admin管理员组文章数量:1404947
Consider the following table found in a PDF file:
I can download and extract the table with the following code:
url <- ".pdf"
download.file(url, 'cars.pdf', mode="wb")
library(tabulapdf)
df <- extract_tables(
'cars.pdf',
pages = 27,
area = list(c(126.4826, 96.5997, 782.1684, 297.9600)),
guess = FALSE)
bind_rows() |>
set_names(c("Model","Quantity"))
Unfortunately, the function reads the quantities as a double type and removes all the zeros at the end.
I can add the following code to change its class:
df <- extract_tables(
'cars.pdf',
pages = 27,
area = list(c(126.4826, 96.5997, 782.1684, 297.9600)),
guess = FALSE) |>
bind_rows() |>
set_names(c("Model","Quantity"))|>
mutate(Quantity = gsub("\\.", "", Quantity))|>
mutate(Quantity = as.integer(Quantity))
But the damage is already done: 2.830, became 283; 1.220 became 122.
Is there a way to make the data be read as char?
Consider the following table found in a PDF file:
I can download and extract the table with the following code:
url <- "https://www.fenabrave..br/portal/files/2023_01_2.pdf"
download.file(url, 'cars.pdf', mode="wb")
library(tabulapdf)
df <- extract_tables(
'cars.pdf',
pages = 27,
area = list(c(126.4826, 96.5997, 782.1684, 297.9600)),
guess = FALSE)
bind_rows() |>
set_names(c("Model","Quantity"))
Unfortunately, the function reads the quantities as a double type and removes all the zeros at the end.
I can add the following code to change its class:
df <- extract_tables(
'cars.pdf',
pages = 27,
area = list(c(126.4826, 96.5997, 782.1684, 297.9600)),
guess = FALSE) |>
bind_rows() |>
set_names(c("Model","Quantity"))|>
mutate(Quantity = gsub("\\.", "", Quantity))|>
mutate(Quantity = as.integer(Quantity))
But the damage is already done: 2.830, became 283; 1.220 became 122.
Is there a way to make the data be read as char?
Share Improve this question edited Mar 10 at 7:28 zx8754 56.4k12 gold badges126 silver badges226 bronze badges Recognized by R Language Collective asked Mar 9 at 21:30 AndreASousaAndreASousa 915 bronze badges 1- I recommend the use of command-line tools for such a task. – Friede Commented Mar 9 at 22:42
1 Answer
Reset to default 3extract_tables
lets you read data in as character
vector. This can be parsed using fread
with colClasses = 'character'
to read in all columns as characters. Then you can use gsub("\\.","",df2$quantity)
to remove the thousands-dot.
library(tabulapdf)
string <- tabulapdf::extract_tables("table.pdf", output = "character")|>
unlist()
library(data.table)
df2 <- fread(string, colClasses = 'character', data.table = FALSE)
df2$quantity <- gsub("\\.","",df2$quantity)
giving
Head1 quantity
1 1 3678
2 2 3093
3 3 2830
4 4 2770
5 5 2200
Test PDF
本文标签: rExtract the correct data type in a PDF tableStack Overflow
版权声明:本文标题:r - Extract the correct data type in a PDF table - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744859329a2628971.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论