r - Extract the correct data type in a PDF table - Stack Overflow

IT技术

更新时间：2025-04-176

admin管理员组
文章数量:1404947

Consider the following table found in a PDF file:

I can download and extract the table with the following code:

url <- ".pdf"

download.file(url, 'cars.pdf', mode="wb")

library(tabulapdf)

df <- extract_tables(
  'cars.pdf',
  pages = 27,
  area = list(c(126.4826, 96.5997, 782.1684, 297.9600)),
  guess = FALSE)
  bind_rows() |>
  set_names(c("Model","Quantity"))

Unfortunately, the function reads the quantities as a double type and removes all the zeros at the end.

I can add the following code to change its class:

df <- extract_tables(
  'cars.pdf',
  pages = 27,
  area = list(c(126.4826, 96.5997, 782.1684, 297.9600)),
  guess = FALSE) |> 
  bind_rows() |>
  set_names(c("Model","Quantity"))|>
  mutate(Quantity = gsub("\\.", "", Quantity))|>
  mutate(Quantity = as.integer(Quantity))

But the damage is already done: 2.830, became 283; 1.220 became 122.

Is there a way to make the data be read as char?

Consider the following table found in a PDF file:

I can download and extract the table with the following code:

url <- "https://www.fenabrave..br/portal/files/2023_01_2.pdf"

download.file(url, 'cars.pdf', mode="wb")

library(tabulapdf)

df <- extract_tables(
  'cars.pdf',
  pages = 27,
  area = list(c(126.4826, 96.5997, 782.1684, 297.9600)),
  guess = FALSE)
  bind_rows() |>
  set_names(c("Model","Quantity"))

Unfortunately, the function reads the quantities as a double type and removes all the zeros at the end.

I can add the following code to change its class:

df <- extract_tables(
  'cars.pdf',
  pages = 27,
  area = list(c(126.4826, 96.5997, 782.1684, 297.9600)),
  guess = FALSE) |> 
  bind_rows() |>
  set_names(c("Model","Quantity"))|>
  mutate(Quantity = gsub("\\.", "", Quantity))|>
  mutate(Quantity = as.integer(Quantity))

But the damage is already done: 2.830, became 283; 1.220 became 122.

Is there a way to make the data be read as char?

Share Improve this question edited Mar 10 at 7:28 zx8754 56.4k12 gold badges126 silver badges226 bronze badges Recognized by R Language Collective asked Mar 9 at 21:30 AndreASousa 915 bronze badges

I recommend the use of command-line tools for such a task. – Friede Commented Mar 9 at 22:42

Add a comment |

1 Answer 1

Sorted by: Reset to default 3

extract_tables lets you read data in as character vector. This can be parsed using fread with colClasses = 'character' to read in all columns as characters. Then you can use gsub("\\.","",df2$quantity) to remove the thousands-dot.

library(tabulapdf)

string <- tabulapdf::extract_tables("table.pdf", output = "character")|> 
  unlist()

library(data.table)

df2 <- fread(string, colClasses = 'character', data.table = FALSE)

df2$quantity <- gsub("\\.","",df2$quantity)

giving

  Head1 quantity
1     1     3678
2     2     3093
3     3     2830
4     4     2770
5     5     2200

Test PDF

本文标签： rExtract the correct data type in a PDF tableStack Overflow

版权声明：本文标题：r - Extract the correct data type in a PDF table - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1744859329a2628971.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

r - Extract the correct data type in a PDF table - Stack Overflow

1 Answer 1

Test PDF

更多相关文章