admin管理员组

文章数量:1335664

Question for {arrow} plus {duckdb} users in R.

Take the code:

    library(arrow, duckdb, dplyr)
    seattle_csv <- open_dataset(
      sources = "data/seattle-library-checkouts.csv", 
      col_types = schema(problem_col = string()),
      format = "csv"
    )
    
    seattle_df <- seattle_csv |>
    to_duckdb() |>
    collect()

If I don't put the col_types option, problem_col is treated as a byte column, which then when collected is a list of raw items. I have not been able to then properly convert it to a chr column using rawToChar(). It returns a column where each cell includes all of the original column's characters. This is, of course, not the desidered outcome.

The above code doesn't work because specifying the schema of the problematic column via col_types = schema(problem_col = string()) leads to an error about the presence of UTF8 illegal characters. Note that I have no issue importing this in data.table using fread(). But I need to work with many big files so I cannot rely on data.table.

Broader context: once I solve this problem (and clean the data as I need to, including filtering and saving as .csv) I will create a function to perform the same operation on about 170 files that are each 20-90 GB. So any more general advice on how to tackle this challenging task is more than welcome. I have used an approach like the above in the past but it was for datasets that, while larger than memory, where not this large. I am happy with any solution as long as I can manipulate the data. Operations required will include filtering the data based on the occurrence of a certain string in a given column (e.g., string "Lewis" in a "book_author" column), collapsing (e.g., individual X booked 3 Lewis books in year Y), and rbind() the multiple, now cleaned, datasets at the end.

Could you help me get unstuck?

PS Bonus Q: Is there any reason why I should expect that sometimes after running this code my console does not produce any output despite the "stop" sign not being on (as in, code is running)?

Question for {arrow} plus {duckdb} users in R.

Take the code:

    library(arrow, duckdb, dplyr)
    seattle_csv <- open_dataset(
      sources = "data/seattle-library-checkouts.csv", 
      col_types = schema(problem_col = string()),
      format = "csv"
    )
    
    seattle_df <- seattle_csv |>
    to_duckdb() |>
    collect()

If I don't put the col_types option, problem_col is treated as a byte column, which then when collected is a list of raw items. I have not been able to then properly convert it to a chr column using rawToChar(). It returns a column where each cell includes all of the original column's characters. This is, of course, not the desidered outcome.

The above code doesn't work because specifying the schema of the problematic column via col_types = schema(problem_col = string()) leads to an error about the presence of UTF8 illegal characters. Note that I have no issue importing this in data.table using fread(). But I need to work with many big files so I cannot rely on data.table.

Broader context: once I solve this problem (and clean the data as I need to, including filtering and saving as .csv) I will create a function to perform the same operation on about 170 files that are each 20-90 GB. So any more general advice on how to tackle this challenging task is more than welcome. I have used an approach like the above in the past but it was for datasets that, while larger than memory, where not this large. I am happy with any solution as long as I can manipulate the data. Operations required will include filtering the data based on the occurrence of a certain string in a given column (e.g., string "Lewis" in a "book_author" column), collapsing (e.g., individual X booked 3 Lewis books in year Y), and rbind() the multiple, now cleaned, datasets at the end.

Could you help me get unstuck?

PS Bonus Q: Is there any reason why I should expect that sometimes after running this code my console does not produce any output despite the "stop" sign not being on (as in, code is running)?

Share Improve this question edited Nov 28, 2024 at 0:25 Marti asked Nov 26, 2024 at 22:52 MartiMarti 1618 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 2

You should probably take a step back and clean files from invalid characters and only then proceed with R / arrow / duckdb, i.e. you could omit invalid characters with something like this in shell:

iconv -f UTF8 -t UTF8 -c < seattle-library-checkouts.csv > seattle-library-checkouts_valid_utf8.csv

Though if you really want to / need to, you technically could clean & tranform binary column from open_dataset()

Starting with a faulty.csv for reprex:

raw_csv <- 
  data.frame(problem_col = month.name[1:3]) |> 
  readr::format_csv() |> 
  charToRaw()
raw_csv[27:28] <- as.raw(255)

writeBin(raw_csv, "faulty.csv")

# fread, read.csv, etc are able to read the file:
(faulty <- read.csv("faulty.csv"))
#>      problem_col
#> 1        January
#> 2 Februa\xff\xff
#> 3          March

# but we can't do much with resulting problem_col:
nchar(faulty$problem_col)
#> Error in nchar(faulty$problem_col): invalid multibyte string, element 2

# though iconv() can help here:
iconv(faulty$problem_col, from = "UTF-8", to = "UTF-8", sub = "_")
#> [1] "January"  "Februa__" "March"

To work with arrow_binary column from collected arrow dataset, we could use stringi::stri_conv() in a similar way as iconv() above; stri_conv() should also be faster and uses special characters for replacing incorrect byte sequences, can be handy for flagging problematic records:

library(arrow)
library(dplyr)
faulty_ds <- 
  open_dataset(
    sources = "faulty.csv", 
    format = "csv"
)
faulty_ds
#> FileSystemDataset with 1 csv file
#> 1 columns
#> problem_col: binary

# warnings are expected
faulty_ds |> 
  collect() |> 
  mutate(
    problem_col = stringi::stri_conv(problem_col, from = "UTF-8", to = "UTF-8"),
    was_invalid = grepl('[\ufffd\u001a]', problem_col)
  ) 
#> Warning: There were 2 warnings in `mutate()`.
#> The first warning was:
#> ℹ In argument: `problem_col = stringi::stri_conv(problem_col, from = "UTF-8",
#>   to = "UTF-8")`.
#> Caused by warning in `stringi::stri_conv()`:
#> ! input data \xffffffff in the current source encoding could not be converted to Unicode
#> ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
#> # A tibble: 3 × 2
#>   problem_col was_invalid
#>   <chr>       <lgl>      
#> 1 January     FALSE      
#> 2 Februa��    TRUE       
#> 3 March       FALSE

Main problem here is collect() -- whole dataset will be materialized in R and I doubt it would help much if you already had issues with "many big files" & data.table


Perhaps an option to actually consider, in addition to fixing input data, is to remove arrow and collect() from the workflow and import directly to duckdb with its read_csv() function, it takes ignore_errors parameter to just skip(!) errors as well as store_rejects to keep a log of errors.

library(duckdb)
library(dplyr)

con <- dbConnect(duckdb())

# errors are skipped, i.e. we get 2 instead of 3 rows
tbl(con, "read_csv('fault*.csv', store_rejects = true)")
#> # Source:   SQL [2 x 1]
#> # Database: DuckDB v1.1.0 [margus@Windows 10 x64:R 4.4.2/:memory:]
#>   problem_col
#>   <chr>      
#> 1 January    
#> 2 March

tbl(con, "reject_scans") |> glimpse()
#> Rows: ??
#> Columns: 13
#> Database: DuckDB v1.1.0 [margus@Windows 10 x64:R 4.4.2/:memory:]
#> $ scan_id           <dbl> 6
#> $ file_id           <dbl> 0
#> $ file_path         <chr> "faulty.csv"
#> $ delimiter         <chr> ","
#> $ quote             <chr> "\""
#> $ escape            <chr> "\""
#> $ newline_delimiter <chr> "\\n"
#> $ skip_rows         <dbl> 0
#> $ has_header        <lgl> TRUE
#> $ columns           <chr> "{'problem_col': 'VARCHAR'}"
#> $ date_format       <chr> NA
#> $ timestamp_format  <chr> NA
#> $ user_arguments    <chr> "store_rejects=true"

tbl(con, "reject_errors") |> glimpse()
#> Rows: ??
#> Columns: 10
#> Database: DuckDB v1.1.0 [margus@Windows 10 x64:R 4.4.2/:memory:]
#> $ scan_id            <dbl> 6
#> $ file_id            <dbl> 0
#> $ line               <dbl> 3
#> $ line_byte_position <dbl> 21
#> $ byte_position      <dbl> 21
#> $ column_idx         <dbl> 1
#> $ column_name        <chr> "problem_col"
#> $ error_type         <fct> INVALID UNICODE
#> $ csv_line           <chr> "Februa??"
#> $ error_message      <chr> "Invalid unicode (byte sequence mismatch) detected."

dbDisconnect(con)

For more about faulty CSV files & duckdb, check https://duckdb./docs/data/csv/reading_faulty_csv_files

本文标签: rIssue with arrow opendataset() with character column that has illegal UTF8 charactersStack Overflow