r - Issue with {arrow} open_dataset() with character column that has illegal UTF8 characters - Stack Overflow

IT技术

更新时间：2025-03-186

admin管理员组
文章数量:1335664

Question for {arrow} plus {duckdb} users in R.

Take the code:

    library(arrow, duckdb, dplyr)
    seattle_csv <- open_dataset(
      sources = "data/seattle-library-checkouts.csv", 
      col_types = schema(problem_col = string()),
      format = "csv"
    )
    
    seattle_df <- seattle_csv |>
    to_duckdb() |>
    collect()

If I don't put the col_types option, problem_col is treated as a byte column, which then when collected is a list of raw items. I have not been able to then properly convert it to a chr column using rawToChar(). It returns a column where each cell includes all of the original column's characters. This is, of course, not the desidered outcome.

The above code doesn't work because specifying the schema of the problematic column via col_types = schema(problem_col = string()) leads to an error about the presence of UTF8 illegal characters. Note that I have no issue importing this in data.table using fread(). But I need to work with many big files so I cannot rely on data.table.

Broader context: once I solve this problem (and clean the data as I need to, including filtering and saving as .csv) I will create a function to perform the same operation on about 170 files that are each 20-90 GB. So any more general advice on how to tackle this challenging task is more than welcome. I have used an approach like the above in the past but it was for datasets that, while larger than memory, where not this large. I am happy with any solution as long as I can manipulate the data. Operations required will include filtering the data based on the occurrence of a certain string in a given column (e.g., string "Lewis" in a "book_author" column), collapsing (e.g., individual X booked 3 Lewis books in year Y), and rbind() the multiple, now cleaned, datasets at the end.

Could you help me get unstuck?

PS Bonus Q: Is there any reason why I should expect that sometimes after running this code my console does not produce any output despite the "stop" sign not being on (as in, code is running)?

Question for {arrow} plus {duckdb} users in R.

Take the code:

    library(arrow, duckdb, dplyr)
    seattle_csv <- open_dataset(
      sources = "data/seattle-library-checkouts.csv", 
      col_types = schema(problem_col = string()),
      format = "csv"
    )
    
    seattle_df <- seattle_csv |>
    to_duckdb() |>
    collect()

If I don't put the col_types option, problem_col is treated as a byte column, which then when collected is a list of raw items. I have not been able to then properly convert it to a chr column using rawToChar(). It returns a column where each cell includes all of the original column's characters. This is, of course, not the desidered outcome.

The above code doesn't work because specifying the schema of the problematic column via col_types = schema(problem_col = string()) leads to an error about the presence of UTF8 illegal characters. Note that I have no issue importing this in data.table using fread(). But I need to work with many big files so I cannot rely on data.table.

Broader context: once I solve this problem (and clean the data as I need to, including filtering and saving as .csv) I will create a function to perform the same operation on about 170 files that are each 20-90 GB. So any more general advice on how to tackle this challenging task is more than welcome. I have used an approach like the above in the past but it was for datasets that, while larger than memory, where not this large. I am happy with any solution as long as I can manipulate the data. Operations required will include filtering the data based on the occurrence of a certain string in a given column (e.g., string "Lewis" in a "book_author" column), collapsing (e.g., individual X booked 3 Lewis books in year Y), and rbind() the multiple, now cleaned, datasets at the end.

Could you help me get unstuck?

PS Bonus Q: Is there any reason why I should expect that sometimes after running this code my console does not produce any output despite the "stop" sign not being on (as in, code is running)?

Share Improve this question edited Nov 28, 2024 at 0:25 asked Nov 26, 2024 at 22:52 Marti 1618 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 2

You should probably take a step back and clean files from invalid characters and only then proceed with R / arrow / duckdb, i.e. you could omit invalid characters with something like this in shell:

iconv -f UTF8 -t UTF8 -c < seattle-library-checkouts.csv > seattle-library-checkouts_valid_utf8.csv

Though if you really want to / need to, you technically could clean & tranform binary column from open_dataset()

Starting with a faulty.csv for reprex:

raw_csv <- 
  data.frame(problem_col = month.name[1:3]) |> 
  readr::format_csv() |> 
  charToRaw()
raw_csv[27:28] <- as.raw(255)

writeBin(raw_csv, "faulty.csv")

# fread, read.csv, etc are able to read the file:
(faulty <- read.csv("faulty.csv"))
#>      problem_col
#> 1        January
#> 2 Februa\xff\xff
#> 3          March

# but we can't do much with resulting problem_col:
nchar(faulty$problem_col)
#> Error in nchar(faulty$problem_col): invalid multibyte string, element 2

# though iconv() can help here:
iconv(faulty$problem_col, from = "UTF-8", to = "UTF-8", sub = "_")
#> [1] "January"  "Februa__" "March"

To work with arrow_binary column from collected arrow dataset, we could use stringi::stri_conv() in a similar way as iconv() above; stri_conv() should also be faster and uses special characters for replacing incorrect byte sequences, can be handy for flagging problematic records:

library(arrow)
library(dplyr)
faulty_ds <- 
  open_dataset(
    sources = "faulty.csv", 
    format = "csv"
)
faulty_ds
#> FileSystemDataset with 1 csv file
#> 1 columns
#> problem_col: binary

# warnings are expected
faulty_ds |> 
  collect() |> 
  mutate(
    problem_col = stringi::stri_conv(problem_col, from = "UTF-8", to = "UTF-8"),
    was_invalid = grepl('[\ufffd\u001a]', problem_col)
  ) 
#> Warning: There were 2 warnings in `mutate()`.
#> The first warning was:
#> ℹ In argument: `problem_col = stringi::stri_conv(problem_col, from = "UTF-8",
#>   to = "UTF-8")`.
#> Caused by warning in `stringi::stri_conv()`:
#> ! input data \xffffffff in the current source encoding could not be converted to Unicode
#> ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
#> # A tibble: 3 × 2
#>   problem_col was_invalid
#>   <chr>       <lgl>      
#> 1 January     FALSE      
#> 2 Februa��    TRUE       
#> 3 March       FALSE

Main problem here is collect() -- whole dataset will be materialized in R and I doubt it would help much if you already had issues with "many big files" & data.table

Perhaps an option to actually consider, in addition to fixing input data, is to remove arrow and collect() from the workflow and import directly to duckdb with its read_csv() function, it takes ignore_errors parameter to just skip(!) errors as well as store_rejects to keep a log of errors.

library(duckdb)
library(dplyr)

con <- dbConnect(duckdb())

# errors are skipped, i.e. we get 2 instead of 3 rows
tbl(con, "read_csv('fault*.csv', store_rejects = true)")
#> # Source:   SQL [2 x 1]
#> # Database: DuckDB v1.1.0 [margus@Windows 10 x64:R 4.4.2/:memory:]
#>   problem_col
#>   <chr>      
#> 1 January    
#> 2 March

tbl(con, "reject_scans") |> glimpse()
#> Rows: ??
#> Columns: 13
#> Database: DuckDB v1.1.0 [margus@Windows 10 x64:R 4.4.2/:memory:]
#> $ scan_id           <dbl> 6
#> $ file_id           <dbl> 0
#> $ file_path         <chr> "faulty.csv"
#> $ delimiter         <chr> ","
#> $ quote             <chr> "\""
#> $ escape            <chr> "\""
#> $ newline_delimiter <chr> "\\n"
#> $ skip_rows         <dbl> 0
#> $ has_header        <lgl> TRUE
#> $ columns           <chr> "{'problem_col': 'VARCHAR'}"
#> $ date_format       <chr> NA
#> $ timestamp_format  <chr> NA
#> $ user_arguments    <chr> "store_rejects=true"

tbl(con, "reject_errors") |> glimpse()
#> Rows: ??
#> Columns: 10
#> Database: DuckDB v1.1.0 [margus@Windows 10 x64:R 4.4.2/:memory:]
#> $ scan_id            <dbl> 6
#> $ file_id            <dbl> 0
#> $ line               <dbl> 3
#> $ line_byte_position <dbl> 21
#> $ byte_position      <dbl> 21
#> $ column_idx         <dbl> 1
#> $ column_name        <chr> "problem_col"
#> $ error_type         <fct> INVALID UNICODE
#> $ csv_line           <chr> "Februa??"
#> $ error_message      <chr> "Invalid unicode (byte sequence mismatch) detected."

dbDisconnect(con)

For more about faulty CSV files & duckdb, check https://duckdb./docs/data/csv/reading_faulty_csv_files

本文标签： rIssue with arrow opendataset() with character column that has illegal UTF8 charactersStack Overflow

版权声明：本文标题：r - Issue with {arrow} open_dataset() with character column that has illegal UTF8 characters - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1742273819a2444813.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

r - Issue with {arrow} open_dataset() with character column that has illegal UTF8 characters - Stack Overflow

1 Answer 1

更多相关文章