sql - column type issue when converting csv to parquet using duckdb in R - Stack Overflow

IT技术

更新时间：2025-04-184

admin管理员组
文章数量:1405871

I have been converting csv files to parquet files using the duckdb::dbExecute() R function but running into a type issue with one of my files. I have a column called code that I want to be VARCHAR not BIGINT. I cannot seem to override the BIGINT conversion and it's throwing an error.

dbExecute(con, sprintf("CREATE VIEW csv_data AS SELECT * FROM read_csv_auto('%s')", csv_file))
dbExecute(con, sprintf("COPY (SELECT * FROM csv_data) TO '%s' (FORMAT 'parquet', CODEC 'snappy')", parquet_file))

where the error thrown is after the second line.

Error when converting column "code". Could not convert string "FELH234713" to 'BIGINT'

Column code is being converted as type BIGINT This type was auto-detected from the CSV file. Possible solutions: * Override the type for this column manually by setting the type explicitly, e.g. types={'code': 'VARCHAR'} * Set the sample size to a larger value to enable the auto-detection to scan more values, e.g. sample_size=-1 * Use a COPY statement to automatically derive types from an existing table.

I have tried setting the type explicitly but that gives me a syntax error. I think one of the issues is that the code column contains only integer looking values as well as a mix of character and integer values. The csv file is too big to load into memory ~500gb.

my attempts at querying when setting type: all throw the same syntax error

dbExecute(con, sprintf("COPY (SELECT * FROM csv_data) TO '%s' (FORMAT 'parquet', CODEC 'snappy', types={'code': 'VARCHAR'})", parquet_file))
dbExecute(con, sprintf("COPY (SELECT * FROM csv_data) TO '%s' (FORMAT 'parquet', CODEC 'snappy', TYPES ('code' 'VARCHAR')))", parquet_file))
dbExecute(con, sprintf("COPY (SELECT * FROM csv_data) TO '%s' (FORMAT 'parquet', CODEC 'snappy', TYPES (code VARCHAR))", parquet_file))

error:

Error in dbSendQuery(): ! rapi_prepare: Failed to extract statements: Parser Error: syntax error at or near "'VARCHAR'" LINE 1: ...arquet', CODEC 'snappy', TYPES ('code' 'VARCHAR'))

I have been converting csv files to parquet files using the duckdb::dbExecute() R function but running into a type issue with one of my files. I have a column called code that I want to be VARCHAR not BIGINT. I cannot seem to override the BIGINT conversion and it's throwing an error.

dbExecute(con, sprintf("CREATE VIEW csv_data AS SELECT * FROM read_csv_auto('%s')", csv_file))
dbExecute(con, sprintf("COPY (SELECT * FROM csv_data) TO '%s' (FORMAT 'parquet', CODEC 'snappy')", parquet_file))

where the error thrown is after the second line.

Error when converting column "code". Could not convert string "FELH234713" to 'BIGINT'

Column code is being converted as type BIGINT This type was auto-detected from the CSV file. Possible solutions: * Override the type for this column manually by setting the type explicitly, e.g. types={'code': 'VARCHAR'} * Set the sample size to a larger value to enable the auto-detection to scan more values, e.g. sample_size=-1 * Use a COPY statement to automatically derive types from an existing table.

I have tried setting the type explicitly but that gives me a syntax error. I think one of the issues is that the code column contains only integer looking values as well as a mix of character and integer values. The csv file is too big to load into memory ~500gb.

my attempts at querying when setting type: all throw the same syntax error

dbExecute(con, sprintf("COPY (SELECT * FROM csv_data) TO '%s' (FORMAT 'parquet', CODEC 'snappy', types={'code': 'VARCHAR'})", parquet_file))
dbExecute(con, sprintf("COPY (SELECT * FROM csv_data) TO '%s' (FORMAT 'parquet', CODEC 'snappy', TYPES ('code' 'VARCHAR')))", parquet_file))
dbExecute(con, sprintf("COPY (SELECT * FROM csv_data) TO '%s' (FORMAT 'parquet', CODEC 'snappy', TYPES (code VARCHAR))", parquet_file))

error:

Error in dbSendQuery(): ! rapi_prepare: Failed to extract statements: Parser Error: syntax error at or near "'VARCHAR'" LINE 1: ...arquet', CODEC 'snappy', TYPES ('code' 'VARCHAR'))

Share edited Mar 6 at 21:01 Dale K 27.5k15 gold badges58 silver badges83 bronze badges asked Mar 6 at 19:54 neuroandstats 14612 bronze badges

Could you include that failed query where you tried to set types? – margusl Commented Mar 6 at 20:09

Add a comment |

1 Answer 1

Sorted by: Reset to default 3

DuckDB supports a couple of ways to work with this.

Setup:

writeLines(c("a,b","1,A"), "~/quux.csv")
conn <- duckdb::dbConnect(duckdb::duckdb(), ":memory:")
DBI::dbGetQuery(conn, "select * from read_csv('/Users/r2/quux.csv')") |>
  str()
# 'data.frame': 1 obs. of  2 variables:
#  $ a: num 1
#  $ b: chr "A"

For each column, you can specify types= (ref: https://duckdb./docs/stable/data/csv/overview.html):

DBI::dbGetQuery(conn, "select * from read_csv('~/quux.csv', types={'A':'VARCHAR'})") |>
  str()
# 'data.frame': 1 obs. of  2 variables:
#  $ a: chr "1"
#  $ b: chr "A"

This means your view can be defined as

dbExecute(con, sprintf("CREATE VIEW csv_data AS SELECT * FROM read_csv_auto('%s', types={'code':'VARCHAR'})", csv_file))

If you don't know beforehand the column name beforehand, then you can mess with sample_size. It defaults to 20,480 (ref: https://duckdb./docs/stable/data/csv/auto_detection.html), you can set it to a large enough positive value if you know where the strings will be found. You can disable the limit by setting it to -1, which means it'll read in all data before auto-determining class, with the risk of a performance hit.

本文标签： sqlcolumn type issue when converting csv to parquet using duckdb in RStack Overflow

版权声明：本文标题：sql - column type issue when converting csv to parquet using duckdb in R - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1744952428a2634157.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

sql - column type issue when converting csv to parquet using duckdb in R - Stack Overflow

1 Answer 1

更多相关文章