admin管理员组文章数量:1202794
I am working with a legacy system that generates delimited files (e.g., CSV), but the number of columns in these files is inconsistent across rows (ragged rows). I am reading the file from ADLS with Polars, but I'm encountering an issue depending on the structure of the second row in the file.
pl.read_csv('sample.csv', has_header=False, skip_rows=1, infer_schema=False, infer_schema_length=None, ignore_errors=True)
- If the second row has more columns than subsequent rows, Polars reads the file successfully and fills the missing values in subsequent rows with
null
. - However, if subsequent rows have more columns than the second row, I get the following exception
ComputeError: found more fields than defined in 'Schema' Consider setting 'truncate_ragged_lines=True'.
Is there a way to handle such cases dynamically in Polars, or do I need to preprocess the file to fix these inconsistencies before reading? Any alternative approaches or solutions to this problem would be appreciated!
Example Data - Failure
ID,Name,Age
1,John,28
2,Jane,35,California,USA
3,Emily,22
4,Michael,40,Australia,Melbourne
Example Data - Success
ID,Name,Age
2,Jane,35,California,USA
1,John,28
3,Emily,22
4,Michael,40,Australia,Melbourne
I am working with a legacy system that generates delimited files (e.g., CSV), but the number of columns in these files is inconsistent across rows (ragged rows). I am reading the file from ADLS with Polars, but I'm encountering an issue depending on the structure of the second row in the file.
pl.read_csv('sample.csv', has_header=False, skip_rows=1, infer_schema=False, infer_schema_length=None, ignore_errors=True)
- If the second row has more columns than subsequent rows, Polars reads the file successfully and fills the missing values in subsequent rows with
null
. - However, if subsequent rows have more columns than the second row, I get the following exception
ComputeError: found more fields than defined in 'Schema' Consider setting 'truncate_ragged_lines=True'.
Is there a way to handle such cases dynamically in Polars, or do I need to preprocess the file to fix these inconsistencies before reading? Any alternative approaches or solutions to this problem would be appreciated!
Example Data - Failure
ID,Name,Age
1,John,28
2,Jane,35,California,USA
3,Emily,22
4,Michael,40,Australia,Melbourne
Example Data - Success
ID,Name,Age
2,Jane,35,California,USA
1,John,28
3,Emily,22
4,Michael,40,Australia,Melbourne
Share
Improve this question
asked Jan 21 at 2:26
Neeraj VernekarNeeraj Vernekar
1602 silver badges11 bronze badges
1
- Do you need the extra columns (state & country) or are you interested purely in id, name, and age? – Hericks Commented Jan 21 at 8:15
2 Answers
Reset to default 1Read it in as a single column by setting the separator to (hopefully) an unused utf8 character with no header and then use .str.split.list.to_struct
followed by unnest
to allow a dynamic number of columns. Then you have to rename the columns and slice out the first row.
import polars as pl
import io
from warnings import catch_warnings, filterwarnings
input_file = io.StringIO("""ID,Name,Age
1,John,28
2,Jane,35,California,USA
3,Emily,22
4,Michael,40,Australia,Melbourne"""
)
input_file.seek(0)
with catch_warnings():
filterwarnings("ignore")
## this suppresses the warning from `to_struct` which wants explicit field names.
df = (
pl.read_csv(input_file, separator="\x00", has_header=False)
.with_columns(
pl.col("column_1")
.str.split(",")
.list.to_struct(n_field_strategy="max_width")
)
.unnest("column_1")
)
df = df.rename({x:y for x,y in zip(df.columns, df.row(0)) if y is not None})
df = df.slice(1,)
Now you've got a df of all strings. You could try to do a for loop with all the columns, trying to cast them but turns out that is slower (at least in a few tests that I did) than writing the existing df to a csv and then rereading it to force polars's auto-infer mechanism.
from tempfile import NamedTemporaryFile
with NamedTemporaryFile() as ff:
df.write_csv(ff)
ff.seek(0)
df= pl.read_csv(ff)
If you've got enough memory then replacing the tempfile with an io.BytesIO()
will be even faster.
A simple work around for the issue would be prepending a long enough initial row so that all subsequent rows will be read as shorter than the first one.
版权声明:本文标题:python - How to handle inconsistent columns (ragged rows) in a delimited file using Polars? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1738657696a2105236.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论