admin管理员组

文章数量:1202794

I am working with a legacy system that generates delimited files (e.g., CSV), but the number of columns in these files is inconsistent across rows (ragged rows). I am reading the file from ADLS with Polars, but I'm encountering an issue depending on the structure of the second row in the file.

pl.read_csv('sample.csv', has_header=False, skip_rows=1, infer_schema=False, infer_schema_length=None, ignore_errors=True)
  • If the second row has more columns than subsequent rows, Polars reads the file successfully and fills the missing values in subsequent rows with null.
  • However, if subsequent rows have more columns than the second row, I get the following exception

ComputeError: found more fields than defined in 'Schema' Consider setting 'truncate_ragged_lines=True'.

Is there a way to handle such cases dynamically in Polars, or do I need to preprocess the file to fix these inconsistencies before reading? Any alternative approaches or solutions to this problem would be appreciated!

Example Data - Failure

ID,Name,Age
1,John,28
2,Jane,35,California,USA
3,Emily,22
4,Michael,40,Australia,Melbourne

Example Data - Success

ID,Name,Age
2,Jane,35,California,USA
1,John,28
3,Emily,22
4,Michael,40,Australia,Melbourne

I am working with a legacy system that generates delimited files (e.g., CSV), but the number of columns in these files is inconsistent across rows (ragged rows). I am reading the file from ADLS with Polars, but I'm encountering an issue depending on the structure of the second row in the file.

pl.read_csv('sample.csv', has_header=False, skip_rows=1, infer_schema=False, infer_schema_length=None, ignore_errors=True)
  • If the second row has more columns than subsequent rows, Polars reads the file successfully and fills the missing values in subsequent rows with null.
  • However, if subsequent rows have more columns than the second row, I get the following exception

ComputeError: found more fields than defined in 'Schema' Consider setting 'truncate_ragged_lines=True'.

Is there a way to handle such cases dynamically in Polars, or do I need to preprocess the file to fix these inconsistencies before reading? Any alternative approaches or solutions to this problem would be appreciated!

Example Data - Failure

ID,Name,Age
1,John,28
2,Jane,35,California,USA
3,Emily,22
4,Michael,40,Australia,Melbourne

Example Data - Success

ID,Name,Age
2,Jane,35,California,USA
1,John,28
3,Emily,22
4,Michael,40,Australia,Melbourne
Share Improve this question asked Jan 21 at 2:26 Neeraj VernekarNeeraj Vernekar 1602 silver badges11 bronze badges 1
  • Do you need the extra columns (state & country) or are you interested purely in id, name, and age? – Hericks Commented Jan 21 at 8:15
Add a comment  | 

2 Answers 2

Reset to default 1

Read it in as a single column by setting the separator to (hopefully) an unused utf8 character with no header and then use .str.split.list.to_struct followed by unnest to allow a dynamic number of columns. Then you have to rename the columns and slice out the first row.

import polars as pl
import io
from warnings import catch_warnings, filterwarnings

input_file = io.StringIO("""ID,Name,Age
1,John,28
2,Jane,35,California,USA
3,Emily,22
4,Michael,40,Australia,Melbourne"""
                   )
input_file.seek(0)


with catch_warnings():
    filterwarnings("ignore")
    ## this suppresses the warning from `to_struct` which wants explicit field names.
    df = (
        pl.read_csv(input_file, separator="\x00", has_header=False)
        .with_columns(
            pl.col("column_1")
            .str.split(",")
            .list.to_struct(n_field_strategy="max_width")
            )
        .unnest("column_1")
    )

df = df.rename({x:y for x,y in zip(df.columns, df.row(0)) if y is not None})
df = df.slice(1,)

Now you've got a df of all strings. You could try to do a for loop with all the columns, trying to cast them but turns out that is slower (at least in a few tests that I did) than writing the existing df to a csv and then rereading it to force polars's auto-infer mechanism.

from tempfile import NamedTemporaryFile
with NamedTemporaryFile() as ff:
    df.write_csv(ff)
    ff.seek(0)
    df= pl.read_csv(ff)

If you've got enough memory then replacing the tempfile with an io.BytesIO() will be even faster.

A simple work around for the issue would be prepending a long enough initial row so that all subsequent rows will be read as shorter than the first one.

本文标签: pythonHow to handle inconsistent columns (ragged rows) in a delimited file using PolarsStack Overflow