python - How to handle inconsistent columns (ragged rows) in a delimited file using Polars? - Stack Overflow

IT技术

更新时间：2025-02-040

admin管理员组
文章数量:1202794

I am working with a legacy system that generates delimited files (e.g., CSV), but the number of columns in these files is inconsistent across rows (ragged rows). I am reading the file from ADLS with Polars, but I'm encountering an issue depending on the structure of the second row in the file.

pl.read_csv('sample.csv', has_header=False, skip_rows=1, infer_schema=False, infer_schema_length=None, ignore_errors=True)

If the second row has more columns than subsequent rows, Polars reads the file successfully and fills the missing values in subsequent rows with null.
However, if subsequent rows have more columns than the second row, I get the following exception

ComputeError: found more fields than defined in 'Schema' Consider setting 'truncate_ragged_lines=True'.

Is there a way to handle such cases dynamically in Polars, or do I need to preprocess the file to fix these inconsistencies before reading? Any alternative approaches or solutions to this problem would be appreciated!

Example Data - Failure

ID,Name,Age
1,John,28
2,Jane,35,California,USA
3,Emily,22
4,Michael,40,Australia,Melbourne

Example Data - Success

ID,Name,Age
2,Jane,35,California,USA
1,John,28
3,Emily,22
4,Michael,40,Australia,Melbourne

I am working with a legacy system that generates delimited files (e.g., CSV), but the number of columns in these files is inconsistent across rows (ragged rows). I am reading the file from ADLS with Polars, but I'm encountering an issue depending on the structure of the second row in the file.

pl.read_csv('sample.csv', has_header=False, skip_rows=1, infer_schema=False, infer_schema_length=None, ignore_errors=True)

If the second row has more columns than subsequent rows, Polars reads the file successfully and fills the missing values in subsequent rows with null.
However, if subsequent rows have more columns than the second row, I get the following exception

ComputeError: found more fields than defined in 'Schema' Consider setting 'truncate_ragged_lines=True'.

Is there a way to handle such cases dynamically in Polars, or do I need to preprocess the file to fix these inconsistencies before reading? Any alternative approaches or solutions to this problem would be appreciated!

Example Data - Failure

ID,Name,Age
1,John,28
2,Jane,35,California,USA
3,Emily,22
4,Michael,40,Australia,Melbourne

Example Data - Success

ID,Name,Age
2,Jane,35,California,USA
1,John,28
3,Emily,22
4,Michael,40,Australia,Melbourne

Share Improve this question asked Jan 21 at 2:26 Neeraj Vernekar 1602 silver badges11 bronze badges

Do you need the extra columns (state & country) or are you interested purely in id, name, and age? – Hericks Commented Jan 21 at 8:15

Add a comment |

2 Answers 2

Sorted by: Reset to default 1

Read it in as a single column by setting the separator to (hopefully) an unused utf8 character with no header and then use .str.split.list.to_struct followed by unnest to allow a dynamic number of columns. Then you have to rename the columns and slice out the first row.

import polars as pl
import io
from warnings import catch_warnings, filterwarnings

input_file = io.StringIO("""ID,Name,Age
1,John,28
2,Jane,35,California,USA
3,Emily,22
4,Michael,40,Australia,Melbourne"""
                   )
input_file.seek(0)


with catch_warnings():
    filterwarnings("ignore")
    ## this suppresses the warning from `to_struct` which wants explicit field names.
    df = (
        pl.read_csv(input_file, separator="\x00", has_header=False)
        .with_columns(
            pl.col("column_1")
            .str.split(",")
            .list.to_struct(n_field_strategy="max_width")
            )
        .unnest("column_1")
    )

df = df.rename({x:y for x,y in zip(df.columns, df.row(0)) if y is not None})
df = df.slice(1,)

Now you've got a df of all strings. You could try to do a for loop with all the columns, trying to cast them but turns out that is slower (at least in a few tests that I did) than writing the existing df to a csv and then rereading it to force polars's auto-infer mechanism.

from tempfile import NamedTemporaryFile
with NamedTemporaryFile() as ff:
    df.write_csv(ff)
    ff.seek(0)
    df= pl.read_csv(ff)

If you've got enough memory then replacing the tempfile with an io.BytesIO() will be even faster.

A simple work around for the issue would be prepending a long enough initial row so that all subsequent rows will be read as shorter than the first one.

本文标签： pythonHow to handle inconsistent columns (ragged rows) in a delimited file using PolarsStack Overflow

版权声明：本文标题：python - How to handle inconsistent columns (ragged rows) in a delimited file using Polars? - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1738657696a2105236.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

python - How to handle inconsistent columns (ragged rows) in a delimited file using Polars? - Stack Overflow

2 Answers 2

更多相关文章