admin管理员组

文章数量:1352152

I have a parquet file with a dataset that looks like:

df = pl.LazyFrame(
    {
        "target": [
            [1.0, 2.0],
            [3.0, 4.0],
        ],
        "point_cloud": [
            [
                [7.0, 8.0],
                [9.0, 10.0],
            ],
            [
                [9.0, 10.0],
            ],
        ],
    },
    schema={
        "target": pl.Array(pl.Float32, 2),
        "point_cloud": pl.List(pl.Array(pl.Float32, 2)),
    },
)

The file has 4 million rows and is 20 GB (doesn't fit in RAM).

I am trying to get the size of point clouds like:

df = (
    pl.scan_parquet(dataset).select(size=pl.col("point_cloud").list.len()).collect()
)

But my program runs out of memory and dies. I have tried changing collect(engine="streaming"), but the result is the same.

I am puzzled because when I try to get e.g. the x coordinate of all targets, it works OK (and is super fast):

df = pl.scan_parquet(dataset).select(x=pl.col("target").arr.get(0)).collect()

Can I get some help with this? Thanks

EDIT This is a plot with the distribution of the length of each list (produced by running the same code in a computer with enough RAM to fit the entire dataset).

I have a parquet file with a dataset that looks like:

df = pl.LazyFrame(
    {
        "target": [
            [1.0, 2.0],
            [3.0, 4.0],
        ],
        "point_cloud": [
            [
                [7.0, 8.0],
                [9.0, 10.0],
            ],
            [
                [9.0, 10.0],
            ],
        ],
    },
    schema={
        "target": pl.Array(pl.Float32, 2),
        "point_cloud": pl.List(pl.Array(pl.Float32, 2)),
    },
)

The file has 4 million rows and is 20 GB (doesn't fit in RAM).

I am trying to get the size of point clouds like:

df = (
    pl.scan_parquet(dataset).select(size=pl.col("point_cloud").list.len()).collect()
)

But my program runs out of memory and dies. I have tried changing collect(engine="streaming"), but the result is the same.

I am puzzled because when I try to get e.g. the x coordinate of all targets, it works OK (and is super fast):

df = pl.scan_parquet(dataset).select(x=pl.col("target").arr.get(0)).collect()

Can I get some help with this? Thanks

EDIT This is a plot with the distribution of the length of each list (produced by running the same code in a computer with enough RAM to fit the entire dataset).

Share Improve this question edited Apr 1 at 19:16 DJDuque asked Apr 1 at 2:54 DJDuqueDJDuque 8585 silver badges13 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 1

This should run streaming if you use .collect(engine="streaming"), but I suspect that your individual rows are so large (in size) that the default morsel size of the engine still causes it to run out of memory.

Can you try running your script while setting the environment variable POLARS_IDEAL_MORSEL_SIZE=100? The default (currently) is 100,000 rows which is likely much too large for your usecase. In the future we want to set the morsel size based on bytes rather than rows, but that's not the case yet.

本文标签: pythonLargerthanmemory dataset with polarsStack Overflow