admin管理员组

文章数量:1122846

I have a pl.LazyFrame with a number of columns. One of the columns is called signal and is supposed to have dtype=pl.Int8. It only contains 0 and 1.

This will be confirmed if I do collect_schema.
However, when I actually collect the dataframe, the dtype switches to pl.Int32.

I wasn't able to come up with a toy example, so I show the behaviour with my existing pl.LazyFrame. Hopefully somebody can still point me in the right direction.

In [1]: lf.select(pl.col("signal")).collect_schema()
Out[1]: Schema([('signal', Int8)])

In [2]: lf.select(pl.col("signal")).collect()
Out[2]: 
shape: (7_556, 1)
┌────────┐
│ signal │
│ ---    │
│ i32    │
╞════════╡
│ 0      │
│ 0      │
│ 0      │
│ 0      │
│ 1      │
│ …      │
│ 1      │
│ 1      │
│ 1      │
│ 0      │
│ 0      │
└────────┘

In [3]: lf.select(pl.col("signal")).collect().collect_schema()
Out[3]: Schema([('signal', Int32)])

In [4]: lf.select(pl.col("signal")).collect().describe()
Out[4]: 
shape: (9, 2)
┌────────────┬──────────┐
│ statistic  ┆ signal   │
│ ---        ┆ ---      │
│ str        ┆ f64      │
╞════════════╪══════════╡
│ count      ┆ 7556.0   │
│ null_count ┆ 0.0      │
│ mean       ┆ 0.55585  │
│ std        ┆ 0.496904 │
│ min        ┆ 0.0      │
│ 25%        ┆ 0.0      │
│ 50%        ┆ 1.0      │
│ 75%        ┆ 1.0      │
│ max        ┆ 1.0      │
└────────────┴──────────┘

In my view, this looks like a bug, doesn't it?

本文标签: pythondtype changes during collect process in polars dataframeStack Overflow