admin管理员组文章数量:1345452
From what I understand this is a main use case for Polars: being able to process a dataset that is larger than RAM, using disk space if necessary. Yet I am unable to achieve this in a Kubernetes environment. To replicate locally I tried launching a docker container with a low memory limit:
docker run -it --memory=500m --rm -v `pwd`:/app python:3.12 /bin/bash
# pip install polars==1.26.0
I checked that it set up the memory limit in cgroups for the current process.
Then I ran a script that loads a moderately large dataframe (23M parquet file, 158M uncompressed), using scan_parquet
, performs a sort, and outputs the head:
source = "parquet/central_west.df"
df = pl.scan_parquet(source, low_memory=True)
query = df.sort("station_code").head()
print(query.collect(engine="streaming"))
This leads to the process getting killed. It works with a smaller dataframe, or a larger limit. Is polars not reading the limit correctly? Or not able to work with that low of a limit? I understand the "new" streaming engine is still in beta, so I tried the same script with version 1.22.0 of polars, but the result was the same. This seems like a very simple and common use case so I hope I am just missing a configuration trick.
On a hunch and based on a similar question I tried setting POLARS_IDEAL_MORSEL_SIZE=100, but that made no difference, and I feel like I am grasping at straws here.
本文标签: pythonPolars out of core sorting and memory usageStack Overflow
版权声明:本文标题:python - Polars out of core sorting and memory usage - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1743788890a2539194.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论