admin管理员组文章数量:1125036
We want to use Polars to load a JSON file of 22GB (10M rows and 65 columns) but we're running out of memory when run collect()
which is causing the program to crash. We're using pl.scan_ndjson
to load the JSON file. We're only able to load JSON files of 400MB. The server has 8GB of RAM and 2 CPUs.
Before we increase the capacity of the server, we'd like to know if Polars is able to load files of this shape or if is there any trick to get it working (slice JSON files into chunks or smaller datasets, for example). We've tried using the lazy api but with the same results.
The versions used are:
Python 3.12
Polars 1.7.0
import polars as pl
file_path = "file.json"
lazy_frame = pl.LazyFrame
lazy_frame = pl.scan_ndjson(file_path, infer_schema_length=None)
selected_columns = [
'field1', 'field2', 'field3', 'field4',
'field5', 'field6', 'field7', 'field8'
]
result = lazy_frame.select([pl.col(col) for col in selected_columns])
df = result.collect(streaming=True)
print(df.shape)
Thank you in advance,
Kind regards
We want to use Polars to load a JSON file of 22GB (10M rows and 65 columns) but we're running out of memory when run collect()
which is causing the program to crash. We're using pl.scan_ndjson
to load the JSON file. We're only able to load JSON files of 400MB. The server has 8GB of RAM and 2 CPUs.
Before we increase the capacity of the server, we'd like to know if Polars is able to load files of this shape or if is there any trick to get it working (slice JSON files into chunks or smaller datasets, for example). We've tried using the lazy api but with the same results.
The versions used are:
Python 3.12
Polars 1.7.0
import polars as pl
file_path = "file.json"
lazy_frame = pl.LazyFrame
lazy_frame = pl.scan_ndjson(file_path, infer_schema_length=None)
selected_columns = [
'field1', 'field2', 'field3', 'field4',
'field5', 'field6', 'field7', 'field8'
]
result = lazy_frame.select([pl.col(col) for col in selected_columns])
df = result.collect(streaming=True)
print(df.shape)
Thank you in advance,
Kind regards
1 Answer
Reset to default 0I went into a similar issue with a much larger JSON dataset (actually, a jsonl dataset). I initially thought the cause was the lack of schema and the need to infer it from the full dataset, but couldn't find a way to fix it. The only solution I've found so far is to load the dataset using pandas and to convert it to polars. It turned out to be pretty efficient (and interestingly, polars converted "object" dtypes directly to list and struct appropriately!).
本文标签: pythonPolars runs out of memory when collecting a JSON fileStack Overflow
版权声明:本文标题:python - Polars runs out of memory when collecting a JSON file - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736653761a1946203.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
low_memory=True
? – Adrian Klaver Commented 2 days ago