python - Polars runs out of memory when collecting a JSON file - Stack Overflow

IT技术

更新时间：2025-01-1210

admin管理员组
文章数量:1125036

We want to use Polars to load a JSON file of 22GB (10M rows and 65 columns) but we're running out of memory when run collect() which is causing the program to crash. We're using pl.scan_ndjson to load the JSON file. We're only able to load JSON files of 400MB. The server has 8GB of RAM and 2 CPUs.

Before we increase the capacity of the server, we'd like to know if Polars is able to load files of this shape or if is there any trick to get it working (slice JSON files into chunks or smaller datasets, for example). We've tried using the lazy api but with the same results.

The versions used are:

Python 3.12
Polars 1.7.0

import polars as pl

file_path = "file.json"
lazy_frame = pl.LazyFrame
lazy_frame = pl.scan_ndjson(file_path, infer_schema_length=None)

selected_columns = [
    'field1', 'field2', 'field3', 'field4', 
    'field5', 'field6', 'field7', 'field8'
]
result = lazy_frame.select([pl.col(col) for col in selected_columns])

df = result.collect(streaming=True)
print(df.shape)

Thank you in advance,
Kind regards

We want to use Polars to load a JSON file of 22GB (10M rows and 65 columns) but we're running out of memory when run collect() which is causing the program to crash. We're using pl.scan_ndjson to load the JSON file. We're only able to load JSON files of 400MB. The server has 8GB of RAM and 2 CPUs.

Before we increase the capacity of the server, we'd like to know if Polars is able to load files of this shape or if is there any trick to get it working (slice JSON files into chunks or smaller datasets, for example). We've tried using the lazy api but with the same results.

The versions used are:

Python 3.12
Polars 1.7.0

import polars as pl

file_path = "file.json"
lazy_frame = pl.LazyFrame
lazy_frame = pl.scan_ndjson(file_path, infer_schema_length=None)

selected_columns = [
    'field1', 'field2', 'field3', 'field4', 
    'field5', 'field6', 'field7', 'field8'
]
result = lazy_frame.select([pl.col(col) for col in selected_columns])

df = result.collect(streaming=True)
print(df.shape)

Thank you in advance,
Kind regards

Share Improve this question edited 2 days ago jqurious 20.9k4 gold badges19 silver badges35 bronze badges asked 2 days ago n4gash 137 bronze badges

3 JSON is not really a good format for big data. You can preprocess the file to some better format (JSON Lines?) that can be processed by chunks (line by line) without loading the whole file into the memory. – Jeyekomon Commented 2 days ago
1 It may be because scan_ndjson is not yet a fully streaming reader. Parquet seems to have the best support currently. – jqurious Commented 2 days ago
From here ndjson have you tried with low_memory=True? – Adrian Klaver Commented 2 days ago
1 JSON is not a streamable file format. It needs to be read in full to be able to finish the batch. If you have this large data, I would recommend using JSONLINES, CSV for human readable data, or Parquet, Arrow/IPC for data in general. – ritchie46 Commented yesterday

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

I went into a similar issue with a much larger JSON dataset (actually, a jsonl dataset). I initially thought the cause was the lack of schema and the need to infer it from the full dataset, but couldn't find a way to fix it. The only solution I've found so far is to load the dataset using pandas and to convert it to polars. It turned out to be pretty efficient (and interestingly, polars converted "object" dtypes directly to list and struct appropriately!).

本文标签： pythonPolars runs out of memory when collecting a JSON fileStack Overflow

版权声明：本文标题：python - Polars runs out of memory when collecting a JSON file - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1736653761a1946203.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

python - Polars runs out of memory when collecting a JSON file - Stack Overflow

1 Answer 1

更多相关文章

python - Polars runs out of memory when collecting a JSON file - Stack Overflow

发表评论

推荐文章

Flutter OpenContainer plus Hero animation - Stack Overflow

plugins - How to Display Taxonomy Custom Meta Box Data in Archive Page?

Discrepancy Between Google Ads pMAX Clicks and Google Analytics Sessions - Stack Overflow

Can statsmodels be used to obtain clustered standard errors in a fixed effects regression model? (Python) - Stack Overflow

vs code not working for linked list programs - Stack Overflow

热门文章

How to translate strings from [file] upload tag in Contact Form 7?

java - Intellij cannot resolve classes from test fixtures - Stack Overflow

Angular 18 optimization error when running ng serve configuration - Stack Overflow

vba - Access crashes when opening report - Stack Overflow

python - trying to plot a graph using matplotlib but showing this error - Stack Overflow

javascript - Importing worker for PDFjs-dist in webapp - Stack Overflow

wkt - Handling failed geography validation from crossing edges in Snowflake - Stack Overflow

vue.js - Programmatic Navigation nuxtjs - Stack Overflow

spring boot - Apache POI 5 - ClassCastException for SchemaTypeSystemImpl - Stack Overflow

Selenium, IntelliJ and Maven - unable to get the versions aligned - Stack Overflow

最新文章

Java入门级教学（IDEA的下载与安装与JDK的环境配置）

华硕笔记本电脑用U盘重装windows系统

物理网卡MAC修改器v3.0 - 真实网卡硬件MAC地址修改，重装系统不变！

如何一键安装win7系统(一键安装win7系统步骤)

Windows 11最稳定版本详解

How to find event listeners on a DOM node in JavaScript or in debugging? - Stack Overflow

javascript - Creating a textarea with auto-resize - Stack Overflow

Remove Object from Array using JavaScript - Stack Overflow

python - Process the image on the website using Pilgram - Stack Overflow

php - how switching from italian to german from internal to internal URL

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价