admin管理员组

文章数量:1125036

We want to use Polars to load a JSON file of 22GB (10M rows and 65 columns) but we're running out of memory when run collect() which is causing the program to crash. We're using pl.scan_ndjson to load the JSON file. We're only able to load JSON files of 400MB. The server has 8GB of RAM and 2 CPUs.

Before we increase the capacity of the server, we'd like to know if Polars is able to load files of this shape or if is there any trick to get it working (slice JSON files into chunks or smaller datasets, for example). We've tried using the lazy api but with the same results.

The versions used are:

  • Python 3.12

  • Polars 1.7.0

import polars as pl

file_path = "file.json"
lazy_frame = pl.LazyFrame
lazy_frame = pl.scan_ndjson(file_path, infer_schema_length=None)

selected_columns = [
    'field1', 'field2', 'field3', 'field4', 
    'field5', 'field6', 'field7', 'field8'
]
result = lazy_frame.select([pl.col(col) for col in selected_columns])

df = result.collect(streaming=True)
print(df.shape)

Thank you in advance,
Kind regards

We want to use Polars to load a JSON file of 22GB (10M rows and 65 columns) but we're running out of memory when run collect() which is causing the program to crash. We're using pl.scan_ndjson to load the JSON file. We're only able to load JSON files of 400MB. The server has 8GB of RAM and 2 CPUs.

Before we increase the capacity of the server, we'd like to know if Polars is able to load files of this shape or if is there any trick to get it working (slice JSON files into chunks or smaller datasets, for example). We've tried using the lazy api but with the same results.

The versions used are:

  • Python 3.12

  • Polars 1.7.0

import polars as pl

file_path = "file.json"
lazy_frame = pl.LazyFrame
lazy_frame = pl.scan_ndjson(file_path, infer_schema_length=None)

selected_columns = [
    'field1', 'field2', 'field3', 'field4', 
    'field5', 'field6', 'field7', 'field8'
]
result = lazy_frame.select([pl.col(col) for col in selected_columns])

df = result.collect(streaming=True)
print(df.shape)

Thank you in advance,
Kind regards

Share Improve this question edited 2 days ago jqurious 20.9k4 gold badges19 silver badges35 bronze badges asked 2 days ago n4gashn4gash 137 bronze badges 4
  • 3 JSON is not really a good format for big data. You can preprocess the file to some better format (JSON Lines?) that can be processed by chunks (line by line) without loading the whole file into the memory. – Jeyekomon Commented 2 days ago
  • 1 It may be because scan_ndjson is not yet a fully streaming reader. Parquet seems to have the best support currently. – jqurious Commented 2 days ago
  • From here ndjson have you tried with low_memory=True? – Adrian Klaver Commented 2 days ago
  • 1 JSON is not a streamable file format. It needs to be read in full to be able to finish the batch. If you have this large data, I would recommend using JSONLINES, CSV for human readable data, or Parquet, Arrow/IPC for data in general. – ritchie46 Commented yesterday
Add a comment  | 

1 Answer 1

Reset to default 0

I went into a similar issue with a much larger JSON dataset (actually, a jsonl dataset). I initially thought the cause was the lack of schema and the need to infer it from the full dataset, but couldn't find a way to fix it. The only solution I've found so far is to load the dataset using pandas and to convert it to polars. It turned out to be pretty efficient (and interestingly, polars converted "object" dtypes directly to list and struct appropriately!).

本文标签: pythonPolars runs out of memory when collecting a JSON fileStack Overflow