admin管理员组

文章数量:1391989

Generate any considerably large dataset:

import pandas as pd
import polars as pl

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(
    n_samples=3_000_000,
    n_features=100,
    n_informative=20,
    n_redundant=10,
    n_repeated=0,
    n_classes=2,
    random_state=42,
    shuffle=False,
)

feature_names = [f"feature_{i}" for i in range(X.shape[1])]

X_polars = pl.DataFrame(X, schema=feature_names)
y_polars = pl.Series(values=y, name="target")
X_pandas = X_polars.clone().to_pandas()

if I execute a code block with Jupyter rendering for both text/HTML and text\plain representation:

df = X_polars+1
df

The df table will be rendered and outputed as the outputs under the cell. The problem begins when I re-run the same cell multiple times, and each time the whole memory usage of df will be accumulatively added:

# First time running
import psutil
print(psutil.virtual_memory().available / (1024 ** 3))
df = X_polars+1
df

out:
26.315258026123047
`df` table
# Fourth time running
out:
19.47336196899414
...

The same behaviour persists even when

  1. Use X_pandas.
  2. Not defining the df variable (e.g., use X_polars+1 straight)
  3. On Linux based systems (I am on Windows 10)
  4. Called gc.collect()
  5. Switched IDE to a. Vscode b. Jupyter Notebook c. Jupyter Lab

The problem does not occur, however, when I use

  1. print(X_polars+1)
  2. print(df)

本文标签: pythonJupyter Notebook DataFrame Render Steadily Increasing memory usageStack Overflow