admin管理员组

文章数量:1401134

Assume Dask dataframe partitioned on region column. As more data arrive to the dataframe, more parquets are created within each region partition. I want "concatenate" all partitions within regions with more parquets into single parquet. Assume further the data is large and therefore I do not want to manipulate the data where there is already only parquet in region. I tried various ways but I did not manage to make it working.

P.S. I hope my usage of word "partition" is not confusing - not sure how to differentiate between regions and individual parquets.

Old data:

my_df
├── region=A
│   └── part.0.parquet
├── region=B
│   ├── part.0.parquet
│   └── part.3.parquet
├── region=C
│   ├── part.0.parquet
│   └── part.3.parquet
└── region=D
    └── part.3.parquet

Desired data:

my_df
├── region=A
│   └── part.0.parquet
├── region=B
│   └── part.4.parquet
├── region=C
│   └── part.4.parquet
└── region=D
    └── part.3.parquet

Code to create data:

import dask.dataframe as dd
import pandas as pd
from pyarrow import fs

# create local filesystem and path to data
filesys = fs.LocalFileSystem()
PATH = "my_df"

# create dataframes
df1 = pd.DataFrame({
    "region": ["A", "B", "C"],
    "date": ["2025-03-21", "2025-03-22", "2025-03-23"],
    "value": [100, 200, 300],
})

df2 = pd.DataFrame({
    "region": ["B", "C", "D"],
    "date": ["2025-03-22", "2025-03-23", "2025-03-24"],
    "value": [150, 250, 350],
})

# create dask dataframes and write to parquet
ddf1 = dd.from_pandas(df1)
ddf1.to_parquet(f"{PATH}", filesystem=filesys, partition_on=["region"], append=True)

ddf2 = dd.from_pandas(df2)
ddf2.to_parquet(f"{PATH}", filesystem=filesys, partition_on=["region"], append=True, ignore_divisions=True)

本文标签: pythonrepartition structured Dask dataframeStack Overflow