admin管理员组文章数量:1401134
Assume Dask dataframe partitioned on region
column. As more data arrive to the dataframe, more parquets are created within each region
partition. I want "concatenate" all partitions within regions with more parquets into single parquet. Assume further the data is large and therefore I do not want to manipulate the data where there is already only parquet in region. I tried various ways but I did not manage to make it working.
P.S. I hope my usage of word "partition" is not confusing - not sure how to differentiate between regions and individual parquets.
Old data:
my_df
├── region=A
│ └── part.0.parquet
├── region=B
│ ├── part.0.parquet
│ └── part.3.parquet
├── region=C
│ ├── part.0.parquet
│ └── part.3.parquet
└── region=D
└── part.3.parquet
Desired data:
my_df
├── region=A
│ └── part.0.parquet
├── region=B
│ └── part.4.parquet
├── region=C
│ └── part.4.parquet
└── region=D
└── part.3.parquet
Code to create data:
import dask.dataframe as dd
import pandas as pd
from pyarrow import fs
# create local filesystem and path to data
filesys = fs.LocalFileSystem()
PATH = "my_df"
# create dataframes
df1 = pd.DataFrame({
"region": ["A", "B", "C"],
"date": ["2025-03-21", "2025-03-22", "2025-03-23"],
"value": [100, 200, 300],
})
df2 = pd.DataFrame({
"region": ["B", "C", "D"],
"date": ["2025-03-22", "2025-03-23", "2025-03-24"],
"value": [150, 250, 350],
})
# create dask dataframes and write to parquet
ddf1 = dd.from_pandas(df1)
ddf1.to_parquet(f"{PATH}", filesystem=filesys, partition_on=["region"], append=True)
ddf2 = dd.from_pandas(df2)
ddf2.to_parquet(f"{PATH}", filesystem=filesys, partition_on=["region"], append=True, ignore_divisions=True)
本文标签: pythonrepartition structured Dask dataframeStack Overflow
版权声明:本文标题:python - repartition structured Dask dataframe - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744239088a2596699.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论