admin管理员组文章数量:1201410
I load data from a parquet file into a pyarrow table with the following schema below. I would like to group the table by the ma_id and items.nan and get the max(processing_ts) for each group. I didn't manage it to group by the filed nan within the items list
import pyarrow as pa
schema: pa.Schema = pa.schema(
[
("ma_id", pa.int32()),
("processing_ts", pa.timestamp("ms")),
(
"items",
pa.list_(
pa.struct(
[
pa.field("nan", pa.int32()),
pa.field("ean", pa.int32()),
]
)
),
),
]
)
Assumed the table contains data like this :
[
(100, '2025-01-03 16:21:00', [{'nan': 1, 'ean': 11}, {'nan': 2, 'ean': 212}, {'nan': 3, 'ean': 3}]),
(100, '2025-01-03 23:55:00', [{'nan': 9, 'ean': 95}, {'nan': 2, 'ean': 212}, {'nan': 9, 'ean': 95}]),
(120, '2025-01-03 21:21:00', [{'nan': 8, 'ean': 87}, {'nan': 2, 'ean': 212}, {'nan': 9, 'ean': 95}]),
(100, '2025-01-03 01:45:00', [{'nan': 6, 'ean': 666}, {'nan': 1, 'ean': 11}, {'nan': 7, 'ean': 711}, {'nan': 6, 'ean': 666}]),
(120, '2025-01-03 12:38:00', [{'nan': 8, 'ean': 87}, {'nan': 9, 'ean': 95}]),
]
My goal is to get the max processing_ts value for each kombination of ma_id and nan from the items column. Related to the data above the result should be:
ma_id | nan | max_processing_ts |
---|---|---|
100 | 1 | '2025-01-03 16:21:00' |
100 | 2 | '2025-01-03 23:55:00' |
100 | 3 | '2025-01-03 16:21:00' |
100 | 6 | '2025-01-03 01:45:00' |
100 | 7 | '2025-01-03 01:45:00' |
100 | 9 | '2025-01-03 23:55:00' |
120 | 2 | '2025-01-03 21:21:00' |
120 | 8 | '2025-01-03 21:21:00' |
120 | 9 | '2025-01-03 21:21:00' |
I load data from a parquet file into a pyarrow table with the following schema below. I would like to group the table by the ma_id and items.nan and get the max(processing_ts) for each group. I didn't manage it to group by the filed nan within the items list
import pyarrow as pa
schema: pa.Schema = pa.schema(
[
("ma_id", pa.int32()),
("processing_ts", pa.timestamp("ms")),
(
"items",
pa.list_(
pa.struct(
[
pa.field("nan", pa.int32()),
pa.field("ean", pa.int32()),
]
)
),
),
]
)
Assumed the table contains data like this :
[
(100, '2025-01-03 16:21:00', [{'nan': 1, 'ean': 11}, {'nan': 2, 'ean': 212}, {'nan': 3, 'ean': 3}]),
(100, '2025-01-03 23:55:00', [{'nan': 9, 'ean': 95}, {'nan': 2, 'ean': 212}, {'nan': 9, 'ean': 95}]),
(120, '2025-01-03 21:21:00', [{'nan': 8, 'ean': 87}, {'nan': 2, 'ean': 212}, {'nan': 9, 'ean': 95}]),
(100, '2025-01-03 01:45:00', [{'nan': 6, 'ean': 666}, {'nan': 1, 'ean': 11}, {'nan': 7, 'ean': 711}, {'nan': 6, 'ean': 666}]),
(120, '2025-01-03 12:38:00', [{'nan': 8, 'ean': 87}, {'nan': 9, 'ean': 95}]),
]
My goal is to get the max processing_ts value for each kombination of ma_id and nan from the items column. Related to the data above the result should be:
ma_id | nan | max_processing_ts |
---|---|---|
100 | 1 | '2025-01-03 16:21:00' |
100 | 2 | '2025-01-03 23:55:00' |
100 | 3 | '2025-01-03 16:21:00' |
100 | 6 | '2025-01-03 01:45:00' |
100 | 7 | '2025-01-03 01:45:00' |
100 | 9 | '2025-01-03 23:55:00' |
120 | 2 | '2025-01-03 21:21:00' |
120 | 8 | '2025-01-03 21:21:00' |
120 | 9 | '2025-01-03 21:21:00' |
1 Answer
Reset to default 1Technically you can do it by exploding the list, flattening/unnesting the struct and calling group by. But it's a lot of work in pyarrow. You'll have a much easier time using polars.
import polaras as pl
df = pl.from_arrow(table)
results = (
df.explode("items")
.unnest("items")
.group_by("ma_id", "nan", maintain_order=True)
.agg(pl.col("processing_ts").max().alias("max_processing_ts"))
)
本文标签:
版权声明:本文标题:python 3.x - Group pyarrow table by multiple columns and aggregate by an item from another list column - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1738632616a2103837.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论