admin管理员组

文章数量:1201410

I load data from a parquet file into a pyarrow table with the following schema below. I would like to group the table by the ma_id and items.nan and get the max(processing_ts) for each group. I didn't manage it to group by the filed nan within the items list

import pyarrow as pa

schema: pa.Schema = pa.schema(
    [
        ("ma_id", pa.int32()),
        ("processing_ts", pa.timestamp("ms")),
        (
            "items",
            pa.list_(
                pa.struct(
                    [
                        pa.field("nan", pa.int32()),
                        pa.field("ean", pa.int32()),
                    ]
                )
            ),
        ),
    ]
)

Assumed the table contains data like this :

[
 (100, '2025-01-03 16:21:00', [{'nan': 1, 'ean': 11}, {'nan': 2, 'ean': 212}, {'nan': 3, 'ean': 3}]),
 (100, '2025-01-03 23:55:00', [{'nan': 9, 'ean': 95}, {'nan': 2, 'ean': 212}, {'nan': 9, 'ean': 95}]),
 (120, '2025-01-03 21:21:00', [{'nan': 8, 'ean': 87}, {'nan': 2, 'ean': 212}, {'nan': 9, 'ean': 95}]),
 (100, '2025-01-03 01:45:00', [{'nan': 6, 'ean': 666}, {'nan': 1, 'ean': 11}, {'nan': 7, 'ean': 711}, {'nan': 6, 'ean': 666}]),
 (120, '2025-01-03 12:38:00', [{'nan': 8, 'ean': 87}, {'nan': 9, 'ean': 95}]),
               ]

My goal is to get the max processing_ts value for each kombination of ma_id and nan from the items column. Related to the data above the result should be:

ma_id nan max_processing_ts
100 1 '2025-01-03 16:21:00'
100 2 '2025-01-03 23:55:00'
100 3 '2025-01-03 16:21:00'
100 6 '2025-01-03 01:45:00'
100 7 '2025-01-03 01:45:00'
100 9 '2025-01-03 23:55:00'
120 2 '2025-01-03 21:21:00'
120 8 '2025-01-03 21:21:00'
120 9 '2025-01-03 21:21:00'

I load data from a parquet file into a pyarrow table with the following schema below. I would like to group the table by the ma_id and items.nan and get the max(processing_ts) for each group. I didn't manage it to group by the filed nan within the items list

import pyarrow as pa

schema: pa.Schema = pa.schema(
    [
        ("ma_id", pa.int32()),
        ("processing_ts", pa.timestamp("ms")),
        (
            "items",
            pa.list_(
                pa.struct(
                    [
                        pa.field("nan", pa.int32()),
                        pa.field("ean", pa.int32()),
                    ]
                )
            ),
        ),
    ]
)

Assumed the table contains data like this :

[
 (100, '2025-01-03 16:21:00', [{'nan': 1, 'ean': 11}, {'nan': 2, 'ean': 212}, {'nan': 3, 'ean': 3}]),
 (100, '2025-01-03 23:55:00', [{'nan': 9, 'ean': 95}, {'nan': 2, 'ean': 212}, {'nan': 9, 'ean': 95}]),
 (120, '2025-01-03 21:21:00', [{'nan': 8, 'ean': 87}, {'nan': 2, 'ean': 212}, {'nan': 9, 'ean': 95}]),
 (100, '2025-01-03 01:45:00', [{'nan': 6, 'ean': 666}, {'nan': 1, 'ean': 11}, {'nan': 7, 'ean': 711}, {'nan': 6, 'ean': 666}]),
 (120, '2025-01-03 12:38:00', [{'nan': 8, 'ean': 87}, {'nan': 9, 'ean': 95}]),
               ]

My goal is to get the max processing_ts value for each kombination of ma_id and nan from the items column. Related to the data above the result should be:

ma_id nan max_processing_ts
100 1 '2025-01-03 16:21:00'
100 2 '2025-01-03 23:55:00'
100 3 '2025-01-03 16:21:00'
100 6 '2025-01-03 01:45:00'
100 7 '2025-01-03 01:45:00'
100 9 '2025-01-03 23:55:00'
120 2 '2025-01-03 21:21:00'
120 8 '2025-01-03 21:21:00'
120 9 '2025-01-03 21:21:00'
Share Improve this question edited Jan 21 at 15:27 0x26res 13.9k12 gold badges62 silver badges120 bronze badges asked Jan 21 at 13:15 Najib BakahouiNajib Bakahoui 815 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 1

Technically you can do it by exploding the list, flattening/unnesting the struct and calling group by. But it's a lot of work in pyarrow. You'll have a much easier time using polars.

import polaras as pl

df = pl.from_arrow(table)

results = (
    df.explode("items")
    .unnest("items")
    .group_by("ma_id", "nan", maintain_order=True)
    .agg(pl.col("processing_ts").max().alias("max_processing_ts"))
)

本文标签: