python 3.x - Group pyarrow table by multiple columns and aggregate by an item from another list column - Stack Overflow

IT技术

更新时间：2025-02-040

admin管理员组
文章数量:1201410

I load data from a parquet file into a pyarrow table with the following schema below. I would like to group the table by the ma_id and items.nan and get the max(processing_ts) for each group. I didn't manage it to group by the filed nan within the items list

import pyarrow as pa

schema: pa.Schema = pa.schema(
    [
        ("ma_id", pa.int32()),
        ("processing_ts", pa.timestamp("ms")),
        (
            "items",
            pa.list_(
                pa.struct(
                    [
                        pa.field("nan", pa.int32()),
                        pa.field("ean", pa.int32()),
                    ]
                )
            ),
        ),
    ]
)

Assumed the table contains data like this :

[
 (100, '2025-01-03 16:21:00', [{'nan': 1, 'ean': 11}, {'nan': 2, 'ean': 212}, {'nan': 3, 'ean': 3}]),
 (100, '2025-01-03 23:55:00', [{'nan': 9, 'ean': 95}, {'nan': 2, 'ean': 212}, {'nan': 9, 'ean': 95}]),
 (120, '2025-01-03 21:21:00', [{'nan': 8, 'ean': 87}, {'nan': 2, 'ean': 212}, {'nan': 9, 'ean': 95}]),
 (100, '2025-01-03 01:45:00', [{'nan': 6, 'ean': 666}, {'nan': 1, 'ean': 11}, {'nan': 7, 'ean': 711}, {'nan': 6, 'ean': 666}]),
 (120, '2025-01-03 12:38:00', [{'nan': 8, 'ean': 87}, {'nan': 9, 'ean': 95}]),
               ]

My goal is to get the max processing_ts value for each kombination of ma_id and nan from the items column. Related to the data above the result should be:

ma_id	nan	max_processing_ts
100	1	'2025-01-03 16:21:00'
100	2	'2025-01-03 23:55:00'
100	3	'2025-01-03 16:21:00'
100	6	'2025-01-03 01:45:00'
100	7	'2025-01-03 01:45:00'
100	9	'2025-01-03 23:55:00'
120	2	'2025-01-03 21:21:00'
120	8	'2025-01-03 21:21:00'
120	9	'2025-01-03 21:21:00'

import pyarrow as pa

schema: pa.Schema = pa.schema(
    [
        ("ma_id", pa.int32()),
        ("processing_ts", pa.timestamp("ms")),
        (
            "items",
            pa.list_(
                pa.struct(
                    [
                        pa.field("nan", pa.int32()),
                        pa.field("ean", pa.int32()),
                    ]
                )
            ),
        ),
    ]
)

Assumed the table contains data like this :

[
 (100, '2025-01-03 16:21:00', [{'nan': 1, 'ean': 11}, {'nan': 2, 'ean': 212}, {'nan': 3, 'ean': 3}]),
 (100, '2025-01-03 23:55:00', [{'nan': 9, 'ean': 95}, {'nan': 2, 'ean': 212}, {'nan': 9, 'ean': 95}]),
 (120, '2025-01-03 21:21:00', [{'nan': 8, 'ean': 87}, {'nan': 2, 'ean': 212}, {'nan': 9, 'ean': 95}]),
 (100, '2025-01-03 01:45:00', [{'nan': 6, 'ean': 666}, {'nan': 1, 'ean': 11}, {'nan': 7, 'ean': 711}, {'nan': 6, 'ean': 666}]),
 (120, '2025-01-03 12:38:00', [{'nan': 8, 'ean': 87}, {'nan': 9, 'ean': 95}]),
               ]

My goal is to get the max processing_ts value for each kombination of ma_id and nan from the items column. Related to the data above the result should be:

ma_id	nan	max_processing_ts
100	1	'2025-01-03 16:21:00'
100	2	'2025-01-03 23:55:00'
100	3	'2025-01-03 16:21:00'
100	6	'2025-01-03 01:45:00'
100	7	'2025-01-03 01:45:00'
100	9	'2025-01-03 23:55:00'
120	2	'2025-01-03 21:21:00'
120	8	'2025-01-03 21:21:00'
120	9	'2025-01-03 21:21:00'

Share Improve this question edited Jan 21 at 15:27 0x26res 13.9k12 gold badges62 silver badges120 bronze badges asked Jan 21 at 13:15 Najib Bakahoui 815 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 1

Technically you can do it by exploding the list, flattening/unnesting the struct and calling group by. But it's a lot of work in pyarrow. You'll have a much easier time using polars.

import polaras as pl

df = pl.from_arrow(table)

results = (
    df.explode("items")
    .unnest("items")
    .group_by("ma_id", "nan", maintain_order=True)
    .agg(pl.col("processing_ts").max().alias("max_processing_ts"))
)

本文标签：

版权声明：本文标题：python 3.x - Group pyarrow table by multiple columns and aggregate by an item from another list column - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1738632616a2103837.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

发表评论

全部评论 0

暂无评论

编程频道|软件玩家 - 软件改变生活！

python 3.x - Group pyarrow table by multiple columns and aggregate by an item from another list column - Stack Overflow

1 Answer 1

更多相关文章

javascript - AngularJS Two-way Data Binding in Nested Directives - Stack Overflow

javascript - Is there any use in importing both ES6 and ES7 core-js polyfills? - Stack Overflow

javascript - How to override click event in jquery - Stack Overflow

python - Convert the unit for pet in netcdf 4 - Stack Overflow

React Native WebView in Expo App Not Supporting File and Camera Inputs - Stack Overflow

javascript - Uncaught TypeError: Cannot read property &#39;document&#39; of undefined - Stack Overflow

javascript - Using an ng-option dropdown in a ui-grid editableCellTemplate [ng-grid 3.x] - Stack Overflow

javascript - Summernote remove resize bar - Stack Overflow

jquery - Does Javascript have anything similar to VBA&#39;s DoEvents? - Stack Overflow

javascript - AngularJS: Is there a better way to sync two promises? - Stack Overflow

javascript - How To Get The URL After Redirecting from Current Page To Another Using Puppeteer? - Stack Overflow

javascript - Can someone please explain this regex filtering of an array - Stack Overflow

menu - HTML &#39;select&#39; element keyboard navigation - Stack Overflow

plugins - Change letters for day name

jquery - Chrome and Safari XSLT using JavaScript - Stack Overflow

javascript - how to set placeholder default display in bootstrap-daterangepicker - Stack Overflow

javascript - How to upload image to server using axios in react native? - Stack Overflow

javascript - Electron Resizing a Frameless Window - Stack Overflow

Javascript DOMContentLoaded event not firing in Internet Explorer - Stack Overflow

javascript - How to manage configuration for WebpackElectron app? - Stack Overflow

发表评论

推荐文章

javascript - When I push a new URL to Backbone.history, the query params stays? - Stack Overflow

javascript - How to print a value in console.log base on ID of an element? - Stack Overflow

java - How to get the return value from a JobRunR Job? - Stack Overflow

query - How to get comment images stored as serialized comment meta

javascript - How to impose numerical sort with jQuery and dataTables? - Stack Overflow

热门文章

using Python with Yahoo! Finance API for financial data - Stack Overflow

how can i open realm database in vb.net using correct schema? - Stack Overflow

javascript - Recommendations for visualising a directed graph in a Web UI - Stack Overflow

javascript - How to make Fixed navbar with vue js? - Stack Overflow

python - Errors in the post-hoc analysis in pymer4: Variable is not in the dataset - Stack Overflow

How to Get Shareable Link for a Google Calendar I Own via Apps Script - Stack Overflow

javascript - Building a site with node.js - Stack Overflow

javascript - A REST API to echo same JSON data back for testing purposes - Stack Overflow

errors - WordPress Subcategory Creates 2 urls for same page (serious issue)

javascript - Array sorting is broken with Bigint In JS? - Stack Overflow

最新文章

电脑小白怎么重装系统_电脑小白u盘重装系统详细教程【小白必看】

忘记电脑密码如何修改win7

Windows7 SP1更新升级失败

Windows7BT种子大全

适合win7的python版本_Win7操作系统上安装 Python3.X环境

javascript - Add Sweet Alert popup to button in React component - Stack Overflow

python - How to Set Up Google Cloud ADC (Application Default Credentials) in Django on PythonAnywhere? - Stack Overflow

javascript - Can I draw a line using jQuery? - Stack Overflow

javascript - How to disable jquery validation on keyup and focusout for 1 specific html element when using the unobtrusive valid

ios - UICollectionView with UICollectionViewCompositionalLayout scrollToItem doesn&#39;t work - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

javascript - Uncaught TypeError: Cannot read property 'document' of undefined - Stack Overflow

jquery - Does Javascript have anything similar to VBA's DoEvents? - Stack Overflow

menu - HTML 'select' element keyboard navigation - Stack Overflow

ios - UICollectionView with UICollectionViewCompositionalLayout scrollToItem doesn't work - Stack Overflow