python - Filter sequences of same values in a particular column of Polars df and leave only the first occurence - Stack Overflow-软件玩家

admin管理员组
文章数量:1356585

I have a very large Polars LazyFrame (if collected it would be tens of millions records). I have information recorded for a specific piece of equipment taken every second and some location flag that is either 1 or 0.

When I have sequences where the location flag is equal to 1, I need to filter out and only leave the latest one but this must be done per equipment id.

I cannot use UDFs since this is a performance-critical piece of code and should ideally stay withing Polars expression syntax.

For a simple case where I have only a single equipment id, I can do it relatively easily by shifting the time data 1 row and filter out the records where there's a big gap:

df_test = pl.DataFrame(
    {
        'time': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
        'equipment': [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        'loc': [0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1]
    }
)

df_test.filter(pl.col('loc') == 1).with_columns((pl.col('time') - pl.col('time').shift(1)).alias('time_diff')).filter(pl.col('time_diff') > 1)

This gives me sort of a correct result, but the problem is that out of 3 sequences of 1s, I only keep 2, the first one gets lost. I can probably live with that, but ideally want to not lose any data.

In a standard case, there will be multiple equipment types and once again, the same approach works but again, for both types, I only keep 2 out of 3 sequences.

df_test = pl.DataFrame(
    {
        'time': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,],
        'equipment': [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
        'loc': [0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0]
    }
)

Is there a better way to do this?

When I have sequences where the location flag is equal to 1, I need to filter out and only leave the latest one but this must be done per equipment id.

I cannot use UDFs since this is a performance-critical piece of code and should ideally stay withing Polars expression syntax.

For a simple case where I have only a single equipment id, I can do it relatively easily by shifting the time data 1 row and filter out the records where there's a big gap:

df_test = pl.DataFrame(
    {
        'time': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
        'equipment': [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        'loc': [0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1]
    }
)

df_test.filter(pl.col('loc') == 1).with_columns((pl.col('time') - pl.col('time').shift(1)).alias('time_diff')).filter(pl.col('time_diff') > 1)

This gives me sort of a correct result, but the problem is that out of 3 sequences of 1s, I only keep 2, the first one gets lost. I can probably live with that, but ideally want to not lose any data.

In a standard case, there will be multiple equipment types and once again, the same approach works but again, for both types, I only keep 2 out of 3 sequences.

df_test = pl.DataFrame(
    {
        'time': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,],
        'equipment': [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
        'loc': [0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0]
    }
)

Is there a better way to do this?

Share Improve this question asked Mar 28 at 3:40 NotAName 4,4043 gold badges37 silver badges58 bronze badges

1 "I need to filter out and only leave the latest one ... per equipment id." But further down: "out of 3 sequences of 1s, I only keep 2, the first one gets lost". Can you add the exact desired output? 1st sentence reads to me as if you are interested in the last sequence per id, 2nd as if you need all of them. And either way: what do you like the result to be? Just the start time per sequence? Also the length? What's the minimul length of a sequence: 1, 2, ...? And finally, is the data sorted already? Seems to be the case here, except for the odd 0 in 1st row? Shouldn't that be 1? – ouroboros1 Commented Mar 28 at 5:45
1 Adding to the above: in the title you have "leave only the first occurence". In the text: " only leave the latest one". So, which one is it, and first/latest of what? First/latest sequence / or first/latest (some value) from each sequence? – ouroboros1 Commented Mar 28 at 5:56

Add a comment |

2 Answers 2

Sorted by: Reset to default 2

If I've interpreted correctly, for each equipment you want to keep only the first row of each continuous sequence of loc = 1.

Fixing your solution

In that case, the only changes you need to make to your solution are:

Add the fill_value to pl.col(“time”).shift(1) to ensure that the first row with loc = 1 is always selected. The choice of fill_value must ensure that the first time_diff > 1 , e.g. fill_value = negative number.
- Note that without the fill_value, the first row of the shift is always null, resulting in a null time_diff, so it is not selected by the time_diff > 1 filter.
- Another option would be to change the filter to pl.col(“time_diff”) > 1 | pl.col(“time_diff”).is_null()
Apply the logic to each equipment by making it a window expression with .over("equipment").

import polars as pl

df_test = pl.DataFrame(
    {
        "time": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
        "equipment": [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        "loc": [0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1],
    }
)


res = (
    df_test.filter(pl.col("loc") == 1)
    #.sort("time") # uncomment if we can't assume that the df is sorted by time.
    .with_columns(
        (pl.col("time") - pl.col("time").shift(1, fill_value=-1))
        .over("equipment")
        .alias("time_diff")
    )
    .filter(pl.col("time_diff") > 1)
)

Output:

>>> res

shape: (3, 4)
┌──────┬───────────┬─────┬───────────┐
│ time ┆ equipment ┆ loc ┆ time_diff │
│ ---  ┆ ---       ┆ --- ┆ ---       │
│ i64  ┆ i64       ┆ i64 ┆ i64       │
╞══════╪═══════════╪═════╪═══════════╡
│ 3    ┆ 1         ┆ 1   ┆ 4         │
│ 9    ┆ 1         ┆ 1   ┆ 4         │
│ 12   ┆ 1         ┆ 1   ┆ 2         │
└──────┴───────────┴─────┴───────────┘

Alternative solution

That said, here is another similar solution which I think is clearer:

res = (
    df_test
    #.sort("time") # uncomment if we can't assume that the df is sorted by time.
    .filter(
        ((pl.col("loc") == 1) & (pl.col("loc").shift(fill_value=0) != 1))
        .over("equipment")
    )
)

Note that in this case the fill_value has to be any value other than 1.

There is .rle_id() to assign IDs to each run/sequence.

df.with_columns(id = pl.col("loc").rle_id().over("equipment"))

shape: (26, 4)
┌──────┬───────────┬─────┬─────┐
│ time ┆ equipment ┆ loc ┆ id  │
│ ---  ┆ ---       ┆ --- ┆ --- │
│ i64  ┆ i64       ┆ i64 ┆ u32 │
╞══════╪═══════════╪═════╪═════╡
│ 1    ┆ 0         ┆ 0   ┆ 0   │
│ 2    ┆ 1         ┆ 0   ┆ 0   │
│ 3    ┆ 1         ┆ 1   ┆ 1   │ # keep
│ 4    ┆ 1         ┆ 1   ┆ 1   │
│ 5    ┆ 1         ┆ 1   ┆ 1   │
│ …    ┆ …         ┆ …   ┆ …   │
│ 9    ┆ 2         ┆ 0   ┆ 4   │
│ 10   ┆ 2         ┆ 0   ┆ 4   │
│ 11   ┆ 2         ┆ 1   ┆ 5   │ # keep
│ 12   ┆ 2         ┆ 1   ┆ 5   │
│ 13   ┆ 2         ┆ 0   ┆ 6   │
└──────┴───────────┴─────┴─────┘

.is_first_distinct() can be used to detect the first occurrences - which you can filter by.

df.filter(
    pl.col.loc == 1,
    pl.col.loc.rle_id().is_first_distinct().over("equipment")
)

shape: (6, 3)
┌──────┬───────────┬─────┐
│ time ┆ equipment ┆ loc │
│ ---  ┆ ---       ┆ --- │
│ i64  ┆ i64       ┆ i64 │
╞══════╪═══════════╪═════╡
│ 3    ┆ 1         ┆ 1   │
│ 9    ┆ 1         ┆ 1   │
│ 12   ┆ 1         ┆ 1   │
│ 3    ┆ 2         ┆ 1   │
│ 6    ┆ 2         ┆ 1   │
│ 11   ┆ 2         ┆ 1   │
└──────┴───────────┴─────┘

(It's basically the same as the Alternative solution - just worded a little different.)

本文标签：

版权声明：本文标题：python - Filter sequences of same values in a particular column of Polars df and leave only the first occurence - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1744058176a2583637.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

python - Filter sequences of same values in a particular column of Polars df and leave only the first occurence - Stack Overflow

2 Answers 2

Fixing your solution

Alternative solution

更多相关文章

.net - EF Core config for optional complex type with required properties - Stack Overflow

javascript - Delay until fully loaded - Stack Overflow

javascript - Difference between class object Method declaration React? - Stack Overflow

conda - Visual studio version related error when trying to install python modules using anaconda - Stack Overflow

javascript - How To Watch a JQuery Selector Using AngularJS scope.$watch() Method - Stack Overflow

javascript - How can I add multiple inputs to a SweetAlert2 popup? - Stack Overflow

swiftdata - @Entry Keyword in SwiftUI and Actor-Isolated Initializer - Stack Overflow

javascript - Bootstrap Navbar Dropdown on Hover if not mobile - Stack Overflow

html - Difference between &quot;alert(a)&#39;&#39; and &#39;&#39;alert(a);var a =1;&#39;&#39; in jav

single page application - Urls to assets should be subpath of base path of vite proxy - Stack Overflow

javascript - Scroll to anchor with fixed header, content hidden behind header, margin and top padding not working - Stack Overfl

jquery - Javascript &quot;onMouseOver&quot; triggering for children? - Stack Overflow

javascript - How can I toggle my overlays so it can close when another link is clicked? - Stack Overflow

using gitlab variables in gitlab-ci.yml - Stack Overflow

javascript - Swiper slider transition bug - Stack Overflow

javascript - Sorting not working for MUI X Data Grid: Data Grid sortComparator giving undefined - Stack Overflow

javascript - Corrupted download in AngularJs app - Stack Overflow

javascript - Assigning the variable name to the same variable name - Stack Overflow

javascript - Local jquery.js file causing Content Security Policy (CSP) violation errors - Stack Overflow

javascript - RequireJS slowed down the load of my web app - Stack Overflow

发表评论

推荐文章

javascript - Binding checkboxes to object values in AngularJs - Stack Overflow

powershell - Is it possible to add a closing parenthesis on tab-completed methods with no arguments? - Stack Overflow

php - codeigniter base_url configuration - Stack Overflow

customization - Custom JavaScript snippets for emmet (in Sublime Text 2) - Stack Overflow

javascript - add language to header using axios - Stack Overflow

热门文章

javascript - Error when adding a tag or category in wordpress - Stack Overflow

javascript - NodeJS, WebStorm and Jasmine: ReferenceError: describe is not defined when debugging - Stack Overflow

JavaScript - if #id equals something - Stack Overflow

javascript - jQuery UI drag and drop snap to center - Stack Overflow

javascript - URL Redirection to Untrusted Site - Stack Overflow

openapi generator - Models are not generated when specific apis are selected through the gradle plugin - Stack Overflow

javascript - Customize toaster from react-toastify - Stack Overflow

javascript - Why is AWS.Lambda.invoke `error` callback argument never populated? - Stack Overflow

Android project with ActiveMQ built by Proguard, throw Exception: could not find class for resource - Stack Overflow

html - Best way to use Babel to create a single JavaScript bundle - Stack Overflow

最新文章

WIN11，如何同时连接有线网络与WLAN无线网络

安可信esp01wifi模块使用（超级坑）

解决windows中安装VMware后宿主机wifi网卡无法正常使用的问题

使用手机连接树莓派（无需电脑，只需要一台手机）

电脑有网，浏览器连不上网，其他应用却能用

How do I pass Java array contents into javascript array? - Stack Overflow

javascript - RequireJS slowed down the load of my web app - Stack Overflow

javascript - Can I use getInitialProps in _app.js and in pages? - Stack Overflow

javascript - Local jquery.js file causing Content Security Policy (CSP) violation errors - Stack Overflow

typescript - Victory native multi press doesn&#39;t work in my graph - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

html - Difference between "alert(a)'' and ''alert(a);var a =1;'' in jav

jquery - Javascript "onMouseOver" triggering for children? - Stack Overflow

typescript - Victory native multi press doesn't work in my graph - Stack Overflow