admin管理员组

文章数量:1356585

I have a very large Polars LazyFrame (if collected it would be tens of millions records). I have information recorded for a specific piece of equipment taken every second and some location flag that is either 1 or 0.

When I have sequences where the location flag is equal to 1, I need to filter out and only leave the latest one but this must be done per equipment id.

I cannot use UDFs since this is a performance-critical piece of code and should ideally stay withing Polars expression syntax.

For a simple case where I have only a single equipment id, I can do it relatively easily by shifting the time data 1 row and filter out the records where there's a big gap:

df_test = pl.DataFrame(
    {
        'time': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
        'equipment': [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        'loc': [0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1]
    }
)

df_test.filter(pl.col('loc') == 1).with_columns((pl.col('time') - pl.col('time').shift(1)).alias('time_diff')).filter(pl.col('time_diff') > 1)

This gives me sort of a correct result, but the problem is that out of 3 sequences of 1s, I only keep 2, the first one gets lost. I can probably live with that, but ideally want to not lose any data.

In a standard case, there will be multiple equipment types and once again, the same approach works but again, for both types, I only keep 2 out of 3 sequences.

df_test = pl.DataFrame(
    {
        'time': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,],
        'equipment': [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
        'loc': [0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0]
    }
)

Is there a better way to do this?

I have a very large Polars LazyFrame (if collected it would be tens of millions records). I have information recorded for a specific piece of equipment taken every second and some location flag that is either 1 or 0.

When I have sequences where the location flag is equal to 1, I need to filter out and only leave the latest one but this must be done per equipment id.

I cannot use UDFs since this is a performance-critical piece of code and should ideally stay withing Polars expression syntax.

For a simple case where I have only a single equipment id, I can do it relatively easily by shifting the time data 1 row and filter out the records where there's a big gap:

df_test = pl.DataFrame(
    {
        'time': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
        'equipment': [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        'loc': [0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1]
    }
)

df_test.filter(pl.col('loc') == 1).with_columns((pl.col('time') - pl.col('time').shift(1)).alias('time_diff')).filter(pl.col('time_diff') > 1)

This gives me sort of a correct result, but the problem is that out of 3 sequences of 1s, I only keep 2, the first one gets lost. I can probably live with that, but ideally want to not lose any data.

In a standard case, there will be multiple equipment types and once again, the same approach works but again, for both types, I only keep 2 out of 3 sequences.

df_test = pl.DataFrame(
    {
        'time': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,],
        'equipment': [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
        'loc': [0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0]
    }
)

Is there a better way to do this?

Share Improve this question asked Mar 28 at 3:40 NotANameNotAName 4,4043 gold badges37 silver badges58 bronze badges 2
  • 1 "I need to filter out and only leave the latest one ... per equipment id." But further down: "out of 3 sequences of 1s, I only keep 2, the first one gets lost". Can you add the exact desired output? 1st sentence reads to me as if you are interested in the last sequence per id, 2nd as if you need all of them. And either way: what do you like the result to be? Just the start time per sequence? Also the length? What's the minimul length of a sequence: 1, 2, ...? And finally, is the data sorted already? Seems to be the case here, except for the odd 0 in 1st row? Shouldn't that be 1? – ouroboros1 Commented Mar 28 at 5:45
  • 1 Adding to the above: in the title you have "leave only the first occurence". In the text: " only leave the latest one". So, which one is it, and first/latest of what? First/latest sequence / or first/latest (some value) from each sequence? – ouroboros1 Commented Mar 28 at 5:56
Add a comment  | 

2 Answers 2

Reset to default 2

If I've interpreted correctly, for each equipment you want to keep only the first row of each continuous sequence of loc = 1.

Fixing your solution

In that case, the only changes you need to make to your solution are:

  • Add the fill_value to pl.col(“time”).shift(1) to ensure that the first row with loc = 1 is always selected. The choice of fill_value must ensure that the first time_diff > 1 , e.g. fill_value = negative number.

    • Note that without the fill_value, the first row of the shift is always null, resulting in a null time_diff, so it is not selected by the time_diff > 1 filter.
    • Another option would be to change the filter to pl.col(“time_diff”) > 1 | pl.col(“time_diff”).is_null()
  • Apply the logic to each equipment by making it a window expression with .over("equipment").

import polars as pl

df_test = pl.DataFrame(
    {
        "time": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
        "equipment": [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        "loc": [0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1],
    }
)


res = (
    df_test.filter(pl.col("loc") == 1)
    #.sort("time") # uncomment if we can't assume that the df is sorted by time.
    .with_columns(
        (pl.col("time") - pl.col("time").shift(1, fill_value=-1))
        .over("equipment")
        .alias("time_diff")
    )
    .filter(pl.col("time_diff") > 1)
)

Output:

>>> res

shape: (3, 4)
┌──────┬───────────┬─────┬───────────┐
│ time ┆ equipment ┆ loc ┆ time_diff │
│ ---  ┆ ---       ┆ --- ┆ ---       │
│ i64  ┆ i64       ┆ i64 ┆ i64       │
╞══════╪═══════════╪═════╪═══════════╡
│ 3    ┆ 1         ┆ 1   ┆ 4         │
│ 9    ┆ 1         ┆ 1   ┆ 4         │
│ 12   ┆ 1         ┆ 1   ┆ 2         │
└──────┴───────────┴─────┴───────────┘

Alternative solution

That said, here is another similar solution which I think is clearer:

res = (
    df_test
    #.sort("time") # uncomment if we can't assume that the df is sorted by time.
    .filter(
        ((pl.col("loc") == 1) & (pl.col("loc").shift(fill_value=0) != 1))
        .over("equipment")
    )
)

Note that in this case the fill_value has to be any value other than 1.

There is .rle_id() to assign IDs to each run/sequence.

df.with_columns(id = pl.col("loc").rle_id().over("equipment"))
shape: (26, 4)
┌──────┬───────────┬─────┬─────┐
│ time ┆ equipment ┆ loc ┆ id  │
│ ---  ┆ ---       ┆ --- ┆ --- │
│ i64  ┆ i64       ┆ i64 ┆ u32 │
╞══════╪═══════════╪═════╪═════╡
│ 1    ┆ 0         ┆ 0   ┆ 0   │
│ 2    ┆ 1         ┆ 0   ┆ 0   │
│ 3    ┆ 1         ┆ 1   ┆ 1   │ # keep
│ 4    ┆ 1         ┆ 1   ┆ 1   │
│ 5    ┆ 1         ┆ 1   ┆ 1   │
│ …    ┆ …         ┆ …   ┆ …   │
│ 9    ┆ 2         ┆ 0   ┆ 4   │
│ 10   ┆ 2         ┆ 0   ┆ 4   │
│ 11   ┆ 2         ┆ 1   ┆ 5   │ # keep
│ 12   ┆ 2         ┆ 1   ┆ 5   │
│ 13   ┆ 2         ┆ 0   ┆ 6   │
└──────┴───────────┴─────┴─────┘

.is_first_distinct() can be used to detect the first occurrences - which you can filter by.

df.filter(
    pl.col.loc == 1,
    pl.col.loc.rle_id().is_first_distinct().over("equipment")
)
shape: (6, 3)
┌──────┬───────────┬─────┐
│ time ┆ equipment ┆ loc │
│ ---  ┆ ---       ┆ --- │
│ i64  ┆ i64       ┆ i64 │
╞══════╪═══════════╪═════╡
│ 3    ┆ 1         ┆ 1   │
│ 9    ┆ 1         ┆ 1   │
│ 12   ┆ 1         ┆ 1   │
│ 3    ┆ 2         ┆ 1   │
│ 6    ┆ 2         ┆ 1   │
│ 11   ┆ 2         ┆ 1   │
└──────┴───────────┴─────┘

(It's basically the same as the Alternative solution - just worded a little different.)

本文标签: