admin管理员组文章数量:1356585
I have a very large Polars LazyFrame (if collected it would be tens of millions records). I have information recorded for a specific piece of equipment taken every second and some location flag that is either 1 or 0.
When I have sequences where the location flag is equal to 1, I need to filter out and only leave the latest one but this must be done per equipment id.
I cannot use UDFs since this is a performance-critical piece of code and should ideally stay withing Polars expression syntax.
For a simple case where I have only a single equipment id, I can do it relatively easily by shifting the time data 1 row and filter out the records where there's a big gap:
df_test = pl.DataFrame(
{
'time': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
'equipment': [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
'loc': [0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1]
}
)
df_test.filter(pl.col('loc') == 1).with_columns((pl.col('time') - pl.col('time').shift(1)).alias('time_diff')).filter(pl.col('time_diff') > 1)
This gives me sort of a correct result, but the problem is that out of 3 sequences of 1s, I only keep 2, the first one gets lost. I can probably live with that, but ideally want to not lose any data.
In a standard case, there will be multiple equipment types and once again, the same approach works but again, for both types, I only keep 2 out of 3 sequences.
df_test = pl.DataFrame(
{
'time': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,],
'equipment': [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
'loc': [0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0]
}
)
Is there a better way to do this?
I have a very large Polars LazyFrame (if collected it would be tens of millions records). I have information recorded for a specific piece of equipment taken every second and some location flag that is either 1 or 0.
When I have sequences where the location flag is equal to 1, I need to filter out and only leave the latest one but this must be done per equipment id.
I cannot use UDFs since this is a performance-critical piece of code and should ideally stay withing Polars expression syntax.
For a simple case where I have only a single equipment id, I can do it relatively easily by shifting the time data 1 row and filter out the records where there's a big gap:
df_test = pl.DataFrame(
{
'time': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
'equipment': [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
'loc': [0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1]
}
)
df_test.filter(pl.col('loc') == 1).with_columns((pl.col('time') - pl.col('time').shift(1)).alias('time_diff')).filter(pl.col('time_diff') > 1)
This gives me sort of a correct result, but the problem is that out of 3 sequences of 1s, I only keep 2, the first one gets lost. I can probably live with that, but ideally want to not lose any data.
In a standard case, there will be multiple equipment types and once again, the same approach works but again, for both types, I only keep 2 out of 3 sequences.
df_test = pl.DataFrame(
{
'time': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,],
'equipment': [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
'loc': [0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0]
}
)
Is there a better way to do this?
Share Improve this question asked Mar 28 at 3:40 NotANameNotAName 4,4043 gold badges37 silver badges58 bronze badges 2 |2 Answers
Reset to default 2If I've interpreted correctly, for each equipment you want to keep only the first row of each continuous sequence of loc = 1
.
Fixing your solution
In that case, the only changes you need to make to your solution are:
Add the
fill_value
topl.col(“time”).shift(1)
to ensure that the first row withloc = 1
is always selected. The choice offill_value
must ensure that the firsttime_diff > 1
, e.g.fill_value = negative number
.- Note that without the
fill_value
, the first row of theshift
is always null, resulting in a nulltime_diff
, so it is not selected by thetime_diff > 1
filter. - Another option would be to change the filter to
pl.col(“time_diff”) > 1 | pl.col(“time_diff”).is_null()
- Note that without the
Apply the logic to each equipment by making it a window expression with
.over("equipment")
.
import polars as pl
df_test = pl.DataFrame(
{
"time": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
"equipment": [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
"loc": [0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1],
}
)
res = (
df_test.filter(pl.col("loc") == 1)
#.sort("time") # uncomment if we can't assume that the df is sorted by time.
.with_columns(
(pl.col("time") - pl.col("time").shift(1, fill_value=-1))
.over("equipment")
.alias("time_diff")
)
.filter(pl.col("time_diff") > 1)
)
Output:
>>> res
shape: (3, 4)
┌──────┬───────────┬─────┬───────────┐
│ time ┆ equipment ┆ loc ┆ time_diff │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞══════╪═══════════╪═════╪═══════════╡
│ 3 ┆ 1 ┆ 1 ┆ 4 │
│ 9 ┆ 1 ┆ 1 ┆ 4 │
│ 12 ┆ 1 ┆ 1 ┆ 2 │
└──────┴───────────┴─────┴───────────┘
Alternative solution
That said, here is another similar solution which I think is clearer:
res = (
df_test
#.sort("time") # uncomment if we can't assume that the df is sorted by time.
.filter(
((pl.col("loc") == 1) & (pl.col("loc").shift(fill_value=0) != 1))
.over("equipment")
)
)
Note that in this case the fill_value
has to be any value other than 1.
There is .rle_id()
to assign IDs to each run/sequence.
df.with_columns(id = pl.col("loc").rle_id().over("equipment"))
shape: (26, 4)
┌──────┬───────────┬─────┬─────┐
│ time ┆ equipment ┆ loc ┆ id │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ u32 │
╞══════╪═══════════╪═════╪═════╡
│ 1 ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1 ┆ 0 ┆ 0 │
│ 3 ┆ 1 ┆ 1 ┆ 1 │ # keep
│ 4 ┆ 1 ┆ 1 ┆ 1 │
│ 5 ┆ 1 ┆ 1 ┆ 1 │
│ … ┆ … ┆ … ┆ … │
│ 9 ┆ 2 ┆ 0 ┆ 4 │
│ 10 ┆ 2 ┆ 0 ┆ 4 │
│ 11 ┆ 2 ┆ 1 ┆ 5 │ # keep
│ 12 ┆ 2 ┆ 1 ┆ 5 │
│ 13 ┆ 2 ┆ 0 ┆ 6 │
└──────┴───────────┴─────┴─────┘
.is_first_distinct()
can be used to detect the first occurrences - which you can filter by.
df.filter(
pl.col.loc == 1,
pl.col.loc.rle_id().is_first_distinct().over("equipment")
)
shape: (6, 3)
┌──────┬───────────┬─────┐
│ time ┆ equipment ┆ loc │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞══════╪═══════════╪═════╡
│ 3 ┆ 1 ┆ 1 │
│ 9 ┆ 1 ┆ 1 │
│ 12 ┆ 1 ┆ 1 │
│ 3 ┆ 2 ┆ 1 │
│ 6 ┆ 2 ┆ 1 │
│ 11 ┆ 2 ┆ 1 │
└──────┴───────────┴─────┘
(It's basically the same as the Alternative solution - just worded a little different.)
本文标签:
版权声明:本文标题:python - Filter sequences of same values in a particular column of Polars df and leave only the first occurence - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744058176a2583637.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
0
in 1st row? Shouldn't that be1
? – ouroboros1 Commented Mar 28 at 5:45