admin管理员组文章数量:1244318
df = pl.from_repr('''
shape: (6, 2)
┌──────┬──────┐
│ A ┆ B │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 1 ┆ null │
│ 2 ┆ 1 │
│ 2 ┆ 2 │
│ null ┆ 3 │
│ 3 ┆ 4 │
│ 4 ┆ null │
│ 5 ┆ 5 │
└──────┴──────┘
''')
I want to sort a dataframe such that multiple columns are in a sorted order, excluding nulls.
In the example above, columns A and B are both sorted, excluding nulls. This feels like a topological sort to me, with the following conditions:
df[0, 'A'] < df[1, 'A']
df[1, 'B'] < df[2, 'B']
df[2, 'B'] < df[3, 'B']
df[3, 'B'] < df[4, 'B']
df[4, 'A'] < df[5, 'A']
df[5, 'A'] < df[6, 'A']
I understand it's not always possible to do a topological sort if there is a cycle, e.g.
df[0, 'A'] < df[1, 'A']
df[0, 'B'] > df[1, 'B']
In that case, I want to specify that ordering for column A should take precedence over column B.
My use case is that I am merging time series data from multiple datasets with some overlapping events, and I want a single dataframe with all events in a chronological order. There are issues with some of the timestamps, so I cannot compare the raw timestamps directly across datasets.
Is something like this possible in polars?
df = pl.from_repr('''
shape: (6, 2)
┌──────┬──────┐
│ A ┆ B │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 1 ┆ null │
│ 2 ┆ 1 │
│ 2 ┆ 2 │
│ null ┆ 3 │
│ 3 ┆ 4 │
│ 4 ┆ null │
│ 5 ┆ 5 │
└──────┴──────┘
''')
I want to sort a dataframe such that multiple columns are in a sorted order, excluding nulls.
In the example above, columns A and B are both sorted, excluding nulls. This feels like a topological sort to me, with the following conditions:
df[0, 'A'] < df[1, 'A']
df[1, 'B'] < df[2, 'B']
df[2, 'B'] < df[3, 'B']
df[3, 'B'] < df[4, 'B']
df[4, 'A'] < df[5, 'A']
df[5, 'A'] < df[6, 'A']
I understand it's not always possible to do a topological sort if there is a cycle, e.g.
df[0, 'A'] < df[1, 'A']
df[0, 'B'] > df[1, 'B']
In that case, I want to specify that ordering for column A should take precedence over column B.
My use case is that I am merging time series data from multiple datasets with some overlapping events, and I want a single dataframe with all events in a chronological order. There are issues with some of the timestamps, so I cannot compare the raw timestamps directly across datasets.
Is something like this possible in polars?
Share Improve this question edited 2 days ago jqurious 21.4k4 gold badges20 silver badges39 bronze badges asked Feb 18 at 4:45 T.H RiceT.H Rice 3071 gold badge2 silver badges9 bronze badges1 Answer
Reset to default 1You can specify multiple columns when you call DataFrame.sort (or LazyFrame.sort), but they only support absolute ordering with nulls sent to the start or the end.
You could try to customize your sorting logic using pl.arg_sort_by
, col.sort_by
and so on, however it will probably be very inefficient compared to the built-in sort method.
Example
expression = (
pl.col('A')
# Order by B to fill in nulls inside of A with the preceding value A would have when sorted by B
.sort_by("B")
.forward_fill()
# Sort back into the original order
.sort_by(pl.col('idx').sort_by("B"))
)
print(df.with_row_index('idx').sort(expression), "B")
shape: (7, 3)
┌─────┬──────┬──────┐
│ idx ┆ A ┆ B │
│ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 │
╞═════╪══════╪══════╡
│ 0 ┆ 1 ┆ null │
│ 1 ┆ 2 ┆ 1 │
│ 2 ┆ 2 ┆ 2 │
│ 3 ┆ null ┆ 3 │
│ 4 ┆ 3 ┆ 4 │
│ 5 ┆ 4 ┆ null │
│ 6 ┆ 5 ┆ 5 │
└─────┴──────┴──────┘
本文标签: pythonTopological sort in PolarsStack Overflow
版权声明:本文标题:python - Topological sort in Polars - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1740113442a2226636.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论