admin管理员组

文章数量:1244318

df = pl.from_repr('''
shape: (6, 2)
┌──────┬──────┐
│ A    ┆ B    │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 1    ┆ null │
│ 2    ┆ 1    │
│ 2    ┆ 2    │
│ null ┆ 3    │
│ 3    ┆ 4    │
│ 4    ┆ null │
│ 5    ┆ 5    │
└──────┴──────┘
''')

I want to sort a dataframe such that multiple columns are in a sorted order, excluding nulls.

In the example above, columns A and B are both sorted, excluding nulls. This feels like a topological sort to me, with the following conditions:

df[0, 'A'] < df[1, 'A']
df[1, 'B'] < df[2, 'B']
df[2, 'B'] < df[3, 'B']
df[3, 'B'] < df[4, 'B']
df[4, 'A'] < df[5, 'A']
df[5, 'A'] < df[6, 'A']

I understand it's not always possible to do a topological sort if there is a cycle, e.g.

df[0, 'A'] < df[1, 'A']
df[0, 'B'] > df[1, 'B']

In that case, I want to specify that ordering for column A should take precedence over column B.

My use case is that I am merging time series data from multiple datasets with some overlapping events, and I want a single dataframe with all events in a chronological order. There are issues with some of the timestamps, so I cannot compare the raw timestamps directly across datasets.

Is something like this possible in polars?

df = pl.from_repr('''
shape: (6, 2)
┌──────┬──────┐
│ A    ┆ B    │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 1    ┆ null │
│ 2    ┆ 1    │
│ 2    ┆ 2    │
│ null ┆ 3    │
│ 3    ┆ 4    │
│ 4    ┆ null │
│ 5    ┆ 5    │
└──────┴──────┘
''')

I want to sort a dataframe such that multiple columns are in a sorted order, excluding nulls.

In the example above, columns A and B are both sorted, excluding nulls. This feels like a topological sort to me, with the following conditions:

df[0, 'A'] < df[1, 'A']
df[1, 'B'] < df[2, 'B']
df[2, 'B'] < df[3, 'B']
df[3, 'B'] < df[4, 'B']
df[4, 'A'] < df[5, 'A']
df[5, 'A'] < df[6, 'A']

I understand it's not always possible to do a topological sort if there is a cycle, e.g.

df[0, 'A'] < df[1, 'A']
df[0, 'B'] > df[1, 'B']

In that case, I want to specify that ordering for column A should take precedence over column B.

My use case is that I am merging time series data from multiple datasets with some overlapping events, and I want a single dataframe with all events in a chronological order. There are issues with some of the timestamps, so I cannot compare the raw timestamps directly across datasets.

Is something like this possible in polars?

Share Improve this question edited 2 days ago jqurious 21.4k4 gold badges20 silver badges39 bronze badges asked Feb 18 at 4:45 T.H RiceT.H Rice 3071 gold badge2 silver badges9 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 1

You can specify multiple columns when you call DataFrame.sort (or LazyFrame.sort), but they only support absolute ordering with nulls sent to the start or the end.

You could try to customize your sorting logic using pl.arg_sort_by, col.sort_by and so on, however it will probably be very inefficient compared to the built-in sort method.

Example

expression = (
    pl.col('A')
      # Order by B to fill in nulls inside of A with the preceding value A would have when sorted by B
    .sort_by("B")
    .forward_fill()
     # Sort back into the original order
    .sort_by(pl.col('idx').sort_by("B"))
)

print(df.with_row_index('idx').sort(expression), "B")
shape: (7, 3)
┌─────┬──────┬──────┐
│ idx ┆ A    ┆ B    │
│ --- ┆ ---  ┆ ---  │
│ u32 ┆ i64  ┆ i64  │
╞═════╪══════╪══════╡
│ 0   ┆ 1    ┆ null │
│ 1   ┆ 2    ┆ 1    │
│ 2   ┆ 2    ┆ 2    │
│ 3   ┆ null ┆ 3    │
│ 4   ┆ 3    ┆ 4    │
│ 5   ┆ 4    ┆ null │
│ 6   ┆ 5    ┆ 5    │
└─────┴──────┴──────┘

本文标签: pythonTopological sort in PolarsStack Overflow