admin管理员组

文章数量:1291174

I have a Polars DataFrame where each id can appear multiple times with different state values (either 1 or 2). I want to count how many unique ids have only state 1, only state 2, or both states 1 and 2.

import polars as pl

df = pl.DataFrame({
    "id": [1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 9, 9, 10, 10, 10, 11, 11, 12, 12, 13, 14, 15, 15, 16, 16, 17, 17, 18, 18, 19, 20, 20, 20],
    "state": [1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 2, 2, 1, 1, 2, 1, 2, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 2, 1, 2, 1, 1, 2, 2, 1, 1, 2, 2]
})

I want to count how many unique ids fall into each category:

• Only state 1 (e.g., IDs that only have 1)

• Only state 2 (e.g., IDs that only have 2)

• Both states 1 and 2 (e.g., IDs that have both 1 and 2)

Expected Result (Example):

State combination [1]  -> 20 IDs  
State combination [2]  -> 15 IDs  
State combination [1, 2]  -> 30 IDs  

I have a Polars DataFrame where each id can appear multiple times with different state values (either 1 or 2). I want to count how many unique ids have only state 1, only state 2, or both states 1 and 2.

import polars as pl

df = pl.DataFrame({
    "id": [1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 9, 9, 10, 10, 10, 11, 11, 12, 12, 13, 14, 15, 15, 16, 16, 17, 17, 18, 18, 19, 20, 20, 20],
    "state": [1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 2, 2, 1, 1, 2, 1, 2, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 2, 1, 2, 1, 1, 2, 2, 1, 1, 2, 2]
})

I want to count how many unique ids fall into each category:

• Only state 1 (e.g., IDs that only have 1)

• Only state 2 (e.g., IDs that only have 2)

• Both states 1 and 2 (e.g., IDs that have both 1 and 2)

Expected Result (Example):

State combination [1]  -> 20 IDs  
State combination [2]  -> 15 IDs  
State combination [1, 2]  -> 30 IDs  
Share Improve this question asked Feb 13 at 20:40 SimonSimon 1,0054 silver badges15 bronze badges
Add a comment  | 

3 Answers 3

Reset to default 4

You need two group_bys, the first to combine the ids into the states and then another group_by for the states to count up the number of ids

(
    df
    .group_by("id")
    .agg(pl.col("state").unique().sort())
    .group_by("state")
    .len()
)

You could group by the id and use .all() and .any() to check the states.

(df.group_by("id")
   .agg(
       one  = (pl.col.state == 1).all(),
       two  = (pl.col.state == 2).all(),
       both = (pl.col.state == 1).any() & (pl.col.state == 2).any()
       # both = pl.lit(1).is_in("state") & pl.lit(2).is_in("state")
   )
   # .select(pl.exclude("id").sum())
)
shape: (20, 4)
┌─────┬───────┬───────┬───────┐
│ id  ┆ one   ┆ two   ┆ both  │
│ --- ┆ ---   ┆ ---   ┆ ---   │
│ i64 ┆ bool  ┆ bool  ┆ bool  │
╞═════╪═══════╪═══════╪═══════╡
│ 6   ┆ false ┆ true  ┆ false │
│ 3   ┆ false ┆ true  ┆ false │
│ 2   ┆ true  ┆ false ┆ false │
│ 12  ┆ true  ┆ false ┆ false │
│ 16  ┆ false ┆ false ┆ true  │
│ …   ┆ …     ┆ …     ┆ …     │
│ 9   ┆ false ┆ false ┆ true  │
│ 13  ┆ false ┆ true  ┆ false │
│ 8   ┆ false ┆ true  ┆ false │
│ 15  ┆ false ┆ false ┆ true  │
│ 10  ┆ false ┆ false ┆ true  │
└─────┴───────┴───────┴───────┘

The .sum() of the bool columns are the counts.

shape: (1, 3)
┌─────┬─────┬──────┐
│ one ┆ two ┆ both │
│ --- ┆ --- ┆ ---  │
│ u32 ┆ u32 ┆ u32  │
╞═════╪═════╪══════╡
│ 6   ┆ 7   ┆ 7    │
└─────┴─────┴──────┘

For future reference, please provide what the correct output should be instead of just an example output.

You can perform a group by, take the unique States for each ID, then take the value counts of that

combinations = df.group_by('id').agg(pl.col('state').unique())
counts = combinations.select(pl.col('state').value_counts().alias('counts'))
print(counts.unnest('counts'))

assert (counts.select(pl.col('counts').struct.field('count').sum()) == df.n_unique('id')).item()

# Alternatively, as a single expression:
print(df.select(
    pl.col('state').unique().implode()
    .over('id', mapping_strategy='explode')
    .value_counts()
    .struct.unnest()
))

本文标签: pythonHow to count unique state combinations per ID in a Polars DataFrameStack Overflow