admin管理员组

文章数量:1394133

I'm slowly migrating to polars from pandas and I have found that in some cases the polars syntax is tricky.

I'm seeking help to do a group_by followed by a describe using less (or more readable) code.

See this example:

from io import BytesIO
import pandas as pd
import polars as pl

S = b'''group,value\n3,245\n3,28\n3,48\n1,113\n1,288\n1,165\n2,90\n2,21\n2,109'''

pl_df = pl.read_csv(BytesIO(S))
pd_df = pd.read_csv(BytesIO(S))

# Polars' way
pl_df.group_by('group').map_groups(
    lambda df: (
        df['value']
        .describe()
        .with_columns(
            group=pl.lit(df['group'][0])
        )
    )
).pivot(index='group', on='statistic')

Something similar in pandas would be:

# pandas' 
pd_df.groupby('group').value.describe()

I'm slowly migrating to polars from pandas and I have found that in some cases the polars syntax is tricky.

I'm seeking help to do a group_by followed by a describe using less (or more readable) code.

See this example:

from io import BytesIO
import pandas as pd
import polars as pl

S = b'''group,value\n3,245\n3,28\n3,48\n1,113\n1,288\n1,165\n2,90\n2,21\n2,109'''

pl_df = pl.read_csv(BytesIO(S))
pd_df = pd.read_csv(BytesIO(S))

# Polars' way
pl_df.group_by('group').map_groups(
    lambda df: (
        df['value']
        .describe()
        .with_columns(
            group=pl.lit(df['group'][0])
        )
    )
).pivot(index='group', on='statistic')

Something similar in pandas would be:

# pandas' 
pd_df.groupby('group').value.describe()
Share Improve this question asked Mar 14 at 16:44 jcalizjcaliz 4,0432 gold badges10 silver badges15 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 3

You can write a quick function that returns a mapping of expressions that you can unpack right into your DataFrame.group_by(...).agg. This avoids any slow-ness of using map_groups and enables Polars to easily scan the query for any optimizations (provided you are working with a LazyFrame).

from io import BytesIO
import pandas as pd
import polars as pl

def describe(column, percentiles=[.25, .5, .75]):
    return {
        'count': column.count(),
        'null_count': column.null_count(),
        'mean': column.mean(),
        'std': column.std(),
        'min': column.min(),
        **{
            f'{pct*100:g}%': column.quantile(pct)
            for pct in percentiles
        },
        'max': column.max(),
    }

S = b'''group,value\n3,245\n3,28\n3,48\n1,113\n1,288\n1,165\n2,90\n2,21\n2,109'''

pl_df = pl.read_csv(BytesIO(S))

print(
    pl_df.group_by('group').agg(**describe(pl.col('value')))
)

Produces

shape: (3, 10)
┌───────┬───────┬────────────┬────────────┬───┬───────┬───────┬───────┬─────┐
│ group ┆ count ┆ null_count ┆ mean       ┆ … ┆ 25%   ┆ 50%   ┆ 75%   ┆ max │
│ ---   ┆ ---   ┆ ---        ┆ ---        ┆   ┆ ---   ┆ ---   ┆ ---   ┆ --- │
│ i64   ┆ u32   ┆ u32        ┆ f64        ┆   ┆ f64   ┆ f64   ┆ f64   ┆ i64 │
╞═══════╪═══════╪════════════╪════════════╪═══╪═══════╪═══════╪═══════╪═════╡
│ 1     ┆ 3     ┆ 0          ┆ 188.666667 ┆ … ┆ 165.0 ┆ 165.0 ┆ 288.0 ┆ 288 │
│ 3     ┆ 3     ┆ 0          ┆ 107.0      ┆ … ┆ 48.0  ┆ 48.0  ┆ 245.0 ┆ 245 │
│ 2     ┆ 3     ┆ 0          ┆ 73.333333  ┆ … ┆ 90.0  ┆ 90.0  ┆ 109.0 ┆ 109 │
└───────┴───────┴────────────┴────────────┴───┴───────┴───────┴───────┴─────┘

本文标签: pythonPolars groupbydescribe return all columns as single dataframeStack Overflow