admin管理员组

文章数量:1406177

Among other descriptive statistics, I want to get some quantiles out of my pandas DataFrame. I can get the quantiles I want a couple of different ways, but I can't find the right way to do it with aggregate. I'd like to use aggregate because it'd be tidy and maybe computationally efficient to get all my stats in one go.

rng = np.random.default_rng(seed=18860504)
df = pd.DataFrame({
    "dummy": 1,
    "bell": rng.normal(loc=0, scale=1, size=100),
    "fish": rng.poisson(lam=10, size=100),
    "cabin": rng.lognormal(mean=0, sigma=1.0, size=100),
})
quants = [x/5 for x in range(6)]
quantiles = pd.DataFrame({
    "quantile" : [f"q{100*q:02n}" for q in quants],
    "bell" : df.groupby("dummy")["bell"].quantile(quants),
    "fish" : df.groupby("dummy")["fish"].quantile(quants),
}) 
print(quantiles)

Output:

          quantile      bell  fish
dummy                             
1     0.0     q000 -2.313461   4.0
      0.2     q020 -0.933831   7.0
      0.4     q040 -0.246860   9.0
      0.6     q060  0.211076  10.0
      0.8     q080  0.685958  13.0
      1.0     q100  3.017258  20.0

I'd like to get these quantiles using groupby().agg(), ideally with programmatically named columns like "bell_q90". Here's an example of the aggregate syntax that feels natural to me:

df.groupby("dummy").agg(
    bell_med=("bell", "median"),
    bell_mean=("bell", "mean"),
    fish_med=("fish", "median"),
    fish_mean=("fish", "mean"),
    # fish_q10=("fish", "quantile(0.1)"), # nothing like it 
    # fish_q10=("fish", "quantile", 0.1), # nothing like it 
    # fish_q10=("fish", "quantile", kwargs({"q":0.1}), # nothing like it 
) 

I can imagine generating the columns by iterating over quants and a list of named columns, using Series.agg and than stitching them together, but this seems like a hack. (For example, it would require me to do my "normal" aggregation first and then add quantiles on afterwards.)

my_aggs = dict()
for q in quants:
    for col in ["bell", "fish"]:
        my_aggs[f"{col}_q{100*q:03n}"] = df.groupby("dummy")[col].quantile(q) 

print(pd.DataFrame(my_aggs)) # numbers equivalent to those above

Is there a better way?

Among other descriptive statistics, I want to get some quantiles out of my pandas DataFrame. I can get the quantiles I want a couple of different ways, but I can't find the right way to do it with aggregate. I'd like to use aggregate because it'd be tidy and maybe computationally efficient to get all my stats in one go.

rng = np.random.default_rng(seed=18860504)
df = pd.DataFrame({
    "dummy": 1,
    "bell": rng.normal(loc=0, scale=1, size=100),
    "fish": rng.poisson(lam=10, size=100),
    "cabin": rng.lognormal(mean=0, sigma=1.0, size=100),
})
quants = [x/5 for x in range(6)]
quantiles = pd.DataFrame({
    "quantile" : [f"q{100*q:02n}" for q in quants],
    "bell" : df.groupby("dummy")["bell"].quantile(quants),
    "fish" : df.groupby("dummy")["fish"].quantile(quants),
}) 
print(quantiles)

Output:

          quantile      bell  fish
dummy                             
1     0.0     q000 -2.313461   4.0
      0.2     q020 -0.933831   7.0
      0.4     q040 -0.246860   9.0
      0.6     q060  0.211076  10.0
      0.8     q080  0.685958  13.0
      1.0     q100  3.017258  20.0

I'd like to get these quantiles using groupby().agg(), ideally with programmatically named columns like "bell_q90". Here's an example of the aggregate syntax that feels natural to me:

df.groupby("dummy").agg(
    bell_med=("bell", "median"),
    bell_mean=("bell", "mean"),
    fish_med=("fish", "median"),
    fish_mean=("fish", "mean"),
    # fish_q10=("fish", "quantile(0.1)"), # nothing like it 
    # fish_q10=("fish", "quantile", 0.1), # nothing like it 
    # fish_q10=("fish", "quantile", kwargs({"q":0.1}), # nothing like it 
) 

I can imagine generating the columns by iterating over quants and a list of named columns, using Series.agg and than stitching them together, but this seems like a hack. (For example, it would require me to do my "normal" aggregation first and then add quantiles on afterwards.)

my_aggs = dict()
for q in quants:
    for col in ["bell", "fish"]:
        my_aggs[f"{col}_q{100*q:03n}"] = df.groupby("dummy")[col].quantile(q) 

print(pd.DataFrame(my_aggs)) # numbers equivalent to those above

Is there a better way?

Share Improve this question asked Mar 6 at 16:34 fliesflies 2,1772 gold badges25 silver badges38 bronze badges
Add a comment  | 

2 Answers 2

Reset to default 3

You could use a function factory to simplify the syntax:

def quantile(q=0.5, **kwargs):
    def f(series):
        return series.quantile(q, **kwargs)
    return f
    
df.groupby('dummy').agg(
    bell_med=('bell', 'median'),
    bell_mean=('bell', 'mean'),
    fish_med=('fish', 'median'),
    fish_mean=('fish', 'mean'),
    bell_q10=('bell', quantile(0.1)),
    fish_q10=('fish', quantile(0.1)),
)

If you have many combinations, you could also combine this with a dictionary comprehension and parameter expansion:

df.groupby('dummy').agg(**{'bell_med': ('bell', 'median'),
                           'bell_mean': ('bell', 'mean'),
                           'fish_med': ('fish', 'median'),
                           'fish_mean': ('fish', 'mean'),
                           },
                        **{f'{c}_q{100*q:02n}': (c, quantile(q))
                           for q in [0.1] # add more if needed
                           for c in ['bell', 'fish']
                          }
                       )

Output:


       bell_med  bell_mean  fish_med  fish_mean  bell_q10  fish_q10
dummy                                                              
1     -0.063454  -0.058557      10.0       9.92 -1.553682       6.0

Consider lambda to expand the Series.quantile call:

agg_df = df.groupby('dummy').agg(
    bell_med=('bell', 'median'),
    bell_mean=('bell', 'mean'),
    fish_med=('fish', 'median'),
    fish_mean=('fish', 'mean'),
    bell_q10=('bell', lambda ser: ser.quantile(0.1)),
    fish_q10=('fish', lambda ser: ser.quantile(0.1))
)

agg_df
#        bell_med  bell_mean  fish_med  fish_mean  bell_q10  fish_q10
# dummy                                                              
# 1     -0.063454  -0.058557      10.0       9.92 -1.553682       6.0

To borrow @mozway's dynamic solution to dictionarize the aggregate functions, an extra lambda input is needed within the dict comprehension:

agg_df = df.groupby('dummy').agg(
    **{
        'bell_med': ('bell', 'median'),
        'bell_mean': ('bell', 'mean'),
        'fish_med': ('fish', 'median'),
        'fish_mean': ('fish', 'mean'),
    },
    **{
        f'{c}_q{100*q:02n}': (c, lambda ser, q=q: ser.quantile(q))
        for q in [0.1, 0.5, 0.9]
        for c in ['bell', 'fish']
    }
)

agg_df
#        bell_med  bell_mean  fish_med  fish_mean  bell_q10  fish_q10  bell_q50  fish_q50  bell_q90  fish_q90
# dummy                                                                                                      
# 1     -0.063454  -0.058557      10.0       9.92 -1.553682       6.0 -0.063454      10.0  1.045002      14.0


本文标签: pythoncompute named quantiles in pandas using groupby aggregateStack Overflow