admin管理员组

文章数量:1124017

I'm taking an online class, and there appears to be a glitch in the coursework, which seems to derive from different versions of Pandas. The code the course is providing does not run. However, the course provides a type of patch or update to jog in a repair - but - that repair/updated code patch doesn't appear to be necessary in one instance, and it malfunctions in another part.

We want to group by planet type and then subdivide further on the basis of whether the planet has a magnetic ring. So six groups in theory - but only four actual existing groups arise from the data. And once we have these groups, we want to perform some sum and agg operations.

The code patch was delivered immediately prior, where we are messaged that to avoid an error, we need to tweak a parameter for the sum() function, where we need to insert: (numeric_only=True). Although my current IDE doesn't throw an error without this tweak, instead it just concats the non-numerics.

But the real problem is where we are asked to run the code at the bottom of the block, with the agg() function. I think the problem derives from the fact that I'm seeking to perform mathematical operations on non-numeric data - specifically, the column for "rings" is a bool type. But while I have been able to adjust the parameters for the mean and max functions individually (so that they are only assessing numeric columns) but I can't make this adjustment on the agg() function because it does not have this parameter. And without being able to make this adjustment for numeric only on the agg(), the coursework itself produces and error.

And if I pursue my own fix as outlined above, and seperate the mean() and max() operations and perform them individually - I can apparently tweak this parameter for "numeric_only=True" for each:

print(planets.groupby(['type', 'magnetic_field']).max(numeric_only=True))
print(planets.groupby(['type', 'magnetic_field']).mean(numeric_only=True))

This does produce all the correct data, albeit less efficiently - Bnd shouldn't these two functions have the same parameters as agg() since they are here part of pandas aggregate functions?

And aside from that question, there is the issue of reproducing the coursework results and just getting it right - I want all the data on the same output dataframe. Ultimately, if I separate these functions and adjust the parameters individually, then I can collect the data correctly - but much less efficiently. And the course work wants all the output in the same printout. Any ideas what I'm missing on this syntax to get do in one execution? THANKS!!

import numpy as np
import pandas as pd
data = {'planet': ['Mercury', 'Venus', 'Earth', 'Mars',
                   'Jupiter', 'Saturn', 'Uranus', 'Neptune'],
        'radius_km': [2440, 6052, 6371, 3390, 69911, 58232,
                     25362, 24622],
        'moons': [0, 0, 1, 2, 80, 83, 27, 14],
        'type': ['terrestrial', 'terrestrial', 'terrestrial', 'terrestrial',
                 'gas giant', 'gas giant', 'ice giant', 'ice giant'],
        'rings': ['no', 'no', 'no', 'no', 'yes', 'yes', 'yes','yes'],
        'mean_temp_c': [167, 464, 15, -65, -110, -140, -195, -200],
        'magnetic_field': ['yes', 'no', 'yes', 'no', 'yes', 'yes', 'yes', 'yes'] }

planets = pd.DataFrame(data)
P = planets.groupby(['type', 'magnetic_field']).agg(['mean', 'max'])
print(P)

I'm taking an online class, and there appears to be a glitch in the coursework, which seems to derive from different versions of Pandas. The code the course is providing does not run. However, the course provides a type of patch or update to jog in a repair - but - that repair/updated code patch doesn't appear to be necessary in one instance, and it malfunctions in another part.

We want to group by planet type and then subdivide further on the basis of whether the planet has a magnetic ring. So six groups in theory - but only four actual existing groups arise from the data. And once we have these groups, we want to perform some sum and agg operations.

The code patch was delivered immediately prior, where we are messaged that to avoid an error, we need to tweak a parameter for the sum() function, where we need to insert: (numeric_only=True). Although my current IDE doesn't throw an error without this tweak, instead it just concats the non-numerics.

But the real problem is where we are asked to run the code at the bottom of the block, with the agg() function. I think the problem derives from the fact that I'm seeking to perform mathematical operations on non-numeric data - specifically, the column for "rings" is a bool type. But while I have been able to adjust the parameters for the mean and max functions individually (so that they are only assessing numeric columns) but I can't make this adjustment on the agg() function because it does not have this parameter. And without being able to make this adjustment for numeric only on the agg(), the coursework itself produces and error.

And if I pursue my own fix as outlined above, and seperate the mean() and max() operations and perform them individually - I can apparently tweak this parameter for "numeric_only=True" for each:

print(planets.groupby(['type', 'magnetic_field']).max(numeric_only=True))
print(planets.groupby(['type', 'magnetic_field']).mean(numeric_only=True))

This does produce all the correct data, albeit less efficiently - Bnd shouldn't these two functions have the same parameters as agg() since they are here part of pandas aggregate functions?

And aside from that question, there is the issue of reproducing the coursework results and just getting it right - I want all the data on the same output dataframe. Ultimately, if I separate these functions and adjust the parameters individually, then I can collect the data correctly - but much less efficiently. And the course work wants all the output in the same printout. Any ideas what I'm missing on this syntax to get do in one execution? THANKS!!

import numpy as np
import pandas as pd
data = {'planet': ['Mercury', 'Venus', 'Earth', 'Mars',
                   'Jupiter', 'Saturn', 'Uranus', 'Neptune'],
        'radius_km': [2440, 6052, 6371, 3390, 69911, 58232,
                     25362, 24622],
        'moons': [0, 0, 1, 2, 80, 83, 27, 14],
        'type': ['terrestrial', 'terrestrial', 'terrestrial', 'terrestrial',
                 'gas giant', 'gas giant', 'ice giant', 'ice giant'],
        'rings': ['no', 'no', 'no', 'no', 'yes', 'yes', 'yes','yes'],
        'mean_temp_c': [167, 464, 15, -65, -110, -140, -195, -200],
        'magnetic_field': ['yes', 'no', 'yes', 'no', 'yes', 'yes', 'yes', 'yes'] }

planets = pd.DataFrame(data)
P = planets.groupby(['type', 'magnetic_field']).agg(['mean', 'max'])
print(P)
Share Improve this question edited yesterday BigBen 50k7 gold badges27 silver badges44 bronze badges asked yesterday PleaseBeNicePleaseBeNice 294 bronze badges 2
  • You should check the data types in your DataFrame. The rings column seems to contain "yes"/"no" strings, which may be causing issues during aggregation. If you convert it to bool (True/False), does that help? – bsraskr Commented yesterday
  • what is your expected output? – iBeMeltin Commented yesterday
Add a comment  | 

3 Answers 3

Reset to default 3

Not sure what is the exact expected output, but you can convert the yes/no to booleans, select the desired dtypes before aggregation:

planets = (planets
           .replace({'yes': True, 'no': False})
           .convert_dtypes()
           )

cols = planets.select_dtypes(['number', 'boolean']).columns

P = (planets.groupby(['type', 'magnetic_field'])[cols]
     .agg(['mean', 'max'])
     )

Output:

                           radius_km        moons     rings        mean_temp_c       magnetic_field       
                                mean    max  mean max  mean    max        mean   max           mean    max
type        magnetic_field                                                                                
gas giant   True             64071.5  69911  81.5  83   1.0   True      -125.0  -110            1.0   True
ice giant   True             24992.0  25362  20.5  27   1.0   True      -197.5  -195            1.0   True
terrestrial False             4721.0   6052   1.0   2   0.0  False       199.5   464            0.0  False
            True              4405.5   6371   0.5   1   0.0  False        91.0   167            1.0   True

You can give specific aggregations for each column, and just leave out the non-numeric columns.

planets.groupby(['type', 'magnetic_field']).agg(
    mean_radius=('radius_km', 'mean'), max_radius=('radius_km', 'max'), 
    mean_moons=('moons', 'mean'), max_moons=('moons', 'max'), 
    mean_temp=('mean_temp_c', 'mean'), max_temp=('mean_temp_c', 'max')
)
                            mean_radius  max_radius  mean_moons  max_moons  mean_temp  max_temp
type        magnetic_field                                                                     
gas giant   yes                 64071.5       69911        81.5         83     -125.0      -110
ice giant   yes                 24992.0       25362        20.5         27     -197.5      -195
terrestrial no                   4721.0        6052         1.0          2      199.5       464
            yes                  4405.5        6371         0.5          1       91.0       167

You can select the numeric columns with .select_dtypes('number') [and .select_dtypes('object') for the categorical ones]:

numeric_cols = planets.select_dtypes('number').columns
P = planets.groupby(['type', 'magnetic_field'])[numeric_cols].agg(['mean', 'max'])
display(P)

本文标签: pythonMissing some part of the groupby() syntaxStack Overflow