admin管理员组

文章数量:1290098

The basics of using Pandas where with callables seems simple.

np.random.seed(0)
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])
df["test"] = range(1,9)

def MyBool(x):
    print(1)
    return ( x > 0 )

def MyFunc(x1):
    print(1)
    return x1['A']

df.where(
    cond  = lambda x: MyBool(x),
    other = lambda x: MyFunc(x) ,
    )

In the code above, I am replacing the values of all columns with the value of column A whenever the value of the col is less than 0. Note, I know I don't need to use callables for this simple example.

Based on my analysis, this is what is happening under the hood.

First, MyFunc is evaluated where the argument is the df itself. This returns a 8x1 df (=A)

Second, the MyBool is evaluated which returns a 8x5 boolean df.

Third, (not sure about this last step) for all entries (i,j) where MyBool returned False, the value of the i'th row of the output of MyFunc is used to replace the current value of the df.

This leads me on to my question: how does this extend to the cases when MyFunc returns a dataframe with several columns and rows? How does the function determine which entries need to be replaced and with which values?

For illustrative purposes, suppose now that we want to divide B and C by 2 when test is equal to 5. The code I have provided below works but I don't quite understand how it determines which entries are to be replaced and with which values.

MyBool still returns a one dimensional vector but MyFunc returns a dataframe. If the previous logic I explained was correct, then shouldn't it replace each False entry with the dataframe? Indeed, if this were the case, the resulting dataframe should be bigger than the input df.

I've been reading the documentation and playing with different examples but can't figure this one out.

def MyBool(x):
    output = x.test != 5
    return output

def MyFunc(x1):
    x1.loc[ x1.test == 5, ["B", "C"] ] /= 2
    return x1

df.where(
    cond  = lambda x: MyBool(x),
    other = lambda x: MyFunc(x.copy()),
    axis  = 0
    )

The basics of using Pandas where with callables seems simple.

np.random.seed(0)
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])
df["test"] = range(1,9)

def MyBool(x):
    print(1)
    return ( x > 0 )

def MyFunc(x1):
    print(1)
    return x1['A']

df.where(
    cond  = lambda x: MyBool(x),
    other = lambda x: MyFunc(x) ,
    )

In the code above, I am replacing the values of all columns with the value of column A whenever the value of the col is less than 0. Note, I know I don't need to use callables for this simple example.

Based on my analysis, this is what is happening under the hood.

First, MyFunc is evaluated where the argument is the df itself. This returns a 8x1 df (=A)

Second, the MyBool is evaluated which returns a 8x5 boolean df.

Third, (not sure about this last step) for all entries (i,j) where MyBool returned False, the value of the i'th row of the output of MyFunc is used to replace the current value of the df.

This leads me on to my question: how does this extend to the cases when MyFunc returns a dataframe with several columns and rows? How does the function determine which entries need to be replaced and with which values?

For illustrative purposes, suppose now that we want to divide B and C by 2 when test is equal to 5. The code I have provided below works but I don't quite understand how it determines which entries are to be replaced and with which values.

MyBool still returns a one dimensional vector but MyFunc returns a dataframe. If the previous logic I explained was correct, then shouldn't it replace each False entry with the dataframe? Indeed, if this were the case, the resulting dataframe should be bigger than the input df.

I've been reading the documentation and playing with different examples but can't figure this one out.

def MyBool(x):
    output = x.test != 5
    return output

def MyFunc(x1):
    x1.loc[ x1.test == 5, ["B", "C"] ] /= 2
    return x1

df.where(
    cond  = lambda x: MyBool(x),
    other = lambda x: MyFunc(x.copy()),
    axis  = 0
    )
Share Improve this question edited Feb 19 at 18:56 Leksa99 asked Feb 19 at 17:14 Leksa99Leksa99 936 bronze badges 1
  • 2 It would be great to provide a reproducible example (seed the random numbers with np.random.seed(0)), and the matching expected output, for clarity. – mozway Commented Feb 19 at 17:38
Add a comment  | 

1 Answer 1

Reset to default 2

The logic is quite simple, for each False in the output of the cond callable, the matching value in the result of other will be used as replacement. If other is a scalar, this value is used.

The matching value is identified by position if the callable returns an array, and by alignment for a DataFrame:

df = pd.DataFrame({'A': [1, 0], 'B': [0, 1]})
df.where(cond=lambda x: x==0,
         other=lambda x: pd.DataFrame({'A': [10, 20]}, index=[1, 0]))
#     A    B
# 0  20  0.0
# 1   0  NaN

df.where(cond=lambda x: x==0, other=lambda x: [[7,8],[9,10]])
#    A   B
# 0  7   0
# 1  0  10

Therefore, your MyFunc functions should return a scalar, a DataFrame (that will be aligned), or an array of the same shape as the input.

You can modify it to broadcast the values to all columns:

def MyFunc(x1):
    print(1)
    return np.broadcast_to(x1['A'].to_frame().values, df.shape)

Example:

df = pd.DataFrame([[1, -1,  0,  0],
                   [2,  0, -1,  0],
                   [3,  0,  0, -1]],
                  columns=['A', 'B', 'C', 'D'])
def MyBool(x):
    return x >= 0

def MyFunc(x):
    return np.broadcast_to(x['A'].to_frame().values, df.shape)

out = df.where(cond=MyBool, other=MyFunc)

#    A  B  C  D
# 0  1  1  0  0
# 1  2  0  2  0
# 2  3  0  0  3

Note that the callables should NOT modify the DataFrame in place.

This should be avoided:

def MyFunc(x1):
    x1.loc[ x1.test == 5, ["B", "C"] ] /= 2
    return x1

and could be replaced by a simple (without using where):

df.loc[df['test'] == 5, ['B', 'C']] /= 2

本文标签: pythonHow does pdwhere work with callables as parametersStack Overflow