admin管理员组文章数量:1290098
The basics of using Pandas where with callables seems simple.
np.random.seed(0)
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])
df["test"] = range(1,9)
def MyBool(x):
print(1)
return ( x > 0 )
def MyFunc(x1):
print(1)
return x1['A']
df.where(
cond = lambda x: MyBool(x),
other = lambda x: MyFunc(x) ,
)
In the code above, I am replacing the values of all columns with the value of column A whenever the value of the col is less than 0. Note, I know I don't need to use callables for this simple example.
Based on my analysis, this is what is happening under the hood.
First, MyFunc is evaluated where the argument is the df itself. This returns a 8x1 df (=A)
Second, the MyBool is evaluated which returns a 8x5 boolean df.
Third, (not sure about this last step) for all entries (i,j) where MyBool returned False, the value of the i'th row of the output of MyFunc is used to replace the current value of the df.
This leads me on to my question: how does this extend to the cases when MyFunc returns a dataframe with several columns and rows? How does the function determine which entries need to be replaced and with which values?
For illustrative purposes, suppose now that we want to divide B and C by 2 when test is equal to 5. The code I have provided below works but I don't quite understand how it determines which entries are to be replaced and with which values.
MyBool still returns a one dimensional vector but MyFunc returns a dataframe. If the previous logic I explained was correct, then shouldn't it replace each False entry with the dataframe? Indeed, if this were the case, the resulting dataframe should be bigger than the input df.
I've been reading the documentation and playing with different examples but can't figure this one out.
def MyBool(x):
output = x.test != 5
return output
def MyFunc(x1):
x1.loc[ x1.test == 5, ["B", "C"] ] /= 2
return x1
df.where(
cond = lambda x: MyBool(x),
other = lambda x: MyFunc(x.copy()),
axis = 0
)
The basics of using Pandas where with callables seems simple.
np.random.seed(0)
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])
df["test"] = range(1,9)
def MyBool(x):
print(1)
return ( x > 0 )
def MyFunc(x1):
print(1)
return x1['A']
df.where(
cond = lambda x: MyBool(x),
other = lambda x: MyFunc(x) ,
)
In the code above, I am replacing the values of all columns with the value of column A whenever the value of the col is less than 0. Note, I know I don't need to use callables for this simple example.
Based on my analysis, this is what is happening under the hood.
First, MyFunc is evaluated where the argument is the df itself. This returns a 8x1 df (=A)
Second, the MyBool is evaluated which returns a 8x5 boolean df.
Third, (not sure about this last step) for all entries (i,j) where MyBool returned False, the value of the i'th row of the output of MyFunc is used to replace the current value of the df.
This leads me on to my question: how does this extend to the cases when MyFunc returns a dataframe with several columns and rows? How does the function determine which entries need to be replaced and with which values?
For illustrative purposes, suppose now that we want to divide B and C by 2 when test is equal to 5. The code I have provided below works but I don't quite understand how it determines which entries are to be replaced and with which values.
MyBool still returns a one dimensional vector but MyFunc returns a dataframe. If the previous logic I explained was correct, then shouldn't it replace each False entry with the dataframe? Indeed, if this were the case, the resulting dataframe should be bigger than the input df.
I've been reading the documentation and playing with different examples but can't figure this one out.
def MyBool(x):
output = x.test != 5
return output
def MyFunc(x1):
x1.loc[ x1.test == 5, ["B", "C"] ] /= 2
return x1
df.where(
cond = lambda x: MyBool(x),
other = lambda x: MyFunc(x.copy()),
axis = 0
)
Share
Improve this question
edited Feb 19 at 18:56
Leksa99
asked Feb 19 at 17:14
Leksa99Leksa99
936 bronze badges
1
|
1 Answer
Reset to default 2The logic is quite simple, for each False
in the output of the cond
callable, the matching value in the result of other
will be used as replacement. If other
is a scalar, this value is used.
The matching value is identified by position if the callable returns an array, and by alignment for a DataFrame:
df = pd.DataFrame({'A': [1, 0], 'B': [0, 1]})
df.where(cond=lambda x: x==0,
other=lambda x: pd.DataFrame({'A': [10, 20]}, index=[1, 0]))
# A B
# 0 20 0.0
# 1 0 NaN
df.where(cond=lambda x: x==0, other=lambda x: [[7,8],[9,10]])
# A B
# 0 7 0
# 1 0 10
Therefore, your MyFunc
functions should return a scalar, a DataFrame (that will be aligned), or an array of the same shape as the input.
You can modify it to broadcast the values to all columns:
def MyFunc(x1):
print(1)
return np.broadcast_to(x1['A'].to_frame().values, df.shape)
Example:
df = pd.DataFrame([[1, -1, 0, 0],
[2, 0, -1, 0],
[3, 0, 0, -1]],
columns=['A', 'B', 'C', 'D'])
def MyBool(x):
return x >= 0
def MyFunc(x):
return np.broadcast_to(x['A'].to_frame().values, df.shape)
out = df.where(cond=MyBool, other=MyFunc)
# A B C D
# 0 1 1 0 0
# 1 2 0 2 0
# 2 3 0 0 3
Note that the callables should NOT modify the DataFrame in place.
This should be avoided:
def MyFunc(x1):
x1.loc[ x1.test == 5, ["B", "C"] ] /= 2
return x1
and could be replaced by a simple (without using where
):
df.loc[df['test'] == 5, ['B', 'C']] /= 2
本文标签: pythonHow does pdwhere work with callables as parametersStack Overflow
版权声明:本文标题:python - How does pd.where work with callables as parameters? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741481213a2381184.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
np.random.seed(0)
), and the matching expected output, for clarity. – mozway Commented Feb 19 at 17:38