admin管理员组文章数量:1277338
Pandas astype()
appears to unexpectedly switch to performing in-place operations after loading data from a pickle file. Concretly, for astype(str)
, the data type of the input dataframe values is modified. What is causing this behavior?
Pandas version: 2.0.3
Minimal example:
import pandas as pd
import numpy as np
# create a test dataframe
df = pd.DataFrame({'col1': ['hi']*10 + [False]*20 + [np.nan]*30})
# print the data types of the cells, before and after casting to string
print(pd.unique([type(elem) for elem in df['col1'].values]))
_ = df.astype(str)
print(pd.unique([type(elem) for elem in df['col1'].values]))
# store the dataframe as pkl and directly load it again
outpath = 'C:/Dokumente/my_test_df.pkl'
df.to_pickle(outpath)
df2 = pd.read_pickle(outpath)
# print the data types of the cells, before and after casting to string
print(pd.unique([type(elem) for elem in df2['col1'].values]))
_ = df2.astype(str)
print(pd.unique([type(elem) for elem in df2['col1'].values]))
Output:
Pandas astype()
appears to unexpectedly switch to performing in-place operations after loading data from a pickle file. Concretly, for astype(str)
, the data type of the input dataframe values is modified. What is causing this behavior?
Pandas version: 2.0.3
Minimal example:
import pandas as pd
import numpy as np
# create a test dataframe
df = pd.DataFrame({'col1': ['hi']*10 + [False]*20 + [np.nan]*30})
# print the data types of the cells, before and after casting to string
print(pd.unique([type(elem) for elem in df['col1'].values]))
_ = df.astype(str)
print(pd.unique([type(elem) for elem in df['col1'].values]))
# store the dataframe as pkl and directly load it again
outpath = 'C:/Dokumente/my_test_df.pkl'
df.to_pickle(outpath)
df2 = pd.read_pickle(outpath)
# print the data types of the cells, before and after casting to string
print(pd.unique([type(elem) for elem in df2['col1'].values]))
_ = df2.astype(str)
print(pd.unique([type(elem) for elem in df2['col1'].values]))
Output:
Share Improve this question edited Feb 24 at 17:04 silence_of_the_lambdas asked Feb 24 at 16:59 silence_of_the_lambdassilence_of_the_lambdas 1,2921 gold badge14 silver badges36 bronze badges 1- 1 Version number is 2.0.3. – silence_of_the_lambdas Commented Feb 24 at 17:04
1 Answer
Reset to default 2This is a bug that has been fixed in pandas 2.2.0:
Bug in
DataFrame.astype()
when called withstr
on unpickled array - the array might change in-place (GH 54654)
As noted by Itayazolay in the PR, regarding the pickle MRE used there:
The problem is not exactly with pickle, it's just a quick way to reproduce the problem.
The problem is that the code here attempts to check if two arrays have the same memory (or share memory) and it does so incorrectly -result is arr
See numpy/numpy#24478 for more technical details.
If you're using a version < 2.2 and cannot upgrade, you could try manually applying the fix mentioned in the PR and recompiling ".../pandas/_libs/lib.pyx"
.
At #L759:
if copy and result is arr:
result = result.copy()
Required change:
if copy and (result is arr or np.may_share_memory(arr, result)):
result = result.copy()
There are now some extra comments in ".../pandas/_libs/lib.pyx"
, version 2.3.x, together with adjusted logic. See #L777-L785:
if result is arr or np.may_share_memory(arr, result):
# if np.asarray(..) did not make a copy of the input arr, we still need
# to do that to avoid mutating the input array
# GH#54654: share_memory check is needed for rare cases where np.asarray
# returns a new object without making a copy of the actual data
if copy:
result = result.copy()
else:
already_copied = False
本文标签: pythonPandas astype becomes inplace operation for data loaded from pickle filesStack Overflow
版权声明:本文标题:python - Pandas astype becomes in-place operation for data loaded from pickle files - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741252189a2365999.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论