admin管理员组文章数量:1292350
I have two DataFrames:
A
: Contains unique(A1, A2)
pairs and a columnD
with numerical values.B
: Contains(A1, A2)
pairs, but allows duplicates.
I need to efficiently map column D
from A
to B
based on the (A1, A2)
keys.
Currently, I’m using the following Pandas approach:
import pandas as pd
A = pd.DataFrame({
'A1': [1, 2, 3],
'A2': ['X', 'Y', 'Z'],
'D': [10, 20, 30]
})
B = pd.DataFrame({
'A1': [2, 3, 4, 2],
'A2': ['Y', 'Z', 'W', 'Y'],
})
B = B.merge(A, how='left', on=['A1', 'A2'], suffixes=('', '_A'))
B.drop(columns=[col for col in B.columns if col.endswith('_A')], inplace=True)
print(B)
gives the output of B
filled with D
A1 A2 D
0 2 Y 20.0
1 3 Z 30.0
2 4 W NaN
3 2 Y 20.0
Concerns:
I am looking a faster way to achieve the same mapping other than using merge
. The output should retain all rows from B, filling missing values from A where applicable. One of the drawback of this approach is to remove unnecessary columns due to left join to make it compatible for my downstream code.
What I’ve Tried:
Using update()
, but it doesn’t work well with multi-key joins.
Question:
Is there a more efficient way to map D
from A
to B
faster without unnecessary column operations?
I have two DataFrames:
A
: Contains unique(A1, A2)
pairs and a columnD
with numerical values.B
: Contains(A1, A2)
pairs, but allows duplicates.
I need to efficiently map column D
from A
to B
based on the (A1, A2)
keys.
Currently, I’m using the following Pandas approach:
import pandas as pd
A = pd.DataFrame({
'A1': [1, 2, 3],
'A2': ['X', 'Y', 'Z'],
'D': [10, 20, 30]
})
B = pd.DataFrame({
'A1': [2, 3, 4, 2],
'A2': ['Y', 'Z', 'W', 'Y'],
})
B = B.merge(A, how='left', on=['A1', 'A2'], suffixes=('', '_A'))
B.drop(columns=[col for col in B.columns if col.endswith('_A')], inplace=True)
print(B)
gives the output of B
filled with D
A1 A2 D
0 2 Y 20.0
1 3 Z 30.0
2 4 W NaN
3 2 Y 20.0
Concerns:
I am looking a faster way to achieve the same mapping other than using merge
. The output should retain all rows from B, filling missing values from A where applicable. One of the drawback of this approach is to remove unnecessary columns due to left join to make it compatible for my downstream code.
What I’ve Tried:
Using update()
, but it doesn’t work well with multi-key joins.
Question:
Is there a more efficient way to map D
from A
to B
faster without unnecessary column operations?
1 Answer
Reset to default 1Your question is not fully clear and without specific about your real dataset it's not easy to help you.
That said, in response to "I have downsteam code that needs to remove all unnecessary columns, which is one of the main reason to ask for a faster version".
You are performing a merge
using all columns of A
to immediately remove those that are common with B
(except for the merging keys).
You could therefore improve efficiency by pre-filtering A
to only keep the keys and unique columns:
keys = ['A1', 'A2']
A_keep = A.columns.difference(B.columns).union(keys)
out = B.merge(A[A_keep], how='left', on=['A1', 'A2'])
Example input:
A = pd.DataFrame({'A1': [1,1,1,2,2,2,3,3,3],
'A2': [1,2,3,1,2,3,1,2,3],
'D' : range(9),
'w' : 'wA',
'x' : 'xA', # we don't want to keep this column
'y' : 'yA', # we don't want to keep this column
})
B = pd.DataFrame({'A1': [1,1,1,2,2,2,3],
'A2': [1,2,3,1,1,2,4],
'x' : 'xB',
'y' : 'yB',
'z' : 'zB',
})
Output, which is identical to that of your code:
A1 A2 x y z D w
0 1 1 xB yB zB 0.0 wA
1 1 2 xB yB zB 1.0 wA
2 1 3 xB yB zB 2.0 wA
3 2 1 xB yB zB 3.0 wA
4 2 1 xB yB zB 3.0 wA
5 2 2 xB yB zB 4.0 wA
6 3 4 xB yB zB NaN NaN
If you have many common columns, there can be a large gain in speed.
For instance, using 1M rows and a variable number of common columns:
Relative to pre-filtering:
版权声明:本文标题:python - Fastest way to map column from unique key dataframe to a duplicate-allowed dataframe - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741555822a2385149.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
merge
is slow for large datasets" / "I am looking a faster way to achieve the same mapping other than usingmerge
". This is subjective. How many rows/duplicates do you have? What are you considering slow? What makes you think there could be a faster approach? – mozway Commented Feb 13 at 8:37_a
columns, which doesn't seem useful with your example and could be avoided in a generic case by doingB.merge(A[['A1', 'A2', 'D']], ...)
, which should be quite faster if you have many common columns. Is this relevant in this question? – mozway Commented Feb 13 at 8:39