python - Fastest way to map column from unique key dataframe to a duplicate-allowed dataframe - Stack Overflow-软件玩家

admin管理员组
文章数量:1292350

I have two DataFrames:

A: Contains unique (A1, A2) pairs and a column D with numerical values.
B: Contains (A1, A2) pairs, but allows duplicates.

I need to efficiently map column D from A to B based on the (A1, A2) keys.

Currently, I’m using the following Pandas approach:

import pandas as pd


A = pd.DataFrame({
    'A1': [1, 2, 3],
    'A2': ['X', 'Y', 'Z'],
    'D': [10, 20, 30]
})


B = pd.DataFrame({
    'A1': [2, 3, 4, 2],
    'A2': ['Y', 'Z', 'W', 'Y'],
})

B = B.merge(A, how='left', on=['A1', 'A2'], suffixes=('', '_A'))
B.drop(columns=[col for col in B.columns if col.endswith('_A')], inplace=True)

print(B)

gives the output of B filled with D

   A1 A2     D
0   2  Y  20.0
1   3  Z  30.0
2   4  W   NaN
3   2  Y  20.0

Concerns:

I am looking a faster way to achieve the same mapping other than using merge. The output should retain all rows from B, filling missing values from A where applicable. One of the drawback of this approach is to remove unnecessary columns due to left join to make it compatible for my downstream code.

What I’ve Tried:

Using update(), but it doesn’t work well with multi-key joins.

Question:

Is there a more efficient way to map D from A to B faster without unnecessary column operations?

I have two DataFrames:

A: Contains unique (A1, A2) pairs and a column D with numerical values.
B: Contains (A1, A2) pairs, but allows duplicates.

I need to efficiently map column D from A to B based on the (A1, A2) keys.

Currently, I’m using the following Pandas approach:

import pandas as pd


A = pd.DataFrame({
    'A1': [1, 2, 3],
    'A2': ['X', 'Y', 'Z'],
    'D': [10, 20, 30]
})


B = pd.DataFrame({
    'A1': [2, 3, 4, 2],
    'A2': ['Y', 'Z', 'W', 'Y'],
})

B = B.merge(A, how='left', on=['A1', 'A2'], suffixes=('', '_A'))
B.drop(columns=[col for col in B.columns if col.endswith('_A')], inplace=True)

print(B)

gives the output of B filled with D

   A1 A2     D
0   2  Y  20.0
1   3  Z  30.0
2   4  W   NaN
3   2  Y  20.0

Concerns:

What I’ve Tried:

Using update(), but it doesn’t work well with multi-key joins.

Question:

Is there a more efficient way to map D from A to B faster without unnecessary column operations?

Share Improve this question edited Feb 13 at 8:40 asked Feb 13 at 8:29 hanugm 1,4174 gold badges23 silver badges52 bronze badges

1 Please provide a minimal reproducible example. Ideally as code to generate representative DataFrames. It's hard to answer questions about efficiency without a clear description of the expected timings, current timings, etc. – mozway Commented Feb 13 at 8:32
@mozway I provided a toy example such that one can print and check B. – hanugm Commented Feb 13 at 8:35
1 Such a toy example isn't really useful. Your assumption is "merge is slow for large datasets" / "I am looking a faster way to achieve the same mapping other than using merge". This is subjective. How many rows/duplicates do you have? What are you considering slow? What makes you think there could be a faster approach? – mozway Commented Feb 13 at 8:37
Also, you show a step to remove the _a columns, which doesn't seem useful with your example and could be avoided in a generic case by doing B.merge(A[['A1', 'A2', 'D']], ...), which should be quite faster if you have many common columns. Is this relevant in this question? – mozway Commented Feb 13 at 8:39
I have downsteam code that needs to remove all unnecessary columns, which is one of the main reason to ask for a faster version. @mozway – hanugm Commented Feb 13 at 8:40

Add a comment |

1 Answer 1

Sorted by: Reset to default 1

Your question is not fully clear and without specific about your real dataset it's not easy to help you.

That said, in response to "I have downsteam code that needs to remove all unnecessary columns, which is one of the main reason to ask for a faster version".

You are performing a merge using all columns of A to immediately remove those that are common with B (except for the merging keys).

You could therefore improve efficiency by pre-filtering A to only keep the keys and unique columns:

keys = ['A1', 'A2']
A_keep = A.columns.difference(B.columns).union(keys)

out = B.merge(A[A_keep], how='left', on=['A1', 'A2'])

Example input:

A = pd.DataFrame({'A1': [1,1,1,2,2,2,3,3,3],
                  'A2': [1,2,3,1,2,3,1,2,3],
                  'D' : range(9),
                  'w' : 'wA',
                  'x' : 'xA',  # we don't want to keep this column
                  'y' : 'yA',  # we don't want to keep this column
                 })
B = pd.DataFrame({'A1': [1,1,1,2,2,2,3],
                  'A2': [1,2,3,1,1,2,4],
                  'x' : 'xB',
                  'y' : 'yB',
                  'z' : 'zB',
                 })

Output, which is identical to that of your code:

   A1  A2   x   y   z    D    w
0   1   1  xB  yB  zB  0.0   wA
1   1   2  xB  yB  zB  1.0   wA
2   1   3  xB  yB  zB  2.0   wA
3   2   1  xB  yB  zB  3.0   wA
4   2   1  xB  yB  zB  3.0   wA
5   2   2  xB  yB  zB  4.0   wA
6   3   4  xB  yB  zB  NaN  NaN

If you have many common columns, there can be a large gain in speed.

For instance, using 1M rows and a variable number of common columns:

Relative to pre-filtering:

本文标签： pythonFastest way to map column from unique key dataframe to a duplicateallowed dataframeStack Overflow

版权声明：本文标题：python - Fastest way to map column from unique key dataframe to a duplicate-allowed dataframe - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1741555822a2385149.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

python - Fastest way to map column from unique key dataframe to a duplicate-allowed dataframe - Stack Overflow

1 Answer 1

更多相关文章