admin管理员组

文章数量:1292350

I have two DataFrames:

  • A: Contains unique (A1, A2) pairs and a column D with numerical values.
  • B: Contains (A1, A2) pairs, but allows duplicates.

I need to efficiently map column D from A to B based on the (A1, A2) keys.

Currently, I’m using the following Pandas approach:

import pandas as pd


A = pd.DataFrame({
    'A1': [1, 2, 3],
    'A2': ['X', 'Y', 'Z'],
    'D': [10, 20, 30]
})


B = pd.DataFrame({
    'A1': [2, 3, 4, 2],
    'A2': ['Y', 'Z', 'W', 'Y'],
})

B = B.merge(A, how='left', on=['A1', 'A2'], suffixes=('', '_A'))
B.drop(columns=[col for col in B.columns if col.endswith('_A')], inplace=True)

print(B)

gives the output of B filled with D

   A1 A2     D
0   2  Y  20.0
1   3  Z  30.0
2   4  W   NaN
3   2  Y  20.0

Concerns:

I am looking a faster way to achieve the same mapping other than using merge. The output should retain all rows from B, filling missing values from A where applicable. One of the drawback of this approach is to remove unnecessary columns due to left join to make it compatible for my downstream code.

What I’ve Tried:

Using update(), but it doesn’t work well with multi-key joins.

Question:

Is there a more efficient way to map D from A to B faster without unnecessary column operations?

I have two DataFrames:

  • A: Contains unique (A1, A2) pairs and a column D with numerical values.
  • B: Contains (A1, A2) pairs, but allows duplicates.

I need to efficiently map column D from A to B based on the (A1, A2) keys.

Currently, I’m using the following Pandas approach:

import pandas as pd


A = pd.DataFrame({
    'A1': [1, 2, 3],
    'A2': ['X', 'Y', 'Z'],
    'D': [10, 20, 30]
})


B = pd.DataFrame({
    'A1': [2, 3, 4, 2],
    'A2': ['Y', 'Z', 'W', 'Y'],
})

B = B.merge(A, how='left', on=['A1', 'A2'], suffixes=('', '_A'))
B.drop(columns=[col for col in B.columns if col.endswith('_A')], inplace=True)

print(B)

gives the output of B filled with D

   A1 A2     D
0   2  Y  20.0
1   3  Z  30.0
2   4  W   NaN
3   2  Y  20.0

Concerns:

I am looking a faster way to achieve the same mapping other than using merge. The output should retain all rows from B, filling missing values from A where applicable. One of the drawback of this approach is to remove unnecessary columns due to left join to make it compatible for my downstream code.

What I’ve Tried:

Using update(), but it doesn’t work well with multi-key joins.

Question:

Is there a more efficient way to map D from A to B faster without unnecessary column operations?

Share Improve this question edited Feb 13 at 8:40 hanugm asked Feb 13 at 8:29 hanugmhanugm 1,4174 gold badges23 silver badges52 bronze badges 5
  • 1 Please provide a minimal reproducible example. Ideally as code to generate representative DataFrames. It's hard to answer questions about efficiency without a clear description of the expected timings, current timings, etc. – mozway Commented Feb 13 at 8:32
  • @mozway I provided a toy example such that one can print and check B. – hanugm Commented Feb 13 at 8:35
  • 1 Such a toy example isn't really useful. Your assumption is "merge is slow for large datasets" / "I am looking a faster way to achieve the same mapping other than using merge". This is subjective. How many rows/duplicates do you have? What are you considering slow? What makes you think there could be a faster approach? – mozway Commented Feb 13 at 8:37
  • Also, you show a step to remove the _a columns, which doesn't seem useful with your example and could be avoided in a generic case by doing B.merge(A[['A1', 'A2', 'D']], ...), which should be quite faster if you have many common columns. Is this relevant in this question? – mozway Commented Feb 13 at 8:39
  • I have downsteam code that needs to remove all unnecessary columns, which is one of the main reason to ask for a faster version. @mozway – hanugm Commented Feb 13 at 8:40
Add a comment  | 

1 Answer 1

Reset to default 1

Your question is not fully clear and without specific about your real dataset it's not easy to help you.

That said, in response to "I have downsteam code that needs to remove all unnecessary columns, which is one of the main reason to ask for a faster version".

You are performing a merge using all columns of A to immediately remove those that are common with B (except for the merging keys).

You could therefore improve efficiency by pre-filtering A to only keep the keys and unique columns:

keys = ['A1', 'A2']
A_keep = A.columns.difference(B.columns).union(keys)

out = B.merge(A[A_keep], how='left', on=['A1', 'A2'])

Example input:

A = pd.DataFrame({'A1': [1,1,1,2,2,2,3,3,3],
                  'A2': [1,2,3,1,2,3,1,2,3],
                  'D' : range(9),
                  'w' : 'wA',
                  'x' : 'xA',  # we don't want to keep this column
                  'y' : 'yA',  # we don't want to keep this column
                 })
B = pd.DataFrame({'A1': [1,1,1,2,2,2,3],
                  'A2': [1,2,3,1,1,2,4],
                  'x' : 'xB',
                  'y' : 'yB',
                  'z' : 'zB',
                 })

Output, which is identical to that of your code:

   A1  A2   x   y   z    D    w
0   1   1  xB  yB  zB  0.0   wA
1   1   2  xB  yB  zB  1.0   wA
2   1   3  xB  yB  zB  2.0   wA
3   2   1  xB  yB  zB  3.0   wA
4   2   1  xB  yB  zB  3.0   wA
5   2   2  xB  yB  zB  4.0   wA
6   3   4  xB  yB  zB  NaN  NaN

If you have many common columns, there can be a large gain in speed.

For instance, using 1M rows and a variable number of common columns:

Relative to pre-filtering:

本文标签: pythonFastest way to map column from unique key dataframe to a duplicateallowed dataframeStack Overflow