admin管理员组

文章数量:1123798

This should be simple using pandas merge_asof function, but unfortunately it's not working because the function complains: ValueError: left keys must be sorted.

I want to assign the correct values of seniority of individuals to their achievements over time. I have a data frame with the achievements of 3,500 people over time. In total, there are > 75,000 achievements over a time period of 1970 to this year. Now, the individuals progress in seniority over time.

I want to match the achievements to their seniority.

Below, under the heading ACHIEVEMENTDATA is an example of data for two people. The relevant identifiers are identifier and achievement_year. achieve_count is the number of achievements by achievement_year for a person.

Now, I want a column (seniority) added to dfa based on achievement data over time from df (see PERSONALNDATA below), such that the rows in dfa reflect the proper seniority for each row in dfa. I manually did that below.

Note that some rows in df have duplicates identifier per year (A in 2015 and 2019) in these cases, rely on the row with the highest value of seniority.

PERSONALNDATA (df)
identifier  seniority   year
A   2   2009
A   3   2015
A   3   2015
A   4   2019
A   4   2019
A   4   2023
B   2   2012
B   4   2024


ACHIEVEMENTDATA (dfa):
identifier  achievement_year    achieve_count   seniority
A   2003    2   
A   2004    3   
A   2005    1   
A   2006    3   
A   2007    1   
A   2008    1   
A   2010    2   2
A   2011    2   2
A   2012    2   2
A   2013    4   2
A   2014    8   2
A   2015    4   3
A   2016    4   3
A   2017    4   3
A   2018    7   3
A   2019    4   4
A   2020    12  4
A   2021    8   4
A   2022    5   4
A   2023    7   4
A   2024    5   4
B   2007    1   
B   2009    1   
B   2010    2   
B   2011    1   
B   2012    2   2
B   2013    1   2
B   2014    1   2
B   2017    3   2
B   2019    1   2
B   2020    2   2
B   2021    1   2
B   2023    2   2
B   2024    2   4

This should be simple using pandas merge_asof function, but unfortunately it's not working because the function complains: ValueError: left keys must be sorted.

I want to assign the correct values of seniority of individuals to their achievements over time. I have a data frame with the achievements of 3,500 people over time. In total, there are > 75,000 achievements over a time period of 1970 to this year. Now, the individuals progress in seniority over time.

I want to match the achievements to their seniority.

Below, under the heading ACHIEVEMENTDATA is an example of data for two people. The relevant identifiers are identifier and achievement_year. achieve_count is the number of achievements by achievement_year for a person.

Now, I want a column (seniority) added to dfa based on achievement data over time from df (see PERSONALNDATA below), such that the rows in dfa reflect the proper seniority for each row in dfa. I manually did that below.

Note that some rows in df have duplicates identifier per year (A in 2015 and 2019) in these cases, rely on the row with the highest value of seniority.

PERSONALNDATA (df)
identifier  seniority   year
A   2   2009
A   3   2015
A   3   2015
A   4   2019
A   4   2019
A   4   2023
B   2   2012
B   4   2024


ACHIEVEMENTDATA (dfa):
identifier  achievement_year    achieve_count   seniority
A   2003    2   
A   2004    3   
A   2005    1   
A   2006    3   
A   2007    1   
A   2008    1   
A   2010    2   2
A   2011    2   2
A   2012    2   2
A   2013    4   2
A   2014    8   2
A   2015    4   3
A   2016    4   3
A   2017    4   3
A   2018    7   3
A   2019    4   4
A   2020    12  4
A   2021    8   4
A   2022    5   4
A   2023    7   4
A   2024    5   4
B   2007    1   
B   2009    1   
B   2010    2   
B   2011    1   
B   2012    2   2
B   2013    1   2
B   2014    1   2
B   2017    3   2
B   2019    1   2
B   2020    2   2
B   2021    1   2
B   2023    2   2
B   2024    2   4
Share Improve this question asked yesterday Martien LubberinkMartien Lubberink 2,7251 gold badge22 silver badges34 bronze badges
Add a comment  | 

2 Answers 2

Reset to default 0

the second dfa looks very unorganised, let us sort that first

df = df.sort_values(by=['identifier', 'year'])
dfa = dfa.sort_values(by=['identifier', 'achievement_year'])

looks like there are some dups too, lets try that too..

A   3   2015
A   3   2015
A   4   2019
A   4   2019

df = df.sort_values(by=['identifier', 'year']).drop_duplicates(subset=['identifier', 'year'], keep='last')

lets go with merge func

merged_df = pd.merge_asof(
    dfa,
    df,
    by='identifier',
    left_on='achievement_year',
    right_on='year',
    direction='backward'
)

add this to code and run, any error let me know..

See rule #5 in this link: Sort both dataframes according to the column listed for the ‘on’ parameter.

This is also a rather hidden rule given its necessity, especially if you data was sorted initially. Regardless, just like watching out for nonetypes before running a merge ensure you sort both data frames according to the on column and all should be well. This is actually a rule that merge_asof shares with merge_ordered() so ensure you sort before that kind of merge also. To find out more about merge_ordered() go here.

I had sorted the frames on 'identifier', and 'year'

本文标签: