admin管理员组文章数量:1123798
This should be simple using pandas merge_asof function, but unfortunately it's not working because the function complains: ValueError: left keys must be sorted
.
I want to assign the correct values of seniority of individuals to their achievements over time. I have a data frame with the achievements of 3,500 people over time. In total, there are > 75,000 achievements over a time period of 1970 to this year. Now, the individuals progress in seniority over time.
I want to match the achievements to their seniority.
Below, under the heading ACHIEVEMENTDATA is an example of data for two people. The relevant identifiers are identifier
and achievement_year
. achieve_count
is the number of achievements by achievement_year
for a person.
Now, I want a column (seniority
) added to dfa
based on achievement data over time from df
(see PERSONALNDATA below), such that the rows in dfa
reflect the proper seniority for each row in dfa
. I manually did that below.
Note that some rows in df
have duplicates identifier per year (A in 2015 and 2019) in these cases, rely on the row with the highest value of seniority.
PERSONALNDATA (df)
identifier seniority year
A 2 2009
A 3 2015
A 3 2015
A 4 2019
A 4 2019
A 4 2023
B 2 2012
B 4 2024
ACHIEVEMENTDATA (dfa):
identifier achievement_year achieve_count seniority
A 2003 2
A 2004 3
A 2005 1
A 2006 3
A 2007 1
A 2008 1
A 2010 2 2
A 2011 2 2
A 2012 2 2
A 2013 4 2
A 2014 8 2
A 2015 4 3
A 2016 4 3
A 2017 4 3
A 2018 7 3
A 2019 4 4
A 2020 12 4
A 2021 8 4
A 2022 5 4
A 2023 7 4
A 2024 5 4
B 2007 1
B 2009 1
B 2010 2
B 2011 1
B 2012 2 2
B 2013 1 2
B 2014 1 2
B 2017 3 2
B 2019 1 2
B 2020 2 2
B 2021 1 2
B 2023 2 2
B 2024 2 4
This should be simple using pandas merge_asof function, but unfortunately it's not working because the function complains: ValueError: left keys must be sorted
.
I want to assign the correct values of seniority of individuals to their achievements over time. I have a data frame with the achievements of 3,500 people over time. In total, there are > 75,000 achievements over a time period of 1970 to this year. Now, the individuals progress in seniority over time.
I want to match the achievements to their seniority.
Below, under the heading ACHIEVEMENTDATA is an example of data for two people. The relevant identifiers are identifier
and achievement_year
. achieve_count
is the number of achievements by achievement_year
for a person.
Now, I want a column (seniority
) added to dfa
based on achievement data over time from df
(see PERSONALNDATA below), such that the rows in dfa
reflect the proper seniority for each row in dfa
. I manually did that below.
Note that some rows in df
have duplicates identifier per year (A in 2015 and 2019) in these cases, rely on the row with the highest value of seniority.
PERSONALNDATA (df)
identifier seniority year
A 2 2009
A 3 2015
A 3 2015
A 4 2019
A 4 2019
A 4 2023
B 2 2012
B 4 2024
ACHIEVEMENTDATA (dfa):
identifier achievement_year achieve_count seniority
A 2003 2
A 2004 3
A 2005 1
A 2006 3
A 2007 1
A 2008 1
A 2010 2 2
A 2011 2 2
A 2012 2 2
A 2013 4 2
A 2014 8 2
A 2015 4 3
A 2016 4 3
A 2017 4 3
A 2018 7 3
A 2019 4 4
A 2020 12 4
A 2021 8 4
A 2022 5 4
A 2023 7 4
A 2024 5 4
B 2007 1
B 2009 1
B 2010 2
B 2011 1
B 2012 2 2
B 2013 1 2
B 2014 1 2
B 2017 3 2
B 2019 1 2
B 2020 2 2
B 2021 1 2
B 2023 2 2
B 2024 2 4
Share
Improve this question
asked yesterday
Martien LubberinkMartien Lubberink
2,7251 gold badge22 silver badges34 bronze badges
2 Answers
Reset to default 0the second dfa looks very unorganised, let us sort that first
df = df.sort_values(by=['identifier', 'year'])
dfa = dfa.sort_values(by=['identifier', 'achievement_year'])
looks like there are some dups too, lets try that too..
A 3 2015
A 3 2015
A 4 2019
A 4 2019
df = df.sort_values(by=['identifier', 'year']).drop_duplicates(subset=['identifier', 'year'], keep='last')
lets go with merge func
merged_df = pd.merge_asof(
dfa,
df,
by='identifier',
left_on='achievement_year',
right_on='year',
direction='backward'
)
add this to code and run, any error let me know..
See rule #5 in this link: Sort both dataframes according to the column listed for the ‘on’ parameter.
This is also a rather hidden rule given its necessity, especially if you data was sorted initially. Regardless, just like watching out for nonetypes before running a merge ensure you sort both data frames according to the on column and all should be well. This is actually a rule that merge_asof shares with merge_ordered() so ensure you sort before that kind of merge also. To find out more about merge_ordered() go here.
I had sorted the frames on 'identifier'
, and 'year'
本文标签:
版权声明:本文标题:pandas - Nearest matching for groups, when merge_asof fails because of warning that left frame is not properly sorted - Stack Ov 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736593329a1945110.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论