admin管理员组文章数量:1391991
Currently I'm trying to execute some filtering procedures in PySpark (educational purposes).
I'm new to PySpark, so decided to ask for a help.
My dataframe look like this:
ID ApplicationDate Loansum Company Decision
ID1 2020-06-01 100 B Negative
ID1 2020-06-04 50 M Positive
ID1 2020-06-05 50 M Positive
ID1 2020-06-10 10 M Positive
ID1 2020-06-15 60 B Negative
ID1 2020-07-15 40 B Positive
ID1 2020-06-22 20 M Positive
ID1 2020-07-01 100 B Negative
ID1 2020-07-02 40 B Positive
ID1 2020-07-03 70 M Positive
ID1 2020-08-01 100 B Negative
ID1 2020-08-01 40 B Positive
ID1 2020-08-02 100 M Positive
ID2 2020-10-01 100 B Negative
ID2 2020-10-04 50 M Positive
ID2 2020-10-05 50 M Positive
ID2 2020-10-10 10 M Positive
ID2 2020-10-15 60 B Negative
ID2 2020-10-15 40 B Positive
ID2 2020-10-22 20 M Positive
ID2 2020-10-01 100 B Negative
ID2 2020-10-02 40 B Positive
ID2 2020-10-03 70 M Positive
My goal is to filter my dataframe is such a way so for each ID I should find and extract all the cases where:
- The ApplicationDate between the first Loansum issued by Company "B" and the next nearest Loansums issued by Company "M" should not exceed 5 days;
- The Loansums of all "Positive" issued loans should not be 20% more than a Lonasum of a loan with "Negative" Decision.
My expected result:
ID ApplicationDate Loansum Company Decision
ID1 2020-06-01 100 B Negative
ID1 2020-06-04 50 M Positive
ID1 2020-06-05 50 M Positive
ID1 2020-07-01 100 B Negative
ID1 2020-07-02 40 B Positive
ID1 2020-07-03 70 M Positive
ID2 2020-10-01 100 B Negative
ID2 2020-10-04 50 M Positive
ID2 2020-10-05 50 M Positive
Any help is highly appreciated!
本文标签: filterExecution of complex filtering procedures in PySparkStack Overflow
版权声明:本文标题:filter - Execution of complex filtering procedures in PySpark - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744604880a2615291.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论