admin管理员组

文章数量:1391991

Currently I'm trying to execute some filtering procedures in PySpark (educational purposes).

I'm new to PySpark, so decided to ask for a help.

My dataframe look like this:

ID     ApplicationDate  Loansum Company Decision
ID1    2020-06-01       100     B       Negative
ID1    2020-06-04       50      M       Positive
ID1    2020-06-05       50      M       Positive

ID1    2020-06-10       10      M       Positive

ID1    2020-06-15       60      B       Negative
ID1    2020-07-15       40      B       Positive
ID1    2020-06-22       20      M       Positive

ID1    2020-07-01       100     B       Negative
ID1    2020-07-02       40      B       Positive
ID1    2020-07-03       70      M       Positive

ID1    2020-08-01       100     B       Negative
ID1    2020-08-01       40      B       Positive
ID1    2020-08-02       100     M       Positive

ID2    2020-10-01       100     B       Negative
ID2    2020-10-04       50      M       Positive
ID2    2020-10-05       50      M       Positive

ID2    2020-10-10       10      M       Positive

ID2    2020-10-15       60      B       Negative
ID2    2020-10-15       40      B       Positive
ID2    2020-10-22       20      M       Positive

ID2    2020-10-01       100     B       Negative
ID2    2020-10-02       40      B       Positive
ID2    2020-10-03       70      M       Positive

My goal is to filter my dataframe is such a way so for each ID I should find and extract all the cases where:

  1. The ApplicationDate between the first Loansum issued by Company "B" and the next nearest Loansums issued by Company "M" should not exceed 5 days;
  2. The Loansums of all "Positive" issued loans should not be 20% more than a Lonasum of a loan with "Negative" Decision.

My expected result:

ID     ApplicationDate  Loansum Company Decision
ID1    2020-06-01       100     B       Negative
ID1    2020-06-04       50      M       Positive
ID1    2020-06-05       50      M       Positive

ID1    2020-07-01       100     B       Negative
ID1    2020-07-02       40      B       Positive
ID1    2020-07-03       70      M       Positive

ID2    2020-10-01       100     B       Negative
ID2    2020-10-04       50      M       Positive
ID2    2020-10-05       50      M       Positive

Any help is highly appreciated!

本文标签: filterExecution of complex filtering procedures in PySparkStack Overflow