apache spark sql - Compare two PySpark DataFrames and append the results side by side - Stack Overflow

IT技术

更新时间：2025-03-190

admin管理员组
文章数量:1333450

I have two pySpark DataFrames, need to compare those two DataFrames column wise and append the result next to it.

DF1:

Claim_number	Claim_Status
1001	Closed
1002	In Progress
1003	open

I have two pySpark DataFrames, need to compare those two DataFrames column wise and append the result next to it.

DF1:

Claim_number	Claim_Status
1001	Closed
1002	In Progress
1003	open

Df2:

Claim_number	Claim_Status
1001	Closed
1002	open
1004	In Progress

Expected Result in pySpark:

DF3:

Claim_number_DF1	Claim_number_DF2	Comparison_of_Claim_number	Claim_status_DF1	Claim_status_DF2	Comparison_of_Claim_Status
1001	1001	TRUE	Closed	Closed	TRUE
1002	1002	TRUE	In Progress	Open	FALSE
1003	1004	FALSE	open	In Progress	FALSE

Share Improve this question asked Nov 20, 2024 at 15:13 Srinivasan 133 bronze badges

What is your actual question? What have you tried? – Andrew Commented Nov 20, 2024 at 15:57
I want to compare two dataframe and if the column value matches it should populate True and if it not matches it should populate False next to the column – Srinivasan Commented Nov 20, 2024 at 15:59
That's not a question, that's asking SO to write your code for you. – Andrew Commented Nov 20, 2024 at 16:06
Sorry i dont understand your question... – Srinivasan Commented Nov 20, 2024 at 16:08
1 Unlike Pandas dataframes PySpark dataframes are not ordered. So the task is not doable unless a criterium is provided which rows of each dataframe should be compared. Simply saying "take the third row from df1 and compare it with the third row from df2" does not work unfortunately. There is no "third row", at least not when using large datasets with multiple partitions. – werner Commented Nov 20, 2024 at 18:24

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

DF(s) are nor ordered but distributed in different places so this is an invalid ask.

However what you can do instead is following -

You can assume DF1 is master DF and join it with DF2 using Claim_number and if DF2 has no claim number then depending on join type, you can choose to ignore (inner join) or produce then as null(left outer join)

If that is what yous ask is, then here is the solution.

final_Df = df1.join(df2, Claim_number, "inner").distinct()

本文标签： apache spark sqlCompare two PySpark DataFrames and append the results side by sideStack Overflow

版权声明：本文标题：apache spark sql - Compare two PySpark DataFrames and append the results side by side - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1742350906a2458471.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

apache spark sql - Compare two PySpark DataFrames and append the results side by side - Stack Overflow

1 Answer 1

更多相关文章