admin管理员组

文章数量:1389750

I’m working with stateful streaming processing in PySpark, specifically using applyInPandasWithState, mapGroups… etc. My function processes data maintaining state but also handles timeouts (e.g. GroupStateTimeout.ProcessingTimeTimeout).

I want to understand best practices for unit testing (using pytest or unittest) such functions, i.e. mocking Spark/GroupState behaviour completely vs using an actual Spark session and how we would go about testing timeouts in either case.

Initially, I decided to mock Spark’s behaviour completely, to have full control over tests. This allowed me to test the outcome data received in a specific orders). However, I am now struggling to mock timeout behaviour properly. I’m unsure whether my current mock-based approach is too far from real production behaviour.

Initially, I decided to mock Spark’s behaviour completely, to have full control over tests. This allowed me to test the outcome data received in a specific orders). However, I am now struggling to mock timeout behaviour properly. I’m unsure whether my current mock-based approach is too far from real production behaviour.

I’m working with stateful streaming processing in PySpark, specifically using applyInPandasWithState, mapGroups… etc. My function processes data maintaining state but also handles timeouts (e.g. GroupStateTimeout.ProcessingTimeTimeout).

I want to understand best practices for unit testing (using pytest or unittest) such functions, i.e. mocking Spark/GroupState behaviour completely vs using an actual Spark session and how we would go about testing timeouts in either case.

Initially, I decided to mock Spark’s behaviour completely, to have full control over tests. This allowed me to test the outcome data received in a specific orders). However, I am now struggling to mock timeout behaviour properly. I’m unsure whether my current mock-based approach is too far from real production behaviour.

Initially, I decided to mock Spark’s behaviour completely, to have full control over tests. This allowed me to test the outcome data received in a specific orders). However, I am now struggling to mock timeout behaviour properly. I’m unsure whether my current mock-based approach is too far from real production behaviour.

Share Improve this question edited Mar 12 at 21:07 user22144809 asked Mar 12 at 21:07 user22144809user22144809 11 bronze badge 1
  • Please share any relevant code you've written! For tips on asking effective questions, check out: How-to-ask – Yu Wei Liu Commented Mar 13 at 8:24
Add a comment  | 

1 Answer 1

Reset to default 1

Plenty of examples on how to unit test a Spark app.

本文标签: pythonUnit testing stateful streaming processing functionsStack Overflow