apache spark - Best way to push 2 GBPS data to Kafka topic via PySpark - Stack Overflow

IT技术

更新时间：2025-01-088

admin管理员组
文章数量:1122846

I am trying to do load testing for the Kafka Producer and Kafka Consumer. For this reason, I have generated 2 GBPS data of one file with a payload of 4KB each. I have used PySpark to load this data and push it to the Kafka topic since I want to achieve 2 GBPS throughput.

from pyspark.sql.session import SparkSession

spark = SparkSession.\
    builder. \
    appName("Batch JSON read"). \
    config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0"). \
    getOrCreate()

df = spark.read.option("multiline","true").json("/Users/Downloads/sample_data.json")

df.selectExpr("to_json(struct(*)) AS value") \
    .write.format('kafka') \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("topic", "test01") \
    .save()

I can push the data but wanted some suggestions if this is the correct way to push the data in "batches" if I want to go ahead with PySpark or if I can leverage batch.size or linder.ms properties of Kafka with CLI.

Once the approach is fixed I will integrate it with JMeter for load testing.

I am trying to do load testing for the Kafka Producer and Kafka Consumer. For this reason, I have generated 2 GBPS data of one file with a payload of 4KB each. I have used PySpark to load this data and push it to the Kafka topic since I want to achieve 2 GBPS throughput.

from pyspark.sql.session import SparkSession

spark = SparkSession.\
    builder. \
    appName("Batch JSON read"). \
    config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0"). \
    getOrCreate()

df = spark.read.option("multiline","true").json("/Users/Downloads/sample_data.json")

df.selectExpr("to_json(struct(*)) AS value") \
    .write.format('kafka') \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("topic", "test01") \
    .save()

I can push the data but wanted some suggestions if this is the correct way to push the data in "batches" if I want to go ahead with PySpark or if I can leverage batch.size or linder.ms properties of Kafka with CLI.

Once the approach is fixed I will integrate it with JMeter for load testing.

Share Improve this question edited Nov 23, 2024 at 15:06 OneCricketeer 191k20 gold badges141 silver badges267 bronze badges asked Nov 21, 2024 at 10:19 RushHour 6131 gold badge9 silver badges31 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

tl;dr Don't reinvent the wheel https://openmessaging.cloud/docs/benchmarks/

Firstly, Kafka has builtin producer-perf test scripts... But If you want speed on Spark, use Scala/Java to skip bytecode conversion rather than Python, and don't use JSON, just keep your dataframe in binary, and maybe even upgrade to latest Spark... Or just don't use Spark

kafka-console-producer ... < sample_data.json

You'd speed this up by breaking your file apart and running one producer per core of your machine (and do the same on other machines once that's capped out to push kafka even harder), and don't run Kafka on the same machine where the producer runs, otherwise there will be IO contention

But otherwise, there's only so many parameters you can tweak on the producer config in Spark, but regarding batches, yes, .write is already one batch. Use a for loop to write multiples, or use a streaming dataframe instead, though Flink would be better if you have streaming datasets

本文标签： apache sparkBest way to push 2 GBPS data to Kafka topic via PySparkStack Overflow

版权声明：本文标题：apache spark - Best way to push 2 GBPS data to Kafka topic via PySpark - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1736311859a1934891.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

apache spark - Best way to push 2 GBPS data to Kafka topic via PySpark - Stack Overflow

1 Answer 1

更多相关文章