admin管理员组

文章数量:1122846

I am trying to do load testing for the Kafka Producer and Kafka Consumer. For this reason, I have generated 2 GBPS data of one file with a payload of 4KB each. I have used PySpark to load this data and push it to the Kafka topic since I want to achieve 2 GBPS throughput.

from pyspark.sql.session import SparkSession

spark = SparkSession.\
    builder. \
    appName("Batch JSON read"). \
    config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0"). \
    getOrCreate()

df = spark.read.option("multiline","true").json("/Users/Downloads/sample_data.json")

df.selectExpr("to_json(struct(*)) AS value") \
    .write.format('kafka') \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("topic", "test01") \
    .save()

I can push the data but wanted some suggestions if this is the correct way to push the data in "batches" if I want to go ahead with PySpark or if I can leverage batch.size or linder.ms properties of Kafka with CLI.

Once the approach is fixed I will integrate it with JMeter for load testing.

I am trying to do load testing for the Kafka Producer and Kafka Consumer. For this reason, I have generated 2 GBPS data of one file with a payload of 4KB each. I have used PySpark to load this data and push it to the Kafka topic since I want to achieve 2 GBPS throughput.

from pyspark.sql.session import SparkSession

spark = SparkSession.\
    builder. \
    appName("Batch JSON read"). \
    config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0"). \
    getOrCreate()

df = spark.read.option("multiline","true").json("/Users/Downloads/sample_data.json")

df.selectExpr("to_json(struct(*)) AS value") \
    .write.format('kafka') \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("topic", "test01") \
    .save()

I can push the data but wanted some suggestions if this is the correct way to push the data in "batches" if I want to go ahead with PySpark or if I can leverage batch.size or linder.ms properties of Kafka with CLI.

Once the approach is fixed I will integrate it with JMeter for load testing.

Share Improve this question edited Nov 23, 2024 at 15:06 OneCricketeer 191k20 gold badges141 silver badges267 bronze badges asked Nov 21, 2024 at 10:19 RushHourRushHour 6131 gold badge9 silver badges31 bronze badges 0
Add a comment  | 

1 Answer 1

Reset to default 0

tl;dr Don't reinvent the wheel https://openmessaging.cloud/docs/benchmarks/


Firstly, Kafka has builtin producer-perf test scripts... But If you want speed on Spark, use Scala/Java to skip bytecode conversion rather than Python, and don't use JSON, just keep your dataframe in binary, and maybe even upgrade to latest Spark... Or just don't use Spark

kafka-console-producer ... < sample_data.json 

You'd speed this up by breaking your file apart and running one producer per core of your machine (and do the same on other machines once that's capped out to push kafka even harder), and don't run Kafka on the same machine where the producer runs, otherwise there will be IO contention

But otherwise, there's only so many parameters you can tweak on the producer config in Spark, but regarding batches, yes, .write is already one batch. Use a for loop to write multiples, or use a streaming dataframe instead, though Flink would be better if you have streaming datasets

本文标签: apache sparkBest way to push 2 GBPS data to Kafka topic via PySparkStack Overflow