admin管理员组文章数量:1122846
I am trying to do load
testing for the Kafka Producer
and Kafka Consumer
. For this reason, I have generated 2 GBPS
data of one file with a payload of 4KB
each. I have used PySpark
to load this data and push it to the Kafka topic since I want to achieve 2 GBPS throughput
.
from pyspark.sql.session import SparkSession
spark = SparkSession.\
builder. \
appName("Batch JSON read"). \
config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0"). \
getOrCreate()
df = spark.read.option("multiline","true").json("/Users/Downloads/sample_data.json")
df.selectExpr("to_json(struct(*)) AS value") \
.write.format('kafka') \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "test01") \
.save()
I can push the data but wanted some suggestions if this is the correct way to push the data in "batches"
if I want to go ahead with PySpark or if I can leverage batch.size
or linder.ms
properties of Kafka with CLI.
Once the approach is fixed I will integrate it with JMeter
for load testing.
I am trying to do load
testing for the Kafka Producer
and Kafka Consumer
. For this reason, I have generated 2 GBPS
data of one file with a payload of 4KB
each. I have used PySpark
to load this data and push it to the Kafka topic since I want to achieve 2 GBPS throughput
.
from pyspark.sql.session import SparkSession
spark = SparkSession.\
builder. \
appName("Batch JSON read"). \
config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0"). \
getOrCreate()
df = spark.read.option("multiline","true").json("/Users/Downloads/sample_data.json")
df.selectExpr("to_json(struct(*)) AS value") \
.write.format('kafka') \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "test01") \
.save()
I can push the data but wanted some suggestions if this is the correct way to push the data in "batches"
if I want to go ahead with PySpark or if I can leverage batch.size
or linder.ms
properties of Kafka with CLI.
Once the approach is fixed I will integrate it with JMeter
for load testing.
1 Answer
Reset to default 0tl;dr Don't reinvent the wheel https://openmessaging.cloud/docs/benchmarks/
Firstly, Kafka has builtin producer-perf test scripts... But If you want speed on Spark, use Scala/Java to skip bytecode conversion rather than Python, and don't use JSON, just keep your dataframe in binary, and maybe even upgrade to latest Spark... Or just don't use Spark
kafka-console-producer ... < sample_data.json
You'd speed this up by breaking your file apart and running one producer per core of your machine (and do the same on other machines once that's capped out to push kafka even harder), and don't run Kafka on the same machine where the producer runs, otherwise there will be IO contention
But otherwise, there's only so many parameters you can tweak on the producer config in Spark, but regarding batches, yes, .write
is already one batch. Use a for loop to write multiples, or use a streaming dataframe instead, though Flink would be better if you have streaming datasets
本文标签: apache sparkBest way to push 2 GBPS data to Kafka topic via PySparkStack Overflow
版权声明:本文标题:apache spark - Best way to push 2 GBPS data to Kafka topic via PySpark - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736311859a1934891.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论