admin管理员组文章数量:1289381
I have a pyspark based pipeline that uses spark.read.format("binaryFile") to decompress tgz files and handle the pcap file inside (exploding to packages etc). The code that handles the tar, pcap, and single packets is written as pure python and integrated as "User Defined Function".
This pipeline works fine but files that contain a pcap files larger 2GB yield ValueError: can not serialize object larger than 2G
is there any way to overcome this?
I would like to keep
self.unzipped: DataFrame = spark.read.format("binaryFile")\
.option("pathGlobFilter", "*.pcap.tgz")\
.option("compression", "gzip")\
.load(folder)
Because of the abstraction layer regarding the file source, it works with file://, hadoop:// and other like azure (abfss://) - if you add the dependencies.
If not possible what are alternatives?
- Since this is an error in python serializer - will this work in an Scala or R implementation?
- if uncompressing on driver (with pure python code, and creating the first dataframe from chunks of packages from pcap) - how to read the file in similar way that accepts different protocols as path (i would need file:// and abfss://)
- any other ideas?
Update:
I am using Pyspark 3.51
Source that raises the error: .py#L160
本文标签:
版权声明:本文标题:python - handling large tgz with pcap in pyspark - ValueError: can not serialize object larger than 2G - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741451344a2379505.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论