admin管理员组

文章数量:1289381

I have a pyspark based pipeline that uses spark.read.format("binaryFile") to decompress tgz files and handle the pcap file inside (exploding to packages etc). The code that handles the tar, pcap, and single packets is written as pure python and integrated as "User Defined Function".

This pipeline works fine but files that contain a pcap files larger 2GB yield ValueError: can not serialize object larger than 2G

is there any way to overcome this?

I would like to keep

self.unzipped: DataFrame = spark.read.format("binaryFile")\
    .option("pathGlobFilter", "*.pcap.tgz")\
    .option("compression", "gzip")\
    .load(folder)

Because of the abstraction layer regarding the file source, it works with file://, hadoop:// and other like azure (abfss://) - if you add the dependencies.

If not possible what are alternatives?

  • Since this is an error in python serializer - will this work in an Scala or R implementation?
  • if uncompressing on driver (with pure python code, and creating the first dataframe from chunks of packages from pcap) - how to read the file in similar way that accepts different protocols as path (i would need file:// and abfss://)
  • any other ideas?

Update:

I am using Pyspark 3.51

Source that raises the error: .py#L160

本文标签: