admin管理员组

文章数量:1389955

I have a Google Dataproc job that reads a CSV file from Google Cloud Storage, containing the following headers

Content-type : application/octet-stream

Content-encoding : gzip

FileName: gs://test_bucket/sample.txt (file doesn't have gz extension but it is compressed)

The below code is running successfully but the dataframe record count(9k) is not matching the file record count(100k). It looks like it is reading only the first 9k rows. How do I make sure I read all the rows into my dataframe?

    self.spark :SparkSession= SparkSession.builder.appName("app_name"). \
                                                                config("spark.executor.memory","4g") \
                                                                .config("spark.hadoop.fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem") \
                                                                .config("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
                                                                .config("spark.hadoop.fs.gs.inputstream.support.gzip.encoding.enable", "true") \
                                                                .config("spark.sql.legacy.timeParserPolicy", "CORRECTED") \
                                                                .config("spark.driver.memory","4g").getOrCreate()

        df = (self.spark.read.format("csv")
            .schema(schema)
            .option("mode", 'PERMISSIVE')  
            .option("encoding", "UTF-8")
            .option("columnNameOfCorruptRecord", '_corrupt_record')
            .load(self.file_path) )


        print("df total count: ", df.count())

本文标签: gzipParitial records being read in Pyspark through DataprocStack Overflow