admin管理员组

文章数量:1123687

I have a Glue job which has the below configuration and writes file to s3 using spark.write and the write process is taking time to write 544 * 7.5mb file. Using coalesce(16) generates 16 2.5 gb files and it didn't help much.

Glue Config:

  • worker type: G1 X
  • max number of workers: 10
  • glue version 5.0

This job basically selects data from athena partitioned table and finally writes data into a s3 file. The s3 file write is taking lot of time.

 select_query = (

    f"SELECT * FROM table1 "
    f"WHERE year='{year}' AND month='{month}' AND day='{day}' AND hour='{hour}' AND col1 ='abc' "
    f"AND col2='123' AND col3 in ('ABC12','CDE23','DEF34','GHI23', "
 
    ) AND col4='NEW'  "
    f"AND key IN ('val1', 'val2')"
                )

data_df = spark.sql(select_query )
data_df.write.mode("append").parquet(athena_output_location)

本文标签: pythonAws glue job is very slow while writing to s3 using pyspark writeStack Overflow