admin管理员组文章数量:1123687
I have a Glue job which has the below configuration and writes file to s3 using spark.write and the write process is taking time to write 544 * 7.5mb file. Using coalesce(16) generates 16 2.5 gb files and it didn't help much.
Glue Config:
- worker type: G1 X
- max number of workers: 10
- glue version 5.0
This job basically selects data from athena partitioned table and finally writes data into a s3 file. The s3 file write is taking lot of time.
select_query = (
f"SELECT * FROM table1 "
f"WHERE year='{year}' AND month='{month}' AND day='{day}' AND hour='{hour}' AND col1 ='abc' "
f"AND col2='123' AND col3 in ('ABC12','CDE23','DEF34','GHI23', "
) AND col4='NEW' "
f"AND key IN ('val1', 'val2')"
)
data_df = spark.sql(select_query )
data_df.write.mode("append").parquet(athena_output_location)
本文标签: pythonAws glue job is very slow while writing to s3 using pyspark writeStack Overflow
版权声明:本文标题:python - Aws glue job is very slow while writing to s3 using pyspark write - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736592043a1945090.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论