admin管理员组文章数量:1122846
update
I have optimized the query so that the performance now is much better. So, the original question is not important or relevant anymore.
Original post:
I have a query that takes about 5 minutes to run in DBeaver (one of the redshift client tool). The query is a simple select:
select to_char( TO_TIMESTAMP(hit_time_gmt_ts, 'DD-Mon-YYYY HH24:MI:SS'),'YYYYMMDD') as hit_time_gmt_dt_key,post_visid_combined,visit_return_count,event_list from cnds.omni_hit_data_cn_v where TO_TIMESTAMP(hit_time_gmt_ts, 'DD-Mon-YYYY HH24:MI:SS') >= '20240701' and TO_TIMESTAMP(hit_time_gmt_ts, 'DD-Mon-YYYY HH24:MI:SS') < '20250101' and post_campaign ~ '(^EM.*_T_.*)((NEWCA)|(NCA)).*'
This takes 5 minutes in DBeaver. It returns about 800 rows, so a very small data set.
But in Databricks it takes 14 minutes. The code in Databricks:
bucket = "tfsds-prod"
RS_tempDir = "s3://xxxx"
RS_jdbcUrl = 'jdbc:redshift://xxxxx.us-east-1.redshift.amazonaws:5439/rscdwdm'
RS_username = 'xxx'
RS_password = 'xxx'
query = """
select to_char( TO_TIMESTAMP(hit_time_gmt_ts, 'DD-Mon-YYYY HH24:MI:SS'),'YYYYMMDD') as hit_time_gmt_dt_key,post_visid_combined,visit_return_count,event_list from cnds.omni_hit_data_cn_v where TO_TIMESTAMP(hit_time_gmt_ts, 'DD-Mon-YYYY HH24:MI:SS') >= '20240701' and TO_TIMESTAMP(hit_time_gmt_ts, 'DD-Mon-YYYY HH24:MI:SS') < '20250101' and post_campaign ~ '(^EM.*_T_.*)((NEWCA)|(NCA)).*'
"""
df = (spark.read
.format("redshift")
.option("query", query)
.option("tempdir", RS_tempDir)
.option("url", RS_jdbcUrl)
.option("user", RS_username)
.option("password", RS_password)
.option("forward_spark_s3_credentials", True)
.load()
)
df.display()
During the execution in Databricks, it shows this message:
Waiting for Redshift to unload intermediate data under s3
It looks to me the code is pending on writing the query result to s3 bucket, and I don't know why in DBeaver it's much faster.
My ultimate goal is to somehow speed up the query execution in Databricks for such a small dataset. I don't know much about Redshift, which is govern by another team. But need some points or proofs before I can report the issue.
update
I have optimized the query so that the performance now is much better. So, the original question is not important or relevant anymore.
Original post:
I have a query that takes about 5 minutes to run in DBeaver (one of the redshift client tool). The query is a simple select:
select to_char( TO_TIMESTAMP(hit_time_gmt_ts, 'DD-Mon-YYYY HH24:MI:SS'),'YYYYMMDD') as hit_time_gmt_dt_key,post_visid_combined,visit_return_count,event_list from cnds.omni_hit_data_cn_v where TO_TIMESTAMP(hit_time_gmt_ts, 'DD-Mon-YYYY HH24:MI:SS') >= '20240701' and TO_TIMESTAMP(hit_time_gmt_ts, 'DD-Mon-YYYY HH24:MI:SS') < '20250101' and post_campaign ~ '(^EM.*_T_.*)((NEWCA)|(NCA)).*'
This takes 5 minutes in DBeaver. It returns about 800 rows, so a very small data set.
But in Databricks it takes 14 minutes. The code in Databricks:
bucket = "tfsds-prod"
RS_tempDir = "s3://xxxx"
RS_jdbcUrl = 'jdbc:redshift://xxxxx.us-east-1.redshift.amazonaws.com:5439/rscdwdm'
RS_username = 'xxx'
RS_password = 'xxx'
query = """
select to_char( TO_TIMESTAMP(hit_time_gmt_ts, 'DD-Mon-YYYY HH24:MI:SS'),'YYYYMMDD') as hit_time_gmt_dt_key,post_visid_combined,visit_return_count,event_list from cnds.omni_hit_data_cn_v where TO_TIMESTAMP(hit_time_gmt_ts, 'DD-Mon-YYYY HH24:MI:SS') >= '20240701' and TO_TIMESTAMP(hit_time_gmt_ts, 'DD-Mon-YYYY HH24:MI:SS') < '20250101' and post_campaign ~ '(^EM.*_T_.*)((NEWCA)|(NCA)).*'
"""
df = (spark.read
.format("redshift")
.option("query", query)
.option("tempdir", RS_tempDir)
.option("url", RS_jdbcUrl)
.option("user", RS_username)
.option("password", RS_password)
.option("forward_spark_s3_credentials", True)
.load()
)
df.display()
During the execution in Databricks, it shows this message:
Waiting for Redshift to unload intermediate data under s3
It looks to me the code is pending on writing the query result to s3 bucket, and I don't know why in DBeaver it's much faster.
My ultimate goal is to somehow speed up the query execution in Databricks for such a small dataset. I don't know much about Redshift, which is govern by another team. But need some points or proofs before I can report the issue.
Share Improve this question edited Nov 22, 2024 at 6:17 thotwielder asked Nov 21, 2024 at 8:59 thotwielderthotwielder 1,7079 gold badges48 silver badges87 bronze badges 1 |1 Answer
Reset to default 0While it seems obvious that there are additional steps required to accomplish the same task due to offloading data to a temp location and some computation time required for spark which is a benefit in large datasets but might be an overhead for smaller datasets. For smaller data sets you can try using JDBC format itself which should mimic the same behaviour of DBeaver.
While it is not recommended to just modify this without understanding the actual execution plans and the data volumes handled eventually by the query, for this particular query you can try switching the Format to JDBC since redshift then runs the query in its memory and directly streams to JDBC client instead of offloading into S3 and then reading the data.
something like,
df = (spark.read
.format("jdbc")
.option("url", RS_jdbcUrl)
.option("query", query)
.option("user", RS_username)
.option("password", RS_password)
.load())
But if your question is on why it is slower, it is possible that the query fetches more data that requires temp storage though the final output is smaller. And DBeaver is faster because its using JDBC directly with no intermediate storage.
本文标签: amazon web servicesRedshift query takes longer in Databricks than in Redshift clientStack Overflow
版权声明:本文标题:amazon web services - Redshift query takes longer in Databricks than in Redshift client - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736312270a1935040.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
psycopg2
) rather than going via Spark, but it depends on your use-case. – John Rotenstein Commented Nov 21, 2024 at 11:28