admin管理员组

文章数量:1122846

update

I have optimized the query so that the performance now is much better. So, the original question is not important or relevant anymore.

Original post:

I have a query that takes about 5 minutes to run in DBeaver (one of the redshift client tool). The query is a simple select:

select to_char( TO_TIMESTAMP(hit_time_gmt_ts, 'DD-Mon-YYYY HH24:MI:SS'),'YYYYMMDD')  as hit_time_gmt_dt_key,post_visid_combined,visit_return_count,event_list   from cnds.omni_hit_data_cn_v            where   TO_TIMESTAMP(hit_time_gmt_ts, 'DD-Mon-YYYY HH24:MI:SS') >= '20240701' and  TO_TIMESTAMP(hit_time_gmt_ts, 'DD-Mon-YYYY HH24:MI:SS') < '20250101'            and post_campaign ~ '(^EM.*_T_.*)((NEWCA)|(NCA)).*' 

This takes 5 minutes in DBeaver. It returns about 800 rows, so a very small data set.

But in Databricks it takes 14 minutes. The code in Databricks:

bucket = "tfsds-prod"
RS_tempDir = "s3://xxxx"
RS_jdbcUrl = 'jdbc:redshift://xxxxx.us-east-1.redshift.amazonaws:5439/rscdwdm'
RS_username = 'xxx'
RS_password = 'xxx'

query = """
  select to_char( TO_TIMESTAMP(hit_time_gmt_ts, 'DD-Mon-YYYY HH24:MI:SS'),'YYYYMMDD')  as hit_time_gmt_dt_key,post_visid_combined,visit_return_count,event_list   from cnds.omni_hit_data_cn_v            where   TO_TIMESTAMP(hit_time_gmt_ts, 'DD-Mon-YYYY HH24:MI:SS') >= '20240701' and  TO_TIMESTAMP(hit_time_gmt_ts, 'DD-Mon-YYYY HH24:MI:SS') < '20250101'            and post_campaign ~ '(^EM.*_T_.*)((NEWCA)|(NCA)).*'  
"""

df = (spark.read
  .format("redshift")
  .option("query", query)
  .option("tempdir", RS_tempDir)
  .option("url", RS_jdbcUrl)
  .option("user", RS_username)
  .option("password", RS_password)
  .option("forward_spark_s3_credentials", True)
  .load()
)

df.display()

During the execution in Databricks, it shows this message:

Waiting for Redshift to unload intermediate data under s3

It looks to me the code is pending on writing the query result to s3 bucket, and I don't know why in DBeaver it's much faster.

My ultimate goal is to somehow speed up the query execution in Databricks for such a small dataset. I don't know much about Redshift, which is govern by another team. But need some points or proofs before I can report the issue.

update

I have optimized the query so that the performance now is much better. So, the original question is not important or relevant anymore.

Original post:

I have a query that takes about 5 minutes to run in DBeaver (one of the redshift client tool). The query is a simple select:

select to_char( TO_TIMESTAMP(hit_time_gmt_ts, 'DD-Mon-YYYY HH24:MI:SS'),'YYYYMMDD')  as hit_time_gmt_dt_key,post_visid_combined,visit_return_count,event_list   from cnds.omni_hit_data_cn_v            where   TO_TIMESTAMP(hit_time_gmt_ts, 'DD-Mon-YYYY HH24:MI:SS') >= '20240701' and  TO_TIMESTAMP(hit_time_gmt_ts, 'DD-Mon-YYYY HH24:MI:SS') < '20250101'            and post_campaign ~ '(^EM.*_T_.*)((NEWCA)|(NCA)).*' 

This takes 5 minutes in DBeaver. It returns about 800 rows, so a very small data set.

But in Databricks it takes 14 minutes. The code in Databricks:

bucket = "tfsds-prod"
RS_tempDir = "s3://xxxx"
RS_jdbcUrl = 'jdbc:redshift://xxxxx.us-east-1.redshift.amazonaws.com:5439/rscdwdm'
RS_username = 'xxx'
RS_password = 'xxx'

query = """
  select to_char( TO_TIMESTAMP(hit_time_gmt_ts, 'DD-Mon-YYYY HH24:MI:SS'),'YYYYMMDD')  as hit_time_gmt_dt_key,post_visid_combined,visit_return_count,event_list   from cnds.omni_hit_data_cn_v            where   TO_TIMESTAMP(hit_time_gmt_ts, 'DD-Mon-YYYY HH24:MI:SS') >= '20240701' and  TO_TIMESTAMP(hit_time_gmt_ts, 'DD-Mon-YYYY HH24:MI:SS') < '20250101'            and post_campaign ~ '(^EM.*_T_.*)((NEWCA)|(NCA)).*'  
"""

df = (spark.read
  .format("redshift")
  .option("query", query)
  .option("tempdir", RS_tempDir)
  .option("url", RS_jdbcUrl)
  .option("user", RS_username)
  .option("password", RS_password)
  .option("forward_spark_s3_credentials", True)
  .load()
)

df.display()

During the execution in Databricks, it shows this message:

Waiting for Redshift to unload intermediate data under s3

It looks to me the code is pending on writing the query result to s3 bucket, and I don't know why in DBeaver it's much faster.

My ultimate goal is to somehow speed up the query execution in Databricks for such a small dataset. I don't know much about Redshift, which is govern by another team. But need some points or proofs before I can report the issue.

Share Improve this question edited Nov 22, 2024 at 6:17 thotwielder asked Nov 21, 2024 at 8:59 thotwielderthotwielder 1,7079 gold badges48 silver badges87 bronze badges 1
  • DBeaver sends the SQL query to Redshift and receives the response directly back from Redshift. In looking at Query Amazon Redshift using Databricks | Databricks on AWS, it would appear that using Spark to connect to Databricks is a much more complex process, not to mention that the it puts the data in a Dataframe, which is an extra level of abstraction. You might consider making a call directly to Redshift (eg using psycopg2) rather than going via Spark, but it depends on your use-case. – John Rotenstein Commented Nov 21, 2024 at 11:28
Add a comment  | 

1 Answer 1

Reset to default 0

While it seems obvious that there are additional steps required to accomplish the same task due to offloading data to a temp location and some computation time required for spark which is a benefit in large datasets but might be an overhead for smaller datasets. For smaller data sets you can try using JDBC format itself which should mimic the same behaviour of DBeaver.

While it is not recommended to just modify this without understanding the actual execution plans and the data volumes handled eventually by the query, for this particular query you can try switching the Format to JDBC since redshift then runs the query in its memory and directly streams to JDBC client instead of offloading into S3 and then reading the data.

something like,

df = (spark.read
  .format("jdbc")
  .option("url", RS_jdbcUrl)
  .option("query", query)
  .option("user", RS_username)
  .option("password", RS_password)
  .load())

But if your question is on why it is slower, it is possible that the query fetches more data that requires temp storage though the final output is smaller. And DBeaver is faster because its using JDBC directly with no intermediate storage.

本文标签: amazon web servicesRedshift query takes longer in Databricks than in Redshift clientStack Overflow