admin管理员组

文章数量:1404351

I have an ETL operation where I am querying a SQLServer with datetimes that are in the America/New_York timezone writing the data to a parquet and ingesting the data into a delta table via spark(spark 3.5 and delta 3.1).

As far as I can tell (reading this ref Databricks blog) the timestamp type in spark does not have any offset info written into the datetime it assumes UTC.

I am trying to figure out if we need to convert our sqlserver datetimes to UTC before writing to a table.

I have been testing changing the session timezone and it does cause confusion if I change to America/New_York and do a df.show() -- it will convert what it thinks it UTC to America/New_York and it will be the incorrect time.

However, when you df.collect() it shows the correct America/New_York time(which spark thinks is UTC?).

We are also writing data that has timestamps created with est_tz.localize(datetime.datetime(now())) the est_tz localize is created with the America/New_York timezone. The python datetime objects appear to be handled differently -- they use the driver timezone which can be different than the session assigned timezone(if you are manually setting it).

My gut is telling me to convert the datetimes to UTC(using to_utc_timestamp(ts,"America/New_York")) and when we read / present in a BI tool we need to convert to America/New_York.

Does anyone have any advice?

Thank you.

I have an ETL operation where I am querying a SQLServer with datetimes that are in the America/New_York timezone writing the data to a parquet and ingesting the data into a delta table via spark(spark 3.5 and delta 3.1).

As far as I can tell (reading this ref Databricks blog) the timestamp type in spark does not have any offset info written into the datetime it assumes UTC.

I am trying to figure out if we need to convert our sqlserver datetimes to UTC before writing to a table.

I have been testing changing the session timezone and it does cause confusion if I change to America/New_York and do a df.show() -- it will convert what it thinks it UTC to America/New_York and it will be the incorrect time.

However, when you df.collect() it shows the correct America/New_York time(which spark thinks is UTC?).

We are also writing data that has timestamps created with est_tz.localize(datetime.datetime(now())) the est_tz localize is created with the America/New_York timezone. The python datetime objects appear to be handled differently -- they use the driver timezone which can be different than the session assigned timezone(if you are manually setting it).

My gut is telling me to convert the datetimes to UTC(using to_utc_timestamp(ts,"America/New_York")) and when we read / present in a BI tool we need to convert to America/New_York.

Does anyone have any advice?

Thank you.

Share Improve this question asked Mar 11 at 15:17 eclipsedlampeclipsedlamp 1499 bronze badges 1
  • 2 Your gut is trustworthy! There’s a good reason why 99% of systems save their timestamps in UTC, it's the standard reference time used globally and serves as the basis for all other time zones – abiratsis Commented Mar 12 at 6:42
Add a comment  | 

1 Answer 1

Reset to default 0

You're on the right track. Spark timestamps are always stored as UTC.
If your SQL Server timestamps are in America/New_York, Spark will misinterpret them as UTC unless explicitly converted. Convert them to UTC (to_utc_timestamp(ts, "America/New_York")) before writing to avoid incorrect conversions when reading.

本文标签: pysparkSparkTimestamp TimezoneStack Overflow