admin管理员组文章数量:1404351
I have an ETL operation where I am querying a SQLServer with datetimes that are in the America/New_York timezone writing the data to a parquet and ingesting the data into a delta table via spark(spark 3.5 and delta 3.1).
As far as I can tell (reading this ref Databricks blog) the timestamp type in spark does not have any offset info written into the datetime it assumes UTC.
I am trying to figure out if we need to convert our sqlserver datetimes to UTC before writing to a table.
I have been testing changing the session timezone and it does cause confusion if I change to America/New_York and do a df.show()
-- it will convert what it thinks it UTC to America/New_York and it will be the incorrect time.
However, when you df.collect()
it shows the correct America/New_York time(which spark thinks
is UTC?).
We are also writing data that has timestamps created with est_tz.localize(datetime.datetime(now()))
the est_tz localize is created with the America/New_York timezone. The python datetime objects appear to be handled differently -- they use the driver timezone which can be different than the session assigned timezone(if you are manually setting it).
My gut is telling me to convert the datetimes to UTC(using to_utc_timestamp(ts,"America/New_York")
) and when we read / present in a BI tool we need to convert to America/New_York.
Does anyone have any advice?
Thank you.
I have an ETL operation where I am querying a SQLServer with datetimes that are in the America/New_York timezone writing the data to a parquet and ingesting the data into a delta table via spark(spark 3.5 and delta 3.1).
As far as I can tell (reading this ref Databricks blog) the timestamp type in spark does not have any offset info written into the datetime it assumes UTC.
I am trying to figure out if we need to convert our sqlserver datetimes to UTC before writing to a table.
I have been testing changing the session timezone and it does cause confusion if I change to America/New_York and do a df.show()
-- it will convert what it thinks it UTC to America/New_York and it will be the incorrect time.
However, when you df.collect()
it shows the correct America/New_York time(which spark thinks
is UTC?).
We are also writing data that has timestamps created with est_tz.localize(datetime.datetime(now()))
the est_tz localize is created with the America/New_York timezone. The python datetime objects appear to be handled differently -- they use the driver timezone which can be different than the session assigned timezone(if you are manually setting it).
My gut is telling me to convert the datetimes to UTC(using to_utc_timestamp(ts,"America/New_York")
) and when we read / present in a BI tool we need to convert to America/New_York.
Does anyone have any advice?
Thank you.
Share Improve this question asked Mar 11 at 15:17 eclipsedlampeclipsedlamp 1499 bronze badges 1- 2 Your gut is trustworthy! There’s a good reason why 99% of systems save their timestamps in UTC, it's the standard reference time used globally and serves as the basis for all other time zones – abiratsis Commented Mar 12 at 6:42
1 Answer
Reset to default 0You're on the right track. Spark timestamps are always stored as UTC.
If your SQL Server timestamps are in America/New_York, Spark will misinterpret them as UTC unless explicitly converted. Convert them to UTC (to_utc_timestamp(ts, "America/New_York")
) before writing to avoid incorrect conversions when reading.
本文标签: pysparkSparkTimestamp TimezoneStack Overflow
版权声明:本文标题:pyspark - Spark - Timestamp Timezone - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744785435a2624967.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论