azure synapse - Efficiently updating a single column value for many rows in MS Fabricpysparkdelta - Stack Overflow

IT技术

更新时间：2025-04-071

admin管理员组
文章数量:1356763

I have a data set on pretty small Microsoft Fabric capacity (which, if you don't know is basically Azure Synapse, which is basically Apache Spark).

Due to limitations with the data source, I am basically getting a full dump every day. So I need to update a "last seen" time stamp on rows I already know (i.e. identical data), and append the changed ones.
This is not hard to do, but I am looking for the most efficient option.

Loading the new data in df_update and the existing data in df_existing, I have tried two ways of doing this:

-- 1 -- Using pyspark data frames:

I can solve the task with an outer join like

df_new = df_existing\
    .withColumnRenamed('ts', 'ts_old')\
    .join(df_update, on=all_columns_but_the_timestamp, how='outer')
return df_new\
    .withColumn('ts', coalesce(df_new['ts'], df_new['ts_old']))\
    .drop('ts_old')

Unfortunately, this requires me to replace the whole table on disk. That's slow and seems to upset OneLake a bit (seeing the updated data in a query takes additional time). Therefore I tried:

-- 2 -- Using delta lake update

By using

df_new = df_update.exceptAll(df_existing.select(all_columns_but_the_timestamp))
df_duplicates = df_ingest.exceptAll(df_new)

I can get the new and the revisited data.

for row in df_duplicates.collect():
    table.update(
            ' AND '.join([f'{k} = "{v}"' for k, v in row.asDict().items()]),
            {'ts': lit(new_timestamp).cast(TimestampType())})

is a woefully slow way to do the updates. df_new can just be appended to the table afterwards.

I have looked for

-- 3 -- Delta lake update in bulk

Somehow selecting all affected rows in one go and update the value.

table.update(
    some_very_neat_condition,
    {'ts': lit(new_timestamp).cast(TimestampType())})

Since I don't have reliable IDs, I don't know how to do that, however.

Or is there another option I'm missing?

I have a data set on pretty small Microsoft Fabric capacity (which, if you don't know is basically Azure Synapse, which is basically Apache Spark).

Loading the new data in df_update and the existing data in df_existing, I have tried two ways of doing this:

-- 1 -- Using pyspark data frames:

I can solve the task with an outer join like

df_new = df_existing\
    .withColumnRenamed('ts', 'ts_old')\
    .join(df_update, on=all_columns_but_the_timestamp, how='outer')
return df_new\
    .withColumn('ts', coalesce(df_new['ts'], df_new['ts_old']))\
    .drop('ts_old')

Unfortunately, this requires me to replace the whole table on disk. That's slow and seems to upset OneLake a bit (seeing the updated data in a query takes additional time). Therefore I tried:

-- 2 -- Using delta lake update

By using

df_new = df_update.exceptAll(df_existing.select(all_columns_but_the_timestamp))
df_duplicates = df_ingest.exceptAll(df_new)

I can get the new and the revisited data.

for row in df_duplicates.collect():
    table.update(
            ' AND '.join([f'{k} = "{v}"' for k, v in row.asDict().items()]),
            {'ts': lit(new_timestamp).cast(TimestampType())})

is a woefully slow way to do the updates. df_new can just be appended to the table afterwards.

I have looked for

-- 3 -- Delta lake update in bulk

Somehow selecting all affected rows in one go and update the value.

table.update(
    some_very_neat_condition,
    {'ts': lit(new_timestamp).cast(TimestampType())})

Since I don't have reliable IDs, I don't know how to do that, however.

Or is there another option I'm missing?

Share Improve this question asked Mar 28 at 15:38 Jörg Neulist 1435 bronze badges

1. Is your target table a delta table? If not what kind is it? 2. Are you using Spark as your engine? 3. What % of your data do you think changes in each update, roughly? 4. "...basically Azure Synapse, which is basically Apache Spark" AFAIK this is not true. You have an option to choose Spark or "SQL pool" or "Data Explorer pool" as your compute/engine. And how you interact depends on your engine. See learn.microsoft/en-us/azure/synapse-analytics/… – Kashyap Commented Mar 28 at 17:04
@Kashyap 1. It's a delta table 2. Yes 3. Can be close to 100%, will decrease over time 4. You're right, that was a bit too simplistic. The idea was to point out that it's delta lake running in a spark cluster. – Jörg Neulist Commented Mar 31 at 6:42

Add a comment |

2 Answers 2

Sorted by: Reset to default 0

Q: I have a data set on pretty small Microsoft Fabric capacity (which, if you don't know is basically Azure Synapse, which is basically Apache Spark).

Due to limitations with the data source, I am basically getting a full dump every day. So I need to update a "last seen" time stamp on rows I already know (i.e. identical data), and append the changed ones.

If I understand correctly, you are trying to merge i.e insert or update

use MERGE INTO whenever possible... even traditional databases has the below sql equivalent

from delta.tables import DeltaTable
from pyspark.sql.functions import current_timestamp

 
delta_table = DeltaTable.forPath(spark, "your tablehere..")

 
delta_table.alias("existing").merge(
    df_update.alias("updates"),
    " AND ".join([f"existing.{col} = updates.{col}" for col in all_columns_but_the_timestamp])
).whenMatchedUpdate(set={
    "ts": "current_timestamp()"  
}).whenNotMatchedInsert(values={
    **{col: f"updates.{col}" for col in all_columns_but_the_timestamp},
    "ts": "current_timestamp()"  
}).execute()

This is the bread and butter usecase for DeltaTable.merge() as Ram mentioned. So something like what Ram suggested in his answer. More docs here.
IMO you should add a key column to your table and use that in condition param of your merge() call. Historically every time someone says there is no unique key they either don't know the data, or they haven't tried enough. In any case assuming your row is uniquely identified by a composite key of all_columns_but_the_timestamp, you could:

from pyspark.sql import functions as F

df_update = df_update.withColumn(
    'all_columns_str',
    F.concat(*all_columns_but_the_timestamp)  # in practice this would be more complex as you'll
                                              # have to convert all columns to str, handle NULLs, ...
).withColumn(
    'generated_key',
    F.conv(F.sha2('all_columns_str', 256), 16, 10)
).drop('all_columns_str')

and then:

delta_table.alias("existing").merge(
    source = df_update.alias("updates"),
    condition = 'existing.generated_key = updates.generated_key'
).whenMatchedUpdate(set={
    "ts": "current_timestamp()"  
}).whenNotMatchedInsert(values={
    **{col: f"updates.{col}" for col in all_columns_but_the_timestamp},
    "ts": "current_timestamp()"  
}).execute()

Not sure how you're partitioning your table. If you add a hash as a key then partitioning would be easy. Lets say you decided that 128 is your sweet spot for number of partitions then:

df_update = df_update.withColumn('partition_id', F.col('generated_key') % 128))

and use partition_id as the partitioning column while creating your delta table.

Also none of the options listed in OP are good fit for this usecase. So do not use them.

本文标签：

版权声明：本文标题：azure synapse - Efficiently updating a single column value for many rows in MS Fabricpysparkdelta - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1744028047a2578396.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

发表评论

全部评论 0

暂无评论

编程频道|软件玩家 - 软件改变生活！

azure synapse - Efficiently updating a single column value for many rows in MS Fabricpysparkdelta - Stack Overflow

2 Answers 2

更多相关文章

javascript - Reactjs - modifying State and changing URL onChange - Stack Overflow

javascript - Uncaught Error: Syntax error, unrecognized expression: #[object HTMLDivElement] - Stack Overflow

javascript - React hooks useState not updating the state - Stack Overflow

javascript - Displaying list of options based on another select in Angular - Stack Overflow

javascript - Suppress proxy generation for some hubs or methods - Stack Overflow

c# - MAUI - AndroidManifest uses-permission no request - Stack Overflow

google tag manager - GTM JavaScript Compiler Error ECMASCRIPT6 - Stack Overflow

php - How to change the content of a div without reloading the webpage? - Stack Overflow

jquery - JavaScript object.hasOwnProperty() with a dynamically generated property - Stack Overflow

regex - Parsing a string containing a date in a specific format into a Date object in Javascript - Stack Overflow

typescript - How to use formControlName (ng_value_accessor) with ngComponentOutlet in Angular for dynamic form components - Stac

javascript - Get the height of the header and show top navigation when scrolled beyond the header - Stack Overflow

How to unhide a hidden html &lt;p&gt; tag element using JavaScript? - Stack Overflow

javascript - PhoneGap - How to get back when Google Maps &quot;Terms of use&quot; is pressed? - Stack Overflow

swift - NavigationLink in swiftui and routing - Stack Overflow

reactjs - Unable to resolve package dependency issues in React project during npm install – Need help determining compatible pac

how to run javascript while waiting for the ajax callback - Stack Overflow

javascript - Could not find &quot;application&quot; template or view. Nothing will be rendered Object {fullName: &qu

postgresql - How to set a seed for gen_random_uuid()? - Stack Overflow

javascript - I got undefined dataset, when getting data attribute in html - Stack Overflow

发表评论

推荐文章

javascript - Changing CSS depending on if screen is 16:9, 16:10, or 4:3 - Stack Overflow

javascript - Web Audio Api input from specific microphone - Stack Overflow

html - Javascript: Highlighting part of a string with &lt;b&gt; tags - Stack Overflow

javascript - CryptoJS decryption Malformed UTF-8 data error - Stack Overflow

自己动手做一个adb的wifi连接及adb命令的apk

热门文章

javascript - How to manually add an item into datasource in Kendo UI Combobox - Stack Overflow

javascript - Testing gRPC functions - Stack Overflow

How to use array push with filter and map method in JavaScript - Stack Overflow

godot - Trying to add a 3D hotbar by using 2nd ortho camera, but it displays sky and obscures the world - Stack Overflow

javascript - Why Vue regular slots also available in this.$scopedSlots? - Stack Overflow

How do I build an object counting occurrences in an Array in JavaScript? - Stack Overflow

powerbi - Filtering data in SSAS based on &quot;AND&quot;, not &quot;OR&quot; - Stack Overflow

How to display dynamic url image in html from which created by javascript - Stack Overflow

c# - Regex white list for input validation - accent insensitive - Stack Overflow

javascript - How to retreive cookies in front end code that is set by server-side? - Stack Overflow

最新文章

必看：重装操作系统的20条原则

重裝系統，磁盤消失解決方法

windows电脑安装系统、重装系统步骤、cmd常用命令等

WIN11，如何同时连接有线网络与WLAN无线网络

安可信esp01wifi模块使用（超级坑）

python - How to update pygame display from another process? - Stack Overflow

javascript - jsPDF library cannot insert utf8 letters into pdf - Stack Overflow

javascript - app.config.globalProperties - Cannot read property &#39;globalProperties&#39; of undefined - Stack Overflow

dns - CloudFlare - redirect non existing subdomain to custom URL - Stack Overflow

javascript - How to upload file along with data from Angular to .NET core - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

How to unhide a hidden html <p> tag element using JavaScript? - Stack Overflow

javascript - PhoneGap - How to get back when Google Maps "Terms of use" is pressed? - Stack Overflow

javascript - Could not find "application" template or view. Nothing will be rendered Object {fullName: &qu

html - Javascript: Highlighting part of a string with <b> tags - Stack Overflow

powerbi - Filtering data in SSAS based on "AND", not "OR" - Stack Overflow

javascript - app.config.globalProperties - Cannot read property 'globalProperties' of undefined - Stack Overflow