admin管理员组文章数量:1289836
We have a lambda writing to Redshift through Redshift Data API, executing a MERGE (udpate/insert) statement with a simple merge condition (two integer columns compare).
We have only one lambda running at a given time (using a fifo queue), so it guarantees there are no competing transactions, usually.
However, quite rarely, but we encounter a situation when the merge statement takes minutes to execute (comparing to more usual few seconds). It causes Lambda to time-out and reprocess the same message again.
In this case, something unexpected happens: we end up with two identical rows inserted to the table by two identical competing MERGE statements. Of course, it's undesirable, especially given PK/uniqueness constraints are not enforced by Redshift. Also, it causes future merges to fail because it finds multiple matches.
The isolation level is set to SNAPSHOT for the DB.
I'm not sure how SNAPSHOT isolation works in Redshift, but I expect it to be similar to SQL Server:
Snapshot isolation uses an optimistic concurrency model. If a snapshot transaction attempts to commit modifications to data that has changed since the transaction began, the transaction will roll back and an error will be raised.
I have the following questions:
- Does Redshift's SNAPSHOT isolation level works the same as SQL Server SNAPSHOT, namely the transaction is rolled back if underlying data is modified?
- If the answer to the above is "yes", why do we get two identical rows? Is it connected to the fact that the MERGE is rewritten to INSERT by Redshift and considered independent?
- Will switching to SERIALIZABLE level mitigate this issue? if not, what's the best way to handle it?
We have a lambda writing to Redshift through Redshift Data API, executing a MERGE (udpate/insert) statement with a simple merge condition (two integer columns compare).
We have only one lambda running at a given time (using a fifo queue), so it guarantees there are no competing transactions, usually.
However, quite rarely, but we encounter a situation when the merge statement takes minutes to execute (comparing to more usual few seconds). It causes Lambda to time-out and reprocess the same message again.
In this case, something unexpected happens: we end up with two identical rows inserted to the table by two identical competing MERGE statements. Of course, it's undesirable, especially given PK/uniqueness constraints are not enforced by Redshift. Also, it causes future merges to fail because it finds multiple matches.
The isolation level is set to SNAPSHOT for the DB.
I'm not sure how SNAPSHOT isolation works in Redshift, but I expect it to be similar to SQL Server:
Snapshot isolation uses an optimistic concurrency model. If a snapshot transaction attempts to commit modifications to data that has changed since the transaction began, the transaction will roll back and an error will be raised.
I have the following questions:
- Does Redshift's SNAPSHOT isolation level works the same as SQL Server SNAPSHOT, namely the transaction is rolled back if underlying data is modified?
- If the answer to the above is "yes", why do we get two identical rows? Is it connected to the fact that the MERGE is rewritten to INSERT by Redshift and considered independent?
- Will switching to SERIALIZABLE level mitigate this issue? if not, what's the best way to handle it?
1 Answer
Reset to default 0After some testing, it's indeed the SNAPSHOT transaction isolation level which can cause duplicated records in the same table when using MERGE.
This is the code to demonstrate the behavior:
CREATE OR REPLACE FUNCTION janky_sleep (x float) RETURNS bool IMMUTABLE as $$
from time import sleep
sleep(x)
return True
$$ LANGUAGE plpythonu;
select janky_sleep(5.0);
create table testingConcurrentMerge (
id integer,
text varchar(10)
);
begin transaction;
create table #tmerge (id integer,
text varchar(10)
);
insert into #tmerge values (1, 'test'), (2, 'test2');
merge into testingConcurrentMerge
USING #tmerge t
on t.id = testingConcurrentMerge.id
WHEN MATCHED THEN
update SET text = t.text
WHEN NOT MATCHED THEN
INSERT (id, text)
values (t.id, t.text);
select janky_sleep(10.0);
end transaction;
Executing the transaction from two multiple connection causes the data to be duplicated.
Changing the isolation level to SEIALIZABLE forces the second transaction to fail, so the records are not duplicated.
本文标签: amazon web servicesRedshift MERGE statement with concurrent transactionStack Overflow
版权声明:本文标题:amazon web services - Redshift MERGE statement with concurrent transaction - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741404982a2376891.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论