admin管理员组

文章数量:1289836

We have a lambda writing to Redshift through Redshift Data API, executing a MERGE (udpate/insert) statement with a simple merge condition (two integer columns compare).

We have only one lambda running at a given time (using a fifo queue), so it guarantees there are no competing transactions, usually.

However, quite rarely, but we encounter a situation when the merge statement takes minutes to execute (comparing to more usual few seconds). It causes Lambda to time-out and reprocess the same message again.

In this case, something unexpected happens: we end up with two identical rows inserted to the table by two identical competing MERGE statements. Of course, it's undesirable, especially given PK/uniqueness constraints are not enforced by Redshift. Also, it causes future merges to fail because it finds multiple matches.

The isolation level is set to SNAPSHOT for the DB.

I'm not sure how SNAPSHOT isolation works in Redshift, but I expect it to be similar to SQL Server:

Snapshot isolation uses an optimistic concurrency model. If a snapshot transaction attempts to commit modifications to data that has changed since the transaction began, the transaction will roll back and an error will be raised.

I have the following questions:

  1. Does Redshift's SNAPSHOT isolation level works the same as SQL Server SNAPSHOT, namely the transaction is rolled back if underlying data is modified?
  2. If the answer to the above is "yes", why do we get two identical rows? Is it connected to the fact that the MERGE is rewritten to INSERT by Redshift and considered independent?
  3. Will switching to SERIALIZABLE level mitigate this issue? if not, what's the best way to handle it?

We have a lambda writing to Redshift through Redshift Data API, executing a MERGE (udpate/insert) statement with a simple merge condition (two integer columns compare).

We have only one lambda running at a given time (using a fifo queue), so it guarantees there are no competing transactions, usually.

However, quite rarely, but we encounter a situation when the merge statement takes minutes to execute (comparing to more usual few seconds). It causes Lambda to time-out and reprocess the same message again.

In this case, something unexpected happens: we end up with two identical rows inserted to the table by two identical competing MERGE statements. Of course, it's undesirable, especially given PK/uniqueness constraints are not enforced by Redshift. Also, it causes future merges to fail because it finds multiple matches.

The isolation level is set to SNAPSHOT for the DB.

I'm not sure how SNAPSHOT isolation works in Redshift, but I expect it to be similar to SQL Server:

Snapshot isolation uses an optimistic concurrency model. If a snapshot transaction attempts to commit modifications to data that has changed since the transaction began, the transaction will roll back and an error will be raised.

I have the following questions:

  1. Does Redshift's SNAPSHOT isolation level works the same as SQL Server SNAPSHOT, namely the transaction is rolled back if underlying data is modified?
  2. If the answer to the above is "yes", why do we get two identical rows? Is it connected to the fact that the MERGE is rewritten to INSERT by Redshift and considered independent?
  3. Will switching to SERIALIZABLE level mitigate this issue? if not, what's the best way to handle it?
Share Improve this question asked Feb 20 at 22:10 Alex SikilindaAlex Sikilinda 3,01320 silver badges34 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 0

After some testing, it's indeed the SNAPSHOT transaction isolation level which can cause duplicated records in the same table when using MERGE.

This is the code to demonstrate the behavior:

CREATE OR REPLACE FUNCTION janky_sleep (x float) RETURNS bool IMMUTABLE as $$
    from time import sleep
    sleep(x)
    return True
$$ LANGUAGE plpythonu;

select janky_sleep(5.0);


create table testingConcurrentMerge (
id integer,
text varchar(10)
);


begin transaction;
    
    create table #tmerge (id integer,
        text varchar(10)
    );
   
    insert into #tmerge values (1, 'test'), (2, 'test2');
   
    merge into testingConcurrentMerge
    USING #tmerge t
    on t.id = testingConcurrentMerge.id
    WHEN MATCHED THEN
    update SET text = t.text
    WHEN NOT MATCHED THEN 
    INSERT (id, text)
    values (t.id, t.text);
    
    select janky_sleep(10.0);

end transaction;

Executing the transaction from two multiple connection causes the data to be duplicated.

Changing the isolation level to SEIALIZABLE forces the second transaction to fail, so the records are not duplicated.

本文标签: amazon web servicesRedshift MERGE statement with concurrent transactionStack Overflow