admin管理员组

文章数量:1406950

I am using Google-BigQuery. I want to delete duplicate records (retain only one record per group) from a table in Prod env. I tried the below method by creating temp tables, but it appears that I don't have access to execute DDLs (CREATE table) in Prod. Hence, I need to use only DML statements to delete the duplicates.

id | fname  | age|  dt
1  | 'John' | 23 | '2024-10-16'
1  | 'John' | 31 | '2024-12-23'
1  | 'John' | 29 | '2025-02-08'
2  | 'Tony' | 25 | '2024-11-24'
2  | 'Tony' | 34 | '2025-01-06'
3  | 'Peter'| 42 | '2024-10-17'

Expected:

id | fname  | age|  dt
1  | 'John' | 29 | '2025-02-08'
2  | 'Tony' | 34 | '2025-01-06'
3  | 'Peter'| 42 | '2024-10-17'

Duplicates determined on [id + fname] and retain record based on latest [dt].

Code that I came up with which uses CREATE statements:

create or replace table `prod.emp_dedup`
as
select * from `prod.emp`
qualify row_number() over (partition by id, fname order by dt desc) = 1;

truncate table `prod.emp`;

insert into `prod.emp` select * from `prod.emp_dedup`;

drop table `prod.emp_dedup`;

I am unable to execute the above statements since I don't have DDL access in Prod. How can I achieve this by using DELETE (DML) statement(s) only ?

Thanks.

I am using Google-BigQuery. I want to delete duplicate records (retain only one record per group) from a table in Prod env. I tried the below method by creating temp tables, but it appears that I don't have access to execute DDLs (CREATE table) in Prod. Hence, I need to use only DML statements to delete the duplicates.

id | fname  | age|  dt
1  | 'John' | 23 | '2024-10-16'
1  | 'John' | 31 | '2024-12-23'
1  | 'John' | 29 | '2025-02-08'
2  | 'Tony' | 25 | '2024-11-24'
2  | 'Tony' | 34 | '2025-01-06'
3  | 'Peter'| 42 | '2024-10-17'

Expected:

id | fname  | age|  dt
1  | 'John' | 29 | '2025-02-08'
2  | 'Tony' | 34 | '2025-01-06'
3  | 'Peter'| 42 | '2024-10-17'

Duplicates determined on [id + fname] and retain record based on latest [dt].

Code that I came up with which uses CREATE statements:

create or replace table `prod.emp_dedup`
as
select * from `prod.emp`
qualify row_number() over (partition by id, fname order by dt desc) = 1;

truncate table `prod.emp`;

insert into `prod.emp` select * from `prod.emp_dedup`;

drop table `prod.emp_dedup`;

I am unable to execute the above statements since I don't have DDL access in Prod. How can I achieve this by using DELETE (DML) statement(s) only ?

Thanks.

Share asked Mar 6 at 20:28 marie20marie20 81320 silver badges37 bronze badges 1
  • It is possible that if you change the first line of your SQL to create temp table emp_dedup your query may just work – Yun Zhang Commented Mar 12 at 4:34
Add a comment  | 

2 Answers 2

Reset to default 0

You can achieve this using only DML statements by leveraging a common table expression (CTE) to identify the duplicates and then delete them.

-- Step 1: Identify the rows to retain
WITH latest_records AS (
  SELECT
    id,
    fname,
    age,
    dt,
    ROW_NUMBER() OVER (PARTITION BY id, fname ORDER BY dt DESC) AS rn
  FROM
    `prod.emp`
)

-- Step 2: Delete the duplicates
DELETE FROM `prod.emp`
WHERE (id, fname, dt) NOT IN (
  SELECT id, fname, dt
  FROM latest_records
  WHERE rn = 1
);

The CTE `latest_records` assigns a row number (`rn`) to each row within each group of `id` and `fname`, ordered by `dt` in descending order. The row with `rn = 1` is the latest record for each group.The DELETE statement removes all rows from `prod.emp` that are not the latest record for their respective groups

Hope it gave you some pointers.

Based on the documentation it is not clear whether bigquery supports DELETE with CTE .

If it does not, you can try below which is standard SQL

DELETE FROM t
WHERE id IN (
  
SELECT id FROM t
GROUP BY id, fname HAVING COUNT(*) > 1)

AND dt NOT IN (
  
SELECT MAX(dt) FROM t GROUP BY id,fname
);

First get the ids that are duplicate based on id+fname and then for the dates choose only dates that are lower than the MAX dates again based on the group of id+fname.

Output

id fname age dt
1 John 29 2025-02-08
2 Tony 34 2025-01-06
3 Peter 42 2024-10-17

Demo in Postgres, sorry I dont have Bigquery to test

本文标签: sqlDeleting duplicate records in BigQuery by DML onlyStack Overflow