admin管理员组文章数量:1406950
I am using Google-BigQuery. I want to delete duplicate records (retain only one record per group) from a table in Prod env. I tried the below method by creating temp tables, but it appears that I don't have access to execute DDLs
(CREATE table
) in Prod. Hence, I need to use only DML
statements to delete the duplicates.
id | fname | age| dt
1 | 'John' | 23 | '2024-10-16'
1 | 'John' | 31 | '2024-12-23'
1 | 'John' | 29 | '2025-02-08'
2 | 'Tony' | 25 | '2024-11-24'
2 | 'Tony' | 34 | '2025-01-06'
3 | 'Peter'| 42 | '2024-10-17'
Expected:
id | fname | age| dt
1 | 'John' | 29 | '2025-02-08'
2 | 'Tony' | 34 | '2025-01-06'
3 | 'Peter'| 42 | '2024-10-17'
Duplicates determined on [id + fname]
and retain record based on latest [dt]
.
Code that I came up with which uses CREATE statements:
create or replace table `prod.emp_dedup`
as
select * from `prod.emp`
qualify row_number() over (partition by id, fname order by dt desc) = 1;
truncate table `prod.emp`;
insert into `prod.emp` select * from `prod.emp_dedup`;
drop table `prod.emp_dedup`;
I am unable to execute the above statements since I don't have DDL access in Prod. How can I achieve this by using DELETE
(DML) statement(s) only ?
Thanks.
I am using Google-BigQuery. I want to delete duplicate records (retain only one record per group) from a table in Prod env. I tried the below method by creating temp tables, but it appears that I don't have access to execute DDLs
(CREATE table
) in Prod. Hence, I need to use only DML
statements to delete the duplicates.
id | fname | age| dt
1 | 'John' | 23 | '2024-10-16'
1 | 'John' | 31 | '2024-12-23'
1 | 'John' | 29 | '2025-02-08'
2 | 'Tony' | 25 | '2024-11-24'
2 | 'Tony' | 34 | '2025-01-06'
3 | 'Peter'| 42 | '2024-10-17'
Expected:
id | fname | age| dt
1 | 'John' | 29 | '2025-02-08'
2 | 'Tony' | 34 | '2025-01-06'
3 | 'Peter'| 42 | '2024-10-17'
Duplicates determined on [id + fname]
and retain record based on latest [dt]
.
Code that I came up with which uses CREATE statements:
create or replace table `prod.emp_dedup`
as
select * from `prod.emp`
qualify row_number() over (partition by id, fname order by dt desc) = 1;
truncate table `prod.emp`;
insert into `prod.emp` select * from `prod.emp_dedup`;
drop table `prod.emp_dedup`;
I am unable to execute the above statements since I don't have DDL access in Prod. How can I achieve this by using DELETE
(DML) statement(s) only ?
Thanks.
Share asked Mar 6 at 20:28 marie20marie20 81320 silver badges37 bronze badges 1 |2 Answers
Reset to default 0You can achieve this using only DML statements by leveraging a common table expression (CTE) to identify the duplicates and then delete them.
-- Step 1: Identify the rows to retain
WITH latest_records AS (
SELECT
id,
fname,
age,
dt,
ROW_NUMBER() OVER (PARTITION BY id, fname ORDER BY dt DESC) AS rn
FROM
`prod.emp`
)
-- Step 2: Delete the duplicates
DELETE FROM `prod.emp`
WHERE (id, fname, dt) NOT IN (
SELECT id, fname, dt
FROM latest_records
WHERE rn = 1
);
The CTE `latest_records` assigns a row number (`rn`) to each row within each group of `id` and `fname`, ordered by `dt` in descending order. The row with `rn = 1` is the latest record for each group.The DELETE statement removes all rows from `prod.emp` that are not the latest record for their respective groups
Hope it gave you some pointers.
Based on the documentation it is not clear whether bigquery supports DELETE
with CTE
.
If it does not, you can try below which is standard SQL
DELETE FROM t
WHERE id IN (
SELECT id FROM t
GROUP BY id, fname HAVING COUNT(*) > 1)
AND dt NOT IN (
SELECT MAX(dt) FROM t GROUP BY id,fname
);
First get the ids that are duplicate based on id+fname
and then for the dates choose only dates that are lower than the MAX
dates again based on the group of id+fname.
Output
id | fname | age | dt |
---|---|---|---|
1 | John | 29 | 2025-02-08 |
2 | Tony | 34 | 2025-01-06 |
3 | Peter | 42 | 2024-10-17 |
Demo in Postgres, sorry I dont have Bigquery to test
本文标签: sqlDeleting duplicate records in BigQuery by DML onlyStack Overflow
版权声明:本文标题:sql - Deleting duplicate records in BigQuery by DML only - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744951068a2634088.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
create temp table emp_dedup
your query may just work – Yun Zhang Commented Mar 12 at 4:34