admin管理员组文章数量:1326129
WHEN TRIED WITH Solutions, there is a vaccum command in delta to remove the old versions from delta table. However, I would like to remove the partion based on the modification date of the partition.
How to perform this efficiently with the delta features and without losing the durability
WHEN TRIED WITH Solutions, there is a vaccum command in delta to remove the old versions from delta table. However, I would like to remove the partion based on the modification date of the partition.
How to perform this efficiently with the delta features and without losing the durability
Share Improve this question edited Feb 24 at 9:05 Yash asked Dec 18, 2024 at 5:00 YashYash 751 silver badge10 bronze badges 2- did you try a simple delete ? – Steven Commented Dec 18, 2024 at 16:38
- I have tried simple delete and it creates new version in deltalog, but the .parquet file was still present. – Yash Commented Dec 22, 2024 at 8:55
2 Answers
Reset to default 1You're mixing up two different things here:
- Cleaning up files that no longer belong to the Delta table (this is what
VACUUM
does). - Deleting data that still exists in the table but is considered outdated.
In your case, there's no built-in way to remove partitions based on the modification date directly. A good workaround would be to add a new column to your table to track the "update timestamp" (basically your modification date).
Once you have that, you can simply run a DELETE
query like this:
DELETE FROM your_table
WHERE update_date < '2024-11-30';
Delta Lake supports standard DML commands like DELETE
, so this works natively without any additional setup.
If you can't add a new column, you'll need to extract some other information to identify the rows you want to remove. If the partition date itself is sufficient, you can use it and perform a simple DELETE
.
Read up on time travel and vacuuming.
DELETE
command will delete the data from current/latest version of the table (say v10) and create a new version (say v11), which becomes the current/latest version. By default you interact with a table (e.g. SELECT * from table
) you're interacting with v11. You can interact with an older version of the table if you want as well, as long as that version exists. E.g. SELECT * from table VERSION AS OF v8
Note the versions are actually numbers (8, 10, 11) not text (v8, v10, v11).
VACUUM
deletes older "versions" (snapshots) of the table, e.g. v10. After you vacuum, you lose the ability to time travel to the deleted versions of the table.
本文标签:
版权声明:本文标题:pyspark - How to do delta table deletion for a partition based on the creationmodification date of the partition folder - Stack 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1742191702a2430334.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论