admin管理员组

文章数量:1122846

I’ve been creating an ETL/ELT Pipeline with Azure Databricks Workflows, Spark and Azure Data Lake. It should process in Near Real Time changes (A Change Data Capture process) from an Azure SQL Server Database. I have a bunch of big tables with the size of 30 – 50 GBs on the SQL Server and the biggest of them have the size of 190 – 220 GBs, and a lot of small tables. Compressed in parquet delta format they get to 3 - 5 GBs and the biggest ones 45 - 55 GBs on the Data Lake, but they are expected to grow. I am storing the data in delta tables. Because of easier management of the data (Life Cycle Management, Debugging of the process etc.) I think that it will be better to store the data of the big tables in partitioned delta tables. I ran some tests and observed that in some cases (when the data written is big - millions of records) the writing of data into the partitioned tables takes less time, but in most of the cases (thousands of records) it takes the same time as in non - partitioned tables. When one creates the table for the first time, it takes longer for the partitioned table to be created, which makes sense, since the metadata of the partitioned tables is more complicated.

So far so good, but at the same time I have the statement in the documentation of Databricks, saying that one should partition only tables that are bigger than 1 TB. .html

Do you think that I should follow this rule and create the tables as non – partitioned? And what stays behind this rule? Why should I partition only tables that are bigger than 1 TB? Many thanks for your answer in advance!

I am partitioning based on year and month of ingestion of the data. The data will not be querried based on those columns, but I have no requirement for column that the data will be queried on.

Many thanks for your answer in advance!

I’ve been creating an ETL/ELT Pipeline with Azure Databricks Workflows, Spark and Azure Data Lake. It should process in Near Real Time changes (A Change Data Capture process) from an Azure SQL Server Database. I have a bunch of big tables with the size of 30 – 50 GBs on the SQL Server and the biggest of them have the size of 190 – 220 GBs, and a lot of small tables. Compressed in parquet delta format they get to 3 - 5 GBs and the biggest ones 45 - 55 GBs on the Data Lake, but they are expected to grow. I am storing the data in delta tables. Because of easier management of the data (Life Cycle Management, Debugging of the process etc.) I think that it will be better to store the data of the big tables in partitioned delta tables. I ran some tests and observed that in some cases (when the data written is big - millions of records) the writing of data into the partitioned tables takes less time, but in most of the cases (thousands of records) it takes the same time as in non - partitioned tables. When one creates the table for the first time, it takes longer for the partitioned table to be created, which makes sense, since the metadata of the partitioned tables is more complicated.

So far so good, but at the same time I have the statement in the documentation of Databricks, saying that one should partition only tables that are bigger than 1 TB. https://docs.databricks.com/en/tables/partitions.html

https://learn.microsoft.com/en-us/azure/databricks/tables/partitions

Do you think that I should follow this rule and create the tables as non – partitioned? And what stays behind this rule? Why should I partition only tables that are bigger than 1 TB? Many thanks for your answer in advance!

I am partitioning based on year and month of ingestion of the data. The data will not be querried based on those columns, but I have no requirement for column that the data will be queried on.

Many thanks for your answer in advance!

Share Improve this question edited yesterday Martin asked yesterday MartinMartin 257 bronze badges 2
  • what are the types of queries that you need to run on this data? – Rakesh Govindula Commented yesterday
  • Let's say that it will be queried based on different columns. No special column to query on. I plan to partition the tables based on year and month of creation, since this is the base of their ingestion. – Martin Commented yesterday
Add a comment  | 

1 Answer 1

Reset to default 0

The tables less than 1 Tb are optimized automatically with the built-in features, it is mentioned in the article your shared.

In built features like ingestion time clustering automatically clusters data in unpartitioned tables by ingestion time, this is the reason you are getting same time while writing the data.

You were saying the data is expected to grow, in such case you can use liquid clustering which handles partitions automatically not too much and not too few, and tables which grows.

本文标签: azureDelta Table Partitioningwhy only for tables bigger than 1 TBStack Overflow