Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!
Last updated: 14-Feb-2024
[Microsoft Fabric] Partitions
- {def} a data organization technique used to split a large dataset into smaller, more manageable nonoverlapping subsets (aka partitions, shards)
- a pattition is defined based on one or more fields
- each partition contains a subset of the data
- each partitions can be stored and processed independently
- {goal} improve performance, scalability, and manageability of large data tables
- {benefit} allows to split large tables into smaller, manageable partitions based on specific criteria [2]
- e.g., date ranges, regions, categories, entities
- {benefit} allows to improve queries' performance as they can target specific partitions [2]
- reduces the amount of data scanned [2]
- improves queries' performance [2]
- {benefit} allows for more efficient data loading [2]
- {benefit} facilitates the management of big tables [2]
- maintenance tasks can be performed on individual partitions [2]
- obsolete data partitions can be removed with no overhead, adding new partitions on a need basis [2]
- applies to
- backups
- indexing
- allows optimizing query performance for specific subsets of data
- statistics
- performance can be affected by
- the choice of partition columns for a delta table [1]
- the number and size of partitions of the partition column [1]
- a column with high cardinality (mostly or entirely made of unique values) results in a large number of partitions [1]
- ⇐ negatively impacts performance of the metadata discovery scan for changes [1]
- {recommendation} if the cardinality of a column is high, choose another column for partitioning [1]
- the size of each partition can also affect performance
- {recommendation} use a column that would result in a partition of at least (or close to) 1 GB [1]
- {recommendation} follow the best practices for delta tables maintenance [1]
- a large volume of small-sized parquet files increases the time it takes to sync the changes [1]
- ⇒ leads to large number of parquet files in a delta table due to [1]
- over-partitioning
- partitions with high number of unique values [1]
- {recommendation} choose a partition column that
- doesn't have a high cardinality [1]
- results in individual partition size of at least 1 GB [1]
- many small files
- batch and streaming data ingestion rates might result in small files
- depends on the frequency and size of changes being written to a lakehouse [1]
- {recommendation} implement regular lakehouse table maintenance [1]
[1] Microsoft Learn (2024) Microsoft Fabric: SQL analytics endpoint performance considerations [link]