SQL Troubles: 🏭🧊🗒️Microsoft Fabric: Partitions in Lakehouses [Notes]

14 February 2025

🏭🧊🗒️Microsoft Fabric: Partitions in Lakehouses [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 14-Feb-2024

[Microsoft Fabric] Partitions

{def} a data organization technique used to split a large dataset into smaller, more manageable nonoverlapping subsets (aka partitions, shards)

a pattition is defined based on one or more fields
each partition contains a subset of the data
each partitions can be stored and processed independently

{goal} improve performance, scalability, and manageability of large data tables
{benefit} allows to split large tables into smaller, manageable partitions based on specific criteria [2]

e.g., date ranges, regions, categories, entities

{benefit} allows to improve queries' performance as they can target specific partitions [2]

reduces the amount of data scanned [2]
improves queries' performance [2]

{benefit} allows for more efficient data loading [2]
{benefit} facilitates the management of big tables [2]

maintenance tasks can be performed on individual partitions [2]
obsolete data partitions can be removed with no overhead, adding new partitions on a need basis [2]

applies to

backups
indexing
allows optimizing query performance for specific subsets of data
statistics

performance can be affected by

the choice of partition columns for a delta table [1]
the number and size of partitions of the partition column [1]
a column with high cardinality (mostly or entirely made of unique values) results in a large number of partitions [1]

⇐ negatively impacts performance of the metadata discovery scan for changes [1]
{recommendation} if the cardinality of a column is high, choose another column for partitioning [1]

the size of each partition can also affect performance

{recommendation} use a column that would result in a partition of at least (or close to) 1 GB [1]
{recommendation} follow the best practices for delta tables maintenance [1]
a large volume of small-sized parquet files increases the time it takes to sync the changes [1]

⇒ leads to large number of parquet files in a delta table due to [1]

over-partitioning

partitions with high number of unique values [1]
{recommendation} choose a partition column that

doesn't have a high cardinality [1]
results in individual partition size of at least 1 GB [1]

many small files

batch and streaming data ingestion rates might result in small files

depends on the frequency and size of changes being written to a lakehouse [1]

{recommendation} implement regular lakehouse table maintenance [1]

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) Microsoft Fabric: SQL analytics endpoint performance considerations [link]

[2] Kenneth A Omorodion (2024) Partitioning Data in Microsoft Fabric to Improve Performance

written [link]

[3] Microsoft Learn (2024) Microsoft Fabric: Loading Fabric Lakehouse Tables with partitions [link]

[4]

Resources

[R1] Microsoft Learn (2024) Microsoft Fabric: Load data to Lakehouse using partition in a Data pipeline [link]

[R2] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

SQL Troubles

Pages

14 February 2025

🏭🧊🗒️Microsoft Fabric: Partitions in Lakehouses [Notes]

No comments:

About Me