SQL Troubles: 🏭🗒️Microsoft Fabric: Parquet Format [Notes]

31 January 2024

🏭🗒️Microsoft Fabric: Parquet Format [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 31-Jan-2024

[Microsoft Fabric] Parquet format

{definition} open source, column-oriented data file format designed for efficient data storage and retrieval [1]

provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk [1]
designed to be a common interchange format for both batch and interactive workloads [1]

{characteristic} open source file format

similar to other columnar-storage file formats available in Hadoop [1]

e.g. RCFile, ORC

became an industry standard

{benefit} provides interoperability across multiple tools

{characteristic} language agnostic [1]

different programming languages can be used to manipulate the data

{characteristic} column-based format [1]

files are organized by column

⇐ rather than by row
⇒ saves storage space and speeds up analytics queries [1]

reads only the needed columns

⇐ non-relevant data are skipped
⇒ greatly minimizes the IO [1]

aggregation queries are less time-consuming compared to row-oriented databases [1]

{benefit} increased data throughput and performance [1]

⇒ recommended for analytical workloads

{characteristic} highly efficient data compression/decompression [1]

supports flexible compression options and efficient encoding schemes [1]

data can be compressed by using one of the several codecs available [1]

⇒ different data files can be compressed differently [1]

reduced storage requirements [1]

by at least one-third on large datasets
⇒ {benefit} saves on cloud storage space

greatly improves scan and deserialization time [1]

⇒ {benefit} reduces the processing costs

{downside} can be slower to write than row-based file formats

primarily because they contain metadata about the file contents
though have fast read times

{characteristic} supports complex data types and advanced nested data structures [1]

implemented using the record-shredding and assembly algorithm

accommodates complex data structures that can be used to store the data [1]
optimized to work with complex data in bulk and features different ways for efficient data compression and encoding types [1]

the approach is best especially for those queries that need to read certain columns from a large table [1]

{characteristic} cloud-ready

works best with interactive and serverless technologies [1]

{characteristic} immutable

a file can't be update to modify the column name, reorder or drop columns [2]

⇐ requires rewriting the whole file [2]

{characteristic} binary-based file

⇒ not easily readable (by humans)

{characteristic} self-describing

contains metadata about schema and structure
{concept} row groups (aka segments)

contains data from the same columns

{constraint} column names are case sensitive

{concept} file footer

stores metadata statistics for each row group [2]

min/max statistics
the number of rows
can be leveraged by data processing engines to run queries more efficiently [2]

⇐ depending on the query, entire row group can be skipped [2]

{concept} file header

large datasets can be split across multiple parquet files

⇐ the structure can be flat or hierarchical
managing multiple files has several challenges
the files can be used to define a table (aka parquet table)

⇐ {constraint} the files must have the same definition

⇐ schema enforcement must be coded manually [2]

{limitation} [Data Lake] no support for ACID transactions [2]

⇒ easy to corrupt [2]

partially written files will break any subsequent read operations

the compute engine will try to read in the corrupt files and error out [2]
corrupted files must be manually identified and deleted manually to fix the issues [2]

{limitation} it's not easy to delete rows from it [2]

requires reading all the data, filtering out the data not needed, and then rewriting the entire table [2]

{limitation} doesn't support DML transactions [2]
{limitation} there is no change data feed [2]
{limitation} slow file listing [2]

small files require excessive I/O overhead

ideally the files should be between 64 MB and 1 GB
ideally the files should be compacted into larger files (aka small file compaction, bin-packing)

{limitation} expensive footer reads to gather statistics for file skipping [2]

fetching all the footers and building the file-level metadata for the entire table is slow [2]

⇐ it requires a file-listing operation [2]

the effectiveness of data skipping depends on how many files can be can skipped when performing a query [2]

{limitation} doesn't support schema enforcement [2]
{limitation} doesn't support check constraints [2]
{limitation} doesn't support data versioning [2]

{concept} table partitioning

{definition} common optimization approach used to store the data of the same table in different directories, with partitioning column values encoded in the path of each partition directory [6]
{recommendation} avoid partitioning by columns with very high cardinality

{concept} bin-packing (aka compaction, bin-compaction)

aims to produce evenly-balanced data files with respect to their size on disk,

⇐ but not necessarily in respect to the number of tuples per file [7]

requires an algorithm that efficiently organizes the files into equal size containers [6]
{characteristic} idempotent

if it is run twice on the same dataset, the second run has no effect [7]

{feature} [Microsoft Fabric] V-order

{definition} write time optimization to the parquet file format that enables lightning-fast reads under the MF compute engines [3]
applies special sorting, row group distribution, dictionary encoding and compression on parquet files [3]

requires less compute engines resources in to read it [3]

provides further cost efficiency and performance

has a 15% impact on average write times but provides up to 50% more compression [3]

{characteristic} open-source parquet format compliant

all parquet engines can read it as a regular parquet file [3]
⇐ table properties and optimization commands can be used on control V-Order on its partitions [3]
compatible with other features [3]

applied at parquet file level [3]
enabled by default

{command} OPTIMIZE

merges all changes into bigger, consolidated parquet files (aka bin-compaction) [3]
[Spark] dynamically optimizes partitions while generating files with a default 128 MB size [5]

the target file size may be changed per workload requirements using configurations [5]

properly designing the table physical structure based on the ingestion frequency and expected read patterns is likely more important than running the optimization command [3]
running the compaction operation brings the data lake in an unusable state for readers [7]
{warning} manually compacting the files is inefficient and error prone [7]

no way to differentiate files that contain new data from files that contain existing data that was just compacted into new files [7]

[Delta Lake] when ZORDER and VORDER are used together, Apache Spark performs bin-compaction, ZORDER, VORDER sequentially [3]

Previous Post <<||>> Next Post

Resources:
[1] Databricks (2023) What is Parquet? (link)
[2] Data Lake (2023) Delta Lake vs. Parquet Comparison (link)
[3] Data Mozart (2023) Parquet file format – everything you need to know! (link)
[4] Microsoft Learn (2023) Query Parquet files using serverless SQL pool in Azure Synapse Analytics (link)
[5] Microsoft Learn (2023) Lakehouse tutorial: Prepare and transform data in the lakehouse (link)
[6] Apache Spark (2023) Spark SQL Guide (link)
[7] Delta Lake (2023) Delta Lake Small File Compaction with OPTIMIZE (link)
[8] Delta Lake (2023) Optimizations (link)

Resources:
[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:
ACID - atomicity, consistency, isolation, durability
IO - Input/Output
MF - Microsoft Fabric

ORC - Optimized Row Columnar
RCFile - Record Columnar File

SQL Troubles

Pages

31 January 2024

🏭🗒️Microsoft Fabric: Parquet Format [Notes]

No comments:

About Me