Disclaimer: This is work in progress intended to consolidate information from various sources.
Last updated: 31-Jan-2024
Parquet format
- {definition} open source, column-oriented data file format designed for efficient data storage and retrieval [1]
- provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk [1]
- designed to be a common interchange format for both batch and interactive workloads [1]
- {characteristic} open source file format
- similar to other columnar-storage file formats available in Hadoop [1]
- e.g. RCFile, ORC
- became an industry standard
- {benefit} provides interoperability across multiple tools
- {characteristic} language agnostic [1]
- different programming languages can be used to manipulate the data
- {characteristic} column-based format [1]
- files are organized by column
- ⇐ rather than by row
- ⇒ saves storage space and speeds up analytics queries [1]
- reads only the needed columns
- ⇐ non-relevant data are skipped
- ⇒ greatly minimizes the IO [1]
- aggregation queries are less time-consuming compared to row-oriented databases [1]
- {benefit} increased data throughput and performance [1]
- ⇒ recommended for analytical workloads
- {characteristic} highly efficient data compression/decompression [1]
- supports flexible compression options and efficient encoding schemes [1]
- data can be compressed by using one of the several codecs available [1]
- ⇒ different data files can be compressed differently [1]
- reduced storage requirements [1]
- by at least one-third on large datasets
- ⇒ {benefit} saves on cloud storage space
- greatly improves scan and deserialization time [1]
- ⇒ {benefit} reduces the processing costs
- {downside} can be slower to write than row-based file formats
- primarily because they contain metadata about the file contents
- though have fast read times
- {characteristic} supports complex data types and advanced nested data structures [1]
- implemented using the record-shredding and assembly algorithm
- accommodates complex data structures that can be used to store the data [1]
- optimized to work with complex data in bulk and features different ways for efficient data compression and encoding types [1]
- the approach is best especially for those queries that need to read certain columns from a large table [1]
- {characteristic} cloud-ready
- works best with interactive and serverless technologies [1]
- {characteristic} immutable
- a file can't be update to modify the column name, reorder or drop columns [2]
- ⇐ requires rewriting the whole file [2]
- {characteristic} binary-based file
- ⇒ not easily readable (by humans)
- {characteristic} self-describing
- contains metadata about schema and structure
- {concept} row groups (aka segments)
- contains data from the same columns
- {constraint} column names are case sensitive
- {concept} file footer
- stores metadata statistics for each row group [2]
- min/max statistics
- the number of rows
- can be leveraged by data processing engines to run queries more efficiently [2]
- ⇐ depending on the query, entire row group can be skipped [2]
- {concept} file header
- large datasets can be split across multiple parquet files
- ⇐ the structure can be flat or hierarchical
- managing multiple files has several challenges
- the files can be used to define a table (aka parquet table)
- ⇐ {constraint} the files must have the same definition
- ⇐ schema enforcement must be coded manually [2]
- {limitation} [Data Lake] no support for ACID transactions [2]
- ⇒ easy to corrupt [2]
- partially written files will break any subsequent read operations
- the compute engine will try to read in the corrupt files and error out [2]
- corrupted files must be manually identified and deleted manually to fix the issues [2]
- {limitation} it's not easy to delete rows from it [2]
- requires reading all the data, filtering out the data not needed, and then rewriting the entire table [2]
- {limitation} doesn't support DML transactions [2]
- {limitation} there is no change data feed [2]
- {limitation} slow file listing [2]
- small files require excessive I/O overhead
- ideally the files should be between 64 MB and 1 GB
- ideally the files should be compacted into larger files (aka small file compaction, bin-packing)
- {limitation} expensive footer reads to gather statistics for file skipping [2]
- fetching all the footers and building the file-level metadata for the entire table is slow [2]
- ⇐ it requires a file-listing operation [2]
- the effectiveness of data skipping depends on how many files can be can skipped when performing a query [2]
- {limitation} doesn't support schema enforcement [2]
- {limitation} doesn't support check constraints [2]
- {limitation} doesn't support data versioning [2]
- {concept} table partitioning
- {definition} common optimization approach used to store the data of the same table in different directories, with partitioning column values encoded in the path of each partition directory [6]
- {recommendation} avoid partitioning by columns with very high cardinality
- {concept} bin-packing (aka compaction, bin-compaction)
- aims to produce evenly-balanced data files with respect to their size on disk,
- ⇐ but not necessarily in respect to the number of tuples per file [7]
- requires an algorithm that efficiently organizes the files into equal size containers [6]
- {characteristic} idempotent
- if it is run twice on the same dataset, the second run has no effect [7]
- {feature} [Microsoft Fabric] V-order
- {definition} write time optimization to the parquet file format that enables lightning-fast reads under the MF compute engines [3]
- applies special sorting, row group distribution, dictionary encoding and compression on parquet files [3]
- requires less compute engines resources in to read it [3]
- provides further cost efficiency and performance
- has a 15% impact on average write times but provides up to 50% more compression [3]
- {characteristic} open-source parquet format compliant
- all parquet engines can read it as a regular parquet file [3]
- ⇐ table properties and optimization commands can be used on control V-Order on its partitions [3]
- compatible with other features [3]
- applied at parquet file level [3]
- enabled by default
- {command} OPTIMIZE
- merges all changes into bigger, consolidated parquet files (aka bin-compaction) [3]
- [Spark] dynamically optimizes partitions while generating files with a default 128 MB size [5]
- the target file size may be changed per workload requirements using configurations [5]
- properly designing the table physical structure based on the ingestion frequency and expected read patterns is likely more important than running the optimization command [3]
- running the compaction operation brings the data lake in an unusable state for readers [7]
- {warning} manually compacting the files is inefficient and error prone [7]
- no way to differentiate files that contain new data from files that contain existing data that was just compacted into new files [7]
- [Delta Lake] when ZORDER and VORDER are used together, Apache Spark performs bin-compaction, ZORDER, VORDER sequentially [3]
ACID - atomicity, consistency, isolation, durability
IO - Input/Output
ORC - Optimized Row Columnar
RCFile - Record Columnar File
Resources:
[1] Databricks (2023) What is Parquet? (link)
[2] Data Lake (2023) Delta Lake vs. Parquet Comparison (link)
[3] Data Mozart (2023) Parquet file format – everything you need to know! (link)
[4] Microsoft Learn (2023) Query Parquet files using serverless SQL pool in Azure Synapse Analytics (link)
[5] Microsoft Learn (2023) Lakehouse tutorial: Prepare and transform data in the lakehouse (link)
[6] Apache Spark (2023) Spark SQL Guide (link)
[7] Delta Lake (2023) Delta Lake Small File Compaction with OPTIMIZE (link)
[8] Delta Lake (2023) Optimizations (link)
No comments:
Post a Comment