09 December 2024

🏭🗒️Microsoft Fabric: Delta Lake [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources and may deviate from them. Please consult the sources for the exact content!
Last updated: 1-Apr-2024

Delta Lake 

  • {definition} an optimized open source storage layer that runs on top of a data lake [1]
    • the default storage format in a Fabric lakehouse [1]
    • stores data in Parquet file format
    • is a variant of log-structured files 
    • initially developed at Databricks
    • fully compatible with Apache Spark APIs [1]
  • {characteristic} high reliability
  • {characteristic} secure
  • {characteristic} performant
    • provides low latency
  • {feature} data indexing
    • indexes are created and maintained on the ingested data  [1]
      • increases the querying speed significantly [1]
  • {feature} data skipping
    • file statistics are maintains so that data subsets relevant to the query are used instead of entire tables - this partition pruning avoids processing data that is not relevant to the query [1]
    • helps complex queries to read only the relevant subsets to fulfil query [1]
  • {feature} multidimensional clustering
    • uses the Z-ordering algorithm
    • enables data skipping this.
  • {feature} compaction
    • compacts or combines multiple small files into more efficient larger ones [1]
      • speeds up query performance
    • storing and accessing  small files can be processing-intensive, slow and inefficient from a storage utilization perspective [1]
  • {feature} data caching
    • highly accessed data is automatically cached to speed access for queries
  • {feature} ACID transactions
    • "all or nothing" ACID transaction approach is employed to prevent data corruption
      • ⇐ partial or failed writes risk corrupting the data [1]
  • {feature} snapshot isolation (aka SI)
    • ensures that multiple writers can write to a dataset simultaneously without interfering with jobs reading the dataset [1]
  • {feature} schema enforcement
    • data can be stored using a schema
    • {benefit} helps ensure data integrity for ingested data by providing schema enforcement [1]
      • potential data corruption with incorrect or invalid schemas is avoided [1]
  • {feature} checkpointing 
    • employed to provide a robust exactly once delivery semantic [1]
    • {benefit} ensures that data is neither missed nor repeated erroneously [1]
  • {feature} UPSERTS and DELETES support
    • provide a more convenient way of dealing with such changes  [1]
  • {feature} unified streaming and batch data processing
    • both batch and streaming data are handled via a direct integration with Structured Streaming for low latency updates  [1]
      • {benefit} simplifies the system architecture  [1]
      • {benefit} results in shorter time from data ingest to query result  [1]
    • can concurrently write batch and streaming data to the same data table [1]
  • {feature} schema evolution
    •  schema is inferred from input data
    • {benefit} reduces the effort for dealing with schema impact of changing business needs at multiple levels of the pipeline/data stack [1]
  • {feature} scalable metadata handling
  • {feature} predictive optimization
    • removes the need to manually manage maintenance operations for delta tables [8]
    • {enabled} automatically identifies tables that would benefit from maintenance operations, and then optimizes their storage
  • {feature} historical retention
    • {default} maintains a history of all changes made [4]
    • {benefit} enhanced regulatory compliance and audit
    • {recommendation} keep historical data only for a certain period of time to reduce storage costs [4]
  • {feature} time travel
    • {benefit} support for data rollback 
    • {benefit} lets users query point-in-time snapshots [5]
  • {best practice} all writes and reads should go through Delta Lake [1]
    • {benefit} ensure consistent overall behavior [1]
  • {best practice}run OPTIMIZE Regularly
    • {exception} should not be run on base or staging tables [1]
  • {best practice} run VACUUM Regularly
    • cleans up expired snapshots that are no longer required [1]
  • {best practice} use MERGE INTO to batch changes 
    • {benefit} allows to efficiently rewrite queries to implement updates to archived data and compliance workflows [5]
  • {best practice} use DELETE commands
    • {benefit} ensures proper progression of the change [1] 
    • {warning} manually deleting files from the underlying storage is likely to break the table [1]

Acronyms:
ACID - atomicity, consistency, isolation, durability

References:
[1] Azure Databricks (2023) Delta Lake on Azure Databricks
[2] Josep Aguilar-Saborit et al, POLARIS: The Distributed SQL Engine in Azure Synapse, PVLDB, 13(12), 2020 (link)
[3] Josep Aguilar-Saborit et al, Extending Polaris to Support Transactions 2024
[4] Implement medallion lakehouse architecture in Microsoft Fabric (link)
[5] Michael Armbrust et al (2020) Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores, Proceedings of the VLDB Endowment13(12) (link)
[6] 
Bennie Haelen & Dan Davis (2024) Delta Lake: Up and Running Modern Data Lakehouse Architectures with Delta Lake

No comments:

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.