- a recommended data design pattern used to organize data in a lakehouse logically [2]
- compatible with the concept of data mesh
- {goal} incrementally and progressively improve the structure and quality of data as it progresses through each stage [1]
- brings structure and efficiency to a lakehouse environment [2]
- ensures that data is reliable and consistent as it goes through various checks and changes [2]
- complements other data organization methods, rather than replacing them [2]
- consists of three distinct layers (or zones)
- {layer} bronze (aka raw zone)
- stores source data in its original format [1]
- the data in this layer is typically append-only and immutable [1]
- {recommendation} store the data in its original format, or use Parquet or Delta Lake [1]
- {recommendation} create a shortcut in the bronze zone instead of copying the data across [1]
- works with OneLake, ADLS Gen2, Amazon S3, Google
- {operation} ingest data
- {characteristic} maintains the raw state of the data source [3]
- {characteristic} is appended incrementally and grows over time [3]
- {characteristic} can be any combination of streaming and batch transactions [3]
- ⇒ retaining the full, unprocessed history
- ⇒ provides the ability to recreate any state of a given data system [3]
- additional metadata may be added to data on ingest
- e.g. source file names, recording the time data was processed
- {goal} enhanced discoverability [3]
- {goal} description of the state of the source dataset [3]
- {goal} optimized performance in downstream applications [3]
- {layer} silver (aka enriched zone)
- stores data sourced from the bronze layer
- the raw data has been
- cleansed
- standardized
- structured as tables (rows and columns)
- integrated with other data to provide an enterprise view of all business entities
- {recommendation} use Delta tables
- provide extra capabilities and performance enhancements [1]
- {default} every engine in Fabric writes data in the delta format and use V-Order write-time optimization to the Parquet file format [1]
- {operation} validate and deduplicate data
- for any data pipeline, the silver layer may contain more than one table [3]
- {layer} gold (aka curated zone)
- stores data sourced from the silver layer [1]
- the data is refined to meet specific downstream business and analytics requirements [1]
- tables typically conform to star schema design
- supports the development of data models that are optimized for performance and usability [1]
- use lakehouses (one for each zone), a data warehouse, or combination of both
- the decision should be based on team's preference and expertise of your team.
- different analytic engines can be used [1]
- ⇐ schemas and tables within each layer can take on a variety of forms and degrees of normalization [3]
- depends on the frequency and nature of data updates and the downstream use cases for the data [3]
- {pattern} create each zone as a lakehouse
- business users access data by using the SQL analytics endpoint [1]
- {pattern} create the bronze and silver zones as lakehouses, and the gold zone as data warehouse
- business users access data by using the data warehouse endpoint [1]
- {pattern} create all lakehouses in a single Fabric workspace
- {recommendation} create each lakehouse in its own workspace [1]
- provides more control and better governance at the zone level [1]
- {concept} data transformation
- involves altering the structure or content of data to meet specific requirements [2]
- via Dataflows (Gen2), notebooks
- {concept} data orchestration
- refers to the coordination and management of multiple data-related processes, ensuring they work together to achieve a desired outcome [2]
- via data pipelines
Previous Post <<||>> Next Post
Acronyms:
ADLS - Azure Data Lake Store Gen2
References:
[1] Microsoft Learn: Fabric (2023) Implement medallion lakehouse architecture in Microsoft Fabric (link)
[2] Microsoft Learn: Fabric (2023) Organize a Fabric lakehouse using medallion architecture design (link)
[3] Microsoft Learn: Azure (2023) What is the medallion lakehouse architecture? (link)
Resources:
[R1] Serverless.SQL (2023) Data Loading Options With Fabric Workspaces, by Andy Cutler (link)
[R2] Microsoft Learn: Fabric (2023) Lakehouse end-to-end scenario: overview and architecture (link)
No comments:
Post a Comment