SQL Troubles: medallion architecture

Showing posts with label medallion architecture. Show all posts

10 March 2024

🏭📑Microsoft Fabric: Medallion Architecture [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 10-Mar-2024

Medallion Architecture in Microsoft Fabric [1]

Medallion architecture

a recommended data design pattern used to organize data in a lakehouse logically [2]

compatible with the concept of data mesh

{goal} incrementally and progressively improve the structure and quality of data as it progresses through each stage [1]

brings structure and efficiency to a lakehouse environment [2]
ensures that data is reliable and consistent as it goes through various checks and changes [2]
complements other data organization methods, rather than replacing them [2]

consists of three distinct layers (or zones)

{layer} bronze (aka raw zone)

stores source data in its original format [1]
the data in this layer is typically append-only and immutable [1]
{recommendation} store the data in its original format, or use Parquet or Delta Lake [1]
{recommendation} create a shortcut in the bronze zone instead of copying the data across [1]

works with OneLake, ADLS Gen2, Amazon S3, Google

{operation} ingest data

{characteristic} maintains the raw state of the data source [3]
{characteristic} is appended incrementally and grows over time [3]
{characteristic} can be any combination of streaming and batch transactions [3]
⇒ retaining the full, unprocessed history

⇒ provides the ability to recreate any state of a given data system [3]

additional metadata may be added to data on ingest

e.g. source file names, recording the time data was processed

{goal} enhanced discoverability [3]
{goal} description of the state of the source dataset [3]
{goal} optimized performance in downstream applications [3]

{layer} silver (aka enriched zone)

stores data sourced from the bronze layer
the raw data has been

cleansed
standardized
structured as tables (rows and columns)
integrated with other data to provide an enterprise view of all business entities

{recommendation} use Delta tables

provide extra capabilities and performance enhancements [1]

{default} every engine in Fabric writes data in the delta format and use V-Order write-time optimization to the Parquet file format [1]

{operation} validate and deduplicate data
for any data pipeline, the silver layer may contain more than one table [3]

{layer} gold (aka curated zone)

stores data sourced from the silver layer [1]
the data is refined to meet specific downstream business and analytics requirements [1]
tables typically conform to star schema design

supports the development of data models that are optimized for performance and usability [1]

use lakehouses (one for each zone), a data warehouse, or combination of both

the decision should be based on team's preference and expertise of your team.
different analytic engines can be used [1]

⇐ schemas and tables within each layer can take on a variety of forms and degrees of normalization [3]

depends on the frequency and nature of data updates and the downstream use cases for the data [3]

{pattern} create each zone as a lakehouse

business users access data by using the SQL analytics endpoint [1]

{pattern} create the bronze and silver zones as lakehouses, and the gold zone as data warehouse

business users access data by using the data warehouse endpoint [1]

{pattern} create all lakehouses in a single Fabric workspace

{recommendation} create each lakehouse in its own workspace [1]
provides more control and better governance at the zone level [1]

{concept} data transformation

involves altering the structure or content of data to meet specific requirements [2]

via Dataflows (Gen2), notebooks

{concept} data orchestration

refers to the coordination and management of multiple data-related processes, ensuring they work together to achieve a desired outcome [2]

via data pipelines

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn: Fabric (2023) Implement medallion lakehouse architecture in Microsoft Fabric (link)
[2] Microsoft Learn: Fabric (2023) Organize a Fabric lakehouse using medallion architecture design (link)
[3] Microsoft Learn: Azure (2023) What is the medallion lakehouse architecture? (link)

Resources:
[R1] Serverless.SQL (2023) Data Loading Options With Fabric Workspaces, by Andy Cutler (link)
[R2] Microsoft Learn: Fabric (2023) Lakehouse end-to-end scenario: overview and architecture (link)
[R3] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:
ADLS - Azure Data Lake Store Gen2

13 February 2024

🧭Business Intelligence: A One-Man Show (Part V: Focus on the Foundation)

Business Intelligence Suite

I tend to agree that one person can't do anymore "everything in the data space", as Christopher Laubenthal put it his article on the topic [1]. He seems to catch the essence of some of the core data roles found in organizations. Summarizing these roles, data architecture is about designing and building a data infrastructure, data engineering is about moving data, database administration is mainly about managing databases, data analysis is about assisting the business with data and reports, information design is about telling stories, while data science can be about studying the impact of various components on the data.

However, I find his analogy between a college's functional structure and the core data roles as poorly chosen from multiple perspectives, even if both are about building an infrastructure of some type.

Firstly, the two constructions have different foundations. Data exists in a an organization also without data architects, data engineers or data administrators (DBAs)! It's enough to buy one or more information systems functioning as islands and reporting needs will arise. The need for a data architect might come when the systems need to be integrated or maybe when a data warehouse needs to be build, though many organizations are still in business without such constructs. While for the others, the more complex the integrations, the bigger the need for a Data Architect. Conversely, some systems can be integrated by design and such capabilities might drive their selection.

Data engineering is needed mainly in the context of the cloud, respectively of data lake-based architectures, where data needs to be moved, processed and prepared for consumption. Conversely, architectures like Microsoft Fabric minimize data movement, the focus being on data processing, the successive transformations it needs to suffer in moving from bronze to the gold layer, respectively in creating an organizational semantical data model. The complexity of the data processing is dependent on data' structuredness, quality and other data characteristics.

As I mentioned before, modern databases, including the ones in the cloud, reduce the need for DBAs to a considerable degree. Unless the volume of work is big enough to consider a DBA role as an in-house resource, organizations will more likely consider involving a service provider and a contingent to cover the needs.

Having in-house one or more people acting under the Data Analyst role, people who know and understand the business, respectively the data tools used in the process, can go a long way. Moreover, it's helpful to have an evangelist-like resource in house, a person who is able to raise awareness and knowhow, help diffuse knowledge about tools, techniques, data, results, best practices, respectively act as a mentor for the Data Analyst citizens. From my point of view, these are the people who form the data-related backbone (foundation) of an organization and this is the minimum of what an organization should have!

Once this established, one can build data warehouses, data integrations and other support architectures, respectively think about BI and Data strategy, Data Governance, etc. Of course, having a Chief Data Officer and a Data Strategy in place can bring more structure in handling the topics at the various levels - strategical, tactical, respectively operational. In constructions one starts with a blueprint and a data strategy can have the same effect, if one knows how to write it and implement it accordingly. However, the strategy is just a tool, while the data-knowledgeable workers are the foundation on which organizations should build upon!

"Build it and they will come" philosophy can work as well, though without knowledgeable and inquisitive people the philosophy has high chances to fail.

Previous Post <<||>> Next Post

Resources:
[1] Christopher Laubenthal (2024) "Why One Person Can’t Do Everything In Data" (link)

21 October 2023

🧊Data Warehousing: Architecture V (Dynamics 365, the Data Lakehouse and the Medallion Architecture)

Data Warehousing Series

An IT architecture is built and functions under a set of constraints that derive from architecture’s components. Usually, if we want flexibility or to change something in one area, this might have an impact in another area. This rule applies to the usage of the medallion architecture as well!

In Data Warehousing the medallion architecture considers a multilayered approach in building a single source of truth, each layer denoting the quality of data stored in the lakehouse [1]. For the moment are defined 3 layers - bronze for raw data, silver for validated data, and gold for enriched data. The concept seems sound considering that a Data Lake contains all types of raw data of different quality that needs to be validated and prepared for reporting or other purposes.

On the other side there are systems like Dynamics 365 that synchronize the data in near-real-time to the Data Lake through various mechanisms at table and/or data entity level (think of data entities as views on top of other tables or views). The databases behind are relational and in theory the data should be of proper quality as needed by business.

The greatest benefit of serverless SQL pool is that it can be used to build near-real-time data analytics solutions on top of the files existing in the Data Lake and the mechanism is quite simple. On top of such files are built external tables in serverless SQL pool, tables that reflect the data model from the source systems. The external tables can be called as any other tables from the various database objects (views, stored procedures and table-valued functions). Thus, can be built an enterprise data model with dimensions, fact-like and mart-like entities on top of the synchronized filed from the Data Lake. The Data Lakehouse (= Data Warehouse + Data Lake) thus created can be used for (enterprise) reporting and other purposes.

As long as there are no special requirements for data processing (e.g. flattening hierarchies, complex data processing, high-performance, data cleaning) this approach allows to report the data from the data sources in near-real time (10-30 minutes), which can prove to be useful for operational and tactical reporting. Tapping into this model via standard Power BI and paginated reports is quite easy.

Now, if it's to use the data medallion approach and rely on pipelines to process the data, unless one is able to process the data in near-real-time or something compared with it, a considerable delay will be introduced, delay that can span from a couple of hours to one day. It's also true that having the data prepared as needed by the reports can increase the performance considerably as compared to processing the logic at runtime. There are advantages and disadvantages to both approaches.

Probably, the most important scenario that needs to be handled is that of integrating the data from different sources. If unique mappings between values exist, unique references are available in one system to the records from the other system, respectively when a unique logic can be identified, the data integration can be handled in serverless SQL pool.

Unfortunately, when compared to on-premise or Azure SQL functionality, the serverless SQL pool has important constraints - it's not possible to use scalar UDFs, tables, recursive CTEs, etc. So, one needs to work around these limitations and in some cases use the Spark pool or pipelines. So, at least for exceptions and maybe for strategic reporting a medallion architecture can make sense and be used in parallel. However, imposing it on all the data can reduce flexibility!

Bottom line: consider the architecture against your requirements!

Previous Post <<||>>> Next Post

[1] What is the medallion lakehouse architecture?
https://learn.microsoft.com/en-us/azure/databricks/lakehouse/medallion

SQL Troubles

Pages