SQL Troubles: ETL

Showing posts with label ETL. Show all posts

12 April 2025

🏭🗒️Microsoft Fabric: Copy job in Data Factory [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 11-Apr-2025

[Microsoft Fabric] Copy job in Data Factory

{def}

{benefit} simplifies data ingestion with built-in patterns for batch and incremental copy, eliminating the need for pipeline creation [1]

across cloud data stores [1]
from on-premises data stores behind a firewall [1]
within a virtual network via a gateway [1]

elevates the data ingestion experience to a more streamlined and user-friendly process from any source to any destination [1]
{benefit} provides seamless data integration

through over 100 built-in connectors [3]
provides essential tools for data operations [3]

{benefit} provides intuitive experience

easy configuration and monitoring [1]

{benefit} efficiency

enable incremental copying effortlessly, reducing manual intervention [1]

{benefit} less resource utilization and faster copy durations

flexibility to control data movement [1]

choose which tables and columns to copy
map the data
define read/write behavior
set schedules that fit requirements whether [1]

applies for a one-time or recurring jobs [1]

{benefit} robust performance

the serverless setup enables data transfer with large-scale parallelism
maximizes data movement throughput [1]

fully utilizes network bandwidth and data store IOPS for optimal performance [3]

{feature} monitoring

once a job executed, users can monitor its progress and metrics through either [1]

the Copy job panel

shows data from the most recent runs [1]

reports several metrics

status
row read
row written
throughput

the Monitoring hub

acts as a centralized portal for reviewing runs across various items [4]

{mode} full copy

copies all data from the source to the destination at once

{mode|GA} incremental copy

the initial job run copies all data, and subsequent job runs only copy changes since the last run [1]
an incremental column must be selected for each table to identify changes [1]

used as a watermark

allows comparing its value with the same from last run in order to copy the new or updated data only [1]
the incremental column can be a timestamp or an increasing INT [1]

{scenario} copying from a database

new or updated rows will be captured and moved to the destination [1]

{scenario} copying from a storage store

new or updated files identified by their LastModifiedTime are captured and moved to the destination [1]

{scenario} copy data to storage store

new rows from the tables or files are copied to new files in the destination [1]

files with the same name are overwritten [1]

{scenario} copy data to database

new rows from the tables or files are appended to destination tables [1]

the update method to merge or overwrite [1]

{default} appends data to the destination [1]

the update method can be adjusted to

{operation} merge

a key column must be provided

{default} the primary key is used, if available [1]

{operation} overwrite

availability

the same regional availability as the pipeline [1]

billing meter

Data Movement, with an identical consumption rate [1]

{feature} robust Public API

{benefit} allows to automate and manage Copy Job efficiently [2]

{feature} Git Integration

{benefit} allows to leverage Git repositories in Azure DevOps or GitHub [2]
{benefit} allows to seamlessly deploy Copy Job with Fabric’s built-in CI/CD workflows [2]

{feature|preview} VNET gateway support

enables secure connections to data sources within virtual network or behind firewalls

Copy Job can be executed directly on the VNet data gateway, ensuring seamless and secure data movement [2]

{feature} upsert to Azure SQL Database
{feature} overwrite to Fabric Lakehouse
{feature} [Jul-2025] native CDC

enables efficient and automated replication of changed data including inserted, updated and deleted records from a source to a destination [5]

ensures destination data stays up to date without manual effort

improves efficiency in data integration while reducing the load on source systems [5]

see Data Movement - Incremental Copy meter

consumption rate of 3 CU

{benefit} zero manual intervention

automatically captures incremental changes directly from the source [5]

{benefit} automatic replication

keeps destination data continuously synchronized with source changes [5]

{benefit} optimized performance

processes only changed data

reduces processing time and minimizing load on the source [5]

smarter incremental copy

automatically detects CDC-enabled source tables and allows to select either CDC-based or watermark-based incremental copy for each table [5]

applies to

CDC-enabled tables

CDC automatically captures and replicates actions on data

non-CDC-enabled tables

Copy Job detects changes by comparing an incremental column against the last run [5]

then merges or appends the changed data to the destination based on configuration [5]

supported connectors

⇐ applies to sources and destinations
Azure SQL DB [5]
On-premises SQL Server [5]
Azure SQL Managed Instance [5]

{enhancement} column mapping for simple data modification to storage as destination store [2]
{enhancement} data preview to help select the right incremental column [2]
{enhancement} search functionality to quickly find tables or columns [2]
{enhancement} real-time monitoring with an in-progress view of running Copy Jobs [2]
{enhancement} customizable update methods & schedules before job creation [2]

Previous Post <<||>> Next Post

References:

[1] Microsoft Learn (2025) Fabric: What is the Copy job in Data Factory for Microsoft Fabric? [link]

[2] Microsoft Fabric Updates Blog (2025) Recap of Data Factory Announcements at Fabric Conference US 2025 [link]

[3] Microsoft Fabric Updates Blog (2025) Fabric: Announcing Public Preview: Copy Job in Microsoft Fabric [link]

[4] Microsoft Learn (2025) Fabric: Learn how to monitor a Copy job in Data Factory for Microsoft Fabric [link]
[5] Microsoft Fabric Updates Blog (2025) Fabric: Simplifying Data Ingestion with Copy job – Introducing Change Data Capture (CDC) Support (Preview) [link]
[6] Microsoft Learn (2025) Fabric: Change data capture (CDC) in Copy Job (Preview) [link]

[7] Microsoft Fabric Updates Blog (2025) Simplifying Data Ingestion with Copy job – Incremental Copy GA, Lakehouse Upserts, and New Connectors [link]

Resources:
[R1] Microsoft Learn (2025) Fabric: Learn how to create a Copy job in Data Factory for Microsoft Fabric [link]
[R2] Microsoft Learn (2025) Microsoft Fabric decision guide: copy activity, Copy job, dataflow, Eventstream, or Spark [link]

Acronyms:

API - Application Programming Interface
CDC - Change Data Capture
CI/CD - Continuous Integration and Continuous Deployment

CU - Capacity Unit

DevOps - Development & Operations

DF - Data Factory
IOPS - Input/Output Operations Per Second

VNet - Virtual Network

15 March 2025

💫🗒️ERP Systems: Microsoft Dynamics 365's Business Performance Analytics (BPA) [notes]

Disclaimer: This is work in progress intended to consolidate information from the various sources and not to provide a complete overview of all the features. Please refer to the documentation for a complete overview!

Last updated: 15-Mar-2025

[Dynamics 365] Business Performance Analytics (BPA)

{def} centralized reporting hub within D365 F&O designed to streamline insights and help organizations make faster, data driven decisions [3]

solution designed to transform organization's data into actionable insights [1]
provides an intuitive out-of-box data model along with familiar tools like Microsoft Excel and Power BI for self-service analytics [4]

data extracted from D365 is classified in BPA in the form of value chains

⇐ a group of business processes on top of the value chain [4]

{benefit} allows to simplify data insights by providing a unified view of business data across entities in near real time [4]
{benefit} allows to streamline financial and operations reporting to reduce the cycle times [4]
{benefit} allows users of all technical abilities to quickly access and analyze data to facilitate data driven decisions [4]
{benefit} provides auditors with direct access to financial data, making the audit process more efficient
{benefit} enables ease of use through familiar apps like Excel and Power BI, in addition to AI driven insights and automation in this platform that can be scalable and extendable [4]
{feature} extends into Microsoft Fabric

{benefit} provide a scalable, secure environment for handling large data sets and ensuring insights are always powered by the latest technology [3]

{feature} ETL process

involves extracting data from finance and operations database, transforming and loading it into Dataverse [4]

each of the entities required for the generation of the dimensional model for the value chains that were mentioned earlier, they are backed by the underlying tables in finance and operations database [4]

installed in Dataverse, virtual entities that are created will then pull in the data into the managed data lake [4]
the data is then transformed to generate the dimensional model which is then pushed into the embedded Power BI workspace in the form of analytical tables [4]
BPA consumes this data from Power BI workspace to render the power BI reports [4]
this data can also be extended to Fabric if there is a need to consolidate data from multiple sources [4]

{feature} reports

designed to provide a detailed overview of an organization's financial health [8]
further reports will be added to expand the coverage for the value chains [8]
out-of-box reports can't be modified

⇐ users cannot rename, delete or edit these type of reports [8]
there’s the option to duplicate the base report and edit the version thus created [8]

can be shared with other users who have access to BPA

⇐ they can receive an in-app notification [8]
can be shared over email with another user by entering user’s email address [8]
one can configure whether the recipient can edit or view the report [8]

{feature} allows to create a new Power BI or Excel report from scratch [8]

{option} start with a blank report or duplicate an existing report [8]

{feature} data refresh

automatic data refreshes run currently two times a day [4]

at 12:00 AM and 12:00 PM UTC
the volume of data is also constrained by the storage capacity of the A3 SKU for Power BI Embedded [1]

future release, may support additional data reporting capacity [1]

⇐ so that larger data sets can be reported and analyzed [1]

the target is to have refreshes every hour or less [3]

data volume will be initially for about eight quarters of data [4]
extensibility will be supported with bring your own Fabric [4]

architecture

SaaS solution

{capability} immediate deployment

businesses can start to analyze data and generate insights with minimal setup [1]

{capability} comprehensive reporting and dashboards

provides access to a wide range of preconfigured reports that cover multiple business functions [1]

{capability} near-real-time analytics

future releases will offer more frequent data refreshes to enable near-real-time data analysis and reporting

{capability} predictive insights

future releases will introduce predictive analytics capabilities that enable businesses to

forecast trends
identify risks
seize opportunities [1]

{capability} user-friendly interface

intuitive design ⇒ minimal training

fosters broader adoption
enables a data-driven culture across the organization [1]

{capability} cost-effectiveness

available as part of D365 license

⇒ provides advanced analytics without requiring significant investments in IT infrastructure [1]

DaaS solution

{capability} organizations can integrate its data models with their existing data warehousing infrastructure in Microsoft Fabric [1]

maximizes the value of existing data solutions [1]
positions businesses for future enhancements [1]

{capability} unified and scalable data models

customers can build custom models on top of a unified framework

ensures consistency and scalability across data sets [1]

{capability} future-proofing with automatic upgrades

data models integrate seamlessly with future D365 updates

reduces manual maintenance and ensures access to the latest features [1]

{capability} consistency and standardization

data models provide consistency and standardization across data sources

ensure high data quality and integrity [1]

{capability} advanced analytics and AI

by customizing the data models, organizations can take advantage of advanced analytics and AI capabilities [1]

deeper insights without having to develop them from scratch [1]

{capability} enhanced data governance

unified data models support better data governance by providing standardized data definitions, relationships, and hierarchies [1]

ensure consistency and quality across the organization [1]

requires an integrated Power Platform environment [5]

must be integrated with the Microsoft Entra tenant [5]

uses shared Dataverse entitlements [1]

includes access to the data lake [1]

setup

dimensions

the selection of dimensions might affect the dimension groups that are created using these dimensions and the users who are assigned there [7]

e.g. legal entity, business unit

dimension groups

users can select specific values for the legal entity, or add a range of values [7]

selecting an invalid combination of dimension values, the dimension group will filter out all the records on the report [7]

{warning} assigning too many dimension groups to a user, slows the load for that user [7]

roles

determine which reports the user can access [7]

security

secure data through role-based access control on top of the value chains [7]
the first user who signs into the app is assigned the BPA admin role [7]

allows a user to access the administrator section of the BPA [7]

where the security can be set up [7]

has automatically assigned

Microsoft report viewer role
the All Access Dimension group [7]

allow the admin to see the data in all the reports across all the dimensions [7]

{feature} dimension-based role-level security

ensures that users only see the data relevant to them based on their role

confidently share reports without duplicating them

⇐ data is automatically filtered by organization's security policies [3]

simple but powerful way to maintain control while providing access for teams that love working in Excel [3]

accessibility

can be accessed through either

Power Platform

admins can access BPA app through PowerApps' makeup portal [6]

Dynamics 365

through the BPA preview shortcut in the homepage or the default dashboard [6]
for end users, the BPA preview shortcut is provided when they have certain duties associated to their role(s) [6]

licensing

included in D365 F&O license [4]

requirements

requires a tier two environment and Dynamics 365 finance version 1.0.38 or later [5]

{project} timeline

[2025 wave 1] backup and restore custom reports and analytics

{benefit} support better lifecycle management and empower customers to develop on sandbox instances before publishing to production [3]

2025: available in all regions where F&O is available [3]
Oct-2024: GA

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) Dynamics 365 Finance: What is Business performance analytics? [link]
[2] Microsoft Learn (2025) Business performance analytics (BPA) with Dynamics 365 Finance [link]
[3] Dynamics 365 Finance - Business Performance Analytics 2025 Release Wave 1 Release Highlights [link]
[4] Dynamics 365 Community (2024) Dynamics 365 Bites: Business Performance Analytics Part 1 [link]
[5] Dynamics 365 Community (2024) Dynamics 365 Bites: Business Performance Analytics Part 2 [link]
[6] Dynamics 365 Community (2024) Dynamics 365 Bites: Business Performance Analytics Part 3 [link]
[7] Dynamics 365 Community (2024) Dynamics 365 Bites: Business Performance Analytics Part 4 [link]
[8] Dynamics 365 Community (2024) Dynamics 365 Bites: Business Performance Analytics Part 5 [link]
[9] Microsoft Learn (2024) Dynamics 365: Business performance analytics introduction [link]

Acronyms:
AI - Artificial Intelligence
BPA - Business Performance Analytics
D365 F&O - Dynamics 365 for Finance and Operations
DaaS - Data-as-a-Service
ETL - Extract, Transfer, Load
GA - General Availability
MF - Microsoft Fabric
PP - Public Preview
SaaS - Software-as-a-Service
SKU - Stock Keeping Unit
UTC - Coordinated Universal Time

09 February 2025

🏭🗒️Microsoft Fabric: Data Pipelines [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 9-Feb-2024

[Microsoft Fabric] Data pipeline

{def} a logical sequence of activities that orchestrate a process and perform together a task [1]

usually by extracting data from one or more sources and loading it into a destination;

⇐ often transforming it along the way [1]

⇐ allows to manage the activities as a set instead of each one individually [2]
⇐ used to automate ETL processes that ingest transactional data from operational data stores into an analytical data store [1]
e.g. lakehouse or data warehouse

{concept} activity

{def} an executable task in a pipeline

a flow of activities can be defined by connecting them in a sequence [1]
its outcome (success, failure, or completion) can be used to direct the flow to the next activity in the sequence [1]

{type} data movement activities

copies data from a source data store to a sink data store [2]

{type} data transformation activities

encapsulate data transfer operations

incl. simple Copy Data activities that extract data from a source and load it to a destination
incl. complex Data Flow activities that encapsulate dataflows (Gen2) that apply transformations to the data as it is transferred
incl. notebook activities to run a Spark notebook
incl. stored procedure activities to run SQL code
incl. delete data activities to delete existing data

{type} control flow activities

used to

implement loops
implement conditional branching
manage variables
manage parameter values
enable to implement complex pipeline logic to orchestrate data ingestion and transformation flow [1]

can be parameterized

⇐enabling to provide specific values to be used each time a pipeline is run [1]

when executed, a run is initiated (aka data pipeline run

runs can be initiated on-demand or scheduled to start at a specific frequency
use the unique run ID to review run details to confirm they completed successfully and investigate the specific settings used for each execution [1]

{benefit} increases pipelines’ reusability

{concept} pipeline template

predefined pipeline that can be used and customize as required

{concept} data pipeline run

occurs when a data pipeline is executed
the activities in the data pipeline are executed to completion [3]
can be triggered one of two ways

on-demand
on a schedule

the scheduled pipeline will be able to run based on the time and frequency set [3]

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2023) Use Data Factory pipelines in Microsoft Fabric [link]
[2] Microsoft Learn (2024) Microsoft Fabric: Activity overview [link]
[3] Microsoft Learn (2024) Microsoft Fabric Concept: Data pipeline Runs [link]

Resources
[R1] Metadata Driven Pipelines for Microsoft Fabric (link)
[R2] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

21 January 2025

🧊🗒️Data Warehousing: Extract, Transform, Load (ETL) [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes.

Last updated: 21-Jan-2025

[Data Warehousing] Extract, Transform, Load (ETL)

{def} automated process which takes raw data, extracts the data required for further processing, transforms it into a format that addresses business' needs, and loads it into the destination repository (e.g. data warehouse)

includes

the transportation of data
overlaps between stages
changes in flow

due to

new technologies
changing requirements

changes in scope
troubleshooting

due to data mismatches

{step} extraction

data is extracted directly from the source systems or intermediate repositories

data may be made available in intermediate repositories, when the direct access to the source system is not possible

⇐ this approach can add a complexity layer

{substep} data validation

an automated process that validates whether data pulled from sources has the expected values
relies on a validation engine

rejects data if it falls outside the validation rules
analyzes rejected records on an ongoing basis to

identifies what went wrong
corrects the source data
modifies extraction to resolve the problem in the next batches

{step} transform

transforms the data, removing extraneous or erroneous data
applies business rules
checks data integrity

ensures that the data is not corrupted in the source or corrupted by ETL
may ensure no data was dropped in previous stages

aggregates the data if necessary

{step} load

{substep} store the data into a staging layer

transformed data are not loaded directly into the target but staged into an intermediate layer (e.g. database)
{advantage} makes it easier to roll back, if something went wrong
{advantage} allows to develop the logic iteratively and publish the data only when needed
{advantage} can be used to generate audit reports for

regulatory compliance
diagnose and repair of data problems

modern ETL process perform transformations in place, instead of in staging areas

{substep} publish the data to the target

loads the data into the target table(s)
{scenario} the existing data are overridden every time the ETL pipeline loads a new batch

this might happen daily, weekly, or monthly

{scenario} add new data without overriding

the timestamp can indicate the data is new

{recommendation} prevent the loading process to error out due to disk space and performance limitations

{approach} building an ETL infrastructure

involves integrating data from one or more data sources and testing the overall processes to ensure the data is processed correctly

recurring process

e.g. data used for reporting

one-time process

e.g. data migration

may involve

multiple source or destination systems
different types of data

e.g. reference, master and transactional data
⇐ may have complex dependencies

different level of structuredness

e.g. structured, semistructured, nonstructured

different data formats
data of different quality
different ownership

{recommendation} consider ETL best practices

{best practice} define requirements upfront in a consolidated and consistent manner

allows to set clear expectations, consolidate the requirements, estimate the effort and costs, respectively get the sign-off
the requirements may involve all the aspects of the process

e.g. data extraction, data transformation, standard formatting, etc.

{best practice} define a high level strategy

allows to define the road ahead, risks and other aspects
allows to provide transparency
this may be part of a broader strategy that can be referenced

{best practice} align the requirements and various aspects to the existing strategies existing in the organization

allows to consolidate the various efforts and make sure that the objectives, goals and requirements are aligned
e.g. IT, business, Information Security, Data Management strategies

{best practice} define the scope upfront

allows to better estimate the effort and validate the outcomes
even if the scope may change in time, this allows to provide transparence and used as basis for the time and costs estimations

{best practice} manage the effort as a project and use a suitable Project Management methodology

allows to apply structured well-established PM practices
it might be suited to adapt the methodology to project's particularities

{best practice} convert data to standard formats to standardize data processing

allows to reduce the volume of issues resulted from data type mismatches
applies mainly to dates, numeric or other values for which can be defined standard formats

{best practice} clean the data in the source systems, when cost-effective

allows to reduces the overall effort, especially when this is done in advance
this should be based ideally on the scope

{best practice} define and enforce data ownership

allows to enforce clear responsibilities across the various processes
allows to reduce the overall effort

{best practice} document data dependencies

document the dependencies existing in the data at the various levels

{best practice} protocol data movement from source(s) to destination(s) in term of data volume

allows to provide transparence into the data movement process
allows to identify gaps in the data or broader issues
can be used for troubleshooting and understanding the overall data growth

{recommendation} consider proven systems, architectures and methodologies

allows to minimize the overall effort and costs associated with the process

Previous Post <<||>> Next Post

18 April 2024

🏭Data Warehousing: Microsoft Fabric (Part II: Data(base) Mirroring) [New feature]

Data Warehousing Series

Microsoft recently announced [4] the preview of a new Fabric feature called Mirroring, a low-cost, low-latency fully managed service that allows to replicate data from various systems together into OneLake [1]. Currently only Azure SQL Database, Azure Cosmos DB, and Snowflake are supported, though probably more database vendors will be targeted soon.

For Microsoft Fabric's data engineers, data scientists and data warehouse professionals this feature is huge as importance because they don't need to care anymore about making the data available in Microsoft Fabric, which involves a considerable amount of work.

Usually, at least for flexibility, transparence, performance and standardization, data professionals prefer to extract the data 1:1 from the source systems into a landing zone in the data warehouse or data/delta lake from where the data are further processed as needed. One data pipeline is thus built for every table in scope, which sometimes is a 10–15-minute effort per table, when the process is standardized, though upon case the effort is much higher if troubleshooting (e.g. data type incompatibility or support) or further logic changes are involved. Maintaining such data pipelines can prove to be costly over time, especially when periodic changes are needed.

Microsoft lists other downsides of the ETL approach - restricted access to data changes, friction between people, processes, and technology, respectively the effort needed to create the pipelines, and the time needed for importing the data [1]. There's some truth is each of these points, though everything is relative. For big tables, however, refreshing all the data overnight can prove to be time-consuming and costly, especially when the data don't lie within the same region, respectively data center. Unless the data can be refreshed incrementally, the night runs can extend into the day, will all the implications that derive from this - not having actual data, which decreases the trust in reports, etc. There are tricks to speed up the process, though there are limits to what can be done.

With mirroring, the replication of data between data sources and the analytics platform is handled in the background, after an initial replication, the changes in the source systems being reflected with a near real-time latency into OneLake, which is amazing! This allows building near real-time reporting solutions which can help the business in many ways - reviewing (and correcting in the data source) records en masse, faster overview of what's happening in the organizations, faster basis for decision-making, etc. Moreover, the mechanism is fully managed by Microsoft, which is thus responsible for making sure that the data are correctly synchronized. Only from this perspective 10-20% from the effort of building an analytics solution is probably reduced.

Mirroring in Microsoft Fabric (adapted after [2])

According to the documentation, one can replicate a whole database or choose individual regular tables (currently views aren't supported [3]), stop, restart, or remove a table from a mirroring. Moreover, through sharing, users can grant to other users or groups of users access to a mirrored database without giving access to the workspace and the rest of its items [1].

The data professionals and citizens can write then cross-database queries against the mirrored databases, warehouses, and the SQL analytics endpoints of lakehouses, combining data from all these sources into a single T-SQL query, which opens lot of opportunities especially in what concerns the creation of an enterprise semantic model, which should be differentiated from the semantic model created by default by the mirroring together with the SQL analytics endpoint.

Considering that the data is replicated into delta tables, one can take advantage of all the capabilities available with such tables - data versioning, time travel, interoperability and/or performance, respectively direct consumption in Power BI.

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn - Microsoft Fabric (2024) What is Mirroring in Fabric? (link)
[2] Microsoft Learn - Microsoft Fabric (2024) Mirroring Azure SQL Database [Preview] (link)
[3] Microsoft Learn - Microsoft Fabric (2024) Frequently asked questions for Mirroring Azure SQL Database in Microsoft Fabric [Preview] (link)
[4] Microsoft Fabric Updates Blog (2024) Announcing the Public Preview of Mirroring in Microsoft Fabric, by Charles Webb (link)

10 March 2024

🏭🗒️Microsoft Fabric: Dataflows Gen 1 [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources.

Last updated: 10-Mar-2024

Dataflows Architecture [3]

[Microsoft Fabric] Dataflow (Gen1)

a type of cloud-based ETL tool for building and executing scalable data transformation processes [1]
a collection of tables created and managed in workspaces in the Power BI service [4]
acts as building blocks on top of one another [4]
includes all of the transformations to reduce data prep time and then can be loaded into a new table, included in a Data Pipeline, or used as a data source by data analysts [1]
{benefit} promote reusability of underlying data elements

prevent the need to create separate connections with your cloud or on-premises data sources.

supports a wide range of cloud and on-premises sources [3]
{operation} refreshing a dataflow

is required before it can be consumed in a semantic model in Power BI Desktop, or referenced as a linked or computed table [4]
can be refreshed at the same frequency as a semantic model [4]
{concept} incremental refresh
[Premium capacity] can be set to refresh incrementally

adds parameters to the dataflow to specify the date range
{contraindication} linked tables shouldn't use incremental refresh if they reference a dataflow

⇐ dataflows don't support query folding (even if the table is DirectQuery enabled).

{contraindication} semantic models referencing dataflows shouldn't use incremental refresh

⇐ refreshes to dataflows are generally performant, so incremental refreshes shouldn't be necessary
if refreshes take too long, consider using the compute engine, or DirectQuery mode

{operation} deleting a workflow

if a workspace that contains dataflows is deleted, all its dataflows are also deleted [4]

even if recovery of the workspace is possible, one cannot recover the deleted dataflows

⇐ either directly or through support from Microsoft [4]

{operation} consuming a dataflow

create a linked table from the dataflow
allows another dataflow author to use the data [4]
create a semantic model from the dataflow
allows a user to utilize the data to create reports [4]
create a connection from external tools that can read from the CDM format [4]

{feature} [premium] Enhanced compute engine.

enables premium subscribers to use their capacity to optimize the use of dataflows
{advantage} reduces the refresh time required for long-running ETL steps over computed entities, such as performing joins, distinct, filters, and group by [7]
{advantage} performs DirectQuery queries over entities [7]
individually set for each dataflow
{configuration} disabled
{configuration|default} optimized

automatically turned on when a table in the dataflow is referenced by another table or when the dataflow is connected to another dataflow in the same workspace.

{configuration} On
{limitation} works only for A3 or larger Power BI capacities [7]

{feature} [premium] DirectQuery

allows to use DirectQuery to connect directly to dataflows without having to import its data [7]

{advantage} avoid separate refresh schedules

removes the need to create an imported semantic model [7]

{advantage} filtering data

allows to filter dataflow data and work with the filtered subset [7]
{limitation} composite/mixed models that have import and DirectQuery data sources are currently not supported [7]
{limitation} large dataflows might have trouble with timeout issues when viewing visualizations [7]

{workaround} use Import mode [7]

{limitation} under data source settings, the dataflow connector will show invalid credentials [7]

the warning doesn't affect the behavior, and the semantic model will work properly [7]

{feature} [premium] Computed entities

allows to perform calculations on your existing dataflows, and return results [7]
enable to focus on report creation and analytics [7]
{limitation} work properly only when the entities reside in the same storage account [7]

{feature} [premium] Linked Entities

allows to reference existing dataflows
one can perform calculations on these entities using computed entities [7]
allows to create a "single source of the truth" table that can be reused within multiple dataflows [7]
{limitation} work properly only when the entities reside in the same storage account [7]

{feature} [premium] Incremental refresh

adds parameters to the dataflow to specify the date range [7]

{concept} table

represents the data output of a query created in a dataflow, after the dataflow has been refreshed
represents data from a source and, optionally, the transformations that were applied to it

{concept} computed tables

similar to other tables

one get data from a source and one can apply further transformations to create them

their data originates from the storage dataflow used, and not the original data source [6]

⇐ they were previously created by a dataflow and then reused [6]

created by referencing a table in the same dataflow or in a different dataflow [6]

{concept} [Power Query] custom function

a mapping from a set of input values to a single output value [5]

{scenario} create reusable transformation logic that can be shared by many semantic models and reports inside Power BI [3]
{scenario} persist data in ADL Gen 2 storage, enabling you to expose it to other Azure services outside Power BI [3]
{scenario} create a single source of truth

encourages uptake by removing analysts' access to underlying data sources [3]

{scenario} strengthen security around underlying data sources by exposing data to report creators in dataflows

allows to limit access to underlying data sources, reducing the load on source systems [3]
gives administrators finer control over data refresh operations [3]

{scenario} perform ETL at scale,

dataflows with Power BI Premium scales more efficiently and gives you more flexibility [3]

{best practice} chose the best connector for the task provides the best experience and performance [5]
{best practice} filter data in the early stages of the query

some connectors can take advantage of filters through query folding [5]

{best practice} do expensive operations last

help minimize the amount of time spend waiting for the preview to render each time a new step is added to the query [5]

{best practice} temporarily work against a subset of your data

if adding new steps to the query is slow, consider using "Keep First Rows" operation and limiting the number of rows you're working against [5]

{best practice} use the correct data types

some features are contextual to the data type [5]

{best practice} explore the data
{best practice} document queries by renaming or adding a description to steps, queries, or groups [5]
{best practice} take a modular approach

split queries that contains a large number of steps into multiple queries
{goal} simplify and decouple transformation phases into smaller pieces to make them easier to understand [5]

{best practice} future-proof queries

make queries resilient to changes and able to refresh even when some components of data source change [5]

{best practice] creating queries that are dynamic and flexible via parameters [5]

parameters serves as a way to easily store and manage a value that can be reused in many different ways [5]

{best practice} create reusable functions

can be created from existing queries and parameters [5]

Previous Post <<||>> Next Post

Acronyms:
CDM - Common Data Model
ETL - Extract, Transform, Load

References:
[1] Microsoft Learn: Fabric (2023) Ingest data with Microsoft Fabric (link)

[2] Microsoft Learn: Fabric (2023) Dataflow Gen2 pricing for Data Factory in Microsoft Fabric (link)

[3] Microsoft Learn: Fabric (2023) Introduction to dataflows and self-service data prep (link)

[4] Microsoft Learn: Fabric (2023) Configure and consume a dataflow (link)

[5] Microsoft Learn: Fabric (2023) Dataflows best practices* (link)

[6] Microsoft Learn: Fabric (2023) Computed table scenarios and use cases (link)

[7] Microsoft Lean: Power BI - Learn (2024) Premium features of dataflows (link)

Resources:

🏭🗒️Microsoft Fabric: Dataflows Gen2 [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 30-May-2025

Dataflow (Gen2) Architecture [4]

[Microsoft Fabric] Dataflow (Gen2)

new generation of dataflows that resides alongside the Power BI Dataflow (Gen1) [2]

brings new features and improved experience [2]
similar to Dataflow Gen1 in Power BI [2]
allow to

extract data from various sources [1]
transform it using a wide range of transformation operations [1]
load it into a destination [1]

{goal} provide an easy, reusable way to perform ETL tasks using Power Query Online [1]

allows to promote reusable ETL logic

⇒ prevents the need to create more connections to the data source [1]
offer a wide variety of transformations [1]

can be horizontally partitioned

{component} Lakehouse

used to stage data being ingested

{component} Warehouse

used as a compute engine and means to write back results to staging or supported output destinations faster

{component} mashup engine

extracts, transforms, or loads the data to staging or data destinations when either [4]

warehouse compute cannot be used [4]
{limitation} staging is disabled for a query [4]

{operation} create a dataflow

can be created in a

Data Factory workload
Power BI workspace
Lakehouse

{operation} publish a dataflow

generates dataflow's definition

⇐ the program that runs once the dataflow is refreshed to produce tables in staging storage and/or output destination [4]
used by the dataflow engine to generate an orchestration plan, manage resources, and orchestrate execution of queries across data sources, gateways, and compute engines, and to create tables in either the staging storage or data destination [4]

saves changes and runs validations that must be performed in the background [2]

{operation} refresh a dataflow
{operation} running a dataflow

can be performed

manually
on a refresh schedule
as part of a Data Pipeline orchestration

{feature} author dataflows with Power Query

uses the full Power Query experience of Power BI dataflows [2]

{feature} shorter authoring flow

uses step-by-step for getting the data into your the dataflow [2]

the number of steps required to create dataflows were reduced [2]

a few new features were added to improve the experience [2]

{feature} Auto-Save and background publishing

changes made to a dataflow are autosaved to the cloud (aka draft version of the dataflow) [2]

⇐ without having to wait for the validation to finish [2]

{functionality} save as draft

stores a draft version of the dataflow every time you make a change [2]
seamless experience and doesn't require any input [2]

{concept} published version

the version of the dataflow that passed validation and is ready to refresh [5]

{feature} integration with data pipelines

integrates directly with Data Factory pipelines for scheduling and orchestration [2]

{feature} high-scale compute

leverages a new, higher-scale compute architecture [2]

improves the performance of both transformations of referenced queries and get data scenarios [2]
creates both Lakehouse and Warehouse items in the workspace, and uses them to store and access data to improve performance for all dataflows [2]

{feature} improved monitoring and refresh history

integrate support for Monitoring Hub [2]
Refresh History experience upgraded [2]

{feature} get data via Dataflows connector

supports a wide variety of data source connectors

include cloud and on-premises relational databases

{feature} incremental refresh

enables to incrementally extract data from data sources, apply Power Query transformations, and load into various output destinations [5]

{feature|planned} Fast Copy
- enables large-scale data ingestion directly utilizing the pipelines Copy Activity capability [6]
- supports sources such Azure SQL Databases, CSV, and Parquet files in Azure Data Lake Storage and Blob Storage [6]
- significantly scales up the data processing capacity providing high-scale ELT capabilities [6]
{feature|planned}Cancel refresh

enables to cancel ongoing Dataflow Gen2 refreshes from the workspace items view [6]

{feature} data destinations

allows to

specify an output destination
separate ETL logic and destination storage [2]

every tabular data query can have a data destination [3]

available destinations

Azure SQL databases
Azure Data Explorer (Kusto)
Fabric Lakehouse
Fabric Warehouse
Fabric KQL database

a destination can be specified for every query individually [3]
multiple different destinations can be used within a dataflow [3]
connecting to the data destination is similar to connecting to a data source
{limitation} functions and lists aren't supported

{operation} creating a new table

{default} table name has the same name as the query name.

{operation} picking an existing table
{operation} deleting a table manually from the data destination

doesn't recreate the table on the next refresh [3]

{operation} reusing queries from Dataflow Gen1

{method} export Dataflow Gen1 query and import it into Dataflow Gen2

export the queries as a PQT file and import them into Dataflow Gen2 [2]

{method} copy and paste in Power Query

copy the queries and paste them in the Dataflow Gen2 editor [2]

automatic settings:

{limitation} supported only for Lakehouse and Azure SQL database
{setting} Update method replace:

data in the destination is replaced at every dataflow refresh with the output data of the dataflow [3]

{setting} Managed mapping:

the mapping is automatically adjusted when republishing the data flow to reflect the change

⇒ doesn't need to be updated manually into the data destination experience every time changes occur [3]

{setting} Drop and recreate table:

on every dataflow refresh the table is dropped and recreated to allow schema changes
{limitation} the dataflow refresh fails if any relationships or measures were added to the table [3]

update methods

{method} replace:

on every dataflow refresh, the data is dropped from the destination and replaced by the output data of the dataflow.
{limitation} not supported by Fabric KQL databases and Azure Data Explorer

{method} append:

on every dataflow refresh, the output data from the dataflow is appended (aka merged) to the existing data in the data destination table (aka upsert)

staging

{default} enabled

allows to use Fabric compute to execute queries

⇐ enhances the performance of query processing

the data is loaded into the staging location

⇐ an internal Lakehouse location accessible only by the dataflow itself

[Warehouse] staging is required before the write operation to the data destination

⇐ improves performance
{limitation} only loading into the same workspace as the dataflow is supported

using staging locations can enhance performance in some cases

disabled

{recommendation} [Lakehouse] disable staging on the query to avoid loading twice into a similar destination

⇐ once for staging and once for data destination
improves dataflow's performance

{scenario} use a dataflow to load data into the lakehouse and then use a notebook to analyze the data [2]
{scenario} use a dataflow to load data into an Azure SQL database and then use a data pipeline to load the data into a data warehouse [2]

{feature} fast copy

allows ingesting terabytes of data with the easy experience and the scalable back-end of the pipeline Copy Activity [7]
the feature must be enabled [7]

after enabling, Dataflows automatically switch the back-end when data size exceeds a particular threshold [7]
⇐there's no need to change anything during authoring of the dataflows
one can check thr refresh history to see if fast copy was used [7]
⇐see the Engine typeRequire fast copy option
{option} Require fast copy

{prerequisite} Fabric capacity is available [7]
{prerequisite} data files

are in .csv or parquet format
have at least 100 MB
are stored in an ADLS Gen2 or a Blob storage account [6]

{prerequisite} [Azure SQL DB|PostgreSQL] >= 5 million rows in the data source [7]
{limitation} doesn't support [7]

the VNet gateway
writing data into an existing table in Lakehouse
fixed schema

{feature} parameters

allow to dynamically control and customize dataflows

makes them more flexible and reusable by enabling different inputs and scenarios without modifying the dataflow itself [9]
the dataflow is refreshed by passing parameter values outside of the Power Query editor through either

Fabric REST API [9]
native Fabric experiences [9]

parameter names are case sensitive [9]
{type} required parameters

{warning} the refresh fails if no value is passed for it [9]

{type} optional parameters
enabled via Parameters >> Enable parameters to be discovered and override for execution [9]

{limitation} dataflows with parameters can't be

scheduled for refresh through the Fabric scheduler [9]
manually triggered through the Fabric Workspace list or lineage view [9]

{limitation} parameters that affect the resource path of a data source or a destination are not supported [9]

⇐connections are linked to the exact data source path defined in the authored dataflow

can't be currently override to use other connections or resource paths [9]

{limitation} can't be leveraged by dataflows with incremental refresh [9]
{limitation} supports only parameters of the type decimal number, whole number, text and true/false can be passed for override

any other data types don't produce a refresh request in the refresh history but show in the monitoring hub [9]

{warning} allow other users who have permissions to the dataflow to refresh the data with other values [9]
{limitation} refresh history does not display information about the parameters passed during the invocation of the dataflow [9]
{limitation} monitoring hub doesn't display information about the parameters passed during the invocation of the dataflow [9]
{limitation} staged queries only keep the last data refresh of a dataflow stored in the Staging Lakehouse [9]
{limitation} only the first request will be accepted from duplicated requests for the same parameter values [9]

subsequent requests are rejected until the first request finishes its evaluation [9]

{feature} supports CI/CD and Git integration

allows to create, edit, and manage dataflows in a Git repository that's connected to a Fabric workspace [10]
allows to use the deployment pipelines to automate the deployment of dataflows between workspaces [10]
allows to use Public APIs to create and manage Dataflow Gen2 with CI/CD and Git integration [10]
allows to create Dataflow Gen2 directly into a workspace folder [10]
allows to use the Fabric settings and scheduler to refresh and edit settings for Dataflow Gen2 [10]
{action} save a workflow

replaces the publish operation

when saving th dataflow, it automatically publishes the changes to the dataflow [10]

{action} delete a dataflow

the staging artifacts become visible in the workspace and are safe to be deleted [10]

{action} schedule a refresh

can be done manually or by scheduling a refresh [10]
{limitation} the Workspace view doesn't show if a refresh is ongoing for the dataflow [10]
refresh information is available in the refresh history [10]

{action} branching out to another workspace

{limitation} the refresh can fail with the message that the staging lakehouse couldn't be found [10]
{workaround} create a new Dataflow Gen2 with CI/CD and Git support in the workspace to trigger the creation of the staging lakehouse [10]
⇐ all other dataflows in the workspace should start to function again.

{action} syncing changes from GIT into the workspace

requires to open the new or updated dataflow and save changes manually with the editor [10]

triggers a publish action in the background to allow the changes to be used during refresh of the dataflow [10]

[Power Automate] {limitation} the connector for dataflows isn't working [10]

{benefit} enhance flexibility by allowing dynamic adjustments without altering the dataflow itself [9]
{benefit} extends data with consistent data, such as a standard date dimension table [1]
{benefit} allows self-service users access to a subset of data warehouse separately [1]
{benefit} optimizes performance with dataflows, which enable extracting data once for reuse, reducing data refresh time for slower sources [1]
{benefit} simplifies data source complexity by only exposing dataflows to larger analyst groups [1]
{benefit} ensures consistency and quality of data by enabling users to clean and transform data before loading it to a destination [1]
{benefit} simplifies data integration by providing a low-code interface that ingests data from various sources [1]
{limitation} not a replacement for a data warehouse [1]
{limitation} row-level security isn't supported [1]
{limitation} Fabric or Fabric trial capacity workspace is required [1]

Feature	Data flow Gen2	Dataflow Gen1
Author dataflows with Power Query	✓	✓
Shorter authoring flow	✓
Auto-Save and background publishing	✓
Data destinations	✓
Improved monitoring and refresh history	✓
Integration with data pipelines	✓
High-scale compute	✓
Get Data via Dataflows connector	✓	✓
Direct Query via Dataflows connector		✓
Incremental refresh	✓*
Fast Copy	✓*
Cancel refresh	✓*
AI Insights support		✓

Dataflow Gen1 vs Gen2 [2]

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2023) Fabric: Ingest data with Microsoft Fabric (link)
[2] Microsoft Learn (2023) Fabric: Getting from Dataflow Generation 1 to Dataflow Generation 2 (link)
[3] Microsoft Learn (2023) Fabric: Dataflow Gen2 data destinations and managed settings (link)
[4] Microsoft Learn (2023) Fabric: Dataflow Gen2 pricing for Data Factory in Microsoft Fabric (link)
[5] Microsoft Learn (2023) Fabric: Save a draft of your dataflow (link)
[6] Microsoft Learn (2023) Fabric: What's new and planned for Data Factory in Microsoft Fabric (link)

[7] Microsoft Learn (2023) Fabric: Fast copy in Dataflows Gen2 [link]
[8] Microsoft Learn (2025) Fabric: Incremental refresh in Dataflow Gen2 [link]

[9] Microsoft Learn (2025) Use public parameters in Dataflow Gen2 (Preview) [link]

[10] Microsoft Learn (2025) Dataflow Gen2 with CI/CD and Git integration support [link]

Resources:
[R1] Arshad Ali & Bradley Schacht (2024) Learn Microsoft Fabric (link)
[R2] Microsoft Learn: Fabric (2023) Data Factory limitations overview (link)
[R3] Microsoft Fabric Blog (2023) Data Factory Spotlight: Dataflow Gen2, by Miguel Escobar (link)
[R4] Microsoft Learn: Fabric (2023) Dataflow Gen2 connectors in Microsoft Fabric (link)
[R5] Microsoft Learn: Fabric (2023) Pattern to incrementally amass data with Dataflow Gen2 (link)
[R6] Fourmoo (2004) Microsoft Fabric – Comparing Dataflow Gen2 vs Notebook on Costs and usability, by Gilbert Quevauvilliers (link)
[R7] Microsoft Learn: Fabric (2023) A guide to Fabric Dataflows for Azure Data Factory Mapping Data Flow users (link)
[R8] Microsoft Learn: Fabric (2023) Quickstart: Create your first dataflow to get and transform data (link)
[R9] Microsoft Learn: Fabric (2023) Microsoft Fabric decision guide: copy activity, dataflow, or Spark (link)
[R10] Microsoft Fabric Blog (2023) Dataflows Gen2 data destinations and managed settings, by Miquella de Boer (link)
[R11] Microsoft Fabric Blog (2023) Service principal support to connect to data in Dataflow, Datamart, Dataset and Dataflow Gen 2, by Miquella de Boer (link)
[R12] Chris Webb's BI Blog (2023) Fabric Dataflows Gen2: To Stage Or Not To Stage? (link)
[R13] Power BI Tips (2023) Let's Learn Fabric ep.7: Fabric Dataflows Gen2 (link)
[R14] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

[R15] Microsoft Fabric Blog (2023) Passing parameter values to refresh a Dataflow Gen2 (Preview) [link]

Acronyms:
ADLS - Azure Data Lake Storage

CI/CD - Continuous Integration/Continuous Deployment

ETL - Extract, Transform, Load

KQL - Kusto Query Language
PQO - Power Query Online
PQT - Power Query Template

14 February 2024

🧭Business Intelligence: A One-Man Show (Part VI: The Lakehouse Perspective)

Business Intelligence Suite

Continuing the ideas on Christopher Laubenthal's article "Why one person can't do everything in the data space" [1] and why his analogy between a college's functional structure and the core data roles is poorly chosen. In the last post I mentioned as a first argument that the two constructions have different foundations.

Secondly, it's a matter of construction, namely the steps used to arrive from one state to another. Indeed, there's somebody who builds the data warehouse (DWH), somebody who builds the ETL/ELT pipelines for moving the data from the sources to the DWH, somebody who builds the sematic data model that includes business related logic, respectively people who tap into the data for reporting, data visualizations, data science projects, and whatever is still needed in the organization. On top of this, there should be somebody who manages the DWH. I haven't associated any role to them because one of the core roles can be responsible for more than one step.

In the case of a lakehouse, it is the data engineer who moves the data from the various data sources to the data lake if that doesn't happen already by design or configuration. As per my understanding the data engineers are the ones who design and build the new lakehouse, move transform and manage the data as required. The Data Analysts, Data Scientist and maybe some Information Designers can tap then into the data. However, the DWH and the lakehouse(s) are technologies that facilitate their work. They can still do their work also if the same data are available by other means.

In what concerns the dorm analogy, the verbs were chosen to match the way data warehouses (DWH) or lakehouses are built, though the congruence of the steps is questionable. One could have compared the number of students with the numbers of data entities, but not with the data themselves. Usually, students move by themselves and occupy the places. The story tellers, the assistants and researchers are independent on whether the students are hosted in the dorm or not. Therefore, the analogy seems to be a bit forced.

Frankly, I covered all the steps except the ones related to Data Science by myself for both described scenarios. It helped that I knew the data from the data sources and the transformations rules I had to apply, respectively the techniques needed for moving and transforming the data, and the volume of data entities was manageable somehow. Conversely, 1-2 more resources in the area of data analysis and visualizations could have helped to bring more value to the business.

This opens the challenge of scale and it has do to with systems engineering and how the number of components and the interactions between them increase systems' complexity and the demand for managing the respective components. In the simplest linear models, for each multiplier of a certain number of components of the same type from the organization, the number of resources managing the respective layer matches to some degree the multiplier. E.g. if a data engineer can handle x data entities in a unit of time, then for hand n*x components are more likely at least n data engineers required. However, the output of n components is only a fraction of the n*x given the dependencies existing between components and other constraints.

An optimization problem resumes in finding out what data roles to chose to cover an organization's needs. A one man show can be the best solution for small organizations, though unless there's a good division of labor, bringing a second person will make the throughput slower until will become faster.

Previous Post <<|||>> Next Post

Resources:
[1] Christopher Laubenthal (2024) "Why One Person Can’t Do Everything In Data" (link)

03 March 2023

🧊Data Warehousing: Architecture (Part IV: Building a Modern Data Warehouse with Azure Synapse)

Introduction

When building a data warehouse (DWH) several key words or derivatives of them appear in requirements: secure, flexible, simple, scalable, reliable, performant, non-redundant, modern, automated, real-timed, etc. As it proves in practice, all these requirements are sometimes challenging to address with the increased complexity of the architecture chosen. There are so many technologies on the DWH market promising all these at low costs, low effort and high ROI, though DWH projects continue to fail addressing the business and technical requirements.

On a basic level for building a DWH is needed a data storage layer and an ETL (Extract, Transfer, Load) tool responsible for the data movement between the various source systems and DWH, and eventually within the DWH itself. After that, each technology added to the landscape tends to increase the overall complexity (and should be regarded with a critical eye in what concerns the advantages and disadvantages).

Data Warehouse Architecture (on-premise)

A Reference Architecture

When building a DWH or a data migration solution, which has many of the characteristics of a DWH, from the many designs, I prefer to keep things as simple as possible. An approach based on a performant database engine like SQL Server as storage layer and SSIS (SQL Server Integration Services) as ETL proved to be the best choice until now, allowing to address most of the technical requirements by design. Then come the choices on how and where to import and transform the data, at what level of granularity, on how the semantic layer is built, how the data are accessed, etc.

Being able to pull (see extract subprocess) the data from the data sources on a need by basis offers the most flexible approach, however there are cases in which the direct access to source data is not possible, having to rely on a push approach, where data are dumped regularly to a given location (e.g. FTP folder structure), following to be picked up as needed. It's actually a hybrid between a push and pull, because a fully push approach would mean pushing the data directly to the DWH, which can be also acceptable, though might offer lower control on data's movement and involve a few other challenges (e.g. permissions, concurrency).

Data can be prepared for the DWH in the source systems (e.g. exposed via data objects or API calls), anywhere in between via ETL-based transformations (see transform subprocess) or directly in the DWH. I prefer importing the data (see load subprocess) 1:1 without any transformations from the various sources via SSIS (or similar technologies) into a set of tables that designated the staging area. It's true that in this way the ETL technology is used to a minimum, though unless there's a major benefit to use it for data transformations, using DWH's capabilities and SQL for data processing can provide better performance and flexibility.

Besides the selection of the columns in scope (typically columns with meaningful values), it's important not to do any transformations in the extraction layer because the data is imported faster (eventually using fast load options as in SSIS) and it assures a basis for troubleshooting (as the data don't change between loads). Some filters can be applied only when the volume of data is high, and the subset of the data could be identified clearly (e.g. when data are partitioned based on a key like business unit, legal entity or creation date).

For better traceability, the staging schemas can reflect the systems they come from, the tables and the columns should have the same names, respectively same data types. On such tables no constraints are applied and no indexes are needed. They can be constructed however on the production tables (aka base tables) - copy of the tables from production.

Some DWH architects try replicating the constraints from the source systems and/or add more constraints on top to define the various business rules. Rigor is good in some scenarios, though it can involve a considerable effort and it might be challenging to keep over time, especially when considering the impact of big data on DWH architectures. Instead of using constraints, building a set of SQL scripts that pinpoint the issues as reports allow more flexibility with the risk of having inconsistencies running wild through the reports. The data should be cleaned in the source system and not possible then properly addressed in the DWH. Applying constraints will make the data unavailable for reporting until data are corrected, while being more permissive would allow dirty data. Thus, either case has advantages or disadvantages, though the latter seems to be more appropriate.

Indexes on the production schema should reflect the characteristics of the queries run on the data and shouldn't replicate the indexes from the source environments, even if some overlaps might exist. In practice, dropping the non-clustered indexes on the production tables before loading the data from staging, and recreating them afterwards proves to provide faster loading (see load optimization techniques).

The production tables are used for building a "semantic" data model or something similar. Several levels of views, table-valued functions and/or indexed/materialized views allows building the dimensions and facts tables, the latter incorporating the business logic needed by the reports. Upon case, stored-procedures, physical or temporary tables, table variables can be used to prepare the data, though they tend to break the "free" flow of data as steps in-between need to be run. On the other side, in certain scenarios their use is unavoidable.

The first level of views (aka base views) is based on the base tables without any joins, though they include only the fields in use (needed by the business) ordered and "grouped" together based on their importance or certain characteristics. The views can include conversions of data types, translations of codes into meaningful values, and quite seldom filters on the data. Based on these "base" views the second level is built, which attempts to define the dimension and fact tables at the lowest granularity. These views include joins between tables coming from the same or different systems, respectively mappings of values defined in tables, and whatever it takes to build such entities. However, transformations on individual fields are pushed, when possible, to the lower level to minimize logic redundancy. From similar reasons, the logic could be broken down over two or more "helper" views when visible benefits could be obtained from it (e.g troubleshooting, reuse, maintenance). It's important to balance between creating too many helper views and encapsulating too much logic in a view.

One of the design principles used in building the entities is to minimize the redundance of the fields used, ideally without having columns duplicated between entities at this level. This would facilitate the traceability of columns to the source tables within the "semantic" layer (typically in the detriment of a few more joins). In practice, one is forced to replicate some columns to simplify some parts of the logic.

Further views can be built based on the dimension and fact entities to define the logic needed by the reports. Only these objects are used and no direct reference to the "base" tables or views are made. Moreover, to offer better performance when the views can be materialized or, when there's an important benefit, physically saved as table (e.g. having multiple indexes for different scenarios). It's the case of entities with considerable data volume called over and over.

This approach of building the entities is usually flexible enough to address most of the reporting requirements, independently whether the technical solution has the characteristics of a DWH, data mart or data migration layer. Moreover, the overall architectural approach can be used on-premise as well in cloud architectures, where Azure SQL Server and ADF (Azure Data Factory) provide similar capabilities. Compared with standard SQL Server, some features might not be available, while other features might bring further benefits, though the gaps should be neglectable.

Data Management topics like Master Data Management (MDM), Data Quality Management (DQM) and/or Metadata Management can be addressed as well by using third-party tools or tools from the Microsoft stack - Master Data Services (MDS) and Data Quality Services (DQS) in combination with SSIS help addressing a wide range of scenarios - however these are optional.

Moving to the Cloud

Within the context of big data, characterized by (high/variable) volume, value, variety, velocity, veracity, and further less important V's, the before technical requirements still apply, however within a cloud environment the overall architecture becomes more complex. Each component becomes a service. There are thus various services for data ingestion, storage, processing, sharing, collaboration, etc. The way data are processed involves also several important transformations: ETL becomes ELT, FTP and local storage by Data Lakes, data packages by data pipelines, stateful by stateless, SMP (Symmetric Multi-Processing) by MPP (Massive Parallel Processing), and so on.

As file storage is less expensive than database storage, there's an increasing trend of dumping business critical data into the Data Lake via data pipelines or features like Link to Data Lake or Export to Data Lake (*), which synchronize the data between source systems and Data Lake in near real-time at table or entity level. Either saved as csv, parquet, delta lake or any other standard file format, in single files or partitions, the data can be used directly or indirectly for analytics.

Cloud-native warehouses allow addressing topics like scalability, elasticity, fault-tolerance and performance by design, though further challenges appear as compute needs to be decoupled from storage, the workloads need to be estimated for assuring the performance, data may be distributed across data centers spanning geographies, the infrastructure is exposed to attacks, etc.

Azure Synapse

If one wants to take advantage of the MPP architecture's power, Microsoft provides an analytical architecture based on Azure Synapse, an analytics service that brings together data integration, enterprise DWH, and big data analytics. Besides two types of SQL-based data processing services (dedicated vs serverless SQL pools) it comes also with a Spark pool for in-memory cluster computing.

A DWH based on Azure Synapse is not that different from the reference architecture described above for an on-premise solution. Actually, a DWH based on a dedicated SQL pool (aka a physical data warehouse) involves the same steps mentioned above.

Data Warehouse Architecture with Dedicated SQL Pool

The data can be imported via ETL/ELT pipelines in the DWH, though there are also mechanisms for consuming the data directly from the files stored in the Data Lake or Azure storage. CETAS (aka Create External Table as Select) can be defined on top of the data files, the external tables acting as "staging" or "base" tables in the architecture described above. When using a dedicated SQL pool it makes sense to use the CETAS as "staging" tables, the processed data following to be dumped to "optimized" physical tables for consumption and refreshed periodically. However, when this happens the near real-time character of data is lost. Using the CETAs as base tables would keep this characteristic as long the data isn't saved physically in tables or files, maybe in the detriment of performance.

Using a dedicated SQL pool for direct reporting can become expensive as the pool needs to be available at least during business hours for incoming user requests, or at least for importing the data and refreshing the datasets. When using the CETAS as a base table, a serverless (aka on-demand) SQL pool, which uses a per-pay-use billing model could prove to be more cost-effective and flexible in many scenarios. By design, it helps to keep the near real-time character of the data. Moreover, even if the data are actually moved from the source tables into the Data Lake, this architecture has the characteristics of a logical data warehouse:

Data Warehouse Architecture with Serverless SQL Pool

Unfortunately, unless one uses Spark tables, misuses views or adds an Azure SQL database to the architecture, there are no physical tables or materialized views in a serverless SQL pool. There's still the option to use data pipelines for regullarly exporting intermediary data to files (incl. over partitions or folders), even if this involves more overhead as it's not possible to export data over SQL syntax to files more than once (though this might change in the future). For certain scenario it could be useful to store data in a Azure SQL Server or similar database, including a dedicated SQL pool.

Choosing between serverless and dedicated SQL pool is not an exclusive choice, both or all 3 types of pools (if we consider also the Spark pool) can be used in the architecture for addressing specific challenges, especially when we consider that there are important differences between the features available in each of the pools. Moreover, one can start the PoC based on the serverless SQL pool and when the solution became mature enough and used in all enterprise, parts of the logic or all of it can be migrated to a dedicated SQL pool. This would allow to save costs at the beginning in the detriment of further effort later.

Talking about the physical storage, data engineers recommend defining within a Data Lake several layers (aka regions, zones) labeled as bronze, silver and gold (and probably platinum will join the club anytime soon). The bronze layer refers to the raw data available in the Data Lake, including the files on which the initial CETAS are defined upon. The silver refers to transformed, cleaned, enriched and integrated data, data resulting from the second layer of views described above. The gold layer refers to the data to which business logic was applied and prepared for consumption, data resulting from the final layer of views. Of course, data pipelines can be used to prepare the data at these stages, though a view-based approach offers more flexibility, are easier to troubleshoot, manage and reuse than data pipelines.

Ideally the gold data should involve no or minimal further transformation before reaching the users, though that's not realistic. Building a DWH takes a considerable time and the business can't usually wait until everything is in place. Therefore, reports based on DWH will continue to coexist with reports directly accessing the source data, which will lead to controversies. Enforcing a single source of truth will help to minimize the gap, though will not eliminate it completely.

Closing Notes

These are just outlines of a minimal reference architecture. There's more to consider, as there are several alternatives (see [1] [2] [3] [4]) for each of the steps considered in here, each technology, new features or mechanisms opening new opportunities. The advantages and disadvantages should be always considered against the business needs and requirements. One approach, even if recommended, might not work for all, though unless there's an important requirement or an opportunity associated with an additional technology, deviating from reference architectures might not be such a good idea afterall.

Note:

(*) Existing customers have until 1-Nov-2024 to transition from Export to Data lake to Synapse link. Microsoft advises new customers to use Synapse Link.

Previous Post <<||>> Next Post

Resources:

[1] Microsoft Learn (2022) Modern data warehouse for small and medium business (link)

[2] Microsoft Learn (2022) Data warehousing and analytics (link)

[3] Microsoft Learn (2022) Enterprise business intelligence (link)

[4] Microsoft Learn (2022) Serverless Modern Data Warehouse Sample using Azure Synapse Analytics and Power BI (link)

[5] Coursera (2023) Data Warehousing with Microsoft Azure Synapse Analytics (link) [course, free to audit]

[6] SQLBits (2020) Mahesh Balija's Building Modern Data Warehouse with Azure Synapse Analytics (link)

[7] Matt How (2020) The Modern Data Warehouse in Azure: Building with Speed and Agility on Microsoft’s Cloud Platform (Amazon)

[8] James Serra's blog (2022) Data lake architecture (link)

[9] SQL Stijn (2022) SQL Building a Modern Lakehouse Data Warehouse with Azure Synapse Analytics: Moving your Database to the lake (link)

[10] Solliance (2022) Azure Synapse Analytics Workshop 400 (link) [GitHub repository]

SQL Troubles

Pages

12 April 2025

🏭🗒️Microsoft Fabric: Copy job in Data Factory [Notes]

15 March 2025

💫🗒️ERP Systems: Microsoft Dynamics 365's Business Performance Analytics (BPA) [notes]

09 February 2025

🏭🗒️Microsoft Fabric: Data Pipelines [Notes]

21 January 2025

🧊🗒️Data Warehousing: Extract, Transform, Load (ETL) [Notes]

18 April 2024

🏭Data Warehousing: Microsoft Fabric (Part II: Data(base) Mirroring) [New feature]

10 March 2024

🏭🗒️Microsoft Fabric: Dataflows Gen 1 [Notes]

🏭🗒️Microsoft Fabric: Dataflows Gen2 [Notes]

14 February 2024

🧭Business Intelligence: A One-Man Show (Part VI: The Lakehouse Perspective)

03 March 2023

🧊Data Warehousing: Architecture (Part IV: Building a Modern Data Warehouse with Azure Synapse)

About Me