SQL Troubles: pipeline

Showing posts with label pipeline. Show all posts

24 May 2025

🏭🗒️Microsoft Fabric: Materialized Lake Views (MLV) [Notes] 🆕🗓️

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 27-Jul-2025

-- create schema
CREATE SCHERA IF NOT EXISTS <lakehouse_name>.<schema_name>

-- create a materialized view
CREATE MATERIALIZED VIEW IF NOT EXISTS <lakehouse_name>.<schema_name>.<view_name> 
[(
    CONSTRAINT <constraint_name> CHECK (<constraint>) ON MISMATCH DROP 
)]

[PARTITIONED BY (col1, col2, ... )] 
[COMMENT “description or comment”] 
[TBLPROPERTIES (“key1”=”val1”, “key2”=”val2”,

AS 
SELECT ...
FROM ...
-- WHERE ...
--GROUP BY ...

[Microsoft Fabric] Materialized Lake Views (MLVs)

{def} persisted, continuously updated view of data [1]

{benefit} allows to build declarative data pipelines using SQL, complete with built-in data quality rules and automatic monitoring of data transformations

simplifies the implementation of multi-stage Lakehouse processing [1]

⇐ aids in the creation, management, and monitoring of views [3]
⇐ improves transformations through a declarative approach [3]
streamline data workflows
enable developers to focus on business logic [1]

⇐ not on infrastructural or data quality-related issues [1]

the views can be created in a notebook [2]

{benefit} allows developers visualize lineage across all entities in lakehouse, view the dependencies, and track its execution progress [3]

can have data quality constraints enforced and visualized for every run, showing completion status and conformance to data quality constraints defined in a single view [1]
empowers developers to set up complex data pipelines with just a few SQL statements and then handle the rest automatically [1]

faster development cycles
trustworthy data
quicker insights

{goal} process only the new or changed data instead of reprocessing everything each time [1]

⇐ leverages Delta Lake’s CDF under the hood

⇒ it can update just the portions of data that changed rather than recompute the whole view from scratch [1]

{operation} creation

allows defining transformations at each layer [1]

e.g. aggregation, projection, filters

allows specifying certain checks that the data must meet [1]

incorporate data quality constraints directly into the pipeline definition

via CREATE MATERIALIZED LAKE VIEW

the SQL syntax is declarative and Fabric figures out how to produce and maintain it [1]

{operation} refresh

refreshes only when its source has new data [1]

if there’s no change, it can skip running entirely (saving time and resources) [1]

via REFRESH MATERIALIZED LAKE VIEW [workspace.lakehouse.schema].MLV_Identifier [FULL];

{operation} list views from schema [3]

via SHOW MATERIALIZED LAKE VIEWS <IN/FROM> Schema_Name;

{opetation} retrieve definition

via SHOW CREATE MATERIALIZED LAKE VIEW MLV_Identifier;

{operstion} update definition

via ALTER MATERIALIZED LAKE VIEW MLV_Identifier RENAME TO MLV_Identifier_New;

{operstion} drop view

via DROP MATERIALIZED LAKE VIEW MLV_Identifier;
{warning} dropping or renaming a materialized lake view affects the lineage view and scheduled refresh [3]

{recommendation} update the reference in all dependent materialized lake views [3]

{operation} schedule view run

lets users set how often the MLV should be refreshed based on business needs and lineage execution timing [5]
depends on

data update frequency: the frequency with which the data is updated [5]
query performance requirements: Business requirement to refresh the data in defined frequent intervals [5]
system load: optimizing the time to run the lineage without overloading the system [5]

{operation} view run history

users can access the last 25 runs including lineage and run metadata

available from the dropdown for monitoring and troubleshooting

{concept} lineage

the sequence of MLV that needs to be executed to refresh the MLV once new data is available [5]

{feature} automatically generate a visual report that shows trends on data quality constraints

{benefit} allows to easily identify the checks that introduce maximum errors and the associated MLVs for easy troubleshooting [1]

{feature} can be combined with Shortcut Transformation feature for CSV ingestion

{benefit} facilitate the building of end-to-end Medallion architectures

{feature} dependency graph

allows to see the dependencies existing between the various objects [2]

⇐ automatically generated [2]

{feature} data quality

{benefit} allows to compose precise queries to exclude poor quality data from the source tables [5]
[medallion architecture] ensuring data quality is essential at every stage of the architecture [5]
maintained by setting constraints when defining the MLVs [5]
{action} FAIL

stops refreshing an MLV if any constraint is violated [5]
{default} halt is at the first instance

even without specifying the FAIL keyword [5]
takes precedence over DROP

{action} DROP

processes the MLV and removes records that don't meet the specified constrain [5]

provides the count of removed records in the lineage view [5]

{constraint} updating data quality constraints after creating an MLV isn't supported [5]

⇐ the MLV must be recreated

{constraint} the use of functions and pattern search with operators in constraint condition is restricted [5]

e.g. LIKE, regex

{known issue} the creation and refresh of an MLV with a FAIL action in constraint may result in a "delta table not found" error

{recommendation} recreate the MLV and avoid using the FAIL action [5]

{feature} data quality report

built-in Power BI dashboard that shows several aggregated metrics [2]

{feature} monitor hub

centralized portal to browse MLV runs in the lakehouse [7]
{operation} view runs' status [7]
{operation} search and filter the runs [7]

based on different criteria

{operation} cancel in-progress run [7]
{operation} drill down run execution details [7]

doesn't support

{feature|planned} PySpark [3]
{feature|planned} incremental refresh [3]
{feature|planned} integration with Data Activator [3]
{feature|planned} API [3]
{feature|planned} cross-lakehouse lineage and execution [3]
{limitation} Spark properties set at the session level aren't applied during scheduled lineage refresh [4]
{limitation} creation with delta time-travel [4]
{limitation} DML statements [4]
{limitation} UDFs in CTAS [4]
{limitation} temporary views can't be used to define MLVs [4]

{recommendation} create all MLVs for a workspace within a single Lakehouss [8]

allows to ensure a consolidated and clear representation in the DAG view [8]

Previous Post <<||>> Next Post

References:

[1] Microsoft Fabric Update Blog (2025) Simplifying Medallion Implementation with Materialized Lake Views in Fabric [link|aka]

[2] Power BI Tips (2025) Microsoft Fabric Notebooks with Materialized Views - Quick Tips [link]

[3] Microsoft Learn (2025) What are materialized lake views in Microsoft Fabric? [link]

[4] Microsoft Learn (2025) Materialized lake views Spark SQL reference [link]

[5] Microsoft Learn (2025) Manage Fabric materialized lake views lineage [link]

[6] Microsoft Learn (2025) Data quality in materialized lake views [link]

[7] Microsoft Learn (2025) Monitor materialized lake views [link]
[8] Microsoft Learn (2025) Development and Monitoring [link]

Resources:
[R1] Databricks (2025) Use materialized views in Databricks SQL [link]

[R2] Microsoft Learn (2025) Implement medallion architecture with materialized lake views [link]

[R3] Mastering Declarative Data Transformations with Materialized Lake Views [link]

Acronyms:

API - Application Programming Interface

CDF - Change Data Feed

CTAS - Create Tables As Select

DML - Data Manipulation Language

ETL - Extract, Transfer, Load

MF - Microsoft Fabric
MLV - Materialized Lake views

UDF - User-defined functions

26 April 2025

🏭🗒️Microsoft Fabric: Deployment Pipelines [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 26-Apr-2025

[Microsoft Fabric] Deployment Pipelines

{def} a structured process that enables content creators to manage the lifecycle of their organizational assets [5]

enable creators to develop and test content in the service before it reaches the users [5]

can simplify the deployment process to development, test, and production workspaces [5]
one Premium workspace is assigned to each stage [5]
each stage can have

different configurations [5]
different databases or different query parameters [5]

{action} create pipeline

from the deployment pipelines entry point in Fabric [5]

creating a pipeline from a workspace automatically assigns it to the pipeline [5]

{action} define how many stages it should have and what they should be called [5]

{default} has three stages

e.g. Development, Test, and Production
the number of stages can be changed anywhere between 2-10
{action} add another stage,
{action} delete stage
{action} rename stage

by typing a new name in the box

{action} share a pipeline with others

users receive access to the pipeline and become pipeline admins [5]

⇐ the number of stages are permanent [5]

can't be changed after the pipeline is created [5]

{action} add content to the pipeline [5]

done by assigning a workspace to the pipeline stage [5]

the workspace can be assigned to any stage [5]

{action|optional} make a stage public

{default} the final stage of the pipeline is made public
a consumer of a public stage without access to the pipeline sees it as a regular workspace [5]

without the stage name and deployment pipeline icon on the workspace page next to the workspace name [5]

{action} deploy to an empty stage

when finishing the work in one pipeline stage, the content can be deployed to the next stage [5]

deployment can happen in any direction [5]

{option} full deployment

deploy all content to the target stage [5]

{option} selective deployment

allows select the content to deploy to the target stage [5]

{option} backward deployment

deploy content from a later stage to an earlier stage in the pipeline [5]
{restriction} only possible when the target stage is empty [5]

{action} deploy content between pages [5]

content can be deployed even if the next stage has content

paired items are overwritten [5]

{action|optional} create deployment rules

when deploying content between pipeline stages, allow changes to content while keeping some settings intact [5]
once a rule is defined or changed, the content must be redeployed

the deployed content inherits the value defined in the deployment rule [5]
the value always applies as long as the rule is unchanged and valid [5]

{feature} deployment history

allows to see the last time content was deployed to each stage [5]
allows to to track time between deployments [5]

{concept} pairing

{def} the process by which an item in one stage of the deployment pipeline is associated with the same item in the adjacent stage

applies to reports, dashboards, semantic models
paired items appear on the same line in the pipeline content list [5]

⇐ items that aren't paired, appear on a line by themselves [5]

the items remain paired even if their name changes
items added after the workspace is assigned to a pipeline aren't automatically paired [5]

⇐ one can have identical items in adjacent workspaces that aren't paired [5]

[lakehouse]

can be removed as a dependent object upon deployment [3]
supports mapping different Lakehouses within the deployment pipeline context [3]
{default} a new empty Lakehouse object with same name is created in the target workspace [3]

⇐ if nothing is specified during deployment pipeline configuration
notebook and Spark job definitions are remapped to reference the new lakehouse object in the new workspace [3]
{warning} a new empty Lakehouse object with same name still is created in the target workspace [3]
SQL Analytics endpoints and semantic models are provisioned
no object inside the Lakehouse is overwritten [3]
updates to Lakehouse name can be synchronized across workspaces in a deployment pipeline context [3]

[notebook] deployment rules can be used to customize the behavior of notebooks when deployed [4]

e.g. change notebook's default lakehouse [4]
{feature} auto-binding

binds the default lakehouse and attached environment within the same workspace when deploying to next stage [4]

[environment] custom pool is not supported in deployment pipeline

the configurations of Compute section in the destination environment are set with default values [6]
⇐ subject to change in upcoming releases [6]

[warehouse]

[database project] ALTER TABLE to add a constraint or column

{limitation} the table will be dropped and recreated when deploying, resulting in data loss

{recommendation} do not create a Dataflow Gen2 with an output destination to the warehouse

⇐ deployment would be blocked by a new item named DataflowsStagingWarehouse that appears in the deployment pipeline [10]

SQL analytics endpoint is not supported

[Eventhouse]

{limitation} the connection must be reconfigured in destination that use Direct Ingestion mode [8]

[EventStream]

{limitation} limited support for cross-workspace scenarios

{recommendation} make sure all EventStream destinations within the same workspace [8]

KQL database

applies to tables, functions, materialized views [7]

KQL queryset

⇐ tabs, data sources [7]

[real-time dashboard]

data sources, parameters, base queries, tiles [7]

[SQL database]

includes the specific differences between the individual database objects in the development and test workspaces [9]

can be also used with

Previous Post <<||>> Next Post

References:

[1] Microsoft Learn (2024) Get started with deployment pipelines [link]

[2] Microsoft Learn (2024) Implement continuous integration and continuous delivery (CI/CD) in Microsoft Fabric [link]

[3] Microsoft Learn (2024) Lakehouse deployment pipelines and git integration (Preview) [link]

[4] Microsoft Learn (2024) Notebook source control and deployment [link]

[5] Microsoft Learn (2024) Introduction to deployment pipelines [link]

[6] Environment Git integration and deployment pipeline [link]

[7] Microsoft Learn (2024) Microsoft Learn (2024) Real-Time Intelligence: Git integration and deployment pipelines (Preview) [link]

[8] Microsoft Learn (2024) Eventstream CI/CD - Git Integration and Deployment Pipeline [link]

[9] Microsoft Learn (2024) Get started with deployment pipelines integration with SQL database in Microsoft Fabric [link]

[10] Microsoft Learn (2025) Source control with Warehouse (preview) [link]

Resources:

Acronyms:
CLM - Content Lifecycle Management
UAT - User Acceptance Testing

09 February 2025

🏭🗒️Microsoft Fabric: Data Pipelines [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 9-Feb-2024

[Microsoft Fabric] Data pipeline

{def} a logical sequence of activities that orchestrate a process and perform together a task [1]

usually by extracting data from one or more sources and loading it into a destination;

⇐ often transforming it along the way [1]

⇐ allows to manage the activities as a set instead of each one individually [2]
⇐ used to automate ETL processes that ingest transactional data from operational data stores into an analytical data store [1]
e.g. lakehouse or data warehouse

{concept} activity

{def} an executable task in a pipeline

a flow of activities can be defined by connecting them in a sequence [1]
its outcome (success, failure, or completion) can be used to direct the flow to the next activity in the sequence [1]

{type} data movement activities

copies data from a source data store to a sink data store [2]

{type} data transformation activities

encapsulate data transfer operations

incl. simple Copy Data activities that extract data from a source and load it to a destination
incl. complex Data Flow activities that encapsulate dataflows (Gen2) that apply transformations to the data as it is transferred
incl. notebook activities to run a Spark notebook
incl. stored procedure activities to run SQL code
incl. delete data activities to delete existing data

{type} control flow activities

used to

implement loops
implement conditional branching
manage variables
manage parameter values
enable to implement complex pipeline logic to orchestrate data ingestion and transformation flow [1]

can be parameterized

⇐enabling to provide specific values to be used each time a pipeline is run [1]

when executed, a run is initiated (aka data pipeline run

runs can be initiated on-demand or scheduled to start at a specific frequency
use the unique run ID to review run details to confirm they completed successfully and investigate the specific settings used for each execution [1]

{benefit} increases pipelines’ reusability

{concept} pipeline template

predefined pipeline that can be used and customize as required

{concept} data pipeline run

occurs when a data pipeline is executed
the activities in the data pipeline are executed to completion [3]
can be triggered one of two ways

on-demand
on a schedule

the scheduled pipeline will be able to run based on the time and frequency set [3]

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2023) Use Data Factory pipelines in Microsoft Fabric [link]
[2] Microsoft Learn (2024) Microsoft Fabric: Activity overview [link]
[3] Microsoft Learn (2024) Microsoft Fabric Concept: Data pipeline Runs [link]

Resources
[R1] Metadata Driven Pipelines for Microsoft Fabric (link)
[R2] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

22 January 2025

🏭🗒️Microsoft Fabric: Folders [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 22-Jan-2025

[Microsoft Fabric] Folders

{def} organizational units inside a workspace that enable users to efficiently organize and manage artifacts in the workspace [1]
identifiable by its name

{constraint} must be unique in a folder or at the root level of the workspace
{constraint} can’t include certain special characters [1]

C0 and C1 control codes [1]
leading or trailing spaces [1]
characters: ~"#.&*:<>?/{|} [1]

{constraint} can’t have system-reserved names

e.g. $recycle.bin, recycled, recycler.

{constraint} its length can't exceed 255 characters

{operation} create folder

can be created in

an existing folder (aka nested subfolder) [1]

{restriction} a maximum of 10 levels of nested subfolders can be created [1]
up to 10 folders can be created in the root folder [1]
{benefit} provide a hierarchical structure for organizing and managing items [1]

the root

{operation} move folder
{operation} rename folder

same rules applies as for folders’ creation [1]

{operation} delete folder

{restriction} currently can be deleted only empty folders [1]

{recommendation} make sure the folder is empty [1]

{operation} create item in folder

{restriction} certain items can’t be created in a folder

dataflows gen2
streaming semantic models
streaming dataflows

⇐ items created from the home page or the Create hub, are created at the root level of the workspace [1]

{operation} move file(s) between folders [1]
{operation} publish to folder [1]

Power BI reports can be published to specific folders

{restriction} folders' name must be unique throughout an entire workspace, regardless of their location [1]

when publishing a report to a workspace that has another report with the same name in a different folder, the report will publish to the location of the already existing report [1]

{limitation}may not be supported by certain features

e.g. Git

{recommendation} use folders to organize workspaces [1]

allows to improve content’s organization and navigation [AN]
allows to improve collaboration, access control and governance [AN]

{permissions}

inherit the permissions of the workspace where they're located [1] [2]
workspace admins, members, and contributors can create, modify, and delete folders in the workspace [1]
viewers can only view folder hierarchy and navigate in the workspace [1]

[deployment pipelines] deploying items in folders to a different stage, the folder hierarchy is automatically applied [2]

Previous Post <<||>> Next Post

References:
[1] Microsoft Fabric (2024) Create folders in workspaces [link]
[2] Microsoft Fabric (2024) The deployment pipelines process [link]
[3] Microsoft Fabric Updates Blog (2025) Define security on folders within a shortcut using OneLake data access roles [link]
[4] Microsoft Fabric Updates Blog (2025) Announcing the General Availability of Folder in Workspace [link]
[5] Microsoft Fabric Updates Blog (2025) Announcing Folder in Workspace in Public Preview [link]
[6] Microsoft Fabric Updates Blog (2025) Getting the size of OneLake data items or folders [link]

Resources:

[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

20 January 2025

🏭🗒️Microsoft Fabric: [Azure] Service Principals (SPN) [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 20-Jan-2025

[Azure] Service Principal (SPN)

{def} a non-human, application-based security identity used by applications or automation tools to access specific Azure resources [1]

can be assigned precise permissions, making them perfect for automated processes or background services

allows to minimize the risks of human error and identity-based vulnerabilities
supported in datasets, Gen1/Gen2 dataflows, datamarts [2]
authentication type

supported only by [2]

Azure Data Lake Storage
Azure Data Lake Storage Gen2
Azure Blob Storage
Azure Synapse Analytics
Azure SQL Database
Dataverse
SharePoint online

doesn’t support

SQL data source with Direct Query in datasets [2]

when registering a new application in Microsoft Entra ID, a SPN is automatically created for the app registration [4]

the access to resources is restricted by the roles assigned to the SPN

⇒ gives control over which resources can be accessed and at which level [4]

{recommendation} use SPN with automated tools [4]

rather than allowing them to sign in with a user identity [4]

{prerequisite} an active Microsoft Entra user account with sufficient permissions to

register an application with the tenant [4]
assign to the application a role in the Azure subscription [4]
⇐ requires Application.ReadWrite.All permission [4]

extended to support Fabric Data Warehouses [1]

{benefit} automation-friendly API Access

allows to create, update, read, and delete Warehouse items via Fabric REST APIs using service principals [1]
enables to automate repetitive tasks without relying on user credentials [1]

e.g. provisioning or managing warehouses
increases security by limiting human error

the warehouses thus created, will be displayed in the Workspace list view in Fabric UI, with the Owner name of the SPN [1]
applicable to users with administrator, member, or contributor workspace role [3]
minimizes risk

the warehouses created with delegated account or fixed identity (owner’s identity) will stop working when the owner leaves the organization [1]

Fabric requires the user to login every 30 days to ensure a valid token is provided for security reasons [1]

{benefit} seamless integration with Client Tools:

tools like SSMS can connect to the Fabric DWH using SPN [1]
SPN provides secure access for developers to

run COPY INTO

with and without firewall enabled storage [1]

run any T-SQL query programmatically on a schedule with ADF pipelines [1]

{benefit} granular access control

Warehouses can be shared with an SPN through the Fabric portal [1]

once shared, administrators can use T-SQL commands to assign specific permissions to SPN [1]

allows to control precisely which data and operations an SPN has access to [1]

GRANT SELECT ON <table name> TO <Service principal name>

warehouses' ownership can be changed from an SPN to user, and vice-versa [3]

{benefit} improved DevOps and CI/CD Integration

SPN can be used to automate the deployment and management of DWH resources [1]

⇐ ensures faster, more reliable deployment processes while maintaining strong security postures [1]

{limitation} default semantic models are not supported for SPN created warehouses [3]

⇒ features such as listing tables in dataset view, creating report from the default dataset don’t work [3]

{limitation} SPN for SQL analytics endpoints is not currently supported
{limitation} SPNs are currently not supported for COPY INTO error files [3]

⇐ Entra ID credentials are not supported as well [3]

{limitation} SPNs are not supported for GIT APIs. SPN support exists only for Deployment pipeline APIs [3]
monitoring tools

[DMV] sys.dm_exec_sessions.login_name column [3]
[Query Insights] queryinsights.exec_requests_history.login_name [3]
Query activity

submitter column in Fabric query activity [3]

Capacity metrics app:

compute usage for warehouse operations performed by SPN appears as the Client ID under the User column in Background operations drill through table [3]

Previous Post <<||>> Next Post

References:

[1] Microsoft Fabric Updates Blog (2024) Service principal support for Fabric Data Warehouse [link]

[2] Microsoft Fabric Learn (2024) Service principal support in Data Factory [link]

[3] Microsoft Fabric Learn (2024) Service principal in Fabric Data Warehouse [link]

[4] Microsoft Fabric Learn (2024) Register a Microsoft Entra app and create a service principal [link]

[5] Microsoft Fabric Updates Blog (2024) Announcing Service Principal support for Fabric APIs [link]

Resources:

[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:

ADF - Azure Data Factory

API - Application Programming Interface

CI/CD - Continuous Integration/Continuous Deployment

DMV - Dynamic Management View

DWH - Data Warehouse

SPN - service principal

SSMS - SQL Server Management Studio

18 April 2024

🏭Data Warehousing: Microsoft Fabric (Part II: Data(base) Mirroring) [New feature]

Data Warehousing Series

Microsoft recently announced [4] the preview of a new Fabric feature called Mirroring, a low-cost, low-latency fully managed service that allows to replicate data from various systems together into OneLake [1]. Currently only Azure SQL Database, Azure Cosmos DB, and Snowflake are supported, though probably more database vendors will be targeted soon.

For Microsoft Fabric's data engineers, data scientists and data warehouse professionals this feature is huge as importance because they don't need to care anymore about making the data available in Microsoft Fabric, which involves a considerable amount of work.

Usually, at least for flexibility, transparence, performance and standardization, data professionals prefer to extract the data 1:1 from the source systems into a landing zone in the data warehouse or data/delta lake from where the data are further processed as needed. One data pipeline is thus built for every table in scope, which sometimes is a 10–15-minute effort per table, when the process is standardized, though upon case the effort is much higher if troubleshooting (e.g. data type incompatibility or support) or further logic changes are involved. Maintaining such data pipelines can prove to be costly over time, especially when periodic changes are needed.

Microsoft lists other downsides of the ETL approach - restricted access to data changes, friction between people, processes, and technology, respectively the effort needed to create the pipelines, and the time needed for importing the data [1]. There's some truth is each of these points, though everything is relative. For big tables, however, refreshing all the data overnight can prove to be time-consuming and costly, especially when the data don't lie within the same region, respectively data center. Unless the data can be refreshed incrementally, the night runs can extend into the day, will all the implications that derive from this - not having actual data, which decreases the trust in reports, etc. There are tricks to speed up the process, though there are limits to what can be done.

With mirroring, the replication of data between data sources and the analytics platform is handled in the background, after an initial replication, the changes in the source systems being reflected with a near real-time latency into OneLake, which is amazing! This allows building near real-time reporting solutions which can help the business in many ways - reviewing (and correcting in the data source) records en masse, faster overview of what's happening in the organizations, faster basis for decision-making, etc. Moreover, the mechanism is fully managed by Microsoft, which is thus responsible for making sure that the data are correctly synchronized. Only from this perspective 10-20% from the effort of building an analytics solution is probably reduced.

Mirroring in Microsoft Fabric (adapted after [2])

According to the documentation, one can replicate a whole database or choose individual regular tables (currently views aren't supported [3]), stop, restart, or remove a table from a mirroring. Moreover, through sharing, users can grant to other users or groups of users access to a mirrored database without giving access to the workspace and the rest of its items [1].

The data professionals and citizens can write then cross-database queries against the mirrored databases, warehouses, and the SQL analytics endpoints of lakehouses, combining data from all these sources into a single T-SQL query, which opens lot of opportunities especially in what concerns the creation of an enterprise semantic model, which should be differentiated from the semantic model created by default by the mirroring together with the SQL analytics endpoint.

Considering that the data is replicated into delta tables, one can take advantage of all the capabilities available with such tables - data versioning, time travel, interoperability and/or performance, respectively direct consumption in Power BI.

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn - Microsoft Fabric (2024) What is Mirroring in Fabric? (link)
[2] Microsoft Learn - Microsoft Fabric (2024) Mirroring Azure SQL Database [Preview] (link)
[3] Microsoft Learn - Microsoft Fabric (2024) Frequently asked questions for Mirroring Azure SQL Database in Microsoft Fabric [Preview] (link)
[4] Microsoft Fabric Updates Blog (2024) Announcing the Public Preview of Mirroring in Microsoft Fabric, by Charles Webb (link)

10 March 2024

🏭🗒️Microsoft Fabric: Dataflows Gen2 [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 30-May-2025

Dataflow (Gen2) Architecture [4]

[Microsoft Fabric] Dataflow (Gen2)

new generation of dataflows that resides alongside the Power BI Dataflow (Gen1) [2]

brings new features and improved experience [2]
similar to Dataflow Gen1 in Power BI [2]
allow to

extract data from various sources [1]
transform it using a wide range of transformation operations [1]
load it into a destination [1]

{goal} provide an easy, reusable way to perform ETL tasks using Power Query Online [1]

allows to promote reusable ETL logic

⇒ prevents the need to create more connections to the data source [1]
offer a wide variety of transformations [1]

can be horizontally partitioned

{component} Lakehouse

used to stage data being ingested

{component} Warehouse

used as a compute engine and means to write back results to staging or supported output destinations faster

{component} mashup engine

extracts, transforms, or loads the data to staging or data destinations when either [4]

warehouse compute cannot be used [4]
{limitation} staging is disabled for a query [4]

{operation} create a dataflow

can be created in a

Data Factory workload
Power BI workspace
Lakehouse

{operation} publish a dataflow

generates dataflow's definition

⇐ the program that runs once the dataflow is refreshed to produce tables in staging storage and/or output destination [4]
used by the dataflow engine to generate an orchestration plan, manage resources, and orchestrate execution of queries across data sources, gateways, and compute engines, and to create tables in either the staging storage or data destination [4]

saves changes and runs validations that must be performed in the background [2]

{operation} refresh a dataflow
{operation} running a dataflow

can be performed

manually
on a refresh schedule
as part of a Data Pipeline orchestration

{feature} author dataflows with Power Query

uses the full Power Query experience of Power BI dataflows [2]

{feature} shorter authoring flow

uses step-by-step for getting the data into your the dataflow [2]

the number of steps required to create dataflows were reduced [2]

a few new features were added to improve the experience [2]

{feature} Auto-Save and background publishing

changes made to a dataflow are autosaved to the cloud (aka draft version of the dataflow) [2]

⇐ without having to wait for the validation to finish [2]

{functionality} save as draft

stores a draft version of the dataflow every time you make a change [2]
seamless experience and doesn't require any input [2]

{concept} published version

the version of the dataflow that passed validation and is ready to refresh [5]

{feature} integration with data pipelines

integrates directly with Data Factory pipelines for scheduling and orchestration [2]

{feature} high-scale compute

leverages a new, higher-scale compute architecture [2]

improves the performance of both transformations of referenced queries and get data scenarios [2]
creates both Lakehouse and Warehouse items in the workspace, and uses them to store and access data to improve performance for all dataflows [2]

{feature} improved monitoring and refresh history

integrate support for Monitoring Hub [2]
Refresh History experience upgraded [2]

{feature} get data via Dataflows connector

supports a wide variety of data source connectors

include cloud and on-premises relational databases

{feature} incremental refresh

enables to incrementally extract data from data sources, apply Power Query transformations, and load into various output destinations [5]

{feature|planned} Fast Copy
- enables large-scale data ingestion directly utilizing the pipelines Copy Activity capability [6]
- supports sources such Azure SQL Databases, CSV, and Parquet files in Azure Data Lake Storage and Blob Storage [6]
- significantly scales up the data processing capacity providing high-scale ELT capabilities [6]
{feature|planned}Cancel refresh

enables to cancel ongoing Dataflow Gen2 refreshes from the workspace items view [6]

{feature} data destinations

allows to

specify an output destination
separate ETL logic and destination storage [2]

every tabular data query can have a data destination [3]

available destinations

Azure SQL databases
Azure Data Explorer (Kusto)
Fabric Lakehouse
Fabric Warehouse
Fabric KQL database

a destination can be specified for every query individually [3]
multiple different destinations can be used within a dataflow [3]
connecting to the data destination is similar to connecting to a data source
{limitation} functions and lists aren't supported

{operation} creating a new table

{default} table name has the same name as the query name.

{operation} picking an existing table
{operation} deleting a table manually from the data destination

doesn't recreate the table on the next refresh [3]

{operation} reusing queries from Dataflow Gen1

{method} export Dataflow Gen1 query and import it into Dataflow Gen2

export the queries as a PQT file and import them into Dataflow Gen2 [2]

{method} copy and paste in Power Query

copy the queries and paste them in the Dataflow Gen2 editor [2]

automatic settings:

{limitation} supported only for Lakehouse and Azure SQL database
{setting} Update method replace:

data in the destination is replaced at every dataflow refresh with the output data of the dataflow [3]

{setting} Managed mapping:

the mapping is automatically adjusted when republishing the data flow to reflect the change

⇒ doesn't need to be updated manually into the data destination experience every time changes occur [3]

{setting} Drop and recreate table:

on every dataflow refresh the table is dropped and recreated to allow schema changes
{limitation} the dataflow refresh fails if any relationships or measures were added to the table [3]

update methods

{method} replace:

on every dataflow refresh, the data is dropped from the destination and replaced by the output data of the dataflow.
{limitation} not supported by Fabric KQL databases and Azure Data Explorer

{method} append:

on every dataflow refresh, the output data from the dataflow is appended (aka merged) to the existing data in the data destination table (aka upsert)

staging

{default} enabled

allows to use Fabric compute to execute queries

⇐ enhances the performance of query processing

the data is loaded into the staging location

⇐ an internal Lakehouse location accessible only by the dataflow itself

[Warehouse] staging is required before the write operation to the data destination

⇐ improves performance
{limitation} only loading into the same workspace as the dataflow is supported

using staging locations can enhance performance in some cases

disabled

{recommendation} [Lakehouse] disable staging on the query to avoid loading twice into a similar destination

⇐ once for staging and once for data destination
improves dataflow's performance

{scenario} use a dataflow to load data into the lakehouse and then use a notebook to analyze the data [2]
{scenario} use a dataflow to load data into an Azure SQL database and then use a data pipeline to load the data into a data warehouse [2]

{feature} fast copy

allows ingesting terabytes of data with the easy experience and the scalable back-end of the pipeline Copy Activity [7]
the feature must be enabled [7]

after enabling, Dataflows automatically switch the back-end when data size exceeds a particular threshold [7]
⇐there's no need to change anything during authoring of the dataflows
one can check thr refresh history to see if fast copy was used [7]
⇐see the Engine typeRequire fast copy option
{option} Require fast copy

{prerequisite} Fabric capacity is available [7]
{prerequisite} data files

are in .csv or parquet format
have at least 100 MB
are stored in an ADLS Gen2 or a Blob storage account [6]

{prerequisite} [Azure SQL DB|PostgreSQL] >= 5 million rows in the data source [7]
{limitation} doesn't support [7]

the VNet gateway
writing data into an existing table in Lakehouse
fixed schema

{feature} parameters

allow to dynamically control and customize dataflows

makes them more flexible and reusable by enabling different inputs and scenarios without modifying the dataflow itself [9]
the dataflow is refreshed by passing parameter values outside of the Power Query editor through either

Fabric REST API [9]
native Fabric experiences [9]

parameter names are case sensitive [9]
{type} required parameters

{warning} the refresh fails if no value is passed for it [9]

{type} optional parameters
enabled via Parameters >> Enable parameters to be discovered and override for execution [9]

{limitation} dataflows with parameters can't be

scheduled for refresh through the Fabric scheduler [9]
manually triggered through the Fabric Workspace list or lineage view [9]

{limitation} parameters that affect the resource path of a data source or a destination are not supported [9]

⇐connections are linked to the exact data source path defined in the authored dataflow

can't be currently override to use other connections or resource paths [9]

{limitation} can't be leveraged by dataflows with incremental refresh [9]
{limitation} supports only parameters of the type decimal number, whole number, text and true/false can be passed for override

any other data types don't produce a refresh request in the refresh history but show in the monitoring hub [9]

{warning} allow other users who have permissions to the dataflow to refresh the data with other values [9]
{limitation} refresh history does not display information about the parameters passed during the invocation of the dataflow [9]
{limitation} monitoring hub doesn't display information about the parameters passed during the invocation of the dataflow [9]
{limitation} staged queries only keep the last data refresh of a dataflow stored in the Staging Lakehouse [9]
{limitation} only the first request will be accepted from duplicated requests for the same parameter values [9]

subsequent requests are rejected until the first request finishes its evaluation [9]

{feature} supports CI/CD and Git integration

allows to create, edit, and manage dataflows in a Git repository that's connected to a Fabric workspace [10]
allows to use the deployment pipelines to automate the deployment of dataflows between workspaces [10]
allows to use Public APIs to create and manage Dataflow Gen2 with CI/CD and Git integration [10]
allows to create Dataflow Gen2 directly into a workspace folder [10]
allows to use the Fabric settings and scheduler to refresh and edit settings for Dataflow Gen2 [10]
{action} save a workflow

replaces the publish operation

when saving th dataflow, it automatically publishes the changes to the dataflow [10]

{action} delete a dataflow

the staging artifacts become visible in the workspace and are safe to be deleted [10]

{action} schedule a refresh

can be done manually or by scheduling a refresh [10]
{limitation} the Workspace view doesn't show if a refresh is ongoing for the dataflow [10]
refresh information is available in the refresh history [10]

{action} branching out to another workspace

{limitation} the refresh can fail with the message that the staging lakehouse couldn't be found [10]
{workaround} create a new Dataflow Gen2 with CI/CD and Git support in the workspace to trigger the creation of the staging lakehouse [10]
⇐ all other dataflows in the workspace should start to function again.

{action} syncing changes from GIT into the workspace

requires to open the new or updated dataflow and save changes manually with the editor [10]

triggers a publish action in the background to allow the changes to be used during refresh of the dataflow [10]

[Power Automate] {limitation} the connector for dataflows isn't working [10]

{benefit} enhance flexibility by allowing dynamic adjustments without altering the dataflow itself [9]
{benefit} extends data with consistent data, such as a standard date dimension table [1]
{benefit} allows self-service users access to a subset of data warehouse separately [1]
{benefit} optimizes performance with dataflows, which enable extracting data once for reuse, reducing data refresh time for slower sources [1]
{benefit} simplifies data source complexity by only exposing dataflows to larger analyst groups [1]
{benefit} ensures consistency and quality of data by enabling users to clean and transform data before loading it to a destination [1]
{benefit} simplifies data integration by providing a low-code interface that ingests data from various sources [1]
{limitation} not a replacement for a data warehouse [1]
{limitation} row-level security isn't supported [1]
{limitation} Fabric or Fabric trial capacity workspace is required [1]

Feature	Data flow Gen2	Dataflow Gen1
Author dataflows with Power Query	✓	✓
Shorter authoring flow	✓
Auto-Save and background publishing	✓
Data destinations	✓
Improved monitoring and refresh history	✓
Integration with data pipelines	✓
High-scale compute	✓
Get Data via Dataflows connector	✓	✓
Direct Query via Dataflows connector		✓
Incremental refresh	✓*
Fast Copy	✓*
Cancel refresh	✓*
AI Insights support		✓

Dataflow Gen1 vs Gen2 [2]

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2023) Fabric: Ingest data with Microsoft Fabric (link)
[2] Microsoft Learn (2023) Fabric: Getting from Dataflow Generation 1 to Dataflow Generation 2 (link)
[3] Microsoft Learn (2023) Fabric: Dataflow Gen2 data destinations and managed settings (link)
[4] Microsoft Learn (2023) Fabric: Dataflow Gen2 pricing for Data Factory in Microsoft Fabric (link)
[5] Microsoft Learn (2023) Fabric: Save a draft of your dataflow (link)
[6] Microsoft Learn (2023) Fabric: What's new and planned for Data Factory in Microsoft Fabric (link)

[7] Microsoft Learn (2023) Fabric: Fast copy in Dataflows Gen2 [link]
[8] Microsoft Learn (2025) Fabric: Incremental refresh in Dataflow Gen2 [link]

[9] Microsoft Learn (2025) Use public parameters in Dataflow Gen2 (Preview) [link]

[10] Microsoft Learn (2025) Dataflow Gen2 with CI/CD and Git integration support [link]

Resources:
[R1] Arshad Ali & Bradley Schacht (2024) Learn Microsoft Fabric (link)
[R2] Microsoft Learn: Fabric (2023) Data Factory limitations overview (link)
[R3] Microsoft Fabric Blog (2023) Data Factory Spotlight: Dataflow Gen2, by Miguel Escobar (link)
[R4] Microsoft Learn: Fabric (2023) Dataflow Gen2 connectors in Microsoft Fabric (link)
[R5] Microsoft Learn: Fabric (2023) Pattern to incrementally amass data with Dataflow Gen2 (link)
[R6] Fourmoo (2004) Microsoft Fabric – Comparing Dataflow Gen2 vs Notebook on Costs and usability, by Gilbert Quevauvilliers (link)
[R7] Microsoft Learn: Fabric (2023) A guide to Fabric Dataflows for Azure Data Factory Mapping Data Flow users (link)
[R8] Microsoft Learn: Fabric (2023) Quickstart: Create your first dataflow to get and transform data (link)
[R9] Microsoft Learn: Fabric (2023) Microsoft Fabric decision guide: copy activity, dataflow, or Spark (link)
[R10] Microsoft Fabric Blog (2023) Dataflows Gen2 data destinations and managed settings, by Miquella de Boer (link)
[R11] Microsoft Fabric Blog (2023) Service principal support to connect to data in Dataflow, Datamart, Dataset and Dataflow Gen 2, by Miquella de Boer (link)
[R12] Chris Webb's BI Blog (2023) Fabric Dataflows Gen2: To Stage Or Not To Stage? (link)
[R13] Power BI Tips (2023) Let's Learn Fabric ep.7: Fabric Dataflows Gen2 (link)
[R14] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

[R15] Microsoft Fabric Blog (2023) Passing parameter values to refresh a Dataflow Gen2 (Preview) [link]

Acronyms:
ADLS - Azure Data Lake Storage

CI/CD - Continuous Integration/Continuous Deployment

ETL - Extract, Transform, Load

KQL - Kusto Query Language
PQO - Power Query Online
PQT - Power Query Template

29 March 2021

Notes: Team Data Science Process (TDSP)

Team Data Science Process (TDSP)

an agile, iterative data science methodology to deliver predictive analytics solutions and intelligent applications efficiently [1]
{goal} help customers fully realize the benefits of their analytics program [1]
{component} data science lifecycle definition
- {description} a framework to structure the development of data science projects [1]
- {goal} designed for data science projects that ship as part of intelligent applications that deploy ML & AI models for predictive analytics [1]
- {benefit} can be used in the context of other DM methodologies as they have common ground [1]
  - e.g. CRISP-DM, KDD
- {benefit} exploratory data science projects or improvised analytics projects can also benefit from using this process [1]
{component} standardized project structure
- {description} a directory structure that includes templates for project documents
  - ⇒makes it easy for team members to find information [1]
  - ⇐templates for the folder structure and required documents are provided in standard locations [1]
  - all code and documents are stored in an agile VCS tracking repository [1]
    - {recommendation} create a separate repository for each project on the VCS for versioning, information security, and collaboration [1]
- {benefit} organizes the code for the various activities [1]
- {benefit} allows tracking the progress [1]
- {benefit} provides checklist with key questions for each project to guarantee process and deliverables’ quality [1]
- {benefit} enables team collaboration [1]
- {benefit} allows closer tracking of the code for individual features [1]
- {benefit} enables teams to obtain better cost estimates [1]
- {benefit} helps build institutional knowledge across the organization [1]
{component} recommended infrastructure
- {description} a set of recommendations for the infrastructure and resources needed for analytics and storage [1]
- {benefit} addresses cloud and/or on-premises requirements [1]
- {benefit} enables reproducible analysis [1]
- {benefit} avoids infrastructure duplication [1]
  - ⇒minimizes inconsistencies and unnecessary infrastructure costs [1]
- {tools} tools are provided to provision the shared resources, track them, and allow each team member to connect to those resources securely [1]
- {good practice} create a consistent compute environment [1]
  - ⇐allows team members replicate and validate experiments [1]
{component} recommended tools and utilities
- {description} a set of recommendations for the tools and utilities needed for project’s execution [1]
- {benefit} help lower the barriers and increase the consistency of their adoption [1]
- {benefit} provides an initial set of tools and scripts to jump-start methodology’s adoption [1]
- {benefit} helps automate some of the common tasks in the data science lifecycle [1]
  - e.g. data exploration and baseline modeling [1]
- {benefit} well-defined structure provided for individuals to contribute shared tools and utilities into their team's shared code repository [1]
  - ⇐ resources can then be leveraged by other projects [1]
{phase} 1: business understanding
- {goal} define and document the business problem, its objectives, the needed attributes, and the metric(s) used to determine project’s success
- {goal} identify and document the relevant data sources
- {step} 1.1: define project’s objectives
  - elicit together with the stakeholders the requirements, define and document the problem and its objectives, respectively the metric(s) used to determine project’s success
    - requires a good understanding of the business processes, data and further characteristics
- {step} 1.2: identify data sources
  - identify the attributes and the data sources relevant to the problem under study
- {step} 1.3: define project plan and team*
  - develop a high-level milestone plan and identify the resources needed for executing it
- {tool} project charter
  - standard template that documents the business problem, the scope of the project, the business objectives and metric(s) used to determine project’s success
{phase} 2: data acquisition & understanding
- {goal} prepare the base dataset(s) as needed by the modeling phase into the target repository
- {goal} build the data ETL/ELT architecture and processes needed for provisioning the basis data
- {step} 2.1: ingest data
  - make the required data available for the team in the repository where the analytics operations take place
- {step} 2.2: explore data
  - understand data’s characteristics by leveraging specific tools (visualization, analysis)
  - prepare the data as needed for further processing
- {step} 2.3: set up pipelines
  - build the pipelines needed for data actualization and qualitative assessment [3]
  - set up a process to score new data or refresh the data regularly [3]
- {step} 2.4: feasibility analysis*
  - reevaluate the project to determine whether the value expected is sufficient to continue pursuing it
- {tool} data quality report
  - report that includes data summaries, data mappings, variable ranking, data qualitative assessment(s) and further information [3]
- {tool} solution architecture
  - diagram and/or textual-based description of the data pipeline(s), technical assumptions and further aspects
- {tool} data reports
  - document the structure and statistics of the raw data
- {tool} checkpoint decision
  - decision template document that
    - summarizes the findings of the feasibility analysis step
    - includes a set of choices and recommendations for the next steps
    - serves as basis for the decision on whether to continue or not the project, respectively what the next steps are
{phase} 3: modeling
- {goal} create a machine-learning model that addresses the prediction requirements and that's suitable for production
- {step} 3.1: feature engineering
  - the inclusion, aggregation, and transformation of raw variables to create the features used in the analysis [4]
    - ⇐requires a good understanding of how the features relate to each other and how the ML algorithms use those features [4]
- {step} 3.2: model selection*
  - choose one or more modeling algorithms that address problem’s characteristics the best
- {step} 3.3: model training
  - involves the following steps:
    - split the input data into training and test datasets
    - build the models by using the training dataset
    - evaluate the training and the test data set
    - determine the optimal setup and methods
- {step} 3.4: model evaluation
  - evaluate the performance of the model(s)
- {step} 3.5: feasibility analysis*
  - evaluate the readiness of the models for use into production, respectively on whether they fulfill project’s objectives
- {tool} feature sets
  - describe the features developed for the modeling and how they were generated
  - contains pointers to the code used to generate the features
- {tool} model report
  - a standard, template-based report that provides details on each experiment’s outcomes
  - created for each model tried
- {tool} checkpoint decision
- {tool} model performance metrics
  - e.g. ROC curves or MSE
{phase} 4: deployment
- {goal} deploy the models and the data pipelines to the environment used for final user acceptance
- {step} 4.1: operationalize architecture
  - prepare the models and data pipelines for use into production
  - {best practice} expose the models over an open API interface
    - enables models’ consumption from various applications
  - {best practice} build telemetry and monitoring into the models and the data pipelines [5]
    - helps in monitoring and troubleshooting [5]
- {step} 4.2: deploy solution*
  - deploy the architecture into production
- {tool} status dashboard
  - displays data on system’s health and key metrics
- {tool} model report
  - the report in its final form with deployment information
- {tool} solution architecture
  - the document in its final form
{phase} 5: customer acceptance
- {goal} confirm that project’s objectives were fulfilled and get customer’s acceptance
- {step} 5.1: system validation
  - validate system’s performance and outcomes and confirm that it fulfills customer’s needs
- {step} 5.2: project signoff*
  - finalize and review documentation
  - handover the solution and afferent documentation to customer
  - evaluate the project against the defined objectives and get customer’ signoff
- {tool} exit report
- {tool} technical report
  - contains all the details of the project that are useful for learning about how to operate the system [6]

Acronyms:

Artificial Intelligence (AI)

Cross-Industry Standard Process for Data Mining (CRISP-DM)

Data Mining (DM)

Knowledge Discovery in Databases (KDD)

Team Data Science Process (TDSP)

Version Control System (VCS)

Visual Studio Team Services (VSTS)

Resources:

[1] Microsoft Azure (2020) What is the Team Data Science Process? [source]

[2] Microsoft Azure (2020) The business understanding stage of the Team Data Science Process lifecycle [source]

[3] Microsoft Azure (2020) Data acquisition and understanding stage of the Team Data Science Process [source]

[4] Microsoft Azure (2020) Modeling stage of the Team Data Science Process lifecycle [source]

[5] Microsoft Azure (2020) Deployment stage of the Team Data Science Process lifecycle [source]

[6] Microsoft Azure (2020) Customer acceptance stage of the Team Data Science Process lifecycle [source]

SQL Troubles

Pages

24 May 2025

🏭🗒️Microsoft Fabric: Materialized Lake Views (MLV) [Notes] 🆕🗓️

26 April 2025

🏭🗒️Microsoft Fabric: Deployment Pipelines [Notes]

09 February 2025

🏭🗒️Microsoft Fabric: Data Pipelines [Notes]

22 January 2025

🏭🗒️Microsoft Fabric: Folders [Notes]

20 January 2025

🏭🗒️Microsoft Fabric: [Azure] Service Principals (SPN) [Notes]

18 April 2024

🏭Data Warehousing: Microsoft Fabric (Part II: Data(base) Mirroring) [New feature]

10 March 2024

🏭🗒️Microsoft Fabric: Dataflows Gen2 [Notes]

29 March 2021

Notes: Team Data Science Process (TDSP)

About Me