Showing posts with label lakehouse. Show all posts
Showing posts with label lakehouse. Show all posts

24 May 2025

🏭🗒️Microsoft Fabric: Materialized Lake Views (MLV) [Notes] 🆕🗓️

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 24-May-2025

-- create schema
CREATE SCHERA IF NOT EXISTS <lakehouse_name>.<schema_name>

-- create a materialized view
CREATE MATERIALIZED VIEW IF NOT EXISTS <lakehouse_name>.<schema_name>.<view_name> 
(
    CONSTRAINT <constraint_name> CHECK (<constraint>) ON MISMATCH DROP 
) 
AS 
SELECT ...
FROM ...
-- WHERE ...
--GROUP BY ...

[Microsoft Fabric] Materialized Lake Views (MLV)

  • {def} persisted, continuously updated view of data [1]
    • {benefit} allows to build declarative data pipelines using SQL, complete with built-in data quality rules and automatic monitoring of data transformations
      • simplifies the implementation of multi-stage Lakehouse processing [1]
        • streamline data workflows
        • enable developers to focus on business logic [1]
          • ⇐ not on infrastructural or data quality-related issues [1]
        • the views can be created in a notebook [2]
      • can have data quality constraints enforced and visualized for every run, showing completion status and conformance to data quality constraints defined in a single view [1]
      • empowers developers to set up complex data pipelines with just a few SQL statements and then handle the rest automatically [1]
        • faster development cycles 
        • trustworthy data
        • quicker insights
  • {goal} process only the new or changed data instead of reprocessing everything each time [1]
    • ⇐  leverages Delta Lake’s CDF under the hood
      • ⇒ it can update just the portions of data that changed rather than recompute the whole view from scratch [1]
  • {operation} creation
    • allows defining transformations at each layer [1]
      • e.g. aggregation, projection, filters
    • allows specifying certain checks that the data must meet [1]
      • incorporate data quality constraints directly into the pipeline definition
    • via CREATE MATERIALIZED LAKE VIEW
      • the SQL syntax is declarative and Fabric figures out how to produce and maintain it [1]
  • {operation} refresh
    • refreshes only when its source has new data [1]
      • if there’s no change, it can skip running entirely (saving time and resources) [1]
  • {feature} automatically generate a visual report that shows trends on data quality constraints 
    • {benefit} allows to easily identify the checks that introduce maximum errors and the associated MLVs for easy troubleshooting [1]
  • {feature} can be combined with Shortcut Transformation feature for CSV ingestion 
    • {benefit} allows building an end-to-end Medallion architecture
  • {feature} dependency graph
    • allows to see the dependencies existing between the various objects [2]
      • ⇐ automatically generated [2]
  • {feature} data quality report
    • built-in Power BI dashboard that shows several aggregated metrics [2]
  • {feature|planned} support for PySpark
  • {feature|planned} incremental refresh
  • {feature|planned} integration with Data Activator
Previous Post <<||>> Next Post

References:
[1] Microsoft Fabric Update Blog (2025) Simplifying Medallion Implementation with Materialized Lake Views in Fabric [link|aka]
[2] Power BI Tips (2025) Microsoft Fabric Notebooks with Materialized Views - Quick Tips [link]
[3] Microsoft Learn (2025)  [link]

Resources:
[R1] Databricks (2025) Use materialized views in Databricks SQL [link]

Acronyms:
CDF - Change Data Feed
ETL - Extract, Transfer, Load
MF - Microsoft Fabric
MLV - Materialized Lake views

26 April 2025

🏭🗒️Microsoft Fabric: Deployment Pipelines [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 26-Apr-2025

[Microsoft Fabric] Deployment Pipelines

  • {def} a structured process that enables content creators to manage the lifecycle of their organizational assets [5]
    • enable creators to develop and test content in the service before it reaches the users [5]
      • can simplify the deployment process to development, test, and production workspaces [5]
      • one Premium workspace is assigned to each stage [5]
      • each stage can have 
        • different configurations [5]
        • different databases or different query parameters [5]
  • {action} create pipeline
    • from the deployment pipelines entry point in Fabric [5]
      • creating a pipeline from a workspace automatically assigns it to the pipeline [5]
    • {action} define how many stages it should have and what they should be called [5]
      • {default} has three stages
        • e.g. Development, Test, and Production
        • the number of stages can be changed anywhere between 2-10 
        • {action} add another stage,
        • {action} delete stage
        • {action} rename stage 
          • by typing a new name in the box
        • {action} share a pipeline with others
          • users receive access to the pipeline and become pipeline admins [5]
        • ⇐ the number of stages are permanent [5]
          • can't be changed after the pipeline is created [5]
    • {action} add content to the pipeline [5]
      • done by assigning a workspace to the pipeline stage [5]
        • the workspace can be assigned to any stage [5]
    • {action|optional} make a stage public
      • {default} the final stage of the pipeline is made public
      • a consumer of a public stage without access to the pipeline sees it as a regular workspace [5]
        • without the stage name and deployment pipeline icon on the workspace page next to the workspace name [5]
    • {action} deploy to an empty stage
      • when finishing the work in one pipeline stage, the content can be deployed to the next stage [5] 
        • deployment can happen in any direction [5]
      • {option} full deployment 
        • deploy all content to the target stage [5]
      • {option} selective deployment 
        • allows select the content to deploy to the target stage [5]
      • {option} backward deployment 
        • deploy content from a later stage to an earlier stage in the pipeline [5] 
        • {restriction} only possible when the target stage is empty [5]
    • {action} deploy content between pages [5]
      • content can be deployed even if the next stage has content
        • paired items are overwritten [5]
    • {action|optional} create deployment rules
      • when deploying content between pipeline stages, allow changes to content while keeping some settings intact [5] 
      • once a rule is defined or changed, the content must be redeployed
        • the deployed content inherits the value defined in the deployment rule [5]
        • the value always applies as long as the rule is unchanged and valid [5]
    • {feature} deployment history 
      • allows to see the last time content was deployed to each stage [5]
      • allows to to track time between deployments [5]
  • {concept} pairing
    • {def} the process by which an item in one stage of the deployment pipeline is associated with the same item in the adjacent stage
      • applies to reports, dashboards, semantic models
      • paired items appear on the same line in the pipeline content list [5]
        • ⇐ items that aren't paired, appear on a line by themselves [5]
      • the items remain paired even if their name changes
      • items added after the workspace is assigned to a pipeline aren't automatically paired [5]
        • ⇐ one can have identical items in adjacent workspaces that aren't paired [5]
  • [lakehouse]
    • can be removed as a dependent object upon deployment [3]
    • supports mapping different Lakehouses within the deployment pipeline context [3]
    • {default} a new empty Lakehouse object with same name is created in the target workspace [3]
      • ⇐ if nothing is specified during deployment pipeline configuration
      • notebook and Spark job definitions are remapped to reference the new lakehouse object in the new workspace [3]
      • {warning} a new empty Lakehouse object with same name still is created in the target workspace [3]
      • SQL Analytics endpoints and semantic models are provisioned
      • no object inside the Lakehouse is overwritten [3]
      • updates to Lakehouse name can be synchronized across workspaces in a deployment pipeline context [3] 
  • [notebook] deployment rules can be used to customize the behavior of notebooks when deployed [4]
    • e.g. change notebook's default lakehouse [4]
    • {feature} auto-binding
      • binds the default lakehouse and attached environment within the same workspace when deploying to next stage [4]
  • [environment] custom pool is not supported in deployment pipeline
    • the configurations of Compute section in the destination environment are set with default values [6]
    • ⇐ subject to change in upcoming releases [6]
  • [warehouse]
    • [database project] ALTER TABLE to add a constraint or column
      • {limitation} the table will be dropped and recreated when deploying, resulting in data loss
    • {recommendation} do not create a Dataflow Gen2 with an output destination to the warehouse
      • ⇐ deployment would be blocked by a new item named DataflowsStagingWarehouse that appears in the deployment pipeline [10]
    • SQL analytics endpoint is not supported
  • [Eventhouse]
    • {limitation} the connection must be reconfigured in destination that use Direct Ingestion mode [8]
  • [EventStream]
    • {limitation} limited support for cross-workspace scenarios
      • {recommendation} make sure all EventStream destinations within the same workspace [8]
  • KQL database
    • applies to tables, functions, materialized views [7]
  • KQL queryset
    • ⇐ tabs, data sources [7]
  • [real-time dashboard]
    • data sources, parameters, base queries, tiles [7]
  • [SQL database]
    • includes the specific differences between the individual database objects in the development and test workspaces [9]
  • can be also used with

    References:
    [1] Microsoft Learn (2024) Get started with deployment pipelines [link]
    [2] Microsoft Learn (2024) Implement continuous integration and continuous delivery (CI/CD) in Microsoft Fabric [link]
    [3] Microsoft Learn (2024)  Lakehouse deployment pipelines and git integration (Preview) [link]
    [4] Microsoft Learn (2024) Notebook source control and deployment [link
    [5] Microsoft Learn (2024) Introduction to deployment pipelines [link]
    [6] Environment Git integration and deployment pipeline [link]
    [7] Microsoft Learn (2024) Microsoft Learn (2024) Real-Time Intelligence: Git integration and deployment pipelines (Preview) [link]
    [8] Microsoft Learn (2024) Eventstream CI/CD - Git Integration and Deployment Pipeline [link]
    [9] Microsoft Learn (2024) Get started with deployment pipelines integration with SQL database in Microsoft Fabric [link]
    [10] Microsoft Learn (2025) Source control with Warehouse (preview) [link

    Resources:

    Acronyms:
    CLM - Content Lifecycle Management
    UAT - User Acceptance Testing

    24 April 2025

    💎🏭SQL Reloaded: Microsoft Fabric's Lakehouses at Work (Part I: Proof-of-Concept)


    Introduction

    One way to work with the data files existing in organization is to import them into a lakehouse and build a data model based on them that can be reused in the various solutions (incl. Power BI). As a reminder, a lakehouse is a data architecture platform for storing, managing, and analyzing structured and unstructured data in a single location.

    The tutorials available on lakehouses are pretty useful for getting an idea how to start, though probably each seasoned professional has his/her way of doing things, at least for testing the capabilities before building a proper solution. The target is thus, to create the minimum for testing the capabilities needed for a proof-of-concept solution. 

    The files used in this post are available on Microsoft's GitHub. Download the files and store them to be easily available for the next steps. The following files were considered for the current post: customers.csv, orders.csv and products.csv.

    Create the Lakehouse

    It's recommended to have a naming convention in place for the various items created in a workspace, e.g. a combination between item type (lakehouse, warehouse), system type (Prod, UAT, Dev, PoC) and eventually department (e.g. FIN, SCM, HR, etc.). One should try to balance between meaning and usefulness. Ideally, one should use 2 maximum 4 letters for each part encoded in the name. For example, the following scripts were created in the LH_SCM_PoC lakehouse. 

    More complex naming conventions can include the system (e.g. D365, CRM, EBS) or the company. The target is to easily identify the systems, independently how complex the rules are. Given that it can become challenging to rename the schemas afterwards, ideally the naming convention should be available from the early stages. 

    Create the Schema

    A lakehouse comes with a dbo schema available by default, though it's recommended to create meaningful schema(s) as needed. The schemas should ideally reflect the domain of the data (e.g. departments or other key areas) and the schemas shouldn't change when the objects are deployed between the different environments. Upon case, one should consider creating multiple schemas that reflect the model's key areas. The names should be simple and suggestive.

    -- create schema
    CREATE Schema Orders
    

    Create a Staging Area

    The next step is to create a staging area where the files in scope can be made available and then further loaded in the lakehouse. One needs to compromise between creating a deep hierarchical structure that reflects the business structure and the need to easily identify, respectively manage the files. An hierarchical structure with 1-2 level could provide the needed compromise, though each additional level tends to increase the complexity. 

    One should also consider rules for archiving or even deleting the files.

    Upload the Files

    Microsoft Fabric allows users to upload multiple files together into a single step. Ideally the files should have proper names for each column, otherwise overheads deriving from this may appear later in the process. 

    When the files are available in multiple folders in a complex hierarchical structure, a set of shortcuts could help in their management.

    Load the Data

    A file's data can be loaded into the lakehouse on the fly by providing a valid table name:
    Files >> SCM_Orders >> (select file) >> Load to Tables >> new table >> Load file to new table >> (provide information) >> Load

    Load file to new table


    Of course, the table's name must be unique within the Schema and the further properties must define files' definition. 

    One should consider loading first a couple of tables, performing a rough validation of the data imported, and only after that the remaining tables can be imported. This allows to identify the issues that typically lead to reimports of the data (wrong formatting, invalid column names, duplicated files, etc.) or rework.

    If the files have different characteristics (e.g. delimiters, number of attributes/records, special data types), one should consider this in the initial scope and have at least one example from each category. 

    Review the Metadata

    Once the files were made available, there's the tendency to start directly with the development without analyzing the data, or equally important, the metadata available. To review the metadata of the tables newly created, one can use the objects from the standard INFORMATION_SCHEMA (see post):

    -- retrieve the list of tables
    SELECT * 
    FROM INFORMATION_SCHEMA.TABLES
    
    WHERE TABLE_SCHEMA = 'orders'
    ORDER BY TABLE_SCHEMA  

    Further on, one can review columns' definition:
     
    -- retrieve column metadata
    SELECT TABLE_CATALOG
    , TABLE_SCHEMA
    , TABLE_NAME
    , COLUMN_NAME
    , ORDINAL_POSITION
    , DATA_TYPE
    , CHARACTER_MAXIMUM_LENGTH
    , NUMERIC_PRECISION
    , NUMERIC_SCALE
    , DATETIME_PRECISION
    , CHARACTER_SET_NAME
    FROM INFORMATION_SCHEMA.COLUMNS
    WHERE TABLE_SCHEMA = 'orders'
    ORDER BY ORDINAL_POSITION

    It's a good idea to save the metadata to a file and use it later for reviews, respectively for metadata management, when no other solution is in place for the same (e.g. Purview). That's useful also for the people with limited or no access to the workspace. 

    Alternatively, one can use a notebook with the following SQL-based commands: 

    %%sql
    
    DESCRIBE TABLE LH_SCM_PoC.orders.sales;
    
    DESCRIBE TABLE EXTENDED LH_SCM_PoC.orders.sales;
    

    One can even provide meaningful descriptions for each table and its attributes via scripts like the ones below, however it might be a good idea to do this in the later phases of the PoC, when the logic become stable:

    %%sql
    
    -- modify a table's COMMENT
    COMMENT ON TABLE LH_SCM_PoC.orders.sales IS 'Customer sales orders';
    
    -- modify columns' COMMENT for an existing table
    ALTER TABLE LH_SCM_DWH.orders.sales  
    ALTER COLUMN SalesOrderNumber COMMENT 'Sales Order Number';
    

    Data Validation

    Before diving into building any business logic, besides identifying the primary, foreign keys and further attributes used in bringing the data together, it's recommended to get an overview of data's intrinsic and extrinsic characteristics relevant to the analysis. Some of the rules used typically for studying the quality of data apply to some extent also in here, though one needs to prioritize accordingly, otherwise one replicates the effort that's typically part of the Data Management initiatives. 

    In addition, it's important to check how much the identified issues impact the business logic, respectively on whether the issues can be corrected to match the expectations. Often, no logic can compensate for major  data quality issues, and this can also affect  PoC's results as soon as the outcomes are validated against the expectations! 

    Data Understanding 

    Further on, it makes sense to get a high-level understanding of the data by looking at the distribution of values, respectively at the records participating in the joins. Of course, more similar queries can be built, though again, one should try to focus on the most important aspects!

    The analysis could for example consider the following points:

    /* validation of Products */
    
    -- review duplicated product numbers (should be 0)
    SELECT ProductName
    , count(*) RecordCount
    FROM orders.products
    GROUP BY ProductName
    HAVING count(*)>1
    
    -- review most (in)expensive products
    SELECT top 100 ProductID
    , ProductName
    , Category
    , ListPrice 
    FROM orders.products
    ORDER BY ListPrice DESC --ASC
    
    -- review category distribution
    SELECT Category
    , count(*) RecordCount 
    FROM orders.products
    GROUP BY Category
    ORDER BY RecordCount DESC
    
    -- review price ranges (
    SELECT Len(floor(ListPrice)) RangeCount
    , count(*) RecordCount 
    FROM orders.products
    GROUP BY Len(floor(ListPrice)) 
    ORDER BY RangeCount DESC
    
    /* validation of Customers */
    
    -- duplicated email address 
    SELECT CST.CustomerID
    , CST.FirstName
    , CST.LastName 
    , CST.EmailAddress 
    , DUP.RecordCount
    FROM (-- duplicates
    	SELECT EmailAddress
    	, count(*) RecordCount 
    	FROM orders.customers 
    	GROUP BY EmailAddress 
    	HAVING count(*)>1
    	) DUP
    	JOIN orders.customers CST
    	   ON DUP.EmailAddress = CST.EmailAddress
    ORDER BY DUP.RecordCount DESC
    , DUP.EmailAddress 
    
    -- duplicated Customer names (not necessarily duplicates)
    SELECT CST.CustomerID
    , CST.FirstName
    , CST.LastName 
    , CST.EmailAddress 
    , DUP.RecordCount
    FROM (-- duplicates
    	SELECT FirstName
    	, LastName
    	, count(*) RecordCount 
    	FROM orders.customers 
    	GROUP BY FirstName
    	, LastName 
    	HAVING count(*)>1
    	) DUP
    	JOIN orders.customers CST
    	   ON DUP.FirstName = CST.FirstName
          AND DUP.LastName = CST.LastName
    ORDER BY DUP.RecordCount DESC
    , DUP.FirstName
    , DUP.LastName
    
    /* validation of Orders */
    
    -- review a typical order
    SELECT SalesOrderID
    , OrderDate
    , CustomerID
    , LineItem
    , ProductID
    , OrderQty
    , LineItemTotal
    FROM orders.orders
    WHERE SalesOrderID = 71780
    ORDER BY SalesOrderID 
    , LineItem
    
    -- review orders' distribution by month
    SELECT Year(OrderDate) Year
    , Month(OrderDate) Month
    , count(*) RecordCount
    FROM orders.orders
    GROUP BY Year(OrderDate) 
    , Month(OrderDate) 
    ORDER BY Year
    , Month
    
    -- checking for duplicates
    SELECT SalesOrderID
    , LineItem
    , count(*) RecordCount
    FROM orders.orders ord 
    GROUP BY SalesOrderID
    , LineItem
    HAVING count(*)>1
    
    -- checking for biggest orders
    SELECT SalesOrderID
    , count(*) RecordCount
    FROM orders.orders ord 
    GROUP BY SalesOrderID
    HAVING count(*) > 10
    ORDER BY NoRecords DESC
    
    -- checking for most purchased products
    SELECT ProductID
    , count(*) NoRecords
    FROM orders.orders ord 
    GROUP BY ProductID
    HAVING count(*) > 8
    ORDER BY NoRecords DESC
    
    -- checking for most active customers
    SELECT CustomerID
    , count(*) RecordCount
    FROM orders.orders ord 
    GROUP BY CustomerID
    HAVING count(*) > 10
    ORDER BY RecordCount DESC
    
    /* join checks */
    
    -- Prders without Product (should be 0)
    SELECT count(*) RecordCount
    FROM orders.orders ord 
    	 LEFT JOIN orders.products prd
    	   ON ord.ProductID = prd.ProductID
    WHERE prd.ProductID IS NULL
    
    -- Prders without Customer (should be 0)
    SELECT count(*) RecordCount
    FROM orders.orders ORD 
    	 LEFT JOIN orders.customers CST
    	   ON ORD.CustomerID = CST.CustomerID
    WHERE CST.CustomerID IS NULL
    
    -- Products without Orders (153 records)
    SELECT count(*) RecordCount
    FROM orders.products prd
    	 LEFT JOIN orders.orders ord 
    	   ON prd.ProductID = ord.ProductID 
    WHERE ord.ProductID IS NULL
    
    
    -- Customers without Orders (815 records)
    SELECT count(*) RecordCount
    FROM orders.customers CST
    	 LEFT JOIN orders.orders ORD
    	   ON ORD.CustomerID = CST.CustomerID
    WHERE ORD.CustomerID IS NULL
    

    The more tables are involved, the more complex the validation logic can become. One should focus on the most important aspects.

    Building the Logic

    Once one has an acceptable understanding of the data entities involved and the relation between them, it's time to build the needed business logic by joining the various tables at the various levels of detail. One can focus on the minimum required, respectively attempt to build a general model that can address a broader set of requirements. For the PoC it's usually recommended to start small by addressing the immediate requirements, though some flexibility might be needed for exploring the data and preparing the logic for a broader set of requirements. Independently of the scope, one should consider a set of validations. 

    Usually, it makes sense to encapsulate the logic in several views or table-valued functions that reflect the logic for the main purposes and which allow a high degree of reuse (see [1]). Of course, one can use the standard approach for modelling the bronze, silver, respectively the gold layers adopted by many professionals. For a PoC, even if  that's not mandatory, it might still be a good idea to make steps in the respective direction. 

    In this case, dealing with only three tables - a fact table and two dimensions table - there are several perspectives that can be built:

    a) all records from fact table + dimension records

    The following view provides the lowest level of details for the fact table, allowing thus to look at the data from different perspectives as long as focus is only the values used is Sales Orders:

    -- create the view
    CREATE OR ALTER VIEW orders.vSalesOrders
    -- Sales Orders with Product & Customer information
    AS
    SELECT ORD.SalesOrderID
    , ORD.OrderDate
    , ORD.CustomerID
    , CST.FirstName 
    , CST.LastName
    , CST.EmailAddress
    , ORD.LineItem
    , ORD.ProductID
    , PRD.ProductName 
    , PRD.Category
    , ORD.OrderQty
    , ORD.LineItemTotal
    , PRD.ListPrice 
    , ORD.OrderQty * PRD.ListPrice ListPriceTotal
    FROM orders.orders ORD 
    	 JOIN orders.products PRD
    	   ON ORD.ProductID = PRD.ProductID
    	 JOIN orders.customers CST
    	   ON ORD.CustomerID = CST.CustomerID
    
    -- test the view   
    SELECT *
    FROM orders.vSalesOrders
    WHERE SalesOrderID = 71780
    

    One can use full joins unless some of the references dimensions are not available.  

    b) aggregated data for all dimension combinations

    The previous view allows to aggregate the data at the various levels of details:

    -- Sales volume by Customer & Product
    SELECT ORD.EmailAddress
    , ORD.ProductName 
    , ORD.Category
    , SUM(ORD.OrderQty) OrderQty
    , SUM(ORD.LineItemTotal) LineItemTotal
    FROM orders.vSalesOrders ORD 
    WHERE ORD.OrderDate >= '2022-06-01'
      AND ORD.OrderDate < '2022-07-01'
    GROUP BY ORD.EmailAddress
    , ORD.ProductName 
    , ORD.Category
    ORDER BY ORD.EmailAddress
    , ORD.ProductName 
    

    One can comment out the dimensions not needed. The query can be included in a view as well. 

    c) all records from each dimension table + aggregated fact records

    Sometimes, it's useful to look at the data from a dimension's perspective, though it might be needed to create such an object for each dimension, like in the below examples. For the maximum of flexibility the logic can be included in a table-valued function:

    -- create the user-defined function
    CREATE OR ALTER FUNCTION orders.tvfProductsSalesVolume(
        @StartDate date NULL,
        @EndDate date NULL
    )
    RETURNS TABLE
    -- Sales volume by Product
    AS
    RETURN (
    SELECT PRD.ProductID
    , PRD.ProductName 
    , PRD.Category
    , ORD.FirstOrderDate
    , ORD.LastOrderDate 
    , IsNull(ORD.TotalSalesQty, 0) TotalSalesQty 
    , IsNull(ORD.TotalSalesValue, 0) TotalSalesValue
    , IsNull(ORD.OrderCount, 0) OrderCount
    , IsNull(ORD.LineCount, 0) LineCount
    FROM orders.products PRD
         OUTER APPLY (
    		SELECT Min(ORD.OrderDate) FirstOrderDate
    		, Max(ORD.OrderDate) LastOrderDate 
    		, SUM(ORD.OrderQty) TotalSalesQty
    		, SUM(ORD.LineItemTotal) TotalSalesValue
    		, count(DISTINCT SalesOrderID) OrderCount
    		, count(*) LineCount
    		FROM orders.orders ORD 
    		WHERE ORD.ProductID = PRD.ProductID
    		  AND ORD.OrderDate >= @StartDate 
    		  AND ORD.OrderDate < @EndDate 
    	 ) ORD
    );
    
    -- test the user-defined function
    SELECT *
    FROM orders.tvfProductsSalesVolume('2022-06-01','2022-07-01') PRD
    WHERE TotalSalesValue <> 0
    ORDER BY TotalSalesValue DESC
    , LastOrderDate DESC
    
    
    -- create the user-defined function
    CREATE OR ALTER FUNCTION orders.tvfCustomersSalesVolume(
        @StartDate date NULL,
        @EndDate date NULL
    )
    RETURNS TABLE
    -- Sales volume by Customer
    AS
    RETURN (
    SELECT CST.CustomerID
    , CST.FirstName 
    , CST.LastName
    , CST.EmailAddress
    , ORD.FirstOrderDate
    , ORD.LastOrderDate 
    , IsNull(ORD.TotalSalesValue, 0) TotalSalesValue
    , IsNull(ORD.OrderCount, 0) OrderCount
    , IsNull(ORD.LineCount, 0) LineCount
    FROM orders.customers CST
         OUTER APPLY (
    		SELECT Min(ORD.OrderDate) FirstOrderDate
    		, Max(ORD.OrderDate) LastOrderDate 
    		, SUM(ORD.LineItemTotal) TotalSalesValue
    		, count(DISTINCT SalesOrderID) OrderCount
    		, count(*) LineCount
    		FROM orders.orders ORD 
    		WHERE ORD.CustomerID = CST.CustomerID
    		  AND ORD.OrderDate >= @StartDate 
    		  AND ORD.OrderDate < @EndDate 
    	 ) ORD
    );
    
    -- test the user-defined function
    SELECT *
    FROM orders.tvfCustomersSalesVolume('2022-06-01','2022-07-01') PRD
    WHERE TotalSalesValue <> 0
    ORDER BY TotalSalesValue DESC
    , LastOrderDate DESC
    

    When restructuring the queries in similar ways, there's always a compromise between the various factors: (re)usability, performance or completeness. 

    Further Comments

    The above database objects should allow users to address most of the requirements, though, as usual, there can be also exceptions, especially when the data needs to be aggregated at a different level of detail that requires the query to be structured differently.

    The number of perspectives can increase also with the number of fact tables used to model a certain entity (e.g. Sales order headers vs. lines). For example, 

    In theory, one can also find ways to automate the process of creating database objects, though one must choose the relevant attributes, respectively include logic that makes sense only within a certain perspective. 

    No matter the data, respectively systems used as source, expect surprises and test your assumptions! For example, in the file used to create the orders.customers table, there seem to be duplicated entities with the same name and email address. One must clarify how such entities must be handled in data analysis, respectively in data modeling. For example, a person can appear twice because of the roles associated with the name or can be other entitled reasons. 

    The files in scope of this post are small compared with the files existing in organizations. In many scenarios files' size could range from GB to TB and thus require partitioning and different other strategies. 

    |>> Next Post

    References
    [1] sql-troubles (2023) Architecture Part IV: Building a Modern Data Warehouse with Azure Synapse [link]

    Resources
    [1] Microsoft Learn (2024) Fabric: Lakehouse and Delta Lake tables [link]

    12 April 2025

    🏭🗒️Microsoft Fabric: Copy job in Data Factory [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 11-Apr-2025

    [Microsoft Fabric] Copy job in Data Factory 
    • {def} 
      • {benefit} simplifies data ingestion with built-in patterns for batch and incremental copy, eliminating the need for pipeline creation [1]
        • across cloud data stores [1]
        • from on-premises data stores behind a firewall [1]
        • within a virtual network via a gateway [1]
    • elevates the data ingestion experience to a more streamlined and user-friendly process from any source to any destination [1]
    • {benefit} provides seamless data integration 
      • through over 100 built-in connectors [3]
      • provides essential tools for data operations [3]
    • {benefit} provides intuitive experience
      • easy configuration and monitoring [1]
    • {benefit} efficiency
      • enable incremental copying effortlessly, reducing manual intervention [1]
    • {benefit} less resource utilization and faster copy durations
      • flexibility to control data movement [1]
        • choose which tables and columns to copy
        • map the data
        • define read/write behavior
        • set schedules that fit requirements whether [1]
      • applies for a one-time or recurring jobs [1]
    • {benefit} robust performance
      • the serverless setup enables data transfer with large-scale parallelism
      • maximizes data movement throughput [1]
        • fully utilizes network bandwidth and data store IOPS for optimal performance [3]
    • {feature} monitoring
      • once a job executed, users can monitor its progress and metrics through either [1] 
        • the Copy job panel
          • shows data from the most recent runs [1]
        • reports several metrics
          • status
          • row read
          • row written
          • throughput
        • the Monitoring hub
          • acts as a centralized portal for reviewing runs across various items [4]
    • {mode} full copy
      • copies all data from the source to the destination at once
    • {mode|preview} incremental copy
      • the initial job run copies all data, and subsequent job runs only copy changes since the last run [1]
      • an incremental column must be selected for each table to identify changes [1]
        • used as a watermark
          • allows comparing its value with the same from last run in order to copy the new or updated data only [1]
          • the incremental column can be a timestamp or an increasing INT [1]
        • {scenario} copying from a database
          • new or updated rows will be captured and moved to the destination [1]
        • {scenario} copying from a storage store
          • new or updated files identified by their LastModifiedTime are captured and moved to the destination [1]
        • {scenario} copy data to storage store
          • new rows from the tables or files are copied to new files in the destination [1]
            • files with the same name are overwritten [1]
        • {scenario} copy data to database
          • new rows from the tables or files are appended to destination tables [1]
            • the update method to merge or overwrite [1]
    • {default} appends data to the destination [1]
      • the update method can be adjusted to 
        • {operation} merge
          • a key column must be provided
            • {default} the primary key is used, if available [1]
        • {operation} overwrite
    • availability 
      • the same regional availability as the pipeline [1]
    • billing meter
      • Data Movement, with an identical consumption rate [1]
    • {feature} robust Public API
      • {benefit} allows to automate and manage Copy Job efficiently [2]
    • {feature} Git Integration
      • {benefit} allows to leverage Git repositories in Azure DevOps or GitHub [2]
      • {benefit} allows to seamlessly deploy Copy Job with Fabric’s built-in CI/CD workflows [2]
    • {feature|preview} VNET gateway support
      • enables secure connections to data sources within virtual network or behind firewalls
        • Copy Job can be executed directly on the VNet data gateway, ensuring seamless and secure data movement [2]
    • {feature} upsert to Azure SQL Database
    • {feature} overwrite to Fabric Lakehouse
    • {feature} [Jul-2025] native CDC
      • enables efficient and automated replication of changed data including inserted, updated and deleted records from a source to a destination [5]
        •  ensures destination data stays up to date without manual effort
          • improves efficiency in data integration while reducing the load on source systems [5]
        • see Data Movement - Incremental Copy meter
          •  consumption rate of 3 CU
        • {benefit} zero manual intervention
          • automatically captures incremental changes directly from the source [5]  
        • {benefit} automatic replication
          • keeps destination data continuously synchronized with source changes [5]  
        • {benefit} optimized performance
          • processes only changed data
            • reduces processing time and minimizing load on the source [5]
        • smarter incremental copy 
          • automatically detects CDC-enabled source tables and allows to select either CDC-based or watermark-based incremental copy for each table [5]
      • applies to 
        • CDC-enabled tables
          • CDC automatically captures and replicates actions on data
        • non-CDC-enabled tables
          • Copy Job detects changes by comparing an incremental column against the last run [5]
            • then merges or appends the changed data to the destination based on configuration [5]
      • supported connectors
        • ⇐ applies to sources and destinations
        • Azure SQL DB [5]
        • On-premises SQL Server [5]
        • Azure SQL Managed Instance [5]
    • {enhancement} column mapping for simple data modification to storage as destination store [2]
    • {enhancement} data preview to help select the right incremental column  [2]
    • {enhancement} search functionality to quickly find tables or columns  [2]
    • {enhancement} real-time monitoring with an in-progress view of running Copy Jobs  [2]
    • {enhancement} customizable update methods & schedules before job creation [2]

    References:
    [1] Microsoft Learn (2025) Fabric: What is the Copy job in Data Factory for Microsoft Fabric? [link]
    [2] Microsoft Fabric Updates Blog (2025) Recap of Data Factory Announcements at Fabric Conference US 2025 [link]
    [3] Microsoft Fabric Updates Blog (2025) Fabric: Announcing Public Preview: Copy Job in Microsoft Fabric [link]
    [4] Microsoft Learn (2025) Fabric: Learn how to monitor a Copy job in Data Factory for Microsoft Fabric [link]
    [5] Microsoft Fabric Updates Blog (2025) Fabric: Simplifying Data Ingestion with Copy job – Introducing Change Data Capture (CDC) Support (Preview) [link]
    [6] Microsoft Learn (2025) Fabric: Change data capture (CDC) in Copy Job (Preview) [link]

    Resources:
    [R1] Microsoft Learn (2025) Fabric: Learn how to create a Copy job in Data Factory for Microsoft Fabric [link]

    Acronyms:
    API - Application Programming Interface
    CDC - Change Data Capture
    CI/CD - Continuous Integration and Continuous Deployment
    CU - Capacity Unit
    DevOps - Development & Operations
    DF - Data Factory
    IOPS - Input/Output Operations Per Second
    VNet - Virtual Network

    26 March 2025

    🏭🗒️Microsoft Fabric: External Data Sharing [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 26-Mar-2025

    External data sharing
    External data sharing [1]

    [Microsoft Fabric] External data sharing 

    • {def} feature that enables Fabric users to share data from their tenant with users in another Fabric tenant (aka cross-tenant sharing) [1]
      • the data is shared in-place from OneLake storage locations in the sharer's tenant [1]
        • ⇒ no data is actually copied to the other tenant [1]
        • creates a OneLake shortcut in the other tenant that points back to the original data in the sharer's tenant [1]
        • data is exposed as read-only [1]
        • data can be consumed by any OneLake compatible Fabric workload in that tenant [1]
      • {benefit} allows for efficient and secure data sharing without duplicating data
        • the shared data remains read-only for the consumer, ensuring data integrity and consistency [2]
        • multiple tables and folders can be shared at once [2]
    • {prerequisite} Fabric admins must turn on external data sharing both in the sharer's tenant and in the external tenant
      • by specifying who can create and accept external data shares [1]
      • users can share data residing in tables or files within supported Fabric items [1]
        • require standard Fabric read and reshare permissions for the item being shared [1]
        • the user creating the share invites a user from another tenant with a link to accept the external data share [1]
          • upon accepting the share, the recipient chooses a lakehouse where a shortcut to the shared data will be created [1]
          • the links work only for users in external tenants
            • for sharing data within the same OneLake storage accounts with users in the same tenant, use OneLake shortcuts [1]
          • {limitation} shortcuts contained in folders that are shared via external data sharing won't resolve in the consumer tenant [1]
      • access is enabled via a dedicated Fabric-to-Fabric authentication mechanism 
      • ⇐ doesn’t require Entra B2B guest user access [1]
    • {operation} create an external data share in the provider tenant)
      • external data shares can be created for tables or files in lakehouses and warehouses, and in KQL, SQL, mirrored databases [1]
      • {limitation} the sharer can't control who has access to the data in the consumer's tenant [1]
    • {operation} accept an external data share in consuming tenant)
      • only lakehouses can be chosen for the operation
      • the consumer can grant access to the data to anyone [1]
        • incl. guest users from outside the consumer's organization [1]
      • data can be transferred across geographic boundaries when it's accessed within the consumer's tenant [1]
    • {operation} revoke extern data shares
      • any user in the sharing tenant with read and reshare permissions on an externally shared item can revoke the external data share at any time [1]
        • via Manage permissions >> External data shares tab
        • can be performed of any item the user has read and reshare permissions on [3]
        • {warning}a revoked external data share can't be restored [3] 
          • irreversibly severs all access from the receiving tenant to the shared data [3]
          • a new external data share can be created instead [3]
    • applies to
      • lakehouse
        • an entire lakehouse schema can be shared [2]
          • shares all the tables in the schema [2]
          • any changes to the schema are immediately reflected in the consumer’s lakehouse [2]
      • mirrored database
      • KQL database
      • OneLake catalog
    • can be consumed via 
      • Spark workloads
        • notebooks or Spark
      • lakehouse SQL Analytics Endpoint
      • semantic models
      • ⇐ data can be shared from a provider and consumed in-place via SQL queries or in a Power BI report [2]
    • {feature} external data sharing APIs
      • support service principals for admin and user operations
        • can be used to automate the creation or management of shares [2]
          • supports service principals or managed identity [2]
      • {planned} data warehouse support
        • share schemas or tables from a data warehouse [2]
      • {planned} shortcut sharing
        • share OneLake shortcuts using external data sharing [2]
        • ⇒ data residing outside of Fabric will be externally shareable
          • S3, ADLS Gen2, or any other supported shortcut location [2]
      • {planned} consumer APIs
        • consumer activities, including viewing share details and accepting shares, will soon be available via API [2]

    References:
    [1] Microsoft Learn (2024) External data sharing in Microsoft Fabric [link]
    [2] Microsoft Fabric Updates Blog (2025) External data sharing enhancements out now [link]
    [3] Microsoft Learn (2024) Fabric: Manage external data shares [link]

    Resources:
    [R1] Analytics on Azure Blog (2024) External Data Sharing With Microsoft Fabric [link]
    [R2] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

    Acronyms:
    ADLS - Azure Data Lake Storage
    API - Application Programming Interface
    B2B - business-to-business
    KQL - Kusto Query Language
    Related Posts Plugin for WordPress, Blogger...

    About Me

    My photo
    Koeln, NRW, Germany
    IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.