Showing posts with label notes. Show all posts
Showing posts with label notes. Show all posts

05 July 2025

🏭🗒️Microsoft Fabric: Git Repository [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 4-Jul-2025

[Microsoft Fabric] Git Repository

  • {def} set of features that enable developers to integrate their development processes, tools, and best practices straight into the Fabric platform [2]

  • {goal} the repo serves as single-source-of-truth
  • {feature} backup and version control [2]
  • {feature} revert to previous stages [2]
  • {feature} collaborate with others or work alone using Git branches [2]
  • {feature} source control 
    • provides tools to manage Fabric items [2]
    • supported for Azure DevOps and GitHub [3]
  • {configuration} tenant switches 
    • ⇐ must be enabled from the Admin portal 
      • by the tenant admin, capacity admin, or workspace admin
        • dependent on organization's settings [3]
    • users can create Fabric items
    • users can synchronize workspace items with their Git repositories
    • create workspaces
      • only if is needed to branch out to a new workspace [3]
    • users can synchronize workspace items with GitHub repositories
      • for GitHub users only [3]
  • {concept} release process 
    • begins once new updates complete a Pull Request process and merge into the team’s shared branch [3]
  • {concept} branch
    • {operation} switch branches
      • the workspace syncs with the new branch and all items in the workspace are overridden [3]
        • if there are different versions of the same item in each branch, the item is replaced [3]
        • if an item is in the old branch, but not the new one, it gets deleted [3]
      • one can't switch branches if there are any uncommitted changes in the workspace [3]
    • {action} branch out to another workspace 
      • creates a new workspace, or switches to an existing workspace based on the last commit to the current workspace, and then connects to the target workspace and branch [4]
      • {permission} contributor and above
    • {action} checkout new branch )
      • creates a new branch based on the last synced commit in the workspace [4]
      • changes the Git connection in the current workspace [4]
      • doesn't change the workspace content [4]
      • {permission} workspace admin
    • {action} switch branch
      • syncs the workspace with another new or existing branch and overrides all items in the workspace with the content of the selected branch [4]
      • {permission} workspace admin
    • {limitation} maximum length of branch name: 244 characters.
    • {limitation} maximum length of full path for file names: 250 characters
    • {limitation} maximum file size: 25 MB
  • {operation} connect a workspace to a Git Repos 
    • can be done only by a workspace admin [4]
      • once connected, anyone with permissions can work in the workspace [4]
    • synchronizes the content between the two (aka initial sync)
      • {scenario} either of the two is empty while the other has content
        • the content is copied from the nonempty location to the empty on [4]
      • {scenario}both have content
        • one must decide which direction the sync should go [4]
          • overwrite the content from the destination [4]
      • includes folder structures [4]
        • workspace items in folders are exported to folders with the same name in the Git repo [4]
        • items in Git folders are imported to folders with the same name in the workspace [4]
        • if the workspace has folders and the connected Git folder doesn't yet have subfolders, they're considered to be different [4]
          • leads to uncommitted changes status in the source control panel [4]
            • one must to commit the changes to Git before updating the workspace [4]
              • update first, the Git folder structure overwrites the workspace folder structure [4]
        • {limitation} empty folders aren't copied to Git
          • when creating or moving items to a folder, the folder is created in Git [4]
        • {limitation} empty folders in Git are deleted automatically [4]
        • {limitation} empty folders in the workspace aren't deleted automatically even if all items are moved to different folders [4]
        • {limitation} folder structure is retained up to 10 levels deep [4]
        • {limitation} the folder structure is maintained up to 10 levels deep
    •  Git status
      • synced 
        • the item is the same in the workspace and Git branch [4]
      •  conflict 
        • the item was changed in both the workspace and Git branch [4]
      •  unsupported item
      •  uncommitted changes in the workspace
      •  update required from Git [4]
      •  item is identical in both places but needs to be updated to the last commit [4]
  • source control panel
    • shows the number of items that are different in the workspace and Git branch
      • when changes are made, the number is updated
      • when the workspace is synced with the Git branch, the Source control icon displays a 0
  • commit and update panel 
    • {section} changes 
      • shows the number of items that were changed in the workspace and need to be committed to Git [4]
      • changed workspace items are listed in the Changes section
        • when there's more than one changed item, one can select which items to commit to the Git branch [4]
      • if there were updates made to the Git branch, commits are disabled until you update your workspace [4]
    • {section} updates 
      • shows the number of items that were modified in the Git branch and need to be updated to the workspace [4]
      • the Update command always updates the entire branch and syncs to the most recent commit [4]
        • {limitation} one can’t select specific items to update [4]
        • if changes were made in the workspace and in the Git branch on the same item, updates are disabled until the conflict is resolved [4]
    • in each section, the changed items are listed with an icon indicating the status
      •  new
      •  modified
      •  deleted
      •  conflict
      •  same-changes
  • {concept} related workspace
    • workspace with the same connection properties as the current branch [4]
      • e.g.  the same organization, project, repository, and git folder [4] 
Previous Post <<||>> Next Post 

References:
[2] Microsoft Learn (2025) Fabric: What is Microsoft Fabric Git integration? [link
What is lifecycle management in Microsoft Fabric? [link]
[3] Microsoft Fabric Updates Blog (2025) Fabric: Introducing New Branching Capabilities in Fabric Git Integration [link
[4] Microsoft Learn (2025) Fabric: Basic concepts in Git integration [link]
[5]  [link]

Resources:
[R1] Microsoft Learn (2025) Fabric: 

Acronyms:
CI/CD - Continuous Integration and Continuous Deployment

21 June 2025

🏭🗒️Microsoft Fabric: Result Set Caching in SQL Analytics Endpoints [Notes] 🆕

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 21-Jun-2025

[Microsoft Fabric] Result Set Caching in SQL Analytics Endpoints

  • {def} built-in performance optimization for Warehouse and Lakehouse that improves read latency [1]
    • fully transparent to the user [3]
    • persists the final result sets for applicable SELECT T-SQL queries
      • caches all the data accessed by a query [3]
      • subsequent runs that "hit" cache will process just the final result set
        • can bypass complex compilation and data processing of the original query[1]
          • ⇐ returns subsequent queries faster [1]
      • the cache creation and reuse is applied opportunistically for queries
    • works on
      • warehouse tables
      • shortcuts to OneLake sources
      • shortcuts to non-Azure sources
    • the management of cache is handled automatically [1]
      • regularly evicts cache as needed
    • as data changes, result consistency is ensured by invalidating cache created earlier [1]
  • {operation} enable setting
    • via ALTER DATABASE <database_name> SET RESULT_SET_CACHING ON
  • {operation} validate setting
    • via SELECT name, is_result_set_caching_on FROM sys.databases
  • {operation} configure setting
    • configurable at item level
      • once enabled, it can then be disabled 
        • at the item level
        • for individual queries
          • e.g. debugging or A/B testing a query
        • via OPTION ( USE HINT ('DISABLE_RESULT_SET_CACHE') 
    • {default} during the preview, result set caching is off for all items [1]
  • [monitoring] 
    • via Message Output
      • applicable to Fabric Query editor, SSMS
      • the statement "Result set cache was used" is displayed after query execution if the query was able to use an existing result set cache
    • via queryinsights.exec_requests_history system view
      • result_cache_hit displays indicates result set cache usage for each query execution [1]
        • {value} 2: the query used result set cache (cache hit)
        • {value} 1: the query created result set cache
        • {value} 0: the query wasn't applicable for result set cache creation or usage [1]
          • {reason} the cache no longer exists
          • {reason} the cache was invalidated by a data change, disqualifying it for reuse [1]
          • {reason} query isn't deterministic
            • isn't eligible for cache creation [1]
          • {reason} query isn't a SELECT statement
  • [warehousing] 
    • {scenario} analytical queries that process large amounts of data to produce a relatively small result [1]
    • {scenario} workloads that trigger the same analytical queries repeatedly [1]
      • the same heavy computation can be triggered multiple times, even though the final result remains the same [1]

References:
[1] Microsoft Learn (2025) Result set caching (preview) [link]
[2] Microsoft Fabric Update Blog (2025) Result Set Caching for Microsoft Fabric Data Warehouse (Preview) [link|aka]
[3] Microsoft Learn (2025) In-memory and disk caching [link]
[4] Microsoft Learn (2025) Performance guidelines in Fabric Data Warehouse [link

Resources:
[R1] Microsoft Fabric (2025) Fabric Update - June 2025 [link]

Acronyms:
MF - Microsoft Fabric
SSMS - SQL Server Management Studio

24 May 2025

🏭🗒️Microsoft Fabric: Materialized Lake Views (MLV) [Notes] 🆕🗓️

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 9-Jun-2025

-- create schema
CREATE SCHERA IF NOT EXISTS <lakehouse_name>.<schema_name>

-- create a materialized view
CREATE MATERIALIZED VIEW IF NOT EXISTS <lakehouse_name>.<schema_name>.<view_name> 
[(
    CONSTRAINT <constraint_name> CHECK (<constraint>) ON MISMATCH DROP 
)] 
[PARTITIONED BY (col1, col2, ... )] 
[COMMENT “description or comment”] 
[TBLPROPERTIES (“key1”=”val1”, “key2”=”val2”, 
AS 
SELECT ...
FROM ...
-- WHERE ...
--GROUP BY ...

[Microsoft Fabric] Materialized Lake Views (MLVs)

  • {def} persisted, continuously updated view of data [1]
    • {benefit} allows to build declarative data pipelines using SQL, complete with built-in data quality rules and automatic monitoring of data transformations
      • simplifies the implementation of multi-stage Lakehouse processing [1]
        • ⇐ aids in the creation, management, and monitoring of views [3]
        • ⇐ improves transformations through a declarative approach [3]
        • streamline data workflows
        • enable developers to focus on business logic [1]
          • ⇐ not on infrastructural or data quality-related issues [1]
        • the views can be created in a notebook [2]
    • {benefit} allows developers visualize lineage across all entities in lakehouse, view the dependencies, and track its execution progress [3]
      • can have data quality constraints enforced and visualized for every run, showing completion status and conformance to data quality constraints defined in a single view [1]
      • empowers developers to set up complex data pipelines with just a few SQL statements and then handle the rest automatically [1]
        • faster development cycles 
        • trustworthy data
        • quicker insights
  • {goal} process only the new or changed data instead of reprocessing everything each time [1]
    • ⇐  leverages Delta Lake’s CDF under the hood
      • ⇒ it can update just the portions of data that changed rather than recompute the whole view from scratch [1]
  • {operation} creation
    • allows defining transformations at each layer [1]
      • e.g. aggregation, projection, filters
    • allows specifying certain checks that the data must meet [1]
      • incorporate data quality constraints directly into the pipeline definition
    • via CREATE MATERIALIZED LAKE VIEW
      • the SQL syntax is declarative and Fabric figures out how to produce and maintain it [1]
  • {operation} refresh
    • refreshes only when its source has new data [1]
      • if there’s no change, it can skip running entirely (saving time and resources) [1]
    • via REFRESH MATERIALIZED LAKE VIEW [workspace.lakehouse.schema].MLV_Identifier [FULL];
  • {operation} list views from schema [3]
    • via SHOW MATERIALIZED LAKE VIEWS <IN/FROM> Schema_Name;
  • {opetation} retrieve definition
    • via SHOW CREATE MATERIALIZED LAKE VIEW MLV_Identifier;
  • {operstion} update definition
    • via ALTER MATERIALIZED LAKE VIEW MLV_Identifier RENAME TO MLV_Identifier_New;
  • {operstion} drop view
    • via DROP MATERIALIZED LAKE VIEW MLV_Identifier;
    • {warning} dropping or renaming a materialized lake view affects the lineage view and scheduled refresh [3]
    • {recommendation} update the reference in all dependent materialized lake views [3]
  • {operation} schedule view run
    • lets users set how often the MLV should be refreshed based on business needs and lineage execution timing [5]
    • depends on
      • data update frequency: the frequency with which the data is updated [5]
      • query performance requirements: Business requirement to refresh the data in defined frequent intervals [5]
      • system load: optimizing the time to run the lineage without overloading the system [5]
  • {operation} view run history
    • users can access the last 25 runs including lineage and run metadata
      • available from the dropdown for monitoring and troubleshooting
  • {concept} lineage
    • the sequence of MLV that needs to be executed to refresh the MLV once new data is available [5]
  • {feature} automatically generate a visual report that shows trends on data quality constraints 
    • {benefit} allows to easily identify the checks that introduce maximum errors and the associated MLVs for easy troubleshooting [1]
  • {feature} can be combined with Shortcut Transformation feature for CSV ingestion 
    • {benefit} facilitate the building of end-to-end Medallion architectures
  • {feature} dependency graph
    • allows to see the dependencies existing between the various objects [2]
      • ⇐ automatically generated [2]
  • {feature} data quality report
    • built-in Power BI dashboard that shows several aggregated metrics [2]
  • doesn't support
    • {feature|planned} PySpark [3]
    • {feature|planned} incremental refresh [3]
    • {feature|planned} integration with Data Activator [3]
    • {feature|planned} API [3]
    • {feature|planned} cross-lakehouse lineage and execution [3]
    • {limitation} Spark properties set at the session level aren't applied during scheduled lineage refresh [4]
    • {limitation} creation with delta time-travel [4]
    • {limitation} DML statements [4]
    • {limitation} UDFs in CTAS [4] 
    • {limitation} temporary views can't be used to define MLVs [4]


References:
[1] Microsoft Fabric Update Blog (2025) Simplifying Medallion Implementation with Materialized Lake Views in Fabric [link|aka]
[2] Power BI Tips (2025) Microsoft Fabric Notebooks with Materialized Views - Quick Tips [link]
[3] Microsoft Learn (2025) What are materialized lake views in Microsoft Fabric? [link]
[4] Microsoft Learn (2025) Materialized lake views Spark SQL reference [link]
[5] Microsoft Learn (2025) Manage Fabric materialized lake views lineage [link] 
[6] Microsoft Learn (2025) Data quality in materialized lake views [link

Resources:
[R1] Databricks (2025) Use materialized views in Databricks SQL [link]
[R2] Microsoft Learn (2025) Implement medallion architecture with materialized lake views [link

Acronyms:
API - 
CDF - Change Data Feed
CTA - 
DML - 
ETL - Extract, Transfer, Load
MF - Microsoft Fabric
MLV - Materialized Lake views
UDF - User-defined functions

23 May 2025

🏭🗒️Microsoft Fabric: Warehouse Snapshots [Notes] 🆕

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 23-May-2025

[Microsoft Fabric] Warehouse Snapshots

  • {def} read-only representation of a warehouse at a specific point in time [1]
  • allows support for analytics, reporting, and historical analysis scenarios without worrying about the volatility of live data updates [1]
    • provide a consistent and stable view of data [1]
    • ensuring that analytical workloads remain unaffected by ongoing changes or ETL  operations [1]
  • {benefit} guarantees data consistency
    • the dataset remains unaffected by ongoing ETL processes [1]
  • {benefit} immediate roll-Forward updates
    • can be seamlessly rolled forward on demand to reflect the latest state of the warehouse
      • ⇒ {benefit} consumers access the same snapshot using a consistent connection string, even from third-party tools [1]
      • ⇐ updates are applied immediately, as if in a single, atomic transaction [1]
  • {benefit} facilitates historical analysis
    • snapshots can be created on an hourly, daily, or weekly basis to suit their business requirements [1]
  • {benefit} enhanced reporting
    • provides a point-in-time reliable dataset for precise reporting [1]
      • ⇐ free from disruptions caused by data modifications [1]
  • {benefit} doesn't require separate storage [1]
    • relies on source Warehouse [1]
  • {limit} doesn't support database objects 
  • {limit} capture a state within the last 30 days
  • {operation} create snapshot
    • via New warehouse snapshot
    • multiple snapshots can be created for the same parent warehouse [1]
      • appear as child items of the parent warehouse in the workspace view [1]
      • the queries run against provide the current version of the data being accessed [1]
  • {operation} read properties 
    • via 
    • GET https://api.fabric.microsoft.com/v1/workspaces/{workspaceId}/items/{warehousesnapshotId} Authorization: Bearer <bearer token>
  • {operation} update snapshot timestamp
    • allows users to roll forward data instantly, ensuring consistency [1]
      • use current state
        • via ALTER DATABASE [<snapshot name>] SET TIMESTAMP = CURRENT_TIMESTAMP; 
      • use point in time
        • ALTER DATABASE snapshot SET TIMESTAMP = 'YYYY-MM-DDTHH:MM:SS.SS'//UTC time
    • queries that are in progress during point in time update will complete against the version of data they were started against [1]
  • {operation} rename snapshot
  • {operation} delete snapshot
    • via DELETE
    • when the parent warehouse gets deleted, the snapshot is also deleted [1]
  • {operation} modify source table
    • DDL changes to source will only impact queries in the snapshot against tables affected [1]
  • {operation} join multiple snapshots
    • the resulting snapshot date will be applied to each warehouse connection [1]
  • {operation} retrieve metadata
    • via sys.databases [1]
  • [permissions] inherited from the source warehouse [1]
    • ⇐ any permission changes in the source warehouse applies instantly to the snapshot [1]
    • security updates on source database will be rendered immediately to the snapshot databases [1]
  • {limitation} can only be created against new warehouses [1]
    • created after Mar-2025
  • {limitation} do not appear in SSMS Object Explorer but will show up in the database selection dropdown [1]
  • {limitation} datetime can be set to any date in the past up to 30 days or database creation time (whichever is later)  [1]
  • {limitation} modified objects after the snapshot timestamp become invalid in the snapshot [1]
    • applies to tables, views, and stored procedures [1]
  • {limitation} must be recreated if the data warehouse is restored [1]
  • {limitation} aren’t supported on the SQL analytics endpoint of the Lakehouse [1]
  • {limitation} aren’t supported as a source for OneLake shortcuts [1]
  •  [Power BI]{limitation} require Direct Query or Import mode [1]
    • don’t support Direct Lake

    References:
    [1] Microsoft Learn (2025) Fabric: Warehouse Snapshots in Microsoft Fabric (Preview) [link]
    [2] Microsoft Learn (2025) Warehouse snapshots (preview) [link]
    [3] Microsoft Learn (2025) Create and manage a warehouse snapshot (preview) [link]

    Resources:


    Acronyms:
    DDL - Data Definition Language
    ETL - Extract, Transfer, Load
    MF - Microsoft Fabric
    SSMS - SQL Server Management Studio

    29 April 2025

    🏭🗒️Microsoft Fabric: Purview [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 29-Apr-2025

    [Microsoft Purview] Purview
    • {def} comprehensive data governance and security platform designed to help organizations manage, protect, and govern their data across various environments [1]
      • incl. on-premises, cloud & SaaS applications [1]
      • provides the highest and most flexible level of functionality for data governance in MF [1]
        • offers comprehensive tools for 
          • data discovery
          • data classification
          • data cataloging
    • {capability} managing the data estate
      • {tool} dedicated portal
        • aka Fabric Admin portal
        • used to control tenant settings, capacities, domains, and other objects, typically reserved for administrators
      • {type} logical containers
        • used to control access to data and capabilities [1]
        • {level} tenants
          • settings for Fabric administrators [1]
        • {level} domains
          • group data that is relevant to a single business area or subject field [1]
        • {level} workspaces 
          • group Fabric items used by a single team or department [1]
      • {type} capacities
        • objects that limit compute resource usage for all Fabric workloads [1]
    • {capability} metadata scanning
      • extracts values from data lakes
        • e.g. names, identities, sensitivities, endorsements, etc. 
        • can be used to analyze and set governance policies [1]
    • {capability} secure and protect data
      • assure that data is protected against unauthorized access and destructive attacks [1]
      • compliant with data storage regulations applicable in your region [1]
      • {tool} data tags
        • allows to identity the sensitivity of data and apply data retentions and protection policies [1]
      • {tool} workspace roles
        • define the users who are authorized to access the data in a workspace [1]
      • {tool} data-level controls
        • used at the level of Fabric items
          • e.g. tables, rows, and columns to impose granular restrictions.
      • {tool} certifications
        • Fabric is compliant with many data management certifications
          • incl. HIPAA BAA, ISO/IEC 27017, ISO/IEC 27018, ISO/IEC 27001, ISO/IEC 27701 [1]
    • {feature} OneLake data hub
      • allows users to find and explore the data in their estate.
    • {feature} endorsement
      • allows users to endorse a Fabric item to identity it as of high quality [1]
        • help other users to trust the data that the item contains [1]
    • {feature} data lineage
      • allows users to understand the flow of data between items in a workspace and the impact that a change would have [1]
    • {feature} monitoring hub
      • allows to monitor activities for the Fabric items for which the user has the permission to view [1]
    • {feature} capacity metrics
      • app used to monitor usage and consumption
    • {feature} allows to automate the identification of sensitive information and provides a centralized repository for metadata [1]
    • feature} allows to find, manage, and govern data across various environments
      • incl. both on-premises and cloud-based systems [1]
      • supports compliance and risk management with features that monitor regulatory adherence and assess data vulnerabilities [1]
    • {feature} integrated with other Microsoft services and third-party tools 
      • {benefit} enhances its utility
      • {benefit} streamlines data access controls
        • enforcing policies, and delivering insights into data lineage [1]
    • {benefit} helps organizations maintain data integrity, comply with regulations, and use their data effectively for strategic decision-making [1]
    • {feature} Data Catalog
      • {benefit} allows users to discover, understand, and manage their organization's data assets
        • search for and browse datasets
        • view metadata
        • gain insights into the data’s lineage, classification, and sensitivity labels [1]
      • {benefit} promotes collaboration
        • users can annotate datasets with tags to improve discoverability and data governance [1]
      • targets users and administrator
      • {benefit} allows to discover where patient records are held by searching for keywords [1]
      • {benefit} allows to label documents and items based on their sensitiveness [1]
      • {benefit} allows to use access policies to manage self-service access requests [1]
    • {feature} Information Protection
      • used to classify, label, and protect sensitive data throughout the organization [1]
        • by applying customizable sensitivity labels, users classify records. [1]
        • {concept} policies
          • define access controls and enforce encryption
          • labels follow the data wherever it goes
          • helps organizations meet compliance requirements while safeguarding data against accidental exposure or malicious threats [1]
      • allows to protect records with policies to encrypt data and impose IRM
    • {feature} Data Loss Prevention (DLP)
      • the practice of protecting sensitive data to reduce the risk from oversharing [2]
        • implemented by defining and applying DLP policies [2]
    • {feature} Audit
      • user activities are automatically logged and appear in the Purview audit log
        • e.g. creating files or accessing Fabric items
    • {feature} connect Purview to Fabric in a different tenant
      • all functionality is supported, except that 
        • {limitation} Purview's live view isn't available for Fabric items [1]
        • {limitation} the system can't identify user registration automatically [1]
        • {limitation} managed identity can’t be used for authentication in cross-tenant connections [1]
          • {workaround} use a service principal or delegated authentication [1]
    • {feature} Purview hub
      • displays reports and insights about Fabric items [1]
        • acts as a centralized location to begin data governance and access more advanced features [1]
        • via Settings >> Microsoft Purview hub
        • administrators see information about their entire organization's Fabric data estate
        • provides information about
          • Data Catalog
          • Information Protection
          • Audit
      • the data section displays tables and graphs that analyze the entire organization's items in MF
        • users only see information about their own Fabric items and data

    References:
    [1] Microsoft Learn (2024) Purview: Govern data in Microsoft Fabric with Purview[link]
    [2] Microsoft Learn (2024) Purview: Learn about data loss prevention [link]
    [3] Microsoft Learn (2024) [link]

    Resources:

    Acronyms:
    DLP - Data Loss Prevention
    M365 - Microsoft 365
    MF - Microsoft Fabric
    SaaS - Software-as-a-Service

    🏭🗒️Microsoft Fabric: Data Loss Prevention (DLP) in Purview [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 10-Jun-2025

    [Microsoft Purview] Data Loss Prevention (DLP)
    • {def} the practice of protecting sensitive data to reduce the risk from oversharing [2]
      • implemented by defining and applying DLP policies [2]
    • {benefit} helps to protect sensitive information with policies that automatically detect, monitor, and control the sharing or movement of sensitive data [1]
      • administrators can customize rules to block, restrict, or alert when sensitive data is transferred to prevent accidental or malicious data leaks [1]
    • {concept} DLP policies
      • allow to monitor the activities users take on sensitive items and then take protective actions [2]
        • applies to sensitive items 
          • at rest
          • in transit [2]
          • in use [2]
        • created and maintained in the Microsoft Purview portal [2]
      • {scope} only supported for Power BI semantic models [1]
      • {action} show a pop-up policy tip to the user that warns that they might be trying to share a sensitive item inappropriately [2]
      • {action} block the sharing and, via a policy tip, allow the user to override the block and capture the users' justification [2]
      • {action} block the sharing without the override option [2]
      • {action} [data at rest] sensitive items can be locked and moved to a secure quarantine location [2]
      • {action} sensitive information won't be displayed 
        • e.g. Teams chat
    • DLP reports
      • provides data from monitoring policy matches and actions, to user activities [2]
        • used as basis for tuning policies and triage actions taken on sensitive items [2]
      • telemetry uses M365 audit Logs and processed the data for the different reporting tools [2]
        • M365 provides with visibility into risky user activities [2]
        • scans the audit logs for risky activities and runs them through a correlation engine to find activities that are occurring at a high volume [1]
          • no DLP policies are required [2]
    • {feature} detects sensitive items by using deep content analysis [2]
      • ⇐ not by just a simple text scan [2]
      • based on
        • keywords matching [2]
        • evaluation of regular expressions [2] 
        • internal function validation [2]
        • secondary data matches that are in proximity to the primary data match [2]
        • ML algorithms and other methods to detect content that matches DLP policies
      • all DLP monitored activities are recorded to the Microsoft 365 Audit log [2]
    • DLP lifecycle
      • {phase} plan for DLP
        • train and acclimate users to DLP practices on well-planned and tuned policies [2]
        • {recommendation} use policy tips to raise awareness with users before changing the policy status from simulation mode to more restrictive modes [2]
      • {phase} prepare for DLP
      • {phase} deploy policies in production
        • {action} define control objectives, and how they apply across workloads [2]
        • {action} draft a policy that embodies the objectives
        • {action} start with one workload at a time, or across all workloads - there's no impact yet
        • {feature} implement policies in simulation mode
          • {benefit} allows to evaluate the impact of controls
            • the actions defined in a policy aren't applied yet
          • {benefit} allows to monitor the outcomes of the policy and fine-tune it so that it meets the control objectives while ensuring it doesn't adversely or inadvertently impacting valid user workflows and productivity [2]
            • e.g. adjusting the locations and people/places that are in or out of scope
            • e.g. tune the conditions that are used to determine if an item and what is being done with it matches the policy
            • e.g. the sensitive information definition/s
            • e.g. add new controls
            • e.g. add new people
            • e.g. add new restricted apps
            • e.g. add new restricted sites
          • {step} enable the control and tune policies [2]
            • policies take effect about an hour after being turned on [2]
        • {action} create DLP policy 
        • {action} deploy DLP policy 
    • DLP alerts 
      • alerts generated when a user performs an action that meets the criteria of a DLP policy [2]
        • there are incident reports configured to generate alerts [2]
        • {limitation} available in the alerts dashboard for 30 days [2]
      • DLP posts the alert for investigation in the DLP Alerts dashboard
      • {tool} DLP Alerts dashboard 
        • allows to view alerts, triage them, set investigation status, and track resolution
          • routed to Microsoft Defender portal 
          • {limitation} available for six months [2]
        • {constraint} administrative unit restricted admins see the DLP alerts for their administrative unit only [2]
    • {concept} egress activities (aka exfiltration)
      • {def} actions related to exiting or leaving a space, system or network [2]
    • {concept}[Microsoft Fabric] policy
      • when a DLP policy detects a supported item type containing sensitive information, the actions configured in the policy are triggered [3]
      • {feature} Activity explorer
        • allows to view Data from DLP for Fabric and Power BI
        • for accessing the data, user's account must be a member of any of the following roles or higher [3]
          • Compliance administrator
          • Security administrator
          • Compliance data administrator
          • Global Administrator 
            • {warning} a highly privileged role that should only be used in scenarios where a lesser privileged role can't be used [3]
          • {recommendation} use a role with the fewest permissions [3]
      • {warning} DLP evaluation workloads impact capacity consumption [3]
      • {action} define policy
        • in the data loss prevention section of the Microsoft Purview portal [3]
        • allows to specify 
          •  conditions 
            • e.g. sensitivity labels
          •  sensitive info types that should be detected [3]
        • [semantic model] evaluated against DLP policies 
          • whenever one of the following events occurs:
            • publish
            • republish
            • on-demand refresh
            • scheduled refresh
          •  the evaluation  doesn't occur if either of the following is true
            • the initiator of the event is an account using service principal authentication [3]
            • the semantic model owner is a service principal [3]
        • [lakehouse] evaluated against DLP policies when the data within a lakehouse undergoes a change
          • e.g. getting new data, connecting a new source, adding or updating existing tables, etc. [3]

    References:
    [1] Microsoft Learn (2025) Learn about data loss prevention [link]
    [2] Microsoft Learn (2024) Purview: Learn about data loss prevention [link]
    [3] Microsoft Learn (2025) Get started with Data loss prevention policies for Fabric and Power BI [link]

    Resources:
    [R1] Microsoft Fabric Updates Blog (2024) Secure Your Data from Day One: Best Practices for Success with Purview Data Loss Prevention (DLP) Policies in Microsoft Fabric [link]
    [R2] 

    Acronyms:
    DLP - Data Loss Prevention
    M365 - Microsoft 365

    26 April 2025

    🏭🗒️Microsoft Fabric: Parameters in Dataflows Gen2 [Notes] 🆕

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 26-Apr-2

    [Microsoft Fabric] Dataflow Gen2 Parameters

    • {def} parameters that allow to dynamically control and customize Dataflows Gen2
      • makes them more flexible and reusable by enabling different inputs and scenarios without modifying the dataflow itself [1]
      • the dataflow is refreshed by passing parameter values outside of the Power Query editor through either
        • Fabric REST API [1]
        • native Fabric experiences [1]
      • parameter names are case sensitive [1]
      • {type} required parameters
        • {warning} the refresh fails if no value is passed for it [1]
      • {type} optional parameters
      • enabled via Parameters >> Enable parameters to be discovered and override for execution [1]
    • {limitation} dataflows with parameters can't be
      • scheduled for refresh through the Fabric scheduler [1]
      • manually triggered through the Fabric Workspace list or lineage view [1]
    • {limitation} parameters that affect the resource path of a data source or a destination are not supported [1]
      • ⇐ connections are linked to the exact data source path defined in the authored dataflow
        • can't be currently override to use other connections or resource paths [1]
    • {limitation} can't be leveraged by dataflows with incremental refresh [1]
    • {limitation} supports only parameters of the type decimal number, whole number, text and true/false can be passed for override
      • any other data types don't produce a refresh request in the refresh history but show in the monitoring hub [1]
    • {warning} allow other users who have permissions to the dataflow to refresh the data with other values [1]
    • {limitation} refresh history does not display information about the parameters passed during the invocation of the dataflow [1]
    • {limitation} monitoring hub doesn't display information about the parameters passed during the invocation of the dataflow [1]
    • {limitation} staged queries only keep the last data refresh of a dataflow stored in the Staging Lakehouse [1]
    • {limitation} only the first request will be accepted from duplicated requests for the same parameter values [1]
      • subsequent requests are rejected until the first request finishes its evaluation [1]

    References:
    [1] Microsoft Learn (2025) Use public parameters in Dataflow Gen2 (Preview) [link

    Resources:
    [R1] Microsoft Fabric Blog (2025) Passing parameter values to refresh a Dataflow Gen2 (Preview) [link

    Acronyms:
    API - Application Programming Interface
    REST - Representational State Transfer

    🏭🗒️Microsoft Fabric: Deployment Pipelines [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 26-Apr-2025

    [Microsoft Fabric] Deployment Pipelines

    • {def} a structured process that enables content creators to manage the lifecycle of their organizational assets [5]
      • enable creators to develop and test content in the service before it reaches the users [5]
        • can simplify the deployment process to development, test, and production workspaces [5]
        • one Premium workspace is assigned to each stage [5]
        • each stage can have 
          • different configurations [5]
          • different databases or different query parameters [5]
    • {action} create pipeline
      • from the deployment pipelines entry point in Fabric [5]
        • creating a pipeline from a workspace automatically assigns it to the pipeline [5]
      • {action} define how many stages it should have and what they should be called [5]
        • {default} has three stages
          • e.g. Development, Test, and Production
          • the number of stages can be changed anywhere between 2-10 
          • {action} add another stage,
          • {action} delete stage
          • {action} rename stage 
            • by typing a new name in the box
          • {action} share a pipeline with others
            • users receive access to the pipeline and become pipeline admins [5]
          • ⇐ the number of stages are permanent [5]
            • can't be changed after the pipeline is created [5]
      • {action} add content to the pipeline [5]
        • done by assigning a workspace to the pipeline stage [5]
          • the workspace can be assigned to any stage [5]
      • {action|optional} make a stage public
        • {default} the final stage of the pipeline is made public
        • a consumer of a public stage without access to the pipeline sees it as a regular workspace [5]
          • without the stage name and deployment pipeline icon on the workspace page next to the workspace name [5]
      • {action} deploy to an empty stage
        • when finishing the work in one pipeline stage, the content can be deployed to the next stage [5] 
          • deployment can happen in any direction [5]
        • {option} full deployment 
          • deploy all content to the target stage [5]
        • {option} selective deployment 
          • allows select the content to deploy to the target stage [5]
        • {option} backward deployment 
          • deploy content from a later stage to an earlier stage in the pipeline [5] 
          • {restriction} only possible when the target stage is empty [5]
      • {action} deploy content between pages [5]
        • content can be deployed even if the next stage has content
          • paired items are overwritten [5]
      • {action|optional} create deployment rules
        • when deploying content between pipeline stages, allow changes to content while keeping some settings intact [5] 
        • once a rule is defined or changed, the content must be redeployed
          • the deployed content inherits the value defined in the deployment rule [5]
          • the value always applies as long as the rule is unchanged and valid [5]
      • {feature} deployment history 
        • allows to see the last time content was deployed to each stage [5]
        • allows to to track time between deployments [5]
    • {concept} pairing
      • {def} the process by which an item in one stage of the deployment pipeline is associated with the same item in the adjacent stage
        • applies to reports, dashboards, semantic models
        • paired items appear on the same line in the pipeline content list [5]
          • ⇐ items that aren't paired, appear on a line by themselves [5]
        • the items remain paired even if their name changes
        • items added after the workspace is assigned to a pipeline aren't automatically paired [5]
          • ⇐ one can have identical items in adjacent workspaces that aren't paired [5]
    • [lakehouse]
      • can be removed as a dependent object upon deployment [3]
      • supports mapping different Lakehouses within the deployment pipeline context [3]
      • {default} a new empty Lakehouse object with same name is created in the target workspace [3]
        • ⇐ if nothing is specified during deployment pipeline configuration
        • notebook and Spark job definitions are remapped to reference the new lakehouse object in the new workspace [3]
        • {warning} a new empty Lakehouse object with same name still is created in the target workspace [3]
        • SQL Analytics endpoints and semantic models are provisioned
        • no object inside the Lakehouse is overwritten [3]
        • updates to Lakehouse name can be synchronized across workspaces in a deployment pipeline context [3] 
    • [notebook] deployment rules can be used to customize the behavior of notebooks when deployed [4]
      • e.g. change notebook's default lakehouse [4]
      • {feature} auto-binding
        • binds the default lakehouse and attached environment within the same workspace when deploying to next stage [4]
    • [environment] custom pool is not supported in deployment pipeline
      • the configurations of Compute section in the destination environment are set with default values [6]
      • ⇐ subject to change in upcoming releases [6]
    • [warehouse]
      • [database project] ALTER TABLE to add a constraint or column
        • {limitation} the table will be dropped and recreated when deploying, resulting in data loss
      • {recommendation} do not create a Dataflow Gen2 with an output destination to the warehouse
        • ⇐ deployment would be blocked by a new item named DataflowsStagingWarehouse that appears in the deployment pipeline [10]
      • SQL analytics endpoint is not supported
    • [Eventhouse]
      • {limitation} the connection must be reconfigured in destination that use Direct Ingestion mode [8]
    • [EventStream]
      • {limitation} limited support for cross-workspace scenarios
        • {recommendation} make sure all EventStream destinations within the same workspace [8]
    • KQL database
      • applies to tables, functions, materialized views [7]
    • KQL queryset
      • ⇐ tabs, data sources [7]
    • [real-time dashboard]
      • data sources, parameters, base queries, tiles [7]
    • [SQL database]
      • includes the specific differences between the individual database objects in the development and test workspaces [9]
    • can be also used with

      References:
      [1] Microsoft Learn (2024) Get started with deployment pipelines [link]
      [2] Microsoft Learn (2024) Implement continuous integration and continuous delivery (CI/CD) in Microsoft Fabric [link]
      [3] Microsoft Learn (2024)  Lakehouse deployment pipelines and git integration (Preview) [link]
      [4] Microsoft Learn (2024) Notebook source control and deployment [link
      [5] Microsoft Learn (2024) Introduction to deployment pipelines [link]
      [6] Environment Git integration and deployment pipeline [link]
      [7] Microsoft Learn (2024) Microsoft Learn (2024) Real-Time Intelligence: Git integration and deployment pipelines (Preview) [link]
      [8] Microsoft Learn (2024) Eventstream CI/CD - Git Integration and Deployment Pipeline [link]
      [9] Microsoft Learn (2024) Get started with deployment pipelines integration with SQL database in Microsoft Fabric [link]
      [10] Microsoft Learn (2025) Source control with Warehouse (preview) [link

      Resources:

      Acronyms:
      CLM - Content Lifecycle Management
      UAT - User Acceptance Testing

      🏭🗒️Microsoft Fabric: Power BI Environments [Notes]

      Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

      Last updated: 26-Apr-2025

      Enterprise Content Publishing [2]

      [Microsoft Fabric] Power BI Environments

      • {def} structured spaces within Microsoft Fabric that helps organizations manage the Power BI assets through the entire lifecycle
      • {environment} development 
        • allows to develop the solution
        • accessible only to the development team 
          • via Contributor access
        • {recommendation} use Power BI Desktop as local development environment
          • {benefit} allows to try, explore, and review updates to reports and datasets
            • once the work is done, upload the new version to the development stage
          • {benefit} enables collaborating and changing dashboards
          • {benefit} avoids duplication 
            • making online changes, downloading the .pbix file, and then uploading it again, creates reports and datasets duplication
        • {recommendation} use version control to keep the .pbix files up to date
          • [OneDrive] use Power BI's autosync
            • {alternative} SharePoint Online with folder synchronization
            • {alternative} GitHub and/or VSTS with local repository & folder synchronization
        • [enterprise scale deployments] 
          • {recommendation} separate dataset from reports and dashboards’ development
            • use the deployment pipelines selective deploy option [22]
            • create separate .pbix files for datasets and reports [22]
              • create a dataset .pbix file and uploaded it to the development stage (see shared datasets [22]
              • create .pbix only for the report, and connect it to the published dataset using a live connection [22]
            • {benefit} allows different creators to separately work on modeling and visualizations, and deploy them to production independently
          • {recommendation} separate data model from report and dashboard development
            • allows using advanced capabilities 
              • e.g. source control, merging diff changes, automated processes
            • separate the development from test data sources [1]
              • the development database should be relatively small [1]
        • {recommendation} use only a subset of the data [1]
          • ⇐ otherwise the data volume can slow down the development [1]
      • {environment} user acceptance testing (UAT)
        • test environment that within the deployment lifecycle sits between development and production
          • it's not necessary for all Power BI solutions [3]
          • allows to test the solution before deploying it into production
            • all tests must have 
              • View access for testing
              • Contributor access for report authoring
          • involves business users who are SMEs
            • provide approval that the content 
              • is accurate
              • meets requirements
              • can be deployed for wider consumption
        • {recommendation} check report’s load and the interactions to find out if changes impact performance [1]
        • {recommendation} monitor the load on the capacity to catch extreme loads before they reach production [1]
        • {recommendation} test data refresh in the Power BI service regularly during development [20]
      • {environment} production
        • {concept} staged deployment
          • {goal} help minimize risk, user disruption, or address other concerns [3]
            • the deployment involves a smaller group of pilot users who provide feedback [3]
        • {recommendation} set production deployment rules for data sources and parameters defined in the dataset [1]
          • allows ensuring the data in production is always connected and available to users [1]
        • {recommendation} don’t upload a new .pbix version directly to the production stage
          •  ⇐ without going through testing
      • {feature|preview} deployment pipelines 
        • enable creators to develop and test content in the service before it reaches the users [5]
      • {recommendation} build separate databases for development and testing 
        • helps protect production data [1]
      • {recommendation} make sure that the test and production environment have similar characteristics [1]
        • e.g. data volume, sage volume, similar capacity 
        • {warning} testing into production can make production unstable [1]
        • {recommendation} use Azure A capacities [22]
      • {recommendation} for formal projects, consider creating an environment for each phase
      • {recommendation} enable users to connect to published datasets to create their own reports
      • {recommendation} use parameters to store connection details 
        • e.g. instance names, database names
        • ⇐  deployment pipelines allow configuring parameter rules to set specific values for the development, test, and production stages
          • alternatively data source rules can be used to specify a connection string for a given dataset
            • {restriction} in deployment pipelines, this isn't supported for all data sources
      • {recommendation} keep the data in blob storage under the 50k blobs and 5GB data in total to prevent timeouts [29]
      • {recommendation} provide data to self-service authors from a centralized data warehouse [20]
        • allows to minimize the amount of work that self-service authors need to take on [20]
      • {recommendation} minimize the use of Excel, csv, and text files as sources when practical [20]
      • {recommendation} store source files in a central location accessible by all coauthors of the Power BI solution [20]
      • {recommendation} be aware of API connectivity issues and limits [20]
      • {recommendation} know how to support SaaS solutions from AppSource and expect further data integration requests [20]
      • {recommendation} minimize the query load on source systems [20]
        • use incremental refresh in Power BI for the dataset(s)
        • use a Power BI dataflow that extracts the data from the source on a schedule
        • reduce the dataset size by only extracting the needed amount of data 
      • {recommendation} expect data refresh operations to take some time [20]
      • {recommendation} use relational database sources when practical [20]
      • {recommendation} make the data easily accessible [20]
      • [knowledge area] knowledge transfer
        • {recommendation} maintain a list of best practices and review it regularly [24]
        • {recommendation} develop a training plan for the various types of users [24]
          • usability training for read only report/app users [24
          • self-service reporting for report authors & data analysts [24]
          • more elaborated training for advanced analysts & developers [24]
      • [knowledge area] lifecycle management
        • consists of the processes and practices used to handle content from its creation to its eventual retirement [6]
        • {recommendation} postfix files with 3-part version number in Development stage [24]
          • remove the version number when publishing files in UAT and production 
        • {recommendation} backup files for archive 
        • {recommendation} track version history 

        References:
        [1] Microsoft Learn (2021) Fabric: Deployment pipelines best practices [link]
        [2] Microsoft Learn (2024) Power BI: Power BI usage scenarios: Enterprise content publishing [link]
        [3] Microsoft Learn (2024) Deploy to Power BI [link]
        [4] Microsoft Learn (2024) Power BI implementation planning: Content lifecycle management [link]
        [5] Microsoft Learn (2024) Introduction to deployment pipelines [link]
        [6] Microsoft Learn (2024) Power BI implementation planning: Content lifecycle management [link]
        [20] Microsoft (2020) Planning a Power BI  Enterprise Deployment [White paper] [link]
        [22] Power BI Docs (2021) Create Power BI Embedded capacity in the Azure portal [link]
        [24] Paul Turley (2019)  A Best Practice Guide and Checklist for Power BI Projects

        Resources:

        Acronyms:
        API - Application Programming Interface
        CLM - Content Lifecycle Management
        COE - Center of Excellence
        SaaS - Software-as-a-Service
        SME - Subject Matter Expert
        UAT - User Acceptance Testing
        VSTS - Visual Studio Team System
        SME - Subject Matter Experts

        25 April 2025

        🏭🗒️Microsoft Fabric: Dataflows Gen2's Incremental Refresh [Notes] 🆕

        Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

        Last updated: 25-Apr-2025

        [Microsoft Fabric] Incremental Refresh in Dataflows Gen2

        • {feature} enables to incrementally extract data from data sources, apply Power Query transformations, and load into various output destinations [5]
          • designed to reduce the amount of data that needs to be processed and retrieved from the source system [8]
          • configurable directly in the dataflow editor [8]
          • doesn't need to specify the historical data range [8]
            • ⇐ the dataflow doesn't remove any data from the destination that's outside the bucket range [8]
          • doesn't need to specify the parameters for the incremental refresh [8]
            • the filters and parameters are automatically added as the last step in the query [8]
        • {prerequisite} the data source 
          • supports folding [8]
          • needs to contain a Date/DateTime column that can be used to filter the data [8]
        • {prerequisite} the data destination supports incremental refresh [8]
          • available destinations
            • Fabric Warehouse
            • Azure SQL Database
            • Azure Synapse Analytics
            • Fabric Lakehouse [preview]
          • other destinations can be used in combination with incremental refresh by using a second query that references the staged data to update the data destination [8]
            • allows to use incremental refresh to reduce the amount of data that needs to be processed and retrieved from the source system [8]
              • a full refresh from the staged data to the data destination is still needed [8]
        • works by dividing the data into buckets based on a DateTime column [8]
          • each bucket contains the data that changed since the last refresh [8]
            • the dataflow knows what changed by checking the maximum value in the specified column 
              • if the maximum value changed for that bucket, the dataflow retrieves the whole bucket and replaces the data in the destination [8]
              • if the maximum value didn't change, the dataflow doesn't retrieve any data [8]
        • {limitation} 
          • the data destination must be set to a fixed schema [8]
          • ⇒table's schema in the data destination must be fixed and can't change [8]
            • ⇒ dynamic schema must be changed to fixed schema before configuring incremental refresh [8]
        • {limitation} the only supported update method in the data destination: replace
          • ⇒the dataflow replaces the data for each bucket in the data destination with the new data [8]
            • data that is outside the bucket range isn't affected [8]
        • {limitation} maximum number of buckets
          • single query: 50
            • {workaround} increase the bucket size or reduce the bucket range to lower the number of buckets [8]
          • whole dataflow: 150
            • {workaround} reduce the number of incremental refresh queries or increase the bucket size [8]
        • {downside} the dataflow may take longer to refresh after enabling incremental refresh [8]
          • because the additional overhead of checking if data changed and processing the buckets is higher than the time saved by processing less data [8]
          • {recommendation} review the settings for incremental refresh and adjust them to better fit the scenario
            • {option} increase the bucket size to reduce the number of buckets and the overhead of processing them [8]
            • {option} reduce the number of buckets by increasing the bucket size [8]
            • {option} disable incremental refresh [8]
        • {recommendation} don't use the column for detecting changes also for filtering [8]
          • because this can lead to unexpected results [8]
        • {setting} limit number of concurrent evaluation
          • setting the value to a lower number, reduces the number of requests sent to the source system [8]
          • via global settings >> Scale tab >> maximum number of parallel query evaluations
          • {recommendation} don't enable this limit unless there're issues with the source system [8]

        References:
        [5] Microsoft Learn (2023) Fabric: Save a draft of your dataflow [link]
        [8] Microsoft Learn (2025) Fabric: Incremental refresh in Dataflow Gen2 [link

        Resources:


        Related Posts Plugin for WordPress, Blogger...

        About Me

        My photo
        Koeln, NRW, Germany
        IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.