Showing posts with label CI/CD. Show all posts
Showing posts with label CI/CD. Show all posts

06 October 2025

🏭🗒️Microsoft Fabric: Git [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 6-Oct-2025

[Microsoft Fabric] Git

  • {def} an open source, distributed version control platform
    • enables developers commit their work to a local repository and then sync their copy of the repository with the copy on the server [1]
    • to be differentiated from centralized version control 
      • where clients must synchronize code with a server before creating new versions of code [1
    • provides tools for isolating changes and later merging them back together
  • {benefit} simultaneous development
    • everyone has their own local copy of code and works simultaneously on their own branches
      •  Git works offline since almost every operation is local
  • {benefit} faster release
    • branches allow for flexible and simultaneous development
  • {benefit} built-in integration
    • integrates into most tools and products
      •  every major IDE has built-in Git support
        • this integration simplifies the day-to-day workflow
  • {benefit} strong community support
    • the volume of community support makes it easy to get help when needed
  • {benefit} works with any team
    • using Git with a source code management tool increases a team's productivity 
      • by encouraging collaboration, enforcing policies, automating processes, and improving visibility and traceability of work
    • the team can either
      • settle on individual tools for version control, work item tracking, and continuous integration and deployment
      • choose a solution that supports all of these tasks in one place
        • e.g. GitHub, Azure DevOps
  • {benefit} pull requests
    • used to discuss code changes with the team before merging them into the main branch
    • allows to ensure code quality and increase knowledge across team
    • platforms like GitHub and Azure DevOps offer a rich pull request experience
  • {benefit} branch policies
    • protect important branches by preventing direct pushes, requiring reviewers, and ensuring clean build
      •  used to ensure that pull requests meet requirements before completion
    •  teams can configure their solution to enforce consistent workflows and process across the team
  • {feature} continuous integration
  • {feature} continuous deployment
  • {feature} automated testing
  • {feature} work item tracking
  • {feature} metrics
  • {feature} reporting 
  • {operation} commit
    • snapshot of all files at a point in time [1]
      •  every time work is saved, Git creates a commit [1]
      •  identified by a unique cryptographic hash of the committed content [1]
      •  everything is hashed
      •  it's impossible to make changes, lose information, or corrupt files without Git detecting it [1]
    •  create links to other commits, forming a graph of the development history [2A]
    • {operation} revert code to a previous commit [1]
    • {operation} inspect how files changed from one commit to the next [1]
    • {operation} review information e.g. where and when changes were made [1]
  • {operation} branch
    •  lightweight pointers to work in progress
    •  each developer saves changes to their own local code repository
      • there can be many different changes based on the same commit
        •  branches manage this separation
      • once work created in a branch is finished, it can be merged back into the team's main (or trunk) branch
    • main branch
      • contains stable, high-quality code from which programmers release
    • feature branches 
      • contain work in progress, which are merged into the main branch upon completion
      •  allows to isolate development work and minimize conflicts among multiple developers [2]
    •  release branch
      •  by separating the release branch from development in progress, it's easier to manage stable code and ship updates more quickly
  • if a file hasn't changed from one commit to the next, Git uses the previously stored file [1]
  • files are in one of three states
    • {state}modified
      • when a file is first modified, the changes exist only in the working directory
        •  they aren't yet part of a commit or the development history
      •  the developer must stage the changed files to be included in the commit
      •  the staging area contains all changes to include in the next commit
    •  {state}committed
      •  once the developer is happy with the staged files, the files are packaged as a commit with a message describing what changed
        •  this commit becomes part of the development history
    •  {state}staged
      •  staging lets developers pick which file changes to save in a commit to break down large changes into a series of smaller commits
        •   by reducing the scope of commits, it's easier to review the commit history to 
  • {best practice} set up a shared Git repository and CI/CD pipelines [2]
    • enables effective collaboration and deployment in PBIP [2]
    • enables implementing version control in PBIP [2]
      • it’s essential for managing project history and collaboration [2]
      • allows to track changes throughout the model lifecycle [2]
      • allows to enable effective governance and collaboratin
    •  provides robust version tracking and collaboration features, ensuring traceability
  • {best practice} use descriptive commit messages [2]
    • allows to ensure clarity and facilitate collaboration in version control [2]
  • {best practice} avoid sharing Git credentials [2]
    • compromises security and accountability [2]
      •  can lead to potential breaches [2]
  • {best practice} define a naming conventions for files and communicated accordingly [2]
  • {best practice} avoid merging changes directly into the master branch [2]
    • {risk} this can lead to integration issues [2]
  • {best practice} use git merge for integrating changes from one branch to another [2]
    • {benefit} ensures seamless collaboration [2]
  • {best practice} avoid skipping merges [2]
    • failing to merge regularly can lead to complex conflicts and integration challenges [2]
Previous Post <<||>> Next Post 

References:
[1] Microsoft Learn (2022) DeveOps: What is Git? [link]
[2] M Anand, Microsoft Fabric Analytics Engineer Associate: Implementing Analytics Solutions Using Microsoft Fabric (DP-600), 2025 

Acronyms:
PBIP - Power BI Project
CI/CD - Continuous Integration and Continuous Deployment
IDE - Integrated Development Environments
 

13 April 2025

🏭🗒️Microsoft Fabric: Continuous Integration & Continuous Deployment [CI/CD] [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 13-Apr-2025

[Microsoft Fabric] Continuous Integration & Continuous Deployment [CI/CD] 
  • {def} development processes, tools, and best practices used to automates the integration, testing, and deployment of code changes to ensure efficient and reliable development
    • can be used in combination with a client tool
      • e.g. VS Code, Power BI Desktop
      • don’t necessarily need a workspace
        • developers can 
          • create branches
          • commit changes to that branch locally
          • push changes to the remote repo
          • create a pull request to the main branch
          • ⇐ all steps can be performed without a workspace [1]
        • workspace is needed only as a testing environment [1]
          • to check that everything works in a real-life scenario [1]
    • addresses a few pain points [2]
      • manual integration issues
        • manual changes can lead to conflicts and errors
          • slow down development [2]
      • development delays
        • manual deployments are time-consuming and prone to errors
          • lead to delays in delivering new features and updates [2]
      • inconsistent environments
        • inconsistencies between environment cause issues that are hard to debug [2]
      • lack of visibility
        • can be challenging to
          • track changes though their lifetime [2]
          • understand the state of the codebase[2]
    • {process} continuous integration (CI)
    • {process} continuous deployment (CD)
    • architecture
      • {layer} development database 
        • {recommendation} should be relatively small [1]
      • {layer} test database 
        • {recommendation} should be as similar as possible to the production database [1]
      • {layer} production database

      • data items
        • items that store data
        • items' definition in Git defines how the data is stored [1]
    • {stage} development 
      • {best practice} back up work to a Git repository
        • back up the work by committing it into Git [1]
        • {prerequisite} the work environment must be isolated [1]
          • so others don’t override the work before it gets committed [1]
          • commit to a branch no other developer is using [1]
          • commit together changes that must be deployed together [1]
            • helps later when 
              • deploying to other stages
              • creating pull requests
              • reverting changes
      • {warning} big commits might hit the max commit size limit [1]
        • {bad practice} store large-size items in source control systems, even if it works [1]
        • {recommendation} consider ways to reduce items’ size if they have lots of static [1] resources, like images [1]
      • {action} revert to a previous version
        • {operation} undo
          • revert the immediate changes made, as long as they aren't committed yet [1]
          • each item can be reverted separately [1]
        • {operation} revert
          • reverting to older commits
            • {recommendation} promote an older commit to be the HEAD 
              • via git revert or git reset [1]
              • shows that there’s an update in the source control pane [1]
              • the workspace can be updated with that new commit [1]
          • {warning} reverting a data item to an older version might break the existing data and could possibly require dropping the data or the operation might fail [1]
          • {recommendation} check dependencies in advance before reverting changes back [1]
      • {concept} private workspace
        • a workspace that provides an isolated environment [1]
          • ⇐ allows to work in isolation [1]
        • {prerequisite} the workspace is assigned to a Fabric capacity [1]
        • {prerequisite} access to data to work in the workspace [1]
        • {step} create a new branch from the main branch [1]
          • allows to have most up-to-date version of the content [1]
          • can be used for any future branch created by the user [1]
            • when a sprint is over, the changes are merged and one can start a fresh new task [1]
              • switch the connection to a new branch on the same workspace
            • approach can be used when is needed to fix a bug in the middle of a sprint [1]
          • {validation} connect to the correct folder in the branch to pull the right content into the workspace [1]
      • {best practice} make small incremental changes that are easy to merge and less likely to get into conflicts [1]
        • update the branch to resolve the conflicts first [1]
      • {best practice} change workspace’s configurations to enable productivity [1]
        • connection between items, or to different data sources or changes to parameters on a given item [1]
      • {recommendation} make sure you're working with the supported structure of the item you're authoring [1]
        • if you’re not sure, first clone a repo with content already synced to a workspace, then start authoring from there, where the structure is already in place [1]
      • {constraint} a workspace can only be connected to a single branch at a time [1]
        • {recommendation} treat this as a 1:1 mapping [1]
    • {stage} test
      • {best practice} allows to simulate a real production environment for testing purposes [1]
        • {alternative} simulate this by connecting Git to another workspace [1]
      • factors to consider for the test environment
        • data volume
        • usage volume
        • production environment’s capacity
          • stage and production should have the same (minimal) capacity [1]
            • using the same capacity can make production unstable during load testing [1]
              • {recommendation} test using a different capacity similar in resources to the production capacity [1]
              • {recommendation} use a capacity that allows to pay only for the testing time [1]
                • allows to avoid unnecessary costs [1]
      • {best practice} use deployment rules with a real-life data source
        • {recommendation} use data source rules to switch data sources in the test stage or parameterize the connection if not working through deployment pipelines [1]
        • {recommendation} separate the development and test data sources [1]
        • {recommendation} check related items
          • the changes made can also affect the dependent items [1]
        • {recommendation} verify that the changes don’t affect or break the performance of dependent items [1]
          • via impact analysis.
      • {operation} update data items in the workspace
        • imports items’ definition into the workspace and applies it on the existing data [1]
        • the operation is same for Git and deployment pipelines [1]
        • {recommendation} know in advance what the changes are and what impact they have on the existing data [1]
        • {recommendation} use commit messages to describe the changes made [1]
        • {recommendation} upload the changes first to a dev or test environment [1]
          • {benefit} allows to see how that item handles the change with test data [1]
        • {recommendation} check the changes on a staging environment, with real-life data (or as close to it as possible) [1]
          • {benefit} allows to minimize the unexpected behavior in production [1]
        • {recommendation} consider the best timing when updating the Prod environment [1]
          • {benefit} minimize the impact errors might cause on the business [1]
        • {recommendation} perform post-deployment tests in Prod to verify that everything works as expected [1]
        • {recommendation} have a deployment, respectively a recovery plan [1]
          • {benefit) allows to minimize the effort, respectively the downtime [1]
    • {stage} production
      • {best practice} let only specific people manage sensitive operations [1]
      • {best practice} use workspace permissions to manage access [1]
        • applies to all BI creators for a specific workspace who need access to the pipeline
      • {best practice} limit access to the repo or pipeline by only enabling permissions to users [1] who are part of the content creation process [1]
      • {best practice} set deployment rules to ensure production stage availability [1]
        • {goal} ensure the data in production is always connected and available to users [1]
        • {benefit} allows deployments run while while minimizing the downtimes
        • applies to data sources and parameters defined in the semantic model [1]
      • deployment into production using Git branches
        • {recommendation} use release branches [1]
          • requires changing the connection of workspace to the new release branches before every deployment [1]
          • if the build or release pipeline requires to change the source code, or run scripts in a build environment before deployment, then connecting the workspace to Git won't help [1]
      • {recommendation} after deploying to each stage, make sure to change all the configuration specific to that stage [1]

    References:
    [1] Microsoft Learn (2025) Fabric: Best practices for lifecycle management in Fabric [link]
    [2] Microsoft Learn (2025) Fabric: CI/CD for pipelines in Data Factory in Microsoft Fabric [link]
    [3] Microsoft Learn (2025) Fabric: Choose the best Fabric CI/CD workflow option for you [link]

    Acronyms:
    API - Application Programming Interface
    BI - Business Intelligence
    CI/CD - Continuous Integration and Continuous Deployment
    VS - Visual Studio

    12 April 2025

    🏭🗒️Microsoft Fabric: Copy job in Data Factory [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 11-Apr-2025

    [Microsoft Fabric] Copy job in Data Factory 
    • {def} 
      • {benefit} simplifies data ingestion with built-in patterns for batch and incremental copy, eliminating the need for pipeline creation [1]
        • across cloud data stores [1]
        • from on-premises data stores behind a firewall [1]
        • within a virtual network via a gateway [1]
    • elevates the data ingestion experience to a more streamlined and user-friendly process from any source to any destination [1]
    • {benefit} provides seamless data integration 
      • through over 100 built-in connectors [3]
      • provides essential tools for data operations [3]
    • {benefit} provides intuitive experience
      • easy configuration and monitoring [1]
    • {benefit} efficiency
      • enable incremental copying effortlessly, reducing manual intervention [1]
    • {benefit} less resource utilization and faster copy durations
      • flexibility to control data movement [1]
        • choose which tables and columns to copy
        • map the data
        • define read/write behavior
        • set schedules that fit requirements whether [1]
      • applies for a one-time or recurring jobs [1]
    • {benefit} robust performance
      • the serverless setup enables data transfer with large-scale parallelism
      • maximizes data movement throughput [1]
        • fully utilizes network bandwidth and data store IOPS for optimal performance [3]
    • {feature} monitoring
      • once a job executed, users can monitor its progress and metrics through either [1] 
        • the Copy job panel
          • shows data from the most recent runs [1]
        • reports several metrics
          • status
          • row read
          • row written
          • throughput
        • the Monitoring hub
          • acts as a centralized portal for reviewing runs across various items [4]
    • {mode} full copy
      • copies all data from the source to the destination at once
    • {mode|GA} incremental copy
      • the initial job run copies all data, and subsequent job runs only copy changes since the last run [1]
      • an incremental column must be selected for each table to identify changes [1]
        • used as a watermark
          • allows comparing its value with the same from last run in order to copy the new or updated data only [1]
          • the incremental column can be a timestamp or an increasing INT [1]
        • {scenario} copying from a database
          • new or updated rows will be captured and moved to the destination [1]
        • {scenario} copying from a storage store
          • new or updated files identified by their LastModifiedTime are captured and moved to the destination [1]
        • {scenario} copy data to storage store
          • new rows from the tables or files are copied to new files in the destination [1]
            • files with the same name are overwritten [1]
        • {scenario} copy data to database
          • new rows from the tables or files are appended to destination tables [1]
            • the update method to merge or overwrite [1]
    • {default} appends data to the destination [1]
      • the update method can be adjusted to 
        • {operation} merge
          • a key column must be provided
            • {default} the primary key is used, if available [1]
        • {operation} overwrite
    • availability 
      • the same regional availability as the pipeline [1]
    • billing meter
      • Data Movement, with an identical consumption rate [1]
    • {feature} robust Public API
      • {benefit} allows to automate and manage Copy Job efficiently [2]
    • {feature} Git Integration
      • {benefit} allows to leverage Git repositories in Azure DevOps or GitHub [2]
      • {benefit} allows to seamlessly deploy Copy Job with Fabric’s built-in CI/CD workflows [2]
    • {feature|preview} VNET gateway support
      • enables secure connections to data sources within virtual network or behind firewalls
        • Copy Job can be executed directly on the VNet data gateway, ensuring seamless and secure data movement [2]
    • {feature} upsert to Azure SQL Database
    • {feature} overwrite to Fabric Lakehouse
    • {feature} [Jul-2025] native CDC
      • enables efficient and automated replication of changed data including inserted, updated and deleted records from a source to a destination [5]
        •  ensures destination data stays up to date without manual effort
          • improves efficiency in data integration while reducing the load on source systems [5]
        • see Data Movement - Incremental Copy meter
          •  consumption rate of 3 CU
        • {benefit} zero manual intervention
          • automatically captures incremental changes directly from the source [5]  
        • {benefit} automatic replication
          • keeps destination data continuously synchronized with source changes [5]  
        • {benefit} optimized performance
          • processes only changed data
            • reduces processing time and minimizing load on the source [5]
        • smarter incremental copy 
          • automatically detects CDC-enabled source tables and allows to select either CDC-based or watermark-based incremental copy for each table [5]
      • applies to 
        • CDC-enabled tables
          • CDC automatically captures and replicates actions on data
        • non-CDC-enabled tables
          • Copy Job detects changes by comparing an incremental column against the last run [5]
            • then merges or appends the changed data to the destination based on configuration [5]
      • supported connectors
        • ⇐ applies to sources and destinations
        • Azure SQL DB [5]
        • On-premises SQL Server [5]
        • Azure SQL Managed Instance [5]
    • {enhancement} column mapping for simple data modification to storage as destination store [2]
    • {enhancement} data preview to help select the right incremental column  [2]
    • {enhancement} search functionality to quickly find tables or columns  [2]
    • {enhancement} real-time monitoring with an in-progress view of running Copy Jobs  [2]
    • {enhancement} customizable update methods & schedules before job creation [2]

    References:
    [1] Microsoft Learn (2025) Fabric: What is the Copy job in Data Factory for Microsoft Fabric? [link]
    [2] Microsoft Fabric Updates Blog (2025) Recap of Data Factory Announcements at Fabric Conference US 2025 [link]
    [3] Microsoft Fabric Updates Blog (2025) Fabric: Announcing Public Preview: Copy Job in Microsoft Fabric [link]
    [4] Microsoft Learn (2025) Fabric: Learn how to monitor a Copy job in Data Factory for Microsoft Fabric [link]
    [5] Microsoft Fabric Updates Blog (2025) Fabric: Simplifying Data Ingestion with Copy job – Introducing Change Data Capture (CDC) Support (Preview) [link]
    [6] Microsoft Learn (2025) Fabric: Change data capture (CDC) in Copy Job (Preview) [link]
    [7] Microsoft Fabric Updates Blog (2025) Simplifying Data Ingestion with Copy job – Incremental Copy GA, Lakehouse Upserts, and New Connectors [link

    Resources:
    [R1] Microsoft Learn (2025) Fabric: Learn how to create a Copy job in Data Factory for Microsoft Fabric [link]
    [R2] Microsoft Learn (2025) Microsoft Fabric decision guide: copy activity, Copy job, dataflow, Eventstream, or Spark [link]

    Acronyms:
    API - Application Programming Interface
    CDC - Change Data Capture
    CI/CD - Continuous Integration and Continuous Deployment
    CU - Capacity Unit
    DevOps - Development & Operations
    DF - Data Factory
    IOPS - Input/Output Operations Per Second
    VNet - Virtual Network

    10 March 2024

    🏭🗒️Microsoft Fabric: Dataflows Gen2 [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 24-Nov-2025

    Dataflow (Gen2) Architecture [4]

    [Microsoft Fabric] Dataflow (Gen2) 

    • cloud-based, low-code interface that provides a modern data integration experience allowing users to ingest, prepare and transform data from a rich set of data sources incl. databases, data warehouses, lakehouses, real-time data repositories, etc. [11]
      • new generation of dataflows that resides alongside the Power BI Dataflow (Gen1) [2]
        • brings new features, improved experience [2] and enhanced performance [11]
        • similar to Dataflow Gen1 in Power BI [2] 
        • {recommendation} implement new functionality using Dataflow (Gen2) [11]
          • allows to leverage the many features and experiences not available in (Gen1) 
        • {recommendation} migrate from Dataflow (Gen1) to (Gen2) [11] 
          • allows to leverage the modern experience and capabilities 
      • allows to 
        • extract data from various sources [1]
        • transform it using a wide range of transformation operations [1]
        • load it into a destination [1]
      • {goal} provide an easy, reusable way to perform ETL tasks using Power Query Online [1]
        • allows to promote reusable ETL logic 
          • ⇒ prevents the need to create more connections to the data source [1]
          • offer a wide variety of transformations [1]
      • can be horizontally partitioned
    • {component} Lakehouse 
      • used to stage data being ingested
    • {component} Warehouse 
      • used as a compute engine and means to write back results to staging or supported output destinations faster
    • {component} mashup engine
      • extracts, transforms, or loads the data to staging or data destinations when either [4]
        • warehouse compute cannot be used [4]
        • {limitation} staging is disabled for a query [4]
    • {operation} create a dataflow
      • can be created in a
        • Data Factory workload
        • Power BI workspace
        • Lakehouse
      •  when a dataflow (Gen2) is reated in a workspace, lakehouse and warehouse items are provisioned along with their related SQL analytics endpoint and semantic models [12]
        •  shared by all dataflows in the workspace and are required for Dataflow Gen2 to operate [12]
          • {warning} shouldn't be deleted, and aren't intended to be used directly by users [12]
          •  aren't visible in the workspace, but might be accessible in other experiences such as the Notebook, SQL-endpoint, Lakehouse, and Warehouse experience [12]
          •  the items can be recognized by their prefix:`DataflowsStaging' [12]
    • {operation} set a default destination for the dataflow 
      • helps to get started quickly by loading all queries to the same destination [14]
      • via ribbon or the status bar in the editor
      • users are prompted to choose a destination and select which queries to bind to it [14]
      • to update the default destination, delete the current default destination and set a new one [14]
      • {default} any new query has as destination the lakehouse, warehouse, or KQL database from which it got started [14] 
    • {operation} publish a dataflow
      • generates dataflow's definition  
        • ⇐ the program that runs once the dataflow is refreshed to produce tables in staging storage and/or output destination [4]
        • used by the dataflow engine to generate an orchestration plan, manage resources, and orchestrate execution of queries across data sources, gateways, and compute engines, and to create tables in either the staging storage or data destination [4]
      • saves changes and runs validations that must be performed in the background [2]
    • {operation} export/import dataflows [11]
      •  allows also to migrate from dataflow (Gen1) to (Gen2) [11]
    • {operation} refresh a dataflow
      • applies the transformation steps defined during authoring 
      • can be triggered on-demand or by setting up a refresh schedule
      • {action} cancel refresh
        • enables to cancel ongoing Dataflow Gen2 refreshes from the workspace items view [6]
        • once canceled, the dataflow's refresh history status is updated to reflect cancellation status [15] 
        • {scenario} stop a refresh during peak time, if a capacity is nearing its limits, or if refresh is taking longer than expected [15]
        • it may have different outcomes
          • data from the last successful refresh is available [15]
          • data written up to the point of cancellation is available [15]
        • {warning} if a refresh is canceled before evaluation of a query that loads data to a destination began, there's no change to data in that query's destination [15]
      • {limitation} each dataflow is allowed up to 300 refreshes per 24-hour rolling window [15]
        •  {warning} attempting 300 refreshes within a short burst (e.g., 60 seconds) may trigger throttling and result in rejected requests [15]
          •  protections in place to ensure system reliability [15]
        • if the scheduled dataflow refresh fails consecutively,  dataflow refresh schedule is paused and an email is sent to the owner [15]
      • {limitation} a single evaluation of a query has a limit of 8 hours [15]
      • {limitation} total refresh time of a single refresh of a dataflow is limited to a max of 24 hours [15]
      • {limitation} per dataflow one can have a maximum of 50 staged queries, or queries with output destination, or combination of both [15]
    • {operation} copy and paste code in Power Query [11]
      •   allows to migrate dataflow (Gen1) to (Gen2) [11]
    • {operation} save a dataflow [11]
      • via 'Save As'  feature
      • can be used to save a dataflow (Gen1) as (Gen2) dataflow [11] 
    • {operation} save a dataflow as draft 
      •  allows to make changes to dataflows without immediately publishing them to a workspace [13]
        • can be later reviewed, and then published, if needed [13]
      • {operation} publish draft dataflow 
        • performed as a background job [13]
        • publishing related errors are visible next to the dataflow's name [13]
          • selecting the indication reveals the publishing errors and allows to edit the dataflow from the last saved version [13]
    • {operation} run a dataflow 
      • can be performed
        • manually
        • on a refresh schedule
        • as part of a Data Pipeline orchestration
    •  {operation} monitor pipeline runs 
      • allows to check pipelines' status, spot issues early, respectively troubleshoot issues
      • [Workspace Monitoring] provides log-level visibility for all items in a workspace [link]
        • via Workspace Settings >> select Monitoring 
      • [Monitoring Hub] serves as a centralized portal for browsing pipeline runs across items within the Data Factory or Data Engineering experience [link]
    • {feature} connect multiple activities in a pipeline [11]
      •  allows to build end-to-end, automated data workflows
    • {feature} author dataflows with Power Query
      • uses the full Power Query experience of Power BI dataflows [2]
    • {feature} shorter authoring flow
      • uses step-by-step for getting the data into your the dataflow [2]
        • the number of steps required to create dataflows were reduced [2]
      • a few new features were added to improve the experience [2]
    • {feature} AutoSave and background publishing
      • changes made to a dataflow are autosaved to the cloud (aka draft version of the dataflow) [2]
        • ⇐ without having to wait for the validation to finish [2]
      • {functionality} save as draft 
        • stores a draft version of the dataflow every time you make a change [2]
        • seamless experience and doesn't require any input [2]
      • {concept} published version
        • the version of the dataflow that passed validation and is ready to refresh [5]
    • {feature} integration with data pipelines
      • integrates directly with Data Factory pipelines for scheduling and orchestration [2] 
    • {feature} high-scale compute
      • leverages a new, higher-scale compute architecture [2] 
        •  improves the performance of both transformations of referenced queries and get data scenarios [2]
        • creates both Lakehouse and Warehouse items in the workspace, and uses them to store and access data to improve performance for all dataflows [2]
    • {feature} improved monitoring and refresh history
      • integrate support for Monitoring Hub [2]
      • Refresh History experience upgraded [2]
    • {feature} get data via Dataflows connector
      • supports a wide variety of data source connectors
        • include cloud and on-premises relational databases
    • {feature} incremental refresh
      • enables to incrementally extract data from data sources, apply Power Query transformations, and load into various output destinations [5]
    • {feature} data destinations
      • allows to 
        • specify an output destination
        • separate ETL logic and destination storage [2]
      • every tabular data query can have a data destination [3]
        • available destinations
          • Azure SQL databases
          • Azure Data Explorer (Kusto)
          • Fabric Lakehouse
          • Fabric Warehouse
          • Fabric KQL database
        • a destination can be specified for every query individually [3]
        • multiple different destinations can be used within a dataflow [3]
        • connecting to the data destination is similar to connecting to a data source
        • {limitation} functions and lists aren't supported
      • {operation} creating a new table
        • {default} table name has the same name as the query name.
      • {operation} picking an existing table
      • {operation} deleting a table manually from the data destination 
        • doesn't recreate the table on the next refresh [3]
      • {operation} reusing queries from Dataflow Gen1
        • {method} export Dataflow Gen1 query and import it into Dataflow Gen2
          • export the queries as a PQT file and import them into Dataflow Gen2 [2]
        • {method} copy and paste in Power Query
          • copy the queries and paste them in the Dataflow Gen2 editor [2]
      • {feature} automatic settings:
        • {limitation} supported only for Lakehouse and Azure SQL database
        • {setting} Update method replace: 
          • data in the destination is replaced at every dataflow refresh with the output data of the dataflow [3]
        • {setting} Managed mapping: 
          • the mapping is automatically adjusted when republishing the data flow to reflect the change 
            • ⇒ doesn't need to be updated manually into the data destination experience every time changes occur [3]
        • {setting} Drop and recreate table: 
          • on every dataflow refresh the table is dropped and recreated to allow schema changes
          • {limitation} the dataflow refresh fails if any relationships or measures were added to the table [3]
      • {feature} update methods
        • {method} replace
          • on every dataflow refresh, the data is dropped from the destination and replaced by the output data of the dataflow.
          • {limitation} not supported by Fabric KQL databases and Azure Data Explorer 
        • {method} append
          • on every dataflow refresh, the output data from the dataflow is appended (aka merged) to the existing data in the data destination table (aka upsert)
      • {feature} data staging 
        • {default} enabled
          • allows to use Fabric compute to execute queries
            • ⇐ enhances the performance of query processing
          • the data is loaded into the staging location
            • ⇐ an internal Lakehouse location accessible only by the dataflow itself
          • [Warehouse] staging is required before the write operation to the data destination
            • ⇐ improves performance
            • {limitation} only loading into the same workspace as the dataflow is supported
          •  using staging locations can enhance performance in some cases
        • disabled
          • {recommendation} [Lakehouse] disable staging on the query to avoid loading twice into a similar destination
            • ⇐ once for staging and once for data destination
            • improves dataflow's performance
      • {scenario} use a dataflow to load data into the lakehouse and then use a notebook to analyze the data [2]
      • {scenario} use a dataflow to load data into an Azure SQL database and then use a data pipeline to load the data into a data warehouse [2]
    • {feature} Fast Copy
      • allows ingesting terabytes of data with the easy experience and the scalable back-end of the pipeline Copy Activity [7]
        • enables large-scale data ingestion directly utilizing the pipelines Copy Activity capability [6]
        • supports sources such Azure SQL Databases, CSV, and Parquet files in Azure Data Lake Storage and Blob Storage [6]
        • significantly scales up the data processing capacity providing high-scale ELT capabilities
      • the feature must be enabled [7]
        • after enabling, Dataflows automatically switch the back-end when data size exceeds a particular threshold [7]
        • ⇐there's no need to change anything during authoring of the dataflows
        • one can check the refresh history to see if fast copy was used [7]
        • ⇐see the Engine typeRequire fast copy option
        • {option} Require fast copy
      • {prerequisite} Fabric capacity is available [7]
        •  requires a Fabric capacity or a Fabric trial capacity [11]
      • {prerequisite} data files 
        • are in .csv or parquet format
        • have at least 100 MB
        • are stored in an ADLS Gen2 or a Blob storage account [6]
      • {prerequisite} [Azure SQL DB|PostgreSQL] >= 5 million rows in the data source [7]
      • {limitation} doesn't support [7] 
        • the VNet gateway
        • writing data into an existing table in Lakehouse
        • fixed schema
    • {feature} parameters
      • allow to dynamically control and customize dataflows
        • makes them more flexible and reusable by enabling different inputs and scenarios without modifying the dataflow itself [9]
        • the dataflow is refreshed by passing parameter values outside of the Power Query editor through either
          • Fabric REST API [9]
          • native Fabric experiences [9]
        • parameter names are case sensitive [9]
        • {type} required parameters
          • {warning} the refresh fails if no value is passed for it [9]
        • {type} optional parameters
        • enabled via Parameters >> Enable parameters to be discovered and override for execution [9]
      • {limitation} dataflows with parameters can't be
        • scheduled for refresh through the Fabric scheduler [9]
        • manually triggered through the Fabric Workspace list or lineage view [9]
      • {limitation} parameters that affect the resource path of a data source or a destination are not supported [9]
        • ⇐connections are linked to the exact data source path defined in the authored dataflow
          • can't be currently override to use other connections or resource paths [9]
      • {limitation} can't be leveraged by dataflows with incremental refresh [9]
      • {limitation} supports only parameters of the type decimal number, whole number, text and true/false can be passed for override
        • any other data types don't produce a refresh request in the refresh history but show in the monitoring hub [9]
      • {warning} allow other users who have permissions to the dataflow to refresh the data with other values [9]
      • {limitation} refresh history does not display information about the parameters passed during the invocation of the dataflow [9]
      • {limitation} monitoring hub doesn't display information about the parameters passed during the invocation of the dataflow [9]
      • {limitation} staged queries only keep the last data refresh of a dataflow stored in the Staging Lakehouse [9]
      • {limitation} only the first request will be accepted from duplicated requests for the same parameter values [9]
        • subsequent requests are rejected until the first request finishes its evaluation [9]
    • {feature} support for CI/CD and Git integration
      • allows to create, edit, and manage dataflows in a Git repository that's connected to a Fabric workspace [10]
      • allows to use the deployment pipelines to automate the deployment of dataflows between workspaces [10]
      • allows to use Public APIs to create and manage Dataflow Gen2 with CI/CD and Git integration [10]
      • allows to create Dataflow Gen2 directly into a workspace folder [10]
      • allows to use the Fabric settings and scheduler to refresh and edit settings for Dataflow Gen2 [10]
      • {action} save a workflow
        • replaces the publish operation
        • when saving th dataflow, it automatically publishes the changes to the dataflow [10]
      • {action} delete a dataflow
        • the staging artifacts become visible in the workspace and are safe to be deleted [10]
      • {action} schedule a refresh
        • can be done manually or by scheduling a refresh [10]
        • {limitation} the Workspace view doesn't show if a refresh is ongoing for the dataflow [10]
        • refresh information is available in the refresh history [10]
      • {action} branching out to another workspace
        • {limitation} the refresh can fail with the message that the staging lakehouse couldn't be found [10]
        • {workaround} create a new Dataflow Gen2 with CI/CD and Git support in the workspace to trigger the creation of the staging lakehouse [10]
        •  all other dataflows in the workspace should start to function again.
      • {action} syncing changes from GIT into the workspace
        • requires to open the new or updated dataflow and save changes manually with the editor [10]
          • triggers a publish action in the background to allow the changes to be used during refresh of the dataflow [10]
      • [Power Automate] {limitation} the connector for dataflows isn't working [10]
    •  {feature} Copilot for Dataflow Gen2
      • provide AI-powered assistance for creating data integration solutions using natural language prompts [11]
      • {benefit} helps streamline the dataflow development process by allowing users to use conversational language to perform data transformations and operations [11]
    • {benefit} enhance flexibility by allowing dynamic adjustments without altering the dataflow itself [9]
    • {benefit} extends data with consistent data, such as a standard date dimension table [1]
    • {benefit} allows self-service users access to a subset of data warehouse separately [1]
    • {benefit} optimizes performance with dataflows, which enable extracting data once for reuse, reducing data refresh time for slower sources [1]
    • {benefit} simplifies data source complexity by only exposing dataflows to larger analyst groups [1]
    • {benefit} ensures consistency and quality of data by enabling users to clean and transform data before loading it to a destination [1]
    • {benefit} simplifies data integration by providing a low-code interface that ingests data from various sources [1]
    • {limitation} not a replacement for a data warehouse [1]
    • {limitation} row-level security isn't supported [1]
    • {limitation} Fabric or Fabric trial capacity workspace is required [1]


    Feature Data flow Gen2 Dataflow Gen1
    Author dataflows with Power Query
    Shorter authoring flow
    Auto-Save and background publishing
    Data destinations
    Improved monitoring and refresh history
    Integration with data pipelines
    High-scale compute
    Get Data via Dataflows connector
    Direct Query via Dataflows connector
    Incremental refresh ✓*
    Fast Copy ✓*
    Cancel refresh ✓*
    AI Insights support
    Dataflow Gen1 vs Gen2 [2]

    References:
    [1] Microsoft Learn (2023) Fabric: Ingest data with Microsoft Fabric [link]
    [2] Microsoft Learn (2023) Fabric: Getting from Dataflow Generation 1 to Dataflow Generation 2 [link]
    [3] Microsoft Learn (2023) Fabric: Dataflow Gen2 data destinations and managed settings [link]
    [4] Microsoft Learn (2023) Fabric: Dataflow Gen2 pricing for Data Factory in Microsoft Fabric [link]
    [5] Microsoft Learn (2023) Fabric: Save a draft of your dataflow [link]
    [6] Microsoft Learn (2023) Fabric: What's new and planned for Data Factory in Microsoft Fabric [link][7] Microsoft Learn (2023) Fabric: Fast copy in Dataflows Gen2 [link]
    [8] Microsoft Learn (2025) Fabric: Incremental refresh in Dataflow Gen2 [link]
    [9] Microsoft Learn (2025) Fabric: Use public parameters in Dataflow Gen2 (Preview) [link]
    [10] Microsoft Learn (2025) Fabric: Dataflow Gen2 with CI/CD and Git integration support [link]
    [11] Microsoft Learn (2025) Fabric: What is Dataflow Gen2? [link]
    [12] Microsoft Learn (2025) Fabric: Use a dataflow in a pipeline [link]
    [13] Microsoft Learn (2025) Fabric: Save a draft of your dataflow [link]
    [14] Microsoft Learn (2025) Fabric: Dataflow destinations and managed settings [link]
    [15] Microsoft Learn (2025) Fabric: Dataflow refresh [link]

    Resources:
    [R1] Arshad Ali & Bradley Schacht (2024) Learn Microsoft Fabric [link]
    [R2] Microsoft Learn: Fabric (2023) Data Factory limitations overview [link]
    [R3] Microsoft Fabric Blog (2023) Data Factory Spotlight: Dataflow Gen2, by Miguel Escobar [link]
    [R4] Microsoft Learn (2023) Fabric: Dataflow Gen2 connectors in Microsoft Fabric [link]
    [R5] Microsoft Learn(2023) Fabric: Pattern to incrementally amass data with Dataflow Gen2 [link
    [R6] Fourmoo (2004) Microsoft Fabric – Comparing Dataflow Gen2 vs Notebook on Costs and usability, by Gilbert Quevauvilliers [link]
    [R7] Microsoft Learn: Fabric (2023) A guide to Fabric Dataflows for Azure Data Factory Mapping Data Flow users [link]
    [R8] Microsoft Learn: Fabric (2023) Quickstart: Create your first dataflow to get and transform data [link]
    [R9] Microsoft Learn: Fabric (2023) Microsoft Fabric decision guide: copy activity, dataflow, or Spark [link]
    [R10] Microsoft Fabric Blog (2023) Dataflows Gen2 data destinations and managed settings, by Miquella de Boer  [link]
    [R11] Microsoft Fabric Blog (2023) Service principal support to connect to data in Dataflow, Datamart, Dataset and Dataflow Gen 2, by Miquella de Boer [link]
    [R12] Chris Webb's BI Blog (2023) Fabric Dataflows Gen2: To Stage Or Not To Stage? [link]
    [R13] Power BI Tips (2023) Let's Learn Fabric ep.7: Fabric Dataflows Gen2 [link]
    [R14] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]
    [R15] Microsoft Fabric Blog (2023) Passing parameter values to refresh a Dataflow Gen2 (Preview) [link

    Acronyms:
    ADLS - Azure Data Lake Storage
    CI/CD - Continuous Integration/Continuous Deployment 
    ETL - Extract, Transform, Load
    KQL - Kusto Query Language
    PQO - Power Query Online
    PQT - Power Query Template
    Related Posts Plugin for WordPress, Blogger...

    About Me

    My photo
    Koeln, NRW, Germany
    IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.