Showing posts with label dataflows. Show all posts
Showing posts with label dataflows. Show all posts

26 April 2025

🏭🗒️Microsoft Fabric: Parameters in Dataflows Gen2 [Notes] 🆕

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 26-Apr-2

[Microsoft Fabric] Dataflow Gen2 Parameters

  • {def} parameters that allow to dynamically control and customize Dataflows Gen2
    • makes them more flexible and reusable by enabling different inputs and scenarios without modifying the dataflow itself [1]
    • the dataflow is refreshed by passing parameter values outside of the Power Query editor through either
      • Fabric REST API [1]
      • native Fabric experiences [1]
    • parameter names are case sensitive [1]
    • {type} required parameters
      • {warning} the refresh fails if no value is passed for it [1]
    • {type} optional parameters
    • enabled via Parameters >> Enable parameters to be discovered and override for execution [1]
  • {limitation} dataflows with parameters can't be
    • scheduled for refresh through the Fabric scheduler [1]
    • manually triggered through the Fabric Workspace list or lineage view [1]
  • {limitation} parameters that affect the resource path of a data source or a destination are not supported [1]
    • ⇐ connections are linked to the exact data source path defined in the authored dataflow
      • can't be currently override to use other connections or resource paths [1]
  • {limitation} can't be leveraged by dataflows with incremental refresh [1]
  • {limitation} supports only parameters of the type decimal number, whole number, text and true/false can be passed for override
    • any other data types don't produce a refresh request in the refresh history but show in the monitoring hub [1]
  • {warning} allow other users who have permissions to the dataflow to refresh the data with other values [1]
  • {limitation} refresh history does not display information about the parameters passed during the invocation of the dataflow [1]
  • {limitation} monitoring hub doesn't display information about the parameters passed during the invocation of the dataflow [1]
  • {limitation} staged queries only keep the last data refresh of a dataflow stored in the Staging Lakehouse [1]
  • {limitation} only the first request will be accepted from duplicated requests for the same parameter values [1]
    • subsequent requests are rejected until the first request finishes its evaluation [1]

References:
[1] Microsoft Learn (2025) Use public parameters in Dataflow Gen2 (Preview) [link

Resources:
[R1] Microsoft Fabric Blog (2025) Passing parameter values to refresh a Dataflow Gen2 (Preview) [link

Acronyms:
API - Application Programming Interface
REST - Representational State Transfer

25 April 2025

🏭🗒️Microsoft Fabric: Dataflows Gen2's Incremental Refresh [Notes] 🆕

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 25-Apr-2025

[Microsoft Fabric] Incremental Refresh in Dataflows Gen2

  • {feature} enables to incrementally extract data from data sources, apply Power Query transformations, and load into various output destinations [5]
    • designed to reduce the amount of data that needs to be processed and retrieved from the source system [8]
    • configurable directly in the dataflow editor [8]
    • doesn't need to specify the historical data range [8]
      • ⇐ the dataflow doesn't remove any data from the destination that's outside the bucket range [8]
    • doesn't need to specify the parameters for the incremental refresh [8]
      • the filters and parameters are automatically added as the last step in the query [8]
  • {prerequisite} the data source 
    • supports folding [8]
    • needs to contain a Date/DateTime column that can be used to filter the data [8]
  • {prerequisite} the data destination supports incremental refresh [8]
    • available destinations
      • Fabric Warehouse
      • Azure SQL Database
      • Azure Synapse Analytics
      • Fabric Lakehouse [preview]
    • other destinations can be used in combination with incremental refresh by using a second query that references the staged data to update the data destination [8]
      • allows to use incremental refresh to reduce the amount of data that needs to be processed and retrieved from the source system [8]
        • a full refresh from the staged data to the data destination is still needed [8]
  • works by dividing the data into buckets based on a DateTime column [8]
    • each bucket contains the data that changed since the last refresh [8]
      • the dataflow knows what changed by checking the maximum value in the specified column 
        • if the maximum value changed for that bucket, the dataflow retrieves the whole bucket and replaces the data in the destination [8]
        • if the maximum value didn't change, the dataflow doesn't retrieve any data [8]
  • {limitation} 
    • the data destination must be set to a fixed schema [8]
    • ⇒table's schema in the data destination must be fixed and can't change [8]
      • ⇒ dynamic schema must be changed to fixed schema before configuring incremental refresh [8]
  • {limitation} the only supported update method in the data destination: replace
    • ⇒the dataflow replaces the data for each bucket in the data destination with the new data [8]
      • data that is outside the bucket range isn't affected [8]
  • {limitation} maximum number of buckets
    • single query: 50
      • {workaround} increase the bucket size or reduce the bucket range to lower the number of buckets [8]
    • whole dataflow: 150
      • {workaround} reduce the number of incremental refresh queries or increase the bucket size [8]
  • {downside} the dataflow may take longer to refresh after enabling incremental refresh [8]
    • because the additional overhead of checking if data changed and processing the buckets is higher than the time saved by processing less data [8]
    • {recommendation} review the settings for incremental refresh and adjust them to better fit the scenario
      • {option} increase the bucket size to reduce the number of buckets and the overhead of processing them [8]
      • {option} reduce the number of buckets by increasing the bucket size [8]
      • {option} disable incremental refresh [8]
  • {recommendation} don't use the column for detecting changes also for filtering [8]
    • because this can lead to unexpected results [8]
  • {setting} limit number of concurrent evaluation
    • setting the value to a lower number, reduces the number of requests sent to the source system [8]
    • via global settings >> Scale tab >> maximum number of parallel query evaluations
    • {recommendation} don't enable this limit unless there're issues with the source system [8]

References:
[5] Microsoft Learn (2023) Fabric: Save a draft of your dataflow [link]
[8] Microsoft Learn (2025) Fabric: Incremental refresh in Dataflow Gen2 [link

Resources:


09 February 2025

🏭🗒️Microsoft Fabric: Data Pipelines [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 9-Feb-2024

[Microsoft Fabric] Data pipeline

  • {def} a logical sequence of activities that orchestrate a process and perform together a task [1]
    • usually by extracting data from one or more sources and loading it into a destination; 
      • ⇐ often transforming it along the way [1]
      • ⇐ allows to manage the activities as a set instead of each one individually [2]
      • ⇐ used to automate ETL processes that ingest transactional data from operational data stores into an analytical data store [1]
      • e.g. lakehouse or data warehouse 
  • {concept} activity
    • {def} an executable task in a pipeline
      • a flow of activities can be defined by connecting them in a sequence [1]
      • its outcome (success, failure, or completion) can be used to direct the flow to the next activity in the sequence [1]
    • {type} data movement activities
      • copies data from a source data store to a sink data store [2]
    • {type} data transformation activities
      • encapsulate data transfer operations
        • incl. simple Copy Data activities that extract data from a source and load it to a destination
        • incl. complex Data Flow activities that encapsulate dataflows (Gen2) that apply transformations to the data as it is transferred
        • incl. notebook activities to run a Spark notebook
        • incl. stored procedure activities to run SQL code
        • incl. delete data activities to delete existing data
    • {type} control flow activities
        • used to 
          • implement loops
          • implement conditional branching
          • manage variables
          • manage parameter values
          • enable to implement complex pipeline logic to orchestrate data ingestion and transformation flow  [1]
        • can be parameterized
          • ⇐enabling to provide specific values to be used each time a pipeline is run  [1]
        • when executed, a run is initiated (aka data pipeline run
          • runs can be initiated on-demand or scheduled to start at a specific frequency
          • use the unique run ID to review run details to confirm they completed successfully and investigate the specific settings used for each execution [1]
      • {benefit} increases pipelines’ reusability
    • {concept} pipeline template
      • predefined pipeline that can be used and customize as required
    • {concept} data pipeline run
      • occurs when a data pipeline is executed
      • the activities in the data pipeline are executed to completion [3] 
      • can be triggered one of two ways
        • on-demand
        • on a schedule
          • the scheduled pipeline will be able to run based on the time and frequency set [3]

    Previous Post  <<||>> Next Post

    References:
    [1] Microsoft Learn (2023) Use Data Factory pipelines in Microsoft Fabric [link]
    [2] Microsoft Learn (2024) Microsoft Fabric: Activity overview [link]
    [3] Microsoft Learn (2024) Microsoft Fabric Concept: Data pipeline Runs [link]

    Resources
    [R1] Metadata Driven Pipelines for Microsoft Fabric (link)
    [R2] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

    10 March 2024

    🏭🗒️Microsoft Fabric: Dataflows Gen 1 [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources. 

    Last updated: 10-Mar-2024

    Dataflows Architecture in Power BI
    Dataflows Architecture [3]

    [Microsoft Fabric] Dataflow (Gen1)

    • a type of cloud-based ETL tool for building and executing scalable data transformation processes [1]
    • a collection of tables created and managed in workspaces in the Power BI service [4]
    • acts as building blocks on top of one another [4]
    • includes all of the transformations to reduce data prep time and then can be loaded into a new table, included in a Data Pipeline, or used as a data source by data analysts [1]
    • {benefit} promote reusability of underlying data elements
      • prevent the need to create separate connections with your cloud or on-premises data sources.
    • supports a wide range of cloud and on-premises sources [3]
    • {operation} refreshing a dataflow 
      • is required before it can be consumed in a semantic model in Power BI Desktop, or referenced as a linked or computed table [4]
      • can be refreshed at the same frequency as a semantic model [4]
      • {concept} incremental refresh
      • [Premium capacity] can be set to refresh incrementally
        • adds parameters to the dataflow to specify the date range
        • {contraindication} linked tables shouldn't use incremental refresh if they reference a dataflow
          • ⇐ dataflows don't support query folding (even if the table is DirectQuery enabled).
        • {contraindication} semantic models referencing dataflows shouldn't use incremental refresh
          • ⇐ refreshes to dataflows are generally performant, so incremental refreshes shouldn't be necessary
          • if refreshes take too long, consider using the compute engine, or DirectQuery mode
    • {operation} deleting a workflow
      • if a workspace that contains dataflows is deleted, all its dataflows are also deleted [4]
        • even if recovery of the workspace is possible, one cannot recover the deleted dataflows
          • ⇐ either directly or through support from Microsoft [4]
    • {operation} consuming a dataflow
      • create a linked table from the dataflow
      • allows another dataflow author to use the data [4]
      • create a semantic model from the dataflow
      • allows a user to utilize the data to create reports [4]
      • create a connection from external tools that can read from the CDM format [4]
    • {feature} [premium] Enhanced compute engine.
      • enables premium subscribers to use their capacity to optimize the use of dataflows
      • {advantage} reduces the refresh time required for long-running ETL steps over computed entities, such as performing joins, distinct, filters, and group by [7]
      • {advantage} performs DirectQuery queries over entities [7]
      • individually set for each dataflow
      • {configuration} disabled
      • {configuration|default} optimized
        • automatically turned on when a table in the dataflow is referenced by another table or when the dataflow is connected to another dataflow in the same workspace.
      • {configuration} On
      • {limitation} works only for A3 or larger Power BI capacities [7]
    • {feature} [premium] DirectQuery
      • allows to use DirectQuery to connect directly to dataflows without having to import its data [7]
    • {advantage} avoid separate refresh schedules 
      • removes the need to create an imported semantic model [7]
    • {advantage} filtering data 
      • allows to filter dataflow data and work with the filtered subset [7]
      • {limitation} composite/mixed models that have import and DirectQuery data sources are currently not supported [7]
      • {limitation} large dataflows might have trouble with timeout issues when viewing visualizations [7]
        • {workaround} use Import mode [7]
      • {limitation} under data source settings, the dataflow connector will show invalid credentials [7]
        • the warning doesn't affect the behavior, and the semantic model will work properly [7]
    • {feature} [premium] Computed entities
      • allows to perform calculations on your existing dataflows, and return results [7]
      • enable to focus on report creation and analytics [7]
      • {limitation} work properly only when the entities reside in the same storage account [7]
    • {feature} [premium] Linked Entities
      • allows to reference existing dataflows
      • one can perform calculations on these entities using computed entities [7]
      • allows to create a "single source of the truth" table that can be reused within multiple dataflows [7]
      • {limitation} work properly only when the entities reside in the same storage account [7]
    • {feature} [premium] Incremental refresh
      • adds parameters to the dataflow to specify the date range [7]
    • {concept} table
      • represents the data output of a query created in a dataflow, after the dataflow has been refreshed
      • represents data from a source and, optionally, the transformations that were applied to it
    • {concept} computed tables
      • similar to other tables 
        • one get data from a source and one can apply further transformations to create them
      • their data originates from the storage dataflow used, and not the original data source [6]
        • ⇐ they were previously created by a dataflow and then reused [6]
      • created by referencing a table in the same dataflow or in a different dataflow [6]
    • {concept} [Power Query] custom function
      • a mapping from a set of input values to a single output value [5]
    • {scenario} create reusable transformation logic that can be shared by many semantic models and reports inside Power BI [3]
    • {scenario} persist data in ADL Gen 2 storage, enabling you to expose it to other Azure services outside Power BI [3]
    • {scenario} create a single source of truth
      • encourages uptake by removing analysts' access to underlying data sources [3]
    • {scenario} strengthen security around underlying data sources by exposing data to report creators in dataflows
      • allows to limit access to underlying data sources, reducing the load on source systems [3]
      • gives administrators finer control over data refresh operations [3]
    • {scenario} perform ETL at scale, 
      • dataflows with Power BI Premium scales more efficiently and gives you more flexibility [3]
    • {best practice} chose the best connector for the task provides the best experience and performance [5] 
    • {best practice} filter data in the early stages of the query
      • some connectors can take advantage of filters through query folding [5]
    • {best practice} do expensive operations last
      • help minimize the amount of time spend waiting for the preview to render each time a new step is added to the query [5]
    • {best practice} temporarily work against a subset of your data
      • if adding new steps to the query is slow, consider using "Keep First Rows" operation and limiting the number of rows you're working against [5]
    • {best practice} use the correct data types
      • some features are contextual to the data type [5]
    • {best practice} explore the data
    • {best practice} document queries by renaming or adding a description to steps, queries, or groups [5]
    • {best practice} take a modular approach
      • split queries that contains a large number of steps into multiple queries
      • {goal} simplify and decouple transformation phases into smaller pieces to make them easier to understand [5]
    • {best practice} future-proof queries 
      • make queries resilient to changes and able to refresh even when some components of data source change [5]
    • {best practice] creating queries that are dynamic and flexible via parameters [5]
      • parameters serves as a way to easily store and manage a value that can be reused in many different ways [5]
    • {best practice} create reusable functions
      • can be created from existing queries and parameters [5]

    Acronyms:
    CDM - Common Data Model
    ETL - Extract, Transform, Load

    References:
    [1] Microsoft Learn: Fabric (2023) Ingest data with Microsoft Fabric (link)
    [2] Microsoft Learn: Fabric (2023) Dataflow Gen2 pricing for Data Factory in Microsoft Fabric (link)
    [3] Microsoft Learn: Fabric (2023) Introduction to dataflows and self-service data prep (link)
    [4] Microsoft Learn: Fabric (2023) Configure and consume a dataflow (link)
    [5] Microsoft Learn: Fabric (2023) Dataflows best practices* (link)
    [6] Microsoft Learn: Fabric (2023) Computed table scenarios and use cases (link)
    [7] Microsoft Lean: Power BI - Learn (2024) Premium features of dataflows (link

    Resources:

    🏭🗒️Microsoft Fabric: Dataflows Gen2 [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 30-May-2025

    Dataflow (Gen2) Architecture [4]

    [Microsoft Fabric] Dataflow (Gen2) 

    •  new generation of dataflows that resides alongside the Power BI Dataflow (Gen1) [2]
      • brings new features and improved experience [2]
      • similar to Dataflow Gen1 in Power BI [2]
      • allow to 
        • extract data from various sources [1]
        • transform it using a wide range of transformation operations [1]
        • load it into a destination [1]
      • {goal} provide an easy, reusable way to perform ETL tasks using Power Query Online [1]
        • allows to promote reusable ETL logic 
          • ⇒ prevents the need to create more connections to the data source [1]
          • offer a wide variety of transformations [1]
      • can be horizontally partitioned
    • {component} Lakehouse 
      • used to stage data being ingested
    • {component} Warehouse 
      • used as a compute engine and means to write back results to staging or supported output destinations faster
    • {component} mashup engine
      • extracts, transforms, or loads the data to staging or data destinations when either [4]
        • warehouse compute cannot be used [4]
        • {limitation} staging is disabled for a query [4]
    • {operation} create a dataflow
      • can be created in a
        • Data Factory workload
        • Power BI workspace
        • Lakehouse
    • {operation} publish a dataflow
      • generates dataflow's definition  
        • ⇐ the program that runs once the dataflow is refreshed to produce tables in staging storage and/or output destination [4]
        • used by the dataflow engine to generate an orchestration plan, manage resources, and orchestrate execution of queries across data sources, gateways, and compute engines, and to create tables in either the staging storage or data destination [4]
      • saves changes and runs validations that must be performed in the background [2]
    • {operation} refresh a dataflow
    • {operation} running a dataflow 
      • can be performed
        • manually
        • on a refresh schedule
        • as part of a Data Pipeline orchestration
    • {feature} author dataflows with Power Query
      • uses the full Power Query experience of Power BI dataflows [2]
    • {feature} shorter authoring flow
      • uses step-by-step for getting the data into your the dataflow [2]
        • the number of steps required to create dataflows were reduced [2]
      • a few new features were added to improve the experience [2]
    • {feature} Auto-Save and background publishing
      • changes made to a dataflow are autosaved to the cloud (aka draft version of the dataflow) [2]
        • ⇐ without having to wait for the validation to finish [2]
      • {functionality} save as draft 
        • stores a draft version of the dataflow every time you make a change [2]
        • seamless experience and doesn't require any input [2]
      • {concept} published version
        • the version of the dataflow that passed validation and is ready to refresh [5]
    • {feature} integration with data pipelines
      • integrates directly with Data Factory pipelines for scheduling and orchestration [2] 
    • {feature} high-scale compute
      • leverages a new, higher-scale compute architecture [2] 
        •  improves the performance of both transformations of referenced queries and get data scenarios [2]
        • creates both Lakehouse and Warehouse items in the workspace, and uses them to store and access data to improve performance for all dataflows [2]
    • {feature} improved monitoring and refresh history
      • integrate support for Monitoring Hub [2]
      • Refresh History experience upgraded [2]
    • {feature} get data via Dataflows connector
      • supports a wide variety of data source connectors
        • include cloud and on-premises relational databases
    • {feature} incremental refresh
      • enables to incrementally extract data from data sources, apply Power Query transformations, and load into various output destinations [5]
    • {feature|planned} Fast Copy 
      • enables large-scale data ingestion directly utilizing the pipelines Copy Activity capability [6]
      • supports sources such Azure SQL Databases, CSV, and Parquet files in Azure Data Lake Storage and Blob Storage [6]
      • significantly scales up the data processing capacity providing high-scale ELT capabilities [6]
    • {feature|planned}Cancel refresh
      • enables to cancel ongoing Dataflow Gen2 refreshes from the workspace items view [6]
    • {feature} data destinations
      • allows to 
        • specify an output destination
        • separate ETL logic and destination storage [2]
      • every tabular data query can have a data destination [3]
        • available destinations
          • Azure SQL databases
          • Azure Data Explorer (Kusto)
          • Fabric Lakehouse
          • Fabric Warehouse
          • Fabric KQL database
        • a destination can be specified for every query individually [3]
        • multiple different destinations can be used within a dataflow [3]
        • connecting to the data destination is similar to connecting to a data source
        • {limitation} functions and lists aren't supported
      • {operation} creating a new table
        • {default} table name has the same name as the query name.
      • {operation} picking an existing table
      • {operation} deleting a table manually from the data destination 
        • doesn't recreate the table on the next refresh [3]
      • {operation} reusing queries from Dataflow Gen1
        • {method} export Dataflow Gen1 query and import it into Dataflow Gen2
          • export the queries as a PQT file and import them into Dataflow Gen2 [2]
        • {method} copy and paste in Power Query
          • copy the queries and paste them in the Dataflow Gen2 editor [2]
      • automatic settings:
        • {limitation} supported only for Lakehouse and Azure SQL database
        • {setting} Update method replace: 
          • data in the destination is replaced at every dataflow refresh with the output data of the dataflow [3]
        • {setting} Managed mapping: 
          • the mapping is automatically adjusted when republishing the data flow to reflect the change 
            • ⇒ doesn't need to be updated manually into the data destination experience every time changes occur [3]
        • {setting} Drop and recreate table: 
          • on every dataflow refresh the table is dropped and recreated to allow schema changes
          • {limitation} the dataflow refresh fails if any relationships or measures were added to the table [3]
      • update methods
        • {method} replace: 
          • on every dataflow refresh, the data is dropped from the destination and replaced by the output data of the dataflow.
          • {limitation} not supported by Fabric KQL databases and Azure Data Explorer 
        • {method} append: 
          • on every dataflow refresh, the output data from the dataflow is appended (aka merged) to the existing data in the data destination table (aka upsert)
      • staging 
        • {default} enabled
          • allows to use Fabric compute to execute queries
            • ⇐ enhances the performance of query processing
          • the data is loaded into the staging location
            • ⇐ an internal Lakehouse location accessible only by the dataflow itself
          • [Warehouse] staging is required before the write operation to the data destination
            • ⇐ improves performance
            • {limitation} only loading into the same workspace as the dataflow is supported
          •  using staging locations can enhance performance in some cases
        • disabled
          • {recommendation} [Lakehouse] disable staging on the query to avoid loading twice into a similar destination
            • ⇐ once for staging and once for data destination
            • improves dataflow's performance
      • {scenario} use a dataflow to load data into the lakehouse and then use a notebook to analyze the data [2]
      • {scenario} use a dataflow to load data into an Azure SQL database and then use a data pipeline to load the data into a data warehouse [2]
    • {feature} fast copy
      • allows ingesting terabytes of data with the easy experience and the scalable back-end of the pipeline Copy Activity [7]
      • the feature must be enabled [7]
        • after enabling, Dataflows automatically switch the back-end when data size exceeds a particular threshold [7]
        • ⇐there's no need to change anything during authoring of the dataflows
        • one can check thr refresh history to see if fast copy was used [7]
        • ⇐see the Engine typeRequire fast copy option
        • {option} Require fast copy
      • {prerequisite} Fabric capacity is available [7]
      • {prerequisite} data files 
        • are in .csv or parquet format
        • have at least 100 MB
        • are stored in an ADLS Gen2 or a Blob storage account [6]
      • {prerequisite} [Azure SQL DB|PostgreSQL] >= 5 million rows in the data source [7]
      • {limitation} doesn't support [7] 
        • the VNet gateway
        • writing data into an existing table in Lakehouse
        • fixed schema
    • {feature} parameters
      • allow to dynamically control and customize dataflows
        • makes them more flexible and reusable by enabling different inputs and scenarios without modifying the dataflow itself [9]
        • the dataflow is refreshed by passing parameter values outside of the Power Query editor through either
          • Fabric REST API [9]
          • native Fabric experiences [9]
        • parameter names are case sensitive [9]
        • {type} required parameters
          • {warning} the refresh fails if no value is passed for it [9]
        • {type} optional parameters
        • enabled via Parameters >> Enable parameters to be discovered and override for execution [9]
      • {limitation} dataflows with parameters can't be
        • scheduled for refresh through the Fabric scheduler [9]
        • manually triggered through the Fabric Workspace list or lineage view [9]
      • {limitation} parameters that affect the resource path of a data source or a destination are not supported [9]
        • ⇐connections are linked to the exact data source path defined in the authored dataflow
          • can't be currently override to use other connections or resource paths [9]
      • {limitation} can't be leveraged by dataflows with incremental refresh [9]
      • {limitation} supports only parameters of the type decimal number, whole number, text and true/false can be passed for override
        • any other data types don't produce a refresh request in the refresh history but show in the monitoring hub [9]
      • {warning} allow other users who have permissions to the dataflow to refresh the data with other values [9]
      • {limitation} refresh history does not display information about the parameters passed during the invocation of the dataflow [9]
      • {limitation} monitoring hub doesn't display information about the parameters passed during the invocation of the dataflow [9]
      • {limitation} staged queries only keep the last data refresh of a dataflow stored in the Staging Lakehouse [9]
      • {limitation} only the first request will be accepted from duplicated requests for the same parameter values [9]
        • subsequent requests are rejected until the first request finishes its evaluation [9]
    • {feature} supports CI/CD and Git integration
      • allows to create, edit, and manage dataflows in a Git repository that's connected to a Fabric workspace [10]
      • allows to use the deployment pipelines to automate the deployment of dataflows between workspaces [10]
      • allows to use Public APIs to create and manage Dataflow Gen2 with CI/CD and Git integration [10]
      • allows to create Dataflow Gen2 directly into a workspace folder [10]
      • allows to use the Fabric settings and scheduler to refresh and edit settings for Dataflow Gen2 [10]
      • {action} save a workflow
        • replaces the publish operation
        • when saving th dataflow, it automatically publishes the changes to the dataflow [10]
      • {action} delete a dataflow
        • the staging artifacts become visible in the workspace and are safe to be deleted [10]
      • {action} schedule a refresh
        • can be done manually or by scheduling a refresh [10]
        • {limitation} the Workspace view doesn't show if a refresh is ongoing for the dataflow [10]
        • refresh information is available in the refresh history [10]
      • {action} branching out to another workspace
        • {limitation} the refresh can fail with the message that the staging lakehouse couldn't be found [10]
        • {workaround} create a new Dataflow Gen2 with CI/CD and Git support in the workspace to trigger the creation of the staging lakehouse [10]
        •  all other dataflows in the workspace should start to function again.
      • {action} syncing changes from GIT into the workspace
        • requires to open the new or updated dataflow and save changes manually with the editor [10]
          • triggers a publish action in the background to allow the changes to be used during refresh of the dataflow [10]
      • [Power Automate] {limitation} the connector for dataflows isn't working [10]
    • {benefit} enhance flexibility by allowing dynamic adjustments without altering the dataflow itself [9]
    • {benefit} extends data with consistent data, such as a standard date dimension table [1]
    • {benefit} allows self-service users access to a subset of data warehouse separately [1]
    • {benefit} optimizes performance with dataflows, which enable extracting data once for reuse, reducing data refresh time for slower sources [1]
    • {benefit} simplifies data source complexity by only exposing dataflows to larger analyst groups [1]
    • {benefit} ensures consistency and quality of data by enabling users to clean and transform data before loading it to a destination [1]
    • {benefit} simplifies data integration by providing a low-code interface that ingests data from various sources [1]
    • {limitation} not a replacement for a data warehouse [1]
    • {limitation} row-level security isn't supported [1]
    • {limitation} Fabric or Fabric trial capacity workspace is required [1]


    Feature Data flow Gen2 Dataflow Gen1
    Author dataflows with Power Query
    Shorter authoring flow
    Auto-Save and background publishing
    Data destinations
    Improved monitoring and refresh history
    Integration with data pipelines
    High-scale compute
    Get Data via Dataflows connector
    Direct Query via Dataflows connector
    Incremental refresh ✓*
    Fast Copy ✓*
    Cancel refresh ✓*
    AI Insights support
    Dataflow Gen1 vs Gen2 [2]


    References:
    [1] Microsoft Learn (2023) Fabric: Ingest data with Microsoft Fabric (link)
    [2] Microsoft Learn (2023) Fabric: Getting from Dataflow Generation 1 to Dataflow Generation 2 (link)
    [3] Microsoft Learn (2023) Fabric: Dataflow Gen2 data destinations and managed settings (link)
    [4] Microsoft Learn (2023) Fabric: Dataflow Gen2 pricing for Data Factory in Microsoft Fabric (link)
    [5] Microsoft Learn (2023) Fabric: Save a draft of your dataflow (link)
    [6] Microsoft Learn (2023) Fabric: What's new and planned for Data Factory in Microsoft Fabric (link)

    [7] Microsoft Learn (2023) Fabric: Fast copy in Dataflows Gen2 [link]
    [8] Microsoft Learn (2025) Fabric: Incremental refresh in Dataflow Gen2 [link]

    [9] Microsoft Learn (2025) Use public parameters in Dataflow Gen2 (Preview) [link]

    [10] Microsoft Learn (2025) Dataflow Gen2 with CI/CD and Git integration support [link

    Resources:
    [R1] Arshad Ali & Bradley Schacht (2024) Learn Microsoft Fabric (link)
    [R2] Microsoft Learn: Fabric (2023) Data Factory limitations overview (link)
    [R3] Microsoft Fabric Blog (2023) Data Factory Spotlight: Dataflow Gen2, by Miguel Escobar (link)
    [R4] Microsoft Learn: Fabric (2023) Dataflow Gen2 connectors in Microsoft Fabric (link) 
    [R5] Microsoft Learn: Fabric (2023) Pattern to incrementally amass data with Dataflow Gen2 (link)
    [R6] Fourmoo (2004) Microsoft Fabric – Comparing Dataflow Gen2 vs Notebook on Costs and usability, by Gilbert Quevauvilliers (link)
    [R7] Microsoft Learn: Fabric (2023) A guide to Fabric Dataflows for Azure Data Factory Mapping Data Flow users (link)
    [R8] Microsoft Learn: Fabric (2023) Quickstart: Create your first dataflow to get and transform data (link)
    [R9] Microsoft Learn: Fabric (2023) Microsoft Fabric decision guide: copy activity, dataflow, or Spark (link)
    [R10] Microsoft Fabric Blog (2023) Dataflows Gen2 data destinations and managed settings, by Miquella de Boer  (link)
    [R11] Microsoft Fabric Blog (2023) Service principal support to connect to data in Dataflow, Datamart, Dataset and Dataflow Gen 2, by Miquella de Boer (link)
    [R12] Chris Webb's BI Blog (2023) Fabric Dataflows Gen2: To Stage Or Not To Stage? (link)
    [R13] Power BI Tips (2023) Let's Learn Fabric ep.7: Fabric Dataflows Gen2 (link)
    [R14] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

    [R15] Microsoft Fabric Blog (2023) Passing parameter values to refresh a Dataflow Gen2 (Preview) [link

    Acronyms:
    ADLS - Azure Data Lake Storage
    CI/CD - Continuous Integration/Continuous Deployment 
    ETL - Extract, Transform, Load
    KQL - Kusto Query Language
    PQO - Power Query Online
    PQT - Power Query Template
    Related Posts Plugin for WordPress, Blogger...

    About Me

    My photo
    Koeln, NRW, Germany
    IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.