Showing posts with label incremental update. Show all posts
Showing posts with label incremental update. Show all posts

25 April 2025

🏭🗒️Microsoft Fabric: Dataflows Gen2's Incremental Refresh [Notes] 🆕

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 25-Apr-2025

[Microsoft Fabric] Incremental Refresh in Dataflows Gen2

  • {feature} enables to incrementally extract data from data sources, apply Power Query transformations, and load into various output destinations [5]
    • designed to reduce the amount of data that needs to be processed and retrieved from the source system [8]
    • configurable directly in the dataflow editor [8]
    • doesn't need to specify the historical data range [8]
      • ⇐ the dataflow doesn't remove any data from the destination that's outside the bucket range [8]
    • doesn't need to specify the parameters for the incremental refresh [8]
      • the filters and parameters are automatically added as the last step in the query [8]
  • {prerequisite} the data source 
    • supports folding [8]
    • needs to contain a Date/DateTime column that can be used to filter the data [8]
  • {prerequisite} the data destination supports incremental refresh [8]
    • available destinations
      • Fabric Warehouse
      • Azure SQL Database
      • Azure Synapse Analytics
      • Fabric Lakehouse [preview]
    • other destinations can be used in combination with incremental refresh by using a second query that references the staged data to update the data destination [8]
      • allows to use incremental refresh to reduce the amount of data that needs to be processed and retrieved from the source system [8]
        • a full refresh from the staged data to the data destination is still needed [8]
  • works by dividing the data into buckets based on a DateTime column [8]
    • each bucket contains the data that changed since the last refresh [8]
      • the dataflow knows what changed by checking the maximum value in the specified column 
        • if the maximum value changed for that bucket, the dataflow retrieves the whole bucket and replaces the data in the destination [8]
        • if the maximum value didn't change, the dataflow doesn't retrieve any data [8]
  • {limitation} 
    • the data destination must be set to a fixed schema [8]
    • ⇒table's schema in the data destination must be fixed and can't change [8]
      • ⇒ dynamic schema must be changed to fixed schema before configuring incremental refresh [8]
  • {limitation} the only supported update method in the data destination: replace
    • ⇒the dataflow replaces the data for each bucket in the data destination with the new data [8]
      • data that is outside the bucket range isn't affected [8]
  • {limitation} maximum number of buckets
    • single query: 50
      • {workaround} increase the bucket size or reduce the bucket range to lower the number of buckets [8]
    • whole dataflow: 150
      • {workaround} reduce the number of incremental refresh queries or increase the bucket size [8]
  • {downside} the dataflow may take longer to refresh after enabling incremental refresh [8]
    • because the additional overhead of checking if data changed and processing the buckets is higher than the time saved by processing less data [8]
    • {recommendation} review the settings for incremental refresh and adjust them to better fit the scenario
      • {option} increase the bucket size to reduce the number of buckets and the overhead of processing them [8]
      • {option} reduce the number of buckets by increasing the bucket size [8]
      • {option} disable incremental refresh [8]
  • {recommendation} don't use the column for detecting changes also for filtering [8]
    • because this can lead to unexpected results [8]
  • {setting} limit number of concurrent evaluation
    • setting the value to a lower number, reduces the number of requests sent to the source system [8]
    • via global settings >> Scale tab >> maximum number of parallel query evaluations
    • {recommendation} don't enable this limit unless there're issues with the source system [8]

References:
[5] Microsoft Learn (2023) Fabric: Save a draft of your dataflow [link]
[8] Microsoft Learn (2025) Fabric: Incremental refresh in Dataflow Gen2 [link

Resources:


22 September 2009

⌛SQL Reloaded: Incremental Update Technique(s) in Oracle

In Cognos Enterprise Business Intelligence Developer Guides mentioned in the previous post is repeated over and over again in the WHERE clause a set of constraints like in the below example(e.g. Accounts Payable Analysis for Oracle e-Business Suite, p. 21), involving more or less than 3 tables:

 (A.CREATION_DATE > TO_DATE(:FROM_DATETIME,'YYYYMMDDHH24MISS') AND A.CREATION_DATE <= TO_DATE(':TO_DATETIME','YYYYMMDDHH24MISS')) OR (B.CREATION_DATE > TO_DATE(:FROM_DATETIME,'YYYYMMDDHH24MISS') AND B.CREATION_DATE <= TO_DATE(':TO_DATETIME','YYYYMMDDHH24MISS')) OR (C.CREATION_DATE > TO_DATE(:FROM_DATETIME,'YYYYMMDDHH24MISS') AND C.CREATION_DATE <= TO_DATE(':TO_DATETIME','YYYYMMDDHH24MISS'))     

For 3 tables it's not a lot of work, though the set of constraints becomes more complicated with the increase in the number of tables. It can be simplified by using the LEAST & GREATEST Oracle functions, resulting the equivalent writing: 

 (LEAST(A.CREATION_DATE, B.CREATION_DATE, C.CREATION_DATE) > TO_DATE(:FROM_DATETIME,'YYYYMMDDHH24MISS') AND GREATEST(A.CREATION_DATE, B.CREATION_DATE, C.CREATION_DATE) <= TO_DATE(':TO_DATETIME','YYYYMMDDHH24MISS'))     

The second formulation is simpler and should have the same performance as the first formulation, however it has a small impediment – the LEFT JOINs and LEAST/GREATEST behaviour when working with NULL values. If one of the parameters provided to the two functions is NULL then the result will be NULL, thing which can mess up a query. That’s not difficult to avoid by using for example the NVL function. Supposing that there is a LEFT JOIN between A and two other tables B & C, the above constraints can be rewritten as: 

(LEAST(A.CREATION_DATE, NVL(B.CREATION_DATE, A.CREATION_DATE), NVL(C.CREATION_DATE, A.CREATION_DATE)) > TO_DATE(:FROM_DATETIME,'YYYYMMDDHH24MISS') AND GREATEST(A.CREATION_DATE, NVL(B.CREATION_DATE, A.CREATION_DATE), NVL(C.CREATION_DATE, A.CREATION_DATE)) <= TO_DATE(':TO_DATETIME','YYYYMMDDHH24MISS'))     

One might arrive to write such constraints for example when attempting to do an incremental update in a data warehouse based on the Creation Date from multiple source tables, loading the records having a Creation Date greater than the date of the previous update (of course the records existing already in data warehouse needs to be deleted or updated). The Creation Date works only for transactional data which don’t change after their creation, for the other types of cases it should be considered the Last Update Date (Date of the Last Update for a record). Also this approach has its flows, working well when records are not deleted from the table; this issue could be fixed for example by creating a trigger on deletion and save the ID of the record deleted in a third party table or actually delete the record from the third party database when the deletion happens in the source table.
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.