Showing posts with label pipeline. Show all posts
Showing posts with label pipeline. Show all posts

18 April 2024

Data Warehousing: Data(base) Mirroring in Microsoft Fabric (New feature)

Data Warehousing
Data Warehousing Series

Microsoft recently announced [4] the preview of a new Fabric feature called Mirroring, a low-cost, low-latency fully managed service that allows to replicate data from various systems together into OneLake [1]. Currently only Azure SQL Database, Azure Cosmos DB, and Snowflake are supported, though probably more database vendors will be targeted soon. 

For Microsoft Fabric's data engineers, data scientists and data warehouse professionals this feature is huge as importance because they don't need to care anymore about making the data available in Microsoft Fabric, which involves a considerable amount of work. 

Usually, at least for flexibility, transparence, performance and standardization, data professionals prefer to extract the data 1:1 from the source systems into a landing zone in the data warehouse or data/delta lake from where the data are further processed as needed. One data pipeline is thus built for every table in scope, which sometimes is a 10–15-minute effort per table, when the process is standardized, though upon case the effort is much higher if troubleshooting (e.g. data type incompatibility or support) or further logic changes are involved. Maintaining such data pipelines can prove to be costly over time, especially when periodic changes are needed. 

Microsoft lists other downsides of the ETL approach - restricted access to data changes, friction between people, processes, and technology, respectively the effort needed to create the pipelines, and the time needed for importing the data [1]. There's some truth is each of these points, though everything is relative. For big tables, however, refreshing all the data overnight can prove to be time-consuming and costly, especially when the data don't lie within the same region, respectively data center. Unless the data can be refreshed incrementally, the night runs can extend into the day, will all the implications that derive from this - not having actual data, which decreases the trust in reports, etc. There are tricks to speed up the process, though there are limits to what can be done. 

With mirroring, the replication of data between data sources and the analytics platform is handled in the background, after an initial replication, the changes in the source systems being reflected with a near real-time latency into OneLake, which is amazing! This allows building near real-time reporting solutions which can help the business in many ways - reviewing (and correcting in the data source) records en masse, faster overview of what's happening in the organizations, faster basis for decision-making, etc. Moreover, the mechanism is fully managed by Microsoft, which is thus responsible for making sure that the data are correctly synchronized. Only from this perspective 10-20% from the effort of building an analytics solution is probably reduced.

Mirroring in Microsoft Fabric
Mirroring in Microsoft Fabric (adapted after [2])

According to the documentation, one can replicate a whole database or choose individual regular tables (currently views aren't supported [3]), stop, restart, or remove a table from a mirroring. Moreover, through sharing, users can grant to other users or groups of users access to a mirrored database without giving access to the workspace and the rest of its items [1]. 

The data professionals and citizens can write then cross-database queries against the mirrored databases, warehouses, and the SQL analytics endpoints of lakehouses, combining data from all these sources into a single T-SQL query, which opens lot of opportunities especially in what concerns the creation of an enterprise semantic model, which should be differentiated from the semantic model created by default by the mirroring together with the SQL analytics endpoint.

Considering that the data is replicated into delta tables, one can take advantage of all the capabilities available with such tables - data versioning, time travel, interoperability and/or performance, respectively direct consumption in Power BI.

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn - Microsoft Fabric (2024) What is Mirroring in Fabric? (link)
[2] Microsoft Learn - Microsoft Fabric (2024) Mirroring Azure SQL Database [Preview] (link)
[3] Microsoft Learn - Microsoft Fabric (2024) Frequently asked questions for Mirroring Azure SQL Database in Microsoft Fabric [Preview] (link)
[4] Microsoft Fabric Updates Blog (2024) Announcing the Public Preview of Mirroring in Microsoft Fabric, by Charles Webb (link)

10 March 2024

Microsoft Fabric: Dataflows Gen2 (Notes)

Disclaimer: This is work in progress intended to consolidate information from various sources. 
Last updated: 10-Mar-2024

Dataflow (Gen2) Architecture [4]

Dataflow (Gen2) 

  •  new generation of dataflows that resides alongside the Power BI Dataflow (Gen1) [2]
  • allows to 
    • extract data from various sources
    • transform it using a wide range of transformation operations 
    • load it into a destination [1]
  • {goal} provide an easy, reusable way to perform ETL tasks using Power Query Online [1]
    • allows to promote reusable ETL logic 
      • ⇒ prevents the need to create more connections to the data source.
      • offer a wide variety of transformations
    • can be horizontally partitioned
  • {component} Lakehouse 
    • used to stage data being ingested
  • {component} Warehouse 
    • used as a compute engine and means to write back results to staging or supported output destinations faster
  • {component} Mashup Engine
    • extracts, transforms, or loads the data to staging or data destinations when either [4]
      • Warehouse compute cannot be used [4]
      • staging is disabled for a query [4]
  • {operation} creating a dataflow
    • can be created in a
      • Data Factory workload
      • Power BI workspace
      • Lakehouse
  • {operation} publishing a dataflow
    • generates dataflow's definition  
      • ⇐ the program that runs once the dataflow is refreshed to produce tables in staging storage and/or output destination [4]
      • used by the dataflow engine to generate an orchestration plan, manage resources, and orchestrate execution of queries across data sources, gateways, and compute engines, and to create tables in either the staging storage or data destination [4]
    • saves changes and runs validations that must be performed in the background [2]
  • {operation} refreshing a dataflow
  • {operation} running a dataflow 
    • can be run
      • manually
      • on a refresh schedule
      • as part of a Data Pipeline orchestration
  • {feature} author dataflows with Power Query
    • uses the full Power Query experience of Power BI dataflows [2]
  • {feature} shorter authoring flow
    • uses step-by-step for getting the data into your the dataflow [2]
      • the number of steps required to create dataflows were reduced [2]
    • a few new features were added to improve the experience [2]
  • {feature} Auto-Save and background publishing
    • changes made to a dataflow are autosaved to the cloud (aka draft version of the dataflow) [2]
      • ⇐ without having to wait for the validation to finish [2]
    • {functionality} save as draft 
      • stores a draft version of the dataflow every time you make a change [2]
      • seamless experience and doesn't require any input [2]
    • {concept} published version
      • the version of the dataflow that passed validation and is ready to refresh [5]
  • {feature} integration with data pipelines
    • integrates directly with Data Factory pipelines for scheduling and orchestration [2] 
  • {feature} high-scale compute
    • leverages a new, higher-scale compute architecture [2] 
      •  improves the performance of both transformations of referenced queries and get data scenarios [2]
      • creates both Lakehouse and Warehouse items in the workspace, and uses them to store and access data to improve performance for all dataflows [2]
  • {feature} improved monitoring and refresh history
    • integrate support for Monitoring Hub [2]
    • Refresh History experience upgraded [2]
  • {feature} get data via Dataflows connector
    • supports a wide variety of data source connectors
      • include cloud and on-premises relational databases
  • {feature|planned} incremental refresh 
    • enables you to incrementally extract data from data sources, apply Power Query transformations, and load into various output destinations [5]
  • {feature|planned} Fast Copy 
    • enables large-scale data ingestion directly utilizing the pipelines Copy Activity capability [6]
    • supports sources such Azure SQL Databases, CSV, and Parquet files in Azure Data Lake Storage and Blob Storage [6]
    • significantly scales up the data processing capacity providing high-scale ELT capabilities [6]
  • {feature|planned}Cancel refresh
    • enables to cancel ongoing Dataflow Gen2 refreshes from the workspace items view [6]
  • {feature} data destinations
    • allows to 
      • specify an output destination
      • separate ETL logic and destination storage [2]
    • every tabular data query can have a data destination [3]
      • available destinations
        • Azure SQL databases
        • Azure Data Explorer (Kusto)
        • Fabric Lakehouse
        • Fabric Warehouse
        • Fabric KQL database
      • a destination can be specified for every query individually [3]
      • multiple different destinations can be used within a dataflow [3]
      • connecting to the data destination is similar to connecting to a data source
      • {limitation} functions and lists aren't supported
    • {operation} creating a new table
      • {default} table name has the same name as the query name.
    • {operation} picking an existing table
    • {operation} deleting a table manually from the data destination 
      • doesn't recreate the table on the next refresh [3]
    • {operation} reusing queries from Dataflow Gen1
      • {method} export Dataflow Gen1 query and import it into Dataflow Gen2
        • export the queries as a PQT file and import them into Dataflow Gen2 [2]
      • {method} copy and paste in Power Query
        • copy the queries and paste them in the Dataflow Gen2 editor [2]
    • automatic settings:
      • {limitation} supported only for Lakehouse and Azure SQL database
      • {setting} Update method replace: 
        • data in the destination is replaced at every dataflow refresh with the output data of the dataflow [3]
      • {setting} Managed mapping: 
        • the mapping is automatically adjusted when republishing the data flow to reflect the change 
          • ⇒ doesn't need to be updated manually into the data destination experience every time changes occur [3]
      • {setting} Drop and recreate table: 
        • on every dataflow refresh the table is dropped and recreated to allow schema changes
        • {limitation} the dataflow refresh fails if any relationships or measures were added to the table [3]
    • update methods
      • {method} replace: 
        • on every dataflow refresh, the data is dropped from the destination and replaced by the output data of the dataflow.
        • {limitation} not supported by Fabric KQL databases and Azure Data Explorer 
      • {method} append: 
        • on every dataflow refresh, the output data from the dataflow is appended (aka merged) to the existing data in the data destination table (aka upsert)
    • staging 
      • {default} enabled
        • allows to use Fabric compute to execute queries
          • ⇐ enhances the performance of query processing
        • the data is loaded into the staging location
          • ⇐ an internal Lakehouse location accessible only by the dataflow itself
        • [Warehouse] staging is required before the write operation to the data destination
          • ⇐ improves performance
          • {limitation} only loading into the same workspace as the dataflow is supported
        •  using staging locations can enhance performance in some cases
      • disabled
        • {recommendation} [Lakehouse] disable staging on the query to avoid loading twice into a similar destination
          • ⇐ once for staging and once for data destination
          • improves dataflow's performance
    • {scenario} use a dataflow to load data into the lakehouse and then use a notebook to analyze the data [2]
    • {scenario} use a dataflow to load data into an Azure SQL database and then use a data pipeline to load the data into a data warehouse [2]
  • {benefit} extends data with consistent data, such as a standard date dimension table [1]
  • {benefit} allows self-service users access to a subset of data warehouse separately [1]
  • {benefit} optimizes performance with dataflows, which enable extracting data once for reuse, reducing data refresh time for slower sources [1]
  • {benefit} simplifies data source complexity by only exposing dataflows to larger analyst groups [1]
  • {benefit} ensures consistency and quality of data by enabling users to clean and transform data before loading it to a destination [1]
  • {benefit} simplifies data integration by providing a low-code interface that ingests data from various sources [1]
  • {limitation} not a replacement for a data warehouse [1]
  • {limitation} row-level security isn't supported [1]
  • {limitation} Fabric or Fabric trial capacity workspace is required [1]


Feature Data flow Gen2 Dataflow Gen1
Author dataflows with Power Query
Shorter authoring flow
Auto-Save and background publishing
Data destinations
Improved monitoring and refresh history
Integration with data pipelines
High-scale compute
Get Data via Dataflows connector
Direct Query via Dataflows connector
Incremental refresh ✓*
Fast Copy ✓*
Cancel refresh ✓*
AI Insights support
Dataflow Gen1 vs Gen2 [2]


Acronyms:
ETL - Extract, Transform, Load
KQL - Kusto Query Language
PQO - Power Query Online
PQT - Power Query Template

References:
[1] Microsoft Learn: Fabric (2023) Ingest data with Microsoft Fabric (link)
[2] Microsoft Learn: Fabric (2023) Getting from Dataflow Generation 1 to Dataflow Generation 2 (link)
[3] Microsoft Learn: Fabric (2023) Dataflow Gen2 data destinations and managed settings (link)
[4] Microsoft Learn: Fabric (2023) Dataflow Gen2 pricing for Data Factory in Microsoft Fabric (link)
[5] Microsoft Learn: Fabric (2023) Save a draft of your dataflow (link)
[6] Microsoft Learn: Fabric (2023) What's new and planned for Data Factory in Microsoft Fabric (link)

Resources:
[R1] Arshad Ali & Bradley Schacht (2024) Learn Microsoft Fabric (link)
[R2] Microsoft Learn: Fabric (2023) Data Factory limitations overview (link)
[R3] Microsoft Fabric Blog (2023) Data Factory Spotlight: Dataflow Gen2, by Miguel Escobar (link)
[R4] Microsoft Learn: Fabric (2023) Dataflow Gen2 connectors in Microsoft Fabric (link) 
[R5] Microsoft Learn: Fabric (2023) Pattern to incrementally amass data with Dataflow Gen2 (link)
[R6] Fourmoo (2004) Microsoft Fabric – Comparing Dataflow Gen2 vs Notebook on Costs and usability, by Gilbert Quevauvilliers (link)
[R7] Microsoft Learn: Fabric (2023) A guide to Fabric Dataflows for Azure Data Factory Mapping Data Flow users (link)
[R8] Microsoft Learn: Fabric (2023) Quickstart: Create your first dataflow to get and transform data (link)
[R9] Microsoft Learn: Fabric (2023) Microsoft Fabric decision guide: copy activity, dataflow, or Spark (link)
[R10] Microsoft Fabric Blog (2023) Dataflows Gen2 data destinations and managed settings, by Miquella de Boer  (link)
[R11] Microsoft Fabric Blog (2023) Service principal support to connect to data in Dataflow, Datamart, Dataset and Dataflow Gen 2, by Miquella de Boer (link)
[R12] Chris Webb's BI Blog (2023) Fabric Dataflows Gen2: To Stage Or Not To Stage? (link)
[R13] Power BI Tips (2023) Let's Learn Fabric ep.7: Fabric Dataflows Gen2 (link)

29 March 2021

Notes: Team Data Science Process (TDSP)

Team Data Science Process (TDSP)
Acronyms:
Artificial Intelligence (AI)
Cross-Industry Standard Process for Data Mining (CRISP-DM)
Data Mining (DM)
Knowledge Discovery in Databases (KDD)
Team Data Science Process (TDSP) 
Version Control System (VCS)
Visual Studio Team Services (VSTS)

Resources:
[1] Microsoft Azure (2020) What is the Team Data Science Process? [source]
[2] Microsoft Azure (2020) The business understanding stage of the Team Data Science Process lifecycle [source]
[3] Microsoft Azure (2020) Data acquisition and understanding stage of the Team Data Science Process [source]
[4] Microsoft Azure (2020) Modeling stage of the Team Data Science Process lifecycle [source
[5] Microsoft Azure (2020) Deployment stage of the Team Data Science Process lifecycle [source]
[6] Microsoft Azure (2020) Customer acceptance stage of the Team Data Science Process lifecycle [source]

11 March 2021

Microsoft Azure: Azure Data Factory (Notes)

Azure Data Factory - Concept Map

Acronyms:
Azure Data Factory (ADF)
Continuous Integration/Continuous Deployment (CI/CD)
Extract Load Transform (ELT)
Extract Transform Load (ETL)
Independent Software Vendors (ISVs)
Operations Management Suite (OMS)
pay-as-you-go (PAYG)
SQL Server Integration Services (SSIS)

Resources:
[1] Microsoft (2020) "Microsoft Business Intelligence and Information Management: Design Guidance", by Rod College
[2] Microsoft (2021) Azure Data Factory [source]
[3] Microsoft (2018) Azure Data Factory: Data Integration in the Cloud [source]
[4] Microsoft (2021) Integrate data with Azure Data Factory or Azure Synapse Pipeline [source]
[10] Coursera (2021) Data Processing with Azure [source]
[11] Sudhir Rawat & Abhishek Narain (2019) "Understanding Azure Data Factory: Operationalizing Big Data and Advanced Analytics Solutions"

16 March 2018

Data Science: Data Pipeline/Pipelining (Definitions)

"A series of operations in an aggregation process." (MongoDb, "Glossary", 2008)

"A series of processes all in a row, linked by pipes, where each passes its output stream to the next." (Jon Orwant et al, "Programming Perl" 4th Ed., 2012)

"Description of the process workflow in sequential order." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)

"In data processing, a pipeline is a sequence of processing steps combined into a single object. In Spark MLlib, a pipeline is a sequence of stages. A Pipeline is an estimator containing transformers, estimators, and evaluators. When it is trained, it produces a PipelineModel containing transformers, models, and evaluators." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

"Abstract concept used to describe where work is broken into several steps which enable multiple tasks to be in progress at the same time. Pipelining is applied in processors to increase processing of machine language instructions and is also a category of functional decomposition that reduces the synchronization cost while maintaining many of the benefits of concurrent execution." (Max Domeika, "Software Development for Embedded Multi-core Systems", 2011)

"A technique that breaks an instruction into smaller steps that can be overlapped" (Nell Dale & John Lewis, "Computer Science Illuminated" 6th Ed., 2015)

[pipeline pattern:] "A set of data processing elements connected in series, generally so that the output of one element is the input of the next one. The elements of a pipeline are often executed concurrently. Describing many algorithms, including many signal processing problems, as pipelines is generally quite natural and lends itself to parallel execution. However, in order to scale beyond the number of pipeline stages, it is necessary to exploit parallelism within a single pipeline stage." (Michael McCool et al, "Structured Parallel Programming", 2012)

"A data pipeline is a general term for a process that moves data from a source to a destination. ETL (extract, transform, and load) uses a data pipeline to move the data it extracts from a source to the destination, where it loads the data." (Jake Stein)

"A data pipeline is a piece of infrastructure responsible for routing data from where it is to where it needs to go and provide any necessary transformations through that process." (Precisely) [source

"A data pipeline is a service or set of actions that process data in sequence. This means that the results or output from one segment of the system become the input for the next. The usual function of a data pipeline is to move data from one state or location to another."(SnapLogic) [source]

"A data pipeline is a software process that takes data from sources and pushes it to a destination. Most modern data pipelines are automated with an ETL (Extract, Transform, Load) platform." (Xplenty) [source

"A data pipeline is a set of actions that extract data (or directly analytics and visualization) from various sources. It is an automated process: take these columns from this database, merge them with these columns from this API, subset rows according to a value, substitute NAs with the median and load them in this other database." (Alan Marazzi)

"A source and all the transformations and targets that receive data from that source. Each mapping contains one or more pipelines." (Informatica)

"An ETL Pipeline refers to a set of processes extracting data from an input source, transforming the data, and loading into an output destination such as a database, data mart, or a data warehouse for reporting, analysis, and data synchronization." (Databricks) [source]

"Data pipeline consists of a set of actions performed in real-time or in batches, that captures data from various sources, sorting it and then moving that data through applications, filters, and APIs for storage and analysis." (EAI) 

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.