Showing posts with label ELT. Show all posts
Showing posts with label ELT. Show all posts

10 March 2024

🏭🗒️Microsoft Fabric: Dataflows Gen2 [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 10-Mar-2024

Dataflow (Gen2) Architecture [4]

[Microsoft Fabric] Dataflow (Gen2) 

  •  new generation of dataflows that resides alongside the Power BI Dataflow (Gen1) [2]
  • allows to 
    • extract data from various sources
    • transform it using a wide range of transformation operations 
    • load it into a destination [1]
  • {goal} provide an easy, reusable way to perform ETL tasks using Power Query Online [1]
    • allows to promote reusable ETL logic 
      • ⇒ prevents the need to create more connections to the data source.
      • offer a wide variety of transformations
    • can be horizontally partitioned
  • {component} Lakehouse 
    • used to stage data being ingested
  • {component} Warehouse 
    • used as a compute engine and means to write back results to staging or supported output destinations faster
  • {component} Mashup Engine
    • extracts, transforms, or loads the data to staging or data destinations when either [4]
      • Warehouse compute cannot be used [4]
      • staging is disabled for a query [4]
  • {operation} creating a dataflow
    • can be created in a
      • Data Factory workload
      • Power BI workspace
      • Lakehouse
  • {operation} publishing a dataflow
    • generates dataflow's definition  
      • ⇐ the program that runs once the dataflow is refreshed to produce tables in staging storage and/or output destination [4]
      • used by the dataflow engine to generate an orchestration plan, manage resources, and orchestrate execution of queries across data sources, gateways, and compute engines, and to create tables in either the staging storage or data destination [4]
    • saves changes and runs validations that must be performed in the background [2]
  • {operation} refreshing a dataflow
  • {operation} running a dataflow 
    • can be run
      • manually
      • on a refresh schedule
      • as part of a Data Pipeline orchestration
  • {feature} author dataflows with Power Query
    • uses the full Power Query experience of Power BI dataflows [2]
  • {feature} shorter authoring flow
    • uses step-by-step for getting the data into your the dataflow [2]
      • the number of steps required to create dataflows were reduced [2]
    • a few new features were added to improve the experience [2]
  • {feature} Auto-Save and background publishing
    • changes made to a dataflow are autosaved to the cloud (aka draft version of the dataflow) [2]
      • ⇐ without having to wait for the validation to finish [2]
    • {functionality} save as draft 
      • stores a draft version of the dataflow every time you make a change [2]
      • seamless experience and doesn't require any input [2]
    • {concept} published version
      • the version of the dataflow that passed validation and is ready to refresh [5]
  • {feature} integration with data pipelines
    • integrates directly with Data Factory pipelines for scheduling and orchestration [2] 
  • {feature} high-scale compute
    • leverages a new, higher-scale compute architecture [2] 
      •  improves the performance of both transformations of referenced queries and get data scenarios [2]
      • creates both Lakehouse and Warehouse items in the workspace, and uses them to store and access data to improve performance for all dataflows [2]
  • {feature} improved monitoring and refresh history
    • integrate support for Monitoring Hub [2]
    • Refresh History experience upgraded [2]
  • {feature} get data via Dataflows connector
    • supports a wide variety of data source connectors
      • include cloud and on-premises relational databases
  • {feature|planned} incremental refresh 
    • enables you to incrementally extract data from data sources, apply Power Query transformations, and load into various output destinations [5]
  • {feature|planned} Fast Copy 
    • enables large-scale data ingestion directly utilizing the pipelines Copy Activity capability [6]
    • supports sources such Azure SQL Databases, CSV, and Parquet files in Azure Data Lake Storage and Blob Storage [6]
    • significantly scales up the data processing capacity providing high-scale ELT capabilities [6]
  • {feature|planned}Cancel refresh
    • enables to cancel ongoing Dataflow Gen2 refreshes from the workspace items view [6]
  • {feature} data destinations
    • allows to 
      • specify an output destination
      • separate ETL logic and destination storage [2]
    • every tabular data query can have a data destination [3]
      • available destinations
        • Azure SQL databases
        • Azure Data Explorer (Kusto)
        • Fabric Lakehouse
        • Fabric Warehouse
        • Fabric KQL database
      • a destination can be specified for every query individually [3]
      • multiple different destinations can be used within a dataflow [3]
      • connecting to the data destination is similar to connecting to a data source
      • {limitation} functions and lists aren't supported
    • {operation} creating a new table
      • {default} table name has the same name as the query name.
    • {operation} picking an existing table
    • {operation} deleting a table manually from the data destination 
      • doesn't recreate the table on the next refresh [3]
    • {operation} reusing queries from Dataflow Gen1
      • {method} export Dataflow Gen1 query and import it into Dataflow Gen2
        • export the queries as a PQT file and import them into Dataflow Gen2 [2]
      • {method} copy and paste in Power Query
        • copy the queries and paste them in the Dataflow Gen2 editor [2]
    • automatic settings:
      • {limitation} supported only for Lakehouse and Azure SQL database
      • {setting} Update method replace: 
        • data in the destination is replaced at every dataflow refresh with the output data of the dataflow [3]
      • {setting} Managed mapping: 
        • the mapping is automatically adjusted when republishing the data flow to reflect the change 
          • ⇒ doesn't need to be updated manually into the data destination experience every time changes occur [3]
      • {setting} Drop and recreate table: 
        • on every dataflow refresh the table is dropped and recreated to allow schema changes
        • {limitation} the dataflow refresh fails if any relationships or measures were added to the table [3]
    • update methods
      • {method} replace: 
        • on every dataflow refresh, the data is dropped from the destination and replaced by the output data of the dataflow.
        • {limitation} not supported by Fabric KQL databases and Azure Data Explorer 
      • {method} append: 
        • on every dataflow refresh, the output data from the dataflow is appended (aka merged) to the existing data in the data destination table (aka upsert)
    • staging 
      • {default} enabled
        • allows to use Fabric compute to execute queries
          • ⇐ enhances the performance of query processing
        • the data is loaded into the staging location
          • ⇐ an internal Lakehouse location accessible only by the dataflow itself
        • [Warehouse] staging is required before the write operation to the data destination
          • ⇐ improves performance
          • {limitation} only loading into the same workspace as the dataflow is supported
        •  using staging locations can enhance performance in some cases
      • disabled
        • {recommendation} [Lakehouse] disable staging on the query to avoid loading twice into a similar destination
          • ⇐ once for staging and once for data destination
          • improves dataflow's performance
    • {scenario} use a dataflow to load data into the lakehouse and then use a notebook to analyze the data [2]
    • {scenario} use a dataflow to load data into an Azure SQL database and then use a data pipeline to load the data into a data warehouse [2]
  • {benefit} extends data with consistent data, such as a standard date dimension table [1]
  • {benefit} allows self-service users access to a subset of data warehouse separately [1]
  • {benefit} optimizes performance with dataflows, which enable extracting data once for reuse, reducing data refresh time for slower sources [1]
  • {benefit} simplifies data source complexity by only exposing dataflows to larger analyst groups [1]
  • {benefit} ensures consistency and quality of data by enabling users to clean and transform data before loading it to a destination [1]
  • {benefit} simplifies data integration by providing a low-code interface that ingests data from various sources [1]
  • {limitation} not a replacement for a data warehouse [1]
  • {limitation} row-level security isn't supported [1]
  • {limitation} Fabric or Fabric trial capacity workspace is required [1]


Feature Data flow Gen2 Dataflow Gen1
Author dataflows with Power Query
Shorter authoring flow
Auto-Save and background publishing
Data destinations
Improved monitoring and refresh history
Integration with data pipelines
High-scale compute
Get Data via Dataflows connector
Direct Query via Dataflows connector
Incremental refresh ✓*
Fast Copy ✓*
Cancel refresh ✓*
AI Insights support
Dataflow Gen1 vs Gen2 [2]


Acronyms:
ETL - Extract, Transform, Load
KQL - Kusto Query Language
PQO - Power Query Online
PQT - Power Query Template

References:
[1] Microsoft Learn: Fabric (2023) Ingest data with Microsoft Fabric (link)
[2] Microsoft Learn: Fabric (2023) Getting from Dataflow Generation 1 to Dataflow Generation 2 (link)
[3] Microsoft Learn: Fabric (2023) Dataflow Gen2 data destinations and managed settings (link)
[4] Microsoft Learn: Fabric (2023) Dataflow Gen2 pricing for Data Factory in Microsoft Fabric (link)
[5] Microsoft Learn: Fabric (2023) Save a draft of your dataflow (link)
[6] Microsoft Learn: Fabric (2023) What's new and planned for Data Factory in Microsoft Fabric (link)

Resources:
[R1] Arshad Ali & Bradley Schacht (2024) Learn Microsoft Fabric (link)
[R2] Microsoft Learn: Fabric (2023) Data Factory limitations overview (link)
[R3] Microsoft Fabric Blog (2023) Data Factory Spotlight: Dataflow Gen2, by Miguel Escobar (link)
[R4] Microsoft Learn: Fabric (2023) Dataflow Gen2 connectors in Microsoft Fabric (link) 
[R5] Microsoft Learn: Fabric (2023) Pattern to incrementally amass data with Dataflow Gen2 (link)
[R6] Fourmoo (2004) Microsoft Fabric – Comparing Dataflow Gen2 vs Notebook on Costs and usability, by Gilbert Quevauvilliers (link)
[R7] Microsoft Learn: Fabric (2023) A guide to Fabric Dataflows for Azure Data Factory Mapping Data Flow users (link)
[R8] Microsoft Learn: Fabric (2023) Quickstart: Create your first dataflow to get and transform data (link)
[R9] Microsoft Learn: Fabric (2023) Microsoft Fabric decision guide: copy activity, dataflow, or Spark (link)
[R10] Microsoft Fabric Blog (2023) Dataflows Gen2 data destinations and managed settings, by Miquella de Boer  (link)
[R11] Microsoft Fabric Blog (2023) Service principal support to connect to data in Dataflow, Datamart, Dataset and Dataflow Gen 2, by Miquella de Boer (link)
[R12] Chris Webb's BI Blog (2023) Fabric Dataflows Gen2: To Stage Or Not To Stage? (link)
[R13] Power BI Tips (2023) Let's Learn Fabric ep.7: Fabric Dataflows Gen2 (link)

03 March 2023

🧊Data Warehousing: Architecture (Part IV: Building a Modern Data Warehouse with Azure Synapse)

Data Warehousing

Introduction

When building a data warehouse (DWH) several key words or derivatives of them appear in requirements: secure, flexible, simple, scalable, reliable, performant, non-redundant, modern, automated, real-timed, etc. As it proves in practice, all these requirements are sometimes challenging to address with the increased complexity of the architecture chosen. There are so many technologies on the DWH market promising all these at low costs, low effort and high ROI, though DWH projects continue to fail addressing the business and technical requirements.

On a basic level for building a DWH is needed a data storage layer and an ETL (Extract, Transfer, Load) tool responsible for the data movement between the various source systems and DWH, and eventually within the DWH itself. After that, each technology added to the landscape tends to increase the overall complexity (and should be regarded with a critical eye in what concerns the advantages and disadvantages).

Data Warehouse Architecture (on-premise)

A Reference Architecture

When building a DWH or a data migration solution, which has many of the characteristics of a DWH, from the many designs, I prefer to keep things as simple as possible.  An approach based on a performant database engine like SQL Server as storage layer and SSIS (SQL Server Integration Services) as ETL proved to be the best choice until now, allowing to address most of the technical requirements by design. Then come the choices on how and where to import and transform the data, at what level of granularity, on how the semantic layer is built, how the data are accessed, etc.

Being able to pull (see extract subprocess) the data from the data sources on a need by basis offers the most flexible approach, however there are cases in which the direct access to source data is not possible, having to rely on a push approach, where data are dumped regularly to a given location (e.g. FTP folder structure), following to be picked up as needed. It's actually a hybrid between a push and pull, because a fully push approach would mean pushing the data directly to the DWH, which can be also acceptable, though might offer lower control on data's movement and involve a few other challenges (e.g. permissions, concurrency). 

Data can be prepared for the DWH in the source systems (e.g. exposed via data objects or API calls), anywhere in between via ETL-based transformations (see transform subprocess) or directly in the DWH. I prefer importing the data (see load subprocess) 1:1 without any transformations from the various sources via SSIS (or similar technologies) into a set of tables that designated the staging area. It's true that in this way the ETL technology is used to a minimum, though unless there's a major benefit to use it for data transformations, using DWH's capabilities and SQL for data processing can provide better performance and flexibility

Besides the selection of the columns in scope (typically columns with meaningful values), it's important not to do any transformations in the extraction layer because the data is imported faster (eventually using fast load options as in SSIS) and it assures a basis for troubleshooting (as the data don't change between loads). Some filters can be applied only when the volume of data is high, and the subset of the data could be identified clearly (e.g. when data are partitioned based on a key like business unit, legal entity or creation date).

For better traceability, the staging schemas can reflect the systems they come from, the tables and the columns should have the same names, respectively same data types. On such tables no constraints are applied and no indexes are needed. They can be constructed however on the production tables (aka base tables) - copy of the tables from production. 

Some DWH architects try replicating the constraints from the source systems and/or add more constraints on top to define the various business rules. Rigor is good in some scenarios, though it can involve a considerable effort and it might be challenging to keep over time, especially when considering the impact of big data on DWH architectures. Instead of using constraints, building a set of SQL scripts that pinpoint the issues as reports allow more flexibility with the risk of having inconsistencies running wild through the reports. The data should be cleaned in the source system and not possible then properly addressed in the DWH. Applying constraints will make the data unavailable for reporting until data are corrected, while being more permissive would allow dirty data. Thus, either case has advantages or disadvantages, though the latter seems to be more appropriate. 

Indexes on the production schema should reflect the characteristics of the queries run on the data and shouldn't replicate the indexes from the source environments, even if some overlaps might exist. In practice, dropping the non-clustered indexes on the production tables before loading the data from staging, and recreating them afterwards proves to provide faster loading (see load optimization techniques). 

The production tables are used for building a "semantic" data model or something similar. Several levels of views, table-valued functions and/or indexed/materialized views allows building the dimensions and facts tables, the latter incorporating the business logic needed by the reports. Upon case, stored-procedures, physical or temporary tables, table variables can be used to prepare the data, though they tend to break the "free" flow of data as steps in-between need to be run. On the other side, in certain scenarios their use is unavoidable. 

The first level of views (aka base views) is based on the base tables without any joins, though they include only the fields in use (needed by the business) ordered and "grouped" together based on their importance or certain characteristics. The views can include conversions of data types, translations of codes into meaningful values, and quite seldom filters on the data. Based on these "base" views the second level is built, which attempts to define the dimension and fact tables at the lowest granularity. These views include joins between tables coming from the same or different systems, respectively mappings of values defined in tables, and whatever it takes to build such entities. However, transformations on individual fields are pushed, when possible, to the lower level to minimize logic redundancy. From similar reasons, the logic could be broken down over two or more "helper" views when visible benefits could be obtained from it (e.g troubleshooting, reuse, maintenance). It's important to balance between creating too many helper views and encapsulating too much logic in a view. 

One of the design principles used in building the entities is to minimize the redundance of the fields used, ideally without having columns duplicated between entities at this level. This would facilitate the traceability of columns to the source tables within the "semantic" layer (typically in the detriment of a few more joins). In practice, one is forced to replicate some columns to simplify some parts of the logic. 

Further views can be built based on the dimension and fact entities to define the logic needed by the reports. Only these objects are used and no direct reference to the "base" tables or views are made. Moreover, to offer better performance when the views can be materialized or, when there's an important benefit, physically saved as table (e.g. having multiple indexes for different scenarios). It's the case of entities with considerable data volume called over and over. 

This approach of building the entities is usually flexible enough to address most of the reporting requirements, independently whether the technical solution has the characteristics of a DWH, data mart or data migration layer. Moreover, the overall architectural approach can be used on-premise as well in cloud architectures, where Azure SQL Server and ADF (Azure Data Factory) provide similar capabilities. Compared with standard SQL Server, some features might not be available, while other features might bring further benefits, though the gaps should be neglectable.

Data Management topics like Master Data Management (MDM), Data Quality Management (DQM) and/or Metadata Management can be addressed as well by using third-party tools or tools from the Microsoft stack - Master Data Services (MDS) and Data Quality Services (DQS) in combination with SSIS help addressing a wide range of scenarios - however these are optional. 

Moving to the Cloud

Within the context of big data, characterized by (high/variable) volume, value, variety, velocity, veracity, and further less important V's, the before technical requirements still apply, however within a cloud environment the overall architecture becomes more complex. Each component becomes a service. There are thus various services for data ingestion, storage, processing, sharing, collaboration, etc. The way data are processed involves also several important transformations: ETL becomes ELT, FTP and local storage by Data Lakes, data packages by data pipelines, stateful by stateless, SMP (Symmetric Multi-Processing) by MPP (Massive Parallel Processing), and so on.

As file storage is less expensive than database storage, there's an increasing trend of dumping business critical data into the Data Lake via data pipelines or features like Link to Data Lake or Export to Data Lake (*), which synchronize the data between source systems and Data Lake in near real-time at table or entity level. Either saved as csv, parquet, delta lake or any other standard file format, in single files or partitions, the data can be used directly or indirectly for analytics.

Cloud-native warehouses allow addressing topics like scalability, elasticity, fault-tolerance and performance by design, though further challenges appear as compute needs to be decoupled from storage, the workloads need to be estimated for assuring the performance, data may be distributed across data centers spanning geographies, the infrastructure is exposed to attacks, etc. 

Azure Synapse

If one wants to take advantage of the MPP architecture's power, Microsoft provides an analytical architecture based on Azure Synapse, an analytics service that brings together data integration, enterprise DWH, and big data analytics. Besides two types of SQL-based data processing services  (dedicated vs serverless SQL pools) it comes also with a Spark pool for in-memory cluster computing.

A DWH based on Azure Synapse is not that different from the reference architecture described above for an on-premise solution. Actually, a DWH based on a dedicated SQL pool (aka a physical data warehouse) involves the same steps mentioned above. 

Data Warehouse Architecture with Dedicated SQL Pool

The data can be imported via ETL/ELT pipelines in the DWH, though there are also mechanisms for consuming the data directly from the files stored in the Data Lake or Azure storage. CETAS (aka Create External Table as Select) can be defined on top of the data files, the external tables acting as "staging" or "base" tables in the architecture described above. When using a dedicated SQL pool it makes sense to use the CETAS as "staging" tables, the processed data following to be dumped to "optimized" physical tables for consumption and refreshed periodically. However, when this happens the near real-time character of data is lost. Using the CETAs as base tables would keep this characteristic as long the data isn't saved physically in tables or files, maybe in the detriment of performance.

Using a dedicated SQL pool for direct reporting can become expensive as the pool needs to be available at least during business hours for incoming user requests, or at least for importing the data and refreshing the datasets. When using the CETAS as a base table, a serverless (aka on-demand) SQL pool, which uses a per-pay-use billing model could prove to be more cost-effective and flexible in many scenarios. By design, it helps to keep the near real-time character of the data. Moreover, even if the data are actually moved from the source tables into the Data Lake, this architecture has the characteristics of a logical data warehouse:

Data Warehouse Architecture with Serverless SQL Pool

Unfortunately, unless one uses Spark tables, misuses views or adds an Azure SQL database to the architecture, there are no physical tables or materialized views in a serverless SQL pool. There's still the option to use data pipelines for regullarly exporting intermediary data to files (incl. over partitions or folders), even if this involves more overhead as it's not possible to export data over SQL syntax to files more than once (though this might change in the future). For certain scenario it could be useful to store data in a Azure SQL Server or similar database, including a dedicated SQL pool. 

Choosing between serverless and dedicated SQL pool is not an exclusive choice, both or all 3 types of pools (if we consider also the Spark pool) can be used in the architecture for addressing specific challenges, especially when we consider that there are important differences between the features available in each of the pools. Moreover, one can start the PoC based on the serverless SQL pool and when the solution became mature enough and used in all enterprise, parts of the logic or all of it can be migrated to a dedicated SQL pool. This would allow to save costs at the beginning in the detriment of further effort later. 

Talking about the physical storage, data engineers recommend defining within a Data Lake several layers (aka regions, zones) labeled as bronze, silver and gold (and probably platinum will join the club anytime soon). The bronze layer refers to the raw data available in the Data Lake, including the files on which the initial CETAS are defined upon. The silver refers to transformed, cleaned, enriched and integrated data, data resulting from the second layer of views described above. The gold layer refers to the data to which business logic was applied and prepared for consumption, data resulting from the final layer of views. Of course, data pipelines can be used to prepare the data at these stages, though a view-based approach offers more flexibility, are easier to troubleshoot, manage and reuse than data pipelines.

Ideally the gold data should involve no or minimal further transformation before reaching the users, though that's not realistic. Building a DWH takes a considerable time and the business can't usually wait until everything is in place. Therefore, reports based on DWH will continue to coexist with reports directly accessing the source data, which will lead to controversies. Enforcing a single source of truth will help to minimize the gap, though will not eliminate it completely. 

Closing Notes

These are just outlines of a minimal reference architecture. There's more to consider, as there are several alternatives (see [1] [2] [3] [4]) for each of the steps considered in here, each technology, new features or mechanisms opening new opportunities. The advantages and disadvantages should be always considered against the business needs and requirements. One approach, even if recommended, might not work for all, though unless there's an important requirement or an opportunity associated with an additional technology, deviating from reference architectures might not be such a good idea afterall.

Note:
(*) Existing customers have until 1-Nov-2024 to transition from Export to Data lake to Synapse link. Microsoft advises new customers to use Synapse Link. 


Resources:
[1] Microsoft Learn (2022) Modern data warehouse for small and medium business (link)
[2] Microsoft Learn (2022) Data warehousing and analytics (link)
[3] Microsoft Learn (2022) Enterprise business intelligence (link)
[4] Microsoft Learn (2022) Serverless Modern Data Warehouse Sample using Azure Synapse Analytics and Power BI (link)
[5] Coursera (2023) Data Warehousing with Microsoft Azure Synapse Analytics (link) [course, free to audit]
[6] SQLBits (2020) Mahesh Balija's Building Modern Data Warehouse with Azure Synapse Analytics (link)
[7] Matt How (2020) The Modern Data Warehouse in Azure: Building with Speed and Agility on Microsoft’s Cloud Platform (Amazon)
[8] James Serra's blog (2022) Data lake architecture (link)
[9] SQL Stijn (2022) SQL Building a Modern Lakehouse Data Warehouse with Azure Synapse Analytics: Moving your Database to the lake (link)
[10] Solliance (2022) Azure Synapse Analytics Workshop 400 (link) [GitHub repository]

20 March 2021

🧭Business Intelligence: New Technologies, Old Challenges (Part II - ETL vs. ELT)

 

Business Intelligence

Data lakes and similar cloud-based repositories drove the requirement of loading the raw data before performing any transformations on the data. At least that’s the approach the new wave of ELT (Extract, Load, Transform) technologies use to handle analytical and data integration workloads, which is probably recommendable for the mentioned cloud-based contexts. However, ELT technologies are especially relevant when is needed to handle data with high velocity, variance, validity or different value of truth (aka big data). This because they allow processing the workloads over architectures that can be scaled with workloads’ demands.

This is probably the most important aspect, even if there can be further advantages, like using built-in connectors to a wide range of sources or implementing complex data flow controls. The ETL (Extract, Transform, Load) tools have the same capabilities, maybe reduced to certain data sources, though their newer versions seem to bridge the gap.

One of the most stressed advantages of ELT is the possibility of having all the (business) data in the repository, though these are not technological advantages. The same can be obtained via ETL tools, even if this might involve upon case a bigger effort, effort depending on the functionality existing in each tool. It’s true that ETL solutions have a narrower scope by loading a subset of the available data, or that transformations are made before loading the data, though this depends on the scope considered while building the data warehouse or data mart, respectively the design of ETL packages, and both are a matter of choice, choices that can be traced back to business requirements or technical best practices.

Some of the advantages seen are context-dependent – the context in which the technologies are put, respectively the problems are solved. It is often imputed to ETL solutions that the available data are already prepared (aggregated, converted) and new requirements will drive additional effort. On the other side, in ELT-based solutions all the data are made available and eventually further transformed, but also here the level of transformations made depends on specific requirements. Independently of the approach used, the data are still available if needed, respectively involve certain effort for further processing.

Building usable and reliable data models is dependent on good design, and in the design process reside the most important challenges. In theory, some think that in ETL scenarios the design is done beforehand though that’s not necessarily true. One can pull the raw data from the source and build the data models in the target repositories.

Data conversion and cleaning is needed under both approaches. In some scenarios is ideal to do this upfront, minimizing the effect these processes have on data’s usage, while in other scenarios it’s helpful to address them later in the process, with the risk that each project will address them differently. This can become an issue and should be ideally addressed by design (e.g. by building an intermediate layer) or at least organizationally (e.g. enforcing best practices).

Advancing that ELT is better just because the data are true (being in raw form) can be taken only as a marketing slogan. The degree of truth data has depends on the way data reflects business’ processes and the way data are maintained, while their quality is judged entirely on their intended use. Even if raw data allow more flexibility in handling the various requests, the challenges involved in processing can be neglected only under the consequences that follow from this.

Looking at the analytics and data integration cloud-based technologies, they seem to allow both approaches, thus building optimal solutions relying on professionals’ wisdom of making appropriate choices.

Previous Post <<||>>Next Post

11 March 2021

💠🗒️Microsoft Azure: Azure Data Factory [Notes]

Azure Data Factory - Concept Map

Acronyms:
Azure Data Factory (ADF)
Continuous Integration/Continuous Deployment (CI/CD)
Extract Load Transform (ELT)
Extract Transform Load (ETL)
Independent Software Vendors (ISVs)
Operations Management Suite (OMS)
pay-as-you-go (PAYG)
SQL Server Integration Services (SSIS)

Resources:
[1] Microsoft (2020) "Microsoft Business Intelligence and Information Management: Design Guidance", by Rod College
[2] Microsoft (2021) Azure Data Factory [source]
[3] Microsoft (2018) Azure Data Factory: Data Integration in the Cloud [source]
[4] Microsoft (2021) Integrate data with Azure Data Factory or Azure Synapse Pipeline [source]
[10] Coursera (2021) Data Processing with Azure [source]
[11] Sudhir Rawat & Abhishek Narain (2019) "Understanding Azure Data Factory: Operationalizing Big Data and Advanced Analytics Solutions"

03 February 2021

📦Data Migrations (DM): Conceptualization (Part V: Data Extraction Layer)

Data Migration

ETL tools are ideal for extracting the needed data from the legacy system(s). They offer a considerable number of connectors to standard databases that leverage legacy systems’ data access layers or own frameworks, both categories providing acceptable performance for a wide range of solutions. Otherwise, third-party connectors can be considered as well, though their advantage might reside in the extra features they bring out-of-the-box in the detriment of performance loss, and thus should be used with caution.

Besides that, ETL tools provide also rich visual functionality that allow users building complex pipelines with transformations that process the data as data go through the pipeline. Further features like data profiling or cleansing bring additional benefits.

As usually only a subset of the legacy data is needed for the migration, an ETL solution allows extracting only the data in scope as filtering and other logic can be used in the extraction mechanism. Whether one loads the tables or entities 1:1 or aggregates the data from multiple tables is a matter of choice, even if the former two approaches are usually recommended.

As alternative to an ETL tool is building own extraction layer based for example on a powerful data access layer like ADO.Net. This might prove to be a cheaper alternative especially when ETL capabilities aren’t needed. This depends also on the overall architectural approach. Attempting to build a desktop-based application for a DM can prove to be a foolhardy approach especially when dealing with a considerable volume of data. Moreover, it would be needed to build features that are already available in ETL tools (transformations, workflows) or databases (indexes for performance optimization, join-based logic).

When the volume of data exceeds the capabilities of ETL tools one can consider ELT tools which load first the data before applying any transformations on them. Such tools are designed for the processing of what is known as big data (data having high volume, high velocity, high variety and different veracity).

When considering the best data extraction approach, it’s important to know where the data will be stored for processing. Given that DMs are data processing intensive the best data storage solution for processing would be a modern relational database. Besides performance, scalability, security, concurrency, failover mechanism some databases offer the possibility to connect directly to other servers via server links functionality. Despite this latter feature an ETL tool can still have considerable advantages for data extraction.

On the other side the DM logic can be in theory built entirely in the ETL tool without storing the data within a database, though this adds a high overhead on the server resources on which the solution runs as all the data needed for processing need to be loaded in memory. Even if the data are loaded in batches and processed as the batches go through the pipeline, the complexity of the processing can make challenging implementing any optimization techniques directly into the ETL tool. Moreover, fully ETL-based solutions are difficult to troubleshoot and change as the requirements change.

To address the high resources’ consumption of the ETL tools one can store the intermediary results into database tables on which indexes can be created for performance optimization. Moreover, the logic can be encapsulated in database objects and used in the processing. This approach enables troubleshooting, performing validations and restarting the processing from a given step in the detriment of splitting the logic between multiple ETL packages. This can be an acceptable price to pay for more flexibility. Given that most ETL transformations can be replaced with SQL-based logic the ETL tool can be used only for data extraction.

Previous Post <<||>> Next Post

📦Data Migrations (DM): Conceptualization (Part III: Heuristics)

Data Migration

Probably one of the most difficult things to learn as a technical person is using the right technology for a given purpose, this mainly because one’s inclined using the tools one knows best. Moreover, technologies’ overlapping makes the task more and more challenging, the difference between competing technologies often residing in the details. Thus, identifying the gaps resumes in understanding the details of the problem(s) or need(s), respectively the advantages or disadvantages of a technology over the other. This is true especially about competing technologies, including the ones that replace other technologies.

There are simple heuristics, that can allow approaching such challenges. For example, heavy data processing belongs usually in databases, while import/export functionality belongs in an ETL tool.  Therefore, one can start looking at the problems from these two perspectives. Would the solution benefit from these two approaches or are there more appropriate technologies (e.g. data streaming, ELT, non-relational databases)? How much effort would involve building the solution? 

Commercial Off-The-Shelf (COTS) tools provided by third-party vendors usually offer specialized functionality in each area. Gartner and Forrester provide regular analyses of the main players in the important areas, analyses which can be used in theory as basis for further research. Even if COTS tend to be more expensive and can have some important functionality gaps, as long they are extensible, they can prove a good starting point for developing a solution. 

Sometimes it helps researching on the web what other people or organizations did, how they approached the same aspects, what technologies, techniques and best practices they used to overcome the challenges. One doesn’t need to reinvent the wheel even if it’s sometimes fun to do so. Moreover, a few hours of research can give one a basis of useful information and a better understanding over the work ahead.

On the other side sometimes it’s advisable to use the tools one knows best, however this can lead also to unusable and less performant solutions. For example, MS Excel and Access have been for years the tools of choice for building personal solutions that later grew into maintenance nightmares for the IT team. Ideally, they can still be used for data entry or data cleaning, though building solutions exclusively based on (one of) them can prove to be far than optimal. 

When one doesn’t know whether a technology or mix of technologies can be used to provide a solution, it’s recommended to start a proof-of-concept (PoC) that would allow addressing most important aspects of the needed solution. One can start small by focusing on the minimal functionality needed to check the main aspects and evolve the PoC during several iterations as needed.

For example, in the case of a Data Migration (DM) this would involve building the data extraction layer for an entity, implement several data transformations based on the defined mappings, consider building a few integrity rules for validation, respectively attempt importing the data into the target system. Once this accomplished, one can start increasing the volume of data to check how the solution behaves under stress. The volume of data can be increased incrementally or by considering all the data available. 

As soon the skeleton was built one can consider all the mappings, respectively add several entities to build the dependencies existing between them and other functionality. The prototype might not address all the requirements from the beginning, therefore consider the problems as they arise. For example, if the volume of data seems to cause problems then attempt splitting the data during processing in batches or considering specific optimization techniques like indexing or scaling techniques like increasing computing resources. 

Previous Post <<||>> Next Post

16 March 2010

🕋Data Warehousing: Data Pipeline/Pipelining (Definitions)

"A series of operations in an aggregation process." (MongoDb, "Glossary", 2008)

"A series of processes all in a row, linked by pipes, where each passes its output stream to the next." (Jon Orwant et al, "Programming Perl" 4th Ed., 2012)

"Description of the process workflow in sequential order." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)

"In data processing, a pipeline is a sequence of processing steps combined into a single object. In Spark MLlib, a pipeline is a sequence of stages. A Pipeline is an estimator containing transformers, estimators, and evaluators. When it is trained, it produces a PipelineModel containing transformers, models, and evaluators." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

"Abstract concept used to describe where work is broken into several steps which enable multiple tasks to be in progress at the same time. Pipelining is applied in processors to increase processing of machine language instructions and is also a category of functional decomposition that reduces the synchronization cost while maintaining many of the benefits of concurrent execution." (Max Domeika, "Software Development for Embedded Multi-core Systems", 2011)

"A technique that breaks an instruction into smaller steps that can be overlapped" (Nell Dale & John Lewis, "Computer Science Illuminated" 6th Ed., 2015)

[pipeline pattern:] "A set of data processing elements connected in series, generally so that the output of one element is the input of the next one. The elements of a pipeline are often executed concurrently. Describing many algorithms, including many signal processing problems, as pipelines is generally quite natural and lends itself to parallel execution. However, in order to scale beyond the number of pipeline stages, it is necessary to exploit parallelism within a single pipeline stage." (Michael McCool et al, "Structured Parallel Programming", 2012)

"A data pipeline is a general term for a process that moves data from a source to a destination. ETL (extract, transform, and load) uses a data pipeline to move the data it extracts from a source to the destination, where it loads the data." (Jake Stein)

"A data pipeline is a piece of infrastructure responsible for routing data from where it is to where it needs to go and provide any necessary transformations through that process." (Precisely) [source

"A data pipeline is a service or set of actions that process data in sequence. This means that the results or output from one segment of the system become the input for the next. The usual function of a data pipeline is to move data from one state or location to another."(SnapLogic) [source]

"A data pipeline is a software process that takes data from sources and pushes it to a destination. Most modern data pipelines are automated with an ETL (Extract, Transform, Load) platform." (Xplenty) [source

"A data pipeline is a set of actions that extract data (or directly analytics and visualization) from various sources. It is an automated process: take these columns from this database, merge them with these columns from this API, subset rows according to a value, substitute NAs with the median and load them in this other database." (Alan Marazzi)

"A source and all the transformations and targets that receive data from that source. Each mapping contains one or more pipelines." (Informatica)

"An ETL Pipeline refers to a set of processes extracting data from an input source, transforming the data, and loading into an output destination such as a database, data mart, or a data warehouse for reporting, analysis, and data synchronization." (Databricks) [source]

"Data pipeline consists of a set of actions performed in real-time or in batches, that captures data from various sources, sorting it and then moving that data through applications, filters, and APIs for storage and analysis." (EAI) 

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.