- Microsoft Azure: Azure Data Factory (ADF)
-
{definition} pay-per-use serverless cloud-based data integration service that orchestrates and automates the movement and transformation of both cloud-based and on-premises data sources [1]
- ⇐ a hybrid and scalable data integration service for Big Data and advanced end-to-end analytics solutions [11]
- ⇐ Microsoft Azure PaaS offering for ETL/ELT workloads found at its second generation [11]
- allows creating data-driven flows to orchestrate movement of data between supported data stores and processing of data using compute services in other regions or in an on-premises environment
-
{benefit} easy-to-use
- {feature} allows creating code-free pipelines with drag-and-drop functionality [2]
- {feature} uses JSON to describe each of its entities
-
{benefit} cost-effective
- pay-as-you-go model against the Azure subscription with no up-front costs
-
low price-to-performance ratio
- ⇐ cost effective and performant at the same time
-
fully managed serverless cloud service that scales on demand [2]
- ⇒requires zero hardware maintenance [1]
- ⇒can easily scale beyond what was originally anticipated [1]
- does not store any data [1]
-
provides additional cost-saving functionality [11]
- {feature} it takes care of the provisioning and teardown of the cluster once the job has executed [11]
-
{benefit} powerful
- allows ingesting on-premise and cloud-based data sources
-
high-performance hybrid connectivity
- over 90 built-in connectors make it easy to interact with all kinds of technologies [11]
-
orchestrate at scale
- on-demand compute
- Big Data workloads are scaled over multiple nodes to chunk data in parallel [11]
-
{feature} [ADFv2] monitoring
-
richer and natively integrating it with Azure Monitor and OMS [11]
- includes feature-rich monitoring and management tools to visualize the current state of data pipelines, data lineage and pipeline dependencies [1]
-
richer and natively integrating it with Azure Monitor and OMS [11]
-
{feature} [ADFv2] control flow functionality
-
lets define complex workflows using programmatic or UI mechanisms
- allows defining parameters at pipeline level [11]
- includes custom state passing and looping containers [11]
-
pipelines can be authored via additional tools
- e.g. PowerShell, .NET, Python, REST APIs
- ⇒ helps ISVs build SaaS-based analytics solutions on top of ADF app models
-
lets define complex workflows using programmatic or UI mechanisms
-
{benefit} intelligent
- autonomous ETL allows unlocking operational efficiencies and enable citizen integrators [2]
-
{benefit} enterprise-grade security:
- provides same security standards as any other Microsoft service [11]
-
{benefit} monthly release cycle
- {feature} via auto-update
- improvements may include support for new connectors, bug fixes, security patches, and performance improvements [11]
-
{benefit} backwards compatibility
-
{feature} [ADFv2] allows rehosting SSIS solutions [2]
- ⇒ helpful for modernizing data warehouse solutions
-
{feature} [ADFv2] allows rehosting SSIS solutions [2]
- {prerequisite} an Azure subscription with the contributor role assigned to at least one resource group
-
{limitation} availability
-
the service isn’t available in all regions
-
an instance can be made available in other region to trigger the job on customer’s computer environment [1]
- ⇐ the time for executing the job on the compute environment doesn’t change [1]
-
an instance can be made available in other region to trigger the job on customer’s computer environment [1]
-
the service isn’t available in all regions
-
{concept} activity
- the unit of orchestration in ADF [1]
- defines the actions to perform on data [1]
- takes zero or more datasets as inputs and produces one or more datasets as outputs [1]
-
activity types
- data movement activities
- data transformation activities
-
control activities
- control how the pipeline works and interacts with the data [10]
- allow executing pipelines [10]
- allow running a foreach statement or Lookup activities [10]
-
{concept] pipeline
-
logical grouping of activities that together perform a task [1]
- the sequence can have a complex schedule and dependencies that need to be orchestrated and automated [1]
- two activities can be chained by setting the output data set of one activity as the input dataset of the other activity
- allows building ETL/ELT workloads
- scheduled by scheduler triggers [10]
-
data in a pipeline is referred to by different names
- ⇐ based on the amount of modification that has been performed
-
raw data
-
data with no processing applied [10]
- ⇒does not yet have a schema applied
- stored in the message encoding format used to send tracking events such as JSON.
-
can be organized into meaningful data stores and data lakes [10]
- ⇐ further used in decision-making
-
it's common to send all tracking events as raw events
- ⇐ because all events can be sent to a single endpoint and schemas can be applied later in the pipeline [10]
-
data with no processing applied [10]
-
processed data
-
raw data that has been decoded in the event-specific formats with the schema applied
- e.g. JSON tracking events that have been translated into a session start event with a fixed schema [10]
- usually stored in different event tables and destination in a data pipeline [10]
-
raw data that has been decoded in the event-specific formats with the schema applied
-
cooked data
- processed data that has been aggregated or summarized [10]
-
{concept} pipeline parameters
-
similar to SSIS package parameters
- ⇐ need to be set from outside packages
- can be passed from the parent pipeline
-
similar to SSIS package parameters
-
logical grouping of activities that together perform a task [1]
-
{concept} dataset
- named references/pointers to the data used as an input or an output of an activity [1]
-
identifies data structures within different (linked) data stores [1]
- ⇐ before creating a dataset, a linked service must be created to link the data store to ADF [10]
-
once created, it can be used with activities in a pipeline [10]
- e.g. a dataset can be an input or output dataset of a copy activity
-
{concept} linked service
-
defines the information needed by ADF to connect to external resources at runtime
- much like connection strings which define the connection information [10]
-
used to represent
-
{concept} data store
- holds the input-output data to the ADF
- e.g. tables, files, folders, and documents
-
{concept} compute resource
- can host the execution of an activity [1]
-
{concept} data store
-
defines the information needed by ADF to connect to external resources at runtime
-
{concept} scheduler triggers
-
allow pipelines to be triggered on a wall-clock schedule [10]
-
pipelines and triggers have an n-m relationship
- multiple triggers can kick off a single pipeline
- the same trigger can kick off multiple pipelines
- manual triggers trigger pipelines on demand [10]
-
pipelines and triggers have an n-m relationship
- once defined, it must be started to begin triggering the pipeline [10]
-
comes into effect only after publishing the solution to ADF [10]
- ⇐ not when saving the trigger in the UI [10]
- to run a pipeline, a pipeline reference must be included in trigger definition [10]
-
there is a cost associated with each pipeline run
- {recommendation} when testing, make sure that the pipeline is triggered only a couple of times [10]
- {recommendation} ensure that there is enough time for the pipeline to run between the published time and the end time [10]
-
allow pipelines to be triggered on a wall-clock schedule [10]
Acronyms:
Azure Data Factory (ADF)
Continuous Integration/Continuous Deployment (CI/CD)
Extract Load Transform (ELT)
Extract Transform Load (ETL)
Independent Software Vendors (ISVs)
Operations Management Suite (OMS)
pay-as-you-go (PAYG)
SQL Server Integration Services (SSIS)
Resources:
[1] Microsoft (2020) "Microsoft Business Intelligence and Information Management: Design Guidance", by Rod College
[2] Microsoft (2021) Azure Data Factory [source]
[3] Microsoft (2018) Azure Data Factory: Data Integration in the Cloud [source]
[4] Microsoft (2021) Integrate data with Azure Data Factory or Azure Synapse Pipeline [source]
[10] Coursera (2021) Data Processing with Azure [source]
[11] Sudhir Rawat & Abhishek Narain (2019) "Understanding Azure Data Factory: Operationalizing Big Data and Advanced Analytics Solutions"