SQL Troubles: 💠🗒️Microsoft Azure: Azure Data Factory [Notes]

11 March 2021

💠🗒️Microsoft Azure: Azure Data Factory [Notes]

Microsoft Azure: Azure Data Factory (ADF)

{definition} pay-per-use serverless cloud-based data integration service that orchestrates and automates the movement and transformation of both cloud-based and on-premises data sources [1]
- ⇐ a hybrid and scalable data integration service for Big Data and advanced end-to-end analytics solutions [11]
- ⇐ Microsoft Azure PaaS offering for ETL/ELT workloads found at its second generation [11]
- allows creating data-driven flows to orchestrate movement of data between supported data stores and processing of data using compute services in other regions or in an on-premises environment
{benefit} easy-to-use
- {feature} allows creating code-free pipelines with drag-and-drop functionality [2]
- {feature} uses JSON to describe each of its entities
{benefit} cost-effective
- pay-as-you-go model against the Azure subscription with no up-front costs
- low price-to-performance ratio
  - ⇐ cost effective and performant at the same time
- fully managed serverless cloud service that scales on demand [2]
  - ⇒requires zero hardware maintenance [1]
  - ⇒can easily scale beyond what was originally anticipated [1]
- does not store any data [1]
- provides additional cost-saving functionality [11]
  - {feature} it takes care of the provisioning and teardown of the cluster once the job has executed [11]
{benefit} powerful
- allows ingesting on-premise and cloud-based data sources
- high-performance hybrid connectivity
  - over 90 built-in connectors make it easy to interact with all kinds of technologies [11]
- orchestrate at scale
  - on-demand compute
  - Big Data workloads are scaled over multiple nodes to chunk data in parallel [11]
- {feature} [ADFv2] monitoring
  - richer and natively integrating it with Azure Monitor and OMS [11]
    - includes feature-rich monitoring and management tools to visualize the current state of data pipelines, data lineage and pipeline dependencies [1]
- {feature} [ADFv2] control flow functionality
  - lets define complex workflows using programmatic or UI mechanisms
    - allows defining parameters at pipeline level [11]
    - includes custom state passing and looping containers [11]
    - pipelines can be authored via additional tools
      - e.g. PowerShell, .NET, Python, REST APIs
      - ⇒ helps ISVs build SaaS-based analytics solutions on top of ADF app models
{benefit} intelligent
- autonomous ETL allows unlocking operational efficiencies and enable citizen integrators [2]
{benefit} enterprise-grade security:
- provides same security standards as any other Microsoft service [11]
{benefit} monthly release cycle
- {feature} via auto-update
- improvements may include support for new connectors, bug fixes, security patches, and performance improvements [11]
{benefit} backwards compatibility
- {feature} [ADFv2] allows rehosting SSIS solutions [2]
  - ⇒ helpful for modernizing data warehouse solutions
{prerequisite} an Azure subscription with the contributor role assigned to at least one resource group
{limitation} availability
- the service isn’t available in all regions
  - an instance can be made available in other region to trigger the job on customer’s computer environment [1]
    - ⇐ the time for executing the job on the compute environment doesn’t change [1]
{concept} activity
- the unit of orchestration in ADF [1]
- defines the actions to perform on data [1]
- takes zero or more datasets as inputs and produces one or more datasets as outputs [1]
- activity types
  - data movement activities
  - data transformation activities
  - control activities
    - control how the pipeline works and interacts with the data [10]
    - allow executing pipelines [10]
    - allow running a foreach statement or Lookup activities [10]
{concept] pipeline
- logical grouping of activities that together perform a task [1]
  - the sequence can have a complex schedule and dependencies that need to be orchestrated and automated [1]
  - two activities can be chained by setting the output data set of one activity as the input dataset of the other activity
- allows building ETL/ELT workloads
- scheduled by scheduler triggers [10]
- data in a pipeline is referred to by different names
  - ⇐ based on the amount of modification that has been performed
  - raw data
    - data with no processing applied [10]
      - ⇒does not yet have a schema applied
    - stored in the message encoding format used to send tracking events such as JSON.
    - can be organized into meaningful data stores and data lakes [10]
      - ⇐ further used in decision-making
    - it's common to send all tracking events as raw events
      - ⇐ because all events can be sent to a single endpoint and schemas can be applied later in the pipeline [10]
  - processed data
    - raw data that has been decoded in the event-specific formats with the schema applied
      - e.g. JSON tracking events that have been translated into a session start event with a fixed schema [10]
    - usually stored in different event tables and destination in a data pipeline [10]
  - cooked data
    - processed data that has been aggregated or summarized [10]
- {concept} pipeline parameters
  - similar to SSIS package parameters
    - ⇐ need to be set from outside packages
  - can be passed from the parent pipeline
{concept} dataset
- named references/pointers to the data used as an input or an output of an activity [1]
- identifies data structures within different (linked) data stores [1]
  - ⇐ before creating a dataset, a linked service must be created to link the data store to ADF [10]
  - once created, it can be used with activities in a pipeline [10]
    - e.g. a dataset can be an input or output dataset of a copy activity
{concept} linked service
- defines the information needed by ADF to connect to external resources at runtime
  - much like connection strings which define the connection information [10]
- used to represent
  - {concept} data store
    - holds the input-output data to the ADF
    - e.g. tables, files, folders, and documents
  - {concept} compute resource
    - can host the execution of an activity [1]
{concept} scheduler triggers
- allow pipelines to be triggered on a wall-clock schedule [10]
  - pipelines and triggers have an n-m relationship
    - multiple triggers can kick off a single pipeline
    - the same trigger can kick off multiple pipelines
  - manual triggers trigger pipelines on demand [10]
- once defined, it must be started to begin triggering the pipeline [10]
- comes into effect only after publishing the solution to ADF [10]
  - ⇐ not when saving the trigger in the UI [10]
- to run a pipeline, a pipeline reference must be included in trigger definition [10]
- there is a cost associated with each pipeline run
  - {recommendation} when testing, make sure that the pipeline is triggered only a couple of times [10]
  - {recommendation} ensure that there is enough time for the pipeline to run between the published time and the end time [10]

Previous Post <<||>> Next Post

Acronyms:

Azure Data Factory (ADF)

Continuous Integration/Continuous Deployment (CI/CD)

Extract Load Transform (ELT)

Extract Transform Load (ETL)

Independent Software Vendors (ISVs)

Operations Management Suite (OMS)

pay-as-you-go (PAYG)

SQL Server Integration Services (SSIS)

Resources:

[1] Microsoft (2020) "Microsoft Business Intelligence and Information Management: Design Guidance", by Rod College

[2] Microsoft (2021) Azure Data Factory [source]

[3] Microsoft (2018) Azure Data Factory: Data Integration in the Cloud [source]

[4] Microsoft (2021) Integrate data with Azure Data Factory or Azure Synapse Pipeline [source]

[10] Coursera (2021) Data Processing with Azure [source]

[11] Sudhir Rawat & Abhishek Narain (2019) "Understanding Azure Data Factory: Operationalizing Big Data and Advanced Analytics Solutions"

SQL Troubles

Pages

11 March 2021

💠🗒️Microsoft Azure: Azure Data Factory [Notes]

No comments:

About Me