- Microsoft Azure: Azure Data Factory (ADF)
            
 - 
                {definition} pay-per-use serverless cloud-based data integration service that orchestrates and automates the movement and transformation of both cloud-based and on-premises data sources [1]
                
- ⇐ a hybrid and scalable data integration service for Big Data and advanced end-to-end analytics solutions [11]
 - ⇐ Microsoft Azure PaaS offering for ETL/ELT workloads found at its second generation [11]
 - allows creating data-driven flows to orchestrate movement of data between supported data stores and processing of data using compute services in other regions or in an on-premises environment
 
 - 
                {benefit} easy-to-use
                
- {feature} allows creating code-free pipelines with drag-and-drop functionality [2]
 - {feature} uses JSON to describe each of its entities
 
 - 
                {benefit} cost-effective
                
- pay-as-you-go model against the Azure subscription with no up-front costs
 - 
                        low price-to-performance ratio
                        
- ⇐ cost effective and performant at the same time
 
 - 
                        fully managed serverless cloud service that scales on demand [2]
                        
- ⇒requires zero hardware maintenance [1]
 - ⇒can easily scale beyond what was originally anticipated [1]
 
 - does not store any data [1]
 - 
                        provides additional cost-saving functionality [11]
                        
- {feature} it takes care of the provisioning and teardown of the cluster once the job has executed [11]
 
 
 - 
                {benefit} powerful
                
- allows ingesting on-premise and cloud-based data sources
 - 
                        high-performance hybrid connectivity
                        
- over 90 built-in connectors make it easy to interact with all kinds of technologies [11]
 
 - 
                        orchestrate at scale
                        
- on-demand compute
 - Big Data workloads are scaled over multiple nodes to chunk data in parallel [11]
 
 - 
                        {feature} [ADFv2] monitoring
                        
- 
                                richer and natively integrating it with Azure Monitor and OMS [11]
                                
- includes feature-rich monitoring and management tools to visualize the current state of data pipelines, data lineage and pipeline dependencies [1]
 
 
 - 
                                richer and natively integrating it with Azure Monitor and OMS [11]
                                
 - 
                        {feature} [ADFv2] control flow functionality
                        
- 
                                lets define complex workflows using programmatic or UI mechanisms
                                
- allows defining parameters at pipeline level [11]
 - includes custom state passing and looping containers [11]
 - 
                                        pipelines can be authored via additional tools
                                        
- e.g. PowerShell, .NET, Python, REST APIs
 - ⇒ helps ISVs build SaaS-based analytics solutions on top of ADF app models
 
 
 
 - 
                                lets define complex workflows using programmatic or UI mechanisms
                                
 
 - 
                {benefit} intelligent
                
- autonomous ETL allows unlocking operational efficiencies and enable citizen integrators [2]
 
 - 
                {benefit} enterprise-grade security:
                
- provides same security standards as any other Microsoft service [11]
 
 - 
                {benefit} monthly release cycle
                
- {feature} via auto-update
 - improvements may include support for new connectors, bug fixes, security patches, and performance improvements [11]
 
 - 
                {benefit} backwards compatibility
                
- 
                        {feature} [ADFv2] allows rehosting SSIS solutions [2]
                        
- ⇒ helpful for modernizing data warehouse solutions
 
 
 - 
                        {feature} [ADFv2] allows rehosting SSIS solutions [2]
                        
 - {prerequisite} an Azure subscription with the contributor role assigned to at least one resource group
 - 
                {limitation} availability
                
- 
                        the service isn’t available in all regions
                        
- 
                                an instance can be made available in other region to trigger the job on customer’s computer environment [1]
                                
- ⇐ the time for executing the job on the compute environment doesn’t change [1]
 
 
 - 
                                an instance can be made available in other region to trigger the job on customer’s computer environment [1]
                                
 
 - 
                        the service isn’t available in all regions
                        
 - 
                {concept} activity
                
- the unit of orchestration in ADF [1]
 - defines the actions to perform on data [1]
 - takes zero or more datasets as inputs and produces one or more datasets as outputs [1]
 - 
                        activity types
                        
- data movement activities
 - data transformation activities
 - 
                                control activities
                                
- control how the pipeline works and interacts with the data [10]
 - allow executing pipelines [10]
 - allow running a foreach statement or Lookup activities [10]
 
 
 
 - 
                {concept] pipeline
                
- 
                        logical grouping of activities that together perform a task [1]
                        
- the sequence can have a complex schedule and dependencies that need to be orchestrated and automated [1]
 - two activities can be chained by setting the output data set of one activity as the input dataset of the other activity
 
 - allows building ETL/ELT workloads
 - scheduled by scheduler triggers [10]
 - 
                        data in a pipeline is referred to by different names
                        
- ⇐ based on the amount of modification that has been performed
 - 
                                raw data
                                
- 
                                        data with no processing applied [10]
                                        
- ⇒does not yet have a schema applied
 
 - stored in the message encoding format used to send tracking events such as JSON.
 - 
                                        can be organized into meaningful data stores and data lakes [10]
                                        
- ⇐ further used in decision-making
 
 - 
                                        it's common to send all tracking events as raw events
                                        
- ⇐ because all events can be sent to a single endpoint and schemas can be applied later in the pipeline [10]
 
 
 - 
                                        data with no processing applied [10]
                                        
 - 
                                processed data
                                
- 
                                        raw data that has been decoded in the event-specific formats with the schema applied
                                        
- e.g. JSON tracking events that have been translated into a session start event with a fixed schema [10]
 
 - usually stored in different event tables and destination in a data pipeline [10]
 
 - 
                                        raw data that has been decoded in the event-specific formats with the schema applied
                                        
 - 
                                cooked data
                                
- processed data that has been aggregated or summarized [10]
 
 
 - 
                        {concept} pipeline parameters
                        
- 
                                similar to SSIS package parameters
                                
- ⇐ need to be set from outside packages
 
 - can be passed from the parent pipeline
 
 - 
                                similar to SSIS package parameters
                                
 
 - 
                        logical grouping of activities that together perform a task [1]
                        
 - 
                {concept} dataset
                
- named references/pointers to the data used as an input or an output of an activity [1]
 - 
                        identifies data structures within different (linked) data stores [1]
                        
- ⇐ before creating a dataset, a linked service must be created to link the data store to ADF [10]
 - 
                                once created, it can be used with activities in a pipeline [10]
                                
- e.g. a dataset can be an input or output dataset of a copy activity
 
 
 
 - 
                {concept} linked service
                
- 
                        defines the information needed by ADF to connect to external resources at runtime
                        
- much like connection strings which define the connection information [10]
 
 - 
                        used to represent
                        
- 
                                {concept} data store
                                
- holds the input-output data to the ADF
 - e.g. tables, files, folders, and documents
 
 - 
                                {concept} compute resource
                                
- can host the execution of an activity [1]
 
 
 - 
                                {concept} data store
                                
 
 - 
                        defines the information needed by ADF to connect to external resources at runtime
                        
 - 
                {concept} scheduler triggers
                
- 
                        allow pipelines to be triggered on a wall-clock schedule [10]
                        
- 
                                pipelines and triggers have an n-m relationship
                                
- multiple triggers can kick off a single pipeline
 - the same trigger can kick off multiple pipelines
 
 - manual triggers trigger pipelines on demand [10]
 
 - 
                                pipelines and triggers have an n-m relationship
                                
 - once defined, it must be started to begin triggering the pipeline [10]
 - 
                        comes into effect only after publishing the solution to ADF [10]
                        
- ⇐ not when saving the trigger in the UI [10]
 
 - to run a pipeline, a pipeline reference must be included in trigger definition [10]
 - 
                        there is a cost associated with each pipeline run
                        
- {recommendation} when testing, make sure that the pipeline is triggered only a couple of times [10]
 - {recommendation} ensure that there is enough time for the pipeline to run between the published time and the end time [10]
 
 
 - 
                        allow pipelines to be triggered on a wall-clock schedule [10]
                        
 
Acronyms:
    Azure Data Factory (ADF)
Continuous Integration/Continuous Deployment (CI/CD)
Extract Load Transform (ELT)
Extract Transform Load (ETL)
Independent Software Vendors (ISVs)
Operations Management Suite (OMS)
pay-as-you-go (PAYG)
SQL Server Integration Services (SSIS)
Resources:
[1] Microsoft (2020) "Microsoft Business Intelligence and Information Management: Design Guidance", by Rod College
[2] Microsoft (2021) Azure Data Factory [source]
[3] Microsoft (2018) Azure Data Factory: Data Integration in the Cloud [source]
[4] Microsoft (2021) Integrate data with Azure Data Factory or Azure Synapse Pipeline [source]
[10] Coursera (2021) Data Processing with Azure [source]
[11] Sudhir Rawat & Abhishek Narain (2019) "Understanding Azure Data Factory: Operationalizing Big Data and Advanced Analytics Solutions"


