Showing posts with label GitHub. Show all posts
Showing posts with label GitHub. Show all posts

12 April 2025

🏭🗒️Microsoft Fabric: Copy job in Data Factory [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 11-Apr-2025

[Microsoft Fabric] Copy job in Data Factory 
  • {def} 
    • {benefit} simplifies data ingestion with built-in patterns for batch and incremental copy, eliminating the need for pipeline creation [1]
      • across cloud data stores [1]
      • from on-premises data stores behind a firewall [1]
      • within a virtual network via a gateway [1]
  • elevates the data ingestion experience to a more streamlined and user-friendly process from any source to any destination [1]
  • {benefit} provides seamless data integration 
    • through over 100 built-in connectors [3]
    • provides essential tools for data operations [3]
  • {benefit} provides intuitive experience
    • easy configuration and monitoring [1]
  • {benefit} efficiency
    • enable incremental copying effortlessly, reducing manual intervention [1]
  • {benefit} less resource utilization and faster copy durations
    • flexibility to control data movement [1]
      • choose which tables and columns to copy
      • map the data
      • define read/write behavior
      • set schedules that fit requirements whether [1]
    • applies for a one-time or recurring jobs [1]
  • {benefit} robust performance
    • the serverless setup enables data transfer with large-scale parallelism
    • maximizes data movement throughput [1]
      • fully utilizes network bandwidth and data store IOPS for optimal performance [3]
  • {feature} monitoring
    • once a job executed, users can monitor its progress and metrics through either [1] 
      • the Copy job panel
        • shows data from the most recent runs [1]
      • reports several metrics
        • status
        • row read
        • row written
        • throughput
      • the Monitoring hub
        • acts as a centralized portal for reviewing runs across various items [4]
  • {mode} full copy
    • copies all data from the source to the destination at once
  • {mode|preview} incremental copy
    • the initial job run copies all data, and subsequent job runs only copy changes since the last run [1]
    • an incremental column must be selected for each table to identify changes [1]
      • used as a watermark
        • allows comparing its value with the same from last run in order to copy the new or updated data only [1]
        • the incremental column can be a timestamp or an increasing INT [1]
      • {scenario} copying from a database
        • new or updated rows will be captured and moved to the destination [1]
      • {scenario} copying from a storage store
        • new or updated files identified by their LastModifiedTime are captured and moved to the destination [1]
      • {scenario} copy data to storage store
        • new rows from the tables or files are copied to new files in the destination [1]
          • files with the same name are overwritten [1]
      • {scenario} copy data to database
        • new rows from the tables or files are appended to destination tables [1]
          • the update method to merge or overwrite [1]
  • {default} appends data to the destination [1]
    • the update method can be adjusted to 
      • {operation} merge
        • a key column must be provided
          • {default} the primary key is used, if available [1]
      • {operation} overwrite
  • availability 
    • the same regional availability as the pipeline [1]
  • billing meter
    • Data Movement, with an identical consumption rate [1]
  • {feature} robust Public API
    • {benefit} allows to automate and manage Copy Job efficiently [2]
  • {feature} Git Integration
    • {benefit} allows to leverage Git repositories in Azure DevOps or GitHub [2]
    • {benefit} allows to seamlessly deploy Copy Job with Fabric’s built-in CI/CD workflows [2]
  • {feature|preview} VNET gateway support
    • enables secure connections to data sources within virtual network or behind firewalls
      • Copy Job can be executed directly on the VNet data gateway, ensuring seamless and secure data movement [2]
  • {feature} upsert to Azure SQL Database
  • {feature} overwrite to Fabric Lakehouse
  • {feature} [Jul-2025] native CDC
    • enables efficient and automated replication of changed data including inserted, updated and deleted records from a source to a destination [5]
      •  ensures destination data stays up to date without manual effort
        • improves efficiency in data integration while reducing the load on source systems [5]
      • see Data Movement - Incremental Copy meter
        •  consumption rate of 3 CU
      • {benefit} zero manual intervention
        • automatically captures incremental changes directly from the source [5]  
      • {benefit} automatic replication
        • keeps destination data continuously synchronized with source changes [5]  
      • {benefit} optimized performance
        • processes only changed data
          • reduces processing time and minimizing load on the source [5]
      • smarter incremental copy 
        • automatically detects CDC-enabled source tables and allows to select either CDC-based or watermark-based incremental copy for each table [5]
    • applies to 
      • CDC-enabled tables
        • CDC automatically captures and replicates actions on data
      • non-CDC-enabled tables
        • Copy Job detects changes by comparing an incremental column against the last run [5]
          • then merges or appends the changed data to the destination based on configuration [5]
    • supported connectors
      • ⇐ applies to sources and destinations
      • Azure SQL DB [5]
      • On-premises SQL Server [5]
      • Azure SQL Managed Instance [5]
  • {enhancement} column mapping for simple data modification to storage as destination store [2]
  • {enhancement} data preview to help select the right incremental column  [2]
  • {enhancement} search functionality to quickly find tables or columns  [2]
  • {enhancement} real-time monitoring with an in-progress view of running Copy Jobs  [2]
  • {enhancement} customizable update methods & schedules before job creation [2]

References:
[1] Microsoft Learn (2025) Fabric: What is the Copy job in Data Factory for Microsoft Fabric? [link]
[2] Microsoft Fabric Updates Blog (2025) Recap of Data Factory Announcements at Fabric Conference US 2025 [link]
[3] Microsoft Fabric Updates Blog (2025) Fabric: Announcing Public Preview: Copy Job in Microsoft Fabric [link]
[4] Microsoft Learn (2025) Fabric: Learn how to monitor a Copy job in Data Factory for Microsoft Fabric [link]
[5] Microsoft Fabric Updates Blog (2025) Fabric: Simplifying Data Ingestion with Copy job – Introducing Change Data Capture (CDC) Support (Preview) [link]
[6] Microsoft Learn (2025) Fabric: Change data capture (CDC) in Copy Job (Preview) [link]

Resources:
[R1] Microsoft Learn (2025) Fabric: Learn how to create a Copy job in Data Factory for Microsoft Fabric [link]

Acronyms:
API - Application Programming Interface
CDC - Change Data Capture
CI/CD - Continuous Integration and Continuous Deployment
CU - Capacity Unit
DevOps - Development & Operations
DF - Data Factory
IOPS - Input/Output Operations Per Second
VNet - Virtual Network

09 April 2024

💎🤖SQL Reloaded: Copilot Stories (Part I: Copilot in Azure SQL Database is in Private Preview) 🆕

Following the news on Copilot and its adoption in the various areas, probably you asked yourself when it will become available in SQL Server. The good news is that the Copilot in Azure SQL Database is already in Private Preview (see [1], [2]). The bad news is that for the early adoption program, Microsoft looks only for customers that can provide feedback based on real-world use cases, and one will need a valid Tenant ID for signing up for the PP.

The new feature comes with two Azure portal experiences:

  • Natural language to SQL: within the Azure portal query editor the natural language prompts are translated into SQL;
  • Azure Copilot integration: provides customers with self-guided assistance for managing databases and troubleshooting. 
In the Data Exposed webcast with Anna Hoffman [1], the invitee, Joe Sack, showcases the use of both experiences. In what concerns the first experience, besides simple joins, Joe also showcased the use of a PIVOT based on the prompt:

"Show me a pivot summary table that displays the total number of properties sold in each year from 2020 to 2023".

-- output from Copilot
SELECT *
FROM 
(
	SELECT YEAR(SaleDate) AS Year, COUNT(*) AS TotalPropertiesSold
	FROM [PropertySearch].[dbo].[PropertyHistory]
	WHERE YEAR(SaleDate) BETWEEN 2020 AND 2023
	GROUP BY YEAR(SalesDate)
) AS PivotDate
PIVOT
(
	SUM(TotalPropertiesSold)
	FOR Year IN ([2020], [2021], [2022], [2023]
) AS PivotTable;

Even if this is a modest example, the great thing is that it generates the skeleton on which the logic can be built, which can really help beginners, though it can be also useful for more seasoned data professionals as it saves time. Probably, in a first phase this will be the main benefit of Copilot - to generate skeletons or templates on which further logic can be built. Hopefully in time it will do much more than that.

I wonder how complex the prompts can become and how can the logic be structured to create a multistep scenario. The Copilot versions from other areas showed that complex prompts give results, the question is whether Copilot can devise the steps in an optimum manner, much like a seasoned data professional does. 

The feature utilizes for the moment the table and view names, column names, primary key, and foreign key metadata to generate T-SQL code. Probably, it makes sense to also use index and statistics information, query hints and all the arsenal usually needed by data professionals to optimize a query. Conversely, maybe the second experience could be used for optimizing and troubleshooting the query. 

I'd really love to test this feature, though probably I'll need to wait until it becomes publicly available. In the meanwhile, one can play with the GitHub Copilot [3] or install Copilot in Azure Data Studio [4].

|>> Next Post

References:
[1] Data Exposed (2024) Introducing Copilot in Azure SQL Database (Private Preview) (link)
[2] Azure SQL Blog (2024) Microsoft Copilot in Azure extends capabilities to Azure SQL Database (Private Preview) by Joe Sack (link)
[3] Azure SQL Blog (2023) GitHub Copilot for SQL Developers: Turbocharge your SQL Development, by Subhojit Basak (link)
[4] Microsoft Developer (2023) Copilot is now in Azure Data Studio and this is how it can help you! (link)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.