SQL Troubles: CI/CD

Showing posts with label CI/CD. Show all posts

13 April 2025

🏭🗒️Microsoft Fabric: Continuous Integration & Continuous Deployment [CI/CD] [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 13-Apr-2025

[Microsoft Fabric] Continuous Integration & Continuous Deployment [CI/CD]

{def} development processes, tools, and best practices used to automates the integration, testing, and deployment of code changes to ensure efficient and reliable development

can be used in combination with a client tool

e.g. VS Code, Power BI Desktop
don’t necessarily need a workspace

developers can

create branches
commit changes to that branch locally
push changes to the remote repo
create a pull request to the main branch
⇐ all steps can be performed without a workspace [1]

workspace is needed only as a testing environment [1]

to check that everything works in a real-life scenario [1]

addresses a few pain points [2]

manual integration issues

manual changes can lead to conflicts and errors

slow down development [2]

development delays

manual deployments are time-consuming and prone to errors

lead to delays in delivering new features and updates [2]

inconsistent environments

inconsistencies between environment cause issues that are hard to debug [2]

lack of visibility

can be challenging to

track changes though their lifetime [2]
understand the state of the codebase[2]

{process} continuous integration (CI)
{process} continuous deployment (CD)

architecture

{layer} development database

{recommendation} should be relatively small [1]

{layer} test database

{recommendation} should be as similar as possible to the production database [1]

{layer} production database

data items

items that store data
items' definition in Git defines how the data is stored [1]

{stage} development

{best practice} back up work to a Git repository

back up the work by committing it into Git [1]
{prerequisite} the work environment must be isolated [1]

so others don’t override the work before it gets committed [1]
commit to a branch no other developer is using [1]
commit together changes that must be deployed together [1]

helps later when

deploying to other stages
creating pull requests
reverting changes

{warning} big commits might hit the max commit size limit [1]

{bad practice} store large-size items in source control systems, even if it works [1]
{recommendation} consider ways to reduce items’ size if they have lots of static [1] resources, like images [1]

{action} revert to a previous version

{operation} undo

revert the immediate changes made, as long as they aren't committed yet [1]
each item can be reverted separately [1]

{operation} revert

reverting to older commits

{recommendation} promote an older commit to be the HEAD

via git revert or git reset [1]
shows that there’s an update in the source control pane [1]
the workspace can be updated with that new commit [1]

{warning} reverting a data item to an older version might break the existing data and could possibly require dropping the data or the operation might fail [1]
{recommendation} check dependencies in advance before reverting changes back [1]

{concept} private workspace

a workspace that provides an isolated environment [1]

⇐ allows to work in isolation [1]

{prerequisite} the workspace is assigned to a Fabric capacity [1]
{prerequisite} access to data to work in the workspace [1]
{step} create a new branch from the main branch [1]

allows to have most up-to-date version of the content [1]
can be used for any future branch created by the user [1]

when a sprint is over, the changes are merged and one can start a fresh new task [1]

switch the connection to a new branch on the same workspace

approach can be used when is needed to fix a bug in the middle of a sprint [1]

{validation} connect to the correct folder in the branch to pull the right content into the workspace [1]

{best practice} make small incremental changes that are easy to merge and less likely to get into conflicts [1]

update the branch to resolve the conflicts first [1]

{best practice} change workspace’s configurations to enable productivity [1]

connection between items, or to different data sources or changes to parameters on a given item [1]

{recommendation} make sure you're working with the supported structure of the item you're authoring [1]

if you’re not sure, first clone a repo with content already synced to a workspace, then start authoring from there, where the structure is already in place [1]

{constraint} a workspace can only be connected to a single branch at a time [1]

{recommendation} treat this as a 1:1 mapping [1]

{stage} test

{best practice} allows to simulate a real production environment for testing purposes [1]

{alternative} simulate this by connecting Git to another workspace [1]

factors to consider for the test environment

data volume
usage volume
production environment’s capacity

stage and production should have the same (minimal) capacity [1]

using the same capacity can make production unstable during load testing [1]

{recommendation} test using a different capacity similar in resources to the production capacity [1]
{recommendation} use a capacity that allows to pay only for the testing time [1]

allows to avoid unnecessary costs [1]

{best practice} use deployment rules with a real-life data source

{recommendation} use data source rules to switch data sources in the test stage or parameterize the connection if not working through deployment pipelines [1]
{recommendation} separate the development and test data sources [1]
{recommendation} check related items

the changes made can also affect the dependent items [1]

{recommendation} verify that the changes don’t affect or break the performance of dependent items [1]

via impact analysis.

{operation} update data items in the workspace

imports items’ definition into the workspace and applies it on the existing data [1]
the operation is same for Git and deployment pipelines [1]
{recommendation} know in advance what the changes are and what impact they have on the existing data [1]
{recommendation} use commit messages to describe the changes made [1]
{recommendation} upload the changes first to a dev or test environment [1]

{benefit} allows to see how that item handles the change with test data [1]

{recommendation} check the changes on a staging environment, with real-life data (or as close to it as possible) [1]

{benefit} allows to minimize the unexpected behavior in production [1]

{recommendation} consider the best timing when updating the Prod environment [1]

{benefit} minimize the impact errors might cause on the business [1]

{recommendation} perform post-deployment tests in Prod to verify that everything works as expected [1]
{recommendation} have a deployment, respectively a recovery plan [1]

{benefit) allows to minimize the effort, respectively the downtime [1]

{stage} production

{best practice} let only specific people manage sensitive operations [1]
{best practice} use workspace permissions to manage access [1]

applies to all BI creators for a specific workspace who need access to the pipeline

{best practice} limit access to the repo or pipeline by only enabling permissions to users [1] who are part of the content creation process [1]
{best practice} set deployment rules to ensure production stage availability [1]

{goal} ensure the data in production is always connected and available to users [1]
{benefit} allows deployments run while while minimizing the downtimes
applies to data sources and parameters defined in the semantic model [1]

deployment into production using Git branches

{recommendation} use release branches [1]

requires changing the connection of workspace to the new release branches before every deployment [1]
if the build or release pipeline requires to change the source code, or run scripts in a build environment before deployment, then connecting the workspace to Git won't help [1]

{recommendation} after deploying to each stage, make sure to change all the configuration specific to that stage [1]

Previous Post <<||>> Next Post

References:

[1] Microsoft Learn (2025) Fabric: Best practices for lifecycle management in Fabric [link]

[2] Microsoft Learn (2025) Fabric: CI/CD for pipelines in Data Factory in Microsoft Fabric [link]
[3] Microsoft Learn (2025) Fabric: Choose the best Fabric CI/CD workflow option for you [link]

Acronyms:

API - Application Programming Interface
BI - Business Intelligence
CI/CD - Continuous Integration and Continuous Deployment
VS - Visual Studio

12 April 2025

🏭🗒️Microsoft Fabric: Copy job in Data Factory [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 11-Apr-2025

[Microsoft Fabric] Copy job in Data Factory

{def}

{benefit} simplifies data ingestion with built-in patterns for batch and incremental copy, eliminating the need for pipeline creation [1]

across cloud data stores [1]
from on-premises data stores behind a firewall [1]
within a virtual network via a gateway [1]

elevates the data ingestion experience to a more streamlined and user-friendly process from any source to any destination [1]
{benefit} provides seamless data integration

through over 100 built-in connectors [3]
provides essential tools for data operations [3]

{benefit} provides intuitive experience

easy configuration and monitoring [1]

{benefit} efficiency

enable incremental copying effortlessly, reducing manual intervention [1]

{benefit} less resource utilization and faster copy durations

flexibility to control data movement [1]

choose which tables and columns to copy
map the data
define read/write behavior
set schedules that fit requirements whether [1]

applies for a one-time or recurring jobs [1]

{benefit} robust performance

the serverless setup enables data transfer with large-scale parallelism
maximizes data movement throughput [1]

fully utilizes network bandwidth and data store IOPS for optimal performance [3]

{feature} monitoring

once a job executed, users can monitor its progress and metrics through either [1]

the Copy job panel

shows data from the most recent runs [1]

reports several metrics

status
row read
row written
throughput

the Monitoring hub

acts as a centralized portal for reviewing runs across various items [4]

{mode} full copy

copies all data from the source to the destination at once

{mode|GA} incremental copy

the initial job run copies all data, and subsequent job runs only copy changes since the last run [1]
an incremental column must be selected for each table to identify changes [1]

used as a watermark

allows comparing its value with the same from last run in order to copy the new or updated data only [1]
the incremental column can be a timestamp or an increasing INT [1]

{scenario} copying from a database

new or updated rows will be captured and moved to the destination [1]

{scenario} copying from a storage store

new or updated files identified by their LastModifiedTime are captured and moved to the destination [1]

{scenario} copy data to storage store

new rows from the tables or files are copied to new files in the destination [1]

files with the same name are overwritten [1]

{scenario} copy data to database

new rows from the tables or files are appended to destination tables [1]

the update method to merge or overwrite [1]

{default} appends data to the destination [1]

the update method can be adjusted to

{operation} merge

a key column must be provided

{default} the primary key is used, if available [1]

{operation} overwrite

availability

the same regional availability as the pipeline [1]

billing meter

Data Movement, with an identical consumption rate [1]

{feature} robust Public API

{benefit} allows to automate and manage Copy Job efficiently [2]

{feature} Git Integration

{benefit} allows to leverage Git repositories in Azure DevOps or GitHub [2]
{benefit} allows to seamlessly deploy Copy Job with Fabric’s built-in CI/CD workflows [2]

{feature|preview} VNET gateway support

enables secure connections to data sources within virtual network or behind firewalls

Copy Job can be executed directly on the VNet data gateway, ensuring seamless and secure data movement [2]

{feature} upsert to Azure SQL Database
{feature} overwrite to Fabric Lakehouse
{feature} [Jul-2025] native CDC

enables efficient and automated replication of changed data including inserted, updated and deleted records from a source to a destination [5]

ensures destination data stays up to date without manual effort

improves efficiency in data integration while reducing the load on source systems [5]

see Data Movement - Incremental Copy meter

consumption rate of 3 CU

{benefit} zero manual intervention

automatically captures incremental changes directly from the source [5]

{benefit} automatic replication

keeps destination data continuously synchronized with source changes [5]

{benefit} optimized performance

processes only changed data

reduces processing time and minimizing load on the source [5]

smarter incremental copy

automatically detects CDC-enabled source tables and allows to select either CDC-based or watermark-based incremental copy for each table [5]

applies to

CDC-enabled tables

CDC automatically captures and replicates actions on data

non-CDC-enabled tables

Copy Job detects changes by comparing an incremental column against the last run [5]

then merges or appends the changed data to the destination based on configuration [5]

supported connectors

⇐ applies to sources and destinations
Azure SQL DB [5]
On-premises SQL Server [5]
Azure SQL Managed Instance [5]

{enhancement} column mapping for simple data modification to storage as destination store [2]
{enhancement} data preview to help select the right incremental column [2]
{enhancement} search functionality to quickly find tables or columns [2]
{enhancement} real-time monitoring with an in-progress view of running Copy Jobs [2]
{enhancement} customizable update methods & schedules before job creation [2]

Previous Post <<||>> Next Post

References:

[1] Microsoft Learn (2025) Fabric: What is the Copy job in Data Factory for Microsoft Fabric? [link]

[2] Microsoft Fabric Updates Blog (2025) Recap of Data Factory Announcements at Fabric Conference US 2025 [link]

[3] Microsoft Fabric Updates Blog (2025) Fabric: Announcing Public Preview: Copy Job in Microsoft Fabric [link]

[4] Microsoft Learn (2025) Fabric: Learn how to monitor a Copy job in Data Factory for Microsoft Fabric [link]
[5] Microsoft Fabric Updates Blog (2025) Fabric: Simplifying Data Ingestion with Copy job – Introducing Change Data Capture (CDC) Support (Preview) [link]
[6] Microsoft Learn (2025) Fabric: Change data capture (CDC) in Copy Job (Preview) [link]

[7] Microsoft Fabric Updates Blog (2025) Simplifying Data Ingestion with Copy job – Incremental Copy GA, Lakehouse Upserts, and New Connectors [link]

Resources:
[R1] Microsoft Learn (2025) Fabric: Learn how to create a Copy job in Data Factory for Microsoft Fabric [link]
[R2] Microsoft Learn (2025) Microsoft Fabric decision guide: copy activity, Copy job, dataflow, Eventstream, or Spark [link]

Acronyms:

API - Application Programming Interface
CDC - Change Data Capture
CI/CD - Continuous Integration and Continuous Deployment

CU - Capacity Unit

DevOps - Development & Operations

DF - Data Factory
IOPS - Input/Output Operations Per Second

VNet - Virtual Network

SQL Troubles

Pages

13 April 2025

🏭🗒️Microsoft Fabric: Continuous Integration & Continuous Deployment [CI/CD] [Notes]

12 April 2025

🏭🗒️Microsoft Fabric: Copy job in Data Factory [Notes]

About Me