24 April 2025

💎🏭SQL Reloaded: Microsoft Fabric's Lakehouses at Work (Part I: Proof-of-Concept)


Introduction

One way to work with the data files existing in organization is to import them into a lakehouse and build a data model based on them that can be reused in the various solutions (incl. Power BI). As a reminder, a lakehouse is a data architecture platform for storing, managing, and analyzing structured and unstructured data in a single location.

The tutorials available on lakehouses are pretty useful for getting an idea how to start, though probably each seasoned professional has his/her way of doing things, at least for testing the capabilities before building a proper solution. The target is thus, to create the minimum for testing the capabilities needed for a proof-of-concept solution. 

The files used in this post are available on Microsoft's GitHub. Download the files and store them to be easily available for the next steps. The following files were considered for the current post: customers.csv, orders.csv and products.csv.

Create the Lakehouse

It's recommended to have a naming convention in place for the various items created in a workspace, e.g. a combination between item type (lakehouse, warehouse), system type (Prod, UAT, Dev, PoC) and eventually department (e.g. FIN, SCM, HR, etc.). One should try to balance between meaning and usefulness. Ideally, one should use 2 maximum 4 letters for each part encoded in the name. For example, the following scripts were created in the LH_SCM_PoC lakehouse. 

More complex naming conventions can include the system (e.g. D365, CRM, EBS) or the company. The target is to easily identify the systems, independently how complex the rules are. Given that it can become challenging to rename the schemas afterwards, ideally the naming convention should be available from the early stages. 

Create the Schema

A lakehouse comes with a dbo schema available by default, though it's recommended to create meaningful schema(s) as needed. The schemas should ideally reflect the domain of the data (e.g. departments or other key areas) and the schemas shouldn't change when the objects are deployed between the different environments. Upon case, one should consider creating multiple schemas that reflect the model's key areas. The names should be simple and suggestive.

-- create schema
CREATE Schema Orders

Create a Staging Area

The next step is to create a staging area where the files in scope can be made available and then further loaded in the lakehouse. One needs to compromise between creating a deep hierarchical structure that reflects the business structure and the need to easily identify, respectively manage the files. An hierarchical structure with 1-2 level could provide the needed compromise, though each additional level tends to increase the complexity. 

One should also consider rules for archiving or even deleting the files.

Upload the Files

Microsoft Fabric allows users to upload multiple files together into a single step. Ideally the files should have proper names for each column, otherwise overheads deriving from this may appear later in the process. 

When the files are available in multiple folders in a complex hierarchical structure, a set of shortcuts could help in their management.

Load the Data

A file's data can be loaded into the lakehouse on the fly by providing a valid table name:
Files >> SCM_Orders >> (select file) >> Load to Tables >> new table >> Load file to new table >> (provide information) >> Load

Load file to new table


Of course, the table's name must be unique within the Schema and the further properties must define files' definition. 

One should consider loading first a couple of tables, performing a rough validation of the data imported, and only after that the remaining tables can be imported. This allows to identify the issues that typically lead to reimports of the data (wrong formatting, invalid column names, duplicated files, etc.) or rework.

If the files have different characteristics (e.g. delimiters, number of attributes/records, special data types), one should consider this in the initial scope and have at least one example from each category. 

Review the Metadata

Once the files were made available, there's the tendency to start directly with the development without analyzing the data, or equally important, the metadata available. To review the metadata of the tables newly created, one can use the objects from the standard INFORMATION_SCHEMA (see post):

-- retrieve the list of tables
SELECT * 
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'orders'
ORDER BY TABLE_SCHEMA  

Further on, one can review columns' definition:
 
-- retrieve column metadata
SELECT TABLE_CATALOG
, TABLE_SCHEMA
, TABLE_NAME
, COLUMN_NAME
, ORDINAL_POSITION
, DATA_TYPE
, CHARACTER_MAXIMUM_LENGTH
, NUMERIC_PRECISION
, NUMERIC_SCALE
, DATETIME_PRECISION
, CHARACTER_SET_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'orders'
ORDER BY ORDINAL_POSITION

It's a good idea to save the metadata to a file and use it later for reviews, respectively for metadata management, when no other solution is in place for the same (e.g. Purview). That's useful also for the people with limited or no access to the workspace. 

Alternatively, one can use a notebook with the following SQL-based commands: 

%%sql

DESCRIBE TABLE LH_SCM_PoC.orders.sales;

DESCRIBE TABLE EXTENDED LH_SCM_PoC.orders.sales;

One can even provide meaningful descriptions for each table and its attributes via scripts like the ones below, however it might be a good idea to do this in the later phases of the PoC, when the logic become stable:

%%sql

-- modify a table's COMMENT
COMMENT ON TABLE LH_SCM_PoC.orders.sales IS 'Customer sales orders';

-- modify columns' COMMENT for an existing table
ALTER TABLE LH_SCM_DWH.orders.sales  
ALTER COLUMN SalesOrderNumber COMMENT 'Sales Order Number';

Data Validation

Before diving into building any business logic, besides identifying the primary, foreign keys and further attributes used in bringing the data together, it's recommended to get an overview of data's intrinsic and extrinsic characteristics relevant to the analysis. Some of the rules used typically for studying the quality of data apply to some extent also in here, though one needs to prioritize accordingly, otherwise one replicates the effort that's typically part of the Data Management initiatives. 

In addition, it's important to check how much the identified issues impact the business logic, respectively on whether the issues can be corrected to match the expectations. Often, no logic can compensate for major  data quality issues, and this can also affect  PoC's results as soon as the outcomes are validated against the expectations! 

Data Understanding 

Further on, it makes sense to get a high-level understanding of the data by looking at the distribution of values, respectively at the records participating in the joins. Of course, more similar queries can be built, though again, one should try to focus on the most important aspects!

The analysis could for example consider the following points:

/* validation of Products */

-- review duplicated product numbers (should be 0)
SELECT ProductName
, count(*) RecordCount
FROM orders.products
GROUP BY ProductName
HAVING count(*)>1

-- review most (in)expensive products
SELECT top 100 ProductID
, ProductName
, Category
, ListPrice 
FROM orders.products
ORDER BY ListPrice DESC --ASC

-- review category distribution
SELECT Category
, count(*) RecordCount 
FROM orders.products
GROUP BY Category
ORDER BY RecordCount DESC

-- review price ranges (
SELECT Len(floor(ListPrice)) RangeCount
, count(*) RecordCount 
FROM orders.products
GROUP BY Len(floor(ListPrice)) 
ORDER BY RangeCount DESC

/* validation of Customers */

-- duplicated email address 
SELECT CST.CustomerID
, CST.FirstName
, CST.LastName 
, CST.EmailAddress 
, DUP.RecordCount
FROM (-- duplicates
	SELECT EmailAddress
	, count(*) RecordCount 
	FROM orders.customers 
	GROUP BY EmailAddress 
	HAVING count(*)>1
	) DUP
	JOIN orders.customers CST
	   ON DUP.EmailAddress = CST.EmailAddress
ORDER BY DUP.RecordCount DESC
, DUP.EmailAddress 

-- duplicated Customer names (not necessarily duplicates)
SELECT CST.CustomerID
, CST.FirstName
, CST.LastName 
, CST.EmailAddress 
, DUP.RecordCount
FROM (-- duplicates
	SELECT FirstName
	, LastName
	, count(*) RecordCount 
	FROM orders.customers 
	GROUP BY FirstName
	, LastName 
	HAVING count(*)>1
	) DUP
	JOIN orders.customers CST
	   ON DUP.FirstName = CST.FirstName
      AND DUP.LastName = CST.LastName
ORDER BY DUP.RecordCount DESC
, DUP.FirstName
, DUP.LastName

/* validation of Orders */

-- review a typical order
SELECT SalesOrderID
, OrderDate
, CustomerID
, LineItem
, ProductID
, OrderQty
, LineItemTotal
FROM orders.orders
WHERE SalesOrderID = 71780
ORDER BY SalesOrderID 
, LineItem

-- review orders' distribution by month
SELECT Year(OrderDate) Year
, Month(OrderDate) Month
, count(*) RecordCount
FROM orders.orders
GROUP BY Year(OrderDate) 
, Month(OrderDate) 
ORDER BY Year
, Month

-- checking for duplicates
SELECT SalesOrderID
, LineItem
, count(*) RecordCount
FROM orders.orders ord 
GROUP BY SalesOrderID
, LineItem
HAVING count(*)>1

-- checking for biggest orders
SELECT SalesOrderID
, count(*) RecordCount
FROM orders.orders ord 
GROUP BY SalesOrderID
HAVING count(*) > 10
ORDER BY NoRecords DESC

-- checking for most purchased products
SELECT ProductID
, count(*) NoRecords
FROM orders.orders ord 
GROUP BY ProductID
HAVING count(*) > 8
ORDER BY NoRecords DESC

-- checking for most active customers
SELECT CustomerID
, count(*) RecordCount
FROM orders.orders ord 
GROUP BY CustomerID
HAVING count(*) > 10
ORDER BY RecordCount DESC

/* join checks */

-- Prders without Product (should be 0)
SELECT count(*) RecordCount
FROM orders.orders ord 
	 LEFT JOIN orders.products prd
	   ON ord.ProductID = prd.ProductID
WHERE prd.ProductID IS NULL

-- Prders without Customer (should be 0)
SELECT count(*) RecordCount
FROM orders.orders ORD 
	 LEFT JOIN orders.customers CST
	   ON ORD.CustomerID = CST.CustomerID
WHERE CST.CustomerID IS NULL

-- Products without Orders (153 records)
SELECT count(*) RecordCount
FROM orders.products prd
	 LEFT JOIN orders.orders ord 
	   ON prd.ProductID = ord.ProductID 
WHERE ord.ProductID IS NULL


-- Customers without Orders (815 records)
SELECT count(*) RecordCount
FROM orders.customers CST
	 LEFT JOIN orders.orders ORD
	   ON ORD.CustomerID = CST.CustomerID
WHERE ORD.CustomerID IS NULL

The more tables are involved, the more complex the validation logic can become. One should focus on the most important aspects.

Building the Logic

Once one has an acceptable understanding of the data entities involved and the relation between them, it's time to build the needed business logic by joining the various tables at the various levels of detail. One can focus on the minimum required, respectively attempt to build a general model that can address a broader set of requirements. For the PoC it's usually recommended to start small by addressing the immediate requirements, though some flexibility might be needed for exploring the data and preparing the logic for a broader set of requirements. Independently of the scope, one should consider a set of validations. 

Usually, it makes sense to encapsulate the logic in several views or table-valued functions that reflect the logic for the main purposes and which allow a high degree of reuse (see [1]). Of course, one can use the standard approach for modelling the bronze, silver, respectively the gold layers adopted by many professionals. For a PoC, even if  that's not mandatory, it might still be a good idea to make steps in the respective direction. 

In this case, dealing with only three tables - a fact table and two dimensions table - there are several perspectives that can be built:

a) all records from fact table + dimension records

The following view provides the lowest level of details for the fact table, allowing thus to look at the data from different perspectives as long as focus is only the values used is Sales Orders:

-- create the view
CREATE OR ALTER VIEW orders.vSalesOrders
-- Sales Orders with Product & Customer information
AS
SELECT ORD.SalesOrderID
, ORD.OrderDate
, ORD.CustomerID
, CST.FirstName 
, CST.LastName
, CST.EmailAddress
, ORD.LineItem
, ORD.ProductID
, PRD.ProductName 
, PRD.Category
, ORD.OrderQty
, ORD.LineItemTotal
, PRD.ListPrice 
, ORD.OrderQty * PRD.ListPrice ListPriceTotal
FROM orders.orders ORD 
	 JOIN orders.products PRD
	   ON ORD.ProductID = PRD.ProductID
	 JOIN orders.customers CST
	   ON ORD.CustomerID = CST.CustomerID

-- test the view   
SELECT *
FROM orders.vSalesOrders
WHERE SalesOrderID = 71780

One can use full joins unless some of the references dimensions are not available.  

b) aggregated data for all dimension combinations

The previous view allows to aggregate the data at the various levels of details:

-- Sales volume by Customer & Product
SELECT ORD.EmailAddress
, ORD.ProductName 
, ORD.Category
, SUM(ORD.OrderQty) OrderQty
, SUM(ORD.LineItemTotal) LineItemTotal
FROM orders.vSalesOrders ORD 
WHERE ORD.OrderDate >= '2022-06-01'
  AND ORD.OrderDate < '2022-07-01'
GROUP BY ORD.EmailAddress
, ORD.ProductName 
, ORD.Category
ORDER BY ORD.EmailAddress
, ORD.ProductName 

One can comment out the dimensions not needed. The query can be included in a view as well. 

c) all records from each dimension table + aggregated fact records

Sometimes, it's useful to look at the data from a dimension's perspective, though it might be needed to create such an object for each dimension, like in the below examples. For the maximum of flexibility the logic can be included in a table-valued function:

-- create the user-defined function
CREATE OR ALTER FUNCTION orders.tvfProductsSalesVolume(
    @StartDate date NULL,
    @EndDate date NULL
)
RETURNS TABLE
-- Sales volume by Product
AS
RETURN (
SELECT PRD.ProductID
, PRD.ProductName 
, PRD.Category
, ORD.FirstOrderDate
, ORD.LastOrderDate 
, IsNull(ORD.TotalSalesQty, 0) TotalSalesQty 
, IsNull(ORD.TotalSalesValue, 0) TotalSalesValue
, IsNull(ORD.OrderCount, 0) OrderCount
, IsNull(ORD.LineCount, 0) LineCount
FROM orders.products PRD
     OUTER APPLY (
		SELECT Min(ORD.OrderDate) FirstOrderDate
		, Max(ORD.OrderDate) LastOrderDate 
		, SUM(ORD.OrderQty) TotalSalesQty
		, SUM(ORD.LineItemTotal) TotalSalesValue
		, count(DISTINCT SalesOrderID) OrderCount
		, count(*) LineCount
		FROM orders.orders ORD 
		WHERE ORD.ProductID = PRD.ProductID
		  AND ORD.OrderDate >= @StartDate 
		  AND ORD.OrderDate < @EndDate 
	 ) ORD
);

-- test the user-defined function
SELECT *
FROM orders.tvfProductsSalesVolume('2022-06-01','2022-07-01') PRD
WHERE TotalSalesValue <> 0
ORDER BY TotalSalesValue DESC
, LastOrderDate DESC


-- create the user-defined function
CREATE OR ALTER FUNCTION orders.tvfCustomersSalesVolume(
    @StartDate date NULL,
    @EndDate date NULL
)
RETURNS TABLE
-- Sales volume by Customer
AS
RETURN (
SELECT CST.CustomerID
, CST.FirstName 
, CST.LastName
, CST.EmailAddress
, ORD.FirstOrderDate
, ORD.LastOrderDate 
, IsNull(ORD.TotalSalesValue, 0) TotalSalesValue
, IsNull(ORD.OrderCount, 0) OrderCount
, IsNull(ORD.LineCount, 0) LineCount
FROM orders.customers CST
     OUTER APPLY (
		SELECT Min(ORD.OrderDate) FirstOrderDate
		, Max(ORD.OrderDate) LastOrderDate 
		, SUM(ORD.LineItemTotal) TotalSalesValue
		, count(DISTINCT SalesOrderID) OrderCount
		, count(*) LineCount
		FROM orders.orders ORD 
		WHERE ORD.CustomerID = CST.CustomerID
		  AND ORD.OrderDate >= @StartDate 
		  AND ORD.OrderDate < @EndDate 
	 ) ORD
);

-- test the user-defined function
SELECT *
FROM orders.tvfCustomersSalesVolume('2022-06-01','2022-07-01') PRD
WHERE TotalSalesValue <> 0
ORDER BY TotalSalesValue DESC
, LastOrderDate DESC

When restructuring the queries in similar ways, there's always a compromise between the various factors: (re)usability, performance or completeness. 

Further Comments

The above database objects should allow users to address most of the requirements, though, as usual, there can be also exceptions, especially when the data needs to be aggregated at a different level of detail that requires the query to be structured differently.

The number of perspectives can increase also with the number of fact tables used to model a certain entity (e.g. Sales order headers vs. lines). For example, 

In theory, one can also find ways to automate the process of creating database objects, though one must choose the relevant attributes, respectively include logic that makes sense only within a certain perspective. 

No matter the data, respectively systems used as source, expect surprises and test your assumptions! For example, in the file used to create the orders.customers table, there seem to be duplicated entities with the same name and email address. One must clarify how such entities must be handled in data analysis, respectively in data modeling. For example, a person can appear twice because of the roles associated with the name or can be other entitled reasons. 

The files in scope of this post are small compared with the files existing in organizations. In many scenarios files' size could range from GB to TB and thus require partitioning and different other strategies. 

|>> Next Post

References
[1] sql-troubles (2023) Architecture Part IV: Building a Modern Data Warehouse with Azure Synapse [link]

Resources
[1] Microsoft Learn (2024) Fabric: Lakehouse and Delta Lake tables [link]

20 April 2025

🧮ERP: Implementations (Part XVIII: The Price of Partnership)


ERP Implementations Series
ERP Implementations Series

When one proceeds on a journey, it’s important to have a precise destination, a good map to show the road, the obstacles ahead, and help to plan the journey, good gear, enough resources to make it through the journey, but probably more important, good companions and ideally guides who can show the way ahead and offer advice when needed. This is in theory the role of a partner, and on such coordinates should be a partnership based upon. However, unless the partners pay for the journey as well, the partnership can come with important costs and occasionally more overhead than needed. 

The traveler’s metaphor is well suited to ERP implementations and probably many other projects in which the customer doesn’t have the knowledge about the various aspects of the project. The role of a partner is thus multifold, and it takes time for all the areas to be addressed. Typically, it takes years for such a relationship to mature to the degree that it all develops naturally, at least in theory. Conversely, few relationships can resist in time given the complex challenges resulting from different goals and objectives, business models, lack of successes or benefits.

Usually, a partnership means sharing the risks and successes, but more importantly, building a beneficial bidirectional relationship from which all parties can profit. This usually means that the partner provides a range of services not available in-house, allowing customers to pull resources and knowledge on a need-by basis, providing direction and other advice whenever is needed. A partner can help if it has the needed insight into the business, and this implies a minimum of communication in respect to business decisions, strategies, goals, objectives, requirements, implications, etc. 

During sales pitches and other meetings, many service providers assume themselves the role of partners, however between their behavior and the partner role is usually a considerable gap that often may seem impossible if not difficult to bridge. It’s helpful to define as part of the various contracts the role of partnership, respectively the further implications. It’s helpful to have a structure of bidirectional bonuses and other benefits that would help to strengthen the bond between organizations. A framework for supporting the partnership must be built, and this takes time to be implemented adequately. 

Even if some consultants are available from the early stages of the partnership, that’s typically the exception and not the norm. It’s typical for resources to be involved only for the whole duration of a project or less. Independently of their performance, the replacement of resources in projects is unavoidable and must be addressed adequately, with knowledge transfer and all that belongs to such situations. Moreover, it needs to be managed adequately by the serve provider, however resources can’t be replaced as the parts of an engine. The planning and other activities must consider and accommodate such changes.

Also the replacement of partners in mid of the project is possible and this option should be considered as exception in projects and planned accordingly. The costs of working with partners can be high and therefore organizations should consider the alternatives. Bringing individual resources in projects and even building long-term relationships with them can prove to be a cost-effective alternative. Even if such partnerships are more challenging to manage, the model can offer other advantages that compensate for the overhead of managing them.

Outsourcing resources across geographies or mixing models can work as well. Even if implementations usually don’t allow for experiments, they can still be a feasible alternative. The past successes and failures are usually a good measure of what works and doesn't for organizations. 

Previous Post <<||>> Next Post 

🧮ERP: Implementations (Part XVII: Taming the Monsters)

ERP Implementations Series
ERP Implementations Series
 
Given their extensive scope, duration, investment and complexity, ERP implementations are probably one of the most complex endeavors pursued by organizations. Moreover, they are often a matter of endurance with many junctions, skirts, turns, shortcuts, ups and downs, a continuous carousel in which the various issues tend to misbehave like little monsters, many of them haunting one’s dreams unexpectedly during and long after implementations.

Probably, the main drivers are the scale and mass of such projects as they touch all or most important aspects of organizations. Just consider the typical project done for a single department and multiply its complexity by a constant number representing the number of departments in scope. And the more one goes into details, the higher the complexity. To move forward the parties need to compromise, and as no one wants to do that, the discussions are prolonged, the discussions get personal, issues are escalated, and probably more negative effects can be met.

Tensions can be rooted in politics, in the friction between different goals, in the need to prioritize requirements, postponing or leaving things out of scope, or by pushing an agenda other parties don't agree with. Besides the typical constraints of projects, there’s the complexity of performing a huge amount of work within a limited period, time during which the resources must be available, the quality must match the expectations, and there are so many aspects to be considered!

Of course, not all implementations are like this, though each such project is a real exam of maturity for the people involved in it. Sometimes, it’s better to have people who care about the decisions made. On the opposite side, there are organizations that go almost blindly with the solutions suggested to them, with all the effects resulting from this. Probably, the middle way between these two extremes is more indicated, though it’s hard to find such a path through all complexity.

An ERP implementation is highly dependent on the initial conditions under which the project has started, the commitment made by the various parties involved in the project, the way resources are made available, on what’s considered in plan, on the communication that takes place, the planning done and its enforcement, etc. Of course, some topics can be addressed also later, though delays tend to create more delays that can have a ripple effect through the project. Under normal circumstances the backlog and other aspects can be manageable, though it’s enough for a few issues to gather momentum so that their cumulative impact can have an exponential impact.

Certain sensitive project topics can easily lead to crises and abnormal behavior, though such situations are usually exceptions (until they are not). It’s important to have in place the processes and procedures that can be used to address this kind of situation, and, not less important, have them communicated to the team. Moreover, it’s not necessary to reinvent the wheel - the processes defined in IT and project methodologies can be used and adapted for this purpose.

It's important to have in place all the processes, procedures and checkpoints needed to support the project. The people participating in a project should have some hands-on experience with them, including the exceptions (e.g. escalation procedures). It’s useful to have a mentor or some experienced person who can help with advice and even attend meetings and provide constructive feedback. Just having some awareness sessions with no feedback can be as dangerous as not having any training at all! It’s suboptimal to use the implementation itself as an environment for learning though in extremis this approach may work as well.

Previous Post <<||>> Next Post

19 April 2025

🧮ERP: Implementations (Part XVI: It’s All About Politics)

ERP Implementations Series
ERP Implementations Series

An ERP implementation takes place within a political context and politics can make or break implementations. Politics occurs whenever individuals or organization groups interact to make decisions that affect parts or the whole organization. Besides decision-making there are further components that revolve around the various types of resources allocation and management, resulting in power dynamics that shape and pull organizations in politically charged directions.

Given the deep implications of ERP systems, probably in no other type of projects the political aspects are that visible and stringent to all employees to the degree that they pull decisions in one direction independently of the actual requirements. It may seem incredible, though there are cases in which ERP systems were selected just because the organization’s CEO played golf with the vendor’s CEO. In the end, the gaps between systems should be minimal nowadays, at least in theory, isn’t it?

Of course, just because one meets certain strange behaviors, it doesn’t mean that this is common practice! There are higher chances of selecting an inadequate system just because the sales representative did a good job and convinced the audience that the system can do anything they want. It probably does if coins are used for each missing feature, and in the long term it can be a lot of coins. Conversely, even if a system satisfies nowadays’ requirements, it doesn’t mean it will continue to do the same with future requirements. Only the future can tell whether the choice of a system over the others was a good one.

The bigger the gaps between the various interests, the more difficult it becomes to pull the project in the right direction. Probably the best way to demonstrate why one system is better than another is by bringing facts and focusing on the main requirements of the organization. This supposes the existence of an explicit list of requirements with a high-level description of how they can be addressed by the future system. This might not be enough, though it’s a good start, a good basis for discussion, for making people aware of the implications. However, doing this exercise for 2-3 or more systems is not cost effective, as such analysis can become time-consuming and expensive.

One way to address political resistance is by discussing openly with the stakeholders and addressing their concerns, arguing why the system is a good choice, what can be done to address the gaps, and so on. It will not always be enough, though it’s important to establish common ground for further discussions. Further on, it’s important to keep the same openness and disposition for communication given that the further the project progresses, the higher the likelihood of other concerns to appear. It’s a never-ending story if there are gaps between needs and what the system provides.

It's important to establish clear and honest communication with the stakeholders, informing them proactively about the challenges faced, independently in which area they are faced. Conversely, too much communication can be disruptive and can create other challenges. One way to cope with this is by identifying the communication needs of each stakeholder and trying to identify what’s the volume of information, respectively the communication needs of each of them. That’s project management 1:1.

The Project Manager and his team should ideally anticipate and address the potential conflicts timely, before they propagate and reach a broader audience. It’s questionable how much can be achieved proactively, especially when the project keeps everybody busy. The tendency is to answer politics with politics, though brainstorming sessions, open communication and a few other approaches can reach deeper where politics can’t.

18 April 2025

🧮ERP: Implementations (Part XV: An Ecosystemic View)

ERP Implementation Series
ERP Implementations Series

In organizations, the network of interconnected systems, technologies, applications, users and all what resides around them, physically or logically, through their direct or indirect interactions form an ecosystem governed by common goals, principles, values, policies, processes and resources. An ERP system becomes thus part of the ecosystem and, given its importance, probably becomes the centerpiece of the respective structure, having the force to shift towards it all the flows existing in organizations, respectively nourishing its satellite systems, enabling them to create more value.

Ecology’s components should allow for growth, regeneration, adaptation, flow, interaction and diversity, respectively creating value for the organizations. The more the flow of information, data, products, and money is unrestricted and sustained by the overall architecture, the more beneficial they become. At least, this is the theoretical basis on which such architectures are built upon. Of course, all this doesn’t happen by itself, but must be sustained and nourished adequately!

The mixture of components can enable system interoperability, scalability, reliability, automation, extensibility or maintainability with impact on collaboration, innovation, sustainability, cost-efficiency and effectiveness. All of these are concepts with weight and further implications for ecology. Some aspects are available by design, while others need to be introduced and enforced in a systematic, respectively systemic manner. It’s an exploration of what’s achievable, of what works, of what creates value. Organizations should review the best practices in the field and explore.

Ideally, the ERP system should nourish the ecology it belongs to, enabling it to grow and create more value for the organization. Conversely, the ERP system can become a cannibal for the ecology, pulling towards it all the resources, but more importantly, limiting the other systems in terms of functionality, value, usage, etc. In other terms, an ERP system has the power to cannibalize its environment, with all the implications deriving from this. This may seem like a pessimistic view, though it’s met in organizations more often than people think. Just think about the rate of failure in ERP implementations, the ERP cannibalizing thus all the financial and human resources available at their disposal.

To create more value, the ERPs should be integrated with the other critical and non-critical information systems – email and other communication systems, customer and vendor relationship systems, inventory management systems, regulatory reporting platforms and probably there are many other systems designed to address the various needs. Of course, not every organization needs all these types of information systems, though the functions exist to some degree in many organizations. Usually, components grow in importance with the shift of needs and attention.

There are also small organizations that only use small subparts of the ERP system (e.g. finance, HR), however an ERP investment becomes more attractive and cost-effective, the further the organization can leverage all its (important) features. It’s also true that an ERP system is not always the solution for everybody. It’s a matter of scale, functionality, and business model. Above all, the final solution must be cost-effective and an enabler for the organization!

Beyond the ecosystemic view, there are design principles and rules that can be used to describe, map and plan the road ahead, though nothing should be considered as fixed. One needs to balance between opportunities and risks, value and costs, respectively between the different priorities. In some scenarios there’s a place for experimentation, while in other scenarios organizations must stick to what they know. Even if there are maybe recipes for success, there’s no guarantee behind them. Each organization must evaluate what works and what doesn’t. It’s a never-ending story in which such topics need to be approached gradually and iteratively.

Previous Post <<||>> Next Post

16 April 2025

🧮ERP: Implementations (Part XIV: A Never-Ending Story)

ERP Implementations Series
ERP Implementations Series

An ERP implementation is occasionally considered as a one-time endeavor after which an organization will live happily ever after. In an ideal world that would be true, though the work never stops – things that were carved out from the implementation, optimizations, new features, new regulations, new requirements, integration with other systems, etc. An implementation is thus just the beginning from what it comes and it's essential to get the foundation right – and that’s the purpose of the ERP implementation – provide a foundation on which something bigger and solid can be erected. 

No matter how well an ERP implementation is managed and executed, respectively how well people work towards the same goals, there’s always something forgotten or carved out from the initial project. Usually, the casual suspects are the integrations with other systems, though there can be also minor or even bigger features that are planned to be addressed later, if the implementation hasn’t consumed already all the financial resources available, as it's usually the case. Some of the topics can be addressed as Change Requests or consolidated on projects of their own. 

Even simple integrations can become complex when the processes are poorly designed, and that typically happens more often than people think. It’s not necessarily about the lack of skillset or about the technologies used, but about the degree to which the processes can work in a loosely coupled interconnected manner. Even unidirectional integrations can raise challenges, though everything increases in complexity when the flow of data is bidirectional. Moreover, the complexity increases with each system added to the overall architecture. 

Like a sculpture’s manual creation, processes in an ERP implementation form a skeleton that needs chiseling and smoothing until the form reaches the desired optimized shape. However, optimization is not a one-time attempt but a continuous work of exploring what is achievable, what works, what is optimal. Sometimes optimization is an exact science, while other times it’s about (scientifical) experimentation in which theory, ideas and investments are put to good use. However, experimentation tends to be expensive at least in terms of time and effort, and probably these are the main reasons why some organizations don’t even attempt that – or maybe it’s just laziness, pure indifference or self-preservation. In fact, why change something that already works?

Typically, software manufacturers make available new releases on a periodic basis as part of their planning for growth and of attracting more businesses. Each release that touches used functionality typically needs proper evaluation, testing and whatever organizations consider as important as part of the release management process. Ideally, everything should go smoothly though life never ceases to surprise and even a minor release can have an important impact when earlier critical functionality stopped working. Test automation and other practices can make an important difference for organizations, though these require additional effort and investments that usually pay off when done right. 

Regulations and other similar requirements must be addressed as they can involve penalties or other risks that are usually worth avoiding. Ideally such requirements should be supported by design, though even then a certain volume of work is involved. Moreover, the business context can change unexpectedly, and further requirements need to be considered eventually. 

The work on an ERP system and the infrastructure built around it is a never-ending story. Therefore, organizations must have not only the resources for the initial project, but also what comes after that. Of course, some work can be performed manually, some requirements can be delayed, some risks can be assumed, though the value of an ERP system increases with its extended usage, at least in theory. 

🧮ERP: Implementations (Part XIII: On Project Management)

ERP Implementations Series
ERP Implementations Series

Given its intrinsic complexity and extended implications, an ERP implementation can be considered as the real test of endurance for a Project Manager, respectively the team managed. Such projects typically deal with multiple internal and external parties with various interests in the outcomes of the project. Moreover, such projects involve multiple technologies, systems, and even methodologies. But, more importantly, such projects tend to have specific characteristics associated with their mass, being challenging to manage within the predefined constraints: time, scope, costs and quality.

From a Project Manager’s perspective what counts is only the current project. From a PMO perspective, one project, independent of its type, must be put within the broader perspective, while looking at the synergies and other important aspects that can help the organization. Unfortunately, for many organizations all begins and ends with the implementation, and this independently of the outcomes of the project. Often failure lurks in the background and usually there can be small differences that in the long term have a considerable impact. ERP implementations are more than other projects sensitive on the initial conditions – the premises under which the project starts and progresses. 

One way of coping with this inherent complexity is to split projects into several phases considered as projects or subprojects in their own boundaries. This allows organizations to narrow the focus and split the overall work into more manageable pieces, reducing to some degree the risks while learning in the process about organization’s capabilities in addressing the various aspects. Conversely, the phases are not necessarily sequential but often must overlap to better manage the resources and minimize waste. 

Given that an implementation project can take years, it’s normal for people to come and go, some taking over work from colleagues, with or without knowledge transfer. The knowledge is available further on, as long as the resources don’t leave the organization, though knowledge transfer can’t be taken for granted. It’s also normal for resources to suddenly not be available or disappear, increasing the burden that needs to be shifted on others’ shoulders. There’s seldom a project without such events and one needs to make the best of each situation, even if several tries and iterations are needed in the process.

Somebody needs to manage all this, and the weight of the whole project falls on a PM’s shoulders. Managing by exception and other management principles break under the weight of implementation projects and often it’s challenging to make progress without addressing this. Fortunately, PMs can shift the burden on Key Users and other parties involved in the project. Splitting a project in subprojects can help set boundaries even if more management could occasionally be involved. Also having clear responsibilities and resources who can take over the burdens when needed can be a sign of maturity of the teams, respectively the organization. 

Teams in Project Management are often compared with teams in sports, though the metaphor is partially right when each party has a ball to play with, while some of the players or even teams prefer to play alone at their own pace. It takes time to build effective teams that play well together, and the team spirit or other similar concepts can't fill all the gaps existing in organizations! Training in team sports has certain characteristics that must be mirrored in organizations to allow for teams to improve. Various parties expect from the PM to be the binder and troubleshooter of something that should have been part of an organization’s DNA! Bringing external players to do the heavy lifting may sometimes work, though who’ll do the lifting after the respective resources are gone? 

Previous Post <<||>> Next Post

15 April 2025

🧮ERP: Implementations (Part XII: The Process Perspective)

ERP Implementation Series
ERP Implementations Series

Technology can have a tremendous potential impact on organizations, helping them achieve their strategic goals and objectives, however it takes more than an implementation of one or more technologies to leverage that potential! This applies to ERP and other technology implementations altogether, but the role of technology is more important in the latter through its transformative role. ERP implementations can be the foundation on which the whole future of the organization is built upon, and it’s ideal to have a broader strategy that looks at all the facets of an organization pre-, during and postimplementation. 

One of the most important assets an organization has is its processes, organization’s success depending on the degree the processes are used to leverage the various strategies. Many customers want their business processes to be implemented on the new platform and that's the point where many projects go in the wrong direction! There are probably areas where this approach makes sense, though organizations need to look also at the alternatives available in the new ecosystem, identify and prioritize the not existing features accordingly. There will be also extreme cases in which one or a mix of systems will be considered as not feasible, and this is an alternative that should be considered during such evaluations! 

An ERP system allows organizations to implement their key value-creation processes by providing a technological skeleton with a set of configurations and features that can be used to address a wide set of requirements. Such a framework is an enabler - makes things possible - though the potential is not reached automatically, and this is one of the many false assumptions associated with such projects. Customers choose such a system and expect magic to happen! Many of the false perceptions are strengthened by implementers or the other parties involved in the projects. As in other IT areas, there are many misconceptions that pervade. 

An ERP provides thus a basis on which an organization can implement its processes. Doing an ERP implementation without process redesign is seldom possible, even if many organizations want to avoid it at all costs. Even if organization’s processes are highly standardized, expecting a system to model them by design is utopian, given that ERP system tends to target the most important aspects identified across industries. And thus, customizations come into play, some of them done without looking for alternatives already existing in the intrinsic or extended range of solutions available in an ERP’s ecosystem. 

One of the most important dangers is when an organization’s processes are so complex that their replication in the new environment creates more issues that the implementation can solve. At least in the first phases of the implementation, organizations must learn to compromise and focus on the critical aspects without which the organization can’t do its business. Moreover, the costs of implementations tend to increase exponentially, when multiple complex requirements are added to address the gaps.  Organizations should always look at alternatives – integrations with third party systems tend to be more cost-effective than rebuilding the respective functionality from scratch! 

It's also true that some processes are too complex to be implemented, though the solution resides usually in the middle. Each customization adds another level of complexity, and a whole range of risk many customers take. Conversely, there’s no blueprint that works for everybody. Organizations must thus compromise and that’s probably one of the most important aspects they should be aware of! However, also compromises must be made in the right places, while evaluating alternatives and the possible outcomes. It’s important to be aware of the full extent of the implications for their decisions. 

14 April 2025

🧮ERP: Implementations (Part XI: Tales from the Crypt)

ERP Implementation Series
ERP Implementations Series

One can seldom meet more frighteningly strange stories than the ones told by people who worked in ERP implementations. Such projects attempt to model an organization’s main functions and processes, independently on whether the focus is on production, finance, supply chain, services, projects or human resources. Because they tend to touch all important aspects of a business, such projects become so complex and political that they are often challenging to manage and occasionally are predestined to failure by design.

For the ones who never participated in an ERP implementation, imagine an average project and the number of challenges associated with it, and multiply it by 10 or a similar number that reflects the increase in complexity with the translation to broader scales. The jump in complexity can be compared with the jump from putting together a bed after a scheme to building a whole house using the same level of detail. The scale can further increase by moving from a house to a whole building or a complex of residential houses. Even if that’s technically achievable, a further challenge is how to build all this in a short amount of time, with minimal costs and acceptable quality levels.

With the increase of scale, imagine the amount of planning and coordination that needs to be achieved to avoid any delays. Even if many plan with the "first-time right" objective in mind, inherent issues are often unavoidable, and an organization’s agility can be measured on how robustly it can handle the foreseeable and unforeseeable challenges altogether. Of course, there are many approaches that allow one to minimize, defer or share the risks, or even opportunities, though there’s usually an important gap between one’s planning and reality!

This doesn’t mean that such projects are unmanageable! Everything can be managed to some level of detail and within some tolerance margins, however many organizations are tempted to answer complexity with complexity, and that’s seldom the right approach! Ideally, complexity should be broken down to manageable parts, though that’s challenging to do when one doesn’t know what is being done. That’s why many organizations search for partners with which to share the risks and success, though that works if the customer, and its partners can stir the same ship toward common destinations, at least for the main itinerary if not for the whole duration of the trip.  

Unfortunately, as happens in partnerships that diverge toward distinct goals, the misalignment and other similar factors resulting from this divergence can lead to further challenges that increase the complexity of ERP implementations even more. Ideally, a partner should behave like the mechanics at a pitstop, though that’s utopic especially when they must be always available and this for the whole duration of the project. So, all parties need to compromise somehow, and, even if there are general recipes that can be used, it’s challenging to make everybody happy!

Often in an ERP implementation is defined from the beginning whose needs are the most important, and from there one can build a whole hierarchy of scenarios, models and analyses that should show the right path(s). There’s a lot of knowledge that can be carried out between projects, respectively, between the different phases of a project, though there will always be surprises and one should be prepared for them! Same as the captain must occasionally change the course to avoid or fight storms or other hazards, so must the corresponding structure act when this is the case! Occasionally, each team member may be in the position to act like a captain and raise to expectations, though project designs must allow for this!

🏭🗒️Microsoft Fabric: Quotas [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 13-Apr-2025

[Microsoft Fabric] Quotas

  • {def} assigned number of resources for an Azure subscription
    • depend on the Azure subscription type [5]
      • set and enforced in the scope of the subscription [1]
      • each subscription has a default value for each quota [1]
    • determine the maximum number of CUs for each of the capacities on the subscription [5]
      • customers don't get charged for quotas, but on capacities [5]
    • each quota represents a specific countable resource
      • e.g. number of VMs that can be created [2]
      • e.g. the number of storage accounts that can be used concurrently [2]
      • e.g. the number of networking resources that can be consumed [2]
      • e.g. the number of API calls to a particular service that can be made [2]
    • designed to help protect customers from unexpected behavior [1]
      • e.g. inaccurately resourced deployments, mistaken consumption
      • helps minimize risks from deceptive or inappropriate consumption and unexpected demand [1]
    • limitations
      • ⇐ some limits are managed at a regional level [2]
      • ⇐ variable and dependent on several factors
  • {type} adjustable quotas
    • quotas for which customers can request quota increases
      • via Azure Home My quotas page
      • based on amount or usage percentage and submitting it directly [1]
      • each subscription has a default quota value for each quota [1]
    • the quickest way to increase quotas [1]
  • {type} non-adjustable quotas
    • quotas which have a hard limit, usually determined by the scope of the subscription [1]
    • to make changes, customers must submit a support request for the Azure support team [1]
  • [Fabric] limit the number of CUs customers can provision across multiple capacities in a subscription
    • calculation based on
      • subscription plan type
      • Azure region
  • {action} provision a new capacity
    • {level} Azure management group limits
    • {level} Azure subscription limits
    • {level} Azure resource group limits
    • {level} Template limits
    • {level} Microsoft Entra ID limits
    • {level} Azure API Center limits
    • {level} Azure API Management limits
    • {level} API Management v2 tiers
    • {level} Azure App Service limits
    • {level} Azure Automation limits
    • {action} request quota increase
      • via support request
      • there is no cost associated with requesting a quota increase [1]
        • ⇐ costs are incurred based on resource usage, not the quotas themselves [1]
      • should be based on workloads' characteristics [1]
    • {action} view quota
      • [permission] contributor role, or another role that includes contributor access [5]
    • {action} manage quota
      • [permission] see Quota Request Operator
    • [Azure] quota alerts
      • notifications triggered when the usage of a specific Azure resource nears the predefined quota limit [4]
      • facilitates proactive resource management [4]
      • multiple alert rules can be created for a given quota or across quotas in a subscription [4]

      References:
      [1] Microsoft Learn (2024) Fabric: Quotas overview [link
      [2] Microsoft Learn (2025) Fabric: Azure subscription and service limits, quotas, and constraints [link
      [3] Microsoft Learn (2025) Fabric: Quota monitoring and alerting [link
      [4] Microsoft Fabric Update Blog (2024) Announcing the launch of Microsoft Fabric Quotas [link]
      [5] Microsoft Learn (2025) Fabric: Microsoft Fabric capacity quotas [link]

      Resources:
      [R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

      Acronyms:
      API - Application Programming Interface
      CU - Capacity Units
      VM - Virtual Machine

      13 April 2025

      🏭🗒️Microsoft Fabric: Continuous Integration & Continuous Deployment [CI/CD] [Notes]

      Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

      Last updated: 13-Apr-2025

      [Microsoft Fabric] Continuous Integration & Continuous Deployment [CI/CD] 
      • {def} development processes, tools, and best practices used to automates the integration, testing, and deployment of code changes to ensure efficient and reliable development
        • can be used in combination with a client tool
          • e.g. VS Code, Power BI Desktop
          • don’t necessarily need a workspace
            • developers can 
              • create branches
              • commit changes to that branch locally
              • push changes to the remote repo
              • create a pull request to the main branch
              • ⇐ all steps can be performed without a workspace [1]
            • workspace is needed only as a testing environment [1]
              • to check that everything works in a real-life scenario [1]
        • addresses a few pain points [2]
          • manual integration issues
            • manual changes can lead to conflicts and errors
              • slow down development [2]
          • development delays
            • manual deployments are time-consuming and prone to errors
              • lead to delays in delivering new features and updates [2]
          • inconsistent environments
            • inconsistencies between environment cause issues that are hard to debug [2]
          • lack of visibility
            • can be challenging to
              • track changes though their lifetime [2]
              • understand the state of the codebase[2]
        • {process} continuous integration (CI)
        • {process} continuous deployment (CD)
        • architecture
          • {layer} development database 
            • {recommendation} should be relatively small [1]
          • {layer} test database 
            • {recommendation} should be as similar as possible to the production database [1]
          • {layer} production database

          • data items
            • items that store data
            • items' definition in Git defines how the data is stored [1]
        • {stage} development 
          • {best practice} back up work to a Git repository
            • back up the work by committing it into Git [1]
            • {prerequisite} the work environment must be isolated [1]
              • so others don’t override the work before it gets committed [1]
              • commit to a branch no other developer is using [1]
              • commit together changes that must be deployed together [1]
                • helps later when 
                  • deploying to other stages
                  • creating pull requests
                  • reverting changes
          • {warning} big commits might hit the max commit size limit [1]
            • {bad practice} store large-size items in source control systems, even if it works [1]
            • {recommendation} consider ways to reduce items’ size if they have lots of static [1] resources, like images [1]
          • {action} revert to a previous version
            • {operation} undo
              • revert the immediate changes made, as long as they aren't committed yet [1]
              • each item can be reverted separately [1]
            • {operation} revert
              • reverting to older commits
                • {recommendation} promote an older commit to be the HEAD 
                  • via git revert or git reset [1]
                  • shows that there’s an update in the source control pane [1]
                  • the workspace can be updated with that new commit [1]
              • {warning} reverting a data item to an older version might break the existing data and could possibly require dropping the data or the operation might fail [1]
              • {recommendation} check dependencies in advance before reverting changes back [1]
          • {concept} private workspace
            • a workspace that provides an isolated environment [1]
              • ⇐ allows to work in isolation [1]
            • {prerequisite} the workspace is assigned to a Fabric capacity [1]
            • {prerequisite} access to data to work in the workspace [1]
            • {step} create a new branch from the main branch [1]
              • allows to have most up-to-date version of the content [1]
              • can be used for any future branch created by the user [1]
                • when a sprint is over, the changes are merged and one can start a fresh new task [1]
                  • switch the connection to a new branch on the same workspace
                • approach can be used when is needed to fix a bug in the middle of a sprint [1]
              • {validation} connect to the correct folder in the branch to pull the right content into the workspace [1]
          • {best practice} make small incremental changes that are easy to merge and less likely to get into conflicts [1]
            • update the branch to resolve the conflicts first [1]
          • {best practice} change workspace’s configurations to enable productivity [1]
            • connection between items, or to different data sources or changes to parameters on a given item [1]
          • {recommendation} make sure you're working with the supported structure of the item you're authoring [1]
            • if you’re not sure, first clone a repo with content already synced to a workspace, then start authoring from there, where the structure is already in place [1]
          • {constraint} a workspace can only be connected to a single branch at a time [1]
            • {recommendation} treat this as a 1:1 mapping [1]
        • {stage} test
          • {best practice} allows to simulate a real production environment for testing purposes [1]
            • {alternative} simulate this by connecting Git to another workspace [1]
          • factors to consider for the test environment
            • data volume
            • usage volume
            • production environment’s capacity
              • stage and production should have the same (minimal) capacity [1]
                • using the same capacity can make production unstable during load testing [1]
                  • {recommendation} test using a different capacity similar in resources to the production capacity [1]
                  • {recommendation} use a capacity that allows to pay only for the testing time [1]
                    • allows to avoid unnecessary costs [1]
          • {best practice} use deployment rules with a real-life data source
            • {recommendation} use data source rules to switch data sources in the test stage or parameterize the connection if not working through deployment pipelines [1]
            • {recommendation} separate the development and test data sources [1]
            • {recommendation} check related items
              • the changes made can also affect the dependent items [1]
            • {recommendation} verify that the changes don’t affect or break the performance of dependent items [1]
              • via impact analysis.
          • {operation} update data items in the workspace
            • imports items’ definition into the workspace and applies it on the existing data [1]
            • the operation is same for Git and deployment pipelines [1]
            • {recommendation} know in advance what the changes are and what impact they have on the existing data [1]
            • {recommendation} use commit messages to describe the changes made [1]
            • {recommendation} upload the changes first to a dev or test environment [1]
              • {benefit} allows to see how that item handles the change with test data [1]
            • {recommendation} check the changes on a staging environment, with real-life data (or as close to it as possible) [1]
              • {benefit} allows to minimize the unexpected behavior in production [1]
            • {recommendation} consider the best timing when updating the Prod environment [1]
              • {benefit} minimize the impact errors might cause on the business [1]
            • {recommendation} perform post-deployment tests in Prod to verify that everything works as expected [1]
            • {recommendation} have a deployment, respectively a recovery plan [1]
              • {benefit) allows to minimize the effort, respectively the downtime [1]
        • {stage} production
          • {best practice} let only specific people manage sensitive operations [1]
          • {best practice} use workspace permissions to manage access [1]
            • applies to all BI creators for a specific workspace who need access to the pipeline
          • {best practice} limit access to the repo or pipeline by only enabling permissions to users [1] who are part of the content creation process [1]
          • {best practice} set deployment rules to ensure production stage availability [1]
            • {goal} ensure the data in production is always connected and available to users [1]
            • {benefit} allows deployments run while while minimizing the downtimes
            • applies to data sources and parameters defined in the semantic model [1]
          • deployment into production using Git branches
            • {recommendation} use release branches [1]
              • requires changing the connection of workspace to the new release branches before every deployment [1]
              • if the build or release pipeline requires to change the source code, or run scripts in a build environment before deployment, then connecting the workspace to Git won't help [1]
          • {recommendation} after deploying to each stage, make sure to change all the configuration specific to that stage [1]

        References:
        [1] Microsoft Learn (2025) Fabric: Best practices for lifecycle management in Fabric [link]
        [2] Microsoft Learn (2025) Fabric: CI/CD for pipelines in Data Factory in Microsoft Fabric [link]
        [3] Microsoft Learn (2025) Fabric: Choose the best Fabric CI/CD workflow option for you [link]

        Acronyms:
        API - Application Programming Interface
        BI - Business Intelligence
        CI/CD - Continuous Integration and Continuous Deployment
        VS - Visual Studio

        12 April 2025

        🏭🗒️Microsoft Fabric: Copy job in Data Factory [Notes]

        Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

        Last updated: 11-Apr-2025

        [Microsoft Fabric] Copy job in Data Factory 
        • {def} 
          • {benefit} simplifies data ingestion with built-in patterns for batch and incremental copy, eliminating the need for pipeline creation [1]
            • across cloud data stores [1]
            • from on-premises data stores behind a firewall [1]
            • within a virtual network via a gateway [1]
        • elevates the data ingestion experience to a more streamlined and user-friendly process from any source to any destination [1]
        • {benefit} provides seamless data integration 
          • through over 100 built-in connectors [3]
          • provides essential tools for data operations [3]
        • {benefit} provides intuitive experience
          • easy configuration and monitoring [1]
        • {benefit} efficiency
          • enable incremental copying effortlessly, reducing manual intervention [1]
        • {benefit} less resource utilization and faster copy durations
          • flexibility to control data movement [1]
            • choose which tables and columns to copy
            • map the data
            • define read/write behavior
            • set schedules that fit requirements whether [1]
          • applies for a one-time or recurring jobs [1]
        • {benefit} robust performance
          • the serverless setup enables data transfer with large-scale parallelism
          • maximizes data movement throughput [1]
            • fully utilizes network bandwidth and data store IOPS for optimal performance [3]
        • {feature} monitoring
          • once a job executed, users can monitor its progress and metrics through either [1] 
            • the Copy job panel
              • shows data from the most recent runs [1]
            • reports several metrics
              • status
              • row read
              • row written
              • throughput
            • the Monitoring hub
              • acts as a centralized portal for reviewing runs across various items [4]
        • {mode} full copy
          • copies all data from the source to the destination at once
        • {mode|GA} incremental copy
          • the initial job run copies all data, and subsequent job runs only copy changes since the last run [1]
          • an incremental column must be selected for each table to identify changes [1]
            • used as a watermark
              • allows comparing its value with the same from last run in order to copy the new or updated data only [1]
              • the incremental column can be a timestamp or an increasing INT [1]
            • {scenario} copying from a database
              • new or updated rows will be captured and moved to the destination [1]
            • {scenario} copying from a storage store
              • new or updated files identified by their LastModifiedTime are captured and moved to the destination [1]
            • {scenario} copy data to storage store
              • new rows from the tables or files are copied to new files in the destination [1]
                • files with the same name are overwritten [1]
            • {scenario} copy data to database
              • new rows from the tables or files are appended to destination tables [1]
                • the update method to merge or overwrite [1]
        • {default} appends data to the destination [1]
          • the update method can be adjusted to 
            • {operation} merge
              • a key column must be provided
                • {default} the primary key is used, if available [1]
            • {operation} overwrite
        • availability 
          • the same regional availability as the pipeline [1]
        • billing meter
          • Data Movement, with an identical consumption rate [1]
        • {feature} robust Public API
          • {benefit} allows to automate and manage Copy Job efficiently [2]
        • {feature} Git Integration
          • {benefit} allows to leverage Git repositories in Azure DevOps or GitHub [2]
          • {benefit} allows to seamlessly deploy Copy Job with Fabric’s built-in CI/CD workflows [2]
        • {feature|preview} VNET gateway support
          • enables secure connections to data sources within virtual network or behind firewalls
            • Copy Job can be executed directly on the VNet data gateway, ensuring seamless and secure data movement [2]
        • {feature} upsert to Azure SQL Database
        • {feature} overwrite to Fabric Lakehouse
        • {feature} [Jul-2025] native CDC
          • enables efficient and automated replication of changed data including inserted, updated and deleted records from a source to a destination [5]
            •  ensures destination data stays up to date without manual effort
              • improves efficiency in data integration while reducing the load on source systems [5]
            • see Data Movement - Incremental Copy meter
              •  consumption rate of 3 CU
            • {benefit} zero manual intervention
              • automatically captures incremental changes directly from the source [5]  
            • {benefit} automatic replication
              • keeps destination data continuously synchronized with source changes [5]  
            • {benefit} optimized performance
              • processes only changed data
                • reduces processing time and minimizing load on the source [5]
            • smarter incremental copy 
              • automatically detects CDC-enabled source tables and allows to select either CDC-based or watermark-based incremental copy for each table [5]
          • applies to 
            • CDC-enabled tables
              • CDC automatically captures and replicates actions on data
            • non-CDC-enabled tables
              • Copy Job detects changes by comparing an incremental column against the last run [5]
                • then merges or appends the changed data to the destination based on configuration [5]
          • supported connectors
            • ⇐ applies to sources and destinations
            • Azure SQL DB [5]
            • On-premises SQL Server [5]
            • Azure SQL Managed Instance [5]
        • {enhancement} column mapping for simple data modification to storage as destination store [2]
        • {enhancement} data preview to help select the right incremental column  [2]
        • {enhancement} search functionality to quickly find tables or columns  [2]
        • {enhancement} real-time monitoring with an in-progress view of running Copy Jobs  [2]
        • {enhancement} customizable update methods & schedules before job creation [2]

        References:
        [1] Microsoft Learn (2025) Fabric: What is the Copy job in Data Factory for Microsoft Fabric? [link]
        [2] Microsoft Fabric Updates Blog (2025) Recap of Data Factory Announcements at Fabric Conference US 2025 [link]
        [3] Microsoft Fabric Updates Blog (2025) Fabric: Announcing Public Preview: Copy Job in Microsoft Fabric [link]
        [4] Microsoft Learn (2025) Fabric: Learn how to monitor a Copy job in Data Factory for Microsoft Fabric [link]
        [5] Microsoft Fabric Updates Blog (2025) Fabric: Simplifying Data Ingestion with Copy job – Introducing Change Data Capture (CDC) Support (Preview) [link]
        [6] Microsoft Learn (2025) Fabric: Change data capture (CDC) in Copy Job (Preview) [link]
        [7] Microsoft Fabric Updates Blog (2025) Simplifying Data Ingestion with Copy job – Incremental Copy GA, Lakehouse Upserts, and New Connectors [link

        Resources:
        [R1] Microsoft Learn (2025) Fabric: Learn how to create a Copy job in Data Factory for Microsoft Fabric [link]
        [R2] Microsoft Learn (2025) Microsoft Fabric decision guide: copy activity, Copy job, dataflow, Eventstream, or Spark [link]

        Acronyms:
        API - Application Programming Interface
        CDC - Change Data Capture
        CI/CD - Continuous Integration and Continuous Deployment
        CU - Capacity Unit
        DevOps - Development & Operations
        DF - Data Factory
        IOPS - Input/Output Operations Per Second
        VNet - Virtual Network
        Related Posts Plugin for WordPress, Blogger...

        About Me

        My photo
        Koeln, NRW, Germany
        IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.