SQL Troubles: proof-of-concept

Showing posts with label proof-of-concept. Show all posts

24 April 2025

💎🏭SQL Reloaded: Microsoft Fabric's Lakehouses at Work (Part I: Proof-of-Concept)

Introduction

One way to work with the data files existing in organization is to import them into a lakehouse and build a data model based on them that can be reused in the various solutions (incl. Power BI). As a reminder, a lakehouse is a data architecture platform for storing, managing, and analyzing structured and unstructured data in a single location.

The tutorials available on lakehouses are pretty useful for getting an idea how to start, though probably each seasoned professional has his/her way of doing things, at least for testing the capabilities before building a proper solution. The target is thus, to create the minimum for testing the capabilities needed for a proof-of-concept solution.

The files used in this post are available on Microsoft's GitHub. Download the files and store them to be easily available for the next steps. The following files were considered for the current post: customers.csv, orders.csv and products.csv.

Create the Lakehouse

It's recommended to have a naming convention in place for the various items created in a workspace, e.g. a combination between item type (lakehouse, warehouse), system type (Prod, UAT, Dev, PoC) and eventually department (e.g. FIN, SCM, HR, etc.). One should try to balance between meaning and usefulness. Ideally, one should use 2 maximum 4 letters for each part encoded in the name. For example, the following scripts were created in the LH_SCM_PoC lakehouse.

More complex naming conventions can include the system (e.g. D365, CRM, EBS) or the company. The target is to easily identify the systems, independently how complex the rules are. Given that it can become challenging to rename the schemas afterwards, ideally the naming convention should be available from the early stages.

Create the Schema

A lakehouse comes with a dbo schema available by default, though it's recommended to create meaningful schema(s) as needed. The schemas should ideally reflect the domain of the data (e.g. departments or other key areas) and the schemas shouldn't change when the objects are deployed between the different environments. Upon case, one should consider creating multiple schemas that reflect the model's key areas. The names should be simple and suggestive.

-- create schema
CREATE Schema Orders

Create a Staging Area

The next step is to create a staging area where the files in scope can be made available and then further loaded in the lakehouse. One needs to compromise between creating a deep hierarchical structure that reflects the business structure and the need to easily identify, respectively manage the files. An hierarchical structure with 1-2 level could provide the needed compromise, though each additional level tends to increase the complexity.

One should also consider rules for archiving or even deleting the files.

Upload the Files

Microsoft Fabric allows users to upload multiple files together into a single step. Ideally the files should have proper names for each column, otherwise overheads deriving from this may appear later in the process.

When the files are available in multiple folders in a complex hierarchical structure, a set of shortcuts could help in their management.

Load the Data

A file's data can be loaded into the lakehouse on the fly by providing a valid table name:
Files >> SCM_Orders >> (select file) >> Load to Tables >> new table >> Load file to new table >> (provide information) >> Load

Load file to new table

Of course, the table's name must be unique within the Schema and the further properties must define files' definition.

One should consider loading first a couple of tables, performing a rough validation of the data imported, and only after that the remaining tables can be imported. This allows to identify the issues that typically lead to reimports of the data (wrong formatting, invalid column names, duplicated files, etc.) or rework.

If the files have different characteristics (e.g. delimiters, number of attributes/records, special data types), one should consider this in the initial scope and have at least one example from each category.

Review the Metadata

Once the files were made available, there's the tendency to start directly with the development without analyzing the data, or equally important, the metadata available. To review the metadata of the tables newly created, one can use the objects from the standard INFORMATION_SCHEMA (see post):

-- retrieve the list of tables
SELECT * 
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'orders'

ORDER BY TABLE_SCHEMA

Further on, one can review columns' definition:

-- retrieve column metadata
SELECT TABLE_CATALOG
, TABLE_SCHEMA
, TABLE_NAME
, COLUMN_NAME
, ORDINAL_POSITION
, DATA_TYPE
, CHARACTER_MAXIMUM_LENGTH
, NUMERIC_PRECISION
, NUMERIC_SCALE
, DATETIME_PRECISION
, CHARACTER_SET_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'orders'
ORDER BY ORDINAL_POSITION

It's a good idea to save the metadata to a file and use it later for reviews, respectively for metadata management, when no other solution is in place for the same (e.g. Purview). That's useful also for the people with limited or no access to the workspace.

Alternatively, one can use a notebook with the following SQL-based commands:

%%sql

DESCRIBE TABLE LH_SCM_PoC.orders.sales;

DESCRIBE TABLE EXTENDED LH_SCM_PoC.orders.sales;

One can even provide meaningful descriptions for each table and its attributes via scripts like the ones below, however it might be a good idea to do this in the later phases of the PoC, when the logic become stable:

%%sql

-- modify a table's COMMENT
COMMENT ON TABLE LH_SCM_PoC.orders.sales IS 'Customer sales orders';

-- modify columns' COMMENT for an existing table
ALTER TABLE LH_SCM_DWH.orders.sales  
ALTER COLUMN SalesOrderNumber COMMENT 'Sales Order Number';

Data Validation

Before diving into building any business logic, besides identifying the primary, foreign keys and further attributes used in bringing the data together, it's recommended to get an overview of data's intrinsic and extrinsic characteristics relevant to the analysis. Some of the rules used typically for studying the quality of data apply to some extent also in here, though one needs to prioritize accordingly, otherwise one replicates the effort that's typically part of the Data Management initiatives.

In addition, it's important to check how much the identified issues impact the business logic, respectively on whether the issues can be corrected to match the expectations. Often, no logic can compensate for major data quality issues, and this can also affect PoC's results as soon as the outcomes are validated against the expectations!

Data Understanding

Further on, it makes sense to get a high-level understanding of the data by looking at the distribution of values, respectively at the records participating in the joins. Of course, more similar queries can be built, though again, one should try to focus on the most important aspects!

The analysis could for example consider the following points:

/* validation of Products */

-- review duplicated product numbers (should be 0)
SELECT ProductName
, count(*) RecordCount
FROM orders.products
GROUP BY ProductName
HAVING count(*)>1

-- review most (in)expensive products
SELECT top 100 ProductID
, ProductName
, Category
, ListPrice 
FROM orders.products
ORDER BY ListPrice DESC --ASC

-- review category distribution
SELECT Category
, count(*) RecordCount 
FROM orders.products
GROUP BY Category
ORDER BY RecordCount DESC

-- review price ranges (
SELECT Len(floor(ListPrice)) RangeCount
, count(*) RecordCount 
FROM orders.products
GROUP BY Len(floor(ListPrice)) 
ORDER BY RangeCount DESC

/* validation of Customers */

-- duplicated email address 
SELECT CST.CustomerID
, CST.FirstName
, CST.LastName 
, CST.EmailAddress 
, DUP.RecordCount
FROM (-- duplicates
	SELECT EmailAddress
	, count(*) RecordCount 
	FROM orders.customers 
	GROUP BY EmailAddress 
	HAVING count(*)>1
	) DUP
	JOIN orders.customers CST
	   ON DUP.EmailAddress = CST.EmailAddress
ORDER BY DUP.RecordCount DESC
, DUP.EmailAddress 

-- duplicated Customer names (not necessarily duplicates)
SELECT CST.CustomerID
, CST.FirstName
, CST.LastName 
, CST.EmailAddress 
, DUP.RecordCount
FROM (-- duplicates
	SELECT FirstName
	, LastName
	, count(*) RecordCount 
	FROM orders.customers 
	GROUP BY FirstName
	, LastName 
	HAVING count(*)>1
	) DUP
	JOIN orders.customers CST
	   ON DUP.FirstName = CST.FirstName
      AND DUP.LastName = CST.LastName
ORDER BY DUP.RecordCount DESC
, DUP.FirstName
, DUP.LastName

/* validation of Orders */

-- review a typical order
SELECT SalesOrderID
, OrderDate
, CustomerID
, LineItem
, ProductID
, OrderQty
, LineItemTotal
FROM orders.orders
WHERE SalesOrderID = 71780
ORDER BY SalesOrderID 
, LineItem

-- review orders' distribution by month
SELECT Year(OrderDate) Year
, Month(OrderDate) Month
, count(*) RecordCount
FROM orders.orders
GROUP BY Year(OrderDate) 
, Month(OrderDate) 
ORDER BY Year
, Month

-- checking for duplicates
SELECT SalesOrderID
, LineItem
, count(*) RecordCount
FROM orders.orders ord 
GROUP BY SalesOrderID
, LineItem
HAVING count(*)>1

-- checking for biggest orders
SELECT SalesOrderID
, count(*) RecordCount
FROM orders.orders ord 
GROUP BY SalesOrderID
HAVING count(*) > 10
ORDER BY NoRecords DESC

-- checking for most purchased products
SELECT ProductID
, count(*) NoRecords
FROM orders.orders ord 
GROUP BY ProductID
HAVING count(*) > 8
ORDER BY NoRecords DESC

-- checking for most active customers
SELECT CustomerID
, count(*) RecordCount
FROM orders.orders ord 
GROUP BY CustomerID
HAVING count(*) > 10
ORDER BY RecordCount DESC

/* join checks */

-- Prders without Product (should be 0)
SELECT count(*) RecordCount
FROM orders.orders ord 
	 LEFT JOIN orders.products prd
	   ON ord.ProductID = prd.ProductID
WHERE prd.ProductID IS NULL

-- Prders without Customer (should be 0)
SELECT count(*) RecordCount
FROM orders.orders ORD 
	 LEFT JOIN orders.customers CST
	   ON ORD.CustomerID = CST.CustomerID
WHERE CST.CustomerID IS NULL

-- Products without Orders (153 records)
SELECT count(*) RecordCount
FROM orders.products prd
	 LEFT JOIN orders.orders ord 
	   ON prd.ProductID = ord.ProductID 
WHERE ord.ProductID IS NULL


-- Customers without Orders (815 records)
SELECT count(*) RecordCount
FROM orders.customers CST
	 LEFT JOIN orders.orders ORD
	   ON ORD.CustomerID = CST.CustomerID
WHERE ORD.CustomerID IS NULL

The more tables are involved, the more complex the validation logic can become. One should focus on the most important aspects.

Building the Logic

Once one has an acceptable understanding of the data entities involved and the relation between them, it's time to build the needed business logic by joining the various tables at the various levels of detail. One can focus on the minimum required, respectively attempt to build a general model that can address a broader set of requirements. For the PoC it's usually recommended to start small by addressing the immediate requirements, though some flexibility might be needed for exploring the data and preparing the logic for a broader set of requirements. Independently of the scope, one should consider a set of validations.

Usually, it makes sense to encapsulate the logic in several views or table-valued functions that reflect the logic for the main purposes and which allow a high degree of reuse (see [1]). Of course, one can use the standard approach for modelling the bronze, silver, respectively the gold layers adopted by many professionals. For a PoC, even if that's not mandatory, it might still be a good idea to make steps in the respective direction.

In this case, dealing with only three tables - a fact table and two dimensions table - there are several perspectives that can be built:

a) all records from fact table + dimension records

The following view provides the lowest level of details for the fact table, allowing thus to look at the data from different perspectives as long as focus is only the values used is Sales Orders:

-- create the view
CREATE OR ALTER VIEW orders.vSalesOrders
-- Sales Orders with Product & Customer information
AS
SELECT ORD.SalesOrderID
, ORD.OrderDate
, ORD.CustomerID
, CST.FirstName 
, CST.LastName
, CST.EmailAddress
, ORD.LineItem
, ORD.ProductID
, PRD.ProductName 
, PRD.Category
, ORD.OrderQty
, ORD.LineItemTotal
, PRD.ListPrice 
, ORD.OrderQty * PRD.ListPrice ListPriceTotal
FROM orders.orders ORD 
	 JOIN orders.products PRD
	   ON ORD.ProductID = PRD.ProductID
	 JOIN orders.customers CST
	   ON ORD.CustomerID = CST.CustomerID

-- test the view   
SELECT *
FROM orders.vSalesOrders
WHERE SalesOrderID = 71780

One can use full joins unless some of the references dimensions are not available.

b) aggregated data for all dimension combinations

The previous view allows to aggregate the data at the various levels of details:

-- Sales volume by Customer & Product
SELECT ORD.EmailAddress
, ORD.ProductName 
, ORD.Category
, SUM(ORD.OrderQty) OrderQty
, SUM(ORD.LineItemTotal) LineItemTotal
FROM orders.vSalesOrders ORD 
WHERE ORD.OrderDate >= '2022-06-01'
  AND ORD.OrderDate < '2022-07-01'
GROUP BY ORD.EmailAddress
, ORD.ProductName 
, ORD.Category
ORDER BY ORD.EmailAddress
, ORD.ProductName

One can comment out the dimensions not needed. The query can be included in a view as well.

c) all records from each dimension table + aggregated fact records

Sometimes, it's useful to look at the data from a dimension's perspective, though it might be needed to create such an object for each dimension, like in the below examples. For the maximum of flexibility the logic can be included in a table-valued function:

-- create the user-defined function
CREATE OR ALTER FUNCTION orders.tvfProductsSalesVolume(
    @StartDate date NULL,
    @EndDate date NULL
)
RETURNS TABLE
-- Sales volume by Product
AS
RETURN (
SELECT PRD.ProductID
, PRD.ProductName 
, PRD.Category
, ORD.FirstOrderDate
, ORD.LastOrderDate 
, IsNull(ORD.TotalSalesQty, 0) TotalSalesQty 
, IsNull(ORD.TotalSalesValue, 0) TotalSalesValue
, IsNull(ORD.OrderCount, 0) OrderCount
, IsNull(ORD.LineCount, 0) LineCount
FROM orders.products PRD
     OUTER APPLY (
		SELECT Min(ORD.OrderDate) FirstOrderDate
		, Max(ORD.OrderDate) LastOrderDate 
		, SUM(ORD.OrderQty) TotalSalesQty
		, SUM(ORD.LineItemTotal) TotalSalesValue
		, count(DISTINCT SalesOrderID) OrderCount
		, count(*) LineCount
		FROM orders.orders ORD 
		WHERE ORD.ProductID = PRD.ProductID
		  AND ORD.OrderDate >= @StartDate 
		  AND ORD.OrderDate < @EndDate 
	 ) ORD
);

-- test the user-defined function
SELECT *
FROM orders.tvfProductsSalesVolume('2022-06-01','2022-07-01') PRD
WHERE TotalSalesValue <> 0
ORDER BY TotalSalesValue DESC
, LastOrderDate DESC


-- create the user-defined function
CREATE OR ALTER FUNCTION orders.tvfCustomersSalesVolume(
    @StartDate date NULL,
    @EndDate date NULL
)
RETURNS TABLE
-- Sales volume by Customer
AS
RETURN (
SELECT CST.CustomerID
, CST.FirstName 
, CST.LastName
, CST.EmailAddress
, ORD.FirstOrderDate
, ORD.LastOrderDate 
, IsNull(ORD.TotalSalesValue, 0) TotalSalesValue
, IsNull(ORD.OrderCount, 0) OrderCount
, IsNull(ORD.LineCount, 0) LineCount
FROM orders.customers CST
     OUTER APPLY (
		SELECT Min(ORD.OrderDate) FirstOrderDate
		, Max(ORD.OrderDate) LastOrderDate 
		, SUM(ORD.LineItemTotal) TotalSalesValue
		, count(DISTINCT SalesOrderID) OrderCount
		, count(*) LineCount
		FROM orders.orders ORD 
		WHERE ORD.CustomerID = CST.CustomerID
		  AND ORD.OrderDate >= @StartDate 
		  AND ORD.OrderDate < @EndDate 
	 ) ORD
);

-- test the user-defined function
SELECT *
FROM orders.tvfCustomersSalesVolume('2022-06-01','2022-07-01') PRD
WHERE TotalSalesValue <> 0
ORDER BY TotalSalesValue DESC
, LastOrderDate DESC

When restructuring the queries in similar ways, there's always a compromise between the various factors: (re)usability, performance or completeness.

Further Comments

The above database objects should allow users to address most of the requirements, though, as usual, there can be also exceptions, especially when the data needs to be aggregated at a different level of detail that requires the query to be structured differently.

The number of perspectives can increase also with the number of fact tables used to model a certain entity (e.g. Sales order headers vs. lines). For example,

In theory, one can also find ways to automate the process of creating database objects, though one must choose the relevant attributes, respectively include logic that makes sense only within a certain perspective.

No matter the data, respectively systems used as source, expect surprises and test your assumptions! For example, in the file used to create the orders.customers table, there seem to be duplicated entities with the same name and email address. One must clarify how such entities must be handled in data analysis, respectively in data modeling. For example, a person can appear twice because of the roles associated with the name or can be other entitled reasons.

The files in scope of this post are small compared with the files existing in organizations. In many scenarios files' size could range from GB to TB and thus require partitioning and different other strategies.

|>> Next Post

References
[1] sql-troubles (2023) Architecture Part IV: Building a Modern Data Warehouse with Azure Synapse [link]

Resources
[1] Microsoft Learn (2024) Fabric: Lakehouse and Delta Lake tables [link]

26 January 2025

🧭Business Intelligence: Perspectives (Part 25: Grounding the Roots)

Business Intelligence Series

When building something that is supposed to last, one needs a solid foundation on which the artifact can be built upon. That’s valid for castles, houses, IT architectures, and probably most important, for BI infrastructures. There are so many tools out there that allow building a dashboard, report or other types of BI artifacts with a few drag-and-drops, moving things around, adding formatting and shiny things. In many cases all these steps are followed to create a prototype for a set of ideas or more formalized requirements keeping the overall process to a minimum.

Rapid prototyping, the process of building a proof-of-concept by focusing at high level on the most important design and functional aspects, is helpful and sometimes a mandatory step in eliciting and addressing the requirements properly. It provides a fast road from an idea to the actual concept, however the prototype, still in its early stages, can rapidly become the actual solution that unfortunately continues to haunt the dreams of its creator(s).

Especially in the BI area, there are many solutions that started as a prototype and gained mass until they start to disturb many things around them with implications for security, performance, data quality, and many other aspects. Moreover, the mass becomes in time critical, to the degree that it pulled more attention and effort than intended, with positive and negative impact altogether. It’s like building an artificial sun that suddenly becomes a danger for the nearby planet(s) and other celestial bodies.

When building such artifacts, it’s important to define what goals the end-result must or would be nice to have, differentiating clearly between them, respectively when is the time to stop and properly address the aspects mandatory in transitioning from the prototype to an actual solution that addresses the best practices in scope. It’s also the point when one should decide upon solution’s feasibility, needed quality acceptance criteria, and broader aspects like supporting processes, human resources, data, and the various aspects that have impact. Unfortunately, many solutions gain inertia without the proper foundation and in extremis succumb under the various forces.

Developing software artifacts of any type is a balancing act between all these aspects, often under suboptimal circumstances. Therefore, one must be able to set priorities right, react and change direction (and gear) according to the changing context. Many wish all this to be a straight sequential road, when in reality it looks more like mountain climbing, with many peaks, valleys and change of scenery. The more exploration is needed, the slower the progress.

All these aspects require additional time, effort, resources and planning, which can easily increase the overall complexity of projects to the degree that it leads to (exponential) effort and more important - waste. Moreover, the complexity pushes back, leading to more effort, and with it to higher costs. On top of this one has the iteration character of BI topics, multiple iterations being needed from the initial concept to the final solution(s), sometimes many steps being discarded in the process, corners are cut, with all the further implications following from this.

Somewhere in the middle, between minimum and the broad overextending complexity, is the sweet spot that drives the most impact with a minimum of effort. For some organizations, respectively professionals, reaching and remaining in the zone will be quite a challenge, though that’s not impossible. It’s important to be aware of all the aspects that drive and sustain the quality of artefacts, data and processes. There’s a lot to learn from successful as well from failed endeavors, and the various aspects should be reflected in the lessons learned.

Previous Post <<||>> Next Post

11 June 2024

🧭🏭Business Intelligence: Microsoft Fabric (Part IV: Is Microsoft Fabric Ready?)

Business Intelligence Series

When writing a Business Case, besides the problem and solution(s) high-level descriptions, is important to roughly estimate how much it costs, how long it takes, respectively how many resources are needed and for what activities. A proof-of-concept (PoC) might not need an explicit business case, though the same high-level information is needed at least for the planning of resources and a formal approval.

Given that there are several analytical experiences in Microsoft Fabric (MF), it’s clear that can’t be anymore a reference architecture that can be recommended for customers. Frankly, that ship has sailed even since the introduction of Microsoft Synapse, if not earlier, with the movement to the cloud. Also, there’s no one size fits all as certain building blocks make sense only in certain scenarios (e.g. organization scale, data volume or source’s type). Moreover, even if MF has been generally available for quite some time, customers and service providers ask themselves whether the available features are enough for building analytics solutions based on it.

“Is Fabric Ready?” was the topic of today’s Explicit Measures webcast [1]. Probably the answer is as usual “it depends” and the general recommendation is to do a PoC to check solution's feasibility. Conversely, MF may be the best approach to consider if integration with other systems (e.g. Dynamics 365, Dataverse) is needed.

What the customers need are some rough realistic estimates they can base any planning upon (at least for a PoC if not for the whole project) in terms of making the data available into OneLake, building a semantic model, respectively processing and making the data available for consumption. Ideally, one needs a translation of the various steps as done earlier. For example, how long it takes to make the data available in OneLake, how long it takes to move the data physically or logically though the various layers, to build semantic models, etc.

Probably, some things can be achieved in a matter of days, at least if one knows what one’s doing. However, we are talking here about a new architecture that may resemble for some of an unknow territory. Even if old and new techniques can be mixed, there are further implications or improvements that can be considered. There are many webcasts, blog posts and other material on how to do things, on what’s possible, though building a functioning solution from beginning to the end, even as PoC, requires more than putting all this together.

Just making the data flow from point A to B or C is not enough - data security, data governance and a few other topics like scalability and availability need to be considered as well. Security and governance are also the areas in which probably more features must be considered. For many customers starting now with MF, the hope is that most of these features will be available during the time the solutions are ready for production.

From a cost perspective, there’s the cost of data at rest, in transit, the licensing for MF and the other components involved. Ideally, one should start small and increase capacities as needed, though small can vary from case to case, while it’s important to find out the optimum. Starting in the middle could be an alternative approach even if may involve higher costs. If one starts small, the costs for PoC can be neglectable, though sooner or later a compromise is needed to provide an acceptable performance.

In terms of human resources, the topic is more complex (see [2]), and it depends largely on the nature of the project. The pool of skillsets is the most important constraint or enabler such projects can have.

Previous Post <<||>> Next Post

References:
[1] Explicit Measures (2024) Power BI tips Ep.327: Is Fabric Ready? (link)
[2] Explicit Measures (2024) Power BI tips Ep.321: Building and BI Team (link)

09 April 2024

🧭Business Intelligence: Why Data Projects Fail to Deliver Real-Life Impact (Part IV: Making It in the Statistics)

Business Intelligence Series

Various sources (e.g., [1], [2], [3]) advance the failure rates for data projects somewhere between 70% and 85%, rates which are a bit higher than the failure of standard projects estimated at 60-75% but not by much. This means that only 2-3 out of 10 projects will succeed and that’s another reason to plan for failure, respectively embrace the failure.

Unfortunately, the statistics advanced on project failure have no solid fundament and should be regarded with circumspection as long the methodology and information about the population used for the estimates aren’t shared, though they do reflect an important point – many data projects do fail! It would be foolish to think that your project will not fail just because you’re a big company, and you have the best resources, and you have a proven rate of success, and you took all the precautions for the project not to fail.

Usually at the end of a project the team meets together to document the lessons learned in the hope that the next projects will benefit from them. The team did learn something, though as the practice shows even if the team managed to avoid some issues, other issues will impact the next similar project, leading to similar variances. One can summarize this as "on the average the impact of new issues and avoided known issues tends to zero out" or "on average, the plusses and minuses balance each other across projects". It’s probably a question of focus – if organizations focus too much on certain aspects, other aspects are ignored and/or unseen.

So, your first data project will more likely fail. The question is: what do you do about it? It’s important to be aware of why projects and data projects fail, though starting to consider and monitor each possible issue can prove to be ineffective. One can, however, create a risk register from the list and estimate the rates for each of the potential failures, respectively focus on only the top 3-5 which have the highest risk. Of course, one should reevaluate the estimates on a regular basis though that’s Risk Management 101.

Besides this, one should focus on how the team can make the project succeed. When adopting a technology, methodology or set of processes, it’s recommended to start with a proof-of-concept (PoC). To make the PoC a helpful experience it’s probably important to start with a topic that’s not too big to handle, but that also involves some complexity that would allow the organization to evaluate the targeted set of tools and technologies. It can also be a topic for which other organizations have made important progress, respectively succeed. The temptation is big to approach the most stringent issues in the organization, respectively to build something big that can have an enormous impact for the organization. Jumping too soon into such topics can just increase the chances of failure.

One can also formulate the goals, objectives and further requirements in a form that allows the organization to build upon them even if the project fails. A PoC is about learning, building a foundation, doing the groundwork, exploring, mapping the unknown, and identifying what's still missing to make progress, respectively closing the full circle. A PoC is less about overachievement and a big impact, which can happen, though is a consequence of the good work done in the PoC.

The bottom line, no matter whether you succeed or fail, once you start a project, you’ll still make it in the statistics! More important is what you’ve learnt after the first data project, respectively how you can use the respective knowledge in further projects to make a difference!

Previous Post <<||>> Next Post

References:
[1] Harvard Business Review (2023) Keep Your AI Projects on Track, by Iavor Bojinov (link)
[2] Cognilytica (2023) The Shocking Truth: 70-80% of AI Projects Fail! (link)
[3] VentureBeat (2019) Why do 87% of data science projects never make it into production? (link)

03 October 2023

🧮ERP: Implementations (Part X: Introducing an Upfront Proof-of-Concept Setup)

ERP Implementations Series

The standard phases of an ERP implementation are mandatory and inflexible as there seems to exist a imposed succession of the phases rooted in customer’s need of having an upfront cost estimate for the project. Moreover, the concept-based approach reflected in the creation of a set of Functional Design Documents (FDDs), even if it’s supposed to increase an implementation’s accuracy, it brings considerable challenges and an effort volume that could be spent in other areas. E.g., having a proof-of-concept setup subproject early in the project seems to bring more benefits.

Usually, before or during the requirements gathering phase the functional consultants together with the key users look at the legacy system(s) and data, questions are asked on both sides, and the findings are hopefully documented, though the outputs are high-level ideas or process design sketches. The sessions are abstract, and besides diagrams there’s no feedback mechanism to make sure that the parties understood customer’s processes and data structures, respectively that the key users understood what the future system is supposed to deliver. Some projects consider the building of 'AS-IS' diagrams and/or user stories during this phase, though their impact on project’s outcomes is questionable.

Why not include in this phase also hand-on training sessions for the key users during which a system is set up based on the available information? For example, one can start with an existing shell of the system reflecting standard parameters used in the industry where the customer works. Starting from this shell the key users and consultants go through the various processes and business scenarios, change parameters, add master data manually, sketch how the process could look like, respectively understand the gaps from expectations, or maybe how the process can be changed to avoid customizations. That’s more effective than discussing over and over the data structures and processes!

Of course, this seems to increase exploratory phase's complexity, though the increase is apparent. Allowing key users to understand how the target system works has the potential of simplifying project's planning and execution. Besides reaching a common understanding of the functionality, the key users can better evaluate whether the target system satisfies the high-level requirements, respectively better perform the various activities - requirements’ definition, reviews and user acceptance testing benefiting altogether. Moreover, they can train and involve other users earlier.

For this to work there are several assumptions. First, that the functional consultants know the target system(s), which is not necessarily needed in other approaches where a person (e.g. business analyst) who can understand how a system works and can document processes is enough. Second, the key users must have a good understanding of the legacy systems. Third, the shell should reflect the business needs as much as possible. Fourth, the necessary financial resources need to be made available upfront. Fifth, the business commitment must be there, and with this the key users should focus only on the project.

However, the most important aspect is that the parties involved need to buy and support the idea! The FDDs bring a safety net and make sense for both parties, the setup being performed only after the signoff. On the other hand, because of the considerable number of iterations FDDs involve high costs. Performing first the setup as described above and writing later the FDDs, if still needed, should improve FDDs’ quality, and require fewer iterations.

This approach allows an important volume of work to be done upfront, and even if further effort is needed for customizations and testing, a lower level of coordination is needed later, reducing thus the complexity of the planning and of the overall project.

Previous <<||>> Next

03 February 2021

📦Data Migrations (DM): Conceptualization (Part III: Heuristics)

Probably one of the most difficult things to learn as a technical person is using the right technology for a given purpose, this mainly because one’s inclined using the tools one knows best. Moreover, technologies’ overlapping makes the task more and more challenging, the difference between competing technologies often residing in the details. Thus, identifying the gaps resumes in understanding the details of the problem(s) or need(s), respectively the advantages or disadvantages of a technology over the other. This is true especially about competing technologies, including the ones that replace other technologies.

There are simple heuristics, that can allow approaching such challenges. For example, heavy data processing belongs usually in databases, while import/export functionality belongs in an ETL tool. Therefore, one can start looking at the problems from these two perspectives. Would the solution benefit from these two approaches or are there more appropriate technologies (e.g. data streaming, ELT, non-relational databases)? How much effort would involve building the solution?

Commercial Off-The-Shelf (COTS) tools provided by third-party vendors usually offer specialized functionality in each area. Gartner and Forrester provide regular analyses of the main players in the important areas, analyses which can be used in theory as basis for further research. Even if COTS tend to be more expensive and can have some important functionality gaps, as long they are extensible, they can prove a good starting point for developing a solution.

Sometimes it helps researching on the web what other people or organizations did, how they approached the same aspects, what technologies, techniques and best practices they used to overcome the challenges. One doesn’t need to reinvent the wheel even if it’s sometimes fun to do so. Moreover, a few hours of research can give one a basis of useful information and a better understanding over the work ahead.

On the other side sometimes it’s advisable to use the tools one knows best, however this can lead also to unusable and less performant solutions. For example, MS Excel and Access have been for years the tools of choice for building personal solutions that later grew into maintenance nightmares for the IT team. Ideally, they can still be used for data entry or data cleaning, though building solutions exclusively based on (one of) them can prove to be far than optimal.

When one doesn’t know whether a technology or mix of technologies can be used to provide a solution, it’s recommended to start a proof-of-concept (PoC) that would allow addressing most important aspects of the needed solution. One can start small by focusing on the minimal functionality needed to check the main aspects and evolve the PoC during several iterations as needed.

For example, in the case of a Data Migration (DM) this would involve building the data extraction layer for an entity, implement several data transformations based on the defined mappings, consider building a few integrity rules for validation, respectively attempt importing the data into the target system. Once this accomplished, one can start increasing the volume of data to check how the solution behaves under stress. The volume of data can be increased incrementally or by considering all the data available.

As soon the skeleton was built one can consider all the mappings, respectively add several entities to build the dependencies existing between them and other functionality. The prototype might not address all the requirements from the beginning, therefore consider the problems as they arise. For example, if the volume of data seems to cause problems then attempt splitting the data during processing in batches or considering specific optimization techniques like indexing or scaling techniques like increasing computing resources.

Previous Post <<||>> Next Post

📦Data Migrations (DM): Conceptualization (Part II: Plan vs. Concept vs. Strategy)

Data Migrations Series

A concept is a document that describes at high level the set of necessary steps and their implications to achieve a desired result, typically making the object of a project. A concept is usually needed to provide more technical and nontechnical information about the desired solution, the context in which a set of steps are conducted, respectively the changes considered, how the changes will be implemented and the further aspects that need to be considered. It can include a high-level plan and sometimes also information that typically belong in a Business Case – goals,objectives, required resources, estimated effort and costs, risks and opportunities.

A concept is used primarily as basis for sign-off as well for establishing common ground and understanding. When approved, it’s used for the actual implementation and solution’s validation. The concept should be updated as the project progresses, respectively as new information are discovered.

Creating a concept for a DM can be considered as best practice because it allows documenting the context, the technical and organizational requirements and dependencies existing between the DM and other projects, how they will be addressed. The concept can include also a high-level plan of the main activities (following to be detailed in a separate document).

Especially when the concept has an exploratory nature (due to incomplete knowledge or other considerations), it can be validated with the help of a proof-of-concept (PoC), the realization of a high-level-design prototype that focuses on the main characteristics of the solution and allows thus identifying the challenges. Once the PoC implemented, the feedback can be used to round out the concept.

Building a PoC for a DM should be considered as objective even when the project doesn’t seem to meet any major challenges. The PoC should resume in addressing the most important DM requirements, ideally by implementing the whole or most important aspects of functionality (e.g. data extraction, data transformations, integrity validation, respectively the import into the target system) for one or two data entities. Once the PoC built, the team can use it as basis for the evolutive development of the solution during the iterations considered.

A strategy is a set of coordinated and sustainable actions following a set of well-defined goals, actions devised into a plan and designed to create value and overcome further challenges. A strategy has the character of a concept though it has a broader scope being usually considered when multiple projects or initiatives compete for the same resources to provide a broader context and handle the challenges, risks and opportunities. Moreover, the strategy takes an inventory of the current issues and architecture – the 'AS-IS' perspective and sketches the to 'TO-BE' perspective by devising a roadmap that bridges the gap between the two.

In the case of a DM a strategy might be required when multiple DM projects need to be performed in parallel or sequentially, as it can help the organization to better manage the migrations.

A plan is a high-level document that describes the tasks, schedule and resources required to carry on an activity. Even if it typically refers to the work or product breakdown structure, it can cover other information usually available in a Business Case. A project plan is used to guide both project execution and project control, while in the context of Strategic Management the (strategic) plan provides a high-level roadmap on how the defined goals and objectives will be achieved during the period covered by the strategy.

For small DM projects a plan can be in theory enough. As both a strategy and a concept can include a high-level plan, the names are in praxis interchangeable.

Previous Post <<||>> Next Post

15 May 2019

#️⃣Software Engineering: Programming (Part XV: Rapid Prototyping - Introduction)

Software Engineering Series

Rapid (software) prototyping (RSP) is a group of techniques applied in Software Engineering to quickly build a prototype (aka mockup, wireframe) to verify the technical or factual realization and feasibility of an application architecture, process or business model. A similar notion is the one of Proof-of-Concept (PoC), which attempts to demonstrate by building a prototype, starting an experiment or a pilot project that a technical concept, business proposal or theory has practical potential. In other words in Software Engineering a RSP encompasses the techniques by which a PoC is lead.

In industries that consider physical products a prototype is typically a small-scale object made from inexpensive material that resembles the final product to a certain degree, some characteristics, details or features being completely ignored (e.g. the inner design, some components, the finishing, etc.). Building several prototypes is much easier and cheaper than building the end product, they allowing to play with a concept or idea until it gets close to the final product. Moreover, this approach reduces the risk of ending up with a product nobody wants.

A similar approach and reasoning is used in Software Engineering as well. Building a prototype allows focusing at the beginning on the essential characteristics or aspects of the application, process or (business) model under consideration. Upon case one can focus on the user interface (UI) , database access, integration mechanism or any other feature that involves a challenge. As in the case of the UI one can build several prototypes that demonstrate different designs or architectures. The initial prototype can go through a series of transformations until it reaches the desired form, following then to integrate more functionality and refine the end product gradually. This iterative and incremental approach is known as rapid evolutional prototyping.

A prototype is useful especially when dealing with the uncertainty, e.g. when adopting (new) technologies or methodologies, when mixing technologies within an architecture, when the details of the implementation are not known, when exploring an idea, when the requirements are expected to change often, etc. Building rapidly a prototype allows validating the requirements, responding agilely to change, getting customers’ feedback and sign-off as early as possible, showing them what’s possible, how the future application can look like, and this without investing too much effort. It’s easier to change a design or an architecture in the concept and design phases than later.

In BI prototyping resumes usually in building queries to identify the source of the data, reengineer the logic from the business application, prove whether the logic is technically feasible, feasibility being translate in robustness, performance, flexibility. In projects that have a broader scope one can attempt building the needed infrastructure for several reports, to make sure that the main requirements are met. Similarly, one can use prototyping to build a data warehouse or a data migration layer. Thus, one can build all or most of the logic for one or two entities, resolving the challenges for them, and once the challenges solved one can go ahead and integrate gradually the other entities.

Rapid prototyping can be used also in the implementation of a strategy or management system to prove the concepts behind. One can start thus with a narrow focus and integrate more functions, processes and business segments gradually in iterative and incremental steps, each step allowing to integrate the lesson learned, address the risks and opportunities, check the progress and change the direction as needed.

Rapid prototyping can prove to be a useful tool when given the chance to prove its benefits. Through its iterative and incremental approaches it allows to reach the targets efficiently

Previous Post <<||>> Next Post

SQL Troubles

Pages

24 April 2025

💎🏭SQL Reloaded: Microsoft Fabric's Lakehouses at Work (Part I: Proof-of-Concept)

26 January 2025

🧭Business Intelligence: Perspectives (Part 25: Grounding the Roots)

11 June 2024

🧭🏭Business Intelligence: Microsoft Fabric (Part IV: Is Microsoft Fabric Ready?)

09 April 2024

🧭Business Intelligence: Why Data Projects Fail to Deliver Real-Life Impact (Part IV: Making It in the Statistics)

03 October 2023

🧮ERP: Implementations (Part X: Introducing an Upfront Proof-of-Concept Setup)

03 February 2021

📦Data Migrations (DM): Conceptualization (Part III: Heuristics)

📦Data Migrations (DM): Conceptualization (Part II: Plan vs. Concept vs. Strategy)

15 May 2019

#️⃣Software Engineering: Programming (Part XV: Rapid Prototyping - Introduction)

About Me