SQL Troubles

08 April 2024

🧭Business Intelligence: Why Data Projects Fail to Deliver Real-Life Impact (Part III: Failure through the Looking Glass)

Business Intelligence Series

There’s a huge volume of material available on project failure – resources that document why individual projects failed, why in general projects fail, why project members, managers and/or executives think projects fail, and there seems to be no other more rewarding activity at the end of a project than to theorize why a project failed, the topic culminating occasionally with the blaming game. Success may generate applause, though it's failure that attracts and stirs the most waves (irony, disapproval, and other similar behavior) and everybody seems to be an expert after the consumed endeavor.

The mere definition of a project failure – not fulfilling project’s objectives within the set budget and timeframe - is a misnomer because budgets and timelines are estimated based on the information available at the beginning of the project, the amount of uncertainty for many projects being considerable, and data projects are no exceptions from it. The higher the uncertainty the less probable are the two estimates. Even simple projects can reveal uncertainty especially when the broader context of the projects is considered.

Even if it’s not a common practice, one way to cope with uncertainty is to add a tolerance for the estimates, though even this practice probably will not always accommodate the full extent of the unknown as the tolerances are usually small. The general expectation is to have an accurate and precise landing, which for big or exploratory projects is seldom possible!

Moreover, the assumptions under which the estimates hold are easily invalidated in praxis – resources’ availability, first time right, executive’s support to set priorities, requirements’ quality, technologies’ maturity, etc. If one looks beyond the reasons why projects fail in general, quite often the issues are more organizational than technological, the lack of knowledge and experience being some of the factors.

Conversely, many projects will not get approved if the estimates don’t look positive, and therefore people are pressured in one way or another to make the numbers fit the expectations. Some projects, given their importance, need to be done even if the numbers don’t look good or can’t be quantified correctly. Other projects represent people’s subsistence on the job, respectively people's self-occupation to create motion, though they can occasionally have also a positive impact for the organizations. These kinds of aspects almost never make it in statistics or surveys. Neither do the big issues people are afraid to talk about. Where to consider that in the light of politics and office’s grapevine the facts get distorted!

Data projects reflect all the symptoms of failure projects have in general, though when words like AI, Statistics or Machine Learning are used, the chances for failure are even higher given that the respective fields require a higher level of expertise, the appropriate use of technologies and adherence to the scientific process for the results to be valid. If projects can benefit from general recipes, respectively established procedures and methods, their range of applicability decreases when the mentioned areas are involved.

Many data projects have an exploratory nature – seeing what’s possible - and therefore a considerable percentage will not reach production. Moreover, even those that reach that far might arrive to be stopped or discarded sooner or later if they don’t deliver the expected value, and probably many of the models created in the process are biased, irrelevant, or incorrectly apply the theory. Where to add that the mere use of tools and algorithms is not Data Science or Data Analysis.

The challenge for many data projects is to identify which Project Management (PM) best practices to consider. Following all or no practices at all just increases the risks of failure!

Previous Post <<||>> Next Post

06 April 2024

🧭Business Intelligence: Why Data Projects Fail to Deliver Real-Life Impact (Part II: There's Value in Failure)

Business Intelligence Series

"Results are nothing; the energies which produce them

and which again spring from them are everything."

(Wilhelm von Humboldt, "On Language", 1836)

When the data is not available and is needed on a continuous basis then usually the solution is to redesign the processes and make sure the data becomes available at the needed quality level. Redesign involves additional costs for the business; therefore, it might be tempting to cancel or postpone data projects, at least until they become feasible, though they’re seldom feasible.

Just because there’s a set of data, this doesn’t mean that there is important knowledge to be extracted from it, respectively that the investment is feasible. There’s however value in building experience in the internal resources, in identifying the challenges and the opportunities, in identifying what needs to be changed for harnessing the data. Unfortunately, organizations expect that somebody else will do the work for them instead of doing the jump by themselves, and this approach more likely will fail. It’s like expecting to get enlightened after a few theoretical sessions with a guru than walking the path by oneself.

This is reflected also in organizations’ readiness to do the required endeavors for making the jump on the maturity scale. If organizations can’t approach such topics systematically and address the assumptions, opportunities, and risks adequately, respectively to manage the various aspects, it’s hard to believe that their data journey will be positive.

A data journey shouldn’t be about politics even if some minds need to be changed in the process, at management as well as at lower level. If the leadership doesn’t recognize the importance of becoming an enabler for such initiatives, then the organization probably deserves to keep the status quo. The drive for change should come from the leadership even if we talk about data culture, data strategy, decision-making, or any critical aspect.

An organization will always need to find the balance between time, scope, cost, and quality, and this applies to operations, tactics, and strategies as well as to projects. There are hard limits and lot of uncertainty associated with data projects and the tasks involved, limits reflected in cost and time estimations (which frankly are just expert’s rough guesses that can change for the worst in the light of new information). Therefore, especially in data projects one needs to be able to compromise, to change scope and timelines as seems fit, and why not, to cancel the projects if the objectives aren’t feasible anymore, respectively if compromises can’t be reached.

An organization must be able to take the risks and invest in failure, otherwise the opportunities for growth don’t change. Being able to split a roadmap into small iterative steps that allow besides breaking down the complexity and making progress to evaluate the progress and the knowledge resulted, respectively incorporate the feedback and knowledge in the next steps, can prove to be what organizations lack in coping with the high uncertainty. Instead, organizations seem to be fascinated by the big bang, thinking that technology can automatically fill the organizational gaps.

Doing the same thing repeatedly and expecting different results is called insanity. Unfortunately, this is what organizations and service providers do in what concerns Project Management in general and data projects in particular. Building something without a foundation, without making sure that the employees have the skillset, maturity and culture to manage the data-related tasks, challenges and opportunities is pure insanity!

Bottom line, harnessing the data requires a certain maturity and it starts with recognizing and pursuing opportunities, setting goals, following roadmaps, learning to fail and getting value from failure, respectively controlling the failure. Growth or instant enlightenment without a fair amount of sweat is possible, though that’s an exception for few in sight!

Previous Post <<||>> Next Post

🧭Business Intelligence: Why Data Projects Fail to Deliver Real-Life Impact (Part I: First Thoughts)

Business Intelligence Series

A data project has a set of assumptions and requirements that must be met, otherwise the project has a high chance of failing. It starts with a clear idea of the goals and objectives, and they need to be achievable and feasible, with the involvement of key stakeholders and the executive without which it’s impossible to change the organization’s data culture. Ideally, there should also be a business strategy, respectively a data strategy available to understand the driving forces and the broader requirements.

An organization’s readiness is important not only in what concerns the data but also the things revolving around the data - processes, systems, decision-making, requirements management, project management, etc. One of the challenges is that the systems and processes available can’t be used as they are for answering important business questions, and many of such questions are quite basic, though unavailability or poor quality of data makes this challenging if not impossible.

Thus, when starting a data project an organization must be ready to change some of its processes to address a project’s needs, and thus the project can become more expensive as changes need to be made to the systems. For many organizations the best time to have done this was when they implemented the system, respectively the integration(s) between systems. Any changes made after that come in theory with higher costs derived from systems and processes’ redesign.

Many projects start big and data projects are no exception to this. Some of them build a costly infrastructure without first analyzing the feasibility of the investment, or at least whether the data can form a basis for answering the targeted questions. On one side one can torture any dataset and some knowledge will be obtained from it (aka data will confess), though few datasets can produce valuable insights, and this is where probably many data projects oversell their potential. Conversely, some initiatives are worth pursuing even only for the sake of the exposure and experience the employees get. However, trying to build something big only through the perspective of one project can easily become a disaster.

When building a data infrastructure, the project needs to be an initiative given the transformative potential such an endeavor can have for the organization, and the different aspects must be managed accordingly. It starts with the management of stakeholders’ expectations, with building a data strategy, respectively with addressing the opportunities and risks associated with the broader context.

Organizations recognize that they aren’t capable of planning and executing such a project or initiative, and they search for a partner to lead the way. Becoming overnight such a partner is more than a challenge as a good understanding of the industry and the business is needed. Some service providers have such knowledge, at least in theory, though the leap from knowledge to results can prove to be a challenge even for experienced service providers.

Many projects follow the pattern: the service provider comes, analyzes the requirements, builds something wonderful, the solution is used for some time and then the business realizes that the result is not what was intended. The causes are multiple and usually form a complex network of causality, though probably the most important aspect is that customers don’t have the in-house technical resources to evaluate the feasibility of requirements, solutions, respectively of the results. Even if organizations involve the best key users, are needed also good data professionals or similar resources who can become the bond between the business and the services provider. Without such an intermediary the disconnect between the business and the service provider can grow with all the implications.

Previous Post <<||>> Next Post

🏭🗒️Microsoft Fabric: Data Governance [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 23-May-2024

[Microsoft Fabric] Data Governance

{definition}set of capabilities that help organizations to manage, protect, monitor, and improve the discoverability of data, so as to meet data governance (and compliance) requirements and regulations [2]
several built-in governance features are available to manage and control the data within Fabric (MF) [1]
{feature} endorsement [aka content endorsement]

{definition} formal process performed by admins to endorse MF items
{benefit} allows admins to designate specific MF items as trusted and approved for use across the organization [1]

establishes trust in data assets by promoting and certifying specific MF items [1]

users know which assets they can trust and rely on for accurate information [1]

endorsed assets are identified with a badge that indicates they have been reviewed and approved [1]

{scope} applies to all MF items except dashboards [1]
{benefit} helps admin manage the overall growth of items across your environment [1]

{feature} promoting [aka content promoting]

{definition} formal process performed by contributors or admins to promote content
promoted content appears with a Promoted badge in the MF portal [1]

workspace members with the contributor or admin role can promote content within a workspace [1]
MF admin can promote content across the organization [1]

{feature} certification [aka content certification]

{definition} formal process that involves a review of the content by a designated reviewer and managed by the admin [1]

can be customized to meet organization’s needs [1]
users can request item certification from an admin [1]

via Request certification from the More menu [1]

the certified content appears with a Certified badge in the Fabric portal [1]

{benefit} allows organizations to label items considered to be quality items [1]

an organization can certify items to identify them an as authoritative sources for critical information [1]

⇐ all Fabric items except Power BI dashboards can be certified [1]

{benefit} allows to specify certifiers who are experts in the domain [1]
domain level settings

enable or disable certification of items that belong to the domain [1]

provides a URL to documentation that is relevant to certification in the domain [1]

{feature} tenant (aka Microsoft Fabric tenant, MF tenant)

a single instance of Fabric for an organization that is aligned with a Microsoft Entra ID
can contain any number of workspaces

{feature} workspaces

{definition} a collection of items that brings together different functionality in a single environment designed for collaboration
can be assigned to teams or departments based on governance requirements and data boundaries [2]
are associated with domains [3]

⇐ {benefit} allows to group data into business domains
all the items in the workspace are then associated with the domain, and they receive a domain attribute as part of their metadata [3]

⇐ {benefit} enables a better consumption experience [1]

{benefit} enables better discoverability and governance [2]

{feature} domains [Notes]

{definition} a way of logically grouping together data in an organization that is relevant to a particular area or field [1]
allows to group data by business domains

⇒{benefit} allows business domains to manage their data according to their specific regulations, restrictions, and needs [3]

{feature} subdomains
- {definition} a way for fine tuning the logical grouping data under a domain [1]

{feature} labeling

default labeling, label inheritance, and programmatic labeling,
{benefit} help achieve maximal sensitivity label coverage across MF [2]
once labeled, data remains protected even when it's exported out of MF via supported export paths [2]
[Purview Audit] compliance admins can monitor activities on sensitivity labels

{feature|preview} folders

{definition} a way of logically grouping MF items

{feature|preview} tags

{benefit} allow managing Fabric items for enhanced compliance, discoverability, and reuse

{feature} scanner API

a set of admin REST APIs
{benefit} allows to scan MF items for sensitive data [1]
can be used to scan both structured and unstructured data [1]
{concept} metadata scanning

facilitates governance of data by enabling cataloging and reporting on all the metadata of organization's Fabric items [1]
it needs to be set up by Admin before metadata scanning can be run [1]

{concept} data lineage

{definition}
{benefit} allows to track the flow of data through Fabric [1]
{benefit} allows to see where data comes from, how it's transformed, and where it goes [1]
{benefit} helps understand the data available in Fabric, and how it's being used [1]

{concept} Fabric item (aka MF item)

{definition} a set of capabilities within an experience

form the building blocks of the Fabric platform

{type} data warehouse
{type} data pipeline
{type} semantic model
{type} reports
{type} dashboards
{type} notebook
{type} lakehouse
{type} metric set

Previous Post <<||>> Next Post

Resources:
[1] Microsoft Learn (2023) Administer Microsoft Fabric (link)
[2] Microsoft Learn - Fabric (2024) Governance overview and guidance (link)
[3] Microsoft Learn: Fabric (2023) Fabric domains (link)
[4] Establishing Data Mesh architectural pattern with Domains and OneLake on Microsoft Fabric, by Maheswaran Arunachalam (link)

Resources:
[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:
API - Application Programming Interface
MF - Microsoft Fabric

05 April 2024

💎SQL Reloaded: SQL Antipatterns (Part I: JOINs, UNIONs & DISTINCT)

Introduction

SQL antipatterns refer in general to common mistakes made when developing SQL code, though the term can refer also to situations in which even if the code is syntactically and logically correct, it's either suboptimal, unclear or even incorrect. Therefore "mistake" can cover a wide range of scenarios, some that can be ignored, while others need to be addressed accordingly.

In this post I consider a few antipatterns observed especially in data warehouses (DWHs). Let's look at the below code created to exemplify several scenarios:

-- Products in open orders (initial query)
SELECT DISTINCT ITM.ProductId                                   -- (1) use of DISTINCT
, ITM.ProductNumber
, ITM.Name 
, ITM.Color 
, ITM.Style 
, ITM.Size
FROM Production.Product ITM
    LEFT JOIN (							-- (5) use of JOIN instead of EXISTS
	-- Open Purchase orders 
	SELECT DISTINCT POL.ProductId
	, 'POs' Source                                          -- (7) use columns not needed in output
	FROM Purchasing.PurchaseOrderDetail POL                 
	     LEFT JOIN Purchasing.PurchaseOrderHeader POH       -- (2) use of LEFT JOIN instead of FULL JOIN
		  ON POL.PurchaseOrderID = POH.PurchaseOrderID
	WHERE POH.Status = 1 -- pending 
	UNION					                -- (3) use of UNION
	-- Open Sales orders 
	SELECT DISTINCT SOL.ProductId
	, 'SOs' Source
	FROM Sales.SalesOrderDetail SOL
	    LEFT JOIN Sales.SalesOrderHeader SOH
		  ON SOL.SalesOrderID = SOH.SalesOrderID
	WHERE SOH.Status = 1 -- in process		        -- (4) use of OR instead of IN
	   OR SOH.Status = 2 -- approved
	) DAT
	ON ITM.ProductID = DAT.ProductID
WHERE DAT.ProductID IS NOT NULL 
ORDER BY ITM.ProductNumber			                -- (6) using too many columns in ORDER BY
, ITM.Name 
, ITM.Color 
, ITM.Style 
, ITM.Size

(1) Use of DISTINCT

DISTINCT is a dirty way to remove the duplicates from a dataset. Sometimes it makes sense to use it to check something fast, though it should be avoided into code intended for a production environment because it can lead to unexpected behavior especially when selecting all the columns using the "*" (SELECT DISTINCT *).

I saw tools and developers adding a DISTINCT in almost each step, independently on whether it was necessary or not. One can thus but wonder whether the DISTINCT was added to fix a bigger issue with the data in the DWH, to remove special duplicates imposed by the logic or just as poor practice. Unfortunately, when it's used frequently, it can become challenging to investigate its use and discover the actual issues from the DWH.

There are several approaches to eliminate DISTINCTs from the code: GROUP BY, ranking functions or upon case also code rewrites.

(2) Use of LEFT JOIN Instead of FULL JOIN

When refreshing a DWH there can be the case that related data are out of synch. It would be the case of Purchase or Sales orders where the headers and lines are out of synch, headers existing without lines and/or vice versa. A common practice is to use a FULL JOIN and thus to eliminate such exceptions, though there are also entitled uses of a LEFT JOIN. This antipattern resumes however to the cases in which logically should be used a FULL JOIN, though a LEFT JOIN is used instead.

In the above example there are two distinct occurrences of this pattern: the relationship between the header and lines in the inner query, respectively the LEFT JOIN with a NOT NULL constraint in the outer query. The latter use is useful when during testing one wants to see all the Products, though bringing this further into production may rise some eyebrows even if it's not necessarily wrong. Anyway, the database engine should be smart enough to recognize such a scenario. However, for the header vs lines case, upon case the plan generated might be suboptimal.

One of the best practices when writing SQL queries is to state one's intent clearly in what the logic concerns. Using a LEFT JOIN instead of a FULL JOIN can make people raise questions about the actual need. When the need is not properly documented, some developer may even go and change the joins. There can be for example business cases that are not cover by the current data, but as soon as case appears it will lead to incorrect logic!

Similarly, splitting a piece of logic into two or more steps unnecessarily can create confusion. There can be however also entitled s situations (e.g. query optimization), which ideally should be documented.

(3) Use of UNION

When a UNION is used, the values returned by the first query will be checked against the values of the second query, and thus unnecessary comparisons occur even if they are not needed. This depends also on the business context, which might not be easily to identify from the query (especially when the reviewer doesn't know the business case).

The misuse of a UNION will not make a big difference when the volume of data is small, though the more data are processed by the query, the higher the impact.

Besides the proper use of the UNION, there are also situations in which a query rewrite can eliminate the need for a UNION (see the rewritten query below).

(4) use of OR instead of IN

One can occasionally find queries in which a OR was used for 10 to 50 distinct values as in the example above. Even if the database engine generate in both cases the same query plan, it's easier to read and maintain a query that used IN. However, if the number of values go beyond a certain value, other techniques should be used to improve the performance.

The only benefit I see for a OR is to document meaning's values or remove during testing one of the values, especially when the list is a selection coming from the user. Frankly, it's not the appropriate way for documenting logic (even if I'm doing it sometimes in ad-hoc queries).

There's a more extreme scenario in which distinct subqueries are written for each or a set of ORs (e.g. the distinction between open vs closed vs. invoices orders), which can make sense sometimes (e.g. the logic is completely different). Therefore, an antipattern can be dependent also of the context or use case.

(5) use of JOIN instead of EXISTS

When there are no values returned from the subquery, quite often it makes sense to the EXISTS or not EXISTS operators in the queries (see the rewritten query below). This might not be indicated however for distributed environments like serverless SQL pool in which the distribution of the processing across multiple tasks might benefit when the pieces of the logic distributed don't require heavy reshuffles.

(6) Using too Many Columns in ORDER BY

The columns specified in an ORDER BY clause need to make sense, otherwise they just add extra burden on the database engine, at least from the perspective of the checks that need to be performed. In the above query, at least the Name doesn't make sense.

It helps also if the columns can use existing indexes, though this depends also on query specifics.

Another antipattern scenario not exemplified above is the use of ordinals to refer to the columns, which should be avoided in production environments (because the order of the columns can be changed accidentally or even :

-- using ordinals instead of number columns (not recommended)
SELECT ITM.ProductId                                  
, ITM.ProductNumber
, ITM.Name 
, ITM.Color 
, ITM.Style 
, ITM.Size
FROM Production.Product ITM                           
ORDER BY 2, 4, 5, 6

(7) Use Columns Not Needed in Output

Besides the fact that each column included unnecessarily in the query can increase the size of the data processed (unless the database engine is smart to remove them), there can be also performance issues and/or optimizations involved. For example, if all the other columns are part of a covering index, the database engine might opt for a suboptimal index compared to the case in which the unnecessary columns are removed.

Conversely, some columns are helpful to troubleshoot the logic (and that's why the Source column was considered) even if they aren't considered in the final output or the logic. It doesn't make sense to bring the query version with the respective fields into production, even if this would mean to have maybe a second version of the query used only for troubleshooting needs. Commenting the unnecessary columns could be a better choice, even if it's not recommended in general as too many such comments can obfuscate the case.

Rewriting the Query

With the above input the query can be rewritten as follows:

-- Products in open orders (modified query)
SELECT ITM.ProductId                                  
, ITM.ProductNumber
, ITM.Name 
, ITM.Color 
, ITM.Style 
, ITM.Size
FROM Production.Product ITM
WHERE EXISTS (										            
	-- Open Purchase orders 
	SELECT POL.ProductId
	FROM Purchasing.PurchaseOrderDetail POL                 
	     JOIN Purchasing.PurchaseOrderHeader POH      
		  ON POL.PurchaseOrderID = POH.PurchaseOrderID
	WHERE POH.Status = 1 
	  AND ITM.ProductID = POL.ProductID
	)
 OR EXISTS (				                                    
	-- Open Sales orders 
	SELECT SOL.ProductId
	FROM Sales.SalesOrderDetail SOL
	     JOIN Sales.SalesOrderHeader SOH
		  ON SOL.SalesOrderID = SOH.SalesOrderID
	WHERE SOH.Status IN (1, 2)
	  AND ITM.ProductID = SOL.ProductID
	)	                           
ORDER BY ITM.ProductNumber			                           
, ITM.Color 
, ITM.Style 
, ITM.Size

Please note that in general to each rule there are also exceptions which should be considered against one's common sense. When the benefit of addressing an antipattern is neglectable compared with the effort involved and the logic doesn't create any issues, probably it's better to let the code as it. One can still reconsider the antipatterns later with the next refactoring opportunity.

There are zealous data professionals who treat minor inconveniences (e.g. not using upper case for SQL reserved words, alternate code formatting, alternative writing of words, especially function names, different indentation, the use of "--" for commenting within a query, etc.) as antipatterns. Use your common sense and evaluate the effort against the benefits or risks and, not less important, be patient with others' mistakes!

Happy coding!

Previous Post <<||>> Next Post

03 April 2024

🧭Business Intelligence: Perspectives (Part 10: The Top 5 Pains of a BI/Analytics Manager)

Business Intelligence Series

1) Business Strategy

A business strategy is supposed to define an organization's mission, vision, values, direction, purpose, goals, objectives, respectively the roadmap, alternatives, capabilities considered to achieve them. All this information is needed by the BI manager to sketch the BI strategy needed to support the business strategy.

Without them, the BI manager must extrapolate, and one thing is to base one's decisions on a clearly stated and communicated business strategy, and another thing to work with vague declarations full of uncertainty. In the latter sense, it's like attempting to build castles into thin air and expecting to have a solid foundation. It may work as many BI requirements are common across organizations, but it can also become a disaster.

2) BI/Data Strategy

Organizations usually differentiate between the BI and the data Strategy because different driving forces and needs are involved, even if there are common goals, needs and opportunities that must be considered from both perspectives. When there's no data strategy available, the BI manager is either forced to address thus many data-related topics (e.g. data culture, data quality, metadata management, data governance), or ignore them with all consequences deriving from this.

A BI strategy is an extension of the business, data and IT strategies into the BI knowledge areas. Unfortunately, few organizations give it the required attention. Besides the fact that the BI strategy breaks down the business strategy from its perspective, it also adds its own goals and objectives which are ideally aligned with the ones from the other strategies.

3) Data Culture

Data culture is "the collective beliefs, values, behaviors, and practices of an organization’s employees in harnessing the value of data for decision-making, operations, or insight". Therefore, data culture is an enabler which, when the many aspects are addressed adequately, can have a multiplier effect for the BI strategy and its execution. Conversely, when basic data culture assumptions and requirements aren't addressed, the interrelated issues resulting from this can prove to be a barrier for the BI projects, operations and strategy.

As mentioned before, an organization’s (data) culture is created, managed, nourished, and destroyed through leadership. If the other leaders aren't playing along, each challenge related to data culture and BI will become a concern for the BI manager.

4) Managing Expectations

A business has great expectations from the investment in its BI infrastructure, especially when the vendors promise competitive advantage, real-time access to data and insights, self-service capabilities, etc. Even if these promises are achievable, they represent a potential that needs to be harnessed and there are several premises that need to be addressed continuously.

Some BI strategies and/or projects address these expectations from the beginning, though there are many organizations that ignore or don't give them the required importance. Unfortunately, these expectations (re)surface when people start using the infrastructure and this can easily become an acceptance issue. It's the BI manager's responsibility to ensure expectations are managed accordingly.

5) Building the Right BI Architecture

For the BI architecture the main driving forces are the shifts in technologies from single servers to distributed environments, from relational tables and data warehouses to delta tables and delta lakes built with the data mesh's principles and product-orientation in mind, which increase the overall complexity considerably. Vendors and data professionals' vision of how the architectures of the future will look like still has major milestones and challenges to surpass.

Therefore, organizations are forced to explore the new architectures and the opportunities they bring, however this involves a considerable effort, skilled resources, and more iterations. Conversely, ignoring these trends might prove to be an opportunity lost and eventually duplicated effort on the long term.

Previous Post <<||>> Next Post

01 April 2024

📊R Language: Data Transformations (Part I: Temperatures' comparison between F° and C°)

The time series used for weather analysis use either Fahrenheit (F°) or Celsius (C°) for the temperature values. Looking at the A and B plots below that represent the values of the same dataset in F°, respectively C°, there seems to be no difference between the two plots independently on whether one works with F° or C°, however the scales are different. Once one uses the same scale for both values (see C) the plots are distorted according to the formula used for transformation.

Comments:
(1) Typically, it makes sense to adapt the temperature scale to the audience, though on the Web there will be always a mix of audiences (and that's why weather websites allow to choose one of the values).
(2) Not starting from 0 might show in the end the same trend at same scale, though the behavior can change occasionally. As long as the Y-axis is correctly labeled, this shouldn't be a problem. Conversely, it's better to control the scale and provide the min-max values for the axis accordingly.
(3) When creating such plots, it's important to be aware of the distortion that might be introduced by transformations. For linear transformations of the type a*x+b, the value of the "a" coefficient tells how much the resulting values are stretched or contracted.

I used as exemplification the airquality dataset which contains data for 1973, the temperature being given in F°. Unfortunately, the dataset contains only the day and the month, so the date must be constructed and added to the dataset. For simplification, I've added the calculated temperature in C° as column as well:

#reviewing the data
help("airquality")

#preparing the data
head(airquality)
airquality$date <- with(airquality, as.Date(ISOdate(1973, Month, Day))) #adding the date
airquality$TempC <- with(airquality, (Temp - 32) * 5/9) #adding the temperature in C°
head(airquality)

And, here's the code used to generate the plots:

#Temperatures' comparison between F° and C°
par(mfrow = c(2,2)) #1x2 matrix display

plot(airquality$date, airquality$Temp, ylab="Temperature (F°)", xlab="date", type="l", col="blue", main="A")

plot(airquality$date, airquality$TempC, ylab="Temperature (C°)", xlab="date", type="l", col="brown", main="B")

plot(airquality$date, airquality$Temp, ylab="Temperature (F°) vs (C°)", xlab="date", ylim=c(0,100), type="l", col="blue", main="C")
lines(airquality$date, airquality$TempC, col="brown")

# using inline formula
plot(airquality$date, (airquality$Temp - 32) * 5/9, ylab="(Temp-32)*5/9", xlab="date", ylim=c(0,100), type="l", col="brown", main="D")

mtext("© sql-troubles@blogspot.com @sql_troubles, 2024", side = 1, line = 4, adj = 1, col = "dodgerblue4", cex = .7)
title("Temperatures' comparison between F° and C°", line = -1, outer = TRUE)

In the fourth plot I directly used the formula for transforming the values from F° and C°. If the values based on the formula need to be used repeatedly, it's probably better to add a column to the dataset.

Unfortunately, the standard library has its limitations when creating visualizations. While writing this post I tried to work also with the plotly library, which offers a richer set of tools and can be used to create wonderful visualizations (though it proves also more complex to use).

install.packages("plotly")
library("plotly")

Here's the code used to plot the below graphic (the points have labels, much like in Power BI):

fig <- plot_ly(airquality, type = 'scatter', mode = 'lines+markers')%>%
  add_trace(x = ~date, y = ~Temp, name = 'Temp (F)')%>%
  add_trace(x = ~date, y = ~TempC, name = 'Temp (C)')%>%
  layout(showlegend = F, title="Temperatures' comparison between K° and C°")

fig

The temperatures via Plotly

Happy coding!

Previous Post <<||>> Next Post

31 March 2024

🏭🗒️Microsoft Fabric: Polaris [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources and may deviate from them. Please consult the sources for the exact content!

Last updated: 31-Mar-2024

[Microsoft Fabric] Polaris

{def} cloud-native analytical query engine over the data lake that follows a stateless micro-service architecture and is designed to execute queries in a scalable, dynamic and fault-tolerant way [1], [2]

the engine behind the serverless SQL pool [1] and Microsoft Fabric [2]
petabyte-scale execution [1]
highly-available micro-service architecture

data and query processing is packaged into units (aka tasks) [1]

can be readily moved across compute nodes and re-started at the task level [1]

can run directly over data in HDFS and in managed transactional stores [1]

[Azure Synapse] designed initially to execute read-only queries [1]

⇐ the architecture behind serverless SQL pool
uses a completely new scale-out framework based on a distributed SQL Server query engine [1]

fully compatible with T-SQL
leverages SQL Server single-node runtime and QO [1]

[Microsoft Fabric] extended with a complete transaction manager that executes general CRUD transactions [2]

incl. updates, deletes and bulk loads [2]
based on [delta tables] and [delta lake]

the delta lake supports currently only transactions within one table [4]

⇐ the architecture behind lakehouses

{goal} converge DWH and big data workloads [1]

the query engine scales-out for relational data and heterogeneous datasets stored in DFSs[1]

needs a clean abstraction over the underlying data type and format, capturing just what’s needed for efficiently parallelizing data processing

{goal} separate compute and state for cloud-native execution [1]

all services within a pool are stateless

data is stored durably in remote storage and is abstracted via data cells [1]

⇐ data is naturally decoupled from compute nodes

the metadata and transactional log state is off-loaded to centralized services [[1]
multiple compute pools can transactionally access the same logical database [1]

{goal} cloud-first [2]

{benefit} leverages elasticity
transactions need to be resilient to node failures on dynamically changing topologies [2]

⇒ the storage engine disaggregates the source of truth for execution state (including data, metadata and transactional state) from compute nodes [2]

must ensure disaggregation of metadata and transactional state from compute nodes [2]

⇐ to ensure that the life span of a transaction is resilient to changes in the backend compute topology [2]

⇐ can change dynamically to take advantage of the elastic nature of the cloud or to handle node failures [2]

{goal} use optimized native columnar, immutable and open storage format [2]

uses delta format

⇐ optimized to handle read-heavy workloads with low contention [2]

{goal} leverage the full potential of vectorized query processing for SQL [2]
{goal} support zero-copy data sharing with other services in the lake [2]
{goal} support read-heavy workloads with low contention [2]
{goal} support lineage-based features [2]

by taking advantage of delta table capabilities

{goal} provide full SQL SI transactional support [2]

{benefit} all traditional DWH requirements are met [2]

incl. multi-table and multi-statement transactions [2]

⇐ Polaris is the only system that supports this [2]
the design is optimized for analytics, specifically read- and insert-intensive workloads [2]
mixes of transactions are supported as well

{objective} no cross-component state sharing [2]

{principle} encapsulation of state within each component to avoid sharing state across nodes [2]
SI and the isolation of state across components allows to execute transactions as if they were queries [2]

⇒ makes read and write transactions indistinguishable [2]

⇒ allows to fully leverage its optimized distributed execution framework [2]

{objective} support snapshot Isolation (SI) semantics [2]

implemented over versioned data
allows reads (R) and writes (W) to proceed concurrently over their own data snapshot

R/W never conflict, and W/W of active transactions only conflict if they modify the same data [2]

⇐ all W transactions are serializable, leading to a serial schedule in increasing order of log record IDs [4]

follows from the commit protocol for write transactions, where only one transaction can write the record with each record ID [4]

⇐ R transactions at the snapshot isolation level create no contention

⇒ any number of R transactions can run concurrently [4]

the immutable data representation in LSTs allows dealing with failures by simply discarding data and metadata files that represent uncommitted changes [2]

similar to how temporary tables are discarded during query processing failures [2]

{feature} resize live workloads [1]

scales resources with the workloads automatically

{feature} deliver predictable performance at scale [1]

scales computational resources based on workloads' needs

{feature} efficiently handle both relational and unstructured data [1]
{feature} flexible, fine-grained task monitoring

a task is the finest grain of execution

{feature} global resource-aware scheduling

enables much better resource utilization and concurrency than traditional DWHs

capable of handling partial query restarts
maintains a global view of multiple queries

it is planned to build on this a global view with autonomous workload management features

{feature} multi-layered data caching model

leverages

SQL Server buffer pools for cashing columnar data
SSD caching

the delta table and its log are are immutable, they can be safely cached on cluster nodes [4]

{feature} tracks data lineage natively

the transaction log can also be used to audit logging based on the commit Info records [4]

{feature} versioning

maintain all versions as data is updated [1]

{feature} time-travel

{benefit} allows users query point-in-time snapshots
{benefit)} allows to roll back erroneous updates to the data.

{feature} table cloning

{benefit} allows to create a point-in-time snapshot of the data based on its metadata

{concept} state

allows to drive the end-to-end life cycle of a SQL statement with transactional guarantees and top tier performance [1]
comprised of

cache
metadata
transaction logs
data

[on-premises architecture] all state is in the compute layer

relies on small, highly stable and homogenous clusters with dedicated hardware for Tier-1 performance
{downside} expensive
{downside} hard to maintain
{downside} limited scalability

cluster capacity is bounded by machine sizes because of the fixed topology

{concept}[stateful architecture]

the state of inflight transactions is stored in the compute node and is not hardened into persistent storage until the transaction commits [1]

⇒ when a compute node fails, the state of non-committed transactions is lost [1]

⇒ the in-flight transactions fail as well [1]

often also couples metadata describing data distributions and mappings to compute nodes [1]

⇒ a compute node effectively owns responsibility for processing a subset of the data [1]

its ownership cannot be transferred without a cluster restart [1]

{downside} resilience to compute node failure and elastic assignment of data to compute are not possible [1]

{concept} stateless compute architecture

requires that compute nodes hold no state information [1]

⇒ all data, transactional logs and metadata need to be externalized [1]

{benefit} allows applications to

partially restart the execution of queries in the event of compute node failures [1]
adapt to online changes of the cluster topology without failing in-flight transactions [1]

caches need to be as close to the compute as possible [1]

since they can be lazily reconstructed from persisted data they don’t necessarily need to be decoupled from compute [1]

the coupling of caches and compute does not make the architecture stateful [1]

{concept} [cloud] decoupling of compute and storage

provides more flexible resource scaling

the 2 layers can scale up and down independently adapting to user needs [1]
customers pay for the compute needed to query a working subset of the data [1]

is not the same as decoupling compute and state [1]

if any of the remaining state held in compute cannot be reconstructed from external services, then compute remains stateful [1]

Previous Post <<||>> Next Post

References:
[1] Josep Aguilar-Saborit et al (2020) POLARIS: The Distributed SQL Engine in Azure Synapse, Proceedings of the VLDB Endowment PVLDB 13(12) (link)
[2] Josep Aguilar-Saborit et al (2024), Extending Polaris to Support Transactions (link)
[3] Advancing Analytics (2021) Azure Synapse Analytics - Polaris Whitepaper Deep-Dive (link)
[4] Michael Armbrust et al (2020) Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores, Proceedings of the VLDB Endowment 13(12) (link)

Resources:

[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:

ADLS - Azure Data Lake Storage

CRUD - Create, Read, Update, Delete

DCP - Distributed Computation Platform

DFS - Distributed File System

DWH - data warehouse

HDFS - Hadoop DFS
R/W - read/write

SI - Semantic Isolation

SSD - Solid-State Drive

29 March 2024

🗄️🗒️Data Management: Data [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources.
Last updated: 29-Mar-2024

[Data Management] Data

{def} raw, unrelated numbers or entries that represent facts, concepts, events, and/or associations
categorized by

domain

{type} transactional data
{type} master data
{type} configuration data

{subtype}hierarchical data
{subtype} reference data
{subtype} setup data
{subtype} policy

{type} analytical data

{subtype} measurements
{subtype} metrics
{subtype}

structuredness

{type} structured data
{type} semi-structured data
{type} unstructured data

statistical usage as variable

{type} categorical data (aka qualitative data)

{subtype} nominal data
{subtype} ordinal data
{subtype} binary data

{type} numerical data (aka quantitative data)

{subtype} discrete data
{subtype} continuous data

size

{type} small data
{type} big data

{concept} transactional data

{def} data that describe business transactions and/or events
supports the daily operations of an organization
commonly refers to data created and updated within operational systems
support applications that automated key business processes
usually stored in normalized tables

{concept} master data

{def}"data that provides the context for business activity data in the form of common and abstract concepts that relate to the activity" [2]

the key business entities on which transaction are executed

the dimensions around on which analysis is conducted

used to categorize, evaluate and aggregate transactional data

can be shared across more than one transactional applications
there are master data similar to most organizations, but also master data specific to certain industries
often appear in more than one area within the business
represent one version of the truth
can be further divided into specialized subsets
{concept} master data entity

core business entity used in different applications across the organization, together with their associated metadata, attributes, definitions, roles, connections and taxonomies
may be classified within a hierarchy

the way they describe, characterize and classify business concepts may actually cross multiple hierarchies in different ways

e.g. a party can be an individual, customer, employee, while a customer might be an individual, party or organization

do not change as frequent like transactional data

less volatile than transactional data
there are master data that don’t change at all

e.g. geographic locations

strategic asset of the business
needs to be managed with the same diligence as other strategic assets

{concept} metadata

{definition} "data that defines and describes the characteristics of other data, used to improve both business and technical understanding of data and data-related processes" [2]

data about data

refers to

database schemas for OLAP & OLTP systems
XML document schemas
report definitions
additional database table and column descriptions stored with extended properties or custom tables provided by SQL Server
application configuration data

{concept} analytical data

{definition} data that supports analytical activities

e.g. decision making, reporting queries and analysis

comprises

numerical values
metrics
measurements

stored in OLAP repositories

optimized for decision support
enterprise data warehouses
departmental data marts
within table structures designed to support aggregation, queries and data mining

{concept} hierarchical data
- {definition} data that reflects a hierarchy
- typically appears in analytical applications
- {concept} hierarchy
{concept} structured data

{definition} "data that has a strict metadata defined"

{concept} unstructured data

{definition} data that doesn't follow predefined metadata
involves all kinds of documents
can appear in a database, in a file, or even in printed material

{concept} semi-structured data

{definition} structured data stored within unstructured data,
data typically in XML form

XML is widely used for data exchange

can appear in stand-alone files or as part of a database (as a column in a table)
useful when metadata (the schema) changes frequently, or there’s no need for a detailed relational schema

Previous Post <<||>> Next Post

References:
[1] The Art of Service (2017) Master Data Management Course

[2] DAMA International (2011) "The DAMA Dictionary of Data Management",

28 March 2024

🗄️🗒️Data Management: Master Data Management [MDM] [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources.
Last updated: 28-Mar-2024

Master Data Management (MDM)

{definition} the technologies, processes, policies, standards and guiding principles that enable the management of master data values to enable consistent, shared, contextual use across systems, of the most accurate, timely, and relevant version of truth about essential business entities [2],[3]
{goal} enable sharing of information assets across business domains and applications within an organization [4]
{goal} provide authoritative source of reconciled and quality-assessed master (and reference) data [4]
{goal} lower cost and complexity through use of standards, common data models, and integration patterns [4]
{driver} meeting organizational data requirements
{driver} improving data quality
{driver} reducing the costs for data integration
{driver} reducing risks
{type} operational MDM

involves solutions for managing transactional data in operational applications [1]
rely heavily on data integration technologies

{type} analytical MDM

involves solutions for managing analytical master data
centered on providing high quality dimensions with multiple hierarchies [1]
cannot influence operational systems

any data cleansing made within operational application isn’t recognized by transactional applications [1]

⇒ inconsistencies to the main operational data [1]

transactional application knowledge isn’t available to the cleansing process

{type} enterprise MDM

involves solutions for managing both transactional and analytical master data

manages all master data entities
deliver maximum business value

operational data cleansing

improves the operational efficiencies of the applications and the business processes that use the applications

cross-application data need

consolidation
standardization
cleansing
distribution

needs to support high volume of transactions

⇒ master data must be contained in data models designed for OLTP

⇐ ODS don’t fulfill this requirement

{enabler} high-quality data
{enabler} data governance
{benefit} single source of truth

used to support both operational and analytical applications in a consistent manner [1]

{benefit} consistent reporting

reduces the inconsistencies experienced previously
influenced by complex transformations

{benefit} improved competitiveness

MDM reduces the complexity of integrating new data and systems into the organization

⇒ increased flexibility and improves competitiveness

ability to react to new business opportunities quickly with limited resources

{benefit} improved risk management

more reliable and consistent data improves the business’s ability to manage enterprise risk [1]

{benefit} improved operational efficiency and reduced costs

helps identify business’ pain point

by developing a strategy for managing master data

{benefit} improved decision making

reducing data inconsistency diminishes organizational data mistrust and facilitates clearer (and faster) business decisions [1]

{benefit} more reliable spend analysis and planning

better data integration helps planners come up with better decisions

improves the ability to

aggregate purchasing activities
coordinate competitive sourcing
be more predictable about future spending
generally improve vendor and supplier management

{benefit} regulatory compliance

allows to reduce compliance risk

helps satisfy governance, regulatory and compliance requirements

simplifies compliance auditing

enables more effective information controls that facilitate compliance with regulations

{benefit} increased information quality

enables organizations to monitor conformance more effectively

via metadata collection
it can track whether data meets information quality expectations across vertical applications, which reduces information scrap and rework

{benefit} quicker results

reduces the delays associated with extraction and transformation of data [1]

⇒ it speeds up the implementation of application migrations, modernization projects, and data warehouse/data mart construction [1]

{benefit} improved business productivity

gives enterprise architects the chance to explore how effective the organization is in automating its business processes by exploiting the information asset [1]

⇐ master data helps organizations realize how the same data entities are represented, manipulated, or exchanged across applications within the enterprise and how those objects relate to business process workflows [1]

{benefit} simplified application development

provides the opportunity to consolidate the application functionality associated with the data lifecycle [1]

⇐ consolidation in MDM is not limited to the data
⇒ provides a single functional to which different applications can subscribe

⇐ introducing a technical service layer for data lifecycle functionality provides the type of abstraction needed for deploying SOA or similar architectures

factors to consider for implementing an MDM:

effective technical infrastructure for collaboration [1]
organizational preparedness

for making a quick transition from a loosely combined confederation of vertical silos to a more tightly coupled collaborative framework
{recommendation} evaluate the kinds of training sessions and individual incentives required to create a smooth transition [1]

metadata management

via a metadata registry

{recommendation} sets up a mechanism for unifying a master data view when possible [1]
determines when that unification should be carried out [1]

technology integration

{recommendation} diagnose what technology needs to be integrated to support the process instead of developing the process around the technology [1]

anticipating/managing change

proper preparation and organization will subtly introduce change to the way people think and act as shown in any shift in pattern [1]
changes in reporting structures and needs are unavoidable

creating a partnership between Business and IT

IT roles

plays a major role in executing the MDM program[1]

business roles

identifying and standardizing master data [1]
facilitating change management within the MDM program [1]
establishing data ownership

measurably high data quality
overseeing processes via policies and procedures for data governance [1]

{challenge} establishing enterprise-wide data governance

{recommendation} define and distribute the policies and procedures governing the oversight of master data

seeking feedback from across the different application teams provides a chance to develop the stewardship framework agreed upon by the majority while preparing the organization for the transition [1]

{challenge} isolated islands of information

caused by vertical alignment of IT

makes it difficult to fix the dissimilarities in roles and responsibilities in relation to the isolated data sets because they are integrated into a master view [1]

caused by data ownership

the politics of information ownership and management have created artificial exclusive domains supervised by individuals who have no desire to centralize information [1]

{challenge} consolidating master data into a centrally managed data asset [1]

transfers the responsibility and accountability for information management from the lines of business to the organization [1]

{challenge} managing MDM

MDM should be considered a program and not a project or an application [1]

{challenge} achieving timely and accurate synchronization across disparate systems [1]
{challenge} different definitions of master metadata
- different coding schemes, data types, collations, and more
{challenge} data conflicts

{recommendation} resolve data conflicts during the project [5]
{recommendation} replicate the resolved data issues back to the source systems [5]

{challenge} domain knowledge

{recommendation} involve domain experts in an MDM project [5]

{challenge} documentation

{recommendation} properly document your master data and metadata [5]

approaches

{architecture} no central MDM

isn’t a real MDM approach
used when any kind of cross-system interaction is required [5]

e.g. performing analysis on data from multiple systems, ad-hoc merging and cleansing

{drawback} very inexpensive at the beginning; however, it turns out to be the most expensive over time [5]

{architecture} central metadata storage

provides unified, centrally maintained definitions for master data [5]

followed and implemented by all systems

ad-hoc merging and cleansing becomes somewhat simpler [5]
does not use a specialized solution for the central metadata storage [5]

⇐ the central storage of metadata is probably in an unstructured form

e.g. documents, worksheets, paper

{architecture} central metadata storage with identity mapping

stores keys that map tables in the MDM solution

only has keys from the systems in the MDM database; it does not have any other attributes [5]

{benefit} data integration applications can be developed much more quickly and easily [5]
{drawback} raises problems in regard to maintaining master data over time [5]

there is no versioning or auditing in place to follow the changes [5]

⇒ viable for a limited time only

e.g. during upgrading, testing, and the initial usage of a new ERP system to provide mapping back to the old ERP system

{architecture} central metadata storage and central data that is continuously merged

stores metadata as well as master data in a dedicated MDM system
master data is not inserted or updated in the MDM system [5]
the merging (and cleansing) of master data from source systems occurs continuously, regularly [5]
{drawback} continuous merging can become expensive [5]
the only viable use for this approach is for finding out what has changed in source systems from the last merge [5]

enables merging only the delta (new and updated data)

frequently used for analytical systems

{architecture} central MDM, single copy

involves a specialized MDM application

master data, together with its metadata, is maintained in a central location [5]
⇒ all existing applications are consumers of the master data

{drawback} upgrade all existing applications to consume master data from central storage instead of maintaining their own copies [5]

⇒ can be expensive
⇒ can be impossible (e.g. for older systems)

{drawback} needs to consolidate all metadata from all source systems [5]
{drawback} the process of creating and updating master data could simply be too slow [5]

because of the processes in place

{architecture} central MDM, multiple copies

uses central storage of master data and its metadata

⇐ the metadata here includes only an intersection of common metadata from source systems [5]
each source system maintains its own copy of master data, with additional attributes that pertain to that system only [5]

after master data is inserted into the central MDM system, it is replicated (preferably automatically) to source systems, where the source-specific attributes are updated [5]
{benefit} good compromise between cost, data quality, and the effectiveness of the CRUD process [5]
{drawback} update conflicts

different systems can also update the common data [5]

⇒ involves continuous merges as well [5]

{drawback} uses a special MDM application

Previous Post <<||>> Next Post

Acronyms:

MDM - Master Data Management

ODS - Operational Data Store

OLAP - online analytical processing

OLTP - online transactional processing

SOA - Service Oriented Architecture

References:
[1] The Art of Service (2017) Master Data Management Course
[2] DAMA International (2009) "The DAMA Guide to the Data Management Body of Knowledge" 1st Ed.

[3] Tony Fisher 2009 "The Data Asset"

[4] DAMA International (2017) "The DAMA Guide to the Data Management Body of Knowledge" 2nd Ed.

[5] Dejan Sarka et al (2012) Exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012 (Training Kit)

SQL Troubles

Pages

08 April 2024

🧭Business Intelligence: Why Data Projects Fail to Deliver Real-Life Impact (Part III: Failure through the Looking Glass)

06 April 2024

🧭Business Intelligence: Why Data Projects Fail to Deliver Real-Life Impact (Part II: There's Value in Failure)

🧭Business Intelligence: Why Data Projects Fail to Deliver Real-Life Impact (Part I: First Thoughts)

🏭🗒️Microsoft Fabric: Data Governance [Notes]

05 April 2024

💎SQL Reloaded: SQL Antipatterns (Part I: JOINs, UNIONs & DISTINCT)

03 April 2024

🧭Business Intelligence: Perspectives (Part 10: The Top 5 Pains of a BI/Analytics Manager)

01 April 2024

📊R Language: Data Transformations (Part I: Temperatures' comparison between F° and C°)

31 March 2024

🏭🗒️Microsoft Fabric: Polaris [Notes]

29 March 2024

🗄️🗒️Data Management: Data [Notes]

28 March 2024

🗄️🗒️Data Management: Master Data Management [MDM] [Notes]

About Me