SQL Troubles: February 2024

29 February 2024

R Language: Visualizing the Iris Dataset

When working with a dataset that has several numeric features, it's useful to visualize it to understand the shapes of each feature, usually by category or in the case of the iris dataset by species. For this purpose one can use a combination between a boxplot and a stripchart to obtain a visualization like the one below (click on the image for a better resolution):

Iris features by species (box & jitter plots combined)

And here's the code used to obtain the above visualization:

par(mfrow = c(2,2)) #2x2 matrix display

boxplot(iris$Petal.Width ~ iris$Species) 
stripchart(iris$Petal.Width ~ iris$Species
	, method = "jitter"
	, add = TRUE
	, vertical = TRUE
	, pch = 20
	, jitter = .5
	, col = c('steelblue', 'red', 'purple'))

boxplot(iris$Petal.Length ~ iris$Species) 
stripchart(iris$Petal.Length ~ iris$Species
	, method = "jitter"
	, add = TRUE
	, vertical = TRUE
	, pch = 20
	, jitter = .5
	, col = c('steelblue', 'red', 'purple'))

boxplot(iris$Sepal.Width ~ iris$Species) 
stripchart(iris$Sepal.Width ~ iris$Species
	, method = "jitter"
	, add = TRUE
	, vertical = TRUE
	, pch = 20
	, jitter = .5
	, col = c('steelblue', 'red', 'purple'))

boxplot(iris$Sepal.Length ~ iris$Species) 
stripchart(iris$Sepal.Length ~ iris$Species
	, method = "jitter"
	, add = TRUE
	, vertical = TRUE
	, pch = 20
	, jitter = .5
	, col = c('steelblue', 'red', 'purple'))

mtext("© sql-troubles@blogspot.com 2024", side = 1, line = 4, adj = 1, col = "dodgerblue4", cex = .7)
title("Iris Features (cm) by Species", line = -2, outer = TRUE)

By contrast, one can obtain a similar visualization with just a command:

plot(iris, col = c('steelblue', 'red', 'purple'), pch = 20)

title("Iris Features (cm) by Species", line = -1, outer = TRUE)
mtext("© sql-troubles@blogspot.com 2024", side = 1, line = 4, adj = 1, col = "dodgerblue4", cex = .7)

And here's the output:

Iris features by species (general plot)

One can improve the visualization by using a bigger contrast between colors (I preferred to use the same colors as in the previous visualization).

I find the first data visualization easier to understand and it provides more information about the shape of data even it requires more work.

Histograms make it easier to understand the distribution of values, though the visualizations make sense only when done by species:

Histograms of Setosa's features

And, here's the code:

par(mfrow = c(2,2)) #2x2 matrix display

setosa = subset(iris, Species == 'setosa') #focus only on setosa
hist(setosa$Sepal.Width)
hist(setosa$Sepal.Length)
hist(setosa$Petal.Width)
hist(setosa$Petal.Length)

title("Setosa's Features (cm)", line = -1, outer = TRUE)
mtext("© sql-troubles@blogspot.com 2024", side = 1, line = 4, adj = 1, col = "dodgerblue4", cex = .7)

There's however a visual called staked histogram that allows to delimit the data for each species:

Iris features by species (stacked histograms)

And, here's the code:

#installing plotrix & multcomp
install.packages("plotrix")
install.packages("plotrix")
library(plotrix)
library(multcomp)

par(mfrow = c(2,2)) #1x2 matrix display

histStack(iris$Sepal.Width
	, z = iris$Species
	, col = c('steelblue', 'red', 'purple')
	, main = "Sepal.Width"
	, xlab = "Width"
	, legend.pos = "topright")

histStack(iris$Sepal.Length
	, z = iris$Species
	, col = c('steelblue', 'red', 'purple')
	, main = "Sepal.Length"
	, xlab = "Length"
	, legend.pos = "topright")

histStack(iris$Petal.Width
	, z = iris$Species
	, col = c('steelblue', 'red', 'purple')
	, main = "Petal.Width"
	, xlab = "Width"
	, legend.pos = "topright")

histStack(iris$Petal.Length
	, z = iris$Species
	, col = c('steelblue', 'red', 'purple')
	, main = "Petal.Length"
	, xlab = "Length"
	, legend.pos = "topright")

title("Iris Features (cm) by Species - Histograms", line = -1, outer = TRUE)
mtext("© sql-troubles@blogspot.com 2024", side = 1, line = 4, adj = 1, col = "dodgerblue4", cex = .7)

Conversely, the standard histogram allows drawing the density curves within its boundaries:

par(mfrow = c(2,2)) #1x2 matrix display 

hist(iris$Sepal.Width
	, main = "Sepal.Width"
	, xlab = "Width"
	, las = 1, cex.axis = .8, freq = F)
eq = density(iris$Sepal.Width) # estimate density curve
lines(eq, lwd = 2) # plot density curve

hist(iris$Sepal.Length
	, main = "Sepal.Length"
	, xlab = "Length"
	, las = 1, cex.axis = .8, freq = F)
eq = density(iris$Sepal.Length) # estimate density curve
lines(eq, lwd = 2) # plot density curve

hist(iris$Petal.Width
	, main = "Petal.Width"
	, xlab = "Width"
	, las = 1, cex.axis = .8, freq = F)
eq = density(iris$Petal.Width) # estimate density curve
lines(eq, lwd = 2) # plot density curve

hist(iris$Petal.Length
	, main = "Petal.Length"
	, xlab = "Length"
	, las = 1, cex.axis = .8, freq = F)
eq = density(iris$Petal.Length) # estimate density curve
lines(eq, lwd = 2) # plot density curve

title("Iris Features (cm) by Species - Density plots", line = -1, outer = TRUE)
mtext("© sql-troubles@blogspot.com 2024", side = 1, line = 4, adj = 1, col = "dodgerblue4", cex = .7)

And, here's the diagram:

Iris features aggregated (histograms with density plots)

As final visualization, one can also compare the width and length for the sepal, respectively petal:

par(mfrow = c(1,2)) #1x2 matrix display

plot(iris$Sepal.Width, iris$Sepal.Length, main = "Sepal Width vs Length", col = iris$Species)
plot(iris$Petal.Width, iris$Petal.Length, main = "Petal Width vs Length", col = iris$Species)

title("Iris Features (cm) by Species - Scatter Plots", line = -1, outer = TRUE)
mtext("© sql-troubles@blogspot.com 2024", side = 1, line = 4, adj = 1, col = "dodgerblue4", cex = .7)

And, here's the output:

Iris features by species (scatter plots)

Happy coding!

28 February 2024

Business Intelligence: A Software Engineer's Perspective V (From Process Management to Mental Models in Knowledge Gaps)

Business Intelligence Series

An organization's business processes are probably one of its most important assets because they reflect the business model, philosophy and culture, respectively link the material, financial, decisional, informational and communicational flows across the whole organization with implication in efficiency, productivity, consistency, quality, adaptability, agility, control or governance. A common practice in organizations is to document the business-critical processes and manage them accordingly over their lifetime, making sure that the employees understand and respect them, respectively improve them continuously.

In what concerns the creation of data artifacts, data without the processual context are often meaningless, no matter how much a data professional knows about data structures/models. Processes allow to delimit the flow and boundaries of data, respectively delimit the essential from non-essential. Moreover, it's the knowledge of processes that allows to reengineer the logic behind systems especially when no proper documentation about the logic is available.

Therefore, the existence of documented processes allows to bridge the knowledge gaps existing on the factual side, and occasionally also on the technical side. In theory, the processes should provide a complete overview of the procedures, rules, policies and responsibilities existing in the organization, respectively how the business operates. However, even if people tend to understand how the world works locally, when broken down into parts, their understanding is systemically flawed, missing the implications of causal relationships that span time with delays, feedback, variable confusion, chaotic behavior, and/or other characteristics borrowed from the vocabulary of complex systems.

Jay W Forrester [3], Peter M Senge [1], John D Sterman [2] and several other systems-thinking theoreticians stressed the importance of mental models in making-sense about the world especially in setups that reflect the characteristics of complex systems. Mental models frame our experience about the world in congruent mental constructs that are further used to think, understand and navigate the world. They are however tacit, fuzzy, incomplete, imprecisely stated, inaccurate, evolving simplifications with dual character, enabling on one side, while impeding on the other side cognitive processes like sense-making, learning, thinking or decision-making, limiting the range of action to what is familiar and comfortable.

On one side one of the primary goals of Data Analytics is to provide new insights, while on the other side the new insights fail to be recognized and put into practice because they conflict with existing mental models, limiting employees to familiar ways of thinking and acting.

Externalizing and sharing mental models allow besides making assumptions explicit and creating a world view also to strategize, make tests and simulations, respectively make sure that the barriers and further constraints don't impact the decisional process. Sange goes further and advances that mental models, especially at management level, offer a competitive advantage, allowing to maintain coherence and direction, people becoming more perceptive and responsive about environmental or circumstance changes.

The whole process isn't about creating a unique congruent mental model, even if several mental models may converge toward one or more holistic models, but of providing different diverse perspectives and enabling people to make leaps in abstraction (by moving from direct observations to generalizations) while blending advocacy and inquiry to promote collaborative learning. Gradually, people and organizations should recognize a shift from mental models dominated by events to mental models that recognize longer-tern patterns of change and the underlying structures producing those patterns [1].

Probably, for many the concept of mental models seems to be still too abstract, respectively that the effort associated with it is unnecessary, or at least questionable on whether it can make a difference. Conversely, being aware of the positive and negative implications the mental models hold, can makes us explore, even if ad-hoc, the roads they open.

Previous Post <<||>> Next Post

Resources:
[1] Peter M Senge (1990) The Fifth Discipline: The Art & Practice of The Learning Organization
[2] John D Sterman (2000) "Business Dynamics: Systems thinking and modeling for a complex world"
[3] Jay W Forrester (1971) "Counterintuitive Behaviour of Social Systems", Technology Review

27 February 2024

Book Review: Rolf Hichert & Jürgen Faisst's International Business Communication Standards (IBCS Version 1.2)

Over the last months I found several references to Rolf Hichert & Jürgen Faisst's booklet on business communication standards [1]. It draw my attention especially because it attempts to provide a standard for reports and data visualizations, which frankly it seems like a tremendous endeavor if done right. The two authors founded the IBCS institute 20 years ago, which is a host, training institute, and certification body of the Creative Commons project called IBCS.

The 150 pages booklet considers various standardization techniques with the help of more than 180 instructive figures, the overall structure being based on a set of principles and rules rooted in an acronym that spells "SUCCESS" - Say, Unify, Condense, Check, Express, Simplify, Structure. On one side the principles seem to form a solid fundament, however the fundament seems to suffer from the same rigidity resulted from fitting something in a nicely-spelled acronym.

Say or conveying a message reflects the principle that each report should convey a message, otherwise the report is just a data collection. According to this "definition" most of the operational reports are just collections of data. Conversely, lot of communication in organizations revolve around issues, metrics and decision making, scenarios in which the messages conveyed can be powerful though dependent on the business context. Settling on only one message can make the message fall short.

Unifying or applying semantic notation reflects the principle that things that have same meaning should look the same. There are many patterns out there that can be standardized, however it's questionable how much complex visualizations can be standardized, respectively how much liberty of expressing certain aspects the standardization allows.

Condense or increasing the information density reflects the requirements that all information necessary to understanding the content should, if possible, be included on one page. This allows to easier navigate the content and prioritize what the audience is able to see. The principle however seems to have more to do with the ink-information ratio principle (see [2]).

Check or ensuring the visual integrity reflects the principle that the information should be presented in the most truthful and the most easily understood way. This is something that many data visualizations out there lack.

Express or choosing the proper visualizations is based on the principle that the visuals considered should be as intuitive as possible. In theory, the more intuitive a visual the easier is to be understood and reused, however this depends on the "visual vocabulary" and "visual grammar" of each individual. Intuition is something that needs to grow through the interplay of these two areas. Having the expectation of displaying everything in terms of basic elements is unrealistic and suboptimal.

Simplify or avoiding clutter refers to eliminating the unnecessary from a visualization, when there's nothing to take out without changing the meaning of a visualization. At least, the principle is correctly considered even if is in general difficult to apply because quite often one needs to build something more complex and reduce the complexity through iterative steps until the simple is obtained.

Structure or organizing the content is based on the principle that content should follow (a logical consistent) structure. The interplay between function and structure is an important topic in itself.

Browsing through the many data visualizations given as example, I'd say that many of the recommendations make sense, though from there to a standardization is still a long way. The reader should evaluate against his/her own judgements the practices described and consider what seems to work.

The book is available on the IBS website as PDF, though the Kindle version is 40% cheaper. Overall, it is worth a read.

Previous Post <<<||>> Next Post

Resources:
[1] Rolf Hichert & Jürgen Faisst (2022) "International Business Communication Standards (IBCS Version 1.2): Conceptual, perceptual, and semantic design of comprehensible business reports, presentations, and dashboards" (link)
[2] Edward R Tufte (1983) "The Visual Display of Quantitative Information"
[3] IBCS Institude (2024) About (link)

26 February 2024

R Language: Data Summaries without Using a DataFrame

Coming back to the R language after several years and trying to remember some basic functions proved to be a bit challenging, even if the syntax is quite simple. Therefore, I considered putting together a few calls as refresher based on Youden-Beale data. To run the below code you'll need to install the R language and RStudio.

In case you don't have the package installed, run the next two lines:

install.packages("ACSWR") #install the Youden-Beale Experiment package
library(ACSWR)	#load the library

str(yb)		#display datasets' structure

'data.frame': 8 obs. of 2 variables:
$ Preparation_1: int 31 20 18 17 9 8 10 7
$ Preparation_2: int 18 17 14 11 10 7 5 6

yb		#display the dataset

Preparation_1 Preparation_2
1      31              18
2      20              17
3      18              14
4      17              11
5         9               10
6       8               7
7       10                5
8      7            6

summary(yb) 	#display the summary for whole dataset

Preparation_1 Preparation_2
Min. : 7.00      Min. : 5.00
1st Qu.: 8.75      1st Qu.: 6.75
Median :13.50   Median :10.50
Mean :15.00      Mean :11.00
3rd Qu.:18.50      3rd Qu.:14.75
Max. :31.00    Max. :18.00

summary(yb$Preparation_1)	#display the summary for first column

Min. 1st Qu. Median Mean 3rd Qu. Max.
7.00 8.75 13.50 15.00 18.50 31.00

summary(yb$Preparation_2)	#display the summary for second column

Min. 1st Qu. Median Mean 3rd Qu. Max.
5.00 6.75 10.50 11.00 14.75 18.00

min(yb)	#display the minimum value for the whole dataset

[1] 5

min(yb$Preparation_1)	#display the mininun of first column

[1] 7

min(yb$Preparation_2)	#display the minimum of second column

[1] 5

sum(yb)	#display the sum of all values

[1] 208

sum(yb$Preparation_1)	#display the sum of first column

[1] 120

sum(yb$Preparation_2)	#display the sum of second column

[1] 88

#display the percentiles 
quantile(yb$Preparation_1,seq(0,1,.25))

0% 25% 50% 75% 100%
7.00 8.75 13.50 18.50 31.00

#display the percentiles 
quantile(yb$Preparation_2,seq(0,1,.25))

0% 25% 50% 75% 100%
5.00 6.75 10.50 14.75 18.00

#display the percentiles 
quantile(yb$Preparation_2,seq(0,1,.25))

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
7.0 7.7 8.4 9.1 9.8 13.5 17.2 17.9 19.2 23.3 31.0

quantile(yb$Preparation_2,seq(0,1,.1))

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
5.0 5.7 6.4 7.3 9.4 10.5 11.6 13.7 15.8 17.3 18.0

length(yb) 	#display the number of items 
ncol(yb) 	#display the number of columns

[1] 2

sort(yb$Preparation_1) #display the sorted values ascendingly

[1] 7 8 9 10 17 18 20 31

sort(yb$Preparation_1, decreasing = TRUE)

[1] 31 20 18 17 10 9 8 7

#display a vertical poxplot
boxplot(yb, notch=FALSE)
title("A: Vertical Boxplot for Youden-Beale Data")

#display an horizontal poxplot
boxplot(yb, horizontal = TRUE)
title("B: Horizontal Boxplot for Youden-Beale Data")

plot(yb) #scatter diagram

title("Scatter diagram")

lsfit(yb$Preparation_1, yb$Preparation_2)$coefficients #list square fit coefficients

Intercept X

2.8269231 0.5448718

lsfit(yb$Preparation_1, yb$Preparation_2)$residuals #list square fit residuals

[1] -1.7179487  3.2756410  1.3653846 -1.0897436  2.2692308 -0.1858974
[7] -3.2756410 -0.6410256

Happy coding!

21 February 2024

Business Intelligence: A Software Engineer's Perspective IV (The Loom of Interactions)

Business Intelligence Series

The process of developing or creating a report is quite simple - there's a demand for data, usually a business problem, the user (aka requestor) defines a set of requirements, the data professional writes one or more queries to address the requirements, which are then used to build one or more reports. The report(s) is/are reviewed by the requestor and with this the process should be over in most of the cases. However, this is rather the exception - a long series of changes over multiple iterations are usually necessary, the queries and the reports get modified and even rewritten until they reach the final form, lot of effort being wasted in the process on both sides.

Common practices for improving the process behind resume to assuring that the requirements are complete and understood upfront, that best practices are followed, that the user gets an early review of the work and that there's a continuous communication, that process' performance is monitored, that controls are in place, etc. Standardizing the process helps to reduce the number of iterations, but only by a factor. Unfortunately, the bigger issue - the knowledge gap - is often ignored.

There's lot of literature on problem solving, on what steps to follow, on how to define the problem, what aspects should be considered, etc. Recipes are good when one knows how to follow them, respectively how to cook, and that can be a tedious process. It is said that framing the right problem is half the way to its solving, and that's so true. Part of the bigger issue is that users need data to better understand the problem, however the drives can be different - sometimes is problem's complexity, while other times the need is apparent, only with the first set of data the users start thinking seriously about the problem.

So, the first major gap is between the problem and user's knowledge about the problem. Experience and theory can help reduce the gap, however the most important progress comes when the user understands the data behind the various processes that overlap with the problem. Sometimes, it's enough to explore the data visually, while other times deeper explorations are needed. Data literacy is important, though more important are the exposure to the data and problems of different variety and complexity, respectively having the time for this.

The second gap concerns the data professional - building the data model and the logic for the report requires domain knowledge. The level of knowledge depends from case to case, and typically what one doesn't know has the biggest impact. A data professional can help to the degree of the information, respectively knowledge he has about the business. The expectation to provide a report based on a set of fields might be valid for simple requirements, though the more complex a problem, the more domain knowledge is needed. Moreover, the data professional might need to reengineer the logic from the source system, which can prove challenging only by looking at the data.

Ideally, the two parties should work together starting with problem's framing and build common ground while covering the knowledge gaps on both sides. Of course, the user doesn't need to dive into the technical knowledge unless the organization leverages this interaction further by adopting the data citizen mindset. Such interactions can help to build trust, respectively a basis for further collaboration. Conversely, the more isolated the two parties, the higher the chances for more iterations to occur.

Covering the knowledge gaps might look like a redistribution of the effort, though by keeping the status quo there is little chance for growth!

Previous Post <<||>> Next Post

18 February 2024

Business Intelligence: A Software Engineer's Perspective III (More of a One-Man Show)

Business Intelligence Series

Probably, in some organizations there are still recounted stories about a hero who knew so much about the business and was technically proficient that he/she was able to provide data-driven answers to most business questions. Unfortunately, the times of solo representations are for long gone - the world moves too fast, there are too many questions looking for an answer, many of them requiring a solution before the problem was actually defined, a whole infrastructure is needed to be able to harness the potential of technologies and data, the volume of knowledge required grows exponentially, etc.

One of the approaches of handling the knowledge gap between the initial and required knowledge in solving problems based on data is to build all the required knowledge in one person, either on the business or the technical side. More common is to hire a data analyst and build the knowledge in the respective resource, and the approach has great chances to work until the volume of work exceeds a person's limits. The data analyst is forced to request to have the workload prioritized, which might work in certain occasions, while in others one needs to compromise on quality and/or do overtime, and all the issues deriving from this.

There are also situations in which the complexity of the problem exceeds a person's ability to handle it, and that's not necessarily a matter of intelligence but of knowhow. Some organizations respond with complexity to complexity, while others are more creative and break the complexity in manageable pieces. In both cases, more resources are needed to cover the knowledge and resource gap. Hiring more data analysts can get the work done though it's not a recipe for success. The more diverse the team, the higher the chances to succeed, though again it's a matter of creativity and of covering the knowledge gaps. Sometimes, it's more productive to use the resources already available in organization, though this can involve other challenges.

Even if much of the knowledge gets documented, as soon the data analyst leaves the organization a void is created until a similar resource is able to fill it. Organizations can better cope with these challenges if they disseminate the knowledge between data professionals respectively within the business. The more resources are involved the higher the level of retention and higher the chances of reusing the knowledge. However, the more people are involved, the higher the costs, especially the one associated with the waste of effort.

Organizations can compromise by choosing 1-2 resources from each department to be involved in knowledge dissemination, ideally people with data and technology affinity. They shall become data citizens, people who use data, data processing and visualization for building solutions that enable their job. Data citizens are expected to act as showmen in their knowledge domain and do their magic whenever such requirements arise.

Having a whole team of data citizens opens new opportunities for organizations, though such resources will need beside domain knowledge and data literacy also technical knowledge. Unfortunately, many people will reach their limitations in this area. Besides the learning effort, understanding what good architecture, design and techniques means is unfortunately not for everybody, and here's where the concept of citizen data analyst or citizen scientist breaks, and this independently of the tools used.

A data citizen's effort works best in data discovery, exploration and visualization scenarios where the rapid creation of prototypes reduces the time from idea to solution. However, the results are personal solutions that need to be validated by a technical person, pieces of the solutions maybe redesigned and moved until enterprise solutions result.

Previous Post <<||>> Next Post

17 February 2024

Business Intelligence: A Software Engineer's Perspective II (Major Knowledge Gaps)

Business Intelligence Series

Solving a problem requires a certain degree of knowledge in the areas affected by the problem, degree that varies exponentially with problem's complexity. This requirement applies to scientific fields with low allowance for errors, as well as to business scenarios where the allowance for errors is in theory more relaxed. Building a report or any other data artifact is closely connected with problem solving as the data artifacts are supposed to model the whole or parts of what is needed for solving the problem(s) in scope.

In general, creating data artifacts requires: (1) domain knowledge - knowledge of the concepts, processes, systems, data, data structures and data flows as available in the organization; (2) technical knowledge - knowledge about the tools, techniques, processes and methodologies used to produce the artifacts; (3) data literacy - critical thinking, the ability to understand and explore the implications of data, respectively communicating data in context; (4) activity management - managing the activities involved.

At minimum, creating a report may require only narrower subsets from the areas mentioned above, depending on the complexity of the problem and the tasks involved. Ideally, a single person should be knowledgeable enough to handle all this alone, though that's seldom the case. Commonly, two or more parties are involved, though let's consider the two-parties scenario: on one side is the customer who has (in theory) a deep understanding of the domain, respectively on the other side is the data professional who has (in theory) a deep understanding of the technical aspects. Ideally, both parties should be data literates and have some basic knowledge of the other party's domain.

To attack a business problem that requires one or more data artifacts both parties need to have a common understanding of the problem to be solved, of the requirements, constraints, assumptions, expectations, risks, and other important aspects associated with it. It's critical for the data professional to acquire the domain knowledge required by the problem, otherwise the solution has high chances to deviate from the expectations. The general issue is that there are multiple interactions that are iterative. Firstly, the interactions for building the needed common ground. Secondly, the interaction between the problem and reality. Thirdly, the interaction between the problem and parties’ mental models und understanding about the problem.

The outcome of these interactions is that the problem and its requirements go through several iterations in which knowledge from the previous iterations are incorporated successively. With each important piece of knowledge gained, it's important to revise and refine the question(s), respectively the problem. If in each iteration there are also programming and further technical activities involved, the effort and costs resulted in the process can explode, while the timeline expands accordingly.

There are several heuristics that could be devised to address these challenges: (1) build all the required knowledge in one person, either on the business or the technical side; (2) make sure that the parties have the required knowledge for approaching the problems in scope; (3) make sure that the gaps between reality and parties' mental models is minimal; (4) make sure that the requirements are complete and understood before starting the development; (5) adhere to methodologies that accommodate the necessary iterations and endeavor's particularities; (6) make sure that there's a halt condition for regularly reviewing the progress, respectively halting the work; (7) build an organizational culture to support all this.

The list is open, and the heuristics aren't exclusive, so in theory any combination of them can be considered. Ideally, an organization should reflect all these heuristics in one form or another. The higher the coverage, the more mature the organization is. The question is how organizations with a suboptimal setup can change the status quo?

Previous Post <<||>> Next Post

Business Intelligence: A Software Engineer's Perspective I (Houston, we have a Problem!)

Business Intelligence Series

One of the critics addressed to the BI/Data Analytics, Data Engineering and even Data Science fields is their resistance to applying Software Engineering (SE) methods in practice. SE can be regarded as the application of sound methods, methodologies, techniques, principles, and practices to obtain high quality economic software in a reproducible manner. At minimum, should be applied SE techniques and practices proven to work, for example the use of best practices, reference technologies, standardized processes for requirements gathering and management, etc. This doesn't mean that one should apply the full extent of SE but consider a minimum that makes sense to adopt.

Unfortunately, the creation of data artifacts (queries, reports, data models, data pipelines, data visualizations, etc.) as process seem to be done after the principle of least action, though least action means here the minimum interaction to push pieces on a board rather than getting the things done. At high level, the process is as follows: get the requirements, build something, present results, get more requirements, do changes, present the results, and the process is repeated ad infinitum.

Given that data artifact's creation finds itself at the intersection of two or more knowledge areas in which knowledge is exchanged in several iterations between the parties involved until a common ground is achieved, this process is totally inefficient from multiple perspectives. First of all, it takes considerably more time than planned to reach a solution, resources being wasted in the process, multiple forms of waste being involved. Secondly, the exchange and retention of knowledge resulting from the process is minimal, mainly on a need by basis. This might look as an efficient approach on the short term, but is inefficient overall.

BI reflects the general issues from SE - most of the issues can be traced back to requirements - if the requirements are incorrect and there's no magic involved in between, then one can't expect for the solution to be correct. The bigger the difference between the initial and final requirements elicited in the process, the more resources are wasted. The more time passes between the start of the development phase and the time a solution is presented to the customer, the longer it takes to build the final solution. Same impact have the time it takes to establish a common ground and other critical factors for success involved in the process.

One can address these issues through better requirements elicitation, rapid prototyping, the use of agile methodologies and similar approaches, though the general feeling is that even if they bring improvements, they don't address the root causes - lack of data literacy skills, lack of knowledge about the business, lack of maturity in planning and executing tasks, the inexistence of well-designed processes and procedures, respectively the lack of an engineering mindset.

These inefficiencies have low impact when building a report occasionally, though they accumulate and tend to create systemic issues in what concerns the overall BI effort. They are addressed locally by experts and in general through a strategic approach like the elaboration of a BI strategy, though organizations seldom pay attention to them. Some organizations consider that they are automatically addressed as part of the data culture, though data culture focuses in general on data literacy and not on the whole set of assumptions mentioned above.

An experienced data professional sees more likely the inefficiencies, tries to address them locally in his interactions with the various stakeholders, he/she can build a business case for addressing them, though it depends on organizations to recognize that they have a problem, respective address the inefficiencies in a strategic and systemic manner!

Previous Post <<||>> Next Post

Business Intelligence: Microsoft Fabric's Notebooks

Business Intelligence Series

When several technologies make their entrance in a data-related field like Data Warehousing, Data Analitics or Data Science, one is forced to understand how the respective technologies can be used or misused, respectively what's their place in the bigger picture. Microsoft Fabric introduces several important technologies that will change the way data are stored, processed and consumed.

The first important technology is the notebook - a web document-like cell-based container for writing and executing code in a collaborative manner. The concept is not new, Jupyter notebooks have been around for almost a decade. In Microsof Fabric, notebooks support multiple languages, from which a default one applies to the whole notebook, while on cell level any of the supported languages can be used.

One can execute a single cell, multiple cells or the entire notebook in a sequential manner, mix languages for the various operations - load, transform, save, and visualize data when needed. Notebooks can be parametrized and run via the homonymous activity in Data Factory pipelines, automating thus data processing. Probably more functionality is to come.

Data engineers seems to have great flexibility, though usually flexibility implies constraints and/or mischiefs in other areas. I see for example in presentations the overuse of temporary data objects (mainly views) in Spark SQL as part of complex logic. That's acceptable during prototyping, though such code becomes a danger as soon the logic is deployed into production. Data objects should be created outside of the logic that uses them and should be treated as artifacts, with version control and proper documentation. It's maybe true that temporary objects reduce the volume of objects in the metastore, though is this the way to go?

Temporary objects tend to lead to wheel's reinvention or they get duplicated across multiple notebooks, which can easily create a maintenance nightmare. One needs to consider that the business logic changes a lot, the requirements and the data sources change, and on the long term, the cost of maintaining the code can easily overweight the benefits.

Notebooks remind me of the beginnings of web programming when HTML was mixed up with client scripting languages like VB Script or Javascript, CSS, respectively server-side scripting languages. It was kind of a spaghetti code, modified repeatedly by multiple programmers, unendingly duplicated, and through a miracle it worked, until it stopped working unexpectedly in strangest situations. The strangest part was when after removing commented code from a section made the code run again.

The debugging of another person's code was a nightmare. Code developed by two people for similar purposes was looking unrecognizable different in terms of structure, programming techniques and layout. The technical debt was high, increasing in exponential manner. One was aware that the code needed refactoring, though there were more important things to do or no time allocated for it.

In the meantime the maturity of programming languages, frameworks, methodologies, best practices, and hopefully of programmers improved the overall quality of software (at least on average). Thinking of software from an Engineer's perspective improved the efficiency and effectiveness of a programmer's endeavor. The average programmer is able to write quality code, though there's a considerable minimum of "engineering" knowledge involved beside the mere knowledge of languages and tools.

Notebooks are good up to a point, beyond which one needs to take a step back, restructure, move the code where it belongs, take a few more steps back and review the good practices and their application, disseminate the knowledge inside the team and use it in the next iterations, respectively refractor the code when needed! Hopefully, people learned from the mistakes of the past.

Resources:
[1] Microsoft Learn (2023) How to use Microsoft Fabric notebooks (link)

16 February 2024

Business Intelligence: What is a BI Strategy?

Business Intelligence Series

"A BI strategy is a plan to implement, use, and manage data and analytics to better enable your users to meet their business objectives. An effective BI strategy ensures that data and analytics support your business strategy." [1]

The definition is from Microsoft's guide on Power BI implementation planning, a long-awaited resource for those deploying Power BI in their organization.

I read the definition repeatedly and, even if it looks logically correct, the general feeling is that it falls short, and I'm trying to understand why. A strategy is a plan indeed, even if various theorists use modifiers like unified, comprehensive, integrative, forward-looking, etc. Probably, because it talks about a BI strategy, the definition implies using a strategic plan. Conversely, using "strategic plan" in the definition seems to make the definition redundant, though it would pull then with it all what a strategy is about.

A business strategy is about enabling users to meet organization's business objectives, otherwise it would fail by design. Implicitly, an organization's objectives become its employees' objectives. The definition kind of states the obvious. Conversely, it talks only about the users, and not all employees are users. Thus, it refers only to a subset. Shouldn't a BI strategy support everybody?

Usually, data analytics refers to the procedures and techniques used for exploration and analysis. Isn't supposed to consider also the visualization of data? Did it forgot something else? Ideally, a definition shouldn't define what its terms are about individually, but what they are when used together.

BI as a set of technologies, architectures, methodologies, processes and practices is by definition an enabler if we take these components individually or as a whole. I would play devil's advocate and ask "better than what?". Many of the information systems used in organizations come with a set of reports or functionalities that enable users in their jobs without investing a cent in a BI infrastructure.

One or two decades ago one of the big words used in sales pitches for BI tools was "competitive advantage". I was asking myself when and where did the word disappeared? Is BI technologies' success so common that the word makes no sense anymore? Did the sellers become more ethical? Or did we recognize that the challenges behind a technology are more of an organizational nature?

When looking at a business strategy, the hierarchy of business objectives forms its backbone, though there are other important elements that form its foundation: mission, vision, purpose, values or principles. A BI strategy needs to be aligned with the business strategy and the other strategies (e.g. quality, IT, communication, etc.). Being able to trace this kind of relationships between strategies is quintessential.

We talk about BI, Data Analytics, Data Management and newly Data Science. The relationship between them becomes more complex. Therefore, what differentiates a BI strategy from the other strategies? The above definition could apply to the other fields as well. Moreover, does it makes sense to include them in one form or another?

Independently how the joint field is called, BI and Data Analytics should be about gaining a deeper understanding about the business and disseminating that knowledge within the organization, respectively about exploring courses of action, building the infrastructure, the skillset, the culture and the mindset to approach more complex challenges and not only to enable business goals!

There are no perfect definitions, especially when the concepts used have drifting definitions as well, being caught into a net that makes it challenging to grasp the essence of things. In the end, a definition is good enough if the data professionals can work with it.

Resources:

[1] Microsoft Learn (2004) Power BI implementation planning: BI strategy (link).

14 February 2024

Business Intelligence: A One Man Show VI (The Lakehouse Perspective)

Business Intelligence Suite

Continuing the ideas on Christopher Laubenthal's article "Why one person can't do everything in the data space" [1] and why his analogy between a college's functional structure and the core data roles is poorly chosen. In the last post I mentioned as a first argument that the two constructions have different foundations.

Secondly, it's a matter of construction, namely the steps used to arrive from one state to another. Indeed, there's somebody who builds the data warehouse (DWH), somebody who builds the ETL/ELT pipelines for moving the data from the sources to the DWH, somebody who builds the sematic data model that includes business related logic, respectively people who tap into the data for reporting, data visualizations, data science projects, and whatever is still needed in the organization. On top of this, there should be somebody who manages the DWH. I haven't associated any role to them because one of the core roles can be responsible for more than one step.

In the case of a lakehouse, it is the data engineer who moves the data from the various data sources to the data lake if that doesn't happen already by design or configuration. As per my understanding the data engineers are the ones who design and build the new lakehouse, move transform and manage the data as required. The Data Analysts, Data Scientist and maybe some Information Designers can tap then into the data. However, the DWH and the lakehouse(s) are technologies that facilitate their work. They can still do their work also if the same data are available by other means.

In what concerns the dorm analogy, the verbs were chosen to match the way data warehouses (DWH) or lakehouses are built, though the congruence of the steps is questionable. One could have compared the number of students with the numbers of data entities, but not with the data themselves. Usually, students move by themselves and occupy the places. The story tellers, the assistants and researchers are independent on whether the students are hosted in the dorm or not. Therefore, the analogy seems to be a bit forced.

Frankly, I covered all the steps except the ones related to Data Science by myself for both described scenarios. It helped that I knew the data from the data sources and the transformations rules I had to apply, respectively the techniques needed for moving and transforming the data, and the volume of data entities was manageable somehow. Conversely, 1-2 more resources in the area of data analysis and visualizations could have helped to bring more value to the business.

This opens the challenge of scale and it has do to with systems engineering and how the number of components and the interactions between them increase systems' complexity and the demand for managing the respective components. In the simplest linear models, for each multiplier of a certain number of components of the same type from the organization, the number of resources managing the respective layer matches to some degree the multiplier. E.g. if a data engineer can handle x data entities in a unit of time, then for hand n*x components are more likely at least n data engineers required. However, the output of n components is only a fraction of the n*x given the dependencies existing between components and other constraints.

An optimization problem resumes in finding out what data roles to chose to cover an organization's needs. A one man show can be the best solution for small organizations, though unless there's a good division of labor, bringing a second person will make the throughput slower until will become faster.

Previous Post <<|||>> Next Post

Resources:
[1] Christopher Laubenthal (2024) "Why One Person Can’t Do Everything In Data" (link)

13 February 2024

Business Intelligence: A One Man Show V (Focus on the Foundation)

Business Intelligence Suite

I tend to agree that one person can't do anymore "everything in the data space", as Christopher Laubenthal put it his article on the topic [1]. He seems to catch the essence of some of the core data roles found in organizations. Summarizing these roles, data architecture is about designing and building a data infrastructure, data engineering is about moving data, database administration is mainly about managing databases, data analysis is about assisting the business with data and reports, information design is about telling stories, while data science can be about studying the impact of various components on the data.

However, I find his analogy between a college's functional structure and the core data roles as poorly chosen from multiple perspectives, even if both are about building an infrastructure of some type.

Firstly, the two constructions have different foundations. Data exists in a an organization also without data architects, data engineers or data administrators (DBAs)! It's enough to buy one or more information systems functioning as islands and reporting needs will arise. The need for a data architect might come when the systems need to be integrated or maybe when a data warehouse needs to be build, though many organizations are still in business without such constructs. While for the others, the more complex the integrations, the bigger the need for a Data Architect. Conversely, some systems can be integrated by design and such capabilities might drive their selection.

Data engineering is needed mainly in the context of the cloud, respectively of data lake-based architectures, where data needs to be moved, processed and prepared for consumption. Conversely, architectures like Microsoft Fabric minimize data movement, the focus being on data processing, the successive transformations it needs to suffer in moving from bronze to the gold layer, respectively in creating an organizational semantical data model. The complexity of the data processing is dependent on data' structuredness, quality and other data characteristics.

As I mentioned before, modern databases, including the ones in the cloud, reduce the need for DBAs to a considerable degree. Unless the volume of work is big enough to consider a DBA role as an in-house resource, organizations will more likely consider involving a service provider and a contingent to cover the needs.

Having in-house one or more people acting under the Data Analyst role, people who know and understand the business, respectively the data tools used in the process, can go a long way. Moreover, it's helpful to have an evangelist-like resource in house, a person who is able to raise awareness and knowhow, help diffuse knowledge about tools, techniques, data, results, best practices, respectively act as a mentor for the Data Analyst citizens. From my point of view, these are the people who form the data-related backbone (foundation) of an organization and this is the minimum of what an organization should have!

Once this established, one can build data warehouses, data integrations and other support architectures, respectively think about BI and Data strategy, Data Governance, etc. Of course, having a Chief Data Officer and a Data Strategy in place can bring more structure in handling the topics at the various levels - strategical, tactical, respectively operational. In constructions one starts with a blueprint and a data strategy can have the same effect, if one knows how to write it and implement it accordingly. However, the strategy is just a tool, while the data-knowledgeable workers are the foundation on which organizations should build upon!

"Build it and they will come" philosophy can work as well, though without knowledgeable and inquisitive people the philosophy has high chances to fail.

Previous Post <<||>> Next Post

Resources:
[1] Christopher Laubenthal (2024) "Why One Person Can’t Do Everything In Data" (link)

Business Intelligence: A One Man Show IV (Data Roles between Past and Future)

Business Intelligence Series

Databases nowadays are highly secure, reliable and available to a degree that reduces the involvement of DBAs to a minimum. The more databases and servers are available in an organization, and the older they are, the bigger the need for dedicated resources to manage them. The number of DBAs involved tends to be proportional with the volume of work required by the database infrastructure. However, if the infrastructure is in the cloud, managed by the cloud providers, it's enough to have a person in the middle who manages the communication between cloud provider(s) and the organization. The person doesn't even need to be a DBA, even if some knowledge in the field is usually recommended.

The requirement for a Data Architect comes when there are several systems in place and there're multiple projects to integrate or build around the respective systems. It'a also the question of what drives the respective requirement - is it the knowledge of data architectures, the supervision of changes, and/or the review of technical documents? The requirement is thus driven by the projects in progress and those waiting in the pipeline. Conversely, if all the systems are in the cloud, their integration is standardized or doesn't involve much architectural knowledge, the role becomes obsolete or at least not mandatory.

The Data Engineer role is a bit more challenging to define because it appeared in the context of cloud-based data architectures. It seems to be related to the data movement via ETL/ELT pipelines and of data processing and preparation for the various needs. Data modeling or data presentation knowledge isn't mandatory even if ideal. The role seems to overlap with the one of a Data Warehouse professional, be it a simple architect or developer. Role's knowhow depends also on the tools involved, because one thing is to build a solution based on a standard SQL Server, and another thing to use dedicated layers and architectures for the various purposes. Engineers' number should be proportional with the number of data entities involved.

Conversely, the existence of solutions that move and process the data as needed, can reduce the volume of work. Moreover, the use of AI-driven tools like Copilot might shift the focus from data to prompt engineering.

The Data Analyst role is kind of a Cinderella - it can involve upon case everything from requirements elicitation to reports writing and results' interpretation, respectively from data collection and data modeling to data visualization. If you have a special wish related to your data, just add it to the role! Analysts' number should be related to the number of issues existing in organization where the collection and processing of data could make a difference. Conversely, the Data Citizen, even if it's not a role but a desirable state of art, could absorb in theory the Data Analyst role.

The Data Scientist is supposed to reveal the gems of knowledge hidden in the data by using Machine Learning, Statistics and other magical tools. The more data available, the higher the chances of finding something, even if probably statistically insignificant or incorrect. The role makes sense mainly in the context of big data, even if some opportunities might be available at smaller scales. Scientists' number depends on the number of projects focused on the big questions. Again, one talks about the Data Scientist citizen.

The Information Designer role seems to be more about data visualization and presentation. It makes sense in the organizations that rely heavily on visual content. All the other organizations can rely on the default settings of data visualization tools, independently on whether AI is involved or not.

Previous Post <<||>> Next Post

Business Intelligence: A One Man Show III (The Microsoft Fabric)

Business Intelligence Series

Announced at the end of the last year, Microsoft Fabric (MF) become a reality for the data professional, even if there are still many gaps in the overall architecture and some things don't work as they should. The Delta Lake and the various data consumption experiences seem to bring more flexibility but also raise questions on how one can use them adequately in building solutions for Data Analytics and/or Data Science.

Currently, as it happens with new technologies, data professionals seem to try to explore the functionality, see what's possible, what's missing, and that's a considerable effort as everybody is more or less on his own. The material released by Microsoft and other professionals should facilitate in theory this effort, though the considerable number of features and the effort needed to review them do the opposite. Some professionals do this as part of their jobs, and exploring the feature seems to be a full job in each area, while others, like myself, do it in their own time.

There are organizations that demand from their employees to regularly actualize their knowledge in their field of activity, respectively explore how new technologies can be integrated in organization's architecture. Having a few hours or even a day a weak for this can go a long way! Occasionally, I could take 1-2 hours a week during the program and take maybe a few many more hours from my own time. Unfortunately, most of the significant progress I made in a certain area (SQL Server, Dynamics 365, Software Engineering, Power BI, and now MF) it was done in my own time, which became in time more and more challenging to do given the pace with which new features and technologies develop.

By comparison, it was relatively easy to locally install SQL Server in its various CTP or community versions, deploy one of the readily-available databases, and start learning. I'm still doing it, playing with a SQL Server 2022 instance whenever I find the time. Similarly, I can use Power BI and a few other tools, depending again on the time available to make progress. However, with MF things start slowly to get blurry. The 60 days of trial won't cut it anymore as there are so many things to learn - Spark SQL, PySpark, Delta Lake, KQL, Dataflows, etc. Probably, there will be ways for learning any of this standalone, though not together in an integrated manner.

The complexity of the tools demands more time, a proper infrastructure and a good project to accommodate them. This doesn't mean that the complexity of the solutions need to increase as well! Azure Synapse allowed me to reuse many of the techniques I used in the past to build a modern Data Analytics solution, while in other areas I had to accommodate the new. The solution wasn't perfect (only time will tell), though it provided the minimum of what was needed. I expect the same to happen in Microsoft Fabric, even if the number of choices is bigger.

There's a considerable difference between building a minimal viable solution and exploring, respectively harnessing MF's capabilities. The challenge for many organizations is to determine what that minimum is about, how to build that knowledge into the team, especially when starting from zero.

Conversely, this doesn't mean that the skillset and effort can't be covered by one person. It might be more challenging though achievable if the foundation is there, respectively if certain conditions are met. This depends also on organization's expectations, infrastructure and other characteristics. A whole team is more likely to succeed than one person, but not certainty!

Previous Post <<||>> Next Post

SQL Troubles

Pages

29 February 2024

R Language: Visualizing the Iris Dataset

28 February 2024

Business Intelligence: A Software Engineer's Perspective V (From Process Management to Mental Models in Knowledge Gaps)

27 February 2024

Book Review: Rolf Hichert & Jürgen Faisst's International Business Communication Standards (IBCS Version 1.2)

26 February 2024

R Language: Data Summaries without Using a DataFrame

21 February 2024

Business Intelligence: A Software Engineer's Perspective IV (The Loom of Interactions)

18 February 2024

Business Intelligence: A Software Engineer's Perspective III (More of a One-Man Show)

17 February 2024

Business Intelligence: A Software Engineer's Perspective II (Major Knowledge Gaps)

Business Intelligence: A Software Engineer's Perspective I (Houston, we have a Problem!)

Business Intelligence: Microsoft Fabric's Notebooks

16 February 2024

Business Intelligence: What is a BI Strategy?

14 February 2024

Business Intelligence: A One Man Show VI (The Lakehouse Perspective)

13 February 2024

Business Intelligence: A One Man Show V (Focus on the Foundation)

Business Intelligence: A One Man Show IV (Data Roles between Past and Future)

Business Intelligence: A One Man Show III (The Microsoft Fabric)

About Me