SQL Troubles: March 2024

31 March 2024

Microsoft Fabric: Polaris (Notes)

Disclaimer: This is work in progress intended to consolidate information from various sources and may deviate from them. Please consult the sources for the exact content!
Last updated: 31-Mar-2024

Polaris

{definition} cloud-native analytical query engine over the data lake that follows a stateless micro-service architecture and is designed to execute queries in a scalable, dynamic and fault-tolerant way [1], [2]

the engine behind the serverless SQL pool [1] and Microsoft Fabric [2]
petabyte-scale execution [1]
highly-available micro-service architecture

data and query processing is packaged into units (aka tasks) [1]

can be readily moved across compute nodes and re-started at the task level [1]

can run directly over data in HDFS and in managed transactional stores [1]

[Azure Synapse] designed initially to execute read-only queries [1]

⇐ the architecture behind serverless SQL pool
uses a completely new scale-out framework based on a distributed SQL Server query engine [1]

fully compatible with T-SQL
leverages SQL Server single-node runtime and QO [1]

[Microsoft Fabric] extended with a complete transaction manager that executes general CRUD transactions [2]

incl. updates, deletes and bulk loads [2]
based on [delta tables] and [delta lake]

the delta lake supports currently only transactions within one table [4]

⇐ the architecture behind lakehouses

{goal} converge DWH and big data workloads [1]

the query engine scales-out for relational data and heterogeneous datasets stored in DFSs[1]

needs a clean abstraction over the underlying data type and format, capturing just what’s needed for efficiently parallelizing data processing

{goal} separate compute and state for cloud-native execution [1]

all services within a pool are stateless

data is stored durably in remote storage and is abstracted via data cells [1]

⇐ data is naturally decoupled from compute nodes

the metadata and transactional log state is off-loaded to centralized services [[1]
multiple compute pools can transactionally access the same logical database [1]

{goal} cloud-first [2]

{benefit} leverages elasticity
transactions need to be resilient to node failures on dynamically changing topologies [2]

⇒ the storage engine disaggregates the source of truth for execution state (including data, metadata and transactional state) from compute nodes [2]

must ensure disaggregation of metadata and transactional state from compute nodes [2]

⇐ to ensure that the life span of a transaction is resilient to changes in the backend compute topology [2]

⇐ can change dynamically to take advantage of the elastic nature of the cloud or to handle node failures [2]

{goal} use optimized native columnar, immutable and open storage format [2]

uses delta format

⇐ optimized to handle read-heavy workloads with low contention [2]

{goal} leverage the full potential of vectorized query processing for SQL [2]
{goal} support zero-copy data sharing with other services in the lake [2]
{goal} support read-heavy workloads with low contention [2]
{goal} support lineage-based features [2]

by taking advantage of delta table capabilities

{goal} provide full SQL SI transactional support [2]

{benefit} all traditional DWH requirements are met [2]

incl. multi-table and multi-statement transactions [2]

⇐ Polaris is the only system that supports this [2]
the design is optimized for analytics, specifically read- and insert-intensive workloads [2]
mixes of transactions are supported as well

{objective} no cross-component state sharing [2]

{principle} encapsulation of state within each component to avoid sharing state across nodes [2]
SI and the isolation of state across components allows to execute transactions as if they were queries [2]

⇒ makes read and write transactions indistinguishable [2]

⇒ allows to fully leverage its optimized distributed execution framework [2]

{objective} support snapshot Isolation (SI) semantics [2]

implemented over versioned data
allows reads (R) and writes (W) to proceed concurrently over their own data snapshot

R/W never conflict, and W/W of active transactions only conflict if they modify the same data [2]

⇐ all W transactions are serializable, leading to a serial schedule in increasing order of log record IDs [4]

follows from the commit protocol for write transactions, where only one transaction can write the record with each record ID [4]

⇐ R transactions at the snapshot isolation level create no contention

⇒ any number of R transactions can run concurrently [4]

the immutable data representation in LSTs allows dealing with failures by simply discarding data and metadata files that represent uncommitted changes [2]

similar to how temporary tables are discarded during query processing failures [2]

{feature} resize live workloads [1]

scales resources with the workloads automatically

{feature} deliver predictable performance at scale [1]

scales computational resources based on workloads' needs

{feature} efficiently handle both relational and unstructured data [1]
{feature} flexible, fine-grained task monitoring

a task is the finest grain of execution

{feature} global resource-aware scheduling

enables much better resource utilization and concurrency than traditional DWHs

capable of handling partial query restarts
maintains a global view of multiple queries

it is planned to build on this a global view with autonomous workload management features

{feature} multi-layered data caching model

leverages

SQL Server buffer pools for cashing columnar data
SSD caching

the delta table and its log are are immutable, they can be safely cached on cluster nodes [4]

{feature} tracks data lineage natively

the transaction log can also be used to audit logging based on the commit Info records [4]

{feature} versioning

maintain all versions as data is updated [1]

{feature} time-travel

{benefit} allows users query point-in-time snapshots
{benefit)} allows to roll back erroneous updates to the data.

{feature} table cloning

{benefit} allows to create a point-in-time snapshot of the data based on its metadata

{concept} state

allows to drive the end-to-end life cycle of a SQL statement with transactional guarantees and top tier performance [1]
comprised of

cache
metadata
transaction logs
data

[on-premises architecture] all state is in the compute layer

relies on small, highly stable and homogenous clusters with dedicated hardware for Tier-1 performance
{downside} expensive
{downside} hard to maintain
{downside} limited scalability

cluster capacity is bounded by machine sizes because of the fixed topology

{concept}[stateful architecture]

the state of inflight transactions is stored in the compute node and is not hardened into persistent storage until the transaction commits [1]

⇒ when a compute node fails, the state of non-committed transactions is lost [1]

⇒ the in-flight transactions fail as well [1]

often also couples metadata describing data distributions and mappings to compute nodes [1]

⇒ a compute node effectively owns responsibility for processing a subset of the data [1]

its ownership cannot be transferred without a cluster restart [1]

{downside} resilience to compute node failure and elastic assignment of data to compute are not possible [1]

{concept} [stateless compute architecture]

requires that compute nodes hold no state information [1]

⇒ all data, transactional logs and metadata need to be externalized [1]

{benefit} allows applications to

partially restart the execution of queries in the event of compute node failures [1]
adapt to online changes of the cluster topology without failing in-flight transactions [1]

caches need to be as close to the compute as possible [1]

since they can be lazily reconstructed from persisted data they don’t necessarily need to be decoupled from compute [1]

the coupling of caches and compute does not make the architecture stateful [1]

{concept} [cloud] decoupling of compute and storage

provides more flexible resource scaling

the 2 layers can scale up and down independently adapting to user needs [1]
customers pay for the compute needed to query a working subset of the data [1]

is not the same as decoupling compute and state [1]

if any of the remaining state held in compute cannot be reconstructed from external services, then compute remains stateful [1]

Acronyms:

ADLS - Azure Data Lake Storage

CRUD - Create, Read, Update, Delete

DCP - distributed computation platform

DFS - Distributed File System

DWH - data warehouse

HDFS - Hadoop DFS

SI - Semantic Isolation

SSD - Solid-State Drive

References:
[1] Josep Aguilar-Saborit et al (2020) POLARIS: The Distributed SQL Engine in Azure Synapse, Proceedings of the VLDB Endowment PVLDB 13(12) (link)
[2] Josep Aguilar-Saborit et al (2024), Extending Polaris to Support Transactions (link)
[3] Advancing Analytics (2021) Azure Synapse Analytics - Polaris Whitepaper Deep-Dive (link)
[4] Michael Armbrust et al (2020) Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores, Proceedings of the VLDB Endowment 13(12) (link)

29 March 2024

Data Management: Data (Notes)

Disclaimer: This is work in progress intended to consolidate information from various sources.
Last updated: 29-Mar-2024

Data

{definition} raw, unrelated numbers or entries that represent facts, concepts, events, and/or associations
categorized by

domain

{type} transactional data
{type} master data
{type} configuration data

{subtype}hierarchical data
{subtype} reference data
{subtype} setup data
{subtype} policy

{type} analytical data

{subtype} measurements
{subtype} metrics
{subtype}

structuredness

{type} structured data
{type} semi-structured data
{type} unstructured data

statistical usage as variable

{type} categorical data (aka qualitative data)

{subtype} nominal data
{subtype} ordinal data
{subtype} binary data

{type} numerical data (aka quantitative data)

{subtype} discrete data
{subtype} continuous data

size

{type} small data
{type} big data

{concept} transactional data

{definition} data that describe business transactions and/or events
supports the daily operations of an organization
commonly refers to data created and updated within operational systems
support applications that automated key business processes
usually stored in normalized tables

{concept} master data

{definition}"data that provides the context for business activity data in the form of common and abstract concepts that relate to the activity" [2]

the key business entities on which transaction are executed

the dimensions around on which analysis is conducted

used to categorize, evaluate and aggregate transactional data

can be shared across more than one transactional applications
there are master data similar to most organizations, but also master data specific to certain industries
often appear in more than one area within the business
represent one version of the truth
can be further divided into specialized subsets
{concept} master data entity

core business entity used in different applications across the organization, together with their associated metadata, attributes, definitions, roles, connections and taxonomies
may be classified within a hierarchy

the way they describe, characterize and classify business concepts may actually cross multiple hierarchies in different ways

e.g. a party can be an individual, customer, employee, while a customer might be an individual, party or organization

do not change as frequent like transactional data

less volatile than transactional data
there are master data that don’t change at all

e.g. geographic locations

strategic asset of the business
needs to be managed with the same diligence as other strategic assets

{concept} metadata

{definition} "data that defines and describes the characteristics of other data, used to improve both business and technical understanding of data and data-related processes" [2]

data about data

refers to

database schemas for OLAP & OLTP systems
XML document schemas
report definitions
additional database table and column descriptions stored with extended properties or custom tables provided by SQL Server
application configuration data

{concept} analytical data

{definition} data that supports analytical activities

e.g. decision making, reporting queries and analysis

comprises

numerical values
metrics
measurements

stored in OLAP repositories

optimized for decision support
enterprise data warehouses
departmental data marts
within table structures designed to support aggregation, queries and data mining

{concept} hierarchical data
- {definition} data that reflects a hierarchy
- typically appears in analytical applications
- {concept} hierarchy
{concept} structured data

{definition} "data that has a strict metadata defined"

{concept} unstructured data

{definition} data that doesn't follow predefined metadata
involves all kinds of documents
can appear in a database, in a file, or even in printed material

{concept} semi-structured data

{definition} structured data stored within unstructured data,
data typically in XML form

XML is widely used for data exchange

can appear in stand-alone files or as part of a database (as a column in a table)
useful when metadata (the schema) changes frequently, or there’s no need for a detailed relational schema

References:
[1] The Art of Service (2017) Master Data Management Course

[2] DAMA International (2011) "The DAMA Dictionary of Data Management",

28 March 2024

Data Management: Master Data Management [MDM] (Notes)

Disclaimer: This is work in progress intended to consolidate information from various sources.
Last updated: 28-Mar-2024

Master Data Management (MDM)

{definition} the technologies, processes, policies, standards and guiding principles that enable the management of master data values to enable consistent, shared, contextual use across systems, of the most accurate, timely, and relevant version of truth about essential business entities [2],[3]
{goal} enable sharing of information assets across business domains and applications within an organization [4]
{goal} provide authoritative source of reconciled and quality-assessed master (and reference) data [4]
{goal} lower cost and complexity through use of standards, common data models, and integration patterns [4]
{driver} meeting organizational data requirements
{driver} improving data quality
{driver} reducing the costs for data integration
{driver} reducing risks
{type} operational MDM

involves solutions for managing transactional data in operational applications [1]
rely heavily on data integration technologies

{type} analytical MDM

involves solutions for managing analytical master data
centered on providing high quality dimensions with multiple hierarchies [1]
cannot influence operational systems

any data cleansing made within operational application isn’t recognized by transactional applications [1]

⇒ inconsistencies to the main operational data [1]

transactional application knowledge isn’t available to the cleansing process

{type} enterprise MDM

involves solutions for managing both transactional and analytical master data

manages all master data entities
deliver maximum business value

operational data cleansing

improves the operational efficiencies of the applications and the business processes that use the applications

cross-application data need

consolidation
standardization
cleansing
distribution

needs to support high volume of transactions

⇒ master data must be contained in data models designed for OLTP

⇐ ODS don’t fulfill this requirement

{enabler} high-quality data
{enabler} data governance
{benefit} single source of truth

used to support both operational and analytical applications in a consistent manner [1]

{benefit} consistent reporting

reduces the inconsistencies experienced previously
influenced by complex transformations

{benefit} improved competitiveness

MDM reduces the complexity of integrating new data and systems into the organization

⇒ increased flexibility and improves competitiveness

ability to react to new business opportunities quickly with limited resources

{benefit} improved risk management

more reliable and consistent data improves the business’s ability to manage enterprise risk [1]

{benefit} improved operational efficiency and reduced costs

helps identify business’ pain point

by developing a strategy for managing master data

{benefit} improved decision making

reducing data inconsistency diminishes organizational data mistrust and facilitates clearer (and faster) business decisions [1]

{benefit} more reliable spend analysis and planning

better data integration helps planners come up with better decisions

improves the ability to

aggregate purchasing activities
coordinate competitive sourcing
be more predictable about future spending
generally improve vendor and supplier management

{benefit} regulatory compliance

allows to reduce compliance risk

helps satisfy governance, regulatory and compliance requirements

simplifies compliance auditing

enables more effective information controls that facilitate compliance with regulations

{benefit} increased information quality

enables organizations to monitor conformance more effectively

via metadata collection
it can track whether data meets information quality expectations across vertical applications, which reduces information scrap and rework

{benefit} quicker results

reduces the delays associated with extraction and transformation of data [1]

⇒ it speeds up the implementation of application migrations, modernization projects, and data warehouse/data mart construction [1]

{benefit} improved business productivity

gives enterprise architects the chance to explore how effective the organization is in automating its business processes by exploiting the information asset [1]

⇐ master data helps organizations realize how the same data entities are represented, manipulated, or exchanged across applications within the enterprise and how those objects relate to business process workflows [1]

{benefit} simplified application development

provides the opportunity to consolidate the application functionality associated with the data lifecycle [1]

⇐ consolidation in MDM is not limited to the data
⇒ provides a single functional to which different applications can subscribe

⇐ introducing a technical service layer for data lifecycle functionality provides the type of abstraction needed for deploying SOA or similar architectures

factors to consider for implementing an MDM:

effective technical infrastructure for collaboration [1]
organizational preparedness

for making a quick transition from a loosely combined confederation of vertical silos to a more tightly coupled collaborative framework
{recommendation} evaluate the kinds of training sessions and individual incentives required to create a smooth transition [1]

metadata management

via a metadata registry

{recommendation} sets up a mechanism for unifying a master data view when possible [1]
determines when that unification should be carried out [1]

technology integration

{recommendation} diagnose what technology needs to be integrated to support the process instead of developing the process around the technology [1]

anticipating/managing change

proper preparation and organization will subtly introduce change to the way people think and act as shown in any shift in pattern [1]
changes in reporting structures and needs are unavoidable

creating a partnership between Business and IT

IT roles

plays a major role in executing the MDM program[1]

business roles

identifying and standardizing master data [1]
facilitating change management within the MDM program [1]
establishing data ownership

measurably high data quality
overseeing processes via policies and procedures for data governance [1]

{challenge} establishing enterprise-wide data governance

{recommendation} define and distribute the policies and procedures governing the oversight of master data

seeking feedback from across the different application teams provides a chance to develop the stewardship framework agreed upon by the majority while preparing the organization for the transition [1]

{challenge} isolated islands of information

caused by vertical alignment of IT

makes it difficult to fix the dissimilarities in roles and responsibilities in relation to the isolated data sets because they are integrated into a master view [1]

caused by data ownership

the politics of information ownership and management have created artificial exclusive domains supervised by individuals who have no desire to centralize information [1]

{challenge} consolidating master data into a centrally managed data asset [1]

transfers the responsibility and accountability for information management from the lines of business to the organization [1]

{challenge} managing MDM

MDM should be considered a program and not a project or an application [1]

{challenge} achieving timely and accurate synchronization across disparate systems [1]
{challenge} different definitions of master metadata
- different coding schemes, data types, collations, and more
{challenge} data conflicts

{recommendation} resolve data conflicts during the project [5]
{recommendation} replicate the resolved data issues back to the source systems [5]

{challenge} domain knowledge

{recommendation} involve domain experts in an MDM project [5]

{challenge} documentation

{recommendation} properly document your master data and metadata [5]

approaches

{architecture} no central MDM

isn’t a real MDM approach
used when any kind of cross-system interaction is required [5]

e.g. performing analysis on data from multiple systems, ad-hoc merging and cleansing

{drawback} very inexpensive at the beginning; however, it turns out to be the most expensive over time [5]

{architecture} central metadata storage

provides unified, centrally maintained definitions for master data [5]

followed and implemented by all systems

ad-hoc merging and cleansing becomes somewhat simpler [5]
does not use a specialized solution for the central metadata storage [5]

⇐ the central storage of metadata is probably in an unstructured form

e.g. documents, worksheets, paper

{architecture} central metadata storage with identity mapping

stores keys that map tables in the MDM solution

only has keys from the systems in the MDM database; it does not have any other attributes [5]

{benefit} data integration applications can be developed much more quickly and easily [5]
{drawback} raises problems in regard to maintaining master data over time [5]

there is no versioning or auditing in place to follow the changes [5]

⇒ viable for a limited time only

e.g. during upgrading, testing, and the initial usage of a new ERP system to provide mapping back to the old ERP system

{architecture} central metadata storage and central data that is continuously merged

stores metadata as well as master data in a dedicated MDM system
master data is not inserted or updated in the MDM system [5]
the merging (and cleansing) of master data from source systems occurs continuously, regularly [5]
{drawback} continuous merging can become expensive [5]
the only viable use for this approach is for finding out what has changed in source systems from the last merge [5]

enables merging only the delta (new and updated data)

frequently used for analytical systems

{architecture} central MDM, single copy

involves a specialized MDM application

master data, together with its metadata, is maintained in a central location [5]
⇒ all existing applications are consumers of the master data

{drawback} upgrade all existing applications to consume master data from central storage instead of maintaining their own copies [5]

⇒ can be expensive
⇒ can be impossible (e.g. for older systems)

{drawback} needs to consolidate all metadata from all source systems [5]
{drawback} the process of creating and updating master data could simply be too slow [5]

because of the processes in place

{architecture} central MDM, multiple copies

uses central storage of master data and its metadata

⇐ the metadata here includes only an intersection of common metadata from source systems [5]
each source system maintains its own copy of master data, with additional attributes that pertain to that system only [5]

after master data is inserted into the central MDM system, it is replicated (preferably automatically) to source systems, where the source-specific attributes are updated [5]
{benefit} good compromise between cost, data quality, and the effectiveness of the CRUD process [5]
{drawback} update conflicts

different systems can also update the common data [5]

⇒ involves continuous merges as well [5]

{drawback} uses a special MDM application

Acronyms:

MDM - Master Data Management

ODS - Operational Data Store

OLAP - online analytical processing

OLTP - online transactional processing

SOA - Service Oriented Architecture

References:
[1] The Art of Service (2017) Master Data Management Course
[2] DAMA International (2009) "The DAMA Guide to the Data Management Body of Knowledge" 1st Ed.

[3] Tony Fisher 2009 "The Data Asset"

[4] DAMA International (2017) "The DAMA Guide to the Data Management Body of Knowledge" 2nd Ed.

[5] Dejan Sarka et al (2012) Exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012 (Training Kit)

25 March 2024

R Language: Regression Analysis with Simulated & Real Data

Before doing regression on a real dataset, one can use as minimum a set of simulated data to test the steps (code adapted after [1]):

# define the model with simulated data
n <- 100
x <- c(1:n)
error <- rnorm(n,0,10)
y <- 1+2*x+error
fit <- lm(y~x)

# plotting the values
plot(x, y, ylab="1+2*x+error")
lines(x, fit$fitted.values)

#using anova (analysis of variance)
anova(fit)

In the first step is created the data model, while in the second the data are plotted, while in the third the analysis of variance is run. For the y variable, can be used any linear function that represents a line in the plane.

rnorm() function generates multivariate normal random variates based on the parameters given, therefore the output will vary between the runs of the above code. The bigger the value of the third parameter, the more dispersed the data is.

To test the code on real data, one can use the Sleuth3 library with the data from [2] (see RPubs):

install.packages ("Sleuth3")
library("Sleuth3")

Let's look at the data from the first case, which represent an experiment concerning the effects of intrinsic and extrinsic motivation on creativity run by the psychologist Teresa Amabile (see [2]):

attach(case0101)
case0101
summary(case0101)

The regression can be applied to all the data:

# case 0101 (all data)
x <- c(1:47)
y <- case0101$Score
fit <- lm(y~x)
plot(x, y, ylab="Score")
lines(x, fit$fitted.values)

Though, a more appropriate analysis should be based on each questionnaire:

# case 0101 (extrinsic vs intrinsic treatments)
extrinsic <- subset(case0101, Treatment %in% "Extrinsic")
intrinsic <- subset(case0101, Treatment %in% "Intrinsic")

par(mfrow = c(1,2)) #1x2 matrix display

x <- c(1:length(extrinsic$Score))
y <- extrinsic$Score
fit <- lm(y~x)
plot(x, y, ylab="Extrinsic Score")
lines(x, fit$fitted.values)

x <- c(1:length(intrinsic$Score))
y <- intrinsic$Score
fit <- lm(y~x)
plot(x, y, ylab="Intrinsic Score")
lines(x, fit$fitted.values)

title("Extrinsic vs. Intrinsic Motivation on Creativity", line = -2, outer = TRUE)

And, here's the output:

Case 0101 Extrinsic vs. Intrinsic Motivation on Creativity

Happy coding!

References:
[1] DeWayne R Derryberry (2014) Basic Data Analysis for Time Series with R 1st Ed.
[2] Fred L Ramsey & Daniel W Schafer (2013) The Statistical Sleuth: A Course in Methods of Data Analysis 3rd Ed.

22 March 2024

Business Intelligence: Dashboards (Part I: Dashboards Are Dead & Other Crap)

Business Intelligence Series

I find annoying the posts that declare that a technology is dead, as they seem to seek the sensational and, in the end, don't offer enough arguments for the positions taken; all is just surfing though a few random ideas. Almost each time I klick on such a link I find myself disappointed. Maybe it's just me - having too great expectations from ad-hoc experts who haven't understood the role of technologies and their lifecycle.

At least until now dashboards are the only visual tool that allows displaying related metrics in a consistent manner, reflecting business objectives, health, or other important perspective into an organization's performance. More recently notebooks seem to be getting closer given their capabilities of presenting data visualizations and some intermediary steps used to obtain the data, though they are still far away from offering similar capabilities. So, from where could come any justification against dashboard's utility? Even if I heard one or two expert voices saying that they don't need KPIs for managing an organization, organizations still need metrics to understand how the organization is doing as a whole and taken on parts.

Many argue that the design of dashboards is poor, that they don't reflect data visualization best practices, or that they are too difficult to navigate. There are so many books on dashboard and/or graphic design that is almost impossible not to find such a book in any big library if one wants to learn more about design. There are many resources online as well, though it's tough to fight with a mind's stubbornness in showing no interest in what concerns the topic. Conversely, there's also lot of crap on the social networks that qualify after the mainstream as best practices.

Frankly, design is important, though as long as the dashboards show the right data and the organization can guide itself on the respective numbers, the perfectionists can say whatever they want, even if they are right! Unfortunately, the numbers shown in dashboards raise entitled questions and the reasons are multiple. Do dashboards show the right numbers? Do they focus on the objectives or important issues? Can the number be trusted? Do they reflect reality? Can we use them in decision-making?

There are so many things that can go wrong when building a dashboard - there are so many transformations that need to be performed, that the chances of failure are high. It's enough to have several blunders in the code or data visualizations for people to stop trusting the data shown.

Trust and quality are complex concepts and there’s no standard path to address them because they are a matter of perception, which can vary and change dynamically based on the situation. There are, however, approaches that allow to minimize this. One can start for example by providing transparency. For each dashboard provide also detailed reports that through drilldown (or also by running the reports separately if that’s not possible) allow to validate the numbers from the report. If users don’t trust the data or the report, then they should pinpoint what’s wrong. Of course, the two sources must be in synch, otherwise the validation will become more complex.

There are also issues related to the approach - the way a reporting tool was introduced, the way dashboards flooded the space, how people reacted, etc. Introducing a reporting tool for dashboards is also a matter of strategy, tactics and operations and the various aspects related to them must be addressed. Few organizations address this properly. Many organizations work after the principle "build it and they will come" even if they build the wrong thing!

Previous Post <<||>> Next Post

Business Intelligence: Monolithic vs. Distributed Architecture (Part III: Architectural Applications)

Business Intelligence Series

Now considering the 500 houses and the skyscraper model introduced in thee previous post, which do you think will be built first? A skyscraper takes 2-10 years to build, depending on the city in which is built and the architecture characteristics. A house may take 6-12 months depending on similar factors. But one needs to build 500 houses. For sure the process can be optimized when the houses look the same, though there are many constraints one needs to consider - the number of workers, tools, and the construction material available at a given time, the volume of planning, etc.

Within a rough estimate, it can take 2-5 years for each architecture to be built considering that on the average the advantages and disadvantages from the various areas can balance each other out. Historical data are in general needed for estimating the actual development time. One can start with a rough estimate and reevaluate the estimates up and down as more information are gathered. This usually happens in Software Engineering as well.

Monolith vs. Distributed Architecture - 500 families

There are multiple ways in which the work can be assigned to the contractors. When the houses are split between domains, each domain can have its own contractor(s) or the contractors can be specialized by knowledge areas, or a combination of the two. Contractors’ performance should be the same, though in practice no two contractors are the same. Conversely, the chances are higher for some contractors to deliver at the expected quality. It would be useful to have worked before with the contractors and have a partnership that spans years back. There are risks on both sides, even if the risks might favor one architecture over the other, and this depends also on the quality of the contractors, designs, and planning.

The planning must be good if not perfect to assure smooth development and each day can cost money when contractors are involved. The first planning must be done for the whole project and then split individually for each contractor and/or group of buildings. A back-and-forth check between the various plans is needed. Managing by exception can work, though it can also go terribly wrong.

Lot of communication must occur between domains to make sure that everything fits together. Especially at the beginning, all the parties must plan together, must make sure that the rules of the games (best practices, policies, procedures, processes, methodologies) are agreed upon. Oversight (governance) needs to happen at a small scale as well on aggregate to makes sure that the rules of the game are followed.

Now, which of the architectures do you think will fit a data warehouse (DWH)? Probably multiple voices will opt for the skyscraper, at least this is how a DWH looks from the outside. However, when one evaluates the architecture behind it, it can resemble a residential complex in which parts are bound together, but there are parts that can be distributed if needed. For example, in a DWH the HR department has its own area that's isolated from the other areas as it has higher security demands. There can be 2-3 other areas that don't share objects, and they can be distributed as well. The reasons why all infrastructure is on one machine are the costs associated with the licenses, respectively the reporting tools point to only one address.

In data marts based DWHs, there are multiple buildings within the architecture, and thus the data marts can be distributed across a wider infrastructure, with each domain responsible for its own data mart(s). The data marts are by definition domain-dependent, and this is one of the downsides imputed to this architecture.

Previous Post <<||>> Next Post

Business Intelligence: Monolithic vs. Distributed Architecture (Part II: Architectural Choices)

Business Intelligence Series

One metaphor that can be used to understand the difference between monolith and distributed architectures, respectively between data warehouses and data mesh-based architectures as per Dehghani’s definition [1] - think that you need to accommodate 500 families (the data products to be built). There are several options: (1) build a skyscraper (developing on vertical) (2) build a complex of high buildings and develop by horizontal and vertical but finding a balance between the two; (3) to split (aka distribute) the second option and create several buildings; (4) build for each family a house, creating a village or a neighborhood.

Monolith vs. Distributed Architecture - 500 families

(1) and (2) fit the definition of monoliths, whiles (3) and (4) are distributed architectures, though also in (3) one of the buildings can resemble a monolith if one chooses different architectures and heights for the buildings. For houses one can use a single architecture, agree on a set of predefined architectures, or have an architecture for each house, so that houses would look alike only by chance. One can also opt to have the same architecture for the buildings belonging to the same neighborhood (domain or subdomain). Moreover, the development could be split between multiple contractors that adhere to the same standards.

If the land is expensive, for example in big, overpopulated cities, when the infrastructure and the terrain allow it, one can build entirely on vertical, a skyscraper. If the land is cheap one can build a house for each family. The other architectures can be considered for everything in between.

A skyscraper is easier for externals to find (mailmen, couriers, milkmen, and other service providers) though will need a doorman to interact with them and probably a few other resources. Everybody will have the same address except the apartment number. There must be many elevators and the infrastructure must allow the flux of utilities up and down the floors, which can be challenging to achieve.

Within a village every person who needs to deliver or pick up something needs to traverse parts of the village. There are many services that need to be provided for both scenarios though the difference it will be the time that's needed to move in between addresses. In the virtual world this shouldn't matter unless one needs to inspect each house to check and/or retrieve something. The network of streets and the flux of utilities must scale with the population from the area.

A skyscraper will need materials of high quality that resist the various forces that apply on the building even in the most extreme situations. Not the same can be said about a house, which in theory needs more materials though a less solid foundation and the construction specifications are more relaxed. Moreover, a house needs smaller tools and is easier to build, unless each house has own design.

A skyscraper can host the families only when the construction is finished, and the needed certificates were approved. The same can be said about houses but the effort and time is considerably smaller, though the utilities must be also available, and they can have their own timeline.

The model is far from perfect, though it allows us to reason how changing the architecture affects various aspects. It doesn't reflect the reality because there's a big difference between the physical and virtual world. E.g., parts of the monolith can be used productively much earlier (though the core functionality might become available later), one doesn't need construction material but needs tool, the infrastructure must be available first, etc. Conversely, functional prototypes must be available beforehand, the needed skillset and a set of assumptions and other requirements must be met, etc.

Previous Post <<||>> Next Post

References:
[1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review)

20 March 2024

Data Management: Understanding Master Data Management’s Integration Challenges (Answer)

Data Management Series

Answering Piethein Strengholt’s post [1] on Master Data Management’s (MDM) integration challenges, the author of "Data Management at Scale".

Master data can be managed within individual domains though the boundaries must be clearly defined, and some coordination is needed. Attempting to partition the entities based on domains doesn’t always work. The partition needs to be performed at attribute level, though even then might be some exceptions involved (e.g. some Products are only for Finance to use). One can identify then attributes inside of the system to create the boundaries.

MDM is simple if you have the right systems, processes, procedures, roles, and data culture in place. Unfortunately, people make it too complicated – oh, we need a nice shiny system for managing the data before they are entered in ERP or other systems, we need a system for storing and maintaining the metadata, and another system for managing the policies, and the story goes on. The lack of systems is given as reason why people make no progress. Moreover, people will want to integrate the systems, increasing the overall complexity of the ecosystem.

The data should be cleaned in the source systems and assessed against the same. If that's not possible, then you have the wrong system! A set of well-built reports can make data assessment possible.

The metadata and policies can be maintained in Excel (and stored in SharePoint), SharePoint or a similar system that supports versioning. Also, for other topics can be found pragmatic solutions.

ERP systems allow us to define workflows and enable a master data record to be published only when the information is complete, though there will always be exceptions (e.g., a Purchase Order must be sent today). Such exceptions make people circumvent the MDM systems with all the issues deriving from this.

Adding an MDM system within an architecture tends to increase the complexity of the overall infrastructure and create more bottlenecks. Occasionally, it just replicates the structures existing in the target system(s).

Integrations are supposed to reduce the effort, though in the past 20 years I never saw an integration to work without issues, even in what MDM concerns. One of the main issues is that the solutions just synchronized the data without considering the processual dependencies, and sometimes also the referential dependencies. The time needed for troubleshooting the integrations can easily exceed the time for importing the data manually over an upload mechanism.

To make the integration work the MDM will arrive to duplicate the all the validation available in the target system(s). This can make sense when daily or weekly a considerable volume of master data is created. Native connectors simplify the integrations, especially when it can handle the errors transparently and allow to modify the records manually, though the issues start as soon the target system is extended with more attributes or other structures.

If an organization has an MDM system, then all the master data should come from the MDM. As soon as a bidirectional synchronization is used (and other integrations might require this), Pandora’s box is open. One can define hard rules, though again, there are always exceptions in which manual interference is needed.

Attempting an integration of reference data is not recommended. ERP systems can have hundreds of such entities. Some organizations tend to have a golden system (a copy of production) with all the reference data. It works for some time, until people realize that the solution is expensive and time-consuming.

MDM systems do make sense in certain scenarios, though to get the integrations right can involve a considerable effort and certain assumptions and requirements must be met.

Previous Post <<||>> Next Post

References:
[1] Piethein Strengholt (2023) Understanding Master Data Management’s Integration Challenges (link)

19 March 2024

R Language: Drawing Function Plots (Part II - Basic Curves & Inflection Points)

For a previous post on inflection points I needed a few examples, so I thought to write the code in the R language, which I did. Here's the final output:

Examples of Inflection Points

And, here's the code used to generate the above graphic:

par(mfrow = c(2,2)) #2x2 matrix display

# Example A: Inflection point with bifurcation
curve(x^3+20, -3,3, col = "black", main="(A) Inflection Point with Bifurcation")
curve(-x^2+20, 0, 3, add=TRUE, col="blue")
text (2, 10, "f(x)=-x^2+20, [0,3]", pos=1, offset = 1) #label inflection point
points(0, 20, col = "red", pch = 19) #inflection point 
text (0, 20, "inflection point", pos=1, offset = 1) #label inflection point


# Example B: Inflection point with Up & Down Concavity
curve(x^3-3*x^2-9*x+1, -3,6, main="(B) Inflection point with Up & Down Concavity")
points(1, -10, col = "red", pch = 19) #inflection point 
text (1, -10, "inflection point", pos=4, offset = 1) #label inflection point
text (-1, -10, "concave down", pos=3, offset = 1) 
text (-1, -10, "f''(x)<0", pos=1, offset = 0) 
text (2, 5, "concave up", pos=3, offset = 1)
text (2, 5, "f''(x)>0", pos=1, offset = 0) 


# Example C: Inflection point for multiple curves
curve(x^3-3*x+2, -3,3, col ="black", ylab="x^n-3*x+2, n = 2..5", main="(C) Inflection Point for Multiple Curves")
text (-3, -10, "n=3", pos=1) #label curve
curve(x^2-3*x+2,-3,3, add=TRUE, col="blue")
text (-2, 10, "n=2", pos=1) #label curve
curve(x^4-3*x+2,-3,3, add=TRUE, col="brown")
text (-1, 10, "n=4", pos=1) #label curve
curve(x^5-3*x+2,-3,3, add=TRUE, col="green")
text (-2, -10, "n=5", pos=1) #label curve
points(0, 2, col = "red", pch = 19) #inflection point 
text (0, 2, "inflection point", pos=4, offset = 1) #label inflection point
title("", line = -3, outer = TRUE)


# Example D: Inflection Point with fast change
curve(x^5-3*x+2,-3,3, col="black", ylab="x^n-3*x+2, n = 5,7,9", main="(D) Inflection Point with Slow vs. Fast Change")
text (-3, -100, "n=5", pos=1) #label curve
curve(x^7-3*x+2, add=TRUE, col="green")
text (-2.25, -100, "n=7", pos=1) #label curve
curve(x^9-3*x+2, add=TRUE, col="brown")
text (-1.5, -100, "n=9", pos=1) #label curve
points(0, 2, col = "red", pch = 19) #inflection point 
text (0, 2, "inflection point", pos=3, offset = 1) #label inflection point

mtext("© sql-troubles@blogspot.com @sql_troubles, 2024", side = 1, line = 4, adj = 1, col = "dodgerblue4", cex = .7)
#title("Examples of Inflection Points", line = -1, outer = TRUE)

Mathematically, an inflection point is a point on a smooth (plane) curve at which the curvature changes sign and where the second derivative is 0 [1]. The curvature intuitively measures the amount by which a curve deviates from being a straight line.

In example A, the main function has an inflection point, while the second function defined only for the interval [0,3] is used to represent a descending curve (aka bifurcation) for which the same point is a maximum point.

In example B, the function was chosen to represent an example with a concave down (for which the second derivative is negative) and a concave up (for which the second derivative is positive) section. So what comes after an inflection point is not necessarily a monotonic increasing function.

In example C are depicted several functions based on a varying power of the first coefficient which have the same inflection point. One could have shown only the behavior of the functions after the inflection point, while before choosing only one of the functions (see example A).

In example D is the same function as in example C with varying powers of the first coefficient considered, though for higher powers than in example C. I kept the function for n=5 to offer a basis for comparison. Apparently, the strange thing is that around the inflection point the change seems to be small and linear, which is not the case. The two graphics are correct though, because as basis is considered the scale for n=5, while in C the basis is n=3 (one scales the graphic further away from the inflection point). If one adds n=3 as the first function in the example D, the new chart will resemble C. Unfortunately, this behavior can be misused to show something like being linear around the inflection point, which is not the case.

# Example E: Inflection Point with slow vs. fast change extended
curve(x^3-3*x+2,-3,3, col="black", ylab="x^n-3*x+2, n = 3,5,7,9", main="(E) Inflection Point with Slow vs. Fast Change")
text (-3, -10, "n=3", pos=1) #label curve
curve(x^5-3*x+2,-3,3, add=TRUE, col="brown")
text (-2, -10, "n=5", pos=1) #label curve
curve(x^7-3*x+2, add=TRUE, col="green")
text (-1.5, -10, "n=7", pos=1) #label curve
curve(x^9-3*x+2, add=TRUE, col="orange")
text (-1, -5, "n=9", pos=1) #label curve
points(0, 2, col = "red", pch = 19) #inflection point 
text (0, 2, "inflection point", pos=3, offset = 1) #label inflection point

Comments:
(1) I cheated a bit calculating the second derivative manually, which is an easy task for polynomials. There seems to be methods for calculating the inflection point, though the focus was on providing the examples.
(2) The examples C and D could have been implemented as part of a loop, though I needed anyway to add the labels for each curve individually. Here's the modified code to support a loop:

# Example F: Inflection Point with slow vs. fast change with loop
n <- list(5,7,9)
color <- list("brown", "green", "orange")

curve(x^3-3*x+2,-3,3, col="black", ylab="x^n-3*x+2, n = 3,5,7,9", main="(F) Inflection Point with Slow vs. Fast Change")
for (i in seq_along(n))
{
ind <- as.numeric(n[i])
curve(x^ind-3*x+2,-3,3, add=TRUE, col=toString(color[i]))
}

text (-3, -10, "n=3", pos=1) #label curve
text (-2, -10, "n=5", pos=1) #label curve
text (-1, -5, "n=9", pos=1) #label curve
text (-1.5, -10, "n=7", pos=1) #label curve

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Wikipedia (2023) Inflection point (link)

Strategic Management: Inflection Points and the Data Mesh (Quote of the Day)

Strategic Management Series

"Data mesh is what comes after an inflection point, shifting our approach, attitude, and technology toward data. Mathematically, an inflection point is a magic moment at which a curve stops bending one way and starts curving in the other direction. It’s a point that the old picture dissolves, giving way to a new one. [...] The impacts affect business agility, the ability to get value from data, and resilience to change. In the center is the inflection point, where we have a choice to make: to continue with our existing approach and, at best, reach a plateau of impact or take the data mesh approach with the promise of reaching new heights." [1]

I tried to understand the "metaphor" behind the quote. As the author through another quote pinpoints, the metaphor is borrowed from Andrew Groove:

"An inflection point occurs where the old strategic picture dissolves and gives way to the new, allowing the business to ascend to new heights. However, if you don’t navigate your way through an inflection point, you go through a peak and after the peak the business declines. [...] Put another way, a strategic inflection point is when the balance of forces shifts from the old structure, from the old ways of doing business and the old ways of competing, to the new. Before" [2]

The second part of the quote clarifies the role of the inflection point - the shift from a structure, respectively organization or system to a new one. The inflection point is not when we take a decision, but when the decision we took, and the impact shifts the balance. If the data mesh comes after the inflection point (see A), then there must be some kind of causality that converges uniquely toward the data mesh, which is questionable, if not illogical. A data mesh eventually makes sense after organizations reached a certain scale and thus is likely improbable to be adopted by small to medium businesses. Even for large organizations the data mesh may not be a viable solution if it doesn't have a proven record of success.

I could understand if the author would have said that the data mesh will lead to an inflection point after its adoption, as is the case of transformative/disruptive technologies. Unfortunately, the tracking record of BI and Data Analytics projects doesn't give many hopes for such a magical moment to happen. Probably, becoming a data-driven organization could have such an effect, though for many organizations the effects are still far from expectations.

There's another point to consider. A curve with inflection points can contain up and down concavities (see B) or there can be multiple curves passing through an inflection point (see C) and the continuation can be on any of the curves.

Examples of Inflection Points [3]

The change can be fast or slow (see D), and in the latter it may take a long time for change to be perceived. Also [2] notes that the perception that something changed can happen in stages. Moreover, the inflection point can be only local and doesn't describe the future evolution of the curve, which to say that the curve can change the trajectory shortly after that. It happens in business processes and policy implementations that after a change was made in extremis to alleviate an issue a slight improvement is recognized after which the performance decays sharply. It's the case of situations in which the symptoms and not the root causes were addressed.

More appropriate to describe the change would be a tipping point, which can be defined as a critical threshold beyond which a system (the organization) reorganizes/changes, often abruptly and/or irreversible.

Previous Post <<||>> Next Post

References:
[1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review)
[2] Andrew S Grove (1988) "Only the Paranoid Survive: How to Exploit the Crisis Points that Challenge Every Company and Career"
[3] SQL Troubles (2024) R Language: Drawing Function Plots (Part II - Basic Curves & Inflection Points) (link)

18 March 2024

Strategic Management: Strategy (Notes)

Disclaimer: This is work in progress intended to consolidate information from various sources.
Last updated: 18-Mar-2024

Strategy

{definition} "the determination of the long-term goals and objectives of an enterprise, and the adoption of courses of action and the allocation of resources necessary for carrying out these goals" [4]
{goal} bring all tools and insights together to create an integrative narrative about what the organization should do moving forward [1]
a good strategy emerges out of the values, opportunities and capabilities of the organization [1]

{characteristic} robust
{characteristic} flexible
{characteristic} needs to embrace the uncertainty and complexity of the world
{characteristic} fact-based and informed by research and analytics
{characteristic} testable

{concept} strategy analysis

{definition} the assessment of an organization's current competitive position and the identification of future valuable competitive positions and how the firm plans to achieve them [1]

done from a general perspective

in terms of different functional elements within the organization [1]
in terms of being integrated across different concepts and tools and frameworks [1]

a good strategic analysis integrates various tools and frameworks that are in our strategist toolkit [1]

approachable in terms of

dynamics
complexity
competition

{step} identify the mission and values of the organization

critical for understanding what the firm values and how it may influence where opportunities they look for and what actions they might be willing to take

{step} analyze the competitive environment

looking at what opportunities the environment provides, how are competitors likely to react

{step} analyze competitive positions

think about own capabilities are and how they might relate to the opportunities that are available

{step} analyze and recommend strategic actions

actions for future improvement

{question} how do we create more value?
{question} how can we improve our current competitive position?
{question} how can we in essence, create more value in our competitive environment

alternatives

scaling the business
entering new markets
innovating
acquiring a competitor/another player within a market segment of interest

recommendations

{question} what do we recommend doing going forward?
{question} what are the underlying assumptions of these recommendations?
{question} do they meet our tests that we might have for providing value?
move from analysis to action

actions come from asking a series of questions about what opportunities, what actions can we take moving forward

{step} strategy formulation
{step} strategy implementation

{tool} competitor analysis

{question} what market is the firm in, and who are the players in these markets?

{tool} environmental analysis

{benefit} provides a picture on the broader competitive environment
{question} what are the major trends impacting this industry?
{question} are there changes in the sociopolitical environment that are going to have important implications for this industry?
{question} is this an attractive market or the barrier to competition?

{tool} five forces analysis

{benefit} provides an overview of the market structure/industry structure
{benefit} helps understand the nature of the competitive game that we are playing as we then devise future strategies [1]

provides a dynamic perspective in our understanding of a competitive market

{question} how's the competitive structure in a market likely to evolve?

{tool} competitive lifestyle analysis
{tool} SWOT (strengths, weaknesses, opportunities, threats) analysis
{tool} stakeholder analysis

{benefit} valuable in trying to understand those mission and values and then the others expectations of a firm

{tool} capabilities analysis

{question} what are the firm's unique resources and capabilities?
{question} how sustainable as any advantage that these assets provide?

{tool} portfolio planning matrix

{benefit} helps us now understand how they might leverage these assets across markets, so as to improve their position in any given market here
{question} how should we position ourselves in the market relative to our rivals?

{tool} capability analysis

{benefit} understand what the firm does well and see what opportunities they might ultimately want to attack and go after in terms of these valuable competitive positions

via Strategy Maps and Portfolio Planning matrices

{tool} hypothesis testing

{question} how competitors are likely to react to these actions?
{question} does it make sense in the future worlds we envision?
[game theory] pay off matrices can be useful to understand what actions taken by various competitors within an industry

{tool} scenario planning

{benefit} helps us envision future scenarios and then work back to understand what are the actions we might need to take in those various scenarios if they play out.
{question} does it provide strategic flexibility?

{tool} real options analysis

highlights the desire to have strategic flexibility or at least the value of strategic flexibility provides

{tool} acquisition analysis

{benefit} helps understand the value of certain action versus others
{benefit} useful as an understanding of opportunity costs for other strategic investments one might make
focused on mergers and acquisitions

{tool} If-Then thinking

sequential in nature

different from causal logic

commonly used in network diagrams, flow charts, Gannt charts, and computer programming

{tool} Balanced Scorecard

{definition} a framework to look at the strategy used for value creation from four different perspectives [5]

{perspective} financial

{scope} the strategy for growth, profitability, and risk viewed from the perspective of the shareholder [5]
{question} what are the financial objectives for growth and productivity? [5]
{question} what are the major sources of growth? [5]
{question} If we succeed, how will we look to our shareholders? [5]

{perspective} customer

{scope} the strategy for creating value and differentiation from the perspective of the customer [5]
{question} who are the target customers that will generate revenue growth and a more profitable mix of products and services? [5]
{question} what are their objectives, and how do we measure success with them? [5]

{perspective} internal business processes

{scope} the strategic priorities for various business processes, which create customer and shareholder satisfaction [5]

{perspective} learning and growth

{scope} deﬁnes the skills, technologies, and corporate culture needed to support the strategy.

enable a company to align its human resources and IT with its strategy

{benefit} enables the strategic hypotheses to be described as a set of cause-and-effect relationships that are explicit and testable [5]

require identifying the activities that are the drivers (or lead indicators) of the desired outcomes (lag indicators) [5]
everyone in the organization must clearly understand the underlying hypotheses, to align resources with the hypotheses, to test the hypotheses continually, and to adapt as required in real time [5]

{tool} strategy map

{definition} a visual representation of a company’s critical objectives and the crucial relationships that drive organizational performance [2]

shows the cause-and effect links by which speciﬁc improvements create desired outcomes [2]

{benefit} shows how an organization will convert its initiatives and resources–including intangible assets such as corporate culture and employee knowledge into tangible outcomes [2]

{component} mission

{question} why we exist?

{component} core values

{question} what we believe in?
⇐ mission and the core values remain fairly stable over time [5]

{component} vision

{question} what we want to be?
paints a picture of the future that clarifies the direction of the organization [5]

helps-individuals to understand why and how they should support the organization [5]

Previous Post <<||>> Next Post

References:
[1] University of Virginia (2022) Strategic Planning and Execution (MOOC, Coursera)
[2] Robert S Kaplan & David P Norton (2000) Having Trouble with Your Strategy? Then Map It (link)
[3] Harold Kerzner (2001) Strategic planning for project management using a project management maturity model
[4] Alfred D Chandler Jr. (1962) "Strategy and Structure"
[5] Robert S Kaplan & David P Norton (2000) The Strategy-focused Organization: How Balanced Scorecard Companies Thrive in the New Business Environment

17 March 2024

Business Intelligence: Data Products (Part II: The Complexity Challenge)

Business Intelligence Series

Creating data products within a data mesh resumes in "partitioning" a given set of inputs, outputs and transformations to create something that looks like a Lego structure, in which each Lego piece represents a data product. The word partition is improperly used as there can be overlapping in terms of inputs, outputs and transformations, though in an ideal solution the outcome should be close to a partition.

If the complexity of inputs and outputs can be neglected, even if their number could amount to a big number, not the same can be said about the transformations that must be performed in the process. Moreover, the transformations involve reengineering the logic built in the source systems, which is not a trivial task and must involve adequate testing. The transformations are a must and there's no way to avoid them.

When designing a data warehouse or data mart one of the goals is to keep the redundancy of the transformations and of the intermediary results to a minimum to minimize the unnecessary duplication of code and data. Code duplication becomes usually an issue when the logic needs to be changed, and in business contexts that can happen often enough to create other challenges. Data duplication becomes an issue when they are not in synch, fact derived from code not synchronized or with different refresh rates.

Building the transformations as SQL-based database objects has its advantages. There were many attempts for providing non-SQL operators for the same (in SSIS, Power Query) though the solutions built based on them are difficult to troubleshoot and maintain, the overall complexity increasing with the volume of transformations that must be performed. In data mashes, the complexity increases also with the number of data products involved, especially when there are multiple stakeholders and different goals involved (see the challenges for developing data marts supposed to be domain-specific).

To growing complexity organizations answer with complexity. On one side the teams of developers, business users and other members of the governance teams who together with the solution create an ecosystem. On the other side, the inherent coordination and organization meetings, managing proposals, the negotiation of scope for data products, their design, testing, etc. The more complex the whole ecosystem becomes, the higher the chances for systemic errors to occur and multiply, respectively to create unwanted behavior of the parties involved. Ecosystems are challenging to monitor and manage.

The more complex the architecture, the higher the chances for failure. Even if some organizations might succeed, it doesn't mean that such an endeavor is for everybody - a certain maturity in building data architectures, data-based artefacts and managing projects must exist in the organization. Many organizations fail in addressing basic analytical requirements, why would one think that they are capable of handling an increased complexity? Even if one breaks the complexity of a data warehouse to more manageable units, the complexity is just moved at other levels that are more difficult to manage in ensemble.

Being able to audit and test each data product individually has its advantages, though when a data product becomes part of an aggregate it can be easily get lost in the bigger picture. Thus, is needed a global observability framework that allows to monitor the performance and health of each data product in aggregate. Besides that, there are needed event brokers and other mechanisms to handle failure, availability, security, etc.

Data products make sense in certain scenarios, especially when the complexity of architectures is manageable, though attempting to redesign everything from their perspective is like having a hammer in one's hand and treating everything like a nail.

Previous Post <<||>> Next Post

Business Intelligence: Data Products (Part I: A Lego Exercise)

Business Intelligence Series

One can define a data product as the smallest unit of data-driven architecture that can be independently deployed and managed (aka product quantum) [1]. In other terms one can think of a data product like a box (or Lego piece) which takes data as inputs, performs several transformations on the data from which result several output data (or even data visualizations or a hybrid between data, visualizations and other content).

At high-level each Data Analytics solution can be regarded as a set of inputs, a set of outputs and the transformations that must be performed on the inputs to generate the outputs. The inputs are the data from the operational systems, while the outputs are analytics data that can be anything from data to KPIs and other metrics. A data mart, data warehouse, lakehouse and data mesh can be abstracted in this way, though different scales apply.

For creating data products within a data mesh, given a set of inputs, outputs and transformations, the challenge is to find horizontal and vertical partitions within these areas to create something that looks like a Lego structure, in which each piece of Lego represents a data product, while its color represents the membership to a business domain. Each such piece is self-contained and contains a set of transformations, respectively intermediary inputs and outputs. Multiple such pieces can be combined in a linear or hierarchical fashion to transform the initial inputs into the final outputs.

Data Products with a Data Mesh

Finding such a partition is possible though it involves a considerable effort, especially in designing the whole thing - identifying each Lego piece uniquely. When each department is on its own and develops its own Lego pieces, there's no guarantee that the pieces from the various domains will fit together to built something cohesive, performant, secure or well-structured. Is like building a house from modules, the pieces must fit together. That would be the role of governance (federated computational governance) - to align and coordinate the effort.

Conversely, there are transformations that need to be replicated for obtaining autonomous data products, and the volume of such overlapping can be considerable high. Consider for example the logic available in reports and how often it needs to be replicated. Alternatively, one can create intermediary data products, when that's feasible.

It's challenging to define the inputs and outputs for a Lego piece. Now imagine in doing the same for a whole set of such pieces depending on each other! This might work for small pieces of data and entities quite stable in their lifetime (e.g. playlists, artists, songs), but with complex information systems the effort can increase by a few factors. Moreover, the complexity of the structure increases as soon the Lego pieces expand beyond their initial design. It's like the real Lego pieces would grow within the available space but still keep the initial structure - strange constructs may result, which even if they work, change the gravity center of the edifice in other directions. There will be thus limits to grow that can easily lead to duplication of functionality to overcome such challenges.

Each new output or change in the initial input for this magic boxes involves a change of all the intermediary Lego pieces from input to output. Just recollect the last experience of defining the inputs and the outputs for an important complex report, how many iterations and how much effort was involved. This might have been an extreme case, though how realistic is the assumption that with data products everything will go smoother? No matter of the effort involved in design, there will be always changes and further iterations involved.

Previous Post <<||>> Next Post

References:
[1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review)

SQL Troubles

Pages

31 March 2024

Microsoft Fabric: Polaris (Notes)

29 March 2024

Data Management: Data (Notes)

28 March 2024

Data Management: Master Data Management [MDM] (Notes)

25 March 2024

R Language: Regression Analysis with Simulated & Real Data

22 March 2024

Business Intelligence: Dashboards (Part I: Dashboards Are Dead & Other Crap)

Business Intelligence: Monolithic vs. Distributed Architecture (Part III: Architectural Applications)

Business Intelligence: Monolithic vs. Distributed Architecture (Part II: Architectural Choices)

20 March 2024

Data Management: Understanding Master Data Management’s Integration Challenges (Answer)

19 March 2024

R Language: Drawing Function Plots (Part II - Basic Curves & Inflection Points)

Strategic Management: Inflection Points and the Data Mesh (Quote of the Day)

18 March 2024

Strategic Management: Strategy (Notes)

17 March 2024

Business Intelligence: Data Products (Part II: The Complexity Challenge)

Business Intelligence: Data Products (Part I: A Lego Exercise)

About Me