Showing posts with label data analytics. Show all posts
Showing posts with label data analytics. Show all posts

16 September 2024

🧭Business Intelligence: Mea Culpa (Part IV: Generalist or Specialist in an AI Era?)

Business Intelligence Series
Business Intelligence Series

Except the early professional years when I did mainly programming for web or desktop applications in the context of n-tier architectures, over the past 20 years my professional life was a mix between BI, Data Analytics, Data Warehousing, Data Migrations and other topics (ERP implementations and support, Project Management, IT Service Management, IT, Data and Applications Management), though the BI topics covered probably on average at least 60% of my time, either as internal or external consultant. 

I can consider myself thus a generalist who had the chance to cover most of the important aspects of a business from an IT perspective, and it was thus a great experience, at least until now! It’s a great opportunity to have the chance to look at problems, solutions, processes and the various challenges and opportunities from different perspectives. Technical people should have this opportunity directly in their jobs through the communication occurring in projects or IT services, though that’s more of a wish! Unfortunately, the dialogue between IT and business occurs almost only over the tickets and documents, which might be transparent but isn’t necessarily effective or efficient! 

Does working only part time in an area make one person less experienced or knowledgeable than other people? In theory, a full-time employee should get more exposure in depth and/or breadth, but that’s relative! It depends on the challenges one faces, the variation of the tasks, the implemented solutions, their depth and other technical and nontechnical factors like training, one’s experience in working with the various tools, the variety of the tasks and problem faced, professionalism, etc. A richer exposure can but not necessarily involve more technical and nontechnical knowledge, and this shouldn’t be taken as given! There’s no right or wrong answer even if people tend to take sides and argue over details.

Independently of job's effective time, one is forced to use his/her time to keep current with technologies or extend one’s horizon. In IT, a professional seldom can rely on what is learned on the job. Fortunately, nowadays one has more and more ways of learning, while the challenge shifts toward what to ignore, respectively better management of one’s time while learning. The topics increase in complexity and with this blogging becomes even more difficult, especially when one competes with AI content!

Talking about IT, it will be interesting to see how much AI can help or replace some of the professions or professionals. Anyway, some jobs will become obsolete or shift the focus to prompt engineering and technical reviews. AI still needs explicit descriptions of how to address tasks, at least until it learns to create and use better recipes for problem definition and solving. The bottom line, AI and its use can’t be ignored, and it can and should be used also in learning new things. It’s amazing what one can do nowadays with prompt engineering! 

Another aspect on which AI can help is to tailor the content to one’s needs. A high percentage in the learning process is spent on fishing in a sea of information for content that is worth knowing, respectively for a solution to one’s needs. AI must be able to address also some of the context without prompters being forced to give information explicitly!

AI opens many doors but can close many others. How much of one’s experience will remain relevant over the next years? Will AI have more success in addressing some of the challenges existing in people’s understanding or people will just trust AI blindly? Anyway, somebody must be smarter than AI, and here people’s collective intelligence probably can prove to be a real match. 

07 May 2024

🏭🗒️Microsoft Fabric: The Metrics Layer [Notes] [new feature]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 07-May-2024

The Metrics Layer in Microsoft Fabric (adapted diagram)
The Metrics Layer in Microsoft Fabric (adapted diagram)

[new feature] Metrics Layer (Metrics Store)

  • {definition}an abstraction layer available between the data store(s) and end users which allows organizations to create standardized business metrics, that are rooted in measures and are discoverable and intended for reuse
    • ⇐ {important} feature still in private preview 
  • {goal} extend existing infrastructure 
    • {benefit} leverages and extends existing features
  • {goal} provide consistent definitions and descriptions [1]
    • consistent definitions that include besides business logic additional dimensions and filters [1]
    • ⇒ {benefit} allows to standardize the metrics across the organization
    • ⇒ {benefit} enforce to enforce a SSoT
  • {goal} easy management 
    • via management views 
    • [feature] lineage 
    • [feature] source control
    • [feature] duplicate identification
    • [feature] push updates to downstream uses of the metrics 
  • {goal}searchable and discoverable metrics 
    • {feature} integration
      • based on Sempy fabric package
        • ⇐ a dataframe for storage and propagation of Power BI metadata which is part of the python-based semantic Link in Fabric
  • {goal}trust
    • [feature] trust indicators
    • {benefit} facilitates report's adoption
  • {feature} metric set 
    • {definition} a Fabric item that groups together a set of metrics into a mini-model
    • {benefit} allows to reduce the overall complexity of semantic models, while being easy to evolve and consume
    • associated with a single domain
      • ⇒ supports the data mesh architecture
    • shareable 
      • can be shared with other users
    • {action} create metric set
      • creates the actual artifact, to which metrics can be added 
  • {feature} metric
    •  {definition} a way to elevate the measures from the various semantic models existing in the organization
    • tied to the original semantic model
      • ⇒ {benefit} allows to see how a metric is used across the solutions 
    • reusable
      • can be reused in other fabric artifacts
        • new reports on the Power BI service
        • notebooks 
          • by copying the code
      • can be reused in Power BI
        • via OneLake data hub menu element
      • can be chained 
        • changes are propagated downstream 
    • materializable 
      • its output can be persisted to OneLake by saving it a delta table into a lakehouse
      • {misuse} data is persisted unnecessarily
    • {action} elevate metric
      • copies measure's definition and description
      • ⇒  implies restructuring, refactoring, moving, and testing a lot of code in the process
      • {misuse} data professionals build everything as metrics
    • {action} update metric
    • {action} add filters to metric 
    • {action} add dimensions to metric
    • {action} materialize metric 

Acronyms:
SSoT - single source of truth ()

References:
[1] Power BI Tips (2024) Explicit Measures Ep. 236: Metrics Hub, Hot New Feature with Carly Newsome (link)
[2] Power BI Tips (2024) Introducing Fabric Metrics Layer / Power Metrics Hub [with Carly Newsome] (link)

06 May 2024

🧭🏭Business Intelligence: Microsoft Fabric (Part III: The Metrics Layer [new feature])

Introduction

One of the announcements of this year's Microsoft Fabric Community first conference was the introduction of a metrics layer in Fabric which "allows organizations to create standardized business metrics, that are rooted in measures and are discoverable and intended for reuse" [1]. As it seems, the information content provided at the conference was kept to a minimum given that the feature is still in private preview, though several webcasts start to catch up on the topic (see [2], [4]). Moreover, as part of their show, the Explicit Measures (@PowerBITips) hosts had Carly Newsome as invitee, the manager of the project, who unveiled more details about the project and the feature, details which became the main source for the information below. 

The idea of a metric layer or metric store is not new, data professionals occasionally refer to their structure(s) of metrics as such. The terms gained weight in their modern conception relatively recently in 2021-2022 (see [5], [6], [7], [8], [10]). Within the modern data stack, a metrics layer or metric store is an abstraction layer available between the data store(s) and end users. It allows to centrally define, store, and manage business metrics. Thus, it allows us to standardize and enforce a single source of truth (SSoT), respectively solve several issues existing in the data stacks. As Benn Stancil earlier remarked, the metrics layer is one of the missing pieces from the modern data stack (see [10]).

Microsoft's Solution

Microsoft's business case for metrics layer's implementation is based on three main ideas (1) duplicate measures contribute to poor data quality, (2) complex data models hinder self-service, (3) reduce data silos in Power BI. In Microsoft's conception the metric layer provides several benefits: consistent definitions and descriptions, easy management via management views, searchable and discoverable metrics, respectively assure trust through indicators. 

For this feature's implementation Microsoft introduces a new Fabric Item called a metric set that allows to group several (business) metrics together as part of a mini-model that can be tailored to the needs of a subset of end-users and accessed by them via the standard tools already available. The metric set becomes thus a mini-model. Such mini-models allow to break down and reduce the overall complexity of semantic models, while being easy to evolve and consume. The challenge will become then on how to break down existing and future semantic models into nonoverlapping mini-models, creating in extremis a partition (see the Lego metaphor for data products). The idea of mini-models is not new, [12] advocating the idea of using a Master Model, a technique for creating derivative tabular models based on a single tabular solution.

A (business) metric is a way to elevate the measures from the various semantic models existing in the organization within the mini-model defined by the metric set. A metric can be reused in other fabric artifacts - currently in new reports on the Power BI service, respectively in notebooks by copying the code. Reusing metrics in other measures can mean that one can chain metrics and the changes made will be further propagated downstream. 

The Metrics Layer in Microsoft Fabric (adapted diagram)
The Metrics Layer in Microsoft Fabric (adapted diagram)

Every metric is tied to the original semantic model which allows thus to track how a metric is used across the solutions and, looking forward to Purview, to identify data's lineage. A measure is related to a "table", the source from which the measure came from.

Users' Perspective

The Metrics Layer feature is available in Microsoft Fabric service for Power BI within the Metrics menu element next to Scorecards. One starts by creating a metric set in an existing workspace, an operation which creates the actual artifact, to which the individual metrics are added. To create a metric, a user with build permissions can navigate through the semantic models across different workspaces he/she has access to, pick a measure from one of them and elevate it to a metric, copying in the process its measure's definition and description. In this way the metric will always point back to the measure from the semantic model, while the metrics thus created are considered as a related collection and can be shared around accordingly. 

Once a metric is added to the metric set, one can add in edit mode dimensions to it (e.g. Date, Category, Product Id, etc.). One can then further explore a metric's output and add filters (e.g. concentrate on only one product or category) point from which one can slice-and-dice the data as needed.

There is a panel where one can see where the metric has been used (e.g. in reports, scorecards, and other integrations), when was last time refreshed, respectively how many times was used. Thus, one has the most important information in one place, which is great for developers as well as for the users. Probably, other metadata will be added, such as whether an increase in the metric would be favorable or unfavorable (like in Tableau Pulse, see [13]) or maybe levels of criticality, an unit of measure, or maybe its type - simple metric, performance indicator (PI), result indicator (RI), KPI, KRI etc.

Metrics can be persisted to the OneLake by saving their output to a delta table into the lakehouse. As demonstrated in the presentation(s), with just a copy-paste and a small piece of code one can materialize the data into a lakehouse delta table, from where the data can be reused as needed. Hopefully, the process will be further automated. 

One can consume metrics and metrics sets also in Power BI Desktop, where a new menu element called Metric sets was added under the OneLake data hub, which can be used to connect to a metric set from a Semantic model and select the metrics needed for the project. 

Tapping into the available Power BI solutions is done via an integration feature based on Sempy fabric package, a dataframe for storage and propagation of Power BI metadata which is part of the python-based semantic Link in Fabric [11].

Further Thoughts

When dealing with a new feature, a natural idea comes to mind: what challenges does the feature involve, respectively how can it be misused? Given that the metrics layer can be built within a workspace and that it can tap into the existing measures, this means that one can built on the existing infrastructure. However, this can imply restructuring, refactoring, moving, and testing a lot of code in the process, hopefully with minimal implications for the solutions already available. Whether the process is as simple as imagined is another story. As misusage, in extremis, data professionals might start building everything as metrics, though the danger might come when the data is persisted unnecessarily. 

From a data mesh's perspective, a metric set is associated with a domain, though there will be metrics and data common to multiple domains. Moreover, a mini-model has the potential of becoming a data product. Distributing the logic across multiple workspaces and domains can add further challenges, especially in what concerns the synchronization and implemented of requirements in a way that doesn't lead to bottlenecks. But this is a general challenge for the development team(s). 

The feature will probably suffer further changes until is released in public review (probably by September or the end of the year). I subscribe to other data professionals' opinion that the feature was for long needed and that can have an important impact on the solutions built. 

Previous Post <<||>> Next Post

Resources:
[1] Microsoft Fabric Blog (2024) Announcements from the Microsoft Fabric Community Conference (link)
[2] Power BI Tips (2024) Explicit Measures Ep. 236: Metrics Hub, Hot New Feature with Carly Newsome (link)
[3] Power BI Tips (2024) Introducing Fabric Metrics Layer / Power Metrics Hub [with Carly Newsome] (link)
[4] KratosBI (2024) Fabric Fridays: Metrics Layer Conspiracy Theories #40 (link)
[5] Chris Webb's BI Blog (2022) Is Power BI A Semantic Layer? (link)
[6] The Data Stack Show (2022) TDSS 95: How the Metrics Layer Bridges the Gap Between Data & Business with Nick Handel of Transform (link)
[7] Sundeep Teki (2022) The Metric Layer & how it fits into the Modern Data Stack (link)
[8] Nick Handel (2021) A brief history of the metrics store (link)
[9] Aurimas (2022) The Jungle of Metrics Layers and its Invisible Elephant (link)
[10] Benn Stancil (2021) The missing piece of the modern data stack (link)
[11] Microsoft Learn (2024) Sempy fabric Package (link)
[12] Michael Kovalsky (2019) Master Model: Creating Derivative Tabular Models (link)
[13] Christina Obry (2023) The Power of a Metrics Layer - and How Your Organization Can Benefit From It (link
[14] KratosBI (2024) Introducing the Metrics Layer in #MicrosoftFabric with Carly Newsome [link]

19 March 2024

𖣯Strategic Management: Inflection Points and the Data Mesh (Quote of the Day)

Strategic Management
Strategic Management Series

"Data mesh is what comes after an inflection point, shifting our approach, attitude, and technology toward data. Mathematically, an inflection point is a magic moment at which a curve stops bending one way and starts curving in the other direction. It’s a point that the old picture dissolves, giving way to a new one. [...] The impacts affect business agility, the ability to get value from data, and resilience to change. In the center is the inflection point, where we have a choice to make: to continue with our existing approach and, at best, reach a plateau of impact or take the data mesh approach with the promise of reaching new heights." [1]

I tried to understand the "metaphor" behind the quote. As the author through another quote pinpoints, the metaphor is borrowed from Andrew Groove:

"An inflection point occurs where the old strategic picture dissolves and gives way to the new, allowing the business to ascend to new heights. However, if you don’t navigate your way through an inflection point, you go through a peak and after the peak the business declines. [...] Put another way, a strategic inflection point is when the balance of forces shifts from the old structure, from the old ways of doing business and the old ways of competing, to the new. Before" [2]

The second part of the quote clarifies the role of the inflection point - the shift from a structure, respectively organization or system to a new one. The inflection point is not when we take a decision, but when the decision we took, and the impact shifts the balance. If the data mesh comes after the inflection point (see A), then there must be some kind of causality that converges uniquely toward the data mesh, which is questionable, if not illogical. A data mesh eventually makes sense after organizations reached a certain scale and thus is likely improbable to be adopted by small to medium businesses. Even for large organizations the data mesh may not be a viable solution if it doesn't have a proven record of success. 

I could understand if the author would have said that the data mesh will lead to an inflection point after its adoption, as is the case of transformative/disruptive technologies. Unfortunately, the tracking record of BI and Data Analytics projects doesn't give many hopes for such a magical moment to happen. Probably, becoming a data-driven organization could have such an effect, though for many organizations the effects are still far from expectations. 

There's another point to consider. A curve with inflection points can contain up and down concavities (see B) or there can be multiple curves passing through an inflection point (see C) and the continuation can be on any of the curves.

Examples of Inflection Points [3]

The change can be fast or slow (see D), and in the latter it may take a long time for change to be perceived. Also [2] notes that the perception that something changed can happen in stages. Moreover, the inflection point can be only local and doesn't describe the future evolution of the curve, which to say that the curve can change the trajectory shortly after that. It happens in business processes and policy implementations that after a change was made in extremis to alleviate an issue a slight improvement is recognized after which the performance decays sharply. It's the case of situations in which the symptoms and not the root causes were addressed. 

More appropriate to describe the change would be a tipping point, which can be defined as a critical threshold beyond which a system (the organization) reorganizes/changes, often abruptly and/or irreversible.

Previous Post <<||>> Next Post

References:
[1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review)
[2] Andrew S Grove (1988) "Only the Paranoid Survive: How to Exploit the Crisis Points that Challenge Every Company and Career"
[3] SQL Troubles (2024) R Language: Drawing Function Plots (Part II - Basic Curves & Inflection Points) (link)

17 March 2024

🧭Business Intelligence: Data Products (Part I: A Lego Exercise)

Business Intelligence
Business Intelligence Series

One can define a data product as the smallest unit of data-driven architecture that can be independently deployed and managed (aka product quantum) [1]. In other terms one can think of a data product like a box (or Lego piece) which takes data as inputs, performs several transformations on the data from which result several output data (or even data visualizations or a hybrid between data, visualizations and other content). 

At high-level each Data Analytics solution can be regarded as a set of inputs, a set of outputs and the transformations that must be performed on the inputs to generate the outputs. The inputs are the data from the operational systems, while the outputs are analytics data that can be anything from data to KPIs and other metrics. A data mart, data warehouse, lakehouse and data mesh can be abstracted in this way, though different scales apply. 

For creating data products within a data mesh, given a set of inputs, outputs and transformations, the challenge is to find horizontal and vertical partitions within these areas to create something that looks like a Lego structure, in which each piece of Lego represents a data product, while its color represents the membership to a business domain. Each such piece is self-contained and contains a set of transformations, respectively intermediary inputs and outputs. Multiple such pieces can be combined in a linear or hierarchical fashion to transform the initial inputs into the final outputs. 

Data Products with a Data Mesh
Data Products with a Data Mesh

Finding such a partition is possible though it involves a considerable effort, especially in designing the whole thing - identifying each Lego piece uniquely. When each department is on its own and develops its own Lego pieces, there's no guarantee that the pieces from the various domains will fit together to built something cohesive, performant, secure or well-structured. Is like building a house from modules, the pieces must fit together. That would be the role of governance (federated computational governance) - to align and coordinate the effort. 

Conversely, there are transformations that need to be replicated for obtaining autonomous data products, and the volume of such overlapping can be considerable high. Consider for example the logic available in reports and how often it needs to be replicated. Alternatively, one can create intermediary data products, when that's feasible. 

It's challenging to define the inputs and outputs for a Lego piece. Now imagine in doing the same for a whole set of such pieces depending on each other! This might work for small pieces of data and entities quite stable in their lifetime (e.g. playlists, artists, songs), but with complex information systems the effort can increase by a few factors. Moreover, the complexity of the structure increases as soon the Lego pieces expand beyond their initial design. It's like the real Lego pieces would grow within the available space but still keep the initial structure - strange constructs may result, which even if they work, change the gravity center of the edifice in other directions. There will be thus limits to grow that can easily lead to duplication of functionality to overcome such challenges.

Each new output or change in the initial input for this magic boxes involves a change of all the intermediary Lego pieces from input to output. Just recollect the last experience of defining the inputs and the outputs for an important complex report, how many iterations and how much effort was involved. This might have been an extreme case, though how realistic is the assumption that with data products everything will go smoother? No matter of the effort involved in design, there will be always changes and further iterations involved.

Previous Post <<||>> Next Post

References:
[1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review

15 March 2024

🧊🗒️Data Warehousing: Data Mesh [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources. 

Last updated: 17-Mar-2024

Data Products with a Data Mesh
Data Products with a Data Mesh

Data Mesh
  • {definition} "a sociotechnical approach to share, access and manage analytical data in complex and large-scale environments - within or across organizations" [1]
    • ⇐ there is no default standard or reference implementation of data mesh and its components [2]
  • {definition} a type of decentralized data architecture that organizes data based on different business domains [2]
    • ⇐ no centralized data architecture coexists with data mesh, unless in transition [1]
    • distributes the modeling of analytical data, the data itself and its ownership [1]
  • {characteristic} partitions data around business domains and gives data ownership to the domains [1]
    • each domain can model their data according to their context [1]
    • there can be multiple models of the same concept in different domains gives the data sharing responsibility to those who are most intimately familiar with the data [1]
    • endorses multiple models of the data
      • data can be read from one domain, transformed and stored by another domain [1]
  • {characteristic} evolutionary execution process
  • {characteristic} agnostic of the underlying technology and infrastructure [1]
  • {aim} respond gracefully to change [1]
  • {aim} sustain agility in the face of growth [1]
  • {aim} increase the ratio of value from data to investment [1]
  • {principle} data as a product
    • {goal} business domains become accountable to share their data as a product to data users
    • {goal} introduce a new unit of logical architecture that controls and encapsulates all the structural components needed to share data as a product autonomously [1]
    • {goal} adhere to a set of acceptance criteria that assure the usability, quality, understandability, accessibility and interoperability of data products*
    • usability characteristics
  • {principle} domain-oriented ownership
    • {goal} decentralize the ownership of sharing analytical data to business domains that are closest to the data [1]
    • {goal} decompose logically the data artefacts based on the business domain they represent and manage their life cycle independently [1]
    • {goal} align business, technology and analytical data [1]
  • {principle} self-serve data platform
    • {goal} provide a self-serve data platform to empower domain-oriented teams to manage and govern the end-to-end life cycle of their data products* [1]
    • {goal} streamline the experience of data consumers to discover, access, and use the data products [1]
  • {principle} federated computational governance
    • {goal} implement a federated decision making and accountability structure that balances the autonomy and agility of domains, while respecting the global conformance, interoperability and security of the mesh* [1]
    • {goal} codifying and automated execution of policies at a fine-grained level [1]
    • ⇐ the principles represent a generalization and adaptation of practices that address the scale of organization digitization* [1]
  • {concept} decentralization of data products
    • {requirement} ability to compose data across different modes of access and topologies [1]
      • data needs to be agnostic to the syntax of data, underlying storage type, and mode of access to it [1]
        • many of the existing composability techniques that assume homogeneous data won’t work
          • e.g.  defining primary and foreign key relationships between tables of a single schema [1]
    • {requirement} ability to discover and learn what is relatable and decentral [1]
    • {requirement} ability to seamlessly link relatable data [1]
    • {requirement} ability to relate data temporally [1]
  • {concept} data product 
    • the smallest unit of data-based architecture that can be independently deployed and managed (aka product quantum) [1]
    • provides a set of explicitly defined and data sharing contracts
    • provides a truthful portion of the reality for a particular domain (aka single slice of truth) [1]
    • constructed in alignment with the source domain [3]
    • {characteristic} autonomous
      • its life cycle and model are managed independently of other data products [1]
    • {characteristic} discoverable
      • via a centralized registry or catalog that list the available datasets with some additional information about each dataset, the owners, the location, sample data, etc. [1]
    • {characteristic} addressable
      • via a permanent and unique address to the data user to programmatically or manually access it [1] 
    • {characteristic} understandable
      • involves getting to know the semantics of its underlying data and the syntax in which the data is encoded [1]
      • describes which entities it encapsulates, the relationships between them, and their adjacent data products [1]
    • {characteristic} trustworthy and truthful
      • represents the fact of the business correctly [1]
      • provides data provenance and data lineage [1]
    • {characteristic} natively accessible
      • make it possible for various data users to access and read its data in their native mode of access [1]
      • meant to be broadcast and shared widely [3]
    • {characteristic} interoperable and composable
      • follows a set of standards and harmonization rules that allow linking data across domains easily [1]
    • {characteristic} valuable on its own
      • must have some inherent value for the data users [1]
    • {characteristic} secure
      • the access control is validated by the data product, right in the flow of data, access, read, or write [1] 
        • ⇐ the access control policies can change dynamically
    • {characteristic} multimodal 
      • there is no definitive 'right way' to create a data product, nor is there a single expected form, format, or mode that it is expected to take [3] 
    • shares its logs, traces, and metrics while consuming, transforming, and sharing data [1]
    • {concept} data quantum (aka product data quantum, architectural quantum) 
      • unit of logical architecture that controls and encapsulates all the structural components needed to share a data product [1]
        • {component} data
        • {component} metadata
        • {component} code
        • {component} policies
        • {component} dependencies' listing
    • {concept} data product observability
      • monitor the operational health of the mesh
      • debug and perform postmortem analysis
      • perform audits
      • understand data lineage
    • {concept} logs 
      • immutable, timestamped, and often structured events that are produced as a result of processing and the execution of a particular task [1]
      • used for debugging and root cause analysis
    • {concept} traces
      • records of causally related distributed events [1]
    • {concept} metrics
      • objectively quantifiable parameters that continue to communicate build-time and runtime characteristics of data products [1]
  • artefacts 
    • e.g. data, code, metadata, policies
Previous Post <<||>> Next Post

References:
[1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review)
[2] Zhamak Dehghani (2019) How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh (link)
[3] Adam Bellemare (2023) Building an Event-Driven Data Mesh: Patterns for Designing and Building Event-Driven Architectures

13 March 2024

🔖Book Review: Zhamak Dehghani's Data Mesh: Delivering Data-Driven Value at Scale (2021)

Zhamak Dehghani's "Data Mesh: Delivering Data-Driven Value at Scale" (2021)

Zhamak Dehghani's "Data Mesh: Delivering Data-Driven Value at Scale" (2021) is a must read book for the data professional. So, here I am, finally managing to read it and give it some thought, even if it will probably take more time and a few more reads for the ideas to grow. Working in the fields of Business Intelligence and Software Engineering for almost a quarter-century, I think I can understand the historical background and the direction of the ideas presented in the book. There are many good ideas but also formulations that make me circumspect about the applicability of some assumptions and requirements considered. 

So, after data marts, warehouses, lakes and lakehouses, the data mesh paradigm seems to be the new shiny thing that will bring organizations beyond the inflection point with tipping potential from where organization's growth will have an exponential effect. At least this seems to be the first impression when reading the first chapters. 

The book follows to some degree the advocative tone of promoting that "our shiny thing is much better than previous thing", or "how bad the previous architectures or paradigms were and how good the new ones are" (see [2]). Architectures and paradigms evolve with the available technologies and our perception of what is important for businesses. Old and new have their place in the order of things, and the old will continue to exist, at least until the new proves its feasibility.  

The definition of the data mash as "a sociotechnical approach to share, access and manage analytical data in complex and large-scale environments - within or across organizations" [1] is too abstract even if it reflects at high level what the concept is about. Compared to other material I read on the topic, the book succeeds in explaining the related concepts as well the goals (called definitions) and benefits (called motivations) associated with the principles behind the data mesh, making the book approachable also by non-professionals. 

Built around four principles "data as a product", "domain-oriented ownership", "self-serve data platform" and "federated governance", the data mesh is the paradigm on which data as products are developed; where the products are "the smallest unit of architecture that can be independently deployed and managed", providing by design the information necessary to be discovered, understood, debugged, and audited.

It's possible to create Lego-like data products, data contracts and/or manifests that address product's usability characteristics, though unless the latter are generated automatically, put in the context of ERP and other complex systems, everything becomes quite an endeavor that requires time and adequate testing, increasing the overall timeframe until a data product becomes available. 

The data mesh describes data products in terms of microservices that structure architectures in terms of a collection of services that are independently deployable and loosely coupled. Asking from data products to behave in this way is probably too hard a constraint, given the complexity and interdependency of the data models behind business processes and their needs. Does all the effort make sense? Is this the "agility" the data mesh solutions are looking for?

Many pioneering organizations are still fighting with the concept of data mesh as it proves to be challenging to implement. At a high level everything makes sense, but the way data products are expected to function makes the concept challenging to implement to the full extent. Moreover, as occasionally implied, the data mesh is about scaling data analytics solutions with the size and complexity of organizations. The effort makes sense when the organizations have a certain size and the departments have a certain autonomy, therefore, it might not apply to small to medium businesses.

Previous Post <<||>>  Next Post

References:
[1] Zhamak Dehghani (2021) "Data Mesh: Delivering Data-Driven Value at Scale" (link)
[2] SQL-troubles (2024) Zhamak Dehghani's Data Mesh - Monolithic Warehouses and Lakes (link)

06 March 2024

🧭Business Intelligence: Data Culture (Part II: Leadership, Necessary but not Sufficient)

Business Intelligence
Business Intelligence Series

Continuing the idea from the previous post on Brent Dykes’ article on data culture and Generative AI [1], it’s worth discussing about the relationship between data culture and leadership. Leadership belongs to a list of select words everybody knows about but fails to define them precisely, especially when many traits are associated with leadership, respectively when most of the issues existing in organizations ca be associated with it directly or indirectly.

Take for example McKinsey’s definition: "Leadership is a set of behaviors used to help people align their collective direction, to execute strategic plans, and to continually renew an organization." [2] It gives an idea of what leadership is about, though it lacks precision, which frankly is difficult to accomplish. Using modifiers like strong or weak with the word leadership doesn’t increase the precision of its usage. Several words stand out though: direction, strategy, behavior, alignment, renewal.

Leadership is about identifying and challenging the status quo, defining how the future will or could look like for the organization in terms of a vision, a mission and a destination, translating them into a set of goals and objectives. Then, it’s about defining a set of strategies, focusing on transformation and what it takes to execute it, adjusting the strategic bridge between goals and objectives, or, reading between the lines, identifying and doing the right things, being able to introduce a new order of things, reinventing the organization, adapting the organization to circumstances.

Aligning resumes in aligning the various strategies, aligning people with the vision and mission, while renewal is about changing course in response to new information or business context, identifying and transforming weaknesses into strengths, risks into opportunities, respectively opportunities into certitudes, seeing possibilities and multiplying them.

Leadership is also about working on the system, addressing the systemic failure, addressing structural and organizational issues, making sure that the preconditions and enablers for organizational change are in place, that no barriers exist or other factors impact negatively the change, that the positive aspects of complex systems like emergence or exponential growth do happen in time.

And leadership is about much more - interpersonal influence, inspiring people, Inspiring change, changing mindsets, assisting, motivating, mobilizing, connecting, knocking people out of their comfort zones, conviction, consistency, authority, competence, wisdom, etc. Leadership seems to be an idealistic concept where too many traits are considered, traits that ideally should apply to the average knowledge worker as well.

An organization’s culture is created, managed, nourished, and destroyed through leadership, and that’s a strong statement and constraint. By extension this statement applies to the data culture as well. It’s about leading by example and not by words or preaching, and many love to preach, even when no quire is around. It’s about demanding the same from the managers as managers demand from their subalterns, it’s about pushing the edges of culture. As Dykes mentions, it should be about participating in the data culture initiatives, making expectations explicit, and sharing mental models.

Leadership is a condition necessary but not sufficient for an organizations culture to mature. Financial and other type of resources are needed, though once a set of behaviors is seeded, they have the potential to grow and multiply when the proper conditions are met. Growth occurs also by being aware of what needs to be done and doing it day by day consciously, through self-mastery. Nowadays there are so many ways to learn and search for support, one just needs a bit of curiosity and drive to learn anything. Blaming in general the lack of leadership is just a way of passing the blame one level above on the command chain.

Resources:
[1] Forbes (2024) Why AI Isn’t Going To Solve All Your Data Culture Problems, by Brent Dykes (link)
[2] McKinsey (2022) What is leadership? (link)

Previous Post <<||>> Next Post

04 March 2024

🧭🏭Business Intelligence: Microsoft Fabric (Part II: Domains and the Data Mesh I -The Challenge of Structure Matching)

Business Intelligence Series
Business Intelligence Series

The holy grail of building a Data Analytics infrastructure seems to be nowadays the creation of a data mesh, a decentralized data architecture that organizes data by specific business domains. This endeavor proves to be difficult to achieve given the various challenges faced  – data integration, data ownership, data product creation and ownership, enablement of data citizens, respectively enforcing security and governance in a federated manner. 

Microsoft Fabric promises to facilitate the creation of data mashes with the help of domains and subdomain by providing built-in security, administration, and governance features associated with them. A domain is a way of logically grouping together all the data in an organization that is relevant to a particular area or field. A subdomain is a way for fine tuning the logical grouping of the data.

Business domains
Business domains & their entities

At high level the challenge of building a data mesh is on how to match or aggregate structures. On one side is the high-level structure of the data mesh, while on the other side is the structure of the business data entities. The data entities can be grouped within a taxonomy with multiple levels that expands to the departments. That’s why it seems somehow natural to consider the departments as the top-most domains of the data mesh. The issue is that if the segmentation starts from a high level, iI becomes inflexible in modeling. Moreover, one has only domains and subdomains, and thus a 2-level structure to model the main aspects of the data mesh.

Some organizations allow unrestricted access to the data belonging to a given department, while others breakdown the access to a more granular level. There are also organizations that don’t restrict the access at all, though this may change later. Besides permissions and a way of grouping together the entities, what value brings to set the domains as departments? 

Therefore, I’m not convinced about using an organizations’ departmental structure as domains, especially when such a structure may change and this would imply a full range of further changes. Moreover, such a structure doesn’t reflect the span of processes or how permissions are assigned for the various roles, which are better reflected on how information systems are structured. Most probably the solution needs to accommodate both perspective and be somehow in the middle. 

Take for example the internal structure of the modules from Dynamics 365 (D365). The Finance area is broken down in Accounts Payable, Accounts Receivables, Fixed Assets, General Ledger, etc. In some organizations the departments reflect this delimitation to some degree, while in others are just associated with finance-related roles. Moreover, the permissions are more granular and, reflecting the data entities the users work with. 

Conversely, SCM extends into Finance as Purchase orders, Sales orders and other business documents are the starting or intermediary points of processes that span modules. Similarly, there are processes that start in CRM or other systems. The span of processes seem to be more appropriate for structuring the data mesh, though the system overlapping with the roles involved in the processes and the free definition of process boundaries can overcomplicate the whole design.

It makes sense to define the domains at a level that resembles the structure of the modules available in D365, while the macro data-entities represent the subdomain. The subdomain would represent then master as well as transactional data entities from the perspective of the domains, with there will be entities that need to be shared between multiple domains. Such a structure has less chances to change over time, allowing more flexibility and smaller areas of focus and thus easier to design, develop, test, deploy and maintain.

Previous Post <<||>> Next Post

16 February 2024

🧭Business Intelligence: Strategic Management (Part I: What is a BI Strategy?)

Business Intelligence Series
Business Intelligence Series

"A BI strategy is a plan to implement, use, and manage data and analytics to better enable your users to meet their business objectives. An effective BI strategy ensures that data and analytics support your business strategy." [1]

The definition is from Microsoft's guide on Power BI implementation planning, a long-awaited resource for those deploying Power BI in their organization. 

I read the definition repeatedly and, even if it looks logically correct, the general feeling is that it falls short, and I'm trying to understand why. A strategy is a plan indeed, even if various theorists use modifiers like unified, comprehensive, integrative, forward-looking, etc. Probably, because it talks about a BI strategy, the definition implies using a strategic plan. Conversely, using "strategic plan" in the definition seems to make the definition redundant, though it would pull then with it all what a strategy is about. 

A business strategy is about enabling users to meet organization's business objectives, otherwise it would fail by design. Implicitly, an organization's objectives become its employees' objectives. The definition kind of states the obvious. Conversely, it talks only about the users, and not all employees are users. Thus, it refers only to a subset. Shouldn't a BI strategy support everybody? 

Usually, data analytics refers to the procedures and techniques used for exploration and analysis. Isn't supposed to consider also the visualization of data? Did it forgot something else? Ideally, a definition shouldn't define what its terms are about individually, but what they are when used together.

BI as a set of technologies, architectures, methodologies, processes and practices is by definition an enabler if we take these components individually or as a whole. I would play devil's advocate and ask "better than what?". Many of the information systems used in organizations come with a set of reports or functionalities that enable users in their jobs without investing a cent in a BI infrastructure. 

One or two decades ago one of the big words used in sales pitches for BI tools was "competitive advantage". I was asking myself when and where did the word disappeared? Is BI technologies' success so common that the word makes no sense anymore? Did the sellers become more ethical? Or did we recognize that the challenges behind a technology are more of an organizational nature? 

When looking at a business strategy, the hierarchy of business objectives forms its backbone, though there are other important elements that form its foundation: mission, vision, purpose, values or principles. A BI strategy needs to be aligned with the business strategy and the other strategies (e.g. quality, IT, communication, etc.). Being able to trace this kind of relationships between strategies is quintessential. 

We talk about BI, Data Analytics, Data Management and newly Data Science. The relationship between them becomes more complex. Therefore, what differentiates a BI strategy from the other strategies? The above definition could apply to the other fields as well. Moreover, does it makes sense to include them in one form or another?

Independently how the joint field is called, BI and Data Analytics should be about gaining a deeper understanding about the business and disseminating that knowledge within the organization, respectively about exploring courses of action, building the infrastructure, the skillset, the culture and the mindset to approach more complex challenges and not only to enable business goals!

There are no perfect definitions, especially when the concepts used have drifting definitions as well, being caught into a net that makes it challenging to grasp the essence of things. In the end, a definition is good enough if the data professionals can work with it. 

Resources:

[1] Microsoft Learn (2004) Power BI implementation planning: BI strategy (link).

13 February 2024

🧭Business Intelligence: A One-Man Show (Part II: In the Cusps of Complexity)

Business Intelligence Series
Business Intelligence Series

I watched today on YouTube Power BI Tips' "One Person to Do Everything" episode I missed last week. The main topic is based on Christopher Laubenthal's article "Why one person can't do everything in the data space". Author's arguments are based on an analogy between the various data areas and a college's functional structure. Reading the article, I must say that it takes a poorly chosen analogy to mess messy things more!

One of the most confusing things is that there are so many data-related context-dependent roles with considerable overlapping, that it becomes more and more difficult to understand what they cover. The author considers the roles of Data Architect, Data Engineer, Database Administrator (DBA), Data Analyst, Information Designer and Data Scientist. However, to the every aspect of a data architecture there are also developers on the database (backend) and reporting side (front-end). Conversely, there are other data professionals on the management side for the various knowledge areas of Data Management: Data Governance, Data Strategy, Data Security, Data Operations, etc. There are also roles at the border between the business and the technical side like Data Stewards, Business Analysts, Data Citizen, etc. 

There are two main aspects here. According to the historical perspective, many of these roles appeared when a new set of requirements or a new layer appeared in the architecture. Firstly, it was maybe the DBA, who was supposed to primarily administer the database. Being a keeper of the data and having some knowledge of the data entities, it was easy for him/her to export data for the various reporting needs. In time such activities were taken over by a second category of data professionals. Then the data were moved to Decision Support Systems and later to Data Warehouses and Data Lakes/Lakehoses, this evolution requiring other professionals to address the challenges of each layer. Every activity performed on the data requires a certain type of knowledge that can result in the end in a new denomination. 

The second perspective results from the management of data and the knowledge areas associated with it. If in small organizations with one or two systems in place one doesn't need to talk about Data Operations, in big organizations, where a data center or something similar is maybe in place, Data Operations can easily become a topic on its own, a management structure needing to be in place for its "effective and efficient" management. And the same can happen in the other knowledge areas and their interaction with the business. It's an inherent tendency of answering to complexity with complexity, which on the long term can be in the detriment of any business. In extremis, organizations tend to have a whole team in each area, which can further increase the overall complexity by a small to not that small magnitude. 

Fortunately, one of the benefits of technological advancement is that much of the complexity can be moved somewhere else, and these are the areas where the cloud brings the most advantages. Parts or all architecture can be deployed into the cloud, being managed by cloud providers and third-parties on an on-demand basis at stable costs. Moreover, with the increasing maturity and integration of the various layers, the impact of the various roles in the overall picture is reduced considerably as areas like governance, security or operations are built-in as services, requiring thus less resources. 

With Microsoft Fabric, all the data needed for reporting becomes in theory easily available in the OneLake. Unfortunately, there is another type of complexity that is dumped on other professionals' shoulders and these aspects need to be furthered considered. 

Previous Post <<|||>> Next Post

Resources:
[1] Christopher Laubenthal (2024) "Why One Person Can’t Do Everything In Data" (link)
[2] Power BI tips (2024) Ep.292: One Person to Do Everything (link)


27 January 2024

Data Science: Back to the Future I (About Beginnings)

Data Science
Data Science Series

I've attended again, after several years, a webcast on performance improvement in SQL Server with Claudio Silva, “Writing T-SQL code for the engine, not for you”. The session was great and I really enjoyed it! I recommend it to any data(base) professional, even if some of the scenarios presented should be known already.

It's strange to see the same topics from 20-25 years ago reappearing over and over again despite the advancements made in the area of database engines. Each version of SQL Server brought something new in what concerns the performance, though without some good experience and understanding of the basic optimization and troubleshooting techniques there's little overall improvement for the average data professional in terms of writing and tuning queries!

Especially with the boom of Data Science topics, the volume of material on SQL increased considerably and many discover how easy is to write queries, even if the start might be challenging for some. Writing a query is easy indeed, though writing a performant query requires besides the language itself also some knowledge about the database engine and the various techniques used for troubleshooting and optimization. It's not about knowing in advance what the engine will do - the engine will often surprise you - but about knowing what techniques work, in what cases, which are their advantages and disadvantages, respectively on how they might impact the processing.

Making a parable with writing literature, it's not enough to speak a language; one needs more for becoming a writer, and there are so many levels of mastery! However, in database world even if creativity is welcomed, its role is considerable diminished by the constraints existing in the database engine, the problems to be solved, the time and the resources available. More important, one needs to understand some of the rules and know how to use the building blocks to solve problems and build reliable solutions.

The learning process for newbies focuses mainly on the language itself, while the exposure to complexity is kept to a minimum. For some learners the problems start when writing queries based on multiple tables -  what joins to use, in what order, how to structure the queries, what database objects to use for encapsulating the code, etc. Even if there are some guidelines and best practices, the learner must walk the path and experiment alone or in an organized setup.

In university courses the focus is on operators algebras, algorithms, on general database technologies and architectures without much hand on experience. All is too theoretical and abstract, which is acceptable for research purposes,  but not for the contact with the real world out there! Probably some labs offer exposure to real life scenarios, though what to cover first in the few hours scheduled for them?

This was the state of art when I started to learn SQL a quarter century ago, and besides the current tendency of cutting corners, the increased confidence from doing some tests, and the eagerness of shouting one’s shaking knowledge and more or less orthodox ideas on the various social networks, nothing seems to have changed! Something did change – the increased complexity of the problems to solve, and, considering the recent technological advances, one can afford now an AI learn buddy to write some code for us based on the information provided in the prompt.

This opens opportunities for learning and growth. AI can be used in the learning process by providing additional curricula for learners to dive deeper in some topics. Moreover, it can help us in time to address the challenges of the ever-increase complexity of the problems.

02 January 2024

🕸Systems Engineering: Never-Ending Stories in Praxis (Quote of the Day)

Systems Engineering
Systems Engineering Cycle

"[…] the longer one works on […] a project without actually concluding it, the more remote the expected completion date becomes. Is this really such a perplexing paradox? No, on the contrary: human experience, all-too-familiar human experience, suggests that in fact many tasks suffer from similar runaway completion times. In short, such jobs either get done soon or they never get done. It is surprising, though, that this common conundrum can be modeled so simply by a self-similar power law." (Manfred Schroeder, "Fractals, Chaos, Power Laws Minutes from an Infinite Paradise", 1990)

I found the above quote while browsing through Manfred Schroeder's book on fractals, chaos and power laws, book that also explores similar topics like percolation, recursion, randomness, self-similarity, determinism, etc. Unfortunately, when one goes beyond the introductory notes of each chapter, the subjects require more advanced knowledge of Mathematics, respectively further analysis and exploration of the models behind. Despite this, the book is still an interesting read with ideas to ponder upon.

I found myself a few times in the situation described above - working on a task that didn't seem to end, despite investing more effort, respectively approaching the solution from different angles. The reasons residing behind such situations were multiple, found typically beyond my direct area of influence and/or decision. In a systemic setup, there are parts of a system that find themselves in opposition, different forces pulling in distinct directions. It can be the case of interests, goals, expectations or solutions which compete or make subject to politics. 

For example, in Data Analytics or Data Science there are high chances that no progress can be made beyond a certain point without addressing first the quality of data or design/architectural issues. The integrations between applications, data migrations and other solutions which heavily rely on data are sensitive to data quality and architecture's reliability. As long the source of variability (data, data generators) is not stabilized, providing a stable solution has low chances of success, no matter how much effort is invested, respectively how performant the tools are. 

Some of the issues can be solved by allocating resources to handle their implications. Unfortunately, some organizations attempt to solve such issues by allocating the resources in the wrong areas or by addressing the symptoms instead of taking a step back and looking systemically at the problem, analyzing and modeling it accordingly. Moreover, there are organizations which refuse to recognize they have a problem at all! In the blame game, it's much easier to shift the responsibility on somebody else's shoulders. 

Defining the right problem to solve might prove more challenging than expected and usually this requires several iterations in which the knowledge obtained in the process is incorporated gradually. Other times, one attempts to solve the correct problem by using the wrong methodology, architecture and/or skillset. The difference between right and wrong depends on the context, and even between similar problems and factors the context can make a considerable difference.

The above quote can be corroborated with situations in which perfection is demanded. In IT and management setups, excellence is often confounded with perfection, the latter being impossible to achieve, though many managers take it as the norm. There's a critical point above which the effort invested outweighs solution's plausibility by an exponential factor.  

Another source for unending effort is when requirements change frequently in a swift manner - e.g. the rate with which changes occur outweighs the progress made for finding a solution. Unless the requirements are stabilized, the effort spirals towards the outside (in an exponential manner). 

Finally, there are cases with extreme character, in which for example the complexity of the task outweighs the skillset and/or the number of resources available. Moreover, there are problems which accept plausible solutions, though there are also problems (especially systemic ones) which don't have stable or plausible solutions. 

Behind most of such cases lie factors that tend to have chaotic behavior that occurs especially when the environments are far from favorable. The models used to depict such relations are nonlinear, sometimes expressed as power laws - one quantity varying as a power of another, with the variation increasing with each generation. 

Previous Post <<||>> Next Post

Resources:
[1] Manfred Schroeder, "Fractals, Chaos, Power Laws Minutes from an Infinite Paradise", 1990 (quotes)

14 October 2023

🧭Business Intelligence: Perspectives (Part VII: Insights - Aha' Moments)

Business Intelligence Series
Business Intelligence Series

On one side scientists talk about 'Insight' with a sign of reverence when referring to the processes, patterns, models, metaphors, stories and paradigms used to generate and communicate insight. Conversely, data professionals seem to regard 'Insight' as something trivial, achievable just by picking and combining the right visualizations and storytelling. Are the scientists exaggerating when talking about insight, or do the data professionals downplay the meaning and role of insight? Or maybe the scientific and business contexts have incomparable complexity, even if the same knowledge toolset are used?

One probably can't deny the potentiality of tools or toolsets like data visualization or data storytelling in providing new information or knowledge that leads to insights, though between potential usefulness and harnessing that potential on a general basis there's a huge difference, no matter how much people tend to idealize the process (and there's lot of idealization going on). Moreover, sometimes the whole process seems to look like a black box in which some magic happens and insight happens.

It's challenging to explain the gap as long as there's no generally accepted scientific definition of insights, respectively an explanation of how insights come into being. Probably, the easiest way to recognize their occurrence is when an 'Aha' moment appears, though that's the outcome of a process and gives almost no information about the process itself. Thus, insight occurs when knowledge about the business is acquired, knowledge that allows new or better understanding of the data, facts, processes or models involved. 

So, there must be new associations that are formed, either derived directly from data or brought to surface by the storytelling process. The latter aspect implies that the storyteller is already in possession of the respective insight(s) or facilitates their discovery without being aware of them. It further implies that the storyteller has a broader understanding of the business than the audience, which is seldom the case, or that the storyteller has a broader understanding of the data and the information extracted from the data, and that's a reasonable expectation.

There're two important restrictions. First, the insight moments must be associated with the business context rather than with the mere use of tools! Secondly, there should be genuine knowledge, not knowledge that the average person should know, respectively the mere confirmation of expectations or bias. 

Understanding can be put in the context of decision making, respectively in the broader context of problem solving. In the latter, insight involves the transition from not knowing how to solve a problem to the state of knowing how to solve it. So, this could apply in the context of data visualization as well, though there might exist intermediary steps in between. For example, in a first step insights enable us to understand and define the right problem. A further step might involve the recognition of the fact the problem belongs to a broader set of problems that have certain characteristics. Thus, the process might involve a succession of 'Aha' moments. Given the complexity of the problems we deal with in business or social contexts, that's more likely to happen. So, the average person might need several 'Aha' moments - leaps in understanding - before the data can make a difference! 

Conversely, new knowledge and understanding obtained over successive steps might not lead to an 'Aha' moment at all. Whether such moments converge or not to 'Aha' moments may rely on the importance of the overall leap, though other factors might be involved as well. In the end, the emergence of new understanding is enough to explain what insights mean. Whether that's enough is a different discussion!

Previous Post <<||>> Next Post 


18 April 2023

📊Graphical Representation: Graphics We Live By I (The Analytics Marathon)

Graphical Representation
Graphical Representation Series

In a diagram adapted from an older article [1], Brent Dykes, the author of "Effective Data Storytelling" [2], makes a parallel between Data Analytics and marathon running, considering that an organization must pass through the depicted milestones, the percentages representing how many organizations reach the respective milestones:



It's a nice visualization and the metaphor makes sense given that running a marathon requires a long-term strategy to address the gaps between the current and targeted physical/mental form and skillset required to run a marathon, respectively for approaching a set of marathons and each course individually. Similarly, implementing a Data Analytics initiative requires a Data Strategy supposed to address the gaps existing between current and targeted state of art, respectively the many projects run to reach organization's goals. 

It makes sense, isn't it? On the other side the devil lies in details and frankly the diagram raises several questions when is compared with practices and processes existing in organizations. This doesn't mean that the diagram is wrong, just that it doesn't seem to reflect entirely the reality. 

The percentages represent author's perception of how many organizations reach the respective milestones, probably in an repeatable manner (as there are several projects). Thus, only 10% have a data strategy, 100% collect data, 80% of them prepare the data, while at the opposite side only 15% communicate insight, respectively 5% act on information.

Considering only the milestones the diagram looks like a funnel and a capability maturity model (CMM). Typically, the CMMs are more complex than this, evolving with technologies' capabilities. All the mentioned milestones have a set of capabilities that increase in complexity and that usually help differentiated organization's maturity. Therefore, the model seems too simple for an actual categorization.  

Typically, data collection has a specific scope resuming to surveys, interviews and/or research. However, the definition can be extended to the storage of data within organizations. Thus, data collection as the gathering of raw data is mainly done as part of their value supporting processes, and given the degree of digitization of data, one can suppose that most organizations gather data for the different purposes, even if only a small part are maybe digitized.

Even if many organizations build data warehouses, marts, lakehouses, mashes or whatever architecture might be en-vogue these days, an important percentage of the reporting needs are covered by standard reports or reporting tools that access directly the source systems without data preparation or even data visualization. The first important question is what is understood by data analytics? Is it only the use of machine learning and statistical analysis? Does it resume only to pattern and insight finding or does it includes also what is typically considered under the Business Intelligence umbrella? 

Pragmatically thinking, Data Analytics should consider BI capabilities as well as its an extension of the current infrastructure to consider analytic capabilities. On the other side Data Warehousing and BI are considered together by DAMA as part of their Data Management methodology. Moreover, organizations may have a Data Strategy and a BI strategy, respectively a Data Analytics strategy as they might have different goals, challenges and bodies to support them. To make it even more complicated, an organization might even consider all these important topics as part of the Data or even Information Governance, or consider BI or Analytics without Data Management. 

So, a Data Strategy might or might not address Data Analytics at all. It's a matter of management philosophy, organizational structure, politics and other factors. Probably, having a strayegy related to data should count. Even if a written and communicated data-related strategy is recommended for all medium to big organizations, only a small percentage of them have one, while small organizations might ignore the topic completely.

At least in the past, data analysis and its various subcomponents was performed before preparing and visualizing the data, or at least in parallel with data visualization. Frankly, it's a strange succession of steps. Or does it refers to exploratory data analysis (EDA) from a statistical perspective, which requires statistical experience to model and interpret the facts? Moreover, data exploration and discovery happen usually in the early stages.

The most puzzling step is the last one - what does the author intended with it? Ideally, data should be actionable, at least that's what one says about KPIs, OKRs and other metrics. Does it make sense to extend Data Analytics into the decision-making process? Where does a data professional's responsibilities end and which are those boundaries? Or does it refer to the actions that need to be performed by data professionals? 

The natural step after communicating insight is for the management to take action and provide feedback. Furthermore, the decisions taken have impact on the artifacts built and a reevaluation of the business problem, assumptions and further components is needed. The many steps of analytics projects are iterative, some iterations affecting the Data Strategy as well. The diagram shows the process as linear, which is not the case.

For sure there's an interface between Data Analytics and Decision-Making and the processes associated with them, however there should be clear boundaries. E.g., it's a data professional's responsibility to make sure that the data/information is actionable and eventually advise upon it, though whether the entitled people act on it is a management topic. Not acting upon an information is also a decision. Overstepping boundaries can put the data professional into a strange situation in which he becomes responsible and eventually accountable for an action not taken, which is utopic.

The final question - is the last mile representative for the analytical process? The challenge is not the analysis and communication of data but of making sure that the feedback processes work and the changes are addressed correspondingly, that value is created continuously from the data analytics infrastructure, that data-related risks and opportunities are addressed as soon they are recognized. 

As any model, a diagram doesn't need to be correct to be useful and might not be even wrong in the right context and argumentation. A data analytics CMM might allow better estimates and comparison between organizations, though it can easily become more complex to use. Between the two models lies probably a better solution for modeling the data analytics process.

Resources:
[1] Brent Dykes (2022) "Data Analytics Marathon: Why Your Organization Must Focus On The Finish", Forbes (link)
[2] Brent Dykes (2019) Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals (link)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.