Showing posts with label data analytics. Show all posts
Showing posts with label data analytics. Show all posts

07 May 2024

Microsoft Fabric: The Metrics Layer (Notes) [new feature]

Disclaimer: This is work in progress intended to consolidate information from various sources.
Last updated: 07-May-2024

The Metrics Layer in Microsoft Fabric (adapted diagram)
The Metrics Layer in Microsoft Fabric (adapted diagram)

[new feature] Metrics Layer (Metrics Store)

  • {definition}an abstraction layer available between the data store(s) and end users which allows organizations to create standardized business metrics, that are rooted in measures and are discoverable and intended for reuse
    • ⇐ {important} feature still in private preview 
  • {goal} extend existing infrastructure 
    • {benefit} leverages and extends existing features
  • {goal} provide consistent definitions and descriptions [1]
    • consistent definitions that include besides business logic additional dimensions and filters [1]
    • ⇒ {benefit} allows to standardize the metrics across the organization
    • ⇒ {benefit} enforce to enforce a SSoT
  • {goal} easy management 
    • via management views 
    • [feature] lineage 
    • [feature] source control
    • [feature] duplicate identification
    • [feature] push updates to downstream uses of the metrics 
  • {goal}searchable and discoverable metrics 
    • {feature} integration
      • based on Sempy fabric package
        • ⇐ a dataframe for storage and propagation of Power BI metadata which is part of the python-based semantic Link in Fabric
  • {goal}trust
    • [feature] trust indicators
    • {benefit} facilitates report's adoption
  • {feature} metric set 
    • {definition} a Fabric item that groups together a set of metrics into a mini-model
    • {benefit} allows to reduce the overall complexity of semantic models, while being easy to evolve and consume
    • associated with a single domain
      • ⇒ supports the data mesh architecture
    • shareable 
      • can be shared with other users
    • {action} create metric set
      • creates the actual artifact, to which metrics can be added 
  • {feature} metric
    •  {definition} a way to elevate the measures from the various semantic models existing in the organization
    • tied to the original semantic model
      • ⇒ {benefit} allows to see how a metric is used across the solutions 
    • reusable
      • can be reused in other fabric artifacts
        • new reports on the Power BI service
        • notebooks 
          • by copying the code
      • can be reused in Power BI
        • via OneLake data hub menu element
      • can be chained 
        • changes are propagated downstream 
    • materializable 
      • its output can be persisted to OneLake by saving it a delta table into a lakehouse
      • {misuse} data is persisted unnecessarily
    • {action} elevate metric
      • copies measure's definition and description
      • ⇒  implies restructuring, refactoring, moving, and testing a lot of code in the process
      • {misuse} data professionals build everything as metrics
    • {action} update metric
    • {action} add filters to metric 
    • {action} add dimensions to metric
    • {action} materialize metric 

Acronyms:
SSoT - single source of truth ()

References:
[1] Power BI Tips (2024) Explicit Measures Ep. 236: Metrics Hub, Hot New Feature with Carly Newsome (link)
[2] Power BI Tips (2024) Introducing Fabric Metrics Layer / Power Metrics Hub [with Carly Newsome] (link)

06 May 2024

Microsoft Fabric: The Metrics Layer [new feature]

Introduction

One of the announcements of this year's Microsoft Fabric Community first conference was the introduction of a metrics layer in Fabric which "allows organizations to create standardized business metrics, that are rooted in measures and are discoverable and intended for reuse" [1]. As it seems, the information content provided at the conference was kept to a minimum given that the feature is still in private preview, though several webcasts start to catch up on the topic (see [2], [4]). Moreover, as part of their show, the Explicit Measures (@PowerBITips) hosts had Carly Newsome as invitee, the manager of the project, who unveiled more details about the project and the feature, details which became the main source for the information below. 

The idea of a metric layer or metric store is not new, data professionals occasionally refer to their structure(s) of metrics as such. The terms gained weight in their modern conception relatively recently in 2021-2022 (see [5], [6], [7], [8], [10]). Within the modern data stack, a metrics layer or metric store is an abstraction layer available between the data store(s) and end users. It allows to centrally define, store, and manage business metrics. Thus, it allows us to standardize and enforce a single source of truth (SSoT), respectively solve several issues existing in the data stacks. As Benn Stancil earlier remarked, the metrics layer is one of the missing pieces from the modern data stack (see [10]).

Microsoft's Solution

Microsoft's business case for metrics layer's implementation is based on three main ideas (1) duplicate measures contribute to poor data quality, (2) complex data models hinder self-service, (3) reduce data silos in Power BI. In Microsoft's conception the metric layer provides several benefits: consistent definitions and descriptions, easy management via management views, searchable and discoverable metrics, respectively assure trust through indicators. 

For this feature's implementation Microsoft introduces a new Fabric Item called a metric set that allows to group several (business) metrics together as part of a mini-model that can be tailored to the needs of a subset of end-users and accessed by them via the standard tools already available. The metric set becomes thus a mini-model. Such mini-models allow to break down and reduce the overall complexity of semantic models, while being easy to evolve and consume. The challenge will become then on how to break down existing and future semantic models into nonoverlapping mini-models, creating in extremis a partition (see the Lego metaphor for data products). The idea of mini-models is not new, [12] advocating the idea of using a Master Model, a technique for creating derivative tabular models based on a single tabular solution.

A (business) metric is a way to elevate the measures from the various semantic models existing in the organization within the mini-model defined by the metric set. A metric can be reused in other fabric artifacts - currently in new reports on the Power BI service, respectively in notebooks by copying the code. Reusing metrics in other measures can mean that one can chain metrics and the changes made will be further propagated downstream. 

The Metrics Layer in Microsoft Fabric (adapted diagram)
The Metrics Layer in Microsoft Fabric (adapted diagram)

Every metric is tied to the original semantic model which allows thus to track how a metric is used across the solutions and, looking forward to Purview, to identify data's lineage. A measure is related to a "table", the source from which the measure came from.

Users' Perspective

The Metrics Layer feature is available in Microsoft Fabric service for Power BI within the Metrics menu element next to Scorecards. One starts by creating a metric set in an existing workspace, an operation which creates the actual artifact, to which the individual metrics are added. To create a metric, a user with build permissions can navigate through the semantic models across different workspaces he/she has access to, pick a measure from one of them and elevate it to a metric, copying in the process its measure's definition and description. In this way the metric will always point back to the measure from the semantic model, while the metrics thus created are considered as a related collection and can be shared around accordingly. 

Once a metric is added to the metric set, one can add in edit mode dimensions to it (e.g. Date, Category, Product Id, etc.). One can then further explore a metric's output and add filters (e.g. concentrate on only one product or category) point from which one can slice-and-dice the data as needed.

There is a panel where one can see where the metric has been used (e.g. in reports, scorecards, and other integrations), when was last time refreshed, respectively how many times was used. Thus, one has the most important information in one place, which is great for developers as well as for the users. Probably, other metadata will be added, such as whether an increase in the metric would be favorable or unfavorable (like in Tableau Pulse, see [13]) or maybe levels of criticality, an unit of measure, or maybe its type - simple metric, performance indicator (PI), result indicator (RI), KPI, KRI etc.

Metrics can be persisted to the OneLake by saving their output to a delta table into the lakehouse. As demonstrated in the presentation(s), with just a copy-paste and a small piece of code one can materialize the data into a lakehouse delta table, from where the data can be reused as needed. Hopefully, the process will be further automated. 

One can consume metrics and metrics sets also in Power BI Desktop, where a new menu element called Metric sets was added under the OneLake data hub, which can be used to connect to a metric set from a Semantic model and select the metrics needed for the project. 

Tapping into the available Power BI solutions is done via an integration feature based on Sempy fabric package, a dataframe for storage and propagation of Power BI metadata which is part of the python-based semantic Link in Fabric [11].

Further Thoughts

When dealing with a new feature, a natural idea comes to mind: what challenges does the feature involve, respectively how can it be misused? Given that the metrics layer can be built within a workspace and that it can tap into the existing measures, this means that one can built on the existing infrastructure. However, this can imply restructuring, refactoring, moving, and testing a lot of code in the process, hopefully with minimal implications for the solutions already available. Whether the process is as simple as imagined is another story. As misusage, in extremis, data professionals might start building everything as metrics, though the danger might come when the data is persisted unnecessarily. 

From a data mesh's perspective, a metric set is associated with a domain, though there will be metrics and data common to multiple domains. Moreover, a mini-model has the potential of becoming a data product. Distributing the logic across multiple workspaces and domains can add further challenges, especially in what concerns the synchronization and implemented of requirements in a way that doesn't lead to bottlenecks. But this is a general challenge for the development team(s). 

The feature will probably suffer further changes until is released in public review (probably by September or the end of the year). I subscribe to other data professionals' opinion that the feature was for long needed and that can have an important impact on the solutions built. 

Previous Post <<||>> Next Post

Resources:
[1] Microsoft Fabric Blog (2024) Announcements from the Microsoft Fabric Community Conference (link)
[2] Power BI Tips (2024) Explicit Measures Ep. 236: Metrics Hub, Hot New Feature with Carly Newsome (link)
[3] Power BI Tips (2024) Introducing Fabric Metrics Layer / Power Metrics Hub [with Carly Newsome] (link)
[4] KratosBI (2024) Fabric Fridays: Metrics Layer Conspiracy Theories #40 (link)
[5] Chris Webb's BI Blog (2022) Is Power BI A Semantic Layer? (link)
[6] The Data Stack Show (2022) TDSS 95: How the Metrics Layer Bridges the Gap Between Data & Business with Nick Handel of Transform (link)
[7] Sundeep Teki (2022) The Metric Layer & how it fits into the Modern Data Stack (link)
[8] Nick Handel (2021) A brief history of the metrics store (link)
[9] Aurimas (2022) The Jungle of Metrics Layers and its Invisible Elephant (link)
[10] Benn Stancil (2021) The missing piece of the modern data stack (link)
[11] Microsoft Learn (2024) Sempy fabric Package (link)
[12] Michael Kovalsky (2019) Master Model: Creating Derivative Tabular Models (link)
[13] Christina Obry (2023) The Power of a Metrics Layer - and How Your Organization Can Benefit From It (link

19 March 2024

Strategic Management: Inflection Points and the Data Mesh (Quote of the Day)

Strategic Management
Strategic Management Series

"Data mesh is what comes after an inflection point, shifting our approach, attitude, and technology toward data. Mathematically, an inflection point is a magic moment at which a curve stops bending one way and starts curving in the other direction. It’s a point that the old picture dissolves, giving way to a new one. [...] The impacts affect business agility, the ability to get value from data, and resilience to change. In the center is the inflection point, where we have a choice to make: to continue with our existing approach and, at best, reach a plateau of impact or take the data mesh approach with the promise of reaching new heights." [1]

I tried to understand the "metaphor" behind the quote. As the author through another quote pinpoints, the metaphor is borrowed from Andrew Groove:

"An inflection point occurs where the old strategic picture dissolves and gives way to the new, allowing the business to ascend to new heights. However, if you don’t navigate your way through an inflection point, you go through a peak and after the peak the business declines. [...] Put another way, a strategic inflection point is when the balance of forces shifts from the old structure, from the old ways of doing business and the old ways of competing, to the new. Before" [2]

The second part of the quote clarifies the role of the inflection point - the shift from a structure, respectively organization or system to a new one. The inflection point is not when we take a decision, but when the decision we took, and the impact shifts the balance. If the data mesh comes after the inflection point (see A), then there must be some kind of causality that converges uniquely toward the data mesh, which is questionable, if not illogical. A data mesh eventually makes sense after organizations reached a certain scale and thus is likely improbable to be adopted by small to medium businesses. Even for large organizations the data mesh may not be a viable solution if it doesn't have a proven record of success. 

I could understand if the author would have said that the data mesh will lead to an inflection point after its adoption, as is the case of transformative/disruptive technologies. Unfortunately, the tracking record of BI and Data Analytics projects doesn't give many hopes for such a magical moment to happen. Probably, becoming a data-driven organization could have such an effect, though for many organizations the effects are still far from expectations. 

There's another point to consider. A curve with inflection points can contain up and down concavities (see B) or there can be multiple curves passing through an inflection point (see C) and the continuation can be on any of the curves.

Examples of Inflection Points [3]

The change can be fast or slow (see D), and in the latter it may take a long time for change to be perceived. Also [2] notes that the perception that something changed can happen in stages. Moreover, the inflection point can be only local and doesn't describe the future evolution of the curve, which to say that the curve can change the trajectory shortly after that. It happens in business processes and policy implementations that after a change was made in extremis to alleviate an issue a slight improvement is recognized after which the performance decays sharply. It's the case of situations in which the symptoms and not the root causes were addressed. 

More appropriate to describe the change would be a tipping point, which can be defined as a critical threshold beyond which a system (the organization) reorganizes/changes, often abruptly and/or irreversible.

Previous Post <<||>> Next Post

References:
[1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review)
[2] Andrew S Grove (1988) "Only the Paranoid Survive: How to Exploit the Crisis Points that Challenge Every Company and Career"
[3] SQL Troubles (2024) R Language: Drawing Function Plots (Part II - Basic Curves & Inflection Points) (link)

17 March 2024

Business Intelligence: Data Products (Part I: A Lego Exercise)

Business Intelligence
Business Intelligence Series

One can define a data product as the smallest unit of data-driven architecture that can be independently deployed and managed (aka product quantum) [1]. In other terms one can think of a data product like a box (or Lego piece) which takes data as inputs, performs several transformations on the data from which result several output data (or even data visualizations or a hybrid between data, visualizations and other content). 

At high-level each Data Analytics solution can be regarded as a set of inputs, a set of outputs and the transformations that must be performed on the inputs to generate the outputs. The inputs are the data from the operational systems, while the outputs are analytics data that can be anything from data to KPIs and other metrics. A data mart, data warehouse, lakehouse and data mesh can be abstracted in this way, though different scales apply. 

For creating data products within a data mesh, given a set of inputs, outputs and transformations, the challenge is to find horizontal and vertical partitions within these areas to create something that looks like a Lego structure, in which each piece of Lego represents a data product, while its color represents the membership to a business domain. Each such piece is self-contained and contains a set of transformations, respectively intermediary inputs and outputs. Multiple such pieces can be combined in a linear or hierarchical fashion to transform the initial inputs into the final outputs. 

Data Products with a Data Mesh
Data Products with a Data Mesh

Finding such a partition is possible though it involves a considerable effort, especially in designing the whole thing - identifying each Lego piece uniquely. When each department is on its own and develops its own Lego pieces, there's no guarantee that the pieces from the various domains will fit together to built something cohesive, performant, secure or well-structured. Is like building a house from modules, the pieces must fit together. That would be the role of governance (federated computational governance) - to align and coordinate the effort. 

Conversely, there are transformations that need to be replicated for obtaining autonomous data products, and the volume of such overlapping can be considerable high. Consider for example the logic available in reports and how often it needs to be replicated. Alternatively, one can create intermediary data products, when that's feasible. 

It's challenging to define the inputs and outputs for a Lego piece. Now imagine in doing the same for a whole set of such pieces depending on each other! This might work for small pieces of data and entities quite stable in their lifetime (e.g. playlists, artists, songs), but with complex information systems the effort can increase by a few factors. Moreover, the complexity of the structure increases as soon the Lego pieces expand beyond their initial design. It's like the real Lego pieces would grow within the available space but still keep the initial structure - strange constructs may result, which even if they work, change the gravity center of the edifice in other directions. There will be thus limits to grow that can easily lead to duplication of functionality to overcome such challenges.

Each new output or change in the initial input for this magic boxes involves a change of all the intermediary Lego pieces from input to output. Just recollect the last experience of defining the inputs and the outputs for an important complex report, how many iterations and how much effort was involved. This might have been an extreme case, though how realistic is the assumption that with data products everything will go smoother? No matter of the effort involved in design, there will be always changes and further iterations involved.

Previous Post <<||>> Next Post

References:
[1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review

15 March 2024

Data Warehousing: Data Mesh (Notes)

Disclaimer: This is work in progress intended to consolidate information from various sources. 
Last updated: 17-Mar-2024

Data Products with a Data Mesh
Data Products with a Data Mesh

Data Mesh
  • {definition} "a sociotechnical approach to share, access and manage analytical data in complex and large-scale environments - within or across organizations" [1]
    • ⇐ there is no default standard or reference implementation of data mesh and its components [2]
  • {definition} a type of decentralized data architecture that organizes data based on different business domains [2]
    • ⇐ no centralized data architecture coexists with data mesh, unless in transition [1]
    • distributes the modeling of analytical data, the data itself and its ownership [1]
  • {characteristic} partitions data around business domains and gives data ownership to the domains [1]
    • each domain can model their data according to their context [1]
    • there can be multiple models of the same concept in different domains gives the data sharing responsibility to those who are most intimately familiar with the data [1]
    • endorses multiple models of the data
      • data can be read from one domain, transformed and stored by another domain [1]
  • {characteristic} evolutionary execution process
  • {characteristic} agnostic of the underlying technology and infrastructure [1]
  • {aim} respond gracefully to change [1]
  • {aim} sustain agility in the face of growth [1]
  • {aim} increase the ratio of value from data to investment [1]
  • {principle} data as a product
    • {goal} business domains become accountable to share their data as a product to data users
    • {goal} introduce a new unit of logical architecture that controls and encapsulates all the structural components needed to share data as a product autonomously [1]
    • {goal} adhere to a set of acceptance criteria that assure the usability, quality, understandability, accessibility and interoperability of data products*
    • usability characteristics
  • {principle} domain-oriented ownership
    • {goal} decentralize the ownership of sharing analytical data to business domains that are closest to the data [1]
    • {goal} decompose logically the data artefacts based on the business domain they represent and manage their life cycle independently [1]
    • {goal} align business, technology and analytical data [1]
  • {principle} self-serve data platform
    • {goal} provide a self-serve data platform to empower domain-oriented teams to manage and govern the end-to-end life cycle of their data products* [1]
    • {goal} streamline the experience of data consumers to discover, access, and use the data products [1]
  • {principle} federated computational governance
    • {goal} implement a federated decision making and accountability structure that balances the autonomy and agility of domains, while respecting the global conformance, interoperability and security of the mesh* [1]
    • {goal} codifying and automated execution of policies at a fine-grained level [1]
    • ⇐ the principles represent a generalization and adaptation of practices that address the scale of organization digitization* [1]
  • {concept} decentralization of data products
    • {requirement} ability to compose data across different modes of access and topologies [1]
      • data needs to be agnostic to the syntax of data, underlying storage type, and mode of access to it [1]
        • many of the existing composability techniques that assume homogeneous data won’t work
          • e.g.  defining primary and foreign key relationships between tables of a single schema [1]
    • {requirement} ability to discover and learn what is relatable and decentral [1]
    • {requirement} ability to seamlessly link relatable data [1]
    • {requirement} ability to relate data temporally [1]
  • {concept} data product 
    • the smallest unit of data-based architecture that can be independently deployed and managed (aka product quantum) [1]
    • provides a set of explicitly defined and data sharing contracts
    • provides a truthful portion of the reality for a particular domain (aka single slice of truth) [1]
    • constructed in alignment with the source domain [3]
    • {characteristic} autonomous
      • its life cycle and model are managed independently of other data products [1]
    • {characteristic} discoverable
      • via a centralized registry or catalog that list the available datasets with some additional information about each dataset, the owners, the location, sample data, etc. [1]
    • {characteristic} addressable
      • via a permanent and unique address to the data user to programmatically or manually access it [1] 
    • {characteristic} understandable
      • involves getting to know the semantics of its underlying data and the syntax in which the data is encoded [1]
      • describes which entities it encapsulates, the relationships between them, and their adjacent data products [1]
    • {characteristic} trustworthy and truthful
      • represents the fact of the business correctly [1]
      • provides data provenance and data lineage [1]
    • {characteristic} natively accessible
      • make it possible for various data users to access and read its data in their native mode of access [1]
      • meant to be broadcast and shared widely [3]
    • {characteristic} interoperable and composable
      • follows a set of standards and harmonization rules that allow linking data across domains easily [1]
    • {characteristic} valuable on its own
      • must have some inherent value for the data users [1]
    • {characteristic} secure
      • the access control is validated by the data product, right in the flow of data, access, read, or write [1] 
        • ⇐ the access control policies can change dynamically
    • {characteristic} multimodal 
      • there is no definitive 'right way' to create a data product, nor is there a single expected form, format, or mode that it is expected to take [3] 
    • shares its logs, traces, and metrics while consuming, transforming, and sharing data [1]
    • {concept} data quantum (aka product data quantum, architectural quantum) 
      • unit of logical architecture that controls and encapsulates all the structural components needed to share a data product [1]
        • {component} data
        • {component} metadata
        • {component} code
        • {component} policies
        • {component} dependencies' listing
    • {concept} data product observability
      • monitor the operational health of the mesh
      • debug and perform postmortem analysis
      • perform audits
      • understand data lineage
    • {concept} logs 
      • immutable, timestamped, and often structured events that are produced as a result of processing and the execution of a particular task [1]
      • used for debugging and root cause analysis
    • {concept} traces
      • records of causally related distributed events [1]
    • {concept} metrics
      • objectively quantifiable parameters that continue to communicate build-time and runtime characteristics of data products [1]
  • artefacts 
    • e.g. data, code, metadata, policies

References:
[1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review)
[2] Zhamak Dehghani (2019) How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh (link)
[3] Adam Bellemare (2023) Building an Event-Driven Data Mesh: Patterns for Designing and Building Event-Driven Architectures

13 March 2024

Book Review: Zhamak Dehghani's Data Mesh: Delivering Data-Driven Value at Scale (2021)

Zhamak Dehghani's "Data Mesh: Delivering Data-Driven Value at Scale" (2021)

Zhamak Dehghani's "Data Mesh: Delivering Data-Driven Value at Scale" (2021) is a must read book for the data professional. So, here I am, finally managing to read it and give it some thought, even if it will probably take more time and a few more reads for the ideas to grow. Working in the fields of Business Intelligence and Software Engineering for almost a quarter-century, I think I can understand the historical background and the direction of the ideas presented in the book. There are many good ideas but also formulations that make me circumspect about the applicability of some assumptions and requirements considered. 

So, after data marts, warehouses, lakes and lakehouses, the data mesh paradigm seems to be the new shiny thing that will bring organizations beyond the inflection point with tipping potential from where organization's growth will have an exponential effect. At least this seems to be the first impression when reading the first chapters. 

The book follows to some degree the advocative tone of promoting that "our shiny thing is much better than previous thing", or "how bad the previous architectures or paradigms were and how good the new ones are" (see [2]). Architectures and paradigms evolve with the available technologies and our perception of what is important for businesses. Old and new have their place in the order of things, and the old will continue to exist, at least until the new proves its feasibility.  

The definition of the data mash as "a sociotechnical approach to share, access and manage analytical data in complex and large-scale environments - within or across organizations" [1] is too abstract even if it reflects at high level what the concept is about. Compared to other material I read on the topic, the book succeeds in explaining the related concepts as well the goals (called definitions) and benefits (called motivations) associated with the principles behind the data mesh, making the book approachable also by non-professionals. 

Built around four principles "data as a product", "domain-oriented ownership", "self-serve data platform" and "federated governance", the data mesh is the paradigm on which data as products are developed; where the products are "the smallest unit of architecture that can be independently deployed and managed", providing by design the information necessary to be discovered, understood, debugged, and audited.

It's possible to create Lego-like data products, data contracts and/or manifests that address product's usability characteristics, though unless the latter are generated automatically, put in the context of ERP and other complex systems, everything becomes quite an endeavor that requires time and adequate testing, increasing the overall timeframe until a data product becomes available. 

The data mesh describes data products in terms of microservices that structure architectures in terms of a collection of services that are independently deployable and loosely coupled. Asking from data products to behave in this way is probably too hard a constraint, given the complexity and interdependency of the data models behind business processes and their needs. Does all the effort make sense? Is this the "agility" the data mesh solutions are looking for?

Many pioneering organizations are still fighting with the concept of data mesh as it proves to be challenging to implement. At a high level everything makes sense, but the way data products are expected to function makes the concept challenging to implement to the full extent. Moreover, as occasionally implied, the data mesh is about scaling data analytics solutions with the size and complexity of organizations. The effort makes sense when the organizations have a certain size and the departments have a certain autonomy, therefore, it might not apply to small to medium businesses.

Previous Post <<||>> Next Post

References:
[1] Zhamak Dehghani (2021) "Data Mesh: Delivering Data-Driven Value at Scale" (link)
[2] SQL-troubles (2024) Zhamak Dehghani's Data Mesh - Monolithic Warehouses and Lakes (link)

06 March 2024

Business Intelligence: Data Culture (Part II: Leadership, Necessary but not Sufficient)

Business Intelligence
Business Intelligence Series

Continuing the idea from the previous post on Brent Dykes’ article on data culture and Generative AI [1], it’s worth discussing about the relationship between data culture and leadership. Leadership belongs to a list of select words everybody knows about but fails to define them precisely, especially when many traits are associated with leadership, respectively when most of the issues existing in organizations ca be associated with it directly or indirectly.

Take for example McKinsey’s definition: "Leadership is a set of behaviors used to help people align their collective direction, to execute strategic plans, and to continually renew an organization." [2] It gives an idea of what leadership is about, though it lacks precision, which frankly is difficult to accomplish. Using modifiers like strong or weak with the word leadership doesn’t increase the precision of its usage. Several words stand out though: direction, strategy, behavior, alignment, renewal.

Leadership is about identifying and challenging the status quo, defining how the future will or could look like for the organization in terms of a vision, a mission and a destination, translating them into a set of goals and objectives. Then, it’s about defining a set of strategies, focusing on transformation and what it takes to execute it, adjusting the strategic bridge between goals and objectives, or, reading between the lines, identifying and doing the right things, being able to introduce a new order of things, reinventing the organization, adapting the organization to circumstances.

Aligning resumes in aligning the various strategies, aligning people with the vision and mission, while renewal is about changing course in response to new information or business context, identifying and transforming weaknesses into strengths, risks into opportunities, respectively opportunities into certitudes, seeing possibilities and multiplying them.

Leadership is also about working on the system, addressing the systemic failure, addressing structural and organizational issues, making sure that the preconditions and enablers for organizational change are in place, that no barriers exist or other factors impact negatively the change, that the positive aspects of complex systems like emergence or exponential growth do happen in time.

And leadership is about much more - interpersonal influence, inspiring people, Inspiring change, changing mindsets, assisting, motivating, mobilizing, connecting, knocking people out of their comfort zones, conviction, consistency, authority, competence, wisdom, etc. Leadership seems to be an idealistic concept where too many traits are considered, traits that ideally should apply to the average knowledge worker as well.

An organization’s culture is created, managed, nourished, and destroyed through leadership, and that’s a strong statement and constraint. By extension this statement applies to the data culture as well. It’s about leading by example and not by words or preaching, and many love to preach, even when no quire is around. It’s about demanding the same from the managers as managers demand from their subalterns, it’s about pushing the edges of culture. As Dykes mentions, it should be about participating in the data culture initiatives, making expectations explicit, and sharing mental models.

Leadership is a condition necessary but not sufficient for an organizations culture to mature. Financial and other type of resources are needed, though once a set of behaviors is seeded, they have the potential to grow and multiply when the proper conditions are met. Growth occurs also by being aware of what needs to be done and doing it day by day consciously, through self-mastery. Nowadays there are so many ways to learn and search for support, one just needs a bit of curiosity and drive to learn anything. Blaming in general the lack of leadership is just a way of passing the blame one level above on the command chain.

Resources:
[1] Forbes (2024) Why AI Isn’t Going To Solve All Your Data Culture Problems, by Brent Dykes (link)
[2] McKinsey (2022) What is leadership? (link)

Previous Post <<||>> Next Post

04 March 2024

Business Intelligence: Microsoft Fabric's Domains and the Data Mesh I (The Challenge of Structure Matching)

Business Intelligence Series
Business Intelligence Series

The holy grail of building a Data Analytics infrastructure seems to be nowadays the creation of a data mesh, a decentralized data architecture that organizes data by specific business domains. This endeavor proves to be difficult to achieve given the various challenges faced  – data integration, data ownership, data product creation and ownership, enablement of data citizens, respectively enforcing security and governance in a federated manner. 

Microsoft Fabric promises to facilitate the creation of data mashes with the help of domains and subdomain by providing built-in security, administration, and governance features associated with them. A domain is a way of logically grouping together all the data in an organization that is relevant to a particular area or field. A subdomain is a way for fine tuning the logical grouping of the data.

Business domains
Business domains & their entities

At high level the challenge of building a data mesh is on how to match or aggregate structures. On one side is the high-level structure of the data mesh, while on the other side is the structure of the business data entities. The data entities can be grouped within a taxonomy with multiple levels that expands to the departments. That’s why it seems somehow natural to consider the departments as the top-most domains of the data mesh. The issue is that if the segmentation starts from a high level, iI becomes inflexible in modeling. Moreover, one has only domains and subdomains, and thus a 2-level structure to model the main aspects of the data mesh.

Some organizations allow unrestricted access to the data belonging to a given department, while others breakdown the access to a more granular level. There are also organizations that don’t restrict the access at all, though this may change later. Besides permissions and a way of grouping together the entities, what value brings to set the domains as departments? 

Therefore, I’m not convinced about using an organizations’ departmental structure as domains, especially when such a structure may change and this would imply a full range of further changes. Moreover, such a structure doesn’t reflect the span of processes or how permissions are assigned for the various roles, which are better reflected on how information systems are structured. Most probably the solution needs to accommodate both perspective and be somehow in the middle. 

Take for example the internal structure of the modules from Dynamics 365 (D365). The Finance area is broken down in Accounts Payable, Accounts Receivables, Fixed Assets, General Ledger, etc. In some organizations the departments reflect this delimitation to some degree, while in others are just associated with finance-related roles. Moreover, the permissions are more granular and, reflecting the data entities the users work with. 

Conversely, SCM extends into Finance as Purchase orders, Sales orders and other business documents are the starting or intermediary points of processes that span modules. Similarly, there are processes that start in CRM or other systems. The span of processes seem to be more appropriate for structuring the data mesh, though the system overlapping with the roles involved in the processes and the free definition of process boundaries can overcomplicate the whole design.

It makes sense to define the domains at a level that resembles the structure of the modules available in D365, while the macro data-entities represent the subdomain. The subdomain would represent then master as well as transactional data entities from the perspective of the domains, with there will be entities that need to be shared between multiple domains. Such a structure has less chances to change over time, allowing more flexibility and smaller areas of focus and thus easier to design, develop, test, deploy and maintain.

Previous Post <<||>> Next Post

16 February 2024

Business Intelligence: What is a BI Strategy?

Business Intelligence Series
Business Intelligence Series

"A BI strategy is a plan to implement, use, and manage data and analytics to better enable your users to meet their business objectives. An effective BI strategy ensures that data and analytics support your business strategy." [1]

The definition is from Microsoft's guide on Power BI implementation planning, a long-awaited resource for those deploying Power BI in their organization. 

I read the definition repeatedly and, even if it looks logically correct, the general feeling is that it falls short, and I'm trying to understand why. A strategy is a plan indeed, even if various theorists use modifiers like unified, comprehensive, integrative, forward-looking, etc. Probably, because it talks about a BI strategy, the definition implies using a strategic plan. Conversely, using "strategic plan" in the definition seems to make the definition redundant, though it would pull then with it all what a strategy is about. 

A business strategy is about enabling users to meet organization's business objectives, otherwise it would fail by design. Implicitly, an organization's objectives become its employees' objectives. The definition kind of states the obvious. Conversely, it talks only about the users, and not all employees are users. Thus, it refers only to a subset. Shouldn't a BI strategy support everybody? 

Usually, data analytics refers to the procedures and techniques used for exploration and analysis. Isn't supposed to consider also the visualization of data? Did it forgot something else? Ideally, a definition shouldn't define what its terms are about individually, but what they are when used together.

BI as a set of technologies, architectures, methodologies, processes and practices is by definition an enabler if we take these components individually or as a whole. I would play devil's advocate and ask "better than what?". Many of the information systems used in organizations come with a set of reports or functionalities that enable users in their jobs without investing a cent in a BI infrastructure. 

One or two decades ago one of the big words used in sales pitches for BI tools was "competitive advantage". I was asking myself when and where did the word disappeared? Is BI technologies' success so common that the word makes no sense anymore? Did the sellers become more ethical? Or did we recognize that the challenges behind a technology are more of an organizational nature? 

When looking at a business strategy, the hierarchy of business objectives forms its backbone, though there are other important elements that form its foundation: mission, vision, purpose, values or principles. A BI strategy needs to be aligned with the business strategy and the other strategies (e.g. quality, IT, communication, etc.). Being able to trace this kind of relationships between strategies is quintessential. 

We talk about BI, Data Analytics, Data Management and newly Data Science. The relationship between them becomes more complex. Therefore, what differentiates a BI strategy from the other strategies? The above definition could apply to the other fields as well. Moreover, does it makes sense to include them in one form or another?

Independently how the joint field is called, BI and Data Analytics should be about gaining a deeper understanding about the business and disseminating that knowledge within the organization, respectively about exploring courses of action, building the infrastructure, the skillset, the culture and the mindset to approach more complex challenges and not only to enable business goals!

There are no perfect definitions, especially when the concepts used have drifting definitions as well, being caught into a net that makes it challenging to grasp the essence of things. In the end, a definition is good enough if the data professionals can work with it. 

Resources:

[1] Microsoft Learn (2004) Power BI implementation planning: BI strategy (link).

13 February 2024

Business Intelligence: A One Man Show II (In the Cusps of Complexity)

Business Intelligence Series
Business Intelligence Series

I watched today on YouTube Power BI Tips' "One Person to Do Everything" episode I missed last week. The main topic is based on Christopher Laubenthal's article "Why one person can't do everything in the data space". Author's arguments are based on an analogy between the various data areas and a college's functional structure. Reading the article, I must say that it takes a poorly chosen analogy to mess messy things more!

One of the most confusing things is that there are so many data-related context-dependent roles with considerable overlapping, that it becomes more and more difficult to understand what they cover. The author considers the roles of Data Architect, Data Engineer, Database Administrator (DBA), Data Analyst, Information Designer and Data Scientist. However, to the every aspect of a data architecture there are also developers on the database (backend) and reporting side (front-end). Conversely, there are other data professionals on the management side for the various knowledge areas of Data Management: Data Governance, Data Strategy, Data Security, Data Operations, etc. There are also roles at the border between the business and the technical side like Data Stewards, Business Analysts, Data Citizen, etc. 

There are two main aspects here. According to the historical perspective, many of these roles appeared when a new set of requirements or a new layer appeared in the architecture. Firstly, it was maybe the DBA, who was supposed to primarily administer the database. Being a keeper of the data and having some knowledge of the data entities, it was easy for him/her to export data for the various reporting needs. In time such activities were taken over by a second category of data professionals. Then the data were moved to Decision Support Systems and later to Data Warehouses and Data Lakes/Lakehoses, this evolution requiring other professionals to address the challenges of each layer. Every activity performed on the data requires a certain type of knowledge that can result in the end in a new denomination. 

The second perspective results from the management of data and the knowledge areas associated with it. If in small organizations with one or two systems in place one doesn't need to talk about Data Operations, in big organizations, where a data center or something similar is maybe in place, Data Operations can easily become a topic on its own, a management structure needing to be in place for its "effective and efficient" management. And the same can happen in the other knowledge areas and their interaction with the business. It's an inherent tendency of answering to complexity with complexity, which on the long term can be in the detriment of any business. In extremis, organizations tend to have a whole team in each area, which can further increase the overall complexity by a small to not that small magnitude. 

Fortunately, one of the benefits of technological advancement is that much of the complexity can be moved somewhere else, and these are the areas where the cloud brings the most advantages. Parts or all architecture can be deployed into the cloud, being managed by cloud providers and third-parties on an on-demand basis at stable costs. Moreover, with the increasing maturity and integration of the various layers, the impact of the various roles in the overall picture is reduced considerably as areas like governance, security or operations are built-in as services, requiring thus less resources. 

With Microsoft Fabric, all the data needed for reporting becomes in theory easily available in the OneLake. Unfortunately, there is another type of complexity that is dumped on other professionals' shoulders and these aspects need to be furthered considered. 

Previous Post <<|||>> Next Post

Resources:
[1] Christopher Laubenthal (2024) "Why One Person Can’t Do Everything In Data" (link)
[2] Power BI tips (2024) Ep.292: One Person to Do Everything (link)


27 January 2024

Data Science: Back to the Future I (About Beginnings)

Data Science
Data Science Series

I've attended again, after several years, a webcast on performance improvement in SQL Server with Claudio Silva, “Writing T-SQL code for the engine, not for you”. The session was great and I really enjoyed it! I recommend it to any data(base) professional, even if some of the scenarios presented should be known already.

It's strange to see the same topics from 20-25 years ago reappearing over and over again despite the advancements made in the area of database engines. Each version of SQL Server brought something new in what concerns the performance, though without some good experience and understanding of the basic optimization and troubleshooting techniques there's little overall improvement for the average data professional in terms of writing and tuning queries!

Especially with the boom of Data Science topics, the volume of material on SQL increased considerably and many discover how easy is to write queries, even if the start might be challenging for some. Writing a query is easy indeed, though writing a performant query requires besides the language itself also some knowledge about the database engine and the various techniques used for troubleshooting and optimization. It's not about knowing in advance what the engine will do - the engine will often surprise you - but about knowing what techniques work, in what cases, which are their advantages and disadvantages, respectively on how they might impact the processing.

Making a parable with writing literature, it's not enough to speak a language; one needs more for becoming a writer, and there are so many levels of mastery! However, in database world even if creativity is welcomed, its role is considerable diminished by the constraints existing in the database engine, the problems to be solved, the time and the resources available. More important, one needs to understand some of the rules and know how to use the building blocks to solve problems and build reliable solutions.

The learning process for newbies focuses mainly on the language itself, while the exposure to complexity is kept to a minimum. For some learners the problems start when writing queries based on multiple tables -  what joins to use, in what order, how to structure the queries, what database objects to use for encapsulating the code, etc. Even if there are some guidelines and best practices, the learner must walk the path and experiment alone or in an organized setup.

In university courses the focus is on operators algebras, algorithms, on general database technologies and architectures without much hand on experience. All is too theoretical and abstract, which is acceptable for research purposes,  but not for the contact with the real world out there! Probably some labs offer exposure to real life scenarios, though what to cover first in the few hours scheduled for them?

This was the state of art when I started to learn SQL a quarter century ago, and besides the current tendency of cutting corners, the increased confidence from doing some tests, and the eagerness of shouting one’s shaking knowledge and more or less orthodox ideas on the various social networks, nothing seems to have changed! Something did change – the increased complexity of the problems to solve, and, considering the recent technological advances, one can afford now an AI learn buddy to write some code for us based on the information provided in the prompt.

This opens opportunities for learning and growth. AI can be used in the learning process by providing additional curricula for learners to dive deeper in some topics. Moreover, it can help us in time to address the challenges of the ever-increase complexity of the problems.

02 January 2024

Systems Engineering: Never-Ending Stories in Praxis (Quote of the Day)

Systems Engineering
Systems Engineering Cycle

"[…] the longer one works on […] a project without actually concluding it, the more remote the expected completion date becomes. Is this really such a perplexing paradox? No, on the contrary: human experience, all-too-familiar human experience, suggests that in fact many tasks suffer from similar runaway completion times. In short, such jobs either get done soon or they never get done. It is surprising, though, that this common conundrum can be modeled so simply by a self-similar power law." (Manfred Schroeder, "Fractals, Chaos, Power Laws Minutes from an Infinite Paradise", 1990)

I found the above quote while browsing through Manfred Schroeder's book on fractals, chaos and power laws, book that also explores similar topics like percolation, recursion, randomness, self-similarity, determinism, etc. Unfortunately, when one goes beyond the introductory notes of each chapter, the subjects require more advanced knowledge of Mathematics, respectively further analysis and exploration of the models behind. Despite this, the book is still an interesting read with ideas to ponder upon.

I found myself a few times in the situation described above - working on a task that didn't seem to end, despite investing more effort, respectively approaching the solution from different angles. The reasons residing behind such situations were multiple, found typically beyond my direct area of influence and/or decision. In a systemic setup, there are parts of a system that find themselves in opposition, different forces pulling in distinct directions. It can be the case of interests, goals, expectations or solutions which compete or make subject to politics. 

For example, in Data Analytics or Data Science there are high chances that no progress can be made beyond a certain point without addressing first the quality of data or design/architectural issues. The integrations between applications, data migrations and other solutions which heavily rely on data are sensitive to data quality and architecture's reliability. As long the source of variability (data, data generators) is not stabilized, providing a stable solution has low chances of success, no matter how much effort is invested, respectively how performant the tools are. 

Some of the issues can be solved by allocating resources to handle their implications. Unfortunately, some organizations attempt to solve such issues by allocating the resources in the wrong areas or by addressing the symptoms instead of taking a step back and looking systemically at the problem, analyzing and modeling it accordingly. Moreover, there are organizations which refuse to recognize they have a problem at all! In the blame game, it's much easier to shift the responsibility on somebody else's shoulders. 

Defining the right problem to solve might prove more challenging than expected and usually this requires several iterations in which the knowledge obtained in the process is incorporated gradually. Other times, one attempts to solve the correct problem by using the wrong methodology, architecture and/or skillset. The difference between right and wrong depends on the context, and even between similar problems and factors the context can make a considerable difference.

The above quote can be corroborated with situations in which perfection is demanded. In IT and management setups, excellence is often confounded with perfection, the latter being impossible to achieve, though many managers take it as the norm. There's a critical point above which the effort invested outweighs solution's plausibility by an exponential factor.  

Another source for unending effort is when requirements change frequently in a swift manner - e.g. the rate with which changes occur outweighs the progress made for finding a solution. Unless the requirements are stabilized, the effort spirals towards the outside (in an exponential manner). 

Finally, there are cases with extreme character, in which for example the complexity of the task outweighs the skillset and/or the number of resources available. Moreover, there are problems which accept plausible solutions, though there are also problems (especially systemic ones) which don't have stable or plausible solutions. 

Behind most of such cases lie factors that tend to have chaotic behavior that occurs especially when the environments are far from favorable. The models used to depict such relations are nonlinear, sometimes expressed as power laws - one quantity varying as a power of another, with the variation increasing with each generation. 

Previous Post <<||>> Next Post

Resources:
[1] Manfred Schroeder, "Fractals, Chaos, Power Laws Minutes from an Infinite Paradise", 1990 (quotes)

14 October 2023

Graphical Representation: On Insights I ('Aha' Moments)

Graphical Representation
Graphical Representation Series

On one side scientists talk about 'Insight' with a sign of reverence when referring to the processes, patterns, models, metaphors, stories and paradigms used to generate and communicate insight. Conversely, data professionals seem to regard 'Insight' as something trivial, achievable just by picking and combining the right visualizations and storytelling. Are the scientists exaggerating when talking about insight, or do the data professionals downplay the meaning and role of insight? Or maybe the scientific and business contexts have incomparable complexity, even if the same knowledge toolset are used?

One probably can't deny the potentiality of tools or toolsets like data visualization or data storytelling in providing new information or knowledge that leads to insights, though between potential usefulness and harnessing that potential on a general basis there's a huge difference, no matter how much people tend to idealize the process (and there's lot of idealization going on). Moreover, sometimes the whole process seems to look like a black box in which some magic happens and insight happens.

It's challenging to explain the gap as long as there's no generally accepted scientific definition of insights, respectively an explanation of how insights come into being. Probably, the easiest way to recognize their occurrence is when an 'Aha' moment appears, though that's the outcome of a process and gives almost no information about the process itself. Thus, insight occurs when knowledge about the business is acquired, knowledge that allows new or better understanding of the data, facts, processes or models involved. 

So, there must be new associations that are formed, either derived directly from data or brought to surface by the storytelling process. The latter aspect implies that the storyteller is already in possession of the respective insight(s) or facilitates their discovery without being aware of them. It further implies that the storyteller has a broader understanding of the business than the audience, which is seldom the case, or that the storyteller has a broader understanding of the data and the information extracted from the data, and that's a reasonable expectation.

There're two important restrictions. First, the insight moments must be associated with the business context rather than with the mere use of tools! Secondly, there should be genuine knowledge, not knowledge that the average person should know, respectively the mere confirmation of expectations or bias. 

Understanding can be put in the context of decision making, respectively in the broader context of problem solving. In the latter, insight involves the transition from not knowing how to solve a problem to the state of knowing how to solve it. So, this could apply in the context of data visualization as well, though there might exist intermediary steps in between. For example, in a first step insights enable us to understand and define the right problem. A further step might involve the recognition of the fact the problem belongs to a broader set of problems that have certain characteristics. Thus, the process might involve a succession of 'Aha' moments. Given the complexity of the problems we deal with in business or social contexts, that's more likely to happen. So, the average person might need several 'Aha' moments - leaps in understanding - before the data can make a difference! 

Conversely, new knowledge and understanding obtained over successive steps might not lead to an 'Aha' moment at all. Whether such moments converge or not to 'Aha' moments may rely on the importance of the overall leap, though other factors might be involved as well. In the end, the emergence of new understanding is enough to explain what insights mean. Whether that's enough is a different discussion!

Previous Post <<||>> Next Post 

18 April 2023

Graphical Representation: Graphics We Live By I (The Analytics Marathon)

Graphical Representation
Graphical Representation Series

In a diagram adapted from an older article [1], Brent Dykes, the author of "Effective Data Storytelling" [2], makes a parallel between Data Analytics and marathon running, considering that an organization must pass through the depicted milestones, the percentages representing how many organizations reach the respective milestones:



It's a nice visualization and the metaphor makes sense given that running a marathon requires a long-term strategy to address the gaps between the current and targeted physical/mental form and skillset required to run a marathon, respectively for approaching a set of marathons and each course individually. Similarly, implementing a Data Analytics initiative requires a Data Strategy supposed to address the gaps existing between current and targeted state of art, respectively the many projects run to reach organization's goals. 

It makes sense, isn't it? On the other side the devil lies in details and frankly the diagram raises several questions when is compared with practices and processes existing in organizations. This doesn't mean that the diagram is wrong, just that it doesn't seem to reflect entirely the reality. 

The percentages represent author's perception of how many organizations reach the respective milestones, probably in an repeatable manner (as there are several projects). Thus, only 10% have a data strategy, 100% collect data, 80% of them prepare the data, while at the opposite side only 15% communicate insight, respectively 5% act on information.

Considering only the milestones the diagram looks like a funnel and a capability maturity model (CMM). Typically, the CMMs are more complex than this, evolving with technologies' capabilities. All the mentioned milestones have a set of capabilities that increase in complexity and that usually help differentiated organization's maturity. Therefore, the model seems too simple for an actual categorization.  

Typically, data collection has a specific scope resuming to surveys, interviews and/or research. However, the definition can be extended to the storage of data within organizations. Thus, data collection as the gathering of raw data is mainly done as part of their value supporting processes, and given the degree of digitization of data, one can suppose that most organizations gather data for the different purposes, even if only a small part are maybe digitized.

Even if many organizations build data warehouses, marts, lakehouses, mashes or whatever architecture might be en-vogue these days, an important percentage of the reporting needs are covered by standard reports or reporting tools that access directly the source systems without data preparation or even data visualization. The first important question is what is understood by data analytics? Is it only the use of machine learning and statistical analysis? Does it resume only to pattern and insight finding or does it includes also what is typically considered under the Business Intelligence umbrella? 

Pragmatically thinking, Data Analytics should consider BI capabilities as well as its an extension of the current infrastructure to consider analytic capabilities. On the other side Data Warehousing and BI are considered together by DAMA as part of their Data Management methodology. Moreover, organizations may have a Data Strategy and a BI strategy, respectively a Data Analytics strategy as they might have different goals, challenges and bodies to support them. To make it even more complicated, an organization might even consider all these important topics as part of the Data or even Information Governance, or consider BI or Analytics without Data Management. 

So, a Data Strategy might or might not address Data Analytics at all. It's a matter of management philosophy, organizational structure, politics and other factors. Probably, having a strayegy related to data should count. Even if a written and communicated data-related strategy is recommended for all medium to big organizations, only a small percentage of them have one, while small organizations might ignore the topic completely.

At least in the past, data analysis and its various subcomponents was performed before preparing and visualizing the data, or at least in parallel with data visualization. Frankly, it's a strange succession of steps. Or does it refers to exploratory data analysis (EDA) from a statistical perspective, which requires statistical experience to model and interpret the facts? Moreover, data exploration and discovery happen usually in the early stages.

The most puzzling step is the last one - what does the author intended with it? Ideally, data should be actionable, at least that's what one says about KPIs, OKRs and other metrics. Does it make sense to extend Data Analytics into the decision-making process? Where does a data professional's responsibilities end and which are those boundaries? Or does it refer to the actions that need to be performed by data professionals? 

The natural step after communicating insight is for the management to take action and provide feedback. Furthermore, the decisions taken have impact on the artifacts built and a reevaluation of the business problem, assumptions and further components is needed. The many steps of analytics projects are iterative, some iterations affecting the Data Strategy as well. The diagram shows the process as linear, which is not the case.

For sure there's an interface between Data Analytics and Decision-Making and the processes associated with them, however there should be clear boundaries. E.g., it's a data professional's responsibility to make sure that the data/information is actionable and eventually advise upon it, though whether the entitled people act on it is a management topic. Not acting upon an information is also a decision. Overstepping boundaries can put the data professional into a strange situation in which he becomes responsible and eventually accountable for an action not taken, which is utopic.

The final question - is the last mile representative for the analytical process? The challenge is not the analysis and communication of data but of making sure that the feedback processes work and the changes are addressed correspondingly, that value is created continuously from the data analytics infrastructure, that data-related risks and opportunities are addressed as soon they are recognized. 

As any model, a diagram doesn't need to be correct to be useful and might not be even wrong in the right context and argumentation. A data analytics CMM might allow better estimates and comparison between organizations, though it can easily become more complex to use. Between the two models lies probably a better solution for modeling the data analytics process.

Resources:
[1] Brent Dykes (2022) "Data Analytics Marathon: Why Your Organization Must Focus On The Finish", Forbes (link)
[2] Brent Dykes (2019) Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals (link)

22 October 2022

Data Analytics: Data Lakes/Lakehouses (Just the Quotes)

"If you think of a Data Mart as a store of bottled water, cleansed and packaged and structured for easy consumption, the Data Lake is a large body of water in a more natural state. [...] The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples." (James Dixon, "Pentaho, Hadoop, and Data Lakes", 2010) [sorce] [first known usage]

"A data lake represents an environment that collects and stores large volumes of structured and unstructured datasets, typically in their original, unaltered forms. More than a data depository, the data lake architecture enables the various users and data science teams to conduct data exploration and related analytical activities." (EMC Education Services, "Data Science & Big Data Analytics", 2015)

"A data lake strategy supports the introduction of a separate analytics environment that off-loads the analytics being done today on your overly expensive data warehouse. This separate analytics environment provides the data science team an on-demand, fail-fast environment for quickly ingesting and analyzing a wide variety of data sources in an attempt to address immediate business opportunities independent of the data warehouse's production schedule and service level agreement (SLA) rules." (Billl Schmarzo, "Driving Business Strategies with Data Science: Big Data MBA" 1st Ed., 2015)

"At its core, it is a data storage and processing repository in which all of the data in an organization can be placed so that every internal and external systems', partners', and collaborators' data flows into it and insights spring out. [...] Data Lake is a huge repository that holds every kind of data in its raw format until it is needed by anyone in the organization to analyze." (Beulah S Purra & Pradeep Pasupuleti, "Data Lake Development with Big Data", 2015) 

"Having multiple data lakes replicates the same problems that were created with multiple data warehouses - disparate data siloes and data fiefdoms that don't facilitate sharing of the corporate data assets across the organization. Organizations need to have a single data lake from which they can source the data for their BI/data warehousing and analytic needs. The data lake may never become the 'single version of the truth' for the organization, but then again, neither will the data warehouse. Instead, the data lake becomes the 'single or central repository for all the organization's data' from which all the organization's reporting and analytic needs are sourced." (Billl Schmarzo, "Driving Business Strategies with Data Science: Big Data MBA" 1st Ed., 2015)

"[...] the real power of the data lake is to enable advanced analytics or data science on the detailed and complete history of data in an attempt to uncover new variables and metrics that are better predictors of business performance." (Billl Schmarzo, "Driving Business Strategies with Data Science: Big Data MBA" 1st Ed., 2015)

"The data lake is not an incremental enhancement to the data warehouse, and it is NOT data warehouse 2.0. The data lake enables entirely new capabilities that allow your organization to address data and analytic challenges that the data warehouse could not address." (Billl Schmarzo, "Driving Business Strategies with Data Science: Big Data MBA" 1st Ed., 2015)

"Unfortunately, some organizations are replicating the bad data warehouse practice by creating special-purpose data lakes - data lakes to address a specific business need. Resist that urge! Instead, source the data that is needed for that specific business need into an 'analytic sandbox' where the data scientists and the business users can collaborate to find those data variables and analytic models that are better predictors of the business performance. Within the 'analytic sandbox', the organization can bring together (ingest and integrate) the data that it wants to test, build the analytic models, test the model's goodness of fit, acquire new data, refine the analytic models, and retest the goodness of fit." (Billl Schmarzo, "Driving Business Strategies with Data Science: Big Data MBA" 1st Ed., 2015)

"A data lake is a storage repository that holds a very large amount of data, often from diverse sources, in native format until needed. In some respects, a data lake can be compared to a staging area of a data warehouse, but there are key differences. Just like a staging area, a data lake is a conglomeration point for raw data from diverse sources. However, a staging area only stores new data needed for addition to the data warehouse and is a transient data store. In contrast, a data lake typically stores all possible data that might be needed for an undefined amount of analysis and reporting, allowing analysts to explore new data relationships. In addition, a data lake is usually built on commodity hardware and software such as Hadoop, whereas traditional staging areas typically reside in structured databases that require specialized servers." (Mike Fleckenstein & Lorraine Fellows, "Modern Data Strategy", 2018)

"A data warehouse follows a pre-built static structure to model source data. Any changes at the structural and configuration level must go through a stringent business review process and impact analysis. Data lakes are very agile. Consumption or analytical layer can be modified to fit in the model requirements. Consumers of a data lake are not constant; therefore, schema and modeling lies at the liberty of analysts and scientists." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"Data in the data lake should never get disposed. Data driven strategy must define steps to version the data and handle deletes and updates from the source systems." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"Data governance policies must not enforce constraints on data - Data governance intends to control the level of democracy within the data lake. Its sole purpose of existence is to maintain the quality level through audits, compliance, and timely checks. Data flow, either by its size or quality, must not be constrained through governance norms. [...] Effective data governance elevates confidence in data lake quality and stability, which is a critical factor to data lake success story. Data compliance, data sharing, risk and privacy evaluation, access management, and data security are all factors that impact regulation." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"Data Lake induces accessibility and catalyzes availability. It warrants data discovery platforms to soak the data trends at a horizontal scale and produce visual insights. It largely cuts down the time that goes into data preparation and exhaustive data analysis." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"Data Lake is a single window snapshot of all enterprise data in its raw format, be it structured, semi-structured, or unstructured. Starting from curating the data ingestion pipeline to the transformation layer for analytical consumption, every aspect of data gets addressed in a data lake ecosystem. It is supposed to hold enormous volumes of data of varied structures." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"Data swamp, on the other hand, presents the devil side of a lake. A data lake in a state of anarchy is nothing but turns into a data swamp. It lacks stable data governance practices, lacks metadata management, and plays weak on ingestion framework. Uncontrolled and untracked access to source data may produce duplicate copies of data and impose pressure on storage systems." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"Data warehousing, as we are aware, is the traditional approach of consolidating data from multiple source systems and combining into one store that would serve as the source for analytical and business intelligence reporting. The concept of data warehousing resolved the problems of data heterogeneity and low-level integration. In terms of objectives, a data lake is no different from a data warehouse. Both are primary advocates of terms like 'single source of truth' and 'central data repository'." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"A data lakehouse is an amalgamation of the best components from both data lakes and data warehouses. A data lakehouse implements data structure and data management features from data warehouses into a cost-effective storage like a data lake. It tries to combine the best from both worlds - data lake - based Big Data analytics and a data warehouse." (Bhadresh Shiyal, "Beginning Azure Synapse Analytics: Transition from Data Warehouse to Data Lakehouse", 2021) 

"A defining characteristic of the data lakehouse architecture is allowing direct access to data as files while retaining the valuable properties of a data warehouse. Just do both!" (Bill Inmon et al, "Building the Data Lakehouse", 2021)

"At first, we threw all of this data into a pit called the 'data lake'. But we soon discovered that merely throwing data into a pit was a pointless exercise. To be useful - to be analyzed - data needed to (1) be related to each other and (2) have its analytical infrastructure carefully arranged and made available to the end user. Unless we meet these two conditions, the data lake turns into a swamp, and swamps start to smell after a while. [...] In a data swamp, data just sits there are no one uses it. In the data swamp, data just rots over time." (Bill Inmon et al, "Building the Data Lakehouse", 2021)

"Data lake architecture suffers from complexity and deterioration. It creates complex and unwieldy pipelines of batch or streaming jobs operated by a central team of hyper-specialized data engineers. It deteriorates over time. Its unmanaged datasets, which are often untrusted and inaccessible, provide little value. The data lineage and dependencies are obscured and hard to track." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"Once you combine the data lake along with analytical infrastructure, the entire infrastructure can be called a data lakehouse. [...] The data lake without the analytical infrastructure simply becomes a data swamp. And a data swamp does no one any good." (Bill Inmon et al, "Building the Data Lakehouse", 2021)

"The data lakehouse architecture presents an opportunity comparable to the one seen during the early years of the data warehouse market. The unique ability of the lakehouse to manage data in an open environment, blend all varieties of data from all parts of the enterprise, and combine the data science focus of the data lake with the end user analytics of the data warehouse will unlock incredible value for organizations. [...] The lakehouse architecture equally makes it natural to manage and apply models where the data lives." (Bill Inmon et al, "Building the Data Lakehouse", 2021)

"With the data lakehouse, it is possible to achieve a level of analytics and machine learning that is not feasible or possible any other way. But like all architectural structures, the data lakehouse requires an understanding of architecture and an ability to plan and create a blueprint." (Bill Inmon et al, "Building the Data Lakehouse", 2021)

"Delta Lake is a transactional storage software layer that runs on top of an existing data lake and adds RDW-like features that improve the lake’s reliability, security, and performance. Delta Lake itself is not storage. In most cases, it’s easy to turn a data lake into a Delta Lake; all you need to do is specify, when you are storing data to your data lake, that you want to save it in Delta Lake format (as opposed to other formats, like CSV or JSON)." (James Serra, "Deciphering Data Architectures", 2024)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.