SQL Troubles: data products

Showing posts with label data products. Show all posts

06 May 2024

🧭🏭Business Intelligence: Microsoft Fabric (Part III: The Metrics Layer) 🆕

Introduction

One of the announcements of this year's Microsoft Fabric Community first conference was the introduction of a metrics layer in Fabric which "allows organizations to create standardized business metrics, that are rooted in measures and are discoverable and intended for reuse" [1]. As it seems, the information content provided at the conference was kept to a minimum given that the feature is still in private preview, though several webcasts start to catch up on the topic (see [2], [4]). Moreover, as part of their show, the Explicit Measures (@PowerBITips) hosts had Carly Newsome as invitee, the manager of the project, who unveiled more details about the project and the feature, details which became the main source for the information below.

The idea of a metric layer or metric store is not new, data professionals occasionally refer to their structure(s) of metrics as such. The terms gained weight in their modern conception relatively recently in 2021-2022 (see [5], [6], [7], [8], [10]). Within the modern data stack, a metrics layer or metric store is an abstraction layer available between the data store(s) and end users. It allows to centrally define, store, and manage business metrics. Thus, it allows us to standardize and enforce a single source of truth (SSoT), respectively solve several issues existing in the data stacks. As Benn Stancil earlier remarked, the metrics layer is one of the missing pieces from the modern data stack (see [10]).

Microsoft's Solution

Microsoft's business case for metrics layer's implementation is based on three main ideas (1) duplicate measures contribute to poor data quality, (2) complex data models hinder self-service, (3) reduce data silos in Power BI. In Microsoft's conception the metric layer provides several benefits: consistent definitions and descriptions, easy management via management views, searchable and discoverable metrics, respectively assure trust through indicators.

For this feature's implementation Microsoft introduces a new Fabric Item called a metric set that allows to group several (business) metrics together as part of a mini-model that can be tailored to the needs of a subset of end-users and accessed by them via the standard tools already available. The metric set becomes thus a mini-model. Such mini-models allow to break down and reduce the overall complexity of semantic models, while being easy to evolve and consume. The challenge will become then on how to break down existing and future semantic models into nonoverlapping mini-models, creating in extremis a partition (see the Lego metaphor for data products). The idea of mini-models is not new, [12] advocating the idea of using a Master Model, a technique for creating derivative tabular models based on a single tabular solution.

A (business) metric is a way to elevate the measures from the various semantic models existing in the organization within the mini-model defined by the metric set. A metric can be reused in other fabric artifacts - currently in new reports on the Power BI service, respectively in notebooks by copying the code. Reusing metrics in other measures can mean that one can chain metrics and the changes made will be further propagated downstream.

The Metrics Layer in Microsoft Fabric (adapted diagram)

Every metric is tied to the original semantic model which allows thus to track how a metric is used across the solutions and, looking forward to Purview, to identify data's lineage. A measure is related to a "table", the source from which the measure came from.

Users' Perspective

The Metrics Layer feature is available in Microsoft Fabric service for Power BI within the Metrics menu element next to Scorecards. One starts by creating a metric set in an existing workspace, an operation which creates the actual artifact, to which the individual metrics are added. To create a metric, a user with build permissions can navigate through the semantic models across different workspaces he/she has access to, pick a measure from one of them and elevate it to a metric, copying in the process its measure's definition and description. In this way the metric will always point back to the measure from the semantic model, while the metrics thus created are considered as a related collection and can be shared around accordingly.

Once a metric is added to the metric set, one can add in edit mode dimensions to it (e.g. Date, Category, Product Id, etc.). One can then further explore a metric's output and add filters (e.g. concentrate on only one product or category) point from which one can slice-and-dice the data as needed.

There is a panel where one can see where the metric has been used (e.g. in reports, scorecards, and other integrations), when was last time refreshed, respectively how many times was used. Thus, one has the most important information in one place, which is great for developers as well as for the users. Probably, other metadata will be added, such as whether an increase in the metric would be favorable or unfavorable (like in Tableau Pulse, see [13]) or maybe levels of criticality, an unit of measure, or maybe its type - simple metric, performance indicator (PI), result indicator (RI), KPI, KRI etc.

Metrics can be persisted to the OneLake by saving their output to a delta table into the lakehouse. As demonstrated in the presentation(s), with just a copy-paste and a small piece of code one can materialize the data into a lakehouse delta table, from where the data can be reused as needed. Hopefully, the process will be further automated.

One can consume metrics and metrics sets also in Power BI Desktop, where a new menu element called Metric sets was added under the OneLake data hub, which can be used to connect to a metric set from a Semantic model and select the metrics needed for the project.

Tapping into the available Power BI solutions is done via an integration feature based on Sempy fabric package, a dataframe for storage and propagation of Power BI metadata which is part of the python-based semantic Link in Fabric [11].

Further Thoughts

When dealing with a new feature, a natural idea comes to mind: what challenges does the feature involve, respectively how can it be misused? Given that the metrics layer can be built within a workspace and that it can tap into the existing measures, this means that one can built on the existing infrastructure. However, this can imply restructuring, refactoring, moving, and testing a lot of code in the process, hopefully with minimal implications for the solutions already available. Whether the process is as simple as imagined is another story. As misusage, in extremis, data professionals might start building everything as metrics, though the danger might come when the data is persisted unnecessarily.

From a data mesh's perspective, a metric set is associated with a domain, though there will be metrics and data common to multiple domains. Moreover, a mini-model has the potential of becoming a data product. Distributing the logic across multiple workspaces and domains can add further challenges, especially in what concerns the synchronization and implemented of requirements in a way that doesn't lead to bottlenecks. But this is a general challenge for the development team(s).

The feature will probably suffer further changes until is released in public review (probably by September or the end of the year). I subscribe to other data professionals' opinion that the feature was for long needed and that can have an important impact on the solutions built.

Previous Post <<||>> Next Post

Resources:
[1] Microsoft Fabric Blog (2024) Announcements from the Microsoft Fabric Community Conference (link)
[2] Power BI Tips (2024) Explicit Measures Ep. 236: Metrics Hub, Hot New Feature with Carly Newsome (link)
[3] Power BI Tips (2024) Introducing Fabric Metrics Layer / Power Metrics Hub [with Carly Newsome] (link)
[4] KratosBI (2024) Fabric Fridays: Metrics Layer Conspiracy Theories #40 (link)
[5] Chris Webb's BI Blog (2022) Is Power BI A Semantic Layer? (link)
[6] The Data Stack Show (2022) TDSS 95: How the Metrics Layer Bridges the Gap Between Data & Business with Nick Handel of Transform (link)
[7] Sundeep Teki (2022) The Metric Layer & how it fits into the Modern Data Stack (link)
[8] Nick Handel (2021) A brief history of the metrics store (link)
[9] Aurimas (2022) The Jungle of Metrics Layers and its Invisible Elephant (link)
[10] Benn Stancil (2021) The missing piece of the modern data stack (link)
[11] Microsoft Learn (2024) Sempy fabric Package (link)
[12] Michael Kovalsky (2019) Master Model: Creating Derivative Tabular Models (link)
[13] Christina Obry (2023) The Power of a Metrics Layer - and How Your Organization Can Benefit From It (link)
[14] KratosBI (2024) Introducing the Metrics Layer in #MicrosoftFabric with Carly Newsome [link]

Resources:

[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

17 March 2024

🧭Business Intelligence: Data Products (Part II: The Complexity Challenge)

Business Intelligence Series

Creating data products within a data mesh resumes in "partitioning" a given set of inputs, outputs and transformations to create something that looks like a Lego structure, in which each Lego piece represents a data product. The word partition is improperly used as there can be overlapping in terms of inputs, outputs and transformations, though in an ideal solution the outcome should be close to a partition.

If the complexity of inputs and outputs can be neglected, even if their number could amount to a big number, not the same can be said about the transformations that must be performed in the process. Moreover, the transformations involve reengineering the logic built in the source systems, which is not a trivial task and must involve adequate testing. The transformations are a must and there's no way to avoid them.

When designing a data warehouse or data mart one of the goals is to keep the redundancy of the transformations and of the intermediary results to a minimum to minimize the unnecessary duplication of code and data. Code duplication becomes usually an issue when the logic needs to be changed, and in business contexts that can happen often enough to create other challenges. Data duplication becomes an issue when they are not in synch, fact derived from code not synchronized or with different refresh rates.

Building the transformations as SQL-based database objects has its advantages. There were many attempts for providing non-SQL operators for the same (in SSIS, Power Query) though the solutions built based on them are difficult to troubleshoot and maintain, the overall complexity increasing with the volume of transformations that must be performed. In data mashes, the complexity increases also with the number of data products involved, especially when there are multiple stakeholders and different goals involved (see the challenges for developing data marts supposed to be domain-specific).

To growing complexity organizations answer with complexity. On one side the teams of developers, business users and other members of the governance teams who together with the solution create an ecosystem. On the other side, the inherent coordination and organization meetings, managing proposals, the negotiation of scope for data products, their design, testing, etc. The more complex the whole ecosystem becomes, the higher the chances for systemic errors to occur and multiply, respectively to create unwanted behavior of the parties involved. Ecosystems are challenging to monitor and manage.

The more complex the architecture, the higher the chances for failure. Even if some organizations might succeed, it doesn't mean that such an endeavor is for everybody - a certain maturity in building data architectures, data-based artefacts and managing projects must exist in the organization. Many organizations fail in addressing basic analytical requirements, why would one think that they are capable of handling an increased complexity? Even if one breaks the complexity of a data warehouse to more manageable units, the complexity is just moved at other levels that are more difficult to manage in ensemble.

Being able to audit and test each data product individually has its advantages, though when a data product becomes part of an aggregate it can be easily get lost in the bigger picture. Thus, is needed a global observability framework that allows to monitor the performance and health of each data product in aggregate. Besides that, there are needed event brokers and other mechanisms to handle failure, availability, security, etc.

Data products make sense in certain scenarios, especially when the complexity of architectures is manageable, though attempting to redesign everything from their perspective is like having a hammer in one's hand and treating everything like a nail.

Previous Post <<||>> Next Post

🧭Business Intelligence: Data Products (Part I: A Lego Exercise)

Business Intelligence Series

One can define a data product as the smallest unit of data-driven architecture that can be independently deployed and managed (aka product quantum) [1]. In other terms one can think of a data product like a box (or Lego piece) which takes data as inputs, performs several transformations on the data from which result several output data (or even data visualizations or a hybrid between data, visualizations and other content).

At high-level each Data Analytics solution can be regarded as a set of inputs, a set of outputs and the transformations that must be performed on the inputs to generate the outputs. The inputs are the data from the operational systems, while the outputs are analytics data that can be anything from data to KPIs and other metrics. A data mart, data warehouse, lakehouse and data mesh can be abstracted in this way, though different scales apply.

For creating data products within a data mesh, given a set of inputs, outputs and transformations, the challenge is to find horizontal and vertical partitions within these areas to create something that looks like a Lego structure, in which each piece of Lego represents a data product, while its color represents the membership to a business domain. Each such piece is self-contained and contains a set of transformations, respectively intermediary inputs and outputs. Multiple such pieces can be combined in a linear or hierarchical fashion to transform the initial inputs into the final outputs.

Data Products with a Data Mesh

Finding such a partition is possible though it involves a considerable effort, especially in designing the whole thing - identifying each Lego piece uniquely. When each department is on its own and develops its own Lego pieces, there's no guarantee that the pieces from the various domains will fit together to built something cohesive, performant, secure or well-structured. Is like building a house from modules, the pieces must fit together. That would be the role of governance (federated computational governance) - to align and coordinate the effort.

Conversely, there are transformations that need to be replicated for obtaining autonomous data products, and the volume of such overlapping can be considerable high. Consider for example the logic available in reports and how often it needs to be replicated. Alternatively, one can create intermediary data products, when that's feasible.

It's challenging to define the inputs and outputs for a Lego piece. Now imagine in doing the same for a whole set of such pieces depending on each other! This might work for small pieces of data and entities quite stable in their lifetime (e.g. playlists, artists, songs), but with complex information systems the effort can increase by a few factors. Moreover, the complexity of the structure increases as soon the Lego pieces expand beyond their initial design. It's like the real Lego pieces would grow within the available space but still keep the initial structure - strange constructs may result, which even if they work, change the gravity center of the edifice in other directions. There will be thus limits to grow that can easily lead to duplication of functionality to overcome such challenges.

Each new output or change in the initial input for this magic boxes involves a change of all the intermediary Lego pieces from input to output. Just recollect the last experience of defining the inputs and the outputs for an important complex report, how many iterations and how much effort was involved. This might have been an extreme case, though how realistic is the assumption that with data products everything will go smoother? No matter of the effort involved in design, there will be always changes and further iterations involved.

Previous Post <<||>> Next Post

References:
[1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review)

15 March 2024

🧊🗒️Data Warehousing: Data Mesh [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources.

Last updated: 17-Mar-2024

Data Products with a Data Mesh

Data Mesh

{definition} "a sociotechnical approach to share, access and manage analytical data in complex and large-scale environments - within or across organizations" [1]

⇐ there is no default standard or reference implementation of data mesh and its components [2]

{definition} a type of decentralized data architecture that organizes data based on different business domains [2]

⇐ no centralized data architecture coexists with data mesh, unless in transition [1]
distributes the modeling of analytical data, the data itself and its ownership [1]

{characteristic} partitions data around business domains and gives data ownership to the domains [1]

each domain can model their data according to their context [1]
there can be multiple models of the same concept in different domains gives the data sharing responsibility to those who are most intimately familiar with the data [1]

endorses multiple models of the data

data can be read from one domain, transformed and stored by another domain [1]

{characteristic} evolutionary execution process
{characteristic} agnostic of the underlying technology and infrastructure [1]
{aim} respond gracefully to change [1]
{aim} sustain agility in the face of growth [1]
{aim} increase the ratio of value from data to investment [1]
{principle} data as a product

{goal} business domains become accountable to share their data as a product to data users
{goal} introduce a new unit of logical architecture that controls and encapsulates all the structural components needed to share data as a product autonomously [1]
{goal} adhere to a set of acceptance criteria that assure the usability, quality, understandability, accessibility and interoperability of data products*
usability characteristics

{principle} domain-oriented ownership

{goal} decentralize the ownership of sharing analytical data to business domains that are closest to the data [1]
{goal} decompose logically the data artefacts based on the business domain they represent and manage their life cycle independently [1]
{goal} align business, technology and analytical data [1]

{principle} self-serve data platform

{goal} provide a self-serve data platform to empower domain-oriented teams to manage and govern the end-to-end life cycle of their data products* [1]
{goal} streamline the experience of data consumers to discover, access, and use the data products [1]

{principle} federated computational governance

{goal} implement a federated decision making and accountability structure that balances the autonomy and agility of domains, while respecting the global conformance, interoperability and security of the mesh* [1]
{goal} codifying and automated execution of policies at a fine-grained level [1]
⇐ the principles represent a generalization and adaptation of practices that address the scale of organization digitization* [1]

{concept} decentralization of data products

{requirement} ability to compose data across different modes of access and topologies [1]

data needs to be agnostic to the syntax of data, underlying storage type, and mode of access to it [1]

many of the existing composability techniques that assume homogeneous data won’t work

e.g. defining primary and foreign key relationships between tables of a single schema [1]

{requirement} ability to discover and learn what is relatable and decentral [1]
{requirement} ability to seamlessly link relatable data [1]
{requirement} ability to relate data temporally [1]

{concept} data product

the smallest unit of data-based architecture that can be independently deployed and managed (aka product quantum) [1]
provides a set of explicitly defined and data sharing contracts
provides a truthful portion of the reality for a particular domain (aka single slice of truth) [1]
constructed in alignment with the source domain [3]
{characteristic} autonomous

its life cycle and model are managed independently of other data products [1]

{characteristic} discoverable

via a centralized registry or catalog that list the available datasets with some additional information about each dataset, the owners, the location, sample data, etc. [1]

{characteristic} addressable

via a permanent and unique address to the data user to programmatically or manually access it [1]

{characteristic} understandable

involves getting to know the semantics of its underlying data and the syntax in which the data is encoded [1]
describes which entities it encapsulates, the relationships between them, and their adjacent data products [1]

{characteristic} trustworthy and truthful

represents the fact of the business correctly [1]
provides data provenance and data lineage [1]

{characteristic} natively accessible

make it possible for various data users to access and read its data in their native mode of access [1]
meant to be broadcast and shared widely [3]

{characteristic} interoperable and composable

follows a set of standards and harmonization rules that allow linking data across domains easily [1]

{characteristic} valuable on its own

must have some inherent value for the data users [1]

{characteristic} secure

the access control is validated by the data product, right in the flow of data, access, read, or write [1]

⇐ the access control policies can change dynamically

{characteristic} multimodal

there is no definitive 'right way' to create a data product, nor is there a single expected form, format, or mode that it is expected to take [3]

shares its logs, traces, and metrics while consuming, transforming, and sharing data [1]
{concept} data quantum (aka product data quantum, architectural quantum)

unit of logical architecture that controls and encapsulates all the structural components needed to share a data product [1]

{component} data
{component} metadata
{component} code
{component} policies
{component} dependencies' listing

{concept} data product observability

monitor the operational health of the mesh
debug and perform postmortem analysis
perform audits
understand data lineage

{concept} logs

immutable, timestamped, and often structured events that are produced as a result of processing and the execution of a particular task [1]
used for debugging and root cause analysis

{concept} traces

records of causally related distributed events [1]

{concept} metrics

objectively quantifiable parameters that continue to communicate build-time and runtime characteristics of data products [1]

artefacts

e.g. data, code, metadata, policies

Previous Post <<||>> Next Post

References:
[1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review)
[2] Zhamak Dehghani (2019) How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh (link)
[3] Adam Bellemare (2023) Building an Event-Driven Data Mesh: Patterns for Designing and Building Event-Driven Architectures

13 June 2020

🧭☯Business Intelligence: Self-Service BI (The Good, the Bad and the Ugly)

Self-Service BI (SSBI) is a form of Business Intelligence (BI) in which the users are enabled and empowered to explore and analyze the data, respectively build reports and visualizations on their own, with minimal IT support.

The Good: Modern SSBI tools like PowerBI, Tableau or Qlik Sense provide easy to use and rich functionality for data preparation, exploration, discovery, integration, modelling, visualization, and analysis. Moreover, they integrated the advances made in graphics, data storage and processing (e.g. in-memory processing, parallel processing), which allow addressing most of data requirements. With just a few drag-and-drops users can display details, aggregate data, identify trends and correlations between data. Slice-and-dice or passthrough features allow navigating the data across dimensions and different levels of details. In addition, the tools can leverage the existing data models available in data warehouses, data marts and other types of data repositories, including the rich set of open data available on the web.

With the right infrastructure, knowledge and skills users can better understand and harness the business data, using them to address business questions, they can make faster and smarter decisions rooted in data. SSBI offers the potential of increasing the value data have for the organization, while improving the time to value for data products (data models, reports, visualizations).

The Bad: In the 90s products like MS Excel or Access allowed users to build personal solutions to address gaps existing in processes and reporting. Upon case, the personal solutions gained in importance, starting to be used by more users to the degree that they become essential for the business. Thus, these islands of data and knowledge started to become a nightmare for the IT department, as they were supposed to be kept alike and backed-up. In addition, issues like security of data, inefficient data processing, duplication of data and effort, different versions of truth, urged the business to consolidate such solutions in standardized solutions.

Without an adequate strategy and a certain control over the outcomes of the SSBI initiatives, organization risk of reaching to the same deplorable state, with SSBI initiatives having the potential to bring more damage than the issues they can solve. Insufficient data quality and integration, unrealistic expectations, the communication problems between business and IT, as well insufficient training and support have the potential of making SSBI’s adoption more difficult.

The investment in adequate SSBI tool(s) might be small compared with the further changes that need to be done within the technical and logistical BI infrastructure. In addition, even if the role of IT is minimized, it doesn’t mean that IT needs to be left out of the picture. IT is still the owner of the IT infrastructure, it still needs to oversight the self-service processes and the flow of data, information and knowledge within the organization. From infrastructure to skillset, there are aspects of the SSBI that need to be addressed accordingly. The BI professional can’t be replaced entirely, though the scope of his work may shift to address new types of challenges.

Not understanding that SSBI initiatives are iterative, explorative in nature and require time to bring value, can put unnecessary pressure on those being part of it. Renouncing to SSBI initiatives without attempting to address the issues and stir them in the right direction hinder an organization and its employees’ potential to grow, with all the implication deriving from it.

The Ugly: Despite the benefits SSBI can bring, its adoption within organizations remains low. Whether it’s business’ credibility in own forces, or the inherent technical or logistical challenges, SSBI follows the BI trend of being a promise that seldom reaches its potential.

Previous Post <<||>> Next Post

04 October 2018

🔭Data Science: Data Products (Just the Quotes)

"Data scientists combine entrepreneurship with patience, the willingness to build data products incrementally, the ability to explore, and the ability to iterate over a solution. They are inherently interdisciplinary. They can tackle all aspects of a problem, from initial data collection and data conditioning to drawing conclusions. They can think outside the box to come up with new ways to view the problem, or to work with very broadly defined problems: 'there’s a lot of data, what can you make from it?'" (Mike Loukides, "What Is Data Science?", 2011)

"Discovery is the key to building great data products, as opposed to products that are merely good." (Mike Loukides, "The Evolution of Data Products", 2011)

"New interfaces for data products are all about hiding the data itself, and getting to what the user wants." (Mike Loukides, "The Evolution of Data Products", 2011)

"[...] a good definition of a data product is a product that facilitates an end goal through the use of data. It’s tempting to think of a data product purely as a data problem. After all, there’s nothing more fun than throwing a lot of technical expertise and fancy algorithmic work at a difficult problem." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"As data scientists, we prefer to interact with the raw data. We know how to import it, transform it, mash it up with other data sources, and visualize it. Most of your customers can’t do that. One of the biggest challenges of developing a data product is figuring out how to give data back to the user. Giving back too much data in a way that’s overwhelming and paralyzing is 'data vomit'. It’s natural to build the product that you would want, but it’s very easy to overestimate the abilities of your users. The product you want may not be the product they want." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"Generalizing beyond advertising, when building any data product in which the data is obfuscated (where there isn’t a clear relationship between the user and the result), you can compromise on precision, but not on recall. But when the data is exposed, focus on high precision." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"Ideas for data products tend to start simple and become complex; if they start complex, they become impossible." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"In an emergency, a data product that just produces more data is of little use. Data scientists now have the predictive tools to build products that increase the common good, but they need to be aware that building the models is not enough if they do not also produce optimized, implementable outcomes." (Jeremy Howard et al, "Designing Great Data Products", 2012)

"The best way to avoid data vomit is to focus on actionability of data. That is, what action do you want the user to take? If you want them to be impressed with the number of things that you can do with the data, then you’re likely producing data vomit. If you’re able to lead them to a clear set of actions, then you’ve built a product with a clear focus." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"The key aspect of making a data product is putting the 'product' first and 'data' second. Saying it another way, data is one mechanism by which you make the product user-focused. With all products, you should ask yourself the following three questions: (1) What do you want the user to take away from this product? (2) What action do you want the user to take because of the product? (3) How should the user feel during and after using your product?" (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"You can give your data product a better chance of success by carefully setting the users’ expectations. [...] One under-appreciated facet of designing data products is how the user feels after using the product. Does he feel good? Empowered? Or disempowered and dejected?" (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"Data products should remain stable and be decoupled from the operational/transactional applications. This requires a mechanism for detecting schema drift, and avoiding disruptive changes. It also requires versioning and, in some cases, independent pipelines to run in parallel, giving your data consumers time to migrate from one version to another." (Piethein Strengholt, "Data Management at Scale: Modern Data Architecture with Data Mesh and Data Fabric" 2nd Ed., 2023)

"To explain a data mesh in one sentence, a data mesh is a centrally managed network of decentralized data products. The data mesh breaks the central data lake into decentralized islands of data that are owned by the teams that generate the data. The data mesh architecture proposes that data be treated like a product, with each team producing its own data/output using its own choice of tools arranged in an architecture that works for them. This team completely owns the data/output they produce and exposes it for others to consume in a way they deem fit for their data." (Aniruddha Deswandikar,"Engineering Data Mesh in Azure Cloud", 2024)

"Data product usage is growing quickly, doubling every year. Obviously, since we made the investment, we'll work with our customer to find applications." (Ken Shelton)

09 June 2018

📦Data Migrations (DM): Guiding Principles

Data Migrations Series

Introduction

"An army of principles can penetrate where an army of soldiers cannot."
Thomas Paine

In life as well in IT principles serve as patterns of advice in form of general or fundamental ideas, truths or values stated in a context-independent manner. They can be used as guidelines in understanding and modeling the reality, the world we live in. With the invasion of technologies in our lives principles serve as a solid ground on which we can build castles – solutions for our problems. Each technology comes with its own set of principles that defines in general terms its usage. That's why most of the IT books attempt to catch these sets of principles. Unfortunately, few of the technical writers manage to define some meaningful principles and showcase their usages.

Many of the ideas considered as principles in papers on Data Migration (DM) are at best just practices, and some can be considered as best/good practices. Just because something worked good in a previous migration doesn’t mean automatically that the idea behind the respective decision turns automatically in a principle. Some of the advices advanced are just lessons learned in disguise. Principles through their generality apply to a broad range of cases, while practices are more activity specific.

A DM through its nature finds its characteristics at the intersection of several area - database-based architecture design, ETL workflows, data management, project management (PM) and services. From these areas one can pull a set of principles that can be used in building DM architectures.

Architecture Principles

"Architecture starts when you carefully put two bricks together."
Ludwig Mies van der Rohe

There are several general principles that apply to the architecture of applications, independently of the technologies used or the industry, e.g. research first, keep it simple/small, start with the end in mind, model first, design to handle failure, secure by design (aka safety first), prototype, progress iteratively, focus on value, reuse (aka don't reinvent the wheel), test early, early feedback, refactor, govern, validate, document, right tool – right people, make it to last, make it sustainable, partition around limits, scale out, defensive coding, minimal intervention, use common sense, process orientation, follow the data, abstract, anticipate obsolescence, benchmark, single-responsibility, single dispatch, separation of concerns, right perspective.

To them add a range of application design characteristics that can be considered as principles as well: extensibility, modularity, adaptability, reusability, repeatability, modularity, performance, revocability, auditability, subject-orientation, traceability, robustness, locality, heterogeneity, consistency, atomicity, increased cohesion, reduced coupling, monitoring, usability, etc. There are several principles that can be transported from problem solving into design - divide and conquer, prioritize, system’s approach, take inventory, and so on.

A DM’s architecture has more to do with a data warehouse as it relies heavily on ETL tasks and data need to be stored for various purposes. Besides the principles of good database design, a few other principles apply: model (the domain) first, denormalize, design for performance, maintainability and security, validate continuously. From ETL area following principles can be considered: single point of processing, each step must have a purpose, minimize touch points, rest data for checkpoints, leverage existing knowledge, automate the steps, batch processing.

In addition, considering their data-specific character, a DM can be regarded as one or several data products, though in contrast with typical data products DM have typically a limited purpose. From this area following principles could be considered: build trust with transparency, blend in, visualize the complex.

Data Management Principles

Considering that a DM’s focus is an organization's data, some principles need to focus on the management and governance of Data. Data Governance together with Data Quality, Data Architecture, Metadata Management, Master Data Management are functions of Data Management. The focus is on data, metadata and their lifecycle, on processes, ownership and roles and their responsibilities. With this in mind there can be defined several principles supposed to facilitate the functions of Data Management: manage data as asset, manage data lifecycle, the business owns the data, integration across the organization, make data/metadata accessible, transparent and auditable processes, one source of truth.

As part of DM there are customer, employee and vendor information which fall under the General Data Protection Regulation (GDPR) EU 2016/679 regulation which defines the legal framework for data protection and privacy for all individuals within the European Union (EU) and the European Economic Area (EEA) as well the export of personal data outside the EU and EEA. The regulation defines a set of principles that make its backbone: fairness, lawfulness and transparency, purpose limitation, data minimization, accuracy, storage limitation, integrity and confidentiality, accountability [6].

Overseas, the US Federal Trade Commission (FTC) issued in 2012, a report recommending organizations design and implement their own privacy programs based on a set of best practices. The report reaffirms the FTC’s focus on Fair Information Processing Principles, which include notice/awareness, choice/consent, access/participation, integrity/security, enforcement/redress [6].

Project Management (PM) Principles

"Management is doing things right […]"
Peter Drucker

A DM though its characteristics is a project and, to increase the chances of success, it needs to be managed as a project. Managing DM as a project is one of the most important principles to consider. The usage of a PM framework will further increase the chances of success, as long the framework is adequate for the purpose and the organization team is able to use the framework. PMI, Prince2, Agile/Scrum/Kanban are probably the most used PM methodologies and they come with their own sets of principles.

In general, all or some of the PM principles apply independently on whether is used alone or in combination with other PM methodologies: a single project manager, an informed and supportive management, a dedicated team of qualified people to do the work of the project, clearly defined goals addressing stakeholders’ priorities, an integrated plan and schedule, as well a budget of costs and/or resources required [1].

On the other side, an agile approach could prove to be a better match for a DM given that requirements change a lot, frequent and continuous deliveries are needed, collaboration is necessary, agile processes as well self-organizing teams can facilitate the migration. These are just a few of the catchwords that make the backbone of the Agile Manifesto (see [3]).

An agile form of Prince2 could be something to consider as well, especially when Prince2 is used as methodology for other projects. For Prince2 are the following principles to consider: continued business justification, learn from experience, defined roles and responsibilities, manage by stages, management by exception, focus on products, tailor to suit the project environment [2].

All these PM principles reveal important aspects to ponder upon, and maybe with a few exceptions, all can be incorporated in the way the DM project is managed.

Service Principles

Considering the dependencies existing between the DM and Data Quality as well to the broader project, a DM can have the characteristics of a service. It’s not an IT Service per se, as IT only supports technically and eventually from a PM perspective the project. Even if a DM is not a ITSM service, some of the ITIL principles can still apply: focus on value, design for experience, start where you are, work holistically, progress iteratively, observe directly, be transparent, collaborate and keep it simple [4].

Conclusion

“Obey the principles without being bound by them.”
Bruce Lee

Within a DM all the above principles can be considered, though the network of implication they create can easily shift the focus from the solution to the philosophical aspects, and that’s a marshy road to follow. Even if all principles are noble, not all can be considered. It would be utopic to consider each possible principle. The trick is to identify the most "important" principles (principles that make sense) and prioritize them according to existing requirements. In theory, this is a one-time process that involves establishing a 'framework" of best/good practices for the DM, in next migrations needing only to consider the new facts and aspects.

Previous Post <<||>> Next Post

References:
[1] “Principles of project management”, by J. A. Bing, PM Network, 1994 (link)
[2] Axelos (2018) What is PRINCE2? (link)
[3] Agile Manifesto (2001) Principles behind the Agile Manifesto (link)
[4] Axelos (2018) ITIL® Practitioner 9 Guiding Principles (link)
[5] The Data Governance Institute (2018) Goals and Principles for Data Governance (link)
[6] Navigating the Labyrinth: An Executive Guide to Data Management, by Laura Sebastian-Coleman for DAMA International, Technics Publications, 2018 (link)

29 January 2018

🔬Data Science: Data Products (Definitions)

"Broadly defined, data means events that are captured and made available for analysis. A data source is a consistent record of these events. And a data product translates this record of events into something that can easily be understood." (Richard Galentino; et al, "Data Fluency: Empowering Your Organization with Effective Data Communication", 2014)

"Self-adapting, broadly applicable economic engines that derive their value from data and generate more data by influencing human behavior or by making inferences or predictions upon new data." (Benjamin Bengfort & Jenny Kim, "Data Analytics with Hadoop", 2016)

"Data products are software applications that derive value from data and in turn generate new data." (Rebecca Bilbro et al, "Applied Text Analysis with Python", 2018)

"[...] a product that facilitates an end goal through the use of data." (Ulrika Jägare, "Data Science Strategy For Dummies", 2019)

"Any computer software that uses data as inputs, produces outputs, and provides feedback based on the output to control the environment is referred to as a data product. A data product is generally based on a model developed during data analysis, for example, a recommendation model that inputs user purchase history and recommends a related item that the user is highly likely to buy." (Suresh K Mukhiya; Usman Ahmed, Hands-On Exploratory Data Analysis with Python, 2020)

"A data product is a product or service whose value is derived from using algorithmic methods on data, and which in turn produces data to be used in the same product, or tangential data products." (Statistics.com)

"A data product, in general terms, is any tool or application that processes data and generates results. […] Data products have one primary objective: to manage, organize and make sense of the vast amount of data that organizations collect and generate. It’s the users’ job to put the insights to use that they gain from these data products, take actions and make better decisions based on these insights." (Sisense) [source]

" A data product is digital information that can be purchased." (Techtarget) [source]

"A strategy for monetizing an organization’s data by offering it as a product to other parties." (Izenda)

"An information product that is derived from observational data through any kind of computation or processing. This includes aggregation, analysis, modelling, or visualization processes." (Fixed-Point Open Ocean Observatories)

"Data set or data set series that conforms to a data product specification." (ISO 19131)

01 January 2018

🔬Data Science: Data Science (Definitions)

"A set of quantitative and qualitative methods that support and guide the extraction of information and knowledge from data to solve relevant problems and predict outcomes." (Xiuli He et al, "Supply Chain Analytics: Challenges and Opportunities", 2014)

"A collection of models, techniques and algorithms that focus on the issues of gathering, pre-processing, and making sense-out of large repositories of data, which are seen as 'data products'." (Alfredo Cuzzocrea & Mohamed M Gaber, "Data Science and Distributed Intelligence", 2015)

"Data science involves using methods to analyze massive amounts of data and extract the knowledge it contains. […] Data science is an evolutionary extension of statistics capable of dealing with the massive amounts of data produced today. It adds methods from computer science to the repertoire of statistics." (Davy Cielen et al, "Introducing Data Science", 2016)

"The workflows and processes involved in the creation and development of data products." (Benjamin Bengfort & Jenny Kim, "Data Analytics with Hadoop", 2016)

"The discipline of analysis that helps relate data to the events and processes that produce and consume it for different reasons." (Gregory Lampshire, "The Data and Analytics Playbook", 2016)

"The extraction of knowledge from large volumes of unstructured data which is a continuation of the field data mining and predictive analytics, also known as knowledge discovery and data mining (KDD)." (Suren Behari, "Data Science and Big Data Analytics in Financial Services: A Case Study", 2016)

"A knowledge acquisition from data through scientific method that comprises systematic observation, experiment, measurement, formulation, and hypotheses testing with the aim of discovering new ideas and concepts." (Babangida Zubairu, "Security Risks of Biomedical Data Processing in Cloud Computing Environment", 2018)

"Data science is a collection of techniques used to extract value from data. It has become an essential tool for any organization that collects, stores, and processes data as part of its operations. Data science techniques rely on finding useful patterns, connections, and relationships within data. Being a buzzword, there is a wide variety of definitions and criteria for what constitutes data science. Data science is also commonly referred to as knowledge discovery, machine learning, predictive analytics, and data mining. However, each term has a slightly different connotation depending on the context." (Vijay Kotu & Bala Deshpande, "Data Science" 2nd Ed., 2018)

"A field that builds on and synthesizes a number of relevant disciplines and bodies of knowledge, including statistics, informatics, computing, communication, management, and sociology to translate data into information, knowledge, insight, and intelligence for improving innovation, productivity, and decision making." (Zhaohao Sun, "Intelligent Big Data Analytics: A Managerial Perspective", 2019)

"Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from data in various forms, both structured and unstructured similar to data mining." (K Hariharanath, "BIG Data: An Enabler in Developing Business Models in Cloud Computing Environments", 2019)

"Is a broad field that refers to the collective processes, theories, concepts, tools and technologies that enable the review, analysis, and extraction of valuable knowledge and information from raw data. It is geared toward helping individuals and organizations make better decisions from stored, consumed and managed data." (Maryna Nehrey & Taras Hnot, "Data Science Tools Application for Business Processes Modelling in Aviation", 2019)

"It is a new discipline that combines elements of mathematics, statistics, computer science, and data visualization. The objective is to extract information from data sources. In this sense, data science is devoted to database exploration and analysis. This discipline has recently received much attention due to the growing interest in big data." (Soraya Sedkaoui, "Big Data Analytics for Entrepreneurial Success", 2019)

"the study and application of techniques for deriving insights from data, including constructing algorithms for prediction. Traditional statistical science forms part of data science, which also includes a strong element of coding and data management." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"A relatively new term applied to an interdisciplinary field of study focused on methods for collecting, maintaining, processing, analyzing and presenting results from large datasets." (Osman Kandara & Eugene Kennedy, "Educational Data Mining: A Guide for Educational Researchers", 2020)

"Data Science is the branch of science that uses technologies to predict the upcoming nature of different things such as a market or weather conditions. It shows a wide usage in today’s world." (Kirti R Bhatele, "Data Analysis on Global Stratification", 2020)

"Data science is a methodical form of integrating statistics, algorithms, scientific methods, models and visualization methods for interpretation of outcomes in organizational problem solving and fact based decision making." (Tanushri Banerjee & Arindam Banerjee, "Designing a Business Analytics Culture in Organizations in India", 2021)

"Data science is a multi-disciplinary field that follows scientific approaches, methods, and processes to extract knowledge and insights from structured, semi-structured and unstructured data." (Ahmad M Kabil, Integrating Big Data Technology Into Organizational Decision Support Systems, 2021)

Data Science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights." (R Suganya et al, "A Literature Review on Thyroid Hormonal Problems in Women Using Data Science and Analytics: Healthcare Applications", 2021)

"Data Science is the science and art of using computational methods to identify and discover influential patterns in data." (M Govindarajan, "Introduction to Data Science", 2021)

"Data science is the study of data. It involves developing methods of recording, storing, and analyzing data to effectively extract useful information. The goal of data science is to gain insights and knowledge from any type of data - both structured and unstructured." (Pankaj Pathak, "A Survey on Tools for Data Analytics and Data Science", 2021)

"It is a science of multiple disciplines used for exploring knowledge from data using complex scientific algorithms and methods." (Vandana Kalra et al, "Machine Learning and Its Application in Monitoring Diabetes Mellitus", 2021)

"The concept that utilizes scientific and software methods, IT infrastructure, processes, and software systems in order to gather, process, analyze and deliver useful information, knowledge and insights from various data sources." (Nenad Stefanovic, "Big Data Analytics in Supply Chain Management", 2021)

"This is an evolving field that deals with the gathering, preparation, exploration, visualization, organisation, and storage of large groups of data and the extraction of valuable information from large volumes of data that may exist in an unorganised or unstructured form." (James O Odia & Osaheni T Akpata, "Role of Data Science and Data Analytics in Forensic Accounting and Fraud Detection", 2021)

"A field of study involving the processes and systems used to extract insights from data in all of its forms. The profession is seen as a continuation of the other data analysis fields, such as statistics." (Solutions Review)

"The discipline of using data and advanced statistics to make predictions. Data science is also focused on creating understanding among messy and disparate data. The “what” a scientist is tackling will differ greatly by employer." (KDnuggets)

"Unites statistical systems and processes with computer and information science to mine insights with structured and/or unstructured data analytics." (Accenture)

"Data science is a multidisciplinary approach to finding, extracting, and surfacing patterns in data through a fusion of analytical methods, domain expertise, and technology. This approach generally includes the fields of data mining, forecasting, machine learning, predictive analytics, statistics, and text analytics." (Tibco) [source]

"Data science is an interdisciplinary field that combines social sciences, advanced statistics, and computer engineering skills to acquire, store, organize, and analyze information across a variety of sources." (TDWI)

"Data science is the multidisciplinary field that focuses on finding actionable information in large, raw or structured data sets to identify patterns and uncover other insights. The field primarily seeks to discover answers for areas that are unknown and unexpected." (Sisense) [source]

"Data science is the practical application of advanced analytics, statistics, machine learning, and the associated activities involved in those areas in a business context, like data preparation for example." (RapidMiner) [source]

"Data Science unites statistical systems and processes with computer and information science to mine insights with structured and/or unstructured data analytics." (Accenture)

25 April 2017

⛏️Data Management: Data Products (Definitions)

"In the case of data mesh, a data product is an architectural quantum. It is the smallest unit of architecture that can be independently deployed and managed." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"A data product is a data asset that should be trusted, reusable, and accessible. The data product is developed, owned, and managed by a domain, and each domain needs to make sure that its data products are accessible to other domains and their data consumers." (Marthe Mengen, 2024) [source]

"A data product is a self-contained, independently deployable unit of data that delivers business value." (James Serra, "Deciphering Data Architectures", 2024)

"A collection of optimized data or data-related assets that are packaged for reuse and distribution with controlled access. Data products contain data as well as models, dashboards, and other computational asset types. Unlike data assets in governance catalogs, data products are managed as products with multiple purposes to provide business value." (IBM)

"A data product is a product built around data, containing everything required to complete a specific task or objective using that underlying data." (Opendatasoft)

"A data product is digital information that can be purchased." (Techtarget) [source]

"A key concept in data mesh architecture, Data Products are independent units of data managed by a specific domain team. They are responsible for defining, publishing, and maintaining their data assets while ensuring high-quality data that meets the needs of its consumers." (DataHub)

[Data product specification:] "Detailed description of a data set or data set series together with additional information that will enable it to be created, supplied to and used by another party" (ISO 19131)

"Data set or data set series that conforms to a data product specification" (ISO 19131)

04 November 2006

🔢Dhanurjay "DJ" Patil - Collected Quotes

"By giving data back to the user, you can create both engagement and revenue. We’re far enough into the data game that most users have realized that they’re not the customer, they’re the product. Their role in the system is to generate data, either to assist in ad targeting or to be sold to the highest bidder, or both." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"Data Jujitsu: the art of using multiple data elements in clever ways to solve iterative problems that, when combined, solve a data problem that might otherwise be intractable." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"Ideas for data products tend to start simple and become complex; if they start complex, they become impossible." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"In many applications, a design treatment that gives the user control over the outcome can go far to create interactions that leave the user feeling good." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"Smart data scientists don’t just solve big, hard problems; they also have an instinct for making big problems small." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"Data is such an incredible lever arm for change, we need to make sure that the change that is coming, is the one we all want to see." (Dhanurjay Patil, "A Code of Ethics for Data Science", 2016)

SQL Troubles

Pages

06 May 2024

🧭🏭Business Intelligence: Microsoft Fabric (Part III: The Metrics Layer) 🆕

17 March 2024

🧭Business Intelligence: Data Products (Part II: The Complexity Challenge)

🧭Business Intelligence: Data Products (Part I: A Lego Exercise)

15 March 2024

🧊🗒️Data Warehousing: Data Mesh [Notes]

13 June 2020

🧭☯Business Intelligence: Self-Service BI (The Good, the Bad and the Ugly)

04 October 2018

🔭Data Science: Data Products (Just the Quotes)

09 June 2018

📦Data Migrations (DM): Guiding Principles

Data Migrations Series

Introduction

Architecture Principles

Data Management Principles

Project Management (PM) Principles

Conclusion

29 January 2018

🔬Data Science: Data Products (Definitions)

01 January 2018

🔬Data Science: Data Science (Definitions)

25 April 2017

⛏️Data Management: Data Products (Definitions)

04 November 2006

🔢Dhanurjay "DJ" Patil - Collected Quotes

About Me