SQL Troubles: 🧊🗒️Data Warehousing: Data Mesh [Notes]

15 March 2024

🧊🗒️Data Warehousing: Data Mesh [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources.

Last updated: 17-Mar-2024

Data Products with a Data Mesh

Data Mesh

{definition} "a sociotechnical approach to share, access and manage analytical data in complex and large-scale environments - within or across organizations" [1]

⇐ there is no default standard or reference implementation of data mesh and its components [2]

{definition} a type of decentralized data architecture that organizes data based on different business domains [2]

⇐ no centralized data architecture coexists with data mesh, unless in transition [1]
distributes the modeling of analytical data, the data itself and its ownership [1]

{characteristic} partitions data around business domains and gives data ownership to the domains [1]

each domain can model their data according to their context [1]
there can be multiple models of the same concept in different domains gives the data sharing responsibility to those who are most intimately familiar with the data [1]

endorses multiple models of the data

data can be read from one domain, transformed and stored by another domain [1]

{characteristic} evolutionary execution process
{characteristic} agnostic of the underlying technology and infrastructure [1]
{aim} respond gracefully to change [1]
{aim} sustain agility in the face of growth [1]
{aim} increase the ratio of value from data to investment [1]
{principle} data as a product

{goal} business domains become accountable to share their data as a product to data users
{goal} introduce a new unit of logical architecture that controls and encapsulates all the structural components needed to share data as a product autonomously [1]
{goal} adhere to a set of acceptance criteria that assure the usability, quality, understandability, accessibility and interoperability of data products*
usability characteristics

{principle} domain-oriented ownership

{goal} decentralize the ownership of sharing analytical data to business domains that are closest to the data [1]
{goal} decompose logically the data artefacts based on the business domain they represent and manage their life cycle independently [1]
{goal} align business, technology and analytical data [1]

{principle} self-serve data platform

{goal} provide a self-serve data platform to empower domain-oriented teams to manage and govern the end-to-end life cycle of their data products* [1]
{goal} streamline the experience of data consumers to discover, access, and use the data products [1]

{principle} federated computational governance

{goal} implement a federated decision making and accountability structure that balances the autonomy and agility of domains, while respecting the global conformance, interoperability and security of the mesh* [1]
{goal} codifying and automated execution of policies at a fine-grained level [1]
⇐ the principles represent a generalization and adaptation of practices that address the scale of organization digitization* [1]

{concept} decentralization of data products

{requirement} ability to compose data across different modes of access and topologies [1]

data needs to be agnostic to the syntax of data, underlying storage type, and mode of access to it [1]

many of the existing composability techniques that assume homogeneous data won’t work

e.g. defining primary and foreign key relationships between tables of a single schema [1]

{requirement} ability to discover and learn what is relatable and decentral [1]
{requirement} ability to seamlessly link relatable data [1]
{requirement} ability to relate data temporally [1]

{concept} data product

the smallest unit of data-based architecture that can be independently deployed and managed (aka product quantum) [1]
provides a set of explicitly defined and data sharing contracts
provides a truthful portion of the reality for a particular domain (aka single slice of truth) [1]
constructed in alignment with the source domain [3]
{characteristic} autonomous

its life cycle and model are managed independently of other data products [1]

{characteristic} discoverable

via a centralized registry or catalog that list the available datasets with some additional information about each dataset, the owners, the location, sample data, etc. [1]

{characteristic} addressable

via a permanent and unique address to the data user to programmatically or manually access it [1]

{characteristic} understandable

involves getting to know the semantics of its underlying data and the syntax in which the data is encoded [1]
describes which entities it encapsulates, the relationships between them, and their adjacent data products [1]

{characteristic} trustworthy and truthful

represents the fact of the business correctly [1]
provides data provenance and data lineage [1]

{characteristic} natively accessible

make it possible for various data users to access and read its data in their native mode of access [1]
meant to be broadcast and shared widely [3]

{characteristic} interoperable and composable

follows a set of standards and harmonization rules that allow linking data across domains easily [1]

{characteristic} valuable on its own

must have some inherent value for the data users [1]

{characteristic} secure

the access control is validated by the data product, right in the flow of data, access, read, or write [1]

⇐ the access control policies can change dynamically

{characteristic} multimodal

there is no definitive 'right way' to create a data product, nor is there a single expected form, format, or mode that it is expected to take [3]

shares its logs, traces, and metrics while consuming, transforming, and sharing data [1]
{concept} data quantum (aka product data quantum, architectural quantum)

unit of logical architecture that controls and encapsulates all the structural components needed to share a data product [1]

{component} data
{component} metadata
{component} code
{component} policies
{component} dependencies' listing

{concept} data product observability

monitor the operational health of the mesh
debug and perform postmortem analysis
perform audits
understand data lineage

{concept} logs

immutable, timestamped, and often structured events that are produced as a result of processing and the execution of a particular task [1]
used for debugging and root cause analysis

{concept} traces

records of causally related distributed events [1]

{concept} metrics

objectively quantifiable parameters that continue to communicate build-time and runtime characteristics of data products [1]

artefacts

e.g. data, code, metadata, policies

Previous Post <<||>> Next Post

References:
[1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review)
[2] Zhamak Dehghani (2019) How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh (link)
[3] Adam Bellemare (2023) Building an Event-Driven Data Mesh: Patterns for Designing and Building Event-Driven Architectures

SQL Troubles

Pages

15 March 2024

🧊🗒️Data Warehousing: Data Mesh [Notes]

No comments:

About Me