Disclaimer: This is work in progress intended to consolidate information from various sources.
Last updated: 17-Mar-2024
Last updated: 17-Mar-2024
Data Mesh
- {definition} "a sociotechnical approach to share, access and manage analytical data in complex and large-scale environments - within or across organizations" [1]
- ⇐ there is no default standard or reference implementation of data mesh and its components [2]
- {definition} a type of decentralized data architecture that organizes data based on different business domains [2]
- ⇐ no centralized data architecture coexists with data mesh, unless in transition [1]
- distributes the modeling of analytical data, the data itself and its ownership [1]
- {characteristic} partitions data around business domains and gives data ownership to the domains [1]
- each domain can model their data according to their context [1]
- there can be multiple models of the same concept in different domains gives the data sharing responsibility to those who are most intimately familiar with the data [1]
- endorses multiple models of the data
- data can be read from one domain, transformed and stored by another domain [1]
- {characteristic} evolutionary execution process
- {characteristic} agnostic of the underlying technology and infrastructure [1]
- {aim} respond gracefully to change [1]
- {aim} sustain agility in the face of growth [1]
- {aim} increase the ratio of value from data to investment [1]
- {principle} data as a product
- {goal} business domains become accountable to share their data as a product to data users
- {goal} introduce a new unit of logical architecture that controls and encapsulates all the structural components needed to share data as a product autonomously [1]
- {goal} adhere to a set of acceptance criteria that assure the usability, quality, understandability, accessibility and interoperability of data products*
- usability characteristics
- {principle} domain-oriented ownership
- {goal} decentralize the ownership of sharing analytical data to business domains that are closest to the data [1]
- {goal} decompose logically the data artefacts based on the business domain they represent and manage their life cycle independently [1]
- {goal} align business, technology and analytical data [1]
- {principle} self-serve data platform
- {goal} provide a self-serve data platform to empower domain-oriented teams to manage and govern the end-to-end life cycle of their data products* [1]
- {goal} streamline the experience of data consumers to discover, access, and use the data products [1]
- {principle} federated computational governance
- {goal} implement a federated decision making and accountability structure that balances the autonomy and agility of domains, while respecting the global conformance, interoperability and security of the mesh* [1]
- {goal} codifying and automated execution of policies at a fine-grained level [1]
- ⇐ the principles represent a generalization and adaptation of practices that address the scale of organization digitization* [1]
- {concept} decentralization of data products
- {requirement} ability to compose data across different modes of access and topologies [1]
- data needs to be agnostic to the syntax of data, underlying storage type, and mode of access to it [1]
- many of the existing composability techniques that assume homogeneous data won’t work
- e.g. defining primary and foreign key relationships between tables of a single schema [1]
- {requirement} ability to discover and learn what is relatable and decentral [1]
- {requirement} ability to seamlessly link relatable data [1]
- {requirement} ability to relate data temporally [1]
- {concept} data product
- the smallest unit of data-based architecture that can be independently deployed and managed (aka product quantum) [1]
- provides a set of explicitly defined and data sharing contracts
- provides a truthful portion of the reality for a particular domain (aka single slice of truth) [1]
- constructed in alignment with the source domain [3]
- {characteristic} autonomous
- its life cycle and model are managed independently of other data products [1]
- {characteristic} discoverable
- via a centralized registry or catalog that list the available datasets with some additional information about each dataset, the owners, the location, sample data, etc. [1]
- {characteristic} addressable
- via a permanent and unique address to the data user to programmatically or manually access it [1]
- {characteristic} understandable
- involves getting to know the semantics of its underlying data and the syntax in which the data is encoded [1]
- describes which entities it encapsulates, the relationships between them, and their adjacent data products [1]
- {characteristic} trustworthy and truthful
- represents the fact of the business correctly [1]
- provides data provenance and data lineage [1]
- {characteristic} natively accessible
- make it possible for various data users to access and read its data in their native mode of access [1]
- meant to be broadcast and shared widely [3]
- {characteristic} interoperable and composable
- follows a set of standards and harmonization rules that allow linking data across domains easily [1]
- {characteristic} valuable on its own
- must have some inherent value for the data users [1]
- {characteristic} secure
- the access control is validated by the data product, right in the flow of data, access, read, or write [1]
- ⇐ the access control policies can change dynamically
- {characteristic} multimodal
- there is no definitive 'right way' to create a data product, nor is there a single expected form, format, or mode that it is expected to take [3]
- shares its logs, traces, and metrics while consuming, transforming, and sharing data [1]
- {concept} data quantum (aka product data quantum, architectural quantum)
- unit of logical architecture that controls and encapsulates all the structural components needed to share a data product [1]
- {component} data
- {component} metadata
- {component} code
- {component} policies
- {component} dependencies' listing
- {concept} data product observability
- monitor the operational health of the mesh
- debug and perform postmortem analysis
- perform audits
- understand data lineage
- {concept} logs
- immutable, timestamped, and often structured events that are produced as a result of processing and the execution of a particular task [1]
- used for debugging and root cause analysis
- {concept} traces
- records of causally related distributed events [1]
- {concept} metrics
- objectively quantifiable parameters that continue to communicate build-time and runtime characteristics of data products [1]
- artefacts
- e.g. data, code, metadata, policies
References:
[1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review)
[2] Zhamak Dehghani (2019) How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh (link)
[3] Adam Bellemare (2023) Building an Event-Driven Data Mesh: Patterns for Designing and Building Event-Driven Architectures