15 March 2024

Data Warehousing: Data Mesh (Notes)

Disclaimer: This is work in progress intended to consolidate information from various sources. 
Last updated: 17-Mar-2024

Data Products with a Data Mesh
Data Products with a Data Mesh

Data Mesh
  • {definition} "a sociotechnical approach to share, access and manage analytical data in complex and large-scale environments - within or across organizations" [1]
    • ⇐ there is no default standard or reference implementation of data mesh and its components [2]
  • {definition} a type of decentralized data architecture that organizes data based on different business domains [2]
    • ⇐ no centralized data architecture coexists with data mesh, unless in transition [1]
    • distributes the modeling of analytical data, the data itself and its ownership [1]
  • {characteristic} partitions data around business domains and gives data ownership to the domains [1]
    • each domain can model their data according to their context [1]
    • there can be multiple models of the same concept in different domains gives the data sharing responsibility to those who are most intimately familiar with the data [1]
    • endorses multiple models of the data
      • data can be read from one domain, transformed and stored by another domain [1]
  • {characteristic} evolutionary execution process
  • {characteristic} agnostic of the underlying technology and infrastructure [1]
  • {aim} respond gracefully to change [1]
  • {aim} sustain agility in the face of growth [1]
  • {aim} increase the ratio of value from data to investment [1]
  • {principle} data as a product
    • {goal} business domains become accountable to share their data as a product to data users
    • {goal} introduce a new unit of logical architecture that controls and encapsulates all the structural components needed to share data as a product autonomously [1]
    • {goal} adhere to a set of acceptance criteria that assure the usability, quality, understandability, accessibility and interoperability of data products*
    • usability characteristics
  • {principle} domain-oriented ownership
    • {goal} decentralize the ownership of sharing analytical data to business domains that are closest to the data [1]
    • {goal} decompose logically the data artefacts based on the business domain they represent and manage their life cycle independently [1]
    • {goal} align business, technology and analytical data [1]
  • {principle} self-serve data platform
    • {goal} provide a self-serve data platform to empower domain-oriented teams to manage and govern the end-to-end life cycle of their data products* [1]
    • {goal} streamline the experience of data consumers to discover, access, and use the data products [1]
  • {principle} federated computational governance
    • {goal} implement a federated decision making and accountability structure that balances the autonomy and agility of domains, while respecting the global conformance, interoperability and security of the mesh* [1]
    • {goal} codifying and automated execution of policies at a fine-grained level [1]
    • ⇐ the principles represent a generalization and adaptation of practices that address the scale of organization digitization* [1]
  • {concept} decentralization of data products
    • {requirement} ability to compose data across different modes of access and topologies [1]
      • data needs to be agnostic to the syntax of data, underlying storage type, and mode of access to it [1]
        • many of the existing composability techniques that assume homogeneous data won’t work
          • e.g.  defining primary and foreign key relationships between tables of a single schema [1]
    • {requirement} ability to discover and learn what is relatable and decentral [1]
    • {requirement} ability to seamlessly link relatable data [1]
    • {requirement} ability to relate data temporally [1]
  • {concept} data product 
    • the smallest unit of data-based architecture that can be independently deployed and managed (aka product quantum) [1]
    • provides a set of explicitly defined and data sharing contracts
    • provides a truthful portion of the reality for a particular domain (aka single slice of truth) [1]
    • constructed in alignment with the source domain [3]
    • {characteristic} autonomous
      • its life cycle and model are managed independently of other data products [1]
    • {characteristic} discoverable
      • via a centralized registry or catalog that list the available datasets with some additional information about each dataset, the owners, the location, sample data, etc. [1]
    • {characteristic} addressable
      • via a permanent and unique address to the data user to programmatically or manually access it [1] 
    • {characteristic} understandable
      • involves getting to know the semantics of its underlying data and the syntax in which the data is encoded [1]
      • describes which entities it encapsulates, the relationships between them, and their adjacent data products [1]
    • {characteristic} trustworthy and truthful
      • represents the fact of the business correctly [1]
      • provides data provenance and data lineage [1]
    • {characteristic} natively accessible
      • make it possible for various data users to access and read its data in their native mode of access [1]
      • meant to be broadcast and shared widely [3]
    • {characteristic} interoperable and composable
      • follows a set of standards and harmonization rules that allow linking data across domains easily [1]
    • {characteristic} valuable on its own
      • must have some inherent value for the data users [1]
    • {characteristic} secure
      • the access control is validated by the data product, right in the flow of data, access, read, or write [1] 
        • ⇐ the access control policies can change dynamically
    • {characteristic} multimodal 
      • there is no definitive 'right way' to create a data product, nor is there a single expected form, format, or mode that it is expected to take [3] 
    • shares its logs, traces, and metrics while consuming, transforming, and sharing data [1]
    • {concept} data quantum (aka product data quantum, architectural quantum) 
      • unit of logical architecture that controls and encapsulates all the structural components needed to share a data product [1]
        • {component} data
        • {component} metadata
        • {component} code
        • {component} policies
        • {component} dependencies' listing
    • {concept} data product observability
      • monitor the operational health of the mesh
      • debug and perform postmortem analysis
      • perform audits
      • understand data lineage
    • {concept} logs 
      • immutable, timestamped, and often structured events that are produced as a result of processing and the execution of a particular task [1]
      • used for debugging and root cause analysis
    • {concept} traces
      • records of causally related distributed events [1]
    • {concept} metrics
      • objectively quantifiable parameters that continue to communicate build-time and runtime characteristics of data products [1]
  • artefacts 
    • e.g. data, code, metadata, policies

References:
[1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review)
[2] Zhamak Dehghani (2019) How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh (link)
[3] Adam Bellemare (2023) Building an Event-Driven Data Mesh: Patterns for Designing and Building Event-Driven Architectures

No comments:

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.