Showing posts with label data ownership. Show all posts
Showing posts with label data ownership. Show all posts

15 March 2024

🧊🗒️Data Warehousing: Data Mesh (Notes)

Disclaimer: This is work in progress intended to consolidate information from various sources. 
Last updated: 17-Mar-2024

Data Products with a Data Mesh
Data Products with a Data Mesh

Data Mesh
  • {definition} "a sociotechnical approach to share, access and manage analytical data in complex and large-scale environments - within or across organizations" [1]
    • ⇐ there is no default standard or reference implementation of data mesh and its components [2]
  • {definition} a type of decentralized data architecture that organizes data based on different business domains [2]
    • ⇐ no centralized data architecture coexists with data mesh, unless in transition [1]
    • distributes the modeling of analytical data, the data itself and its ownership [1]
  • {characteristic} partitions data around business domains and gives data ownership to the domains [1]
    • each domain can model their data according to their context [1]
    • there can be multiple models of the same concept in different domains gives the data sharing responsibility to those who are most intimately familiar with the data [1]
    • endorses multiple models of the data
      • data can be read from one domain, transformed and stored by another domain [1]
  • {characteristic} evolutionary execution process
  • {characteristic} agnostic of the underlying technology and infrastructure [1]
  • {aim} respond gracefully to change [1]
  • {aim} sustain agility in the face of growth [1]
  • {aim} increase the ratio of value from data to investment [1]
  • {principle} data as a product
    • {goal} business domains become accountable to share their data as a product to data users
    • {goal} introduce a new unit of logical architecture that controls and encapsulates all the structural components needed to share data as a product autonomously [1]
    • {goal} adhere to a set of acceptance criteria that assure the usability, quality, understandability, accessibility and interoperability of data products*
    • usability characteristics
  • {principle} domain-oriented ownership
    • {goal} decentralize the ownership of sharing analytical data to business domains that are closest to the data [1]
    • {goal} decompose logically the data artefacts based on the business domain they represent and manage their life cycle independently [1]
    • {goal} align business, technology and analytical data [1]
  • {principle} self-serve data platform
    • {goal} provide a self-serve data platform to empower domain-oriented teams to manage and govern the end-to-end life cycle of their data products* [1]
    • {goal} streamline the experience of data consumers to discover, access, and use the data products [1]
  • {principle} federated computational governance
    • {goal} implement a federated decision making and accountability structure that balances the autonomy and agility of domains, while respecting the global conformance, interoperability and security of the mesh* [1]
    • {goal} codifying and automated execution of policies at a fine-grained level [1]
    • ⇐ the principles represent a generalization and adaptation of practices that address the scale of organization digitization* [1]
  • {concept} decentralization of data products
    • {requirement} ability to compose data across different modes of access and topologies [1]
      • data needs to be agnostic to the syntax of data, underlying storage type, and mode of access to it [1]
        • many of the existing composability techniques that assume homogeneous data won’t work
          • e.g.  defining primary and foreign key relationships between tables of a single schema [1]
    • {requirement} ability to discover and learn what is relatable and decentral [1]
    • {requirement} ability to seamlessly link relatable data [1]
    • {requirement} ability to relate data temporally [1]
  • {concept} data product 
    • the smallest unit of data-based architecture that can be independently deployed and managed (aka product quantum) [1]
    • provides a set of explicitly defined and data sharing contracts
    • provides a truthful portion of the reality for a particular domain (aka single slice of truth) [1]
    • constructed in alignment with the source domain [3]
    • {characteristic} autonomous
      • its life cycle and model are managed independently of other data products [1]
    • {characteristic} discoverable
      • via a centralized registry or catalog that list the available datasets with some additional information about each dataset, the owners, the location, sample data, etc. [1]
    • {characteristic} addressable
      • via a permanent and unique address to the data user to programmatically or manually access it [1] 
    • {characteristic} understandable
      • involves getting to know the semantics of its underlying data and the syntax in which the data is encoded [1]
      • describes which entities it encapsulates, the relationships between them, and their adjacent data products [1]
    • {characteristic} trustworthy and truthful
      • represents the fact of the business correctly [1]
      • provides data provenance and data lineage [1]
    • {characteristic} natively accessible
      • make it possible for various data users to access and read its data in their native mode of access [1]
      • meant to be broadcast and shared widely [3]
    • {characteristic} interoperable and composable
      • follows a set of standards and harmonization rules that allow linking data across domains easily [1]
    • {characteristic} valuable on its own
      • must have some inherent value for the data users [1]
    • {characteristic} secure
      • the access control is validated by the data product, right in the flow of data, access, read, or write [1] 
        • ⇐ the access control policies can change dynamically
    • {characteristic} multimodal 
      • there is no definitive 'right way' to create a data product, nor is there a single expected form, format, or mode that it is expected to take [3] 
    • shares its logs, traces, and metrics while consuming, transforming, and sharing data [1]
    • {concept} data quantum (aka product data quantum, architectural quantum) 
      • unit of logical architecture that controls and encapsulates all the structural components needed to share a data product [1]
        • {component} data
        • {component} metadata
        • {component} code
        • {component} policies
        • {component} dependencies' listing
    • {concept} data product observability
      • monitor the operational health of the mesh
      • debug and perform postmortem analysis
      • perform audits
      • understand data lineage
    • {concept} logs 
      • immutable, timestamped, and often structured events that are produced as a result of processing and the execution of a particular task [1]
      • used for debugging and root cause analysis
    • {concept} traces
      • records of causally related distributed events [1]
    • {concept} metrics
      • objectively quantifiable parameters that continue to communicate build-time and runtime characteristics of data products [1]
  • artefacts 
    • e.g. data, code, metadata, policies

References:
[1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review)
[2] Zhamak Dehghani (2019) How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh (link)
[3] Adam Bellemare (2023) Building an Event-Driven Data Mesh: Patterns for Designing and Building Event-Driven Architectures

09 April 2012

🧭Business Intelligence: Between Products, Partners, People and Processes

Business Intelligence
Business Intelligence Series

In the previous post, “BI between Potential, Reality, Quality and Stories”, I was commenting five of the important findings of a study led by KPMG in respect to the state of art in BI initiatives. My comments were centered mainly on the first 3 of the 4Ps (Products, People, Partners, respectively Processes) considered in ITSM (IT Service Management). The connection to IT Service Management isn’t accidental, BI being also an organizational capability. Many of the aspects related to the 4Ps perspectives, reveal the maturity of an organization in leveraging its BI infrastructure.  In this post I would like to consider BI landscape from these 4 perspectives.

Products

Products or technology perspective has within BI context a dual nature. First of all we have to consider the BI infrastructure – the whole set of BI tools we have at disposal for our shiny reports. Secondly, because the BI infrastructure doesn’t stand on itself, we have to consider also IT infrastructure on which BI infrastructure is based upon – a full range of ISs (Information Systems) in which data are entered, processed, transported and consumed before they are used by the BI tools. For Data Quality issues, we often have to consider the broader perspective, and tackle the problems at the source. Otherwise we might arrive to treat the symptoms and not the causes. It’s important to note that the two layers or perspectives are interconnected, the consequences being bidirectional.

A typical BI infrastructure revolves around several databases, maybe one or more data warehouses and data marts, and one or more reporting systems. Within the most basic scenario, the data flow is unidirectional from databases to data warehouse/marts, reports being built on top of the data warehouse/marts or directly on the IS’ databases. In more complex scenarios, the data can flow between the various ISs when they were integrated, and even between data warehouses/marts, within a unidirectional or bidirectional flow.  Unless the reports are based directly on the ISs’ databases, such architectures lead to data duplication, conversions between complex schemas, delays between the various layers, to mention just a few of the most important implications. In some point in time the complexity falls down on you.

One of the problems I met is that a considerable percent of the IS are not developed to address BI requirements. It starts with data validation, with the way data are modeled, structured, formatted and made available for BI consumption. If you want to increase the quality of your data, you have sooner or later to address them. It’s important thus the degree to which the systems are designed to cover the BI needs in particular, and decision making in general. This presumes that BI requirements need to be addressed in early phases of implementations, software design or when tools are consider for purchase.

In addition many ISs come with their own (standard) reports or reporting frameworks, becoming thus part of your BI infrastructure, intended or unintended. Even if such reports are intended to cover basic immediate reporting requirements, they not always so easy to consume, the logic behind them is not visible, are hard to extend, are not always tested, the additional reports built in other tools need to be synchronized with them, etc.

Partners

We gather huge volumes of data, we are drowning in it; we want to take decision rooted in data and get visibility into the past, actual and future state of business. How can we achieve that if we don’t have the knowledge and human resources to achieve that? “Partners” is the magic word – external suppliers specialized, in theory, to provide this kind of services: BI analysts and developers, business analysts, data miners, and other IT professionals work together in order to build your BI infrastructure. One detail many people forget is that BI tools provide potentiality; are the skills and knowledge of those working with them that transforms that potentiality into success. On their capabilities depends the success of such projects. Not to forget that BI projects are similar to other IT projects, falling under same type of fallacies plus a few other fallacies of their own derived from exploratory and complex nature of BI projects.

There is a dual nature also in “partners” perspective – except the external perspective which concerns the external partners and the IT department or the business as a whole, there is also the internal perspective in which the IT department plays again a central role. I heard it often loudly affirmed that the other departments are customers of the IT department, or the reciprocal. I have seen also this conception brought to extreme, in which the IT had no word to say in what concerns the IT infrastructure in general, respectively the BI infrastructure in particular. As long the IT department isn’t treated as a business partner, an organization will be more likely sabotaged from inside. Sabotage it’s a word too strong maybe, though it kind of reflects the state of art.

People

Same as partners, people perspective includes a considerable variety of types: IT staff, executives, managers, end-users and other types of stakeholders, each of them with a word to say, grouped in various groups of interests that don’t always converge, situations in which politics plays a major role. It’s actually interesting to see how the decision for a given BI solution is made, how the solution takes its place into the landscape, how it’s used and misused, how personalities and knowledge harness it or stand in the way. I feel that there are organizations (people) which do BI just for the sake of doing something, copying sometimes recipes of success, without uniting the dots, without clear goals and strategy. There are people who juggle with numbers and BI concepts without knowing their meaning and what they involve. This aspect is reflected in how BI tools are selected, implemented and used.

Having the best tools, consultants and highest data quality, won’t guarantee the success of BI initiative without users’ acceptance, without teaching them how to make constructive use of tools and data, on how to use and built models in order to solve the problems the business is confronted with, on how to address strategic, tactical and operational requirements. The transformation from a robot to a knowledge worker doesn’t happen over night. Is needed to make people aware of the various aspects of BI – data quality, process and data ownership, on how models can be used and misused, on how models evolve or become obsolete, how the BI infrastructure has to evolve with the business’ dynamics. There are so many aspects that need to be considered. It’s a continuous learning process.

Processes

In processes' perspective can be depicted a dual nature too. First of all we have to consider the processes which are used to manage efficiently and effectively the whole BI infrastructure. They are widely discussed in various methodologies like ITIL, whose implementation is thoroughly documented. Secondly, it’s the reflection of departmental processes within the various data perspectives – how they are measured, and how the measurements are further used for continuous improvement. 

Considering that this aspect is correlated with an organization’s capability model, I don’t think that many organizations go/rich that far. Sure the trend is to define meaningful KPIs, growth, health and other type of metrics, but the question is – are you using those metrics constructively, are you aligning them with your strategic, tactic and operational goals? I think there is lot of potential in this, though in order to measure processes accordingly is imperative to have also the system designed for this purpose. Back to technological perspective…

31 January 2010

🧭Business Intelligence: Enterprise Reporting (Part IV: Choosing Report’s Attributes)

Business Intelligence
Business Intelligence Series

Introduction

How are chosen the attributes of a report? Attributes are added primarily based on users’ specifications, however often they can be too high level or the user ignored willingly or by mistake certain aspects. In general in a report is need to be shown the attributes of high relevance to a certain topic, for example Document information (Document Number, Type, Dates, Statuses, etc.), Product main information (Product Number, Description, Type, Status, etc.), Quantities, Prices, Amounts, Responsible Users (e.g. Buyers, Preparers, Managers, etc.) or Responsible Third Parties (e.g. Customers, Vendors, Carriers).

When choosing the attributes for a report, there are several important sets of attributes which needs to be considered:

Unique identifiers

Together with the various Names (e.g. Vendor Name, Customer Name) associated with entities, a report should include also the “unique identifier” (UID) for each entity, even if formed from one or more attributes. The UID allows identifying for example if duplicate records appear in report or it could be used to match/join the data from the reports with other data sets in order to pull details or for further analysis of data. For example in a PO report over PO Shipments a unique (natural) key could be identified by using the PO Number, Line Number and Shipment Number; for a Vendor could be used the Vendor Name or the GSL (Global Supplier Location) Number, though the later it’s more adequate because it’s more general and accurate, making easier Vendor’s identification. In theory, for the same scope could be used also the database (surrogate) unique identifier from PO Shipments table, the elements dictating report’s level of detail, respectively the Vendor ID, though even if surrogate UID are easier to use in joins, they could create confusion and overload the reports, given that surrogate UIDs need to be provided also for the other elements.

Documents like Invoices include an external and internal unique identifier, the Invoice Number together with the Vendor, typically unique in a system, form the external UID, while the Document or Voucher Number is used as internal UID. The external UID it’s easier to use for external-based considerations, while the internal UID it’s easier to use for internal needs, so it makes sense to include both types of unique identifiers.

Quantities & Related Attributes

In Item-related reports, most of the times it makes sense to include also the quantities (e.g. Transaction, Ordered, Delivered, Invoiced, On-Hand Quantities) together with the Unit of Measure (UOM) in which they are represented. It has to be made distinction between the Primary UOM, the UOM in which the item is stored, and the Transactional UOM, the UOM in which the Item is transacted; for example the Purchasing UOM, Sales UOM or Transaction UOM could be different than the Primary UOM in which the item is stored in Warehouse. In such cases together with the Transactional UOM should be provided also the Primary UOM and eventually the UOM Conversion Rate, when applicable.

Prices/Amounts & Related Attributes

For Item-related reports and not only, include the various Prices (e.g. Sales, Purchase, Standard Price) together with the Currency Code used even if only one Currency is used, same rule applying also for the amounts stored (e.g. Invoice, Sales Order, Purchase amounts). For financial reports it’s advisable to show both functional amounts, the amounts in the Currency used by GL (General Ledger), and transactional amounts, the Currency used in the transaction. When the level of details allows it, show also the Quantity, Price Unit used to calculate the amounts, the eventual Exchange Rate or UOM Conversion Rate used. When available, include also the Period when the Amount was booked in the system.

Dates

Typically should be included the Document Date (e.g. Invoice Date, Order Date) and Document Creation Date, together with the other Dates important for the business or data analysis (e.g. Need By Date, GL Date, Value Date). In general the Document Date or Document Creation Date, and GL Date for financial reports, should be mandatory attributes because they could be used to segment (partition) a data set in time units (e.g. days, weeks, months, periods, years, etc.).

Statuses

The various record statuses and document statuses should be again mandatory attributes in reports. Record statuses show whether a record is active, was cancelled or marked as deleted, while document statuses show documents’ processing status, often being associated with a workflow (e.g. approval or processing workflows). The record statuses could be synchronized and even merged with the document statuses.

Either expressed as flags or list of values, statuses are essential in delimiting the data set that needs to be considered for further calculations, because often not approved documents or cancelled records have low or no relevance for the business. Not approved documents are typically not considered for the various calculations until they were not approved, while cancelled records are associated with mistakes or the lack of need. Not being able to identify the active records can mess things pretty badly, because for example there are reports that show only active, while others show all the data available in a system. Therefore showing of statuses in reports can be important in the mitigation of differences between reports, especially when dealing with calculations.

It’s advisable to have the possibility to see also the cancelled records, for example in order to analyze the amount of waste expressed as overwork or for identifying the records that were cancelled by mistake.

In reports with multiple levels of details, it can be useful to show the statuses from all levels, as statuses might not be in sync or because they have different meaning. In theory, when the statuses are in synch and especially when considering cancellations, it should be enough to consider the status from the lowest level of detail from each logical entity (e.g. PO Shipment Status when considering PO, Invoice Line status when considering Invoices, both mentioned statuses when considering POs together with Invoices), though reality can prove to be a tough world for statuses, as programming errors and other business scenarios need to be considered.

Action Owners

Include Requestors, Document Preparers, Buyers, Managers or any other type of action owners, so a user can track the direct or indirect issues back to them.

Note:
Such attributes can be used as base to calculate/reflect action owner’s performance, fact that can infringe country or organization regulations so you need to check if there are any constraints in this direction and which set of attributes might be impacted. For example might be no problem to show the Buyer, though might be a problem to show information about who created/modified the record. Eventually if needed to calculate the performance at action owner level, substitute any attribute that can be used to identify a person with a random value, however if the mapping between the action owner and value used as substitute is known (in case unique identifiers are used) or easy to get (by checking records in the system), the data might be misused.

11 November 2008

🗄️Data Management: Data Quality (Part I: Information Systems' Perspective)

Data Management
Data Management Series

One LinkedIn user brought to attention the fact that according to top IT managers the top two reasons why CRM investments fail is: (1) managing resistance within the organization; (2) bad data quality.

The two reasons are common not only to CRM or BI solutions but also to other Information Systems, though from the two data quality has usually the biggest impact. Especially in ERP systems the data quality continues to be a problem and here are a few reasons:
  • Processes span different functions and/or roles, each of them maintaining the data they are interested in, without any agreement or coordination on the ownership. The lack of ownership is in general management’s fault.
  • Within an enterprise many systems arrive to be integrated, the quality of the data depending on the quality and scope of the integrations, whether they were addressed fully or only superficially. Few integrations are stable and properly designed. If stability can be obtained in time, scope is seldom changed as it involves further investments, and thus the remaining data need to be maintained manually, respectively the issues need to be troubleshooted or let accumulate in the backlog.
  • There are systems which are not integrated but use the same data, users needing to duplicate their effort, so they often focus on their immediate needs. Moreover, the lack of mappings between systems makes data analysis and review difficult. 
  • The lack of knowledge about the systems used in terms of processes, procedures, best practices, policies, etc. Users usually try to do their best based on the knowledge they have, and despite their best intent, the systems arrive to be misused just to get things done. 
  • Basic or inexistent validation for data entry in each important entry point (UI, integration interfaces, bulk upload functionality), system permissiveness (allowing workarounds), stability and reliability (bugs/defects).
  • Inexistence of data quality control mechanisms or quality methodologies, respectively a Data and/or Quality Management strategy. If the data quality is not kept under review, it can easily decrease over time. 
  • The lack of a data culture and processes that support data quality.
  • People lack consistency and/or the self-discipline to follow the processes and update the data as the processes requires it and not only the data to move to the next or final step. Therefore, the gap between reality and the one presented by the system is considerable.
  • People are not motivated to improve data quality even if they may recognize the importance of doing that.
Data quality is usually ignored in BI projects, and this is because few are the ones that go and search for the causes, making it easier to blame the BI solution or the technical team than to do something. This is one of the reasons for which users are reticent in using a BI solution, to which add up solution’s flexibility and the degree up to which the solution satisfies users’ needs. On the other side BI solutions are often abused, including also reports which have OLTP characteristics or of providing too much unstructured or inadequate content that needs to be further reworked.

Data quality comes on the managers' agenda, especially during ERP implementations. Unfortunately, as soon as that happens, it also disappears, despite being warned of the consequences poor data quality might have on the implementation and further data use. An ERP implementation is supposed to be an opportunity for improving the data quality, though for many organizations it remains in this state. Once this opportunity passes, organizations need more financial and human resources to reach a fraction from the opportunity missed.

The above topics are complex and need further discussion (see [1], [2]).


Written: Nov-2008, Last Reviewed: Mar-2024

Resources:
[1] SQL-Troubles (2010) Data Management: Data Quality - An Introduction (link)
[2] SQL-Troubles (2012) Data Migration: Data Quality’s Perspective I - A Bird’s-Eye View (link)
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.