SQL Troubles: 🗄️Data Management: Data Quality Dimensions (Part V: Consistency)

18 January 2010

🗄️Data Management: Data Quality Dimensions (Part V: Consistency)

Data Management Series

IEEE defines consistency in general as "the degree of uniformity, standardization, and freedom from contradiction among the documents or parts of a system or component" [4]. In respect to data, consistency can be defined thus as the degree of uniformity and standardization of data values among systems or data repositories (aka cross-system consistencies), records within the same repository (aka cross-record consistency), or within the same record at different points in time (temporal consistency) (see [2], [3]).

Unfortunately, uniformity, standardization and freedom can be considered as data quality dimensions as well and they might even have broader scope than the one provided by consistency. Moreover, the definition requires further definitions for the concept to be understood, which is not ideal.

Simply put, consistency refers to the extent (data) values are consistent in notation, respectively the degree to which the data values across different contexts match. For example, one system uses "Free on Board" while another systems uses "FOB" to refer to the point obligations, costs, and risk involved in the delivery of goods shift from the seller to the buyer. The two systems refer to the same value by two different notations. When the two systems are integrated, because of the different values used, the rules defined in the target system might make the record fail because it is expecting another value. Conversely, other systems may import the value as it is, leading thus to two values used in parallel for the same meaning. This can happen not only to reference data, but also to master data, for example when a value deviates slightly from the expected value (e.g. misspelled). A more special case is when one of the systems uses case sensitive values (usually the target system, though there can be also bidirectional data integrations).

One solution for such situations would be to "standardize" the values across systems, though not all systems allow to easily change the values once they have been set. Another solution would be to create a mapping as part the integration, though to maintain such mappings for many cases is suboptimal, but in the end, it might be the only solution. Further systems can be impacted by these issues as well (e.g. data warehouses, data marts).

It's recommended to use a predefined list of values (LOV) - a data dictionary, an ontology or any other type of knowledge representation form that can be used to 'enforce' data consistency. ‘Enforce’ is maybe not the best term because the two data sets could be disconnected from each other, being in Users’ responsibility to ensure the overall consistency, or the two data sets could be integrated using specific techniques. In many cases is checked the consistency of the values taken by one attribute against an existing LOV, though for example for data formed from multiple segments (e.g. accounts) each segment might need to be checked against a specific data set or rule generator, such mechanisms implying multi-attribute mappings or associational rules that specify the possible values.

As highlighted also by [1], there are two aspects of consistency: the structural consistency in which two or more values can be distinct in notation but have the same meaning (e.g. missing vs. n/a), and semantic consistency in which each value has a unique meaning (e.g. only n/a is allowed to highlight missing values). It should be targeted to have the data semantically consistent, to avoid confusion, accidental exclusion of data during filtering or reporting. More and more organizations are investing in ontologies, they allow ensuring the semantic consistency of concepts/entities, though for most of the cases simple single or multi-attribute lists of values are enough.

Previous Post <<||>> Next Post

Written: Jan-2010, Last Reviewed: Mar-2024

References:
[1] Chapman A.D. (2005) "Principles of Data Quality", version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen
[2] David Loshin (2009) Master Data Management

[3] IEEE (1990) "IEEE Standard Glossary of Software Engineering Terminology"

[4] DAMA International (2010) "The DAMA Dictionary of Data Management" 1st Ed.

SQL Troubles

Pages

18 January 2010

🗄️Data Management: Data Quality Dimensions (Part V: Consistency)

No comments:

About Me