18 January 2010

Data Quality Dimensions - Consistency

    Consistency refers to the extent values are consistent in notation, this often supposing the existence of a predefined list of values (LOV), a data dictionary, an ontology or any other type of knowledge representation form (e.g. charts, diagrams) that can be used to “enforce” data consistency. Enforce is maybe not the best term to describe the state of art because the two data sets could be disconnected from each other, being in Users’ responsibility to ensure the overall consistency, or the two data sets could be integrated using specific techniques. In most of the cases is checked the consistency of the values taken by one attribute against an existing LOV, though for example for data formed from multiple segments (e.g. accounts) each segment might need to be checked against a specific data set or rule generator , such mechanisms implying multi-attribute mappings or associational rules that specify the possible values.

    Consistency could be in general considered against two distinct data sets or distinct systems, typically one of them functioning as master, between records (record-level consistency), between a set of attributes from different records (cross-record consistency) or within the same record but at different points in time (temporal consistency) [3]. It makes sense to talk about consistency especially in regard to master data, being important to keep the consistency of one or more attributes in the various systems against the master data.

    As highlighted also by [1], there are two aspects of consistency: the structural consistency in which two or more values can be distinct in notation but have the same meaning (e.g. missing vs. n/a), and semantic consistency in which each value has a unique meaning (only n/a for example is allowed in order to highlight missing values). It should be targeted to have the data semantically consistent, in order to avoid confusions, accidental exclusion of data during filtering or reporting. More and more organizations are investing in ontologies, they allowing ensuring the semantic consistency of concepts/entities, though for most of the cases simple single or multi-attribute lists of values are enough.

    There are several sources (e.g. [2]) that consider Codd’s referential integrity constraint as a type of consistency, in the support of this idea could be mentioned the fact that referential integrity could be used to solve data consistency issues by bringing the various LOV in the systems, though I would prefer to separate the two concepts because referential integrity is mainly an architectural concept even if it involves the “consistency” of foreign key/primary key pairs.

[1] Chapman A.D. (2005). Principles of Data Quality, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen
[2] Lee Y.W., Pipino L.L., Funk J.D., Wang R.Y. (2006). Journey to Data Quality. MIT Press. ISBN: 0-262-12287-1.
[3] Loshin D. (2009). Master Data Management. Morgan Kaufmann OMG Press. ISBN 978-0-12-374225-4.

No comments: