Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes.
Last updated: 27-Jan-2025
[Data Management] Data quality dimensions
- {def} features of data that can be measured or assessed against defined standards to determine the quality of data
- captures a specific aspect of general data quality
- can refer to data values or to their schema
- {type} hard dimensions
- dimensions that can be measured
- {type} soft dimensions
- dimensions that can be measured only indirectly
- ⇐ through interviews with data users or through any other kind of communication with users
- dimensions whose measurement depends on the perception of the users of the data
- {dimension} uniqueness [post]
- the degree to which a value or set of values is unique within a dataset
- can be determined based on a set of values supposed to be unique across the whole dataset
- some systems have a artificial, respectively natural unique identified
- measured in terms of either
- the percentage of unique values available in a dataset
- the percentage of duplicate values available in a dataset
- the impossibility of identifying whether a value is unique increases the chances for it to be duplicated
- it can have broader implications
- aggregated information is not shown correctly
- ⇐ split across different entities
- can lead to further duplicates in other areas
- {recommendation} enforce uniqueness by design, if possible
- {recommendation} check the data regularly for duplicates and disable or delete the duplicated records
- ⇐ one should make sure that the records can't be further reused in business processes or analytics workloads
- {dimension} completeness [post]
- the extent to which there are missing data in a dataset
- ⇐ reflected in the number of the missing values
- measured as percentage of the missing values compared to the total
- determined by the presence of NULL values
- {type} attribute completeness
- the number of NULLs in a specific attribute
- {type} tuple completeness
- the number of unknown values of the attributes in a tuple
- {type} relation completeness
- the number of tuples with unknown attribute values in the relation
- {type} value completeness
- makes sense for complex, semi-structured columns such as XML data type columns
- e.g. a complete element or attribute can be missing
- considered in report to
- mandatory attributes
- attributes that need a not-Null value for each record
- optional attributes
- attributes that not necessarily need to be provided
- inapplicable attributes
- attributes not applicable (relevant) for certain scenarios by design
- {dimension} conformity (aka format compliance) [post]
- {def} the extent data are in the expected format
- dependent on the data type and its definition
- can be associated with a set of metadata
- data type
- e.g. text, numeric, alphanumeric, positive, date
- length
- precision
- scale
- formatting patterns
- e.g. phone number, decimal and digit grouping symbols
- different formatting might apply based on various business rules
- can use delimiters
- {recommendation} define the data type and further constraints to enforce the various characteristics of the element
- {recommendation} make sure that the delimiters don't overlap with other uses
- {dimension} accuracy [post]
- {def} the extent data is correct, respectively match the reality with an acceptable level of approximation
- stricter than just conforming to business rules
- can be measured at column and table level
- [discrete data values]
- use frequency distribution of values
- a value with very low frequency is probably incorrect
- [alphanumeric values]
- use string length distribution
- a string with a very atypical length is potentially incorrect
- try to find patterns and then create pattern distribution.
- patterns with low frequency probably denote wrong values
- [continuous attributes]
- use descriptive statistics
- just by looking at minimal and maximal values, you can easily spot potentially problematic data
- {dimension} consistency [post]
- {def} the degree of uniformity, standardization, and freedom from contradiction among the documents or parts of a system or component
- {type} notational consistency
- the extent (data) values are consistent in notation
- {type} semantic consistency
- the degree to which data has unique meaning
- is more restrictive than the notational consistency
- measures the equivalence of information stored in various repositories
- involves comparing values with a predefined set of possible values
- from the same or from different systems
- can be measured at column and table level
- can have different scopes
- cross-system consistencies
- among systems or data repositories
- cross-record consistency
- within the same repository
- temporal consistency
- within the same record at different points in time
- {dimension} timeliness [post]
- tells the degree to which data is current and available when needed
- there is always some delay between change in the real world and the moment when this change is entered into a system
- stale data/obsolete data
- {dimension} structuredness [post]
- the degree to which a data structure or model possesses a definite pattern of organization of its interdependent parts
- allows the categorization of data as
- structured data [def]
- refers to structures that can be easily perceived or known, that raises no doubt on structure’s delimitations
- unstructured data [def]
- refers to textual data and media content (video, sound, images), in which the structural patterns even if exist they are hard to discover or not predefined
- semi-structured data [def]
- refers to islands of structured data stored with unstructured data, or vice versa
- ⇐ the more structured the data, the easier it is to be processed
- {dimension} referential integrity [post]
- {def} the degree to which the values of a key in one table (aka reference value) match the values of a key in a related table (aka the referenced value)
- it's an architectural concept of the database
- {recommendation} keep the referential integrity of a system by design
- some systems build logic for assuring the referential integrity in the applications and not in the database
- {dimension} currency (aka actuality)
- the extent to which data is actual
- can be considered as a special type of accuracy
- ⇐ when the data is not actual then it doesn’t reflect reality
- {dimension} ease of use
- the extent to which data can be used for a given purpose
- usually it refers to whether the data can be processed as needed
- depends on the application or on the user interface
- {dimension} fitness of use
- the degree to which the data is fit for use
- the data may have good quality for a given purposes but
- not usable for other purposes
- can be used as substitute for other data
- e.g. use phone area codes instead of ZIP codes to locate customers approximately
- {dimension} trustfulness [post]
- the degree to which the data can be trusted
- is a matter of perception
- ask users whether they trust the data and which are the reasons
- if the users don’t trust the data
- they will create their own solutions
- they will not use applications
- {dimension} entropy
- {def} the average amount of information conveyed
- ⇐ quantification of information in a system
- ⇐ the more dispersed the values and the more the frequency distribution of a discrete column is equally spread among the values, the more information is available [1]
- ⇐ can tell whether your data is suitable for analysis or not
- can be measured at column and table level
- {dimension} presentation quality
- applicable to applications that presents data
- format and appearance should support the appropriate use of data
- depends on the UI used
- {recommendation} have a dedicated system for maintaining the master data and broadcast the data to the subscribers as needed
- the data should be exclusively managed though the management system
- {anti-pattern} data is modified in the subscribers and the changes aren't always reflected back to the source system
Previous Post
<<||>>
Next Post
References:
[1] Dejan Sarka et al (2012) Exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012 (Training Kit)
No comments:
Post a Comment