27 January 2025

🗄️🗒️Data Management: Data Quality Dimensions [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. 

Last updated: 27-Jan-2025

[Data Management] Data quality dimensions

  • {def} features of data that can be measured or assessed against defined standards to determine the quality of data
    • captures a specific aspect of general data quality
      • can refer to data values or to their schema
  • {type} hard dimensions
    • dimensions that can be measured
  • {type} soft dimensions
    • dimensions that can be measured only indirectly
      • ⇐ through interviews with data users or through any other kind of communication with users
    • dimensions whose measurement depends on the perception of the users of the data
  • {dimension} uniqueness [post]
    • the degree to which a value or set of values is unique within a dataset
      • can be determined based on a set of values supposed to be unique across the whole dataset
        • some systems have a artificial, respectively natural unique identified
      • measured in terms of either
        • the percentage of unique values available in a dataset 
        • the percentage of duplicate values available in a dataset
    • the impossibility of identifying whether a value is unique increases the chances for it to be duplicated
    • it can have broader implications
      • aggregated information is not shown correctly
        • ⇐ split across different entities
      • can lead to further duplicates in other areas
    • {recommendation} enforce uniqueness by design, if possible
    • {recommendation} check the data regularly for duplicates and disable or delete the duplicated records
      • ⇐ one should make sure that the records can't be further reused in business processes or analytics workloads
  • {dimension} completeness [post]
    •  the extent to which there are missing data in a dataset
      • ⇐ reflected in the number of the missing values
        • measured as percentage of the missing values compared to the total
    • determined by the presence of NULL values  
      • {type} attribute completeness 
        • the number of NULLs in a specific attribute
      • {type} tuple completeness 
        • the number of unknown values of the attributes in a tuple
      • {type} relation completeness 
        • the number of tuples with unknown attribute values in the relation
      • {type} value completeness
        • makes sense for complex, semi-structured columns such as XML data type columns
          • e.g. a complete element or attribute can be missing
    • considered in report to 
      • mandatory attributes
        • attributes that need a not-Null value for each record
      • optional attributes
        • attributes that not necessarily need to be provided
      • inapplicable attributes
        • attributes not applicable (relevant) for certain scenarios by design
  • {dimension} conformity (aka format compliance) [post]
    • {def} the extent data are in the expected format
      • dependent on the data type and its definition
    • can be associated with a set of metadata 
      • data type
        • e.g. text, numeric, alphanumeric, positive, date
      • length
      • precision
      • scale
      • formatting patterns 
        • e.g. phone number, decimal and digit grouping symbols
        • different formatting might apply based on various business rules 
        • can use delimiters
    • {recommendation} define the data type and further constraints to enforce the various characteristics of the element
    • {recommendation} make sure that the delimiters don't overlap with other uses
  • {dimension} accuracy [post]
    • {def} the extent data is correct, respectively match the reality with an acceptable level of approximation
    • stricter than just conforming to business rules
    • can be measured at column and table level
      • [discrete data values] 
        • use frequency distribution of values
          • a value with very low frequency is probably incorrect
      • [alphanumeric values]
        • use string length distribution
          • a  string with a very atypical length is potentially incorrect
        • try to find patterns and then create pattern distribution.
          • patterns with low frequency probably denote wrong values
      • [continuous attributes]
        • use descriptive statistics
          • just by looking at minimal and maximal values, you can easily spot potentially problematic data
  • {dimension} consistency [post]
    • {def} the degree of uniformity, standardization, and freedom from contradiction among the documents or parts of a system or component
      • {type} notational consistency
        • the extent (data) values are consistent in notation
      • {type} semantic consistency
        • the degree to which data has unique meaning
        • is more restrictive than the notational consistency
    • measures the equivalence of information stored in various repositories
    • involves comparing values with a predefined set of possible values
      •  from the same or from different systems
    • can be measured at column and table level
    • can have different scopes
      • cross-system consistencies
        • among systems or data repositories
      • cross-record consistency
        • within the same repository
      • temporal consistency
        •  within the same record at different points in time
  • {dimension} timeliness [post]
    • tells the degree to which data is current and available when needed
      • there is always some delay between change in the real world and the moment when this change is entered into a system
    • stale data/obsolete data
  • {dimension} structuredness [post]
    • the degree to which a data structure or model possesses a definite pattern of organization of its interdependent parts
    • allows the categorization of data as
      • structured data [def]
        • refers to structures that can be easily perceived or known, that raises no doubt on structure’s delimitations
      • unstructured data [def]
        • refers to textual data and media content (video, sound, images), in which the structural patterns even if exist they are hard to discover or not predefined
      • semi-structured data [def]
        • refers to islands of structured data stored with unstructured data, or vice versa
      • ⇐ the more structured the data, the easier it is to be processed
  • {dimension} referential integrity [post]
    • {def} the degree to which the values of a key in one table (aka reference value) match the values of a key in a related table (aka the referenced value)
    • it's an architectural concept of the database 
    • {recommendation} keep the referential integrity of a system by design
      • some systems build logic for assuring the referential integrity in the applications and not in the database
  • {dimension} currency (aka actuality)
    •  the extent to which data is actual
    •  can be considered as a special type of accuracy 
      • ⇐ when the data is not actual then it doesn’t reflect reality
  • {dimension} ease of use
    • the extent to which data can be used for a given purpose
      • usually it refers to whether the data can be processed as needed
      • depends on the application or on the user interface
  • {dimension} fitness of use
    • the degree to which the data is fit for use
      • the data may have good quality for a given purposes but 
        • not usable for other purposes
        • can be used as substitute for other data
          • e.g. use phone area codes instead of ZIP codes to locate customers approximately
  • {dimension} trustfulness [post]
    • the degree to which the data can be trusted
      • is a matter of perception 
        • ask users whether they trust the data and which are the reasons
          • if the users don’t trust the data 
            • they will create their own solutions
            • they will not use applications
  • {dimension} entropy
    • {def} the average amount of information conveyed
      • ⇐ quantification of information in a system
      • ⇐ the more dispersed the values and the more the frequency distribution of a discrete column is equally spread among the values, the more information is available [1]
      • ⇐ can tell whether your data is suitable for analysis or not
    • can be measured at column and table level
  • {dimension} presentation quality 
    • applicable to applications that presents data
      • format and appearance should support the appropriate use of data
      • depends on the UI used
  • {recommendation} have a dedicated system for maintaining the master data and broadcast the data to the subscribers as needed
    • the data should be exclusively managed though the management system
    • {anti-pattern} data is modified in the subscribers and the changes aren't always reflected back to the source system
Previous Post <<||>>  Next Post

References:
[1] Dejan Sarka et al (2012) Exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012 (Training Kit)

No comments:

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.