Showing posts with label Conformity. Show all posts
Showing posts with label Conformity. Show all posts

27 January 2025

🗄️🗒️Data Management: Data Quality Dimensions [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. 

Last updated: 27-Jan-2025

[Data Management] Data quality dimensions

  • {def} features of data that can be measured or assessed against defined standards to determine the quality of data
    • captures a specific aspect of general data quality
      • can refer to data values or to their schema
  • {type} hard dimensions
    • dimensions that can be measured
  • {type} soft dimensions
    • dimensions that can be measured only indirectly
      • ⇐ through interviews with data users or through any other kind of communication with users
    • dimensions whose measurement depends on the perception of the users of the data
  • {dimension} uniqueness [post]
    • the degree to which a value or set of values is unique within a dataset
      • can be determined based on a set of values supposed to be unique across the whole dataset
        • some systems have a artificial, respectively natural unique identified
      • measured in terms of either
        • the percentage of unique values available in a dataset 
        • the percentage of duplicate values available in a dataset
    • the impossibility of identifying whether a value is unique increases the chances for it to be duplicated
    • it can have broader implications
      • aggregated information is not shown correctly
        • ⇐ split across different entities
      • can lead to further duplicates in other areas
    • {recommendation} enforce uniqueness by design, if possible
    • {recommendation} check the data regularly for duplicates and disable or delete the duplicated records
      • ⇐ one should make sure that the records can't be further reused in business processes or analytics workloads
  • {dimension} completeness [post]
    •  the extent to which there are missing data in a dataset
      • ⇐ reflected in the number of the missing values
        • measured as percentage of the missing values compared to the total
    • determined by the presence of NULL values  
      • {type} attribute completeness 
        • the number of NULLs in a specific attribute
      • {type} tuple completeness 
        • the number of unknown values of the attributes in a tuple
      • {type} relation completeness 
        • the number of tuples with unknown attribute values in the relation
      • {type} value completeness
        • makes sense for complex, semi-structured columns such as XML data type columns
          • e.g. a complete element or attribute can be missing
    • considered in report to 
      • mandatory attributes
        • attributes that need a not-Null value for each record
      • optional attributes
        • attributes that not necessarily need to be provided
      • inapplicable attributes
        • attributes not applicable (relevant) for certain scenarios by design
  • {dimension} conformity (aka format compliance) [post]
    • {def} the extent data are in the expected format
      • dependent on the data type and its definition
    • can be associated with a set of metadata 
      • data type
        • e.g. text, numeric, alphanumeric, positive, date
      • length
      • precision
      • scale
      • formatting patterns 
        • e.g. phone number, decimal and digit grouping symbols
        • different formatting might apply based on various business rules 
        • can use delimiters
    • {recommendation} define the data type and further constraints to enforce the various characteristics of the element
    • {recommendation} make sure that the delimiters don't overlap with other uses
  • {dimension} accuracy [post]
    • {def} the extent data is correct, respectively match the reality with an acceptable level of approximation
    • stricter than just conforming to business rules
    • can be measured at column and table level
      • [discrete data values] 
        • use frequency distribution of values
          • a value with very low frequency is probably incorrect
      • [alphanumeric values]
        • use string length distribution
          • a  string with a very atypical length is potentially incorrect
        • try to find patterns and then create pattern distribution.
          • patterns with low frequency probably denote wrong values
      • [continuous attributes]
        • use descriptive statistics
          • just by looking at minimal and maximal values, you can easily spot potentially problematic data
  • {dimension} consistency [post]
    • {def} the degree of uniformity, standardization, and freedom from contradiction among the documents or parts of a system or component
      • {type} notational consistency
        • the extent (data) values are consistent in notation
      • {type} semantic consistency
        • the degree to which data has unique meaning
        • is more restrictive than the notational consistency
    • measures the equivalence of information stored in various repositories
    • involves comparing values with a predefined set of possible values
      •  from the same or from different systems
    • can be measured at column and table level
    • can have different scopes
      • cross-system consistencies
        • among systems or data repositories
      • cross-record consistency
        • within the same repository
      • temporal consistency
        •  within the same record at different points in time
  • {dimension} timeliness [post]
    • tells the degree to which data is current and available when needed
      • there is always some delay between change in the real world and the moment when this change is entered into a system
    • stale data/obsolete data
  • {dimension} structuredness [post]
    • the degree to which a data structure or model possesses a definite pattern of organization of its interdependent parts
    • allows the categorization of data as
      • structured data [def]
        • refers to structures that can be easily perceived or known, that raises no doubt on structure’s delimitations
      • unstructured data [def]
        • refers to textual data and media content (video, sound, images), in which the structural patterns even if exist they are hard to discover or not predefined
      • semi-structured data [def]
        • refers to islands of structured data stored with unstructured data, or vice versa
      • ⇐ the more structured the data, the easier it is to be processed
  • {dimension} referential integrity [post]
    • {def} the degree to which the values of a key in one table (aka reference value) match the values of a key in a related table (aka the referenced value)
    • it's an architectural concept of the database 
    • {recommendation} keep the referential integrity of a system by design
      • some systems build logic for assuring the referential integrity in the applications and not in the database
  • {dimension} currency (aka actuality)
    •  the extent to which data is actual
    •  can be considered as a special type of accuracy 
      • ⇐ when the data is not actual then it doesn’t reflect reality
  • {dimension} ease of use
    • the extent to which data can be used for a given purpose
      • usually it refers to whether the data can be processed as needed
      • depends on the application or on the user interface
  • {dimension} fitness of use
    • the degree to which the data is fit for use
      • the data may have good quality for a given purposes but 
        • not usable for other purposes
        • can be used as substitute for other data
          • e.g. use phone area codes instead of ZIP codes to locate customers approximately
  • {dimension} trustfulness [post]
    • the degree to which the data can be trusted
      • is a matter of perception 
        • ask users whether they trust the data and which are the reasons
          • if the users don’t trust the data 
            • they will create their own solutions
            • they will not use applications
  • {dimension} entropy
    • {def} the average amount of information conveyed
      • ⇐ quantification of information in a system
      • ⇐ the more dispersed the values and the more the frequency distribution of a discrete column is equally spread among the values, the more information is available [1]
      • ⇐ can tell whether your data is suitable for analysis or not
    • can be measured at column and table level
  • {dimension} presentation quality 
    • applicable to applications that presents data
      • format and appearance should support the appropriate use of data
      • depends on the UI used
  • {recommendation} have a dedicated system for maintaining the master data and broadcast the data to the subscribers as needed
    • the data should be exclusively managed though the management system
    • {anti-pattern} data is modified in the subscribers and the changes aren't always reflected back to the source system
Previous Post <<||>>  Next Post

References:
[1] Dejan Sarka et al (2012) Exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012 (Training Kit)

13 January 2010

🗄️Data Management: Data Quality Dimensions (Part II: Conformity)

Data Management
Data Management Series

Conformity or format compliance, as named by [1], refers to the extent data are in the expected format, each attribute being associated with a set of metadata like type (e.g. text, numeric, alphanumeric, positive), length, precision, scale, or any other formatting patterns (e.g. phone number, decimal and digit grouping symbols).

Because distinct decimal, digit grouping, negative sign and currency symbols can be used to represent numeric values, same as different date formats could be used alternatively (e.g. dd-mm-yyyy vs. mm-dd-yyyy), the numeric and date data types are highly sensitive to local computer and general applications settings because the same attribute could be stored, processed and represented in different formats. Therefore, it’s preferable to minimize the variations in formatting by applying the same format to all attributes having the same data type and, whenever is possible, the format should not be confusing. 

For example all the dates in a data set or in a set of data sets being object of the same global context (e.g. data migration, reporting) should have the same format, being preferred a format of type dd-mon-yyyy which, ignoring the different values the month could have for different language settings, it lets no space for interpretations (e.g. 01-10-2009 vs. 10-01-2009). There are also situations in which the constraints imposed by the various applications used restraints the flexibility of working adequately with the local computer formats.

If for decimal and dates there are a limited number of possibilities that can be dealt with, for alphanumeric values things change drastically because excepting the format masks that could be used during data entry, the adherence to a format depends entirely on the Users and whether they applied the formatting standards defined. In the absence of standards, Users might come with their own encoding, and even then, they might change it over time. 

The use of different encodings could be also required by the standards within a specific country, organization, or other type of such entity. All these together makes from alphanumeric attributes the most often candidate for data cleaning, and the business rules used can be quite complex, needing to handle each specific case. For example, the VAT code could have different length from country to country, and more than one encoding could be used reflecting the changes in formatting policy.

In what concerns the format, the alphanumeric attributes offer greater flexibility than the decimal and date attributes, and their formatting could be in theory ignored unless they are further parsed by other applications. However, considering that such needs change over time, it’s advisable to standardize the various formats used within an organization and use 'standard' delimiters for formatting the various chunks of data with a particular meaning within an alphanumeric attribute, fact that could reduce considerably the volume of overwork needed in order to cleanse the data for further processing. An encoding could be done without the use of delimiters, e.g. when the length of each chunk of data is the same, though chunk length-based formatting could prove to be limited when the length of a chunk changes.

Note:
Delimiters should be chosen from the characters that will never be used in the actual chunks of data or in the various applications dealing with the respective data. For example pipe (“|”) or semicolon (“;”) could be good candidates for such a delimiter though they are often used as delimiters when exporting the data to text files, therefore it’s better to use a dash (“-”) or even a combinations of characters (e.g. “.-.”) when a dash is not enough, while in some cases even a space or a dot could be used as delimiter.


Written: Jan-2010, Last Reviewed: Mar-2024 

References:
[1] David Loshin (2009) "Master Data Management", Morgan Kaufmann OMG Press. ISBN 978-0-12-374225-4.
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.