13 January 2010

Data Quality Dimensions – Conformity

    Conformity or format compliance as named by [1] refers to the extent data are in the expected format, each attribute being associated with a set of metadata like type (e.g. text, numeric, alphanumeric, positive), length, precision, scale or any other formatting patterns (e.g. phone number, decimal and digit grouping symbols).

    Because distinct decimal, digit grouping, negative sign and currency symbols can be used to represent numeric values, same as different date formats could be used (e.g. dd-mm-yyyy vs. mm-dd-yyyy), the numeric and date data types are highly sensitive to local computer and general applications settings because the same attribute could be stored, processed and represented in different formats. Therefore it’s preferable to minimize the variations in formatting by applying the same format to all attributes having the same data type and, in addition, whenever is possible the format should not be confusing - for example all the dates in a data set or in a set of data sets being object of the same global context (e.g. data migration, reporting) should have the same format, being preferred a format of type dd-mon-yyyy which, ignoring the different values the month could have for different language settings, it lets not space for interpretations (e.g. 01-10-2009 vs. 10-01-2009). There are also situations in which the constraints imposed by the various applications used restraints the flexibility of working adequately with the local computer formats.

    If for decimal and dates there are a limited number of possibilities that can be dealt with, for alphanumeric values things change drastically because excepting the format masks that could be used during data entry, the adherence to a format depends entirely on the user and whether he/she applies the eventual formatting standards defined. Often each user might come with its own encoding, and even then he might change it over time too. The use of different encodings could be also required by the standards within a specific country, organization or other type of such entity. All these together makes from alphanumeric attributes the most often candidate for data cleaning, the business rules used could be quite complex, needing to handle each specific case. For example the VAT code could have different length from country to country, and more than one encoding could be used reflecting the changes in formatting policy.

    In what concerns the format, the alphanumeric attributes offer greater flexibility than the decimal and date attributes, and their formatting could be in theory ignored unless they are further parsed by other applications. However considering that such needs change over time, it’s advisable to standardize the various formats used within an organization and use “standard” delimiters for delimiting the various chunks of data with a particular meaning within an alphanumeric attribute, fact that could reduce considerably the volume of overwork needed in order to cleanse the data for further processing. An encoding could be done without the use of delimiters, typically when the length of each chunk of data is the same, though chunk length based formatting could prove to be limited when the length of a chunk changes.

Note:
    Delimiters should be chosen from the characters that will never be used in the actual chunks of data or in the various applications dealing with the respective data. For example pipe (“|”) or semicolon (“;”) could be good candidates for such a delimiter though they are often used as delimiters when exporting the data to text files, therefore it’s better to use a dash (“-”) or even a combinations of characters (e.g. “.-.”) when a dash is not enough, while in some cases even a space or a dot could be used as delimiter.

References:
[1] Loshin D. (2009). Master Data Management. Morgan Kaufmann OMG Press. ISBN 978-0-12-374225-4.

No comments: