About Me

IT Professional with more than 16 years experience in IT especially in the area of full life-cycle of Web/Desktop Applications Development, Database Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP support, etc.

Wednesday, January 13, 2010

Data Quality Dimensions – Uniqueness

    Uniqueness refers to “requirements that entities modeled within the master environment are captured, represented, and referenced uniquely within the relevant application architectures” [1]. An alternative form used for uniqueness dimension is the one of duplicates, which stresses the existence of duplicate records within a data set, the not-uniqueness, being a better indicator for the nonconformance especially when considering datasets.

    Why is required to enforce the uniqueness of entities? An entity is defined using a number of attributes representing entity’s characteristics, in case the attributes of two entities have the same values, then more likely the two representations refer to the same entity. This could happen in most of the cases, though there are situations in which the attribute(s) that make(s) it possible to differentiate between two distinct entities is/are not adequately maintained or not considered at all; the impossibility of identifying uniquely an entity increases the chances of using one of the respective entities wrongly, for example booking the Invoice against the wrong Vendor and all the implications derived from it. In theory for each type of entity there is in theory one or more attributes that allow identifying uniquely the entity, for example in case of a Vendor could be Vendor’s name and address, the more such attributes then more difficult becomes the identification of a Vendor; therefore even if such a set of attributes exists, like in Vendor’s case, it’s preferable to use instead a unique identifier, a numeric or alphanumeric key that identifies uniquely an entity. For example a Vendor could be uniquely identified by the Vendor ID, though that allows unique identification of a Vendor only in a data repository, the chances being quite high to have another Vendor ID for the same entity in another data repository. Therefore in order to guarantee the uniqueness of entities is prefer to use instead an attribute that has the same value indifferently from the data repository the entity is stored in (e.g. Vendor Number, Item Number, Customer Number, Asset Number, Sales Order Number, etc.), such attributes could be enforced to be unique across a set of data repositories, though one of them must function as master.

    In theory we could have multiple identifiers for the same entity, though this can easily create confusion, especially when this happens within the same system and people or machines are not aware that the respective identifiers refer to the same entity, and the more identifiers we have for the same entity the higher the chances of creating confusion. Imagine that in the same system you book some of the Invoices against one of the identifiers, and the remaining Invoices against another identifier of the same entity. Especially in reports this might be quite a problem as amounts that should appear for the same entity are split against two references, and even if they refer to the same entity report’s users might be not aware of it, the calculations based on such numbers not reflecting the reality. Imagine that you have booked the invoices against two references to the same Vendor and you want to consider the first 10 Vendors with the highest volume spent; it might happen that the aggregate amounts for each of the respective references didn’t make it in the top 10, though when considered together they can even make it to the first position.

    There are even situations in which the use of multiple identifiers for the same entity is inevitable, and I’m referring here especially to the cases in which more than one different system refer to the same entity in their unique way, often being not designed to use a global unique identifier for an entity. Under such circumstances what an organization could do is to either extend the system(s) to store the global identifier or have a mapping in place with the referents used to indicate the same entity. The more systems within an enterprise that use/represent the same entity, the more complex is to manage the correct referencing; the logical thing to do is to have in place a master system that stores the global identifier and integrate it with the other systems in place or at least store the global identifier in each of the systems. Such identifiers used for master or even transactional data allow creating consolidated reports based on several systems and mitigating possible issues existing in the data.

    Even if such keys are designed to enforce uniqueness this doesn’t mean it necessarily happens; the reason is simple - behind the assignment of a key to an entity there is a process defined, and no matter how robust a process was created if it doesn’t cover all the scenarios that might occur or if it can’t deal (adequately) with various constraints, then in one point in time a flow might lead to a duplicate, a not unique record. The simplest such example I can recall is Material Number creation, supposing that a Material Number XYZ is created in an ERP system and because of an error made one or more of the respective Material’s attributes are incorrect. In most of the cases such errors can be corrected, though each system comes with its own constraints not allowing changing certain attributes (e.g. Unit of Measure). Because an entity with the wrong attributes is inadequate to be used, a new Material Number is created for the same entity. Some systems could allow the deletion of the first entered Material though that’s not always possible. Even if the two Materials Numbers are different they point to the same Material entity, and even if the first entered Material is marked as obsolete, until this happens transactions might be already made with it, and again, it might be possible to revert the transactions back, though that’s not always possible, and that’s not all, once a Material is marked as obsolete, this doesn’t necessarily imply that a User won’t reactivate the Material in the near or far future. As can be seen is getting more and more complicated, not all the systems and processes being designed to handle the various scenarios that might appear in time. If such deviations from uniqueness can’t be corrected in the legacy systems, at least during conversion and migration should be corrected.

[1] Loshin D. (2009). Master Data Management. Morgan Kaufmann OMG Press. ISBN 978-0-12-374225-4.

No comments: