SQL Troubles: Uniqueness

Showing posts with label Uniqueness. Show all posts

27 January 2025

🗄️🗒️Data Management: Data Quality Dimensions [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes.

Last updated: 27-Jan-2025

[Data Management] Data quality dimensions

{def} features of data that can be measured or assessed against defined standards to determine the quality of data

captures a specific aspect of general data quality

can refer to data values or to their schema

{type} hard dimensions

dimensions that can be measured

{type} soft dimensions

dimensions that can be measured only indirectly

⇐ through interviews with data users or through any other kind of communication with users

dimensions whose measurement depends on the perception of the users of the data

{dimension} uniqueness [post]

the degree to which a value or set of values is unique within a dataset

can be determined based on a set of values supposed to be unique across the whole dataset

some systems have a artificial, respectively natural unique identified

measured in terms of either

the percentage of unique values available in a dataset
the percentage of duplicate values available in a dataset

the impossibility of identifying whether a value is unique increases the chances for it to be duplicated
it can have broader implications

aggregated information is not shown correctly

⇐ split across different entities

can lead to further duplicates in other areas

{recommendation} enforce uniqueness by design, if possible
{recommendation} check the data regularly for duplicates and disable or delete the duplicated records

⇐ one should make sure that the records can't be further reused in business processes or analytics workloads

{dimension} completeness [post]

the extent to which there are missing data in a dataset

⇐ reflected in the number of the missing values

measured as percentage of the missing values compared to the total

determined by the presence of NULL values

{type} attribute completeness

the number of NULLs in a specific attribute

{type} tuple completeness

the number of unknown values of the attributes in a tuple

{type} relation completeness

the number of tuples with unknown attribute values in the relation

{type} value completeness

makes sense for complex, semi-structured columns such as XML data type columns

e.g. a complete element or attribute can be missing

considered in report to

mandatory attributes

attributes that need a not-Null value for each record

optional attributes

attributes that not necessarily need to be provided

inapplicable attributes

attributes not applicable (relevant) for certain scenarios by design

{dimension} conformity (aka format compliance) [post]

{def} the extent data are in the expected format

dependent on the data type and its definition

can be associated with a set of metadata

data type

e.g. text, numeric, alphanumeric, positive, date

length
precision
scale
formatting patterns

e.g. phone number, decimal and digit grouping symbols
different formatting might apply based on various business rules
can use delimiters

{recommendation} define the data type and further constraints to enforce the various characteristics of the element
{recommendation} make sure that the delimiters don't overlap with other uses

{dimension} accuracy [post]

{def} the extent data is correct, respectively match the reality with an acceptable level of approximation
stricter than just conforming to business rules
can be measured at column and table level

[discrete data values]

use frequency distribution of values

a value with very low frequency is probably incorrect

[alphanumeric values]

use string length distribution

a string with a very atypical length is potentially incorrect

try to find patterns and then create pattern distribution.

patterns with low frequency probably denote wrong values

[continuous attributes]

use descriptive statistics

just by looking at minimal and maximal values, you can easily spot potentially problematic data

{dimension} consistency [post]

{def} the degree of uniformity, standardization, and freedom from contradiction among the documents or parts of a system or component

{type} notational consistency

the extent (data) values are consistent in notation

{type} semantic consistency

the degree to which data has unique meaning
is more restrictive than the notational consistency

measures the equivalence of information stored in various repositories
involves comparing values with a predefined set of possible values

from the same or from different systems

can be measured at column and table level
can have different scopes

cross-system consistencies

among systems or data repositories

cross-record consistency

within the same repository

temporal consistency

within the same record at different points in time

{dimension} timeliness [post]

tells the degree to which data is current and available when needed

there is always some delay between change in the real world and the moment when this change is entered into a system

stale data/obsolete data

{dimension} structuredness [post]

the degree to which a data structure or model possesses a definite pattern of organization of its interdependent parts
allows the categorization of data as

structured data [def]

refers to structures that can be easily perceived or known, that raises no doubt on structure’s delimitations

unstructured data [def]

refers to textual data and media content (video, sound, images), in which the structural patterns even if exist they are hard to discover or not predefined

semi-structured data [def]

refers to islands of structured data stored with unstructured data, or vice versa

⇐ the more structured the data, the easier it is to be processed

{dimension} referential integrity [post]

{def} the degree to which the values of a key in one table (aka reference value) match the values of a key in a related table (aka the referenced value)
it's an architectural concept of the database
{recommendation} keep the referential integrity of a system by design

some systems build logic for assuring the referential integrity in the applications and not in the database

{dimension} currency (aka actuality)

the extent to which data is actual
can be considered as a special type of accuracy

⇐ when the data is not actual then it doesn’t reflect reality

{dimension} ease of use

the extent to which data can be used for a given purpose

usually it refers to whether the data can be processed as needed
depends on the application or on the user interface

{dimension} fitness of use

the degree to which the data is fit for use

the data may have good quality for a given purposes but

not usable for other purposes
can be used as substitute for other data

e.g. use phone area codes instead of ZIP codes to locate customers approximately

{dimension} trustfulness [post]

the degree to which the data can be trusted

is a matter of perception

ask users whether they trust the data and which are the reasons

if the users don’t trust the data

they will create their own solutions
they will not use applications

{dimension} entropy

{def} the average amount of information conveyed

⇐ quantification of information in a system
⇐ the more dispersed the values and the more the frequency distribution of a discrete column is equally spread among the values, the more information is available [1]
⇐ can tell whether your data is suitable for analysis or not

can be measured at column and table level

{dimension} presentation quality

applicable to applications that presents data

format and appearance should support the appropriate use of data
depends on the UI used

{recommendation} have a dedicated system for maintaining the master data and broadcast the data to the subscribers as needed

the data should be exclusively managed though the management system
{anti-pattern} data is modified in the subscribers and the changes aren't always reflected back to the source system

Previous Post <<||>> Next Post

References:
[1] Dejan Sarka et al (2012) Exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012 (Training Kit)

13 January 2010

🗄️Data Management: Data Quality Dimensions (Part I: Uniqueness)

Data Management Series

Uniqueness refers to "requirements that entities modeled within the master environment are captured, represented, and referenced uniquely within the relevant application architectures" [1]. An alternative name is the one of duplicates, which stresses the existence of duplicate records within a dataset, the not-uniqueness, being a better indicator for the nonconformance especially when considering datasets.

Why is required to enforce the uniqueness of entities? An entity is defined using a number of attributes representing entity’s characteristics, in case the attributes of two entities have the same values, then more likely the two representations refer to the same entity. This could happen in most of the cases, though there are situations in which the attribute(s) that make(s) it possible to differentiate between two distinct entities is/are not adequately maintained or not considered at all. The impossibility of identifying uniquely an entity increases the chances of using one of the respective entities wrongly, for example booking the Invoice against the wrong Vendor and all the implications derived from it.

For each type of entity there can be one or more attributes that allow identifying it uniquely, for example in case of a Vendor could be Vendor’s name and address. The more such attributes then more difficult becomes the identification of a Vendor; therefore even if such a set of attributes exists, like in Vendor’s case, it’s preferable to use instead a unique identifier, a numeric or alphanumeric key that identifies uniquely an entity.

A Vendor could be uniquely identified by the Vendor ID, though it allows unique identification of a Vendor only in a data repository, the chances being quite high to have another Vendor ID for the same entity in another data repository. Therefore, in order to guarantee the uniqueness of entities is prefer to use instead an attribute that has the same value indifferently from the data repository the entity is stored in (e.g. Vendor Number, Item Number, Customer Number, Asset Number, Sales Order Number, etc.). Such attributes could be enforced to be unique across a set of data repositories, though one of them must function as 'master'.

Multiple identifiers for the same entity may exist, though this can easily create confusion, especially when this happens within the same system and people or machines are not aware that the respective identifiers refer to the same entity, and the more identifiers we have for the same entity the higher the chances of creating confusion. Imagine that in the same system you book some of the Invoices against one of the identifiers, and the remaining Invoices against another identifier of the same entity.

Especially in reports this might be quite a problem as amounts that should appear for the same entity are split against two references, and even if they refer to the same entity report’s users might be not aware of it, the calculations based on such numbers not reflecting the reality. Imagine that you have booked the invoices against two references to the same Vendor, and you want to consider the first 10 Vendors with the highest volume spent; it might happen that the aggregate amounts for each of the respective references didn’t make it in the top 10, though when considered together they can even make it to the first position.

There are even situations in which the use of multiple identifiers for the same entity is inevitable, and important are here the cases in which more than one different system refer to the same entity in their unique way, often being not designed to use a global unique identifier for an entity. Under such circumstances what an organization could do is to either extend the system(s) to store the global identifier or have a mapping in place with the referents used to indicate the same entity.

The more systems within an enterprise that use/represent the same entity, the more complex is to manage the correct referencing; the logical thing to do is to have in place a master system that stores the global identifier and integrate it with the other systems in place or at least store the global identifier in each of the systems. Such identifiers used for master or even transactional data allow creating consolidated reports based on several systems and mitigating possible issues existing in the data.

Even if such keys are designed to enforce uniqueness this doesn’t mean it necessarily happens; the reason is simple - behind the assignment of a key to an entity there is a process defined, and no matter how robust a process was created if it doesn’t cover all the scenarios that might occur or if it can’t deal (adequately) with various constraints, then in one point in time a flow might lead to a duplicate, a not unique record. Take for example a Product Number's creation - supposing that a Product Number XYZ is created in an ERP system and one or more of the respective Product’s attributes are incorrect. In many cases such errors can be corrected, though each system comes with its own constraints not allowing changing certain attributes (e.g. the Unit of Measure once Inventory was booked). Because an entity with the wrong attributes is inadequate to be used, a new Product Number is created for the same entity, the two identifiers coexisting for some time. (Some systems allow to build a relationship between the Products, showing that one Product replaces another going forward.)

Some systems could allow the deletion of the first entered Product though that’s not always possible. Even if the two Products Numbers are different, they point to the same Product entity, and even if the first entered Product is marked as obsolete, until this happens transactions might be already made with it, and again, it might be possible to revert the transactions back, though that’s not always possible, and that’s not all, once a Product is marked as obsolete, this doesn’t necessarily imply that a User won’t reactivate the Product in the near or far future. As can be seen, the scenarios are quite complex, not all the systems and processes being designed to handle them. If such deviations from uniqueness can’t be corrected in the legacy systems, at least during conversion and migration should be corrected.

Previous Post <<||>> Next Post

Written: Jan-2010, Last Reviewed: Mar-2024

References:
[1] David Loshin (2009) "Master Data Management"

27 January 2009

🛢DBMS: Unique Index (Definitions)

"Indexes which do not permit any two rows in the specified columns to have the same value. SQL Server checks for duplicate values when you create the index (if data already exists) and each time data is added." (Karen Paulsell et al, "Sybase SQL Server: Performance and Tuning Guide", 1996)

"Restricts duplicate values in a column." (Owen Williams, "MCSE TestPrep: SQL Server 6.5 Design and Implementation", 1998)

"An index in which no two rows are permitted to have the same index value, thus prohibiting duplicate index or key values. The system checks for duplicate key values when the index is created and checks each time data is added with an INSERT or UPDATE statement." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"An index with perfect selectivity, that is, an index with no duplicate values allowed. Standard SQL allows multiple NULLs in a unique index, since a NULL is not considered to be equal to any other value, including another NULL. Many DBMSs, however, accept only one NULL in a unique index. Some DBMSs won't allow even a single NULL." (Peter Gulutzan & Trudy Pelzer, "SQL Performance Tuning", 2002)

"A b-tree index whose keys are not duplicated." (Bob Bryla, "Oracle Database Foundations", 2004)

"This is a special index that prevents users from inserting the same value into a column (or columns) more than once." (Joseph L Jorden & Dandy Weyn, "MCTS Microsoft SQL Server 2005: Implementation and Maintenance Study Guide - Exam 70-431", 2006)

"An index that enforces uniqueness for a particular field across a single collection." (MongoDb, "Glossary", 2008)

"Sometimes created explicitly by the user, and sometimes created automatically by the database server. By guaranteeing one-and-only-one value for a given table, this structure speeds access to information and preserves data integrity." (Robert D. Schneider and Darril Gibson, "Microsoft SQL Server 2008 All-In-One Desk Reference For Dummies", 2008)

"An index in which no two rows are permitted to have the same index value, thus prohibiting duplicate index or key values." (Microsoft, "SQL Server 2012 Glossary,", 2012)

"An index that ensures that no identical key values are stored in a column or a set of columns in a table." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

SQL Troubles

Pages