SQL Troubles: Referential Integrity

Showing posts with label Referential Integrity. Show all posts

27 January 2025

🗄️🗒️Data Management: Data Quality Dimensions [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes.

Last updated: 27-Jan-2025

[Data Management] Data quality dimensions

{def} features of data that can be measured or assessed against defined standards to determine the quality of data

captures a specific aspect of general data quality

can refer to data values or to their schema

{type} hard dimensions

dimensions that can be measured

{type} soft dimensions

dimensions that can be measured only indirectly

⇐ through interviews with data users or through any other kind of communication with users

dimensions whose measurement depends on the perception of the users of the data

{dimension} uniqueness [post]

the degree to which a value or set of values is unique within a dataset

can be determined based on a set of values supposed to be unique across the whole dataset

some systems have a artificial, respectively natural unique identified

measured in terms of either

the percentage of unique values available in a dataset
the percentage of duplicate values available in a dataset

the impossibility of identifying whether a value is unique increases the chances for it to be duplicated
it can have broader implications

aggregated information is not shown correctly

⇐ split across different entities

can lead to further duplicates in other areas

{recommendation} enforce uniqueness by design, if possible
{recommendation} check the data regularly for duplicates and disable or delete the duplicated records

⇐ one should make sure that the records can't be further reused in business processes or analytics workloads

{dimension} completeness [post]

the extent to which there are missing data in a dataset

⇐ reflected in the number of the missing values

measured as percentage of the missing values compared to the total

determined by the presence of NULL values

{type} attribute completeness

the number of NULLs in a specific attribute

{type} tuple completeness

the number of unknown values of the attributes in a tuple

{type} relation completeness

the number of tuples with unknown attribute values in the relation

{type} value completeness

makes sense for complex, semi-structured columns such as XML data type columns

e.g. a complete element or attribute can be missing

considered in report to

mandatory attributes

attributes that need a not-Null value for each record

optional attributes

attributes that not necessarily need to be provided

inapplicable attributes

attributes not applicable (relevant) for certain scenarios by design

{dimension} conformity (aka format compliance) [post]

{def} the extent data are in the expected format

dependent on the data type and its definition

can be associated with a set of metadata

data type

e.g. text, numeric, alphanumeric, positive, date

length
precision
scale
formatting patterns

e.g. phone number, decimal and digit grouping symbols
different formatting might apply based on various business rules
can use delimiters

{recommendation} define the data type and further constraints to enforce the various characteristics of the element
{recommendation} make sure that the delimiters don't overlap with other uses

{dimension} accuracy [post]

{def} the extent data is correct, respectively match the reality with an acceptable level of approximation
stricter than just conforming to business rules
can be measured at column and table level

[discrete data values]

use frequency distribution of values

a value with very low frequency is probably incorrect

[alphanumeric values]

use string length distribution

a string with a very atypical length is potentially incorrect

try to find patterns and then create pattern distribution.

patterns with low frequency probably denote wrong values

[continuous attributes]

use descriptive statistics

just by looking at minimal and maximal values, you can easily spot potentially problematic data

{dimension} consistency [post]

{def} the degree of uniformity, standardization, and freedom from contradiction among the documents or parts of a system or component

{type} notational consistency

the extent (data) values are consistent in notation

{type} semantic consistency

the degree to which data has unique meaning
is more restrictive than the notational consistency

measures the equivalence of information stored in various repositories
involves comparing values with a predefined set of possible values

from the same or from different systems

can be measured at column and table level
can have different scopes

cross-system consistencies

among systems or data repositories

cross-record consistency

within the same repository

temporal consistency

within the same record at different points in time

{dimension} timeliness [post]

tells the degree to which data is current and available when needed

there is always some delay between change in the real world and the moment when this change is entered into a system

stale data/obsolete data

{dimension} structuredness [post]

the degree to which a data structure or model possesses a definite pattern of organization of its interdependent parts
allows the categorization of data as

structured data [def]

refers to structures that can be easily perceived or known, that raises no doubt on structure’s delimitations

unstructured data [def]

refers to textual data and media content (video, sound, images), in which the structural patterns even if exist they are hard to discover or not predefined

semi-structured data [def]

refers to islands of structured data stored with unstructured data, or vice versa

⇐ the more structured the data, the easier it is to be processed

{dimension} referential integrity [post]

{def} the degree to which the values of a key in one table (aka reference value) match the values of a key in a related table (aka the referenced value)
it's an architectural concept of the database
{recommendation} keep the referential integrity of a system by design

some systems build logic for assuring the referential integrity in the applications and not in the database

{dimension} currency (aka actuality)

the extent to which data is actual
can be considered as a special type of accuracy

⇐ when the data is not actual then it doesn’t reflect reality

{dimension} ease of use

the extent to which data can be used for a given purpose

usually it refers to whether the data can be processed as needed
depends on the application or on the user interface

{dimension} fitness of use

the degree to which the data is fit for use

the data may have good quality for a given purposes but

not usable for other purposes
can be used as substitute for other data

e.g. use phone area codes instead of ZIP codes to locate customers approximately

{dimension} trustfulness [post]

the degree to which the data can be trusted

is a matter of perception

ask users whether they trust the data and which are the reasons

if the users don’t trust the data

they will create their own solutions
they will not use applications

{dimension} entropy

{def} the average amount of information conveyed

⇐ quantification of information in a system
⇐ the more dispersed the values and the more the frequency distribution of a discrete column is equally spread among the values, the more information is available [1]
⇐ can tell whether your data is suitable for analysis or not

can be measured at column and table level

{dimension} presentation quality

applicable to applications that presents data

format and appearance should support the appropriate use of data
depends on the UI used

{recommendation} have a dedicated system for maintaining the master data and broadcast the data to the subscribers as needed

the data should be exclusively managed though the management system
{anti-pattern} data is modified in the subscribers and the changes aren't always reflected back to the source system

Previous Post <<||>> Next Post

References:
[1] Dejan Sarka et al (2012) Exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012 (Training Kit)

18 January 2010

🗄️Data Management: Data Quality Dimensions (Part V: Consistency)

Data Management Series

IEEE defines consistency in general as "the degree of uniformity, standardization, and freedom from contradiction among the documents or parts of a system or component" [4]. In respect to data, consistency can be defined thus as the degree of uniformity and standardization of data values among systems or data repositories (aka cross-system consistencies), records within the same repository (aka cross-record consistency), or within the same record at different points in time (temporal consistency) (see [2], [3]).

Unfortunately, uniformity, standardization and freedom can be considered as data quality dimensions as well and they might even have broader scope than the one provided by consistency. Moreover, the definition requires further definitions for the concept to be understood, which is not ideal.

Simply put, consistency refers to the extent (data) values are consistent in notation, respectively the degree to which the data values across different contexts match. For example, one system uses "Free on Board" while another systems uses "FOB" to refer to the point obligations, costs, and risk involved in the delivery of goods shift from the seller to the buyer. The two systems refer to the same value by two different notations. When the two systems are integrated, because of the different values used, the rules defined in the target system might make the record fail because it is expecting another value. Conversely, other systems may import the value as it is, leading thus to two values used in parallel for the same meaning. This can happen not only to reference data, but also to master data, for example when a value deviates slightly from the expected value (e.g. misspelled). A more special case is when one of the systems uses case sensitive values (usually the target system, though there can be also bidirectional data integrations).

One solution for such situations would be to "standardize" the values across systems, though not all systems allow to easily change the values once they have been set. Another solution would be to create a mapping as part the integration, though to maintain such mappings for many cases is suboptimal, but in the end, it might be the only solution. Further systems can be impacted by these issues as well (e.g. data warehouses, data marts).

It's recommended to use a predefined list of values (LOV) - a data dictionary, an ontology or any other type of knowledge representation form that can be used to 'enforce' data consistency. ‘Enforce’ is maybe not the best term because the two data sets could be disconnected from each other, being in Users’ responsibility to ensure the overall consistency, or the two data sets could be integrated using specific techniques. In many cases is checked the consistency of the values taken by one attribute against an existing LOV, though for example for data formed from multiple segments (e.g. accounts) each segment might need to be checked against a specific data set or rule generator, such mechanisms implying multi-attribute mappings or associational rules that specify the possible values.

As highlighted also by [1], there are two aspects of consistency: the structural consistency in which two or more values can be distinct in notation but have the same meaning (e.g. missing vs. n/a), and semantic consistency in which each value has a unique meaning (e.g. only n/a is allowed to highlight missing values). It should be targeted to have the data semantically consistent, to avoid confusion, accidental exclusion of data during filtering or reporting. More and more organizations are investing in ontologies, they allow ensuring the semantic consistency of concepts/entities, though for most of the cases simple single or multi-attribute lists of values are enough.

Previous Post <<||>> Next Post

Written: Jan-2010, Last Reviewed: Mar-2024

References:
[1] Chapman A.D. (2005) "Principles of Data Quality", version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen
[2] David Loshin (2009) Master Data Management

[3] IEEE (1990) "IEEE Standard Glossary of Software Engineering Terminology"

[4] DAMA International (2010) "The DAMA Dictionary of Data Management" 1st Ed.

13 January 2010

🗄️Data Management: Data Quality Dimensions (Part III: Completeness)

Data Management Series

Completeness refers to the extent to which there are missing data in a dataset, fact reflected in the number of the missing values, also referred as empty (when an empty string or default values is used) or 'Nulls' (aka unknown values), and/or in the number of missing records.

The missing values are typically considered in report to mandatory attributes, attributes that need a not-Null value for each record, though after case might be applied to non-mandatory attributes (optional attributes) too, for example when is intended to understand whether the attributes are adequately maintained or not. It’s interesting that [1] considers also the inapplicable attributes referring to the attributes not applicable (relevant) for certain scenarios (e.g. physical dimensions for service-based materials), which together with the applicable attributes (relevant) can be considered as another type of categorization for attributes. Whether an attribute is mandatory is decided upon business context and not necessarily upon the physical structure containing the attribute, in other words an attribute could be optional as per database schema and mandatory per business rules.

'Missing records' can be a misleading term because is used in several contexts, however within data completeness context it refers only to the cases not covered by data integrity. For example in parent-child table relations the header data was entered though the detail data is missing, either not entered or deleted; such a case is not covered by referential integrity because there is no missing reference, but just the parent without child data (1:n cardinality).

A mixed example occurs when the same entity is split across several tables at the same level of detail. One of the tables must function as a parent, falling in the previous mentioned example (1:1 cardinality). In such a scenario it depends how one reports the nonconformances per record: (1) the error is counted only once, independently on how many dimensions an error was raised; (2) the error is counted for each dimension. For (2) when the referential integrity failed, an error is raised also for each mandatory attribute.

Both examples are dealing with explicit data referents – the 'parent' data, though there are cases in which the referents are implicit, for example when the data are not available for a certain time interval (e.g. period, day) even if needed, though also for this case the referents could be made explicit, falling in the previous mentioned examples. In such scenarios all the attributes corresponding to the missing records will be null.

Normally the completeness of parent-child relations is enforced with the help of referential integrity and database transactions, a set of actions performed as a single unit of work, they allow saving the parent data only if the child data were saved successfully, though such type of constraints is not always necessary.

Data should be cleaned when feasible in the source system(s) and this applies to incomplete data as well. It might be feasible to clean the values in Excel files or similar tools, by exporting and then reimporting the clean values back into the respective systems.

In data migrations or similar scenarios, the completeness in particular and data quality in general must be judged against the target system(s) and thus the dataset must be enriched in an intermediate layer as needed. Upon case, one can consider using default values, though this sounds like a technical debt, likely improbably to be addressed later. Moreover, one should prioritize the effort and consider first the attributes which are needed for the good functioning of the target system(s).

Ideally, for the mandatory fields should be applied data validation techniques in the source systems, when feasible.

Previous Post <<||>> Next Post

Written: Jan-2010, Last Reviewed: Mar-2024

References:
[1] David Loshin (2009) "Master Data Management"

16 July 2009

🛢DBMS: Referential Integrity (Definitions)

"The rules governing data consistency, specifically the relationships among the primary keys and foreign keys of different tables. SQL Server addresses referential integrity with user-defined triggers." (Karen Paulsell et al, "Sybase SQL Server: Performance and Tuning Guide", 1996)

"When a table has relationships with other tables, they are linked on a field (or group of fields). Referential integrity ensures that the copy of the key field kept in one table matches the key field in the other." (Owen Williams, "MCSE TestPrep: SQL Server 6.5 Design and Implementation", 1998)

"An integrity mechanism that ensures that vital data in a database, such as the unique identifier for a given piece of data, remains accurate and usable as the database changes. Referential integrity involves managing corresponding data values between tables when the foreign key of a table contains the same values as the primary key of another table." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"Mandatory condition in a data warehouse where all the keys in the fact tables are legitimate foreign keys relative to the dimension tables. In other words, all the fact key components are subsets of the primary keys found in the dimension tables at all times." (Ralph Kimball & Margy Ross, "The Data Warehouse Toolkit 2nd Ed ", 2002)

"A state in which all foreign key values in a database are valid." (Anthony Sequeira & Brian Alderman, "The SQL Server 2000 Book", 2003)

"A method employed by a relational database system that enforces one-to-many relationships between tables." (Bob Bryla, "Oracle Database Foundations", 2004)

"A feature of some database systems that ensures that any record stored in the database is supported by accurate primary and foreign keys." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"The facility of a DBMS to ensure the validity of predefined relationships." (William H Inmon, "Building the Data Warehouse", 2005)

"A process (usually contained within a relational database model) of validation between related primary and foreign key field values. For example, a foreign key value cannot be added to a table unless the related primary key value exists in the parent table. Similarly, deleting a primary key value necessitates removing all records in subsidiary tables, containing that primary key value in foreign key fields. Additionally, it follows that preventing the deletion of a primary key record is not allowed if a foreign key exists elsewhere." (Gavin Powell, "Beginning Database Design", 2006)

"The assurance that a reference from one entity to another entity is valid. If entity A references entity B, entity B exists. If entity B is removed, all references to entity B must also be removed." (Pramod J Sadalage & Scott W Ambler, "Refactoring Databases: Evolutionary Database Design", 2006)

"Relational database integrity that dictates that all foreign key values in a child table must have a corresponding matching primary key value in the parent table." (Marilyn Miller-White et al, "MCITP Administrator: Microsoft SQL Server 2005 Optimization and Maintenance 70-444", 2007)

"The referential integrity imposes the constraint that if a foreign key exists in a relation, either the foreign key value must match a candidate key value of some tuple in its home relation or the foreign key value must be wholly null." (S. Sumathi & S. Esakkirajan, "Fundamentals of Relational Database Management Systems", 2007)

"A set of rules, enforced by the database server, the user’s application, or both, that protects the quality and consistency of information stored in the database." (Robert D Schneider & Darril Gibson, "Microsoft SQL Server 2008 All-in-One Desk Reference For Dummies", 2008)

"Requires that relationships among tables be consistent. For example, foreign key constraints must be satisfied. You cannot accept a transaction until referential integrity is satisfied." (Rod Stephens, "Beginning Database Design Solutions", 2008)

"A constraint on a relation that states that every non-null foreign key value must match an existing primary key value." (Jan L Harrington, "Relational Database Design and Implementation" 3rd Ed., 2009)

"A constraint in a SQL database that requires, for every foreign key instance that exists in a table, that the row (and thus the primary key instance) of the parent table associated with that foreign key instance must also exist in the database." (Toby J Teorey, ", Database Modeling and Design" 4th Ed, 2010)

"A constraint on a relation that states that every non-null foreign key value must reference an existing primary key value." (Jan L Harrington, "SQL Clearly Explained" 3rd Ed., 2010)

"In a relational database, the quality of a table that all its associations are with real instances of other tables." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"Refers to two relational tables that are directly related. Referential integrity between related tables is established if non-null values in the foreign key field of the child table are primary key values in the parent table." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"A condition by which a dependent table’s foreign key must have either a null entry or a matching entry in the related table. Even though an attribute may not have a corresponding attribute, it is impossible to have an invalid entry." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management" 9th Ed, 2011)

"In data management, constraints that govern the relationship of an occurrence of one entity to one or more occurrences of another entity. These constraints may be automatically enforced by the DBMS. For instance, every purchase order must have one and only one customer. If the relationship is represented using a foreign key, then the foreign key is said to reference a file or entity table where the identifier is from the same domain. Having referential integrity means that IF a value exists in the foreign key of the referencing file, then it must exist as a valid identifier in the referenced file or table." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Through the specification of appropriate referential constraints, RI guarantees that an acceptable value is always in each foreign key column." (Craig S Mullins, "Database Administration", 2012)

"Refers to the accuracy and consistency of records, and the assurance that they are genuine and unaltered." (Robert F Smallwood, "Information Governance: Concepts, Strategies, and Best Practices", 2014)

"The process of relating data together in a disciplined manner" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"A requirement that the data in related tables be matched, so that an entry in the 'many' side of the relationship (the foreign key) must have a corresponding entry in the “one” side of the relationship (the primary key)." (Faithe Wempen, "Computing Fundamentals: Introduction to Computers", 2015)

"Refers to the accuracy and consistency of records, and the assurance that they are genuine and unaltered." (Robert F Smallwood, "Information Governance for Healthcare Professionals", 2018)

"The state of a database in which all values of all foreign keys are valid. Maintaining referential integrity requires the enforcement of a referential constraint on all operations that change the data in a table where the referential constraints are defined." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

"A rule defined on a key in one table that guarantees that the values in that key match the values in a key in a related table (the referenced value)." (Oracle, "Oracle Database Concepts")

"A state in which all foreign key values in a database are valid. For a foreign key to be valid, it must contain either the value NULL, or an existing key value from the primary or unique key columns referenced by the foreign key." (Microsoft Technet)

"The technique of maintaining data always in a consistent format, part of the ACID philosophy. In particular, data in different tables is kept consistent through the use of foreign key constraints, which can prevent changes from happening or automatically propagate those changes to all related tables. Related mechanisms include the unique constraint, which prevents duplicate values from being inserted by mistake, and the NOT NULL constraint, which prevents blank values from being inserted by mistake." (MySQL, "MySQL 8.0 Reference Manual Glossary")

09 March 2009

🛢DBMS: Trigger (Definitions)

"A special form of stored procedure that goes into effect when a user gives a change command such as insert, delete, or update to a specified table or column. Triggers are often used to enforce referential integrity." (Karen Paulsell et al, "Sybase SQL Server: Performance and Tuning Guide", 1996)

"A special form of stored procedure that goes into effect when data within a table is modified. Triggers are often created to enforce integrity or consistency among logically related data in different tables." (Patrick Dalton, "Microsoft SQL Server Black Book", 1997)

"A special type of stored procedure that is set off by actions taken on a table. Triggers allow for complex relationships between tables and complex business rules to be checked automatically." (Owen Williams, "MCSE TestPrep: SQL Server 6.5 Design and Implementation", 1998)

"A stored procedure that executes automatically when data in a specified table is modified. Triggers are often created to enforce referential integrity or consistency among logically related data in different tables." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"Code stored in the database that executes automatically when certain events occur. Traditionally associated only with table write events such as INSERT, UPDATE, or DELETE, newer versions of Oracle provide the ability to define triggers on views and on other system events such as logon, logoff, and system error." (Bill Pribyl & Steven Feuerstein, "Learning Oracle PL/SQL", 2001)

"A stored procedure that executes when data in a specified table is modified. Triggers are often created to enforce referential integrity or consistency among logically related data in different tables." (Anthony Sequeira & Brian Alderman, "The SQL Server 2000 Book", 2003)

"A trigger is a stored procedure that is fired when data is modified from a table using any of the three modification statements: DELETE, INSERT, or UPDATE. FOR and AFTER are synonymous, and are usually implied when referring to triggers, rather than INSTEAD OF triggers. Triggers are often created to enforce referential integrity or consistency among logically related data in different tables." (Thomas Moore, "EXAM CRAM™ 2: Designing and Implementing Databases with SQL Server 2000 Enterprise Edition", 2005)

"A chunk of code that executes when a specified event occurs, usually before or after an INSERT, UPDATE, or DELETE command." (Gavin Powell, "Beginning Database Design", 2006)

"A database method that is automatically invoked as the result of Data Manipulation Language (DML) activity within a persistence mechanism." (Pramod J Sadalage & Scott W Ambler, "Refactoring Databases: Evolutionary Database Design", 2006)

"A stored procedure that is fired when data is modified from a table using any of the three modification statements DELETE, INSERT, or UPDATE. FOR and AFTER are synonymous and are usually implied when referring to triggers rather than INSTEAD OF triggers. Triggers are often created to enforce referential integrity or consistency among logically related data in different tables." (Thomas Moore, "MCTS 70-431: Implementing and Maintaining Microsoft SQL Server 2005", 2006)

"A stored procedure that executes when certain conditions occurs such as when a record is created, modified, or deleted. Triggers can perform special actions such as creating other records or validating changes." (Rod Stephens, "Beginning Database Design Solutions", 2008)

"A type of stored procedure that fires in response to action on a table. DML triggers are associated with INSERT, UPDATE, and DELETE statements. DDL triggers are associated with CREATE, ALTER, and DROP statements." (Darril Gibson, "MCITP SQL Server 2005 Database Developer All-in-One Exam Guide", 2008)

"Stored in, and managed by, your database server, this software is executed when a certain event occurs. These events can range from information creation or modification to structural changes to your database. When the event occurs, the trigger is executed, causing a pre-determined set of actions to take place. These actions can encompass data validation, alerts, warnings, and other administrative operations. Triggers can invoke other triggers and stored procedures." (Robert D. Schneider and Darril Gibson, "Microsoft SQL Server 2008 All-In-One Desk Reference For Dummies", 2008)

"A stored procedure that executes in response to a Data Manipulation Language (DML) or Data Definition Language (DDL) event." (Jim Joseph, "Microsoft SQL Server 2008 Reporting Services Unleashed", 2009)

"A SQL program module that is executed when a specific data modification activity occurs. Triggers are stored in the database they manipulate." (Jan L Harrington, "SQL Clearly Explained" 3rd Ed., 2010)

"A stored procedure that can be triggered and executed automatically when a database operation such as insert, update, or delete takes place." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"A procedural SQL code that is automatically invoked by the relational database management system upon the occurrence of a data manipulation event." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management" 9th Ed., 2011)

"A software routine guaranteed to execute when an event occurs. Often a trigger will monitor changes to data values. A trigger includes a monitoring procedure, a set or range of values to check data integrity, and one or more procedures invoked in response, which may update other data or fulfill a data subscription." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"An event that causes a handler to be run." (Jon Orwant et al, "Programming Perl, 4th Ed.", 2012)

"An event-driven specialized procedure that is attached to database tables; typically implemented to support data integrity requirements." (Craig S Mullins, "Database Administration", 2012)

"A database object that is associated with a single base table or view and that defines a rule. The rule consists of a set of SQL statements that runs when an insert, update, or delete database operation occurs on the associated base table or view." (IBM, "Informix Servers 12.1", 2014)

"A PL/SQL or Java procedure that fires when a table or view is modified or when specific user or database actions occur. Procedures are explicitly run, whereas triggers are implicitly run." (Oracle, "Oracle Database Concepts")

"A stored procedure that executes in response to a data manipulation language (DML) or data definition language (DDL) event." (Microsoft Technet,)

SQL Troubles

Pages

27 January 2025

🗄️🗒️Data Management: Data Quality Dimensions [Notes]

18 January 2010

🗄️Data Management: Data Quality Dimensions (Part V: Consistency)

13 January 2010

🗄️Data Management: Data Quality Dimensions (Part III: Completeness)

16 July 2009

🛢DBMS: Referential Integrity (Definitions)

09 March 2009

🛢DBMS: Trigger (Definitions)

About Me