Showing posts with label Referential Integrity. Show all posts
Showing posts with label Referential Integrity. Show all posts

18 January 2010

🗄️Data Management: Data Quality Dimensions (Part V: Consistency)

Data Management
Data Management Series

IEEE defines consistency in general as "the degree of uniformity, standardization, and freedom from contradiction among the documents or parts of a system or component" [4]. In respect to data, consistency can be defined thus as the degree of uniformity and standardization of data values among systems or data repositories (aka cross-system consistencies), records within the same repository (aka cross-record consistency), or within the same record at different points in time (temporal consistency) (see [2], [3]). 

Unfortunately, uniformity, standardization and freedom can be considered as data quality dimensions as well and they might even have broader scope than the one provided by consistency. Moreover, the definition requires further definitions for the concept to be understood, which is not ideal. 

Simply put, consistency refers to the extent (data) values are consistent in notation, respectively the degree to which the data values across different contexts match. For example, one system uses "Free on Board" while another systems uses "FOB" to refer to the point obligations, costs, and risk involved in the delivery of goods shift from the seller to the buyer. The two systems refer to the same value by two different notations. When the two systems are integrated, because of the different values used, the rules defined in the target system might make the record fail because it is expecting another value. Conversely, other systems may import the value as it is, leading thus to two values used in parallel for the same meaning. This can happen not only to reference data, but also to master data, for example when a value deviates slightly from the expected value (e.g. misspelled). A more special case is when one of the systems uses case sensitive values (usually the target system, though there can be also bidirectional data integrations).

One solution for such situations would be to "standardize" the values across systems, though not all systems allow to easily change the values once they have been set. Another solution would be to create a mapping as part the integration, though to maintain such mappings for many cases is suboptimal, but in the end, it might be the only solution. Further systems can be impacted by these issues as well (e.g. data warehouses, data marts).

It's recommended to use a predefined list of values (LOV) - a data dictionary, an ontology or any other type of knowledge representation form that can be used to 'enforce' data consistency. ‘Enforce’ is maybe not the best term because the two data sets could be disconnected from each other, being in Users’ responsibility to ensure the overall consistency, or the two data sets could be integrated using specific techniques. In many cases is checked the consistency of the values taken by one attribute against an existing LOV, though for example for data formed from multiple segments (e.g. accounts) each segment might need to be checked against a specific data set or rule generator, such mechanisms implying multi-attribute mappings or associational rules that specify the possible values.

As highlighted also by [1], there are two aspects of consistency: the structural consistency in which two or more values can be distinct in notation but have the same meaning (e.g. missing vs. n/a), and semantic consistency in which each value has a unique meaning (e.g. only n/a is allowed to highlight missing values). It should be targeted to have the data semantically consistent, to avoid confusion, accidental exclusion of data during filtering or reporting. More and more organizations are investing in ontologies, they allow ensuring the semantic consistency of concepts/entities, though for most of the cases simple single or multi-attribute lists of values are enough.


Written: Jan-2010, Last Reviewed: Mar-2024

References:
[1] Chapman A.D. (2005) "Principles of Data Quality", version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen
[2] David Loshin (2009) Master Data Management
[3] IEEE (1990) "IEEE Standard Glossary of Software Engineering Terminology"
[4] DAMA International (2010) "The DAMA Dictionary of Data Management" 1st Ed.

13 January 2010

🗄️Data Management: Data Quality Dimensions (Part III: Completeness)

Data Management
Data Management Series

Completeness refers to the extent to which there are missing data in a dataset, fact reflected in the number of the missing values, also referred as empty (when an empty string or default values is used) or 'Nulls' (aka unknown values), and/or in the number of missing records.

The missing values are typically considered in report to mandatory attributes, attributes that need a not-Null value for each record, though after case might be applied to non-mandatory attributes (optional attributes) too, for example when is intended to understand whether the attributes are adequately maintained or not. It’s interesting that [1] considers also the inapplicable attributes referring to the attributes not applicable (relevant) for certain scenarios (e.g. physical dimensions for service-based materials), which together with the applicable attributes (relevant) can be considered as another type of categorization for attributes. Whether an attribute is mandatory is decided upon business context and not necessarily upon the physical structure containing the attribute, in other words an attribute could be optional as per database schema and mandatory per business rules.

'Missing records' can be a misleading term because is used in several contexts, however within data completeness context it refers only to the cases not covered by data integrity. For example in parent-child table relations the header data was entered though the detail data is missing, either not entered or deleted; such a case is not covered by referential integrity because there is no missing reference, but just the parent without child data (1:n cardinality). 

A mixed example occurs when the same entity is split across several tables at the same level of detail. One of the tables must function as a parent, falling in the previous mentioned example (1:1 cardinality). In such a scenario it depends how one reports the nonconformances per record: (1) the error is counted only once, independently on how many dimensions an error was raised; (2) the error is counted for each dimension. For (2) when the referential integrity failed, an error is raised also for each mandatory attribute. 

Both examples are dealing with explicit data referents – the 'parent' data, though there are cases in which the referents are implicit, for example when the data are not available for a certain time interval (e.g. period, day) even if needed, though also for this case the referents could be made explicit, falling in the previous mentioned examples. In such scenarios all the attributes corresponding to the missing records will be null.

Normally the completeness of parent-child relations is enforced with the help of referential integrity and database transactions, a set of actions performed as a single unit of work, they allow saving the parent data only if the child data were saved successfully, though such type of constraints is not always necessary.

Data should be cleaned when feasible in the source system(s) and this applies to incomplete data as well. It might be feasible to clean the values in Excel files or similar tools, by exporting and then reimporting the clean values back into the respective systems.

In data migrations or similar scenarios, the completeness in particular and data quality in general must be judged against the target system(s) and thus the dataset must be enriched in an intermediate layer as needed. Upon case, one can consider using default values, though this sounds like a technical debt, likely improbably to be addressed later. Moreover, one should prioritize the effort and consider first the attributes which are needed for the good functioning of the target system(s).

Ideally, for the mandatory fields should be applied data validation techniques in the source systems, when feasible. 


Written: Jan-2010, Last Reviewed: Mar-2024 

References:
[1] David Loshin (2009) "Master Data Management"

16 July 2009

🛢DBMS: Referential Integrity (Definitions)

"The rules governing data consistency, specifically the relationships among the primary keys and foreign keys of different tables. SQL Server addresses referential integrity with user-defined triggers." (Karen Paulsell et al, "Sybase SQL Server: Performance and Tuning Guide", 1996)

"When a table has relationships with other tables, they are linked on a field (or group of fields). Referential integrity ensures that the copy of the key field kept in one table matches the key field in the other." (Owen Williams, "MCSE TestPrep: SQL Server 6.5 Design and Implementation", 1998)

"An integrity mechanism that ensures that vital data in a database, such as the unique identifier for a given piece of data, remains accurate and usable as the database changes. Referential integrity involves managing corresponding data values between tables when the foreign key of a table contains the same values as the primary key of another table." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"Mandatory condition in a data warehouse where all the keys in the fact tables are legitimate foreign keys relative to the dimension tables. In other words, all the fact key components are subsets of the primary keys found in the dimension tables at all times." (Ralph Kimball & Margy Ross, "The Data Warehouse Toolkit 2nd Ed ", 2002)

"A state in which all foreign key values in a database are valid." (Anthony Sequeira & Brian Alderman, "The SQL Server 2000 Book", 2003)

"A method employed by a relational database system that enforces one-to-many relationships between tables." (Bob Bryla, "Oracle Database Foundations", 2004)

"A feature of some database systems that ensures that any record stored in the database is supported by accurate primary and foreign keys." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"The facility of a DBMS to ensure the validity of predefined relationships." (William H Inmon, "Building the Data Warehouse", 2005)

"A process (usually contained within a relational database model) of validation between related primary and foreign key field values. For example, a foreign key value cannot be added to a table unless the related primary key value exists in the parent table. Similarly, deleting a primary key value necessitates removing all records in subsidiary tables, containing that primary key value in foreign key fields. Additionally, it follows that preventing the deletion of a primary key record is not allowed if a foreign key exists elsewhere." (Gavin Powell, "Beginning Database Design", 2006)

"The assurance that a reference from one entity to another entity is valid. If entity A references entity B, entity B exists. If entity B is removed, all references to entity B must also be removed." (Pramod J Sadalage & Scott W Ambler, "Refactoring Databases: Evolutionary Database Design", 2006)

"Relational database integrity that dictates that all foreign key values in a child table must have a corresponding matching primary key value in the parent table." (Marilyn Miller-White et al, "MCITP Administrator: Microsoft SQL Server 2005 Optimization and Maintenance 70-444", 2007)

"The referential integrity imposes the constraint that if a foreign key exists in a relation, either the foreign key value must match a candidate key value of some tuple in its home relation or the foreign key value must be wholly null." (S. Sumathi & S. Esakkirajan, "Fundamentals of Relational Database Management Systems", 2007)

"A set of rules, enforced by the database server, the user’s application, or both, that protects the quality and consistency of information stored in the database." (Robert D Schneider & Darril Gibson, "Microsoft SQL Server 2008 All-in-One Desk Reference For Dummies", 2008)

"Requires that relationships among tables be consistent. For example, foreign key constraints must be satisfied. You cannot accept a transaction until referential integrity is satisfied." (Rod Stephens, "Beginning Database Design Solutions", 2008)

"A constraint on a relation that states that every non-null foreign key value must match an existing primary key value." (Jan L Harrington, "Relational Database Design and Implementation" 3rd Ed., 2009)

"A constraint in a SQL database that requires, for every foreign key instance that exists in a table, that the row (and thus the primary key instance) of the parent table associated with that foreign key instance must also exist in the database." (Toby J Teorey, ", Database Modeling and Design" 4th Ed, 2010)

"A constraint on a relation that states that every non-null foreign key value must reference an existing primary key value." (Jan L Harrington, "SQL Clearly Explained" 3rd Ed., 2010)

"In a relational database, the quality of a table that all its associations are with real instances of other tables." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"Refers to two relational tables that are directly related. Referential integrity between related tables is established if non-null values in the foreign key field of the child table are primary key values in the parent table." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"A condition by which a dependent table’s foreign key must have either a null entry or a matching entry in the related table. Even though an attribute may not have a corresponding attribute, it is impossible to have an invalid entry." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management" 9th Ed, 2011)

"In data management, constraints that govern the relationship of an occurrence of one entity to one or more occurrences of another entity. These constraints may be automatically enforced by the DBMS. For instance, every purchase order must have one and only one customer. If the relationship is represented using a foreign key, then the foreign key is said to reference a file or entity table where the identifier is from the same domain. Having referential integrity means that IF a value exists in the foreign key of the referencing file, then it must exist as a valid identifier in the referenced file or table." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Through the specification of appropriate referential constraints, RI guarantees that an acceptable value is always in each foreign key column." (Craig S Mullins, "Database Administration", 2012)

"Refers to the accuracy and consistency of records, and the assurance that they are genuine and unaltered." (Robert F Smallwood, "Information Governance: Concepts, Strategies, and Best Practices", 2014)

"The process of relating data together in a disciplined manner" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"A requirement that the data in related tables be matched, so that an entry in the 'many' side of the relationship (the foreign key) must have a corresponding entry in the “one” side of the relationship (the primary key)." (Faithe Wempen, "Computing Fundamentals: Introduction to Computers", 2015)

"Refers to the accuracy and consistency of records, and the assurance that they are genuine and unaltered." (Robert F Smallwood, "Information Governance for Healthcare Professionals", 2018)

"The state of a database in which all values of all foreign keys are valid. Maintaining referential integrity requires the enforcement of a referential constraint on all operations that change the data in a table where the referential constraints are defined." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

"A rule defined on a key in one table that guarantees that the values in that key match the values in a key in a related table (the referenced value)." (Oracle, "Oracle Database Concepts")

"A state in which all foreign key values in a database are valid. For a foreign key to be valid, it must contain either the value NULL, or an existing key value from the primary or unique key columns referenced by the foreign key." (Microsoft Technet)

"The technique of maintaining data always in a consistent format, part of the ACID philosophy. In particular, data in different tables is kept consistent through the use of foreign key constraints, which can prevent changes from happening or automatically propagate those changes to all related tables. Related mechanisms include the unique constraint, which prevents duplicate values from being inserted by mistake, and the NOT NULL constraint, which prevents blank values from being inserted by mistake." (MySQL, "MySQL 8.0 Reference Manual Glossary")

09 March 2009

🛢DBMS: Trigger (Definitions)

"A special form of stored procedure that goes into effect when a user gives a change command such as insert, delete, or update to a specified table or column. Triggers are often used to enforce referential integrity." (Karen Paulsell et al, "Sybase SQL Server: Performance and Tuning Guide", 1996)

"A special form of stored procedure that goes into effect when data within a table is modified. Triggers are often created to enforce integrity or consistency among logically related data in different tables." (Patrick Dalton, "Microsoft SQL Server Black Book", 1997)

"A special type of stored procedure that is set off by actions taken on a table. Triggers allow for complex relationships between tables and complex business rules to be checked automatically." (Owen Williams, "MCSE TestPrep: SQL Server 6.5 Design and Implementation", 1998)

"A stored procedure that executes automatically when data in a specified table is modified. Triggers are often created to enforce referential integrity or consistency among logically related data in different tables." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"Code stored in the database that executes automatically when certain events occur. Traditionally associated only with table write events such as INSERT, UPDATE, or DELETE, newer versions of Oracle provide the ability to define triggers on views and on other system events such as logon, logoff, and system error." (Bill Pribyl & Steven Feuerstein, "Learning Oracle PL/SQL", 2001)

"A stored procedure that executes when data in a specified table is modified. Triggers are often created to enforce referential integrity or consistency among logically related data in different tables." (Anthony Sequeira & Brian Alderman, "The SQL Server 2000 Book", 2003)

"A trigger is a stored procedure that is fired when data is modified from a table using any of the three modification statements: DELETE, INSERT, or UPDATE. FOR and AFTER are synonymous, and are usually implied when referring to triggers, rather than INSTEAD OF triggers. Triggers are often created to enforce referential integrity or consistency among logically related data in different tables." (Thomas Moore, "EXAM CRAM™ 2: Designing and Implementing Databases with SQL Server 2000 Enterprise Edition", 2005)

"A chunk of code that executes when a specified event occurs, usually before or after an INSERT, UPDATE, or DELETE command." (Gavin Powell, "Beginning Database Design", 2006)

"A database method that is automatically invoked as the result of Data Manipulation Language (DML) activity within a persistence mechanism." (Pramod J Sadalage & Scott W Ambler, "Refactoring Databases: Evolutionary Database Design", 2006)

"A stored procedure that is fired when data is modified from a table using any of the three modification statements DELETE, INSERT, or UPDATE. FOR and AFTER are synonymous and are usually implied when referring to triggers rather than INSTEAD OF triggers. Triggers are often created to enforce referential integrity or consistency among logically related data in different tables." (Thomas Moore, "MCTS 70-431: Implementing and Maintaining Microsoft SQL Server 2005", 2006)

"A stored procedure that executes when certain conditions occurs such as when a record is created, modified, or deleted. Triggers can perform special actions such as creating other records or validating changes." (Rod Stephens, "Beginning Database Design Solutions", 2008)

"A type of stored procedure that fires in response to action on a table. DML triggers are associated with INSERT, UPDATE, and DELETE statements. DDL triggers are associated with CREATE, ALTER, and DROP statements." (Darril Gibson, "MCITP SQL Server 2005 Database Developer All-in-One Exam Guide", 2008)

"Stored in, and managed by, your database server, this software is executed when a certain event occurs. These events can range from information creation or modification to structural changes to your database. When the event occurs, the trigger is executed, causing a pre-determined set of actions to take place. These actions can encompass data validation, alerts, warnings, and other administrative operations. Triggers can invoke other triggers and stored procedures." (Robert D. Schneider and Darril Gibson, "Microsoft SQL Server 2008 All-In-One Desk Reference For Dummies", 2008)

"A stored procedure that executes in response to a Data Manipulation Language (DML) or Data Definition Language (DDL) event." (Jim Joseph, "Microsoft SQL Server 2008 Reporting Services Unleashed", 2009)

"A SQL program module that is executed when a specific data modification activity occurs. Triggers are stored in the database they manipulate." (Jan L Harrington, "SQL Clearly Explained" 3rd Ed., 2010)

"A stored procedure that can be triggered and executed automatically when a database operation such as insert, update, or delete takes place." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"A procedural SQL code that is automatically invoked by the relational database management system upon the occurrence of a data manipulation event." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management" 9th Ed., 2011)

"A software routine guaranteed to execute when an event occurs. Often a trigger will monitor changes to data values. A trigger includes a monitoring procedure, a set or range of values to check data integrity, and one or more procedures invoked in response, which may update other data or fulfill a data subscription." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"An event that causes a handler to be run." (Jon Orwant et al, "Programming Perl, 4th Ed.", 2012)

"An event-driven specialized procedure that is attached to database tables; typically implemented to support data integrity requirements." (Craig S Mullins, "Database Administration", 2012)

"A database object that is associated with a single base table or view and that defines a rule. The rule consists of a set of SQL statements that runs when an insert, update, or delete database operation occurs on the associated base table or view." (IBM, "Informix Servers 12.1", 2014)

"A database object that is associated with a single base table or view and that defines a rule. The rule consists of a set of SQL statements that runs when an insert, update, or delete database operation occurs on the associated base table or view." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

 "A PL/SQL or Java procedure that fires when a table or view is modified or when specific user or database actions occur. Procedures are explicitly run, whereas triggers are implicitly run." (Oracle, "Oracle Database Concepts")

"A stored procedure that executes in response to a data manipulation language (DML) or data definition language (DDL) event." (Microsoft Technet,)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.