SQL Troubles: data integrity

Showing posts with label data integrity. Show all posts

04 February 2021

📦Data Migrations (DM): Conceptualization (Part VI: Data Migration Layer)

Data Migrations Series

Besides migrating the master and transactional data from the legacy systems there are usually three additional important business requirements for a Data Migration (DM) – migrate the data within expected timeline, with minimal disruption for the business, respectively within expected quality levels. Hence, DM’ timeline must match and synchronize with main project’s timeline in terms of main milestones, though the DM needs to be executed typically within a small timeframe of a few days during the Go-Live. In what concerns the third requirement, even if the data have high quality as available in the source systems or provided by the business, there are aspects like integration and consistency that rely primarily on the DM logic.

To address these requirements the DM logic must reach a certain level of performance and quality that allows importing the data as expected. From project’s beginning until UAT the DM team will integrate the various information iteratively, will need to test the changes several times, troubleshoot the deviations from expectations. The volume of effort required for these activities can be overwhelming. It’s not only important for the whole solution to be performant but each step must be designed so that besides fast execution, the changes and troubleshooting must involve a minimum of overhead.

For better understanding the importance, imagine a quest game in which the character has to go through a labyrinth with traps. If the player made a mistake he’ll need to restart from a certain distant point in time or even from the beginning. Now imagine that for each mistake he has the possibility of going one step back try a new option and move forward. For some it may look like cheating though in this way one can finish the game relatively quickly. It would be great if executing a DM could allow the same flexibility.

Unfortunately, unless the data are stored between steps or each step is a different package, an ETL solution doesn’t provide the flexibility of changing the code, moving one step behind, rerunning the step and performing troubleshooting, and this over and over again like in the quest game. To better illustrate the impact of such approach let’s consider that the DM has about 40 entities and one needs to perform on average 20 changes per entity. If one is able to move forwards and backwards probably each change will take about a few minutes to execute the code. Otherwise rerunning a whole package can take 5-10 times or even more as this can depend on packages’ size and data volume. For 800 changes only an additional minute per change equates with 800 minutes (about 13 hours).

In exchange, storing the data for an entity in a database for the important points of the processing and implementing the logic as a succession of SQL scripts allows this flexibility. The most important downside is that the steps need to be executed manually though this is a small price to pay for the flexibility and control gained. Moreover, with a few tricks one can load deltas as in the case of a phased DM.

To assure that the consistency of the data is kept one needs to build for each entity a set of validation queries that check for duplicates, for special cases, for data integrity, incorrect format, etc. The queries can be included in the sequence of logic used for the DM. Thus, one can react promptly to each unexpected value. When required, the validation rules can be built within reports and used in the data cleaning process by users, or even logged periodically per entity for tracking the progress.

Previous Post <<||>> Next Post

31 October 2020

🧊Data Warehousing: Architecture (Part III: Data Lakes & other Puddles)

One can consider a data lake as a repository of all of an organization’s data found in raw form, however this constraint might be too harsh as the data found at different levels of processing can be imported as well, for example the results of data mining or other Data Science techniques/methods can be considered as raw data for further processing.

In the initial definition provided by James Dixon, the difference between a data lake and a data mart/warehouse was expressed metaphorically as the transition from bottled water to lakes streamed (artificially) from various sources. It’s contrasted thus the objective-oriented, limited and single-purposed role of the data mart/warehouse in respect to the flow of data in nature that could be tapped and harnessed as desired. These are though metaphors intended to sensitize the buyer. Personally, I like to think of the data lake as an extension of the data infrastructure, in which the data mart or warehouse is integrant part. Imposing further constrains seem to have no benefit.

Probably the most important characteristic of a data lake is that it makes the data of an organization discoverable and consumable, though from there to insight and other benefits is a long road and requires specific knowledge about the techniques used, as well about organization’s processes and data. Without this data lake-based solutions can lead to erroneous results, same as mixing several ingredients without having knowledge about their usage can lead to cooking experiments aloof from the art of cooking.

A characteristic of data is that they go through continuous change and have different timeliness, respectively degrees of quality in respect to the data quality dimensions implied and sources considered. Data need to reflect the reality at the appropriate level of detail and quality required by the processing application(s), this applying to data warehouses/marts as well data lake-based solutions.

Data found in raw form don’t necessarily represent the true/truth and don’t necessarily acquire a good quality no matter how much they are processed. Solutions need to be resilient in respect to the data they handle through their layers, independently of the data quality and transmission problems. Whether one talks about ETL, data migration or other types of data processing, keeping the data integrity at various levels and layers can be maybe the most important demand upon solutions.

Snapshots as moment-in-time recordings of tables, entities, sets of entities, datasets or whole databases, prove to be often the best mechanisms in keeping data integrity when this aspect is essential to their processing (e.g. data migrations, high-accuracy measurements). Unfortunately, the more systems are involved in the process and the broader span of the solutions over the sources, the more difficult it become to take such snapshots.

A SQL query’s output represents a snapshot of the data, therefore SQL-based solutions are usually appropriate for most of the business scenarios in which the characteristics of data (typically volume, velocity and/or variety) make their processing manageable. However, when the data are extracted by other means integrity is harder to obtain, especially when there’s no timestamp to allow data partitioning on a time scale, the handling of data integrity becoming thus in extremis a programmer’s task. In addition, getting snapshots of the data as they are changed can be a costly and futile task.

Further on, maintaining data integrity can prove to be a matter of design in respect not only to the processing of data, but also in respect to the source applications and the business processes they implement. The mastery of the underlying principles, techniques, patterns and methodologies, helps in the process of designing the right solutions.

Note:
Written as answer to a Medium post on data lakes and batch processing in data warehouses.

05 May 2017

⛏️Data Management: Data Steward (Definitions)

"A person with responsibility to improve the accuracy, reliability, and security of an organization’s data; also works with various groups to clearly define and standardize data." (Margaret Y Chu, "Blissful Data ", 2004)

"Critical players in data governance councils. Comfortable with technology and business problems, data stewards seek to speak up for their business units when an organization-wide decision will not work for that business unit. Yet they are not turf protectors, instead seeking solutions that will work across an organization. Data stewards are responsible for communication between the business users and the IT community." (Tony Fisher, "The Data Asset", 2009)

"A business leader and/or subject matter expert designated as accountable for: a) the identification of operational and Business Intelligence data requirements within an assigned subject area, b) the quality of data names, business definitions, data integrity rules, and domain values within an assigned subject area, c) compliance with regulatory requirements and conformance to internal data policies and data standards, d) application of appropriate security controls, e) analyzing and improving data quality, and f) identifying and resolving data related issues. Data stewards are often categorized as executive data stewards, business data stewards, or coordinating data stewards." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

[business data steward:] "A knowledge worker, business leader, and recognized subject matter expert assigned accountability for the data specifications and data quality of specifically assigned business entities, subject areas or databases, but with less responsibility for data governance than a coordinating data steward or an executive data steward." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The person responsible for maintaining a data element in a metadata registry." (Microsoft, "SQL Server 2012 Glossary, 2012)

"The term stewardship is “the management or care of another person’s property” (NOAD). Data stewards are individuals who are responsible for the care and management of data. This function is carried out in different ways based on the needs of particular organizations." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"The person responsible for maintaining a data element in a metadata registry." (Microsoft, SQL Server 2012 Glossary, 2012)

"An individual comfortable with both technology and business problems. Stewards are responsible for communicating between the business users and the IT community." (Jim Davis & Aiman Zeid, "Business Transformation: A Roadmap for Maximizing Organizational Insights", 2014)

"A role in the data governance organization that is responsible for the development of a uniform data model for business objects used across boundaries. The data steward is also often responsible for the development of master data management and ensures compliance with the governance rules." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"A natural person assigned the responsibility to catalog, define, and monitor changes to critical data. Example: The data steward for finance critical data is Dan." (Gregory Lampshire, "The Data and Analytics Playbook", 2016)

"A person responsible for managing data content, quality, standards, and controls within an organization or function." (Jonathan Ferrar et al, "The Power of People", 2017)

"A data steward is a job role that involves planning, implementing and managing the sourcing, use and maintenance of data assets in an organization. Data stewards enable an organization to take control and govern all the types and forms of data and their associated libraries or repositories." (Richard T Herschel, "Business Intelligence", 2019)

18 January 2010

🗄️Data Management: Data Quality Dimensions (Part VI: Referential Integrity)

Data Management Series

Referential integrity, when considered as data quality dimension, refers to the degree to which the values of a key in one table (aka reference value) match the values of a key in a related table (aka the referenced value). Typically, that's assured by design in Database Management Systems (DBMS) using a feature called referential integrity that defines an explicit relationship between the two tables that makes sure that the values remain valid during database changes. Thus, when a record is inserted or updated and a value is provided for the reference value, the system makes sure that the referenced value is valid, otherwise it throws a referential integrity error. A similar error is thrown when one attempts to delete the record with the referenced value as long is referenced by a table on which the relationship was explicitly defined.

Using referential integrity is a recommended technique for assuring the overall integrity of the data in a database, though there are also exceptions when that's not enforced for all tables (e.g. data warehouses) or only for exceptions (e.g. interface tables where records are imported as they are, attribute whose values references data from multiple tables). Therefore, even if there are tables with the referential integrity enforced, don't make the assumption that it applies to all tables!

In relational DBMS there are three types of integrity mentioned – entity, referential and domain integrity. Entity integrity demands that all the tables must have a primary key that contains no Null values. The referential integrity demands that each non-null value of a foreign key must match the value of a primary key [1], while the domain integrity demands that the type of an attribute should be restricted to a certain data type, the format should be restricted by using constraints, rules or range of possible values [2].

Even if not mandatory, all three types of integrity are quintessential for reliable relational databases. When the referential integrity is not enforced at database level or at least in code, when a record from a table is deleted and a foreign key it’s still pointing to it, fact that could lead to unexpected disappearance of records from the system’s UI even if the records are still available.

During conversions or data migrations is important to assure that the various sets loaded match the referential and domain integrity of the database in which the data will be loaded, otherwise the records not respecting the mentioned type of integrity will be rejected. The rejection itself might not be a problem for several records, though when it happens at large scale, then the situations changes dramatically, especially when the system gives no adequate messages for the cause or rejection. A recommended approach is to assure that the scope is synchronized between the various data elements, and that the referential integrity of datasets is validated before the data are loaded in the destination database.

There are several sources (e.g. [3]) that consider Codd’s referential integrity constraint as a type of consistency, in the support of this idea could be mentioned the fact that referential integrity could be used to solve data consistency issues by bringing the various LOV in the systems. Referential integrity is mainly an architectural concept even if it involves the 'consistency' of foreign key/primary key pairs.

Note:
Expect the unforeseeable! It’s always a good idea to check whether the referential integrity is kept by a system – there are so many things that could go wrong! In data migration solutions, data warehouses and more general analytical solutions is a good idea to have in place mechanisms that check for this kind of issues.

Previous Post <<||>> Next Post

Written: Jan-2010, Last Reviewed: Mar-2024

References:
[1] Halpin. T. (2001) Information Modeling and Relational Databases: From Conceptual Analysis to Logical Design. Morgan Kaufmann Publishers. ISBN 1-55860-672-6.
[2] MSDN. 2009. Data Integrity. [Online] Available from: http://msdn.microsoft.com/en-us/library/ms184276.aspx (Accessed: 18 January 2009)

[3] Lee Y.W., Pipino L.L., Funk J.D., Wang R.Y. (2006) "Journey to Data Quality", MIT Press. ISBN: 0-262-12287-1

🗄️Data Management: Data Quality Dimensions (Part V: Consistency)

Data Management Series

IEEE defines consistency in general as "the degree of uniformity, standardization, and freedom from contradiction among the documents or parts of a system or component" [4]. In respect to data, consistency can be defined thus as the degree of uniformity and standardization of data values among systems or data repositories (aka cross-system consistencies), records within the same repository (aka cross-record consistency), or within the same record at different points in time (temporal consistency) (see [2], [3]).

Unfortunately, uniformity, standardization and freedom can be considered as data quality dimensions as well and they might even have broader scope than the one provided by consistency. Moreover, the definition requires further definitions for the concept to be understood, which is not ideal.

Simply put, consistency refers to the extent (data) values are consistent in notation, respectively the degree to which the data values across different contexts match. For example, one system uses "Free on Board" while another systems uses "FOB" to refer to the point obligations, costs, and risk involved in the delivery of goods shift from the seller to the buyer. The two systems refer to the same value by two different notations. When the two systems are integrated, because of the different values used, the rules defined in the target system might make the record fail because it is expecting another value. Conversely, other systems may import the value as it is, leading thus to two values used in parallel for the same meaning. This can happen not only to reference data, but also to master data, for example when a value deviates slightly from the expected value (e.g. misspelled). A more special case is when one of the systems uses case sensitive values (usually the target system, though there can be also bidirectional data integrations).

One solution for such situations would be to "standardize" the values across systems, though not all systems allow to easily change the values once they have been set. Another solution would be to create a mapping as part the integration, though to maintain such mappings for many cases is suboptimal, but in the end, it might be the only solution. Further systems can be impacted by these issues as well (e.g. data warehouses, data marts).

It's recommended to use a predefined list of values (LOV) - a data dictionary, an ontology or any other type of knowledge representation form that can be used to 'enforce' data consistency. ‘Enforce’ is maybe not the best term because the two data sets could be disconnected from each other, being in Users’ responsibility to ensure the overall consistency, or the two data sets could be integrated using specific techniques. In many cases is checked the consistency of the values taken by one attribute against an existing LOV, though for example for data formed from multiple segments (e.g. accounts) each segment might need to be checked against a specific data set or rule generator, such mechanisms implying multi-attribute mappings or associational rules that specify the possible values.

As highlighted also by [1], there are two aspects of consistency: the structural consistency in which two or more values can be distinct in notation but have the same meaning (e.g. missing vs. n/a), and semantic consistency in which each value has a unique meaning (e.g. only n/a is allowed to highlight missing values). It should be targeted to have the data semantically consistent, to avoid confusion, accidental exclusion of data during filtering or reporting. More and more organizations are investing in ontologies, they allow ensuring the semantic consistency of concepts/entities, though for most of the cases simple single or multi-attribute lists of values are enough.

Previous Post <<||>> Next Post

Written: Jan-2010, Last Reviewed: Mar-2024

References:
[1] Chapman A.D. (2005) "Principles of Data Quality", version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen
[2] David Loshin (2009) Master Data Management

[3] IEEE (1990) "IEEE Standard Glossary of Software Engineering Terminology"

[4] DAMA International (2010) "The DAMA Dictionary of Data Management" 1st Ed.

13 January 2010

🗄️Data Management: Data Quality Dimensions (Part III: Completeness)

Data Management Series

Completeness refers to the extent to which there are missing data in a dataset, fact reflected in the number of the missing values, also referred as empty (when an empty string or default values is used) or 'Nulls' (aka unknown values), and/or in the number of missing records.

The missing values are typically considered in report to mandatory attributes, attributes that need a not-Null value for each record, though after case might be applied to non-mandatory attributes (optional attributes) too, for example when is intended to understand whether the attributes are adequately maintained or not. It’s interesting that [1] considers also the inapplicable attributes referring to the attributes not applicable (relevant) for certain scenarios (e.g. physical dimensions for service-based materials), which together with the applicable attributes (relevant) can be considered as another type of categorization for attributes. Whether an attribute is mandatory is decided upon business context and not necessarily upon the physical structure containing the attribute, in other words an attribute could be optional as per database schema and mandatory per business rules.

'Missing records' can be a misleading term because is used in several contexts, however within data completeness context it refers only to the cases not covered by data integrity. For example in parent-child table relations the header data was entered though the detail data is missing, either not entered or deleted; such a case is not covered by referential integrity because there is no missing reference, but just the parent without child data (1:n cardinality).

A mixed example occurs when the same entity is split across several tables at the same level of detail. One of the tables must function as a parent, falling in the previous mentioned example (1:1 cardinality). In such a scenario it depends how one reports the nonconformances per record: (1) the error is counted only once, independently on how many dimensions an error was raised; (2) the error is counted for each dimension. For (2) when the referential integrity failed, an error is raised also for each mandatory attribute.

Both examples are dealing with explicit data referents – the 'parent' data, though there are cases in which the referents are implicit, for example when the data are not available for a certain time interval (e.g. period, day) even if needed, though also for this case the referents could be made explicit, falling in the previous mentioned examples. In such scenarios all the attributes corresponding to the missing records will be null.

Normally the completeness of parent-child relations is enforced with the help of referential integrity and database transactions, a set of actions performed as a single unit of work, they allow saving the parent data only if the child data were saved successfully, though such type of constraints is not always necessary.

Data should be cleaned when feasible in the source system(s) and this applies to incomplete data as well. It might be feasible to clean the values in Excel files or similar tools, by exporting and then reimporting the clean values back into the respective systems.

In data migrations or similar scenarios, the completeness in particular and data quality in general must be judged against the target system(s) and thus the dataset must be enriched in an intermediate layer as needed. Upon case, one can consider using default values, though this sounds like a technical debt, likely improbably to be addressed later. Moreover, one should prioritize the effort and consider first the attributes which are needed for the good functioning of the target system(s).

Ideally, for the mandatory fields should be applied data validation techniques in the source systems, when feasible.

Previous Post <<||>> Next Post

Written: Jan-2010, Last Reviewed: Mar-2024

References:
[1] David Loshin (2009) "Master Data Management"

01 April 2009

🛢DBMS: Data Integrity (Definitions)

"The correctness and completeness of data within a database." (Karen Paulsell et al, "Sybase SQL Server: Performance and Tuning Guide", 1996)

"A general term that refers to the correctness of the data contained in a database." (Owen Williams, "MCSE TestPrep: SQL Server 6.5 Design and Implementation", 1998)

"The accuracy and reliability of data. Data integrity is important in both single-user and multiuser environments. In multiuser environments, where data is shared, both the potential for and the cost of data corruption are high. In large-scale relational database management system (RDBMS) environments, data integrity is a primary concern." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"Data integrity refers to a state in which all the data values stored in the database are correct." (Anthony Sequeira & Brian Alderman, "The SQL Server 2000 Book", 2003)

"The condition that exists when there’s no accidental or intentional destruction, alteration, or loss of data." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"The bits of data that are put in storage (via I/O writes) are the same bits of data—order and completeness - that come out (via I/O reads)." (David G Hill, "Data Protection: Governance, Risk Management, and Compliance", 2009)

"In a relational database, refers to a condition in which the data in the database is in compliance with all entity and referential integrity constraints." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management 9th Ed", 2011)

"The accuracy of data and its conformity to its expected value, especially after being transmitted or processed." (Microsoft, "SQL Server 2012 Glossary", 2012)

"Refers to the accuracy and quality of the data." (Steve Conger, "Hands-on database : an introduction to database design and development", 2012)

"Data integrity is the state of data being free from corruption." (Vince Buffalo, "Bioinformatics Data Skills", 2015)

"The property that data has not been altered in an authorized manner." (O Sami Saydjari, "Engineering Trustworthy Systems: Get Cybersecurity Design Right the First Time", 2018)

"The degree to which the data is internal or referential/consistent. If the key to refer to a different table is invalid, the join between the two tables cannot be made." (Piethein Strengholt, "Data Management at Scale", 2020)

"(1) In the context of data and network security: The assurance that information can only be accessed or modified by those authorized to do so. (2) In the context of data quality: The assurance the data are clean, traceable, and fit for purpose." (CODATA)

"The degree to which a collection of data is complete, consistent, and accurate. See also: data security; database integrity; integrity." (IEEE 610.5-1990)

27 January 2009

🛢DBMS: Unique Index (Definitions)

"Indexes which do not permit any two rows in the specified columns to have the same value. SQL Server checks for duplicate values when you create the index (if data already exists) and each time data is added." (Karen Paulsell et al, "Sybase SQL Server: Performance and Tuning Guide", 1996)

"Restricts duplicate values in a column." (Owen Williams, "MCSE TestPrep: SQL Server 6.5 Design and Implementation", 1998)

"An index in which no two rows are permitted to have the same index value, thus prohibiting duplicate index or key values. The system checks for duplicate key values when the index is created and checks each time data is added with an INSERT or UPDATE statement." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"An index with perfect selectivity, that is, an index with no duplicate values allowed. Standard SQL allows multiple NULLs in a unique index, since a NULL is not considered to be equal to any other value, including another NULL. Many DBMSs, however, accept only one NULL in a unique index. Some DBMSs won't allow even a single NULL." (Peter Gulutzan & Trudy Pelzer, "SQL Performance Tuning", 2002)

"A b-tree index whose keys are not duplicated." (Bob Bryla, "Oracle Database Foundations", 2004)

"This is a special index that prevents users from inserting the same value into a column (or columns) more than once." (Joseph L Jorden & Dandy Weyn, "MCTS Microsoft SQL Server 2005: Implementation and Maintenance Study Guide - Exam 70-431", 2006)

"An index that enforces uniqueness for a particular field across a single collection." (MongoDb, "Glossary", 2008)

"Sometimes created explicitly by the user, and sometimes created automatically by the database server. By guaranteeing one-and-only-one value for a given table, this structure speeds access to information and preserves data integrity." (Robert D. Schneider and Darril Gibson, "Microsoft SQL Server 2008 All-In-One Desk Reference For Dummies", 2008)

"An index in which no two rows are permitted to have the same index value, thus prohibiting duplicate index or key values." (Microsoft, "SQL Server 2012 Glossary,", 2012)

"An index that ensures that no identical key values are stored in a column or a set of columns in a table." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

SQL Troubles

Pages

04 February 2021

📦Data Migrations (DM): Conceptualization (Part VI: Data Migration Layer)

31 October 2020

🧊Data Warehousing: Architecture (Part III: Data Lakes & other Puddles)

05 May 2017

⛏️Data Management: Data Steward (Definitions)

18 January 2010

🗄️Data Management: Data Quality Dimensions (Part VI: Referential Integrity)

🗄️Data Management: Data Quality Dimensions (Part V: Consistency)

13 January 2010

🗄️Data Management: Data Quality Dimensions (Part III: Completeness)

01 April 2009

🛢DBMS: Data Integrity (Definitions)

27 January 2009

🛢DBMS: Unique Index (Definitions)

About Me