SQL Troubles

13 August 2017

#️⃣Software Engineering: SQL Reloaded (Patt II: Who Messed with My Data?)

Introduction

"Errors, like straws, upon the surface flow;
He who would search for pearls must dive below."
(John Dryden)

Life of a programmer is full of things that stopped working overnight. What’s beautiful about such experiences is that always there is a logical explanation for such “happenings”. There are two aspects - one is how to troubleshoot such problems, and the second – how to avoid such situations, and this is typically done through what we refer as defensive programming. On one side avoiding issues makes one’s life simpler, while issues make it fuller.

I can say that I had plenty such types of challenges in my life, most of them self-created, mainly in the learning process, but also a good share of challenges created by others. Independently of the time spent on troubleshooting such issues, it’s the experience that counts, the little wins against the “dark” side of programming. In the following series of posts I will describe some of the issues I was confronted directly or indirectly over time. In an ad-hoc characterization they can be split in syntax, logical, data, design and systemic errors.

Syntax Errors

"Watch your language young man!"

(anonymous mother)

Syntax in natural languages like English is the sequence in which words are put together, word’s order indicating the relationship existing between words. Based on the meaning the words carry and the relationships formed between words we are capable to interpret sentences. SQL, initially called SEQUEL (Structured English Query Language) is an English-like language designed to manipulate and retrieve data. Same as natural languages, artificial languages like SQL have their own set of (grammar) rules that when violated lead to runtime errors, leading to interruption in code execution or there can be cases when the code runs further leading to inconsistencies in data. Unlike natural languages, artificial languages interpreters are quite sensitive to syntax errors.

Syntax errors are common to beginners, though a moment of inattention or misspelling can happen to anyone, no matter how versatile one’s coding is. Some are more frequent or have a bigger negative impact than others. Here are some of the typical types of syntax errors:
- missing brackets and quotes, especially in complex formulas;
- misspelled commands, table or column names;
- omitting table aliases or database names;
- missing objects or incorrectly referenced objects or other resources;
- incorrect statement order;
- relying on implicit conversion;
- incompatible data types;
- incorrect parameters’ order;
- missing or misplaced semicolons;
- usage of deprecated syntax.

Typically, syntax errors are easy to track at runtime with minimal testing as long the query is static. Dynamic queries on the other side require sometimes a larger number of combinations to be tested. The higher the number of attributes to be combined and the more complex the logic behind them, the more difficult is to test all combinations. The more combinations not tested, the higher the probability that an error might lurk in the code. Dynamics queries can thus easily become (syntax) error generators.

Logical Errors

"Students are often able to use algorithms to solve numerical problems
without completely understanding the underlying scientific concept."
(Eric Mazur)

One beautiful aspect of the human mind is that it needs only a rough understanding about how a tool works in order to make use of it up to an acceptable level. Therefore often it settles for the minimum of understanding that allows it to use a tool. Aspects like the limits of a tool, contexts of applicability, how it can be used efficiently to get the job done, or available alternatives, all these can be ignored in the process. As the devil lies in details, misunderstanding how a piece of technology works can prove to be our Achilles’ heel. For example, misunderstanding how sets and the different types of joins work, that lexical order differ from logical order and further to order of execution, when is appropriate or inappropriate to use a certain technique or functionality can make us make poor choices.

One of these poor choices is the method used to solve a problem. A mature programming language can offer sometimes two or more alternatives for solving a problem. Choosing the inadequate solution can lead to performance issues in time. This type of errors can be rooted in the lack of understanding of the data, of how an application is used, or how a piece of technology works.

"I suppose it is tempting, if the only tool you have is a hammer,
to treat everything as if it were a nail."
(Abraham Maslow)

Some of the errors derive from the difference between how different programming languages work with data. There can be considerable differences between procedural, relational and vector languages. When jumping from one language to another, one can be tempted to apply the same old techniques to the new language. The solution might work, though (by far) not optimal.

The capital mistake is to be the man of one tool, and use it in all the cases, even when not appropriate. For example. when one learned working with views, attempts to apply them all over the code in order to reuse logic, creating thus chains of views which even prove to be flexible, their complexity sooner or later will kick back. Same can happen with stored procedures and other object types as well. A sign of mastery is when the developer adapts his tools to the purpose.

"For every complex problem there is an answer
that is clear, simple, and wrong."
(Henry L Mencken)

One can build elegant solutions but solve the wrong problem. Misunderstanding the problem at hand is one type of error sometimes quite difficult to identify. Typically, they can be found through thorough testing. Sometimes the unavailability of (quality) data can impede the process of testing, such errors being found late in the process.

At the opposite side, one can attempt to solve the right problem but with logic flaws – wrong steps order, wrong algorithm, wrong set of tools, or even missing facts/assumptions. A special type of logical errors are the programmatic errors, which occur when SQL code encounters a logic or behavioral error during processing (e.g. infinite loop, out of range input). [1]

Data Errors

"Data quality requires certain level of sophistication within a company
to even understand that it’s a problem."
(Colleen Graham)

Poor data quality is the source for all evil, or at least for some of the evil. Typically, a good designed database makes use of a mix of techniques to reduce the chances for inconsistencies: appropriate data types and data granularity, explicit transactions, check constraints, default values, triggers or integrity constraints. Some of these techniques can be too restrictive, therefore in design one has to provide a certain flexibility in the detriment of one of the above techniques, fact that makes the design vulnerable to same range of issues: missing values, missing or duplicate records.

No matter how good a database was designed, sometimes is difficult to cope with users’ ingenuity – misusage of functionality, typically resulting in deviations from standard processes, that can invalidate an existing query. Similar effects have the changes to processes or usage of new processed not addressed in existing queries or reports.

Another topic that have a considerable impact on queries’ correctness is the existence, or better said the inexistence of master data policies and a board to regulate the maintenance of master data. Without proper governance of master data one might end up with a big mess with no way to bring some order in it without addressing the quality of data adequately.

Designed to Fail

"The weakest spot in a good defense is designed to fail."
(Mark Lawrence)

In IT one can often meet systems designed to fail, the occurrences of errors being just a question of time, kind of a ticking bomb. In such situations, a system is only as good as its weakest link(s). Issues can be traced back to following aspects:
- systems used for what they were not designed to do – typically misusing a tool for a purpose for which another tool would be more appropriate (e.g. using Excel as database, using SSIS for real-time, using a reporting tool for data entry);
- poor performing systems - systems not adequately designed for the tasks supposed to handle (e.g. handling large volume of data/transactions);
- systems not coping with user’s inventiveness or mistakes (e.g. not validating adequately user input or not confirming critical actions like deletion of records);
- systems not configurable (e.g. usage of hardcoded values instead of parameters or configurable values);
- systems for which one of the design presumptions were invalidated by reality (e.g. input data don’t have the expected format, a certain resource always exists);
- systems not being able to handle changes in environment (e.g. changing user settings for language, numeric or data values);
- systems succumbing in their own complexity (e.g. overgeneralization, wrong mix of technologies);
- fault intolerant systems – system not handling adequately more or less unexpected errors or exceptions (e.g. division by zero, handling of nulls, network interruptions, out of memory).

Systemic Errors

Systemic errors can be found at the borders of the "impossible", situations in which the errors defy the common sense. Such errors are not determined by chance but are introduced by an inaccuracy inherent to the system/environment.

A systemic error occurs when a SQL program encounters a deficiency or unexpected condition with a system resource (e.g. a program encountered insufficient space in tempdb to process a large query, database/transaction log running out of space). [1]

Such errors are often difficult but not impossible to reproduce. The difficulty resides primarily in figuring out what happened, what caused the error. Once one found the cause, with a little resourcefulness one can come with an example to reproduce the error.

Conclusion

"To err is human; to try to prevent recurrence of error is science."
(Anon)

When one thinks about it, there are so many ways to fail. In the end to err is human and nobody is exempted from making mistakes, no matter how good or wise. The quest of a (good) programmer is to limit errors’ occurrences, and to correct them early in process, before they start becoming a nightmare.

References:
[1] Transact-SQL Programming: Covers Microsoft SQL Server 6.5 /7.0 and Sybase, by Kevin Kline, Lee Gould & Andrew Zanevsky, O’Reilly, ISBN 10: 1565924010, 1999

18 June 2017

💠🛠️SQL Server: Administration (Database Recovery on SQL Server 2017)

I installed today SQL Server 2017 CTP 2.1 on my Lab PC without any apparent problems. It was time to recreate some of the databases I used for testing. As previously I had an evaluation version of SQL Server 2016, it expired without having a backup for one of the databases. I could recreate the database from scripts and reload the data from various text files. This would have been a relatively laborious task (estimated time > 1 hour), though the chances were pretty high that everything would go smoothly. As the database is relatively small (about 2 GB) and possible data loss was neglectable, I thought it would be possible to recover the data from the database with minimal loss in less than half of hour. I knew this was possible, as I was forced a few times in the past to recover data from damaged databases in SQL Server 2005, 2008 and 2012 environments, though being in a new environment I wasn’t sure how smooth will go and how long it would take.

Plan A - Create the database with ATTACH_REBUILD_LOG option:

As it seems the option is available in SQL Server 2017, so I attempted to create the database via the following script:

CREATE DATABASE  ON 
(FILENAME='I:\Data\.mdf') 
FOR ATTACH_REBUILD_LOG

And as expected I run into the first error:

Msg 5120, Level 16, State 101, Line 1 Unable to open the physical file "I:\Data\.mdf". Operating system error 5: "5(Access is denied.)".

Msg 1802, Level 16, State 7, Line 1 CREATE DATABASE failed. Some file names listed could not be created. Check related errors.

It looked like a permissions problem, though I wasn’t entirely sure which account is causing the problem. In the past I had problems with the Administrator account, so it was the first thing to try. Once I removed the permissions for Administrator account to the folder containing the database and gave it full control permissions again, I tried to create the database anew using the above script, running into the next error:

File activation failure. The physical file name "D:\Logs\_log.ldf" may be incorrect. The log cannot be rebuilt because there were open transactions/users when the database was shutdown, no checkpoint occurred to the database, or the database was read-only. This error could occur if the transaction log file was manually deleted or lost due to a hardware or environment failure.

Msg 1813, Level 16, State 2, Line 1 Could not open new database ''. CREATE DATABASE is aborted.

This approach seemed to lead nowhere, so it was time for Plan B.

Plan B - Recover the database into an empty database with the same name:

Step 1: Create a new database with the same name, stop the SQL Server, then copy the old file over the new file, and delete the new log file manually. Then restarted the server. After the restart the database will appear in Management Studio with the SUSPECT state.

Step 2: Set the database in EMERGENCY mode:

ALTER DATABASE  SET EMERGENCY, SINGLE_USER

Step 3: Rebuild the log file:

ALTER DATABASE <database_name>

REBUILD LOG ON (Name=’_Log',

FileName='D:\Logs\.ldf')

The rebuild worked without problems.

Step 4: Set the database in MULTI_USER mode:

ALTER DATABASE  SET MULTI_USER

Step 5: Perform a consistency check:

DBCC CHECKDB () WITH ALL_ERRORMSGS, NO_INFOMSG

After 15 minutes of work the database was back online.

Warnings:
Always attempt to recover the data for production databases from the backup files! Use the above steps only if there is no other alternative!
The consistency check might return errors. In this case one might need to run CHECKDB with REPAIR_ALLOW_DATA_LOSS several times [2], until the database was repaired.
After recovery there can be problems with the user access. It might be needed to delete the users from the recovered database and reassign their permissions!

Resources:
[1] In Recovery (2008) Creating, detaching, re-attaching, and fixing a SUSPECT database, by Paul S Randal [Online] Available from: https://www.sqlskills.com/blogs/paul/creating-detaching-re-attaching-and-fixing-a-suspect-database/
[2] In Recovery (2009) Misconceptions around database repair, by Paul S Randal [Online] Available from: https://www.sqlskills.com/blogs/paul/misconceptions-around-database-repair/
[3] Microsoft Blogs (2013) Recovering from Log File Corruption, by Glen Small [Online] Available from: https://blogs.msdn.microsoft.com/glsmall/2013/11/14/recovering-from-log-file-corruption/

24 May 2017

⛏️Data Management: Data Contracts (Definitions)

"Data contracts specifically define the data that is being exchanged between a client and service. The data contract is an agreement, meaning that the client and the service must agree on the data contract in order for the exchange of data to take place. Note that they don't have to agree on the data types, just the contract." (Pablo Cibraro & Scott Klein, "Professional WCF Programming: .NET Development with the Windows Communication Foundation", 2007)

"A data contract is an agreement between a client and a service that conceptually depicts the data to be exchanged. Data contracts define the data types that are used in the service." (Nagaraju B et al, ".Net Interview Questions", 2010)

"The format of the data to be communicated and the logic under which it is created form the data contract. This contract is followed by both the producer and the consumer of the event data. It gives the event meaning and form beyond the context in which it is produced and extends the usability of the data to consumer applications." (Adam Bellemare, "Building Event-Driven Microservices", 2020)

"A data contract is a document that accompanies data movement and captures relevant information (like upstream contacts, service-level agreement, scenarios enabled, etc.)." (Vlad Riscutia, "Data Engineering on Azure", 2021)

"A data contract is a formal agreement between a service and a client that abstractly describes the data to be exchanged. That is, to communicate, the client and the service do not have to share the same types, only the same data contracts. A data contract precisely defines, for each parameter or return type, what data is serialized (turned into XML) to be exchanged." (Microsoft, "Using Data Contracts", 2021) [source]

"A data contract is a written agreement between the owner of a source system and the team ingesting data from that system for use in a data pipeline. The contract should state what data is being extracted, via what method (full, incremental), how often, as well as who (person, team) are the contacts for both the source system and the ingestion." (James Densmore, "Data Pipelines Pocket Reference", 2021)

"It's a formal agreement between the data producer and the data consumers. There is not yet a clear definition of the form and scope of a data contract. Usually, they cover the structure of the exchanged data (i.e. the schema) and its meaning (i.e. the semantics)." (Open Data Mesh, "Data Contract", 2022) [source]

"A data contract is an agreed interface between the generators of data and its consumers. It sets the expectations around that data, defines how it should be governed, and facilitates the explicit generation of quality data that meets the business requirements." (Andrew Jones, "Driving Data Quality with Data Contracts", 2023)

"A data contract is an agreement between the producer and the consumers of a data product. Just as business contracts hold up obligations between suppliers and consumers of a business product, data contracts define and enforce the functionality, manageability, and reliability of data products." (Atlan, "Data Contracts: The Key to Scaling Distributed Data Architecture and Reducing Data Chaos", 2023) [source]

"Data contracts are formal agreements outlining the structure and type of data exchanged between systems, ensuring all parties understand the data's format. Used in various contexts such as APIs, SOA, data pipelines, they provide crucial interoperability, making data contracts essential in managing and controlling data flow effectively." (Jatin Solanki, "What is Data Contracts, is it a hype?", 2023) [source]

"A formal agreement between a data consumer or user and a data provider or owner that defines the conditions under which the data is exchanged between both parties." (Circ Thread)

20 May 2017

⛏️Data Management: Data Scrubbing (Definitions)

"The process of making data consistent, either manually, or automatically using programs." (Microsoft Corporation, "Microsoft SQL Server 7.0 System Administration Training Kit", 1999)

Processing data to remove or repair inconsistencies." (Rod Stephens, "Beginning Database Design Solutions", 2008)

"The process of building a data warehouse out of data coming from multiple online transaction processing (OLTP) systems." (Microsoft, "SQL Server 2012 Glossary", 2012)

"A term that is very similar to data deidentification and is sometimes used improperly as a synonym for data deidentification. Data scrubbing refers to the removal, from data records, of identifying information (i.e., information linking the record to an individual) plus any other information that is considered unwanted. This may include any personal, sensitive, or private information contained in a record, any incriminating or otherwise objectionable language contained in a record, and any information irrelevant to the purpose served by the record." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)

"The process of removing corrupt, redundant, and inaccurate data in the data governance process. (Robert F Smallwood, Information Governance: Concepts, Strategies, and Best Practices, 2014)

"Data Cleansing (or Data Scrubbing) is the action of identifying and then removing or amending any data within a database that is: incorrect, incomplete, duplicated." (experian) [source]

"Data cleansing, or data scrubbing, is the process of detecting and correcting or removing inaccurate data or records from a database. It may also involve correcting or removing improperly formatted or duplicate data or records. Such data removed in this process is often referred to as 'dirty data'. Data cleansing is an essential task for preserving data quality." (Teradata) [source]

"Data scrubbing, also called data cleansing, is the process of amending or removing data in a database that is incorrect, incomplete, improperly formatted, or duplicated." (Techtarget) [source]

"Part of the process of building a data warehouse out of data coming from multiple online transaction processing (OLTP) systems." (Microsoft Technet)

"The process of filtering, merging, decoding, and translating source data to create validated data for the data warehouse." (Information Management)

05 May 2017

⛏️Data Management: Data Steward (Definitions)

"A person with responsibility to improve the accuracy, reliability, and security of an organization’s data; also works with various groups to clearly define and standardize data." (Margaret Y Chu, "Blissful Data ", 2004)

"Critical players in data governance councils. Comfortable with technology and business problems, data stewards seek to speak up for their business units when an organization-wide decision will not work for that business unit. Yet they are not turf protectors, instead seeking solutions that will work across an organization. Data stewards are responsible for communication between the business users and the IT community." (Tony Fisher, "The Data Asset", 2009)

"A business leader and/or subject matter expert designated as accountable for: a) the identification of operational and Business Intelligence data requirements within an assigned subject area, b) the quality of data names, business definitions, data integrity rules, and domain values within an assigned subject area, c) compliance with regulatory requirements and conformance to internal data policies and data standards, d) application of appropriate security controls, e) analyzing and improving data quality, and f) identifying and resolving data related issues. Data stewards are often categorized as executive data stewards, business data stewards, or coordinating data stewards." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

[business data steward:] "A knowledge worker, business leader, and recognized subject matter expert assigned accountability for the data specifications and data quality of specifically assigned business entities, subject areas or databases, but with less responsibility for data governance than a coordinating data steward or an executive data steward." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The person responsible for maintaining a data element in a metadata registry." (Microsoft, "SQL Server 2012 Glossary, 2012)

"The term stewardship is “the management or care of another person’s property” (NOAD). Data stewards are individuals who are responsible for the care and management of data. This function is carried out in different ways based on the needs of particular organizations." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"The person responsible for maintaining a data element in a metadata registry." (Microsoft, SQL Server 2012 Glossary, 2012)

"An individual comfortable with both technology and business problems. Stewards are responsible for communicating between the business users and the IT community." (Jim Davis & Aiman Zeid, "Business Transformation: A Roadmap for Maximizing Organizational Insights", 2014)

"A role in the data governance organization that is responsible for the development of a uniform data model for business objects used across boundaries. The data steward is also often responsible for the development of master data management and ensures compliance with the governance rules." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"A natural person assigned the responsibility to catalog, define, and monitor changes to critical data. Example: The data steward for finance critical data is Dan." (Gregory Lampshire, "The Data and Analytics Playbook", 2016)

"A person responsible for managing data content, quality, standards, and controls within an organization or function." (Jonathan Ferrar et al, "The Power of People", 2017)

"A data steward is a job role that involves planning, implementing and managing the sourcing, use and maintenance of data assets in an organization. Data stewards enable an organization to take control and govern all the types and forms of data and their associated libraries or repositories." (Richard T Herschel, "Business Intelligence", 2019)

03 May 2017

⛏️Data Management: Hashing (Definitions)

"A technique for providing fast access to data based on a key value by determining the physical storage location of that data." (Jan L Harrington, "Relational Database Dessign: Clearly Explained" 2nd Ed., 2002)

"A mathematical technique for assigning a unique number to each record in a file." (S. Sumathi & S. Esakkirajan, "Fundamentals of Relational Database Management Systems", 2007)

"A technique that transforms a key value via an algorithm to a physical storage location to enable quick direct access to data. The algorithm is typically referred to as a randomizer, because the goal of the hashing routine is to spread the key values evenly throughout the physical storage." (Craig S Mullins, "Database Administration", 2012)

"A mathematical technique in which an infinite set of input values is mapped to a finite set of output values, called hash values. Hashing is useful for rapid lookups of data in a hash table." (Oracle, "Database SQL Tuning Guide Glossary", 2013)

"An algorithm converts data values into an address" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"The technique used for ordering and accessing elements in a collection in a relatively constant amount of time by manipulating the element’s key to identify the element’s location in the collection" (Nell Dale et al, "Object-Oriented Data Structures Using Java" 4th Ed., 2016)

"The application of an algorithm to a search key to derive a physical storage location." (George Tillmann, "Usage-Driven Database Design: From Logical Data Modeling through Physical Schmea Definition", 2017)

"Hashing is the process of mapping data values to fixed-size hash values (hashes). Common hashing algorithms are Message Digest 5 (MD5) and Secure Hashing Algorithm (SHA). It’s impossible to turn a hash value back into the original data value." (Piethein Strengholt, "Data Management at Scale", 2020)

"A process used to convert data into a string of numbers and letters." (AICPA)

"A technique for arranging a set of items, in which a hash function is applied to the key of each item to determine its hash value. The hash value identifies each item's primary position in a hash table, and if this position is already occupied, the item is inserted either in an overflow table or in another available position in the table." (IEEE 610.5-1990)

01 May 2017

⛏️Data Management: Hash (Definitions)

"A number (often a 32-bit integer) that is derived from column values using a lossy compression algorithm. DBMSs occasionally use hashing to speed up access, but indexes are a more common mechanism." (Peter Gulutzan & Trudy Pelzer, "SQL Performance Tuning", 2002)

"A set of characters generated by running text data through certain algorithms. Often used to create digital signatures and compare changes in content." (Tom Petrocelli, "Data Protection and Information Lifecycle Management", 2005)

"Hash, a mathematical method for creating a numeric signature based on content; these days, often unique and based on public key encryption technology." (Bo Leuf, "The Semantic Web: Crafting infrastructure for agency", 2006)

[hash code:] "An integer calculated from an object. Identical objects have the same hash code. Generated by a hash method." (Michael Fitzgerald, "Learning Ruby", 2007)

"An unordered collection of data where keys and values are mapped. Compare with array." (Michael Fitzgerald, "Learning Ruby", 2007)

"A cryptographic hash is a fixed-size bit string that is generated by applying a hash function to a block of data. Secure cryptographic hash functions are collision-free, meaning there is a very small possibility of generating the same hash for two different blocks of data. A secure cryptographic hash function should also be one-way, meaning it is infeasible to retrieve the original text from the hash." (Michael Coles & Rodney Landrum, "Expert SQL Server 2008 Encryption", 2008)

"A hash is the result of applying a mathematical function or transformation on data to generate a smaller 'fingerprint' of the data. Generally, the most useful hash functions are one-way collision-free hashes that guarantee a high level of uniqueness in their results." (Michael Coles, "Pro T-SQL 2008 Programmer's Guide", 2008)

"The output of a hash function." (Mark S Merkow & Lakshmikanth Raghavan, "Secure and Resilient Software Development", 2010)

"A number based on the hash value of a string." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"1.Data allocated in an algorithmically randomized fashion in an attempt to evenly distribute data and smooth access patterns. 2.Verb. To calculate a hash key for data." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A hash is the result of applying a mathematical function or transformation on data to generate a smaller 'fingerprint' of the data. Generally, the most useful hash functions are one-way collision-free hashes that guarantee a high level of uniqueness in their results." (Jay Natarajan et al, "Pro T-SQL 2012 Programmer's Guide" 3rd Ed., 2012)

"An unordered association of key/value pairs, stored such that you can easily use a string key to look up its associated data value. This glossary is like a hash, where the word to be defined is the key and the definition is the value. A hash is also sometimes septisyllabically called an “associative array”, which is a pretty good reason for simply calling it a 'hash' instead." (Jon Orwant et al, "Programming Perl" 4th Ed., 2012)

"In a hash cluster, a unique numeric ID that identifies a bucket. Oracle Database uses a hash function that accepts an infinite number of hash key values as input and sorts them into a finite number of buckets. Each hash value maps to the database block address for the block that stores the rows corresponding to the hash key value (department 10, 20, 30, and so on)." (Oracle, "Database SQL Tuning Guide Glossary", 2013)

"The result of applying a mathematical function or transformation to data to generate a smaller 'fingerprint' of the data. Generally, the most useful hash functions are one-way, collision-free hashes that guarantee a high level of uniqueness in their results." (Miguel Cebollero et al, "Pro T-SQL Programmer’s Guide" 4th Ed., 2015)

[hash code:] "The output of the hash function that is associated with the input object" (Nell Dale et al, "Object-Oriented Data Structures Using Java" 4th Ed., 2016)

"A numerical value produced by a mathematical function, which generates a fixed-length value typically much smaller than the input to the function. The function is many to one, but generally, for all practical purposes, each file or other data block input to a hash function yields a unique hash value." (William Stallings, "Effective Cybersecurity: A Guide to Using Best Practices and Standards", 2018)

"The number generated by a hash function to indicate the position of a given item in a hash table." (IEEE 610.5-1990)

28 April 2017

⛏️Data Management: Completeness (Definitions)

"A characteristic of information quality that measures the degree to which there is a value in a field; synonymous with fill rate. Assessed in the data quality dimension of Data Integrity Fundamentals." (Danette McGilvray, "Executing Data Quality Projects", 2008)

"Containing by a composite data all components necessary to full description of the states of a considered object or process." (Juliusz L Kulikowski, "Data Quality Assessment", 2009)

"An inherent quality characteristic that is a measure of the extent to which an attribute has values for all instances of an entity class." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"Completeness is a dimension of data quality. As used in the DQAF, completeness implies having all the necessary or appropriate parts; being entire, finished, total. A dataset is complete to the degree that it contains required attributes and a sufficient number of records, and to the degree that attributes are populated in accord with data consumer expectations. For data to be complete, at least three conditions must be met: the dataset must be defined so that it includes all the attributes desired (width); the dataset must contain the desired amount of data (depth); and the attributes must be populated to the extent desired (density). Each of these secondary dimensions of completeness can be measured differently." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"Completeness is defined as a measure of the presence of core source data elements that, exclusive of derived fields, must be present in order to complete a given business process." (Rajesh Jugulum, "Competing with High Quality Data", 2014)

"Complete existence of all values or attributes of a record that are necessary." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"The degree to which all data has been delivered or stored and no values are missing. Examples are empty or missing records." (Piethein Strengholt, "Data Management at Scale", 2020)

"The degree to which elements that should be contained in the model are indeed there." (Panos Alexopoulos, "Semantic Modeling for Data", 2020)

"The degree of data representing all properties and instances of the real-world context." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"Data is considered 'complete' when it fulfills expectations of comprehensiveness." (Precisely) [source]

"The degree to which all required measures are known. Values may be designated as “missing” in order not to have empty cells, or missing values may be replaced with default or interpolated values. In the case of default or interpolated values, these must be flagged as such to distinguish them from actual measurements or observations. Missing, default, or interpolated values do not imply that the dataset has been made complete." (CODATA)

27 April 2017

⛏️Data Management: Availability (Definitions)

"Corresponds to the information that should be available when necessary and in the appropriate format." (José M Gaivéo, "Security of ICTs Supporting Healthcare Activities", 2013)

"A property by which the data is available all the time during the business hours. In cloud computing domain, the data availability by the cloud service provider holds a crucial importance." (Sumit Jaiswal et al, "Security Challenges in Cloud Computing", 2015)

"Availability: the ability of the data user to access the data at the desired point in time." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"It is one of the main aspects of the information security. It means data should be available to its legitimate user all the time whenever it is requested by them. To guarantee availability data is replicated at various nodes in the network. Data must be reliably available." (Omkar Badve et al, "Reviewing the Security Features in Contemporary Security Policies and Models for Multiple Platforms", 2016)

"Timely, reliable access to data and information services for authorized users." (Maurice Dawson et al, "Battlefield Cyberspace: Exploitation of Hyperconnectivity and Internet of Things", 2017)

"A set of principles and metrics that assures the reliability and constant access to data for the authorized individuals or groups." (Gordana Gardašević et al, "Cybersecurity of Industrial Internet of Things", 2020)

"Ensuring the conditions necessary for easy retrieval and use of information and system resources, whenever necessary, with strict conditions of confidentiality and integrity." (Alina Stanciu et al, "Cyberaccounting for the Leaders of the Future", 2020)

"The state when data are in the place needed by the user, at the time the user needs them, and in the form needed by the user." (CODATA)

"The state that exists when data can be accessed or a requested service provided within an acceptable period of time." (NISTIR 4734)

"Timely, reliable access to information by authorized entities." (NIST SP 800-57 Part 1)

25 April 2017

⛏️Data Management: Data Products (Definitions)

"In the case of data mesh, a data product is an architectural quantum. It is the smallest unit of architecture that can be independently deployed and managed." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"A data product is a data asset that should be trusted, reusable, and accessible. The data product is developed, owned, and managed by a domain, and each domain needs to make sure that its data products are accessible to other domains and their data consumers." (Marthe Mengen, 2024) [source]

"A data product is a self-contained, independently deployable unit of data that delivers business value." (James Serra, "Deciphering Data Architectures", 2024)

"A collection of optimized data or data-related assets that are packaged for reuse and distribution with controlled access. Data products contain data as well as models, dashboards, and other computational asset types. Unlike data assets in governance catalogs, data products are managed as products with multiple purposes to provide business value." (IBM)

"A data product, in general terms, is any tool or application that processes data and generates results. […] Data products have one primary objective: to manage, organize and make sense of the vast amount of data that organizations collect and generate. It’s the users’ job to put the insights to use that they gain from these data products, take actions and make better decisions based on these insights." (Sisense) [source]

"A data product is a product built around data, containing everything required to complete a specific task or objective using that underlying data." (Opendatasoft)

"A data product is digital information that can be purchased." (Techtarget) [source]

"A key concept in data mesh architecture, Data Products are independent units of data managed by a specific domain team. They are responsible for defining, publishing, and maintaining their data assets while ensuring high-quality data that meets the needs of its consumers." (DataHub)

[Data product specification:] "Detailed description of a data set or data set series together with additional information that will enable it to be created, supplied to and used by another party" (ISO 19131)

"Data set or data set series that conforms to a data product specification" (ISO 19131)

12 April 2017

⛏️Data Management: Accessibility (Definitions)

"Capable of being reached, capable of being used or seen." (Martin J Eppler, "Managing Information Quality" 2nd Ed., 2006)

"The degree to which data can be obtained and used." (Danette McGilvray, "Executing Data Quality Projects", 2008)

"The opportunity to find, as well as the ease and convenience associated with locating, information. Often, this is related to the physical location of the individual seeking the information and the physical location of the information in a book or journal." (Jimmie L Joseph & David P Cook, "Medical Ethical and Policy Issues Arising from RIA", 2008)

"An inherent quality characteristic that is a measure of the ability to access data when it is required." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"The ability to readily obtain data when needed." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Accessibility refers to the difficulty level for users to obtain data. Accessibility is closely linked with data openness, the higher the data openness degree, the more data types obtained, and the higher the degree of accessibility." (Li Cai & Yangyong Zhu, "The Challenges of Data Quality and Data Quality Assessment in the Big Data Era", 2015) [source]

"It is the state of each user to have access to any information at any time." (ihsan Eken & Basak Gezmen, "Accessibility for Everyone in Health Communication Mobile Application Usage", 2020)

"Data accessibility measures the extent to which government data are provided in open and re-usable formats, with their associated metadata." (OECD)

⛏️Data Management: Data Virtualization (Definitions)

"The concept of letting data stay 'where it lives; and developing a hardware and software architecture that exposes the data to various business processes and organizations. The goal of virtualization is to shield developers and users from the complexity of the underlying data structures." (Jill Dyché & Evan Levy, "Customer Data Integration: Reaching a Single Version of the Truth", 2006)

"The ability to easily select and combine data fragments from many different locations dynamically and in any way into a single data structure while also maintaining its semantic accuracy." (Michael M David & Lee Fesperman, "Advanced SQL Dynamic Data Modeling and Hierarchical Processing", 2013)

"The process of retrieving and manipulating data without requiring details of how the data formatted or where the data is located" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"A data integration process used to gain more insights. Usually it involves databases, applications, file systems, websites, big data techniques, and so on." (Jason Williamson, "Getting a Big Data Job For Dummies", 2015)

"Data virtualization is an approach that allows an application to retrieve and manipulate data without requiring technical details about the data, such as how it is formatted at source or where it is physically located, and can provide a single customer view (or single view of any other entity) of the overall data. Some database vendors provide a database (virtual) query layer, which is also called a data virtualization layer. This layer abstracts the database and optimizes the data for better read performance. Another reason to abstract is to intercept queries for better security. An example is Amazon Athena." (Piethein Strengholt, "Data Management at Scale", 2020)

"A data integration process in order to gain more insights. Usually it involves databases, applications, file systems, websites, big data techniques, etc.)." (Analytics Insight)

"The integration and transformation of data in real time or near real time from disparate data sources in multicloud and hybrid cloud, to support business intelligence, reporting, analytics, and other workloads." (Forrester)

⛏️Data Management: Data Lineage (Definitions)

"A mechanism for recording information to determine the source of any piece of data, and the transformations applied to that data using Data Transformation Services (DTS). Data lineage can be tracked at the package and row levels of a table and provides a complete audit trail for information stored in a data warehouse. Data lineage is available only for packages stored in Microsoft Repository." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"This information is used by Data Transformation Services (DTS) when it works in conjunction with Meta Data Services. This information records the history of package execution and data transformations for each piece of data." (Anthony Sequeira & Brian Alderman, "The SQL Server 2000 Book", 2003)

"This is also called data provenance. It deals with the origin of data; it is all about documenting where data is, how it has been derived, and how it flows so you can manage and secure it appropriately as it is further processed by applications." (Martin Oberhofer et al, "Enterprise Master Data Management", 2008)

"This provides the functionality to determine where data comes from, how it is transformed, and where it is going. Data lineage metadata traces the lifecycle of information between systems, including the operations that are performed on the data." (Martin Oberhofer et al, "The Art of Enterprise Information Architecture", 2010)

"Data lineage refers to a set of identifiable points that can be used to understand details of data movement and transformation (e.g., transactional source field names, file names, data processing job names, programming rules, target table fields). Lineage describes the movement of data through systems from its origin or provenance to its use in a particular application. Lineage is related to both the data chain and the information life cycle. Most people concerned with the lineage of data want to understand two aspects of it: the data’s origin and the ways in which the data has changed since it was originally created. Change can take place within one system or between systems." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

⛏️Data Management: Data Federation (Definitions)

"Data access to a variety of data stores, using consistent rules and definitions that enable all the data stores to be treated as a single resource." (Judith Hurwitz et al, "Service Oriented Architecture For Dummies 2nd Ed.", 2009)

"Technology that joins data from different sources, operational or analytic, around an organization. Data federation allows users to have a single view of disparate data without having to understand the details of the individual data sources." (Tony Fisher, "The Data Asset", 2009)

"A method of transparently joining or linking data from multiple physical locations and/or multiple platforms." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Data access to a variety of data stores, using consistent rules and definitions that enable all the data stores to be treated as a single resource. " (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"Data federation technology is software that provides an organization with the ability to aggregate data from disparate sources in a virtual database so it can be used for business intelligence (BI) or other analysis." (Techtarget)

"Process where data is collected from distinct databases without ever copying or transforming the original data." (Solutions Review)

06 April 2017

⛏️Data Management: Data Mesh (Definitions)

"Data Mesh is a sociotechnical approach to share, access and manage analytical data in complex and large-scale environments - within or across organizations." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"A data mesh is an architectural concept in data engineering that gives business domains (divisions/departments) within a large organization ownership of the data they produce. The centralized data management team then becomes the organization’s data governance team." (Margaret Rouse, 2023) [source]

"Data Mesh is a design concept based on federated data and business domains. It applies product management thinking to data management with the outcome being Data Products. It’s technology agnostic and calls for a domain-centric organization with federated Data Governance." (Sonia Mezzetta, "Principles of Data Fabric", 2023)

"A data mesh is a decentralized data architecture with four specific characteristics. First, it requires independent teams within designated domains to own their analytical data. Second, in a data mesh, data is treated and served as a product to help the data consumer to discover, trust, and utilize it for whatever purpose they like. Third, it relies on automated infrastructure provisioning. And fourth, it uses governance to ensure that all the independent data products are secure and follow global rules."(James Serra, "Deciphering Data Architectures", 2024)

"A data mesh is a federated data architecture that emphasizes decentralizing data across business functions or domains such as marketing, sales, human resources, and more. It facilitates organizing and managing data in a logical way to facilitate the more targeted and efficient use and governance of the data across organizations." (Arshad Ali & Bradley Schacht, "Learn Microsoft Fabric", 2024)

"To explain a data mesh in one sentence, a data mesh is a centrally managed network of decentralized data products. The data mesh breaks the central data lake into decentralized islands of data that are owned by the teams that generate the data. The data mesh architecture proposes that data be treated like a product, with each team producing its own data/output using its own choice of tools arranged in an architecture that works for them. This team completely owns the data/output they produce and exposes it for others to consume in a way they deem fit for their data." (Aniruddha Deswandikar,"Engineering Data Mesh in Azure Cloud", 2024)

"A data mesh is a decentralized data architecture that organizes data by a specific business domain - for example, marketing, sales, customer service and more - to provide more ownership to the producers of a given data set." (IBM) [source]

"A data mesh is a new approach to designing data architectures. It takes a decentralized approach to data storage and management, having individual business domains retain ownership over their datasets rather than flowing all of an organization’s data into a centrally owned data lake." (Alteryx) [source]

"A Data Mesh is a solution architecture for the specific goal of building business-focused data products without preference or specification of the technology involved." (Gartner)

"A data mesh is an architectural framework that solves advanced data security challenges through distributed, decentralized ownership." (AWS) [source]

"Data mesh defines a platform architecture based on a decentralized network. The data mesh distributes data ownership and allows domain-specific teams to manage data independently." (TIBCO) [source]

"Data mesh refers to a data architecture where data is owned and managed by the teams that use it. A data mesh decentralizes data ownership to business domains–such as finance, marketing, and sales–and provides them a self-serve data platform and federated computational governance." (Qlik) [source]