29 December 2017

Data Management: Numeracy (Just the Quotes)

"The great body of physical science, a great deal of the essential fact of financial science, and endless social and political problems are only accessible and only thinkable to those who have had a sound training in mathematical analysis, and the time may not be very remote when it will be understood that for complete initiation as an efficient citizen of one of the new great complex world-wide States that are now developing, it is as necessary to be able to compute, to think in averages and maxima and minima, as it is now to be able to read and write." (Herbert G Wells, "Mankind in the Making",  1903) 

"[…] statistical literacy. That is, the ability to read diagrams and maps; a 'consumer' understanding of common statistical terms, as average, percent, dispersion, correlation, and index number."  (Douglas Scates, "Statistics: The Mathematics for Social Problems", 1943) 

„Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write." (Samuel S Wilks, 1951 [paraphrasing Herber Wells] ) 

"Just as by ‘literacy’, in this context, we mean much more than its dictionary sense of the ability to read and write, so by ‘numeracy’ we mean more than mere ability to manipulate the rule of three. When we say that a scientist is ‘illiterate’, we mean that he is not well enough read to be able to communicate effectively with those who have had a literary education. When we say that a historian or a linguist is ‘innumerate’ we mean that he cannot even begin to understand what scientists and mathematicians are talking about." (Sir Geoffrey Crowther, "A Report of the Central Advisory Committee for Education", 1959) 

"It is perhaps possible to distinguish two different aspects of numeracy […]. On the one hand is an understanding of the scientific approach to the study of phenomena - observation, hypothesis, experiment, verification. On the other hand, there is the need in the modern world to think quantitatively, to realise how far our problems are problems of degree even when they appear as problems of kind." (Sir Geoffrey Crowther, "A Report of the Central Advisory Committee for Education", 1959) 

"Numeracy has two facets-reading and writing, or extracting numerical information and presenting it. The skills of data presentation may at first seem ad hoc and judgemental, a matter of style rather than of technology, but certain aspects can be formalized into explicit rules, the equivalent of elementary syntax." (Andrew Ehrenberg, "Rudiments of Numeracy", Journal of Royal Statistical Society, 1977)

"People often feel inept when faced with numerical data. Many of us think that we lack numeracy, the ability to cope with numbers. […] The fault is not in ourselves, but in our data. Most data are badly presented and so the cure lies with the producers of the data. To draw an analogy with literacy, we do not need to learn to read better, but writers need to be taught to write better." (Andrew Ehrenberg, "The problem of numeracy", American Statistician 35(2), 1981)

"We would wish ‘numerate’ to imply the possession of two attributes. The first of these is an ‘at-homeness’ with numbers and an ability to make use of mathematical skills which enable an individual to cope with the practical mathematical demands of his everyday life. The second is ability to have some appreciation and understanding of information which is presented in mathematical terms, for instance in graphs, charts or tables or by reference to percentage increase or decrease." (Cockcroft Committee, "Mathematics Counts: A Report into the Teaching of Mathematics in Schools", 1982) 

"To function in today's society, mathematical literacy - what the British call ‘numeracy' - is as essential as verbal literacy […] Numeracy requires more than just familiarity with numbers. To cope confidently with the demands of today's society, one must be able to grasp the implications of many mathematical concepts - for example, change, logic, and graphs - that permeate daily news and routine decisions - mathematical, scientific, and cultural - provide a common fabric of communication indispensable for modern civilized society. Mathematical literacy is especially crucial because mathematics is the language of science and technology." (National Research Council, "Everybody counts: A report to the nation on the future of mathematics education", 1989)

"Illiteracy and innumeracy are social ills created in part by increased demand for words and numbers. As printing brought words to the masses and made literacy a prerequisite for productive life, so now computing has made numeracy an essential feature of today's society. But it is innumeracy, not numeracy, that dominates the headlines: ignorance of basic quantitative tools is endemic […] and is approaching epidemic levels […]." (Lynn A Steen, "Numeracy", Daedalus Vol. 119 No. 2, 1990) 

"[…] data analysis in the context of basic mathematical concepts and skills. The ability to use and interpret simple graphical and numerical descriptions of data is the foundation of numeracy […] Meaningful data aid in replacing an emphasis on calculation by the exercise of judgement and a stress on interpreting and communicating results." (David S Moore, "Statistics for All: Why, What and How?", 1990) 

 "To be numerate is more than being able to manipulate numbers, or even being able to ‘succeed’ in school or university mathematics. Numeracy is a critical awareness which builds bridges between mathematics and the real world, with all its diversity. […] in this sense […] there is no particular ‘level’ of Mathematics associated with it: it is as important for an engineer to be numerate as it is for a primary school child, a parent, a car driver or gardener. The different contexts will require different Mathematics to be activated and engaged in […] "(Betty Johnston, "Critical Numeracy", 1994)

"We believe that numeracy is about making meaning in mathematics and being critical about maths. This view of numeracy is very different from numeracy just being about numbers, and it is a big step from numeracy or everyday maths that meant doing some functional maths. It is about using mathematics in all its guises - space and shape, measurement, data and statistics, algebra, and of course, number - to make sense of the real world, and using maths critically and being critical of maths itself. It acknowledges that numeracy is a social activity. That is why we can say that numeracy is not less than maths but more. It is why we don’t need to call it critical numeracy being numerate is being critical." (Dave Tout & Beth Marr, "Changing practice: Adult numeracy professional development", 1997)

"To be numerate means to be competent, confident, and comfortable with one’s judgements on whether to use mathematics in a particular situation and if so, what mathematics to use, how to do it, what degree of accuracy is appropriate, and what the answer means in relation to the context." (Diana Coben, "Numeracy, mathematics and adult learning", 2000)

"Numeracy is the ability to process, interpret and communicate numerical, quantitative, spatial, statistical, even mathematical information, in ways that are appropriate for a variety of contexts, and that will enable a typical member of the culture or subculture to participate effectively in activities that they value." (Jeff Evans, "Adults´ Mathematical Thinking and Emotion", 2000)

"Mathematics is often thought to be difficult and dull. Many people avoid it as much as they can and as a result much of the population is mathematically illiterate. This is in part due to the relative lack of importance given to numeracy in our culture, and to the way that the subject has been presented to students." (Julian Havil , "Gamma: Exploring Euler's Constant", 2003)

"One can be highly functionally numerate without being a mathematician or a quantitative analyst. It is not the mathematical manipulation of numbers (or symbols representing numbers) that is central to the notion of numeracy. Rather, it is the ability to draw correct meaning from a logical argument couched in numbers. When such a logical argument relates to events in our uncertain real world, the element of uncertainty makes it, in fact, a statistical argument." (Eric R Sowey, "The Getting of Wisdom: Educating Statisticians to Enhance Their Clients' Numeracy", The American Statistician 57(2), 2003)

"Mathematics and numeracy are not congruent. Nor is numeracy an accidental or automatic by-product of mathematics education at any level. When the goal is numeracy some mathematics will be involved but mathematical skills alone do not constitute numeracy." (Theresa Maguire & John O'Donoghue, "Numeracy concept sophistication - an organizing framework, a useful thinking tool", 2003)

"Mathematical literacy is an individual’s capacity to identify and understand the role that mathematics plays in the world, to make well-founded judgements and to use and engage with mathematics in ways that meet the needs of that individual’s life as a constructive, concerned and reflective citizen." (OECD, "Assessing scientific, reading and mathematical literacy: a framework for PISA 2006", 2006)

"Statistical literacy is more than numeracy. It includes the ability to read and communicate the meaning of data. This quality makes people literate as opposed to just numerate. Wherever words (and pictures) are added to numbers and data in your communication, people need to be able to understand them correctly." (United Nations, "Making Data Meaningful" Part 4: "A guide to improving statistical literacy", 2012)

"When a culture is founded on the principle of immediacy of experience, there is no need for numeracy. It is impossible to consume more than one thing at a time, so differentiating between 'a small amount', 'a larger amount' and 'many' is enough for survival." (The Open University, "Understanding the environment: learning and communication", 2016)

13 August 2017

Software Engineering: Who Messed with My Data?


Introduction

Errors, like straws, upon the surface flow;
He who would search for pearls must dive below.

(John Dryden) 

    Life of a programmer is full of things that stopped working overnight. What’s beautiful about such experiences is that always there is a logical explanation for such “happenings”. There are two aspects - one is how to troubleshoot such problems, and the second – how to avoid such situations, and this is typically done through what we refer as defensive programming. On one side avoiding issues makes one’s life simpler, while issues make it fuller.

   I can say that I had plenty such types of challenges in my life, most of them self-created, mainly in the learning process, but also a good share of challenges created by others. Independently of the time spent on troubleshooting such issues, it’s the experience that counts, the little wins against the “dark” side of programming. In the following series of posts I will describe some of the issues I was confronted directly or indirectly over time. In an ad-hoc characterization they can be split in syntax, logical, data, design and systemic errors.

Syntax Errors

Watch your language young man!

(anonymous mother) 

    Syntax in natural languages like English is the sequence in which words are put together, word’s order indicating the relationship existing between words. Based on the meaning the words carry and the relationships formed between words we are capable to interpret sentences. SQL, initially called SEQUEL (Structured English Query Language) is an English-like language designed to manipulate and retrieve data. Same as natural languages, artificial languages like SQL have their own set of (grammar) rules that when violated lead to runtime errors, leading to interruption in code execution or there can be cases when the code runs further leading to inconsistencies in data. Unlike natural languages, artificial languages interpreters are quite sensitive to syntax errors.

    Syntax errors are common to beginners, though a moment of inattention or misspelling can happen to anyone, no matter how versatile one’s coding is. Some are more frequent or have a bigger negative impact than others. Here are some of the typical types of syntax errors:
- missing brackets and quotes, especially in complex formulas;
- misspelled commands, table or column names;
- omitting table aliases or database names;
- missing objects or incorrectly referenced objects or other resources;
- incorrect statement order;
- relying on implicit conversion;
- incompatible data types;
- incorrect parameters’ order;
- missing or misplaced semicolons;
- usage of deprecated syntax.

   Typically, syntax errors are easy to track at runtime with minimal testing as long the query is static. Dynamic queries on the other side require sometimes a larger number of combinations to be tested. The higher the number of attributes to be combined and the more complex the logic behind them, the more difficult is to test all combinations. The more combinations not tested, the higher the probability that an error might lurk in the code. Dynamics queries can thus easily become (syntax) error generators.

Logical Errors

Students are often able to use algorithms to solve numerical problems
without completely understanding the underlying scientific concept.

(Eric Mazur) 

   One beautiful aspect of the human mind is that it needs only a rough understanding about how a tool works in order to make use of it up to an acceptable level. Therefore often it settles for the minimum of understanding that allows it to use a tool. Aspects like the limits of a tool, contexts of applicability, how it can be used efficiently to get the job done, or available alternatives, all these can be ignored in the process. As the devil lies in details, misunderstanding how a piece of technology works can prove to be our Achilles’ heel. For example, misunderstanding how sets and the different types of joins work, that lexical order differ from logical order and further to order of execution, when is appropriate or inappropriate to use a certain technique or functionality can make us make poor choices.

   One of these poor choices is the method used to solve a problem. A mature programming language can offer sometimes two or more alternatives for solving a problem. Choosing the inadequate solution can lead to performance issues in time. This type of errors can be rooted in the lack of understanding of the data, of how an application is used, or how a piece of technology works.

I suppose it is tempting, if the only tool you have is a hammer,
to treat everything as if it were a nail.

(Abraham Maslow) 

   Some of the errors derive from the difference between how different programming languages work with data. There can be considerable differences between procedural, relational and vector languages. When jumping from one language to another, one can be tempted to apply the same old techniques to the new language. The solution might work, though (by far) not optimal.

    The capital mistake is to be the man of one tool, and use it in all the cases, even when not appropriate. For example. when one learned working with views, attempts to apply them all over the code in order to reuse logic, creating thus chains of views which even prove to be flexible, their complexity sooner or later will kick back. Same can happen with stored procedures and other object types as well. A sign of mastery is when the developer adapts his tools to the purpose.

"For every complex problem there is an answer
that is clear, simple, and wrong.
"
(Henry L. Mencken) 

   One can build elegant solutions but solve the wrong problem. Misunderstanding the problem at hand is one type of error sometimes quite difficult to identify. Typically, they can be found through thorough testing. Sometimes the unavailability of (quality) data can impede the process of testing, such errors being found late in the process.

   At the opposite side, one can attempt to solve the right problem but with logic flaws – wrong steps order, wrong algorithm, wrong set of tools, or even missing facts/assumptions. A special type of logical errors are the programmatic errors, which occur when SQL code encounters a logic or behavioral error during processing (e.g. infinite loop, out of range input). [1]

Data Errors

Data quality requires certain level of sophistication within a company
to even understand that it’s a problem.

(Colleen Graham) 

   Poor data quality is the source for all evil, or at least for some of the evil. Typically, a good designed database makes use of a mix of techniques to reduce the chances for inconsistencies: appropriate data types and data granularity, explicit transactions, check constraints, default values, triggers or integrity constraints. Some of these techniques can be too restrictive, therefore in design one has to provide a certain flexibility in the detriment of one of the above techniques, fact that makes the design vulnerable to same range of issues: missing values, missing or duplicate records.

   No matter how good a database was designed, sometimes is difficult to cope with users’ ingenuity – misusage of functionality, typically resulting in deviations from standard processes, that can invalidate an existing query. Similar effects have the changes to processes or usage of new processed not addressed in existing queries or reports.

  Another topic that have a considerable impact on queries’ correctness is the existence, or better said the inexistence of master data policies and a board to regulate the maintenance of master data. Without proper governance of master data one might end up with a big mess with no way to bring some order in it without addressing the quality of data adequately.

Designed to Fail

The weakest spot in a good defense is designed to fail.
(Mark Lawrence) 

   In IT one can often meet systems designed to fail, the occurrences of errors being just a question of time, kind of a ticking bomb. In such situations, a system is only as good as its weakest link(s). Issues can be traced back to following aspects:
- systems used for what they were not designed to do – typically misusing a tool for a purpose for which another tool would be more appropriate (e.g. using Excel as database, using SSIS for real-time, using a reporting tool for data entry);
- poor performing systems - systems not adequately designed for the tasks supposed to handle (e.g. handling large volume of data/transactions);
- systems not coping with user’s inventiveness or mistakes (e.g. not validating adequately user input or not confirming critical actions like deletion of records);
- systems not configurable (e.g. usage of hardcoded values instead of parameters or configurable values);
- systems for which one of the design presumptions were invalidated by reality (e.g. input data don’t have the expected format, a certain resource always exists);
- systems not being able to handle changes in environment (e.g. changing user settings for language, numeric or data values);
- systems succumbing in their own complexity (e.g. overgeneralization, wrong mix of technologies);
- fault intolerant systems – system not handling adequately more or less unexpected errors or exceptions (e.g. division by zero, handling of nulls, network interruptions, out of memory).

Systemic Errors

    Systemic errors can be found at the borders of the “impossible”, situations in which the errors defy the common sense. Such errors are not determined by chance but are introduced by an inaccuracy inherent to the system/environment.

    A systemic error occurs when a SQL program encounters a deficiency or unexpected condition with a system resource (e.g. a program encountered insufficient space in tempdb to process a large query, database/transaction log running out of space). [1]

   Such errors are often difficult but not impossible to reproduce. The difficulty resides primarily in figuring out what happened, what caused the error. Once one found the cause, with a little resourcefulness one can come with an example to reproduce the error.

Conclusion

“To err is human; to try to prevent recurrence of error is science.“
(Anon)

    When one thinks about it, there are so many ways to fail. In the end to err is human and nobody is exempted from making mistakes, no matter how good or wise. The quest of a (good) programmer is to limit errors’ occurrences, and to correct them early in process, before they start becoming a nightmare.

References:
[1] Transact-SQL Programming: Covers Microsoft SQL Server 6.5 /7.0 and Sybase,  by Kevin Kline, Lee Gould & Andrew Zanevsky, O’Reilly, ISBN 10: 1565924010, 1999

18 June 2017

SQL Server Administration: Database Recovery on SQL Server 2017

I installed today SQL Server 2017 CTP 2.1 on my Lab PC without any apparent problems. It was time to recreate some of the databases I used for testing. As previously I had an evaluation version of SQL Server 2016, it expired without having a backup for one of the databases. I could recreate the database from scripts and reload the data from various text files. This would have been a relatively laborious task (estimated time > 1 hour), though the chances were pretty high that everything would go smoothly. As the database is relatively small (about 2 GB) and possible data loss was neglectable, I thought it would be possible to recover the data from the database with minimal loss in less than half of hour. I knew this was possible, as I was forced a few times in the past to recover data from damaged databases in SQL Server 2005, 2008 and 2012 environments, though being in a new environment I wasn’t sure how smooth will go and how long it would take.  

Plan A - Create the database with  ATTACH_REBUILD_LOG option:

As it seems the option is available in SQL Server 2017, so I attempted to create the database via the following script:
 
CREATE DATABASE  ON 
(FILENAME='I:\Data\.mdf') 
FOR ATTACH_REBUILD_LOG 

And as expected I run into the first error:
Msg 5120, Level 16, State 101, Line 1
Unable to open the physical file "I:\Data\.mdf". Operating system error 5: "5(Access is denied.)".
Msg 1802, Level 16, State 7, Line 1 CREATE DATABASE failed. Some file names listed could not be created. Check related errors.

It looked like a permissions problem, though I wasn’t entirely sure which account is causing the problem. In the past I had problems with the Administrator account, so it was the first thing to try. Once I removed the permissions for Administrator account to the folder containing the database and gave it full control permissions again, I tried to create the database anew using the above script, running into the next error:

File activation failure. The physical file name "D:\Logs\_log.ldf" may be incorrect. The log cannot be rebuilt because there were open transactions/users when the database was shutdown, no checkpoint occurred to the database, or the database was read-only. This error could occur if the transaction log file was manually deleted or lost due to a hardware or environment failure.
Msg 1813, Level 16, State 2, Line 1 Could not open new database ''. CREATE DATABASE is aborted.

This approach seemed to lead nowhere, so it was time for Plan B.

Plan B - Recover the database into an empty database with the same name:

Step 1: Create a new database with the same name, stop the SQL Server, then copy the old file over the new file, and delete the new log file manually. Then restarted the server. After the restart the database will appear in Management Studio with the SUSPECT state.

Step 2:
Set the database in EMERGENCY mode:

ALTER DATABASE  SET EMERGENCY, SINGLE_USER

Step 3:
Rebuild the log file:

ALTER DATABASE <database_name> 
REBUILD LOG ON (Name=_Log', 
FileName='D:\Logs\.ldf')

The rebuild worked without problems.

Step 4: Set the database in MULTI_USER mode:

ALTER DATABASE  SET MULTI_USER 

Step 5:
Perform a consistency check:

DBCC CHECKDB () WITH ALL_ERRORMSGS, NO_INFOMSG 

After 15 minutes of work the database was back online.

Warnings:
Always attempt to recover the data for production databases from the backup files! Use the above steps only if there is no other alternative!
The consistency check might return errors. In this case one might need to run CHECKDB with REPAIR_ALLOW_DATA_LOSS several times [2], until the database was repaired.
After recovery there can be problems with the user access. It might be needed to delete the users from the recovered database and reassign their permissions!  

Resources:
[1] In Recovery (2008) Creating, detaching, re-attaching, and fixing a SUSPECT database, by Paul S Randal [Online] Available from: https://www.sqlskills.com/blogs/paul/creating-detaching-re-attaching-and-fixing-a-suspect-database/ 
[2] In Recovery (2009) Misconceptions around database repair, by Paul S Randal [Online] Available from: https://www.sqlskills.com/blogs/paul/misconceptions-around-database-repair/
[3] Microsoft Blogs (2013) Recovering from Log File Corruption, by Glen Small [Online] Available from: https://blogs.msdn.microsoft.com/glsmall/2013/11/14/recovering-from-log-file-corruption/

20 May 2017

Data Management: Data Scrubbing (Definitions)

"The process of making data consistent, either manually, or automatically using programs." (Microsoft Corporation, "Microsoft SQL Server 7.0 System Administration Training Kit", 1999)

Processing data to remove or repair inconsistencies." (Rod Stephens, "Beginning Database Design Solutions", 2008)

"The process of building a data warehouse out of data coming from multiple online transaction processing (OLTP) systems." (Microsoft, "SQL Server 2012 Glossary", 2012)

"A term that is very similar to data deidentification and is sometimes used improperly as a synonym for data deidentification. Data scrubbing refers to the removal, from data records, of identifying information (i.e., information linking the record to an individual) plus any other information that is considered unwanted. This may include any personal, sensitive, or private information contained in a record, any incriminating or otherwise objectionable language contained in a record, and any information irrelevant to the purpose served by the record." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)

"The process of removing corrupt, redundant, and inaccurate data in the data governance process. (Robert F Smallwood, Information Governance: Concepts, Strategies, and Best Practices, 2014)

"Data Cleansing (or Data Scrubbing) is the action of identifying and then removing or amending any data within a database that is: incorrect, incomplete, duplicated." (experian) [source]

"Data cleansing, or data scrubbing, is the process of detecting and correcting or removing inaccurate data or records from a database. It may also involve correcting or removing improperly formatted or duplicate data or records. Such data removed in this process is often referred to as 'dirty data'. Data cleansing is an essential task for preserving data quality." (Teradata) [source]

"Data scrubbing, also called data cleansing, is the process of amending or removing data in a database that is incorrect, incomplete, improperly formatted, or duplicated." (Techtarget) [source]

"Part of the process of building a data warehouse out of data coming from multiple online transaction processing (OLTP) systems." (Microsoft Technet)

"The process of filtering, merging, decoding, and translating source data to create validated data for the data warehouse." (Information Management)

05 May 2017

Data Management: Data Steward (Definitions)

"A person with responsibility to improve the accuracy, reliability, and security of an organization’s data; also works with various groups to clearly define and standardize data." (Margaret Y Chu, "Blissful Data ", 2004)

"Critical players in data governance councils. Comfortable with technology and business problems, data stewards seek to speak up for their business units when an organization-wide decision will not work for that business unit. Yet they are not turf protectors, instead seeking solutions that will work across an organization. Data stewards are responsible for communication between the business users and the IT community." (Tony Fisher, "The Data Asset", 2009)

"A business leader and/or subject matter expert designated as accountable for: a) the identification of operational and Business Intelligence data requirements within an assigned subject area, b) the quality of data names, business definitions, data integrity rules, and domain values within an assigned subject area, c) compliance with regulatory requirements and conformance to internal data policies and data standards, d) application of appropriate security controls, e) analyzing and improving data quality, and f) identifying and resolving data related issues. Data stewards are often categorized as executive data stewards, business data stewards, or coordinating data stewards." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

[business data steward:] "A knowledge worker, business leader, and recognized subject matter expert assigned accountability for the data specifications and data quality of specifically assigned business entities, subject areas or databases, but with less responsibility for data governance than a coordinating data steward or an executive data steward." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The person responsible for maintaining a data element in a metadata registry." (Microsoft, "SQL Server 2012 Glossary, 2012)

"The term stewardship is “the management or care of another person’s property” (NOAD). Data stewards are individuals who are responsible for the care and management of data. This function is carried out in different ways based on the needs of particular organizations." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"The person responsible for maintaining a data element in a metadata registry." (Microsoft, SQL Server 2012 Glossary, 2012)

"An individual comfortable with both technology and business problems. Stewards are responsible for communicating between the business users and the IT community." (Jim Davis & Aiman Zeid, "Business Transformation: A Roadmap for Maximizing Organizational Insights", 2014)

"A role in the data governance organization that is responsible for the development of a uniform data model for business objects used across boundaries. The data steward is also often responsible for the development of master data management and ensures compliance with the governance rules." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"A natural person assigned the responsibility to catalog, define, and monitor changes to critical data. Example: The data steward for finance critical data is Dan." (Gregory Lampshire, "The Data and Analytics Playbook", 2016)

"A person responsible for managing data content, quality, standards, and controls within an organization or function." (Jonathan Ferrar et al, "The Power of People", 2017)

"A data steward is a job role that involves planning, implementing and managing the sourcing, use and maintenance of data assets in an organization. Data stewards enable an organization to take control and govern all the types and forms of data and their associated libraries or repositories." (Richard T Herschel, "Business Intelligence", 2019)

28 April 2017

Data Quality Dimensions: Completeness (Definitions)

"A characteristic of information quality that measures the degree to which there is a value in a field; synonymous with fill rate. Assessed in the data quality dimension of Data Integrity Fundamentals." (Danette McGilvray, "Executing Data Quality Projects", 2008)

"Containing by a composite data all components necessary to full description of the states of a considered object or process." (Juliusz L Kulikowski, "Data Quality Assessment", 2009)

"An inherent quality characteristic that is a measure of the extent to which an attribute has values for all instances of an entity class." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"Completeness is a dimension of data quality. As used in the DQAF, completeness implies having all the necessary or appropriate parts; being entire, finished, total. A dataset is complete to the degree that it contains required attributes and a sufficient number of records, and to the degree that attributes are populated in accord with data consumer expectations. For data to be complete, at least three conditions must be met: the dataset must be defined so that it includes all the attributes desired (width); the dataset must contain the desired amount of data (depth); and the attributes must be populated to the extent desired (density). Each of these secondary dimensions of completeness can be measured differently." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"Completeness is defined as a measure of the presence of core source data elements that, exclusive of derived fields, must be present in order to complete a given business process." (Rajesh Jugulum, "Competing with High Quality Data", 2014)

"Complete existence of all values or attributes of a record that are necessary." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"The degree to which all data has been delivered or stored and no values are missing. Examples are empty or missing records." (Piethein Strengholt, "Data Management at Scale", 2020)

"The degree to which elements that should be contained in the model are indeed there." (Panos Alexopoulos, "Semantic Modeling for Data", 2020)

"Data is considered 'complete' when it fulfills expectations of comprehensiveness." (Precisely) [source]

"The degree to which all required measures are known. Values may be designated as “missing” in order not to have empty cells, or missing values may be replaced with default or interpolated values. In the case of default or interpolated values, these must be flagged as such to distinguish them from actual measurements or observations. Missing, default, or interpolated values do not imply that the dataset has been made complete." (CODATA)

27 April 2017

Data Quality Dimensions: Availability (Definitions)

"Corresponds to the information that should be available when necessary and in the appropriate format." (José M Gaivéo, "Security of ICTs Supporting Healthcare Activities", 2013)

"A property by which the data is available all the time during the business hours. In cloud computing domain, the data availability by the cloud service provider holds a crucial importance." (Sumit Jaiswal et al, "Security Challenges in Cloud Computing", 2015) 

"Availability: the ability of the data user to access the data at the desired point in time." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"It is one of the main aspects of the information security. It means data should be available to its legitimate user all the time whenever it is requested by them. To guarantee availability data is replicated at various nodes in the network. Data must be reliably available." (Omkar Badve et al, "Reviewing the Security Features in Contemporary Security Policies and Models for Multiple Platforms", 2016)

"Timely, reliable access to data and information services for authorized users." (Maurice Dawson et al, "Battlefield Cyberspace: Exploitation of Hyperconnectivity and Internet of Things", 2017)

"A set of principles and metrics that assures the reliability and constant access to data for the authorized individuals or groups." (Gordana Gardašević et al, "Cybersecurity of Industrial Internet of Things", 2020)

"Ensuring the conditions necessary for easy retrieval and use of information and system resources, whenever necessary, with strict conditions of confidentiality and integrity." (Alina Stanciu et al, "Cyberaccounting for the Leaders of the Future", 2020)

"The state when data are in the place needed by the user, at the time the user needs them, and in the form needed by the user." (CODATA)

"The state that exists when data can be accessed or a requested service provided within an acceptable period of time." (NISTIR 4734)

"Timely, reliable access to information by authorized entities." (NIST SP 800-57 Part 1)

12 April 2017

Data Management: Accessibility (Definitions)

"Capable of being reached, capable of being used or seen." (Martin J Eppler, "Managing Information Quality" 2nd Ed., 2006)

"The degree to which data can be obtained and used." (Danette McGilvray, "Executing Data Quality Projects", 2008)

"The opportunity to find, as well as the ease and convenience associated with locating, information. Often, this is related to the physical location of the individual seeking the information and the physical location of the information in a book or journal." (Jimmie L Joseph & David P Cook, "Medical Ethical and Policy Issues Arising from RIA", 2008)

"An inherent quality characteristic that is a measure of the ability to access data when it is required." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"The ability to readily obtain data when needed." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Accessibility refers to the difficulty level for users to obtain data. Accessibility is closely linked with data openness, the higher the data openness degree, the more data types obtained, and the higher the degree of accessibility." (Li Cai & Yangyong Zhu, "The Challenges of Data Quality and Data Quality Assessment in the Big Data Era", 2015) [source]

"It is the state of each user to have access to any information at any time." (ihsan Eken & Basak Gezmen, "Accessibility for Everyone in Health Communication Mobile Application Usage", 2020)

"Data accessibility measures the extent to which government data are provided in open and re-usable formats, with their associated metadata." (OECD)

Data Management: Data Virtualization (Definitions)

"The concept of letting data stay 'where it lives; and developing a hardware and software architecture that exposes the data to various business processes and organizations. The goal of virtualization is to shield developers and users from the complexity of the underlying data structures." (Jill Dyché & Evan Levy, "Customer Data Integration: Reaching a Single Version of the Truth", 2006)

"The ability to easily select and combine data fragments from many different locations dynamically and in any way into a single data structure while also maintaining its semantic accuracy." (Michael M David & Lee Fesperman, "Advanced SQL Dynamic Data Modeling and Hierarchical Processing", 2013)

"The process of retrieving and manipulating data without requiring details of how the data formatted or where the data is located" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"A data integration process used to gain more insights.  Usually it involves databases, applications, file systems, websites, big data techniques, and so on." (Jason Williamson, "Getting a Big Data Job For Dummies", 2015)

"Data virtualization is an approach that allows an application to retrieve and manipulate data without requiring technical details about the data, such as how it is formatted at source or where it is physically located, and can provide a single customer view (or single view of any other entity) of the overall data. Some database vendors provide a database (virtual) query layer, which is also called a data virtualization layer. This layer abstracts the database and optimizes the data for better read performance. Another reason to abstract is to intercept queries for better security. An example is Amazon Athena." (Piethein Strengholt, "Data Management at Scale", 2020)

"A data integration process in order to gain more insights. Usually it involves databases, applications, file systems, websites, big data techniques, etc.)." (Analytics Insight)

 "The integration and transformation of data in real time or near real time from disparate data sources in multicloud and hybrid cloud, to support business intelligence, reporting, analytics, and other workloads." (Forrester)

Data Management: Data Lineage (Definitions)

 "A mechanism for recording information to determine the source of any piece of data, and the transformations applied to that data using Data Transformation Services (DTS). Data lineage can be tracked at the package and row levels of a table and provides a complete audit trail for information stored in a data warehouse. Data lineage is available only for packages stored in Microsoft Repository." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"This information is used by Data Transformation Services (DTS) when it works in conjunction with Meta Data Services. This information records the history of package execution and data transformations for each piece of data." (Anthony Sequeira & Brian Alderman, "The SQL Server 2000 Book", 2003)

"This is also called data provenance. It deals with the origin of data; it is all about documenting where data is, how it has been derived, and how it flows so you can manage and secure it appropriately as it is further processed by applications." (Martin Oberhofer et al, "Enterprise Master Data Management", 2008)

"This provides the functionality to determine where data comes from, how it is transformed, and where it is going. Data lineage metadata traces the lifecycle of information between systems, including the operations that are performed on the data." (Martin Oberhofer et al, "The Art of Enterprise Information Architecture", 2010)

"Data lineage refers to a set of identifiable points that can be used to understand details of data movement and transformation (e.g., transactional source field names, file names, data processing job names, programming rules, target table fields). Lineage describes the movement of data through systems from its origin or provenance to its use in a particular application. Lineage is related to both the data chain and the information life cycle. Most people concerned with the lineage of data want to understand two aspects of it: the data’s origin and the ways in which the data has changed since it was originally created. Change can take place within one system or between systems." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

20 March 2017

Data Management: Data Structure (Definitions)

"A logical relationship among data elements that is designed to support specific data manipulation functions (trees, lists, and tables)." (William H Inmon, "Building the Data Warehouse", 2005)

"Data stored in a computer in a way that (usually) allows efficient retrieval of the data. Arrays and hashes are examples of data structures." (Michael Fitzgerald, "Learning Ruby", 2007)

"A data structure in computer science is a way of storing data to be used efficiently." (Sahar Shabanah, "Computer Games for Algorithm Learning", 2011)

"Data structure is a general term referring to how data is organized. In modeling, it refers more specifically to the model itself. Tables are referred to as 'structures'." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

[probabilistic *] "A data structure which exploits randomness to boost its efficiency, for example skip lists and Bloom filters. In the case of Bloom filters, the results of certain operations may be incorrect with a small probability." (Wei-Chih Huang & William J Knottenbelt, "Low-Overhead Development of Scalable Resource-Efficient Software Systems", 2014)

"A collection of methods for storing and organizing sets of data in order to facilitate access to them. More formally data structures are concise implementations of abstract data types, where an abstract data type is a set of objects together with a collection of operations on the elements of the set." (Ioannis Kouris et al, "Indexing and Compressing Text", 2015)

"A representation of the logical relationship between elements of data." (Adam Gordon, "Official (ISC)2 Guide to the CISSP CBK" 4th Ed., 2015)

"Is a schematic organization of data and relationship to express a reality of interest, usually represented in a diagrammatic form." (Maria T Artese  Isabella Gagliardi, "UNESCO Intangible Cultural Heritage Management on the Web", 2015)

"The implementation of a composite data field in an abstract data type" (Nell Dale & John Lewis, "Computer Science Illuminated" 6th Ed., 2015)

"A way of organizing data so that it can be efficiently accessed and updated." (Vasileios Zois et al, "Querying of Time Series for Big Data Analytics", 2016)

"A particular way of storing information, allowing to a high level approach on the software implementation." (Katia Tannous & Fillipe de Souza Silva, "Particle Shape Analysis Using Digital Image Processing", 2018)

"It is a particular way of organizing data in a computer so that they can be used efficiently." (Edgar C Franco et al, "Implementation of an Intelligent Model Based on Machine Learning in the Application of Macro-Ergonomic Methods...", 2019)

"Way information is represented and stored." (Shalin Hai-Jew, "Methods for Analyzing and Leveraging Online Learning Data", 2019)

"A physical or logical relationship among a collection of data elements." (IEEE 610.5-1990)

Data Management: Data Sharing (Definitions)

"The ability to share individual pieces of data transparently from a database across different applications." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"Exchange of data and/or meta-data in a situation involving the use of open, freely available data formats, where process patterns are known and standard, and where not limited by privacy and confidentiality regulations." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Data sharing involves one entity sending data to another entity, usually with the understanding that the other entity will store and use the data. This process may involve free or purchased data, and it may be done willingly, or in compliance with regulations, laws, or court orders." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)

"The ability of subsystems or application programs to access data directly and to change it while maintaining data integrity." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

"The ability of two or more DB2 subsystems to directly access and change a single set of data." (BMC)


Data Management: Information Overload (Definitions)

"A state in which information can no longer be internalized productively by the individual due to time constraints or the large volume of received information." (Martin J Eppler, "Managing Information Quality" 2nd Ed., 2006)

"Phenomena related to the inability to absorb and manage effectively large amounts of information, creating inefficiencies, stress, and frustration. It has been exacerbated by advances in the generation, storage, and electronic communication of information." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"A situation where relevant information becomes buried in a mass of irrelevant information" (Josep C Morales, "Information Disasters in Networked Organizations", 2008)

"A situation where individuals have access to so much information that it becomes impossible for them to function effectively, sometimes leading to where nothing gets done and the user gives the impression of being a rabbit caught in the glare of car headlights." Alan Pritchard, "Information-Rich Learning Concepts", 2009)

"is the situation when the information processing requirements exceed the information processing capacities." (Jeroen ter Heerdt & Tanya Bondarouk, "Information Overload in the New World of Work: Qualitative Study into the Reasons", 2009)

"Refers to an excess amount of information, making it difficult for individuals to effectively absorb and use information; increases the likelihood of poor decisions." (Leslie G Eldenburg & Susan K Wolcott, "Cost Management" 2nd Ed., 2011)

"The inability to cope with or process ever-growing amounts of data into our lives." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed., 2011)

"The state where the rate or amount of input to a system or person outstrips the capacity or speed of processing that input successfully." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The state in which a huge influx of information interferes with understanding an issue, making good decisions, and performance on the job." (Carol A. Brown, "Economic Impact of Information and Communication Technology in Higher Education", 2014)

"The difficulty a person can have understanding an issue and making decisions that can be caused by the presence of too much information." (Li Chen, "Mobile Technostress", Encyclopedia of Mobile Phone Behavior, 2015)

"Occurs when excess of information suffocates businesses and causes employees to suffer mental anguish and physical illness. Information overload causes high levels of stress that can result in health problems and the breakdown of individuals’ personal relationships." (Sérgio Maravilhas & Sérgio R G Oliveira, "Entrepreneurship and Innovation: The Search for the Business Idea", 2018)

"A set of subjective and objective difficulties, mainly originating in the amount and complexity of information available and people’s inability to handle such situations." (Tibor Koltay, "Information Overload", 2021)

19 March 2017

Data Management: Encryption (Definitions)

"A method for keeping sensitive information confidential by changing data into an unreadable form." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"The encoding of data so that the plain text is transformed into something unintelligible, called cipher text." (Tom Petrocelli, "Data Protection and Information Lifecycle Management", 2005)

"Reordering of bits of data to make it unintelligible (and therefore useless) to an unauthorized third party, while still enabling the authorized user to use the data after the reverse process of decryption." (David G Hill, "Data Protection: Governance, Risk Management, and Compliance", 2009)

"To transform information from readable plain text to unreadable cipher text to prevent unintended recipients from reading the data." (Janice M Roehl-Anderson, "IT Best Practices for Financial Managers", 2010)

"The process of transforming data using an algorithm (called a cipher) to make it unreadable to anyone except those possessing special knowledge, usually referred to as a key." (Craig S Mullins, "Database Administration", 2012)

"The process of converting readable data (plaintext) into a coded form (ciphertext) to prevent it from being read by an unauthorized party." (Microsoft, "SQL Server 2012 Glossary", 2012)

"The cryptographic transformation of data to produce ciphertext." (Manish Agrawal, "Information Security and IT Risk Management", 2014)

"The process of scrambling data in such a way that it is unreadable by unauthorized users but can be unscrambled by authorized users to be readable again." (Weiss, "Auditing IT Infrastructures for Compliance, 2nd Ed", 2015)

"The transformation of plaintext into unreadable ciphertext." (Shon Harris & Fernando Maymi, "CISSP All-in-One Exam Guide, 8th Ed", 2018)

"In computer security, the process of transforming data into an unintelligible form in such a way that the original data either cannot be obtained or can be obtained only by using a decryption process." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

"Encryption is about translating the data into complex codes that cannot be interpreted (decrypted) without the use of a decryption key. These keys are typically distributed and stored separately. There are two types of encryption: symmetric key encryption and public key encryption. In symmetric key encryption, the key to both encrypt and decrypt is exactly the same. Public key encryption has two different keys. One key is used to encrypt the values (the public key), and one key is used to decrypt the data (the private key)." (Piethein Strengholt, "Data Management at Scale", 2020)

"The process of encoding data in such a way to prevent unauthorized access." (AICPA)

16 March 2017

Data Management: Missing Data (Definitions)

"Noise in a bivalent testing input pattern in which one or more components have been changed from the correct value to a value midway between the correct and the incorrect value, i.e. a + 1, or a -1, has been changed to a O." (Laurene V Fausett, "Fundamentals of Neural Networks: Architectures, Algorithms, and Applications", 1994)

"Many databases have cases where not all the attribute values are known. These can be due to structural reasons (e.g., parity for males), due to changes or variations in data collection methodology, or due to nonresponses. In the latter case, it is important to distinguish between ignorable and nonignorable nonresponse. The former must be addressed even though the latter can (usually) be treated as random." (William J Raynor Jr., "The International Dictionary of Artificial Intelligence", 1999)

"Observations where one or more variables contain no value." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"data are said to be missing when there is no information for one or more pattern on one or more features in a research study." (Pedro J García-Laencina et al, "Classification with Incomplete Data", 2010)

"Missing data, also known as lost data, is the data that is lost in an inner join when rows of the tables being joined do not match with any other rows. Missing data can also occur with one-sided joins on the side that is not being preserved. This definition ignores all the other reasons for missing data." (Michael M David & Lee Fesperman, "Advanced SQL Dynamic Data Modeling and Hierarchical Processing", 2013)

"It refers that no data value is stored for the variable in the observation." (Liang-Ting Tsai et al, "Weighting Imputation for Categorical Data", 2014)

"Observations which were planned and are missing." (OECD)

"In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data." (Wikipedia)

15 March 2017

Data Management: Data Compression (Definitions)

"any kind of data reduction method that preserves the application-specific information." (Teuvo Kohonen, "Self-Organizing Maps 3rd Ed.", 2001)

"The process of reducing the size of data by use of mathematical algorithms." (Tom Petrocelli, "Data Protection and Information Lifecycle Management", 2005)

"1.Algorithms or techniques that change data to a smaller physical size that contains the same information. 2.The process of changing data to be stored in a smaller physical or logical space." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Encoding information in such a way that its representation consumes less space in memory" (Hasso Plattner, "A Course in In-Memory Data Management: The Inner Mechanics of In-Memory Databases 2nd Ed.", 2014)

"Compression is a data management technique that uses repeating patterns in data to reduce the storage needed to hold the data. A compression algorithm for databases should perform compression and decompression operations as fast as possible. This often entails a trade-off between the speed of compression/decompression and the size of the compressed data. Faster compression algorithms can lead to larger compressed data than other, slower algorithms." (Dan Sullivan, "NoSQL for Mere Mortals®", 2015)

"Reducing the amount of space needed to store a piece of data" (Nell Dale & John Lewis, "Computer Science Illuminated" 6th Ed., 2015)

"The process of reducing the size of a data file by encoding information using fewer bits than the original file." (Faithe Wempen, "Computing Fundamentals: Introduction to Computers", 2015)

"A method that reduces the amount of space needed for storing data. See also client compression and hardware compression." (CommVault, "Documentation 11.20", 2018)

"Any technique used to reduce the amount of storage required to store data." (IEEE 610.5-1990)

14 March 2017

Data Management: Data Protection (Definitions)

"The protecting of data from damage, destruction, and unauthorized alteration." (Tom Petrocelli, "Data Protection and Information Lifecycle Management", 2005)

"Deals with issues such as data security, privacy, and availability. Data protection controls are required by regulations and industry mandates such as Sarbanes-Oxley, European Data Protection Law, and others." (Allen Dreibelbis et al, "Enterprise Master Data Management", 2008)

"A set of rules that aim to protect the rights, freedoms and interests of individuals when information related to them is being processed." (Maria Tzanou, "Data Protection in EU Law after Lisbon: Challenges, Developments, and Limitations", 2015)

"An umbrella term for various procedures that ensure information is secure and available only to authorized users." (Peter Sasvari & Zoltán Nagymate, "The Empirical Analysis of Cloud Computing Services among the Hungarian Enterprises", 2015)

"Protection of the data against unauthorized access by third parties as well as protection of personal data (such as customer data) in the processing of data according to the applicable legal provisions." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"Legal control over access to, and use of, data in computers." (Lucy Self & Petros Chamakiotis, "Understanding Cloud Computing in a Higher Education Context", 2018)

"Data protection is a task of safeguarding personal or sensitive data which are complex and widely distributed." (M Fevzi Esen & Eda Kocabas, "Personal Data Privacy and Protection in the Meeting, Incentive, Convention, and Exhibition (MICE) Industry", 2019)

"Process of protecting important information from corruption, compromise, or loss." (Patrícia C T Gonçalves, "Medical Social Networks, Epidemiology and Health Systems", 2021)

"The process involving use of laws to protect data of individuals from unauthorized disclosure or access." (Frank Makoza, "Learning From Abroad on SIM Card Registration Policy: The Case of Malawi", 2019)

"Is the process in information and communication technology that deals with the ability an organization or individual to safeguard data and information from corruption, theft, compromise, or loss." (Valerianus Hashiyana et al, "Integrated Big Data E-Healthcare Solutions to a Fragmented Health Information System in Namibia", 2021)

"The mechanisms with which an organization enables individuals to retain control of the personal data they willingly share, where security provides policies, controls, protocols, and technologies necessary to fulfill rules and obligations in accordance with privacy regulations, industry standards, and the organization's ethics and social responsibility." (Forrester)

06 March 2017

Data Management: Audit Trail (Definitions)

"Audit records stored in the sybsecurity database." (Karen Paulsell et al, "Sybase SQL Server: Performance and Tuning Guide", 1996)

"A record of what happened to data from its inception to its current state. Audit trails help verify the integrity of data." (Microsoft Corporation, "Microsoft SQL Server 7.0 Data Warehouse Training Kit", 2000)

"Data maintained to trace activity, such as a transaction log, for purposes of recovery or audit." (Craig S Mullins, "Database Administration", 2012)

"A chronological record of activities on information resources that enables the reconstruction and examination of sequences of activities on those information resources for later review." (Mark Rhodes-Ousley, "Information Security: The Complete Reference, Second Edition" 2nd Ed., 2013)

"A trace of a sequence of events in a clerical or computer system. This audit usually identifies the creation or modification of any element in the system, who did it, and (possibly) why it was done." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"A chronological record of events or transactions. An audit trail is used for examining or reconstructing a sequence of events or transactions, managing security, and recovering lost transactions." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

 "A path by which the original input to a process (e.g. data) can be traced back through the process, taking the process output as a starting point. This facilitates result checking and allows a process audit to be carried out [after TMap]." (Software Quality Assurance)

05 March 2017

Data Management: System of Record (Definitions)

"The system that definitively specifies data values. In dealing with redundant data, you can have values that should be the same but disagree. The system of record is the system you go back to, in order to verify the true value of the data." (Microsoft Corporation, "Microsoft SQL Server 7.0 Data Warehouse Training Kit", 2000)

"The definitive and singular source of operational data. If data element ABC has a value of 25 in a database record but a value of 45 in the system of record, by definition, the first value is incorrect and must be reconciled. The system of record is useful for managing redundancy of data." (William H Inmon, "Building the Data Warehouse", 2005)

"The single authoritative, enterprise-designated source of operational data. It is the most current, accurate source of its data." (David Lyle & John G Schmidt, "Lean Integration", 2010)

"A system that stores the 'official' version of a data attribute." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The system of record is a system that is charged with keeping the most complete or trustworthy representation of a set of entities. Within the practice of master data management, such representations are referred to as golden records and the system of record can also be called the system of truth." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"Records from which information is retrieved by the name, identifying number, symbol, or other identifying particular assigned to the individual. Sometimes abbreviated as SOR." ( Manish Agrawal, "Information Security and IT Risk Management", 2014)

"An information storage system (commonly implemented on a computer system) that is the authoritative data source for a given data element or piece of information. The need to identify systems of record can become acute in organizations where management information systems have been built by taking output data from multiple-source systems, reprocessing this data, and then re-presenting the result for a new business use." (Janice M Roehl-Anderson, "IT Best Practices for Financial Managers", 2010)

28 February 2017

Data Management: Data Lifecycle (Definitions)

[Data Lifecycle Management (DLM):" "The process by which data is moved to different mass storage devices based on its age." (Tom Petrocelli, "Data Protection and Information Lifecycle Management", 2005)

[master data lifecycle management:] "Supports the definition, creation, access, and management of master data. Master data must be managed and leveraged effectively throughout its entire lifecycle." (Allen Dreibelbis et al, "Enterprise Master Data Management", 2008)

[Data lifecycle management (DLM):] "Managing data as blocks without underlying knowledge of the content of the blocks, based on limited metadata (e.g., creation date, last accessed)." (David G Hill, "Data Protection: Governance, Risk Management, and Compliance", 2009)

"The data life cycle is the set of processes a dataset goes through from its origin through its use(s) to its retirement. Data that moves through multiple systems and multiple uses has a complex life cycle. Danette McGilvray’s POSMAD formulation identifies the phases of the life cycle as: planning for, obtaining, storing and sharing, maintaining, applying, and disposing of data." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"The recognition that as data ages, that data takes on different characteristics" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"The development of a record in the company’s IT systems from its creation until its deletion. This process may also be designated as 'CRUD', an acronym for the Create, Read/Retrieve, Update and Delete database operations." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"The series of stages that data moves though from initiation, to creation, to destruction. Example: the data life cycle of customer data has four distinct phases and lasts approximately eight years." (Gregory Lampshire, "The Data and Analytics Playbook", 2016)

"covers the period of time from data origination to the time when data are no longer considered useful or otherwise disposed of. The data lifecycle includes three phases, the origination phase during which data are first collected, the active phase during which data are accumulating and changing, and the inactive phase during which data are no longer expected to accumulate or change, but during which data are maintained for possible use." (Meredith Zozus, "The Data Book: Collection and Management of Research Data", 2017)

"The complete set of development stages from creation to disposal, each with its own characteristics and management responsibilities, through which organizational data assets pass." (Kevin J Sweeney, "Re-Imagining Data Governance", 2018)

"An illustrative phrase describing the many manifestations of data from its raw, unanalyzed state, such as survey data, to intellectual property, such as blueprints." (Sue Milton, "Data Privacy vs. Data Security", 2021)

"Refers to all the stages in the existence of digital information from creation to destruction. A lifecycle view is used to enable active management of the data objects and resource over time, thus maintaining accessibility and usability." (CODATA)

Data Warehousing: Data Load Optimization – A Success Story

Introduction

    This topic has been waiting in the queue for almost two years already - since I finished optimizing an already existing relational data warehouse within a SQL Server 2012 Enterprise Edition environment. Through various simple techniques I managed then to reduce the running time for the load process by more than 65%, from 9 to 3 hours. It’s a considerable performance gain, considering that I didn’t have to refactor any business logic implemented in queries.

    The ETL (Extract, Transform, Load) solution was making use of SSIS (SQL Server Integration Services) packages to load data sequentially from several sources into staging tables, and from stating further into base tables. Each package was responsible for deleting the data from the staging tables via TRUNCATE, extracting the data 1:1 from the source into the staging tables, then loading the data 1:1 from the staging table to base tables. It’s the simplest and a relatively effective ETL design I also used with small alterations for data warehouse solutions. For months the data load worked smoothly, until data growth and eventually other problems increased the loading time from 5 to 9 hours.

Using TABLOCK Hint

    Using SSIS to bulk load data into SQL Server provides an optimum of performance and flexibility. Within a Data Flow, when “Table Lock” property on the destination is checked, it implies that the insert records are minimally logged, speeding up the load by a factor of two. The TABLOCK hint can be used also for other insert operations performed outside of SSIS packages. At least in this case the movement of data from staging into base tables was performed in plain T-SQL, outside of SSIS packages. Also further data processing had benefitted from this change. Only this optimization step alone provided 30-40% performance gain.

Drop/Recreating the Indexes on Big Tables

    As the base tables were having several indexes each, it proved beneficial to drop the indexes for the big tables (e.g. with more than 1000000 records) before loading the data into the base tables, and recreate the indexes afterwards. This was done within SSIS, and provided an additional 20-30% performance gain from the previous step.

Consolidating the Indexes

    Adding missing indexes, removing or consolidating (overlapping) indexes are typical index maintenance tasks, apparently occasionally ignored. It doesn’t always bring much performance as compared with the previous methods, though dropping and consolidating some indexes proved to be beneficial as fewer data were maintained. Data processing logic benefited from the creation of new indexes as well.

Running Packages in Parallel

As the packages were run sequentially (one package at a time), the data load was hardly taking advantage of the processing power available on the server. Even if queries could use parallelism, the benefit was minimal. Enabling packages run in parallel added additional performance gain, however this minimized the availability of processing resources for other tasks. When the data load is performed overnight, this causes minimal overhead, however it should be avoided when the data are loading to business hours.

Using Nonclustered Indexes

In my analysis I found out that many tables, especially the ones storing prepared data, were lacking a clustered index, even if further indexes were built on them. I remember that years back there was a (false) myth that fact and/or dimension tables don’t need clustered indexes in SQL Server. Of course clustered indexes have downsides (e.g. fragmentation, excessive key-lookups) though their benefits exceed by far the downsides. Besides missing clustered index, there were cases in which the tables would have benefited from having a narrow clustered index, instead of a multicolumn wide clustered index. Upon case also such cases were addressed.

Removing the Staging Tables

    Given the fact that the source and target systems are in the same virtual environment, and the data are loaded 1:1 between the various layers, without further transformations and conversions, one could load the data directly into the base tables. After some tests I came to the conclusion that the load from source tables into the staging table, and the load from staging table into base table (with TABLOCK hint) were taking almost the same amount of time. This means that the base tables will be for the same amount of the time unavailable, if the data were loaded from the sources directly into the base tables. Therefore one could in theory remove the staging tables from the architecture. Frankly, one should think twice when doing such a change, as there can be further implications in time. Even if today the data are imported 1:1, in the future this could change.

Reducing the Data Volume

    Reducing the data volume was identified as a possible further technique to reduce the amount of time needed for data loading. A data warehouse is built based on a set of requirements and presumptions that change over time. It can happen for example that even if the reports need only 1-2 years’ worth of data, the data load considers a much bigger timeframe. Some systems can have up to 5-10 years’ worth of data. Loading all data without a specific requirement leads to waste of resources and bigger load times. Limiting the transactional data to a given timeframe can make a considerable difference. Additionally, there are historical data that have the potential to be archived.

    There are also tables for which a weekly or monthly refresh would suffice. Some tables or even data sources can become obsolete, however they continue to be loaded in the data warehouse. Such cases occur seldom, though they occur. Also some unused or redundant column could have been removed from the packages.

Further Thoughts

    There are further techniques to optimize the data load within a data warehouse like partitioning large tables, using columnstore indexes or optimizing the storage, however my target was to provide maximum sufficient performance gain with minimum of effort and design changes. Therefore I stopped when I considered that the amount of effort is considerable higher than the performance gain.

Further Reading:
[1] TechNet (2009) The Data Loading Performance Guide, by Thomas Kejser, Peter Carlin & Stuart Ozer
https://technet.microsoft.com/en-us/library/dd425070(v=sql.100).aspx
[2] MSDN (2010) Best Practices for Data Warehousing with SQL Server 2008 R2, by Mark Whitehorn, Keith Burns & Eric N Hanson
https://msdn.microsoft.com/en-us/library/gg567302.aspx
[3] MSDN (2012) Whitepaper: Fast Track Data Warehouse Reference Guide for SQL Server 2012, by Eric Kraemer, Mike Bassett, Eric Lemoine & Dave Withers
https://msdn.microsoft.com/en-us/library/hh918452.aspx
[4] MSDN (2008) Best Practices for Data Warehousing with SQL Server 2008, by Mark Whitehorn & Keith Burns https://msdn.microsoft.com/library/cc719165.aspx
[5] TechNet (2005) Strategies for Partitioning Relational Data Warehouses in Microsoft SQL Server, by Gandhi Swaminathan
https://technet.microsoft.com/library/cc966457
[6] SQL Server Customer Advisory Team (2013) Top 10 Best Practices for Building a Large Scale Relational Data Warehouse
https://blogs.msdn.microsoft.com/sqlcat/2013/09/16/top-10-best-practices-for-building-a-large-scale-relational-data-warehouse/

23 February 2017

Data Management: Data Integration (Definitions)

"The process of coherently using data from across platforms. applications or business units. Data integration ensures that data from different sources is merged allowing silos of data to be combined." (Tony Fisher, "The Data Asset", 2009)

"The planned and controlled:
a) merge using some form of reference,
b) transformation using a set of business rules, and
c) flow of data from a source to a target, for operational and/or analytical use. Data needs to be accessed and extracted, moved, validated and cleansed, standardized, transformed, and loaded. (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The collection of data from various sources with the same significance into one uniform record. This data may be physically integrated, for example, into a data warehouse or virtually, meaning that the data will remain in the source systems, however will be accessed using a uniform view." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"Data integration comprises the activities, techniques, and tools required to consolidate and harmonize data from different (multiple) sources into a unified view. The processes of extract, transform, and load (ETL) are part of this discipline." (Piethein Strengholt, "Data Management at Scale", 2020)

"Pulling together and reconciling dispersed data for analytic purposes that organizations have maintained in multiple, heterogeneous systems. Data needs to be accessed and extracted, moved and loaded, validated and cleaned, and standardized and transformed." (Information Management)

"The combination of technical and business processes used to combine data from disparate sources into meaningful insights." (Solutions Review)

"The process of retrieving and combining data from different sources into a unified set for users, organizations, and applications." (MuleSoft) 

"Data integration is the practice of consolidating data from disparate sources into a single dataset with the ultimate goal of providing users with consistent access and delivery of data across the spectrum of subjects and structure types, and to meet the information needs of all applications and business processes." (OmiSci) [source]

"Data integration is the process of combining data from multiple source systems to create unified sets of information for both operational and analytical uses." (Techtarget)

"Data integration is the process of bringing data from disparate sources together to provide users with a unified view. The premise of data integration is to make data more freely available and easier to consume and process by systems and users." (Tibco) [source]

"Data integration is the process of retrieving and combining data from different sources into a unified set of data. Data integration can be used to combine data for users, organizations, and applications." (kloudless)

"Data integration is the process of taking data from multiple disparate sources and collating it in a single location, such as a data warehouse. Once integrated, data can then be used for detailed analytics or to power other enterprise applications." (Xplenty) [source]

"Data integration is the process used to combine data from disparate sources into a unified view that can provide valuable and actionable information." (snowflake) [source]

"Data integration refers to the technical and business processes used to combine data from multiple sources to provide a unified, single view of the data." (OmiSci) [source]

"The discipline of data integration comprises the practices, architectural techniques and tools for achieving the consistent access and delivery of data across the spectrum of data subject areas and data structure types in the enterprise to meet the data consumption requirements of all applications and business processes." (Gartner)

Data Management: Data Cleaning/Cleansing (Definitions)

"A processing step where missing or inaccurate data is replaced with valid values." (Joseph P Bigus, "Data Mining with Neural Networks: Solving Business Problems from Application Development to Decision Support", 1996)

"The process of validating data prior to a data analysis or Data Mining. This includes both ensuring that the values of the data are valid for a particular attribute or variable (e.g., heights are all positive and in a reasonable range) and that the values for given records or set of records are consistent." (William J Raynor Jr., "The International Dictionary of Artificial Intelligence", 1999)

"The process of correcting errors or omissions in data. This is often part of the extraction, transformation, and loading (ETL) process of extracting data from a source system, usually before attempting to load it into a target system. This is also known as data scrubbing." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"The removal of inconsistencies, errors, and gaps in source data prior to its incorporation into data warehouses or data marts to facilitate data integration and improve data quality." (Steve Williams & Nancy Williams, "The Profit Impact of Business Intelligence", 2007)

"Software used to identify potential data quality problems. For example, if a customer is listed multiple times in a customer database using variations of the spelling of his or her name, the data cleansing software ensures that each data element is consistent so there is no confusion. Such software is used to make corrections to help standardize the data." (Judith Hurwitz et al, "Service Oriented Architecture For Dummies" 2nd Ed., 2009)

"The process of reviewing and improving data to make sure it is correct, up to date, and not duplicated." (Tony Fisher, "The Data Asset", 2009)

"The process of correcting data errors to bring the level of data quality to an acceptable level for the information user needs." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The act of detecting and removing and/or correcting data in a database. Also called data scrubbing." (Craig S Mullins, "Database Administration: The Complete Guide to DBA Practices and Procedures", 2012)

"Synonymous with data fixing or data correcting, data cleaning is the process by which errors, inexplicable anomalies, and missing values are somehow handled. There are three options for data cleaning: correcting the error, deleting the error, or leaving it unchanged." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)

"The process of detecting, removing, or correcting incorrect data." (Evan Stubbs, "Delivering Business Analytics: Practical Guidelines for Best Practice", 2013)

"The process of finding and fixing errors and inaccuracies in data" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"The process of removing corrupt, redundant, and inaccurate data in the data governance process." (Robert F Smallwood, "Information Governance: Concepts, Strategies, and Best Practices", 2014) 

"The process of eliminating inaccuracies, irregularities, and discrepancies from data." (Jim Davis & Aiman Zeid, "Business Transformation", 2014)

"The process of reviewing and revising data in order to delete duplicates, correct errors, and provide consistency." (Jason Williamson, "Getting a Big Data Job For Dummies", 2015)

"the processes of identifying and resolving potential data errors." (Meredith Zozus, "The Data Book: Collection and Management of Research Data", 2017)

"A sub-process in data preprocessing, where we remove punctuation, stop words, etc. from the text." (Neha Garg & Kamlesh Sharma, "Machine Learning in Text Analysis", 2020)

"Processing a dataset to make it easier to consume. This may involve fixing inconsistencies and errors, removing non-machine-readable elements such as formatting, using standard labels for row and column headings, ensuring that numbers, dates, and other quantities are represented appropriately, conversion to a suitable file format, reconciliation of labels with another dataset being used (see data integration)." (Open Data Handbook) 

"The process of detecting and correcting faulty records, leading to highly accurate BI-informed decisions, as enormous databases and rapid acquisition of data can lead to inaccurate or faulty data that impacts the resulting BI and analysis. Correcting typographical errors, de-duplicating records, and standardizing syntax are all examples of data cleansing." (Insight Software)

"Transforming data in its native state to a pre-defined standardized format using vendor software." (Solutions Review)

"Data cleansing is the effort to improve the overall quality of data by removing or correcting inaccurate, incomplete, or irrelevant data from a data system.  […] Data cleansing techniques are usually performed on data that is at rest rather than data that is being moved. It attempts to find and remove or correct data that detracts from the quality, and thus the usability, of data. The goal of data cleansing is to achieve consistent, complete, accurate, and uniform data." (Informatica) [source]

"Data cleansing is the process of modifying data to improve accuracy and quality." (Xplenty) [source]

"Data cleaning is the process of preparing data for analysis by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted." (Sisense) [source]

"Data Cleansing (or Data Scrubbing) is the action of identifying and then removing or amending any data within a database that is: incorrect, incomplete, duplicated." (experian) [source]

"Data cleansing, or data scrubbing, is the process of detecting and correcting or removing inaccurate data or records from a database. It may also involve correcting or removing improperly formatted or duplicate data or records. Such data removed in this process is often referred to as 'dirty data'. Data cleansing is an essential task for preserving data quality." (Teradata) [source]

"Data scrubbing, also called data cleansing, is the process of amending or removing data in a database that is incorrect, incomplete, improperly formatted, or duplicated." (Techtarget) [source]

"the process of reviewing and revising data in order to delete duplicates, correct errors and provide consistency." (Analytics Insight)

21 February 2017

Data Quality Dimensions: Validity (Definitions)

"A characteristic of the data collected that indicates they are sound and accurate." (Teri Lund & Susan Barksdale, "10 Steps to Successful Strategic Planning", 2006)

"Implies that the test measures what it is supposed to." (Robert McCrie, "Security Operations Management" 2nd Ed., 2006)

"The determination that values in the field are or are not within a set of allowed or valid values. Measured as part of the Data Integrity Fundamentals data quality dimension." (Danette McGilvray, "Executing Data Quality Projects", 2008)

"A data quality dimension that reflects the confirmation of data items to their corresponding value domains, and the extent to which non-confirmation of certain items affects fitness to use. For example, a data item is invalid if it is defined to be integer but contains a non-integer value, linked to a finite set of possible values but contains a value not included in this set, or contains a NULL value where a NULL is not allowed." (G Shankaranarayanan & Adir Even, "Measuring Data Quality in Context", 2009)

"An aspect of data quality consisting in its steadiness despite the natural process of data obsolescence increasing in time." (Juliusz L Kulikowski, "Data Quality Assessment", 2009)

"An inherent quality characteristic that is a measure of the degree of conformance of data to its domain values and business rules." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"Validity is a dimension of data quality, defined as the degree to which data conforms to stated rules. As used in the DQAF, validity is differentiated from both accuracy and correctness. Validity is the degree to which data conform to a set of business rules, sometimes expressed as a standard or represented within a defined data domain." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"Validity is defined as the extent to which data corresponds to reference tables, lists of values from golden sources documented in metadata, value ranges, etc." (Rajesh Jugulum, "Competing with High Quality Data", 2014)

"the state of consistency between a measurement and the concept that a researcher intended to measure." (Meredith Zozus, "The Data Book: Collection and Management of Research Data", 2017)

[semantic validity:] "The compliance of attribute data to rules regarding consistency and truthfulness of association." (O Sami Saydjari, "Engineering Trustworthy Systems: Get Cybersecurity Design Right the First Time", 2018)

[syntactic validity:] "The compliance of attribute data to format and grammar rules." (O Sami Saydjari, "Engineering Trustworthy Systems: Get Cybersecurity Design Right the First Time", 2018)

"Validity is a data quality dimension that refers to information that doesn’t conform to a specific format or doesn’t follow business rules." (Precisely) [source]

Related Posts Plugin for WordPress, Blogger...