SQL Troubles: Completeness

Showing posts with label Completeness. Show all posts

27 January 2025

🗄️🗒️Data Management: Data Quality Dimensions [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes.

Last updated: 27-Jan-2025

[Data Management] Data quality dimensions

{def} features of data that can be measured or assessed against defined standards to determine the quality of data

captures a specific aspect of general data quality

can refer to data values or to their schema

{type} hard dimensions

dimensions that can be measured

{type} soft dimensions

dimensions that can be measured only indirectly

⇐ through interviews with data users or through any other kind of communication with users

dimensions whose measurement depends on the perception of the users of the data

{dimension} uniqueness [post]

the degree to which a value or set of values is unique within a dataset

can be determined based on a set of values supposed to be unique across the whole dataset

some systems have a artificial, respectively natural unique identified

measured in terms of either

the percentage of unique values available in a dataset
the percentage of duplicate values available in a dataset

the impossibility of identifying whether a value is unique increases the chances for it to be duplicated
it can have broader implications

aggregated information is not shown correctly

⇐ split across different entities

can lead to further duplicates in other areas

{recommendation} enforce uniqueness by design, if possible
{recommendation} check the data regularly for duplicates and disable or delete the duplicated records

⇐ one should make sure that the records can't be further reused in business processes or analytics workloads

{dimension} completeness [post]

the extent to which there are missing data in a dataset

⇐ reflected in the number of the missing values

measured as percentage of the missing values compared to the total

determined by the presence of NULL values

{type} attribute completeness

the number of NULLs in a specific attribute

{type} tuple completeness

the number of unknown values of the attributes in a tuple

{type} relation completeness

the number of tuples with unknown attribute values in the relation

{type} value completeness

makes sense for complex, semi-structured columns such as XML data type columns

e.g. a complete element or attribute can be missing

considered in report to

mandatory attributes

attributes that need a not-Null value for each record

optional attributes

attributes that not necessarily need to be provided

inapplicable attributes

attributes not applicable (relevant) for certain scenarios by design

{dimension} conformity (aka format compliance) [post]

{def} the extent data are in the expected format

dependent on the data type and its definition

can be associated with a set of metadata

data type

e.g. text, numeric, alphanumeric, positive, date

length
precision
scale
formatting patterns

e.g. phone number, decimal and digit grouping symbols
different formatting might apply based on various business rules
can use delimiters

{recommendation} define the data type and further constraints to enforce the various characteristics of the element
{recommendation} make sure that the delimiters don't overlap with other uses

{dimension} accuracy [post]

{def} the extent data is correct, respectively match the reality with an acceptable level of approximation
stricter than just conforming to business rules
can be measured at column and table level

[discrete data values]

use frequency distribution of values

a value with very low frequency is probably incorrect

[alphanumeric values]

use string length distribution

a string with a very atypical length is potentially incorrect

try to find patterns and then create pattern distribution.

patterns with low frequency probably denote wrong values

[continuous attributes]

use descriptive statistics

just by looking at minimal and maximal values, you can easily spot potentially problematic data

{dimension} consistency [post]

{def} the degree of uniformity, standardization, and freedom from contradiction among the documents or parts of a system or component

{type} notational consistency

the extent (data) values are consistent in notation

{type} semantic consistency

the degree to which data has unique meaning
is more restrictive than the notational consistency

measures the equivalence of information stored in various repositories
involves comparing values with a predefined set of possible values

from the same or from different systems

can be measured at column and table level
can have different scopes

cross-system consistencies

among systems or data repositories

cross-record consistency

within the same repository

temporal consistency

within the same record at different points in time

{dimension} timeliness [post]

tells the degree to which data is current and available when needed

there is always some delay between change in the real world and the moment when this change is entered into a system

stale data/obsolete data

{dimension} structuredness [post]

the degree to which a data structure or model possesses a definite pattern of organization of its interdependent parts
allows the categorization of data as

structured data [def]

refers to structures that can be easily perceived or known, that raises no doubt on structure’s delimitations

unstructured data [def]

refers to textual data and media content (video, sound, images), in which the structural patterns even if exist they are hard to discover or not predefined

semi-structured data [def]

refers to islands of structured data stored with unstructured data, or vice versa

⇐ the more structured the data, the easier it is to be processed

{dimension} referential integrity [post]

{def} the degree to which the values of a key in one table (aka reference value) match the values of a key in a related table (aka the referenced value)
it's an architectural concept of the database
{recommendation} keep the referential integrity of a system by design

some systems build logic for assuring the referential integrity in the applications and not in the database

{dimension} currency (aka actuality)

the extent to which data is actual
can be considered as a special type of accuracy

⇐ when the data is not actual then it doesn’t reflect reality

{dimension} ease of use

the extent to which data can be used for a given purpose

usually it refers to whether the data can be processed as needed
depends on the application or on the user interface

{dimension} fitness of use

the degree to which the data is fit for use

the data may have good quality for a given purposes but

not usable for other purposes
can be used as substitute for other data

e.g. use phone area codes instead of ZIP codes to locate customers approximately

{dimension} trustfulness [post]

the degree to which the data can be trusted

is a matter of perception

ask users whether they trust the data and which are the reasons

if the users don’t trust the data

they will create their own solutions
they will not use applications

{dimension} entropy

{def} the average amount of information conveyed

⇐ quantification of information in a system
⇐ the more dispersed the values and the more the frequency distribution of a discrete column is equally spread among the values, the more information is available [1]
⇐ can tell whether your data is suitable for analysis or not

can be measured at column and table level

{dimension} presentation quality

applicable to applications that presents data

format and appearance should support the appropriate use of data
depends on the UI used

{recommendation} have a dedicated system for maintaining the master data and broadcast the data to the subscribers as needed

the data should be exclusively managed though the management system
{anti-pattern} data is modified in the subscribers and the changes aren't always reflected back to the source system

Previous Post <<||>> Next Post

References:
[1] Dejan Sarka et al (2012) Exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012 (Training Kit)

05 January 2021

🧮ERP: Implementations (Part III: It’s all about Scope I - Functional Requirements)

ERP Implementations Series

Introduction

ERP (Enterprise Resource Planning) Implementations tend to be expensive projects, often the actual costs overrunning the expectations by an important factor. The causes for this are multiple, the most important ones ranging from the completeness and complexity of the requirements and the impact they have on the organization to the availability of internal and external skilled resources, project methodology, project implementation, organization’s maturity in running projects, etc.

The most important decision in an ERP implementation is deciding what one needs, respectively what will be considered for the implementation, aspects reflected in a set of functional and nonfunctional requirements.

Functional Requirements

The functional requirements (FRs) reflect the expected behavior of the system in respect to the inputs and outputs – what the system must do. Typically, they encompass end-users’ requirements in the area of processes, interfaces and data processing, though are not limited to them.

The FRs are important because they reflect the future behavior of the system as perceived by the business, serving further as basis for identifying project’s scope, the gaps between end-users’ requirements and system’s functionality, respectively for estimating project’s duration and areas of focus. Further they are used as basis for validating system’s behavior and getting the sign-off for the system. Therefore, the FRs need to have the adequate level of detail, be complete, clear, comprehensible and implementable, otherwise any gaps in requirements can impact the project in adverse ways. To achieve this state of art they need to go through several iterations in which the requirements are reevaluated, enhanced, checked for duplication, relevance or any other important aspect. In the process it makes sense to categorize the requirements and provide further metadata needed for their appraisal (e.g. process, procedure, owner, status, priority).

Once brought close to a final form, the FRs are checked against the functionality available in the targeted system, or systems when more systems are considered for evaluation. Ideally all the requirements can be implemented with the proper parametrization of the systems, though it’s seldom the case as each business has certain specifics. The gaps need to be understood, their impact evaluated and decided whether the gaps need to be implemented. In general, it’s recommended to remain close to the standard functionality, as each further gap requires further changes to the system, gaps that in time can generate further quality-related and maintenance costs.

It can become a tedious effort, as in the process an impact and cost-benefit analysis need to be performed for each gap. Therefore, gaps’ estimation needs to occur earlier or intermixed with their justification. Once the list of the FRs is finalized and frozen, they will be used for estimating the final costs of the project, identifying the work packages, respectively planning the further work. Once the FRs frozen, any new requirements or changes to requirements (including taking out a requirement) need to go through the Change Management process and all the consequences deriving from it – additional effort, costs, delays, etc. This can trigger again an impact and cost-benefit analysis.

The FRs are documented in a specification document (aka functional requirement specification), which is supposed to track all the FRs through their lifetime. When evaluating the FRs against system’s functionality it’s recommended to provide general information on how they will be implemented, respectively which system function(s) will be used for that purpose. Besides the fact that it provides transparence, the information can be used as basic ground for further discussions.

Seldom all the FRs will be defined upfront or complete. Moreover, some requirements will become obsolete during project’s execution, or gaps will be downgraded as standard and vice-versa. Therefore, it’s important to recollect the unexpected.

Previous Post <<||>> Next Post

11 May 2019

#️⃣Software Engineering: Programming (Part XI: The Dark Side)

Software Engineering Series

As member of programmers' extended community, it’s hard to accept some of the views that inconsiderate programmers and their work. In some contexts, maybe the critics reveal some truths. It’s in human nature to generalize some of the bad experiences people have or to oversimplify some of programmers’ traits in stereotypes, however the generalizations and simplifications with pejorative connotations bring no service to the group criticized, as well to the critics.

The programmer finds himself at the end of the chain of command, and he’s therefore the easiest to blame for the problems existing in software development (SD). Some of the reasoning fallacies are equating the process of programming with programmers' capabilities, when the problems reside in the organization itself – the way it handles each step of the processes involved, the way it manages projects, the way it’s organized, the way it addresses cultural challenges, etc.

The meaningful part of the SD starts with requirements’ elicitation, the process of researching and discovering the requirements based on which a piece of software is built upon. The results of the programming process are as good as the inputs provided – the level of detail, accuracy and completeness with which the requirements were defined. It’s the known GIGO (garbage in, garbage out) principle. Even if he questions some of the requirements, for example, when they are contradictory or incomplete, each question adds more delays in the process because getting clarifying the open issues involves often several iterations. Thus, one must choose between being on time and delivering the expected quality. Another problem is that the pay-off and perception for the two is different from managerial and customers’ perspective.

A programmer’s work, the piece of software he developed, it’s seen late in the process, when it’s maybe too late to change something in utile time. This happens especially in waterfall methodology, this aspect being addressed by more modern technologies by involving the customers and getting constructive feedback early in the process, and by developing the software in iterations.

Being at the end of the chain command, programming is seen often as a low endeavor, minimizing its importance, maybe because it seems so obvious. Some even consider that anybody can program, and it’s true that, as each activity, anyone can learn to program, same as anyone can learn another craft, however as any craft it takes time and skills to master. The simple act of programming doesn’t make one a programmer, same as the act of singing doesn’t make one a singer. A programmer needs on average several years to achieve an acceptable level of mastery and profoundness. This can be done only by mastering one or more programming languages and frameworks, getting a good understanding of the SD processes and what the customers want, getting hand-on experience on a range of projects that allow programmers to learn and grow.

There are also affirmations that contain some degrees of truth. Overconfidence in one’s skills results in programmers not testing adequately their own work. Programmers attempt using the minimum of effort in achieving a task, the development environments and frameworks, the methodologies and other tools playing an important part. In extremis, through the hobbies, philosophies, behaviors and quirks they have, not necessarily good or bad, the programmers seem to isolate themselves.

In the end the various misconceptions about programmers have influence only to the degree they can pervade a community or an organization’s culture. The bottom line is, as Bjarne Stroustrup formulated it, “an organization that treats its programmers as morons will soon have programmers that are willing and able to act like morons only” [1].

Previous Post <<||>> Next Post

References:
[1] "The C++ Programming Language" 2nd Ed., by Bjarne Stroustrup, 1991

04 May 2019

🧊Data Warehousing: Architecture (Part I: Push vs. Pull)

In data integrations, data migrations and data warehousing there is the need to move data between two or more systems. In the simplest scenario there are only two systems involved, a source and a target system, though there can be complex scenarios in which data from multiple sources need to be available in a common target system (as in the case of data warehouses/marts or data migrations), or data from one source (e.g. ERP systems) need to be available in other systems (e.g. Web shops, planning systems), or there can be complex cases in which there is a many-to-many relationship (e.g. data from two ERP systems are consolidated in other systems).

The data can flow in one direction from the source systems to the target systems (aka unidirectional flow), though there can be situations in which once the data are modified in the target system they need to flow back to the source system (aka bidirectional flow), as in the case of planning or product development systems. In complex scenarios the communication may occur multiple times within same process until a final state is reached.

Independently of the number of systems and the type of communication involved, data need to flow between the systems as smooth as possible, assuring that the data are consistent between the various systems and available when needed. The architectures responsible for moving data between the sources are based on two simple mechanisms - push vs pull – or combinations of them.

A push mechanism makes data to be pushed from the source system into the target system(s), the source system being responsible for the operation. Typically the push can happen as soon as an event occurs in the source system, event that leads to or follows a change in the data. There can be also cases when is preferred to push the data at regular points in time (e.g. hourly, daily), especially when the changes aren’t needed immediately. This later scenario allows to still make changes to the data in the source until they are sent to other system(s). When the ability to make changes is critical this can be controlled over specific business rules.

A pull mechanism makes the data to be pulled from the source system into the target system, the target systems being responsible for the operation. This usually happens at regular points in time or on demand, however the target system has to check whether the data have been changed.

Hybrid scenarios may involve a middleware that sits between the systems, being responsible for pulling the data from the source systems and pushing them into the targets system. Another hybrid scenario is when the source system pushes the data to an intermediary repository, the target system(s) pulling the data on a need basis. The repository can reside on the source, target on in-between. A variation of it is when the source informs the target that a change happened at it’s up to the target to decide whether it needs the data or not.

The main differentiators between the various methods is the timeliness, completeness and consistency of the data. Timeliness refers to the urgency with which data need to be available in the target system(s), completeness refers to the degree to which the data are ready to be sent, while consistency refers to the degree the data from the source are consistent with the data from the target systems.

Based on their characteristics integrations seem to favor push methods while data migrations and data warehousing the pull methods, though which method suits the best depends entirely on the business needs under consideration.

Previous Post <<||>> Next Post

19 April 2019

🌡Performance Management: Mastery (Part I: The Need for Perfection vs. Excellence)

A recurring theme occurring in various contexts over the years seemed to be corroborated with the need for perfection, need going sometimes in extremis beyond common sense. The simplest theory attempting to explain at least some of these situations is that people tend to confuse excellence with perfection, from this confusion deriving false beliefs, false expectations and unhealthy behavior.

Beyond the fact that each individual has an illusory image of what perfection is about, perfection is in certain situations a limiting force rooted in the idealistic way of looking at life. Primarily, perfection denotes that we will never be good enough to reach it as we are striving to something that doesn’t exist. From this appears the external and internal criticism, criticism that instead of helping us to build something it drains out our energy to the extent that it destroys all we have built over the years with a considerable effort. Secondarily, on the long run, perfection has the tendency to steal our inner peace and balance, letting fear take over – the fear of not making mistakes, of losing the acceptance and trust of the others. It focuses on our faults, errors and failures instead of driving us to our goals. In extremis it relieves the worst in people, actors and spectators altogether.

In its proximate semantics though at diametral side through its implications, excellence focuses on our goals, on the aspiration of aiming higher without implying a limit to it. It’s a shift of attention from failure to possibilities, on what matters, on reaching our potential, on acknowledging the long way covered. It allows us building upon former successes and failures. Excellence is what we need to aim at in personal and professional life. Will Durant explaining Aristotle said that: “We are what we repeatedly do. Excellence, then, is not an act, but a habit.”

People who attempt giving 100% of their best to achieve a (positive) goal are to admire, however the proximity of 100% is only occasionally achievable, hopefully when needed the most. 100% is another illusory limit we force upon ourselves as it’s correlated to the degree of achievement, completeness or quality an artefact or result can ideally have. We rightly define quality as the degree to which something is fit for purpose. Again, a moving target that needs to be made explicit before we attempt to reach it otherwise quality envisions perfection rather than excellence and effort is wasted.

Considering the volume of effort needed to achieve a goal, Pareto’s principles (aka the 80/20 rule) seems to explain the best its underlying forces. The rule states that roughly 80% of the effects come from 20% of the causes. A corollary is that we can achieve 80% of a goal with 20% of the effort needed altogether to achieve it fully. This means that to achieve the remaining 20% toward the goal we need to put four times more of the effort already spent. This rule seems to govern the elaboration of concepts, designs and other types of documents, and I suppose it can be easily extended to other activities like writing code, cleaning data, improving performance, etc.

Given the complexity, urgency and dependencies of the tasks or goals before us probably it's beneficial sometimes to focus first on the 80% of their extent, so we can make progress, and focus on the remaining 20% if needed, when needed. This concurrent approach can allow us making progress faster in incremental steps. Also, in time, through excellence, we can bridge the gap between the two numbers as is needed less time and effort in the process.

28 April 2017

⛏️Data Management: Completeness (Definitions)

"A characteristic of information quality that measures the degree to which there is a value in a field; synonymous with fill rate. Assessed in the data quality dimension of Data Integrity Fundamentals." (Danette McGilvray, "Executing Data Quality Projects", 2008)

"Containing by a composite data all components necessary to full description of the states of a considered object or process." (Juliusz L Kulikowski, "Data Quality Assessment", 2009)

"An inherent quality characteristic that is a measure of the extent to which an attribute has values for all instances of an entity class." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"Completeness is a dimension of data quality. As used in the DQAF, completeness implies having all the necessary or appropriate parts; being entire, finished, total. A dataset is complete to the degree that it contains required attributes and a sufficient number of records, and to the degree that attributes are populated in accord with data consumer expectations. For data to be complete, at least three conditions must be met: the dataset must be defined so that it includes all the attributes desired (width); the dataset must contain the desired amount of data (depth); and the attributes must be populated to the extent desired (density). Each of these secondary dimensions of completeness can be measured differently." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"Completeness is defined as a measure of the presence of core source data elements that, exclusive of derived fields, must be present in order to complete a given business process." (Rajesh Jugulum, "Competing with High Quality Data", 2014)

"Complete existence of all values or attributes of a record that are necessary." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"The degree to which all data has been delivered or stored and no values are missing. Examples are empty or missing records." (Piethein Strengholt, "Data Management at Scale", 2020)

"The degree to which elements that should be contained in the model are indeed there." (Panos Alexopoulos, "Semantic Modeling for Data", 2020)

"The degree of data representing all properties and instances of the real-world context." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"Data is considered 'complete' when it fulfills expectations of comprehensiveness." (Precisely) [source]

"The degree to which all required measures are known. Values may be designated as “missing” in order not to have empty cells, or missing values may be replaced with default or interpolated values. In the case of default or interpolated values, these must be flagged as such to distinguish them from actual measurements or observations. Missing, default, or interpolated values do not imply that the dataset has been made complete." (CODATA)

12 February 2017

⛏️Data Management: Data Quality (Definitions)

"Data are of high quality if they are fit for their intended use in operations, decision-making, and planning." (Joseph M Juran, 1964)

"[…] data has quality if it satisfies the requirements of its intended use. It lacks quality to the extent that it does not satisfy the requirement. In other words, data quality depends as much on the intended use as it does on the data itself. To satisfy the intended use, the data must be accurate, timely, relevant, complete, understood, and trusted." (Jack E Olson, "Data Quality: The Accuracy Dimension", 2003)

"A set of measurable characteristics of data that define how well data represents the real-world construct to which it refers." (Alex Berson & Lawrence Dubov, "Master Data Management and Customer Data Integration for a Global Enterprise", 2007)

"The state of completeness, validity, consistency, timeliness and accuracy that makes data appropriate for a specific use." (Keith Gordon, "Principles of Data Management", 2007)

"Deals with data validation and cleansing services (to ensure relevance, validity, accuracy, and consistency of the master data), reconciliation services (aimed at helping cleanse the master data of duplicates as part of consistency), and cross-reference services (to help with matching master data across multiple systems)." (Martin Oberhofer et al,"Enterprise Master Data Management", 2008)

"A set of data properties (features, parameters, etc.) describing their ability to satisfy user’s expectations or requirements concerning data using for information acquiring in a given area of interest, learning, decision making, etc." (Juliusz L Kulikowski, "Data Quality Assessment", 2009)

"Assessment of the cleanliness, accuracy, and reliability of data." (Laura Reeves, "A Manager's Guide to Data Warehousing", 2009)

"A set of measurable characteristics of data that define how well the data represents the real-world construct to which it refers." (Alex Berson & Lawrence Dubov, "Master Data Management and Data Governance", 2010)

"This term refers to whether an organization’s data is reliable, consistent, up to date, free of duplication, and can be used efficiently across the organization." (Tony Fisher, "The Data Asset", 2009)

"The degree of data accuracy, accessibility, relevance, time-liness, and completeness." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed., 2011)

"The degree of fitness for use of data in particular application. Also the degree to which data conforms to data specifications as measured in data quality dimensions. Sometimes used interchangeably with information quality." (John R Talburt, "Entity Resolution and Information Quality", 2011)

"The degree to which data is accurate, complete, timely, consistent with all requirements and business rules, and relevant for a given use." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Contextual data quality considers the extent to which data are applicable (pertinent) to the task of the data user, not to the context of representation itself. Contextually appropriate data must be relevant to the consumer, in terms of timeliness and completeness. Dimensions include: value-added, relevancy, timeliness, completeness, and appropriate amount of data (from the Wang & Strong framework.)" (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"Intrinsic data quality denotes that data have quality in their own right; it is understood largely as the extent to which data values are in conformance with the actual or true values. Intrinsically good data is accurate, correct, and objective, and comes from a reputable source. Dimensions include: accuracy objectivity, believability, and reputation (from the Wang & Strong framework)." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"Representational data quality indicates that the system must present data in such a way that it is easy to understand (represented concisely and consistently) so that the consumer is able to interpret the data; understood as the extent to which data is presented in an intelligible and clear manner. Dimensions include: interpretability, ease of understanding, representational consistency, and concise representation (rom the Wang & Strong framework)." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"The level of quality of data represents the degree to which data meets the expectations of data consumers, based on their intended use of the data." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement", 2013)

"The relative value of data, which is based on the accuracy of the knowledge that can be generated using that data. High-quality data is consistent, accurate, and unambiguous, and it can be processed efficiently." (Jim Davis & Aiman Zeid, "Business Transformation: A Roadmap for Maximizing Organizational Insights", 2014)

"The properties of data embodied by the “Five C’s”: clean, consistent, conformed, current, and comprehensive." (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"The degree to which data in an IT system is complete, up-to-date, consistent, and (syntactically and semantically) correct." (Tilo Linz et al, "Software Testing Foundations, 4th Ed", 2014)

"A measure for the suitability of data for certain requirements in the business processes, where it is used. Data quality is a multi-dimensional, context-dependent concept that cannot be described and measured by a single characteristic, but rather various data quality dimensions. The desired level of data quality is thereby oriented on the requirements in the business processes and functions, which use this data [...]" (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"[...] characteristics of data such as consistency, accuracy, reliability, completeness, timeliness, reasonableness, and validity. Data-quality software ensures that data elements are represented in a consistent way across different data stores or systems, making the data more trustworthy across the enterprise." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"Refers to the accuracy, completeness, timeliness, integrity, and acceptance of data as determined by its users." (Gregory Lampshire, "The Data and Analytics Playbook", 2016)

"A measure of the useableness of data. An ideal dataset is accurate, complete, timely in publication, consistent in its naming of items and its handling of e.g. missing data, and directly machine-readable (see data cleaning), conforms to standards of nomenclature in the field, and is published with sufficient metadata that users can easily understand, for example, who it is published by and the meaning of the variables in the dataset." (Open Data Handbook)

"Refers to the level of 'quality' in data. If a particular data store is seen as holding highly relevant data for a project, that data is seen as quality to the users." (Solutions Review)

"the processes and techniques involved in ensuring the reliability and application efficiency of data. Data is of high quality if it reliably reflects underlying processes and fits the intended uses in operations, decision making and planning." (KDnuggets)

"The narrow definition of data quality is that it's about data that is missing or incorrect. A broader definition is that data quality is achieved when a business uses data that is comprehensive, consistent, relevant and timely." (Information Management)

"Data Quality refers to the accuracy of datasets, and the ability to analyse and create actionable insights for other users." (experian) [source]

"Data quality refers to the current condition of data and whether it is suitable for a specific business purpose." (Xplenty) [source]

09 November 2011

📉Graphical Representation: Completeness (Just the Quotes)

"The title for any chart presenting data in the graphic form should be so clear and so complete that the chart and its title could be removed from the context and yet give all the information necessary for a complete interpretation of the data. Charts which present new or especially interesting facts are very frequently copied by many magazines. A chart with its title should be considered a unit, so that anyone wishing to make an abstract of the article in which the chart appears could safely transfer the chart and its title for use elsewhere." (Willard C Brinton, "Graphic Methods for Presenting Facts", 1919)

"A man's judgment cannot be better than the information on which he has based it. Give him no news, or present him only with distorted and incomplete data, with ignorant, sloppy, or biased reporting, with propaganda and deliberate falsehoods, and you destroy his whole reasoning process and make him somewhat less than a man." (Arthur H Sulzberger, [speech] 1948)

"Graphical methodology provides powerful diagnostic tools for conveying properties of the fitted regression, for assessing the adequacy of the fit, and for suggesting improvements. There is seldom any prior guarantee that a hypothesized regression model will provide a good description of the mechanism that generated the data. Standard regression models carry with them many specific assumptions about the relationship between the response and explanatory variables and about the variation in the response that is not accounted for by the explanatory variables. In many applications of regression there is a substantial amount of prior knowledge that makes the assumptions plausible; in many other applications the assumptions are made as a starting point simply to get the analysis off the ground. But whatever the amount of prior knowledge, fitting regression equations is not complete until the assumptions have been examined." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"Labels should be complete but succinct. Long and complicated labels will defeat the viewer and therefore the purpose of the graph. Treat a label as a cue to jog the memory or to complete comprehension. Shorten long labels; avoid abbreviations unless they are universally understood; avoid repetition on the same graph. A title, for instance, should not repeat what is already in the axis labels. Be consistent in terminology." (Mary H Briscoe, "Preparing Scientific Illustrations: A guide to better posters, presentations, and publications" 2nd ed., 1995)

"[…] a graphic with loose, incomplete information that is too verbose, vague or passive can actually impede your audience’s ability to make sense of the information at hand. If the graphic confuses or frustrates the audience, you’re likely to do more harm than good, leave them with more questions than answers and essentially turn them away from your publication." (Jennifer George-Palilonis," A Practical Guide to Graphics Reporting: Information Graphics for Print, Web & Broadcast", 2006)

"Perception requires imagination because the data people encounter in their lives are never complete and always equivocal. [...] We also use our imagination and take shortcuts to fill gaps in patterns of nonvisual data. As with visual input, we draw conclusions and make judgments based on uncertain and incomplete information, and we conclude, when we are done analyzing the patterns, that out picture is clear and accurate. But is it?" (Leonard Mlodinow, "The Drunkard’s Walk: How Randomness Rules Our Lives", 2008)

"Information graphics are an essential component of technical communication. Very few technical documents or presentations can be considered complete without graphical elements to present some essential data. Because engineers are visually oriented, graphic aids allow their thoughts and ideas to be better understood by other engineers. Information graphics are essential in presenting data because they simplify the content, offer a visually pleasing alternative to gray text in a proposal or an article, and thereby invite interest." (Dennis K Lieu & Sheryl Sorby, "Visualization, Modeling, and Graphics for Engineering Design", 2009)

"Good visualization is a winding process that requires statistics and design knowledge. Without the former, the visualization becomes an exercise only in illustration and aesthetics, and without the latter, one of only analyses. On their own, these are fine skills, but they make for incomplete data graphics. Having skills in both provides you with the luxury - which is growing into a necessity - to jump back and forth between data exploration and storytelling." (Nathan Yau, "Data Points: Visualization That Means Something", 2013)

"Just because data is visualized doesn’t necessarily mean that it is accurate, complete, or indicative of the right course of action. Exhibiting a healthy skepticism is almost always a good thing." (Phil Simon, "The Visual Organization: Data Visualization, Big Data, and the Quest for Better Decisions", 2014)

"To become a great data analyst, you must be able to identify and deal with incomplete data and work to identify the data quality and accuracy issues in a data set." (Andy Kriebel & Eva Murray, "#MakeoverMonday: Improving How We Visualize and Analyze Data, One Chart at a Time", 2018)

"Using a question as a title is a great way to guide the audience. The question helps you ensure that your charts respond directly to the question and when they do not, you can remove them. And that is the main point: You need to answer the question. If the data is not conclusive, say so. Give an explanation that relates back to your title and close the loop so that your audience is informed and gets the complete picture included in your analysis." (Andy Kriebel & Eva Murray, "#MakeoverMonday: Improving How We Visualize and Analyze Data, One Chart at a Time", 2018)

"Data storytelling is a method of communicating information that is custom-fit for a specific audience and offers a compelling narrative to prove a point, highlight a trend, make a sale, or all of the above. [...] Data storytelling combines three critical components, storytelling, data science, and visualizations, to create not just a colorful chart or graph, but a work of art that carries forth a narrative complete with a beginning, middle, and end." (Kate Strachnyi, "ColorWise: A Data Storyteller’s Guide to the Intentional Use of Color", 2023)

03 July 2010

🗄️Data Management: Data Profiling I (An Introduction)

Data Management Series

Some Generalities

Jack Olson defines data profiling as "the use of analytical techniques to discover the true content, structure, and quality of data" [1]. Understanding the structure of data is quintessential for database developers in order to provide accurate technical solutions (reports, database objects for various purposes), though in order to answer business questions is important to understand also the content, the projection of business processes in database environment and, not less important, is the Data Quality reflected in the value the data has for an enterprise, being highly dependent on the understanding of data content and structure.

Typically, DBMS store also metadata about the data chunks and data structures they contain – tables, the columns they contain and their data types, the relations between tables, various types of indexes and other constraints, database objects like views, stored procedures, or user-defined functions. When taken together they provide together with the logical and semantical data models a good overview over the data structure, though the complexity of such systems, sometimes made of hundreds of tables, if often hard to comprehend, especially when one or more layers of abstraction reside in between the database structure and consumer.

Conversely, the structure is one thing, while the data contained by the structure is another thing, attributes being more or less correctly misused for various purposes, with different degrees of completeness, actuality, accuracy, etc. I wrote about Data Quality and the most important dimensions in several posts, highlighting the importance of having data with an acceptable quality. Thus, in Data Quality initiatives, especially the ones done during data migration/conversion between various systems, data profiling occupies an important place.

A natural question (actually two): what analytical metrics could be used in order to understand the data and how complex do they have to be? Nowadays data mining, the process of extracting patterns from data, gains day by day in importance, more and more complex techniques being used, visualization techniques together with complex algorithms allowing us to derive rules from data. Do we have to go so far? Statistics offers us for example a simple set of techniques and concepts which could be used to understand the data, though Statistics deals mainly with quantities.

Therefore, if we could translate the volume of data into various metrics, we could reuse in theory the metrics as input for statistical reasoning. And again, do we have to go that far? As we will see further on, data profiling could be done using several simple metrics, like the number of records, the number of missing values (see completeness), the number of distinct values and the distribution formed by them, the minimum and maximum values, or the minimum and maximum length of values. There are also more complex techniques that attempt to identify the correlation existing between the various attributes, falling already in the direction of data mining techniques.

Volume of Data

The number of records, as the name intuitively denotes, resumes in identifying the number of records each table/dataset from a set of table/datasets contains. It’s only a dimension of the volume of data, reflected also in the number of tables/datasets in scope, and the number of attributes each of them contains. So, the volume of data could be considered as a measure of number of records, number of tables and number of attributes. If we want to be more accurate in measurement, given the fact that each attribute could have a different maximum length, the maximum length could be considered as a fourth dimension of the volume of data.

Now each database vendor decides on whether it reserves the respective maximum length or it considers only the actual used length (e.g. an attribute of type string could have a maximum length of 250 characters, but it holds values with a maximum length of 30 characters). Things are even more complicated because the structural length is different than the physical lengths (the number of bits allocated for storage). The physical storage occupied by a database could be in theory easily identified, though if only a subset is considered then it might be needed to consider the actual size of the volume of data in scope. The subset resumes not only to tables, but also to attributes as maybe not all the attributes from a table are used, records as there could be also inactive/disabled or other type of irrelevant records, and even data content, in case data could be stripped of not useful content.

Considering a given table/dataset the above-mentioned metrics could be obtained fairly easy by running queries against the table/dataset itself or against the metadata. As I highlighted above, database vendors that store metadata about their data structure, of importance in this context being the tables, the attributes, attributes’ data type and maximum length. There are also vendors that store the number of records as statistics for various purposes, though the statistics are not always up to date, now it depends also on profiling requirements. On the other side getting the number of records is a simple task, even in an automated manner. Considering for exemplification the AdventureWorks database coming with SQL Server, the number of records from a table can be easily identified with the help of the count aggregate function available in all important RDBMS:

-- the number of records (returns 54) 
SELECT count(1) NumberRecords 
FROM AdventureWorks.Production.Product

The number of attributes the table contains could be obtained by doing a simple count against the number of attributes returned by a SELECT * statement or by querying directly the metadata:

-- the number of attributes (returns 25) 
SELECT COUNT(1) 
FROM sys.columns C 
    JOIN sys.tables T 
      ON C.object_id = T.object_id 
WHERE T.name = 'Product'

The same tables (catalog views starting with SQL Server 2005) could be used in order to retrieve the name of the attributes, their data type, length related attributes and whether the attribute accepts NULL values. Given the fact that the respective views will be used in multiple queries, it makes sense to encapsulate them in a view:

-- table columns view 
CREATE VIEW dbo.vTableColumns 
AS  
SELECT s.name [schema_name] 
, T.name Table_Name 
, C.name Column_Name 
, C.column_id  
, C.max_length 
, C.precision 
, C.scale 
, C.is_nullable 
, C.is_identity 
, UT.name user_type 
, ST.name system_type 
FROM sys.columns C 
    JOIN sys.types as UT 
       ON C.user_type_id = UT.user_type_id 
    JOIN sys.types ST 
      ON C.system_type_id = ST.user_type_id 
    JOIN sys.tables T 
       ON C.object_id = T.object_id 
         JOIN sys.schemas S 
            ON T.schema_id = S.schema_id

Thus, the needed query becomes:

-- metadata details 
SELECT * 
FROM dbo.vTableColumns 
WHERE [Schema_Name] = 'Production' 
   AND Table_Name = 'Product' 
ORDER BY column_id

Output:

schema_name	Table_Name	Column_Name	column_id	max_length	precision	scale	is_nullable	is_identity	user_type	system_type
Production	Product	ProductID	1	4	10	0	0	1	int	int
Production	Product	Name	2	100	0	0	0	0	Name	nvarchar
Production	Product	ProductNumber	3	50	0	0	0	0	nvarchar	nvarchar
Production	Product	MakeFlag	4	1	1	0	0	0	Flag	bit
Production	Product	FinishedGoodsFlag	5	1	1	0	0	0	Flag	bit
Production	Product	Color	6	30	0	0	1	0	nvarchar	nvarchar
Production	Product	SafetyStockLevel	7	2	5	0	0	0	smallint	smallint
Production	Product	ReorderPoint	8	2	5	0	0	0	smallint	smallint
Production	Product	StandardCost	9	8	19	4	0	0	money	money
Production	Product	ListPrice	10	8	19	4	0	0	money	money

Note:
For more information on the meaning of user_type and system_type and the other related metadata stored, or on the length of each data types in SQL Server please check the local or online SQL Server documentation (see: Mapping System Tables to System Views, respectively Data Types).

Analyzing the Data Content

Analyzing the content resumes in studying the completeness for each attribute, what values are used and their distribution, the minimal/maximal/average values and minimal/maximal/average length of values.

Completeness could be studied as a metric of the number of records that have not-null values, or the complementary metric, the number of records that have null values, or as a percentage of the two from the total number of records. For example, in order to study the completeness of “Color” attribute could be used the following query:

-- completeness 
SELECT count(1) NumberRecords 
, count(Color) NumberNotNullValues 
, count(1)-count(Color) NumberNullValues 
, CASE  
    WHEN count(1)&gt;0 THEN Cast(100*Cast(count(Color) As decimal (18,2))/count(1) AS decimal (5,2)) 
     ELSE 0.00  
END PercentageNotNullValues 
, CASE  
    WHEN count(1)&gt;0 THEN Cast(100*Cast((count(1)-count(Color)) As decimal (18,2))/count(1) AS decimal (5,2)) 
     ELSE 0.00  
END PercentageNullValues  
FROM Production.Product

Output:

NumberRecords	NumberNotNullValues	NumberNullValues	PercentageNotNullValues	PercentageNullValues
504	256	248	50.79	49.21

Studying completeness of attributes is important especially for the cases in which the attribute becomes mandatory, the number of null records equating with the number of changes that must be made to make the respective values available. In contrast, the not-null values could be a base for further review.

The next step is determining the values used in attributes, in other words the number of distinct values appearing in an attribute. Eventually, especially for list of values, it could be considered as a percentage from the number of values available. For example it could be studied how many accounts are used from the number of accounts defined. On the other side there are also attributes for which the list of possible values are not explicitly delimited, as in the case of colors available for Products, though that doesn’t necessarily happen in all business cases. To determine the number of distinct colors used for queries could be used the following query:

-- number distinct values 
 SELECT count(DISTINCT Color) NumberDistinctColors 
, count(DISTINCT IsNull(Color, 'n/a')) NumberDistinctColorsIncludingNulls 
FROM Production.Product

Output:

NumberDistinctColors	NumberDistinctColorsIncludingNulls
9	10

Note:
Please note that the NULL value is ignored by count aggregate function. I’m mentioning that because there are cases in which the NULL value has a meaning of its own, in such cases the NULL could be placed with a meaningful (e.g. ‘’n/a in this case).

Attributes’ usage is quite an important subject when considering metadata, being for example important to study which attributes are referenced in various tables. For example, is of importance studying which Products appear in Purchases, Sales Orders, Inventory, etc., especially the respective Products have to receive special attention in what concerns Data Quality.

The number of distinct values is just a measure in studying attributes’ usage, a more complex analysis being performed based on the frequency distribution of values based on the number of references in a given table/dataset. This can be easily done using a count together with a grouping based on the studied attribute. Here’s the example for color:

-- distribution of values

SELECT Color 
, count(1) NumberRecords 
FROM Production.Product 
GROUP BY Color 
ORDER BY NumberRecords DESC

Output:

Color	NumberRecords
NULL	248
Black	93
Silver	43
Red	38
Yellow	36
Blue	26
Multi	8
Silver/Black	7
White	4
Grey	1

As can be seen from the above output (called a univariate frequency table), the result shows not only the number of records with NULL values, but also the number of distinct values. Attributes’ usage could be used also as a criterion for prioritizing Data Cleansing by focusing first on the attributes with the highest frequency. Eventually the frequencies could be expressed also as percentages in ratio with the number of records, the percentages being often considered as more representative values in visualization of data.

Often, it’s useful to study the frequency of values based on explicit (e.g. Product Category/Subcategory) or implicit (ad-hoc defined) categories instead of studying values’ distribution.

A third technique used in analyzing attribute’s usage is based on identifying the range of values, here simple statistical tools like minimum, maximum, averages, or standard variation of values could be of importance in understanding the data. The minimum and maximum could be applied for all data types, but they are meaningful mainly for the numeric and date-time data types, and both could take advantage of the whole specific range of statistical tools. Studying the maximum and minimum value within for a string data attribute makes sense only when there is a strong lexicographical motivation, therefore as example I will consider the List Price from same Product table.

-- aggregate values 
 SELECT MIN(ListPrice) MinListPrice 
, MAX(ListPrice) MaxListPrice 
, AVG(ListPrice) AverageListPrice 
, STDEV(ListPrice) SandardDevListPrice  
FROM Production.Product

Output:

MinListPrice	MaxListPrice	AverageListPrice	SandardDevListPrice
0.00	3578.27	438.6662	773.602842630487

The 0 values could be excluded from analysis because at least in this case they represent the lack of a valid List Price. At least in this case the average and standard deviation doesn’t really make sense, while the minimum and maximum reveal the range of values used. As in the case of frequencies, it’s useful to study the frequency of values based on explicit or implicit categories. Here’s an example based on List Price aggregation at Subcategory level:

-- aggregate values per subcategory 
SELECT PS.Name 
, MIN(P.ListPrice) MinListPrice 
, MAX(P.ListPrice) MaxListPrice 
, AVG(P.ListPrice) AverageListPrice 
, STDEV(P.ListPrice) SandardDevListPrice  
FROM Production.Product P 
    LEFT JOIN Production.ProductSubcategory PS 
      ON P.ProductSubcategoryID = PS.ProductSubcategoryID 
WHERE IsNull(P.ListPrice, 0)!=0 
GROUP BY PS.Name 
ORDER BY PS.Name

Output:

Name	MinListPrice	MaxListPrice	AverageListPrice	SandardDevListPrice
NULL	133.34	196.92	159.1333	28.9631334630769
Bib-Shorts	89.99	89.99	89.99	1.34869915234861E-06
Bike Racks	120.00	120.00	120.00
Bike Stands	159.00	159.00	159.00
Bottles and Cages	4.99	9.99	7.99	2.64575131106459
Bottom Brackets	53.99	121.49	92.24	34.6383097162665
Brakes	106.50	106.50	106.50	0
Caps	8.99	8.99	8.99
Chains	20.24	20.24	20.24
Cleaners	7.95	7.95	7.95

I tend to believe that such analysis at category level is less important in Data Profiling, though there could be also exceptions, for example the case of ABC analysis.

Analyzing the Data Structure

The understanding of the structure of data is facilitated by the metadata available in the database itself, the logical/semantical data models and other types of documentation (e.g. functional specifications, data dictionaries, etc.), but as specified above even if the data are filled in a given predefined structure, when the database is not adequately normalized or that the data attributes are misused, results structures in structures, sometimes people use their own encodings varying from person to person, while other times such encodings are standardized, and even implemented in tools. When we talk about structure, the simplest metrics that could be created are based on the minimal and maximal length of attributes, while the average length could be used for example as a base for calculating the volume of data. Here’s the example script based on Color attribute:

-- aggregate length values 
SELECT MIN(Len(Color)) MinLength 
, MAX(Len(Color)) MaxLength 
, AVG(Len(Color)) AverageLength            
FROM Production.Product

Output:

MinLength	MaxLength	AverageLength
3	12	5

The discovery of special structures (patterns) used inside of an attribute are not so easy to determine using simple scripts, at least not unless the structure is known in advance. There are tools and I’ve also seen some SQL scripts available that allow identifying the patterns in data, though they have relative precision and I believe human intervention is requested (e.g. semi-automatic methods).

Analyzing Data Relations

Talking about patterns, the discovery of relations between attributes could also be considered as special type of pattern discovery, the above last statement applying for them too, the data mining field providing many tools for this type of discovery. In theory the relation between attributes could be discovered by grouping two or more attributes and analyzing the output and interdependence between them. Whether that can be achieved by a database developer depends also on his skills and knowledge in statistics and data mining field, often being required an experienced Data Analyst for this type of tasks.

Closing Remarks

There are more to say about Data Profiling topic, there are many more (complex) techniques that could be used to profile the data, especially in respect to Data Quality, which mainly focuses on 6 important dimensions of data: accuracy, duplicates (or uniqueness), completeness, consistency, conformity, and referential integrity. If in previous posts I addressed the respective topics from a theoretical point of view, I will try to attempt to develop the subjects from Data Profiling perspective in next posts, showing various techniques that could be used by developers for this purpose.

Previous Post <<||>> Next Post

Written: Jul-2010, Last Reviewed: Mar-2024

References:
[1] Jack E Olson(2003) Data Quality: The Accuracy Dimension

SQL Troubles

Pages

27 January 2025

🗄️🗒️Data Management: Data Quality Dimensions [Notes]

05 January 2021

🧮ERP: Implementations (Part III: It’s all about Scope I - Functional Requirements)

11 May 2019

#️⃣Software Engineering: Programming (Part XI: The Dark Side)

04 May 2019

🧊Data Warehousing: Architecture (Part I: Push vs. Pull)

19 April 2019

🌡Performance Management: Mastery (Part I: The Need for Perfection vs. Excellence)

28 April 2017

⛏️Data Management: Completeness (Definitions)

12 February 2017

⛏️Data Management: Data Quality (Definitions)

09 November 2011

📉Graphical Representation: Completeness (Just the Quotes)

03 July 2010

🗄️Data Management: Data Profiling I (An Introduction)

About Me