SQL Troubles: Data Quality Dimensions

Showing posts with label Data Quality Dimensions. Show all posts

27 January 2025

🗄️🗒️Data Management: Data Quality Dimensions [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes.

Last updated: 27-Jan-2025

[Data Management] Data quality dimensions

{def} features of data that can be measured or assessed against defined standards to determine the quality of data

captures a specific aspect of general data quality

can refer to data values or to their schema

{type} hard dimensions

dimensions that can be measured

{type} soft dimensions

dimensions that can be measured only indirectly

⇐ through interviews with data users or through any other kind of communication with users

dimensions whose measurement depends on the perception of the users of the data

{dimension} uniqueness [post]

the degree to which a value or set of values is unique within a dataset

can be determined based on a set of values supposed to be unique across the whole dataset

some systems have a artificial, respectively natural unique identified

measured in terms of either

the percentage of unique values available in a dataset
the percentage of duplicate values available in a dataset

the impossibility of identifying whether a value is unique increases the chances for it to be duplicated
it can have broader implications

aggregated information is not shown correctly

⇐ split across different entities

can lead to further duplicates in other areas

{recommendation} enforce uniqueness by design, if possible
{recommendation} check the data regularly for duplicates and disable or delete the duplicated records

⇐ one should make sure that the records can't be further reused in business processes or analytics workloads

{dimension} completeness [post]

the extent to which there are missing data in a dataset

⇐ reflected in the number of the missing values

measured as percentage of the missing values compared to the total

determined by the presence of NULL values

{type} attribute completeness

the number of NULLs in a specific attribute

{type} tuple completeness

the number of unknown values of the attributes in a tuple

{type} relation completeness

the number of tuples with unknown attribute values in the relation

{type} value completeness

makes sense for complex, semi-structured columns such as XML data type columns

e.g. a complete element or attribute can be missing

considered in report to

mandatory attributes

attributes that need a not-Null value for each record

optional attributes

attributes that not necessarily need to be provided

inapplicable attributes

attributes not applicable (relevant) for certain scenarios by design

{dimension} conformity (aka format compliance) [post]

{def} the extent data are in the expected format

dependent on the data type and its definition

can be associated with a set of metadata

data type

e.g. text, numeric, alphanumeric, positive, date

length
precision
scale
formatting patterns

e.g. phone number, decimal and digit grouping symbols
different formatting might apply based on various business rules
can use delimiters

{recommendation} define the data type and further constraints to enforce the various characteristics of the element
{recommendation} make sure that the delimiters don't overlap with other uses

{dimension} accuracy [post]

{def} the extent data is correct, respectively match the reality with an acceptable level of approximation
stricter than just conforming to business rules
can be measured at column and table level

[discrete data values]

use frequency distribution of values

a value with very low frequency is probably incorrect

[alphanumeric values]

use string length distribution

a string with a very atypical length is potentially incorrect

try to find patterns and then create pattern distribution.

patterns with low frequency probably denote wrong values

[continuous attributes]

use descriptive statistics

just by looking at minimal and maximal values, you can easily spot potentially problematic data

{dimension} consistency [post]

{def} the degree of uniformity, standardization, and freedom from contradiction among the documents or parts of a system or component

{type} notational consistency

the extent (data) values are consistent in notation

{type} semantic consistency

the degree to which data has unique meaning
is more restrictive than the notational consistency

measures the equivalence of information stored in various repositories
involves comparing values with a predefined set of possible values

from the same or from different systems

can be measured at column and table level
can have different scopes

cross-system consistencies

among systems or data repositories

cross-record consistency

within the same repository

temporal consistency

within the same record at different points in time

{dimension} timeliness [post]

tells the degree to which data is current and available when needed

there is always some delay between change in the real world and the moment when this change is entered into a system

stale data/obsolete data

{dimension} structuredness [post]

the degree to which a data structure or model possesses a definite pattern of organization of its interdependent parts
allows the categorization of data as

structured data [def]

refers to structures that can be easily perceived or known, that raises no doubt on structure’s delimitations

unstructured data [def]

refers to textual data and media content (video, sound, images), in which the structural patterns even if exist they are hard to discover or not predefined

semi-structured data [def]

refers to islands of structured data stored with unstructured data, or vice versa

⇐ the more structured the data, the easier it is to be processed

{dimension} referential integrity [post]

{def} the degree to which the values of a key in one table (aka reference value) match the values of a key in a related table (aka the referenced value)
it's an architectural concept of the database
{recommendation} keep the referential integrity of a system by design

some systems build logic for assuring the referential integrity in the applications and not in the database

{dimension} currency (aka actuality)

the extent to which data is actual
can be considered as a special type of accuracy

⇐ when the data is not actual then it doesn’t reflect reality

{dimension} ease of use

the extent to which data can be used for a given purpose

usually it refers to whether the data can be processed as needed
depends on the application or on the user interface

{dimension} fitness of use

the degree to which the data is fit for use

the data may have good quality for a given purposes but

not usable for other purposes
can be used as substitute for other data

e.g. use phone area codes instead of ZIP codes to locate customers approximately

{dimension} trustfulness [post]

the degree to which the data can be trusted

is a matter of perception

ask users whether they trust the data and which are the reasons

if the users don’t trust the data

they will create their own solutions
they will not use applications

{dimension} entropy

{def} the average amount of information conveyed

⇐ quantification of information in a system
⇐ the more dispersed the values and the more the frequency distribution of a discrete column is equally spread among the values, the more information is available [1]
⇐ can tell whether your data is suitable for analysis or not

can be measured at column and table level

{dimension} presentation quality

applicable to applications that presents data

format and appearance should support the appropriate use of data
depends on the UI used

{recommendation} have a dedicated system for maintaining the master data and broadcast the data to the subscribers as needed

the data should be exclusively managed though the management system
{anti-pattern} data is modified in the subscribers and the changes aren't always reflected back to the source system

Previous Post <<||>> Next Post

References:
[1] Dejan Sarka et al (2012) Exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012 (Training Kit)

21 May 2020

📦Data Migrations (DM): In-house Built Solutions (Part III: The Data Preparation Layer)

Data Migrations Series

Once the source data (including the data needed for enrichment) were made available into the migration database, one can model the source entities by encapsulating the logic in views or table-valued functions (TVFs). This enables code’s maintainability (by providing better visibility over the transformations), reuse (various validations are necessary), and performance (by taking advantage of RDBMS native functionality), and flexibility in changing the code. The objects thus created can be used in a new set of similar objects supposed to contain the mapping logic between the source and target data models. One attempts thus to prepare the data as needed by the target system.

As each target entity is modelled, it’s useful to dump the resulting data into tables, which will be further used as source for the further logic, instead of using directly the views or TVFs. This allows to keep a copy of the data, perform a range of validations, and most important, can provide better performance as indexes can be built on the tables. In addition, one can further manipulate the data in tables as requested, e.g. by including information which are later available (e.g. attributes from the target system) or, for testing or correcting the data without affecting the built logic. From the same reasons such tables can be used in intermediate steps of the migration, inclusively when modelling the source data entities. However, one should avoid their excessive use, as this can complicate the architecture unnecessarily.

The last step into the preparation layer is to prepare queries which only select the attributes needed by the target system and include additional formatting. It’s in general recommended to detach the formatting from other transformations as this approach provides better flexibility in addressing the migration requirements. The data can be afterwards exported manually or via an automated job, the latter approach being recommended especially when the data need to be partitioned.

At this stage, after validating the data, they can be imported into the target system via the mechanisms available. Until here lies in theory the boundary of the migration logic, however this layer can be extended for data validation purposes. It would be helpful for example to assure that the data imported into the target entirely reflect the prepared data. It can happen that during import data are truncated, incorrectly imported (wrong attribute, values are changed or incorrectly mapped) or even whole records not being imported, with impact on data consistency.

After importing the data into the target system one can import the migrated data via ETL packages back into the migration database and build queries which match the data at attribute level. This step may seem redundant, though it’s a way to assure that the migration occurred according to the expectations and minimize thus the later surprises. If not for all entities, this type of import might be the easiest solution for importing data into the logic (e.g. when identity values need to be mapped into the logic after migrating an entity).

It can be also helpful to import the tables for dropdown values (typical parameters), to assure thus that the values used for parameterizing the system were built into the migration logic as expected. It may sound surprising that not all systems perform checks by imports or these checks were disabled for other reasons.

In data migrations is recommended to assure data internal consistency by design. Even if various validations for uniqueness, completeness, consistency, conformity, timeliness, referential integrity or even accuracy might seem as redundant or involve extra work, on the long run they pay-off as they allow trapping the issues early in the process.

28 April 2017

⛏️Data Management: Completeness (Definitions)

"A characteristic of information quality that measures the degree to which there is a value in a field; synonymous with fill rate. Assessed in the data quality dimension of Data Integrity Fundamentals." (Danette McGilvray, "Executing Data Quality Projects", 2008)

"Containing by a composite data all components necessary to full description of the states of a considered object or process." (Juliusz L Kulikowski, "Data Quality Assessment", 2009)

"An inherent quality characteristic that is a measure of the extent to which an attribute has values for all instances of an entity class." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"Completeness is a dimension of data quality. As used in the DQAF, completeness implies having all the necessary or appropriate parts; being entire, finished, total. A dataset is complete to the degree that it contains required attributes and a sufficient number of records, and to the degree that attributes are populated in accord with data consumer expectations. For data to be complete, at least three conditions must be met: the dataset must be defined so that it includes all the attributes desired (width); the dataset must contain the desired amount of data (depth); and the attributes must be populated to the extent desired (density). Each of these secondary dimensions of completeness can be measured differently." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"Completeness is defined as a measure of the presence of core source data elements that, exclusive of derived fields, must be present in order to complete a given business process." (Rajesh Jugulum, "Competing with High Quality Data", 2014)

"Complete existence of all values or attributes of a record that are necessary." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"The degree to which all data has been delivered or stored and no values are missing. Examples are empty or missing records." (Piethein Strengholt, "Data Management at Scale", 2020)

"The degree to which elements that should be contained in the model are indeed there." (Panos Alexopoulos, "Semantic Modeling for Data", 2020)

"The degree of data representing all properties and instances of the real-world context." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"Data is considered 'complete' when it fulfills expectations of comprehensiveness." (Precisely) [source]

"The degree to which all required measures are known. Values may be designated as “missing” in order not to have empty cells, or missing values may be replaced with default or interpolated values. In the case of default or interpolated values, these must be flagged as such to distinguish them from actual measurements or observations. Missing, default, or interpolated values do not imply that the dataset has been made complete." (CODATA)

27 April 2017

⛏️Data Management: Availability (Definitions)

"Corresponds to the information that should be available when necessary and in the appropriate format." (José M Gaivéo, "Security of ICTs Supporting Healthcare Activities", 2013)

"A property by which the data is available all the time during the business hours. In cloud computing domain, the data availability by the cloud service provider holds a crucial importance." (Sumit Jaiswal et al, "Security Challenges in Cloud Computing", 2015)

"Availability: the ability of the data user to access the data at the desired point in time." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"It is one of the main aspects of the information security. It means data should be available to its legitimate user all the time whenever it is requested by them. To guarantee availability data is replicated at various nodes in the network. Data must be reliably available." (Omkar Badve et al, "Reviewing the Security Features in Contemporary Security Policies and Models for Multiple Platforms", 2016)

"Timely, reliable access to data and information services for authorized users." (Maurice Dawson et al, "Battlefield Cyberspace: Exploitation of Hyperconnectivity and Internet of Things", 2017)

"A set of principles and metrics that assures the reliability and constant access to data for the authorized individuals or groups." (Gordana Gardašević et al, "Cybersecurity of Industrial Internet of Things", 2020)

"Ensuring the conditions necessary for easy retrieval and use of information and system resources, whenever necessary, with strict conditions of confidentiality and integrity." (Alina Stanciu et al, "Cyberaccounting for the Leaders of the Future", 2020)

"The state when data are in the place needed by the user, at the time the user needs them, and in the form needed by the user." (CODATA)

"The state that exists when data can be accessed or a requested service provided within an acceptable period of time." (NISTIR 4734)

"Timely, reliable access to information by authorized entities." (NIST SP 800-57 Part 1)

21 February 2017

⛏️Data Management: Validity (Definitions)

"A characteristic of the data collected that indicates they are sound and accurate." (Teri Lund & Susan Barksdale, "10 Steps to Successful Strategic Planning", 2006)

"Implies that the test measures what it is supposed to." (Robert McCrie, "Security Operations Management" 2nd Ed., 2006)

"The determination that values in the field are or are not within a set of allowed or valid values. Measured as part of the Data Integrity Fundamentals data quality dimension." (Danette McGilvray, "Executing Data Quality Projects", 2008)

"A data quality dimension that reflects the confirmation of data items to their corresponding value domains, and the extent to which non-confirmation of certain items affects fitness to use. For example, a data item is invalid if it is defined to be integer but contains a non-integer value, linked to a finite set of possible values but contains a value not included in this set, or contains a NULL value where a NULL is not allowed." (G Shankaranarayanan & Adir Even, "Measuring Data Quality in Context", 2009)

"An aspect of data quality consisting in its steadiness despite the natural process of data obsolescence increasing in time." (Juliusz L Kulikowski, "Data Quality Assessment", 2009)

"An inherent quality characteristic that is a measure of the degree of conformance of data to its domain values and business rules." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"Validity is a dimension of data quality, defined as the degree to which data conforms to stated rules. As used in the DQAF, validity is differentiated from both accuracy and correctness. Validity is the degree to which data conform to a set of business rules, sometimes expressed as a standard or represented within a defined data domain." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"Validity is defined as the extent to which data corresponds to reference tables, lists of values from golden sources documented in metadata, value ranges, etc." (Rajesh Jugulum, "Competing with High Quality Data", 2014)

"the state of consistency between a measurement and the concept that a researcher intended to measure." (Meredith Zozus, "The Data Book: Collection and Management of Research Data", 2017)

[semantic validity:] "The compliance of attribute data to rules regarding consistency and truthfulness of association." (O Sami Saydjari, "Engineering Trustworthy Systems: Get Cybersecurity Design Right the First Time", 2018)

[syntactic validity:] "The compliance of attribute data to format and grammar rules." (O Sami Saydjari, "Engineering Trustworthy Systems: Get Cybersecurity Design Right the First Time", 2018)

"Validity is a data quality dimension that refers to information that doesn’t conform to a specific format or doesn’t follow business rules." (Precisely) [source]

20 February 2017

⛏️Data Management: Timeliness (Definitions)

"Coming early or at the right, appropriate or adapted to the times or the occasion." (Martin J Eppler, "Managing Information Quality" 2nd Ed., 2006)

[timeliness & availability] "A data quality dimension that measures the degree to which data are current and available for use as specified, and in the time frame in which they are expected." (Danette McGilvray, "Executing Data Quality Projects", 2008)

"the ability of a task to repeatedly meet its timeliness requirements." (Bruce P Douglass, "Real-Time Agility: The Harmony/ESW Method for Real-Time and Embedded Systems Development", 2009)

"A pragmatic quality characteristic that is a measure of the relative availability of data to support a given process within the timetable required to perform the process." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"1.The degree to which available data meets the currency requirements of information consumers. 2.The length of time between data availability and the event or phenomenon they describe." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Timeliness is a dimension of data quality related to the availability and currency of data. As used in the DQAF, timeliness is associated with data delivery, availability, and processing. Timeliness is the degree to which data conforms to a schedule for being updated and made available. For data to be timely, it must be delivered according to schedule." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"The degree to which the model contains elements that reflect the current version of the world Transitive Relation When a relation R is transitive then if R links entity A to entity B, and entity B to entity C, then it also links A to C." (Panos Alexopoulos, "Semantic Modeling for Data", 2020)

"The degree to which the actual time and processing time are separated. The timelier the data is, the smaller the gap is between actual time and record time." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"Length of time between data availability and the event or phenomenon they describe." (SDMX)

19 January 2017

⛏️Data Management: Consistency (Definitions)

"The degree of uniformity, standardization, and freedom from contradiction among the documents or parts of a system or component." (IEEE," IEEE Standard Glossary of Software Engineering Terminology", 1990)

"Describes whether or not master data is defined and used across all IT systems in a consistent manner." (Allen Dreibelbis et al, "Enterprise Master Data Management", 2008)

"The requirement that a transaction should leave the database in a consistent state. If a transaction would put the database in an inconsistent state, the transaction is canceled." (Rod Stephens, "Beginning Database Design Solutions", 2008)

"The degree to which one set of attribute values match another attribute set within the same row or record (record-level consistency), within another attribute set in a different record (cross-record consistency), or within the same record at different points in time (temporal consistency)." (DAMA International, "The DAMA Dictionary of Data Management" 1st Ed., 2010)

"Consistency is a dimension of data quality. As used in the DQAF, consistency can be thought of as the absence of variety or change. Consistency is the degree to which data conform to an equivalent set of data, usually a set produced under similar conditions or a set produced by the same process over time." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"The degree to which data values are equivalent across redundant databases. With regard to transactions, consistency refers to the state of the data both before and after the transaction is executed. A transaction maintains the consistency of the state of the data. In other words, after a transaction is run, all data in the database is 'correct' (the C in ACID)." (Craig S Mullins, "Database Administration", 2012)

"Agreement of several versions of the data related to the same real objects, which are stored in various information systems." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"Consistency: agreement of several versions of the data related to the same real objects, which are stored in various information systems." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"The degree to which the data reflects the definition of the data. An example is the person name field, which represents either a first name, last name, or a combination of first name and last name." (Piethein Strengholt, "Data Management at Scale", 2020)

"The degree to which the model is free of logical or semantic contradictions." (Panos Alexopoulos, "Semantic Modeling for Data", 2020)

"The degree of data being free of contradictions." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"The degree of uniformity, standardization, and freedom from contradiction among the documents or parts of a component or system." [IEEE 610]

18 January 2017

⛏️Data Management: Accuracy (Definitions)

"(1) A qualitative assessment of correctness, or freedom from error. (2) A quantitative measure of the magnitude of error." (IEEE, "IEEE Standard Glossary of Software Engineering Terminology", 1990)

[accuracy (of measurement):] "Closeness of the agreement between the result of a measurement and a true value of the measurand." International Vocabulary of Basic and General Terms in Metrology, 1993)

"A qualitative assessment of freedom from error or a quantitative measure of the magnitude of error, expressed as a function of relative error." (William H Inmon, "Building the Data Warehouse", 2005)

"Accuracy is the closeness of a measured value to the true value." (Steve McKillup, "Statistics Explained: An Introductory Guide for Life Scientists", 2005)

"A data element’s degree of conformity to an established business measurement or definition. Data precision is the degree to which further measurements or definitions will show the same results." (Jill Dyché & Evan Levy, "Customer Data Integration: Reaching a Single Version of the Truth", 2006)

"Degree of conformity of a measure to a standard or a true value. Level of precision or detail." (Martin J Eppler, "Managing Information Quality" 2nd Ed., 2006)

"The accuracy reflects the number of times the model is correct." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"An aspect of numerical data quality connected with a standard statistical error between a real parameter value and the corresponding value given by the data. Data accuracy is inversely proportional to this error." (Juliusz L Kulikowski, "Data Quality Assessment", 2009)

"An inherent quality characteristic that is a measure of the degree to which data agrees with an original source of data (such as a form, document, or unaltered electronic data) received from an acknowledged source outside the control of the organization." (David C Hay, "Data Model Patterns: A Metadata Map", 2010) [accuracy in regard to a surrogate source]

"An inherent quality characteristic that is a measure of the degree to which data accurately reflects the real-world object or event being described. Accuracy is the highest degree of inherent information quality possible." (David C Hay, "Data Model Patterns: A Metadata Map", 2010) [accuracy in regard to reality]

"Freedom from mistakes or error, conformity to truth or to a standard, exactness, the degree of conformity of a measure to a standard or true value. (Michael Brackett, 2011)

"The degree to which a data attribute value closely and correctly describes its business entity instance (the 'real life' entities) as of a point in time." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Accuracy is the quality or state of being correct or precise; accurate information is correct in all details (NOAD)." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"Within the quality management system, accuracy is an assessment of correctness." (For Dummies, "PMP Certification All-in-One For Dummies" 2nd Ed., 2013)

"How closely a measurement or assessment reflects the true value. Not to be confused with precision [...]" (Kenneth A Shaw, "Integrated Management of Processes and Information", 2013)

"Accuracy is defined as a measure of whether the value of a given data element is correct and reflects the real world as viewed by a valid real-world source (SME, customer, hard-copy record, etc.)." (Rajesh Jugulum, "Competing with High Quality Data", 2014)

"Within the quality management system, accuracy is an assessment of correctness." (Project Management Institute, "A Guide to the Project Management Body of Knowledge (PMBOK® Guide)" 6th Ed., 2017)

"The degree to which the data reflect the truth or reality. A spelling mistake is a good example of inaccurate data." (Piethein Strengholt, "Data Management at Scale", 2020)

"The degree to which the semantic assertions of a model are accepted to be true." (Panos Alexopoulos, "Semantic Modeling for Data", 2020)

"The degree of how closely the data represents the true value of the attribute in the real-world context." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"Closeness of computations or estimates to the exact or true values that the statistics were intended to measure." (SDMX)

"The capability of the software product to provide the right or agreed results or effects with the needed degree of precision." [ISO/IEC 25000]

"The closeness of agreement between an observed value and an accepted reference value." (American Society for Quality)

"The term “accuracy” refers to the degree to which information accurately reflects an event or object described." (Precisely) [source]

06 April 2012

🧭Business Intelligence: Enterprise Reporting (Part X: Between Potential, Reality, Quality and Stories)

Business Intelligence Series

Have you ever felt that you are investing quite a lot of time, effort, money and other resources into your BI infrastructure, and in the end you don’t meet your expectations? As it seems you’re not the only one. The “Does your business intelligence tell you the whole story” paper released in 2009 by KPMG provides some interesting numbers to support that:

1. “More than 50% of business intelligence projects fail to deliver the expected benefit” (BI projects failure)
2. “Two thirds of executives feel that the quality of and timely access to data is poor and inconsistent” (reports and data quality)
3. “Seven out of ten executives do not get the right information to make business decisions.” (BI value)
4. “Fewer than 10% of organizations have successfully used business intelligence to enhance their organizational and technological infrastructures” (BI alignment)
5. “those with effective business intelligence outperform the market by more than 5% in terms of return on equity” (competitive advantage)

The numbers reflect to some degree also my expectations, though they seem more pessimistic than I expected. That’s not a surprise, considering that such studies can be strongly biased, especially because in them are reflected expectations, presumptions and personal views over the state of art within an organization.

KPMG builds on the above numbers and several other aspects that revolve around the use of governance and alignment in order to increase the value provided by BI to the business, though I feel that they are hardly scratching the surface. Governance and alignment look great into studies and academic work, though they alone can’t bring success, no matter how much their importance and usage is accentuated. Sometimes I feel that people hide behind big words without even grasping the facts. The importance of governance and alignment can’t be neglected, though the argumentation provided by KPMG isn’t flawless. There are statements I can agree with, and many which are circumstantial. Anyway, let’s look a little deeper at the above numbers.

I suppose there is no surprise concerning the huge rate of BI projects’ failure. The value is somewhat close to the rate of software projects’ failure. Why would make a BI project an exception from a typical software project, considering that they are facing almost the same environments and challenges? In fact, given the role played by BI in decision making, I would say that BI projects are more sensitive to the various factors than a typical software project.

It doesn’t make sense to retake the motives for which software projects fail, but some particular aspects need to be mentioned. KPMG insists on the poor quality of data, on the relevance and volume of reports and metrics used, the lack of reflecting organization’s objectives, the inflexibility of data models, lack of standardization, all of them reflecting in a degree or other on the success of a BI project. There is much more to it!

KPMG refers to a holistic approach concentrated on the change of focus from technology to the actual needs, a change of process and funding. A reflection of the holistic approach is also the view of the BI infrastructure from the point of view of the entire IT infrastructure, of the organization, network of partners and of the end-products – mainly models and reports. Many of the problems BI initiatives are confronted with refer to the quality of data and its many dimensions (duplicates, conformity, consistency, integrity, accuracy, availability, timeliness, etc.) , problems which could be in theory solved in the source systems, mainly through design. Other problems, like dealing with complex infrastructures based on more or less compatible IS or BI tools, might involve virtualization, consolidation or harmonization of such solutions, plus the addition of other tools.

Looking at the whole organization, other problems appear: the use of reports and models without understanding the whole luggage of meaning hiding behind them, the different views within the same data and models, the difference of language, problems, requirements and objectives, the departmental and organizational politics, the lack of communication, the lack of trust in the existing models and reports, and so on. What all these points have in common are people! The people are the maybe the most important factor in the adoption and effective usage of BI solutions. It starts with them – identifying their needs, and it ends with them – as end users. Making them aware of all contextual requirements, actually making them knowledge workers and not considering them just simple machines could give a boost to your BI strategy.

Partners doesn’t encompass just software vendors, service providers or consultants, but also the internal organizational structures – teams, departments, sites or any other similar structure. Many problems in BI can be tracked down to partners and the ways a partnership is understood, on how resources are managed, how different goals and strategies are harmonized, on how people collaborate and coordinate. Maybe the most problematic is the partnership between IT and the other departments on one side, and between IT and external partners on the other side. As long IT is not seen as a partner, as long IT is skip from the important decisions or isn’t acting as a mediator between its internal and external partners, there are few chances of succeeding. There are so many aspects and lot of material written on this topic, there are models and methodologies supposed to make things work, but often between theory and practice there is a long distance.

How many of the people you met were blaming the poor quality of the data without actually doing something to improve anything? If the quality of your data in one of your major problems then why aren’t you doing something to improve that? Taking the ownership over your data is a major step on the way to better data quality, though a data management strategy is needed. This involve the design of a framework that facilitates data quality and data consumption, the design and use of policies, practices and procedures to properly manage the full data lifecycle. Also this can be considered as part of your BI infrastructure, and given the huge volume, the complexity and diversity of data, is nowadays a must for an organization.

The “right information” is an evasive construct. In order to get the right information you must be capable to define what you want, to design your infrastructure with that in mind and to learn how to harness your data. You don’t have to look only at your data and information but also at the whole DIKW pyramid. The bottom line is that you don’t have to build only a BI infrastructure but a knowledge management infrastructure, and methodologies like ITIL can help you achieve that, though they are not sufficient. Sooner or later you’ll arrive to blame the whole DIKW pyramid - the difficulty of extracting information from data, knowledge from information, and the ultimate translation into wisdom. Actually that’s also what the third and fourth of the above statements are screaming out loud – it’s not so easy to get information from the silos of data, same as it’s not easy to align the transformation process with organizations’ strategy.

Also timeliness has a relative meaning. It’s true that nowadays’ business dynamics requires faster access to data, though it requires also to be proactive, many organizations lacking this level of maturity. In order to be proactive it’s necessary to understand your business’ dynamics thoroughly, that being routed primarily in your data, in the tools you are using and the skill set your employees acquired in order to move between the DIKW layers. I would say that the understanding of DIKW is essential in harnessing your BI infrastructure.

KPMG considers that the 5% increase in return on equity associated with the effective usage of BI is a positive sign, not necessarily. The increase can be associated with hazard or other factors as well, even if it’s unlikely probable to be so. The increase it’s quite small when considered with the huge amount of resources spent on BI infrastructure. I believe that BI can do much more for organizations when harnessed adequately. It’s just a belief that needs to be backed up by numbers, hopefully that will happen someday, soon.

Previous Post <<||>> Next Post

07 February 2010

🗄️Data Management: The Data-Driven Enterprise (Part I: Thoughts on a White Paper)

Data Management Series

I read today ‘The Data-Driven Enterprise’ White Paper from Informatica, quite useful paper, especially when it comes from one of the leaders in integration software and services. In this paper the term data-driven enterprise refers to the organizations that are “able to take advantage of their data assets to work faster, better and smarter” [1], in order to achieve this state of art being necessary to” invest in the people, processes and technology needed to know where the data resides, to understand it, to clean it and keep it clean, and to get it to where it is needed, when and how it is needed” [1].

It seems that the data-driven enterprise, same as data-driven corporation [2], is just an alternative term for the data-driven organization concept already in use since several good years. Following the DIKW pyramid a data-driven organization follow a four stage evolution from data, to information and further to knowledge and wisdom, of importance being especially how knowledge is derived from data and information, the organizations capable of creating, managing and putting knowledge into use being known as knowledge-based organizations. It’s interesting that the paper makes no direct reference to knowledge and information, focusing on data as asset and possible ignoring information respectively knowledge as asset. I think it would help if the concepts from this paper would have been anchored also within these two contexts.

The paper touches several important aspects related to Data Management, approaching concepts like “value of data”, “data quality”, “data integration”, “business involvement”, “data trust”, “relevant data”, “timely data” “virtualized access”, “compliant reporting”, “Business-IT collaboration”, highlighting the importance of having adequate processes, infrastructure and culture in order to bring more value for the business. I totally agree with the importance of these concepts though I think that there are many other aspects that need to be considered. With such concepts almost all vendors juggle, though what’s often missing is the knowledge/wisdom and method to put philosophies and technologies into use, to redesign an organization’s infrastructure and culture so it could bring the optimum benefit.

Since the appearance of data warehouses concepts, the efficient integration of the various data islands existing within and outside of an organization become a Holy Grail for IT vendors and organizations, though given the fast pace with which new technologies appear this hunt looks more like a Morgan le Fey in the desert. Informatica builds a strong case for data integration in general and for Informatica 9 in particular, their new infrastructure platform targeting to enable organizations to become data-driven by providing a centralized architecture for enforcing data policy and addressing issues like data timeliness, format, semantics, privacy and quality[3]. On the other side the grounds on which Informatica builds its launching strategy could be contra-argumented considering the grey zone they were placed in.

Quantifying Value of Data

How many of the organizations could say that they could quantify (easily) the real value of their data when there is no market value they could be benchmarked against? I would say that data have only a potential value that could increase only with its use, once you learned to explore the data, find patterns and new uses for the data, derive knowledge out of it and use it wisely in order to derive profit and a competitive advantage, and it might take years to arrive there.

People who witnessed big IT projects like ERP/CRM implementations or data warehousing have seen how their initial expectations were hardly met, how much are they willing to invest in an initiative that could prove its value maybe only years later, especially when there are still many organizations fighting the crisis? How could they create a business case for such a project? How much could they rely on the numbers advanced by vendors and by the nice slogans behind their tools just good for selling a product?

Taking a quote from the video presentation of Sohaib Abbasi, Chairman and CEO at Informatica, “70% of all current SOA initiatives will be restarted or simply abandoned (Gartner)” [3], and I would bet that many such projects are targeting to integrate the various systems existing in an organization. Once you had several bad such experiences, how much are you willing to invest in a new one?

There are costs that can be quantified, like the number of hours employees spent on maintaining the duplicate data, correcting the issues driven by bad data quality, or more general the costs related to waste, and there are costs that can’t be quantified so easily, like the costs associated with bad decisions or lost opportunities driven by missing data or inadequate reflection of reality. There is another aspect, even if organizations reach to quantify such costs, without having some transparency on how they arrived to the respective numbers it felts like somebody just took out some numbers from a magician’s hat. It would be great if the quantification of such costs is somehow standardized, though that’s difficult to do given the fact that each organization approaches Data Management from its own perspective and requirements.

From Data to Meaning

Reports are used only to aggregate, analyze and navigate data, while it’s in Users attribution to give adequate meaning to the data, and together with the data analyst to find the who, how, when, where, what, why, which and by what means, in a word to understand the factors that impact the business positively/negatively, the correlation between them and how they can be strengthened/mitigated in order to achieve better quality/outcomes.

People want nice charts and metrics that can give them a birds-eye view of the current state, though the aggregated data could easily hide the reality because of the quality of the data, quality of the reports itself, the degree to which they cover the reality. Part of the data-driven philosophy resume in understanding the data, and reacting to data. I met people who were ignoring the data, preferring to take wild guesses, sometimes they were right, other times they were wrong.

From Functionality to Usability

There are Users who once they have a tool they want to find all about its capabilities, play with the tool, find other uses and they could even come with nice to have features. There are also Users who don’t want to bother in getting the data by themselves, they just want the data timely and in the format they need them. The fact that Informatica allows Users to analyze the data by themselves it’s quite of a deal, though as I already stressed in a previous post, you can’t expect from a User to become a data expert overnight, there are even developers that have difficulties in handling complex data analysis requirements.

The guys from Informatica tried to make simple this aspect in their presentation though it’s not as simple as it seems, especially when dealing with ERP systems like Oracle or SAP that have hundreds of tables, each of them with a bunch of attributes and complex relations, one of the important challenges for developers is to reengineer the logic implemented in such systems. It’s a whole mindset that needs to be developed, there are also best practices that needs to be considered, specific techniques that allow getting the data in the most efficient way.

Allowing users to decide which logic to apply in their reports could prove to be a two edged sword, organizations risking ending up with multiple versions of the same story. It’s needed to align the various reports, bring users on the same page from the point of view of expectations and constraints. On the other side some Users prefer to prepare the data by themselves because they know the issues existing in the data or because they have more flexibility in making the data to look positive.

Trust, Relevance and Timeliness

An important part of Informatica’s strategy is based on data trust, relevancy and timeliness, three important but hard to quantify dimensions of Data Quality. Trust is often correlated with Users’ perception over the overall Data Quality, the degree to which the aggregated data presented in reports can be backed up with detailed data to support them, the visibility they have on the business rules and transformations used. If the Users can get a feeling of the data with click-through, drilldown or drill-through reports, if the business rules and transformations are documented, then most probably that data trust won’t be an issue anymore. Data relevancy and data timeliness are heavily requirement-dependent, for some Users being enough to work with one week old data while others need live data. In a greater or less degree, all data used by the business are relevant otherwise I don’t see why maintaining them.

Software Tools as Enablers

Sometimes being aware that there is a problem and doing something to fix it already brings an amount of value to the business, and this without investing in complex technologies but handling things methodologically and enforcing some management practices – identifying, assessing, addressing, monitoring and controlling issues. I bet this alone could bring a benefit for an organization, and everything starts just by recognizing that there is a problem and doing something to fix the root causes. On the other side software technologies could enable performing the various tasks more efficient and effective, with better quality, less resources, in less time and eventually with lowers costs. Now what’s the value of the saving based on addressing the issue and what’s the value of saving by using a software technology in particular?!

Software tools like Informatica are just enablers, they don’t guarantee results and don’t eliminate barriers unless people know how to use them and make most of it. For this are needed experts that know the business, the various software tools involved, and good experienced managers to bring such projects on the right track. When the objectives are not met or the final solution doesn’t satisfies all requirements, then people reach to develop alternative solutions, which I categorize as personal solutions – spreadsheets, MS Access applications, an organization ending up with such islands of duplicated data/logic. Often Users need to make use of such solutions in order to understand their data, and this is an area in which Informatica could easily gain adepts.

Business-IT collaboration

There is no news that the IT/IM and other functional departments don’t function as partners, IT initiatives not being adequately supported by the business, while in many IT technology-related initiatives driven by the business at corporate level the IT department is involved only as executor and has little to say in the decision of using one technology or another, many of such initiatives ignoring aspects specific to IT – usability of such a solution, integration with other solutions, nuances of internal architecture and infrastructure. Of course that phrases like “business struggling in working with IT” appear when IT and the business function as separate entities with a minimum of communication, when the various strategies are not aligned as they are supposed to.

If you’re not informing the IT department on the expectations, and vice-versa, each department will reach to address issues as they appear and not proactively, so there will be no wonder when it takes weeks or months until a solution is provided. The responsiveness of IT is strongly correlated with the resources, the existing infrastructure and policies in place. In addition for the IT to do its work the business has to share the necessary business knowledge, how can you expect to address issues when even the business is not able to articulate adequately the requirements – in many cases the business figures out what they want only when a first solution/prototype is provided. It’s an iterative process, and many people ignore this aspect.

No matter of the slogans and the concepts the vendors juggle with, I’m sorry, but I can’t believe that there is one tool that matches all requirements, that provides a fully integrated solution, that the tool itself is sufficient for eliminating the language and collaboration barriers between the business and IT!

Human Resources & Co.

Many organizations don’t have in-house the human resources needed for the various projects related to Data Management, therefore bringing consultants or outsourcing parts of the projects. A consultant needs time in order to understand the processes existing in an organization, organization’s particularities. Even if business analysts reach to augment the requirements in solid specifications, it’s difficult to cover all the aspects without having a deep knowledge about the architecture used, same as for consultants it’s difficult to put the pieces of the puzzle together especially when more of the pieces are missing. The consultants expect in general to have all the pieces of the puzzles, while the other sides expect consultants to identify the missing pieces.

When outsourcing tasks (e.g. data analysis) or data-related infrastructure (e.g. data warehouses, data marts) an organization risks to lose control over what’s happening, the communication issues being reflected in longer cycle times for issues’ resolution, making everything to become a challenge. There are many other issues related to outsourcing that maybe deserve to be addressed in detail.

The Lack of Vision, Policy and Strategy

An organization needs to have a vision, policy and strategy toward data quality in particular and Data Management in general, in order to plan, enforce and coordinate the overall effort toward quality. Their lack can have unpredictable impact on information systems and reporting infrastructure in particular and on the business as a whole, without it data quality initiatives can have local and narrow scope, without the expected effectiveness, resulting in rework and failure stories. The syntagma “it’s better to prevent than to cure” reliefs the best the philosophy on which Data Management should be centered.

Lack of Ownership

In the context of the lack of policy and strategy can be put also the lack of ownership, though given its importance it deserves a special attention. The syntagma “each employee is responsible for quality” applies to data quality too, each user and department need to take the ownership over the data they have to maintain, for their own or others’ departments scope, same as they have to take the ownership over the reports that make scope of their work, assure their quality and the afferent documentation, over the explicit and implicit islands of knowledge existing.

References:
[1] Informatica. (2009). The Data-Driven Enterprise. [Online] Available from: http://www.informatica.com/downloads/7060_data_driven_wp_web.pdf (Accessed: 6 February 2010).
[2] Herzler. (2006). Eight Aspects of the Data Driven Corporation – Exploring your Gap to Entitlement. [Online] Available from: http://www.hertzler.com/php/portfolio/white.paper.detail.php?article=31 (Accessed: 6 February 2010).
[3] Informatica. (2009). Informatica 9: Infrastructure Platform for the Data-Driven Enterprise, Speaker: Sohaib Abbasi, Chairman and CEO. [Online] Available from: http://www.informatica.com/9/thelibrary.html#page=page-5 (Accessed: 6 February 2010).

25 January 2010

🗄️Data Management: Data Quality Dimensions (Part VII: Structuredness)

Data Management Series

Barry Boehm defines structuredness as 'the degree to which a system or component possesses a definite pattern of organization of its interdependent parts' [1], which transposed to data refers to the 'pattern of organization' that can be observed in data, mainly the format in which the data are stored at macro-level (file or any other type of digital containment) or micro-level (tags, groupings, sentences, paragraphs, tables, etc.), emerging thus several levels of structure of different type.

From the various sources in which data are stored - databases, Excel files and other types of data sheets, text files, emails, documentation, meeting minutes, charts, images, intranet or extranet web sites, can be derived multiple structures coexisting in the same document, some of them quite difficult to perceive. From the structuredness point of view data can be categorized as structured, semi-structured and unstructured.

In general, the term structured data refers to structures that can be easily perceived or known, that raises no doubt on structure’s delimitations. Unstructured data refers to textual data and media content (video, sound, images), in which the structural patterns even if exist they are hard to discover or not predefined, while semi-structured data refers to islands of structured data stored with unstructured data, or vice versa.

From this perspective, according to [3], database and file systems, data exchange formats are example of semi-structured data, though from a programmers’ perspective the databases are highly structured, and same for XML files. As also remarked by [2] the terms of structured data and unstructured data are often used ambiguously by different interest groups, in different contexts – web searching, data mining, semantics, etc.

Data structuredness is important especially when is considered the processing of data with the help of machines, the correct parsing of data being highly dependent on the knowledge about the data structure, either defined beforehand or deducted. The more structured the data and the more evident and standardized the structure, the easier should be to process the data. Merrill Lynch estimates that 85% of the data in an organization are in unstructured form, most probably this number referring to semi-structured data too. To make such data available in a structured format is required an important volume of manual work combined eventually with reliable data/text mining techniques, a fact that reduces considerably the value of such data.

Text, relational, multidimensional, object, graph or XML-based DBMS are in theory the most easily to process, map and integrate though that might not be so simple as it looks given the different architectures vendors come with, the fact that the structures evolve over time. To bridge the structure and architectural differences, many vendors make it possible to access data over standard interfaces (e.g. ODBC), though there are also systems that provide only proprietary interfaces, making data difficult to obtain in an automated manner. There are also other types of technical issues related mainly to the different data types and data formats, though such issues can be easily overcome.

In the context of Data Quality, the structuredness dimension refers to the degree the structure in which the data are stored matches the expectations, the syntactic set of rules defining it, being considered across the whole set of records. Even a minor inadvertence in the structure of a record could lead to processing errors and unexpected behavior. The simplest example is a delimited text file - if any of the character sets used to delimit the structure of the file is available in the data itself, then there are high chances that the file will be parsed incorrectly, or the parsing will fail unless the issues are corrected.

Previous Post <<||>> Next Post

Written: Jan-2010, Last Reviewed: Mar-2024

References:
[1] Barry W Boehm et al (1978) "Characteristics of software quality"
[2] The Register (2006) "Structured data is boring and useless", by D. Nortfolk (link)
[3] P Wood (?) "Semi-structured Data"

18 January 2010

🗄️Data Management: Data Quality Dimensions (Part VI: Referential Integrity)

Data Management Series

Referential integrity, when considered as data quality dimension, refers to the degree to which the values of a key in one table (aka reference value) match the values of a key in a related table (aka the referenced value). Typically, that's assured by design in Database Management Systems (DBMS) using a feature called referential integrity that defines an explicit relationship between the two tables that makes sure that the values remain valid during database changes. Thus, when a record is inserted or updated and a value is provided for the reference value, the system makes sure that the referenced value is valid, otherwise it throws a referential integrity error. A similar error is thrown when one attempts to delete the record with the referenced value as long is referenced by a table on which the relationship was explicitly defined.

Using referential integrity is a recommended technique for assuring the overall integrity of the data in a database, though there are also exceptions when that's not enforced for all tables (e.g. data warehouses) or only for exceptions (e.g. interface tables where records are imported as they are, attribute whose values references data from multiple tables). Therefore, even if there are tables with the referential integrity enforced, don't make the assumption that it applies to all tables!

In relational DBMS there are three types of integrity mentioned – entity, referential and domain integrity. Entity integrity demands that all the tables must have a primary key that contains no Null values. The referential integrity demands that each non-null value of a foreign key must match the value of a primary key [1], while the domain integrity demands that the type of an attribute should be restricted to a certain data type, the format should be restricted by using constraints, rules or range of possible values [2].

Even if not mandatory, all three types of integrity are quintessential for reliable relational databases. When the referential integrity is not enforced at database level or at least in code, when a record from a table is deleted and a foreign key it’s still pointing to it, fact that could lead to unexpected disappearance of records from the system’s UI even if the records are still available.

During conversions or data migrations is important to assure that the various sets loaded match the referential and domain integrity of the database in which the data will be loaded, otherwise the records not respecting the mentioned type of integrity will be rejected. The rejection itself might not be a problem for several records, though when it happens at large scale, then the situations changes dramatically, especially when the system gives no adequate messages for the cause or rejection. A recommended approach is to assure that the scope is synchronized between the various data elements, and that the referential integrity of datasets is validated before the data are loaded in the destination database.

There are several sources (e.g. [3]) that consider Codd’s referential integrity constraint as a type of consistency, in the support of this idea could be mentioned the fact that referential integrity could be used to solve data consistency issues by bringing the various LOV in the systems. Referential integrity is mainly an architectural concept even if it involves the 'consistency' of foreign key/primary key pairs.

Note:
Expect the unforeseeable! It’s always a good idea to check whether the referential integrity is kept by a system – there are so many things that could go wrong! In data migration solutions, data warehouses and more general analytical solutions is a good idea to have in place mechanisms that check for this kind of issues.

Previous Post <<||>> Next Post

Written: Jan-2010, Last Reviewed: Mar-2024

References:
[1] Halpin. T. (2001) Information Modeling and Relational Databases: From Conceptual Analysis to Logical Design. Morgan Kaufmann Publishers. ISBN 1-55860-672-6.
[2] MSDN. 2009. Data Integrity. [Online] Available from: http://msdn.microsoft.com/en-us/library/ms184276.aspx (Accessed: 18 January 2009)

[3] Lee Y.W., Pipino L.L., Funk J.D., Wang R.Y. (2006) "Journey to Data Quality", MIT Press. ISBN: 0-262-12287-1

17 January 2010

🗄️Data Management: Data Quality Dimensions (Part IV: Accuracy)

Data Management Series

Accuracy refers to the extent data is correct, matching the reality with an acceptable level of approximation. Correctness, the value of being correct, same as reality are vague terms, in many cases they are a question of philosophy, perception, having a high degree of interpretability. However, in what concerns data they are typically the result of measurement, therefore a measurement of accuracy relates to the degree the data deviate from physical laws, logics or defined rules, though also this context is a swampy field because, utilizing a well-known syntagm, everything is relative.

From a scientific point of view, we try to model the reality with mathematical models which offer various level of approximation, the more we learn about our world, the more flaws we discover in the existing models, it’s a continuous quest for finding better models that approximate the reality. Things don’t have to be so complicated, for basic measurements there are many tools out there that offer acceptable results for most of the requirements, on the other side, as requirements change, better approximations might be required with time.

Another concept related with the ones of accuracy and measurement systems is the one of precision, and it refers to degree repeated measurements under unchanged conditions lead to the same results, further concepts associated with it being the ones of repeatability and reproducibility. Even if the accuracy and precision concepts are often confounded a measurement system can be accurate but not precise or precise but not accurate (see the target analogy), a valid measurement system targeting thus both aspects. Accuracy and precision can be considered dimensions of correctedness.

Coming back to accuracy and its use in determining data quality, typically accuracy it’s strong related to the measurement tools used, for this being needed to do again the measurements for all or a sample of the dataset and identify whether the requested level of accuracy is met, approach that could involve quite an effort. The accuracy depends also on whether the systems used to store the data are designed to store the data at the requested level of accuracy, fact reflected by the characteristics of data types used (e.g. precision, length).

Given the fact that a system stores related data (e.g. weight, height, width, length) that could satisfy physical, business of common-sense rules could be used rules to check whether the data satisfy them with the desired level of approximation. For example, being known the height, width, length and the composition of a material (e.g. metal bar) could be determined the approximated weight and compared with entered weight, if the difference is not inside of a certain interval then most probably one of the values were incorrect entered. There are even simpler rules that might apply, for example the physical dimensions must be positive real values, or in a generalized formulation - involve maximal or minimal limits that lead to identification of outliers, etc. In fact, most of the time determining data accuracy resumes only at defining possible value intervals, though there will be also cases in which for this purpose are built complex models and specific techniques.

There is another important aspect related to accuracy, time dependency of data – whether the data changes or not with time. Data currency or actuality refers to the extent data are actual. Given the above definition for accuracy, currency could be considered as a special type of accuracy because when the data are not actual then they don’t reflect reality. If currency is considered as a standalone data quality dimension, then accuracy refers only to the data that are not time dependent.

Previous Post <<||>> Next Post

Written: Jan-2010, Last Reviewed: Mar-2024

13 January 2010

🗄️Data Management: Data Quality Dimensions (Part III: Completeness)

Data Management Series

Completeness refers to the extent to which there are missing data in a dataset, fact reflected in the number of the missing values, also referred as empty (when an empty string or default values is used) or 'Nulls' (aka unknown values), and/or in the number of missing records.

The missing values are typically considered in report to mandatory attributes, attributes that need a not-Null value for each record, though after case might be applied to non-mandatory attributes (optional attributes) too, for example when is intended to understand whether the attributes are adequately maintained or not. It’s interesting that [1] considers also the inapplicable attributes referring to the attributes not applicable (relevant) for certain scenarios (e.g. physical dimensions for service-based materials), which together with the applicable attributes (relevant) can be considered as another type of categorization for attributes. Whether an attribute is mandatory is decided upon business context and not necessarily upon the physical structure containing the attribute, in other words an attribute could be optional as per database schema and mandatory per business rules.

'Missing records' can be a misleading term because is used in several contexts, however within data completeness context it refers only to the cases not covered by data integrity. For example in parent-child table relations the header data was entered though the detail data is missing, either not entered or deleted; such a case is not covered by referential integrity because there is no missing reference, but just the parent without child data (1:n cardinality).

A mixed example occurs when the same entity is split across several tables at the same level of detail. One of the tables must function as a parent, falling in the previous mentioned example (1:1 cardinality). In such a scenario it depends how one reports the nonconformances per record: (1) the error is counted only once, independently on how many dimensions an error was raised; (2) the error is counted for each dimension. For (2) when the referential integrity failed, an error is raised also for each mandatory attribute.

Both examples are dealing with explicit data referents – the 'parent' data, though there are cases in which the referents are implicit, for example when the data are not available for a certain time interval (e.g. period, day) even if needed, though also for this case the referents could be made explicit, falling in the previous mentioned examples. In such scenarios all the attributes corresponding to the missing records will be null.

Normally the completeness of parent-child relations is enforced with the help of referential integrity and database transactions, a set of actions performed as a single unit of work, they allow saving the parent data only if the child data were saved successfully, though such type of constraints is not always necessary.

Data should be cleaned when feasible in the source system(s) and this applies to incomplete data as well. It might be feasible to clean the values in Excel files or similar tools, by exporting and then reimporting the clean values back into the respective systems.

In data migrations or similar scenarios, the completeness in particular and data quality in general must be judged against the target system(s) and thus the dataset must be enriched in an intermediate layer as needed. Upon case, one can consider using default values, though this sounds like a technical debt, likely improbably to be addressed later. Moreover, one should prioritize the effort and consider first the attributes which are needed for the good functioning of the target system(s).

Ideally, for the mandatory fields should be applied data validation techniques in the source systems, when feasible.

Previous Post <<||>> Next Post

Written: Jan-2010, Last Reviewed: Mar-2024

References:
[1] David Loshin (2009) "Master Data Management"

🗄️Data Management: Data Quality Dimensions (Part II: Conformity)

Data Management Series

Conformity or format compliance, as named by [1], refers to the extent data are in the expected format, each attribute being associated with a set of metadata like type (e.g. text, numeric, alphanumeric, positive), length, precision, scale, or any other formatting patterns (e.g. phone number, decimal and digit grouping symbols).

Because distinct decimal, digit grouping, negative sign and currency symbols can be used to represent numeric values, same as different date formats could be used alternatively (e.g. dd-mm-yyyy vs. mm-dd-yyyy), the numeric and date data types are highly sensitive to local computer and general applications settings because the same attribute could be stored, processed and represented in different formats. Therefore, it’s preferable to minimize the variations in formatting by applying the same format to all attributes having the same data type and, whenever is possible, the format should not be confusing.

For example all the dates in a data set or in a set of data sets being object of the same global context (e.g. data migration, reporting) should have the same format, being preferred a format of type dd-mon-yyyy which, ignoring the different values the month could have for different language settings, it lets no space for interpretations (e.g. 01-10-2009 vs. 10-01-2009). There are also situations in which the constraints imposed by the various applications used restraints the flexibility of working adequately with the local computer formats.

If for decimal and dates there are a limited number of possibilities that can be dealt with, for alphanumeric values things change drastically because excepting the format masks that could be used during data entry, the adherence to a format depends entirely on the Users and whether they applied the formatting standards defined. In the absence of standards, Users might come with their own encoding, and even then, they might change it over time.

The use of different encodings could be also required by the standards within a specific country, organization, or other type of such entity. All these together makes from alphanumeric attributes the most often candidate for data cleaning, and the business rules used can be quite complex, needing to handle each specific case. For example, the VAT code could have different length from country to country, and more than one encoding could be used reflecting the changes in formatting policy.

In what concerns the format, the alphanumeric attributes offer greater flexibility than the decimal and date attributes, and their formatting could be in theory ignored unless they are further parsed by other applications. However, considering that such needs change over time, it’s advisable to standardize the various formats used within an organization and use 'standard' delimiters for formatting the various chunks of data with a particular meaning within an alphanumeric attribute, fact that could reduce considerably the volume of overwork needed in order to cleanse the data for further processing. An encoding could be done without the use of delimiters, e.g. when the length of each chunk of data is the same, though chunk length-based formatting could prove to be limited when the length of a chunk changes.

Note:
Delimiters should be chosen from the characters that will never be used in the actual chunks of data or in the various applications dealing with the respective data. For example pipe (“|”) or semicolon (“;”) could be good candidates for such a delimiter though they are often used as delimiters when exporting the data to text files, therefore it’s better to use a dash (“-”) or even a combinations of characters (e.g. “.-.”) when a dash is not enough, while in some cases even a space or a dot could be used as delimiter.

Previous Post <<||>> Next Post

Written: Jan-2010, Last Reviewed: Mar-2024

References:
[1] David Loshin (2009) "Master Data Management", Morgan Kaufmann OMG Press. ISBN 978-0-12-374225-4.

🗄️Data Management: Data Quality Dimensions (Part I: Uniqueness)

Data Management Series

Uniqueness refers to "requirements that entities modeled within the master environment are captured, represented, and referenced uniquely within the relevant application architectures" [1]. An alternative name is the one of duplicates, which stresses the existence of duplicate records within a dataset, the not-uniqueness, being a better indicator for the nonconformance especially when considering datasets.

Why is required to enforce the uniqueness of entities? An entity is defined using a number of attributes representing entity’s characteristics, in case the attributes of two entities have the same values, then more likely the two representations refer to the same entity. This could happen in most of the cases, though there are situations in which the attribute(s) that make(s) it possible to differentiate between two distinct entities is/are not adequately maintained or not considered at all. The impossibility of identifying uniquely an entity increases the chances of using one of the respective entities wrongly, for example booking the Invoice against the wrong Vendor and all the implications derived from it.

For each type of entity there can be one or more attributes that allow identifying it uniquely, for example in case of a Vendor could be Vendor’s name and address. The more such attributes then more difficult becomes the identification of a Vendor; therefore even if such a set of attributes exists, like in Vendor’s case, it’s preferable to use instead a unique identifier, a numeric or alphanumeric key that identifies uniquely an entity.

A Vendor could be uniquely identified by the Vendor ID, though it allows unique identification of a Vendor only in a data repository, the chances being quite high to have another Vendor ID for the same entity in another data repository. Therefore, in order to guarantee the uniqueness of entities is prefer to use instead an attribute that has the same value indifferently from the data repository the entity is stored in (e.g. Vendor Number, Item Number, Customer Number, Asset Number, Sales Order Number, etc.). Such attributes could be enforced to be unique across a set of data repositories, though one of them must function as 'master'.

Multiple identifiers for the same entity may exist, though this can easily create confusion, especially when this happens within the same system and people or machines are not aware that the respective identifiers refer to the same entity, and the more identifiers we have for the same entity the higher the chances of creating confusion. Imagine that in the same system you book some of the Invoices against one of the identifiers, and the remaining Invoices against another identifier of the same entity.

Especially in reports this might be quite a problem as amounts that should appear for the same entity are split against two references, and even if they refer to the same entity report’s users might be not aware of it, the calculations based on such numbers not reflecting the reality. Imagine that you have booked the invoices against two references to the same Vendor, and you want to consider the first 10 Vendors with the highest volume spent; it might happen that the aggregate amounts for each of the respective references didn’t make it in the top 10, though when considered together they can even make it to the first position.

There are even situations in which the use of multiple identifiers for the same entity is inevitable, and important are here the cases in which more than one different system refer to the same entity in their unique way, often being not designed to use a global unique identifier for an entity. Under such circumstances what an organization could do is to either extend the system(s) to store the global identifier or have a mapping in place with the referents used to indicate the same entity.

The more systems within an enterprise that use/represent the same entity, the more complex is to manage the correct referencing; the logical thing to do is to have in place a master system that stores the global identifier and integrate it with the other systems in place or at least store the global identifier in each of the systems. Such identifiers used for master or even transactional data allow creating consolidated reports based on several systems and mitigating possible issues existing in the data.

Even if such keys are designed to enforce uniqueness this doesn’t mean it necessarily happens; the reason is simple - behind the assignment of a key to an entity there is a process defined, and no matter how robust a process was created if it doesn’t cover all the scenarios that might occur or if it can’t deal (adequately) with various constraints, then in one point in time a flow might lead to a duplicate, a not unique record. Take for example a Product Number's creation - supposing that a Product Number XYZ is created in an ERP system and one or more of the respective Product’s attributes are incorrect. In many cases such errors can be corrected, though each system comes with its own constraints not allowing changing certain attributes (e.g. the Unit of Measure once Inventory was booked). Because an entity with the wrong attributes is inadequate to be used, a new Product Number is created for the same entity, the two identifiers coexisting for some time. (Some systems allow to build a relationship between the Products, showing that one Product replaces another going forward.)

Some systems could allow the deletion of the first entered Product though that’s not always possible. Even if the two Products Numbers are different, they point to the same Product entity, and even if the first entered Product is marked as obsolete, until this happens transactions might be already made with it, and again, it might be possible to revert the transactions back, though that’s not always possible, and that’s not all, once a Product is marked as obsolete, this doesn’t necessarily imply that a User won’t reactivate the Product in the near or far future. As can be seen, the scenarios are quite complex, not all the systems and processes being designed to handle them. If such deviations from uniqueness can’t be corrected in the legacy systems, at least during conversion and migration should be corrected.

Previous Post <<||>> Next Post

Written: Jan-2010, Last Reviewed: Mar-2024

References:
[1] David Loshin (2009) "Master Data Management"

SQL Troubles

Pages

27 January 2025

🗄️🗒️Data Management: Data Quality Dimensions [Notes]

21 May 2020

📦Data Migrations (DM): In-house Built Solutions (Part III: The Data Preparation Layer)

28 April 2017

⛏️Data Management: Completeness (Definitions)

27 April 2017

⛏️Data Management: Availability (Definitions)

21 February 2017

⛏️Data Management: Validity (Definitions)

20 February 2017

⛏️Data Management: Timeliness (Definitions)

19 January 2017

⛏️Data Management: Consistency (Definitions)

18 January 2017

⛏️Data Management: Accuracy (Definitions)

06 April 2012

🧭Business Intelligence: Enterprise Reporting (Part X: Between Potential, Reality, Quality and Stories)

07 February 2010

🗄️Data Management: The Data-Driven Enterprise (Part I: Thoughts on a White Paper)

25 January 2010

🗄️Data Management: Data Quality Dimensions (Part VII: Structuredness)

18 January 2010

🗄️Data Management: Data Quality Dimensions (Part VI: Referential Integrity)

17 January 2010

🗄️Data Management: Data Quality Dimensions (Part IV: Accuracy)

13 January 2010

🗄️Data Management: Data Quality Dimensions (Part III: Completeness)

🗄️Data Management: Data Quality Dimensions (Part II: Conformity)

🗄️Data Management: Data Quality Dimensions (Part I: Uniqueness)

About Me