Data Migrations Series |
A Software Engineer and data professional's blog on SQL, data, databases, data architectures, data management, programming, Software Engineering, Project Management, ERP implementation and other IT related topics.
Pages
- 🏠Home
- 🗃️Posts
- 🗃️Definitions
- 🏭Fabric
- ⚡Power BI
- 🔢SQL Server
- 📚Data
- 📚Engineering
- 📚Management
- 📚SQL Server
- 📚Systems Thinking
- ✂...Quotes
- 🧾D365: GL
- 💸D365: AP
- 💰D365: AR
- 👥D365: HR
- ⛓️D365: SCM
- 🔤Acronyms
- 🪢Experts
- 🗃️Quotes
- 🔠Dataviz
- 🔠D365
- 🔠Fabric
- 🔠Engineering
- 🔠Management
- 🔡Glossary
- 🌐Resources
- 🏺Dataviz
- 🗺️Social
- 📅Events
- ℹ️ About
21 May 2020
📦Data Migrations (DM): In-house Built Solutions (Part III: The Data Preparation Layer)
28 April 2017
⛏️Data Management: Completeness (Definitions)
"A characteristic of information quality that measures the degree to which there is a value in a field; synonymous with fill rate. Assessed in the data quality dimension of Data Integrity Fundamentals." (Danette McGilvray, "Executing Data Quality Projects", 2008)
"Containing by a composite data all components necessary to full description of the states of a considered object or process." (Juliusz L Kulikowski, "Data Quality Assessment", 2009)
"An inherent quality characteristic that is a measure of the extent to which an attribute has values for all instances of an entity class." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)
"Completeness is a dimension of data quality. As used in the DQAF, completeness implies having all the necessary or appropriate parts; being entire, finished, total. A dataset is complete to the degree that it contains required attributes and a sufficient number of records, and to the degree that attributes are populated in accord with data consumer expectations. For data to be complete, at least three conditions must be met: the dataset must be defined so that it includes all the attributes desired (width); the dataset must contain the desired amount of data (depth); and the attributes must be populated to the extent desired (density). Each of these secondary dimensions of completeness can be measured differently." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)
"Completeness is defined as a measure of the presence of core source data elements that, exclusive of derived fields, must be present in order to complete a given business process." (Rajesh Jugulum, "Competing with High Quality Data", 2014)
"Complete existence of all values or attributes of a record that are necessary." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)
"The degree to which all data has been delivered or stored and no values are missing. Examples are empty or missing records." (Piethein Strengholt, "Data Management at Scale", 2020)
"The degree to which elements that should be contained in the model are indeed there." (Panos Alexopoulos, "Semantic Modeling for Data", 2020)
"The degree of data representing all properties and instances of the real-world context." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)
"Data is considered 'complete' when it fulfills expectations of comprehensiveness." (Precisely) [source]
"The degree to which all required measures are known. Values may be designated as “missing” in order not to have empty cells, or missing values may be replaced with default or interpolated values. In the case of default or interpolated values, these must be flagged as such to distinguish them from actual measurements or observations. Missing, default, or interpolated values do not imply that the dataset has been made complete." (CODATA)
27 April 2017
⛏️Data Management: Availability (Definitions)
"Corresponds to the information that should be available when necessary and in the appropriate format." (José M Gaivéo, "Security of ICTs Supporting Healthcare Activities", 2013)
"A property by which the data is available all the time during the business hours. In cloud computing domain, the data availability by the cloud service provider holds a crucial importance." (Sumit Jaiswal et al, "Security Challenges in Cloud Computing", 2015)
"Availability: the ability of the data user to access the data at the desired point in time." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)
"It is one of the main aspects of the information security. It means data should be available to its legitimate user all the time whenever it is requested by them. To guarantee availability data is replicated at various nodes in the network. Data must be reliably available." (Omkar Badve et al, "Reviewing the Security Features in Contemporary Security Policies and Models for Multiple Platforms", 2016)
"Timely, reliable access to data and information services for authorized users." (Maurice Dawson et al, "Battlefield Cyberspace: Exploitation of Hyperconnectivity and Internet of Things", 2017)
"A set of principles and metrics that assures the reliability and constant access to data for the authorized individuals or groups." (Gordana Gardašević et al, "Cybersecurity of Industrial Internet of Things", 2020)
"Ensuring the conditions necessary for easy retrieval and use of information and system resources, whenever necessary, with strict conditions of confidentiality and integrity." (Alina Stanciu et al, "Cyberaccounting for the Leaders of the Future", 2020)
"The state when data are in the place needed by the user, at the time the user needs them, and in the form needed by the user." (CODATA)
"The state that exists when data can be accessed or a requested service provided within an acceptable period of time." (NISTIR 4734)
"Timely, reliable access to information by authorized entities." (NIST SP 800-57 Part 1)
21 February 2017
⛏️Data Management: Validity (Definitions)
"A characteristic of the data collected that indicates they are sound and accurate." (Teri Lund & Susan Barksdale, "10 Steps to Successful Strategic Planning", 2006)
"Implies that the test measures what it is supposed to." (Robert McCrie, "Security Operations Management" 2nd Ed., 2006)
"The determination that values in the field are or are not within a set of allowed or valid values. Measured as part of the Data Integrity Fundamentals data quality dimension." (Danette McGilvray, "Executing Data Quality Projects", 2008)
"A data quality dimension that reflects the confirmation of data items to their corresponding value domains, and the extent to which non-confirmation of certain items affects fitness to use. For example, a data item is invalid if it is defined to be integer but contains a non-integer value, linked to a finite set of possible values but contains a value not included in this set, or contains a NULL value where a NULL is not allowed." (G Shankaranarayanan & Adir Even, "Measuring Data Quality in Context", 2009)
"An aspect of data quality consisting in its steadiness despite the natural process of data obsolescence increasing in time." (Juliusz L Kulikowski, "Data Quality Assessment", 2009)
"An inherent quality characteristic that is a measure of the degree of conformance of data to its domain values and business rules." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)
"Validity is a dimension of data quality, defined as the degree to which data conforms to stated rules. As used in the DQAF, validity is differentiated from both accuracy and correctness. Validity is the degree to which data conform to a set of business rules, sometimes expressed as a standard or represented within a defined data domain." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)
"Validity is defined as the extent to which data corresponds to reference tables, lists of values from golden sources documented in metadata, value ranges, etc." (Rajesh Jugulum, "Competing with High Quality Data", 2014)
"the state of consistency between a measurement and the concept that a researcher intended to measure." (Meredith Zozus, "The Data Book: Collection and Management of Research Data", 2017)
[semantic validity:] "The compliance of attribute data to rules regarding consistency and truthfulness of association." (O Sami Saydjari, "Engineering Trustworthy Systems: Get Cybersecurity Design Right the First Time", 2018)
[syntactic validity:] "The compliance of attribute data to format and grammar rules." (O Sami Saydjari, "Engineering Trustworthy Systems: Get Cybersecurity Design Right the First Time", 2018)
"Validity is a data quality dimension that refers to information that doesn’t conform to a specific format or doesn’t follow business rules." (Precisely) [source]
20 February 2017
⛏️Data Management: Timeliness (Definitions)
"Coming early or at the right, appropriate or adapted to the times or the occasion." (Martin J Eppler, "Managing Information Quality" 2nd Ed., 2006)
[timeliness & availability] "A data quality dimension that measures the degree to which data are current and available for use as specified, and in the time frame in which they are expected." (Danette McGilvray, "Executing Data Quality Projects", 2008)
"the ability of a task to repeatedly meet its timeliness requirements." (Bruce P Douglass, "Real-Time Agility: The Harmony/ESW Method for Real-Time and Embedded Systems Development", 2009)
"A pragmatic quality characteristic that is a measure of the relative availability of data to support a given process within the timetable required to perform the process." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)
"1.The degree to which available data meets the currency requirements of information consumers. 2.The length of time between data availability and the event or phenomenon they describe." (DAMA International, "The DAMA Dictionary of Data Management", 2011)
"Timeliness is a dimension of data quality related to the availability and currency of data. As used in the DQAF, timeliness is associated with data delivery, availability, and processing. Timeliness is the degree to which data conforms to a schedule for being updated and made available. For data to be timely, it must be delivered according to schedule." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)
"The degree to which the model contains elements that reflect the current version of the world Transitive Relation When a relation R is transitive then if R links entity A to entity B, and entity B to entity C, then it also links A to C." (Panos Alexopoulos, "Semantic Modeling for Data", 2020)
"The degree to which the actual time and processing time are separated. The timelier the data is, the smaller the gap is between actual time and record time." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)
"Length of time between data availability and the event or phenomenon they describe." (SDMX)
06 April 2012
🧭Business Intelligence: Enterprise Reporting (Part X: Between Potential, Reality, Quality and Stories)
Business Intelligence Series |
2. “Two thirds of executives feel that the quality of and timely access to data is poor and inconsistent” (reports and data quality)
3. “Seven out of ten executives do not get the right information to make business decisions.” (BI value)
4. “Fewer than 10% of organizations have successfully used business intelligence to enhance their organizational and technological infrastructures” (BI alignment)
5. “those with effective business intelligence outperform the market by more than 5% in terms of return on equity” (competitive advantage)
The numbers reflect to some degree also my expectations, though they seem more pessimistic than I expected. That’s not a surprise, considering that such studies can be strongly biased, especially because in them are reflected expectations, presumptions and personal views over the state of art within an organization.
KPMG builds on the above numbers and several other aspects that revolve around the use of governance and alignment in order to increase the value provided by BI to the business, though I feel that they are hardly scratching the surface. Governance and alignment look great into studies and academic work, though they alone can’t bring success, no matter how much their importance and usage is accentuated. Sometimes I feel that people hide behind big words without even grasping the facts. The importance of governance and alignment can’t be neglected, though the argumentation provided by KPMG isn’t flawless. There are statements I can agree with, and many which are circumstantial. Anyway, let’s look a little deeper at the above numbers.
I suppose there is no surprise concerning the huge rate of BI projects’ failure. The value is somewhat close to the rate of software projects’ failure. Why would make a BI project an exception from a typical software project, considering that they are facing almost the same environments and challenges? In fact, given the role played by BI in decision making, I would say that BI projects are more sensitive to the various factors than a typical software project.
KPMG refers to a holistic approach concentrated on the change of focus from technology to the actual needs, a change of process and funding. A reflection of the holistic approach is also the view of the BI infrastructure from the point of view of the entire IT infrastructure, of the organization, network of partners and of the end-products – mainly models and reports. Many of the problems BI initiatives are confronted with refer to the quality of data and its many dimensions (duplicates, conformity, consistency, integrity, accuracy, availability, timeliness, etc.) , problems which could be in theory solved in the source systems, mainly through design. Other problems, like dealing with complex infrastructures based on more or less compatible IS or BI tools, might involve virtualization, consolidation or harmonization of such solutions, plus the addition of other tools.
Looking at the whole organization, other problems appear: the use of reports and models without understanding the whole luggage of meaning hiding behind them, the different views within the same data and models, the difference of language, problems, requirements and objectives, the departmental and organizational politics, the lack of communication, the lack of trust in the existing models and reports, and so on. What all these points have in common are people! The people are the maybe the most important factor in the adoption and effective usage of BI solutions. It starts with them – identifying their needs, and it ends with them – as end users. Making them aware of all contextual requirements, actually making them knowledge workers and not considering them just simple machines could give a boost to your BI strategy.
Partners doesn’t encompass just software vendors, service providers or consultants, but also the internal organizational structures – teams, departments, sites or any other similar structure. Many problems in BI can be tracked down to partners and the ways a partnership is understood, on how resources are managed, how different goals and strategies are harmonized, on how people collaborate and coordinate. Maybe the most problematic is the partnership between IT and the other departments on one side, and between IT and external partners on the other side. As long IT is not seen as a partner, as long IT is skip from the important decisions or isn’t acting as a mediator between its internal and external partners, there are few chances of succeeding. There are so many aspects and lot of material written on this topic, there are models and methodologies supposed to make things work, but often between theory and practice there is a long distance.
How many of the people you met were blaming the poor quality of the data without actually doing something to improve anything? If the quality of your data in one of your major problems then why aren’t you doing something to improve that? Taking the ownership over your data is a major step on the way to better data quality, though a data management strategy is needed. This involve the design of a framework that facilitates data quality and data consumption, the design and use of policies, practices and procedures to properly manage the full data lifecycle. Also this can be considered as part of your BI infrastructure, and given the huge volume, the complexity and diversity of data, is nowadays a must for an organization.
The “right information” is an evasive construct. In order to get the right information you must be capable to define what you want, to design your infrastructure with that in mind and to learn how to harness your data. You don’t have to look only at your data and information but also at the whole DIKW pyramid. The bottom line is that you don’t have to build only a BI infrastructure but a knowledge management infrastructure, and methodologies like ITIL can help you achieve that, though they are not sufficient. Sooner or later you’ll arrive to blame the whole DIKW pyramid - the difficulty of extracting information from data, knowledge from information, and the ultimate translation into wisdom. Actually that’s also what the third and fourth of the above statements are screaming out loud – it’s not so easy to get information from the silos of data, same as it’s not easy to align the transformation process with organizations’ strategy.
Also timeliness has a relative meaning. It’s true that nowadays’ business dynamics requires faster access to data, though it requires also to be proactive, many organizations lacking this level of maturity. In order to be proactive it’s necessary to understand your business’ dynamics thoroughly, that being routed primarily in your data, in the tools you are using and the skill set your employees acquired in order to move between the DIKW layers. I would say that the understanding of DIKW is essential in harnessing your BI infrastructure.
KPMG considers that the 5% increase in return on equity associated with the effective usage of BI is a positive sign, not necessarily. The increase can be associated with hazard or other factors as well, even if it’s unlikely probable to be so. The increase it’s quite small when considered with the huge amount of resources spent on BI infrastructure. I believe that BI can do much more for organizations when harnessed adequately. It’s just a belief that needs to be backed up by numbers, hopefully that will happen someday, soon.
07 February 2010
🗄️Data Management: The Data-Driven Enterprise (Part I: Thoughts on a White Paper)
Data Management Series |
The paper touches several important aspects related to Data Management, approaching concepts like “value of data”, “data quality”, “data integration”, “business involvement”, “data trust”, “relevant data”, “timely data” “virtualized access”, “compliant reporting”, “Business-IT collaboration”, highlighting the importance of having adequate processes, infrastructure and culture in order to bring more value for the business. I totally agree with the importance of these concepts though I think that there are many other aspects that need to be considered. With such concepts almost all vendors juggle, though what’s often missing is the knowledge/wisdom and method to put philosophies and technologies into use, to redesign an organization’s infrastructure and culture so it could bring the optimum benefit.
Since the appearance of data warehouses concepts, the efficient integration of the various data islands existing within and outside of an organization become a Holy Grail for IT vendors and organizations, though given the fast pace with which new technologies appear this hunt looks more like a Morgan le Fey in the desert. Informatica builds a strong case for data integration in general and for Informatica 9 in particular, their new infrastructure platform targeting to enable organizations to become data-driven by providing a centralized architecture for enforcing data policy and addressing issues like data timeliness, format, semantics, privacy and quality[3]. On the other side the grounds on which Informatica builds its launching strategy could be contra-argumented considering the grey zone they were placed in.
Quantifying Value of Data
How many of the organizations could say that they could quantify (easily) the real value of their data when there is no market value they could be benchmarked against? I would say that data have only a potential value that could increase only with its use, once you learned to explore the data, find patterns and new uses for the data, derive knowledge out of it and use it wisely in order to derive profit and a competitive advantage, and it might take years to arrive there.
There are costs that can be quantified, like the number of hours employees spent on maintaining the duplicate data, correcting the issues driven by bad data quality, or more general the costs related to waste, and there are costs that can’t be quantified so easily, like the costs associated with bad decisions or lost opportunities driven by missing data or inadequate reflection of reality. There is another aspect, even if organizations reach to quantify such costs, without having some transparency on how they arrived to the respective numbers it felts like somebody just took out some numbers from a magician’s hat. It would be great if the quantification of such costs is somehow standardized, though that’s difficult to do given the fact that each organization approaches Data Management from its own perspective and requirements.
From Data to Meaning
Reports are used only to aggregate, analyze and navigate data, while it’s in Users attribution to give adequate meaning to the data, and together with the data analyst to find the who, how, when, where, what, why, which and by what means, in a word to understand the factors that impact the business positively/negatively, the correlation between them and how they can be strengthened/mitigated in order to achieve better quality/outcomes.
People want nice charts and metrics that can give them a birds-eye view of the current state, though the aggregated data could easily hide the reality because of the quality of the data, quality of the reports itself, the degree to which they cover the reality. Part of the data-driven philosophy resume in understanding the data, and reacting to data. I met people who were ignoring the data, preferring to take wild guesses, sometimes they were right, other times they were wrong.
From Functionality to Usability
There are Users who once they have a tool they want to find all about its capabilities, play with the tool, find other uses and they could even come with nice to have features. There are also Users who don’t want to bother in getting the data by themselves, they just want the data timely and in the format they need them. The fact that Informatica allows Users to analyze the data by themselves it’s quite of a deal, though as I already stressed in a previous post, you can’t expect from a User to become a data expert overnight, there are even developers that have difficulties in handling complex data analysis requirements.
Allowing users to decide which logic to apply in their reports could prove to be a two edged sword, organizations risking ending up with multiple versions of the same story. It’s needed to align the various reports, bring users on the same page from the point of view of expectations and constraints. On the other side some Users prefer to prepare the data by themselves because they know the issues existing in the data or because they have more flexibility in making the data to look positive.
Trust, Relevance and Timeliness
An important part of Informatica’s strategy is based on data trust, relevancy and timeliness, three important but hard to quantify dimensions of Data Quality. Trust is often correlated with Users’ perception over the overall Data Quality, the degree to which the aggregated data presented in reports can be backed up with detailed data to support them, the visibility they have on the business rules and transformations used. If the Users can get a feeling of the data with click-through, drilldown or drill-through reports, if the business rules and transformations are documented, then most probably that data trust won’t be an issue anymore. Data relevancy and data timeliness are heavily requirement-dependent, for some Users being enough to work with one week old data while others need live data. In a greater or less degree, all data used by the business are relevant otherwise I don’t see why maintaining them.
Software Tools as Enablers
Sometimes being aware that there is a problem and doing something to fix it already brings an amount of value to the business, and this without investing in complex technologies but handling things methodologically and enforcing some management practices – identifying, assessing, addressing, monitoring and controlling issues. I bet this alone could bring a benefit for an organization, and everything starts just by recognizing that there is a problem and doing something to fix the root causes. On the other side software technologies could enable performing the various tasks more efficient and effective, with better quality, less resources, in less time and eventually with lowers costs. Now what’s the value of the saving based on addressing the issue and what’s the value of saving by using a software technology in particular?!
Software tools like Informatica are just enablers, they don’t guarantee results and don’t eliminate barriers unless people know how to use them and make most of it. For this are needed experts that know the business, the various software tools involved, and good experienced managers to bring such projects on the right track. When the objectives are not met or the final solution doesn’t satisfies all requirements, then people reach to develop alternative solutions, which I categorize as personal solutions – spreadsheets, MS Access applications, an organization ending up with such islands of duplicated data/logic. Often Users need to make use of such solutions in order to understand their data, and this is an area in which Informatica could easily gain adepts.
Business-IT collaboration
There is no news that the IT/IM and other functional departments don’t function as partners, IT initiatives not being adequately supported by the business, while in many IT technology-related initiatives driven by the business at corporate level the IT department is involved only as executor and has little to say in the decision of using one technology or another, many of such initiatives ignoring aspects specific to IT – usability of such a solution, integration with other solutions, nuances of internal architecture and infrastructure. Of course that phrases like “business struggling in working with IT” appear when IT and the business function as separate entities with a minimum of communication, when the various strategies are not aligned as they are supposed to.
No matter of the slogans and the concepts the vendors juggle with, I’m sorry, but I can’t believe that there is one tool that matches all requirements, that provides a fully integrated solution, that the tool itself is sufficient for eliminating the language and collaboration barriers between the business and IT!
Human Resources & Co.
Many organizations don’t have in-house the human resources needed for the various projects related to Data Management, therefore bringing consultants or outsourcing parts of the projects. A consultant needs time in order to understand the processes existing in an organization, organization’s particularities. Even if business analysts reach to augment the requirements in solid specifications, it’s difficult to cover all the aspects without having a deep knowledge about the architecture used, same as for consultants it’s difficult to put the pieces of the puzzle together especially when more of the pieces are missing. The consultants expect in general to have all the pieces of the puzzles, while the other sides expect consultants to identify the missing pieces.
When outsourcing tasks (e.g. data analysis) or data-related infrastructure (e.g. data warehouses, data marts) an organization risks to lose control over what’s happening, the communication issues being reflected in longer cycle times for issues’ resolution, making everything to become a challenge. There are many other issues related to outsourcing that maybe deserve to be addressed in detail.
The Lack of Vision, Policy and Strategy
An organization needs to have a vision, policy and strategy toward data quality in particular and Data Management in general, in order to plan, enforce and coordinate the overall effort toward quality. Their lack can have unpredictable impact on information systems and reporting infrastructure in particular and on the business as a whole, without it data quality initiatives can have local and narrow scope, without the expected effectiveness, resulting in rework and failure stories. The syntagma “it’s better to prevent than to cure” reliefs the best the philosophy on which Data Management should be centered.
Lack of Ownership
In the context of the lack of policy and strategy can be put also the lack of ownership, though given its importance it deserves a special attention. The syntagma “each employee is responsible for quality” applies to data quality too, each user and department need to take the ownership over the data they have to maintain, for their own or others’ departments scope, same as they have to take the ownership over the reports that make scope of their work, assure their quality and the afferent documentation, over the explicit and implicit islands of knowledge existing.
References:
[1] Informatica. (2009). The Data-Driven Enterprise. [Online] Available from: http://www.informatica.com/downloads/7060_data_driven_wp_web.pdf (Accessed: 6 February 2010).
[2] Herzler. (2006). Eight Aspects of the Data Driven Corporation – Exploring your Gap to Entitlement. [Online] Available from: http://www.hertzler.com/php/portfolio/white.paper.detail.php?article=31 (Accessed: 6 February 2010).
[3] Informatica. (2009). Informatica 9: Infrastructure Platform for the Data-Driven Enterprise, Speaker: Sohaib Abbasi, Chairman and CEO. [Online] Available from: http://www.informatica.com/9/thelibrary.html#page=page-5 (Accessed: 6 February 2010).
25 January 2010
🗄️Data Management: Data Quality Dimensions (Part VII: Structuredness)
Data Management Series |
In general, the term structured data refers to structures that can be easily perceived or known, that raises no doubt on structure’s delimitations. Unstructured data refers to textual data and media content (video, sound, images), in which the structural patterns even if exist they are hard to discover or not predefined, while semi-structured data refers to islands of structured data stored with unstructured data, or vice versa.
Data structuredness is important especially when is considered the processing of data with the help of machines, the correct parsing of data being highly dependent on the knowledge about the data structure, either defined beforehand or deducted. The more structured the data and the more evident and standardized the structure, the easier should be to process the data. Merrill Lynch estimates that 85% of the data in an organization are in unstructured form, most probably this number referring to semi-structured data too. To make such data available in a structured format is required an important volume of manual work combined eventually with reliable data/text mining techniques, a fact that reduces considerably the value of such data.
Text, relational, multidimensional, object, graph or XML-based DBMS are in theory the most easily to process, map and integrate though that might not be so simple as it looks given the different architectures vendors come with, the fact that the structures evolve over time. To bridge the structure and architectural differences, many vendors make it possible to access data over standard interfaces (e.g. ODBC), though there are also systems that provide only proprietary interfaces, making data difficult to obtain in an automated manner. There are also other types of technical issues related mainly to the different data types and data formats, though such issues can be easily overcome.
In the context of Data Quality, the structuredness dimension refers to the degree the structure in which the data are stored matches the expectations, the syntactic set of rules defining it, being considered across the whole set of records. Even a minor inadvertence in the structure of a record could lead to processing errors and unexpected behavior. The simplest example is a delimited text file - if any of the character sets used to delimit the structure of the file is available in the data itself, then there are high chances that the file will be parsed incorrectly, or the parsing will fail unless the issues are corrected.
References:
[1] Barry W Boehm et al (1978) "Characteristics of software quality"
[2] The Register (2006) "Structured data is boring and useless", by D. Nortfolk (link)
[3] P Wood (?) "Semi-structured Data"
19 January 2010
⛏️Data Management: Consistency (Definitions)
18 January 2010
🗄️Data Management: Data Quality Dimensions (Part VI: Referential Integrity)
Data Management Series |
During conversions or data migrations is important to assure that the various sets loaded match the referential and domain integrity of the database in which the data will be loaded, otherwise the records not respecting the mentioned type of integrity will be rejected. The rejection itself might not be a problem for several records, though when it happens at large scale, then the situations changes dramatically, especially when the system gives no adequate messages for the cause or rejection. A recommended approach is to assure that the scope is synchronized between the various data elements, and that the referential integrity of datasets is validated before the data are loaded in the destination database.
Note:
Expect the unforeseeable! It’s always a good idea to check whether the referential integrity is kept by a system – there are so many things that could go wrong! In data migration solutions, data warehouses and more general analytical solutions is a good idea to have in place mechanisms that check for this kind of issues.
References:
[1] Halpin. T. (2001) Information Modeling and Relational Databases: From Conceptual Analysis to Logical Design. Morgan Kaufmann Publishers. ISBN 1-55860-672-6.
[2] MSDN. 2009. Data Integrity. [Online] Available from: http://msdn.microsoft.com/en-us/library/ms184276.aspx (Accessed: 18 January 2009)
⛏️Data Management: Accuracy (Definitions)
"(1) A qualitative assessment of correctness, or freedom from error. (2) A quantitative measure of the magnitude of error." (IEEE, "IEEE Standard Glossary of Software Engineering Terminology", 1990)
[accuracy (of measurement):] "Closeness of the agreement between the result of a measurement and a true value of the measurand." International Vocabulary of Basic and General Terms in Metrology, 1993)
"A qualitative assessment of freedom from error or a quantitative measure of the magnitude of error, expressed as a function of relative error." (William H Inmon, "Building the Data Warehouse", 2005)
"Accuracy is the closeness of a measured value to the true value."
"A data element’s degree of conformity to an established business measurement or definition. Data precision is the degree to which further measurements or definitions will show the same results." (Jill Dyché & Evan Levy, "Customer Data Integration: Reaching a Single Version of the Truth", 2006)
"Degree of conformity of a measure to a standard or a true value. Level of precision or detail." (Martin J Eppler, "Managing Information Quality" 2nd Ed., 2006)
"The accuracy reflects the number of times the model is correct." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)
"An aspect of numerical data quality connected with a standard statistical error between a real parameter value and the corresponding value given by the data. Data accuracy is inversely proportional to this error." (Juliusz L Kulikowski, "Data Quality Assessment", 2009)
"An inherent quality characteristic that is a measure of the degree to which data agrees with an original source of data (such as a form, document, or unaltered electronic data) received from an acknowledged source outside the control of the organization." (David C Hay, "Data Model Patterns: A Metadata Map", 2010) [accuracy in regard to a surrogate source]
"An inherent quality characteristic that is a measure of the degree to which data accurately reflects the real-world object or event being described. Accuracy is the highest degree of inherent information quality possible." (David C Hay, "Data Model Patterns: A Metadata Map", 2010) [accuracy in regard to reality]
"Freedom from mistakes or error, conformity to truth or to a standard, exactness, the degree of conformity of a measure to a standard or true value. (Michael Brackett, 2011)
"The degree to which a data attribute value closely and correctly describes its business entity instance (the 'real life' entities) as of a point in time." (DAMA International, "The DAMA Dictionary of Data Management", 2011)
"Accuracy is the quality or state of being correct or precise; accurate information is correct in all details (NOAD)." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)
"Within the quality management system, accuracy is an assessment of correctness." (For Dummies, "PMP Certification All-in-One For Dummies" 2nd Ed., 2013)
"How closely a measurement or assessment reflects the true value. Not to be confused with precision [...]" (Kenneth A Shaw, "Integrated Management of Processes and Information", 2013)
"Accuracy is defined as a measure of whether the value of a given data element is correct and reflects the real world as viewed by a valid real-world source (SME, customer, hard-copy record, etc.)." (Rajesh Jugulum, "Competing with High Quality Data", 2014)
"Within the quality management system, accuracy is an assessment of correctness." (Project Management Institute, "A Guide to the Project Management Body of Knowledge (PMBOK® Guide)" 6th Ed., 2017)
"The degree to which the data reflect the truth or reality. A spelling mistake is a good example of inaccurate data." (Piethein Strengholt, "Data Management at Scale", 2020)
"The degree to which the semantic assertions of a model are accepted to be true." (Panos Alexopoulos, "Semantic Modeling for Data", 2020)
"The degree of how closely the data represents the true value of the attribute in the real-world context." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)
"Closeness of computations or estimates to the exact or true values that the statistics were intended to measure." (SDMX)
"The capability of the software product to provide the right or agreed results or effects with the needed degree of precision." [ISO/IEC 25000]
"The closeness of agreement between an observed value and an accepted reference value." (American Society for Quality)
"The term “accuracy” refers to the degree to which information accurately reflects an event or object described." (Precisely) [source]
17 January 2010
🗄️Data Management: Data Quality Dimensions (Part IV: Accuracy)
Another concept related with the ones of accuracy and measurement systems is the one of precision, and it refers to degree repeated measurements under unchanged conditions lead to the same results, further concepts associated with it being the ones of repeatability and reproducibility. Even if the accuracy and precision concepts are often confounded a measurement system can be accurate but not precise or precise but not accurate (see the target analogy), a valid measurement system targeting thus both aspects. Accuracy and precision can be considered dimensions of correctedness.
Coming back to accuracy and its use in determining data quality, typically accuracy it’s strong related to the measurement tools used, for this being needed to do again the measurements for all or a sample of the dataset and identify whether the requested level of accuracy is met, approach that could involve quite an effort. The accuracy depends also on whether the systems used to store the data are designed to store the data at the requested level of accuracy, fact reflected by the characteristics of data types used (e.g. precision, length).
Given the fact that a system stores related data (e.g. weight, height, width, length) that could satisfy physical, business of common-sense rules could be used rules to check whether the data satisfy them with the desired level of approximation. For example, being known the height, width, length and the composition of a material (e.g. metal bar) could be determined the approximated weight and compared with entered weight, if the difference is not inside of a certain interval then most probably one of the values were incorrect entered. There are even simpler rules that might apply, for example the physical dimensions must be positive real values, or in a generalized formulation - involve maximal or minimal limits that lead to identification of outliers, etc. In fact, most of the time determining data accuracy resumes only at defining possible value intervals, though there will be also cases in which for this purpose are built complex models and specific techniques.
There is another important aspect related to accuracy, time dependency of data – whether the data changes or not with time. Data currency or actuality refers to the extent data are actual. Given the above definition for accuracy, currency could be considered as a special type of accuracy because when the data are not actual then they don’t reflect reality. If currency is considered as a standalone data quality dimension, then accuracy refers only to the data that are not time dependent.
13 January 2010
🗄️Data Management: Data Quality Dimensions (Part III: Completeness)
The missing values are typically considered in report to mandatory attributes, attributes that need a not-Null value for each record, though after case might be applied to non-mandatory attributes (optional attributes) too, for example when is intended to understand whether the attributes are adequately maintained or not. It’s interesting that [1] considers also the inapplicable attributes referring to the attributes not applicable (relevant) for certain scenarios (e.g. physical dimensions for service-based materials), which together with the applicable attributes (relevant) can be considered as another type of categorization for attributes. Whether an attribute is mandatory is decided upon business context and not necessarily upon the physical structure containing the attribute, in other words an attribute could be optional as per database schema and mandatory per business rules.
'Missing records' can be a misleading term because is used in several contexts, however within data completeness context it refers only to the cases not covered by data integrity. For example in parent-child table relations the header data was entered though the detail data is missing, either not entered or deleted; such a case is not covered by referential integrity because there is no missing reference, but just the parent without child data (1:n cardinality).
Normally the completeness of parent-child relations is enforced with the help of referential integrity and database transactions, a set of actions performed as a single unit of work, they allow saving the parent data only if the child data were saved successfully, though such type of constraints is not always necessary.
References:
[1] David Loshin (2009) "Master Data Management"
🗄️Data Management: Data Quality Dimensions (Part II: Conformity)
Because distinct decimal, digit grouping, negative sign and currency symbols can be used to represent numeric values, same as different date formats could be used alternatively (e.g. dd-mm-yyyy vs. mm-dd-yyyy), the numeric and date data types are highly sensitive to local computer and general applications settings because the same attribute could be stored, processed and represented in different formats. Therefore, it’s preferable to minimize the variations in formatting by applying the same format to all attributes having the same data type and, whenever is possible, the format should not be confusing.
If for decimal and dates there are a limited number of possibilities that can be dealt with, for alphanumeric values things change drastically because excepting the format masks that could be used during data entry, the adherence to a format depends entirely on the Users and whether they applied the formatting standards defined. In the absence of standards, Users might come with their own encoding, and even then, they might change it over time.
In what concerns the format, the alphanumeric attributes offer greater flexibility than the decimal and date attributes, and their formatting could be in theory ignored unless they are further parsed by other applications. However, considering that such needs change over time, it’s advisable to standardize the various formats used within an organization and use 'standard' delimiters for formatting the various chunks of data with a particular meaning within an alphanumeric attribute, fact that could reduce considerably the volume of overwork needed in order to cleanse the data for further processing. An encoding could be done without the use of delimiters, e.g. when the length of each chunk of data is the same, though chunk length-based formatting could prove to be limited when the length of a chunk changes.
Note:
Delimiters should be chosen from the characters that will never be used in the actual chunks of data or in the various applications dealing with the respective data. For example pipe (“|”) or semicolon (“;”) could be good candidates for such a delimiter though they are often used as delimiters when exporting the data to text files, therefore it’s better to use a dash (“-”) or even a combinations of characters (e.g. “.-.”) when a dash is not enough, while in some cases even a space or a dot could be used as delimiter.
References:
[1] David Loshin (2009) "Master Data Management", Morgan Kaufmann OMG Press. ISBN 978-0-12-374225-4.
🗄️Data Management: Data Quality Dimensions (Part I: Uniqueness)
Data Management Series |
Why is required to enforce the uniqueness of entities? An entity is defined using a number of attributes representing entity’s characteristics, in case the attributes of two entities have the same values, then more likely the two representations refer to the same entity. This could happen in most of the cases, though there are situations in which the attribute(s) that make(s) it possible to differentiate between two distinct entities is/are not adequately maintained or not considered at all. The impossibility of identifying uniquely an entity increases the chances of using one of the respective entities wrongly, for example booking the Invoice against the wrong Vendor and all the implications derived from it.
Multiple identifiers for the same entity may exist, though this can easily create confusion, especially when this happens within the same system and people or machines are not aware that the respective identifiers refer to the same entity, and the more identifiers we have for the same entity the higher the chances of creating confusion. Imagine that in the same system you book some of the Invoices against one of the identifiers, and the remaining Invoices against another identifier of the same entity.
There are even situations in which the use of multiple identifiers for the same entity is inevitable, and important are here the cases in which more than one different system refer to the same entity in their unique way, often being not designed to use a global unique identifier for an entity. Under such circumstances what an organization could do is to either extend the system(s) to store the global identifier or have a mapping in place with the referents used to indicate the same entity.
Even if such keys are designed to enforce uniqueness this doesn’t mean it necessarily happens; the reason is simple - behind the assignment of a key to an entity there is a process defined, and no matter how robust a process was created if it doesn’t cover all the scenarios that might occur or if it can’t deal (adequately) with various constraints, then in one point in time a flow might lead to a duplicate, a not unique record. Take for example a Product Number's creation - supposing that a Product Number XYZ is created in an ERP system and one or more of the respective Product’s attributes are incorrect. In many cases such errors can be corrected, though each system comes with its own constraints not allowing changing certain attributes (e.g. the Unit of Measure once Inventory was booked). Because an entity with the wrong attributes is inadequate to be used, a new Product Number is created for the same entity, the two identifiers coexisting for some time. (Some systems allow to build a relationship between the Products, showing that one Product replaces another going forward.)
References:
[1] David Loshin (2009) "Master Data Management"
About Me
- Adrian
- Koeln, NRW, Germany
- IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.