SQL Troubles: 🗃️Data Management: Data Quality (Just the Quotes)

27 December 2017

🗃️Data Management: Data Quality (Just the Quotes)

"[...] it is a function of statistical method to emphasize that precise conclusions cannot be drawn from inadequate data." (Egon S Pearson & H Q Hartley, "Biometrika Tables for Statisticians" Vol. 1, 1914)

"Not even the most subtle and skilled analysis can overcome completely the unreliability of basic data." (Roy D G Allen, "Statistics for Economists", 1951)

"The enthusiastic use of statistics to prove one side of a case is not open to criticism providing the work is honestly and accurately done, and providing the conclusions are not broader than indicated by the data. This type of work must not be confused with the unfair and dishonest use of both accurate and inaccurate data, which too commonly occurs in business. Dishonest statistical work usually takes the form of: (1) deliberate misinterpretation of data; (2) intentional making of overestimates or underestimates; and (3) biasing results by using partial data, making biased surveys, or using wrong statistical methods." (John R Riggleman & Ira N Frisbee, "Business Statistics", 1951)

"Data are of high quality if they are fit for their intended use in operations, decision-making, and planning." (Joseph M Juran, 1964)

"There is no substitute for honest, thorough, scientific effort to get correct data (no matter how much it clashes with preconceived ideas). There is no substitute for actually reaching a correct chain of reasoning. Poor data and good reasoning give poor results. Good data and poor reasoning give poor results. Poor data and poor reasoning give rotten results." (Edmund C Berkeley, "Computers and Automation", 1969)

"Detailed study of the quality of data sources is an essential part of applied work. [...] Data analysts need to understand more about the measurement processes through which their data come. To know the name by which a column of figures is headed is far from being enough." (John W Tukey, "An Overview of Techniques of Data Analysis, Emphasizing Its Exploratory Aspects", 1982)

"We have found that some of the hardest errors to detect by traditional methods are unsuspected gaps in the data collection (we usually discovered them serendipitously in the course of graphical checking)." (Peter Huber, "Huge data sets", Compstat '94: Proceedings, 1994)

"Data obtained without any external disturbance or corruption are called clean; noisy data mean that a small random ingredient is added to the clean data." (Nikola K Kasabov, "Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering", 1996)

"Probability theory is a serious instrument for forecasting, but the devil, as they say, is in the details - in the quality of information that forms the basis of probability estimates." (Peter L Bernstein, "Against the Gods: The Remarkable Story of Risk", 1996)

"Unfortunately, just collecting the data in one place and making it easily available isn’t enough. When operational data from transactions is loaded into the data warehouse, it often contains missing or inaccurate data. How good or bad the data is a function of the amount of input checking done in the application that generates the transaction. Unfortunately, many deployed applications are less than stellar when it comes to validating the inputs. To overcome this problem, the operational data must go through a 'cleansing' process, which takes care of missing or out-of-range values. If this cleansing step is not done before the data is loaded into the data warehouse, it will have to be performed repeatedly whenever that data is used in a data mining operation." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"If the data is usually bad, and you find that you have to gather some data, what can you do to do a better job? First, recognize what I have repeatedly said to you, the human animal was not designed to be reliable; it cannot count accurately, it can do little or nothing repetitive with great accuracy. [...] Second, you cannot gather a really large amount of data accurately. It is a known fact which is constantly ignored. It is always a matter of limited resources and limited time. [...] Third, much social data is obtained via questionnaires. But it a well documented fact the way the questions are phrased, the way they are ordered in sequence, the people who ask them or come along and wait for them to be filled out, all have serious effects on the answers." (Richard Hamming, "The Art of Doing Science and Engineering: Learning to Learn", 1997)

"Blissful data consist of information that is accurate, meaningful, useful, and easily accessible to many people in an organization. These data are used by the organization’s employees to analyze information and support their decision-making processes to strategic action. It is easy to see that organizations that have reached their goal of maximum productivity with blissful data can triumph over their competition. Thus, blissful data provide a competitive advantage." (Margaret Y Chu, "Blissful Data", 2004)

"Let’s define dirty data as: ‘… data that are incomplete, invalid, or inaccurate’. In other words, dirty data are simply data that are wrong. […] Incomplete or inaccurate data can result in bad decisions being made. Thus, dirty data are the opposite of blissful data. Problems caused by dirty data are significant; be wary of their pitfalls." (Margaret Y Chu, "Blissful Data", 2004)

"Processes must be implemented to prevent bad data from entering the system as well as propagating to other systems. That is, dirty data must be intercepted at its source. The operational systems are often the source of informational data; thus dirty data must be fixed at the operational data level. Implementing the right processes to cleanse data is, however, not easy." (Margaret Y Chu, "Blissful Data", 2004)

"Equally critical is to include data quality definition and acceptable quality benchmarks into the conversion specifications. No product design skips quality specifications. including quality metrics and benchmarks. Yet rare data conversion follows suit. As a result, nobody knows how successful the conversion project was until data errors get exposed in the subsequent months and years. The solution is to perform comprehensive data quality assessment of the target data upon conversion and compare the results with pre-defined benchmarks." (Arkady Maydanchik, "Data Quality Assessment", 2007)

"Much data in databases has a long history. It might have come from old 'legacy' systems or have been changed several times in the past. The usage of data fields and value codes changes over time. The same value in the same field will mean totally different thing in different records. Knowledge or these facts allows experts to use the data properly. Without this knowledge, the data may bc used literally and with sad consequences. The same is about data quality. Data users in the trenches usually know good data from bad and can still use it efficiently. They know where to look and what to check. Without these experts, incorrect data quality assumptions are often made and poor data quality becomes exposed." (Arkady Maydanchik, "Data Quality Assessment", 2007)

"The big part of the challenge is that data quality does not improve by itself or as a result of general IT advancements. Over the years, the onus of data quality improvement was placed on modern database technologies and better information systems. [...] In reality, most IT processes affect data quality negatively, Thus, if we do nothing, data quality will continuously deteriorate to the point where the data will become a huge liability." (Arkady Maydanchik, "Data Quality Assessment", 2007)

"While we might attempt to identify and correct most data errors, as well as try to prevent others from entering the database, the data quality will never be perfect. Perfection is practically unattainable in data quality as with the quality of most other products. In truth, it is also unnecessary since at some point improving data quality becomes more expensive than leaving it alone. The more efficient our data quality program, the higher level of quality we will achieve- but never will it reach 100%. However, accepting imperfection is not the same as ignoring it. Knowledge of the data limitations and imperfections can help use the data wisely and thus save time and money, The challenge, of course, is making this knowledge organized and easily accessible to the target users. The solution is a comprehensive integrated data quality meta data warehouse." (Arkady Maydanchik, "Data Quality Assessment", 2007)

"Achieving a high level of data quality is hard and is affected significantly by organizational and ownership issues. In the short term, bandaging problems rather than addressing the root causes is often the path of least resistance." (Cindi Howson, "Successful Business Intelligence: Secrets to making BI a killer App", 2008)

"Communicate loudly and widely where there are data quality problems and the associated risks with deploying BI tools on top of bad data. Also advise the different stakeholders on what can be done to address data quality problems - systematically and organizationally. Complaining without providing recommendations fixes nothing." (Cindi Howson, "Successful Business Intelligence: Secrets to making BI a killer App", 2008)

"Data quality is such an important issue, and yet one that is not well understood or that excites business users. It’s often perceived as being a problem for IT to handle when it’s not: it’s for the business to own and correct." (Cindi Howson, "Successful Business Intelligence: Secrets to making BI a killer App", 2008)

"Depending on the extent of the data quality issues, be careful about where you deploy BI. Without a reasonable degree of confidence in the data quality, BI should be kept in the hands of knowledge workers and not extended to frontline workers and certainly not to customers and suppliers. Deploy BI in this limited fashion as data quality issues are gradually exposed, understood, and ultimately, addressed. Don’t wait for every last data quality issue to be resolved; if you do, you will never deliver any BI capabilities, business users will never see the problem, and quality will never improve." (Cindi Howson, "Successful Business Intelligence: Secrets to making BI a killer App", 2008)

"Our culture, obsessed with numbers, has given us the idea that what we can measure is more important than what we can't measure. Think about that for a minute. It means that we make quantity more important than quality." (Donella Meadows, "Thinking in Systems: A Primer", 2008)

"The data architecture is the most important technical aspect of your business intelligence initiative. Fail to build an information architecture that is flexible, with consistent, timely, quality data, and your BI initiative will fail. Business users will not trust the information, no matter how powerful and pretty the BI tools. However, sometimes it takes displaying that messy data to get business users to understand the importance of data quality and to take ownership of a problem that extends beyond business intelligence, to the source systems and to the organizational structures that govern a company’s data." (Cindi Howson, "Successful Business Intelligence: Secrets to making BI a killer App", 2008)

"Many new data scientists tend to rush past it to get their data into a minimally acceptable state, only to discover that the data has major quality issues after they apply their (potentially computationally intensive) algorithm and get a nonsense answer as output. (Sandy Ryza, "Advanced Analytics with Spark: Patterns for Learning from Data at Scale", 2009)

"Access to more information isn’t enough - the information needs to be correct, timely, and presented in a manner that enables the reader to learn from it. The current network is full of inaccurate, misleading, and biased information that often crowds out the valid information. People have not learned that 'popular' or 'available' information is not necessarily valid." (Gene Spafford, 2010)

"Are data quality and data governance the same thing? They share the same goal, essentially striving for the same outcome of optimizing data and information results for business purposes. Data governance plays a very important role in achieving high data quality. It deals primarily with orchestrating the efforts of people, processes, objectives, technologies, and lines of business in order to optimize outcomes around enterprise data assets. This includes, among other things, the broader cross-functional oversight of standards, architecture, business processes, business integration, and risk and compliance. Data governance is an organizational structure that oversees the compliance and standards of enterprise data." (Neera Bhansali, "Data Governance: Creating Value from Information Assets", 2014)

"Data governance is about putting people in charge of fixing and preventing data issues and using technology to help aid the process. Any time data is synchronized, merged, and exchanged, there have to be ground rules guiding this. Data governance serves as the method to organize the people, processes, and technologies for data-driven programs like data quality; they are a necessary part of any data quality effort." (Neera Bhansali, "Data Governance: Creating Value from Information Assets", 2014)

"Having data quality as a focus is a business philosophy that aligns strategy, business culture, company information, and technology in order to manage data to the benefit of the enterprise. Data quality is an elusive subject that can defy measurement and yet be critical enough to derail a single IT project, strategic initiative, or even an entire company." (Neera Bhansali, "Data Governance: Creating Value from Information Assets", 2014)

"Accuracy and coherence are related concepts pertaining to data quality. Accuracy refers to the comprehensiveness or extent of missing data, performance of error edits, and other quality assurance strategies. Coherence is the degree to which data - item value and meaning are consistent over time and are comparable to similar variables from other routinely used data sources." (Aileen Rothbard, "Quality Issues in the Use of Administrative Data Records", 2015)

"How good the data quality is can be looked at both subjectively and objectively. The subjective component is based on the experience and needs of the stakeholders and can differ by who is being asked to judge it. For example, the data managers may see the data quality as excellent, but consumers may disagree. One way to assess it is to construct a survey for stakeholders and ask them about their perception of the data via a questionnaire. The other component of data quality is objective. Measuring the percentage of missing data elements, the degree of consistency between records, how quickly data can be retrieved on request, and the percentage of incorrect matches on identifiers (same identifier, different social security number, gender, date of birth) are some examples." (Aileen Rothbard, "Quality Issues in the Use of Administrative Data Records", 2015)

"When we find data quality issues due to valid data during data exploration, we should note these issues in a data quality plan for potential handling later in the project. The most common issues in this regard are missing values and outliers, which are both examples of noise in the data." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, worked examples, and case studies", 2015)

"A popular misconception holds that the era of Big Data means the end of a need for sampling. In fact, the proliferation of data of varying quality and relevance reinforces the need for sampling as a tool to work efficiently with a variety of data, and minimize bias. Even in a Big Data project, predictive models are typically developed and piloted with samples." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"Metadata is the key to effective data governance. Metadata in this context is the data that defines the structure and attributes of data. This could mean data types, data privacy attributes, scale, and precision. In general, quality of data is directly proportional to the amount and depth of metadata provided. Without metadata, consumers will have to depend on other sources and mechanisms." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"The quality of data that flows within a data pipeline is as important as the functionality of the pipeline. If the data that flows within the pipeline is not a valid representation of the source data set(s), the pipeline doesn’t serve any real purpose. It’s very important to incorporate data quality checks within different phases of the pipeline. These checks should verify the correctness of data at every phase of the pipeline. There should be clear isolation between checks at different parts of the pipeline. The checks include checks like row count, structure, and data type validation." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"Are your insights based on data that is accurate and reliable? Trustworthy data is correct or valid, free from significant defects and gaps. The trustworthiness of your data begins with the proper collection, processing, and maintenance of the data at its source. However, the reliability of your numbers can also be influenced by how they are handled during the analysis process. Clean data can inadvertently lose its integrity and true meaning depending on how it is analyzed and interpreted." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"First, from an ethos perspective, the success of your data story will be shaped by your own credibility and the trustworthiness of your data. Second, because your data story is based on facts and figures, the logos appeal will be integral to your message. Third, as you weave the data into a convincing narrative, the pathos or emotional appeal makes your message more engaging. Fourth, having a visualized insight at the core of your message adds the telos appeal, as it sharpens the focus and purpose of your communication. Fifth, when you share a relevant data story with the right audience at the right time (kairos), your message can be a powerful catalyst for change." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"The one unique characteristic that separates a data story from other types of stories is its fundamental basis in data. [...] The building blocks of every data story are quantitative or qualitative data, which are frequently the results of an analysis or insightful observation. Because each data story is formed from a collection of facts, each one represents a work of nonfiction. While some creativity may be used in how the story is structured and delivered, a true data story won’t stray too far from its factual underpinnings. In addition, the quality and trustworthiness of the data will determine how credible and powerful the data story is." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"Data is dirty. Let's just get that out there. How is it dirty? In all sorts of ways. Misspelled text values, date format problems, mismatching units, missing values, null values, incompatible geospatial coordinate formats, the list goes on and on." (Ben Jones, "Avoiding Data Pitfalls: How to Steer Clear of Common Blunders When Working with Data and Presenting Analysis and Visualizations", 2020)

"Bad data makes bad models. Bad models instruct people to make ineffective or harmful interventions. Those bad interventions produce more bad data, which is fed into more bad models." (Cory Doctorow, "Machine Learning’s Crumbling Foundations", 2021)

"[...] data mesh introduces a fundamental shift that the owners of the data products must communicate and guarantee an acceptable level of quality and trustworthiness - specific to their domain - as an intrinsic characteristic of their data product. This means cleansing and running automated data integrity tests at the point of the creation of a data product." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"Ensure you build into your data literacy strategy learning on data quality. If the individuals who are using and working with data do not understand the purpose and need for data quality, we are not sitting in a strong position for great and powerful insight. What good will the insight be, if the data has no quality within the model?" (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"[...] the governance function is accountable to define what constitutes data quality and how each data product communicates that in a standard way. It’s no longer accountable for the quality of each data product. The platform team is accountable to build capabilities to validate the quality of the data and communicate its quality metrics, and each domain (data product owner) is accountable to adhere to the quality standards and provide quality data products." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"Bad data is costly to fix, and it’s more costly the more widespread it is. Everyone who has accessed, used, copied, or processed the data may be affected and may require mitigating action on their part. The complexity is further increased by the fact that not every consumer will “fix” it in the same way. This can lead to divergent results that are divergent with others and can be a nightmare to detect, track down, and rectify." (Adam Bellemare, "Building an Event-Driven Data Mesh: Patterns for Designing and Building Event-Driven Architectures", 2023)

"Data has historically been treated as a second-class citizen, as a form of exhaust or by-product emitted by business applications. This application-first thinking remains the major source of problems in today’s computing environments, leading to ad hoc data pipelines, cobbled together data access mechanisms, and inconsistent sources of similar-yet-different truths. Data mesh addresses these shortcomings head-on, by fundamentally altering the relationships we have with our data. Instead of a secondary by-product, data, and the access to it, is promoted to a first-class citizen on par with any other business service." (Adam Bellemare, "Building an Event-Driven Data Mesh: Patterns for Designing and Building Event-Driven Architectures", 2023)

"In truth, no one knows how much bad data quality costs a company – even companies with mature data quality initiatives in place, who are measuring hundreds of data points for their quality struggle to accurately measure quantitative impact. This is often a deal-breaker for senior leaders when trying to get approval for a budget for data quality work. Data quality initiatives often seek substantial budgets and are up against projects with more tangible benefits." (Robert Hawker, "Practical Data Quality", 2023)

"The biggest mistake that can be made in a data quality initiative is focusing on the wrong data. If you fix data that does not impact a critical business process or drive important decisions, your initiative simply will not make the difference that you want it to." (Robert Hawker, "Practical Data Quality", 2023)

"The data should be monitored in the source, it should be corrected in the source, and it should then feed the secondary source(s) with high-quality data that can be used without workarounds. The reduction in workarounds will make the data engineers, scientists, and data visualization specialists much more productive." (Robert Hawker, "Practical Data Quality", 2023)

"The problem of bad data has existed for a very long time. Data copies diverge as their original source changes. Copies get stale. Errors detected in one data set are not fixed in duplicate ones. Domain knowledge related to interpreting and understanding data remains incomplete, as does support from the owners of the original data." (Adam Bellemare, "Building an Event-Driven Data Mesh: Patterns for Designing and Building Event-Driven Architectures", 2023)

"Data Lakes embrace a schema-on-read approach, storing vast volumes of raw or lightly processed data in native formats with minimal upfront constraints. This design significantly enhances ingestion velocity and accommodates diverse, unstructured, or semi-structured datasets. However, enforcing data quality at scale becomes more complex, as traditional static constraints are absent." (William Smith, "Great Expectations for Modern Data Quality: The Complete Guide for Developers and Engineers", 2025)

"Data quality insights generated through automated profiling and baseline analysis are only as valuable as their visibility and actionability within the broader organizational decision-making context." (William Smith, "Soda Core for Modern Data Quality and Observability: The Complete Guide for Developers and Engineers", 2025)

"Data quality verification, when executed as a set of static, invariant rules, often fails to accommodate the inherent fluidity of real-world datasets and evolving analytical contexts. To ensure robustness and relevance, quality checks must evolve beyond static constraints, incorporating adaptability driven by metadata, runtime information, and domain-specific business logic. This transformation enables the development of dynamic and context-aware validation systems capable of offering intelligent, self-tuning quality enforcement with reduced false positives and operational noise." (William Smith, "Soda Core for Modern Data Quality and Observability: The Complete Guide for Developers and Engineers", 2025)

"Errors using inadequate data are much less than those using no data at all." (Charles Babbage)

SQL Troubles

Pages

27 December 2017

🗃️Data Management: Data Quality (Just the Quotes)

No comments:

About Me