13 November 2011

📉Graphical Representation: Missing Data (Just the Quotes)

"Missing data values pose a particularly sticky problem for symbols. For instance, if the ray corresponding to a missing value is simply left off of a star symbol, the result will be almost indistinguishable from a minimum (i.e., an extreme) value. It may be better either (i) to impute a value, perhaps a median for that variable, or a fitted value from some regression on other variables, (ii) to indicate that the value is missing, possibly with a dashed line, or (iii) not to draw the symbol for a particular observation if any value is missing." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"We often think, naïvely, that missing data are the primary impediments to intellectual progress - just find the right facts and all problems will dissipate. But barriers are often deeper and more abstract in thought. We must have access to the right metaphor, not only to the requisite information. Revolutionary thinkers are not, primarily, gatherers of facts, but weavers of new intellectual structures." (Stephen J Gould, "The Flamingo's Smile: Reflections in Natural History", 1985)

"Statistics depend on collecting information. If questions go unasked, or if they are asked in ways that limit responses, or if measures count some cases but exclude others, information goes ungathered, and missing numbers result. Nevertheless, choices regarding which data to collect and how to go about collecting the information are inevitable." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"People tend to give greater weight to the data that they have just been exposed to than other relevant data. […] This phenomenon, where people give greater attention to recent or easily available data, is often referred to as an availability error." (Alan Graham, "Developing Thinking in Statistics", 2006)

"There are several key issues in the field of statistics that impact our analyses once data have been imported into a software program. These data issues are commonly referred to as the measurement scale of variables, restriction in the range of data, missing data values, outliers, linearity, and nonnormality." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

"[…] events will always occur that cannot be foreseen by following a chain of logical deductive reasoning. Successful prediction requires intuitive leaps and/or information that is not part of the original data available." (John L Casti, "X-Events: The Collapse of Everything", 2012)

"Missing data is the blind spot of statisticians. If they are not paying full attention, they lose track of these little details. Even when they notice, many unwittingly sway things our way. Most ranking systems ignore missing values." (Kaiser Fung, "Numbersense: How To Use Big Data To Your Advantage", 2013)

"Having NUMBERSENSE means: (•) Not taking published data at face value; (•) Knowing which questions to ask; (•) Having a nose for doctored statistics. [...] NUMBERSENSE is that bit of skepticism, urge to probe, and desire to verify. It’s having the truffle hog’s nose to hunt the delicacies. Developing NUMBERSENSE takes training and patience. It is essential to know a few basic statistical concepts. Understanding the nature of means, medians, and percentile ranks is important. Breaking down ratios into components facilitates clear thinking. Ratios can also be interpreted as weighted averages, with those weights arranged by rules of inclusion and exclusion. Missing data must be carefully vetted, especially when they are substituted with statistical estimates. Blatant fraud, while difficult to detect, is often exposed by inconsistency." (Kaiser Fung, "Numbersense: How To Use Big Data To Your Advantage", 2013)

"Accuracy and coherence are related concepts pertaining to data quality. Accuracy refers to the comprehensiveness or extent of missing data, performance of error edits, and other quality assurance strategies. Coherence is the degree to which data - item value and meaning are consistent over time and are comparable to similar variables from other routinely used data sources." (Aileen Rothbard, "Quality Issues in the Use of Administrative Data Records", 2015)

"There are several key issues in the field of statistics that impact our analyses once data have been imported into a software program. These data issues are commonly referred to as the measurement scale of variables, restriction in the range of data, missing data values, outliers, linearity, and nonnormality." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

"[…] people attempt to use highly flexible mathematical structures with large numbers of parameters that can be adjusted to fit the data, the result often being models that fit the data well but lack structural representation of the phenomena and thus are not predictive outside the range of the data. The situation is exacerbated by uncertainty regarding model parameters on account of insufficient data relative to model complexity, which in fact means uncertainty regarding the models themselves. More importantly from the standpoint of epistemology, the amount of available data is often miniscule in comparison to the amount needed for validation. The desire for knowledge has far outstripped experimental/observational capability. We are starved for data." (Edward R Dougherty, "The Evolution of Scientific Knowledge: From certainty to uncertainty", 2016)

"There are other problems with Big Data. In any large data set, there are bound to be inconsistencies, misclassifications, missing data - in other words, errors, blunders, and possibly lies. These problems with individual items occur in any data set, but they are often hidden in a large mass of numbers even when these numbers are generated out of computer interactions." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"Unless we’re collecting data ourselves, there’s a limit to how much we can do to combat the problem of missing data. But we can and should remember to ask who or what might be missing from the data we’re being told about. Some missing numbers are obvious […]. Other omissions show up only when we take a close look at the claim in question." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"Correlation does not imply causation: often some other missing third variable is influencing both of the variables you are correlating. […] The need for a scatterplot arose when scientists had to examine bivariate relations between distinct variables directly. As opposed to other graphic forms - pie charts, line graphs, and bar charts - the scatterplot offered a unique advantage: the possibility to discover regularity in empirical data (shown as points) by adding smoothed lines or curves designed to pass 'not through, but among them', so as to pass from raw data to a theory-based description, analysis, and understanding." (Michael Friendly & Howard Wainer, "A History of Data Visualization and Graphic Communication", 2021)

No comments:

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.