30 December 2018

🔭Data Science: Data Analysis (Just the Quotes)

"As in Mathematics, so in Natural Philosophy, the Investigation of difficult Things by the Method of Analysis, ought ever to precede the Method of Composition. This Analysis consists in making Experiments and Observations, and in drawing general Conclusions from them by Induction, and admitting of no Objections against the Conclusions but such as are taken from Experiments, or other certain Truths." (Sir Isaac Newton, "Opticks", 1704)

"The errors which arise from the absence of facts are far more numerous and more durable than those which result from unsound reasoning respecting true data." (Charles Babbage, "On the Economy of Machinery and Manufactures", 1832)

"In every branch of knowledge the progress is proportional to the amount of facts on which to build, and therefore to the facility of obtaining data." (James C Maxwell, [letter to Lewis Campbell] 1851)

"Not even the most subtle and skilled analysis can overcome completely the unreliability of basic data." (Roy D G Allen, "Statistics for Economists", 1951)

"The technical analysis of any large collection of data is a task for a highly trained and expensive man who knows the mathematical theory of statistics inside and out. Otherwise the outcome is likely to be a collection of drawings - quartered pies, cute little battleships, and tapering rows of sturdy soldiers in diversified uniforms - interesting enough in the colored Sunday supplement, but hardly the sort of thing from which to draw reliable inferences." (Eric T Bell, "Mathematics: Queen and Servant of Science", 1951)

"If data analysis is to be well done, much of it must be a matter of judgment, and ‘theory’ whether statistical or non-statistical, will have to guide, not command." (John W Tukey, "The Future of Data Analysis", Annals of Mathematical Statistics, Vol. 33 (1), 1962)

"If one technique of data analysis were to be exalted above all others for its ability to be revealing to the mind in connection with each of many different models, there is little doubt which one would be chosen. The simple graph has brought more information to the data analyst’s mind than any other device. It specializes in providing indications of unexpected phenomena." (John W Tukey, "The Future of Data Analysis", Annals of Mathematical Statistics Vol. 33 (1), 1962)

"The most important maxim for data analysis to heed, and one which many statisticians seem to have shunned is this: ‘Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.’ Data analysis must progress by approximate answers, at best, since its knowledge of what the problem really is will at best be approximate." (John W Tukey, "The Future of Data Analysis", Annals of Mathematical Statistics, Vol. 33, No. 1, 1962)

"The first step in data analysis is often an omnibus step. We dare not expect otherwise, but we equally dare not forget that this step, and that step, and other step, are all omnibus steps and that we owe the users of such techniques a deep and important obligation to develop ways, often varied and competitive, of replacing omnibus procedures by ones that are more sharply focused." (John W Tukey, "The Future of Processes of Data Analysis", 1965)

"The basic general intent of data analysis is simply stated: to seek through a body of data for interesting relationships and information and to exhibit the results in such a way as to make them recognizable to the data analyzer and recordable for posterity. Its creative task is to be productively descriptive, with as much attention as possible to previous knowledge, and thus to contribute to the mysterious process called insight." (John W Tukey & Martin B Wilk, "Data Analysis and Statistics: An Expository Overview", 1966)

"Comparable objectives in data analysis are (l) to achieve more specific description of what is loosely known or suspected; (2) to find unanticipated aspects in the data, and to suggest unthought-of-models for the data's summarization and exposure; (3) to employ the data to assess the (always incomplete) adequacy of a contemplated model; (4) to provide both incentives and guidance for further analysis of the data; and (5) to keep the investigator usefully stimulated while he absorbs the feeling of his data and considers what to do next." (John W Tukey & Martin B Wilk, "Data Analysis and Statistics: An Expository Overview", 1966)

"Data analysis must be iterative to be effective. [...] The iterative and interactive interplay of summarizing by fit and exposing by residuals is vital to effective data analysis. Summarizing and exposing are complementary and pervasive." (John W Tukey & Martin B Wilk, "Data Analysis and Statistics: An Expository Overview", 1966)

"Every student of the art of data analysis repeatedly needs to build upon his previous statistical knowledge and to reform that foundation through fresh insights and emphasis." (John W Tukey, "Data Analysis, Including Statistics", 1968)

"[...] bending the question to fit the analysis is to be shunned at all costs." (John W Tukey, "Analyzing Data: Sanctification or Detective Work?", 1969)

"Statistical methods are tools of scientific investigation. Scientific investigation is a controlled learning process in which various aspects of a problem are illuminated as the study proceeds. It can be thought of as a major iteration within which secondary iterations occur. The major iteration is that in which a tentative conjecture suggests an experiment, appropriate analysis of the data so generated leads to a modified conjecture, and this in turn leads to a new experiment, and so on." (George E P Box & George C Tjao, "Bayesian Inference in Statistical Analysis", 1973)

"Almost all efforts at data analysis seek, at some point, to generalize the results and extend the reach of the conclusions beyond a particular set of data. The inferential leap may be from past experiences to future ones, from a sample of a population to the whole population, or from a narrow range of a variable to a wider range. The real difficulty is in deciding when the extrapolation beyond the range of the variables is warranted and when it is merely naive. As usual, it is largely a matter of substantive judgment - or, as it is sometimes more delicately put, a matter of 'a priori nonstatistical considerations'." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"[…] it is not enough to say: 'There's error in the data and therefore the study must be terribly dubious'. A good critic and data analyst must do more: he or she must also show how the error in the measurement or the analysis affects the inferences made on the basis of that data and analysis." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"The use of statistical methods to analyze data does not make a study any more 'scientific', 'rigorous', or 'objective'. The purpose of quantitative analysis is not to sanctify a set of findings. Unfortunately, some studies, in the words of one critic, 'use statistics as a drunk uses a street lamp, for support rather than illumination'. Quantitative techniques will be more likely to illuminate if the data analyst is guided in methodological choices by a substantive understanding of the problem he or she is trying to learn about. Good procedures in data analysis involve techniques that help to (a) answer the substantive questions at hand, (b) squeeze all the relevant information out of the data, and (c) learn something new about the world." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Typically, data analysis is messy, and little details clutter it. Not only confounding factors, but also deviant cases, minor problems in measurement, and ambiguous results lead to frustration and discouragement, so that more data are collected than analyzed. Neglecting or hiding the messy details of the data reduces the researcher's chances of discovering something new." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"[...] be wary of analysts that try to quantify the unquantifiable." (Ralph Keeney & Raiffa Howard, "Decisions with Multiple Objectives: Preferences and Value Trade-offs", 1976)

"[...] exploratory data analysis is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as for those we believe might be there. Except for its emphasis on graphs, its tools are secondary to its purpose." (John W Tukey, [comment] 1979)

"[...] any hope that we are smart enough to find even transiently optimum solutions to our data analysis problems is doomed to failure, and, indeed, if taken seriously, will mislead us in the allocation of effort, thus wasting both intellectual and computational effort." (John W Tukey, "Choosing Techniques for the Analysis of Data", 1981)

"The fact must be expressed as data, but there is a problem in that the correct data is difficult to catch. So that I always say 'When you see the data, doubt it!' 'When you see the measurement instrument, doubt it!' [...]For example, if the methods such as sampling, measurement, testing and chemical analysis methods were incorrect, data. […] to measure true characteristics and in an unavoidable case, using statistical sensory test and express them as data." (Kaoru Ishikawa, Annual Quality Congress Transactions, 1981)

"Exploratory data analysis, EDA, calls for a relatively free hand in exploring the data, together with dual obligations: (•) to look for all plausible alternatives and oddities - and a few implausible ones, (graphic techniques can be most helpful here) and (•) to remove each appearance that seems large enough to be meaningful - ordinarily by some form of fitting, adjustment, or standardization [...] so that what remains, the residuals, can be examined for further appearances." (John W Tukey, "Introduction to Styles of Data Analysis Techniques", 1982)

"A competent data analysis of an even moderately complex set of data is a thing of trials and retreats, of dead ends and branches." (John W Tukey, Computer Science and Statistics: Proceedings of the 14th Symposium on the Interface, 1983)

"Data in isolation are meaningless, a collection of numbers. Only in context of a theory do they assume significance […]" (George Greenstein, "Frozen Star", 1983)

"Iteration and experimentation are important for all of data analysis, including graphical data display. In many cases when we make a graph it is immediately clear that some aspect is inadequate and we regraph the data. In many other cases we make a graph, and all is well, but we get an idea for studying the data in a different way with a different graph; one successful graph often suggests another." (William S Cleveland, "The Elements of Graphing Data", 1985)

"There are some who argue that a graph is a success only if the important information in the data can be seen within a few seconds. While there is a place for rapidly-understood graphs, it is too limiting to make speed a requirement in science and technology, where the use of graphs ranges from, detailed, in-depth data analysis to quick presentation." (William S Cleveland, "The Elements of Graphing Data", 1985)

"A first analysis of experimental results should, I believe, invariably be conducted using flexible data analytical techniques - looking at graphs and simple statistics - that so far as possible allow the data to 'speak for themselves'. The unexpected phenomena that such a approach often uncovers can be of the greatest importance in shaping and sometimes redirecting the course of an ongoing investigation." (George Box, "Signal to Noise Ratios, Performance Criteria, and Transformations", Technometrics 30, 1988)

"Data analysis is an art practiced by individuals who are skilled at quantitative reasoning and have much experience in looking at numbers and detecting  patterns in data. Usually these individuals have some background in statistics." (David Lubinsky, Daryl Pregibon , "Data analysis as search", Journal of Econometrics Vol. 38 (1–2), 1988)

"Like a detective, a data analyst will experience many dead ends, retrace his steps, and explore many alternatives before settling on a single description of the evidence in front of him." (David Lubinsky & Daryl Pregibon , "Data analysis as search", Journal of Econometrics Vol. 38 (1–2), 1988)

"[…] data analysis in the context of basic mathematical concepts and skills. The ability to use and interpret simple graphical and numerical descriptions of data is the foundation of numeracy […] Meaningful data aid in replacing an emphasis on calculation by the exercise of judgement and a stress on interpreting and communicating results." (David S Moore, "Statistics for All: Why, What and How?", 1990)

"Data analysis is rarely as simple in practice as it appears in books. Like other statistical techniques, regression rests on certain assumptions and may produce unrealistic results if those assumptions are false. Furthermore it is not always obvious how to translate a research question into a regression model." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"Data analysis typically begins with straight-line models because they are simplest, not because we believe reality is inherently linear. Theory or data may suggest otherwise [...]" (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"90 percent of all problems can be solved by using the techniques of data stratification, histograms, and control charts. Among the causes of nonconformance, only one-fifth or less are attributable to the workers." (Kaoru Ishikawa, The Quality Management Journal Vol. 1, 1993)

"Probabilistic inference is the classical paradigm for data analysis in science and technology. It rests on a foundation of randomness; variation in data is ascribed to a random process in which nature generates data according to a probability distribution. This leads to a codification of uncertainly by confidence intervals and hypothesis tests." (William S Cleveland, "Visualizing Data", 1993)

"Visualization is an approach to data analysis that stresses a penetrating look at the structure of data. No other approach conveys as much information. […] Conclusions spring from data when this information is combined with the prior knowledge of the subject under investigation." (William S Cleveland, "Visualizing Data", 1993)

"When the distributions of two or more groups of univariate data are skewed, it is common to have the spread increase monotonically with location. This behavior is monotone spread. Strictly speaking, monotone spread includes the case where the spread decreases monotonically with location, but such a decrease is much less common for raw data. Monotone spread, as with skewness, adds to the difficulty of data analysis. For example, it means that we cannot fit just location estimates to produce homogeneous residuals; we must fit spread estimates as well. Furthermore, the distributions cannot be compared by a number of standard methods of probabilistic inference that are based on an assumption of equal spreads; the standard t-test is one example. Fortunately, remedies for skewness can cure monotone spread as well." (William S Cleveland, "Visualizing Data", 1993)

"A careful and sophisticated analysis of the data is often quite useless if the statistician cannot communicate the essential features of the data to a client for whom statistics is an entirely foreign language." (Christopher J Wild, "Embracing the ‘Wider view’ of Statistics", The American Statistician 48, 1994)

"Science is not impressed with a conglomeration of data. It likes carefully constructed analysis of each problem." (Daniel E Koshland Jr, Science Vol. 263 (5144), [editorial] 1994)

"So we pour in data from the past to fuel the decision-making mechanisms created by our models, be they linear or nonlinear. But therein lies the logician's trap: past data from real life constitute a sequence of events rather than a set of independent observations, which is what the laws of probability demand. [...] It is in those outliers and imperfections that the wildness lurks." (Peter L Bernstein, "Against the Gods: The Remarkable Story of Risk", 1996)

"Data are generally collected as a basis for action. However, unless potential signals are separated from probable noise, the actions taken may be totally inconsistent with the data. Thus, the proper use of data requires that you have simple and effective methods of analysis which will properly separate potential signals from probable noise." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"No matter what the data, and no matter how the values are arranged and presented, you must always use some method of analysis to come up with an interpretation of the data.
While every data set contains noise, some data sets may contain signals. Therefore, before you can detect a signal within any given data set, you must first filter out the noise." (Donald J Wheeler," Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"The purpose of analysis is insight. The best analysis is the simplest analysis which provides the needed insight." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Without meaningful data there can be no meaningful analysis. The interpretation of any data set must be based upon the context of those data." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Statistical analysis of data can only be performed within the context of selected assumptions, models, and/or prior distributions. A statistical analysis is actually the extraction of substantive information from data and assumptions. And herein lies the rub, understood well by Disraeli and others skeptical of our work: For given data, an analysis can usually be selected which will result in 'information' more favorable to the owner of the analysis then is objectively warranted." (Stephen B Vardeman & Max D Morris, "Statistics and Ethics: Some Advice for Young Statisticians", The American Statistician vol 57, 2003)

"Exploratory Data Analysis is more than just a collection of data-analysis techniques; it provides a philosophy of how to dissect a data set. It stresses the power of visualisation and aspects such as what to look for, how to look for it and how to interpret the information it contains. Most EDA techniques are graphical in nature, because the main aim of EDA is to explore data in an open-minded way. Using graphics, rather than calculations, keeps open possibilities of spotting interesting patterns or anomalies that would not be apparent with a calculation (where assumptions and decisions about the nature of the data tend to be made in advance)." (Alan Graham, "Developing Thinking in Statistics", 2006)

"It is the aim of all data analysis that a result is given in form of the best estimate of the true value. Only in simple cases is it possible to use the data value itself as result and thus as best estimate." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Put simply, statistics is a range of procedures for gathering, organizing, analyzing and presenting quantitative data. […] Essentially […], statistics is a scientific approach to analyzing numerical data in order to enable us to maximize our interpretation, understanding and use. This means that statistics helps us turn data into information; that is, data that have been interpreted, understood and are useful to the recipient. Put formally, for your project, statistics is the systematic collection and analysis of numerical data, in order to investigate or discover relationships among phenomena so as to explain, predict and control their occurrence." (Reva B Brown & Mark Saunders, "Dealing with Statistics: What You Need to Know", 2008)

"Data analysis is careful thinking about evidence." (Michael Milton, "Head First Data Analysis", 2009)

"Doing data analysis without explicitly defining your problem or goal is like heading out on a road trip without having decided on a destination." (Michael Milton, "Head First Data Analysis", 2009)

"The discrepancy between our mental models and the real world may be a major problem of our times; especially in view of the difficulty of collecting, analyzing, and making sense of the unbelievable amount of data to which we have access today." (Ugo Bardi, "The Limits to Growth Revisited", 2011)

"Data analysis is not generally thought of as being simple or easy, but it can be. The first step is to understand that the purpose of data analysis is to separate any signals that may be contained within the data from the noise in the data. Once you have filtered out the noise, anything left over will be your potential signals. The rest is just details." (Donald J Wheeler," Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"The four questions of data analysis are the questions of description, probability, inference, and homogeneity. Any data analyst needs to know how to organize and use these four questions in order to obtain meaningful and correct results. [...] 
THE DESCRIPTION QUESTION: Given a collection of numbers, are there arithmetic values that will summarize the information contained in those numbers in some meaningful way?
THE PROBABILITY QUESTION: Given a known universe, what can we say about samples drawn from this universe? [...] 
THE INFERENCE QUESTION: Given an unknown universe, and given a sample that is known to have been drawn from that unknown universe, and given that we know everything about the sample, what can we say about the unknown universe? [...] 
THE HOMOGENEITY QUESTION: Given a collection of observations, is it reasonable to assume that they came from one universe, or do they show evidence of having come from multiple universes?" (Donald J Wheeler," Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"Each systems archetype embodies a particular theory about dynamic behavior that can serve as a starting point for selecting and formulating raw data into a coherent set of interrelationships. Once those relationships are made explicit and precise, the 'theory' of the archetype can then further guide us in our data-gathering process to test the causal relationships through direct observation, data analysis, or group deliberation." (Daniel H Kim, "Systems Archetypes as Dynamic Theories", The Systems Thinker Vol. 24 (1), 2013)

"A complete data analysis will involve the following steps: (i) Finding a good model to fit the signal based on the data. (ii) Finding a good model to fit the noise, based on the residuals from the model. (iii) Adjusting variances, test statistics, confidence intervals, and predictions, based on the model for the noise.(DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

"The random element in most data analysis is assumed to be white noise - normal errors independent of each other. In a time series, the errors are often linked so that independence cannot be assumed (the last examples). Modeling the nature of this dependence is the key to time series.(DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

"Statistics is an integral part of the quantitative approach to knowledge. The field of statistics is concerned with the scientific study of collecting, organizing, analyzing, and drawing conclusions from data." (Kandethody M Ramachandran & Chris P Tsokos, "Mathematical Statistics with Applications in R" 2nd Ed., 2015)

"Too often there is a disconnect between the people who run a study and those who do the data analysis. This is as predictable as it is unfortunate. If data are gathered with particular hypotheses in mind, too often they (the data) are passed on to someone who is tasked with testing those hypotheses and who has only marginal knowledge of the subject matter. Graphical displays, if prepared at all, are just summaries or tests of the assumptions underlying the tests being done. Broader displays, that have the potential of showing us things that we had not expected, are either not done at all, or their message is not able to be fully appreciated by the data analyst." (Howard Wainer, Comment, Journal of Computational and Graphical Statistics Vol. 20(1), 2011)

"The dialectical interplay of experiment and theory is a key driving force of modern science. Experimental data do only have meaning in the light of a particular model or at least a theoretical background. Reversely theoretical considerations may be logically consistent as well as intellectually elegant: Without experimental evidence they are a mere exercise of thought no matter how difficult they are. Data analysis is a connector between experiment and theory: Its techniques advise possibilities of model extraction as well as model testing with experimental data." (Achim Zielesny, "From Curve Fitting to Machine Learning" 2nd Ed., 2016)

"Data analysis and data mining are concerned with unsupervised pattern finding and structure determination in data sets. The data sets themselves are explicitly linked as a form of representation to an observational or otherwise empirical domain of interest. 'Structure' has long been understood as symmetry which can take many forms with respect to any transformation, including point, translational, rotational, and many others. Symmetries directly point to invariants, which pinpoint intrinsic properties of the data and of the background empirical domain of interest. As our data models change, so too do our perspectives on analysing data." (Fionn Murtagh, "Data Science Foundations: Geometry and Topology of Complex Hierarchic Systems and Big Data Analytics", 2018)

"[…] the data itself can lead to new questions too. In exploratory data analysis (EDA), for example, the data analyst discovers new questions based on the data. The process of looking at the data to address some of these questions generates incidental visualizations - odd patterns, outliers, or surprising correlations that are worth looking into further." (Danyel Fisher & Miriah Meyer, "Making Data Visual", 2018)

"Analysis is a two-step process that has an exploratory and an explanatory phase. In order to create a powerful data story, you must effectively transition from data discovery (when you’re finding insights) to data communication (when you’re explaining them to an audience). If you don’t properly traverse these two phases, you may end up with something that resembles a data story but doesn’t have the same effect. Yes, it may have numbers, charts, and annotations, but because it’s poorly formed, it won’t achieve the same results." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"While visuals are an essential part of data storytelling, data visualizations can serve a variety of purposes from analysis to communication to even art. Most data charts are designed to disseminate information in a visual manner. Only a subset of data compositions is focused on presenting specific insights as opposed to just general information. When most data compositions combine both visualizations and text, it can be difficult to discern whether a particular scenario falls into the realm of data storytelling or not." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"If the data that go into the analysis are flawed, the specific technical details of the analysis don’t matter. One can obtain stupid results from bad data without any statistical trickery. And this is often how bullshit arguments are created, deliberately or otherwise. To catch this sort of bullshit, you don’t have to unpack the black box. All you have to do is think carefully about the data that went into the black box and the results that came out. Are the data unbiased, reasonable, and relevant to the problem at hand? Do the results pass basic plausibility checks? Do they support whatever conclusions are drawn?" (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

"We all know that the numerical values on each side of an equation have to be the same. The key to dimensional analysis is that the units have to be the same as well. This provides a convenient way to keep careful track of units when making calculations in engineering and other quantitative disciplines, to make sure one is computing what one thinks one is computing. When an equation exists only for the sake of mathiness, dimensional analysis often makes no sense." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

"Overall [...] everyone also has a need to analyze data. The ability to analyze data is vital in its understanding of product launch success. Everyone needs the ability to find trends and patterns in the data and information. Everyone has a need to ‘discover or reveal (something) through detailed examination’, as our definition says. Not everyone needs to be a data scientist, but everyone needs to drive questions and analysis. Everyone needs to dig into the information to be successful with diagnostic analytics. This is one of the biggest keys of data literacy: analyzing data." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

[Murphy’s Laws of Analysis:] "(1) In any collection of data, the figures that are obviously correct contain errors. (2) It is customary for a decimal to be misplaced. (3) An error that can creep into a calculation, will. Also, it will always be in the direction that will cause the most damage to the calculation." (G C Deakly)

"We must include in any language with which we hope to describe complex data-processing situations the capability for describing data. We must also include a mechanism for determining the priorities to be applied to the data. These priorities are not fixed and are indicated in many cases by the data." (Grace Hopper) 

🔭Data Science: Testing (Just the Quotes)

"We must trust to nothing but facts: These are presented to us by Nature, and cannot deceive. We ought, in every instance, to submit our reasoning to the test of experiment, and never to search for truth but by the natural road of experiment and observation." (Antoin-Laurent de Lavoisiere, "Elements of Chemistry", 1790)

"A law of nature, however, is not a mere logical conception that we have adopted as a kind of memoria technical to enable us to more readily remember facts. We of the present day have already sufficient insight to know that the laws of nature are not things which we can evolve by any speculative method. On the contrary, we have to discover them in the facts; we have to test them by repeated observation or experiment, in constantly new cases, under ever-varying circumstances; and in proportion only as they hold good under a constantly increasing change of conditions, in a constantly increasing number of cases with greater delicacy in the means of observation, does our confidence in their trustworthiness rise." (Hermann von Helmholtz, "Popular Lectures on Scientific Subjects", 1873)

"A discoverer is a tester of scientific ideas; he must not only be able to imagine likely hypotheses, and to select suitable ones for investigation, but, as hypotheses may be true or untrue, he must also be competent to invent appropriate experiments for testing them, and to devise the requisite apparatus and arrangements." (George Gore, "The Art of Scientific Discovery", 1878)

"The preliminary examination of most data is facilitated by the use of diagrams. Diagrams prove nothing, but bring outstanding features readily to the eye; they are therefore no substitutes for such critical tests as may be applied to the data, but are valuable in suggesting such tests, and in explaining the conclusions founded upon them." (Sir Ronald A Fisher, "Statistical Methods for Research Workers", 1925)

"A scientist, whether theorist or experimenter, puts forward statements, or systems of statements, and tests them step by step. In the field of the empirical sciences, more particularly, he constructs hypotheses, or systems of theories, and tests them against experience by observation and experiment." (Karl Popper, "The Logic of Scientific Discovery", 1934)

"Science, in the broadest sense, is the entire body of the most accurately tested, critically established, systematized knowledge available about that part of the universe which has come under human observation. For the most part this knowledge concerns the forces impinging upon human beings in the serious business of living and thus affecting man’s adjustment to and of the physical and the social world. […] Pure science is more interested in understanding, and applied science is more interested in control […]" (Austin L Porterfield, "Creative Factors in Scientific Research", 1941)

"To a scientist a theory is something to be tested. He seeks not to defend his beliefs, but to improve them. He is, above everything else, an expert at ‘changing his mind’." (Wendell Johnson, 1946)

"As usual we may make the errors of I) rejecting the null hypothesis when it is true, II) accepting the null hypothesis when it is false. But there is a third kind of error which is of interest because the present test of significance is tied up closely with the idea of making a correct decision about which distribution function has slipped furthest to the right. We may make the error of III) correctly rejecting the null hypothesis for the wrong reason." (Frederick Mosteller, "A k-Sample Slippage Test for an Extreme Population", The Annals of Mathematical Statistics 19, 1948)

"Errors of the third kind happen in conventional tests of differences of means, but they are usually not considered, although their existence is probably recognized. It seems to the author that there may be several reasons for this among which are 1) a preoccupation on the part of mathematical statisticians with the formal questions of acceptance and rejection of null hypotheses without adequate consideration of the implications of the error of the third kind for the practical experimenter, 2) the rarity with which an error of the third kind arises in the usual tests of significance." (Frederick Mosteller, "A k-Sample Slippage Test for an Extreme Population", The Annals of Mathematical Statistics 19, 1948)

"If significance tests are required for still larger samples, graphical accuracy is insufficient, and arithmetical methods are advised. A word to the wise is in order here, however. Almost never does it make sense to use exact binomial significance tests on such data - for the inevitable small deviations from the mathematical model of independence and constant split have piled up to such an extent that the binomial variability is deeply buried and unnoticeable. Graphical treatment of such large samples may still be worthwhile because it brings the results more vividly to the eye." (Frederick Mosteller & John W Tukey, "The Uses and Usefulness of Binomial Probability Paper?", Journal of the American Statistical Association 44, 1949)

"Statistics is the fundamental and most important part of inductive logic. It is both an art and a science, and it deals with the collection, the tabulation, the analysis and interpretation of quantitative and qualitative measurements. It is concerned with the classifying and determining of actual attributes as well as the making of estimates and the testing of various hypotheses by which probable, or expected, values are obtained. It is one of the means of carrying on scientific research in order to ascertain the laws of behavior of things - be they animate or inanimate. Statistics is the technique of the Scientific Method." (Bruce D Greenschields & Frank M Weida, "Statistics with Applications to Highway Traffic Analyses", 1952)

"The only relevant test of the validity of a hypothesis is comparison of prediction with experience." (Milton Friedman, "Essays in Positive Economics", 1953)

"The main purpose of a significance test is to inhibit the natural enthusiasm of the investigator." (Frederick Mosteller, "Selected Quantitative Techniques", 1954)

"The methods of science may be described as the discovery of laws, the explanation of laws by theories, and the testing of theories by new observations. A good analogy is that of the jigsaw puzzle, for which the laws are the individual pieces, the theories local patterns suggested by a few pieces, and the tests the completion of these patterns with pieces previously unconsidered." (Edwin P Hubble, "The Nature of Science and Other Lectures", 1954)

"Science is the creation of concepts and their exploration in the facts. It has no other test of the concept than its empirical truth to fact." (Jacob Bronowski, "Science and Human Values", 1956)

"Null hypotheses of no difference are usually known to be false before the data are collected [...] when they are, their rejection or acceptance simply reflects the size of the sample and the power of the test, and is not a contribution to science." (I Richard Savage, "Nonparametric statistics", Journal of the American Statistical Association 52, 1957)

"The well-known virtue of the experimental method is that it brings situational variables under tight control. It thus permits rigorous tests of hypotheses and confidential statements about causation. The correlational method, for its part, can study what man has not learned to control. Nature has been experimenting since the beginning of time, with a boldness and complexity far beyond the resources of science. The correlator’s mission is to observe and organize the data of nature’s experiments." (Lee J Cronbach, "The Two Disciplines of Scientific Psychology", The American Psychologist Vol. 12, 1957)

"A satisfactory prediction of the sequential properties of learning data from a single experiment is by no means a final test of a model. Numerous other criteria - and some more demanding - can be specified. For example, a model with specific numerical parameter values should be invariant to changes in independent variables that explicitly enter in the model." (Robert R Bush & Frederick Mosteller,"A Comparison of Eight Models?", Studies in Mathematical Learning Theory, 1959)

"One feature [...] which requires much more justification than is usually given, is the setting up of unplausible null hypotheses. For example, a statistician may set out a test to see whether two drugs have exactly the same effect, or whether a regression line is exactly straight. These hypotheses can scarcely be taken literally." (Cedric A B Smith, "Book review of Norman T. J. Bailey: Statistical Methods in Biology", Applied Statistics 9, 1960)

"The null-hypothesis significance test treats ‘acceptance’ or ‘rejection’ of a hypothesis as though these were decisions one makes. But a hypothesis is not something, like a piece of pie offered for dessert, which can be accepted or rejected by a voluntary physical action. Acceptance or rejection of a hypothesis is a cognitive process, a degree of believing or disbelieving which, if rational, is not a matter of choice but determined solely by how likely it is, given the evidence, that the hypothesis is true." (William W Rozeboom, "The fallacy of the null–hypothesis significance test", Psychological Bulletin 57, 1960)

"It is easy to obtain confirmations, or verifications, for nearly every theory - if we look for confirmations. Confirmations should count only if they are the result of risky predictions. […] A theory which is not refutable by any conceivable event is non-scientific. Irrefutability is not a virtue of a theory (as people often think) but a vice. Every genuine test of a theory is an attempt to falsify it, or refute it." (Karl R Popper, "Conjectures and Refutations: The Growth of Scientific Knowledge", 1963)

"The final test of a theory is its capacity to solve the problems which originated it." (George Dantzig, "Linear Programming and Extensions", 1963)

"The mediation of theory and praxis can only be clarified if to begin with we distinguish three functions, which are measured in terms of different criteria: the formation and extension of critical theorems, which can stand up to scientific discourse; the organization of processes of enlightenment, in which such theorems are applied and can be tested in a unique manner by the initiation of processes of reflection carried on within certain groups toward which these processes have been directed; and the selection of appropriate strategies, the solution of tactical questions, and the conduct of the political struggle. On the first level, the aim is true statements, on the second, authentic insights, and on the third, prudent decisions." (Jürgen Habermas, "Introduction to Theory and Practice", 1963)

"The null hypothesis of no difference has been judged to be no longer a sound or fruitful basis for statistical investigation. […] Significance tests do not provide the information that scientists need, and, furthermore, they are not the most effective method for analyzing and summarizing data." (Cherry A Clark, "Hypothesis Testing in Relation to Statistical Methodology", Review of Educational Research Vol. 33, 1963)

"The usefulness of the models in constructing a testable theory of the process is severely limited by the quickly increasing number of parameters which must be estimated in order to compare the predictions of the models with empirical results" (Anatol Rapoport, "Prisoner's Dilemma: A study in conflict and cooperation", 1965)

"The validation of a model is not that it is 'true' but that it generates good testable hypotheses relevant to important problems.” (Richard Levins, "The Strategy of Model Building in Population Biology”, 1966)

"Discovery always carries an honorific connotation. It is the stamp of approval on a finding of lasting value. Many laws and theories have come and gone in the history of science, but they are not spoken of as discoveries. […] Theories are especially precarious, as this century profoundly testifies. World views can and do often change. Despite these difficulties, it is still true that to count as a discovery a finding must be of at least relatively permanent value, as shown by its inclusion in the generally accepted body of scientific knowledge." (Richard J. Blackwell, "Discovery in the Physical Sciences", 1969)

"Science consists simply of the formulation and testing of hypotheses based on observational evidence; experiments are important where applicable, but their function is merely to simplify observation by imposing controlled conditions." (Henry L Batten, "Evolution of the Earth", 1971)

"A hypothesis is empirical or scientific only if it can be tested by experience. […] A hypothesis or theory which cannot be, at least in principle, falsified by empirical observations and experiments does not belong to the realm of science." (Francisco J Ayala, "Biological Evolution: Natural Selection or Random Walk", American Scientist, 1974)

"An experiment is a failure only when it also fails adequately to test the hypothesis in question, when the data it produces don't prove anything one way or the other." (Robert M Pirsig, "Zen and the Art of Motorcycle Maintenance", 1974)

"Science is systematic organisation of knowledge about the universe on the basis of explanatory hypotheses which are genuinely testable. Science advances by developing gradually more comprehensive theories; that is, by formulating theories of greater generality which can account for observational statements and hypotheses which appear as prima facie unrelated." (Francisco J Ayala, "Studies in the Philosophy of Biology: Reduction and Related Problems", 1974)

"A good scientific law or theory is falsifiable just because it makes definite claims about the world. For the falsificationist, If follows fairly readily from this that the more falsifiable a theory is the better, in some loose sense of more. The more a theory claims, the more potential opportunities there will be for showing that the world does not in fact behave in the way laid down by the theory. A very good theory will be one that makes very wide-ranging claims about the world, and which is consequently highly falsifiable, and is one that resists falsification whenever it is put to the test." (Alan F Chalmers,  "What Is This Thing Called Science?", 1976)

"Tests appear to many users to be a simple way to discharge the obligation to provide some statistical treatment of the data." (H V Roberts, "For what use are tests of hypotheses and tests of significance",  Communications in Statistics [Series A], 1976)

"Prediction can never be absolutely valid and therefore science can never prove some generalization or even test a single descriptive statement and in that way arrive at final truth." (Gregory Bateson, "Mind and Nature, A necessary unity", 1979)

"The fact must be expressed as data, but there is a problem in that the correct data is difficult to catch. So that I always say 'When you see the data, doubt it!' 'When you see the measurement instrument, doubt it!' [...]For example, if the methods such as sampling, measurement, testing and chemical analysis methods were incorrect, data. […] to measure true characteristics and in an unavoidable case, using statistical sensory test and express them as data." (Kaoru Ishikawa, Annual Quality Congress Transactions, 1981)

"All interpretations made by a scientist are hypotheses, and all hypotheses are tentative. They must forever be tested and they must be revised if found to be unsatisfactory. Hence, a change of mind in a scientist, and particularly in a great scientist, is not only not a sign of weakness but rather evidence for continuing attention to the respective problem and an ability to test the hypothesis again and again." (Ernst Mayr, "The Growth of Biological Thought: Diversity, Evolution and Inheritance", 1982)

"Theoretical scientists, inching away from the safe and known, skirting the point of no return, confront nature with a free invention of the intellect. They strip the discovery down and wire it into place in the form of mathematical models or other abstractions that define the perceived relation exactly. The now-naked idea is scrutinized with as much coldness and outward lack of pity as the naturally warm human heart can muster. They try to put it to use, devising experiments or field observations to test its claims. By the rules of scientific procedure it is then either discarded or temporarily sustained. Either way, the central theory encompassing it grows. If the abstractions survive they generate new knowledge from which further exploratory trips of the mind can be planned. Through the repeated alternation between flights of the imagination and the accretion of hard data, a mutual agreement on the workings of the world is written, in the form of natural law." (Edward O Wilson, "Biophilia", 1984)

"Models are often used to decide issues in situations marked by uncertainty. However statistical differences from data depend on assumptions about the process which generated these data. If the assumptions do not hold, the inferences may not be reliable either. This limitation is often ignored by applied workers who fail to identify crucial assumptions or subject them to any kind of empirical testing. In such circumstances, using statistical procedures may only compound the uncertainty." (David A Greedman & William C Navidi, "Regression Models for Adjusting the 1980 Census", Statistical Science Vol. 1 (1), 1986)

"Science has become a social method of inquiring into natural phenomena, making intuitive and systematic explorations of laws which are formulated by observing nature, and then rigorously testing their accuracy in the form of predictions. The results are then stored as written or mathematical records which are copied and disseminated to others, both within and beyond any given generation. As a sort of synergetic, rigorously regulated group perception, the collective enterprise of science far transcends the activity within an individual brain." (Lynn Margulis & Dorion Sagan, "Microcosmos", 1986)

"Beware of the problem of testing too many hypotheses; the more you torture the data, the more likely they are to confess, but confessions obtained under duress may not be admissible in the court of scientific opinion." (Stephen M. Stigler, "Neutral Models in Biology", 1987)

"Prediction can never be absolutely valid and therefore science can never prove some generalization or even test a single descriptive statement and in that way arrive at final truth." (Gregory Bateson, Mind and Nature: A necessary unity", 1988)

"Science doesn't purvey absolute truth. Science is a mechanism. It's a way of trying to improve your knowledge of nature. It's a system for testing your thoughts against the universe and seeing whether they match. And this works, not just for the ordinary aspects of science, but for all of life. I should think people would want to know that what they know is truly what the universe is like, or at least as close as they can get to it." (Isaac Asimov, [Interview by Bill Moyers] 1988)

"The heart of the scientific method is the problem-hypothesis-test process. And, necessarily, the scientific method involves predictions. And predictions, to be useful in scientific methodology, must be subject to test empirically." (Paul Davies, "The Cosmic Blueprint: New Discoveries in Nature's Creative Ability to, Order the Universe", 1988)

"Science doesn’t purvey absolute truth. Science is a mechanism, a way of trying to improve your knowledge of nature. It’s a system for testing your thoughts against the universe, and seeing whether they match." (Isaac Asimov, [interview with Bill Moyers in The Humanist] 1989)

"A little thought reveals a fact widely understood among statisticians: The null hypothesis, taken literally (and that’s the only way you can take it in formal hypothesis testing), is always false in the real world. [...] If it is false, even to a tiny degree, it must be the case that a large enough sample will produce a significant result and lead to its rejection. So if the null hypothesis is always false, what’s the big deal about rejecting it?" (Jacob Cohen, "Things I Have Learned (So Far)", American Psychologist, 1990)

"How has the virtually barren technique of hypothesis testing come to assume such importance in the process by which we arrive at our conclusions from our data?" (Geoffrey R Loftus, "On the tyranny of hypothesis testing in the social sciences", Contemporary Psychology 36, 1991)

"On this view, we recognize science to be the search for algorithmic compressions. We list sequences of observed data. We try to formulate algorithms that compactly represent the information content of those sequences. Then we test the correctness of our hypothetical abbreviations by using them to predict the next terms in the string. These predictions can then be compared with the future direction of the data sequence. Without the development of algorithmic compressions of data all science would be replaced by mindless stamp collecting - the indiscriminate accumulation of every available fact. Science is predicated upon the belief that the Universe is algorithmically compressible and the modern search for a Theory of Everything is the ultimate expression of that belief, a belief that there is an abbreviated representation of the logic behind the Universe's properties that can be written down in finite form by human beings." (John D Barrow, New Theories of Everything", 1991)

"Scientists use mathematics to build mental universes. They write down mathematical descriptions - models - that capture essential fragments of how they think the world behaves. Then they analyse their consequences. This is called 'theory'. They test their theories against observations: this is called 'experiment'. Depending on the result, they may modify the mathematical model and repeat the cycle until theory and experiment agree. Not that it's really that simple; but that's the general gist of it, the essence of the scientific method." (Ian Stewart & Martin Golubitsky, "Fearful Symmetry: Is God a Geometer?", 1992)

"The amount of understanding produced by a theory is determined by how well it meets the criteria of adequacy - testability, fruitfulness, scope, simplicity, conservatism - because these criteria indicate the extent to which a theory systematizes and unifies our knowledge." (Theodore Schick Jr.,  "How to Think about Weird Things: Critical Thinking for a New Age", 1995)

"The science of statistics may be described as exploring, analyzing and summarizing data; designing or choosing appropriate ways of collecting data and extracting information from them; and communicating that information. Statistics also involves constructing and testing models for describing chance phenomena. These models can be used as a basis for making inferences and drawing conclusions and, finally, perhaps for making decisions." (Fergus Daly et al, "Elements of Statistics", 1995)

"Science is distinguished not for asserting that nature is rational, but for constantly testing claims to that or any other affect by observation and experiment." (Timothy Ferris, "The Whole Shebang: A State-of-the Universe’s Report", 1996)

"There are two kinds of mistakes. There are fatal mistakes that destroy a theory; but there are also contingent ones, which are useful in testing the stability of a theory." (Gian-Carlo Rota, [lecture] 1996)

"Validation is the process of testing how good the solutions produced by a system are. The results produced by a system are usually compared with the results obtained either by experts or by other systems. Validation is an extremely important part of the process of developing every knowledge-based system. Without comparing the results produced by the system with reality, there is little point in using it." (Nikola K Kasabov, "Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering", 1996)

"The rate of the development of science is not the rate at which you make observations alone but, much more important, the rate at which you create new things to test." (Richard Feynman, "The Meaning of It All", 1998)

"Let us regard a proof of an assertion as a purely mechanical procedure using precise rules of inference starting with a few unassailable axioms. This means that an algorithm can be devised for testing the validity of an alleged proof simply by checking the successive steps of the argument; the rules of inference constitute an algorithm for generating all the statements that can be deduced in a finite number of steps from the axioms." (Edward Beltrami, "What is Random?: Chaos and Order in Mathematics and Life", 1999)

"The greatest plus of data modeling is that it produces a simple and understandable picture of the relationship between the input variables and responses [...] different models, all of them equally good, may give different pictures of the relation between the predictor and response variables [...] One reason for this multiplicity is that goodness-of-fit tests and other methods for checking fit give a yes–no answer. With the lack of power of these tests with data having more than a small number of dimensions, there will be a large number of models whose fit is acceptable. There is no way, among the yes–no methods for gauging fit, of determining which is the better model." (Leo Breiman, "Statistical Modeling: The two cultures", Statistical Science 16(3), 2001)

"When significance tests are used and a null hypothesis is not rejected, a major problem often arises - namely, the result may be interpreted, without a logical basis, as providing evidence for the null hypothesis." (David F Parkhurst, "Statistical Significance Tests: Equivalence and Reverse Tests Should Reduce Misinterpretation", BioScience Vol. 51 (12), 2001)

"Visualizations can be used to explore data, to confirm a hypothesis, or to manipulate a viewer. [...] In exploratory visualization the user does not necessarily know what he is looking for. This creates a dynamic scenario in which interaction is critical. [...] In a confirmatory visualization, the user has a hypothesis that needs to be tested. This scenario is more stable and predictable. System parameters are often predetermined." (Usama Fayyad et al, "Information Visualization in Data Mining and Knowledge Discovery", 2002)

"There is a tendency to use hypothesis testing methods even when they are not appropriate. Often, estimation and confidence intervals are better tools. Use hypothesis testing only when you want to test a well-defined hypothesis." (Larry A Wasserman, "All of Statistics: A concise course in statistical inference", 2004)

"In science, for a theory to be believed, it must make a prediction - different from those made by previous theories - for an experiment not yet done. For the experiment to be meaningful, we must be able to get an answer that disagrees with that prediction. When this is the case, we say that a theory is falsifiable - vulnerable to being shown false. The theory also has to be confirmable, it must be possible to verify a new prediction that only this theory makes. Only when a theory has been tested and the results agree with the theory do we advance the statement to the rank of a true scientific theory." (Lee Smolin, "The Trouble with Physics", 2006)

"A type of error used in hypothesis testing that arises when incorrectly rejecting the null hypothesis, although it is actually true. Thus, based on the test statistic, the final conclusion rejects the Null hypothesis, but in truth it should be accepted. Type I error equates to the alpha (α) or significance level, whereby the generally accepted default is 5%." (Lynne Hambleton, "Treasure Chest of Six Sigma Growth Methods, Tools, and Best Practices", 2007)

"Each systems archetype embodies a particular theory about dynamic behavior that can serve as a starting point for selecting and formulating raw data into a coherent set of interrelationships. Once those relationships are made explicit and precise, the 'theory' of the archetype can then further guide us in our data-gathering process to test the causal relationships through direct observation, data analysis, or group deliberation." (Daniel H Kim, "Systems Archetypes as Dynamic Theories", The Systems Thinker Vol. 24 (1), 2013)

"In common usage, prediction means to forecast a future event. In data science, prediction more generally means to estimate an unknown value. This value could be something in the future (in common usage, true prediction), but it could also be something in the present or in the past. Indeed, since data mining usually deals with historical data, models very often are built and tested using events from the past." (Foster Provost & Tom Fawcett, "Data Science for Business", 2013)

"Another way to secure statistical significance is to use the data to discover a theory. Statistical tests assume that the researcher starts with a theory, collects data to test the theory, and reports the results - whether statistically significant or not. Many people work in the other direction, scrutinizing the data until they find a pattern and then making up a theory that fits the pattern." (Gary Smith, "Standard Deviations", 2014)

"Data clusters are everywhere, even in random data. Someone who looks for an explanation will inevitably find one, but a theory that fits a data cluster is not persuasive evidence. The found explanation needs to make sense and it needs to be tested with uncontaminated data." (Gary Smith, "Standard Deviations", 2014)

"Machine learning is a science and requires an objective approach to problems. Just like the scientific method, test-driven development can aid in solving a problem. The reason that TDD and the scientific method are so similar is because of these three shared characteristics: Both propose that the solution is logical and valid. Both share results through documentation and work over time. Both work in feedback loops." (Matthew Kirk, "Thoughtful Machine Learning", 2015)

"Science, at its core, is simply a method of practical logic that tests hypotheses against experience. Scientism, by contrast, is the worldview and value system that insists that the questions the scientific method can answer are the most important questions human beings can ask, and that the picture of the world yielded by science is a better approximation to reality than any other." (John M Greer, "After Progress: Reason and Religion at the End of the Industrial Age", 2015)

"The dialectical interplay of experiment and theory is a key driving force of modern science. Experimental data do only have meaning in the light of a particular model or at least a theoretical background. Reversely theoretical considerations may be logically consistent as well as intellectually elegant: Without experimental evidence they are a mere exercise of thought no matter how difficult they are. Data analysis is a connector between experiment and theory: Its techniques advise possibilities of model extraction as well as model testing with experimental data." (Achim Zielesny, "From Curve Fitting to Machine Learning" 2nd Ed., 2016)

"Bias is error from incorrect assumptions built into the model, such as restricting an interpolating function to be linear instead of a higher-order curve. [...] Errors of bias produce underfit models. They do not fit the training data as tightly as possible, were they allowed the freedom to do so. In popular discourse, I associate the word 'bias' with prejudice, and the correspondence is fairly apt: an apriori assumption that one group is inferior to another will result in less accurate predictions than an unbiased one. Models that perform lousy on both training and testing data are underfit." (Steven S Skiena, "The Data Science Design Manual", 2017)

"Early stopping and regularization can ensure network generalization when you apply them properly. [...] With early stopping, the choice of the validation set is also important. The validation set should be representative of all points in the training set. When you use Bayesian regularization, it is important to train the network until it reaches convergence. The sum-squared error, the sum-squared weights, and the effective number of parameters should reach constant values when the network has converged. With both early stopping and regularization, it is a good idea to train the network starting from several different initial conditions. It is possible for either method to fail in certain circumstances. By testing several different initial conditions, you can verify robust network performance." (Mark H Beale et al, "Neural Network Toolbox™ User's Guide", 2017)

"Scientists generally agree that no theory is 100 percent correct. Thus, the real test of knowledge is not truth, but utility." (Yuval N Harari, "Sapiens: A brief history of humankind", 2017)

"Variance is error from sensitivity to fluctuations in the training set. If our training set contains sampling or measurement error, this noise introduces variance into the resulting model. [...] Errors of variance result in overfit models: their quest for accuracy causes them to mistake noise for signal, and they adjust so well to the training data that noise leads them astray. Models that do much better on testing data than training data are overfit." (Steven S Skiena, "The Data Science Design Manual", 2017)

"[...] a hypothesis test tells us whether the observed data are consistent with the null hypothesis, and a confidence interval tells us which hypotheses are consistent with the data." (William C Blackwelder)

🔭Data Science: Matching (Just the Quotes)

"A physical theory must accept some actual data as inputs and must be able to generate from them another set of possible data (the output) in such a way that both input and output match the assumptions of the theory - laws, constraints, etc. This concept of matching involves relevance: thus boundary conditions are relevant only to field-like theories such as hydrodynamics and quantum mechanics. But matching is more than relevance: it is also logical compatibility." (Mario Bunge, "Philosophy of Physics", 1973)

"The matching procedure often helps inform the reader what is going on in the data […] Matching has some defects, chiefly that it is difficult to do a very good job of matching in complex situations without a large number of cases. […] One limitation of matching, then, is that quite often the match is not very accurate. A second limitation is that if we want to control for more than one variable using matching procedures, the tables begin to have combinations of categories without any cases at all in them, and they become somewhat more difficult for the reader to understand." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Generalization is the process of matching new, unknown input data with the problem knowledge in order to obtain the best possible solution, or one close to it. Generalization means reacting properly to new situations, for example, recognizing new images, or classifying new objects and situations. Generalization can also be described as a transition from a particular object description to a general concept description. This is a major characteristic of all intelligent systems." (Nikola K Kasabov, "Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering", 1996)

"More than just a new computing architecture, neural networks offer a completely different paradigm for solving problems with computers. […] The process of learning in neural networks is to use feedback to adjust internal connections, which in turn affect the output or answer produced. The neural processing element combines all of the inputs to it and produces an output, which is essentially a measure of the match between the input pattern and its connection weights. When hundreds of these neural processors are combined, we have the ability to solve difficult problems such as credit scoring." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"Because No Free Lunch theorems dictate that no optimization algorithm can be considered more efficient than any other when considering all possible functions, the desired function class plays a prominent role in the model. In particular, this provides a tractable way to answer the traditionally difficult question of what algorithm is best matched to a particular class of functions. Among the benefits of the model are the ability to specify the function class in a straightforward manner, a natural way to specify noisy or dynamic functions, and a new source of insight into No Free Lunch theorems for optimization." (Christopher K Monson, "No Free Lunch, Bayesian Inference, and Utility: A Decision-Theoretic Approach to Optimization", [thesis] 2006)

"Whereas regression is about attempting to specify the underlying relationship that summarises a set of paired data, correlation is about assessing the strength of that relationship. Where there is a very close match between the scatter of points and the regression line, correlation is said to be 'strong' or 'high' . Where the points are widely scattered, the correlation is said to be 'weak' or 'low'." (Alan Graham, "Developing Thinking in Statistics", 2006)

"A decision theory that rests on the assumptions that human cognitive capabilities are limited and that these limitations are adaptive with respect to the decision environments humans frequently encounter. Decision are thought to be made usually without elaborate calculations, but instead by using fast and frugal heuristics. These heuristics certainly have the advantage of speed and simplicity, but if they are well matched to a decision environment, they can even outperform maximizing calculations with respect to accuracy. The reason for this is that many decision environments are characterized by incomplete information and noise. The information we do have is usually structured in a specific way that clever heuristics can exploit." (E Ebenhoh, "Agent-Based Modelnig with Boundedly Rational Agents", 2007)

"Learning a complicated function that matches the training data closely but fails to recognize the underlying process that generates the data. As a result of overfitting, the model performs poor on new input. Overfitting occurs when the training patterns are sparse in input space and/or the trained networks are too complex." (Frank Padberg, "Counting the Hidden Defects in Software Documents", 2010)

"Unfortunately, creating an objective function that matches the true goal of the data mining is usually impossible, so data scientists often choose based on faith and experience." (Foster Provost & Tom Fawcett, "Data Science for Business", 2013)

"Matching is a family of methods for estimating causal effects by matching similar observations (or units) in the treatment and non-treatment groups. The goal of matching is to make comparisons between similar units in order to achieve as precise an estimate of the true causal effect as possible." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

29 December 2018

🔭Data Science: Data (Just the Quotes)

"Before anything can be reasoned upon to a conclusion, certain facts, principles, or data, to reason from, must be established, admitted, or denied." (Thomas Paine, "Rights of Man", 1791) 

"The errors which arise from the absence of facts are far more numerous and more durable than those which result from unsound reasoning respecting true data." (Charles Babbage, "On the Economy of Machinery and Manufactures", 1832)

"In every branch of knowledge the progress is proportional to the amount of facts on which to build, and therefore to the facility of obtaining data." (James C Maxwell, [letter to Lewis Campbell] 1851)

"It is a capital mistake to theorise before one has data." (Arthur C Doyle, "The Adventures of Sherlock Holmes", 1892)

"The man of science, by virtue of his training, is alone capable of realising the difficulties - often enormous - of obtaining accurate data upon which just judgment may be based." (Sir Richard Gregory, "Discovery; or, The Spirit and Service of Science", 1918)

"Not even the most subtle and skilled analysis can overcome completely the unreliability of basic data." (Roy D G Allen, "Statistics for Economists", 1951)

"When evaluating the reliability and generality of data, it is often important to know the aims of the experimenter. When evaluating the importance of experimental results, however, science has a trick of disregarding the experimenter's rationale and finding a more appropriate context for the data than the one he proposed." (Murray Sidman, "Tactics of Scientific Research", 1960)

"Philosophers of science have repeatedly demonstrated that more than one theoretical construction can always be placed upon a given collection of data." (Thomas Kuhn, "The Structure of Scientific Revolutions", 1962) 

"Modern science is characterized by its ever-increasing specialization, necessitated by the enormous amount of data, the complexity of techniques and of theoretical structures within every field. Thus science is split into innumerable disciplines continually generating new subdisciplines. In consequence, the physicist, the biologist, the psychologist and the social scientist are, so to speak, encapusulated in their private universes, and it is difficult to get word from one cocoon to the other." (Ludwig von Bertalanffy, "General System Theory", 1968)

"At root what is needed for scientific inquiry is just receptivity to data, skill in reasoning, and yearning for truth. Admittedly, ingenuity can help too." (Willard v O Quine, "The Web of Belief", 1970)

"Statistical methods of analysis are intended to aid the interpretation of data that are subject to appreciable haphazard variability." (David V. Hinkley & David Cox, "Theoretical Statistics", 1974)

"In a way, science might be described as paranoid thinking applied to Nature: we are looking for natural conspiracies, for connections among apparently disparate data." (Carl Sagan, "The Dragons of Eden", 1977)

"If we gather more and more data and establish more and more associations, however, we will not finally find that we know something. We will simply end up having more and more data and larger sets of correlations." (Kenneth N Waltz, "Theory of International Politics Source: Theory of International Politics", 1979)

"There is a tendency to mistake data for wisdom, just as there has always been a tendency to confuse logic with values, intelligence with insight. Unobstructed access to facts can produce unlimited good only if it is matched by the desire and ability to find out what they mean and where they lead." (Norman Cousins, "Human Options : An Autobiographical Notebook", 1981) 

"Data in isolation are meaningless, a collection of numbers. Only in context of a theory do they assume significance […]" (George Greenstein, "Frozen Star", 1983)

"Data is raw. It simply exists and has no significance beyond its existence (in and of itself). It can exist in any form, usable or not. It does not have meaning of itself. In computer parlance, a spreadsheet generally starts out by holding data." (Russell L Ackoff, "Towards a Systems Theory of Organization, 1985)

"Information is data that has been given meaning by way of relational connection. This "meaning" can be useful, but does not have to be. In computer parlance, a relational database makes information from the data stored within it." (Russell L Ackoff, "Towards a Systems Theory of Organization", 1985)

"Intuition becomes increasingly valuable in the new information society precisely because there is so much data." (John Naisbitt, "Re-Inventing the Corporation", 1988)

"The unit of coding is the most basic segment, or element, of the raw data or information that can be assessed in a meaningful way regarding the phenomenon." (Richard Boyatzis, "Transforming qualitative information", 1998)

"Data are collected as a basis for action. Yet before anyone can use data as a basis for action the data have to be interpreted. The proper interpretation of data will require that the data be presented in context, and that the analysis technique used will filter out the noise."  (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"While all data contain noise, some data contain signals. Before you can detect a signal, you must filter out the noise." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"[…] you simply cannot make sense of any number without a contextual basis. Yet the traditional attempts to provide this contextual basis are often flawed in their execution. [...] Data have no meaning apart from their context. Data presented without a context are effectively rendered meaningless.(Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"The more data we have, the more likely we are to drown in it." (Nassim N Taleb, "Fooled by Randomness", 2001)

"Data is a fact of life. As time goes by, we collect more and more data, making our original reason for collecting the data harder to accomplish. We don't collect data just to waste time or keep busy; we collect data so that we can gain knowledge, which can be used to improve the efficiency of our organization, improve profit margins, and on and on. The problem is that as we collect more data, it becomes harder for us to use the data to derive this knowledge. We are being suffocated by this raw data, yet we need to find a way to use it." (Seth Paul et al. "Preparing and Mining Data with Microsoft SQL Server 2000 and Analysis", 2002)

"Blissful data consist of information that is accurate, meaningful, useful, and easily accessible to many people in an organization. These data are used by the organization’s employees to analyze information and support their decision-making processes to strategic action. It is easy to see that organizations that have reached their goal of maximum productivity with blissful data can triumph over their competition. Thus, blissful data provide a competitive advantage.". (Margaret Y Chu, "Blissful Data", 2004)

"Perception requires imagination because the data people encounter in their lives are never complete and always equivocal. [...] We also use our imagination and take shortcuts to fill gaps in patterns of nonvisual data. As with visual input, we draw conclusions and make judgments based on uncertain and incomplete information, and we conclude, when we are done analyzing the patterns, that out picture is clear and accurate. But is it?" (Leonard Mlodinow, "The Drunkard’s Walk: How Randomness Rules Our Lives", 2008)

"By giving data back to the user, you can create both engagement and revenue. We’re far enough into the data game that most users have realized that they’re not the customer, they’re the product. Their role in the system is to generate data, either to assist in ad targeting or to be sold to the highest bidder, or both." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"Generalizing beyond advertising, when building any data product in which the data is obfuscated (where there isn’t a clear relationship between the user and the result), you can compromise on precision, but not on recall. But when the data is exposed, focus on high precision." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"The value of having numbers - data - is that they aren't subject to someone else's interpretation. They are just the numbers. You can decide what they mean for you." (Emily Oster, "Expecting Better", 2013)

"A study that leaves out data is waving a big red flag. A decision to include or exclude data sometimes makes all the difference in the world. This decision should be based on the relevance and quality of the data, not on whether the data support or undermine a conclusion that is expected or desired." (Gary Smith, "Standard Deviations", 2014)

"Another way to secure statistical significance is to use the data to discover a theory. Statistical tests assume that the researcher starts with a theory, collects data to test the theory, and reports the results - whether statistically significant or not. Many people work in the other direction, scrutinizing the data until they find a pattern and then making up a theory that fits the pattern." (Gary Smith, "Standard Deviations", 2014)

"Data clusters are everywhere, even in random data. Someone who looks for an explanation will inevitably find one, but a theory that fits a data cluster is not persuasive evidence. The found explanation needs to make sense and it needs to be tested with uncontaminated data." (Gary Smith, "Standard Deviations", 2014)

"Data without theory can fuel a speculative stock market bubble or create the illusion of a bubble where there is none. How do we tell the difference between a real bubble and a false alarm? You know the answer: we need a theory. Data are not enough. […] Data without theory is alluring, but misleading." (Gary Smith, "Standard Deviations", 2014)

"These practices - selective reporting and data pillaging - are known as data grubbing. The discovery of statistical significance by data grubbing shows little other than the researcher’s endurance. We cannot tell whether a data grubbing marathon demonstrates the validity of a useful theory or the perseverance of a determined researcher until independent tests confirm or refute the finding. But more often than not, the tests stop there. After all, you won’t become a star by confirming other people’s research, so why not spend your time discovering new theories? The data-grubbed theory consequently sits out there, untested and unchallenged." (Gary Smith, "Standard Deviations", 2014)

"We naturally draw conclusions from what we see […]. We should also think about what we do not see […]. The unseen data may be just as important, or even more important, than the seen data. To avoid survivor bias, start in the past and look forward." (Gary Smith, "Standard Deviations", 2014)

"The term data, unlike the related terms facts and evidence, does not connote truth. Data is descriptive, but data can be erroneous. We tend to distinguish data from information. Data is a primitive or atomic state (as in ‘raw data’). It becomes information only when it is presented in context, in a way that informs. This progression from data to information is not the only direction in which the relationship flows, however; information can also be broken down into pieces, stripped of context, and stored as data. This is the case with most of the data that’s stored in computer systems. Data that’s collected and stored directly by machines, such as sensors, becomes information only when it’s reconnected to its context."  (Stephen Few, "Signal: Understanding What Matters in a World of Noise", 2015)

"To find signals in data, we must learn to reduce the noise - not just the noise that resides in the data, but also the noise that resides in us. It is nearly impossible for noisy minds to perceive anything but noise in data. […] Signals always point to something. In this sense, a signal is not a thing but a relationship. Data becomes useful knowledge of something that matters when it builds a bridge between a question and an answer. This connection is the signal." (Stephen Few, "Signal: Understanding What Matters in a World of Noise", 2015)

"Data is such an incredible lever arm for change, we need to make sure that the change that is coming, is the one we all want to see." (Dhanurjay Patil, "A Code of Ethics for Data Science", 2016)

"The first epistemic principle to embrace is that there is always a gap between our data and the real world. We fall headfirst into a pitfall when we forget that this gap exists, that our data isn't a perfect reflection of the real-world phenomena it's representing. Do people really fail to remember this? It sounds so basic. How could anyone fall into such an obvious trap?" (Ben Jones, "Avoiding Data Pitfalls: How to Steer Clear of Common Blunders When Working with Data and Presenting Analysis and Visualizations", 2020)

"The way we explore data today, we often aren't constrained by rigid hypothesis testing or statistical rigor that can slow down the process to a crawl. But we need to be careful with this rapid pace of exploration, too. Modern business intelligence and analytics tools allow us to do so much with data so quickly that it can be easy to fall into a pitfall by creating a chart that misleads us in the early stages of the process." (Ben Jones, "Avoiding Data Pitfalls: How to Steer Clear of Common Blunders When Working with Data and Presenting Analysis and Visualizations", 2020)

"In general, the more complex the data, the more the analyst has to make prior inferences of what is considered normal for modeling purposes." (Charu C Aggarwal, "Artificial Intelligence: A Textbook", 2021)

"Understanding the entire data ecosystem, from the production of a data point to its consumption in a dashboard or a visualization, provides the ability to invoke action, which is more valuable than the mere sum of its parts." (Jesús Barrasa et al, "Knowledge Graphs: Data in Context for Responsive Businesses", 2021)

"Data has historically been treated as a second-class citizen, as a form of exhaust or by-product emitted by business applications. This application-first thinking remains the major source of problems in today’s computing environments, leading to ad hoc data pipelines, cobbled together data access mechanisms, and inconsistent sources of similar-yet-different truths. Data mesh addresses these shortcomings head-on, by fundamentally altering the relationships we have with our data. Instead of a secondary by-product, data, and the access to it, is promoted to a first-class citizen on par with any other business service." (Adam Bellemare,"Building an Event-Driven Data Mesh: Patterns for Designing and Building Event-Driven Architectures", 2023)

"'Let the data speak'" is a catchy and powerful slogan, but [...] data itself is not always enough. It’s worth remembering that in many cases 'data cannot speak for themselves' and we might need more information than just observations to address some of our questions." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"Data are most valuable at their point of origin. The value of data is directly related to their timeliness." (Lawrence M Miller)

"Too little attention is given to the need for statistical control, or to put it more pertinently, since statistical control (randomness) is so rarely found, too little attention is given to the interpretation of data that arise from conditions not in statistical control." (William E Deming)

More quotes on" Data" at the-web-of-knowledge.blogspot.com.

🔭Data Science: Experience (Just the Quotes)

"[…] it is from long experience chiefly that we are to expect the most certain rules of practice, yet it is withal to be remembered, that observations, and to put us upon the most probable means of improving any art, is to get the best insight we can into the nature and properties of those things which we are desirous to cultivate and improve." (Stephen Hales, "Vegetable Staticks", 1727)

"In order to supply the defects of experience, we will have recourse to the probable conjectures of analogy, conclusions which we will bequeath to our posterity to be ascertained by new observations, which, if we augur rightly, will serve to establish our theory and to carry it gradually nearer to absolute certainty." (Johann H Lambert, "The System of the World", 1800)

"Induction, analogy, hypotheses founded upon facts and rectified continually by new observations, a happy tact given by nature and strengthened by numerous comparisons of its indications with experience, such are the principal means for arriving at truth." (Pierre-Simon Laplace, "A Philosophical Essay on Probabilities", 1814)

"Observation is so wide awake, and facts are being so rapidly added to the sum of human experience, that it appears as if the theorizer would always be in arrears, and were doomed forever to arrive at imperfect conclusion; but the power to perceive a law is equally rare in all ages of the world, and depends but little on the number of facts observed." (Henry D Thoreau, "A Week on the Concord and Merrimack Rivers", 1862)

"Science is the systematic classification of experience." (George H Lewes, "The Physical Basis of Mind", 1877)

"Experience teaches that one will be led to new discoveries almost exclusively by means of special mechanical models." (Ludwig Boltzmann, "Lectures on Gas Theory", 1896)

"Philosophy, like science, consists of theories or insights arrived at as a result of systemic reflection or reasoning in regard to the data of experience. It involves, therefore, the analysis of experience and the synthesis of the results of analysis into a comprehensive or unitary conception. Philosophy seeks a totality and harmony of reasoned insight into the nature and meaning of all the principal aspects of reality." (Joseph A Leighton, "The Field of Philosophy: An outline of lectures on introduction to philosophy", 1919)

"Abstraction is the detection of a common quality in the characteristics of a number of diverse observations […] A hypothesis serves the same purpose, but in a different way. It relates apparently diverse experiences, not by directly detecting a common quality in the experiences themselves, but by inventing a fictitious substance or process or idea, in terms of which the experience can be expressed. A hypothesis, in brief, correlates observations by adding something to them, while abstraction achieves the same end by subtracting something." (Herbert Dingle, Science and Human Experience, 1931)

"It can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience." (Albert Einstein, [lecture] 1933)

"A scientist, whether theorist or experimenter, puts forward statements, or systems of statements, and tests them step by step. In the field of the empirical sciences, more particularly, he constructs hypotheses, or systems of theories, and tests them against experience by observation and experiment." (Karl Popper, "The Logic of Scientific Discovery", 1934)

"Science does not aim, primarily, at high probabilities. It aims at a high informative content, well backed by experience. But a hypothesis may be very probable simply because it tells us nothing, or very little." (Karl Popper, "The Logic of Scientific Discovery", 1934) 

"Science is a system of statements based on direct experience, and controlled by experimental verification. Verification in science is not, however, of single statements but of the entire system or a sub-system of such statements." (Rudolf Carnap, "The Unity of Science", 1934)

"Science is the attempt to make the chaotic diversity of our sense experience correspond to a logically uniform system of thought." (Albert Einstein, "Considerations Concerning the Fundaments of Theoretical Physics", Science Vol. 91 (2369), 1940)

"A model, like a novel, may resonate with nature, but it is not a ‘real’ thing. Like a novel, a model may be convincing - it may ‘ring true’ if it is consistent with our experience of the natural world. But just as we may wonder how much the characters in a novel are drawn from real life and how much is artifice, we might ask the same of a model: How much is based on observation and measurement of accessible phenomena, how much is convenience? Fundamentally, the reason for modeling is a lack of full access, either in time or space, to the phenomena of interest." (Kenneth Belitz, Science, Vol. 263, 1944)

"Every bit of knowledge we gain and every conclusion we draw about the universe or about any part or feature of it depends finally upon some observation or measurement. Mankind has had again and again the humiliating experience of trusting to intuitive, apparently logical conclusions without observations, and has seen Nature sail by in her radiant chariot of gold in an entirely different direction." (Oliver J Lee, "Measuring Our Universe: From the Inner Atom to Outer Space", 1950)

"Statistics is the name for that science and art which deals with uncertain inferences - which uses numbers to find out something about nature and experience." (Warren Weaver, 1952)

"The only relevant test of the validity of a hypothesis is comparison of prediction with experience." (Milton Friedman, "Essays in Positive Economics", 1953)

"Mathematical statistics provides an exceptionally clear example of the relationship between mathematics and the external world. The external world provides the experimentally measured distribution curve; mathematics provides the equation (the mathematical model) that corresponds to the empirical curve. The statistician may be guided by a thought experiment in finding the corresponding equation." (Marshall J Walker, "The Nature of Scientific Thought", 1963)

"Experience without theory teaches nothing." (William E Deming, "Out of the Crisis", 1986)

"A discovery in science, or a new theory, even where it appears most unitary and most all-embracing, deals with some immediate element of novelty or paradox within the framework of far vaster, unanalyzed, unarticulated reserves of knowledge, experience, faith, and presupposition. Our progress is narrow: it takes a vast world unchallenged and for granted." (James R Oppenheimer, "Atom and Void", 1989)

"It is ironic but true: the one reality science cannot reduce is the only reality we will ever know. This is why we need art. By expressing our actual experience, the artist reminds us that our science is incomplete, that no map of matter will ever explain the immateriality of our consciousness." (Jonah Lehrer, "Proust Was a Neuroscientist", 2011)

"Science, at its core, is simply a method of practical logic that tests hypotheses against experience. Scientism, by contrast, is the worldview and value system that insists that the questions the scientific method can answer are the most important questions human beings can ask, and that the picture of the world yielded by science is a better approximation to reality than any other." (John M Greer, "After Progress: Reason and Religion at the End of the Industrial Age", 2015)

"Ideally, a decision maker or a forecaster will combine the outside view and the inside view - or, similarly, statistics plus personal experience. But it’s much better to start with the statistical view, the outside view, and then modify it in the light of personal experience than it is to go the other way around. If you start with the inside view you have no real frame of reference, no sense of scale - and can easily come up with a probability that is ten times too large, or ten times too small." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"Statistical metrics can show us facts and trends that would be impossible to see in any other way, but often they’re used as a substitute for relevant experience, by managers or politicians without specific expertise or a close-up view." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"The contradiction between what we see with our own eyes and what the statistics claim can be very real. […] The truth is more complicated. Our personal experiences should not be dismissed along with our feelings, at least not without further thought. Sometimes the statistics give us a vastly better way to understand the world; sometimes they mislead us. We need to be wise enough to figure out when the statistics are in conflict with everyday experience - and in those cases, which to believe." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.