SQL Troubles

30 April 2006

🖍️Ronald A Fisher - Collected Quotes

"It may often happen that an inefficient statistic is accurate enough to answer the particular questions at issue. There is however, one limitation to the legitimate use of inefficient statistics which should be noted in advance. If we are to make accurate tests of goodness of fit, the methods of fitting employed must not introduce errors of fitting comparable to the errors of random sampling; when this requirement is investigated, it appears that when tests of goodness of fit are required, the statistics employed in fitting must be not only consistent, but must be of 100 percent efficiency. This is a very serious limitation to the use of inefficient statistics, since in the examination of any body of data it is desirable to be able at any time to test the validity of one or more of the provisional assumptions which have been made." (Sir Ronald A Fisher, "Statistical Methods for Research Workers", 1925)

"No human mind is capable of grasping in its entirety the meaning of any considerable quantity of numerical data." (Sir Ronald A Fisher, "Statistical Methods for Research Workers", 1925)

"Statistics may be regarded as (i) the study of populations, (ii) as the study of variation, and (iii) as the study of methods of the reduction of data." (Sir Ronald A Fisher, "Statistical Methods for Research Worker", 1925)

"The conception of statistics as the study of variation is the natural outcome of viewing the subject as the study of populations; for a population of individuals in all respects identical is completely described by a description of anyone individual, together with the number in the group. The populations which are the object of statistical study always display variations in one or more respects. To speak of statistics as the study of variation also serves to emphasise the contrast between the aims of modern statisticians and those of their predecessors." (Sir Ronald A Fisher, "Statistical Methods for Research Workers", 1925)

"The preliminary examination of most data is facilitated by the use of diagrams. Diagrams prove nothing, but bring outstanding features readily to the eye; they are therefore no substitutes for such critical tests as may be applied to the data, but are valuable in suggesting such tests, and in explaining the conclusions founded upon them." (Sir Ronald A Fisher, "Statistical Methods for Research Workers", 1925)

"The problems which arise in the reduction of data may thus conveniently be divided into three types: (i) Problems of Specification, which arise in the choice of the mathematical form of the population. (ii) When a specification has been obtained, problems of Estimation arise. These involve the choice among the methods of calculating, from our sample, statistics fit to estimate the unknow n parameters of the population. (iii) Problems of Distribution include the mathematical deduction of the exact nature of the distributions in random samples of our estimates of the parameters, and of other statistics designed to test the validity of our specification (tests of Goodness of Fit)." (Sir Ronald A Fisher, "Statistical Methods for Research Workers", 1925)

"The statistical examination of a body of data is thus logically similar to the general alternation of inductive and deductive methods throughout the sciences. A hypothesis is conceived and defined with all necessary exactitude; its logical consequences are ascertained by a deductive argument; these consequences are compared with the available observations; if these are completely in, accord with the deductions, the hypothesis is justified at least until fresh and more stringent observations are available." (Sir Ronald A Fisher, "Statistical Methods for Research Workers", 1925)

"In expositions of the scientific use of experimentation it is frequent to find an excessive stress laid on the importance of varying the essential conditions one at a time [...] in the state of knowledge or ignorance in which genuine research, intended to advance knowledge, has to be carried on, this simple formula is not very helpful. We are usually ignorant which, out of innumerable possible factors, may prove ultimately to be the most important, though we may have strong presuppositions that some few of them are particularly worthy of study." (Sir Ronald A Fisher, "The Design of Experiments", 1935)

"In relation to any experiment we may speak of this hypothesis as the 'null hypothesis', and it should be noted that the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis." (Sir Ronald A Fisher, "The Design of Experiments", 1935)

"Inductive inference is the only process known to us by which essential new knowledge comes into the world." (Sir Ronald A Fisher, "The Design of Experiments", 1935)

"[…] no isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon; for the ‘one chance in a million’ will undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be that it should occur to us." (Sir Ronald A Fisher, "The Design of Experiments", 1935)

"Statistical procedure and experimental design are only two different aspects of the same whole, and that whole is the logical requirements of the complete process of adding to natural knowledge by experimentation." (Sir Ronald A Fisher, "The Design of Experiments", 1935)

"The statistician cannot excuse himself from the duty of getting his head clear on the principles of scientific inference, but equally no other thinking man can avoid a like obligation." (Sir Ronald A Fisher, "The Design of Experiments", 1935)

"To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of." (Sir Ronald A Fisher, [presidential address] 1938)

"The effects of chance are the most accurately calculable, and therefore the least doubtful of all the factors of an evolutionary situation." (Sir Ronald A Fisher, "Croonian Lecture: Population Genetics", Proceedings of the Royal Society of London Vol. 141, 1955)

"The precise specification of our knowledge is, however, the same as the precise specification of our ignorance." (Sir Ronald A Fisher, "Statistical Methods and Scientific Inference", 1959)

29 April 2006

🖍️Randall E Schumacker - Collected Quotes

"Given the important role that correlation plays in structural equation modeling, we need to understand the factors that affect establishing relationships among multivariable data points. The key factors are the level of measurement, restriction of range in data values (variability, skewness, kurtosis), missing data, nonlinearity, outliers, correction for attenuation, and issues related to sampling variation, confidence intervals, effect size, significance, sample size, and power." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

"Need to consider outliers as they can affect statistics such as means, standard deviations, and correlations. They can either be explained, deleted, or accommodated (using either robust statistics or obtaining additional data to fill-in). Can be detected by methods such as box plots, scatterplots, histograms or frequency distributions." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

"Outliers or influential data points can be defined as data values that are extreme or atypical on either the independent (X variables) or dependent (Y variables) variables or both. Outliers can occur as a result of observation errors, data entry errors, instrument errors based on layout or instructions, or actual extreme values from self-report data. Because outliers affect the mean, the standard deviation, and correlation coefficient values, they must be explained, deleted, or accommodated by using robust statistics." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

"Structural equation modeling is a correlation research method; therefore, the measurement scale, restriction of range in the data values, missing data, outliers, nonlinearity, and nonnormality of data affect the variance–covariance among variables and thus can impact the SEM analysis." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

"Structural equation modeling (SEM) uses various types of models to depict relationships among observed variables, with the same basic goal of providing a quantitative test of a theoretical model hypothesized by the researcher. More specifically, various theoretical models can be tested in SEM that hypothesize how sets of variables define constructs and how these constructs are related to each other." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

"There are several key issues in the field of statistics that impact our analyses once data have been imported into a software program. These data issues are commonly referred to as the measurement scale of variables, restriction in the range of data, missing data values, outliers, linearity, and nonnormality." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

🖍️John W Tukey - Collected Quotes

"[We] need men who can practice science - not a particular science - in a word, we need scientific generalists." (John W Tukey, "The Education of a Scientific Generalist", 1949)

"[...] the whole of modern statistics, philosophy and methods alike, is based on the principle of interpreting what did happen in terms of what might have happened." (John W Tukey, "Standard Methods of Analyzing Data, 1951)

"Just remember that not all statistics has been mathematized - and that we may not have to wait for its mathematization in order to use it." (John W Tukey, "The Growth of Experimental Design in a Research Laboratory, 1953)

"Difficulties in identifying problems have delayed statistics far more than difficulties in solving problems." (John W Tukey, Unsolved Problems of Experimental Statistics, 1954)

"Predictions, prophecies, and perhaps even guidance - those who suggested this title to me must have hoped for such-even though occasional indulgences in such actions by statisticians has undoubtedly contributed to the characterization of a statistician as a man who draws straight lines from insufficient data to foregone conclusions!" (John W Tukey, "Where do We Go From Here?", Journal of the American Statistical Association, Vol. 55 (289), 1960)

"Today one of statistics' great needs is a body of able investigators who make it clear to the intellectual world that they are scientific statisticians. and they are proud of that fact that to them mathematics is incidental, though perhaps indispensable." (John W Tukey, "Statistical and Quantitative Methodology, 1961)

"If data analysis is to be well done, much of it must be a matter of judgment, and ‘theory’ whether statistical or non-statistical, will have to guide, not command." (John W Tukey, "The Future of Data Analysis", Annals of Mathematical Statistics, Vol. 33 (1), 1962)

"The most important maxim for data analysis to heed, and one which many statisticians seem to have shunned is this: ‘Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.’ Data analysis must progress by approximate answers, at best, since its knowledge of what the problem really is will at best be approximate." (John W Tukey, "The Future of Data Analysis", Annals of Mathematical Statistics, Vol. 33, No. 1, 1962)

"The physical sciences are used to ‘praying over’ their data, examining the same data from a variety of points of view. This process has been very rewarding, and has led to many extremely valuable insights. Without this sort of flexibility, progress in physical science would have been much slower. Flexibility in analysis is often to be had honestly at the price of a willingness not to demand that what has already been observed shall establish, or prove, what analysis suggests. In physical science generally, the results of praying over the data are thought of as something to be put to further test in another experiment, as indications rather than conclusions." (John W Tukey, "The Future of Data Analysis", Annals of Mathematical Statistics Vol. 33 (1), 1962)

"The histogram, with its columns of area proportional to number, like the bar graph, is one of the most classical of statistical graphs. Its combination with a fitted bell-shaped curve has been common since the days when the Gaussian curve entered statistics. Yet as a graphical technique it really performs quite poorly. Who is there among us who can look at a histogram-fitted Gaussian combination and tell us, reliably, whether the fit is excellent, neutral, or poor? Who can tell us, when the fit is poor, of what the poorness consists? Yet these are just the sort of questions that a good graphical technique should answer at least approximately." (John W Tukey, "The Future of Processes of Data Analysis", 1965)

"The first step in data analysis is often an omnibus step. We dare not expect otherwise, but we equally dare not forget that this step, and that step, and other step, are all omnibus steps and that we owe the users of such techniques a deep and important obligation to develop ways, often varied and competitive, of replacing omnibus procedures by ones that are more sharply focused." (John W Tukey, "The Future of Processes of Data Analysis", 1965)

"The basic general intent of data analysis is simply stated: to seek through a body of data for interesting relationships and information and to exhibit the results in such a way as to make them recognizable to the data analyzer and recordable for posterity. Its creative task is to be productively descriptive, with as much attention as possible to previous knowledge, and thus to contribute to the mysterious process called insight." (John W Tukey & Martin B Wilk, "Data Analysis and Statistics: An Expository Overview", 1966)

"Comparable objectives in data analysis are (l) to achieve more specific description of what is loosely known or suspected; (2) to find unanticipated aspects in the data, and to suggest unthought-of-models for the data's summarization and exposure; (3) to employ the data to assess the (always incomplete) adequacy of a contemplated model; (4) to provide both incentives and guidance for further analysis of the data; and (5) to keep the investigator usefully stimulated while he absorbs the feeling of his data and considers what to do next." (John W Tukey & Martin B Wilk, "Data Analysis and Statistics: An Expository Overview", 1966)

"The science and art of data analysis concerns the process of learning from quantitative records of experience. By its very nature it exists in relation to people. Thus, the techniques and the technology of data analysis must be harnessed to suit human requirements and talents. Some implications for effective data analysis are: (1) that it is essential to have convenience of interaction of people and intermediate results and (2) that at all stages of data analysis the nature and detail of output, both actual and potential, need to be matched to the capabilities of the people who use it and want it." (John W Tukey & Martin B Wilk, "Data Analysis and Statistics: An Expository Overview", 1966)

"In many instances, a picture is indeed worth a thousand words. To make this true in more diverse circumstances, much more creative effort is needed to pictorialize the output from data analysis. Naive pictures are often extremely helpful, but more sophisticated pictures can be both simple and even more informative." (John W Tukey & Martin B Wilk, "Data Analysis and Statistics: An Expository Overview", 1966)

"Data analysis must be iterative to be effective. [...] The iterative and interactive interplay of summarizing by fit and exposing by residuals is vital to effective data analysis. Summarizing and exposing are complementary and pervasive." (John W Tukey & Martin B Wilk, "Data Analysis and Statistics: An Expository Overview", 1966)

"Summarizing data is a process of constrained and partial a process that essentially and inevitably corresponds to description - some sort of fitting, though it need not necessarily involve formal criteria or well-defined computations." (John W Tukey & Martin B Wilk, "Data Analysis and Statistics: An Expository Overview", 1966)

"The typical statistician has learned from bitter experience that negative results are just as important as positive ones, sometimes more so." (John W Tukey, "A Statistician's Comment", 1967)

"It is fair to say that statistics has made its greatest progress by having to move away from certainty [...] If we really want to make progress, we need to identify our next step away from certainty." (John W Tukey, "What Have Statisticians Been Forgetting", 1967)

"'Every student of the art of data analysis repeatedly needs to build upon his previous statistical knowledge and to reform that foundation through fresh insights and emphasis." (John W Tukey, "Data Analysis, Including Statistics", 1968)

"Every graph is at least an indication, by contrast with some common instances of numbers." (John W Tukey, "Data Analysis, Including Statistics", 1968)

"Nothing can substitute for relatively direct assessment of variability." (John W Tukey, "Data Analysis, Including Statistics", 1968)

"No one knows how to appraise a procedure safely except by using different bodies of data from those that determined it." (John W Tukey, "Data Analysis, Including Statistics", 1968)

"The problems of different fields are much more alike than their practitioners think, much more alike than different." (John W Tukey, "Analyzing Data: Sanctification or Detective Work?", 1969)

"[...] bending the question to fit the analysis is to be shunned at all costs." (John W Tukey, "Analyzing Data: Sanctification or Detective Work?", 1969)

"Data analysis is in important ways an antithesis of pure mathematics." (John W Tukey, "Data Analysis, Computation and Mathematics", 1972)

"Undoubtedly, the swing to exploratory data analysis will go somewhat too far. However : It is better to ride a damped pendulum than to be stuck in the mud." (John W Tukey, "Exploratory Data Analysis as Part of a Larger Whole", 1973)

"The greatest value of a picture is when it forces us to notice what we never expected to see." (John W Tukey, "Exploratory Data Analysis", 1977)

"[...] exploratory data analysis is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as for those we believe might be there. Except for its emphasis on graphs, its tools are secondary to its purpose." (John W Tukey, [comment] 1979)

"There is NO question of teaching confirmatory OR exploratory - we need to teach both." (John W Tukey, "We Need Both Exploratory and Confirmatory", 1980)

"Finding the question is often more important than finding the answer." (John W Tukey, "We Need Both Exploratory and Confirmatory", 1980)

"[...] any hope that we are smart enough to find even transiently optimum solutions to our data analysis problems is doomed to failure, and, indeed, if taken seriously, will mislead us in the allocation of effort, thus wasting both intellectual and computational effort." (John W Tukey, "Choosing Techniques for the Analysis of Data", 1981)

"Detailed study of the quality of data sources is an essential part of applied work. [...] Data analysts need to understand more about the measurement processes through which their data come. To know the name by which a column of figures is headed is far from being enough." (John W Tukey, "An Overview of Techniques of Data Analysis, Emphasizing Its Exploratory Aspects", 1982)

"Exploratory data analysis, EDA, calls for a relatively free hand in exploring the data, together with dual obligations: (•) to look for all plausible alternatives and oddities - and a few implausible ones, (graphic techniques can be most helpful here) and (•) to remove each appearance that seems large enough to be meaningful - ordinarily by some form of fitting, adjustment, or standardization [...] so that what remains, the residuals, can be examined for further appearances." (John W Tukey, "Introduction to Styles of Data Analysis Techniques", 1982)

"The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data." (John W Tukey, "Sunset Salvo", The American Statistician Vol. 40 (1), 1986)

"The worst, i.e., most dangerous, feature of 'accepting the null hypothesis' is the giving up of explicit uncertainty. […] Mathematics can sometimes be put in such black-and-white terms, but our knowledge or belief about the external world never can." (John W Tukey, "The Philosophy of Multiple Comparisons", Statistical Science Vol. 6 (1), 1991)

"Statistics is the science, the art, the philosophy, and the technique of making inferences from the particular to the general." (John W Tukey)

28 April 2006

🖍️William E Deming - Collected Quotes

"It is important to realize that it is not the one measurement, alone, but its relation to the rest of the sequence that is of interest." (William E Deming, "Statistical Adjustment of Data", 1938)

"The definition of random in terms of a physical operation is notoriously without effect on the mathematical operations of statistical theory because so far as these mathematical operations are concerned random is purely and simply an undefined term." (Walter A Shewhart & William E Deming, "Statistical Method from the Viewpoint of Quality Control", 1939)

"Experience without theory teaches nothing." (William E Deming, "Out of the Crisis", 1986)

"It is important to realize that it is not the one measurement, alone, but its relation to the rest of the sequence that is of interest." (William E Deming, "Statistical Adjustment of Data", 1943)

"Sampling is the science and art of controlling and measuring the reliability of useful statistical information through the theory of probability." (William E Deming, "Some Theory of Sampling", 1950)

"Why waste knowledge?… No company can afford to waste knowledge. Failure of management to breakdown barriers between activities… is one way to waste knowledge. People that are not working together are not contributing their best to the company. People as they work together, feeling secure in the job reinforce their knowledge and efforts. Their combined output, when they are working together, is more than the sum of their separate. " (W Edwards Deming," Quality, Productivity and Competitive Position", 1982)

"Experience by itself teaches nothing [...] Without theory, experience has no meaning. Without theory, one has no questions to ask. Hence without theory there is no learning." (William E Deming, "The New Economics for Industry, Government, Education", 1993)

"Knowledge is theory. We should be thankful if action of management is based on theory. Knowledge has temporal spread. Information is not knowledge. The world is drowning in information but is slow in acquisition of knowledge. There is no substitute for knowledge." (William E Deming, "The New Economics for Industry, Government, Education", 1993)

"What is a system? A system is a network of interdependent components that work together to try to accomplish the aim of the system. A system must have an aim. Without an aim, there is no system. The aim of the system must be clear to everyone in the system. The aim must include plans for the future. The aim is a value judgment." (William E Deming, "The New Economics for Industry, Government, Education", 1993)

"The only useful function of a statistician is to make predictions, and thus to provide a basis for action." (William E Deming)

"Too little attention is given to the need for statistical control, or to put it more pertinently, since statistical control (randomness) is so rarely found, too little attention is given to the interpretation of data that arise from conditions not in statistical control." (William E Deming)

🖍️Donald J Wheeler - Collected Quotes

"Averages, ranges, and histograms all obscure the time-order for the data. If the time-order for the data shows some sort of definite pattern, then the obscuring of this pattern by the use of averages, ranges, or histograms can mislead the user. Since all data occur in time, virtually all data will have a time-order. In some cases this time-order is the essential context which must be preserved in the presentation." (Donald J Wheeler," Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Before you can improve any system you must listen to the voice of the system (the Voice of the Process). Then you must understand how the inputs affect the outputs of the system. Finally, you must be able to change the inputs (and possibly the system) in order to achieve the desired results. This will require sustained effort, constancy of purpose, and an environment where continual improvement is the operating philosophy." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Data are collected as a basis for action. Yet before anyone can use data as a basis for action the data have to be interpreted. The proper interpretation of data will require that the data be presented in context, and that the analysis technique used will filter out the noise." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Data are generally collected as a basis for action. However, unless potential signals are separated from probable noise, the actions taken may be totally inconsistent with the data. Thus, the proper use of data requires that you have simple and effective methods of analysis which will properly separate potential signals from probable noise." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"No comparison between two values can be global. A simple comparison between the current figure and some previous value and convey the behavior of any time series. […] While it is simple and easy to compare one number with another number, such comparisons are limited and weak. They are limited because of the amount of data used, and they are weak because both of the numbers are subject to the variation that is inevitably present in weak world data. Since both the current value and the earlier value are subject to this variation, it will always be difficult to determine just how much of the difference between the values is due to variation in the numbers, and how much, if any, of the difference is due to real changes in the process." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"No matter what the data, and no matter how the values are arranged and presented, you must always use some method of analysis to come up with an interpretation of the data.
While every data set contains noise, some data sets may contain signals. Therefore, before you can detect a signal within any given data set, you must first filter out the noise." (Donald J Wheeler," Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"We analyze numbers in order to know when a change has occurred in our processes or systems. We want to know about such changes in a timely manner so that we can respond appropriately. While this sounds rather straightforward, there is a complication - the numbers can change even when our process does not. So, in our analysis of numbers, we need to have a way to distinguish those changes in the numbers that represent changes in our process from those that are essentially noise." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"When a system is predictable, it is already performing as consistently as possible. Looking for assignable causes is a waste of time and effort. Instead, you can meaningfully work on making improvements and modifications to the process. When a system is unpredictable, it will be futile to try and improve or modify the process. Instead you must seek to identify the assignable causes which affect the system. The failure to distinguish between these two different courses of action is a major source of confusion and wasted effort in business today." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"When a process displays unpredictable behavior, you can most easily improve the process and process outcomes by identifying the assignable causes of unpredictable variation and removing their effects from your process." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"While all data contain noise, some data contain signals. Before you can detect a signal, you must filter out the noise." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Without meaningful data there can be no meaningful analysis. The interpretation of any data set must be based upon the context of those data. Unfortunately, much of the data reported to executives today are aggregated and summed over so many different operating units and processes that they cannot be said to have any context except a historical one - they were all collected during the same time period. While this may be rational with monetary figures, it can be devastating to other types of data." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"[…] you simply cannot make sense of any number without a contextual basis. Yet the traditional attempts to provide this contextual basis are often flawed in their execution. [...] Data have no meaning apart from their context. Data presented without a context are effectively rendered meaningless." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"A control chart is a tool for maintaining the status-quo - it was created to monitor a process after that process has been brought to a satisfactory level of operation."(Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"Data analysis is not generally thought of as being simple or easy, but it can be. The first step is to understand that the purpose of data analysis is to separate any signals that may be contained within the data from the noise in the data. Once you have filtered out the noise, anything left over will be your potential signals. The rest is just details." (Donald J Wheeler," Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"Descriptive statistics are built on the assumption that we can use a single value to characterize a single property for a single universe. […] Probability theory is focused on what happens to samples drawn from a known universe. If the data happen to come from different sources, then there are multiple universes with different probability models. If you cannot answer the homogeneity question, then you will not know if you have one probability model or many. [...] Statistical inference assumes that you have a sample that is known to have come from one universe." (Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"In order to be effective a descriptive statistic has to make sense - it has to distill some essential characteristic of the data into a value that is both appropriate and understandable. […] the justification for computing any given statistic must come from the nature of the data themselves - it cannot come from the arithmetic, nor can it come from the statistic. If the data are a meaningless collection of values, then the summary statistics will also be meaningless - no arithmetic operation can magically create meaning out of nonsense. Therefore, the meaning of any statistic has to come from the context for the data, while the appropriateness of any statistic will depend upon the use we intend to make of that statistic." (Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

[myth] " It has been said that process behavior charts work because of the central limit theorem."(Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

[myth] "It has been said that the data must be normally distributed before they can be placed on a process behavior chart."(Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

[myth] "It has been said that the observations must be independent - data with autocorrelation are inappropriate for process behavior charts." (Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

[myth] "It has been said that the process must be operating in control before you can place the data on a process behavior chart."(Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"The four questions of data analysis are the questions of description, probability, inference, and homogeneity. Any data analyst needs to know how to organize and use these four questions in order to obtain meaningful and correct results. [...] THE DESCRIPTION QUESTION: Given a collection of numbers, are there arithmetic values that will summarize the information contained in those numbers in some meaningful way?
THE PROBABILITY QUESTION: Given a known universe, what can we say about samples drawn from this universe? [...]
THE INFERENCE QUESTION: Given an unknown universe, and given a sample that is known to have been drawn from that unknown universe, and given that we know everything about the sample, what can we say about the unknown universe? [...]
THE HOMOGENEITY QUESTION: Given a collection of observations, is it reasonable to assume that they came from one universe, or do they show evidence of having come from multiple universes?" (Donald J Wheeler," Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"The simplicity of the process behavior chart can be deceptive. This is because the simplicity of the charts is based on a completely different concept of data analysis than that which is used for the analysis of experimental data. When someone does not understand the conceptual basis for process behavior charts they are likely to view the simplicity of the charts as something that needs to be fixed. Out of these urges to fix the charts all kinds of myths have sprung up resulting in various levels of complexity and obstacles to the use of one of the most powerful analysis techniques ever invented." (Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

[myth] "The standard deviation statistic is more efficient than the range and therefore we should use the standard deviation statistic when computing limits for a process behavior chart."(Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

🖍️Nassim N Taleb - Collected Quotes

"A mistake is not something to be determined after the fact, but in the light of the information until that point." (Nassim N Taleb, "Fooled by Randomness", 2001)

"Probability is not about the odds, but about the belief in the existence of an alternative outcome, cause, or motive." (Nassim N Taleb, "Fooled by Randomness", 2001)

"A Black Swan is a highly improbable event with three principal characteristics: It is unpredictable; it carries a massive impact; and, after the fact, we concoct an explanation that makes it appear less random, and more predictable, than it was. […] The Black Swan idea is based on the structure of randomness in empirical reality. [...] the Black Swan is what we leave out of simplification." (Nassim N Taleb, "The Black Swan" , 2007)

"Prediction, not narration, is the real test of our understanding of the world." (Nassim N Taleb, "The Black Swan", 2007)

"The inability to predict outliers implies the inability to predict the course of history.” (Nassim N Taleb, “The Black Swan”, 2007)

"While in theory randomness is an intrinsic property, in practice, randomness is incomplete information." (Nassim N Taleb, "The Black Swan", 2007)

"Complex systems are full of interdependencies - hard to detect - and nonlinear responses. […] Man-made complex systems tend to develop cascades and runaway chains of reactions that decrease, even eliminate, predictability and cause outsized events. So the modern world may be increasing in technological knowledge, but, paradoxically, it is making things a lot more unpredictable." (Nassim N Taleb, "Antifragile: Things that gain from disorder", 2012)

"Technology is the result of antifragility, exploited by risk-takers in the form of tinkering and trial and error, with nerd-driven design confined to the backstage." (Nassim N Taleb, "Antifragile: Things that gain from disorder", 2012)

"The higher the dimension, in other words, the higher the number of possible interactions, and the more disproportionally difficult it is to understand the macro from the micro, the general from the simple units. This disproportionate increase of computational demands is called the curse of dimensionality." (Nassim N Taleb, "Skin in the Game: Hidden Asymmetries in Daily Life", 2018)

"[…] whenever people make decisions after being supplied with the standard deviation number, they act as if it were the expected mean deviation." (Nassim N Taleb, "Statistical Consequences of Fat Tails: Real World Preasymptotics, Epistemology, and Applications" 2nd Ed., 2022)

27 April 2006

🖍️Andrew Gelman - Collected Quotes

"The idea of optimization transfer is very appealing to me, especially since I have never succeeded in fully understanding the EM algorithm." (Andrew Gelman, "Discussion", Journal of Computational and Graphical Statistics vol 9, 2000)

"The difference between 'statistically significant' and 'not statistically significant' is not in itself necessarily statistically significant. By this, I mean more than the obvious point about arbitrary divisions, that there is essentially no difference between something significant at the 0.049 level or the 0.051 level. I have a bigger point to make. It is common in applied research–in the last couple of weeks, I have seen this mistake made in a talk by a leading political scientist and a paper by a psychologist–to compare two effects, from two different analyses, one of which is statistically significant and one which is not, and then to try to interpret/explain the difference. Without any recognition that the difference itself was not statistically significant." (Andrew Gelman, "The difference between ‘statistically significant’ and ‘not statistically significant’ is not in itself necessarily statistically significant", 2005)

"A naive interpretation of regression to the mean is that heights, or baseball records, or other variable phenomena necessarily become more and more 'average' over time. This view is mistaken because it ignores the error in the regression predicting y from x. For any data point xi, the point prediction for its yi will be regressed toward the mean, but the actual yi that is observed will not be exactly where it is predicted. Some points end up falling closer to the mean and some fall further." (Andrew Gelman & Jennifer Hill, "Data Analysis Using Regression and Multilevel/Hierarchical Models", 2007)

"You might say that there’s no reason to bother with model checking since all models are false anyway. I do believe that all models are false, but for me the purpose of model checking is not to accept or reject a model, but to reveal aspects of the data that are not captured by the fitted model." (Andrew Gelman, "Some thoughts on the sociology of statistics", 2007)

"It’s a commonplace among statisticians that a chi-squared test (and, really, any p-value) can be viewed as a crude measure of sample size: When sample size is small, it’s very difficult to get a rejection (that is, a p-value below 0.05), whereas when sample size is huge, just about anything will bag you a rejection. With large n, a smaller signal can be found amid the noise. In general: small n, unlikely to get small p-values. Large n, likely to find something. Huge n, almost certain to find lots of small p-values." (Andrew Gelman, "The sample size is huge, so a p-value of 0.007 is not that impressive", 2009)

"The arguments I lay out are, briefly, that graphs are a distraction from more serious analysis; that graphs can mislead in displaying compelling patterns that are not statistically significant and that could easily enough be consistent with chance variation; that diagnostic plots could be useful in the development of a model but do not belong in final reports; that, when they take the place of tables, graphs place the careful reader one step further away from the numerical inferences that are the essence of rigorous scientific inquiry; and that the effort spent making flashy graphics would be better spent on the substance of the problem being studied." (Andrew Gelman et al, "Why Tables Are Really Much Better Than Graphs", Journal of Computational and Graphical Statistics, Vol. 20(1), 2011)

"Graphs are gimmicks, substituting fancy displays for careful analysis and rigorous reasoning. It is basically a trade-off: the snazzier your display, the more you can get away with a crappy underlying analysis. Conversely, a good analysis does not need a fancy graph to sell itself. The best quantitative research has an underlying clarity and a substantive importance whose results are best presented in a sober, serious tabular display. And the best quantitative researchers trust their peers enough to present their estimates and standard errors directly, with no tricks, for all to see and evaluate." (Andrew Gelman et al, "Why Tables Are Really Much Better Than Graphs", Journal of Computational and Graphical Statistics, Vol. 20(1), 2011)"

"Eye-catching data graphics tend to use designs that are unique (or nearly so) without being strongly focused on the data being displayed. In the world of Infovis, design goals can be pursued at the expense of statistical goals. In contrast, default statistical graphics are to a large extent determined by the structure of the data (line plots for time series, histograms for univariate data, scatterplots for bivariate nontime-series data, and so forth), with various conventions such as putting predictors on the horizontal axis and outcomes on the vertical axis. Most statistical graphs look like other graphs, and statisticians often think this is a good thing." (Andrew Gelman & Antony Unwin, "Infovis and Statistical Graphics: Different Goals, Different Looks" , Journal of Computational and Graphical Statistics Vol. 22(1), 2013)

"Providing the right comparisons is important, numbers on their own make little sense, and graphics should enable readers to make up their own minds on any conclusions drawn, and possibly see more. On the Infovis side, computer scientists and designers are interested in grabbing the readers' attention and telling them a story. When they use data in a visualization (and data-based graphics are only a subset of the field of Infovis), they provide more contextual information and make more effort to awaken the readers' interest. We might argue that the statistical approach concentrates on what can be got out of the available data and the Infovis approach uses the data to draw attention to wider issues. Both approaches have their value, and it would probably be best if both could be combined." (Andrew Gelman & Antony Unwin, "Infovis and Statistical Graphics: Different Goals, Different Looks" , Journal of Computational and Graphical Statistics Vol. 22(1), 2013)

"Statisticians tend to use standard graphic forms (e.g., scatterplots and time series), which enable the experienced reader to quickly absorb lots of information but may leave other readers cold. We personally prefer repeated use of simple graphical forms, which we hope draw attention to the data rather than to the form of the display." (Andrew Gelman & Antony Unwin, "Infovis and Statistical Graphics: Different Goals, Different Looks" , Journal of Computational and Graphical Statistics Vol. 22(1), 2013)

"[…] we do see a tension between the goal of statistical communication and the more general goal of communicating the qualitative sense of a dataset. But graphic design is not on one side or another of this divide. Rather, design is involved at all stages, especially when several graphics are combined to contribute to the overall picture, something we would like to see more of." (Andrew Gelman & Antony Unwin, "Tradeoffs in Information Graphics", Journal of Computational and Graphical Statistics, 2013)

"Yes, it can sometimes be possible for a graph to be both beautiful and informative […]. But such synergy is not always possible, and we believe that an approach to data graphics that focuses on celebrating such wonderful examples can mislead people by obscuring the tradeoffs between the goals of visual appeal to outsiders and statistical communication to experts." (Andrew Gelman & Antony Unwin, "Tradeoffs in Information Graphics", Journal of Computational and Graphical Statistics, 2013)

"Flaws can be found in any research design if you look hard enough. […] In our experience, it is good scientific practice to refine one's research hypotheses in light of the data. Working scientists are also keenly aware of the risks of data dredging, and they use confidence intervals and p-values as a tool to avoid getting fooled by noise. Unfortunately, a by-product of all this struggle and care is that when a statistically significant pattern does show up, it is natural to get excited and believe it. The very fact that scientists generally don't cheat, generally don't go fishing for statistical significance, makes them vulnerable to drawing strong conclusions when they encounter a pattern that is robust enough to cross the p < 0.05 threshold." (Andrew Gelman & Eric Loken, "The Statistical Crisis in Science", American Scientist Vol. 102(6), 2014)

"There are many roads to statistical significance; if data are gathered with no preconceptions at all, statistical significance can obviously be obtained even from pure noise by the simple means of repeatedly performing comparisons, excluding data in different ways, examining different interactions, controlling for different predictors, and so forth. Realistically, though, a researcher will come into a study with strong substantive hypotheses, to the extent that, for any given data set, the appropriate analysis can seem evidently clear. But even if the chosen data analysis is a deterministic function of the observed data, this does not eliminate the problem posed by multiple comparisons." (Andrew Gelman & Eric Loken, "The Statistical Crisis in Science", American Scientist Vol. 102(6), 2014)

"There is a growing realization that reported 'statistically significant' claims in statistical publications are routinely mistaken. Researchers typically express the confidence in their data in terms of p-value: the probability that a perceived result is actually the result of random variation. The value of p (for 'probability') is a way of measuring the extent to which a data set provides evidence against a so-called null hypothesis. By convention, a p- value below 0.05 is considered a meaningful refutation of the null hypothesis; however, such conclusions are less solid than they appear." (Andrew Gelman & Eric Loken, "The Statistical Crisis in Science", American Scientist Vol. 102(6), 2014)

"I agree with the general message: 'The right variables make a big difference for accuracy. Complex statistical methods, not so much.' This is similar to something Hal Stern told me once: the most important aspect of a statistical analysis is not what you do with the data, it’s what data you use." (Andrew Gelman, "The most important aspect of a statistical analysis is not what you do with the data, it’s what data you use", 2018)

"We thus echo the classical Bayesian literature in concluding that ‘noninformative prior information’ is a contradiction in terms. The flat prior carries information just like any other; it represents the assumption that the effect is likely to be large. This is often not true. Indeed, the signal-to-noise ratios is often very low and then it is necessary to shrink the unbiased estimate. Failure to do so by inappropriately using the flat prior causes overestimation of effects and subsequent failure to replicate them." (Erik van Zwet & Andrew Gelman, "A proposal for informative default priors scaled by the standard error of estimates", The American Statistician 76, 2022)

"Taking a model too seriously is really just another way of not taking it seriously at all." (Andrew Gelman)

26 April 2006

🖍️Gerald van Belle - Collected Quotes

"A bar graph typically presents either averages or frequencies. It is relatively simple to present raw data (in the form of dot plots or box plots). Such plots provide much more information. and they are closer to the original data. If the bar graph categories are linked in some way - for example, doses of treatments - then a line graph will be much more informative. Very complicated bar graphs containing adjacent bars are very difficult to grasp. If the bar graph represents frequencies. and the abscissa values can be ordered, then a line graph will be much more informative and will have substantially reduced chart junk." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"A good graph displays relationships and structures that are difficult to detect by merely looking at the data." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"A probability can frequently be expressed as a ratio of the number of events divided by the number of units eligible for the event. What the rule of thumb says is to be aware of what the numerator and denominator are, particularly when assessing probabilities in a personal situation. If someone never goes hang gliding, they clearly do not need to worry about the probability of dying in a hang gliding accident." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Assess agreement by addressing accuracy, scale differential, and precision. Accuracy can be thought of as the lack of bias." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Before choosing a measure of covariation determine the source of the data (sampling scheme), the nature of the variables, and the symmetry status of the measure." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Characterizing variability requires repeatedly observing the variability since the it is not a property inherent in the observation itself. " (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Displaying numerical information always involves selection. The process of selection needs to be described so that the reader will not be misled." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Distinguish among confidence, prediction, and tolerance intervals. Confidence intervals are statements about population means or other parameters. Prediction intervals address future (single or multiple) observations. Tolerance intervals describe the location of a specific proportion of a population, with specified confidence." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Do not let the scale of measurement rigidly determine the method of analysis." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Everyone agrees that there are degrees of quality of information but when asked to define the criteria there a great deal of disagreement. The simple statistical rule that the inverse of the variance of a statistic is a measure of the information contained in the statistic provides a useful criterion for a point estimate but is clearly inadequate for comparing much bigger chunks of information such as a study." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Every statistical analysis is an interpretation of the data, and missingness affects the interpretation. The challenge is that when the reasons for the missingness cannot be determined there is basically no way to make appropriate statistical adjustments. Sensitivity analyses are designed to model and explore a reasonable range of explanations in order to assess the robustness of the results." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"In assessing change, the spacing of the observations is much more important than the number of observations." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"In using a database, first look at the metadata, then look at the data. [...] The old computer acronym GIGO (Garbage In, Garbage Out) applies to the use of large databases. The issue is whether the data from the database will answer the research question. In order to determine this, the investigator must have some idea about the nature of the data in the database - that is, the metadata." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"It is crucial to have a broad understanding of the subject matter involved. Statistical analysis is much more than just carrying out routine computations. Only with keen understanding of the subject matter can statisticians, and statistics, be most usefully engaged." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Know what properties a transformation preserves and does not preserve." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Models can be viewed and used at three levels. The first is a model that fits the data. A test of goodness-of-fit operates at this level. This level is the least useful but is frequently the one at which statisticians and researchers stop. For example, a test of a linear model is judged good when a quadratic term is not significant. A second level of usefulness is that the model predicts future observations. Such a model has been called a forecast model. This level is often required in screening studies or studies predicting outcomes such as growth rate. A third level is that a model reveals unexpected features of the situation being described, a structural model, [...] However, it does not explain the data." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Observation is selection. [...] To observe one thing implies that another is not observed, hence there is selection. This implies that the observation is taken from a larger collective, the statistical population." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Ockham's Razor in statistical analysis is used implicitly when models are embedded in richer models -for example, when testing the adequacy of a linear model by incorporating a quadratic term. If the coefficient of the quadratic term is not significant, it is dropped and the linear model is assumed to summarize the data adequately." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Precision does not vary linearly with increasing sample size. As is well known, the width of a confidence interval is a function of the square root of the number of observations. But it is more complicate than that. The basic elements determining a confidence interval are the sample size, an estimate of variability, and a pivotal variable associated with the estimate of variability." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Randomization puts systematic sources of variability into the error term." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Since the analysis of variance is an analysis of variability of means it is possible to plot the means in many ways." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Stacked bar graphs do not show data structure well. A trend in one of the stacked variables has to be deduced by scanning along the vertical bars. This becomes especially difficult when the categories do not move in the same direction." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Statistics is the analysis of variation. There are many sources and kinds of variation. In environmental studies it is particularly important to understand the kinds of variation and the implications of the difference. Two important categories are variability and uncertainty. Variability refers to variation in environmental quantities (which may have special regulatory interest), uncertainty refers to the degree of precision with which these quantities are estimated." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"The best rule is: Don't have any missing data, Unfortunately, that is unrealistic. Therefore, plan for missing data and develop strategies to account for them. Do this before starting the study. The strategy should state explicitly how the type of missingness will be examined, how it will be handled, and how the sensitivity of the results to the missing data will be assessed." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"The bounds on the standard deviation are pretty crude but it is surprising how often the rule will pick up gross errors such as confusing the standard error and standard deviation, confusing the variance and the standard deviation, or reporting the mean in one scale and the standard deviation in another scale." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"The content and context of the numerical data determines the most appropriate mode of presentation. A few numbers can be listed, many numbers require a table. Relationships among numbers can be displayed by statistics. However, statistics, of necessity, are summary quantities so they cannot fully display the relationships, so a graph can be used to demonstrate them visually. The attractiveness of the form of the presentation is determined by word layout, data structure, and design." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"The most ubiquitous graph is the pie chart. It is a staple of the business world. [...] Never use a pie chart. Present a simple list of percentages, or whatever constitutes the divisions of the pie chart." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"[...] there are two problems with the indiscriminate multiplication of probabilities. First, multiplication without adjustment implies that the events represented by the probabilities are treated as independent. Second, since probabilities are always less than 1, the product will become smaller and smaller. If small probabilities are associated with unlikely events then, by a suitable selection, the joint occurrence of events can be made arbitrarily small." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"This pie chart violates several of the rules suggested by the question posed in the introduction. First, immediacy: the reader has to turn to the legend to find out what the areas represent; and the lack of color makes it very difficult to determine which area belongs to what code. Second, the underlying structure of the data is completely ignored. Third, a tremendous amount of ink is used to display eight simple numbers." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Three key aspects of presenting high dimensional data are: rendering, manipulation, and linking. Rendering determines what is to be plotted, manipulation determines the structure of the relationships, and linking determines what information will be shared between plots or sections of the graph." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"When there is more than one source of variation it is important to identify those sources." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

🖍️George B Dyson - Collected Quotes

"An Internet search engine is a finite-state, deterministic machine, except at those junctures where people, individually and collectively, make a nondeterministic choice as to which results are selected as meaningful and given a click. These clicks are then immediately incorporated into the state of the deterministic machine, which grows ever so incrementally more knowledgeable with every click. This is what Turing defined as an oracle machine." (George B Dyson, "Turing's Cathedral: The Origins of the Digital Universe", 2012)

"If life, by some chance, happens to have originated, and survived, elsewhere in the universe, it will have had time to explore an unfathomable diversity of forms. Those best able to survive the passage of time, adapt to changing environments, and migrate across interstellar distances will become the most widespread. A life form that assumes digital representation, for all or part of its life cycle, will be able to travel at the speed of light." (George B Dyson, "Turing's Cathedral: The Origins of the Digital Universe", 2012)

"In our universe, we measure time with clocks, and computers have a 'clock speed', but the clocks that govern the digital universe are very different from the clocks that govern ours. In the digital universe, clocks exist to synchronize the translation between bits that are stored in memory (as structures in space) and bits that are communicated by code (as sequences in time). They are clocks more in the sense of regulating escapement than in the sense of measuring time." (George B Dyson, "Turing's Cathedral: The Origins of the Digital Universe", 2012)

"It is characteristic of objects of low complexity that it is easier to talk about the object than produce it and easier to predict its properties than to build it." (George B Dyson, "Turing's Cathedral: The Origins of the Digital Universe", 2012)

"Life evolved, so far, by making use of the viral cloud as a source of backup copies and a way to rapidly exchange genetic code. Life may be better adapted to the digital universe than we think." (George B Dyson, "Turing's Cathedral: The Origins of the Digital Universe", 2012)

"Monte Carlo is able to discover practical solutions to otherwise intractable problems because the most efficient search of an unmapped territory takes the form of a random walk. Today’s search engines, long descended from their ENIAC-era ancestors, still bear the imprint of their Monte Carlo origins: random search paths being accounted for, statistically, to accumulate increasingly accurate results. The genius of Monte Carlo - and its search-engine descendants - lies in the ability to extract meaningful solutions, in the face of overwhelming information, by recognizing that meaning resides less in the data at the end points and more in the intervening paths." (George B Dyson, "Turing's Cathedral: The Origins of the Digital Universe", 2012)

"Over long distances, it is expensive to transport structures, and inexpensive to transmit sequences. Turing machines, which by definition are structures that can be encoded as sequences, are already propagating themselves, locally, at the speed of light. The notion that one particular computer resides in one particular location at one time is obsolete. (George Dyson, "Turing's Cathedral: The Origins of the Digital Universe", 2012)

"Random search can be more efficient than nonrandom search - something that Good and Turing had discovered at Bletchley Park. A random network, whether of neurons, computers, words, or ideas, contains solutions, waiting to be discovered, to problems that need not be explicitly defined." (George B Dyson, "Turing's Cathedral: The Origins of the Digital Universe", 2012)

"The brain is a statistical, probabilistic system, with logic and mathematics running as higher-level processes. The computer is a logical, mathematical system, upon which higher-level statistical, probabilistic systems, such as human language and intelligence, could possibly be built." (George B Dyson, "Turing's Cathedral: The Origins of the Digital Universe", 2012)

"The good news is that, as Leibniz suggested, we appear to live in the best of all possible worlds, where the computable functions make life predictable enough to be survivable, while the noncomputable functions make life (and mathematical truth) unpredictable enough to remain interesting, no matter how far computers continue to advance." (George B Dyson, "Turing's Cathedral: The Origins of the Digital Universe", 2012)

"The fundamental, indivisible unit of information is the bit. The fundamental, indivisible unit of digital computation is the transformation of a bit between its two possible forms of existence: as structure (memory) or as sequence (code). This is what a Turing Machine does when reading a mark (or the absence of a mark) on a square of tape, changing its state of mind accordingly, and making (or erasing) a mark somewhere else." (George B Dyson, "Turing's Cathedral: The Origins of the Digital Universe", 2012)

"The genius of Monte Carlo - and its search-engine descendants - lies in the ability to extract meaningful solutions, in the face of overwhelming information, by recognizing that meaning resides less in the data at the end points and more in the intervening paths." (George B Dyson, "Turing's Cathedral: The Origins of the Digital Universe", 2012)

"The paradox of artificial intelligence is that any system simple enough to be understandable is not complicated enough to behave intelligently, and any system complicated enough to behave intelligently is not simple enough to understand. The path to artificial intelligence, suggested Turing, is to construct a machine with the curiosity of a child, and let intelligence evolve." (George B Dyson, "Turing's Cathedral: The Origins of the Digital Universe", 2012)

"Where does meaning come in? If everything is assigned a number, does this diminish the meaning in the world? What Gödel (and Turing) proved is that formal systems will, sooner or later, produce meaningful statements whose truth can be proved only outside the system itself. This limitation does not confine us to a world with any less meaning. It proves, on the contrary, that we live in a world where higher meaning exists." (George B Dyson, "Turing's Cathedral: The Origins of the Digital Universe", 2012)

"Nature uses digital computing for generation-to-generation information storage, combinatorics, and error correction but relies on analog computing for real-time intelligence and control." (George B Dyson, Analogia: The Emergence of Technology Beyond Programmable Control", 2020)

🖍️George E P Box - Collected Quotes

"Statistical criteria should (1) be sensitive to change in the specific factors tested, (2) be insensitive to changes, of a magnitude likely to occur in practice, in extraneous factors." (George E P Box, 1955)

"The method of least squares is used in the analysis of data from planned experiments and also in the analysis of data from unplanned happenings. The word 'regression' is most often used to describe analysis of unplanned data. It is the tacit assumption that the requirements for the validity of least squares analysis are satisfied for unplanned data that produces a great deal of trouble." (George E P Box, "Use and Abuse of Regression", 1966)

"To find out what happens to a system when you interfere with it you have to interfere with it (not just passively observe it)." (George E P Box, "Use and Abuse of Regression", 1966)

"A man in daily muddy contact with field experiments could not be expected to have much faith in any direct assumption of independently distributed normal errors." (George E P Box, "Science and Statistics", Journal of the American Statistical Association 71, 1976)

"For the theory-practice iteration to work, the scientist must be, as it were, mentally ambidextrous; fascinated equally on the one hand by possible meanings, theories, and tentative models to be induced from data and the practical reality of the real world, and on the other with the factual implications deducible from tentative theories, models and hypotheses." (George E P Box, "Science and Statistics", Journal of the American Statistical Association 71, 1976)

"One important idea is that science is a means whereby learning is achieved, not by mere theoretical speculation on the one hand, nor by the undirected accumulation of practical facts on the other, but rather by a motivated iteration between theory and practice." (George E P Box, "Science and Statistics", Journal of the American Statistical Association 71, 1976)

"Since all models are wrong the scientist cannot obtain a ‘correct’ one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity." (George E P Box, "Science and Statistics", Journal of the American Statistical Association 71, 1976)

"Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad." (George E P Box, "Science and Statistics", Journal of the American Statistical Association 71, 1976)

"Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful." (George E P Box, "Empirical Model-Building and Response Surfaces", 1987)

"The fact that [the model] is an approximation does not necessarily detract from its usefulness because models are approximations. All models are wrong, but some are useful." (George E P Box, 1987)

"Statistics is, or should be, about scientific investigation and how to do it better, but many statisticians believe it is a branch of mathematics." (George E P Box, Commentary, Technometrics 32, 1990)

"The central limit theorem says that, under conditions almost always satisfied in the real world of experimentation, the distribution of such a linear function of errors will tend to normality as the number of its components becomes large. The tendency to normality occurs almost regardless of the individual distributions of the component errors. An important proviso is that several sources of error must make important contributions to the overall error and that no particular source of error dominate the rest." (George E P Box et al, "Statistics for Experimenters: Design, discovery, and innovation" 2nd Ed., 2005)

"Two things explain the importance of the normal distribution: (1) The central limit effect that produces a tendency for real error distributions to be 'normal like'. (2) The robustness to nonnormality of some common statistical procedures, where 'robustness' means insensitivity to deviations from theoretical normality." (George E P Box et al, "Statistics for Experimenters: Design, discovery, and innovation" 2nd Ed., 2005)

"All models are approximations. Essentially, all models are wrong, but some are useful. However, the approximate nature of the model must always be borne in mind." (George E P Box & Norman R Draper, "Response Surfaces, Mixtures, and Ridge Analyses", 2007)

"In my view, statistics has no reason for existence except as the catalyst for investigation and discovery." (George E P Box)

🖍️Cathy O'Neil - Collected Quotes

"A model, after all, is nothing more than an abstract representation of some process, be it a baseball game, an oil company’s supply chain, a foreign government’s actions, or a movie theater’s attendance. Whether it’s running in a computer program or in our head, the model takes what we know and uses it to predict responses in various situations. All of us carry thousands of models in our heads. They tell us what to expect, and they guide our decisions." (Cathy O'Neil, "Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy", 2016)

"Big Data processes codify the past. They do not invent the future. Doing that requires moral imagination, and that’s something only humans can provide. We have to explicitly embed better values into our algorithms, creating Big Data models that follow our ethical lead. Sometimes that will mean putting fairness ahead of profit." (Cathy O'Neil, "Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy", 2016)

"No model can include all of the real world’s complexity or the nuance of human communication. Inevitably, some important information gets left out. […] To create a model, then, we make choices about what’s important enough to include, simplifying the world into a toy version that can be easily understood and from which we can infer important facts and actions.[…] Sometimes these blind spots don’t matter. […] A model’s blind spots reflect the judgments and priorities of its creators. […] Our own values and desires influence our choices, from the data we choose to collect to the questions we ask. Models are opinions embedded in mathematics." (Cathy O'Neil, "Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy", 2016)

"The first question: Even if the participant is aware of being modeled, or what the model is used for, is the model opaque, or even invisible? […] the second question: Does the model work against the subject’s interest? In short, is it unfair? Does it damage or destroy lives? […] The third question is whether a model has the capacity to grow exponentially. As a statistician would put it, can it scale?" (Cathy O'Neil, "Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy", 2016)

"Whether or not a model works is also a matter of opinion. After all, a key component of every model, whether formal or informal, is its definition of success." (Cathy O'Neil, "Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy", 2016)

"While Big Data, when managed wisely, can provide important insights, many of them will be disruptive. After all, it aims to find patterns that are invisible to human eyes. The challenge for data scientists is to understand the ecosystems they are wading into and to present not just the problems but also their possible solutions." (Cathy O'Neil, "Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy", 2016)

25 April 2006

🖍️Darell Huff - Collected Quotes

"Another thing to watch out for is a conclusion in which a correlation has been inferred to continue beyond the data with which it has been demonstrated." (Darell Huff, "How to Lie with Statistics", 1954)

"Extrapolations are useful, particularly in the form of soothsaying called forecasting trends. But in looking at the figures or the charts made from them, it is necessary to remember one thing constantly: The trend to now may be a fact, but the future trend represents no more than an educated guess. Implicit in it is 'everything else being equal' and 'present trends continuing'. And somehow everything else refuses to remain equal." (Darell Huff, "How to Lie with Statistics", 1954)

"If you can't prove what you want to prove, demonstrate something else and pretend that they are the something. In the daze that follows the collision of statistics with the human mind, hardly anybody will notice the difference." (Darell Huff, "How to Lie with Statistics", 1954)

"Keep in mind that a correlation may be real and based on real cause and effect -and still be almost worthless in determining action in any single case." (Darell Huff, "How to Lie with Statistics", 1954)

"Only when there is a substantial number of trials involved is the law of averages a useful description or prediction." (Darell Huff, "How to Lie with Statistics", 1954)

"Percentages offer a fertile field for confusion. And like the ever-impressive decimal they can lend an aura of precision to the inexact. […] Any percentage figure based on a small number of cases is likely to be misleading. It is more informative to give the figure itself. And when the percentage is carried out to decimal places, you begin to run the scale from the silly to the fraudulent." (Darell Huff, "How to Lie with Statistics", 1954)

"Place little faith in an average or a graph or a trend when those important figures are missing." (Darell Huff, "How to Lie with Statistics", 1954)

"Sometimes the big ado is made about a difference that is mathematically real and demonstrable but so tiny as to have no importance. This is in defiance of the fine old saying that a difference is a difference only if it makes a difference." (Darell Huff, "How to Lie with Statistics", 1954)

"The fact is that, despite its mathematical base, statistics is as much an art as it is a science. A great many manipulations and even distortions are possible within the bounds of propriety. Often the statistician must choose among methods, a subjective process, and find the one that he will use to represent the facts." (Darell Huff, "How to Lie with Statistics", 1954)

"The purely random sample is the only kind that can be examined with entire confidence by means of statistical theory, but there is one thing wrong with it. It is so difficult and expensive to obtain for many uses that sheer cost eliminates it." (Darell Huff, "How to Lie with Statistics", 1954)

"The secret language of statistics, so appealing in a fact-minded culture, is employed to sensationalize, inflate, confuse, and oversimplify. Statistical methods and statistical terms are necessary in reporting the mass data of social and economic trends, business conditions, 'opinion' polls, the census. But without writers who use the words with honesty and understanding and readers who know what they mean, the result can only be semantic nonsense." (Darell Huff, "How to Lie with Statistics", 1954)

"There are often many ways of expressing any figure. […] The method is to choose the one that sounds best for the purpose at hand and trust that few who read it will recognize how imperfectly it reflects the situation." (Darell Huff, "How to Lie with Statistics", 1954)

"To be worth much, a report based on sampling must use a representative sample, which is one from which every source of bias has been removed." (Darell Huff, "How to Lie with Statistics", 1954)

"When numbers in tabular form are taboo and words will not do the work well as is often the case. There is one answer left: Draw a picture. About the simplest kind of statistical picture or graph, is the line variety. It is very useful for showing trends, something practically everybody is interested in showing or knowing about or spotting or deploring or forecasting." (Darell Huff, "How to Lie with Statistics", 1954)

"When you are told that something is an average you still don't know very much about it unless you can find out which of the common kinds of average it is-mean, median, or mode. [...] The different averages come out close together when you deal with data, such as those having to do with many human characteristics, that have the grace to fall close to what is called the normal distribution. If you draw a curve to represent it you get something shaped like a bell, and mean, median, and mode fall at the same point." (Darell Huff, "How to Lie with Statistics", 1954)

"When you find somebody - usually an interested party - making a fuss about a correlation, look first of all to see if it is not one of this type, produced by the stream of events, the trend of the times." (Darell Huff, "How to Lie with Statistics", 1954)

🖍️John D Kelleher - Collected Quotes

"A predictive model overfits the training set when at least some of the predictions it returns are based on spurious patterns present in the training data used to induce the model. Overfitting happens for a number of reasons, including sampling variance and noise in the training set. The problem of overfitting can affect any machine learning algorithm; however, the fact that decision tree induction algorithms work by recursively splitting the training data means that they have a natural tendency to segregate noisy instances and to create leaf nodes around these instances. Consequently, decision trees overfit by splitting the data on irrelevant features that only appear relevant due to noise or sampling variance in the training data. The likelihood of overfitting occurring increases as a tree gets deeper because the resulting predictions are based on smaller and smaller subsets as the dataset is partitioned after each feature test in the path." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies", 2015)

"Decision trees are also discriminative models. Decision trees are induced by recursively partitioning the feature space into regions belonging to the different classes, and consequently they define a decision boundary by aggregating the neighboring regions belonging to the same class. Decision tree model ensembles based on bagging and boosting are also discriminative models." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies", 2015)

"Decision trees are also considered nonparametric models. The reason for this is that when we train a decision tree from data, we do not assume a fixed set of parameters prior to training that define the tree. Instead, the tree branching and the depth of the tree are related to the complexity of the dataset it is trained on. If new instances were added to the dataset and we rebuilt the tree, it is likely that we would end up with a (potentially very) different tree." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies", 2015)

"It is important to remember that predictive data analytics models built using machine learning techniques are tools that we can use to help make better decisions within an organization and are not an end in themselves. It is paramount that, when tasked with creating a predictive model, we fully understand the business problem that this model is being constructed to address and ensure that it does address it." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, worked examples, and case studies", 2015)

"There are two kinds of mistakes that an inappropriate inductive bias can lead to: underfitting and overfitting. Underfitting occurs when the prediction model selected by the algorithm is too simplistic to represent the underlying relationship in the dataset between the descriptive features and the target feature. Overfitting, by contrast, occurs when the prediction model selected by the algorithm is so complex that the model fits to the dataset too closely and becomes sensitive to noise in the data."(John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies", 2015)

"The main advantage of decision tree models is that they are interpretable. It is relatively easy to understand the sequences of tests a decision tree carried out in order to make a prediction. This interpretability is very important in some domains. [...] Decision tree models can be used for datasets that contain both categorical and continuous descriptive features. A real advantage of the decision tree approach is that it has the ability to model the interactions between descriptive features. This arises from the fact that the tests carried out at each node in the tree are performed in the context of the results of the tests on the other descriptive features that were tested at the preceding nodes on the path from the root. Consequently, if there is an interaction effect between two or more descriptive features, a decision tree can model this." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies", 2015)

"Tree pruning identifies and removes subtrees within a decision tree that are likely to be due to noise and sample variance in the training set used to induce it. In cases where a subtree is deemed to be overfitting, pruning the subtree means replacing the subtree with a leaf node that makes a prediction based on the majority target feature level (or average target feature value) of the dataset created by merging the instances from all the leaf nodes in the subtree. Obviously, pruning will result in decision trees being created that are not consistent with the training set used to build them. In general, however, we are more interested in creating prediction models that generalize well to new data rather than that are strictly consistent with training data, so it is common to sacrifice consistency for generalization capacity." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies", 2015)

"When datasets are small, a parametric model may perform well because the strong assumptions made by the model - if correct - can help the model to avoid overfitting. However, as the size of the dataset grows, particularly if the decision boundary between the classes is very complex, it may make more sense to allow the data to inform the predictions more directly. Obviously the computational costs associated with nonparametric models and large datasets cannot be ignored. However, support vector machines are an example of a nonparametric model that, to a large extent, avoids this problem. As such, support vector machines are often a good choice in complex domains with lots of data." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies", 2015)

"When we find data quality issues due to valid data during data exploration, we should note these issues in a data quality plan for potential handling later in the project. The most common issues in this regard are missing values and outliers, which are both examples of noise in the data." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, worked examples, and case studies", 2015)

"A neural network consists of a set of neurons that are connected together. A neuron takes a set of numeric values as input and maps them to a single output value. At its core, a neuron is simply a multi-input linear-regression function. The only significant difference between the two is that in a neuron the output of the multi-input linear-regression function is passed through another function that is called the activation function." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"Data scientists should have some domain expertise. Most data science projects begin with a real-world, domain-specific problem and the need to design a data-driven solution to this problem. As a result, it is important for a data scientist to have enough domain expertise that they understand the problem, why it is important, an dhow a data science solution to the problem might fit into an organization’s processes. This domain expertise guides the data scientist as she works toward identifying an optimized solution." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"However, because ML algorithms are biased to look for different types of patterns, and because there is no one learning bias across all situations, there is no one best ML algorithm. In fact, a theorem known as the 'no free lunch theorem' states that there is no one best ML algorithm that on average outperforms all other algorithms across all possible data sets." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"One of the biggest myths is the belief that data science is an autonomous process that we can let loose on our data to find the answers to our problems. In reality, data science requires skilled human oversight throughout the different stages of the process. [...] The second big myth of data science is that every data science project needs big data and needs to use deep learning. In general, having more data helps, but having the right data is the more important requirement. [...] A third data science myth is that modern data science software is easy to use, and so data science is easy to do. [...] The last myth about data science [...] is the belief that data science pays for itself quickly. The truth of this belief depends on the context of the organization. Adopting data science can require significant investment in terms of developing data infrastructure and hiring staff with data science expertise. Furthermore, data science will not give positive results on every project." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"One of the most important skills for a data scientist is the ability to frame a real-world problem as a standard data science task." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"Presenting data in a graphical format makes it much easier to see and understand what is happening with the data. Data visualization applies to all phases of the data science process." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"The goal of data science is to improve decision making by basing decisions on insights extracted from large data sets. As a field of activity, data science encompasses a set of principles, problem definitions, algorithms, and processes for extracting nonobvious and useful patterns from large data sets. It is closely related to the fields of data mining and machine learning, but it is broader in scope." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"The patterns that we extract using data science are useful only if they give us insight into the problem that enables us to do something to help solve the problem." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"The promise of data science is that it provides a way to understand the world through data." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"Using data science, we can uncover the important patterns in a data set, and these patterns can reveal the important attributes in the domain. The reason why data science is used in so many domains is that it doesn’t matter what the problem domain is: if the right data are available and the problem can be clearly defined, then data science can help." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"We humans are reasonably good at defining rules that check one, two, or even three attributes (also commonly referred to as features or variables), but when we go higher than three attributes, we can start to struggle to handle the interactions between them. By contrast, data science is often applied in contexts where we want to look for patterns among tens, hundreds, thousands, and, in extreme cases, millions of attributes." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

🖍️Thomas Carlyle - Collected Quotes

"Statistics is a science which ought to be honourable, the basis of many most important sciences; but it is not to be carried on by steam, this science, any more than others are; a wise hand is requisite for carrying it on. Conclusive facts are inseparable from unconclusive except by a head that already understands and knows." (Thomas Carlyle, "Critical and Miscellaneous Essays", 1838)

"A judicious man looks at Statistics, not to get knowledge, but to save himself from having ignorance foisted on him." (Thomas Carlyle, "Chartism", 1840)

"A witty statesman once said, you might prove anything by figures." (Thomas Carlyle, "Chartism", 1840)

"Statistics, one may hope, will improve individually, and become good for something." (Thomas Carlyle, "Chartism", 1840)

"Inquiries wisely gone into, even on this most complex matter, will yield results worth something, not nothing. But it is a most complex matter; on which, whether for the past or the present. Statistic Inquiry, with its limited means, with its short vision and headlong extensive dogmatism, as yet too often throws not light, but error worse than darkness." (Thomas Carlyle, "Chartism", 1840)

"Tables are like cobwebs, like the sieve of Danaides; beautifully reticulated, orderly to look upon, but which will hold no conclusion. Tables are abstractions, and the object a most concrete one, so difficult to read the essence of." (Thomas Carlyle, "Chartism", 1840)

"There are innumerable circumstances; and one circumstance left out may be the vital one on which all turned. Statistics is a science which ought to be honourable, the basis
of many most important sciences; but it is not to be carried on by steam, this science, any more than others are; a wise head is requisite for carrying it on. Conclusive facts are inseparable from inconclusive except by a head that ah-eady understands and knows." (Thomas Carlyle, "Chartism", 1840)

"There is one fact which Statistic Science has communicated, and a most astonishing one ; the inference from which is pregnant as to this matter." (Thomas Carlyle, "Chartism", 1840)

"What constitutes the well-being of a man? Many things; of which the wages he gets, and the bread he buys with them, are but one preliminary item. Grant, however, that the
wages were the whole; that once knowing the wages and the price of bread, we know all; then what are the wages? Statistic Inquiry, in its present unguided condition, cannot
tell. The average rate of day's wages is not correctly ascertained for any portion of this country; not only not for half-centuries, it is not even ascertained anywhere for decades
or years: far from instituting comparisons with the past, the present itself is unknown to us." (Thomas Carlyle, "Chartism", 1840)

"A judicious man uses statistics, not to get knowledge, but to save himself from having ignorance foisted upon him." (Thomas Carlyle)

"A man protesting against error is on the way towards uniting himself with all men that believe in truth." (Thomas Carlyle)

"Conclusive facts are inseparable from inconclusive except by a head that already understands and knows." (Thomas Carlyle)

"In every phenomenon the beginning remains always the most notable moment." (Thomas Carlyle)

"Once turn to practice, error and truth will no longer consort together [...]." (Thomas Carlyle)

"Science rests on reason and experiment, and can meet an opponent with calmness." (Thomas Carlyle)