SQL Troubles

25 April 2006

🖍️David S Salsburg - Collected Quotes

"A good estimator has to be more than just consistent. It also should be one whose variance is less than that of any other estimator. This property is called minimum variance. This means that if we run the experiment several times, the 'answers' we get will be closer to one another than 'answers' based on some other estimator." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"All methods of dealing with big data require a vast number of mind-numbing, tedious, boring mathematical steps." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"An estimate (the mathematical definition) is a number derived from observed values that is as close as we can get to the true parameter value. Useful estimators are those that are 'better' in some sense than any others." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"Correlation is not equivalent to cause for one major reason. Correlation is well defined in terms of a mathematical formula. Cause is not well defined." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"Estimators are functions of the observed values that can be used to estimate specific parameters. Good estimators are those that are consistent and have minimum variance. These properties are guaranteed if the estimator maximizes the likelihood of the observations." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"One final warning about the use of statistical models (whether linear or otherwise): The estimated model describes the structure of the data that have been observed. It is unwise to extend this model very far beyond the observed data." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"The central limit conjecture states that most errors are the result of many small errors and, as such, have a normal distribution. The assumption of a normal distribution for error has many advantages and has often been made in applications of statistical models." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"The degree to which one variable can be predicted from another can be calculated as the correlation between them. The square of the correlation (R^2) is the proportion of the variance of one that can be 'explained' by knowledge of the other." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"The elements of this cloud of uncertainty (the set of all possible errors) can be described in terms of probability. The center of the cloud is the number zero, and elements of the cloud that are close to zero are more probable than elements that are far away from that center. We can be more precise in this definition by defining the cloud of uncertainty in terms of a mathematical function, called the probability distribution." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"The lack of variability is often a hallmark of faked data. […] The failure of faked data to have sufficient variability holds as long as the liar does not know this. If the liar knows this, his best approach is to start with real data and use it cleverly to adapt it to his needs." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"There are other problems with Big Data. In any large data set, there are bound to be inconsistencies, misclassifications, missing data - in other words, errors, blunders, and possibly lies. These problems with individual items occur in any data set, but they are often hidden in a large mass of numbers even when these numbers are generated out of computer interactions." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"There is a constant battle between the cold abstract absolutes of pure mathematics and, the sometimes sloppy way in which mathematical methods are applied in science." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"Two clouds of uncertainty may have the same center, but one may be much more dispersed than the other. We need a way of looking at the scatter about the center. We need a measure of the scatter. One such measure is the variance. We take each of the possible values of error and calculate the squared difference between that value and the center of the distribution. The mean of those squared differences is the variance." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"What properties should a good statistical estimator have? Since we are dealing with probability, we start with the probability that our estimate will be very close to the true value of the parameter. We want that probability to become greater and greater as we get more and more data. This property is called consistency. This is a statement about probability. It does not say that we are sure to get the right answer. It says that it is highly probable that we will be close to the right answer." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"When we use algebraic notation in statistical models, the problem becomes more complicated because we cannot 'observe' a probability and know its exact number. We can only estimate probabilities on the basis of observations." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

24 April 2006

🖍️Schuyler W Huck - Collected Quotes

"As used here, the term statistical misconception refers to any of several widely held but incorrect notions about statistical concepts, about procedures for analyzing data and about the meaning of results produced by such analyses. To illustrate, many people think that (1) normal curves are bell shaped, (2) a correlation coefficient should never be used to address questions of causality, and (3) the level of significance dictates the probability of a Type I error. Some people, of course, have only one or two (rather than all three) of these misconceptions, and a few individuals realize that all three of those beliefs are false." (Schuyler W Huck, "Statistical Misconceptions", 2008)

"Distributional shape is an important attribute of data, regardless of whether scores are analyzed descriptively or inferentially. Because the degree of skewness can be summarized by means of a single number, and because computers have no difficulty providing such measures (or estimates) of skewness, those who prepare research reports should include a numerical index of skewness every time they provide measures of central tendency and variability." (Schuyler W Huck, "Statistical Misconceptions", 2008)

"If a researcher checks the normality assumption by visually inspecting each sample’s data (for example, by looking at a frequency distribution or a histogram), that researcher might incorrectly think that the data are nonnormal because the distribution appears to be too tall and skinny or too flat and squatty. As a result of this misdiagnosis, the researcher might unnecessarily abandon his or her initial plan to use a parametric statistical test in favor of a different procedure, perhaps one that is thought to be distribution-free." (Schuyler W Huck, "Statistical Misconceptions", 2008)

"If data are normally distributed, certain things are known about the group and individual scores in the group. For example, the three most frequently used measures of central tendency - the arithmetic mean, median, and mode - all have the same numerical value in a normal distribution. Moreover, if a distribution is normal, we can determine a person’s percentile if we know his or her z-score or T-score." (Schuyler W Huck, "Statistical Misconceptions", 2008)

"It is dangerous to think that standard scores, such as z and T, form a normal distribution because (1) they don’t have to and (2) they often won’t. If you mistakenly presume that a set of standard scores are normally distributed (when they’re not), your conversion of z-scores (or T-scores) into percentiles can lead to great inaccuracies." (Schuyler W Huck, "Statistical Misconceptions", 2008)

"It should be noted that any finite data set cannot “follow” the normal curve exactly. That’s because a normal curve’s two 'tails' extend out to positive and negative infinity. The curved line that forms a normal curve gets closer and closer to the baseline as the curved line moves further and further away from its middle section; however, the curved line never actually touches the abscissa." (Schuyler W Huck, "Statistical Misconceptions", 2008)

"[…] kurtosis is influenced by the variability of the data. This fact leads to two surprising characteristics of kurtosis. First, not all rectangular distributions have the same amount of kurtosis. Second, certain distributions that are not rectangular are more platykurtic than are rectangular distributions!" (Schuyler W Huck, "Statistical Misconceptions", 2008)

"The shape of a normal curve is influenced by two things: (1) the distance between the baseline and the curve’s apex, and (2) the length, on the baseline, that’s set equal to one standard deviation. The arbitrary values chosen for these distances by the person drawing the normal curve determine the appearance of the resulting picture." (Schuyler W Huck, "Statistical Misconceptions", 2008)

"The concept of kurtosis is often thought to deal with the 'peakedness' of a distribution. Compared to a normal distribution (which is said to have a moderate peak), distributions that have taller peaks are referred to as being leptokurtic, while those with smaller peaks are referred to as being platykurtic. Regarding the second of these terms, authors and instructors often suggest that the word flat (which rhymes with the first syllable of platykurtic) is a good mnemonic device for remembering that platykurtic distributions tend to be flatter than normal." (Schuyler W Huck, "Statistical Misconceptions", 2008)

"The second surprising feature of kurtosis is that rectangular distributions, which are flat, are not maximally platykurtic. Bimodal distributions can yield lower kurtosis values than rectangular distributions, even in those situations where the number of scores and score variability are held constant." (Schuyler W Huck, "Statistical Misconceptions", 2008)

"There are degrees to which a distribution can deviate from normality in terms of peakedness. A platykurtic distribution, for instance, might be slightly less peaked than a normal distribution, moderately less peaked than normal, or totally lacking in any peak. One is tempted to think that any perfectly rectangular distribution, being ultraflat in its shape, would be maximally platykurtic. However, this is not the case." (Schuyler W Huck, "Statistical Misconceptions", 2008)

🖍️Field Cady - Collected Quotes

"A common misconception is that data scientists don’t need visualizations. This attitude is not only inaccurate: it is very dangerous. Most machine learning algorithms are not inherently visual, but it is very easy to misinterpret their outputs if you look only at the numbers; there is no substitute for the human eye when it comes to making intuitive sense of things." (Field Cady, "The Data Science Handbook", 2017)

"AI failed (at least relative to the hype it had generated), and it’s partly out of embarrassment on behalf of their discipline that the term 'artificial intelligence' is rarely used in computer science circles (although it’s coming back into favor, just without the over-hyping). We are as far away from mimicking human intelligence as we have ever been, partly because the human brain is fantastically more complicated than a mere logic engine." (Field Cady, "The Data Science Handbook", 2017)

"At very small time scales, the motion of a particle is more like a random walk, as it gets jostled about by discrete collisions with water molecules. But virtually any random movement on small time scales will give rise to Brownian motion on large time scales, just so long as the motion is unbiased. This is because of the Central Limit Theorem, which tells us that the aggregate of many small, independent motions will be normally distributed." (Field Cady, "The Data Science Handbook", 2017)

"By far the greatest headache in machine learning is the problem of overfitting. This means that your results look great for the data you trained them on, but they don’t generalize to other data in the future. [...] The solution is to train on some of your data and assess performance on other data." (Field Cady, "The Data Science Handbook", 2017)

"Extracting good features is the most important thing for getting your analysis to work. It is much more important than good machine learning classifiers, fancy statistical techniques, or elegant code. Especially if your data doesn’t come with readily available features (as is the case with web pages, images, etc.), how you reduce it to numbers will make the difference between success and failure." (Field Cady, "The Data Science Handbook", 2017)

"Feature extraction is also the most creative part of data science and the one most closely tied to domain expertise. Typically, a really good feature will correspond to some real‐world phenomenon. Data scientists should work closely with domain experts and understand what these phenomena mean and how to distill them into numbers." (Field Cady, "The Data Science Handbook", 2017)

"Outliers make it very hard to give an intuitive interpretation of the mean, but in fact, the situation is even worse than that. For a real‐world distribution, there always is a mean (strictly speaking, you can define distributions with no mean, but they’re not realistic), and when we take the average of our data points, we are trying to estimate that mean. But when there are massive outliers, just a single data point is likely to dominate the value of the mean and standard deviation, so much more data is required to even estimate the mean, let alone make sense of it." (Field Cady, "The Data Science Handbook", 2017)

"The first step is always to frame the problem: understand the business use case and craft a well‐defined analytics problem (or problems) out of it. This is followed by an extensive stage of grappling with the data and the real‐world things that it describes, so that we can extract meaningful features. Finally, these features are plugged into analytical tools that give us hard numerical results." (Field Cady, "The Data Science Handbook", 2017)

"Theoretically, the normal distribution is most famous because many distributions converge to it, if you sample from them enough times and average the results. This applies to the binomial distribution, Poisson distribution and pretty much any other distribution you’re likely to encounter (technically, any one for which the mean and standard deviation are finite)." (Field Cady, "The Data Science Handbook", 2017)

"With time series though, there is absolutely no substitute for plotting. The pertinent pattern might end up being a sharp spike followed by a gentle taper down. Or, maybe there are weird plateaus. There could be noisy spikes that have to be filtered out. A good way to look at it is this: means and standard deviations are based on the naïve assumption that data follows pretty bell curves, but there is no corresponding 'default' assumption for time series data (at least, not one that works well with any frequency), so you always have to look at the data to get a sense of what’s normal. [...] Along the lines of figuring out what patterns to expect, when you are exploring time series data, it is immensely useful to be able to zoom in and out." (Field Cady, "The Data Science Handbook", 2017)

"The myth of replacing domain experts comes from people putting too much faith in the power of ML to find patterns in the data. [...] ML looks for patterns that are generally pretty crude - the power comes from the sheer scale at which they can operate. If the important patterns in the data are not sufficiently crude then ML will not be able to ferret them out. The most powerful classes of models, like deep learning, can sometimes learn good-enough proxies for the real patterns, but that requires more training data than is usually available and yields complicated models that are hard to understand and impossible to debug. It’s much easier to just ask somebody who knows the domain!" (Field Cady, "Data Science: The Executive Summary: A Technical Book for Non-Technical Professionals", 2021)

🖍️Joel Best - Collected Quotes

"All human knowledge - including statistics - is created through people's actions; everything we know is shaped by our language, culture, and society. Sociologists call this the social construction of knowledge. Saying that knowledge is socially constructed does not mean that all we know is somehow fanciful, arbitrary, flawed, or wrong. For example, scientific knowledge can be remarkably accurate, so accurate that we may forget the people and social processes that produced it." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Any statistic based on more than a guess requires some sort of counting. Definitions specify what will be counted. Measuring involves deciding how to go about counting. We cannot begin counting until we decide how we will identify and count instances of a social problem. [...] Measurement involves choices. [...] Often, measurement decisions are hidden." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Big numbers warn us that the problem is a common one, compelling our attention, concern, and action. The media like to report statistics because numbers seem to be 'hard facts' - little nuggets of indisputable truth. [...] One common innumerate error involves not distinguishing among large numbers. [...] Because many people have trouble appreciating the differences among big numbers, they tend to uncritically accept social statistics (which often, of course, feature big numbers)." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"But people treat mutant statistics just as they do other statistics - that is, they usually accept even the most implausible claims without question. [...] And people repeat bad statistics [...] bad statistics live on; they take on lives of their own. [...] Statistics, then, have a bad reputation. We suspect that statistics may be wrong, that people who use statistics may be 'lying' - trying to manipulate us by using numbers to somehow distort the truth. Yet, at the same time, we need statistics; we depend upon them to summarize and clarify the nature of our complex society. This is particularly true when we talk about social problems." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Changing measures are a particularly common problem with comparisons over time, but measures also can cause problems of their own. [...] We cannot talk about change without making comparisons over time. We cannot avoid such comparisons, nor should we want to. However, there are several basic problems that can affect statistics about change. It is important to consider the problems posed by changing - and sometimes unchanging - measures, and it is also important to recognize the limits of predictions. Claims about change deserve critical inspection; we need to ask ourselves whether apples are being compared to apples - or to very different objects." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Clear, precise definitions are not enough. Whatever is defined must also be measured, and meaningless measurements will produce meaningless statistics." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Compound errors can begin with any of the standard sorts of bad statistics - a guess, a poor sample, an inadvertent transformation, perhaps confusion over the meaning of a complex statistic. People inevitably want to put statistics to use, to explore a number's implications. [...] The strengths and weaknesses of those original numbers should affect our confidence in the second-generation statistics." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"First, good statistics are based on more than guessing. [...] Second, good statistics are based on clear, reasonable definitions. Remember, every statistic has to define its subject. Those definitions ought to be clear and made public. [...] Third, good statistics are based on clear, reasonable measures. Again, every statistic involves some sort of measurement; while all measures are imperfect, not all flaws are equally serious. [...] Finally, good statistics are based on good samples." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"In order to interpret statistics, we need more than a checklist of common errors. We need a general approach, an orientation, a mind-set that we can use to think about new statistics that we encounter. We ought to approach statistics thoughtfully. This can be hard to do, precisely because so many people in our society treat statistics as fetishes." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Innumeracy - widespread confusion about basic mathematical ideas - means that many statistical claims about social problems don't get the critical attention they deserve. This is not simply because an innumerate public is being manipulated by advocates who cynically promote inaccurate statistics. Often, statistics about social problems originate with sincere, well-meaning people who are themselves innumerate; they may not grasp the full implications of what they are saying. Similarly, the media are not immune to innumeracy; reporters commonly repeat the figures their sources give them without bothering to think critically about them." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Knowledge is factual when evidence supports it and we have great confidence in its accuracy. What we call 'hard fact' is information supported by strong, convincing evidence; this means evidence that, so far as we know, we cannot deny, however we examine or test it. Facts always can be questioned, but they hold up under questioning. How did people come by this information? How did they interpret it? Are other interpretations possible? The more satisfactory the answers to such questions, the 'harder' the facts." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Like definitions, measurements always involve choices. Advocates of different measures can defend their own choices and criticize those made by their opponents - so long as the various choices being made are known and understood. However, when measurement choices are kept hidden, it becomes difficult to assess the statistics based on those choices." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"No definition of a social problem is perfect, but there are two principal ways such definitions can be flawed. On the one hand, we may worry that a definition is too broad, that it encompasses more than it ought to include. That is, broad definitions identify some cases as part of the problem that we might think ought not to be included; statisticians call such cases false positives (that is, they mistakenly identify cases as part of the problem). On the other hand, a definition that is too narrow excludes cases that we might think ought to be included; these are false negatives (incorrectly identified as not being part of the problem)." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Not all statistics start out bad, but any statistic can be made worse. Numbers - even good numbers - can be misunderstood or misinterpreted. Their meanings can be stretched, twisted, distorted, or mangled. These alterations create what we can call mutant statistics - distorted versions of the original figures." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"One reason we tend to accept statistics uncritically is that we assume that numbers come from experts who know what they're doing. [...] There is a natural tendency to treat these figures as straightforward facts that cannot be questioned." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"People who create or repeat a statistic often feel they have a stake in defending the number. When someone disputes an estimate and offers a very different (often lower) figure, people may rush to defend the original estimate and attack the new number and anyone who dares to use it. [...] any estimate can be defended by challenging the motives of anyone who disputes the figure." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Statistics are not magical. Nor are they always true - or always false. Nor need they be incomprehensible. Adopting a Critical approach offers an effective way of responding to the numbers we are sure to encounter. Being Critical requires more thought, but failing to adopt a Critical mind-set makes us powerless to evaluate what others tell us. When we fail to think critically, the statistics we hear might just as well be magical." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Statisticians can calculate the probability that such random samples represent the population; this is usually expressed in terms of sampling error [...]. The real problem is that few samples are random. Even when researchers know the nature of the population, it can be time-consuming and expensive to draw a random sample; all too often, it is impossible to draw a true random sample because the population cannot be defined. This is particularly true for studies of social problems. [...] The best samples are those that come as close as possible to being random." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"The ease with which somewhat complex statistics can produce confusion is important, because we live in a world in which complex numbers are becoming more common. Simple statistical ideas - fractions, percentages, rates - are reasonably well understood by many people. But many social problems involve complex chains of cause and effect that can be understood only through complicated models developed by experts. [...] environment has an influence. Sorting out the interconnected causes of these problems requires relatively complicated statistical ideas - net additions, odds ratios, and the like. If we have an imperfect understanding of these ideas, and if the reporters and other people who relay the statistics to us share our confusion - and they probably do - the chances are good that we'll soon be hearing - and repeating, and perhaps making decisions on the basis of - mutated statistics." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"There are two problems with sampling - one obvious, and the other more subtle. The obvious problem is sample size. Samples tend to be much smaller than their populations. [...] Obviously, it is possible to question results based on small samples. The smaller the sample, the less confidence we have that the sample accurately reflects the population. However, large samples aren't necessarily good samples. This leads to the second issue: the representativeness of a sample is actually far more important than sample size. A good sample accurately reflects (or 'represents') the population." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"We often hear warnings that some social problem is 'epidemic'. This expression suggests that the problem's growth is rapid, widespread, and out of control. If things are getting worse, and particularly if they're getting worse fast, we need to act." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Whenever examples substitute for definitions, there is a risk that our understanding of the problem will be distorted." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"While some social problems statistics are deliberate deceptions, many - probably the great majority - of bad statistics are the result of confusion, incompetence, innumeracy, or selective, self-righteous efforts to produce numbers that reaffirm principles and interests that their advocates consider just and right. The best response to stat wars is not to try and guess who's lying or, worse, simply to assume that the people we disagree with are the ones telling lies. Rather, we need to watch for the standard causes of bad statistics - guessing, questionable definitions or methods, mutant numbers, and inappropriate comparisons." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Every number has its limitations; every number is a product of choices that inevitably involve compromise. Statistics are intended to help us summarize, to get an overview of part of the world’s complexity. But some information is always sacrificed in the process of choosing what will be counted and how. Something is, in short, always missing. In evaluating statistics, we should not forget what has been lost, if only because this helps us understand what we still have." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"Good statistics are not only products of people counting; the quality of statistics also depends on people’s willingness and ability to count thoughtfully and on their decisions about what, exactly, ought to be counted so that the resulting numbers will be both accurate and meaningful." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"In much the same way, people create statistics: they choose what to count, how to go about counting, which of the resulting numbers they share with others, and which words they use to describe and interpret those figures. Numbers do not exist independent of people; understanding numbers requires knowing who counted what, why they bothered counting, and how they went about it." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"In short, some numbers are missing from discussions of social issues because certain phenomena are hard to quantify, and any effort to assign numeric values to them is subject to debate. But refusing to somehow incorporate these factors into our calculations creates its own hazards. The best solution is to acknowledge the difficulties we encounter in measuring these phenomena, debate openly, and weigh the options as best we can." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"Nonetheless, the basic principles regarding correlations between variables are not that diffcult to understand. We must look for patterns that reveal potential relationships and for evidence that variables are actually related. But when we do spot those relationships, we should not jump to conclusions about causality. Instead, we need to weigh the strength of the relationship and the plausibility of our theory, and we must always try to discount the possibility of spuriousness." (Joel Best, "More Damned Lies and Statistics : How numbers confuse public issues", 2004)

"Statistics depend on collecting information. If questions go unasked, or if they are asked in ways that limit responses, or if measures count some cases but exclude others, information goes ungathered, and missing numbers result. Nevertheless, choices regarding which data to collect and how to go about collecting the information are inevitable." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"When people use statistics, they assume - or, at least, they want their listeners to assume - that the numbers are meaningful. This means, at a minimum, that someone has actually counted something and that they have done the counting in a way that makes sense. Statistical information is one of the best ways we have of making sense of the world’s complexities, of identifying patterns amid the confusion. But bad statistics give us bad information." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

23 April 2006

🖍️Michael J Moroney - Collected Quotes

"A good estimator will be unbiased and will converge more and more closely (in the long run) on the true value as the sample size increases. Such estimators are known as consistent. But consistency is not all we can ask of an estimator. In estimating the central tendency of a distribution, we are not confined to using the arithmetic mean; we might just as well use the median. Given a choice of possible estimators, all consistent in the sense just defined, we can see whether there is anything which recommends the choice of one rather than another. The thing which at once suggests itself is the sampling variance of the different estimators, since an estimator with a small sampling variance will be less likely to differ from the true value by a large amount than an estimator whose sampling variance is large." (Michael J Moroney, "Facts from Figures", 1951)

"A piece of self-deception - often dear to the heart of apprentice scientists - is the drawing of a 'smooth curve' (how attractive it sounds!) through a set of points which have about as much trend as the currants in plum duff. Once this is done, the mind, looking for order amidst chaos, follows the Jack-o'-lantern line with scant attention to the protesting shouts of the actual points. Nor, let it be whispered, is it unknown for people who should know better to rub off the offending points and publish the trend line which their foolish imagination has introduced on the flimsiest of evidence. Allied to this sin is that of overconfident extrapolation, i.e. extending the graph by guesswork beyond the range of factual information. Whenever extrapolation is attempted it should be carefully distinguished from the rest of the graph, e.g. by showing the extrapolation as a dotted line in contrast to the full line of the rest of the graph. [...] Extrapolation always calls for justification, sooner or later. Until this justification is forthcoming, it remains a provisional estimate, based on guesswork." (Michael J Moroney, "Facts from Figures", 1951)

"Data should be collected with a clear purpose in mind. Not only a clear purpose, but a clear idea as to the precise way in which they will be analysed so as to yield the desired information." (Michael J Moroney, "Facts from Figures", 1951)

"For the most part, Statistics is a method of investigation that is used when other methods are of no avail; it is often a last resort and a forlorn hope. A statistical analysis, properly conducted, is a delicate dissection of uncertainties, a surgery of suppositions. The surgeon must guard carefully against false incisions with his scalpel. Very often he has to sew up the patient as inoperable. The public knows too little about the statistician as a conscientious and skilled servant of true science." (Michael J Moroney, "Facts from Figures", 1951)

"It is really questionable - though bordering on heresy to put the question - whether we would be any the worse off if the whole bag of tricks were scrapped. So many of these index numbers are so ancient and so out of date, so out of touch with reality, so completely devoid of practical value when they have been computed, that their regular calculation must be regarded as a widespread compulsion neurosis. Only lunatics and public servants with no other choice go on doing silly things and liking it." (Michael J Moroney, "Facts from Figures", 1951)

"It pays to keep wide awake in studying any graph. The thing looks so simple, so frank, and so appealing that the careless are easily fooled. [...] Data and formulae should be given along with the graph, so that the interested reader may look at the details if he wishes." (Michael J Moroney, "Facts from Figures", 1951)

"It will, of course, happen but rarely that the proportions will be identical, even if no real association exists. Evidently, therefore, we need a significance test to reassure ourselves that the observed difference of proportion is greater than could reasonably be attributed to chance. The significance test will test the reality of the association, without telling us anything about the intensity of association. It will be apparent that we need two distinct things: (a) a test of significance, to be used on the data first of all, and (b) some measure of the intensity of the association, which we shall only be justified in using if the significance test confirms that the association is real." (Michael J Moroney, "Facts from Figures", 1951)

"Some distributions [...] are symmetrical about their central value. Other distributions have marked asymmetry and are said to be skew. Skew distributions are divided into two types. If the 'tail' of the distribution reaches out into the larger values of the variate, the distribution is said to show positive skewness; if the tail extends towards the smaller values of the variate, the distribution is called negatively skew." (Michael J Moroney, "Facts from Figures", 1951)

"The economists, of course, have great fun - and show remarkable skill - in inventing more refined index numbers. Sometimes they use geometric averages instead of arithmetic averages (the advantage here being that the geometric average is less upset by extreme oscillations in individual items), sometimes they use the harmonic average. But these are all refinements of the basic idea of the index number [...]" (Michael J Moroney, "Facts from Figures", 1951)

"The mode would form a very poor basis for any further calculations of an arithmetical nature, for it has deliberately excluded arithmetical precision in the interests of presenting a typical result. The arithmetic average, on the other hand, excellent as it is for numerical purposes, has sacrificed its desire to be typical in favour of numerical accuracy. In such a case it is often desirable to quote both measures of central tendency." (Michael J Moroney, "Facts from Figures", 1951)

"The statistician’s job is to draw general conclusions from fragmentary data. Too often the data supplied to him for analysis are not only fragmentary but positively incoherent, so that he can do next to nothing with them. Even the most kindly statistician swears heartily under his breath whenever this happens". (Michael J Moroney, "Facts from Figures", 1951)

"Undoubtedly one of the most elegant, powerful, and useful techniques in modern statistical method is that of the Analysis of Variation and Co-variation by which the total variation in a set of data may be reduced to components associated with possible sources of variability whose relative importance we wish to assess. The precise form which any given analysis will take is intimately connected with the structure of the investigation from which the data are obtained. A simple structure will lead to a simple analysis; a complex structure to a complex analysis." (Michael J Moroney, "Facts from Figures", 1951)

"When the mathematician speaks of the existence of a 'functional relation' between two variable quantities, he means that they are connected by a simple 'formula that is to say, if we are told the value of one of the variable quantities we can find the value of the second quantity by substituting in the formula which tells us how they are related. [...] The thing to be clear about before we proceed further is that a functional relationship in mathematics means an exact and predictable relationship, with no ifs or buts about lt. It is useful in practice so long as the ifs and buts are only tiny voices which even the most ardent protagonist of proportional representation can ignore with a clear conscience." (Michael J Moroney, "Facts from Figures", 1951)

🖍️David Spiegelhalter - Collected Quotes

"A classification tree is perhaps the simplest form of algorithm, since it consists of a series of yes/no questions, the answer to each deciding the next question to be asked, until a conclusion is reached." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"Bootstrapping provides an intuitive, computer-intensive way of assessing the uncertainty in our estimates, without making strong assumptions and without using probability theory. But the technique is not feasible when it comes to, say, working out the margins of error on unemployment surveys of 100,000 people. Although bootstrapping is a simple, brilliant and extraordinarily effective idea, it is just too clumsy to bootstrap such large quantities of data, especially when a convenient theory exists that can generate formulae for the width of uncertainty intervals." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"But [bootstrap-based] simulations are clumsy and time-consuming, especially with large data sets, and in more complex circumstances it is not straightforward to work out what should be simulated. In contrast, formulae derived from probability theory provide both insight and convenience, and always lead to the same answer since they don’t depend on a particular simulation. But the flip side is that this theory relies on assumptions, and we should be careful not to be deluded by the impressive algebra into accepting unjustified conclusions." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"[...] data often has some errors, outliers and other strange values, but these do not necessarily need to be individually identified and excluded. It also points to the benefits of using summary measures that are not unduly affected by odd observations [...] are known as robust measures, and include the median and the inter-quartile range." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"Even in an era of open data, data science and data journalism, we still need basic statistical principles in order not to be misled by apparent patterns in the numbers." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"[...] in the statistical world, what we see and measure around us can be considered as the sum of a systematic mathematical idealized form plus some random contribution that cannot yet be explained. This is the classic idea of the signal and the noise." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"It is convenient to use a single number to summarize a steadily increasing or decreasing relationship between the pairs of numbers shown on a scatter-plot. This is generally chosen to be the Pearson correlation coefficient [...]. A Pearson correlation runs between −1 and 1, and expresses how close to a straight line the dots or data-points fall. A correlation of 1 occurs if all the points lie on a straight line going upwards, while a correlation of −1 occurs if all the points lie on a straight line going downwards. A correlation near 0 can come from a random scatter of points, or any other pattern in which there is no systematic trend upwards or downwards [...]." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"It is not enough to give a single summary for a distribution - we need to have an idea of the spread, sometimes known as the variability. [...] The range is a natural choice, but is clearly very sensitive to extreme values [...] In contrast the inter-quartile range (IQR) is unaffected by extremes. This is the distance between the 25th and 75th percentiles of the data and so contains the ‘central half’ of the numbers [...] Finally the standard deviation is a widely used measure of spread. It is the most technically complex measure, but is only really appropriate for well-behaved symmetric data since it is also unduly influenced by outlying values." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"[...] just because we act, and something changes, it doesn’t mean we were responsible for the result. Humans seem to find this simple truth difficult to grasp - we are always keen to construct an explanatory narrative, and even keener if we are at its centre. Of course sometimes this interpretation is true - if you flick a switch, and the light comes on, then you are usually responsible. But sometimes your actions are clearly not responsible for an outcome: if you don’t take an umbrella, and it rains, it is not your fault (although it may feel that way). But the consequences of many of our actions are less clear-cut. [...] We have a strong psychological tendency to attribute change to intervention, and this makes before-and-after comparisons treacherous." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"Mean-averages can be highly misleading when the raw data do not form a symmetric pattern around a central value but instead are skewed towards one side [...], typically with a large group of standard cases but with a tail of a few either very high (for example, income) or low (for example, legs) values." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"Statistical models have two main components. First, a mathematical formula that expresses a deterministic, predictable component, for example the fitted straight line that enables us to make a prediction [...]. But the deterministic part of a model is not going to be a perfect representation of the observed world [...] and the difference between what the model predicts, and what actually happens, is the second component of a model and is known as the residual error - although it is important to remember that in statistical modelling, ‘error’ does not refer to a mistake, but the inevitable inability of a model to exactly represent what we observe." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"[...] the Central Limit Theorem [...] says that the distribution of sample means tends towards the form of a normal distribution with increasing sample size, almost regardless of the shape of the original data distribution." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"The first rule of communication is to shut up and listen, so that you can get to know about the audience for your communication, whether it might be politicians, professionals or the general public. We have to understand their inevitable limitations and any misunderstandings, and fight the temptation to be too sophisticated and clever, or put in too much detail." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"The second rule of communication is to know what you want to achieve. Hopefully the aim is to encourage open debate, and informed decision-making. But there seems no harm in repeating yet again that numbers do not speak for themselves; the context, language and graphic design all contribute to the way the communication is received. We have to acknowledge we are telling a story, and it is inevitable that people will make comparisons and judgements, no matter how much we only want to inform and not persuade. All we can do is try to pre-empt inappropriate gut reactions by design or warning." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"There is no ‘correct’ way to display sets of numbers: each of the plots we have used has some advantages: strip-charts show individual points, box-and-whisker plots are convenient for rapid visual summaries, and histograms give a good feel for the underlying shape of the data distribution." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"This common view of statistics as a basic ‘bag of tools’ is now facing major challenges. First, we are in an age of data science, in which large and complex data sets are collected from routine sources such as traffic monitors, social media posts and internet purchases, and used as a basis for technological innovations such as optimizing travel routes, targeted advertising or purchase recommendation systems [...]. Statistical training is increasingly seen as just one necessary component of being a data scientist, together with skills in data management, programming and algorithm development, as well as proper knowledge of the subject matter." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"Unfortunately, when an ‘average’ is reported in the media, it is often unclear whether this should be interpreted as the mean or median." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"When it comes to presenting categorical data, pie charts allow an impression of the size of each category relative to the whole pie, but are often visually confusing, especially if they attempt to show too many categories in the same chart, or use a three-dimensional representation that distorts areas. [...] Multiple pie charts are generally not a good idea, as comparisons are hampered by the difficulty in assessing the relative sizes of areas of different shapes. Comparisons are better based on height or length alone in a bar chart." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"When we have all the data, it is straightforward to produce statistics that describe what has been measured. But when we want to use the data to draw broader conclusions about what is going on around us, then the quality of the data becomes paramount, and we need to be alert to the kind of systematic biases that can jeopardize the reliability of any claims." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"With the growing availability of massive data sets and user-friendly analysis software, it might be thought that there is less need for training in statistical methods. This would be naïve in the extreme. Far from freeing us from the need for statistical skills, bigger data and the rise in the number and complexity of scientific studies makes it even more difficult to draw appropriate conclusions. More data means that we need to be even more aware of what the evidence is actually worth." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"We over-fit when we go too far in adapting to local circumstances, in a worthy but misguided effort to be ‘unbiased’ and take into account all the available information. Usually we would applaud the aim of being unbiased, but this refinement means we have less data to work on, and so the reliability goes down. Over-fitting therefore leads to less bias but at a cost of more uncertainty or variation in the estimates, which is why protection against over-fitting is sometimes known as the bias/variance trade-off." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

22 April 2006

🖍️Judea Pearl - Collected Quotes

"Despite the prevailing use of graphs as metaphors for communicating and reasoning about dependencies, the task of capturing informational dependencies by graphs is not at all trivial." (Judea Pearl, "Probabilistic Reasoning in Intelligent Systems: Network of Plausible, Inference", 1988)

"Probabilities are summaries of knowledge that is left behind when information is transferred to a higher level of abstraction." (Judea Pearl, "Probabilistic Reasoning in Intelligent Systems: Network of Plausible, Inference", 1988)

"When loops are present, the network is no longer singly connected and local propagation schemes will invariably run into trouble. […] If we ignore the existence of loops and permit the nodes to continue communicating with each other as if the network were singly connected, messages may circulate indefinitely around the loops and process may not converges to a stable equilibrium. […] Such oscillations do not normally occur in probabilistic networks […] which tend to bring all messages to some stable equilibrium as time goes on. However, this asymptotic equilibrium is not coherent, in the sense that it does not represent the posterior probabilities of all nodes of the network." (Judea Pearl, "Probabilistic Reasoning in Intelligent Systems: Network of Plausible, Inference", 1988)

"Traditional statistics is strong in devising ways of describing data and inferring distributional parameters from sample. Causal inference requires two additional ingredients: a science-friendly language for articulating causal knowledge, and a mathematical machinery for processing that knowledge, combining it with data and drawing new causal conclusions about a phenomenon." (Judea Pearl, "Causal inference in statistics: An overview", Statistics Surveys 3, 2009)

"Again, classical statistics only summarizes data, so it does not provide even a language for asking [a counterfactual] question. Causal inference provides a notation and, more importantly, offers a solution. As with predicting the effect of interventions [...], in many cases we can emulate human retrospective thinking with an algorithm that takes what we know about the observed world and produces an answer about the counterfactual world." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"Bayesian networks inhabit a world where all questions are reducible to probabilities, or (in the terminology of this chapter) degrees of association between variables; they could not ascend to the second or third rungs of the Ladder of Causation. Fortunately, they required only two slight twists to climb to the top." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"Bayesian statistics give us an objective way of combining the observed evidence with our prior knowledge (or subjective belief) to obtain a revised belief and hence a revised prediction of the outcome of the coin’s next toss. [...] This is perhaps the most important role of Bayes’s rule in statistics: we can estimate the conditional probability directly in one direction, for which our judgment is more reliable, and use mathematics to derive the conditional probability in the other direction, for which our judgment is rather hazy. The equation also plays this role in Bayesian networks; we tell the computer the forward probabilities, and the computer tells us the inverse probabilities when needed." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"Deep learning has instead given us machines with truly impressive abilities but no intelligence. The difference is profound and lies in the absence of a model of reality." (Judea Pearl, "The Book of Why: The New Science of Cause and Effect", 2018)

"[…] deep learning has succeeded primarily by showing that certain questions or tasks we thought were difficult are in fact not. It has not addressed the truly difficult questions that continue to prevent us from achieving humanlike AI." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"Some scientists (e.g., econometricians) like to work with mathematical equations; others (e.g., hard-core statisticians) prefer a list of assumptions that ostensibly summarizes the structure of the diagram. Regardless of language, the model should depict, however qualitatively, the process that generates the data - in other words, the cause-effect forces that operate in the environment and shape the data generated." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"The calculus of causation consists of two languages: causal diagrams, to express what we know, and a symbolic language, resembling algebra, to express what we want to know. The causal diagrams are simply dot-and-arrow pictures that summarize our existing scientific knowledge. The dots represent quantities of interest, called 'variables', and the arrows represent known or suspected causal relationships between those variables - namely, which variable 'listens' to which others." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"The main differences between Bayesian networks and causal diagrams lie in how they are constructed and the uses to which they are put. A Bayesian network is literally nothing more than a compact representation of a huge probability table. The arrows mean only that the probabilities of child nodes are related to the values of parent nodes by a certain formula (the conditional probability tables) and that this relation is sufficient. That is, knowing additional ancestors of the child will not change the formula. Likewise, a missing arrow between any two nodes means that they are independent, once we know the values of their parents. [...] If, however, the same diagram has been constructed as a causal diagram, then both the thinking that goes into the construction and the interpretation of the final diagram change." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"The transparency of Bayesian networks distinguishes them from most other approaches to machine learning, which tend to produce inscrutable 'black boxes'. In a Bayesian network you can follow every step and understand how and why each piece of evidence changed the network’s beliefs." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"When the scientific question of interest involves retrospective thinking, we call on another type of expression unique to causal reasoning called a counterfactual. […] Counterfactuals are the building blocks of moral behavior as well as scientific thought. The ability to reflect on one’s past actions and envision alternative scenarios is the basis of free will and social responsibility. The algorithmization of counterfactuals invites thinking machines to benefit from this ability and participate in this (until now) uniquely human way of thinking about the world." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"With Bayesian networks, we had taught machines to think in shades of gray, and this was an important step toward humanlike thinking. But we still couldn’t teach machines to understand causes and effects. [...] By design, in a Bayesian network, information flows in both directions, causal and diagnostic: smoke increases the likelihood of fire, and fire increases the likelihood of smoke. In fact, a Bayesian network can’t even tell what the 'causal direction' is." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

🖍️Foster Provost - Collected Quotes

"Data mining is a craft. As with many crafts, there is a well-defined process that can help to increase the likelihood of a successful result. This process is a crucial conceptual tool for thinking about data science projects. [...] data mining is an exploratory undertaking closer to research and development than it is to engineering." (Foster Provost, "Data Science for Business", 2013)

"Formulating data mining solutions and evaluating the results involves thinking carefully about the context in which they will be used." (Foster Provost, "Data Science for Business", 2013)

"[…] framing a business problem in terms of expected value can allow us to systematically decompose it into data mining tasks." (Foster Provost & Tom Fawcett, "Data Science for Business", 2013)

"If you look too hard at a set of data, you will find something - but it might not generalize beyond the data you’re looking at. This is referred to as overfitting a dataset. Data mining techniques can be very powerful, and the need to detect and avoid overfitting is one of the most important concepts to grasp when applying data mining to real problems. The concept of overfitting and its avoidance permeates data science processes, algorithms, and evaluation methods." (Foster Provost & Tom Fawcett, "Data Science for Business", 2013)

"In analytics, it’s more important for individuals to be able to formulate problems well, to prototype solutions quickly, to make reasonable assumptions in the face of ill-structured problems, to design experiments that represent good investments, and to analyze results." (Foster Provost & Tom Fawcett, "Data Science for Business", 2013)

"In common usage, prediction means to forecast a future event. In data science, prediction more generally means to estimate an unknown value. This value could be something in the future (in common usage, true prediction), but it could also be something in the present or in the past. Indeed, since data mining usually deals with historical data, models very often are built and tested using events from the past." (Foster Provost & Tom Fawcett, "Data Science for Business", 2013)

"In data science, a predictive model is a formula for estimating the unknown value of interest: the target. The formula could be mathematical, or it could be a logical statement such as a rule. Often it is a hybrid of the two." (Foster Provost & Tom Fawcett, "Data Science for Business", 2013)

"There is another important distinction pertaining to mining data: the difference between (1) mining the data to find patterns and build models, and (2) using the results of data mining. Students often confuse these two processes when studying data science, and managers sometimes confuse them when discussing business analytics. The use of data mining results should influence and inform the data mining process itself, but the two should be kept distinct." (Foster Provost & Tom Fawcett, "Data Science for Business", 2013)

"There is convincing evidence that data-driven decision-making and big data technologies substantially improve business performance. Data science supports data-driven decision-making - and sometimes conducts such decision-making automatically - and depends upon technologies for 'big data' storage and engineering, but its principles are separate." (Foster Provost & Tom Fawcett, "Data Science for Business", 2013)

"Unfortunately, creating an objective function that matches the true goal of the data mining is usually impossible, so data scientists often choose based on faith and experience." (Foster Provost & Tom Fawcett, "Data Science for Business", 2013)

🖍️Joseph P Bigus - Collected Quotes

"Data mining is the efficient discovery of valuable, nonobvious information from a large collection of data. […] Data mining centers on the automated discovery of new facts and relationships in data. The idea is that the raw material is the business data, and the data mining algorithm is the excavator, sifting through the vast quantities of raw data looking for the valuable nuggets of business information." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"Like modeling, which involves making a static one-time prediction based on current information, time-series prediction involves looking at current information and predicting what is going to happen. However, with time-series predictions, we typically are looking at what has happened for some period back through time and predicting for some point in the future. The temporal or time element makes time-series prediction both more difficult and more rewarding. Someone who can predict the future based on what has occurred in the past can clearly have tremendous advantages over someone who cannot." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"Many of the basic functions performed by neural networks are mirrored by human abilities. These include making distinctions between items (classification), dividing similar things into groups (clustering), associating two or more things (associative memory), learning to predict outcomes based on examples (modeling), being able to predict into the future (time-series forecasting), and finally juggling multiple goals and coming up with a good- enough solution (constraint satisfaction)." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"More than just a new computing architecture, neural networks offer a completely different paradigm for solving problems with computers. […] The process of learning in neural networks is to use feedback to adjust internal connections, which in turn affect the output or answer produced. The neural processing element combines all of the inputs to it and produces an output, which is essentially a measure of the match between the input pattern and its connection weights. When hundreds of these neural processors are combined, we have the ability to solve difficult problems such as credit scoring." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"Neural networks are a computing model grounded on the ability to recognize patterns in data. As a consequence, they have many applications to data mining and analysis." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"Neural networks are a computing technology whose fundamental purpose is to recognize patterns in data. Based on a computing model similar to the underlying structure of the human brain, neural networks share the brains ability to learn or adapt in response to external inputs. When exposed to a stream of training data, neural networks can discover previously unknown relationships and learn complex nonlinear mappings in the data. Neural networks provide some fundamental, new capabilities for processing business data. However, tapping these new neural network data mining functions requires a completely different application development process from traditional programming." (Joseph P Bigus, "Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"People build practical, useful mental models all of the time. Seldom do they resort to writing a complex set of mathematical equations or use other formal methods. Rather, most people build models relating inputs and outputs based on the examples they have seen in their everyday life. These models can be rather trivial, such as knowing that when there are dark clouds in the sky and the wind starts picking up that a storm is probably on the way. Or they can be more complex, like a stock trader who watches plots of leading economic indicators to know when to buy or sell. The ability to make accurate predictions from complex examples involving many variables is a great asset." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"Unfortunately, just collecting the data in one place and making it easily available isn’t enough. When operational data from transactions is loaded into the data warehouse, it often contains missing or inaccurate data. How good or bad the data is a function of the amount of input checking done in the application that generates the transaction. Unfortunately, many deployed applications are less than stellar when it comes to validating the inputs. To overcome this problem, the operational data must go through a 'cleansing' process, which takes care of missing or out-of-range values. If this cleansing step is not done before the data is loaded into the data warehouse, it will have to be performed repeatedly whenever that data is used in a data mining operation." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"When training a neural network, it is important to understand when to stop. […] If the same training patterns or examples are given to the neural network over and over, and the weights are adjusted to match the desired outputs, we are essentially telling the network to memorize the patterns, rather than to extract the essence of the relationships. What happens is that the neural network performs extremely well on the training data. However, when it is presented with patterns it hasn't seen before, it cannot generalize and does not perform well. What is the problem? It is called overtraining." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"While classification is important, it can certainly be overdone. Making too fine a distinction between things can be as serious a problem as not being able to decide at all. Because we have limited storage capacity in our brain (we still haven't figured out how to add an extender card), it is important for us to be able to cluster similar items or things together. Not only is clustering useful from an efficiency standpoint, but the ability to group like things together (called chunking by artificial intelligence practitioners) is a very important reasoning tool. It is through clustering that we can think in terms of higher abstractions, solving broader problems by getting above all of the nitty-gritty details." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

🖍️Richard E Nisbett - Collected Quotes

"Multiple regression, like all statistical techniques based on correlation, has a severe limitation due to the fact that correlation doesn't prove causation. And no amount of measuring of 'control' variables can untangle the web of causality. What nature hath joined together, multiple regression cannot put asunder." (Richard Nisbett, "2014: What scientific idea is ready for retirement?", 2013)

"What nature hath joined together, multiple regression cannot put asunder." (Richard Nisbett, "2014: What scientific idea is ready for retirement?", 2013)

"A basic problem with MRA is that it typically assumes that the independent variables can be regarded as building blocks, with each variable taken by itself being logically independent of all the others. This is usually not the case, at least for behavioral data. […] Just as correlation doesn’t prove causation, absence of correlation fails to prove absence of causation. False-negative findings can occur using MRA just as false-positive findings do - because of the hidden web of causation that we’ve failed to identify." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"Deductive and inductive reasoning schemas essentially regulate inferences. They tell us what kinds of inferences are valid and what kinds are invalid. […] Dialectical reasoning isn’t formal or deductive and usually doesn’t deal in abstractions. It’s concerned with reaching true and useful conclusions rather than valid conclusions. In fact, conclusions based on dialectical reasoning can actually be opposed to those based on formal logic." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"Multiple regression analysis (MRA) examines the association between an independent variable and a dependent variable, controlling for the association between the independent variable and other variables, as well as the association of those other variables with the dependent variable. The method can tell us about causality only if all possible causal influences have been identified and measured reliably and validly. In practice, these conditions are rarely met." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"One technique employing correlational analysis is multiple regression analysis (MRA), in which a number of independent variables are correlated simultaneously (or sometimes sequentially, but we won’t talk about that variant of MRA) with some dependent variable. The predictor variable of interest is examined along with other independent variables that are referred to as control variables. The goal is to show that variable A influences variable B 'net of' the effects of all the other variables. That is to say, the relationship holds even when the effects of the control variables on the dependent variable are taken into account." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"Science is often described as a 'seamless web'. What’s meant by that is that the facts, methods, theories, and rules of inference discovered in one field can be helpful for other fields. And philosophy and logic can affect reasoning in literally every field of science."(Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"The closer that sample-selection procedures approach the gold standard of random selection - for which the definition is that every individual in the population has an equal chance of appearing in the sample - the more we should trust them. If we don’t know whether a sample is random, any statistical measure we conduct may be biased in some unknown way." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"The correlational technique known as multiple regression is used frequently in medical and social science research. This technique essentially correlates many independent (or predictor) variables simultaneously with a given dependent variable (outcome or output). It asks, 'Net of the effects of all the other variables, what is the effect of variable A on the dependent variable?' Despite its popularity, the technique is inherently weak and often yields misleading results. The problem is due to self-selection. If we don’t assign cases to a particular treatment, the cases may differ in any number of ways that could be causing them to differ along some dimension related to the dependent variable. We can know that the answer given by a multiple regression analysis is wrong because randomized control experiments, frequently referred to as the gold standard of research techniques, may give answers that are quite different from those obtained by multiple regression analysis." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"The fundamental problem with MRA, as with all correlational methods, is self-selection. The investigator doesn’t choose the value for the independent variable for each subject (or case). This means that any number of variables correlated with the independent variable of interest have been dragged along with it. In most cases, we will fail to identify all these variables. In the case of behavioral research, it’s normally certain that we can’t be confident that we’ve identified all the plausibly relevant variables." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"The theory behind multiple regression analysis is that if you control for everything that is related to the independent variable and the dependent variable by pulling their correlations out of the mix, you can get at the true causal relation between the predictor variable and the outcome variable. That’s the theory. In practice, many things prevent this ideal case from being the norm." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"We are superb causal-hypothesis generators. Given an effect, we are rarely at a loss for an explanation. Seeing a difference in observations over time, we readily come up with a causal interpretation. Much of the time, no causality at all is going on—just random variation. The compulsion to explain is particularly strong when we habitually see that one event typically occurs in conjunction with another event. Seeing such a correlation almost automatically provokes a causal explanation. It’s tremendously useful to be on our toes looking for causal relationships that explain our world. But there are two problems: (1) The explanations come too easily. If we recognized how facile our causal hypotheses were, we’d place less confidence in them. (2) Much of the time, no causal interpretation at all is appropriate and wouldn’t even be made if we had a better understanding of randomness." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"We don’t recognize how easy it is to generate hypotheses about the world. If we did, we’d generate fewer of them, or at least hold them more tentatively. We sprout causal theories in abundance when we learn of a correlation, and we readily find causal explanations for the failure of the world to confirm our hypotheses. We don’t realize how easy it is for us to explain away evidence that would seem on the surface to contradict our hypotheses. And we fail to generate tests of a hypothesis that could falsify the hypothesis if in fact the hypothesis is wrong. This is one type of confirmation bias." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

🖍️Mike Barlow - Collected Quotes

"Applying data science principles to solve social problems and improve the lives of ordinary people seems like a logical idea, but it is by no means a given. Using data science to elevate the human condition won’t happen by accident; groups of people will have to envision it, develop the routine processes and underlying infrastructures required to make it practical, and then commit the time and energy necessary to make it all work." (Mike Barlow, "Learning to Love Data Science", 2015)

"Hollywood loves the myth of a lone scientist working late nights in a dark laboratory on a mysterious island, but the truth is far less melodramatic. Real science is almost always a team sport. Groups of people, collaborating with other groups of people, are the norm in science - and data science is no exception to the rule. When large groups of people work together for extended periods of time, a culture begins to emerge." (Mike Barlow, "Learning to Love Data Science", 2015)

"In other words, real-time denotes the ability to process data as it arrives, rather than storing the data and retrieving it at some point in the future. That’s the primary significance of the term - real-time means that you’re processing data in the present, rather than in the future." (Mike Barlow, "Learning to Love Data Science", 2015)

"The ability to manage large and complex sets of data hasn’t diminished the appetite for more size and greater speed. Every day it seems that a new technique or application is introduced that pushes the edges of the speed-size envelope even further." (Mike Barlow, "Learning to Love Data Science", 2015)

"The cultural component of big data is neither trivial nor free. It is not a list of 'feel-good' or 'fluffy' attributes that are posted on a corporate website. Culture (that is, people and processes) is integral and critical to the success of any new technology deployment or implementation." (Mike Barlow, "Learning to Love Data Science", 2015)

"The whole point of machine learning is automating the learning process itself, enabling the computer program to get better as it consumes more data, without requiring the continual intervention of a programmer." (Mike Barlow, "Learning to Love Data Science", 2015)

🖍️Peter C Bruce - Collected Quotes

"A popular misconception holds that the era of Big Data means the end of a need for sampling. In fact, the proliferation of data of varying quality and relevance reinforces the need for sampling as a tool to work efficiently with a variety of data, and minimize bias. Even in a Big Data project, predictive models are typically developed and piloted with samples." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"Do not confuse standard deviation (which measures the variability of individual data points) with standard error (which measures the variability of a sample metric)." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"In statistical theory, location and variability are referred to as the first and second moments of a distribution. The third and fourth moments are called skewness and kurtosis. Skewness refers to whether the data is skewed to larger or smaller values and kurtosis indicates the propensity of the data to have extreme values. Generally, metrics are not used to measure skewness and kurtosis; instead, these are discovered through visual displays [...]" (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"Machine learning tends to be more focused on developing efficient algorithms that scale to large data in order to optimize the predictive model. Statistics generally pays more attention to the probabilistic theory and underlying structure of the model." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"Many classification and regression algorithms optimize a certain criteria or loss function. For example, logistic regression attempts to minimize the deviance. In the literature, some propose to modify the loss function in order to avoid the problems caused by a rare class. In practice, this is hard to do: classification algorithms can be complex and difficult to modify. Weighting is an easy way to change the loss function, discounting errors for records with low weights in favor of records of higher weights." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"Moreover, data science (and business in general) is not so worried about statistical significance, but more concerned with optimizing overall effort and results." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"Statisticians often use the term estimates for values calculated from the data at hand, to draw a distinction between what we see from the data, and the theoretical true or exact state of affairs. Data scientists and business analysts are more likely to refer to such values as a metric. The difference reflects the approach of statistics versus data science: accounting for uncertainty lies at the heart of the discipline of statistics, whereas concrete business or organizational objectives are the focus of data science. Hence, statisticians estimate, and data scientists measure." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"The bootstrap does not compensate for a small sample size; it does not create new data, nor does it fill in holes in an existing data set. It merely informs us about how lots of additional samples would behave when drawn from a population like our original sample." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"The tension between oversmoothing and overfitting is an instance of the bias-variance tradeoff, an ubiquitous problem in statistical model fitting. Variance refers to the modeling error that occurs because of the choice of training data; that is, if you were to choose a different set of training data, the resulting model would be different. Bias refers to the modeling error that occurs because you have not properly identified the underlying real-world scenario; this error would not disappear if you simply added more training data. When a flexible model is overfit, the variance increases. You can reduce this by using a simpler model, but the bias may increase due to the loss of flexibility in modeling the real underlying situation." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"The variance, the standard deviation, mean absolute deviation, and median absolute deviation from the median are not equivalent estimates, even in the case where the data comes from a normal distribution. In fact, the standard deviation is always greater than the mean absolute deviation, which itself is greater than the median absolute deviation. Sometimes, the median absolute deviation is multiplied by a constant scaling factor (it happens to work out to 1.4826) to put MAD on the same scale as the standard deviation in the case of a normal distribution." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"When analysts and researchers use the term regression by itself, they are typically referring to linear regression; the focus is usually on developing a linear model to explain the relationship between predictor variables and a numeric outcome variable. In its formal statistical sense, regression also includes nonlinear models that yield a functional relationship between predictors and outcome variables. In the machine learning community, the term is also occasionally used loosely to refer to the use of any predictive model that produces a predicted numeric outcome (standing in distinction from classification methods that predict a binary or categorical outcome)." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

21 April 2006

🖍️Pedro Domingos - Collected Quotes

"A learner that uses Bayes’ theorem and assumes the effects are independent given the cause is called a Naïve Bayes classifier. That’s because, well, that’s such a naïve assumption." (Pedro Domingos, "The Master Algorithm", 2015)

"An algorithm is not just any set of instructions: they have to be precise and unambiguous enough to be executed by a computer. [...] The computer has to know how to execute the algorithm all the way down to turning specific transistors on and off." (Pedro Domingos, "The Master Algorithm", 2015)

"As so often happens in computer science, we’re willing to sacrifice efficiency for generality." (Pedro Domingos, "The Master Algorithm", 2015)

"Believe it or not, every algorithm, no matter how complex, can be reduced to just these three operations: AND, OR, and NOT." (Pedro Domingos, "The Master Algorithm", 2015)

"Designing an algorithm is not easy. Pitfalls abound, and nothing can be taken for granted. Some of your intuitions will turn out to have been wrong, and you’ll have to find another way. On top of designing the algorithm, you have to write it down in a language computers can understand, like Java or Python (at which point it’s called a program). Then you have to debug it: find every error and fix it until the computer runs your program without screwing up. But once you have a program that does what you want, you can really go to town." (Pedro Domingos, "The Master Algorithm", 2015)

"Dimensionality reduction is essential for coping with big data—like the data coming in through your senses every second. A picture may be worth a thousand words, but it’s also a million times more costly to process and remember. [...] A common complaint about big data is that the more data you have, the easier it is to find spurious patterns in it. This may be true if the data is just a huge set of disconnected entities, but if they’re interrelated, the picture changes." (Pedro Domingos, "The Master Algorithm", 2015)

"Every algorithm has an input and an output: the data goes into the computer, the algorithm does what it will with it, and out comes the result. Machine learning turns this around: in goes the data and the desired result and out comes the algorithm that turns one into the other. Learning algorithms - also known as learners - are algorithms that make other algorithms. With machine learning, computers write their own programs, so we don’t have to." (Pedro Domingos, "The Master Algorithm", 2015)

"In machine learning, knowledge is often in the form of statistical models, because most knowledge is statistical [...] Machine learning is a kind of knowledge pump: we can use it to extract a lot of knowledge from data, but first we have to prime the pump." (Pedro Domingos, "The Master Algorithm", 2015)

"Learning is forgetting the details as much as it is remembering the important parts." (Pedro Domingos, "The Master Algorithm", 2015)

"Machine learning takes many different forms and goes by many different names: pattern recognition, statistical modeling, data mining, knowledge discovery, predictive analytics, data science, adaptive systems, self-organizing systems, and more. Each of these is used by different communities and has different associations. Some have a long half-life, some less so." (Pedro Domingos, "The Master Algorithm", 2015)

"Our beliefs are based on our experience, which gives us a very incomplete picture of the world, and it's easy to jump to false conclusions." (Pedro Domingos, "The Master Algorithm", 2015)

"People often think computers are all about numbers, but they’re not. Computers are all about logic." (Pedro Domingos, "The Master Algorithm", 2015)

"Science’s predictions are more trustworthy, but they are limited to what we can systematically observe and tractably model. Big data and machine learning greatly expand that scope. Some everyday things can be predicted by the unaided mind, from catching a ball to carrying on a conversation. Some things, try as we might, are just unpredictable. For the vast middle ground between the two, there’s machine learning." (Pedro Domingos, "The Master Algorithm", 2015)

"To make progress, every field of science needs to have data commensurate with the complexity of the phenomena it studies. [...] With big data and machine learning, you can understand much more complex phenomena than before. In most fields, scientists have traditionally used only very limited kinds of models, like linear regression, where the curve you fit to the data is always a straight line. Unfortunately, most phenomena in the world are nonlinear. [...] Machine learning opens up a vast new world of nonlinear models." (Pedro Domingos, "The Master Algorithm", 2015)

"Today we routinely learn models with millions of parameters, enough to give each elephant in the world his own distinctive wiggle. It’s even been said that data mining means 'torturing the data until it confesses'." (Pedro Domingos, "The Master Algorithm", 2015)

"Traditionally, the only way to get a computer to do something - from adding two numbers to flying an airplane - was to write down an algorithm explaining how, in painstaking detail. But machine-learning algorithms, also known as learners, are different: they figure it out on their own, by making inferences from data. And the more data they have, the better they get. Now we don’t have to program computers; they program themselves." (Pedro Domingos, "The Master Algorithm", 2015)

"Whoever has the best algorithms and the most data wins. A new type of network effect takes hold: whoever has the most customers accumulates the most data, learns the best models, wins the most new customers, and so on in a virtuous circle (or a vicious one, if you’re the competition)." (Pedro Domingos, "The Master Algorithm", 2015)

🖍️Richard Levins - Collected Quotes

"A mathematical model is neither an hypothesis nor a theory. Unlike the scientific hypothesis, a model is not verifiable directly by experiment. For all models are both true and false. Almost any plausible proposed relation among aspects of nature is likely to be true in the sense that it occurs (although rarely and slightly). Yet all models leave out a lot and are in that sense false, incomplete, inadequate. The validation of a model is not that it is ' 'true" but that it generates good testable hypotheses relevant to important problems. A model may be discarded in favor of a more powerful one, but it usually is simply outgrown when the live issues are not any longer those for which it was designed." (Richard Levins, "The Strategy of Model Building in Population Biology", American Scientist 54(4), 1966)

"For population genetics, a population is specified by the frequencies of genotypes without reference to the age distribution, physiological state as a reflection of past history, or population density. A single population or species is treated at a time, and evolution is usually assumed to occur in a constant environment. Population ecology, on the other hand, recognizes multispecies systems, describes populations in terms of their age distributions, physiological states, and densities. The environment is allowed to vary but the species are treated as genetically homogeneous, so that evolution is ignored." (Richard Levins, "The Strategy of Model Building in Population Biology", American Scientist 54(4), 1966)

"It is of course desirable to work with manageable models which maximize generality, realism, and precision toward the overlapping but not identical goals of understanding, predicting, and modifying nature. But this cannot be done. Therefore, several alternative strategies have evolved: (1) Sacrifice generality to realism and precision. (2) Sacrifice realism to generality and precision. (3) Sacrifice precision to realism and generality." (Richard Levins, "The strategy of model building in population biology", American Scientist Vol. 54 (4), 1966)

"The multiplicity of models is imposed by the contradictory demands of a complex, heterogeneous nature and a mind that can only cope with few variables at a time; by the contradictory desiderata of generality, realism, and precision; by the need to understand and also to control; even by the opposing esthetic standards which emphasize the stark simplicity and power of a general theorem as against the richness and the diversity of living nature. These conflicts are irreconcilable. Therefore, the alternative approaches even of contending schools are part of a larger mixed strategy. But the conflict is about method, not nature, for the individual models, while they are essential for understanding reality, should not be confused with that reality itself." (Richard Levins, "The Strategy of Model Building in Population Biology", American Scientist 54(4), 1966)

"The validation of a model is not that it is 'true' but that it generates good testable hypotheses relevant to important problems." (Richard Levins, "The Strategy of Model Building in Population Biology", American Scientist 54(4), 1966)

"[…] truth is the intersection of independent lies." (Richard Levins, "The Strategy of Model Building in Population Biology", 1966)

"Unlike the theory, models are restricted by technical considerations to a few components at a time, even in systems which are complex. Thus a satisfactory theory is usually a cluster of models. These models are related to each other in several ways : as coordinate alternative models for the same set of phenomena, they jointly produce robust theorems; as complementary models they can cope with different aspects of the same problem and give complementary as well as overlapping results; as hierarchically arranged 'nested' models, each provides an interpretation of the sufficient parameters of the next higher level where they are taken as given." (Richard Levins, "The Strategy of Model Building in Population Biology", American Scientist 54(4), 1966)

"Parts and wholes evolve in consequence of their relationship, and the relationship itself evolves. These are the properties of things that we call dialectical: that one thing cannot exist without the other, that one acquires its properties from its relation to the other, that the properties of both evolve as a consequence of their interpenetration." (Richard Levins & Richard C Lewontin, "The Dialectical Biologist", 1985)

"The organism cannot be regarded as simply the passive object of autonomous internal and external forces; it is also the subject of its own evolution." (Richard Levins & Richard C Lewontin, "The Dialectical Biologist", 1985)

"We believe that science, in all its sense, is a social process that both causes and is caused by social organisation." (Richard Levins & Richard C Lewontin, "The Dialectical Biologist", 1985)