04 November 2018

🔭Data Science: Central Limit Theorem (Just the Quotes)

"I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the ‘Law of Frequency of Error’. The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement, amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason. Whenever a large sample of chaotic elements are taken in hand and marshalled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along." (Sir Francis Galton, "Natural Inheritance", 1889)

"The central limit theorem […] states that regardless of the shape of the curve of the original population, if you repeatedly randomly sample a large segment of your group of interest and take the average result, the set of averages will follow a normal curve." (Charles Livingston & Paul Voakes, "Working with Numbers and Statistics: A handbook for journalists", 2005)

"The central limit theorem says that, under conditions almost always satisfied in the real world of experimentation, the distribution of such a linear function of errors will tend to normality as the number of its components becomes large. The tendency to normality occurs almost regardless of the individual distributions of the component errors. An important proviso is that several sources of error must make important contributions to the overall error and that no particular source of error dominate the rest." (George E P Box et al, "Statistics for Experimenters: Design, discovery, and innovation" 2nd Ed., 2005)

"Two things explain the importance of the normal distribution: (1) The central limit effect that produces a tendency for real error distributions to be 'normal like'. (2) The robustness to nonnormality of some common statistical procedures, where 'robustness' means insensitivity to deviations from theoretical normality." (George E P Box et al, "Statistics for Experimenters: Design, discovery, and innovation" 2nd Ed., 2005)

"The central limit theorem differs from laws of large numbers because random variables vary and so they differ from constants such as population means. The central limit theorem says that certain independent random effects converge not to a constant population value such as the mean rate of unemployment but rather they converge to a random variable that has its own Gaussian bell-curve description." (Bart Kosko, "Noise", 2006)

"Normally distributed variables are everywhere, and most classical statistical methods use this distribution. The explanation for the normal distribution’s ubiquity is the Central Limit Theorem, which says that if you add a large number of independent samples from the same distribution the distribution of the sum will be approximately normal." (Ben Bolker, "Ecological Models and Data in R", 2007)

"[…] the Central Limit Theorem says that if we take any sequence of small independent random quantities, then in the limit their sum (or average) will be distributed according to the normal distribution. In other words, any quantity that can be viewed as the sum of many small independent random effects. will be well approximated by the normal distribution. Thus, for example, if one performs repeated measurements of a fixed physical quantity, and if the variations in the measurements across trials are the cumulative result of many independent sources of error in each trial, then the distribution of measured values should be approximately normal." (David Easley & Jon Kleinberg, "Networks, Crowds, and Markets: Reasoning about a Highly Connected World", 2010)

[myth] "It has been said that process behavior charts work because of the central limit theorem."(Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"Statistical inference is really just the marriage of two concepts that we’ve already discussed: data and probability (with a little help from the central limit theorem)." (Charles Wheelan, "Naked Statistics: Stripping the Dread from the Data", 2012)

"The central limit theorem is often used to justify the assumption of normality when using the sample mean and the sample standard deviation. But it is inevitable that real data contain gross errors. Five to ten percent unusual values in a dataset seem to be the rule rather than the exception. The distribution of such data is no longer Normal." (A S Hedayat & Guoqin Su, "Robustness of the Simultaneous Estimators of Location and Scale From Approximating a Histogram by a Normal Density Curve", The American Statistician 66, 2012)

"The central limit theorem tells us that in repeated samples, the difference between the two means will be distributed roughly as a normal distribution." (Charles Wheelan, "Naked Statistics: Stripping the Dread from the Data", 2012)

"According to the central limit theorem, it doesn’t matter what the raw data look like, the sample variance should be proportional to the number of observations and if I have enough of them, the sample mean should be normal." (Kristin H Jarman, "The Art of Data Analysis: How to answer almost any question using basic statistics", 2013)

"For a confidence interval, the central limit theorem plays a role in the reliability of the interval because the sample mean is often approximately normal even when the underlying data is not. A prediction interval has no such protection. The shape of the interval reflects the shape of the underlying distribution. It is more important to examine carefully the normality assumption by checking the residuals […].(DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

"When data is not normal, the reason the formulas are working is usually the central limit theorem. For large sample sizes, the formulas are producing parameter estimates that are approximately normal even when the data is not itself normal. The central limit theorem does make some assumptions and one is that the mean and variance of the population exist. Outliers in the data are evidence that these assumptions may not be true. Persistent outliers in the data, ones that are not errors and cannot be otherwise explained, suggest that the usual procedures based on the central limit theorem are not applicable.(DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

"At very small time scales, the motion of a particle is more like a random walk, as it gets jostled about by discrete collisions with water molecules. But virtually any random movement on small time scales will give rise to Brownian motion on large time scales, just so long as the motion is unbiased. This is because of the Central Limit Theorem, which tells us that the aggregate of many small, independent motions will be normally distributed." (Field Cady, "The Data Science Handbook", 2017)

"The central limit conjecture states that most errors are the result of many small errors and, as such, have a normal distribution. The assumption of a normal distribution for error has many advantages and has often been made in applications of statistical models." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"Theoretically, the normal distribution is most famous because many distributions converge to it, if you sample from them enough times and average the results. This applies to the binomial distribution, Poisson distribution and pretty much any other distribution you’re likely to encounter (technically, any one for which the mean and standard deviation are finite)." (Field Cady, "The Data Science Handbook", 2017)

"The central limit theorem in statistics states that, given a sufficiently large sample size, the sampling distribution of the mean for a variable will approximate a normal distribution regardless of that variable’s distribution in the population." (Jim Frost)

"The old rule of trusting the Central Limit Theorem if the sample size is larger than 30 is just that–old. Bootstrap and permutation testing let us more easily do inferences for a wider variety of statistics." (Tim Hesterberg)

No comments:

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.