Showing posts with label statistics. Show all posts
Showing posts with label statistics. Show all posts

17 May 2026

🔭Data Science: Misconception (Just the Quotes)

"Science does not begin with facts; one of its tasks is to uncover the facts by removing misconceptions." (Lancelot L Whyte, "Accent on Form", 1954)

"A common misconception is that an effect exists only if it is statistically significant and that it does not exist if it is not [statistically significant]." (Jonas Ranstam, "A common misconception about p-value and its consequences", Acta Orthopaedica Scandinavica 67, 1996)

"[...] the term statistical misconception refers to any of several widely held but incorrect notions about statistical concepts, about procedures for analyzing data and about the meaning of results produced by such analyses. To illustrate, many people think that (1) normal curves are bell shaped, (2) a correlation coeffi cient should never be used to address questios of causality, and (3) the level of signifi cance dictates the probability of a Type I error. Some people, of course, have only one or two (rather than all three) of these misconceptions, and a few individuals realize that all three of those beliefs are false."(Schuyler W Huck, "Statistical Misconceptions", 2008)

"Science would be better understood if we called theories ‘misconceptions’ from the outset, instead of only after we have discovered their successors." (David Deutsch, "Beginning of Infinity", 2011)

"A popular misconception holds that the era of Big Data means the end of a need for sampling. In fact, the proliferation of data of varying quality and relevance reinforces the need for sampling as a tool to work efficiently with a variety of data, and minimize bias. Even in a Big Data project, predictive models are typically developed and piloted with samples." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"An oft-repeated rule of thumb in any sort of statistical model fitting is 'you can't fit a model with more parameters than data points'. This idea appears to be as wide-spread as it is incorrect. On the contrary, if you construct your models carefully, you can fit models with more parameters than datapoints [...]. A model with more parameters than datapoints is known as an under-determined system, and it's a common misperception that such a model cannot be solved in any circumstance. [...] this misconception, which I like to call the 'model complexity myth' [...] is not true in general, it is true in the specific case of simple linear models, which perhaps explains why the myth is so pervasive." (Jake Vanderplas", "The Model Complexity Myth", 2015)


16 May 2026

🔭Data Science: Central Tendency (Just the Quotes)

"An average value is a single value within the range of the data that is used to represent all of the values in the series. Since an average is somewhere within the range of the data, it is sometimes called a measure of central value." (Frederick E Croxton & Dudley J Cowden, "Practical Business Statistics", 1937

"A good estimator will be unbiased and will converge more and more closely (in the long run) on the true value as the sample size increases. Such estimators are known as consistent. But consistency is not all we can ask of an estimator. In estimating the central tendency of a distribution, we are not confined to using the arithmetic mean; we might just as well use the median. Given a choice of possible estimators, all consistent in the sense just defined, we can see whether there is anything which recommends the choice of one rather than another. The thing which at once suggests itself is the sampling variance of the different estimators, since an estimator with a small sampling variance will be less likely to differ from the true value by a large amount than an estimator whose sampling variance is large." (Michael J Moroney, "Facts from Figures", 1951)

"The mode would form a very poor basis for any further calculations of an arithmetical nature, for it has deliberately excluded arithmetical precision in the interests of presenting a typical result. The arithmetic average, on the other hand, excellent as it is for numerical purposes, has sacrificed its desire to be typical in favour of numerical accuracy. In such a case it is often desirable to quote both measures of central tendency.(Michael J Moroney,Facts from Figures", 1951)

"An average is sometimes called a 'measure of central tendency' because individual values of the variable usually cluster around it. Averages are useful, however, for certain types of data in which there is little or no central tendency." (William A Spirr & Charles P Bonini,Statistical Analysis for Business Decisions" 3rd Ed., 1967)

"Central tendency is the formal expression for the notion of where data is centered, best understood by most readers as 'average'. There is no one way of measuring where data are centered, and different measures provide different insights." (Charles Livingston & Paul Voakes,Working with Numbers and Statistics: A handbook for journalists", 2005)

"Distributional shape is an important attribute of data, regardless of whether scores are analyzed descriptively or inferentially. Because the degree of skewness can be summarized by means of a single number, and because computers have no difficul ty providing such measures (or estimates) of skewness, those who prepare research reports should include a numerical index of skewness every time they provide measures of central tendency and variability." (Schuyler W Huck, "Statistical Misconceptions", 2008)

"It is best to think of the various kinds of central tendency indices as falling into three categories based on the computational procedures one uses to summarize the data. One category deals with means, with techniques put into this category if scores are added together and then divided by the number of scores that are summed. The second category involves different kinds of medians, with various techniques grouped here if the goal is to find some sort of midpoint. The third category contains different kinds of modes, with these techniques focused on the frequency with which scores appear in the data." (Schuyler W Huck, "Statistical Misconceptions", 2008)

"Various measures of central tendency have been invented because the proper notion of the 'average' score can vary from study to study. Depending on the kind of data collected, the degree of skewness in the data, and the possible existence of outliers, it may be that the most appropriate measure of central tendency is found by doing something other than (1) dividing the sum of the scores by the number of scores (to get the mean), (2) calculating the midpoint in the distribution (to get the median), or (3) determining the most frequently observed score (to get the mode)." (Schuyler W Huck, "Statistical Misconceptions", 2008)

"Statistical analysis seeks to develop concise summary figures which describe a large body of quantitative data. One of the most widely used set of summary figures is known as measures of location, which are often referred to as averages, measures of central tendency or central location. The purpose for computing an average value for a set of observations is to obtain a single value which is representative of all the items and which the mind can grasp simply and quickly. The single value is the point or location around which the individual items cluster." (Lawrence J Kaplan)

15 May 2026

🔭Data Science: Center (Just the Quotes)

"An average value is a single value within the range of the data that is used to represent all of the values in the series. Since an average is somewhere within the range of the data, it is sometimes called a measure of central value." (Frederick E Croxton & Dudley J Cowden,Practical Business Statistics", 1937)

"Some distributions [...] are symmetrical about their central value. Other distributions have marked asymmetry and are said to be skew. Skew distributions are divided into two types. If the 'tail' of the distribution reaches out into the larger values of the variate, the distribution is said to show positive skewness; if the tail extends towards the smaller values of the variate, the distribution is called negatively skew." (Michael J Moroney,Facts from Figures", 1951)

"Numerical data, which have been recorded at intervals of time, form what is generally described as a time series. [...] The purpose of analyzing time series is not always the determination of the trend by itself. Interest may be centered on the seasonal movement displayed by the series and, in such a case, the determination of the trend is merely a stage in the process of measuring and analyzing the seasonal variation. If a regular basic or under- lying seasonal movement can be clearly established, forecasting of future movements becomes rather less a matter of guesswork and more a matter of intelligent forecasting." (Alfred R Ilersic, "Statistics", 1959)

"Dispersion or spread is the degree of the scatter or variation of the variables about a central value." (Bertram C Brookes & W F L Dick,Introduction to Statistical Method", 1969)

"Equal variability is not always achieved in plots. For instance, if the theoretical distribution for a probability plot has a density that drops off gradually to zero in the tails" (as the normal density does), then the variability of the data in the tails of the probability plot is greater than in the center. Another example is provided by the histogram. Since the height of any one bar has a binomial distribution, the standard deviation of the height is approximately proportional to the square root of the expected height; hence, the variability of the longer bars is greater." (John M Chambers et al,Graphical Methods for Data Analysis", 1983)

"There are several reasons why symmetry is an important concept in data analysis. First, the most important single summary of a set of data is the location of the center, and when data meaning of 'center' is unambiguous. We can take center to mean any of the following things, since they all coincide exactly for symmetric data, and they are together for nearly symmetric data: (l) the center of symmetry. (2) the arithmetic average or center of gravity, (3) the median or 50%. Furthermore, if data a single point of highest concentration instead of several" (that is, they are unimodal), then we can add to the list (4) point of highest concentration. When data are far from symmetric, we may have trouble even agreeing on what we mean by center; in fact, the center may become an inappropriate summary for the data." (John M Chambers et al,Graphical Methods for Data Analysis", 1983)

"A connected graph is appropriate when the time series is smooth, so that perceiving individual values is not important. A vertical line graph is appropriate when it is important to see individual values, when we need to see short-term fluctuations, and when the time series has a large number of values; the use of vertical lines allows us to pack the series tightly along the horizontal axis. The vertical line graph, however, usually works best when the vertical lines emanate from a horizontal line through the center of the data and when there are no long-term trends in the data." (William S Cleveland,The Elements of Graphing Data", 1985)

"If the sample is not representative of the population because the sample is small or biased, not selected at random, or its constituents are not independent of one another, then the bootstrap will fail. […] For a given size sample, bootstrap estimates of percentiles in the tails will always be less accurate than estimates of more centrally located percentiles. Similarly, bootstrap interval estimates for the variance of a distribution will always be less accurate than estimates of central location such as the mean or median because the variance depends strongly upon extreme values in the population." (Phillip I Good & James W Hardin,Common Errors in Statistics" (and How to Avoid Them)", 2003)

"Central tendency is the formal expression for the notion of where data is centered, best understood by most readers as 'average'. There is no one way of measuring where data are centered, and different measures provide different insights." (Charles Livingston & Paul Voakes,Working with Numbers and Statistics: A handbook for journalists", 2005)

"Mean-averages can be highly misleading when the raw data do not form a symmetric pattern around a central value but instead are skewed towards one side [...], typically with a large group of standard cases but with a tail of a few either very high" (for example, income) or low" (for example, legs) values." (David Spiegelhalter,The Art of Statistics: Learning from Data", 2019)

"The elements of this cloud of uncertainty (the set of all possible errors) can be described in terms of probability. The center of the cloud is the number zero, and elements of the cloud that are close to zero are more probable than elements that are far away from that center. We can be more precise in this definition by defining the cloud of uncertainty in terms of a mathematical function, called the probability distribution." (David S Salsburg,Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"Two clouds of uncertainty may have the same center, but one may be much more dispersed than the other. We need a way of looking at the scatter about the center. We need a measure of the scatter. One such measure is the variance. We take each of the possible values of error and calculate the squared difference between that value and the center of the distribution. The mean of those squared differences is the variance." (David S Salsburg,Errors, Blunders, and Lies: How to Tell the Difference", 2017)

10 May 2026

🔭Data Science: Location (Just the Quotes)

"There are several reasons why symmetry is an important concept in data analysis. First, the most important single summary of a set of data is the location of the center, and when data meaning of 'center' is unambiguous. We can take center to mean any of the following things, since they all coincide exactly for symmetric data, and they are together for nearly symmetric data: (l) the center of symmetry. (2) the arithmetic average or center of gravity, (3) the median or 50%. Furthermore, if data a single point of highest concentration instead of several (that is, they are unimodal), then we can add to the list (4) point of highest concentration. When data are far from symmetric, we may have trouble even agreeing on what we mean by center; in fact, the center may become an inappropriate summary for the data." (John M Chambers et al,Graphical Methods for Data Analysis", 1983)

"Data that are skewed toward large values occur commonly. Any set of positive measurements is a candidate. Nature just works like that. In fact, if data consisting of positive numbers range over several powers of ten, it is almost a guarantee that they will be skewed. Skewness creates many problems. There are visualization problems. A large fraction of the data are squashed into small regions of graphs, and visual assessment of the data degrades. There are characterization problems. Skewed distributions tend to be more complicated than symmetric ones; for example, there is no unique notion of location and the median and mean measure different aspects of the distribution. There are problems in carrying out probabilistic methods. The distribution of skewed data is not well approximated by the normal, so the many probabilistic methods based on an assumption of a normal distribution cannot be applied." (William S Cleveland,Visualizing Data", 1993)

"Fitting data means finding mathematical descriptions of structure in the data. An additive shift is a structural property of univariate data in which distributions differ only in location and not in spread or shape. […] The process of identifying a structure in data and then fitting the structure to produce residuals that have the same distribution lies at the heart of statistical analysis. Such homogeneous residuals can be pooled, which increases the power of the description of the variation in the data." (William S Cleveland,Visualizing Data", 1993)

"When the distributions of two or more groups of univariate data are skewed, it is common to have the spread increase monotonically with location. This behavior is monotone spread. Strictly speaking, monotone spread includes the case where the spread decreases monotonically with location, but such a decrease is much less common for raw data. Monotone spread, as with skewness, adds to the difficulty of data analysis. For example, it means that we cannot fit just location estimates to produce homogeneous residuals; we must fit spread estimates as well. Furthermore, the distributions cannot be compared by a number of standard methods of probabilistic inference that are based on an assumption of equal spreads; the standard t-test is one example. Fortunately, remedies for skewness can cure monotone spread as well." (William S Cleveland,Visualizing Data", 1993)

"Since the average is a measure of location, it is common to use averages to compare two data sets. The set with the greater average is thought to ‘exceed’ the other set. While such comparisons may be helpful, they must be used with caution. After all, for any given data set, most of the values will not be equal to the average." (Donald J Wheeler,Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Distinguish among confidence, prediction, and tolerance intervals. Confidence intervals are statements about population means or other parameters. Prediction intervals address future" (single or multiple) observations. Tolerance intervals describe the location of a specific proportion of a population, with specified confidence." (Gerald van Belle,Statistical Rules of Thumb", 2002)

"If the sample is not representative of the population because the sample is small or biased, not selected at random, or its constituents are not independent of one another, then the bootstrap will fail. […] For a given size sample, bootstrap estimates of percentiles in the tails will always be less accurate than estimates of more centrally located percentiles. Similarly, bootstrap interval estimates for the variance of a distribution will always be less accurate than estimates of central location such as the mean or median because the variance depends strongly upon extreme values in the population." (Phillip I Good & James W Hardin,Common Errors in Statistics" (and How to Avoid Them)", 2003)

"The central limit theorem is often used to justify the assumption of normality when using the sample mean and the sample standard deviation. But it is inevitable that real data contain gross errors. Five to ten percent unusual values in a dataset seem to be the rule rather than the exception. The distribution of such data is no longer Normal." (A S Hedayat & Guoqin Su,Robustness of the Simultaneous Estimators of Location and Scale From Approximating a Histogram by a Normal Density Curve", The American Statistician 66, 2012)

09 May 2026

🔭Data Science: Guessing (Just the Quotes)

"Summing up, then, it would seem as if the mind of the great discoverer must combine contradictory attributes. He must be fertile in theories and hypotheses, and yet full of facts and precise results of experience. He must entertain the feeblest analogies, and the merest guesses at truth, and yet he must hold them as worthless till they are verified in experiment. When there are any grounds of probability he must hold tenaciously to an old opinion, and yet he must be prepared at any moment to relinquish it when a clearly contradictory fact is encountered." (William S Jevons,The Principles of Science: A Treatise on Logic and Scientific Method", 1874)

"It would be an error to suppose that the great discoverer seizes at once upon the truth, or has any unerring method of divining it. In all probability the errors of the great mind exceed in number those of the less vigorous one. Fertility of imagination and abundance of guesses at truth are among the first requisites of discovery; but the erroneous guesses must be many times as numerous as those that prove well founded. The weakest analogies, the most whimsical notions, the most apparently absurd theories, may pass through the teeming brain, and no record remain of more than the hundredth part. […] The truest theories involve suppositions which are inconceivable, and no limit can really be placed to the freedom of hypotheses." (W Stanley Jevons,The Principles of Science: A Treatise on Logic and Scientific Method", 1877)

"Heuristic reasoning is reasoning not regarded as final and strict but as provisional and plausible only, whose purpose is to discover the solution of the present problem. We are often obliged to use heuristic reasoning. We shall attain complete certainty when we shall have obtained the complete solution, but before obtaining certainty we must often be satisfied with a more or less plausible guess. We may need the provisional before we attain the final. We need heuristic reasoning when we construct a strict proof as we need scaffolding when we erect a building." (George Pólya,How to Solve It", 1945)

"The scientist who discovers a theory is usually guided to his discovery by guesses; he cannot name a method by means of which he found the theory and can only say that it appeared plausible to him, that he had the right hunch or that he saw intuitively which assumption would fit the facts." (Hans Reichenbach,The Rise of Scientific Philosophy", 1951)

"Extrapolations are useful, particularly in the form of soothsaying called forecasting trends. But in looking at the figures or the charts made from them, it is necessary to remember one thing constantly: The trend to now may be a fact, but the future trend represents no more than an educated guess. Implicit in it is 'everything else being equal' and 'present trends continuing'. And somehow everything else refuses to remain equal." (Darell Huff,How to Lie with Statistics", 1954)

"In plausible reasoning the principal thing is to distinguish... a more reasonable guess from a less reasonable guess." (George Pólya,Mathematics and plausible reasoning" Vol. 1, 1954)

"We know many laws of nature and we hope and expect to discover more. Nobody can foresee the next such law that will be discovered. Nevertheless, there is a structure in laws of nature which we call the laws of invariance. This structure is so far-reaching in some cases that laws of nature were guessed on the basis of the postulate that they fit into the invariance structure." (Eugene P Wigner,The Role of Invariance Principles in Natural Philosophy", 1963)

"Another thing I must point out is that you cannot prove a vague theory wrong. If the guess that you make is poorly expressed and rather vague, and the method that you use for figuring out the consequences is a little vague - you are not sure, and you say, 'I think everything's right because it's all due to so and so, and such and such do this and that more or less, and I can sort of explain how this works' […] then you see that this theory is good, because it cannot be proved wrong! Also if the process of computing the consequences is indefinite, then with a little skill any experimental results can be made to look like the expected consequences." (Richard P Feynman,The Character of Physical Law", 1965)

"The method of guessing the equation seems to be a pretty effective way of guessing new laws. This shows again that mathematics is a deep way of expressing nature, and any attempt to express nature in philosophical principles, or in seat-of-the-pants mechanical feelings, is not an efficient way." (Richard Feynman,The Character of Physical Law", 1965)

"Every discovery, every enlargement of the understanding, begins as an imaginative preconception of what the truth might be. The imaginative preconception - a ‘hypothesis’ - arises by a process as easy or as difficult to understand as any other creative act of mind; it is a brainwave, an inspired guess, a product of a blaze of insight. It comes anyway from within and cannot be achieved by the exercise of any known calculus of discovery." (Sir Peter B Medawar,Advice to a Young Scientist", 1979)

"Scientists reach their  conclusions  for the damnedest of reasons: intuition, guesses, redirections after wild-goose chases, all combing with a dollop of rigorous observation and logical  reasoning to be sure […] This  messy and personal side of science should not be  disparaged, or covered up, by  scientists for two  major reasons. First, scientists should proudly show this  human face to  display their kinship with all other  modes of creative human thought […] Second, while biases and references often impede understanding, these  mental idiosyncrasies  may  also serve as powerful, if  quirky and personal, guides to solutions." (Stephen J Gould,Dinosaur in a  Haystack: Reflections in natural  history", 1995)

"Compound errors can begin with any of the standard sorts of bad statistics - a guess, a poor sample, an inadvertent transformation, perhaps confusion over the meaning of a complex statistic. People inevitably want to put statistics to use, to explore a number's implications. [...] The strengths and weaknesses of those original numbers should affect our confidence in the second-generation statistics." (Joel Best,Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"First, good statistics are based on more than guessing. [...] Second, good statistics are based on clear, reasonable definitions. Remember, every statistic has to define its subject. Those definitions ought to be clear and made public. [...] Third, good statistics are based on clear, reasonable measures. Again, every statistic involves some sort of measurement; while all measures are imperfect, not all flaws are equally serious. [...] Finally, good statistics are based on good samples." (Joel Best,Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"While some social problems statistics are deliberate deceptions, many - probably the great majority - of bad statistics are the result of confusion, incompetence, innumeracy, or selective, self-righteous efforts to produce numbers that reaffirm principles and interests that their advocates consider just and right. The best response to stat wars is not to try and guess who's lying or, worse, simply to assume that the people we disagree with are the ones telling lies. Rather, we need to watch for the standard causes of bad statistics - guessing, questionable definitions or methods, mutant numbers, and inappropriate comparisons." (Joel Best,Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"The well-known 'No Free Lunch' theorem indicates that there does not exist a pattern classification method that is inherently superior to any other, or even to random guessing without using additional information. It is the type of problem, prior information, and the amount of training samples that determine the form of classifier to apply. In fact, corresponding to different real-world problems, different classes may have different underlying data structures. A classifier should adjust the discriminant boundaries to fit the structures which are vital for classification, especially for the generalization capacity of the classifier." (Hui Xue et al,SVM: Support Vector Machines", 2009)

"Data science isn’t just about the existence of data, or making guesses about what that data might mean; it’s about testing hypotheses and making sure that the conclusions you’re drawing from the data are valid." (Mike Loukides,What Is Data Science?", 2011)

"GIGO is a famous saying coined by early computer scientists: garbage in, garbage out. At the time, people would blindly put their trust into anything a computer output indicated because the output had the illusion of precision and certainty. If a statistic is composed of a series of poorly defined measures, guesses, misunderstandings, oversimplifications, mismeasurements, or flawed estimates, the resulting conclusion will be flawed." (Daniel J Levitin,Weaponized Lies", 2017)

"In statistical inference and machine learning, we often talk about estimates and estimators. Estimates are basically our best guesses regarding some quantities of interest given" (finite) data. Estimators are computational devices or procedures that allow us to map between a given" (finite) data sample and an estimate of interest." (Aleksander Molak,Causal Inference and Discovery in Python", 2023)


03 May 2026

🔭Data Science: Tails (Just the Quotes)

"Some distributions [...] are symmetrical about their central value. Other distributions have marked asymmetry and are said to be skew. Skew distributions are divided into two types. If the 'tail' of the distribution reaches out into the larger values of the variate, the distribution is said to show positive skewness; if the tail extends towards the smaller values of the variate, the distribution is called negatively skew." (Michael J Moroney,Facts from Figures", 1951)

"Logging size transforms the original skewed distribution into a more symmetrical one by pulling in the long right tail of the distribution toward the mean. The short left tail is, in addition, stretched. The shift toward symmetrical distribution produced by the log transform is not, of course, merely for convenience. Symmetrical distributions, especially those that resemble the normal distribution, fulfill statistical assumptions that form the basis of statistical significance testing in the regression model." (Edward R Tufte,Data Analysis for Politics and Policy", 1974)

"Equal variability is not always achieved in plots. For instance, if the theoretical distribution for a probability plot has a density that drops off gradually to zero in the tails" (as the normal density does), then the variability of the data in the tails of the probability plot is greater than in the center. Another example is provided by the histogram. Since the height of any one bar has a binomial distribution, the standard deviation of the height is approximately proportional to the square root of the expected height; hence, the variability of the longer bars is greater." (John M Chambers et al,Graphical Methods for Data Analysis", 1983)

"If the sample is not representative of the population because the sample is small or biased, not selected at random, or its constituents are not independent of one another, then the bootstrap will fail. […] For a given size sample, bootstrap estimates of percentiles in the tails will always be less accurate than estimates of more centrally located percentiles. Similarly, bootstrap interval estimates for the variance of a distribution will always be less accurate than estimates of central location such as the mean or median because the variance depends strongly upon extreme values in the population." (Phillip I Good & James W Hardin,Common Errors in Statistics" (and How to Avoid Them)", 2003)

"Bell curves don't differ that much in their bells. They differ in their tails. The tails describe how frequently rare events occur. They describe whether rare events really are so rare. This leads to the saying that the devil is in the tails." (Bart Kosko,Noise", 2006)

"Readability in visualization helps people interpret data and make conclusions about what the data has to say. Embed charts in reports or surround them with text, and you can explain results in detail. However, take a visualization out of a report or disconnect it from text that provides context" (as is common when people share graphics online), and the data might lose its meaning; or worse, others might misinterpret what you tried to show." (Nathan Yau,Data Points: Visualization That Means Something", 2013)

"A very different - and very incorrect - argument is that successes must be balanced by failures (and failures by successes) so that things average out. Every coin flip that lands heads makes tails more likely. Every red at roulette makes black more likely. […] These beliefs are all incorrect. Good luck will certainly not continue indefinitely, but do not assume that good luck makes bad luck more likely, or vice versa." (Gary Smith,Standard Deviations", 2014)

"The more complex the system, the more variable (risky) the outcomes. The profound implications of this essential feature of reality still elude us in all the practical disciplines. Sometimes variance averages out, but more often fat-tail events beget more fat-tail events because of interdependencies. If there are multiple projects running, outlier (fat-tail) events may also be positively correlated - one IT project falling behind will stretch resources and increase the likelihood that others will be compromised." (Paul Gibbons,The Science of Successful Organizational Change",  2015)

"Many statistical procedures perform more effectively on data that are normally distributed, or at least are symmetric and not excessively kurtotic" (fat-tailed), and where the mean and variance are approximately constant. Observed time series frequently require some form of transformation before they exhibit these distributional properties, for in their 'raw' form they are often asymmetric." (Terence C Mills,Applied Time Series Analysis: A practical guide to modeling and forecasting", 2019)

"Mean-averages can be highly misleading when the raw data do not form a symmetric pattern around a central value but instead are skewed towards one side [...], typically with a large group of standard cases but with a tail of a few either very high" (for example, income) or low" (for example, legs) values." (David Spiegelhalter,The Art of Statistics: Learning from Data", 2019)

"[…] it is not merely that events in the tails of the distributions matter, happen, play a large role, etc. The point is that these events play the major role and their probabilities are not" (easily) computable, not reliable for any effective use. The implication is that Black Swans do not necessarily come from fat tails; the problem can result from an incomplete assessment of tail events." (Nassim N Taleb,Statistical Consequences of Fat Tails: Real World Preasymptotics, Epistemology, and Applications" 2nd Ed., 2022)

"[…] whenever people make decisions after being supplied with the standard deviation number, they act as if it were the expected mean deviation." (Nassim N Taleb,Statistical Consequences of Fat Tails: Real World Preasymptotics, Epistemology, and Applications" 2nd Ed., 2022)

"Behavioral finance so far makes conclusions from statics not dynamics, hence misses the picture. It applies trade-offs out of context and develops the consensus that people irrationally overestimate tail risk" (hence need to be 'nudged' into taking more of these exposures). But the catastrophic event is an absorbing barrier. No risky exposure can be analyzed in isolation: risks accumulate. If we ride a motorcycle, smoke, fly our own propeller plane, and join the mafia, these risks add up to a near-certain premature death. Tail risks are not a renewable resource." (Nassim N Taleb,Statistical Consequences of Fat Tails: Real World Preasymptotics, Epistemology, and Applications" 2nd Ed., 2022)

"But note that any heavy tailed process, even a power law, can be described in sample" (that is finite number of observations necessarily discretized) by a simple Gaussian process with changing variance, a regime switching process, or a combination of Gaussian plus a series of variable jumps" (though not one where jumps are of equal size […])." (Nassim N Taleb,Statistical Consequences of Fat Tails: Real World Preasymptotics, Epistemology, and Applications" 2nd Ed., 2022)

"Once we know something is fat-tailed, we can use heuristics to see how an exposure there reacts to random events: how much is a given unit harmed by them. It is vastly more effective to focus on being insulated from the harm of random events than try to figure them out in the required details" (as we saw the inferential errors under thick tails are huge). So it is more solid, much wiser, more ethical, and more effective to focus on detection heuristics and policies rather than fabricate statistical properties." (Nassim N Taleb,Statistical Consequences of Fat Tails: Real World Preasymptotics, Epistemology, and Applications" 2nd Ed., 2022)

"No one sees further into a generalization than his own knowledge of detail extends." (William James)

"Remember that a p-value merely indicates the probability of a particular set of data being generated by the null model–it has little to say about the size of a deviation from that model" (especially in the tails of the distribution, where large changes in effect size cause only small changes in p-values)." (Clay Helberg)


02 May 2026

🔭Data Science: Skewness (Just the Quotes)

"Some distributions [...] are symmetrical about their central value. Other distributions have marked asymmetry and are said to be skew. Skew distributions are divided into two types. If the 'tail' of the distribution reaches out into the larger values of the variate, the distribution is said to show positive skewness; if the tail extends towards the smaller values of the variate, the distribution is called negatively skew." (Michael J Moroney, "Facts from Figures", 1951)

"Logging size transforms the original skewed distribution into a more symmetrical one by pulling in the long right tail of the distribution toward the mean. The short left tail is, in addition, stretched. The shift toward symmetrical distribution produced by the log transform is not, of course, merely for convenience. Symmetrical distributions, especially those that resemble the normal distribution, fulfill statistical assumptions that form the basis of statistical significance testing in the regression model." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Logging skewed variables also helps to reveal the patterns in the data. […] the rescaling of the variables by taking logarithms reduces the nonlinearity in the relationship and removes much of the clutter resulting from the skewed distributions on both variables; in short, the transformation helps clarify the relationship between the two variables. It also […] leads to a theoretically meaningful regression coefficient." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"The logarithmic transformation serves several purposes: (1) The resulting regression coefficients sometimes have a more useful theoretical interpretation compared to a regression based on unlogged variables. (2) Badly skewed distributions - in which many of the observations are clustered together combined with a few outlying values on the scale of measurement - are transformed by taking the logarithm of the measurements so that the clustered values are spread out and the large values pulled in more toward the middle of the distribution. (3) Some of the assumptions underlying the regression model and the associated significance tests are better met when the logarithm of the measured variables is taken." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"The logarithm is an extremely powerful and useful tool for graphical data presentation. One reason is that logarithms turn ratios into differences, and for many sets of data, it is natural to think in terms of ratios. […] Another reason for the power of logarithms is resolution. Data that are amounts or counts are often very skewed to the right; on graphs of such data, there are a few large values that take up most of the scale and the majority of the points are squashed into a small region of the scale with no resolution." (William S. Cleveland, "Graphical Methods for Data Presentation: Full Scale Breaks, Dot Charts, and Multibased Logging", The American Statistician Vol. 38 (4) 1984)

"It is common for positive data to be skewed to the right: some values bunch together at the low end of the scale and others trail off to the high end with increasing gaps between the values as they get higher. Such data can cause severe resolution problems on graphs, and the common remedy is to take logarithms. Indeed, it is the frequent success of this remedy that partly accounts for the large use of logarithms in graphical data display." (William S Cleveland, "The Elements of Graphing Data", 1985)

"If a distribution were perfectly symmetrical, all symmetry-plot points would be on the diagonal line. Off-line points indicate asymmetry. Points fall above the line when distance above the median is greater than corresponding distance below the median. A consistent run of above-the-line points indicates positive skew; a run of below-the-line points indicates negative skew." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"Skewness is a measure of symmetry. For example, it's zero for the bell-shaped normal curve, which is perfectly symmetric about its mean. Kurtosis is a measure of the peakedness, or fat-tailedness, of a distribution. Thus, it measures the likelihood of extreme values." (John L Casti, "Reality Rules: Picturing the world in mathematics", 1992)

"Data that are skewed toward large values occur commonly. Any set of positive measurements is a candidate. Nature just works like that. In fact, if data consisting of positive numbers range over several powers of ten, it is almost a guarantee that they will be skewed. Skewness creates many problems. There are visualization problems. A large fraction of the data are squashed into small regions of graphs, and visual assessment of the data degrades. There are characterization problems. Skewed distributions tend to be more complicated than symmetric ones; for example, there is no unique notion of location and the median and mean measure different aspects of the distribution. There are problems in carrying out probabilistic methods. The distribution of skewed data is not well approximated by the normal, so the many probabilistic methods based on an assumption of a normal distribution cannot be applied." (William S Cleveland, "Visualizing Data", 1993)

"The logarithm is one of many transformations that we can apply to univariate measurements. The square root is another. Transformation is a critical tool for visualization or for any other mode of data analysis because it can substantially simplify the structure of a set of data. For example, transformation can remove skewness toward large values, and it can remove monotone increasing spread. And often, it is the logarithm that achieves this removal." (William S Cleveland, "Visualizing Data", 1993)

"When the distributions of two or more groups of univariate data are skewed, it is common to have the spread increase monotonically with location. This behavior is monotone spread. Strictly speaking, monotone spread includes the case where the spread decreases monotonically with location, but such a decrease is much less common for raw data. Monotone spread, as with skewness, adds to the difficulty of data analysis. For example, it means that we cannot fit just location estimates to produce homogeneous residuals; we must fit spread estimates as well. Furthermore, the distributions cannot be compared by a number of standard methods of probabilistic inference that are based on an assumption of equal spreads; the standard t-test is one example. Fortunately, remedies for skewness can cure monotone spread as well." (William S Cleveland, "Visualizing Data", 1993)

"Use a logarithmic scale when it is important to understand percent change or multiplicative factors. […] Showing data on a logarithmic scale can cure skewness toward large values." (Naomi B Robbins, "Creating More effective Graphs", 2005)

"Distributional shape is an important attribute of data, regardless of whether scores are analyzed descriptively or inferentially. Because the degree of skewness can be summarized by means of a single number, and because computers have no difficulty providing such measures (or estimates) of skewness, those who prepare research reports should include a numerical index of skewness every time they provide measures of central tendency and variability." (Schuyler W Huck, "Statistical Misconceptions", 2008)

"Given the important role that correlation plays in structural equation modeling, we need to understand the factors that affect establishing relationships among multivariable data points. The key factors are the level of measurement, restriction of range in data values (variability, skewness, kurtosis), missing data, nonlinearity, outliers, correction for attenuation, and issues related to sampling variation, confidence intervals, effect size, significance, sample size, and power." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

"[The normality] assumption is the least important one for the reliability of the statistical procedures under discussion. Violations of the normality assumption can be divided into two general forms: Distributions that have heavier tails than the normal and distributions that are skewed rather than symmetric. If data is skewed, the formulas we are discussing are still valid as long as the sample size is sufficiently large. Although the guidance about 'how skewed' and 'how large a sample' can be quite vague, since the greater the skew, the larger the required sample size. For the data commonly used in time series and for the sample sizes (which are generally quite large) used, skew is not a problem. On the other hand, heavy tails can be very problematic." (DeWayne R Derryberry, "Basic Data Analysis for Time Series with R" 1st Ed, 2014)

"In statistical theory, location and variability are referred to as the first and second moments of a distribution. The third and fourth moments are called skewness and kurtosis. Skewness refers to whether the data is skewed to larger or smaller values and kurtosis indicates the propensity of the data to have extreme values. Generally, metrics are not used to measure skewness and kurtosis; instead, these are discovered through visual displays [...]" (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"A histogram represents the frequency distribution of the data. Histograms are similar to bar charts but group numbers into ranges. Also, a histogram lets you show the frequency distribution of continuous data. This helps in analyzing the distribution (for example, normal or Gaussian), any outliers present in the data, and skewness." (Umesh R Hodeghatta & Umesha Nayak, "Business Analytics Using R: A Practical Approach", 2017)

"New information is constantly flowing in, and your brain is constantly integrating it into this statistical distribution that creates your next perception (so in this sense 'reality' is just the product of your brain’s ever-evolving database of consequence). As such, your perception is subject to a statistical phenomenon known in probability theory as kurtosis. Kurtosis in essence means that things tend to become increasingly steep in their distribution [...] that is, skewed in one direction. This applies to ways of seeing everything from current events to ourselves as we lean 'skewedly' toward one interpretation, positive or negative. Things that are highly kurtotic, or skewed, are hard to shift away from. This is another way of saying that seeing differently isn’t just conceptually difficult - it’s statistically difficult." (Beau Lotto, "Deviate: The Science of Seeing Differently", 2017)

"Mean-averages can be highly misleading when the raw data do not form a symmetric pattern around a central value but instead are skewed towards one side [...], typically with a large group of standard cases but with a tail of a few either very high (for example, income) or low (for example, legs) values." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"With skewed data, quantiles will reflect the skew, while adding standard deviations assumes symmetry in the distribution and can be misleading." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

"Adjusting scale is an important practice in data visualization. While the log transform is versatile, it doesn’t handle all situations where skew or curvature occurs. For example, at times the values are all roughly the same order of magnitude and the log transformation has little impact. Another transformation to consider is the square root transformation, which is often useful for count data." (Sam Lau et al, "Learning Data Science: Data Wrangling, Exploration, Visualization, and Modeling with Python", 2023)

18 April 2026

🤖〽️Prompt Engineering: Copilot Unabridged (Part 166: How AI Recognizes Subtle Relationships in Data)

Prompt Engineering Series
Prompt Engineering Series


Prompt: "write a post of 600 words on how AI can recognize subtle relationships in data"

Introduction

Artificial Intelligence (AI) has become a powerful tool for uncovering patterns in data - patterns that are too complex, too hidden, or too high‑dimensional for humans to detect on their own. While humans excel at intuitive reasoning and contextual understanding, AI excels at scale. It can analyze millions of data points simultaneously, identify faint correlations, and detect relationships that would otherwise remain invisible. Understanding how AI recognizes subtle relationships in data reveals why these systems are so transformative - and why they must be used thoughtfully.

1. AI Learns Patterns Through High‑Dimensional Representations

At the heart of modern AI is the ability to represent information in high‑dimensional space. Instead of viewing data as simple numbers or labels, AI models encode concepts as vectors - mathematical points with hundreds or thousands of dimensions.

This allows the model to capture:

  • Nuanced similarities between concepts
  • Gradients of meaning rather than binary categories
  • Relationships that span multiple variables at once

For example, a language model can understand that 'king' and 'queen' are related not because it knows gender or royalty, but because their vector representations share structural patterns learned from data.

2. AI Detects Patterns Across Massive Datasets

Humans can only process a limited amount of information at once. AI, however, can analyze enormous datasets containing millions of examples. This scale allows it to detect:

  • Weak correlations that appear only across large samples
  • Rare patterns that humans might overlook
  • Multi‑step relationships that span many variables

In fields like medicine or finance, these subtle patterns can reveal early warning signs, hidden risks, or emerging trends.

3. AI Identifies Non‑Linear Relationships

Traditional statistical methods often assume linear relationships - simple, straight‑line connections between variables. AI models, especially neural networks, can capture far more complex patterns:

  • Curved relationships
  • Interactions between multiple variables
  • Conditional patterns that appear only in specific contexts

This ability to model non‑linear relationships is one reason AI can outperform traditional analytics in tasks like image recognition, fraud detection, and natural language processing.

4. AI Learns Through Layered Abstraction

Deep learning models recognize subtle relationships by building layers of abstraction. Each layer transforms the data into a more meaningful representation:

  • Early layers detect simple features
  • Middle layers combine features into patterns
  • Higher layers capture concepts and relationships

This hierarchical structure allows AI to detect relationships that are not obvious at the surface level. For example, in image analysis, early layers detect edges, while deeper layers detect shapes, objects, and even emotional expressions.

5. AI Uses Attention Mechanisms to Focus on What Matters

Modern AI models often include attention mechanisms - components that help the model identify which parts of the data are most relevant to a given task. Attention allows AI to:

  • Highlight subtle but important signals
  • Ignore irrelevant noise
  • Understand relationships across long sequences

In language models, attention helps the system track relationships between words that are far apart in a sentence, enabling it to understand nuance, tone, and context.

6. AI Learns From Patterns It Was Not Explicitly Taught

One of the most surprising aspects of AI is its ability to discover relationships that were never explicitly labeled. Through unsupervised or self‑supervised learning, AI can:

  • Cluster similar items
  • Infer categories
  • Detect anomalies
  • Reveal hidden structure in data

This makes AI a powerful tool for exploration and discovery.

Closing Statement

AI recognizes subtle relationships in data by combining high‑dimensional representations, massive computational scale, non‑linear modeling, layered abstraction, and attention mechanisms. These capabilities allow AI to uncover patterns that humans cannot easily see, making it invaluable in fields ranging from science and medicine to business and communication. Yet with this power comes responsibility: subtle patterns can illuminate truth, but they can also reflect bias or noise. Understanding how AI detects these relationships helps us use it wisely - leveraging its strengths while ensuring its insights remain grounded, fair, and meaningful.

Disclaimer: The whole text was generated by Copilot (under Windows 11) at the first attempt. This is just an experiment to evaluate feature's ability to answer standard general questions, independently on whether they are correctly or incorrectly posed. Moreover, the answers may reflect hallucinations and other types of inconsistent or incorrect reasoning.

Previous Post <<||>> Next Post

16 March 2026

🤖〽️Prompt Engineering: Copilot Unabridged (Part 139: How AI Understands the World)

Prompt Engineering Series
Prompt Engineering Series

Prompt: "write a post of 600 words on how AI understands the world"

Introduction

Artificial Intelligence (AI) has become a constant presence in modern life - summarizing information, generating text, recognizing images, translating languages, and assisting with decisions. Because AI often produces outputs that feel coherent, insightful, or even creative, it’s natural to wonder: How does AI understand the world? The answer is both fascinating and counterintuitive. AI does not understand in the human sense. It does not perceive, feel, or interpret reality through lived experience. Instead, it constructs a statistical map of patterns found in data. Exploring how this works helps us appreciate both the power and the limits of today’s AI systems.

AI’s 'Understanding' Begins With Patterns, Not Perception

Humans understand the world through sensory experience, memory, emotion, and social interaction. AI, by contrast, begins with data - text, images, audio, or other digital inputs. It does not see a tree, hear a voice, or feel the warmth of sunlight. It processes symbols and patterns.

When an AI model is trained, it analyzes vast amounts of data and learns statistical relationships:

  • Which words tend to appear together
  • What shapes correspond to certain labels
  • How sequences unfold over time

This pattern‑learning process allows AI to generate predictions. For example, when you ask a question, the model predicts the most likely next word, then the next, and so on. The result can feel like understanding, but it is fundamentally pattern completion.

AI Builds Internal Representations - But Not Meaning

Inside an AI model, information is encoded in mathematical structures called representations. These representations capture relationships between concepts: 'cat' is closer to 'animal' than to 'car', for example. This internal structure allows AI to generalize, classify, and generate coherent responses.

But these representations are not grounded in experience. AI does not know what a cat is - it only knows how the word 'cat' behaves in data. Meaning, in the human sense, comes from consciousness, embodiment, and emotion. AI has none of these. Its “understanding” is functional, not experiential.

Context Without Comprehension

One of the most impressive aspects of modern AI is its ability to use context. It can adjust tone, follow instructions, and maintain coherence across long conversations. This gives the impression of comprehension. 

But context for AI is statistical, not conceptual. It identifies patterns in how humans use language in similar situations. It does not grasp intention, nuance, or subtext the way humans do. When AI responds sensitively to a personal story or thoughtfully to a complex question, it is drawing on patterns - not empathy or insight.

AI Understands the World Through Human Data

AI’s worldview is entirely shaped by the data it is trained on. This means:

  • It reflects human knowledge
  • It inherits human biases
  • It mirrors human language
  • It amplifies human patterns

AI does not discover the world; it absorbs the world as humans have recorded it. This makes AI powerful as a tool for synthesis and reasoning, but it also means its understanding is limited by the scope and quality of its data.

The Limits of AI’s Understanding

AI cannot:

  • Form intentions
  • Experience emotion
  • Understand moral or social meaning
  • Interpret ambiguity the way humans do
  • Ground concepts in physical experience

These limitations matter. They remind us that AI is a tooan extraordinary one - but not a mind.

Closing Statement

AI understands the world not through perception or consciousness, but through patterns extracted from human‑generated data. Its 'understanding' is statistical, not experiential; functional, not emotional. Recognizing this helps us use AI wisely - leveraging its strengths in analysis and generation while remembering that meaning, judgment, and lived experience remain uniquely human. As AI continues to evolve, the most powerful outcomes will come from collaboration: human understanding enriched by machine‑driven insight

Disclaimer: The whole text was generated by Copilot (under Windows 11) at the first attempt. This is just an experiment to evaluate feature's ability to answer standard general questions, independently on whether they are correctly or incorrectly posed. Moreover, the answers may reflect hallucinations and other types of inconsistent or incorrect reasoning.

Previous Post <<||>> Next Post

24 April 2025

🧭Business Intelligence: Perspectives (Part 30: The Data Science Connection)

Business Intelligence Series
Business Intelligence Series

Data Science is a collection of quantitative and qualitative methods, respectively techniques, algorithms, principles, processes and technologies used to analyze, and process amounts of raw and aggregated data to extract information or knowledge it contains. Its theoretical basis is rooted within mathematics, mainly statistics, computer science and domain expertise, though it can include further aspects related to communication, management, sociology, ecology, cybernetics, and probably many other fields, as there’s enough space for experimentation and translation of knowledge from one field to another.  

The aim of Data Science is to extract valuable insights from data to support decision-making, problem-solving, drive innovation and probably it can achieve more in time. Reading in between the lines, Data Science sounds like a superhero that can solve all the problems existing out there, which frankly is too beautiful to be true! In theory everything is possible, when in practice there are many hard limitations! Given any amount of data, the knowledge that can be obtained from it can be limited by many factors - the degree to which the data, processes and models built reflect reality, and there can be many levels of approximation, respectively the degree to which such data can be collected consistently. 

Moreover, even if the theoretical basis seems sound, the data, information or knowledge which is not available can be the important missing link in making any sensible progress toward the goals set in Data Science projects. In some cases, one might be aware of what's missing, though for the data scientist not having the required domain knowledge, this can be a hard limit! This gap can be probably bridged with sensemaking, exploration and experimentation approaches, especially by applying models from other domains, though there are no guarantees ahead!

AI can help in this direction by utilizing its capacity to explore fast ideas or models. However, it's questionable how much the models built with AI can be further used if one can't build mechanistical mental models of the processes reflected in the data. It's like devising an algorithm for winning at lottery small amounts, though investing more money in the algorithm doesn't automatically imply greater wins. Even if occasionally the performance is improved, it's questionable how much it can be leveraged for each utilization. Statistics has its utility when one studies data in aggregation and can predict average behavior. It can’t be used to predict the occurrence of events with a high precision. Think how hard the prediction of earthquakes or extreme weather is by just looking at a pile of data reflecting what’s happening only in a certain zone!

In theory, the more data one has from different geographical areas or organizations, the more robust the models can become. However, no two geographies, respectively no two organizations are alike: business models, the people, the events and other aspects make global models less applicable to local context. Frankly, one has more chances of progress if a model is obtained by having a local scope and then attempting to leverage the respective model for a broader scope. Even then, there can be differences between the behavior or phenomena at micro, respectively at macro level (see the law of physics). 

This doesn’t mean that Data Science or AI related knowledge is useless. The knowledge accumulated by applying various techniques, models and programming languages in problem-solving can be more valuable than the results obtained! Experimentation is a must for organizations to innovate, to extend their knowledge base. It’s also questionable how much of the respective knowledge can be retained and put to good use. In the end, each organization must determine this by itself!

22 March 2025

💠🛠️🗒️SQL Server: Statistics Issues [Notes]

Disclaimer: This is work in progress based on notes gathered over the years, intended to consolidate information from the various sources. The content needs yet to be reviewed against the current documentation.

Last updated: 22-Mar-2024

[SQL Server 2005] Statistics

  • {def} objects that contain statistical information about the distribution of values in one or more columns of a table or indexed view [2] [see notes]
  • {issue} inaccurate statistics (aka bad statistics)
    • the primary source for cardinality estimation errors [6]
    • guessed selectivity can lead to a less-than-optimal query plan
    • {cause} missing statistics
      • if an error or other event prevents the query optimizer from creating statistics, the query optimizer creates the query plan without using statistics [22] 
        • the query optimizer marks the statistics as missing and attempts to regenerate the statistics the next time the query is executed [22]
      • [SSMS] indicated as warnings when the execution plan of a query is graphically displayed [22]
    • [SQL Server Profiler] monitoring the Missing Column Statistics event class indicates when statistics are missing [22]
    • troubleshooting
      • {step} verify that AUTO_CREATE_STATISTICS and AUTO_UPDATE_STATISTICS are on [22]
      • verify that the database is not read-only [22]
        • the query optimizer cannot save statistics in read-only databases [22]
      • create the missing statistics by using the CREATE STATISTICS statement [22]
    • {cause} use of local variables
    • {cause} non-constant-foldable expressions
    • {cause} complex expressions
    • {cause} ascending key columns
      • can cause inaccurate statistics in tables with frequent INSERTs because new values all lie outside the histogram [3]
    • it is not necessary for statistics to be highly accurate to be effective [18]
  • {issue} stale statistics (aka outdated statistics)
    • statistics are outdated based on the number of modifications of the statistics (and index) columns [6]
    • SQL Server checks to see if the statistics are outdated when it looks up a plan from the cache [6]
      •  it recompiles the query if they are [6]
        • ⇐ triggers a statistics update [6]
    •  [temporary tables] can increase the number of recompilations triggered by outdated statistics [6]
    • [query hint] KEEPFIXED PLAN
      • prevents query recompilation in cases of outdated statistics [6]
        • queries are recompiled only when the schemas of the underlying tables are changed or the recompilation is forced
        • e.g. when a stored procedure is called using the WITH RECOMPILE clause [6]
    • can be identified based on
      • Stats_Date(object_id, stats_id) returns the last update date
      • Rowmodctr from sys.sysindexes
    • has negative impact on performance 
      • almost always outweighs the performance benefits of avoiding automatic statistics updates/creation [2]
        • ⇐ applies also to missing statistics [2]
        • auto update/create stats should be disabled only when thorough testing has shown that there's no other way to achieve the performance or scalability you require [2]
        • without correct and up-to-date statistical information, can be challenging to determine the best execution plan for a particular query [8]
  • {issue} missing statistics
    • {symptom} excessive number of logical reads 
      • often indicates suboptimal execution plans 
        • due to missing indexes and/or suboptimal join strategies selected 
          • because of incorrect cardinality estimation [6]
      • shouldn't be used as the only criteria during optimization 
        • further factors should be considered
          • resource usage, parallelism, related operators in the execution plan [6]
  • {issue} unused statistics 
    • table can end up having man statistics that serve no purpose
      • it makes sense to review and clean up the statistics as part of general maintenance
        • based on the thresholds for the automatic update
  • {issue} skewed statistics 
    • may happen when the sample rate doesn’t reflect the characteristics of values’ distribution
      • {solution} using a higher sample rate 
        • update statistics manually with fullscan
  • {issue} statistics accumulate over time 
    • statistics objects are persistent and will remain in the database forever until explicitly dropped
      • this becomes expensive over time 
    • {solution} drop all auto stats 
      • will place some temporary stress on the system [11]
      • all queries that adheres to a certain pattern that requires a statistics to be created, will wait [11]
      • typically in a matter of minutes most of the missing statistics will be already back in place and the temporary stress will be over [11]
      • {tip} the cleanup can be performed at off-peak hours and ‘warm-up’ the database by capturing and replaying a trace of the most common read only queries [11]
        • create many of the required statistics objects without impacting your ‘real’ workload [11]
      • {downside} statements may deadlock on schema locks with concurrent sessions [11]
  • {issue} overlapping statistics 
    • column statistics created by SQL Server based on queries that get executed can overlap with statistics for indexes created by users on the same column [2] 

Previous Post <<||>> Next Post

References:
[2] Ken Henderson (2003) Guru's Guide to SQL Server Architecture and Internals
[3] Kendal Van Dyke (2010) More On Overlapping Statistics [link
[6] Dmitri Korotkevitch (2016) Pro SQL Server Internals 2nd Ed.
[18] Joe Chang’s Blog (2012) Decoding STATS_STREAM, by Joe Chang
[22] MSDN (2016) Statistics [link]

Resources:

Acronyms:
SSMS - SQL Server Management Studio
  

19 March 2025

💠🛠️🗒️SQL Server: Statistics [Notes]

Disclaimer: This is work in progress based on notes gathered over the years, intended to consolidate information from the various sources. The content needs yet to be reviewed against the current documentation.

Last updated: 19-Mar-2024

[SQL Server 2005] Statistics

  • {def} objects that contain statistical information about the distribution of values in one or more columns of a table or indexed view [2]
    • used by query optimizer to estimate the cardinality (aka number of rows)  in the query result
    • created on any data type that supports comparison operations [4]
    • metadata maintained about index keys and, optionally, nonindexed column values
      • implemented via statistics objects 
      • can be created over most types 
        • generally data types that support comparisons (such as >, =, and so on) support the creation of statistics [20]
      • the DBA is responsible for keeping the statistics objects up to date in the system [20]
      • {restriction} the combined width of all columns constituting a single statistics set must not be greater than 900 bytes [21]
    • statistics collected
      • time of the last statistics collection (inside STATBLOB) [21]
      • number of rows in the table or index (rows column in SYSINDEXES) [21]
      • number of pages occupied by the table or index (dpages column in SYSINDEXES) [21]
      • number of rows used to produce the histogram and density information (inside STATBLOB, described below) [21]
      • average key length (inside STATBLOB) [21]
      • histogram 
        • measures the frequency of occurrence for each distinct value in a data set [31]
        • computed by the query optimizer on the column values in the first key column of the statistics object, selecting the column values by statistically sampling the rows or by performing a full scan of all rows in the table or view [31]
          • when created from a sampled set of rows, the stored totals for number of rows and number of distinct values are estimates and do not need to be whole integers [31]
          • it sorts the column values, computes the number of values that match each distinct column value and then aggregates the column values into a maximum of 200 contiguous histogram steps [31]
            • each step includes a range of column values followed by an upper bound column value [31]
            • the range includes all possible column values between boundary values, excluding the boundary values themselves [21]
            • the lowest of the sorted column values is the upper boundary value for the first histogram step [31]
        • defines the histogram steps according to their statistical significance [31] 
          • uses a maximum difference algorithm to minimize the number of steps in the histogram while maximizing the difference between the boundary values
            • the maximum number of steps is 200 [31]
            • the number of histogram steps can be fewer than the number of distinct values, even for columns with fewer than 200 boundary points [31]
    • stored only for the first column of a composite index
      • the most selective columns in a multicolumn index should be selected first [2]
        • the histogram will be more useful to the optimizer [2]
      • single column histogram [21]
      • splitting up composite indexes into multiple single-column indexes is sometimes advisable [2]
    • used to estimate the selectivity of nonequality selection predicates, joins, and other operators  [3]
    • contains a sampling of up to 200 values for the index's first key column
      • the values in a given column are sorted in ordered sequence
        • divided into up to 199 intervals 
          • so that the most statistically significant information is captured
          • in general, of nonequal size
  • {type} table-level statistics
    • statistics maintained on each table
    • include: 
      • number of rows in the table [8]
      • number of pages used by the table [8]
      • number of modifications made to the keys of the table since the last update to the statistics [8]
  • {type} index statistics 
    • created with the indexes
    • updated with fullscan when indexes are rebuild the index [13]
      • {exception} [SQL Server 2012] for partitioned indexes when the number of partitions >1000 it uses default sampling [13]
  • {type} column statistics (aka non-index statistics)
    • statistics on non-indexed columns
    • determine the likelihood that a given value might occur in a column [2]
      • gives the optimizer valuable information in determining how best to service a query [2]
        • allows the optimizer to estimate the number of rows that will qualify from a given table involved in a join [2]
          • allows to more accurately select join order [2]
      • if automatic creation or updating of statistics is disabled, the Query Optimizer returns a warning in the showplan output when compiling a query where it thinks it needs this information [20]
    • used by optimizer to provide histogram-type information for the other columns in a multicolumn index [2]
      • the more information the optimizer has about data, the better [2]
      • queries asking for data that is outside the bounds of the histogram will estimate 1 row and typically end up with a bad, serial plan [15]
    • automatic created 
      • when nonindexed column is queried while AUTO_CREATE_STATISTICS is enabled for the database
    • aren’t updated when an index is reorganized or rebuild [10]
      • {exception} [SQL Server 2005] indexes rebuilt with DBCC dbreindex  [10]
        • updates index statistics as well column statistics [10]
        • this feature will be removed in a future version of Microsoft SQL Server
  • {type} [SQL Server 2008] filtered statistics
    • useful when dealing with partitioned data or data that is wildly skewed due to wide ranging data or lots of nulls
    • scope defined with the help of a statistic filter
      • a condition that is evaluated to determine whether a row must be part of the filtered statistics
      • the predicate appears in the WHERE clause of the CREATE STATISTICS or CREATE INDEX statements (in the case when statistics are automatically created as a side effect of creating an index)
  • {type} custom statistics
    • {limitation}[SQL Server] not supported 
  • {type} temporary statistics
    • when statistics on a read-only database or read-only snapshot are missing or stale, the Database Engine creates and maintains temporary statistics in tempdb  [22]
      • a SQL Server restart causes all temporary statistics to disappear [22]
      • the statistics name is appended with the suffix _readonly_database_statistic 
        • to differentiate the temporary statistics from the permanent statistics [22]
        • reserved for statistics generated by SQL Server [2] 
        • scripts for the temporary statistics can be created and reproduced on a read-write database [22]
          • when scripted, SSMS changes the suffix of the statistics name from _readonly_database_statistic to _readonly_database_statistic_scripted [22]
    • {restriction} can be created and updated only by SQL Server [22]
      • can be deleted and monitored [22]
  • {operation} create statistics
    • auto-create (aka auto-create statistics)
      • on by default 
        • samples the rows by default 
          • {exception} when statistics are created as a by-product of index creation, then a full scan is used [3]
      • {exception} statistics may not be created
        • for tables where the cost of the plan execution would be lower than the statistics creation itself [21]
        • when the server is too busy [21]
          • e.g. too many outstanding compilations in progress [21]
    • manually 
      • via CREATE STATISTICS 
        • only generates the statistics for a given column or combination of columns [21]
        • {recommendation} keep the AUTO_CREATE_STATISTICS option on so that the query optimizer continues to routinely create single-column statistics for query predicate columns [22]
  • {operation} update statistics
    • covered by the same SQL Server Profiler event as statistics creation
    • ensures that queries compile with up-to-date statistics [22]
    • {recommendation} statistics should not be updated too frequently [22]
      • because there is a performance tradeoff between improving query plans and the time it takes to recompile queries [22]
        • tradeoffs depend from application to application [22]
    • triggered by the executions of either commands:
      • CREATE INDEX ... WITH DROP EXISTING
        • scans the whole data set
          • the index statistics are initially created without sampling
        • allows to set the sample size in the WITH clause either by specifying
          • FULLSCAN 
          • percentage of data to scan
          • interpreted as an approximation 
      • sp_createstats stored procedure
      • sp_updatestats stored procedure
      • DBCC DBREINDEX
        • rebuilds one or more indexes for a table in the specified database [21]
        • DBCC INDEXDEFRAG or ALTER INDEX REORGANIZE operations don’t update the statistics [22]
    • undocumented options
      • meant for testing and debugging purposes
        •  should never be used on production systems [29]
      • STATS_STREAM = stats_stream 
      • ROWCOUNT = numeric_constant 
        • incl PAGECOUNT, alter the internal metadata of the specified table or index by overriding the counters containing the row and page counts of the object [29]
          • read by the Query Optimizer when processing queries that access the table and/or index in question [29]
            •  cheat the Optimizer into thinking that a table or index is extremely large [29]
              • the content of the actual tables and indexes will remain intact [29]
      • PAGECOUNT = numeric contant 
    • auto-update (aka auto statistics update)
      • on by default 
        • samples the rows by default 
          • always performed by sampling the index or table [21]
      • triggered by either
        • query optimization 
        • execution of a compiled plan
      • involves only a subset of the columns referred to in the query 
      • occurs
        • before query compilation if AUTO_UPDATE_STATISTCS_ASYNC is OFF
        • asynchronously if AUTO_UPDATE_STATISTCS_ASYNC is ON
          • the query that triggered the update proceeds using the old statistics
            • provides more predictable query response time for some workloads [3]
              • particularly the workload with short running queries and very large tables [3]
      • it tracks changes to columns in the statistics
        • {limitation} it doesn’t track changes to columns in the predicate [3]
          • {recommendation} if there are many changes to the columns used in predicates of filtered statistics, consider using manual updates to keep up with the changes [3]
      • enable/disable statistics update 
        • database level
          • ALTER DATABASE dbname SET AUTO_UPDATE_STATISTICS OFF
          • {limitation} it’s not possible to override the database setting of OFF for auto update statistics by setting it ON at the statistics object level [3]
        • table level
          • NORECOMPUTE option of the UPDATE STATISTICS command 
          • CREATE STATISTICS command
          • sp_autostats
        • index
          • sp_autostats
        • statistics object
        • sp_autostats
      • [SQL Server 2005] asynchronous statistics update
        • allows the statistics update operation to be performed on a background thread in a different transaction context
          • avoids the repeating rollback issue [20]
          • the original query continues and uses out-of-date statistical information to compile the query and return it to be executed [20]
          • when the statistics are updated, plans based on those statistics objects are invalidated and are recompiled on their next use [20]
        • {command}
          • ALTER DATABASE... SET AUTO_UPDATE_STATISTICS_ASYNC {ON | OFF}
    • manually
      • {recommendation} don’t update statistics after index defragmentation 
        • update eventually only the column statistics
      • [system tables] it might be needed to update statistics also on system tables when many objects were created
    • triggers a recompilation of the queries [22]
      • {exception} when a plan is trivial [1]
        • won’t be generated a better or different plan [1]
      • {exception} when the plan is non-trivial but no row modifications since last statistics update [1]
        • no insert, delete or update since last statistics update [1]
        • manually free procedure cache in case is needed to generate a new plan [1]
      • schema change 
        • dropping an index defined on a table or an indexed view 
          • only if the index is used by the query plan in question
        • [SQL Server 2000] manually updating or dropping a statistic (not creating!) on a table will cause a recompilation of any query plans that use that table
          • the recompilation happens the next time the query plan in question begins execution
        • [SQL Server 2005] dropping a statistic (not creating or updating!) defined on a table will cause a correctness-related recompilation of any query plans that use that table
          • the recompilations happen the next time the query plan in question begins execution
        • updating a statistic (both manual and auto-update) will cause an optimality-related (data related) recompilation of any query plans that uses this statistic 
      • update scenarios 
        • {scenario} query execution times are slow [22]
          • troubleshooting: ensure that queries have up-to-date statistics before performing additional analysis [22] 
        • {scenario} insert operations occur on ascending or descending key columns [22]
          • statistics on ascending or descending key columns (e.g. IDENTITY or real-time timestamp) might require more frequent statistics updates than the query optimizer performs [22]
          • if statistics are not up-to-date and queries select from the most recently added rows, the current statistics will not have cardinality estimates for these new values [22] 
            • inaccurate cardinality estimates and slow query performance [22]
        • after maintenance operations [22]
          • {recommendation} consider updating statistics after performing maintenance procedures that change the distribution of data (e.g. truncating a table, bulk inserts) [22]
            • this can avoid future delays in query processing while queries wait for automatic statistics updates [22]
            • rebuilding, defragmenting, or reorganizing an index do not change the distribution of data [22]
              • no need to update statistics after performing ALTER INDEX REBUILD, DBCC REINDEX, DBCC INDEXDEFRAG, or ALTER INDEX REORGANIZE operations [22]
  • {operation} dropping statistics
    • via DROP STATISTICS command
    • {limitation} it’s not possible to drop statistics that are a byproduct of an index [21]
      • such statistics are removed only when the index is dropped [21]
    • aging 
      • [SQL Server 2000] ages the automatically created statistics (only those that are not a byproduct of the index creation)
      • after several automatic updates the column statistics are dropped rather than updated [21]
        • if they are needed in the future, they may be created again [21]
        • there is no substantial cost difference between statistics that are updated and created [21]
      • does not affect user-created statistics  [21]
  • {operation} statistics import)
    • performed by running generated scripts (see statistics export )
  • {operation}export (aka statistics export)
    • performed via “Generate Scripts” task
      • WITH STATS_STREAM
        • value can be generated by using DBCC SHOW_STATISTICS WITH STATS_STREAM
  • {operation} save/restore statistics
    • not supported
  • {operation} monitoring
    • via dbcc show_statistics
    • via SQL Server query profiler 
  • {option} AUTO_CREATE_STATISTICS
    • when enabled the query optimizer creates statistics on individual columns in the query predicate, as necessary, to improve cardinality estimates for the query plan [22]
      • these single-column statistics are created on columns that do not already have a histogram in an existing statistics object [22]
      • it applies strictly to single-column statistics for the full table [22]
      • statistics name starts with _WA
    • does not determine whether statistics get created for indexes [22]
    • does not generate filtered statistics [22]
      • call: SELECT DATABASEPROPERTY(<database_name>','IsAutoCreateStatistics') 
      • call: select is_auto_create_stats_on from sys.databases where name = '<database_name'
  • {option}AUTO_UPDATE_STATISTICS option
    • when enabled the query optimizer determines when statistics might be out-of-date and then updates them when they are used by a query [22]
      • statistics become out-of-date after insert, update, delete, or merge operations change the data distribution in the table or indexed view [22]
      • determined by counting the number of data modifications since the last statistics update and comparing the number of modifications to a threshold [22]
        • the threshold is based on the number of rows in the table or indexed view [22]
      • the query optimizer checks for out-of-date statistics 
        • before compiling a query 
          • the query optimizer uses the columns, tables, and indexed views in the query predicate to determine which statistics might be out-of-date [22]
        • before executing a cached query plan
          • the Database Engine verifies that the query plan references up-to-date statistics [22]
    • applies to 
      • statistics objects created for indexes, single-columns in query predicates [22]
      • statistics created with the CREATE STATISTICS statement [22]
      • filtered statistics [22]
    • call: SELECT DATABASEPROPERTY(<database_name>','IsAutoUpdateStatistics') 
    • call: select is_auto_update_stats_on from sys.databases where name = '<database_name'
  • {option} AUTO_UPDATE_STATISTICS_ASYNC
    • determines whether the query optimizer uses synchronous or asynchronous statistics updates 
    • {default} off
      • the query optimizer updates statistics synchronously
        • queries always compile and execute with up-to-date statistics
        • when statistics are out-of-date, the query optimizer waits for updated statistics before compiling and executing the query [22]
        • {recommendation} use when performing operations that change the distribution of data [22]
          • e.g. truncating a table or performing a bulk update of a large percentage of the rows [22]
          •  if the statistics are not updated after completing the operation, using synchronous statistics will ensure statistics are up-to-date before executing queries on the changed data [22]
    • applies to statistics objects created for  [22]
      • indexes
      • single columns in query predicates
      • statistics created with the CREATE STATISTICS statement
    • asynchronous statistics updates
      • queries compile with existing statistics even if the existing statistics are out-of-date
        • the query optimizer could choose a suboptimal query plan if statistics are out-of-date when the query compiles [22]
        • queries that compile after the asynchronous updates have completed will benefit from using the updated statistics [22]
      • {warning} it will do nothing if the “Auto Update Statistics” database option isn’t enabled [30]
      • {recommendation} scenarios
        •  to achieve more predictable query response times
        • an application frequently executes the same query, similar queries, or similar cached query plans
          • the query response times might be more predictable with asynchronous statistics updates than with synchronous statistics updates because the query optimizer can execute incoming queries without waiting for up-to-date statistics [22]
            • this avoids delaying some queries and not others [22]
        • an application experienced client request time outs caused by one or more queries waiting for updated statistics [22]
          • in some cases waiting for synchronous statistics could cause applications with aggressive time outs to fail [22]
      • call: select is_auto_update_stats_async_on from sys.databases where name = '<database_name'
  • [SQL Server 2014] Auto Create Incremental Statistics
    • when active (ON), the statistics created are per partition statistics [22]
    • {default} when disabled (OFF), the statistics tree is dropped and SQL Server re-computes the statistics [22]
    • this setting overrides the database level INCREMENTAL property [22]
  • [partitions] when new partitions are added to a large table, statistics should be updated to include the new partitions [22]
    • however the time required to scan the entire table (FULLSCAN or SAMPLE option) might be quite long [22]
    • scanning the entire table isn't necessary because only the statistics on the new partitions might be needed [22]
    • the incremental option creates and stores statistics on a per partition basis, and when updated, only refreshes statistics on those partitions that need new statistics [22]
    • if per partition statistics are not supported the option is ignored and a warning is generated [22]
    • not supported for several statistics types
      • statistics created with indexes that are not partition-aligned with the base table [22]
      • statistics created on AlwaysOn readable secondary databases [22]
      • statistics created on read-only databases [22]
      • statistics created on filtered indexes [22]
      • statistics created on views [22]
      • statistics created on internal tables [22]
      • statistics created with spatial indexes or XML indexes [22]

References:
[1] Microsoft Learn (2025) SQL Server: Statistics [link]
[2] Microsoft Learn (2025) SQL Server: Configure auto statistics [link]
[3] Microsoft Learn (2009) SQL Server: Statistics Used by the Query Optimizer in Microsoft SQL Server 2008 [link]
[4] Dmitri Korotkevitch (2016) Pro SQL Server Internals 2nd Ed.
[8] Microsoft (2014) Statistical maintenance functionality (autostats) in SQL Server  (kb 195565)
[10] CSS SQL Server Engineers (2015) Does rebuild index update statistics? [link]
[16] Thomas Kejser's Database Blog (2011) The Ascending Key Problem in Fact Tables – Part one: Pain! by Thomas Kejser
[20] Kalen Delaney et al (2013) Microsoft SQL Server 2012 Internals, 2013
[22] MSDN (2016) Statistics [link]
[29] Tips, Tricks, and Advice from the SQL Server Query Optimization Team (2006) UPDATE STATISTICS undocumented options [link]
[31] Microsoft Learn (2025) SQL Server: sys.dm_db_stats_histogram (Transact-SQL) [link]
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.