Showing posts with label Data science. Show all posts
Showing posts with label Data science. Show all posts

24 May 2026

🖍️Cheryl Cihon - Collected Quotes

"A combination of graphical and tabular presentations may be used to good advantage. The former illustrates most effectively qualitative characteristics (e.g., changes of data with time or sequence) while the latter is the best means to present quantitative information." (Cheryl Cihon & John K Taylor, "Statistical Techniques for Data Analysis" 2nd. ed., 2005)

"Each systematic error associated with a given measurement process is always of the same sign and magnitude. It persists measurement after measurement. When its existence is established, such an error is called a bias, and reasonable effort should be made to correct for it. Sometimes the observed bias is the result of the concurrence of several biases that cannot or at least have not been individually identified. One of the purposes of statistical treatment of data is to decide whether an apparently erroneous result is real and indicates a bias or whether it could happen as the result of chance variability, even in a well-behaved measurement system. There can be, of course, biases that have not been identified as such. Also, there are limits to how well one can correct for known biases, and this inadequacy must be considered when limits of uncertainty are assigned to data." (Cheryl Cihon & John K Taylor, "Statistical Techniques for Data Analysis" 2nd. ed., 2005)

"Essentially, the null hypothesis is that there is not a significant difference between two results. It will be seen that differences may have to be quite large in some instances before they are statistically significant, especially in the case of small data sets of high variability. Statistics will not say whether or not an apparent difference is real, but will only give the probability that it could have been as large as it is by chance alone. Often, the answer will be that there is no reason to believe a difference exists other than due to a chance occurrence, based on the statistical evidence available. Remember that this is not saying that there is no difference but that the evidence presented is insufficient to support the belief that the difference is not more than a random effect." (Cheryl Cihon & John K Taylor, "Statistical Techniques for Data Analysis" 2nd. ed., 2005)

"Frequency distributions, commonly called histograms, are special kinds of bar charts that are used widely for displaying variability of scientific and technical information. Such displays may be used to demonstrate that a normal distribution is or is not achieved [...]. Generally, a minimum of 25 data points is required to prepare a good bar chart, and considerably more is highly desirable. The data are divided into groups bounded by cells of fixed limits. The number of cells chosen to cover the range of values for the data is somewhat arbitrary. If too few, a distribution can lack resolution; if too many, there can be numerous unpopulated cells in the case of small data sets. Trial and error may be used in a specific case to decide what is most effective." (Cheryl Cihon & John K Taylor, "Statistical Techniques for Data Analysis" 2nd. ed., 2005)

"Nomographs are effective ways to graphically calculate various functionally related quantities. Nomographs are really graphical computational devices. They were once used widely in engineering situations when calculating was more laborious than at the present time, and they still can be useful when complex relationships are concerned. In brief, scales are laid out in which the scale intervals and placement of the lines are chosen by well-established procedures. A straight edge can then be used to interconnect independent variables so the corresponding values of dependent variables can be read." (Cheryl Cihon & John K Taylor, "Statistical Techniques for Data Analysis" 2nd. ed., 2005)

"Pie charts are more comprehensible as the sectors are approximately equal. A feeling of relationship is lost as very small sectors are placed alongside very large ones. In any case, numerical values need to be inserted in the sectors or related to them by lines or arrows to provide numerical significance, since the eye is not a good quantitative judge of the relative areas of sectors. The total number of sectors used should be reasonably small. While not a hard and fast rule, a maximum of eight sectors is a reasonable number. Sectors may be homogeneous or consist of conglomerates of several items. The information contained in a sector may be displayed as a separate pie chart. This is an effective way to handle conglomerates." (Cheryl Cihon & John K Taylor, "Statistical Techniques for Data Analysis" 2nd. ed., 2005)

"The inevitability of variability complicates the evaluation and use of data. It must be recognized that many uses require data quality that may be difficult to achieve. There are minimum quality standards required for every measurement situation (sometimes called data quality objectives). These standards should be established in advance and both the producer and the user must be able to determine whether they have been met. The only way that this can be accomplished is to attain statistical control of the measurement process and to apply valid statistical procedures in the analysis of the data." (Cheryl Cihon & John K Taylor, "Statistical Techniques for Data Analysis" 2nd. ed., 2005)

"The quantitative accuracy of what is measured is an obvious indicator of data quality. Because of inescapable variability, data will always have some degree of uncertainty. When measurement plans are properly made and adequately executed, it is possible to assign quantitative limits of uncertainty to measured values." (Cheryl Cihon & John K Taylor, "Statistical Techniques for Data Analysis" 2nd. ed., 2005)

"The second type of uncertainty results from random causes that produce fluctuations in both sign and magnitude, the latter within well-defined limits, however. In the long run, the random error averages out to zero. The random error accounts for the variability of individual measurements and it will be shown that it can be statistically characterized by what is called a standard deviation. This term is thus a measure of the dispersion of the data around a mean or average value. When the value of the standard deviation is small, the data cluster closely around the mean; when it is large, the spread is greater." (Cheryl Cihon & John K Taylor, "Statistical Techniques for Data Analysis" 2nd. ed., 2005)

"The use of tables is perhaps the most common method for presentation of data. The format will vary, depending on what information is needed to be conveyed. Even a cursory perusal of the scientific literature will reveal many examples of both good and poor tables. A good table is simply one that presents data in an easily understandable manner. Tables should be relatively simple in order to promote understanding and the columns should have a clear relationship to each other. Column titles should be as brief as possible, consistent with clarity. Footnotes may be needed in some cases to provide further explanation of the headings." (Cheryl Cihon & John K Taylor, "Statistical Techniques for Data Analysis" 2nd. ed., 2005)

"Variability is inevitable in a measurement process. The operation of a measurement process does not produce one number but a variety of numbers. Each time it is applied to a measurement situation it can be expected to produce a slightly different number or sets of numbers. The means of sets of numbers will differ among themselves, but to a lesser degree than the individual values. One must distinguish between natural variability and instability. Gross instability can arise from many sources, including lack of control of the process. Failure to control steps that introduce bias also can introduce variability. Thus, any variability in calibration, done to minimize bias, can produce variability of measured values." (Cheryl Cihon & John K Taylor, "Statistical Techniques for Data Analysis" 2nd. ed., 2005)

22 May 2026

🔭Data Science: Asymmetry (Just the Quotes)

"Some distributions [...] are symmetrical about their central value. Other distributions have marked asymmetry and are said to be skew. Skew distributions are divided into two types. If the 'tail' of the distribution reaches out into the larger values of the variate, the distribution is said to show positive skewness; if the tail extends towards the smaller values of the variate, the distribution is called negatively skew." (Michael J Moroney,Facts from Figures", 1951)

“Rut seldom is asymmetry merely the absence of symmetry. Even in asymmetric designs one feels symmetry as the norm from which one deviates under the influence of forces of non-formal character.” (Hermann Weyl, “Symmetry”, 1952)

"If a distribution were perfectly symmetrical, all symmetry-plot points would be on the diagonal line. Off-line points indicate asymmetry. Points fall above the line when distance above the median is greater than corresponding distance below the median. A consistent run of above-the-line points indicates positive skew; a run of below-the-line points indicates negative skew." (Lawrence C Hamilton,Regression with Graphics: A second course in applied statistics", 1991)

“An asymmetry in the present is understood as having originated from a past symmetry.” (Michael Leyton, “Symmetry, Causality, Mind”, 1992)

"Chaos demonstrates that deterministic causes can have random effects […] There's a similar surprise regarding symmetry: symmetric causes can have asymmetric effects. […] This paradox, that symmetry can get lost between cause and effect, is called symmetry-breaking. […] From the smallest scales to the largest, many of nature's patterns are a result of broken symmetry; […]" (Ian Stewart & Martin Golubitsky,Fearful Symmetry: Is God a Geometer?", 1992)

“Approximate symmetry is a softening of the hard dichotomy between symmetry and asymmetry. The extent of deviation from exact symmetry that can still be considered approximate symmetry will depend on the context and the application and could very well be a matter of personal taste.” (Joe Rosen, “Symmetry Rules: How Science and Nature Are Founded on Symmetry”, 2008)

"[…] in cybernetics, control is seen not as a function of one agent over something else, but as residing within circular causal networks, maintaining stabilities in a system. Circularities have no beginning, no end and no asymmetries. The control metaphor of communication, by contrast, punctuates this circularity unevenly. It privileges the conceptions and actions of a designated controller by distinguishing between messages sent in order to cause desired effects and feedback that informs the controller of successes or failures." (Klaus Krippendorff,On Communicating: Otherness, Meaning, and Information", 2009)

“[…] asymmetry can be defined only relative to symmetry, and vice versa. Asymmetric elements in paintings or buildings are most effective when superimposed against a background of symmetry.” (Alan Lightman, “The Accidental Universe: The World You Thought You Knew”, 2014)

"The higher the dimension, in other words, the higher the number of possible interactions, and the more disproportionally difficult it is to understand the macro from the micro, the general from the simple units. This disproportionate increase of computational demands is called the curse of dimensionality." (Nassim N Taleb,Skin in the Game: Hidden Asymmetries in Daily Life", 2018)

"Many statistical procedures perform more effectively on data that are normally distributed, or at least are symmetric and not excessively kurtotic" (fat-tailed), and where the mean and variance are approximately constant. Observed time series frequently require some form of transformation before they exhibit these distributional properties, for in their 'raw' form they are often asymmetric." (Terence C Mills,Applied Time Series Analysis: A practical guide to modeling and forecasting", 2019)

17 May 2026

🔭Data Science: Misconceptions (Just the Quotes)

"Science does not begin with facts; one of its tasks is to uncover the facts by removing misconceptions." (Lancelot L Whyte, "Accent on Form", 1954)

"A common misconception is that an effect exists only if it is statistically significant and that it does not exist if it is not [statistically significant]." (Jonas Ranstam, "A common misconception about p-value and its consequences", Acta Orthopaedica Scandinavica 67, 1996)

"The standard deviation (often SD) is a measure of variability. When we calculate the standard deviation of a sample, we are using it as an estimate of the variability of the population from which the sample was drawn. For data with a normal distribution, about 95% of individu als will have values within 2 standard deviations of the mean, the other 5% being equally scattered above and below these limits. Contrary to popular misconception, the standard deviation is a valid measure of variability regardless of the distribution. About 95% of observa tions of any distribution usually fall within the 2 standard deviation limits, though those outside may all be at one end. We may choose a different summary statistic, how ever, when data have a skewed distribution." (Douglas G Altman & J Martin Bland, "Statistics Notes: Standard Deviations And Standard Errors", British Medical Journal Vol. 331 (7521) 2005)

"[...] the term statistical misconception refers to any of several widely held but incorrect notions about statistical concepts, about procedures for analyzing data and about the meaning of results produced by such analyses. To illustrate, many people think that (1) normal curves are bell shaped, (2) a correlation coeffi cient should never be used to address questios of causality, and (3) the level of signifi cance dictates the probability of a Type I error. Some people, of course, have only one or two (rather than all three) of these misconceptions, and a few individuals realize that all three of those beliefs are false."(Schuyler W Huck, "Statistical Misconceptions", 2008)

"Science would be better understood if we called theories ‘misconceptions’ from the outset, instead of only after we have discovered their successors." (David Deutsch, "Beginning of Infinity", 2011)

"A popular misconception holds that the era of Big Data means the end of a need for sampling. In fact, the proliferation of data of varying quality and relevance reinforces the need for sampling as a tool to work efficiently with a variety of data, and minimize bias. Even in a Big Data project, predictive models are typically developed and piloted with samples." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"An oft-repeated rule of thumb in any sort of statistical model fitting is 'you can't fit a model with more parameters than data points'. This idea appears to be as wide-spread as it is incorrect. On the contrary, if you construct your models carefully, you can fit models with more parameters than datapoints [...]. A model with more parameters than datapoints is known as an under-determined system, and it's a common misperception that such a model cannot be solved in any circumstance. [...] this misconception, which I like to call the 'model complexity myth' [...] is not true in general, it is true in the specific case of simple linear models, which perhaps explains why the myth is so pervasive." (Jake Vanderplas", "The Model Complexity Myth", 2015)


16 May 2026

🔭Data Science: Central Tendency (Just the Quotes)

"An average value is a single value within the range of the data that is used to represent all of the values in the series. Since an average is somewhere within the range of the data, it is sometimes called a measure of central value." (Frederick E Croxton & Dudley J Cowden, "Practical Business Statistics", 1937

"A good estimator will be unbiased and will converge more and more closely (in the long run) on the true value as the sample size increases. Such estimators are known as consistent. But consistency is not all we can ask of an estimator. In estimating the central tendency of a distribution, we are not confined to using the arithmetic mean; we might just as well use the median. Given a choice of possible estimators, all consistent in the sense just defined, we can see whether there is anything which recommends the choice of one rather than another. The thing which at once suggests itself is the sampling variance of the different estimators, since an estimator with a small sampling variance will be less likely to differ from the true value by a large amount than an estimator whose sampling variance is large." (Michael J Moroney, "Facts from Figures", 1951)

"The mode would form a very poor basis for any further calculations of an arithmetical nature, for it has deliberately excluded arithmetical precision in the interests of presenting a typical result. The arithmetic average, on the other hand, excellent as it is for numerical purposes, has sacrificed its desire to be typical in favour of numerical accuracy. In such a case it is often desirable to quote both measures of central tendency.(Michael J Moroney,Facts from Figures", 1951)

"An average is sometimes called a 'measure of central tendency' because individual values of the variable usually cluster around it. Averages are useful, however, for certain types of data in which there is little or no central tendency." (William A Spirr & Charles P Bonini,Statistical Analysis for Business Decisions" 3rd Ed., 1967)

"Central tendency is the formal expression for the notion of where data is centered, best understood by most readers as 'average'. There is no one way of measuring where data are centered, and different measures provide different insights." (Charles Livingston & Paul Voakes,Working with Numbers and Statistics: A handbook for journalists", 2005)

"Distributional shape is an important attribute of data, regardless of whether scores are analyzed descriptively or inferentially. Because the degree of skewness can be summarized by means of a single number, and because computers have no difficul ty providing such measures (or estimates) of skewness, those who prepare research reports should include a numerical index of skewness every time they provide measures of central tendency and variability." (Schuyler W Huck, "Statistical Misconceptions", 2008)

"It is best to think of the various kinds of central tendency indices as falling into three categories based on the computational procedures one uses to summarize the data. One category deals with means, with techniques put into this category if scores are added together and then divided by the number of scores that are summed. The second category involves different kinds of medians, with various techniques grouped here if the goal is to find some sort of midpoint. The third category contains different kinds of modes, with these techniques focused on the frequency with which scores appear in the data." (Schuyler W Huck, "Statistical Misconceptions", 2008)

"Various measures of central tendency have been invented because the proper notion of the 'average' score can vary from study to study. Depending on the kind of data collected, the degree of skewness in the data, and the possible existence of outliers, it may be that the most appropriate measure of central tendency is found by doing something other than (1) dividing the sum of the scores by the number of scores (to get the mean), (2) calculating the midpoint in the distribution (to get the median), or (3) determining the most frequently observed score (to get the mode)." (Schuyler W Huck, "Statistical Misconceptions", 2008)

"Statistical analysis seeks to develop concise summary figures which describe a large body of quantitative data. One of the most widely used set of summary figures is known as measures of location, which are often referred to as averages, measures of central tendency or central location. The purpose for computing an average value for a set of observations is to obtain a single value which is representative of all the items and which the mind can grasp simply and quickly. The single value is the point or location around which the individual items cluster." (Lawrence J Kaplan)

15 May 2026

🔭Data Science: Centrality (Just the Quotes)

"An average value is a single value within the range of the data that is used to represent all of the values in the series. Since an average is somewhere within the range of the data, it is sometimes called a measure of central value." (Frederick E Croxton & Dudley J Cowden,Practical Business Statistics", 1937)

"Some distributions [...] are symmetrical about their central value. Other distributions have marked asymmetry and are said to be skew. Skew distributions are divided into two types. If the 'tail' of the distribution reaches out into the larger values of the variate, the distribution is said to show positive skewness; if the tail extends towards the smaller values of the variate, the distribution is called negatively skew." (Michael J Moroney,Facts from Figures", 1951)

"Numerical data, which have been recorded at intervals of time, form what is generally described as a time series. [...] The purpose of analyzing time series is not always the determination of the trend by itself. Interest may be centered on the seasonal movement displayed by the series and, in such a case, the determination of the trend is merely a stage in the process of measuring and analyzing the seasonal variation. If a regular basic or under- lying seasonal movement can be clearly established, forecasting of future movements becomes rather less a matter of guesswork and more a matter of intelligent forecasting." (Alfred R Ilersic, "Statistics", 1959)

"Dispersion or spread is the degree of the scatter or variation of the variables about a central value." (Bertram C Brookes & W F L Dick,Introduction to Statistical Method", 1969)

"Equal variability is not always achieved in plots. For instance, if the theoretical distribution for a probability plot has a density that drops off gradually to zero in the tails" (as the normal density does), then the variability of the data in the tails of the probability plot is greater than in the center. Another example is provided by the histogram. Since the height of any one bar has a binomial distribution, the standard deviation of the height is approximately proportional to the square root of the expected height; hence, the variability of the longer bars is greater." (John M Chambers et al,Graphical Methods for Data Analysis", 1983)

"There are several reasons why symmetry is an important concept in data analysis. First, the most important single summary of a set of data is the location of the center, and when data meaning of 'center' is unambiguous. We can take center to mean any of the following things, since they all coincide exactly for symmetric data, and they are together for nearly symmetric data: (l) the center of symmetry. (2) the arithmetic average or center of gravity, (3) the median or 50%. Furthermore, if data a single point of highest concentration instead of several" (that is, they are unimodal), then we can add to the list (4) point of highest concentration. When data are far from symmetric, we may have trouble even agreeing on what we mean by center; in fact, the center may become an inappropriate summary for the data." (John M Chambers et al,Graphical Methods for Data Analysis", 1983)

"A connected graph is appropriate when the time series is smooth, so that perceiving individual values is not important. A vertical line graph is appropriate when it is important to see individual values, when we need to see short-term fluctuations, and when the time series has a large number of values; the use of vertical lines allows us to pack the series tightly along the horizontal axis. The vertical line graph, however, usually works best when the vertical lines emanate from a horizontal line through the center of the data and when there are no long-term trends in the data." (William S Cleveland,The Elements of Graphing Data", 1985)

"If the sample is not representative of the population because the sample is small or biased, not selected at random, or its constituents are not independent of one another, then the bootstrap will fail. […] For a given size sample, bootstrap estimates of percentiles in the tails will always be less accurate than estimates of more centrally located percentiles. Similarly, bootstrap interval estimates for the variance of a distribution will always be less accurate than estimates of central location such as the mean or median because the variance depends strongly upon extreme values in the population." (Phillip I Good & James W Hardin,Common Errors in Statistics" (and How to Avoid Them)", 2003)

"Central tendency is the formal expression for the notion of where data is centered, best understood by most readers as 'average'. There is no one way of measuring where data are centered, and different measures provide different insights." (Charles Livingston & Paul Voakes,Working with Numbers and Statistics: A handbook for journalists", 2005)

"Mean-averages can be highly misleading when the raw data do not form a symmetric pattern around a central value but instead are skewed towards one side [...], typically with a large group of standard cases but with a tail of a few either very high" (for example, income) or low" (for example, legs) values." (David Spiegelhalter,The Art of Statistics: Learning from Data", 2019)

"The elements of this cloud of uncertainty (the set of all possible errors) can be described in terms of probability. The center of the cloud is the number zero, and elements of the cloud that are close to zero are more probable than elements that are far away from that center. We can be more precise in this definition by defining the cloud of uncertainty in terms of a mathematical function, called the probability distribution." (David S Salsburg,Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"Two clouds of uncertainty may have the same center, but one may be much more dispersed than the other. We need a way of looking at the scatter about the center. We need a measure of the scatter. One such measure is the variance. We take each of the possible values of error and calculate the squared difference between that value and the center of the distribution. The mean of those squared differences is the variance." (David S Salsburg,Errors, Blunders, and Lies: How to Tell the Difference", 2017)

10 May 2026

🔭Data Science: Location (Just the Quotes)

"There are several reasons why symmetry is an important concept in data analysis. First, the most important single summary of a set of data is the location of the center, and when data meaning of 'center' is unambiguous. We can take center to mean any of the following things, since they all coincide exactly for symmetric data, and they are together for nearly symmetric data: (l) the center of symmetry. (2) the arithmetic average or center of gravity, (3) the median or 50%. Furthermore, if data a single point of highest concentration instead of several (that is, they are unimodal), then we can add to the list (4) point of highest concentration. When data are far from symmetric, we may have trouble even agreeing on what we mean by center; in fact, the center may become an inappropriate summary for the data." (John M Chambers et al,Graphical Methods for Data Analysis", 1983)

"Data that are skewed toward large values occur commonly. Any set of positive measurements is a candidate. Nature just works like that. In fact, if data consisting of positive numbers range over several powers of ten, it is almost a guarantee that they will be skewed. Skewness creates many problems. There are visualization problems. A large fraction of the data are squashed into small regions of graphs, and visual assessment of the data degrades. There are characterization problems. Skewed distributions tend to be more complicated than symmetric ones; for example, there is no unique notion of location and the median and mean measure different aspects of the distribution. There are problems in carrying out probabilistic methods. The distribution of skewed data is not well approximated by the normal, so the many probabilistic methods based on an assumption of a normal distribution cannot be applied." (William S Cleveland,Visualizing Data", 1993)

"Fitting data means finding mathematical descriptions of structure in the data. An additive shift is a structural property of univariate data in which distributions differ only in location and not in spread or shape. […] The process of identifying a structure in data and then fitting the structure to produce residuals that have the same distribution lies at the heart of statistical analysis. Such homogeneous residuals can be pooled, which increases the power of the description of the variation in the data." (William S Cleveland,Visualizing Data", 1993)

"When the distributions of two or more groups of univariate data are skewed, it is common to have the spread increase monotonically with location. This behavior is monotone spread. Strictly speaking, monotone spread includes the case where the spread decreases monotonically with location, but such a decrease is much less common for raw data. Monotone spread, as with skewness, adds to the difficulty of data analysis. For example, it means that we cannot fit just location estimates to produce homogeneous residuals; we must fit spread estimates as well. Furthermore, the distributions cannot be compared by a number of standard methods of probabilistic inference that are based on an assumption of equal spreads; the standard t-test is one example. Fortunately, remedies for skewness can cure monotone spread as well." (William S Cleveland,Visualizing Data", 1993)

"Since the average is a measure of location, it is common to use averages to compare two data sets. The set with the greater average is thought to ‘exceed’ the other set. While such comparisons may be helpful, they must be used with caution. After all, for any given data set, most of the values will not be equal to the average." (Donald J Wheeler,Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Distinguish among confidence, prediction, and tolerance intervals. Confidence intervals are statements about population means or other parameters. Prediction intervals address future" (single or multiple) observations. Tolerance intervals describe the location of a specific proportion of a population, with specified confidence." (Gerald van Belle,Statistical Rules of Thumb", 2002)

"If the sample is not representative of the population because the sample is small or biased, not selected at random, or its constituents are not independent of one another, then the bootstrap will fail. […] For a given size sample, bootstrap estimates of percentiles in the tails will always be less accurate than estimates of more centrally located percentiles. Similarly, bootstrap interval estimates for the variance of a distribution will always be less accurate than estimates of central location such as the mean or median because the variance depends strongly upon extreme values in the population." (Phillip I Good & James W Hardin,Common Errors in Statistics" (and How to Avoid Them)", 2003)

"The central limit theorem is often used to justify the assumption of normality when using the sample mean and the sample standard deviation. But it is inevitable that real data contain gross errors. Five to ten percent unusual values in a dataset seem to be the rule rather than the exception. The distribution of such data is no longer Normal." (A S Hedayat & Guoqin Su,Robustness of the Simultaneous Estimators of Location and Scale From Approximating a Histogram by a Normal Density Curve", The American Statistician 66, 2012)

09 May 2026

🔭Data Science: Guessing (Just the Quotes)

"Summing up, then, it would seem as if the mind of the great discoverer must combine contradictory attributes. He must be fertile in theories and hypotheses, and yet full of facts and precise results of experience. He must entertain the feeblest analogies, and the merest guesses at truth, and yet he must hold them as worthless till they are verified in experiment. When there are any grounds of probability he must hold tenaciously to an old opinion, and yet he must be prepared at any moment to relinquish it when a clearly contradictory fact is encountered." (William S Jevons,The Principles of Science: A Treatise on Logic and Scientific Method", 1874)

"It would be an error to suppose that the great discoverer seizes at once upon the truth, or has any unerring method of divining it. In all probability the errors of the great mind exceed in number those of the less vigorous one. Fertility of imagination and abundance of guesses at truth are among the first requisites of discovery; but the erroneous guesses must be many times as numerous as those that prove well founded. The weakest analogies, the most whimsical notions, the most apparently absurd theories, may pass through the teeming brain, and no record remain of more than the hundredth part. […] The truest theories involve suppositions which are inconceivable, and no limit can really be placed to the freedom of hypotheses." (W Stanley Jevons,The Principles of Science: A Treatise on Logic and Scientific Method", 1877)

"Heuristic reasoning is reasoning not regarded as final and strict but as provisional and plausible only, whose purpose is to discover the solution of the present problem. We are often obliged to use heuristic reasoning. We shall attain complete certainty when we shall have obtained the complete solution, but before obtaining certainty we must often be satisfied with a more or less plausible guess. We may need the provisional before we attain the final. We need heuristic reasoning when we construct a strict proof as we need scaffolding when we erect a building." (George Pólya,How to Solve It", 1945)

"The scientist who discovers a theory is usually guided to his discovery by guesses; he cannot name a method by means of which he found the theory and can only say that it appeared plausible to him, that he had the right hunch or that he saw intuitively which assumption would fit the facts." (Hans Reichenbach,The Rise of Scientific Philosophy", 1951)

"Extrapolations are useful, particularly in the form of soothsaying called forecasting trends. But in looking at the figures or the charts made from them, it is necessary to remember one thing constantly: The trend to now may be a fact, but the future trend represents no more than an educated guess. Implicit in it is 'everything else being equal' and 'present trends continuing'. And somehow everything else refuses to remain equal." (Darell Huff,How to Lie with Statistics", 1954)

"In plausible reasoning the principal thing is to distinguish... a more reasonable guess from a less reasonable guess." (George Pólya,Mathematics and plausible reasoning" Vol. 1, 1954)

"We know many laws of nature and we hope and expect to discover more. Nobody can foresee the next such law that will be discovered. Nevertheless, there is a structure in laws of nature which we call the laws of invariance. This structure is so far-reaching in some cases that laws of nature were guessed on the basis of the postulate that they fit into the invariance structure." (Eugene P Wigner,The Role of Invariance Principles in Natural Philosophy", 1963)

"Another thing I must point out is that you cannot prove a vague theory wrong. If the guess that you make is poorly expressed and rather vague, and the method that you use for figuring out the consequences is a little vague - you are not sure, and you say, 'I think everything's right because it's all due to so and so, and such and such do this and that more or less, and I can sort of explain how this works' […] then you see that this theory is good, because it cannot be proved wrong! Also if the process of computing the consequences is indefinite, then with a little skill any experimental results can be made to look like the expected consequences." (Richard P Feynman,The Character of Physical Law", 1965)

"The method of guessing the equation seems to be a pretty effective way of guessing new laws. This shows again that mathematics is a deep way of expressing nature, and any attempt to express nature in philosophical principles, or in seat-of-the-pants mechanical feelings, is not an efficient way." (Richard Feynman,The Character of Physical Law", 1965)

"Every discovery, every enlargement of the understanding, begins as an imaginative preconception of what the truth might be. The imaginative preconception - a ‘hypothesis’ - arises by a process as easy or as difficult to understand as any other creative act of mind; it is a brainwave, an inspired guess, a product of a blaze of insight. It comes anyway from within and cannot be achieved by the exercise of any known calculus of discovery." (Sir Peter B Medawar,Advice to a Young Scientist", 1979)

"Scientists reach their  conclusions  for the damnedest of reasons: intuition, guesses, redirections after wild-goose chases, all combing with a dollop of rigorous observation and logical  reasoning to be sure […] This  messy and personal side of science should not be  disparaged, or covered up, by  scientists for two  major reasons. First, scientists should proudly show this  human face to  display their kinship with all other  modes of creative human thought […] Second, while biases and references often impede understanding, these  mental idiosyncrasies  may  also serve as powerful, if  quirky and personal, guides to solutions." (Stephen J Gould,Dinosaur in a  Haystack: Reflections in natural  history", 1995)

"Compound errors can begin with any of the standard sorts of bad statistics - a guess, a poor sample, an inadvertent transformation, perhaps confusion over the meaning of a complex statistic. People inevitably want to put statistics to use, to explore a number's implications. [...] The strengths and weaknesses of those original numbers should affect our confidence in the second-generation statistics." (Joel Best,Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"First, good statistics are based on more than guessing. [...] Second, good statistics are based on clear, reasonable definitions. Remember, every statistic has to define its subject. Those definitions ought to be clear and made public. [...] Third, good statistics are based on clear, reasonable measures. Again, every statistic involves some sort of measurement; while all measures are imperfect, not all flaws are equally serious. [...] Finally, good statistics are based on good samples." (Joel Best,Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"While some social problems statistics are deliberate deceptions, many - probably the great majority - of bad statistics are the result of confusion, incompetence, innumeracy, or selective, self-righteous efforts to produce numbers that reaffirm principles and interests that their advocates consider just and right. The best response to stat wars is not to try and guess who's lying or, worse, simply to assume that the people we disagree with are the ones telling lies. Rather, we need to watch for the standard causes of bad statistics - guessing, questionable definitions or methods, mutant numbers, and inappropriate comparisons." (Joel Best,Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"The well-known 'No Free Lunch' theorem indicates that there does not exist a pattern classification method that is inherently superior to any other, or even to random guessing without using additional information. It is the type of problem, prior information, and the amount of training samples that determine the form of classifier to apply. In fact, corresponding to different real-world problems, different classes may have different underlying data structures. A classifier should adjust the discriminant boundaries to fit the structures which are vital for classification, especially for the generalization capacity of the classifier." (Hui Xue et al,SVM: Support Vector Machines", 2009)

"Data science isn’t just about the existence of data, or making guesses about what that data might mean; it’s about testing hypotheses and making sure that the conclusions you’re drawing from the data are valid." (Mike Loukides,What Is Data Science?", 2011)

"GIGO is a famous saying coined by early computer scientists: garbage in, garbage out. At the time, people would blindly put their trust into anything a computer output indicated because the output had the illusion of precision and certainty. If a statistic is composed of a series of poorly defined measures, guesses, misunderstandings, oversimplifications, mismeasurements, or flawed estimates, the resulting conclusion will be flawed." (Daniel J Levitin,Weaponized Lies", 2017)

"In statistical inference and machine learning, we often talk about estimates and estimators. Estimates are basically our best guesses regarding some quantities of interest given" (finite) data. Estimators are computational devices or procedures that allow us to map between a given" (finite) data sample and an estimate of interest." (Aleksander Molak,Causal Inference and Discovery in Python", 2023)


08 May 2026

🔭Data Science: Heuristics (Just the Quotes)

"Heuristic reasoning is reasoning not regarded as final and strict but as provisional and plausible only, whose purpose is to discover the solution of the present problem. We are often obliged to use heuristic reasoning. We shall attain complete certainty when we shall have obtained the complete solution, but before obtaining certainty we must often be satisfied with a more or less plausible guess. We may need the provisional before we attain the final. We need heuristic reasoning when we construct a strict proof as we need scaffolding when we erect a building." (George Pólya,How to Solve It", 1945)

"The attempt to characterize exactly models of an empirical theory almost inevitably yields a more precise and clearer understanding of the exact character of a theory. The emptiness and shallowness of many classical theories in the social sciences is well brought out by the attempt to formulate in any exact fashion what constitutes a model of the theory. The kind of theory which mainly consists of insightful remarks and heuristic slogans will not be amenable to this treatment. The effort to make it exact will at the same time reveal the weakness of the theory." (Patrick Suppes," A Comparison of the Meaning and Uses of Models in Mathematics and the Empirical Sciences", Synthese  Vol. 12" (2/3), 1960)

"Design problems - generating or discovering alternatives - are complex largely because they involve two spaces, an action space and a state space, that generally have completely different structures. To find a design requires mapping the former of these on the latter. For many, if not most, design problems in the real world systematic algorithms are not known that guarantee solutions with reasonable amounts of computing effort. Design uses a wide range of heuristic devices - like means-end analysis, satisficing, and the other procedures that have been outlined - that have been found by experience to enhance the efficiency of search. Much remains to be learned about the nature and effectiveness of these devices." (Herbert A Simon,The Logic of Heuristic Decision Making", [inThe Logic of Decision and Action"], 1966)

"Intelligence has two parts, which we shall call the epistemological and the heuristic. The epistemological part is the representation of the world in such a form that the solution of problems follows from the facts expressed in the representation. The heuristic part is the mechanism that on the basis of the information solves the problem and decides what to do." (John McCarthy & Patrick J Hayes,Some Philosophical Problems from the Standpoint of Artificial Intelligence", Machine Intelligence 4, 1969)

"Consider any of the heuristics that people have come up with for supervised learning: avoid overfitting, prefer simpler to more complex models, boost your algorithm, bag it, etc. The no free lunch theorems say that all such heuristics fail as often" (appropriately weighted) as they succeed. This is true despite formal arguments some have offered trying to prove the validity of some of these heuristics." (David H Wolpert,The lack of a priori distinctions between learning algorithms", Neural Computation Vol. 8(7), 1996)

"Heuristic (it is of Greek origin) means discovery. Heuristic methods are based on experience, rational ideas, and rules of thumb. Heuristics are based more on common sense than on mathematics. Heuristics are useful, for example, when the optimal solution needs an exhaustive search that is not realistic in terms of time. In principle, a heuristic does not guarantee the best solution, but a heuristic solution can provide a tremendous shortcut in cost and time." (Nikola K Kasabov,Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering", 1996)

"Theories of choice are at best approximate and incomplete. One reason for this pessimistic assessment is that choice is a constructive and contingent process. When faced with a complex problem, people employ a variety of heuristic procedures in order to simplify the representation and the evaluation of prospects. These procedures include computational shortcuts and editing operations, such as eliminating common components and discarding nonessential differences. The heuristics of choice do not readily lend themselves to formal analysis because their application depends on the formulation of the problem, the method of elicitation, and the context of choice." (Amos Tversky & Daniel Kahneman,Advances in Prospect Theory: Cumulative Representation of Uncertainty" [inChoices, Values, and Frames"], 2000)

"Behavioural research shows that we tend to use simplifying heuristics when making judgements about uncertain events. These are prone to biases and systematic errors, such as stereotyping, disregard of sample size, disregard for regression to the mean, deriving estimates based on the ease of retrieving instances of the event, anchoring to the initial frame, the gambler’s fallacy, and wishful thinking, which are all affected by our inability to consider more than a few aspects or dimensions of any phenomenon or situation at the same time." (Hans G Daellenbach & Donald C McNickle,Management Science: Decision making through systems thinking", 2005)

"A decision theory that rests on the assumptions that human cognitive capabilities are limited and that these limitations are adaptive with respect to the decision environments humans frequently encounter. Decision are thought to be made usually without elaborate calculations, but instead by using fast and frugal heuristics. These heuristics certainly have the advantage of speed and simplicity, but if they are well matched to a decision environment, they can even outperform maximizing calculations with respect to accuracy. The reason for this is that many decision environments are characterized by incomplete information and noise. The information we do have is usually structured in a specific way that clever heuristics can exploit." (E Ebenhoh,Agent-Based Modelnig with Boundedly Rational Agents", 2007)

"Optimization systems (or optimizers, as they are often referred to) aim to optimize in a systematic way, oftentimes using a heuristics-based approach. Such an approach enables the AI system to use a macro level concept as part of its low-level calculations, accelerating the whole process and making it more light-weight. After all, most of these systems are designed with scalability in mind, so the heuristic approach is most practical." (Yunus E Bulut & Zacharias Voulgaris,AI for Data Science: Artificial Intelligence Frameworks and Functionality for Deep Learning, Optimization, and Beyond", 2018)

"The social world that humans have made for themselves is so complex that the mind simplifies the world by using heuristics, customs, and habits, and by making models or assumptions about how things generally work (the ‘causal structure of the world’). And because people rely upon" (and are invested in) these mental models, they usually prefer that they remain uncontested." (Dr James Brennan,Psychological  Adjustment to Illness and Injury", West of England Medical Journal Vol. 117 (2), 2018)

"Many AI systems employ heuristic decision making, which uses a strategy to find the most likely correct decision to avoid the high cost" (time) of processing lots of information. We can think of those heuristics as shortcuts or rules of thumb that we would use to make fast decisions." (Jesús Barrasa et al,Knowledge Graphs: Data in Context for Responsive Businesses", 2021)

"Once we know something is fat-tailed, we can use heuristics to see how an exposure there reacts to random events: how much is a given unit harmed by them. It is vastly more effective to focus on being insulated from the harm of random events than try to figure them out in the required details" (as we saw the inferential errors under thick tails are huge). So it is more solid, much wiser, more ethical, and more effective to focus on detection heuristics and policies rather than fabricate statistical properties." (Nassim N Taleb,Statistical Consequences of Fat Tails: Real World Preasymptotics, Epistemology, and Applications" 2nd Ed., 2022)

03 May 2026

🔭Data Science: Tails (Just the Quotes)

"Some distributions [...] are symmetrical about their central value. Other distributions have marked asymmetry and are said to be skew. Skew distributions are divided into two types. If the 'tail' of the distribution reaches out into the larger values of the variate, the distribution is said to show positive skewness; if the tail extends towards the smaller values of the variate, the distribution is called negatively skew." (Michael J Moroney,Facts from Figures", 1951)

"Logging size transforms the original skewed distribution into a more symmetrical one by pulling in the long right tail of the distribution toward the mean. The short left tail is, in addition, stretched. The shift toward symmetrical distribution produced by the log transform is not, of course, merely for convenience. Symmetrical distributions, especially those that resemble the normal distribution, fulfill statistical assumptions that form the basis of statistical significance testing in the regression model." (Edward R Tufte,Data Analysis for Politics and Policy", 1974)

"Equal variability is not always achieved in plots. For instance, if the theoretical distribution for a probability plot has a density that drops off gradually to zero in the tails" (as the normal density does), then the variability of the data in the tails of the probability plot is greater than in the center. Another example is provided by the histogram. Since the height of any one bar has a binomial distribution, the standard deviation of the height is approximately proportional to the square root of the expected height; hence, the variability of the longer bars is greater." (John M Chambers et al,Graphical Methods for Data Analysis", 1983)

"If the sample is not representative of the population because the sample is small or biased, not selected at random, or its constituents are not independent of one another, then the bootstrap will fail. […] For a given size sample, bootstrap estimates of percentiles in the tails will always be less accurate than estimates of more centrally located percentiles. Similarly, bootstrap interval estimates for the variance of a distribution will always be less accurate than estimates of central location such as the mean or median because the variance depends strongly upon extreme values in the population." (Phillip I Good & James W Hardin,Common Errors in Statistics" (and How to Avoid Them)", 2003)

"Bell curves don't differ that much in their bells. They differ in their tails. The tails describe how frequently rare events occur. They describe whether rare events really are so rare. This leads to the saying that the devil is in the tails." (Bart Kosko,Noise", 2006)

"Readability in visualization helps people interpret data and make conclusions about what the data has to say. Embed charts in reports or surround them with text, and you can explain results in detail. However, take a visualization out of a report or disconnect it from text that provides context" (as is common when people share graphics online), and the data might lose its meaning; or worse, others might misinterpret what you tried to show." (Nathan Yau,Data Points: Visualization That Means Something", 2013)

"A very different - and very incorrect - argument is that successes must be balanced by failures (and failures by successes) so that things average out. Every coin flip that lands heads makes tails more likely. Every red at roulette makes black more likely. […] These beliefs are all incorrect. Good luck will certainly not continue indefinitely, but do not assume that good luck makes bad luck more likely, or vice versa." (Gary Smith,Standard Deviations", 2014)

"The more complex the system, the more variable (risky) the outcomes. The profound implications of this essential feature of reality still elude us in all the practical disciplines. Sometimes variance averages out, but more often fat-tail events beget more fat-tail events because of interdependencies. If there are multiple projects running, outlier (fat-tail) events may also be positively correlated - one IT project falling behind will stretch resources and increase the likelihood that others will be compromised." (Paul Gibbons,The Science of Successful Organizational Change",  2015)

"Many statistical procedures perform more effectively on data that are normally distributed, or at least are symmetric and not excessively kurtotic" (fat-tailed), and where the mean and variance are approximately constant. Observed time series frequently require some form of transformation before they exhibit these distributional properties, for in their 'raw' form they are often asymmetric." (Terence C Mills,Applied Time Series Analysis: A practical guide to modeling and forecasting", 2019)

"Mean-averages can be highly misleading when the raw data do not form a symmetric pattern around a central value but instead are skewed towards one side [...], typically with a large group of standard cases but with a tail of a few either very high" (for example, income) or low" (for example, legs) values." (David Spiegelhalter,The Art of Statistics: Learning from Data", 2019)

"[…] it is not merely that events in the tails of the distributions matter, happen, play a large role, etc. The point is that these events play the major role and their probabilities are not" (easily) computable, not reliable for any effective use. The implication is that Black Swans do not necessarily come from fat tails; the problem can result from an incomplete assessment of tail events." (Nassim N Taleb,Statistical Consequences of Fat Tails: Real World Preasymptotics, Epistemology, and Applications" 2nd Ed., 2022)

"[…] whenever people make decisions after being supplied with the standard deviation number, they act as if it were the expected mean deviation." (Nassim N Taleb,Statistical Consequences of Fat Tails: Real World Preasymptotics, Epistemology, and Applications" 2nd Ed., 2022)

"Behavioral finance so far makes conclusions from statics not dynamics, hence misses the picture. It applies trade-offs out of context and develops the consensus that people irrationally overestimate tail risk" (hence need to be 'nudged' into taking more of these exposures). But the catastrophic event is an absorbing barrier. No risky exposure can be analyzed in isolation: risks accumulate. If we ride a motorcycle, smoke, fly our own propeller plane, and join the mafia, these risks add up to a near-certain premature death. Tail risks are not a renewable resource." (Nassim N Taleb,Statistical Consequences of Fat Tails: Real World Preasymptotics, Epistemology, and Applications" 2nd Ed., 2022)

"But note that any heavy tailed process, even a power law, can be described in sample" (that is finite number of observations necessarily discretized) by a simple Gaussian process with changing variance, a regime switching process, or a combination of Gaussian plus a series of variable jumps" (though not one where jumps are of equal size […])." (Nassim N Taleb,Statistical Consequences of Fat Tails: Real World Preasymptotics, Epistemology, and Applications" 2nd Ed., 2022)

"Once we know something is fat-tailed, we can use heuristics to see how an exposure there reacts to random events: how much is a given unit harmed by them. It is vastly more effective to focus on being insulated from the harm of random events than try to figure them out in the required details" (as we saw the inferential errors under thick tails are huge). So it is more solid, much wiser, more ethical, and more effective to focus on detection heuristics and policies rather than fabricate statistical properties." (Nassim N Taleb,Statistical Consequences of Fat Tails: Real World Preasymptotics, Epistemology, and Applications" 2nd Ed., 2022)

"No one sees further into a generalization than his own knowledge of detail extends." (William James)

"Remember that a p-value merely indicates the probability of a particular set of data being generated by the null model–it has little to say about the size of a deviation from that model" (especially in the tails of the distribution, where large changes in effect size cause only small changes in p-values)." (Clay Helberg)


02 May 2026

🔭Data Science: Skewness (Just the Quotes)

"Some distributions [...] are symmetrical about their central value. Other distributions have marked asymmetry and are said to be skew. Skew distributions are divided into two types. If the 'tail' of the distribution reaches out into the larger values of the variate, the distribution is said to show positive skewness; if the tail extends towards the smaller values of the variate, the distribution is called negatively skew." (Michael J Moroney, "Facts from Figures", 1951)

"Logging size transforms the original skewed distribution into a more symmetrical one by pulling in the long right tail of the distribution toward the mean. The short left tail is, in addition, stretched. The shift toward symmetrical distribution produced by the log transform is not, of course, merely for convenience. Symmetrical distributions, especially those that resemble the normal distribution, fulfill statistical assumptions that form the basis of statistical significance testing in the regression model." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Logging skewed variables also helps to reveal the patterns in the data. […] the rescaling of the variables by taking logarithms reduces the nonlinearity in the relationship and removes much of the clutter resulting from the skewed distributions on both variables; in short, the transformation helps clarify the relationship between the two variables. It also […] leads to a theoretically meaningful regression coefficient." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"The logarithmic transformation serves several purposes: (1) The resulting regression coefficients sometimes have a more useful theoretical interpretation compared to a regression based on unlogged variables. (2) Badly skewed distributions - in which many of the observations are clustered together combined with a few outlying values on the scale of measurement - are transformed by taking the logarithm of the measurements so that the clustered values are spread out and the large values pulled in more toward the middle of the distribution. (3) Some of the assumptions underlying the regression model and the associated significance tests are better met when the logarithm of the measured variables is taken." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"The logarithm is an extremely powerful and useful tool for graphical data presentation. One reason is that logarithms turn ratios into differences, and for many sets of data, it is natural to think in terms of ratios. […] Another reason for the power of logarithms is resolution. Data that are amounts or counts are often very skewed to the right; on graphs of such data, there are a few large values that take up most of the scale and the majority of the points are squashed into a small region of the scale with no resolution." (William S. Cleveland, "Graphical Methods for Data Presentation: Full Scale Breaks, Dot Charts, and Multibased Logging", The American Statistician Vol. 38 (4) 1984)

"It is common for positive data to be skewed to the right: some values bunch together at the low end of the scale and others trail off to the high end with increasing gaps between the values as they get higher. Such data can cause severe resolution problems on graphs, and the common remedy is to take logarithms. Indeed, it is the frequent success of this remedy that partly accounts for the large use of logarithms in graphical data display." (William S Cleveland, "The Elements of Graphing Data", 1985)

"If a distribution were perfectly symmetrical, all symmetry-plot points would be on the diagonal line. Off-line points indicate asymmetry. Points fall above the line when distance above the median is greater than corresponding distance below the median. A consistent run of above-the-line points indicates positive skew; a run of below-the-line points indicates negative skew." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"Skewness is a measure of symmetry. For example, it's zero for the bell-shaped normal curve, which is perfectly symmetric about its mean. Kurtosis is a measure of the peakedness, or fat-tailedness, of a distribution. Thus, it measures the likelihood of extreme values." (John L Casti, "Reality Rules: Picturing the world in mathematics", 1992)

"Data that are skewed toward large values occur commonly. Any set of positive measurements is a candidate. Nature just works like that. In fact, if data consisting of positive numbers range over several powers of ten, it is almost a guarantee that they will be skewed. Skewness creates many problems. There are visualization problems. A large fraction of the data are squashed into small regions of graphs, and visual assessment of the data degrades. There are characterization problems. Skewed distributions tend to be more complicated than symmetric ones; for example, there is no unique notion of location and the median and mean measure different aspects of the distribution. There are problems in carrying out probabilistic methods. The distribution of skewed data is not well approximated by the normal, so the many probabilistic methods based on an assumption of a normal distribution cannot be applied." (William S Cleveland, "Visualizing Data", 1993)

"The logarithm is one of many transformations that we can apply to univariate measurements. The square root is another. Transformation is a critical tool for visualization or for any other mode of data analysis because it can substantially simplify the structure of a set of data. For example, transformation can remove skewness toward large values, and it can remove monotone increasing spread. And often, it is the logarithm that achieves this removal." (William S Cleveland, "Visualizing Data", 1993)

"When the distributions of two or more groups of univariate data are skewed, it is common to have the spread increase monotonically with location. This behavior is monotone spread. Strictly speaking, monotone spread includes the case where the spread decreases monotonically with location, but such a decrease is much less common for raw data. Monotone spread, as with skewness, adds to the difficulty of data analysis. For example, it means that we cannot fit just location estimates to produce homogeneous residuals; we must fit spread estimates as well. Furthermore, the distributions cannot be compared by a number of standard methods of probabilistic inference that are based on an assumption of equal spreads; the standard t-test is one example. Fortunately, remedies for skewness can cure monotone spread as well." (William S Cleveland, "Visualizing Data", 1993)

"The standard deviation (often SD) is a measure of variability. When we calculate the standard deviation of a sample, we are using it as an estimate of the variability of the population from which the sample was drawn. For data with a normal distribution, about 95% of individu als will have values within 2 standard deviations of the mean, the other 5% being equally scattered above and below these limits. Contrary to popular misconception, the standard deviation is a valid measure of variability regardless of the distribution. About 95% of observa tions of any distribution usually fall within the 2 standard deviation limits, though those outside may all be at one end. We may choose a different summary statistic, how ever, when data have a skewed distribution." (Douglas G Altman & J Martin Bland, "Statistics Notes: Standard Deviations And Standard Errors", British Medical Journal Vol. 331 (7521) 2005)

"Use a logarithmic scale when it is important to understand percent change or multiplicative factors. […] Showing data on a logarithmic scale can cure skewness toward large values." (Naomi B Robbins, "Creating More effective Graphs", 2005)

"Distributional shape is an important attribute of data, regardless of whether scores are analyzed descriptively or inferentially. Because the degree of skewness can be summarized by means of a single number, and because computers have no difficulty providing such measures (or estimates) of skewness, those who prepare research reports should include a numerical index of skewness every time they provide measures of central tendency and variability." (Schuyler W Huck, "Statistical Misconceptions", 2008)

"Given the important role that correlation plays in structural equation modeling, we need to understand the factors that affect establishing relationships among multivariable data points. The key factors are the level of measurement, restriction of range in data values (variability, skewness, kurtosis), missing data, nonlinearity, outliers, correction for attenuation, and issues related to sampling variation, confidence intervals, effect size, significance, sample size, and power." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

"[The normality] assumption is the least important one for the reliability of the statistical procedures under discussion. Violations of the normality assumption can be divided into two general forms: Distributions that have heavier tails than the normal and distributions that are skewed rather than symmetric. If data is skewed, the formulas we are discussing are still valid as long as the sample size is sufficiently large. Although the guidance about 'how skewed' and 'how large a sample' can be quite vague, since the greater the skew, the larger the required sample size. For the data commonly used in time series and for the sample sizes (which are generally quite large) used, skew is not a problem. On the other hand, heavy tails can be very problematic." (DeWayne R Derryberry, "Basic Data Analysis for Time Series with R" 1st Ed, 2014)

"In statistical theory, location and variability are referred to as the first and second moments of a distribution. The third and fourth moments are called skewness and kurtosis. Skewness refers to whether the data is skewed to larger or smaller values and kurtosis indicates the propensity of the data to have extreme values. Generally, metrics are not used to measure skewness and kurtosis; instead, these are discovered through visual displays [...]" (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"A histogram represents the frequency distribution of the data. Histograms are similar to bar charts but group numbers into ranges. Also, a histogram lets you show the frequency distribution of continuous data. This helps in analyzing the distribution (for example, normal or Gaussian), any outliers present in the data, and skewness." (Umesh R Hodeghatta & Umesha Nayak, "Business Analytics Using R: A Practical Approach", 2017)

"New information is constantly flowing in, and your brain is constantly integrating it into this statistical distribution that creates your next perception (so in this sense 'reality' is just the product of your brain’s ever-evolving database of consequence). As such, your perception is subject to a statistical phenomenon known in probability theory as kurtosis. Kurtosis in essence means that things tend to become increasingly steep in their distribution [...] that is, skewed in one direction. This applies to ways of seeing everything from current events to ourselves as we lean 'skewedly' toward one interpretation, positive or negative. Things that are highly kurtotic, or skewed, are hard to shift away from. This is another way of saying that seeing differently isn’t just conceptually difficult - it’s statistically difficult." (Beau Lotto, "Deviate: The Science of Seeing Differently", 2017)

"Mean-averages can be highly misleading when the raw data do not form a symmetric pattern around a central value but instead are skewed towards one side [...], typically with a large group of standard cases but with a tail of a few either very high (for example, income) or low (for example, legs) values." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"With skewed data, quantiles will reflect the skew, while adding standard deviations assumes symmetry in the distribution and can be misleading." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

"Adjusting scale is an important practice in data visualization. While the log transform is versatile, it doesn’t handle all situations where skew or curvature occurs. For example, at times the values are all roughly the same order of magnitude and the log transformation has little impact. Another transformation to consider is the square root transformation, which is often useful for count data." (Sam Lau et al, "Learning Data Science: Data Wrangling, Exploration, Visualization, and Modeling with Python", 2023)

24 April 2025

🧭Business Intelligence: Perspectives (Part 30: The Data Science Connection)

Business Intelligence Series
Business Intelligence Series

Data Science is a collection of quantitative and qualitative methods, respectively techniques, algorithms, principles, processes and technologies used to analyze, and process amounts of raw and aggregated data to extract information or knowledge it contains. Its theoretical basis is rooted within mathematics, mainly statistics, computer science and domain expertise, though it can include further aspects related to communication, management, sociology, ecology, cybernetics, and probably many other fields, as there’s enough space for experimentation and translation of knowledge from one field to another.  

The aim of Data Science is to extract valuable insights from data to support decision-making, problem-solving, drive innovation and probably it can achieve more in time. Reading in between the lines, Data Science sounds like a superhero that can solve all the problems existing out there, which frankly is too beautiful to be true! In theory everything is possible, when in practice there are many hard limitations! Given any amount of data, the knowledge that can be obtained from it can be limited by many factors - the degree to which the data, processes and models built reflect reality, and there can be many levels of approximation, respectively the degree to which such data can be collected consistently. 

Moreover, even if the theoretical basis seems sound, the data, information or knowledge which is not available can be the important missing link in making any sensible progress toward the goals set in Data Science projects. In some cases, one might be aware of what's missing, though for the data scientist not having the required domain knowledge, this can be a hard limit! This gap can be probably bridged with sensemaking, exploration and experimentation approaches, especially by applying models from other domains, though there are no guarantees ahead!

AI can help in this direction by utilizing its capacity to explore fast ideas or models. However, it's questionable how much the models built with AI can be further used if one can't build mechanistical mental models of the processes reflected in the data. It's like devising an algorithm for winning at lottery small amounts, though investing more money in the algorithm doesn't automatically imply greater wins. Even if occasionally the performance is improved, it's questionable how much it can be leveraged for each utilization. Statistics has its utility when one studies data in aggregation and can predict average behavior. It can’t be used to predict the occurrence of events with a high precision. Think how hard the prediction of earthquakes or extreme weather is by just looking at a pile of data reflecting what’s happening only in a certain zone!

In theory, the more data one has from different geographical areas or organizations, the more robust the models can become. However, no two geographies, respectively no two organizations are alike: business models, the people, the events and other aspects make global models less applicable to local context. Frankly, one has more chances of progress if a model is obtained by having a local scope and then attempting to leverage the respective model for a broader scope. Even then, there can be differences between the behavior or phenomena at micro, respectively at macro level (see the law of physics). 

This doesn’t mean that Data Science or AI related knowledge is useless. The knowledge accumulated by applying various techniques, models and programming languages in problem-solving can be more valuable than the results obtained! Experimentation is a must for organizations to innovate, to extend their knowledge base. It’s also questionable how much of the respective knowledge can be retained and put to good use. In the end, each organization must determine this by itself!

17 September 2024

#️⃣Software Engineering: Mea Culpa (Part V: All-Knowing Developers are Back in Demand?)

Software Engineering Series

I’ve been reading many job descriptions lately related to my experience and curiously or not I observed that many organizations look for developers with Microsoft Dynamics experience in the CRM, respectively Finance and Operations (F&O) and Business Central (BC) areas. It’s a good sign that the adoption of Microsoft solutions for CRM and ERP increases, especially when one considers the progress made in the BI and AI areas with the introduction of Microsoft Fabric, which gives Microsoft a considerable boost. Conversely, it seems that the "developers are good for everything" syntagma is back, at least from what one reads in job descriptions. 

Of course, it’s useful to have an inhouse developer who can address all the aspects of an implementation, though that’s a lot to ask considering the different non-programming areas that need to be addressed. It’s true that a developer with experience can handle Requirements, Data and Process Management, respectively Data Migrations and Business Intelligence topics, though if one considers that each of the topics can easily become a full-time job before, during and post-project implementations. I’ve been there and I (hopefully) know that the jobs imply. Even if an experienced programmer can easily handle the different aspects, there will be also times when all the topics combined will be too much for a person!

It's not a novelty that job descriptions are treated like Christmas lists, but it’s difficult to differentiate between essential and nonessential skillset. I read many jobs descriptions lately in which among a huge list of demands, one of the requirements is to program in the F&O framework, sign that D365 programmers are in high demand. I worked for many years as programmer and Software Engineer, respectively in the BI area, where SQL and non-SQL code is needed. Even if I can understand the code in F&O, does it make sense to learn now to program in X++ and the whole framework? 

It's never too late to learn new tricks, respectively another programming language and/or framework. It even helps to provide better solutions in usual areas, though frankly I would invest my time in other areas, and AI-related topics like AI prompting or Data Science seem to be more interesting on the long run, especially when they are already in demand!

There seems to be a tendency for Data Science professionals to do everything, building their own solutions, ignoring the experience accumulated respectively the data models built in BI and Data Analytics areas, as if the topics and data models are unrelated! It’s also true that AI-modeling comes with its own requirements in what concerns data modeling (e.g. translating non-numeric to numeric values), though I believe that common ground can be found!

Similarly, the notebook-based programming seems to replicate logic in each solution, which occasionally makes sense, though personally I wouldn’t recommend it as practice! The other day, I was looking at code developed in Python to mimic the joining of tables, when a view with the same could be easier (re)used, maintained, read and probably more efficient, even if different engines will be used. It will be interesting to see how the mix of spaghetti solutions will evolve over time. There are developers already complaining of the number of objects used in the process by building logic for each layer from the medallion architecture! Even if it makes sense from architectural considerations, it will become a nightmare in time.

One can wonder also about nomenclature used – Data Engineer or Prompt Engineering for the simple manipulation of data between structures in data transformations, respectively for structuring the prompts for AI. I believe that engineering involves more than this, no matter the context! 

Previous Post <<||>> Next Post

17 February 2024

🧭Business Intelligence: A Software Engineer's Perspective I (Houston, we have a Problem!)

Business Intelligence Series
Business Intelligence Series

One of the critics addressed to the BI/Data Analytics, Data Engineering and even Data Science fields is their resistance to applying Software Engineering (SE) methods in practice. SE can be regarded as the application of sound methods, methodologies, techniques, principles, and practices to obtain high quality economic software in a reproducible manner. At minimum, should be applied SE techniques and practices proven to work, for example the use of best practices, reference technologies, standardized processes for requirements gathering and management, etc. This doesn't mean that one should apply the full extent of SE but consider a minimum that makes sense to adopt.

Unfortunately, the creation of data artifacts (queries, reports, data models, data pipelines, data visualizations, etc.) as process seem to be done after the principle of least action, though least action means here the minimum interaction to push pieces on a board rather than getting the things done. At high level, the process is as follows: get the requirements, build something, present results, get more requirements, do changes, present the results, and the process is repeated ad infinitum.

Given that data artifact's creation finds itself at the intersection of two or more knowledge areas in which knowledge is exchanged in several iterations between the parties involved until a common ground is achieved, this process is totally inefficient from multiple perspectives. First of all, it takes considerably more time than planned to reach a solution, resources being wasted in the process, multiple forms of waste being involved. Secondly, the exchange and retention of knowledge resulting from the process is minimal, mainly on a need by basis. This might look as an efficient approach on the short term, but is inefficient overall.

BI reflects the general issues from SE - most of the issues can be traced back to requirements - if the requirements are incorrect and there's no magic involved in between, then one can't expect for the solution to be correct. The bigger the difference between the initial and final requirements elicited in the process, the more resources are wasted. The more time passes between the start of the development phase and the time a solution is presented to the customer, the longer it takes to build the final solution. Same impact have the time it takes to establish a common ground and other critical factors for success involved in the process.

One can address these issues through better requirements elicitation, rapid prototyping, the use of agile methodologies and similar approaches, though the general feeling is that even if they bring improvements, they don't address the root causes - lack of data literacy skills, lack of knowledge about the business, lack of maturity in planning and executing tasks, the inexistence of well-designed processes and procedures, respectively the lack of an engineering mindset.

These inefficiencies have low impact when building a report occasionally, though they accumulate and tend to create systemic issues in what concerns the overall BI effort. They are addressed locally by experts and in general through a strategic approach like the elaboration of a BI strategy, though organizations seldom pay attention to them. Some organizations consider that they are automatically addressed as part of the data culture, though data culture focuses in general on data literacy and not on the whole set of assumptions mentioned above.

An experienced data professional sees more likely the inefficiencies, tries to address them locally in his interactions with the various stakeholders, he/she can build a business case for addressing them, though it depends on organizations to recognize that they have a problem, respective address the inefficiencies in a strategic and systemic manner!

Previous Post <<||>> Next Post

13 February 2024

🧭Business Intelligence: A One-Man Show (Part V: Focus on the Foundation)

Business Intelligence Suite
Business Intelligence Suite

I tend to agree that one person can't do anymore "everything in the data space", as Christopher Laubenthal put it his article on the topic [1]. He seems to catch the essence of some of the core data roles found in organizations. Summarizing these roles, data architecture is about designing and building a data infrastructure, data engineering is about moving data, database administration is mainly about managing databases, data analysis is about assisting the business with data and reports, information design is about telling stories, while data science can be about studying the impact of various components on the data. 

However, I find his analogy between a college's functional structure and the core data roles as poorly chosen from multiple perspectives, even if both are about building an infrastructure of some type. 

Firstly, the two constructions have different foundations. Data exists in a an organization also without data architects, data engineers or data administrators (DBAs)! It's enough to buy one or more information systems functioning as islands and reporting needs will arise. The need for a data architect might come when the systems need to be integrated or maybe when a data warehouse needs to be build, though many organizations are still in business without such constructs. While for the others, the more complex the integrations, the bigger the need for a Data Architect. Conversely, some systems can be integrated by design and such capabilities might drive their selection.

Data engineering is needed mainly in the context of the cloud, respectively of data lake-based architectures, where data needs to be moved, processed and prepared for consumption. Conversely, architectures like Microsoft Fabric minimize data movement, the focus being on data processing, the successive transformations it needs to suffer in moving from bronze to the gold layer, respectively in creating an organizational semantical data model. The complexity of the data processing is dependent on data' structuredness, quality and other data characteristics. 

As I mentioned before, modern databases, including the ones in the cloud, reduce the need for DBAs to a considerable degree. Unless the volume of work is big enough to consider a DBA role as an in-house resource, organizations will more likely consider involving a service provider and a contingent to cover the needs. 

Having in-house one or more people acting under the Data Analyst role, people who know and understand the business, respectively the data tools used in the process, can go a long way. Moreover, it's helpful to have an evangelist-like resource in house, a person who is able to raise awareness and knowhow, help diffuse knowledge about tools, techniques, data, results, best practices, respectively act as a mentor for the Data Analyst citizens. From my point of view, these are the people who form the data-related backbone (foundation) of an organization and this is the minimum of what an organization should have!

Once this established, one can build data warehouses, data integrations and other support architectures, respectively think about BI and Data strategy, Data Governance, etc. Of course, having a Chief Data Officer and a Data Strategy in place can bring more structure in handling the topics at the various levels - strategical, tactical, respectively operational. In constructions one starts with a blueprint and a data strategy can have the same effect, if one knows how to write it and implement it accordingly. However, the strategy is just a tool, while the data-knowledgeable workers are the foundation on which organizations should build upon!

"Build it and they will come" philosophy can work as well, though without knowledgeable and inquisitive people the philosophy has high chances to fail.

Previous Post <<||>> Next Post

Resources:
[1] Christopher Laubenthal (2024) "Why One Person Can’t Do Everything In Data" (link)

🧭Business Intelligence: A One-Man Show (Part IV: Data Roles between Past and Future)

Business Intelligence Series
Business Intelligence Series

Databases nowadays are highly secure, reliable and available to a degree that reduces the involvement of DBAs to a minimum. The more databases and servers are available in an organization, and the older they are, the bigger the need for dedicated resources to manage them. The number of DBAs involved tends to be proportional with the volume of work required by the database infrastructure. However, if the infrastructure is in the cloud, managed by the cloud providers, it's enough to have a person in the middle who manages the communication between cloud provider(s) and the organization. The person doesn't even need to be a DBA, even if some knowledge in the field is usually recommended.

The requirement for a Data Architect comes when there are several systems in place and there're multiple projects to integrate or build around the respective systems. It'a also the question of what drives the respective requirement - is it the knowledge of data architectures, the supervision of changes, and/or the review of technical documents? The requirement is thus driven by the projects in progress and those waiting in the pipeline. Conversely, if all the systems are in the cloud, their integration is standardized or doesn't involve much architectural knowledge, the role becomes obsolete or at least not mandatory. 

The Data Engineer role is a bit more challenging to define because it appeared in the context of cloud-based data architectures. It seems to be related to the data movement via ETL/ELT pipelines and of data processing and preparation for the various needs. Data modeling or data presentation knowledge isn't mandatory even if ideal. The role seems to overlap with the one of a Data Warehouse professional, be it a simple architect or developer. Role's knowhow depends also on the tools involved, because one thing is to build a solution based on a standard SQL Server, and another thing to use dedicated layers and architectures for the various purposes. Engineers' number should be proportional with the number of data entities involved.

Conversely, the existence of solutions that move and process the data as needed, can reduce the volume of work. Moreover, the use of AI-driven tools like Copilot might shift the focus from data to prompt engineering. 

The Data Analyst role is kind of a Cinderella - it can involve upon case everything from requirements elicitation to reports writing and results' interpretation, respectively from data collection and data modeling to data visualization. If you have a special wish related to your data, just add it to the role! Analysts' number should be related to the number of issues existing in organization where the collection and processing of data could make a difference. Conversely, the Data Citizen, even if it's not a role but a desirable state of art, could absorb in theory the Data Analyst role.

The Data Scientist is supposed to reveal the gems of knowledge hidden in the data by using Machine Learning, Statistics and other magical tools. The more data available, the higher the chances of finding something, even if probably statistically insignificant or incorrect. The role makes sense mainly in the context of big data, even if some opportunities might be available at smaller scales. Scientists' number depends on the number of projects focused on the big questions. Again, one talks about the Data Scientist citizen. 

The Information Designer role seems to be more about data visualization and presentation. It makes sense in the organizations that rely heavily on visual content. All the other organizations can rely on the default settings of data visualization tools, independently on whether AI is involved or not. 

Previous Post <<||>> Next Post

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.