12 November 2018

🔭Data Science: Statisticians (Just the Quotes)

"It is difficult to understand why statisticians commonly limit their inquiries to Averages, and do not revel in more comprehensive views. Their souls seem as dull to the charm of variety as that of the native of one of our flat English counties, whose retrospect of Switzerland was that, if its mountains could be thrown into its lakes, two nuisances would be got rid of at once. An Average is but a solitary fact, whereas if a single other fact be added to it, an entire Normal Scheme, which nearly corresponds to the observed one, starts potentially into existence." (Francis Galton, "Natural Inheritance", 1889) 

"Even trained statisticians often fail to appreciate the extent to which statistics are vitiated by the unrecorded assumptions of their interpreters." (George B Shaw, "The Doctor's Dilemma", 1906)

"Figures may not lie, but statistics compiled unscientifically and analyzed incompetently are almost sure to be misleading, and when this condition is unnecessarily chronic the so-called statisticians may be called liars." (Edwin B Wilson, "Bulletin of the American Mathematical Society", Vol 18, 1912)

"The statistician’s job is to draw general conclusions from fragmentary data. Too often the data supplied to him for analysis are not only fragmentary but positively incoherent, so that he can do next to nothing with them. Even the most kindly statistician swears heartily under his breath whenever this happens". (Michael J Moroney, "Facts from Figures", 1927)

"To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of." (Sir Ronald A Fisher, [presidential address] 1938)

"An inference, if it is to have scientific value, must constitute a prediction concerning future data. If the inference is to be made purely with the help of the distribution theory of statistics, the experiments that constitute evidence for the inference must arise from a state of statistical control; until that state is reached, there is no universe, normal or otherwise, and the statistician’s calculations by themselves are an illusion if not a delusion. The fact is that when distribution theory is not applicable for lack of control, any inference, statistical or otherwise, is little better than a conjecture. The state of statistical control is therefore the goal of all experimentation." (William E Deming, "Statistical Method from the Viewpoint of Quality Control", 1939)

"[Statistics] is both a science and an art. It is a science in that its methods are basically systematic and have general application; and an art in that their successful application depends to a considerable degree on the skill and special experience of the statistician, and on his knowledge of the field of application, e.g. economics." (Leonard H C Tippett, "Statistics", 1943)

"Errors of the third kind happen in conventional tests of differences of means, but they are usually not considered, although their existence is probably recognized. It seems to the author that there may be several reasons for this among which are 1) a preoccupation on the part of mathematical statisticians with the formal questions of acceptance and rejection of null hypotheses without adequate consideration of the implications of the error of the third kind for the practical experimenter, 2) the rarity with which an error of the third kind arises in the usual tests of significance." (Frederick Mosteller, "A k-Sample Slippage Test for an Extreme Population", The Annals of Mathematical Statistics 19, 1948)

"The characteristic which distinguishes the present-day professional statistician, is his interest and skill in the measurement of the fallibility of conclusions." (George W Snedecor, "On a Unique Feature of Statistics", [address] 1948)

"[...] to be a good theoretical statistician one must also compute, and must therefore have the best computing aids." (Frank Yates, "Sampling Methods for Censuses and Surveys", 1949)

"It is very easy to devise different tests which, on the average, have similar properties, [...] hey behave satisfactorily when the null hypothesis is true and have approximately the same power of detecting departures from that hypothesis. Two such tests may, however, give very different results when applied to a given set of data. The situation leads to a good deal of contention amongst statisticians and much discredit of the science of statistics. The appalling position can easily arise in which one can get any answer one wants if only one goes around to a large enough number of statisticians." (Frances Yates, "Discussion on the Paper by Dr. Box and Dr. Andersen", Journal of the Royal Statistical Society B Vol. 17, 1955)

"One feature [...] which requires much more justification than is usually given, is the setting up of unplausible null hypotheses. For example, a statistician may set out a test to see whether two drugs have exactly the same effect, or whether a regression line is exactly straight. These hypotheses can scarcely be taken literally." (Cedric A B Smith, "Book review of Norman T. J. Bailey: Statistical Methods in Biology", Applied Statistics 9, 1960)

"The statistician cannot excuse himself from the duty of getting his head clear on the principles of scientific inference, but equally no other thinking man can avoid a like obligation." (Sir Ronald A Fisher, "The Design of Experiments", 1971)

"Another reason for the applied statistician to care about Bayesian inference is that consumers of statistical answers, at least interval estimates, commonly interpret them as probability statements about the possible values of parameters. Consequently, the answers statisticians provide to consumers should be capable of being interpreted as approximate Bayesian statements." (Donald B Rubin, "Bayesianly justifiable and relevant frequency calculations for the applied statistician", Annals of Statistics 12(4), 1984)

"Too much of what all statisticians do [...] is blatantly subjective for any of us to kid ourselves or the users of our technology into believing that we have operated ‘impartially’ in any true sense. [...] We can do what seems to us most appropriate, but we can not be objective and would do well to avoid language that hints to the contrary." (Steve V Vardeman, Comment, Journal of the American Statistical Association 82, 1987)

"It is clear that a statistician who is involved at the start of an investigation, advises on data collection, and who knows the background and objectives, will generally make a better job of the analysis than a statistician who was called in later on." (Christopher Chatfield, "Problem solving: a statistician’s guide", 1988)

"Statistics is a very powerful and persuasive mathematical tool. People put a lot of faith in printed numbers. It seems when a situation is described by assigning it a numerical value, the validity of the report increases in the mind of the viewer. It is the statistician's obligation to be aware that data in the eyes of the uninformed or poor data in the eyes of the naive viewer can be as deceptive as any falsehoods." (Theoni Pappas, "More Joy of Mathematics: Exploring mathematical insights & concepts", 1991)

"At its core statistics is not about cleverness and technique, but rather about honesty. Its real contribution to society is primarily moral, not technical. It is about doing the right thing when interpreting empirical information. Statisticians are not the world’s best computer scientists, mathematicians, or scientific subject matter specialists. We are (potentially, at least) the best at the principled collection, summarization, and analysis of data." (Stephen B Vardeman & Max D Morris, "Statistics and Ethics: Some Advice for Young Statisticians", The American Statistician vol 57, 2003)

"Today [...] we have high-speed computers and prepackaged statistical routines to perform the necessary calculations [...] statistical software will no more make one a statistician than would a scalpel turn one into a neurosurgeon. Allowing these tools to do our thinking for us is a sure recipe for disaster." (Phillip Good & Hardin James, "Common Errors in Statistics and How to Avoid Them", 2003)

"What is so unconventional about the statistical way of thinking? First, statisticians do not care much for the popular concept of the statistical average; instead, they fixate on any deviation from the average. They worry about how large these variations are, how frequently they occur, and why they exist. [...] Second, variability does not need to be explained by reasonable causes, despite our natural desire for a rational explanation of everything; statisticians are frequently just as happy to pore over patterns of correlation. [...] Third, statisticians are constantly looking out for missed nuances: a statistical average for all groups may well hide vital differences that exist between these groups. Ignoring group differences when they are present frequently portends inequitable treatment. [...] Fourth, decisions based on statistics can be calibrated to strike a balance between two types of errors. Predictably, decision makers have an incentive to focus exclusively on minimizing any mistake that could bring about public humiliation, but statisticians point out that because of this bias, their decisions will aggravate other errors, which are unnoticed but serious. [...] Finally, statisticians follow a specific protocol known as statistical testing when deciding whether the evidence fits the crime, so to speak. Unlike some of us, they don’t believe in miracles. In other words, if the most unusual coincidence must be contrived to explain the inexplicable, they prefer leaving the crime unsolved." (Kaiser Fung, "Numbers Rule the World", 2010)

"Darn right, graphs are not serious. Any untrained, unsophisticated, non-degree-holding civilian can display data. Relying on plots is like admitting you do not need a statistician. Show pictures of the numbers and let people make their own judgments? That can be no better than airing your statistical dirty laundry. People need guidance; they need to be shown what the data are supposed to say. Graphics cannot do that; models can." (William M Briggs, Comment, Journal of Computational and Graphical Statistics Vol. 20(1), 2011)

"When statisticians, trained in math and probability theory, try to assess likely outcomes, they demand a plethora of data points. Even then, they recognize that unless it’s a very simple and controlled action such as flipping a coin, unforeseen variables can exert significant influence." (Zachary Karabell, "The Leading Indicators: A short history of the numbers that rule our world", 2014)

"To be any good, a sample has to be representative. A sample is representative if every person or thing in the group you’re studying has an equally likely chance of being chosen. If not, your sample is biased. […] The job of the statistician is to formulate an inventory of all those things that matter in order to obtain a representative sample. Researchers have to avoid the tendency to capture variables that are easy to identify or collect data on - sometimes the things that matter are not obvious or are difficult to measure." (Daniel J Levitin, "Weaponized Lies", 2017)

"Some scientists (e.g., econometricians) like to work with mathematical equations; others (e.g., hard-core statisticians) prefer a list of assumptions that ostensibly summarizes the structure of the diagram. Regardless of language, the model should depict, however qualitatively, the process that generates the data - in other words, the cause-effect forces that operate in the environment and shape the data generated." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

🔭Data Science: Regularization (Just the Quotes)

"Neural networks can model very complex patterns and decision boundaries in the data and, as such, are very powerful. In fact, they are so powerful that they can even model the noise in the training data, which is something that definitely should be avoided. One way to avoid this overfitting is by using a validation set in a similar way as with decision trees.[...] Another scheme to prevent a neural network from overfitting is weight regularization, whereby the idea is to keep the weights small in absolute sense because otherwise they may be fitting the noise in the data. This is then implemented by adding a weight size term (e.g., Euclidean norm) to the objective function of the neural network." (Bart Baesens, "Analytics in a Big Data World: The Essential Guide to Data Science and Its Applications", 2014)

"Regularization works because it is the sum of the coefficients of the predictor variables, therefore it’s important that they’re on the same scale or the regularization may find it difficult to converge, and variables with larger absolute coefficient values will greatly influence it, generating an infective regularization. It’s good practice to standardize the predictor values or bind them to a common min‐max, such as the [‐1,+1] range." (Luca Massaron & John P Mueller, "Python for Data Science For Dummies", 2015)

"Neural nets are typically over-parametrized, and hence are prone to overfitting. Originally early stopping was set up as the primary tuning parameter, and the stopping time was determined using a held-out set of validation data. In modern networks the regularization is tuned adaptively to avoid overfitting, and hence it is less of a problem." (Bradley Efron & Trevor Hastie, "Computer Age Statistical Inference: Algorithms, Evidence, and Data Science", 2016)

"Boosting defines an objective function to measure the performance of a model given a certain set of parameters. The objective function contains two parts: regularization and training loss, both of which add to one another. The training loss measures how predictive our model is on the training data. The most commonly used training loss function includes mean squared error and logistic regression. The regularization term controls the complexity of the model, which helps avoid overfitting." (Danish Haroon, "Python Machine Learning Case Studies", 2017)

"Early stopping and regularization can ensure network generalization when you apply them properly. [...] With early stopping, the choice of the validation set is also important. The validation set should be representative of all points in the training set. When you use Bayesian regularization, it is important to train the network until it reaches convergence. The sum-squared error, the sum-squared weights, and the effective number of parameters should reach constant values when the network has converged. With both early stopping and regularization, it is a good idea to train the network starting from several different initial conditions. It is possible for either method to fail in certain circumstances. By testing several different initial conditions, you can verify robust network performance." (Mark H Beale et al, "Neural Network Toolbox™ User's Guide", 2017)

"Feature generation (or engineering, as it is often called) is where the bulk of the time is spent in the machine learning process. As social science researchers or practitioners, you have spent a lot of time constructing features, using transformations, dummy variables, and interaction terms. All of that is still required and critical in the machine learning framework. One difference you will need to get comfortable with is that instead of carefully selecting a few predictors, machine learning systems tend to encourage the creation of lots of features and then empirically use holdout data to perform regularization and model selection. It is common to have models that are trained on thousands of features." (Rayid Ghani & Malte Schierholz, "Machine Learning", 2017)

"The danger of overfitting is particularly severe when the training data is not a perfect gold standard. Human class annotations are often subjective and inconsistent, leading boosting to amplify the noise at the expense of the signal. The best boosting algorithms will deal with overfitting though regularization. The goal will be to minimize the number of non-zero coefficients, and avoid large coefficients that place too much faith in any one classifier in the ensemble." (Steven S Skiena, "The Data Science Design Manual", 2017)

"Even though a natural way of avoiding overfitting is to simply build smaller networks (with fewer units and parameters), it has often been observed that it is better to build large networks and then regularize them in order to avoid overfitting. This is because large networks retain the option of building a more complex model if it is truly warranted. At the same time, the regularization process can smooth out the random artifacts that are not supported by sufficient data. By using this approach, we are giving the model the choice to decide what complexity it needs, rather than making a rigid decision for the model up front (which might even underfit the data)." (Charu C Aggarwal, "Neural Networks and Deep Learning: A Textbook", 2018)

"Regularization is particularly important when the amount of available data is limited. A neat biological interpretation of regularization is that it corresponds to gradual forgetting, as a result of which 'less important' (i.e., noisy) patterns are removed. In general, it is often advisable to use more complex models with regularization rather than simpler models without regularization." (Charu C Aggarwal, "Neural Networks and Deep Learning: A Textbook", 2018)

"The idea behind deeper architectures is that they can better leverage repeated regularities in the data patterns in order to reduce the number of computational units and therefore generalize the learning even to areas of the data space where one does not have examples. Often these repeated regularities are learned by the neural network within the weights as the basis vectors of hierarchical features." (Charu C Aggarwal, "Neural Networks and Deep Learning: A Textbook", 2018)

10 November 2018

🔭Data Science: Criteria (Just the Quotes)

"[Precision] is the very soul of science; and its attainment afford the only criterion, or at least the best, of the truth of theories, and the correctness of experiments." (John F W Herschel, "A Preliminary Discourse on the Study of Natural Philosophy", 1830)

"A primary goal of any learning model is to predict correctly the learning curve - proportions of correct responses versus trials. Almost any sensible model with two or three free parameters, however, can closely fit the curve, and so other criteria must be invoked when one is comparing several models." (Robert R Bush & Frederick Mosteller, "A Comparison of Eight Models?", Studies in Mathematical Learning Theory, 1959)

"A satisfactory prediction of the sequential properties of learning data from a single experiment is by no means a final test of a model. Numerous other criteria - and some more demanding - can be specified. For example, a model with specific numerical parameter values should be invariant to changes in independent variables that explicitly enter in the model." (Robert R Bush & Frederick Mosteller,"A Comparison of Eight Models?", Studies in Mathematical Learning Theory, 1959)

"[...] sciences do not try to explain, they hardly even try to interpret, they mainly make models. By a model is meant a mathematical construct which, with the addition of certain verbal interpretations, describes observed phenomena. The justification of such a mathematical construct is solely and precisely that it is expected to work - that is, correctly to describe phenomena from a reasonably wide area. Furthermore, it must satisfy certain aesthetic criteria - that is, in relation to how much it describes, it must be rather simple.” (John von Neumann, "Method in the physical sciences", 1961)

"Adaptive system - whether on the biological, psychological, or sociocultural level - must manifest (1) some degree of 'plasticity' and 'irritability' vis-a-vis its environment such that it carries on a constant interchange with acting on and reacting to it; (2) some source or mechanism for variety, to act as a potential pool of adaptive variability to meet the problem of mapping new or more detailed variety and constraints in a changeable environment; (3) a set of selective criteria or mechanisms against which the 'variety pool' may be sifted into those variations in the organization or system that more closely map the environment and those that do not; and (4) an arrangement for preserving and/or propagating these 'successful' mappings." (Walter F Buckley," Sociology and modern systems theory", 1967)

"The advantage of this way of proceeding is evident: insights and skills obtained on the model-side can be - certain transference criteria satisfied - transferred to the original, [in this way] the model-builder obtains a new knowledge about the modeled original […]" (Herbert Stachowiak, "Allgemeine Modelltheorie", 1973)

"In most engineering problems, particularly when solving optimization problems, one must have the opportunity of comparing different variants quantitatively. It is therefore important to be able to state a clear-cut quantitative criterion." (Yakov Khurgin, "Did You Say Mathematics?", 1974)

"The principal aim of physical theories is understanding. A theory's ability to find a number is merely a useful criterion for a correct understanding." (Yuri I Manin, "Mathematics and Physics", 1981)

"It is often the scientist’s experience that he senses the nearness of truth when such connections are envisioned. A connection is a step toward simplification, unification. Simplicity is indeed often the sign of truth and a criterion of beauty." (Mahlon B Hoagland, "Toward the Habit of Truth", 1990)

"A model for simulating dynamic system behavior requires formal policy descriptions to specify how individual decisions are to be made. Flows of information are continuously converted into decisions and actions. No plea about the inadequacy of our understanding of the decision-making processes can excuse us from estimating decision-making criteria. To omit a decision point is to deny its presence - a mistake of far greater magnitude than any errors in our best estimate of the process." (Jay W Forrester, "Policies, decisions and information sources for modeling", 1994)

"The amount of understanding produced by a theory is determined by how well it meets the criteria of adequacy - testability, fruitfulness, scope, simplicity, conservatism - because these criteria indicate the extent to which a theory systematizes and unifies our knowledge." (Theodore Schick Jr., "How to Think about Weird Things: Critical Thinking for a New Age", 1995)

09 November 2018

🔭Data Science: Plausibility (Just the Quotes)

"However successful a theory or law may have been in the past, directly it fails to interpret new discoveries its work is finished, and it must be discarded or modified. However plausible the hypothesis, it must be ever ready for sacrifice on the altar of observation." (Joseph W Mellor, "A Comprehensive Treatise on Inorganic and Theoretical Chemistry", 1922) 

"Heuristic reasoning is reasoning not regarded as final and strict but as provisional and plausible only, whose purpose is to discover the solution of the present problem. We are often obliged to use heuristic reasoning. We shall attain complete certainty when we shall have obtained the complete solution, but before obtaining certainty we must often be satisfied with a more or less plausible guess. We may need the provisional before we attain the final. We need heuristic reasoning when we construct a strict proof as we need scaffolding when we erect a building." (George Pólya, "How to Solve It", 1945)

"The scientist who discovers a theory is usually guided to his discovery by guesses; he cannot name a method by means of which he found the theory and can only say that it appeared plausible to him, that he had the right hunch or that he saw intuitively which assumption would fit the facts." (Hans Reichenbach, "The Rise of Scientific Philosophy", 1951)

"In plausible reasoning the principal thing is to distinguish... a more reasonable guess from a less reasonable guess." (George Pólya, "Mathematics and plausible reasoning" Vol. 1, 1954)

"One feature [...] which requires much more justification than is usually given, is the setting up of unplausible null hypotheses. For example, a statistician may set out a test to see whether two drugs have exactly the same effect, or whether a regression line is exactly straight. These hypotheses can scarcely be taken literally." (Cedric A B Smith, "Book review of Norman T. J. Bailey: Statistical Methods in Biology", Applied Statistics 9, 1960)

"[…] the social scientist who lacks a mathematical mind and regards a mathematical formula as a magic recipe, rather than as the formulation of a supposition, does not hold forth much promise. A mathematical formula is never more than a precise statement. It must not be made into a Procrustean bed - and that is what one is driven to by the desire to quantify at any cost. It is utterly implausible that a mathematical formula should make the future known to us, and those who think it can, would once have believed in witchcraft. The chief merit of mathematicization is that it compels us to become conscious of what we are assuming." (Bertrand de Jouvenel, "The Art of Conjecture", 1967)

"In all scientific fields, theory is frequently more important than experimental data. Scientists are generally reluctant to accept the existence of a phenomenon when they do not know how to explain it. On the other hand, they will often accept a theory that is especially plausible before there exists any data to support it." (Richard Morris, 1983)

"The degree of confirmation assigned to any given hypothesis is sensitive to properties of the entire belief system [...] simplicity, plausibility, and conservatism are properties that theories have in virtue of their relation to the whole structure of scientific beliefs taken collectively. A measure of conservatism or simplicity would be a metric over global properties of belief systems." (Jerry Fodor, "Modularity of Mind", 1983)

"Nonetheless, the basic principles regarding correlations between variables are not that difficult to understand. We must look for patterns that reveal potential relationships and for evidence that variables are actually related. But when we do spot those relationships, we should not jump to conclusions about causality. Instead, we need to weigh the strength of the relationship and the plausibility of our theory, and we must always try to discount the possibility of spuriousness." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"Reality dishes out experiences using probability, not plausibility." (Eliezer S Yudkowsky, "A Technical Explanation of Technical Explanation", 2005)

"Don’t just do the calculations. Use common sense to see whether you are answering the correct question, the assumptions are reasonable, and the results are plausible. If a statistical argument doesn’t make sense, think about it carefully - you may discover that the argument is nonsense." (Gary Smith, "Standard Deviations", 2014)

"Facts and values are entangled in science. It's not because scientists are biased, not because they are partial or influenced by other kinds of interests, but because of a commitment to reason, consistency, coherence, plausibility and replicability. These are value commitments." (Alva Noë)

08 November 2018

🔭Data Science: Aggregation (Just the Quotes)

"Statistics may be defined as numerical statements of facts by means of which large aggregates are analyzed, the relations of individual units to their groups are ascertained, comparisons are made between groups, and continuous records are maintained for comparative purposes." (Melvin T Copeland. "Statistical Methods" [in: Harvard Business Studies, Vol. III, Ed. by Melvin T Copeland, 1917])

"The null hypothesis of no difference has been judged to be no longer a sound or fruitful basis for statistical investigation. […] Significance tests do not provide the information that scientists need, and, furthermore, they are not the most effective method for analyzing and summarizing data." (Cherry A Clark, "Hypothesis Testing in Relation to Statistical Methodology", Review of Educational Research Vol. 33, 1963)

"[…] fitting lines to relationships between variables is often a useful and powerful method of summarizing a set of data. Regression analysis fits naturally with the development of causal explanations, simply because the research worker must, at a minimum, know what he or she is seeking to explain." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Fitting lines to relationships between variables is the major tool of data analysis. Fitted lines often effectively summarize the data and, by doing so, help communicate the analytic results to others. Estimating a fitted line is also the first step in squeezing further information from the data." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Modern data graphics can do much more than simply substitute for small statistical tables. At their best, graphics are instruments for reasoning about quantitative information. Often the most effective way to describe, explore, and summarize a set of numbers even a very large set - is to look at pictures of those numbers. Furthermore, of all methods for analyzing and communicating statistical information, well-designed data graphics are usually the simplest and at the same time the most powerful." (Edward R Tufte, "The Visual Display of Quantitative Information", 1983)

"The science of statistics may be described as exploring, analyzing and summarizing data; designing or choosing appropriate ways of collecting data and extracting information from them; and communicating that information. Statistics also involves constructing and testing models for describing chance phenomena. These models can be used as a basis for making inferences and drawing conclusions and, finally, perhaps for making decisions." (Fergus Daly et al, "Elements of Statistics", 1995)

"Without meaningful data there can be no meaningful analysis. The interpretation of any data set must be based upon the context of those data. Unfortunately, much of the data reported to executives today are aggregated and summed over so many different operating units and processes that they cannot be said to have any context except a historical one - they were all collected during the same time period. While this may be rational with monetary figures, it can be devastating to other types of data." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Ockham's Razor in statistical analysis is used implicitly when models are embedded in richer models -for example, when testing the adequacy of a linear model by incorporating a quadratic term. If the coefficient of the quadratic term is not significant, it is dropped and the linear model is assumed to summarize the data adequately." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Every number has its limitations; every number is a product of choices that inevitably involve compromise. Statistics are intended to help us summarize, to get an overview of part of the world’s complexity. But some information is always sacrificed in the process of choosing what will be counted and how. Something is, in short, always missing. In evaluating statistics, we should not forget what has been lost, if only because this helps us understand what we still have." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"Data often arrive in raw form, as long lists of numbers. In this case your job is to summarize the data in a way that captures its essence and conveys its meaning. This can be done numerically, with measures such as the average and standard deviation, or graphically. At other times you find data already in summarized form; in this case you must understand what the summary is telling, and what it is not telling, and then interpret the information for your readers or viewers." (Charles Livingston & Paul Voakes, "Working with Numbers and Statistics: A handbook for journalists", 2005)

"Whereas regression is about attempting to specify the underlying relationship that summarises a set of paired data, correlation is about assessing the strength of that relationship. Where there is a very close match between the scatter of points and the regression line, correlation is said to be 'strong' or 'high' . Where the points are widely scattered, the correlation is said to be 'weak' or 'low'." (Alan Graham, "Developing Thinking in Statistics", 2006)

"Numerical precision should be consistent throughout and summary statistics such as means and standard deviations should not have more than one extra decimal place (or significant digit) compared to the raw data. Spurious precision should be avoided although when certain measures are to be used for further calculations or when presenting the results of analyses, greater precision may sometimes be appropriate." (Jenny Freeman et al, "How to Display Data", 2008)

"Graphical displays are often constructed to place principal focus on the individual observations in a dataset, and this is particularly helpful in identifying both the typical positions of data points and unusual or influential cases. However, in many investigations, principal interest lies in identifying the nature of underlying trends and relationships between variables, and so it is often helpful to enhance graphical displays in ways which give deeper insight into these features. This can be very beneficial both for small datasets, where variation can obscure underlying patterns, and large datasets, where the volume of data is so large that effective representation inevitably involves suitable summaries." (Adrian W Bowman, "Smoothing Techniques for Visualisation" [in "Handbook of Data Visualization"], 2008)

"There are three possible reasons for [the] absence of predictive power. First, it is possible that the models are misspecified. Second, it is possible that the model’s explanatory factors are measured at too high a level of aggregation [...] Third, [...] the search for statistically significant relationships may not be the strategy best suited for evaluating our model’s ability to explain real world events [...] the lack of predictive power is the result of too much emphasis having been placed on finding statistically significant variables, which may be overdetermined. Statistical significance is generally a flawed way to prune variables in regression models [...] Statistically significant variables may actually degrade the predictive accuracy of a model [...] [By using] models that are constructed on the basis of pruning undertaken with the shears of statistical significance, it is quite possible that we are winnowing our models away from predictive accuracy." (Michael D Ward et al, "The perils of policy by p-value: predicting civil conflicts" Journal of Peace Research 47, 2010)

"In order to be effective a descriptive statistic has to make sense - it has to distill some essential characteristic of the data into a value that is both appropriate and understandable. […] the justification for computing any given statistic must come from the nature of the data themselves - it cannot come from the arithmetic, nor can it come from the statistic. If the data are a meaningless collection of values, then the summary statistics will also be meaningless - no arithmetic operation can magically create meaning out of nonsense. Therefore, the meaning of any statistic has to come from the context for the data, while the appropriateness of any statistic will depend upon the use we intend to make of that statistic." (Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"[...] things that seem hopelessly random and unpredictable when viewed in isolation often turn out to be lawful and predictable when viewed in aggregate." (Steven Strogatz, "The Joy of X: A Guided Tour of Mathematics, from One to Infinity", 2012)

"In general, when building statistical models, we must not forget that the aim is to understand something about the real world. Or predict, choose an action, make a decision, summarize evidence, and so on, but always about the real world, not an abstract mathematical world: our models are not the reality - a point well made by George Box in his oft-cited remark that "all models are wrong, but some are useful". (David Hand, "Wonderful examples, but let's not close our eyes", Statistical Science 29, 2014)

"Decision trees are also discriminative models. Decision trees are induced by recursively partitioning the feature space into regions belonging to the different classes, and consequently they define a decision boundary by aggregating the neighboring regions belonging to the same class. Decision tree model ensembles based on bagging and boosting are also discriminative models." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies", 2015)

"Just as with aggregated data, an average is a summary statistic that can tell you something about the data - but it is only one metric, and oftentimes a deceiving one at that. By taking all of the data and boiling it down to one value, an average (and other summary statistics) may imply that all of the underlying data is the same, even when it’s not." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"At very small time scales, the motion of a particle is more like a random walk, as it gets jostled about by discrete collisions with water molecules. But virtually any random movement on small time scales will give rise to Brownian motion on large time scales, just so long as the motion is unbiased. This is because of the Central Limit Theorem, which tells us that the aggregate of many small, independent motions will be normally distributed." (Field Cady, "The Data Science Handbook", 2017)

"The most accurate but least interpretable form of data presentation is to make a table, showing every single value. But it is difficult or impossible for most people to detect patterns and trends in such data, and so we rely on graphs and charts. Graphs come in two broad types: Either they represent every data point visually (as in a scatter plot) or they implement a form of data reduction in which we summarize the data, looking, for example, only at means or medians." (Daniel J Levitin, "Weaponized Lies", 2017)

"Again, classical statistics only summarizes data, so it does not provide even a language for asking [a counterfactual] question. Causal inference provides a notation and, more importantly, offers a solution. As with predicting the effect of interventions [...], in many cases we can emulate human retrospective thinking with an algorithm that takes what we know about the observed world and produces an answer about the counterfactual world." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"While the individual man is an insoluble puzzle, in the aggregate he becomes a mathematical certainty. You can, for example, never foretell what anyone man will be up to, but you can say with precision what an average number will be up to. Individuals vary, but percentages remain constant. So says the statistician." (Sir Arthur C Doyle)

🔭Data Science - Consistency (Just the Quotes)

"A model, like a novel, may resonate with nature, but it is not a ‘real’ thing. Like a novel, a model may be convincing - it may ‘ring true’ if it is consistent with our experience of the natural world. But just as we may wonder how much the characters in a novel are drawn from real life and how much is artifice, we might ask the same of a model: How much is based on observation and measurement of accessible phenomena, how much is convenience? Fundamentally, the reason for modeling is a lack of full access, either in time or space, to the phenomena of interest." (Kenneth Belitz, Science, Vol. 263, 1944)

"Consistency and completeness can also be characterized in terms of models: a theory T is consistent if and only if it has at least one model; it is complete if and only if every sentence of T which is satified in one model is also satisfied in any other model of T. Two theories T1 and T2 are said to be compatible if they have a common consistent extension; this is equivalent to saying that the union of T1 and T2 is consistent." (Alfred Tarski et al, "Undecidable Theories", 1953)

"To be useful data must be consistent - they must reflect periodic recordings of the value of the variable or at least possess logical internal connections. The definition of the variable under consideration cannot change during the period of measurement or enumeration. Also, if the data are to be valuable, they must be relevant to the question to be answered." (Cecil H Meyers, "Handbook of Basic Graphs: A modern approach", 1970)

"When evaluating a model, at least two broad standards are relevant. One is whether the model is consistent with the data. The other is whether the model is consistent with the ‘real world’." (Kenneth A Bollen, "Structural Equations with Latent Variables", 1989)

"The word theory, as used in the natural sciences, doesn’t mean an idea tentatively held for purposes of argument - that we call a hypothesis. Rather, a theory is a set of logically consistent abstract principles that explain a body of concrete facts. It is the logical connections among the principles and the facts that characterize a theory as truth. No one element of a theory [...] can be changed without creating a logical contradiction that invalidates the entire system. Thus, although it may not be possible to substantiate directly a particular principle in the theory, the principle is validated by the consistency of the entire logical structure." (Alan Cromer, "Uncommon Sense: The Heretical Nature of Science", 1993)

"For a given dataset there is not a great deal of advice which can be given on content and context. hose who know their own data should know best for their specific purposes. It is advisable to think hard about what should be shown and to check with others if the graphic makes the desired impression. Design should be let to designers, though some basic guidelines should be followed: consistency is important (sets of graphics should be in similar style and use equivalent scaling); proximity is helpful (place graphics on the same page, or on the facing page, of any text that refers to them); and layout should be checked (graphics should be neither too small nor too large and be attractively positioned relative to the whole page or display)."(Antony Unwin, "Good Graphics?" [in "Handbook of Data Visualization"], 2008)

"It is the consistency of the information that matters for a good story, not its completeness. Indeed, you will often find that knowing little makes it easier to fit everything you know into a coherent pattern." (Daniel Kahneman, "Thinking, Fast and Slow", 2011)

"Accuracy and coherence are related concepts pertaining to data quality. Accuracy refers to the comprehensiveness or extent of missing data, performance of error edits, and other quality assurance strategies. Coherence is the degree to which data - item value and meaning are consistent over time and are comparable to similar variables from other routinely used data sources." (Aileen Rothbard, "Quality Issues in the Use of Administrative Data Records", 2015)

"How good the data quality is can be looked at both subjectively and objectively. The subjective component is based on the experience and needs of the stakeholders and can differ by who is being asked to judge it. For example, the data managers may see the data quality as excellent, but consumers may disagree. One way to assess it is to construct a survey for stakeholders and ask them about their perception of the data via a questionnaire. The other component of data quality is objective. Measuring the percentage of missing data elements, the degree of consistency between records, how quickly data can be retrieved on request, and the percentage of incorrect matches on identifiers (same identifier, different social security number, gender, date of birth) are some examples." (Aileen Rothbard, "Quality Issues in the Use of Administrative Data Records", 2015)

"The danger of overfitting is particularly severe when the training data is not a perfect gold standard. Human class annotations are often subjective and inconsistent, leading boosting to amplify the noise at the expense of the signal. The best boosting algorithms will deal with overfitting though regularization. The goal will be to minimize the number of non-zero coefficients, and avoid large coefficients that place too much faith in any one classifier in the ensemble." (Steven S Skiena, "The Data Science Design Manual", 2017)

"There are other problems with Big Data. In any large data set, there are bound to be inconsistencies, misclassifications, missing data - in other words, errors, blunders, and possibly lies. These problems with individual items occur in any data set, but they are often hidden in a large mass of numbers even when these numbers are generated out of computer interactions." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

Data Management : Data Fabric (Definitions)

"Enterprise data fabric (EDF) is a data layer that separates data sources from applications, providing the means to solve the gridlock prevalent in distributed environments such as grid computing, service-oriented architecture (SOA) and event-driven architecture (EDA)." (Information Management, 2010)

"A data fabric is an emerging data management and data integration design concept for attaining flexible, reusable and augmented data integration pipelines, services and semantics, in support of various operational and analytics use cases delivered across multiple deployment and orchestration platforms." (Jacob O Lund, "Demystifying the Data Fabric", 2020)

"A data fabric is a data management architecture that can optimize access to distributed data and intelligently curate and orchestrate it for self-service delivery to data consumers." (IBM, "Data Fabric", 2021) [source]

"A data fabric is a modern, distributed data architecture that includes shared data assets and optimized data management and integration processes that you can use to address today’s data challenges in a unified way." (Alice LaPlante, "Data Fabric as Modern Data Architecture", 2021)

"A data fabric is an emerging data management design for attaining flexible and reusable data integration pipelines, services and semantics. A data fabric supports various operational and analytics use cases delivered across multiple deployment and orchestration platforms. Data fabrics support a combination of different data integration styles and leverage active metadata, knowledge graphs, semantics and ML to automate and enhance data integration design and delivery." (Ehtisham Zaidi, "Data Fabric", Gartner's Hype Cycle for Data Management, 2021)

"Is a distributed Data Management platform whose objective is to combine various types of data storage, access, preparation, analytics, and security tools in a fully compliant manner to support seamless Data Management." (Michelle Knight, "What Is a Data Fabric?", 2021)

"A data fabric is a customized combination of architecture and technology. It uses dynamic data integration and orchestration to connect different locations, sources, and types of data. With the right structures and flows as defined within the data fabric platform, companies can quickly access and share data regardless of where it is or how it was generated." (SAP)

"A data fabric is a distributed, memory-based data management platform that uses cluster-wide resources - memory, CPU, network bandwidth, and optionally local disk – to manage application data and application logic (behavior). The data fabric uses dynamic replication and data partitioning techniques to offer continuous availability, very high performance, and linear scalability for data intensive applications, all without compromising on data consistency even when exposed to failure conditions." (VMware)

"A Data Fabric is a technology utilization and implementation design capable of multiple outputs and applied uses." (Gartner)

"A data fabric is an architecture and set of data services that provide consistent capabilities across a choice of endpoints spanning hybrid multicloud environments." (NetApp) [source]

"Data fabric is an end-to-end data integration and management solution, consisting of architecture, data management and integration software, and shared data that helps organizations manage their data. A data fabric provides a unified, consistent user experience and access to data for any member of an organization worldwide and in real-time." (Tibco) [source]

07 November 2018

🔭Data Science: Decision Theory (Just the Quotes)

"Years ago a statistician might have claimed that statistics deals with the processing of data [...] today’s statistician will be more likely to say that statistics is concerned with decision making in the face of uncertainty." (Herman Chernoff & Lincoln E Moses, "Elementary Decision Theory", 1959)

"Another approach to management theory, undertaken by a growing and scholarly group, might be referred to as the decision theory school. This group concentrates on rational approach to decision-the selection from among possible alternatives of a course of action or of an idea. The approach of this school may be to deal with the decision itself, or to the persons or organizational group making the decision, or to an analysis of the decision process. Some limit themselves fairly much to the economic rationale of the decision, while others regard anything which happens in an enterprise the subject of their analysis, and still others expand decision theory to cover the psychological and sociological aspect and environment of decisions and decision-makers." (Harold Koontz, "The Management Theory Jungle," 1961)

"The term hypothesis testing arises because the choice as to which process is observed is based on hypothesized models. Thus hypothesis testing could also be called model testing. Hypothesis testing is sometimes called decision theory. The detection theory of communication theory is a special case." (Fred C Scweppe, "Uncertain dynamic systems", 1973)

"In decision theory, mathematical analysis shows that once the sampling distribution, loss function, and sample are specified, the only remaining basis for a choice among different admissible decisions lies in the prior probabilities. Therefore, the logical foundations of decision theory cannot be put in fully satisfactory form until the old problem of arbitrariness (sometimes called 'subjectiveness') in assigning prior probabilities is resolved." (Edwin T Jaynes, "Prior Probabilities", 1978)

"Decision theory, as it has grown up in recent years, is a formalization of the problems involved in making optimal choices. In a certain sense - a very abstract sense, to be sure - it incorporates among others operations research, theoretical economics, and wide areas of statistics, among others." (Kenneth Arrow, "The Economics of Information", 1984) 

"Cybernetics is concerned with scientific investigation of systemic processes of a highly varied nature, including such phenomena as regulation, information processing, information storage, adaptation, self-organization, self-reproduction, and strategic behavior. Within the general cybernetic approach, the following theoretical fields have developed: systems theory (system), communication theory, game theory, and decision theory." (Fritz B Simon et al, "Language of Family Therapy: A Systemic Vocabulary and Source Book", 1985)

"A field of study that includes a methodology for constructing computer simulation models to achieve better under-standing of social and corporate systems. It draws on organizational studies, behavioral decision theory, and engineering to provide a theoretical and empirical base for structuring the relationships in complex systems." (Virginia Anderson & Lauren Johnson, "Systems Thinking Basics: From Concepts to Casual Loops", 1997) 

"A decision theory that rests on the assumptions that human cognitive capabilities are limited and that these limitations are adaptive with respect to the decision environments humans frequently encounter. Decision are thought to be made usually without elaborate calculations, but instead by using fast and frugal heuristics. These heuristics certainly have the advantage of speed and simplicity, but if they are well matched to a decision environment, they can even outperform maximizing calculations with respect to accuracy. The reason for this is that many decision environments are characterized by incomplete information and noise. The information we do have is usually structured in a specific way that clever heuristics can exploit." (E Ebenhoh, "Agent-Based Modelnig with Boundedly Rational Agents", 2007)

🔭Data Science: Belief (Just the Quotes)

"By degree of probability we really mean, or ought to mean, degree of belief [...] Probability then, refers to and implies belief, more or less, and belief is but another name for imperfect knowledge, or it may be, expresses the mind in a state of imperfect knowledge." (Augustus De Morgan, "Formal Logic: Or, The Calculus of Inference, Necessary and Probable", 1847)

"To a scientist a theory is something to be tested. He seeks not to defend his beliefs, but to improve them. He is, above everything else, an expert at ‘changing his mind’." (Wendell Johnson, 1946)

"A model can not be proved to be correct; at best it can only be found to be reasonably consistant and not to contradict some of our beliefs of what reality is." (Richard W Hamming, "The Art of Probability for Scientists and Engineers", 1991)

"Probability is not about the odds, but about the belief in the existence of an alternative outcome, cause, or motive." (Nassim N Taleb, "Fooled by Randomness", 2001)

"The Bayesian approach is based on the following postulates: (B1) Probability describes degree of belief, not limiting frequency. As such, we can make probability statements about lots of things, not just data which are subject to random variation. […] (B2) We can make probability statements about parameters, even though they are fixed constants. (B3) We make inferences about a parameter θ by producing a probability distribution for θ. Inferences, such as point estimates and interval estimates, may then be extracted from this distribution." (Larry A Wasserman, "All of Statistics: A concise course in statistical inference", 2004)

"The important thing is to understand that frequentist and Bayesian methods are answering different questions. To combine prior beliefs with data in a principled way, use Bayesian inference. To construct procedures with guaranteed long run performance, such as confidence intervals, use frequentist methods. Generally, Bayesian methods run into problems when the parameter space is high dimensional." (Larry A Wasserman, "All of Statistics: A concise course in statistical inference", 2004)

"Our inner weighing of evidence is not a careful mathematical calculation resulting in a probabilistic estimate of truth, but more like a whirlpool blending of the objective and the personal. The result is a set of beliefs - both conscious and unconscious - that guide us in interpreting all the events of our lives." (Leonard Mlodinow, "War of the Worldviews: Where Science and Spirituality Meet - and Do Not", 2011)

"The search for better numbers, like the quest for new technologies to improve our lives, is certainly worthwhile. But the belief that a few simple numbers, a few basic averages, can capture the multifaceted nature of national and global economic systems is a myth. Rather than seeking new simple numbers to replace our old simple numbers, we need to tap into both the power of our information age and our ability to construct our own maps of the world to answer the questions we need answering." (Zachary Karabell, "The Leading Indicators: A short history of the numbers that rule our world", 2014)

"One kind of probability - classic probability - is based on the idea of symmetry and equal likelihood […] In the classic case, we know the parameters of the system and thus can calculate the probabilities for the events each system will generate. […] A second kind of probability arises because in daily life we often want to know something about the likelihood of other events occurring […]. In this second case, we need to estimate the parameters of the system because we don’t know what those parameters are. […] A third kind of probability differs from these first two because it’s not obtained from an experiment or a replicable event - rather, it expresses an opinion or degree of belief about how likely a particular event is to occur. This is called subjective probability […]." (Daniel J Levitin, "Weaponized Lies", 2017)

"Bayesian statistics give us an objective way of combining the observed evidence with our prior knowledge (or subjective belief) to obtain a revised belief and hence a revised prediction of the outcome of the coin’s next toss. [...] This is perhaps the most important role of Bayes’s rule in statistics: we can estimate the conditional probability directly in one direction, for which our judgment is more reliable, and use mathematics to derive the conditional probability in the other direction, for which our judgment is rather hazy. The equation also plays this role in Bayesian networks; we tell the computer the forward  probabilities, and the computer tells us the inverse probabilities when needed." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"The transparency of Bayesian networks distinguishes them from most other approaches to machine learning, which tend to produce inscrutable 'black boxes'. In a Bayesian network you can follow every step and understand how and why each piece of evidence changed the network’s beliefs." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

06 November 2018

🔭Data Science: Tools (Just the Quotes)

"[Statistics] are the only tools by which an opening can be cut through the formidable thicket of difficulties that bars the path of those who pursue the Science of man." (Sir Francis Galton, "Natural Inheritance", 1889)

"Mathematics is merely a shorthand method of recording physical intuition and physical reasoning, but it should not be a formalism leading from nowhere to nowhere, as it is likely to be made by one who does not realize its purpose as a tool." (Charles P Steinmetz, "Transactions of the American Institute of Electrical Engineers", 1909)

"Statistical methods are tools of scientific investigation. Scientific investigation is a controlled learning process in which various aspects of a problem are illuminated as the study proceeds. It can be thought of as a major iteration within which secondary iterations occur. The major iteration is that in which a tentative conjecture suggests an experiment, appropriate analysis of the data so generated leads to a modified conjecture, and this in turn leads to a new experiment, and so on." (George E P Box & George C Tjao, "Bayesian Inference in Statistical Analysis", 1973)

"[...] exploratory data analysis is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as for those we believe might be there. Except for its emphasis on graphs, its tools are secondary to its purpose." (John W Tukey, [comment] 1979)

"Correlation analysis is a useful tool for uncovering a tenuous relationship, but it doesn't necessarily provide any real understanding of the relationship, and it certainly doesn't provide any evidence that the relationship is one of cause and effect. People who don't understand correlation tend to credit it with being a more fundamental approach than it is." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"Science usually amounts to a lot more than blind trial and error. Good statistics consists of much more than just significance tests; there are more sophisticated tools available for the analysis of results, such as confidence statements, multiple comparisons, and Bayesian analysis, to drop a few names. However, not all scientists are good statisticians, or want to be, and not all people who are called scientists by the media deserve to be so described." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"Some methods, such as those governing the design of experiments or the statistical treatment of data, can be written down and studied. But many methods are learned only through personal experience and interactions with other scientists. Some are even harder to describe or teach. Many of the intangible influences on scientific discovery - curiosity, intuition, creativity - largely defy rational analysis, yet they are often the tools that scientists bring to their work." (Committee on the Conduct of Science, "On Being a Scientist", 1989)

"Statistics is a tool. In experimental science you plan and carry out experiments, and then analyse and interpret the results. To do this you use statistical arguments and calculations. Like any other tool - an oscilloscope, for example, or a spectrometer, or even a humble spanner - you can use it delicately or clumsily, skillfully or ineptly. The more you know about it and understand how it works, the better you will be able to use it and the more useful it will be." (Roger Barlow, "Statistics: A Guide to the Use of Statistical Methods in the Physical Sciences", 1989)

"Fitting is essential to visualizing hypervariate data. The structure of data in many dimensions can be exceedingly complex. The visualization of a fit to hypervariate data, by reducing the amount of noise, can often lead to more insight. The fit is a hypervariate surface, a function of three or more variables. As with bivariate and trivariate data, our fitting tools are loess and parametric fitting by least-squares. And each tool can employ bisquare iterations to produce robust estimates when outliers or other forms of leptokurtosis are present." (William S Cleveland, "Visualizing Data", 1993)

"The logarithm is one of many transformations that we can apply to univariate measurements. The square root is another. Transformation is a critical tool for visualization or for any other mode of data analysis because it can substantially simplify the structure of a set of data. For example, transformation can remove skewness toward large values, and it can remove monotone increasing spread. And often, it is the logarithm that achieves this removal." (William S Cleveland, "Visualizing Data", 1993)

"Probability theory is an ideal tool for formalizing uncertainty in situations where class frequencies are known or where evidence is based on outcomes of a sufficiently long series of independent random experiments. Possibility theory, on the other hand, is ideal for formalizing incomplete information expressed in terms of fuzzy propositions." (George Klir, "Fuzzy sets and fuzzy logic", 1995)

"We use mathematics and statistics to describe the diverse realms of randomness. From these descriptions, we attempt to glean insights into the workings of chance and to search for hidden causes. With such tools in hand, we seek patterns and relationships and propose predictions that help us make sense of the world."  (Ivars Peterson, "The Jungles of Randomness: A Mathematical Safari", 1998)

"When an analyst selects the wrong tool, this is a misuse which usually leads to invalid conclusions. Incorrect use of even a tool as simple as the mean can lead to serious misuses. […] But all statisticians know that more complex tools do not guarantee an analysis free of misuses. Vigilance is required on every statistical level."  (Herbert F Spirer et al, "Misused Statistics" 2nd Ed, 1998)

"The key role of representation in thinking is often downplayed because of an ideal of rationality that dictates that whenever two statements are mathematically or logically the same, representing them in different forms should not matter. Evidence that it does matter is regarded as a sign of human irrationality. This view ignores the fact that finding a good representation is an indispensable part of problem solving and that playing with different representations is a tool of creative thinking." (Gerd Gigerenzer, "Calculated Risks: How to know when numbers deceive you", 2002)

"There is a tendency to use hypothesis testing methods even when they are not appropriate. Often, estimation and confidence intervals are better tools. Use hypothesis testing only when you want to test a well-defined hypothesis." (Larry A Wasserman, "All of Statistics: A concise course in statistical inference", 2004)

"Popular accounts of mathematics often stress the discipline’s obsession with certainty, with proof. And mathematicians often tell jokes poking fun at their own insistence on precision. However, the quest for precision is far more than an end in itself. Precision allows one to reason sensibly about objects outside of ordinary experience. It is a tool for exploring possibility: about what might be, as well as what is." (Donal O’Shea, “The Poincaré Conjecture”, 2007)

"The key to understanding randomness and all of mathematics is not being able to intuit the answer to every problem immediately but merely having the tools to figure out the answer." (Leonard Mlodinow,"The Drunkard’s Walk: How Randomness Rules Our Lives", 2008)

"[...] a model is a tool for taking decisions and any decision taken is the result of a process of reasoning that takes place within the limits of the human mind. So, models have eventually to be understood in such a way that at least some layer of the process of simulation is comprehensible by the human mind. Otherwise, we may find ourselves acting on the basis of models that we don’t understand, or no model at all.” (Ugo Bardi, “The Limits to Growth Revisited”, 2011)

"The first and main goal of any graphic and visualization is to be a tool for your eyes and brain to perceive what lies beyond their natural reach." (Alberto Cairo, "The Functional Art", 2011)

"What is really important is to remember that no matter how creative and innovative you wish to be in your graphics and visualizations, the first thing you must do, before you put a finger on the computer keyboard, is ask yourself what users are likely to try to do with your tool." (Alberto Cairo, "The Functional Art", 2011)

"Data mining is a craft. As with many crafts, there is a well-defined process that can help to increase the likelihood of a successful result. This process is a crucial conceptual tool for thinking about data science projects. [...] data mining is an exploratory undertaking closer to research and development than it is to engineering." (Foster Provost, "Data Science for Business", 2013)

"It is important to remember that predictive data analytics models built using machine learning techniques are tools that we can use to help make better decisions within an organization and are not an end in themselves. It is paramount that, when tasked with creating a predictive model, we fully understand the business problem that this model is being constructed to address and ensure that it does address it." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, worked examples, and case studies", 2015)

"A popular misconception holds that the era of Big Data means the end of a need for sampling. In fact, the proliferation of data of varying quality and relevance reinforces the need for sampling as a tool to work efficiently with a variety of data, and minimize bias. Even in a Big Data project, predictive models are typically developed and piloted with samples." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"[…] remember that, as with many statistical issues, sampling in and of itself is not a good or a bad thing. Sampling is a powerful tool that allows us to learn something, when looking at the full population is not feasible (or simply isn’t the preferred option). And you shouldn’t be misled to think that you always should use all the data. In fact, using a sample of data can be incredibly helpful." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"A theory is nothing but a tool to know the reality. If a theory contradicts reality, it must be discarded at the earliest." (Awdhesh Singh, "Myths are Real, Reality is a Myth", 2018)

"Cross-validation is a useful tool for finding optimal predictive models, and it also works well in visualization. The concept is simple: split the data at random into a 'training' and a 'test' set, fit the model to the training data, then see how well it predicts the test data. As the model gets more complex, it will always fit the training data better and better. It will also start off getting better results on the test data, but there comes a point where the test data predictions start going wrong." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

"Even though data is being thrust on more people, it doesn’t mean everyone is prepared to consume and use it effectively. As our dependence on data for guidance and insights increases, the need for greater data literacy also grows. If literacy is defined as the ability to read and write, data literacy can be defined as the ability to understand and communicate data. Today’s advanced data tools can offer unparalleled insights, but they require capable operators who can understand and interpret data." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"We know what forecasting is: you start in the present and try to look into the future and imagine what it will be like. Backcasting is the opposite: you state your desired vision of the future as if it’s already happened, and then work backward to imagine the practices, policies, programs, tools, training, and people who worked in concert in a hypothetical past (which takes place in the future) to get you there." (Eben Hewitt, "Technology Strategy Patterns: Architecture as strategy" 2nd Ed., 2019)

"Big data is revolutionizing the world around us, and it is easy to feel alienated by tales of computers handing down decisions made in ways we don’t understand. I think we’re right to be concerned. Modern data analytics can produce some miraculous results, but big data is often less trustworthy than small data. Small data can typically be scrutinized; big data tends to be locked away in the vaults of Silicon Valley. The simple statistical tools used to analyze small datasets are usually easy to check; pattern-recognizing algorithms can all too easily be mysterious and commercially sensitive black boxes." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"I think sometimes organizations are looking at tools or the mythical and elusive data driven culture to be the strategy. Let me emphasize now: culture and tools are not strategies; they are enabling pieces." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"Visualisation is fundamentally limited by the number of pixels you can pump to a screen. If you have big data, you have way more data than pixels, so you have to summarise your data. Statistics gives you lots of really good tools for this." (Hadley Wickham)

05 November 2018

💠🛠️SQL Server: Administration (End of Life for 2008 and 2008 R2 Versions)

SQL Server 2008 and 2008 R2 versions are heading with steep steps toward the end of support - July 9, 2019. Besides an upgrade to upper versions, it seems there is also the opportunity to migrate to an Azure SQL Server Managed Instance, which allows a near 100% compatibility with an on-premises SQL Server installation, or to a Azure VM (see Franck Mercier’s post).

If you aren’t sure which versions of SQL Server you have in your organization here’s a script that can be run on all SQL Server 2005+ installations via SQL Server Management Studio:

SELECT SERVERPROPERTY('ComputerNamePhysicalNetBIOS') ComputerName
 , SERVERPROPERTY('Edition') Edition
 , SERVERPROPERTY('ProductVersion') ProductVersion
 , CASE Cast(SERVERPROPERTY('ProductVersion') as nvarchar(20))
     WHEN '9.00.5000.00' THEN 'SQL Server 2005'
     WHEN '10.0.6000.29' THEN 'SQL Server 2008' 
     WHEN '10.50.6000.34' THEN 'SQL Server 2008 R2' 
     WHEN '11.0.7001.0' THEN 'SQL Server 2012' 
     WHEN '12.0.6024.0' THEN 'SQL Server 2014' 
     WHEN '13.0.5026.0' THEN 'SQL Server 2016'
     WHEN '14.0.2002.14' THEN 'SQL Server 2017'
     ELSE 'unknown'
   END Product 


Resources:
[1] Microsoft (2018) How to determine the version, edition, and update level of SQL Server and its components [Online] Available from: https://support.microsoft.com/en-us/help/321185/how-to-determine-the-version-edition-and-update-level-of-sql-server-an
[2] Microsoft Blogs (2018) SQL Server 2008 end of support, by Franck Mercier [Online] Available from: https://blogs.technet.microsoft.com/franmer/2018/11/01/sql-server-2008-end-of-support-2/

🔭Data Science: Confidence (Just the Quotes)

"What the use of P [the significance level] implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred." (Harold Jeffreys, "Theory of Probability", 1939)

"Only by the analysis and interpretation of observations as they are made, and the examination of the larger implications of the results, is one in a satisfactory position to pose new experimental and theoretical questions of the greatest significance." (John A Wheeler, "Elementary Particle Physics", American Scientist, 1947)

"As usual we may make the errors of I) rejecting the null hypothesis when it is true, II) accepting the null hypothesis when it is false. But there is a third kind of error which is of interest because the present test of significance is tied up closely with the idea of making a correct decision about which distribution function has slipped furthest to the right. We may make the error of III) correctly rejecting the null hypothesis for the wrong reason." (Frederick Mosteller, "A k-Sample Slippage Test for an Extreme Population", The Annals of Mathematical Statistics 19, 1948)

"Errors of the third kind happen in conventional tests of differences of means, but they are usually not considered, although their existence is probably recognized. It seems to the author that there may be several reasons for this among which are 1) a preoccupation on the part of mathematical statisticians with the formal questions of acceptance and rejection of null hypotheses without adequate consideration of the implications of the error of the third kind for the practical experimenter, 2) the rarity with which an error of the third kind arises in the usual tests of significance." (Frederick Mosteller, "A k-Sample Slippage Test for an Extreme Population", The Annals of Mathematical Statistics 19, 1948)

"If significance tests are required for still larger samples, graphical accuracy is insufficient, and arithmetical methods are advised. A word to the wise is in order here, however. Almost never does it make sense to use exact binomial significance tests on such data - for the inevitable small deviations from the mathematical model of independence and constant split have piled up to such an extent that the binomial variability is deeply buried and unnoticeable. Graphical treatment of such large samples may still be worthwhile because it brings the results more vividly to the eye." (Frederick Mosteller & John W Tukey, "The Uses and Usefulness of Binomial Probability Paper?", Journal of the American Statistical Association 44, 1949)

"One reason for preferring to present a confidence interval statement (where possible) is that the confidence interval, by its width, tells more about the reliance that can be placed on the results of the experiment than does a YES-NO test of significance." (Mary G Natrella, "The relation between confidence intervals and tests of significance", American Statistician 14, 1960)

"Confidence intervals give a feeling of the uncertainty of experimental evidence, and (very important) give it in the same units [...] as the original observations." (Mary G Natrella, "The relation between confidence intervals and tests of significance", American Statistician 14, 1960)

"The null-hypothesis significance test treats ‘acceptance’ or ‘rejection’ of a hypothesis as though these were decisions one makes. But a hypothesis is not something, like a piece of pie offered for dessert, which can be accepted or rejected by a voluntary physical action. Acceptance or rejection of a hypothesis is a cognitive process, a degree of believing or disbelieving which, if rational, is not a matter of choice but determined solely by how likely it is, given the evidence, that the hypothesis is true." (William W Rozeboom, "The fallacy of the null–hypothesis significance test", Psychological Bulletin 57, 1960)

"The null hypothesis of no difference has been judged to be no longer a sound or fruitful basis for statistical investigation. […] Significance tests do not provide the information that scientists need, and, furthermore, they are not the most effective method for analyzing and summarizing data." (Cherry A Clark, "Hypothesis Testing in Relation to Statistical Methodology", Review of Educational Research Vol. 33, 1963)

"The idea of knowledge as an improbable structure is still a good place to start. Knowledge, however, has a dimension which goes beyond that of mere information or improbability. This is a dimension of significance which is very hard to reduce to quantitative form. Two knowledge structures might be equally improbable but one might be much more significant than the other." (Kenneth E Boulding, "Beyond Economics: Essays on Society", 1968)

"Significance levels are usually computed and reported, but power and confidence limits are not. Perhaps they should be." (Amos Tversky & Daniel Kahneman, "Belief in the law of small numbers", Psychological Bulletin 76(2), 1971)

"Science usually amounts to a lot more than blind trial and error. Good statistics consists of much more than just significance tests; there are more sophisticated tools available for the analysis of results, such as confidence statements, multiple comparisons, and Bayesian analysis, to drop a few names. However, not all scientists are good statisticians, or want to be, and not all people who are called scientists by the media deserve to be so described." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"It is usually wise to give a confidence interval for the parameter in which you are interested." (David S Moore & George P McCabe, "Introduction to the Practice of Statistics", 1989)

"I do not think that significance testing should be completely abandoned [...] and I don’t expect that it will be. But I urge researchers to provide estimates, with confidence intervals: scientific advance requires parameters with known reliability estimates. Classical confidence intervals are formally equivalent to a significance test, but they convey more information." (Nigel G Yoccoz, "Use, Overuse, and Misuse of Significance Tests in Evolutionary Biology and Ecology", Bulletin of the Ecological Society of America Vol. 72 (2), 1991)

"Whereas hypothesis testing emphasizes a very narrow question (‘Do the population means fail to conform to a specific pattern?’), the use of confidence intervals emphasizes a much broader question (‘What are the population means?’). Knowing what the means are, of course, implies knowing whether they fail to conform to a specific pattern, although the reverse is not true. In this sense, use of confidence intervals subsumes the process of hypothesis testing." (Geoffrey R Loftus, "On the tyranny of hypothesis testing in the social sciences", Contemporary Psychology 36, 1991) 

"We should push for de-emphasizing some topics, such as statistical significance tests - an unfortunate carry-over from the traditional elementary statistics course. We would suggest a greater focus on confidence intervals - these achieve the aim of formal hypothesis testing, often provide additional useful information, and are not as easily misinterpreted." (Gerry Hahn et al, "The Impact of Six Sigma Improvement: A Glimpse Into the Future of Statistics", The American Statistician, 1999)

"[...] they [confidence limits] are rarely to be found in the literature. I suspect that the main reason they are not reported is that they are so embarrassingly large!" (Jacob Cohen, "The earth is round (p<.05)", American Psychologist 49, 1994)

"Given the important role that correlation plays in structural equation modeling, we need to understand the factors that affect establishing relationships among multivariable data points. The key factors are the level of measurement, restriction of range in data values (variability, skewness, kurtosis), missing data, nonlinearity, outliers, correction for attenuation, and issues related to sampling variation, confidence intervals, effect size, significance, sample size, and power." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

"Another way to secure statistical significance is to use the data to discover a theory. Statistical tests assume that the researcher starts with a theory, collects data to test the theory, and reports the results - whether statistically significant or not. Many people work in the other direction, scrutinizing the data until they find a pattern and then making up a theory that fits the pattern." (Gary Smith, "Standard Deviations", 2014)

"There is a growing realization that reported 'statistically significant' claims in statistical publications are routinely mistaken. Researchers typically express the confidence in their data in terms of p-value: the probability that a perceived result is actually the result of random variation. The value of p (for 'probability') is a way of measuring the extent to which a data set provides evidence against a so-called null hypothesis. By convention, a p- value below 0.05 is considered a meaningful refutation of the null hypothesis; however, such conclusions are less solid than they appear." (Andrew Gelman & Eric Loken, "The Statistical Crisis in Science", American Scientist Vol. 102(6), 2014)

04 November 2018

🔭Data Science: Residuals (Just the Quotes)

"Data analysis must be iterative to be effective. [...] The iterative and interactive interplay of summarizing by fit and exposing by residuals is vital to effective data analysis. Summarizing and exposing are complementary and pervasive." (John W Tukey & Martin B Wilk, "Data Analysis and: An Expository Overview", 1966)

"Exploratory data analysis, EDA, calls for a relatively free hand in exploring the data, together with dual obligations: (•) to look for all plausible alternatives and oddities - and a few implausible ones, (graphic techniques can be most helpful here) and (•) to remove each appearance that seems large enough to be meaningful - ordinarily by some form of fitting, adjustment, or standardization [...] so that what remains, the residuals, can be examined for further appearances." (John W Tukey, "Introduction to Styles of Data Analysis Techniques", 1982)

"A good description of the data summarizes the systematic variation and leaves residuals that look structureless. That is, the residuals exhibit no patterns and have no exceptionally large values, or outliers. Any structure present in the residuals indicates an inadequate fit. Looking at the residuals laid out in an overlay helps to spot patterns and outliers and to associate them with their source in the data." (Christopher H Schrnid, "Value Splitting: Taking the Data Apart", 1991)

"A useful description relates the systematic variation to one or more factors; if the residuals dwarf the effects for a factor, we may not be able to relate variation in the data to changes in the factor. Furthermore, changes in the factor may bring no important change in the response. Such comparisons of residuals and effects require a measure of the variation of overlays relative to each other." (Christopher H Schrnid, "Value Splitting: Taking the Data Apart", 1991)

"Fitting data means finding mathematical descriptions of structure in the data. An additive shift is a structural property of univariate data in which distributions differ only in location and not in spread or shape. […] The process of identifying a structure in data and then fitting the structure to produce residuals that have the same distribution lies at the heart of statistical analysis. Such homogeneous residuals can be pooled, which increases the power of the description of the variation in the data." (William S Cleveland, "Visualizing Data", 1993)

"When the distributions of two or more groups of univariate data are skewed, it is common to have the spread increase monotonically with location. This behavior is monotone spread. Strictly speaking, monotone spread includes the case where the spread decreases monotonically with location, but such a decrease is much less common for raw data. Monotone spread, as with skewness, adds to the difficulty of data analysis. For example, it means that we cannot fit just location estimates to produce homogeneous residuals; we must fit spread estimates as well. Furthermore, the distributions cannot be compared by a number of standard methods of probabilistic inference that are based on an assumption of equal spreads; the standard t-test is one example. Fortunately, remedies for skewness can cure monotone spread as well." (William S Cleveland, "Visualizing Data", 1993)

"Residual analysis is similarly unreliable. In a discussion after a presentation of residual analysis in a seminar at Berkeley in 1993, William Cleveland, one of the fathers of residual analysis, admitted that it could not uncover lack of fit in more than four to five dimensions. The papers I have read on using residual analysis to check lack of fit are confined to data sets with two or three variables. With higher dimensions, the interactions between the variables can produce passable residual plots for a variety of models. A residual plot is a goodness-of-fit test, and lacks power in more than a few dimensions. An acceptable residual plot does not imply that the model is a good fit to the data." (Leo Breiman, "Statistical Modeling: The Two Cultures", Statistical Science Vol. 16(3), 2001)

"For a confidence interval, the central limit theorem plays a role in the reliability of the interval because the sample mean is often approximately normal even when the underlying data is not. A prediction interval has no such protection. The shape of the interval reflects the shape of the underlying distribution. It is more important to examine carefully the normality assumption by checking the residuals […]." (DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

"Using noise (the uncorrelated variables) to fit noise (the residual left from a simple model on the genuinely correlated variables) is asking for trouble." (Steven S Skiena, "The Data Science Design Manual", 2017)

"One of the most common problems that you will encounter when training deep neural networks will be overfitting. What can happen is that your network may, owing to its flexibility, learn patterns that are due to noise, errors, or simply wrong data. [...] The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e., the noise) as if that variation represented the underlying model structure. The opposite is called underfitting - when the model cannot capture the structure of the data." (Umberto Michelucci, "Applied Deep Learning: A Case-Based Approach to Understanding Deep Neural Networks", 2018)

🔭Data Science: Central Limit Theorem (Just the Quotes)

"I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the ‘Law of Frequency of Error’. The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement, amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason. Whenever a large sample of chaotic elements are taken in hand and marshalled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along." (Sir Francis Galton, "Natural Inheritance", 1889)

"The central limit theorem […] states that regardless of the shape of the curve of the original population, if you repeatedly randomly sample a large segment of your group of interest and take the average result, the set of averages will follow a normal curve." (Charles Livingston & Paul Voakes, "Working with Numbers and Statistics: A handbook for journalists", 2005)

"The central limit theorem says that, under conditions almost always satisfied in the real world of experimentation, the distribution of such a linear function of errors will tend to normality as the number of its components becomes large. The tendency to normality occurs almost regardless of the individual distributions of the component errors. An important proviso is that several sources of error must make important contributions to the overall error and that no particular source of error dominate the rest." (George E P Box et al, "Statistics for Experimenters: Design, discovery, and innovation" 2nd Ed., 2005)

"Two things explain the importance of the normal distribution: (1) The central limit effect that produces a tendency for real error distributions to be 'normal like'. (2) The robustness to nonnormality of some common statistical procedures, where 'robustness' means insensitivity to deviations from theoretical normality." (George E P Box et al, "Statistics for Experimenters: Design, discovery, and innovation" 2nd Ed., 2005)

"The central limit theorem differs from laws of large numbers because random variables vary and so they differ from constants such as population means. The central limit theorem says that certain independent random effects converge not to a constant population value such as the mean rate of unemployment but rather they converge to a random variable that has its own Gaussian bell-curve description." (Bart Kosko, "Noise", 2006)

"Normally distributed variables are everywhere, and most classical statistical methods use this distribution. The explanation for the normal distribution’s ubiquity is the Central Limit Theorem, which says that if you add a large number of independent samples from the same distribution the distribution of the sum will be approximately normal." (Ben Bolker, "Ecological Models and Data in R", 2007)

"[…] the Central Limit Theorem says that if we take any sequence of small independent random quantities, then in the limit their sum (or average) will be distributed according to the normal distribution. In other words, any quantity that can be viewed as the sum of many small independent random effects. will be well approximated by the normal distribution. Thus, for example, if one performs repeated measurements of a fixed physical quantity, and if the variations in the measurements across trials are the cumulative result of many independent sources of error in each trial, then the distribution of measured values should be approximately normal." (David Easley & Jon Kleinberg, "Networks, Crowds, and Markets: Reasoning about a Highly Connected World", 2010)

[myth] "It has been said that process behavior charts work because of the central limit theorem."(Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"Statistical inference is really just the marriage of two concepts that we’ve already discussed: data and probability (with a little help from the central limit theorem)." (Charles Wheelan, "Naked Statistics: Stripping the Dread from the Data", 2012)

"The central limit theorem is often used to justify the assumption of normality when using the sample mean and the sample standard deviation. But it is inevitable that real data contain gross errors. Five to ten percent unusual values in a dataset seem to be the rule rather than the exception. The distribution of such data is no longer Normal." (A S Hedayat & Guoqin Su, "Robustness of the Simultaneous Estimators of Location and Scale From Approximating a Histogram by a Normal Density Curve", The American Statistician 66, 2012)

"The central limit theorem tells us that in repeated samples, the difference between the two means will be distributed roughly as a normal distribution." (Charles Wheelan, "Naked Statistics: Stripping the Dread from the Data", 2012)

"According to the central limit theorem, it doesn’t matter what the raw data look like, the sample variance should be proportional to the number of observations and if I have enough of them, the sample mean should be normal." (Kristin H Jarman, "The Art of Data Analysis: How to answer almost any question using basic statistics", 2013)

"For a confidence interval, the central limit theorem plays a role in the reliability of the interval because the sample mean is often approximately normal even when the underlying data is not. A prediction interval has no such protection. The shape of the interval reflects the shape of the underlying distribution. It is more important to examine carefully the normality assumption by checking the residuals […].(DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

"When data is not normal, the reason the formulas are working is usually the central limit theorem. For large sample sizes, the formulas are producing parameter estimates that are approximately normal even when the data is not itself normal. The central limit theorem does make some assumptions and one is that the mean and variance of the population exist. Outliers in the data are evidence that these assumptions may not be true. Persistent outliers in the data, ones that are not errors and cannot be otherwise explained, suggest that the usual procedures based on the central limit theorem are not applicable.(DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

"At very small time scales, the motion of a particle is more like a random walk, as it gets jostled about by discrete collisions with water molecules. But virtually any random movement on small time scales will give rise to Brownian motion on large time scales, just so long as the motion is unbiased. This is because of the Central Limit Theorem, which tells us that the aggregate of many small, independent motions will be normally distributed." (Field Cady, "The Data Science Handbook", 2017)

"The central limit conjecture states that most errors are the result of many small errors and, as such, have a normal distribution. The assumption of a normal distribution for error has many advantages and has often been made in applications of statistical models." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"Theoretically, the normal distribution is most famous because many distributions converge to it, if you sample from them enough times and average the results. This applies to the binomial distribution, Poisson distribution and pretty much any other distribution you’re likely to encounter (technically, any one for which the mean and standard deviation are finite)." (Field Cady, "The Data Science Handbook", 2017)

"The central limit theorem in statistics states that, given a sufficiently large sample size, the sampling distribution of the mean for a variable will approximate a normal distribution regardless of that variable’s distribution in the population." (Jim Frost)

"The old rule of trusting the Central Limit Theorem if the sample size is larger than 30 is just that–old. Bootstrap and permutation testing let us more easily do inferences for a wider variety of statistics." (Tim Hesterberg)

03 November 2018

🔭Data Science: Least Squares (Just the Quotes)

"From the foregoing we see that the two justifications each leave something to be desired. The first depends entirely on the hypothetical form of the probability of the error; as soon as that form is rejected, the values of the unknowns produced by the method of least squares are no more the most probable values than is the arithmetic mean in the simplest case mentioned above. The second justification leaves us entirely in the dark about what to do when the number of observations is not large. In this case the method of least squares no longer has the status of a law ordained by the probability calculus but has only the simplicity of the operations it entails to recommend it." (Carl Friedrich Gauss, "Anzeige: Theoria combinationis observationum erroribus minimis obnoxiae: Pars prior", Göttingische gelehrte Anzeigen, 1821)

"The method of least squares is used in the analysis of data from planned experiments and also in the analysis of data from unplanned happenings. The word 'regression' is most often used to describe analysis of unplanned data. It is the tacit assumption that the requirements for the validity of least squares analysis are satisfied for unplanned data that produces a great deal of trouble." (George E P Box, "Use and Abuse of Regression", 1966)

"At the heart of probabilistic statistical analysis is the assumption that a set of data arises as a sample from a distribution in some class of probability distributions. The reasons for making distributional assumptions about data are several. First, if we can describe a set of data as a sample from a certain theoretical distribution, say a normal distribution (also called a Gaussian distribution), then we can achieve a valuable compactness of description for the data. For example, in the normal case, the data can be succinctly described by giving the mean and standard deviation and stating that the empirical (sample) distribution of the data is well approximated by the normal distribution. A second reason for distributional assumptions is that they can lead to useful statistical procedures. For example, the assumption that data are generated by normal probability distributions leads to the analysis of variance and least squares. Similarly, much of the theory and technology of reliability assumes samples from the exponential, Weibull, or gamma distribution. A third reason is that the assumptions allow us to characterize the sampling distribution of statistics computed during the analysis and thereby make inferences and probabilistic statements about unknown aspects of the underlying distribution. For example, assuming the data are a sample from a normal distribution allows us to use the t-distribution to form confidence intervals for the mean of the theoretical distribution. A fourth reason for distributional assumptions is that understanding the distribution of a set of data can sometimes shed light on the physical mechanisms involved in generating the data." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"Least squares' means just what it says: you minimise the (suitably weighted) squared difference between a set of measurements and their predicted values. This is done by varying the parameters you want to estimate: the predicted values are adjusted so as to be close to the measurements; squaring the differences means that greater importance is placed on removing the large deviations." (Roger J Barlow, "Statistics: A guide to the use of statistical methods in the physical sciences", 1989)

"Principal components and principal factor analysis lack a well-developed theoretical framework like that of least squares regression. They consequently provide no systematic way to test hypotheses about the number of factors to retain, the size of factor loadings, or the correlations between factors, for example. Such tests are possible using a different approach, based on maximum-likelihood estimation." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"Fuzzy models should make good predictions even when they are asked to predict on regions that were not excited during the construction of the model. The generalization capabilities can be controlled by an appropriate initialization of the consequences (prior knowledge) and the use of the recursive least squares to improve the prior choices. The prior knowledge can be obtained from the data." (Jairo Espinosa et al, "Fuzzy Logic, Identification and Predictive Control", 2005)

"Often when people relate essentially the same variable in two different groups, or at two different times, they see this same phenomenon - the tendency of the response variable to be closer to the mean than the predicted value. Unfortunately, people try to interpret this by thinking that the performance of those far from the mean is deteriorating, but it’s just a mathematical fact about the correlation. So, today we try to be less judgmental about this phenomenon and we call it regression to the mean. We managed to get rid of the term 'mediocrity', but the name regression stuck as a name for the whole least squares fitting procedure - and that’s where we get the term regression line." (Richard D De Veaux et al, "Stats: Data and Models", 2016)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.