SQL Troubles

08 November 2018

🔭Data Science: Heuristics (Just the Quotes)

"Heuristic reasoning is reasoning not regarded as final and strict but as provisional and plausible only, whose purpose is to discover the solution of the present problem. We are often obliged to use heuristic reasoning. We shall attain complete certainty when we shall have obtained the complete solution, but before obtaining certainty we must often be satisfied with a more or less plausible guess. We may need the provisional before we attain the final. We need heuristic reasoning when we construct a strict proof as we need scaffolding when we erect a building." (George Pólya,How to Solve It", 1945)

"The attempt to characterize exactly models of an empirical theory almost inevitably yields a more precise and clearer understanding of the exact character of a theory. The emptiness and shallowness of many classical theories in the social sciences is well brought out by the attempt to formulate in any exact fashion what constitutes a model of the theory. The kind of theory which mainly consists of insightful remarks and heuristic slogans will not be amenable to this treatment. The effort to make it exact will at the same time reveal the weakness of the theory." (Patrick Suppes," A Comparison of the Meaning and Uses of Models in Mathematics and the Empirical Sciences", Synthese Vol. 12" (2/3), 1960)

"Design problems - generating or discovering alternatives - are complex largely because they involve two spaces, an action space and a state space, that generally have completely different structures. To find a design requires mapping the former of these on the latter. For many, if not most, design problems in the real world systematic algorithms are not known that guarantee solutions with reasonable amounts of computing effort. Design uses a wide range of heuristic devices - like means-end analysis, satisficing, and the other procedures that have been outlined - that have been found by experience to enhance the efficiency of search. Much remains to be learned about the nature and effectiveness of these devices." (Herbert A Simon,The Logic of Heuristic Decision Making", [inThe Logic of Decision and Action"], 1966)

"Intelligence has two parts, which we shall call the epistemological and the heuristic. The epistemological part is the representation of the world in such a form that the solution of problems follows from the facts expressed in the representation. The heuristic part is the mechanism that on the basis of the information solves the problem and decides what to do." (John McCarthy & Patrick J Hayes,Some Philosophical Problems from the Standpoint of Artificial Intelligence", Machine Intelligence 4, 1969)

"Consider any of the heuristics that people have come up with for supervised learning: avoid overfitting, prefer simpler to more complex models, boost your algorithm, bag it, etc. The no free lunch theorems say that all such heuristics fail as often" (appropriately weighted) as they succeed. This is true despite formal arguments some have offered trying to prove the validity of some of these heuristics." (David H Wolpert,The lack of a priori distinctions between learning algorithms", Neural Computation Vol. 8(7), 1996)

"Heuristic (it is of Greek origin) means discovery. Heuristic methods are based on experience, rational ideas, and rules of thumb. Heuristics are based more on common sense than on mathematics. Heuristics are useful, for example, when the optimal solution needs an exhaustive search that is not realistic in terms of time. In principle, a heuristic does not guarantee the best solution, but a heuristic solution can provide a tremendous shortcut in cost and time." (Nikola K Kasabov,Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering", 1996)

"Theories of choice are at best approximate and incomplete. One reason for this pessimistic assessment is that choice is a constructive and contingent process. When faced with a complex problem, people employ a variety of heuristic procedures in order to simplify the representation and the evaluation of prospects. These procedures include computational shortcuts and editing operations, such as eliminating common components and discarding nonessential differences. The heuristics of choice do not readily lend themselves to formal analysis because their application depends on the formulation of the problem, the method of elicitation, and the context of choice." (Amos Tversky & Daniel Kahneman,Advances in Prospect Theory: Cumulative Representation of Uncertainty" [inChoices, Values, and Frames"], 2000)

"Behavioural research shows that we tend to use simplifying heuristics when making judgements about uncertain events. These are prone to biases and systematic errors, such as stereotyping, disregard of sample size, disregard for regression to the mean, deriving estimates based on the ease of retrieving instances of the event, anchoring to the initial frame, the gambler’s fallacy, and wishful thinking, which are all affected by our inability to consider more than a few aspects or dimensions of any phenomenon or situation at the same time." (Hans G Daellenbach & Donald C McNickle,Management Science: Decision making through systems thinking", 2005)

"A decision theory that rests on the assumptions that human cognitive capabilities are limited and that these limitations are adaptive with respect to the decision environments humans frequently encounter. Decision are thought to be made usually without elaborate calculations, but instead by using fast and frugal heuristics. These heuristics certainly have the advantage of speed and simplicity, but if they are well matched to a decision environment, they can even outperform maximizing calculations with respect to accuracy. The reason for this is that many decision environments are characterized by incomplete information and noise. The information we do have is usually structured in a specific way that clever heuristics can exploit." (E Ebenhoh,Agent-Based Modelnig with Boundedly Rational Agents", 2007)

"Optimization systems (or optimizers, as they are often referred to) aim to optimize in a systematic way, oftentimes using a heuristics-based approach. Such an approach enables the AI system to use a macro level concept as part of its low-level calculations, accelerating the whole process and making it more light-weight. After all, most of these systems are designed with scalability in mind, so the heuristic approach is most practical." (Yunus E Bulut & Zacharias Voulgaris,AI for Data Science: Artificial Intelligence Frameworks and Functionality for Deep Learning, Optimization, and Beyond", 2018)

"The social world that humans have made for themselves is so complex that the mind simplifies the world by using heuristics, customs, and habits, and by making models or assumptions about how things generally work (the ‘causal structure of the world’). And because people rely upon" (and are invested in) these mental models, they usually prefer that they remain uncontested." (Dr James Brennan,Psychological Adjustment to Illness and Injury", West of England Medical Journal Vol. 117 (2), 2018)

"Many AI systems employ heuristic decision making, which uses a strategy to find the most likely correct decision to avoid the high cost" (time) of processing lots of information. We can think of those heuristics as shortcuts or rules of thumb that we would use to make fast decisions." (Jesús Barrasa et al,Knowledge Graphs: Data in Context for Responsive Businesses", 2021)

"Once we know something is fat-tailed, we can use heuristics to see how an exposure there reacts to random events: how much is a given unit harmed by them. It is vastly more effective to focus on being insulated from the harm of random events than try to figure them out in the required details" (as we saw the inferential errors under thick tails are huge). So it is more solid, much wiser, more ethical, and more effective to focus on detection heuristics and policies rather than fabricate statistical properties." (Nassim N Taleb,Statistical Consequences of Fat Tails: Real World Preasymptotics, Epistemology, and Applications" 2nd Ed., 2022)

07 November 2018

🔭Data Science: Decision Theory (Just the Quotes)

"Years ago a statistician might have claimed that statistics deals with the processing of data [...] today’s statistician will be more likely to say that statistics is concerned with decision making in the face of uncertainty." (Herman Chernoff & Lincoln E Moses, "Elementary Decision Theory", 1959)

"Another approach to management theory, undertaken by a growing and scholarly group, might be referred to as the decision theory school. This group concentrates on rational approach to decision-the selection from among possible alternatives of a course of action or of an idea. The approach of this school may be to deal with the decision itself, or to the persons or organizational group making the decision, or to an analysis of the decision process. Some limit themselves fairly much to the economic rationale of the decision, while others regard anything which happens in an enterprise the subject of their analysis, and still others expand decision theory to cover the psychological and sociological aspect and environment of decisions and decision-makers." (Harold Koontz, "The Management Theory Jungle," 1961)

"The term hypothesis testing arises because the choice as to which process is observed is based on hypothesized models. Thus hypothesis testing could also be called model testing. Hypothesis testing is sometimes called decision theory. The detection theory of communication theory is a special case." (Fred C Scweppe, "Uncertain dynamic systems", 1973)

"In decision theory, mathematical analysis shows that once the sampling distribution, loss function, and sample are specified, the only remaining basis for a choice among different admissible decisions lies in the prior probabilities. Therefore, the logical foundations of decision theory cannot be put in fully satisfactory form until the old problem of arbitrariness (sometimes called 'subjectiveness') in assigning prior probabilities is resolved." (Edwin T Jaynes, "Prior Probabilities", 1978)

"Decision theory, as it has grown up in recent years, is a formalization of the problems involved in making optimal choices. In a certain sense - a very abstract sense, to be sure - it incorporates among others operations research, theoretical economics, and wide areas of statistics, among others." (Kenneth Arrow, "The Economics of Information", 1984)

"Cybernetics is concerned with scientific investigation of systemic processes of a highly varied nature, including such phenomena as regulation, information processing, information storage, adaptation, self-organization, self-reproduction, and strategic behavior. Within the general cybernetic approach, the following theoretical fields have developed: systems theory (system), communication theory, game theory, and decision theory." (Fritz B Simon et al, "Language of Family Therapy: A Systemic Vocabulary and Source Book", 1985)

"A field of study that includes a methodology for constructing computer simulation models to achieve better under-standing of social and corporate systems. It draws on organizational studies, behavioral decision theory, and engineering to provide a theoretical and empirical base for structuring the relationships in complex systems." (Virginia Anderson & Lauren Johnson, "Systems Thinking Basics: From Concepts to Casual Loops", 1997)

"A decision theory that rests on the assumptions that human cognitive capabilities are limited and that these limitations are adaptive with respect to the decision environments humans frequently encounter. Decision are thought to be made usually without elaborate calculations, but instead by using fast and frugal heuristics. These heuristics certainly have the advantage of speed and simplicity, but if they are well matched to a decision environment, they can even outperform maximizing calculations with respect to accuracy. The reason for this is that many decision environments are characterized by incomplete information and noise. The information we do have is usually structured in a specific way that clever heuristics can exploit." (E Ebenhoh, "Agent-Based Modelnig with Boundedly Rational Agents", 2007)

🔭Data Science: Belief (Just the Quotes)

"By degree of probability we really mean, or ought to mean, degree of belief [...] Probability then, refers to and implies belief, more or less, and belief is but another name for imperfect knowledge, or it may be, expresses the mind in a state of imperfect knowledge." (Augustus De Morgan, "Formal Logic: Or, The Calculus of Inference, Necessary and Probable", 1847)

"To a scientist a theory is something to be tested. He seeks not to defend his beliefs, but to improve them. He is, above everything else, an expert at ‘changing his mind’." (Wendell Johnson, 1946)

"A model can not be proved to be correct; at best it can only be found to be reasonably consistant and not to contradict some of our beliefs of what reality is." (Richard W Hamming, "The Art of Probability for Scientists and Engineers", 1991)

"Sometimes the proprietors of a bad model claim that parts of it are facts, not just beliefs. Evaluation then amounts to determining if facts support the claims, and disciplines like statistics have tools for this task. The difficulty of using statistical tools will vary depending on the problem." (James S Hodges, "Six (or So) Things You Can Do with a Bad Model", 1991)

"Probability is not about the odds, but about the belief in the existence of an alternative outcome, cause, or motive." (Nassim N Taleb, "Fooled by Randomness", 2001)

"The Bayesian approach is based on the following postulates: (B1) Probability describes degree of belief, not limiting frequency. As such, we can make probability statements about lots of things, not just data which are subject to random variation. […] (B2) We can make probability statements about parameters, even though they are fixed constants. (B3) We make inferences about a parameter θ by producing a probability distribution for θ. Inferences, such as point estimates and interval estimates, may then be extracted from this distribution." (Larry A Wasserman, "All of Statistics: A concise course in statistical inference", 2004)

"The important thing is to understand that frequentist and Bayesian methods are answering different questions. To combine prior beliefs with data in a principled way, use Bayesian inference. To construct procedures with guaranteed long run performance, such as confidence intervals, use frequentist methods. Generally, Bayesian methods run into problems when the parameter space is high dimensional." (Larry A Wasserman, "All of Statistics: A concise course in statistical inference", 2004)

"Our inner weighing of evidence is not a careful mathematical calculation resulting in a probabilistic estimate of truth, but more like a whirlpool blending of the objective and the personal. The result is a set of beliefs - both conscious and unconscious - that guide us in interpreting all the events of our lives." (Leonard Mlodinow, "War of the Worldviews: Where Science and Spirituality Meet - and Do Not", 2011)

"The search for better numbers, like the quest for new technologies to improve our lives, is certainly worthwhile. But the belief that a few simple numbers, a few basic averages, can capture the multifaceted nature of national and global economic systems is a myth. Rather than seeking new simple numbers to replace our old simple numbers, we need to tap into both the power of our information age and our ability to construct our own maps of the world to answer the questions we need answering." (Zachary Karabell, "The Leading Indicators: A short history of the numbers that rule our world", 2014)

"One kind of probability - classic probability - is based on the idea of symmetry and equal likelihood […] In the classic case, we know the parameters of the system and thus can calculate the probabilities for the events each system will generate. […] A second kind of probability arises because in daily life we often want to know something about the likelihood of other events occurring […]. In this second case, we need to estimate the parameters of the system because we don’t know what those parameters are. […] A third kind of probability differs from these first two because it’s not obtained from an experiment or a replicable event - rather, it expresses an opinion or degree of belief about how likely a particular event is to occur. This is called subjective probability […]." (Daniel J Levitin, "Weaponized Lies", 2017)

"Bayesian statistics give us an objective way of combining the observed evidence with our prior knowledge (or subjective belief) to obtain a revised belief and hence a revised prediction of the outcome of the coin’s next toss. [...] This is perhaps the most important role of Bayes’s rule in statistics: we can estimate the conditional probability directly in one direction, for which our judgment is more reliable, and use mathematics to derive the conditional probability in the other direction, for which our judgment is rather hazy. The equation also plays this role in Bayesian networks; we tell the computer the forward probabilities, and the computer tells us the inverse probabilities when needed." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"The transparency of Bayesian networks distinguishes them from most other approaches to machine learning, which tend to produce inscrutable 'black boxes'. In a Bayesian network you can follow every step and understand how and why each piece of evidence changed the network’s beliefs." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

♟️Strategic Management: Accuracy (Just the Quotes)

"There is some confusion today as to the meaning of scientific management. This concerns itself with the nature of such management itself, with the scope or field to which such management applies, and with the aims that it desires to attain. Scientific management is simply management that is based upon actual measurement. Its skillful application is an art that must be acquired, but its fundamental principles have the exactness of scientific laws which are open to study by everyone. We have here nothing hidden or occult or secret, like the working practices of an old-time craft; we have here a science that is the result of accurately recorded, exact investigation." (Frank B Gilbreth Sr., "Applied Motion Study", 1917)

"Scientists whose work has no clear, practical implications would want to make their decisions considering such things as: the relative worth of (1) more observations, (2) greater scope of his conceptual model, (3) simplicity, (4) precision of language, (5) accuracy of the probability assignment." (C West Churchman, "Costs, Utilities, and Values", 1956)

"[...] long-range plans are most valuable when they are revised and adjusted and set anew at shorter periods. The five-year plan is reconstructed each year in turn for the following five years. The soundest basis for this change is accurate measurement of the results of the first year's experience with the plan against the target of the plan." (George S Odiorne, "Management by Objectives", 1965)

"However, and conversely, our models fall far short of representing the world fully. That is why we make mistakes and why we are regularly surprised. In our heads, we can keep track of only a few variables at one time. We often draw illogical conclusions from accurate assumptions, or logical conclusions from inaccurate assumptions. Most of us, for instance, are surprised by the amount of growth an exponential process can generate. Few of us can intuit how to damp oscillations in a complex system." (Donella H Meadows, "Limits to Growth", 1972)

"Planning and management by objectives have their point as devices for compelling thought, so long as executives don't forget that any plan worth making is inaccurate; the longer a plan takes to write, the worse it is - just because of its consumption of time. And the more they change plans to suit events, the better they will manage - if you've made a mistake, you had better admit it." (Robert Heller, "The Naked Manager: Games Executives Play", 1972)

"In management, there are few things as dangerous as a comprehensive, accurate answer to the wrong question. This is pseudo-knowledge. It easily misleads management into erroneous actions. Pseudo-knowledge has mushroomed with the advent of computers, which have made available masses of data that answer questions managers found too costly to ask before. In too many instances, however, the data are collected but not used because they answer irrelevant questions." (Dale E. Zand, "Information, Organization, and Power", 1981)

"People in general tend to assume that there is some 'right' way of solving problems. Formal logic, for example, is regarded as a correct approach to thinking, but thinking is always a compromise between the demands of comprehensiveness, speed, and accuracy. There is no best way of thinking." (James L McKenney & Peter G W Keen, Harvard Business Review on Human Relations, 1986)

"Top managers are currently inundated with reams of information concerning the organizational units under their supervision. Behind this information explosion lies a seemingly logical assumption made by information specialists and frequently accepted by line managers: if top management can be supplied with more 'objective' and 'accurate' quantified information, they will make 'better' judgments about the performance of their operating units. [...] A research study we have recently completed indicates that quantified performance information may have a more limited role than is currently assumed or envisioned; in fact, managers rely more on subjective information than they do on so called 'objective' statistics in assessing the overall performance of lower-level units." (Larry E. Greiner et al, Harvard Business Review on Human Relations, 1986)

"[...] incomplete, inaccurate, and invalid data can cause problems for an organization. These problems are not only embarrassing and awkward but will also cause the organization to lose customers, new opportunities, and market share." (Margaret Y Chu, "Blissful Data", 2004)

"So everyone has and uses mental representations. What sets expert performers apart from everyone else is the quality and quantity of their mental representations. Through years of practice, they develop highly complex and sophisticated representations of the various situations they are likely to encounter in their fields - such as the vast number of arrangements of chess pieces that can appear during games. These representations allow them to make faster, more accurate decisions and respond more quickly and effectively in a given situation. This, more than anything else, explains the difference in performance between novices and experts." (Anders Ericsson & Robert Pool, "Peak: Secrets from the New Science of Expertise", 2016)

See also the quotes on "Accuracy" in Data Science, Graphical Representation

06 November 2018

🔭Data Science: Tools (Just the Quotes)

"[Statistics] are the only tools by which an opening can be cut through the formidable thicket of difficulties that bars the path of those who pursue the Science of man." (Sir Francis Galton, "Natural Inheritance", 1889)

"Mathematics is merely a shorthand method of recording physical intuition and physical reasoning, but it should not be a formalism leading from nowhere to nowhere, as it is likely to be made by one who does not realize its purpose as a tool." (Charles P Steinmetz, "Transactions of the American Institute of Electrical Engineers", 1909)

"Statistical methods are tools of scientific investigation. Scientific investigation is a controlled learning process in which various aspects of a problem are illuminated as the study proceeds. It can be thought of as a major iteration within which secondary iterations occur. The major iteration is that in which a tentative conjecture suggests an experiment, appropriate analysis of the data so generated leads to a modified conjecture, and this in turn leads to a new experiment, and so on." (George E P Box & George C Tjao, "Bayesian Inference in Statistical Analysis", 1973)

"[...] exploratory data analysis is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as for those we believe might be there. Except for its emphasis on graphs, its tools are secondary to its purpose." (John W Tukey, [comment] 1979)

"Correlation analysis is a useful tool for uncovering a tenuous relationship, but it doesn't necessarily provide any real understanding of the relationship, and it certainly doesn't provide any evidence that the relationship is one of cause and effect. People who don't understand correlation tend to credit it with being a more fundamental approach than it is." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"Science usually amounts to a lot more than blind trial and error. Good statistics consists of much more than just significance tests; there are more sophisticated tools available for the analysis of results, such as confidence statements, multiple comparisons, and Bayesian analysis, to drop a few names. However, not all scientists are good statisticians, or want to be, and not all people who are called scientists by the media deserve to be so described." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"Some methods, such as those governing the design of experiments or the statistical treatment of data, can be written down and studied. But many methods are learned only through personal experience and interactions with other scientists. Some are even harder to describe or teach. Many of the intangible influences on scientific discovery - curiosity, intuition, creativity - largely defy rational analysis, yet they are often the tools that scientists bring to their work." (Committee on the Conduct of Science, "On Being a Scientist", 1989)

"Statistics is a tool. In experimental science you plan and carry out experiments, and then analyse and interpret the results. To do this you use statistical arguments and calculations. Like any other tool - an oscilloscope, for example, or a spectrometer, or even a humble spanner - you can use it delicately or clumsily, skillfully or ineptly. The more you know about it and understand how it works, the better you will be able to use it and the more useful it will be." (Roger Barlow, "Statistics: A Guide to the Use of Statistical Methods in the Physical Sciences", 1989)

"Fitting is essential to visualizing hypervariate data. The structure of data in many dimensions can be exceedingly complex. The visualization of a fit to hypervariate data, by reducing the amount of noise, can often lead to more insight. The fit is a hypervariate surface, a function of three or more variables. As with bivariate and trivariate data, our fitting tools are loess and parametric fitting by least-squares. And each tool can employ bisquare iterations to produce robust estimates when outliers or other forms of leptokurtosis are present." (William S Cleveland, "Visualizing Data", 1993)

"The logarithm is one of many transformations that we can apply to univariate measurements. The square root is another. Transformation is a critical tool for visualization or for any other mode of data analysis because it can substantially simplify the structure of a set of data. For example, transformation can remove skewness toward large values, and it can remove monotone increasing spread. And often, it is the logarithm that achieves this removal." (William S Cleveland, "Visualizing Data", 1993)

"Probability theory is an ideal tool for formalizing uncertainty in situations where class frequencies are known or where evidence is based on outcomes of a sufficiently long series of independent random experiments. Possibility theory, on the other hand, is ideal for formalizing incomplete information expressed in terms of fuzzy propositions." (George Klir, "Fuzzy sets and fuzzy logic", 1995)

"We use mathematics and statistics to describe the diverse realms of randomness. From these descriptions, we attempt to glean insights into the workings of chance and to search for hidden causes. With such tools in hand, we seek patterns and relationships and propose predictions that help us make sense of the world." (Ivars Peterson, "The Jungles of Randomness: A Mathematical Safari", 1998)

"When an analyst selects the wrong tool, this is a misuse which usually leads to invalid conclusions. Incorrect use of even a tool as simple as the mean can lead to serious misuses. […] But all statisticians know that more complex tools do not guarantee an analysis free of misuses. Vigilance is required on every statistical level." (Herbert F Spirer et al, "Misused Statistics" 2nd Ed, 1998)

"The key role of representation in thinking is often downplayed because of an ideal of rationality that dictates that whenever two statements are mathematically or logically the same, representing them in different forms should not matter. Evidence that it does matter is regarded as a sign of human irrationality. This view ignores the fact that finding a good representation is an indispensable part of problem solving and that playing with different representations is a tool of creative thinking." (Gerd Gigerenzer, "Calculated Risks: How to know when numbers deceive you", 2002)

"There is a tendency to use hypothesis testing methods even when they are not appropriate. Often, estimation and confidence intervals are better tools. Use hypothesis testing only when you want to test a well-defined hypothesis." (Larry A Wasserman, "All of Statistics: A concise course in statistical inference", 2004)

"Popular accounts of mathematics often stress the discipline’s obsession with certainty, with proof. And mathematicians often tell jokes poking fun at their own insistence on precision. However, the quest for precision is far more than an end in itself. Precision allows one to reason sensibly about objects outside of ordinary experience. It is a tool for exploring possibility: about what might be, as well as what is." (Donal O’Shea, “The Poincaré Conjecture”, 2007)

"The key to understanding randomness and all of mathematics is not being able to intuit the answer to every problem immediately but merely having the tools to figure out the answer." (Leonard Mlodinow,"The Drunkard’s Walk: How Randomness Rules Our Lives", 2008)

"[...] a model is a tool for taking decisions and any decision taken is the result of a process of reasoning that takes place within the limits of the human mind. So, models have eventually to be understood in such a way that at least some layer of the process of simulation is comprehensible by the human mind. Otherwise, we may find ourselves acting on the basis of models that we don’t understand, or no model at all.” (Ugo Bardi, “The Limits to Growth Revisited”, 2011)

"The first and main goal of any graphic and visualization is to be a tool for your eyes and brain to perceive what lies beyond their natural reach." (Alberto Cairo, "The Functional Art", 2011)

"What is really important is to remember that no matter how creative and innovative you wish to be in your graphics and visualizations, the first thing you must do, before you put a finger on the computer keyboard, is ask yourself what users are likely to try to do with your tool." (Alberto Cairo, "The Functional Art", 2011)

"Data mining is a craft. As with many crafts, there is a well-defined process that can help to increase the likelihood of a successful result. This process is a crucial conceptual tool for thinking about data science projects. [...] data mining is an exploratory undertaking closer to research and development than it is to engineering." (Foster Provost, "Data Science for Business", 2013)

"It is important to remember that predictive data analytics models built using machine learning techniques are tools that we can use to help make better decisions within an organization and are not an end in themselves. It is paramount that, when tasked with creating a predictive model, we fully understand the business problem that this model is being constructed to address and ensure that it does address it." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, worked examples, and case studies", 2015)

"A popular misconception holds that the era of Big Data means the end of a need for sampling. In fact, the proliferation of data of varying quality and relevance reinforces the need for sampling as a tool to work efficiently with a variety of data, and minimize bias. Even in a Big Data project, predictive models are typically developed and piloted with samples." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"[…] remember that, as with many statistical issues, sampling in and of itself is not a good or a bad thing. Sampling is a powerful tool that allows us to learn something, when looking at the full population is not feasible (or simply isn’t the preferred option). And you shouldn’t be misled to think that you always should use all the data. In fact, using a sample of data can be incredibly helpful." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"A theory is nothing but a tool to know the reality. If a theory contradicts reality, it must be discarded at the earliest." (Awdhesh Singh, "Myths are Real, Reality is a Myth", 2018)

"Cross-validation is a useful tool for finding optimal predictive models, and it also works well in visualization. The concept is simple: split the data at random into a 'training' and a 'test' set, fit the model to the training data, then see how well it predicts the test data. As the model gets more complex, it will always fit the training data better and better. It will also start off getting better results on the test data, but there comes a point where the test data predictions start going wrong." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

"Even though data is being thrust on more people, it doesn’t mean everyone is prepared to consume and use it effectively. As our dependence on data for guidance and insights increases, the need for greater data literacy also grows. If literacy is defined as the ability to read and write, data literacy can be defined as the ability to understand and communicate data. Today’s advanced data tools can offer unparalleled insights, but they require capable operators who can understand and interpret data." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"We know what forecasting is: you start in the present and try to look into the future and imagine what it will be like. Backcasting is the opposite: you state your desired vision of the future as if it’s already happened, and then work backward to imagine the practices, policies, programs, tools, training, and people who worked in concert in a hypothetical past (which takes place in the future) to get you there." (Eben Hewitt, "Technology Strategy Patterns: Architecture as strategy" 2nd Ed., 2019)

"Big data is revolutionizing the world around us, and it is easy to feel alienated by tales of computers handing down decisions made in ways we don’t understand. I think we’re right to be concerned. Modern data analytics can produce some miraculous results, but big data is often less trustworthy than small data. Small data can typically be scrutinized; big data tends to be locked away in the vaults of Silicon Valley. The simple statistical tools used to analyze small datasets are usually easy to check; pattern-recognizing algorithms can all too easily be mysterious and commercially sensitive black boxes." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"I think sometimes organizations are looking at tools or the mythical and elusive data driven culture to be the strategy. Let me emphasize now: culture and tools are not strategies; they are enabling pieces." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"Visualisation is fundamentally limited by the number of pixels you can pump to a screen. If you have big data, you have way more data than pixels, so you have to summarise your data. Statistics gives you lots of really good tools for this." (Hadley Wickham)

05 November 2018

💠🛠️SQL Server: Administration (End of Life for 2008 and 2008 R2 Versions)

SQL Server 2008 and 2008 R2 versions are heading with steep steps toward the end of support - July 9, 2019. Besides an upgrade to upper versions, it seems there is also the opportunity to migrate to an Azure SQL Server Managed Instance, which allows a near 100% compatibility with an on-premises SQL Server installation, or to a Azure VM (see Franck Mercier’s post).

If you aren’t sure which versions of SQL Server you have in your organization here’s a script that can be run on all SQL Server 2005+ installations via SQL Server Management Studio:

SELECT SERVERPROPERTY('ComputerNamePhysicalNetBIOS') ComputerName
 , SERVERPROPERTY('Edition') Edition
 , SERVERPROPERTY('ProductVersion') ProductVersion
 , CASE Cast(SERVERPROPERTY('ProductVersion') as nvarchar(20))
     WHEN '9.00.5000.00' THEN 'SQL Server 2005'
     WHEN '10.0.6000.29' THEN 'SQL Server 2008' 
     WHEN '10.50.6000.34' THEN 'SQL Server 2008 R2' 
     WHEN '11.0.7001.0' THEN 'SQL Server 2012' 
     WHEN '12.0.6024.0' THEN 'SQL Server 2014' 
     WHEN '13.0.5026.0' THEN 'SQL Server 2016'
     WHEN '14.0.2002.14' THEN 'SQL Server 2017'
     ELSE 'unknown'
   END Product

Resources:
[1] Microsoft (2018) How to determine the version, edition, and update level of SQL Server and its components [Online] Available from: https://support.microsoft.com/en-us/help/321185/how-to-determine-the-version-edition-and-update-level-of-sql-server-an
[2] Microsoft Blogs (2018) SQL Server 2008 end of support, by Franck Mercier [Online] Available from: https://blogs.technet.microsoft.com/franmer/2018/11/01/sql-server-2008-end-of-support-2/

🔭Data Science: Confidence (Just the Quotes)

"What the use of P [the significance level] implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred." (Harold Jeffreys, "Theory of Probability", 1939)

"Only by the analysis and interpretation of observations as they are made, and the examination of the larger implications of the results, is one in a satisfactory position to pose new experimental and theoretical questions of the greatest significance." (John A Wheeler, "Elementary Particle Physics", American Scientist, 1947)

"As usual we may make the errors of I) rejecting the null hypothesis when it is true, II) accepting the null hypothesis when it is false. But there is a third kind of error which is of interest because the present test of significance is tied up closely with the idea of making a correct decision about which distribution function has slipped furthest to the right. We may make the error of III) correctly rejecting the null hypothesis for the wrong reason." (Frederick Mosteller, "A k-Sample Slippage Test for an Extreme Population", The Annals of Mathematical Statistics 19, 1948)

"Errors of the third kind happen in conventional tests of differences of means, but they are usually not considered, although their existence is probably recognized. It seems to the author that there may be several reasons for this among which are 1) a preoccupation on the part of mathematical statisticians with the formal questions of acceptance and rejection of null hypotheses without adequate consideration of the implications of the error of the third kind for the practical experimenter, 2) the rarity with which an error of the third kind arises in the usual tests of significance." (Frederick Mosteller, "A k-Sample Slippage Test for an Extreme Population", The Annals of Mathematical Statistics 19, 1948)

"If significance tests are required for still larger samples, graphical accuracy is insufficient, and arithmetical methods are advised. A word to the wise is in order here, however. Almost never does it make sense to use exact binomial significance tests on such data - for the inevitable small deviations from the mathematical model of independence and constant split have piled up to such an extent that the binomial variability is deeply buried and unnoticeable. Graphical treatment of such large samples may still be worthwhile because it brings the results more vividly to the eye." (Frederick Mosteller & John W Tukey, "The Uses and Usefulness of Binomial Probability Paper?", Journal of the American Statistical Association 44, 1949)

"One reason for preferring to present a confidence interval statement (where possible) is that the confidence interval, by its width, tells more about the reliance that can be placed on the results of the experiment than does a YES-NO test of significance." (Mary G Natrella, "The relation between confidence intervals and tests of significance", American Statistician 14, 1960)

"Confidence intervals give a feeling of the uncertainty of experimental evidence, and (very important) give it in the same units [...] as the original observations." (Mary G Natrella, "The relation between confidence intervals and tests of significance", American Statistician 14, 1960)

"The null-hypothesis significance test treats ‘acceptance’ or ‘rejection’ of a hypothesis as though these were decisions one makes. But a hypothesis is not something, like a piece of pie offered for dessert, which can be accepted or rejected by a voluntary physical action. Acceptance or rejection of a hypothesis is a cognitive process, a degree of believing or disbelieving which, if rational, is not a matter of choice but determined solely by how likely it is, given the evidence, that the hypothesis is true." (William W Rozeboom, "The fallacy of the null–hypothesis significance test", Psychological Bulletin 57, 1960)

"The null hypothesis of no difference has been judged to be no longer a sound or fruitful basis for statistical investigation. […] Significance tests do not provide the information that scientists need, and, furthermore, they are not the most effective method for analyzing and summarizing data." (Cherry A Clark, "Hypothesis Testing in Relation to Statistical Methodology", Review of Educational Research Vol. 33, 1963)

"The idea of knowledge as an improbable structure is still a good place to start. Knowledge, however, has a dimension which goes beyond that of mere information or improbability. This is a dimension of significance which is very hard to reduce to quantitative form. Two knowledge structures might be equally improbable but one might be much more significant than the other." (Kenneth E Boulding, "Beyond Economics: Essays on Society", 1968)

"Significance levels are usually computed and reported, but power and confidence limits are not. Perhaps they should be." (Amos Tversky & Daniel Kahneman, "Belief in the law of small numbers", Psychological Bulletin 76(2), 1971)

"It is usually wise to give a confidence interval for the parameter in which you are interested." (David S Moore & George P McCabe, "Introduction to the Practice of Statistics", 1989)

"I do not think that significance testing should be completely abandoned [...] and I don’t expect that it will be. But I urge researchers to provide estimates, with confidence intervals: scientific advance requires parameters with known reliability estimates. Classical confidence intervals are formally equivalent to a significance test, but they convey more information." (Nigel G Yoccoz, "Use, Overuse, and Misuse of Significance Tests in Evolutionary Biology and Ecology", Bulletin of the Ecological Society of America Vol. 72 (2), 1991)

"Whereas hypothesis testing emphasizes a very narrow question (‘Do the population means fail to conform to a specific pattern?’), the use of confidence intervals emphasizes a much broader question (‘What are the population means?’). Knowing what the means are, of course, implies knowing whether they fail to conform to a specific pattern, although the reverse is not true. In this sense, use of confidence intervals subsumes the process of hypothesis testing." (Geoffrey R Loftus, "On the tyranny of hypothesis testing in the social sciences", Contemporary Psychology 36, 1991)

"We should push for de-emphasizing some topics, such as statistical significance tests - an unfortunate carry-over from the traditional elementary statistics course. We would suggest a greater focus on confidence intervals - these achieve the aim of formal hypothesis testing, often provide additional useful information, and are not as easily misinterpreted." (Gerry Hahn et al, "The Impact of Six Sigma Improvement: A Glimpse Into the Future of Statistics", The American Statistician, 1999)

"[...] they [confidence limits] are rarely to be found in the literature. I suspect that the main reason they are not reported is that they are so embarrassingly large!" (Jacob Cohen, "The earth is round (p<.05)", American Psychologist 49, 1994)

"Given the important role that correlation plays in structural equation modeling, we need to understand the factors that affect establishing relationships among multivariable data points. The key factors are the level of measurement, restriction of range in data values (variability, skewness, kurtosis), missing data, nonlinearity, outliers, correction for attenuation, and issues related to sampling variation, confidence intervals, effect size, significance, sample size, and power." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

"Another way to secure statistical significance is to use the data to discover a theory. Statistical tests assume that the researcher starts with a theory, collects data to test the theory, and reports the results - whether statistically significant or not. Many people work in the other direction, scrutinizing the data until they find a pattern and then making up a theory that fits the pattern." (Gary Smith, "Standard Deviations", 2014)

"There is a growing realization that reported 'statistically significant' claims in statistical publications are routinely mistaken. Researchers typically express the confidence in their data in terms of p-value: the probability that a perceived result is actually the result of random variation. The value of p (for 'probability') is a way of measuring the extent to which a data set provides evidence against a so-called null hypothesis. By convention, a p- value below 0.05 is considered a meaningful refutation of the null hypothesis; however, such conclusions are less solid than they appear." (Andrew Gelman & Eric Loken, "The Statistical Crisis in Science", American Scientist Vol. 102(6), 2014)

04 November 2018

🔭Data Science: Residuals (Just the Quotes)

"Data analysis must be iterative to be effective. [...] The iterative and interactive interplay of summarizing by fit and exposing by residuals is vital to effective data analysis. Summarizing and exposing are complementary and pervasive." (John W Tukey & Martin B Wilk, "Data Analysis and: An Expository Overview", 1966)

"Exploratory data analysis, EDA, calls for a relatively free hand in exploring the data, together with dual obligations: (•) to look for all plausible alternatives and oddities - and a few implausible ones, (graphic techniques can be most helpful here) and (•) to remove each appearance that seems large enough to be meaningful - ordinarily by some form of fitting, adjustment, or standardization [...] so that what remains, the residuals, can be examined for further appearances." (John W Tukey, "Introduction to Styles of Data Analysis Techniques", 1982)

"A good description of the data summarizes the systematic variation and leaves residuals that look structureless. That is, the residuals exhibit no patterns and have no exceptionally large values, or outliers. Any structure present in the residuals indicates an inadequate fit. Looking at the residuals laid out in an overlay helps to spot patterns and outliers and to associate them with their source in the data." (Christopher H Schrnid, "Value Splitting: Taking the Data Apart", 1991)

"A useful description relates the systematic variation to one or more factors; if the residuals dwarf the effects for a factor, we may not be able to relate variation in the data to changes in the factor. Furthermore, changes in the factor may bring no important change in the response. Such comparisons of residuals and effects require a measure of the variation of overlays relative to each other." (Christopher H Schrnid, "Value Splitting: Taking the Data Apart", 1991)

"Fitting data means finding mathematical descriptions of structure in the data. An additive shift is a structural property of univariate data in which distributions differ only in location and not in spread or shape. […] The process of identifying a structure in data and then fitting the structure to produce residuals that have the same distribution lies at the heart of statistical analysis. Such homogeneous residuals can be pooled, which increases the power of the description of the variation in the data." (William S Cleveland, "Visualizing Data", 1993)

"When the distributions of two or more groups of univariate data are skewed, it is common to have the spread increase monotonically with location. This behavior is monotone spread. Strictly speaking, monotone spread includes the case where the spread decreases monotonically with location, but such a decrease is much less common for raw data. Monotone spread, as with skewness, adds to the difficulty of data analysis. For example, it means that we cannot fit just location estimates to produce homogeneous residuals; we must fit spread estimates as well. Furthermore, the distributions cannot be compared by a number of standard methods of probabilistic inference that are based on an assumption of equal spreads; the standard t-test is one example. Fortunately, remedies for skewness can cure monotone spread as well." (William S Cleveland, "Visualizing Data", 1993)

"Residual analysis is similarly unreliable. In a discussion after a presentation of residual analysis in a seminar at Berkeley in 1993, William Cleveland, one of the fathers of residual analysis, admitted that it could not uncover lack of fit in more than four to five dimensions. The papers I have read on using residual analysis to check lack of fit are confined to data sets with two or three variables. With higher dimensions, the interactions between the variables can produce passable residual plots for a variety of models. A residual plot is a goodness-of-fit test, and lacks power in more than a few dimensions. An acceptable residual plot does not imply that the model is a good fit to the data." (Leo Breiman, "Statistical Modeling: The Two Cultures", Statistical Science Vol. 16(3), 2001)

"For a confidence interval, the central limit theorem plays a role in the reliability of the interval because the sample mean is often approximately normal even when the underlying data is not. A prediction interval has no such protection. The shape of the interval reflects the shape of the underlying distribution. It is more important to examine carefully the normality assumption by checking the residuals […]." (DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

"It is common to view a statistical model as nothing more than a recipe for calculating the fitted values, and to think that the residuals are just the errors made by this model. But we’ll have a richer picture if instead we view the residuals as part of the model. If you’ve ignored the variation in the residuals, then you really haven’t specified a complete forecast." (James G Scott, "Statistical Modeling: A Gentle Introduction", 2017)

"The residuals from a regression model are sometimes called 'errors'. This is especially true in experimental science, where measurements of some Y variable will be taken at different values of the X variable (called design points), and where noisy measurement instruments can introduce random errors into theobservations. But in many cases this interpretation of a residual as an error can be misleading. A regression model can still give a nonzero residual, even if there is no mistake in the measurement of the Y variable. It’s often far more illuminating to think of the residual as the part of the Y variable that it is left unpredicted by X." (James G Scott, "Statistical Modeling: A Gentle Introduction", 2017)

"Using noise (the uncorrelated variables) to fit noise (the residual left from a simple model on the genuinely correlated variables) is asking for trouble." (Steven S Skiena, "The Data Science Design Manual", 2017)

"One of the most common problems that you will encounter when training deep neural networks will be overfitting. What can happen is that your network may, owing to its flexibility, learn patterns that are due to noise, errors, or simply wrong data. [...] The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e., the noise) as if that variation represented the underlying model structure. The opposite is called underfitting - when the model cannot capture the structure of the data." (Umberto Michelucci, "Applied Deep Learning: A Case-Based Approach to Understanding Deep Neural Networks", 2018)

🔭Data Science: Central Limit Theorem (Just the Quotes)

"I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the ‘Law of Frequency of Error’. The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement, amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason. Whenever a large sample of chaotic elements are taken in hand and marshalled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along." (Sir Francis Galton, "Natural Inheritance", 1889)

"The central limit theorem […] states that regardless of the shape of the curve of the original population, if you repeatedly randomly sample a large segment of your group of interest and take the average result, the set of averages will follow a normal curve." (Charles Livingston & Paul Voakes, "Working with Numbers and Statistics: A handbook for journalists", 2005)

"The central limit theorem says that, under conditions almost always satisfied in the real world of experimentation, the distribution of such a linear function of errors will tend to normality as the number of its components becomes large. The tendency to normality occurs almost regardless of the individual distributions of the component errors. An important proviso is that several sources of error must make important contributions to the overall error and that no particular source of error dominate the rest." (George E P Box et al, "Statistics for Experimenters: Design, discovery, and innovation" 2nd Ed., 2005)

"Two things explain the importance of the normal distribution: (1) The central limit effect that produces a tendency for real error distributions to be 'normal like'. (2) The robustness to nonnormality of some common statistical procedures, where 'robustness' means insensitivity to deviations from theoretical normality." (George E P Box et al, "Statistics for Experimenters: Design, discovery, and innovation" 2nd Ed., 2005)

"The central limit theorem differs from laws of large numbers because random variables vary and so they differ from constants such as population means. The central limit theorem says that certain independent random effects converge not to a constant population value such as the mean rate of unemployment but rather they converge to a random variable that has its own Gaussian bell-curve description." (Bart Kosko, "Noise", 2006)

"Normally distributed variables are everywhere, and most classical statistical methods use this distribution. The explanation for the normal distribution’s ubiquity is the Central Limit Theorem, which says that if you add a large number of independent samples from the same distribution the distribution of the sum will be approximately normal." (Ben Bolker, "Ecological Models and Data in R", 2007)

"[…] the Central Limit Theorem says that if we take any sequence of small independent random quantities, then in the limit their sum (or average) will be distributed according to the normal distribution. In other words, any quantity that can be viewed as the sum of many small independent random effects. will be well approximated by the normal distribution. Thus, for example, if one performs repeated measurements of a fixed physical quantity, and if the variations in the measurements across trials are the cumulative result of many independent sources of error in each trial, then the distribution of measured values should be approximately normal." (David Easley & Jon Kleinberg, "Networks, Crowds, and Markets: Reasoning about a Highly Connected World", 2010)

[myth] "It has been said that process behavior charts work because of the central limit theorem."(Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"Statistical inference is really just the marriage of two concepts that we’ve already discussed: data and probability (with a little help from the central limit theorem)." (Charles Wheelan, "Naked Statistics: Stripping the Dread from the Data", 2012)

"The central limit theorem is often used to justify the assumption of normality when using the sample mean and the sample standard deviation. But it is inevitable that real data contain gross errors. Five to ten percent unusual values in a dataset seem to be the rule rather than the exception. The distribution of such data is no longer Normal." (A S Hedayat & Guoqin Su, "Robustness of the Simultaneous Estimators of Location and Scale From Approximating a Histogram by a Normal Density Curve", The American Statistician 66, 2012)

"The central limit theorem tells us that in repeated samples, the difference between the two means will be distributed roughly as a normal distribution." (Charles Wheelan, "Naked Statistics: Stripping the Dread from the Data", 2012)

"According to the central limit theorem, it doesn’t matter what the raw data look like, the sample variance should be proportional to the number of observations and if I have enough of them, the sample mean should be normal." (Kristin H Jarman, "The Art of Data Analysis: How to answer almost any question using basic statistics", 2013)

"When data is not normal, the reason the formulas are working is usually the central limit theorem. For large sample sizes, the formulas are producing parameter estimates that are approximately normal even when the data is not itself normal. The central limit theorem does make some assumptions and one is that the mean and variance of the population exist. Outliers in the data are evidence that these assumptions may not be true. Persistent outliers in the data, ones that are not errors and cannot be otherwise explained, suggest that the usual procedures based on the central limit theorem are not applicable." (DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

"At very small time scales, the motion of a particle is more like a random walk, as it gets jostled about by discrete collisions with water molecules. But virtually any random movement on small time scales will give rise to Brownian motion on large time scales, just so long as the motion is unbiased. This is because of the Central Limit Theorem, which tells us that the aggregate of many small, independent motions will be normally distributed." (Field Cady, "The Data Science Handbook", 2017)

"The central limit conjecture states that most errors are the result of many small errors and, as such, have a normal distribution. The assumption of a normal distribution for error has many advantages and has often been made in applications of statistical models." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"Theoretically, the normal distribution is most famous because many distributions converge to it, if you sample from them enough times and average the results. This applies to the binomial distribution, Poisson distribution and pretty much any other distribution you’re likely to encounter (technically, any one for which the mean and standard deviation are finite)." (Field Cady, "The Data Science Handbook", 2017)

"The central limit theorem in statistics states that, given a sufficiently large sample size, the sampling distribution of the mean for a variable will approximate a normal distribution regardless of that variable’s distribution in the population." (Jim Frost)

"The old rule of trusting the Central Limit Theorem if the sample size is larger than 30 is just that–old. Bootstrap and permutation testing let us more easily do inferences for a wider variety of statistics." (Tim Hesterberg)

03 November 2018

🔭Data Science: Samples (Just the Quotes)

"Little experience is sufficient to show that the traditional machinery of statistical processes is wholly unsuited to the needs of practical research. Not only does it take a cannon to shoot a sparrow, but it misses the sparrow! The elaborate mechanism built on the theory of infinitely large samples is not accurate enough for simple laboratory data. Only by systematically tackling small sample problems on their metrics does it seem possible to apply accurate tests to practical data." (Ronald A Fisher, "Statistical Methods for Research Workers", 1925)

"The postulate of randomness thus resolves itself into the question, ‘of what population is this a random sample?’ which must frequently be asked by every practical statistician." (Ronald A Fisher, "On the Mathematical Foundation of Theoretical Statistics", Philosophical Transactions of the Royal Society of London Vol. A222, 1922)

"Null hypotheses of no difference are usually known to be false before the data are collected [...] when they are, their rejection or acceptance simply reflects the size of the sample and the power of the test, and is not a contribution to science." (I Richard Savage, "Nonparametric Statistics", Journal of the American Statistical Association 52, 1957)

"Assumptions that we make, such as those concerning the form of the population sampled, are always untrue." (David R Cox, "Some problems connected with statistical inference", Annals of Mathematical Statistics 29, 1958)

"[...] a priori reasons for believing that the null hypothesis is generally false anyway. One of the common experiences of research workers is the very high frequency with which significant results are obtained with large samples." (David Bakan, "The test of significance in psychological research", Psychological Bulletin 66, 1966)

"People have erroneous intuitions about the laws of chance. In particular, they regard a sample randomly drawn from a population as highly representative, that is, similar to the population in all essential characteristics. The prevalence of the belief and its unfortunate consequences for psychological research are illustrated by the responses of professional psychologists to a questionnaire concerning research decisions." (Amos Tversky & Daniel Kahneman, "Belief in the law of small numbers", Psychological Bulletin 76(2), 1971)

"[...] too many users of the analysis of variance seem to regard the reaching of a mediocre level of significance as more important than any descriptive specification of the underlying averages Our thesis is that people have strong intuitions about random sampling; that these intuitions are wrong in fundamental respects; that these intuitions are shared by naive subjects and by trained scientists; and that they are applied with unfortunate consequences in the course of scientific inquiry. We submit that people view a sample randomly drawn from a population as highly representative, that is, similar to the population in all essential characteristics. Consequently, they expect any two samples drawn from a particular population to be more similar to one another and to the population than sampling theory predicts, at least for small samples." (Amos Tversky & Daniel Kahneman, "Belief in the law of small numbers", Psychological Bulletin 76(2), 1971)

"It would help if the standard statistical programs did not generate t statistics in such profusion. The programs might be written to ask, 'Do you really have a probability sample?', 'By what standard would you judge a fitted coefficient large or small?' Or perhaps they could merely say, printed in bold capitals beside each equation, 'So What Else Is New?'" (Donald M McCloskey, "The Loss Function Has Been Mislaid: The Rhetoric of Significance Tests", American Economic Review Vol. 75, 1985)

"Since a point hypothesis is not to be expected in practice to be exactly true, but only approximate, a proper test of significance should almost always show significance for large enough samples. So the whole game of testing point hypotheses, power analysis notwithstanding, is but a mathematical game without empirical importance." (Louis Guttman, "The illogic of statistical inference for cumulative science", Applied Stochastic Models and Data Analysis, 1985)

"A little thought reveals a fact widely understood among statisticians: The null hypothesis, taken literally (and that’s the only way you can take it in formal hypothesis testing), is always false in the real world[...]. If it is false, even to a tiny degree, it must be the case that a large enough sample will produce a significant result and lead to its rejection. So if the null hypothesis is always false, what’s the big deal about rejecting it." (Jacob Cohen, "Things I have learned (so far)", American Psychologist 45, 1990)

"Unfortunately, when applied in a cook-book fashion, such significance tests do not extract the maximum amount of information available from the data. Worse still, misleading conclusions can be drawn. There are at least three problems: (1) a conclusion that there is a significant difference can often be reached merely by collecting enough samples; (2) a statistically significant result is not necessarily practically significant; and (3) reports of the presence or absence of significant differences for multiple tests are not comparable unless identical sample sizes are used." (Graham B McBride et al, "What do significance tests really tell us about the environment?", Environmental Management 17, 1993)

"Statistical hypothesis testing is commonly used inappropriately to analyze data, determine causality, and make decisions about significance in ecological risk assessment,[...] It discourages good toxicity testing and field studies, it provides less protection to ecosystems or their components that are difficult to sample or replicate, and it provides less protection when more treatments or responses are used. It provides a poor basis for decision-making because it does not generate a conclusion of no effect, it does not indicate the nature or magnitude of effects, it does address effects at untested exposure levels, and it confounds effects and uncertainty[...]. Risk assessors should focus on analyzing the relationship between exposure and effects[...]." (Glenn W Suter, "Abuse of hypothesis testing statistics in ecological risk assessment", Human and Ecological Risk Assessment 2, 1996)

"The standard error of most statistics is proportional to 1 over the square root of the sample size. God did this, and there is nothing we can do to change it." (Howard Wainer, "Improving Tabular Displays, With NAEP Tables as Examples and Inspirations", Journal of Educational and Behavioral Statistics Vol 22 (1), 1997)

"It is not always convenient to remember that the right model for a population can fit a sample of data worse than a wrong model - even a wrong model with fewer parameters. We cannot rely on statistical diagnostics to save us, especially with small samples. We must think about what our models mean, regardless of fit, or we will promulgate nonsense." (Leland Wilkinson, "The Grammar of Graphics" 2nd Ed., 2005)

"It’s a commonplace among statisticians that a chi-squared test (and, really, any p-value) can be viewed as a crude measure of sample size: When sample size is small, it’s very difficult to get a rejection (that is, a p-value below 0.05), whereas when sample size is huge, just about anything will bag you a rejection. With large n, a smaller signal can be found amid the noise. In general: small n, unlikely to get small p-values. Large n, likely to find something. Huge n, almost certain to find lots of small p-values." (Andrew Gelman, "The sample size is huge, so a p-value of 0.007 is not that impressive", 2009)

"Why are you testing your data for normality? For large sample sizes the normality tests often give a meaningful answer to a meaningless question (for small samples they give a meaningless answer to a meaningful question)." (Greg Snow, "R-Help", 2014)

"The Dirty Data Theorem states that 'real world' data tends to come from bizarre and unspecifiable distributions of highly correlated variables and have unequal sample sizes, missing data points, non-independent observations, and an indeterminate number of inaccurately recorded values." (Anon, Statistically Speaking)

"While the main emphasis in the development of power analysis has been to provide methods for assessing and increasing power, it should also be noted that it is possible to have too much power. If your sample is too large, nearly any difference, no matter how small or meaningless from a practical standpoint, will be ‘statistically significant’." (Clay Helberg)

🔭Data Science: Least Squares (Just the Quotes)

"From the foregoing we see that the two justifications each leave something to be desired. The first depends entirely on the hypothetical form of the probability of the error; as soon as that form is rejected, the values of the unknowns produced by the method of least squares are no more the most probable values than is the arithmetic mean in the simplest case mentioned above. The second justification leaves us entirely in the dark about what to do when the number of observations is not large. In this case the method of least squares no longer has the status of a law ordained by the probability calculus but has only the simplicity of the operations it entails to recommend it." (Carl Friedrich Gauss, "Anzeige: Theoria combinationis observationum erroribus minimis obnoxiae: Pars prior", Göttingische gelehrte Anzeigen, 1821)

"The method of least squares is used in the analysis of data from planned experiments and also in the analysis of data from unplanned happenings. The word 'regression' is most often used to describe analysis of unplanned data. It is the tacit assumption that the requirements for the validity of least squares analysis are satisfied for unplanned data that produces a great deal of trouble." (George E P Box, "Use and Abuse of Regression", 1966)

"At the heart of probabilistic statistical analysis is the assumption that a set of data arises as a sample from a distribution in some class of probability distributions. The reasons for making distributional assumptions about data are several. First, if we can describe a set of data as a sample from a certain theoretical distribution, say a normal distribution (also called a Gaussian distribution), then we can achieve a valuable compactness of description for the data. For example, in the normal case, the data can be succinctly described by giving the mean and standard deviation and stating that the empirical (sample) distribution of the data is well approximated by the normal distribution. A second reason for distributional assumptions is that they can lead to useful statistical procedures. For example, the assumption that data are generated by normal probability distributions leads to the analysis of variance and least squares. Similarly, much of the theory and technology of reliability assumes samples from the exponential, Weibull, or gamma distribution. A third reason is that the assumptions allow us to characterize the sampling distribution of statistics computed during the analysis and thereby make inferences and probabilistic statements about unknown aspects of the underlying distribution. For example, assuming the data are a sample from a normal distribution allows us to use the t-distribution to form confidence intervals for the mean of the theoretical distribution. A fourth reason for distributional assumptions is that understanding the distribution of a set of data can sometimes shed light on the physical mechanisms involved in generating the data." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"Least squares' means just what it says: you minimise the (suitably weighted) squared difference between a set of measurements and their predicted values. This is done by varying the parameters you want to estimate: the predicted values are adjusted so as to be close to the measurements; squaring the differences means that greater importance is placed on removing the large deviations." (Roger J Barlow, "Statistics: A guide to the use of statistical methods in the physical sciences", 1989)

"Principal components and principal factor analysis lack a well-developed theoretical framework like that of least squares regression. They consequently provide no systematic way to test hypotheses about the number of factors to retain, the size of factor loadings, or the correlations between factors, for example. Such tests are possible using a different approach, based on maximum-likelihood estimation." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"Fuzzy models should make good predictions even when they are asked to predict on regions that were not excited during the construction of the model. The generalization capabilities can be controlled by an appropriate initialization of the consequences (prior knowledge) and the use of the recursive least squares to improve the prior choices. The prior knowledge can be obtained from the data." (Jairo Espinosa et al, "Fuzzy Logic, Identification and Predictive Control", 2005)

"One of the classical assumptions in linear regression analysis is that of equal variance, which is frequently referred to as homoscedasticity. However, this assumption may not be valid in data analysis arising from many fields (e.g., economics, finance, engineering, and biological science). When heteroscedasticity (nonconstant variance) occurs, the statistical inferences and predictions via the ordinary least squares method are often not reliable. Therefore, it is crucial to study the heteroscedastic error structure in linear model fitting." (Xiaogang Su et al, "Treed Variance", Journal of Computational and Graphical Statistics, Vol. 15 (2), 2006)

"Often when people relate essentially the same variable in two different groups, or at two different times, they see this same phenomenon - the tendency of the response variable to be closer to the mean than the predicted value. Unfortunately, people try to interpret this by thinking that the performance of those far from the mean is deteriorating, but it’s just a mathematical fact about the correlation. So, today we try to be less judgmental about this phenomenon and we call it regression to the mean. We managed to get rid of the term 'mediocrity', but the name regression stuck as a name for the whole least squares fitting procedure - and that’s where we get the term regression line." (Richard D De Veaux et al, "Stats: Data and Models", 2016)

🔭Data Science: Myths (Just the Quotes)

"[myth:] Accuracy is more important than precision. For single best estimates, be it a mean value or a single data value, this question does not arise because in that case there is no difference between accuracy and precision. (Think of a single shot aimed at a target.) Generally, it is good practice to balance precision and accuracy. The actual requirements will differ from case to case." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"[myth:] Counting can be done without error. Usually, the counted number is an integer and therefore without (rounding) error. However, the best estimate of a scientifically relevant value obtained by counting will always have an error. These errors can be very small in cases of consecutive counting, in particular of regular events, e.g., when measuring frequencies." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"[myth:] Random errors can always be determined by repeating measurements under identical conditions. […] this statement is true only for time-related random errors." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"[myth:] Systematic errors can be determined inductively. It should be quite obvious that it is not possible to determine the scale error from the pattern of data values." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

[myth] " It has been said that process behavior charts work because of the central limit theorem."(Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

[myth] "It has been said that the data must be normally distributed before they can be placed on a process behavior chart."(Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

[myth] "It has been said that the observations must be independent - data with autocorrelation are inappropriate for process behavior charts." (Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

[myth] " It has been said that the process must be operating in control before you can place the data on a process behavior chart."(Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

[myth] "The standard deviation statistic is more efficient than the range and therefore we should use the standard deviation statistic when computing limits for a process behavior chart."(Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"An oft-repeated rule of thumb in any sort of statistical model fitting is 'you can't fit a model with more parameters than data points'. This idea appears to be as wide-spread as it is incorrect. On the contrary, if you construct your models carefully, you can fit models with more parameters than datapoints [...]. A model with more parameters than datapoints is known as an under-determined system, and it's a common misperception that such a model cannot be solved in any circumstance. [...] this misconception, which I like to call the 'model complexity myth' [...] is not true in general, it is true in the specific case of simple linear models, which perhaps explains why the myth is so pervasive." (Jake Vanderplas, "The Model Complexity Myth", 2015) [source]

"Hollywood loves the myth of a lone scientist working late nights in a dark laboratory on a mysterious island, but the truth is far less melodramatic. Real science is almost always a team sport. Groups of people, collaborating with other groups of people, are the norm in science - and data science is no exception to the rule. When large groups of people work together for extended periods of time, a culture begins to emerge. " (Mike Barlow, "Learning to Love Data Science", 2015)

"One of the biggest truths about the real–time analytics is that nothing is actually real–time; it's a myth. In reality, it's close to real–time. Depending upon the performance and ability of a solution and the reduction of operational latencies, the analytics could be close to real–time, but, while day-by-day we are bridging the gap between real–time and near–real–time, it's practically impossible to eliminate the gap due to computational, operational, and network latencies." (Shilpi Saxena & Saurabh Gupta, "Practical Real-time Data Processing and Analytics", 2017)

"The field of big-data analytics is still littered with a few myths and evidence-free lore. The reasons for these myths are simple: the emerging nature of technologies, the lack of common definitions, and the non-availability of validated best practices. Whatever the reasons, these myths must be debunked, as allowing them to persist usually has a negative impact on success factors and Return on Investment (RoI). On a positive note, debunking the myths allows us to set the right expectations, allocate appropriate resources, redefine business processes, and achieve individual/organizational buy-in." (Prashant Natarajan et al, "Demystifying Big Data and Machine Learning for Healthcare", 2017)

"The first myth is that prediction is always based on time-series extrapolation into the future (also known as forecasting). This is not the case: predictive analytics can be applied to generate any type of unknown data, including past and present. In addition, prediction can be applied to non-temporal (time-based) use cases such as disease progression modeling, human relationship modeling, and sentiment analysis for medication adherence, etc. The second myth is that predictive analytics is a guarantor of what will happen in the future. This also is not the case: predictive analytics, due to the nature of the insights they create, are probabilistic and not deterministic. As a result, predictive analytics will not be able to ensure certainty of outcomes." (Prashant Natarajan et al, "Demystifying Big Data and Machine Learning for Healthcare", 2017)

"One of the biggest myths is the belief that data science is an autonomous process that we can let loose on our data to find the answers to our problems. In reality, data science requires skilled human oversight throughout the different stages of the process. [...] The second big myth of data science is that every data science project needs big data and needs to use deep learning. In general, having more data helps, but having the right data is the more important requirement. [...] A third data science myth is that modern data science software is easy to use, and so data science is easy to do. [...] The last myth about data science [...] is the belief that data science pays for itself quickly. The truth of this belief depends on the context of the organization. Adopting data science can require significant investment in terms of developing data infrastructure and hiring staff with data science expertise. Furthermore, data science will not give positive results on every project." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"In the world of data and analytics, people get enamored by the nice, shiny object. We are pulled around by the wind of the latest technology, but in so doing we are pulled away from the sound and intelligent path that can lead us to data and analytical success. The data and analytical world is full of examples of overhyped technology or processes, thinking this thing will solve all of the data and analytical needs for an individual or organization. Such topics include big data or data science. These two were pushed into our minds and down our throats so incessantly over the past decade that they are somewhat of a myth, or people finally saw the light. In reality, both have a place and do matter, but they are not the only solution to your data and analytical needs. Unfortunately, though, organizations bit into them, thinking they would solve everything, and were left at the alter, if you will, when it came time for the marriage of data and analytical success with tools." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"[...] the focus on Big Data AI seems to be an excuse to put forth a number of vague and hand-waving theories, where the actual details and the ultimate success of neuroscience is handed over to quasi- mythological claims about the powers of large datasets and inductive computation. Where humans fail to illuminate a complicated domain with testable theory, machine learning and big data supposedly can step in and render traditional concerns about finding robust theories. This seems to be the logic of Data Brain efforts today. (Erik J Larson, "The Myth of Artificial Intelligence: Why Computers Can’t Think the Way We Do", 2021)

"The myth of replacing domain experts comes from people putting too much faith in the power of ML to find patterns in the data. [...] ML looks for patterns that are generally pretty crude - the power comes from the sheer scale at which they can operate. If the important patterns in the data are not sufficiently crude then ML will not be able to ferret them out. The most powerful classes of models, like deep learning, can sometimes learn good-enough proxies for the real patterns, but that requires more training data than is usually available and yields complicated models that are hard to understand and impossible to debug. It’s much easier to just ask somebody who knows the domain!" (Field Cady, "Data Science: The Executive Summary: A Technical Book for Non-Technical Professionals", 2021)

"A common myth is that non-linear dimension reduction captures non-linear patterns in the high-dimensional data. It may or may not do this. The term means that the methods transform the data non-linearly into a useful (or not) visual representation." (Dianne Cook & Ursula Laa, "Interactively Exploring High-Dimensional Data and Models in R", 2026)

🔭Data Science: Tails (Just the Quotes)

"Some distributions [...] are symmetrical about their central value. Other distributions have marked asymmetry and are said to be skew. Skew distributions are divided into two types. If the 'tail' of the distribution reaches out into the larger values of the variate, the distribution is said to show positive skewness; if the tail extends towards the smaller values of the variate, the distribution is called negatively skew." (Michael J Moroney,Facts from Figures", 1951)

"Logging size transforms the original skewed distribution into a more symmetrical one by pulling in the long right tail of the distribution toward the mean. The short left tail is, in addition, stretched. The shift toward symmetrical distribution produced by the log transform is not, of course, merely for convenience. Symmetrical distributions, especially those that resemble the normal distribution, fulfill statistical assumptions that form the basis of statistical significance testing in the regression model." (Edward R Tufte,Data Analysis for Politics and Policy", 1974)

"Equal variability is not always achieved in plots. For instance, if the theoretical distribution for a probability plot has a density that drops off gradually to zero in the tails" (as the normal density does), then the variability of the data in the tails of the probability plot is greater than in the center. Another example is provided by the histogram. Since the height of any one bar has a binomial distribution, the standard deviation of the height is approximately proportional to the square root of the expected height; hence, the variability of the longer bars is greater." (John M Chambers et al,Graphical Methods for Data Analysis", 1983)

"If the sample is not representative of the population because the sample is small or biased, not selected at random, or its constituents are not independent of one another, then the bootstrap will fail. […] For a given size sample, bootstrap estimates of percentiles in the tails will always be less accurate than estimates of more centrally located percentiles. Similarly, bootstrap interval estimates for the variance of a distribution will always be less accurate than estimates of central location such as the mean or median because the variance depends strongly upon extreme values in the population." (Phillip I Good & James W Hardin,Common Errors in Statistics" (and How to Avoid Them)", 2003)

"Bell curves don't differ that much in their bells. They differ in their tails. The tails describe how frequently rare events occur. They describe whether rare events really are so rare. This leads to the saying that the devil is in the tails." (Bart Kosko,Noise", 2006)

"Readability in visualization helps people interpret data and make conclusions about what the data has to say. Embed charts in reports or surround them with text, and you can explain results in detail. However, take a visualization out of a report or disconnect it from text that provides context" (as is common when people share graphics online), and the data might lose its meaning; or worse, others might misinterpret what you tried to show." (Nathan Yau,Data Points: Visualization That Means Something", 2013)

"A very different - and very incorrect - argument is that successes must be balanced by failures (and failures by successes) so that things average out. Every coin flip that lands heads makes tails more likely. Every red at roulette makes black more likely. […] These beliefs are all incorrect. Good luck will certainly not continue indefinitely, but do not assume that good luck makes bad luck more likely, or vice versa." (Gary Smith,Standard Deviations", 2014)

"The more complex the system, the more variable (risky) the outcomes. The profound implications of this essential feature of reality still elude us in all the practical disciplines. Sometimes variance averages out, but more often fat-tail events beget more fat-tail events because of interdependencies. If there are multiple projects running, outlier (fat-tail) events may also be positively correlated - one IT project falling behind will stretch resources and increase the likelihood that others will be compromised." (Paul Gibbons,The Science of Successful Organizational Change", 2015)

"Many statistical procedures perform more effectively on data that are normally distributed, or at least are symmetric and not excessively kurtotic" (fat-tailed), and where the mean and variance are approximately constant. Observed time series frequently require some form of transformation before they exhibit these distributional properties, for in their 'raw' form they are often asymmetric." (Terence C Mills,Applied Time Series Analysis: A practical guide to modeling and forecasting", 2019)

"Mean-averages can be highly misleading when the raw data do not form a symmetric pattern around a central value but instead are skewed towards one side [...], typically with a large group of standard cases but with a tail of a few either very high" (for example, income) or low" (for example, legs) values." (David Spiegelhalter,The Art of Statistics: Learning from Data", 2019)

"[…] it is not merely that events in the tails of the distributions matter, happen, play a large role, etc. The point is that these events play the major role and their probabilities are not" (easily) computable, not reliable for any effective use. The implication is that Black Swans do not necessarily come from fat tails; the problem can result from an incomplete assessment of tail events." (Nassim N Taleb,Statistical Consequences of Fat Tails: Real World Preasymptotics, Epistemology, and Applications" 2nd Ed., 2022)

"[…] whenever people make decisions after being supplied with the standard deviation number, they act as if it were the expected mean deviation." (Nassim N Taleb,Statistical Consequences of Fat Tails: Real World Preasymptotics, Epistemology, and Applications" 2nd Ed., 2022)

"Behavioral finance so far makes conclusions from statics not dynamics, hence misses the picture. It applies trade-offs out of context and develops the consensus that people irrationally overestimate tail risk" (hence need to be 'nudged' into taking more of these exposures). But the catastrophic event is an absorbing barrier. No risky exposure can be analyzed in isolation: risks accumulate. If we ride a motorcycle, smoke, fly our own propeller plane, and join the mafia, these risks add up to a near-certain premature death. Tail risks are not a renewable resource." (Nassim N Taleb,Statistical Consequences of Fat Tails: Real World Preasymptotics, Epistemology, and Applications" 2nd Ed., 2022)

"But note that any heavy tailed process, even a power law, can be described in sample" (that is finite number of observations necessarily discretized) by a simple Gaussian process with changing variance, a regime switching process, or a combination of Gaussian plus a series of variable jumps" (though not one where jumps are of equal size […])." (Nassim N Taleb,Statistical Consequences of Fat Tails: Real World Preasymptotics, Epistemology, and Applications" 2nd Ed., 2022)

"No one sees further into a generalization than his own knowledge of detail extends." (William James)

"Remember that a p-value merely indicates the probability of a particular set of data being generated by the null model–it has little to say about the size of a deviation from that model" (especially in the tails of the distribution, where large changes in effect size cause only small changes in p-values)." (Clay Helberg)