08 November 2018

🔭Data Science: Aggregation (Just the Quotes)

"Statistics may be defined as numerical statements of facts by means of which large aggregates are analyzed, the relations of individual units to their groups are ascertained, comparisons are made between groups, and continuous records are maintained for comparative purposes." (Melvin T Copeland. "Statistical Methods" [in: Harvard Business Studies, Vol. III, Ed. by Melvin T Copeland, 1917])

"The null hypothesis of no difference has been judged to be no longer a sound or fruitful basis for statistical investigation. […] Significance tests do not provide the information that scientists need, and, furthermore, they are not the most effective method for analyzing and summarizing data." (Cherry A Clark, "Hypothesis Testing in Relation to Statistical Methodology", Review of Educational Research Vol. 33, 1963)

"[…] fitting lines to relationships between variables is often a useful and powerful method of summarizing a set of data. Regression analysis fits naturally with the development of causal explanations, simply because the research worker must, at a minimum, know what he or she is seeking to explain." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Fitting lines to relationships between variables is the major tool of data analysis. Fitted lines often effectively summarize the data and, by doing so, help communicate the analytic results to others. Estimating a fitted line is also the first step in squeezing further information from the data." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Modern data graphics can do much more than simply substitute for small statistical tables. At their best, graphics are instruments for reasoning about quantitative information. Often the most effective way to describe, explore, and summarize a set of numbers even a very large set - is to look at pictures of those numbers. Furthermore, of all methods for analyzing and communicating statistical information, well-designed data graphics are usually the simplest and at the same time the most powerful." (Edward R Tufte, "The Visual Display of Quantitative Information", 1983)

"The science of statistics may be described as exploring, analyzing and summarizing data; designing or choosing appropriate ways of collecting data and extracting information from them; and communicating that information. Statistics also involves constructing and testing models for describing chance phenomena. These models can be used as a basis for making inferences and drawing conclusions and, finally, perhaps for making decisions." (Fergus Daly et al, "Elements of Statistics", 1995)

"Without meaningful data there can be no meaningful analysis. The interpretation of any data set must be based upon the context of those data. Unfortunately, much of the data reported to executives today are aggregated and summed over so many different operating units and processes that they cannot be said to have any context except a historical one - they were all collected during the same time period. While this may be rational with monetary figures, it can be devastating to other types of data." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Ockham's Razor in statistical analysis is used implicitly when models are embedded in richer models -for example, when testing the adequacy of a linear model by incorporating a quadratic term. If the coefficient of the quadratic term is not significant, it is dropped and the linear model is assumed to summarize the data adequately." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Every number has its limitations; every number is a product of choices that inevitably involve compromise. Statistics are intended to help us summarize, to get an overview of part of the world’s complexity. But some information is always sacrificed in the process of choosing what will be counted and how. Something is, in short, always missing. In evaluating statistics, we should not forget what has been lost, if only because this helps us understand what we still have." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"Data often arrive in raw form, as long lists of numbers. In this case your job is to summarize the data in a way that captures its essence and conveys its meaning. This can be done numerically, with measures such as the average and standard deviation, or graphically. At other times you find data already in summarized form; in this case you must understand what the summary is telling, and what it is not telling, and then interpret the information for your readers or viewers." (Charles Livingston & Paul Voakes, "Working with Numbers and Statistics: A handbook for journalists", 2005)

"Whereas regression is about attempting to specify the underlying relationship that summarises a set of paired data, correlation is about assessing the strength of that relationship. Where there is a very close match between the scatter of points and the regression line, correlation is said to be 'strong' or 'high' . Where the points are widely scattered, the correlation is said to be 'weak' or 'low'." (Alan Graham, "Developing Thinking in Statistics", 2006)

"Numerical precision should be consistent throughout and summary statistics such as means and standard deviations should not have more than one extra decimal place (or significant digit) compared to the raw data. Spurious precision should be avoided although when certain measures are to be used for further calculations or when presenting the results of analyses, greater precision may sometimes be appropriate." (Jenny Freeman et al, "How to Display Data", 2008)

"Graphical displays are often constructed to place principal focus on the individual observations in a dataset, and this is particularly helpful in identifying both the typical positions of data points and unusual or influential cases. However, in many investigations, principal interest lies in identifying the nature of underlying trends and relationships between variables, and so it is often helpful to enhance graphical displays in ways which give deeper insight into these features. This can be very beneficial both for small datasets, where variation can obscure underlying patterns, and large datasets, where the volume of data is so large that effective representation inevitably involves suitable summaries." (Adrian W Bowman, "Smoothing Techniques for Visualisation" [in "Handbook of Data Visualization"], 2008)

"There are three possible reasons for [the] absence of predictive power. First, it is possible that the models are misspecified. Second, it is possible that the model’s explanatory factors are measured at too high a level of aggregation [...] Third, [...] the search for statistically significant relationships may not be the strategy best suited for evaluating our model’s ability to explain real world events [...] the lack of predictive power is the result of too much emphasis having been placed on finding statistically significant variables, which may be overdetermined. Statistical significance is generally a flawed way to prune variables in regression models [...] Statistically significant variables may actually degrade the predictive accuracy of a model [...] [By using] models that are constructed on the basis of pruning undertaken with the shears of statistical significance, it is quite possible that we are winnowing our models away from predictive accuracy." (Michael D Ward et al, "The perils of policy by p-value: predicting civil conflicts" Journal of Peace Research 47, 2010)

"In order to be effective a descriptive statistic has to make sense - it has to distill some essential characteristic of the data into a value that is both appropriate and understandable. […] the justification for computing any given statistic must come from the nature of the data themselves - it cannot come from the arithmetic, nor can it come from the statistic. If the data are a meaningless collection of values, then the summary statistics will also be meaningless - no arithmetic operation can magically create meaning out of nonsense. Therefore, the meaning of any statistic has to come from the context for the data, while the appropriateness of any statistic will depend upon the use we intend to make of that statistic." (Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"[...] things that seem hopelessly random and unpredictable when viewed in isolation often turn out to be lawful and predictable when viewed in aggregate." (Steven Strogatz, "The Joy of X: A Guided Tour of Mathematics, from One to Infinity", 2012)

"In general, when building statistical models, we must not forget that the aim is to understand something about the real world. Or predict, choose an action, make a decision, summarize evidence, and so on, but always about the real world, not an abstract mathematical world: our models are not the reality - a point well made by George Box in his oft-cited remark that "all models are wrong, but some are useful". (David Hand, "Wonderful examples, but let's not close our eyes", Statistical Science 29, 2014)

"Decision trees are also discriminative models. Decision trees are induced by recursively partitioning the feature space into regions belonging to the different classes, and consequently they define a decision boundary by aggregating the neighboring regions belonging to the same class. Decision tree model ensembles based on bagging and boosting are also discriminative models." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies", 2015)

"Just as with aggregated data, an average is a summary statistic that can tell you something about the data - but it is only one metric, and oftentimes a deceiving one at that. By taking all of the data and boiling it down to one value, an average (and other summary statistics) may imply that all of the underlying data is the same, even when it’s not." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"At very small time scales, the motion of a particle is more like a random walk, as it gets jostled about by discrete collisions with water molecules. But virtually any random movement on small time scales will give rise to Brownian motion on large time scales, just so long as the motion is unbiased. This is because of the Central Limit Theorem, which tells us that the aggregate of many small, independent motions will be normally distributed." (Field Cady, "The Data Science Handbook", 2017)

"The most accurate but least interpretable form of data presentation is to make a table, showing every single value. But it is difficult or impossible for most people to detect patterns and trends in such data, and so we rely on graphs and charts. Graphs come in two broad types: Either they represent every data point visually (as in a scatter plot) or they implement a form of data reduction in which we summarize the data, looking, for example, only at means or medians." (Daniel J Levitin, "Weaponized Lies", 2017)

"Again, classical statistics only summarizes data, so it does not provide even a language for asking [a counterfactual] question. Causal inference provides a notation and, more importantly, offers a solution. As with predicting the effect of interventions [...], in many cases we can emulate human retrospective thinking with an algorithm that takes what we know about the observed world and produces an answer about the counterfactual world." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"While the individual man is an insoluble puzzle, in the aggregate he becomes a mathematical certainty. You can, for example, never foretell what anyone man will be up to, but you can say with precision what an average number will be up to. Individuals vary, but percentages remain constant. So says the statistician." (Sir Arthur C Doyle)

🔭Data Science - Consistency (Just the Quotes)

"A model, like a novel, may resonate with nature, but it is not a ‘real’ thing. Like a novel, a model may be convincing - it may ‘ring true’ if it is consistent with our experience of the natural world. But just as we may wonder how much the characters in a novel are drawn from real life and how much is artifice, we might ask the same of a model: How much is based on observation and measurement of accessible phenomena, how much is convenience? Fundamentally, the reason for modeling is a lack of full access, either in time or space, to the phenomena of interest." (Kenneth Belitz, Science, Vol. 263, 1944)

"Consistency and completeness can also be characterized in terms of models: a theory T is consistent if and only if it has at least one model; it is complete if and only if every sentence of T which is satified in one model is also satisfied in any other model of T. Two theories T1 and T2 are said to be compatible if they have a common consistent extension; this is equivalent to saying that the union of T1 and T2 is consistent." (Alfred Tarski et al, "Undecidable Theories", 1953)

"To be useful data must be consistent - they must reflect periodic recordings of the value of the variable or at least possess logical internal connections. The definition of the variable under consideration cannot change during the period of measurement or enumeration. Also, if the data are to be valuable, they must be relevant to the question to be answered." (Cecil H Meyers, "Handbook of Basic Graphs: A modern approach", 1970)

"When evaluating a model, at least two broad standards are relevant. One is whether the model is consistent with the data. The other is whether the model is consistent with the ‘real world’." (Kenneth A Bollen, "Structural Equations with Latent Variables", 1989)

"The word theory, as used in the natural sciences, doesn’t mean an idea tentatively held for purposes of argument - that we call a hypothesis. Rather, a theory is a set of logically consistent abstract principles that explain a body of concrete facts. It is the logical connections among the principles and the facts that characterize a theory as truth. No one element of a theory [...] can be changed without creating a logical contradiction that invalidates the entire system. Thus, although it may not be possible to substantiate directly a particular principle in the theory, the principle is validated by the consistency of the entire logical structure." (Alan Cromer, "Uncommon Sense: The Heretical Nature of Science", 1993)

"For a given dataset there is not a great deal of advice which can be given on content and context. hose who know their own data should know best for their specific purposes. It is advisable to think hard about what should be shown and to check with others if the graphic makes the desired impression. Design should be let to designers, though some basic guidelines should be followed: consistency is important (sets of graphics should be in similar style and use equivalent scaling); proximity is helpful (place graphics on the same page, or on the facing page, of any text that refers to them); and layout should be checked (graphics should be neither too small nor too large and be attractively positioned relative to the whole page or display)."(Antony Unwin, "Good Graphics?" [in "Handbook of Data Visualization"], 2008)

"It is the consistency of the information that matters for a good story, not its completeness. Indeed, you will often find that knowing little makes it easier to fit everything you know into a coherent pattern." (Daniel Kahneman, "Thinking, Fast and Slow", 2011)

"Accuracy and coherence are related concepts pertaining to data quality. Accuracy refers to the comprehensiveness or extent of missing data, performance of error edits, and other quality assurance strategies. Coherence is the degree to which data - item value and meaning are consistent over time and are comparable to similar variables from other routinely used data sources." (Aileen Rothbard, "Quality Issues in the Use of Administrative Data Records", 2015)

"How good the data quality is can be looked at both subjectively and objectively. The subjective component is based on the experience and needs of the stakeholders and can differ by who is being asked to judge it. For example, the data managers may see the data quality as excellent, but consumers may disagree. One way to assess it is to construct a survey for stakeholders and ask them about their perception of the data via a questionnaire. The other component of data quality is objective. Measuring the percentage of missing data elements, the degree of consistency between records, how quickly data can be retrieved on request, and the percentage of incorrect matches on identifiers (same identifier, different social security number, gender, date of birth) are some examples." (Aileen Rothbard, "Quality Issues in the Use of Administrative Data Records", 2015)

"The danger of overfitting is particularly severe when the training data is not a perfect gold standard. Human class annotations are often subjective and inconsistent, leading boosting to amplify the noise at the expense of the signal. The best boosting algorithms will deal with overfitting though regularization. The goal will be to minimize the number of non-zero coefficients, and avoid large coefficients that place too much faith in any one classifier in the ensemble." (Steven S Skiena, "The Data Science Design Manual", 2017)

"There are other problems with Big Data. In any large data set, there are bound to be inconsistencies, misclassifications, missing data - in other words, errors, blunders, and possibly lies. These problems with individual items occur in any data set, but they are often hidden in a large mass of numbers even when these numbers are generated out of computer interactions." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

Data Management : Data Fabric (Definitions)

"Enterprise data fabric (EDF) is a data layer that separates data sources from applications, providing the means to solve the gridlock prevalent in distributed environments such as grid computing, service-oriented architecture (SOA) and event-driven architecture (EDA)." (Information Management, 2010)

"A data fabric is an emerging data management and data integration design concept for attaining flexible, reusable and augmented data integration pipelines, services and semantics, in support of various operational and analytics use cases delivered across multiple deployment and orchestration platforms." (Jacob O Lund, "Demystifying the Data Fabric", 2020)

"A data fabric is a data management architecture that can optimize access to distributed data and intelligently curate and orchestrate it for self-service delivery to data consumers." (IBM, "Data Fabric", 2021) [source]

"A data fabric is a modern, distributed data architecture that includes shared data assets and optimized data management and integration processes that you can use to address today’s data challenges in a unified way." (Alice LaPlante, "Data Fabric as Modern Data Architecture", 2021)

"A data fabric is an emerging data management design for attaining flexible and reusable data integration pipelines, services and semantics. A data fabric supports various operational and analytics use cases delivered across multiple deployment and orchestration platforms. Data fabrics support a combination of different data integration styles and leverage active metadata, knowledge graphs, semantics and ML to automate and enhance data integration design and delivery." (Ehtisham Zaidi, "Data Fabric", Gartner's Hype Cycle for Data Management, 2021)

"Is a distributed Data Management platform whose objective is to combine various types of data storage, access, preparation, analytics, and security tools in a fully compliant manner to support seamless Data Management." (Michelle Knight, "What Is a Data Fabric?", 2021)

"A data fabric is a customized combination of architecture and technology. It uses dynamic data integration and orchestration to connect different locations, sources, and types of data. With the right structures and flows as defined within the data fabric platform, companies can quickly access and share data regardless of where it is or how it was generated." (SAP)

"A data fabric is a distributed, memory-based data management platform that uses cluster-wide resources - memory, CPU, network bandwidth, and optionally local disk – to manage application data and application logic (behavior). The data fabric uses dynamic replication and data partitioning techniques to offer continuous availability, very high performance, and linear scalability for data intensive applications, all without compromising on data consistency even when exposed to failure conditions." (VMware)

"A Data Fabric is a technology utilization and implementation design capable of multiple outputs and applied uses." (Gartner)

"A data fabric is an architecture and set of data services that provide consistent capabilities across a choice of endpoints spanning hybrid multicloud environments." (NetApp) [source]

"Data fabric is an end-to-end data integration and management solution, consisting of architecture, data management and integration software, and shared data that helps organizations manage their data. A data fabric provides a unified, consistent user experience and access to data for any member of an organization worldwide and in real-time." (Tibco) [source]

07 November 2018

🔭Data Science: Decision Theory (Just the Quotes)

"Years ago a statistician might have claimed that statistics deals with the processing of data [...] today’s statistician will be more likely to say that statistics is concerned with decision making in the face of uncertainty." (Herman Chernoff & Lincoln E Moses, "Elementary Decision Theory", 1959)

"Another approach to management theory, undertaken by a growing and scholarly group, might be referred to as the decision theory school. This group concentrates on rational approach to decision-the selection from among possible alternatives of a course of action or of an idea. The approach of this school may be to deal with the decision itself, or to the persons or organizational group making the decision, or to an analysis of the decision process. Some limit themselves fairly much to the economic rationale of the decision, while others regard anything which happens in an enterprise the subject of their analysis, and still others expand decision theory to cover the psychological and sociological aspect and environment of decisions and decision-makers." (Harold Koontz, "The Management Theory Jungle," 1961)

"The term hypothesis testing arises because the choice as to which process is observed is based on hypothesized models. Thus hypothesis testing could also be called model testing. Hypothesis testing is sometimes called decision theory. The detection theory of communication theory is a special case." (Fred C Scweppe, "Uncertain dynamic systems", 1973)

"In decision theory, mathematical analysis shows that once the sampling distribution, loss function, and sample are specified, the only remaining basis for a choice among different admissible decisions lies in the prior probabilities. Therefore, the logical foundations of decision theory cannot be put in fully satisfactory form until the old problem of arbitrariness (sometimes called 'subjectiveness') in assigning prior probabilities is resolved." (Edwin T Jaynes, "Prior Probabilities", 1978)

"Decision theory, as it has grown up in recent years, is a formalization of the problems involved in making optimal choices. In a certain sense - a very abstract sense, to be sure - it incorporates among others operations research, theoretical economics, and wide areas of statistics, among others." (Kenneth Arrow, "The Economics of Information", 1984) 

"Cybernetics is concerned with scientific investigation of systemic processes of a highly varied nature, including such phenomena as regulation, information processing, information storage, adaptation, self-organization, self-reproduction, and strategic behavior. Within the general cybernetic approach, the following theoretical fields have developed: systems theory (system), communication theory, game theory, and decision theory." (Fritz B Simon et al, "Language of Family Therapy: A Systemic Vocabulary and Source Book", 1985)

"A field of study that includes a methodology for constructing computer simulation models to achieve better under-standing of social and corporate systems. It draws on organizational studies, behavioral decision theory, and engineering to provide a theoretical and empirical base for structuring the relationships in complex systems." (Virginia Anderson & Lauren Johnson, "Systems Thinking Basics: From Concepts to Casual Loops", 1997) 

"A decision theory that rests on the assumptions that human cognitive capabilities are limited and that these limitations are adaptive with respect to the decision environments humans frequently encounter. Decision are thought to be made usually without elaborate calculations, but instead by using fast and frugal heuristics. These heuristics certainly have the advantage of speed and simplicity, but if they are well matched to a decision environment, they can even outperform maximizing calculations with respect to accuracy. The reason for this is that many decision environments are characterized by incomplete information and noise. The information we do have is usually structured in a specific way that clever heuristics can exploit." (E Ebenhoh, "Agent-Based Modelnig with Boundedly Rational Agents", 2007)

🔭Data Science: Belief (Just the Quotes)

"By degree of probability we really mean, or ought to mean, degree of belief [...] Probability then, refers to and implies belief, more or less, and belief is but another name for imperfect knowledge, or it may be, expresses the mind in a state of imperfect knowledge." (Augustus De Morgan, "Formal Logic: Or, The Calculus of Inference, Necessary and Probable", 1847)

"To a scientist a theory is something to be tested. He seeks not to defend his beliefs, but to improve them. He is, above everything else, an expert at ‘changing his mind’." (Wendell Johnson, 1946)

"A model can not be proved to be correct; at best it can only be found to be reasonably consistant and not to contradict some of our beliefs of what reality is." (Richard W Hamming, "The Art of Probability for Scientists and Engineers", 1991)

"Probability is not about the odds, but about the belief in the existence of an alternative outcome, cause, or motive." (Nassim N Taleb, "Fooled by Randomness", 2001)

"The Bayesian approach is based on the following postulates: (B1) Probability describes degree of belief, not limiting frequency. As such, we can make probability statements about lots of things, not just data which are subject to random variation. […] (B2) We can make probability statements about parameters, even though they are fixed constants. (B3) We make inferences about a parameter θ by producing a probability distribution for θ. Inferences, such as point estimates and interval estimates, may then be extracted from this distribution." (Larry A Wasserman, "All of Statistics: A concise course in statistical inference", 2004)

"The important thing is to understand that frequentist and Bayesian methods are answering different questions. To combine prior beliefs with data in a principled way, use Bayesian inference. To construct procedures with guaranteed long run performance, such as confidence intervals, use frequentist methods. Generally, Bayesian methods run into problems when the parameter space is high dimensional." (Larry A Wasserman, "All of Statistics: A concise course in statistical inference", 2004)

"Our inner weighing of evidence is not a careful mathematical calculation resulting in a probabilistic estimate of truth, but more like a whirlpool blending of the objective and the personal. The result is a set of beliefs - both conscious and unconscious - that guide us in interpreting all the events of our lives." (Leonard Mlodinow, "War of the Worldviews: Where Science and Spirituality Meet - and Do Not", 2011)

"The search for better numbers, like the quest for new technologies to improve our lives, is certainly worthwhile. But the belief that a few simple numbers, a few basic averages, can capture the multifaceted nature of national and global economic systems is a myth. Rather than seeking new simple numbers to replace our old simple numbers, we need to tap into both the power of our information age and our ability to construct our own maps of the world to answer the questions we need answering." (Zachary Karabell, "The Leading Indicators: A short history of the numbers that rule our world", 2014)

"One kind of probability - classic probability - is based on the idea of symmetry and equal likelihood […] In the classic case, we know the parameters of the system and thus can calculate the probabilities for the events each system will generate. […] A second kind of probability arises because in daily life we often want to know something about the likelihood of other events occurring […]. In this second case, we need to estimate the parameters of the system because we don’t know what those parameters are. […] A third kind of probability differs from these first two because it’s not obtained from an experiment or a replicable event - rather, it expresses an opinion or degree of belief about how likely a particular event is to occur. This is called subjective probability […]." (Daniel J Levitin, "Weaponized Lies", 2017)

"Bayesian statistics give us an objective way of combining the observed evidence with our prior knowledge (or subjective belief) to obtain a revised belief and hence a revised prediction of the outcome of the coin’s next toss. [...] This is perhaps the most important role of Bayes’s rule in statistics: we can estimate the conditional probability directly in one direction, for which our judgment is more reliable, and use mathematics to derive the conditional probability in the other direction, for which our judgment is rather hazy. The equation also plays this role in Bayesian networks; we tell the computer the forward  probabilities, and the computer tells us the inverse probabilities when needed." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"The transparency of Bayesian networks distinguishes them from most other approaches to machine learning, which tend to produce inscrutable 'black boxes'. In a Bayesian network you can follow every step and understand how and why each piece of evidence changed the network’s beliefs." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

06 November 2018

🔭Data Science: Tools (Just the Quotes)

"[Statistics] are the only tools by which an opening can be cut through the formidable thicket of difficulties that bars the path of those who pursue the Science of man." (Sir Francis Galton, "Natural Inheritance", 1889)

"Mathematics is merely a shorthand method of recording physical intuition and physical reasoning, but it should not be a formalism leading from nowhere to nowhere, as it is likely to be made by one who does not realize its purpose as a tool." (Charles P Steinmetz, "Transactions of the American Institute of Electrical Engineers", 1909)

"Statistical methods are tools of scientific investigation. Scientific investigation is a controlled learning process in which various aspects of a problem are illuminated as the study proceeds. It can be thought of as a major iteration within which secondary iterations occur. The major iteration is that in which a tentative conjecture suggests an experiment, appropriate analysis of the data so generated leads to a modified conjecture, and this in turn leads to a new experiment, and so on." (George E P Box & George C Tjao, "Bayesian Inference in Statistical Analysis", 1973)

"[...] exploratory data analysis is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as for those we believe might be there. Except for its emphasis on graphs, its tools are secondary to its purpose." (John W Tukey, [comment] 1979)

"Correlation analysis is a useful tool for uncovering a tenuous relationship, but it doesn't necessarily provide any real understanding of the relationship, and it certainly doesn't provide any evidence that the relationship is one of cause and effect. People who don't understand correlation tend to credit it with being a more fundamental approach than it is." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"Science usually amounts to a lot more than blind trial and error. Good statistics consists of much more than just significance tests; there are more sophisticated tools available for the analysis of results, such as confidence statements, multiple comparisons, and Bayesian analysis, to drop a few names. However, not all scientists are good statisticians, or want to be, and not all people who are called scientists by the media deserve to be so described." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"Some methods, such as those governing the design of experiments or the statistical treatment of data, can be written down and studied. But many methods are learned only through personal experience and interactions with other scientists. Some are even harder to describe or teach. Many of the intangible influences on scientific discovery - curiosity, intuition, creativity - largely defy rational analysis, yet they are often the tools that scientists bring to their work." (Committee on the Conduct of Science, "On Being a Scientist", 1989)

"Statistics is a tool. In experimental science you plan and carry out experiments, and then analyse and interpret the results. To do this you use statistical arguments and calculations. Like any other tool - an oscilloscope, for example, or a spectrometer, or even a humble spanner - you can use it delicately or clumsily, skillfully or ineptly. The more you know about it and understand how it works, the better you will be able to use it and the more useful it will be." (Roger Barlow, "Statistics: A Guide to the Use of Statistical Methods in the Physical Sciences", 1989)

"Fitting is essential to visualizing hypervariate data. The structure of data in many dimensions can be exceedingly complex. The visualization of a fit to hypervariate data, by reducing the amount of noise, can often lead to more insight. The fit is a hypervariate surface, a function of three or more variables. As with bivariate and trivariate data, our fitting tools are loess and parametric fitting by least-squares. And each tool can employ bisquare iterations to produce robust estimates when outliers or other forms of leptokurtosis are present." (William S Cleveland, "Visualizing Data", 1993)

"The logarithm is one of many transformations that we can apply to univariate measurements. The square root is another. Transformation is a critical tool for visualization or for any other mode of data analysis because it can substantially simplify the structure of a set of data. For example, transformation can remove skewness toward large values, and it can remove monotone increasing spread. And often, it is the logarithm that achieves this removal." (William S Cleveland, "Visualizing Data", 1993)

"Probability theory is an ideal tool for formalizing uncertainty in situations where class frequencies are known or where evidence is based on outcomes of a sufficiently long series of independent random experiments. Possibility theory, on the other hand, is ideal for formalizing incomplete information expressed in terms of fuzzy propositions." (George Klir, "Fuzzy sets and fuzzy logic", 1995)

"We use mathematics and statistics to describe the diverse realms of randomness. From these descriptions, we attempt to glean insights into the workings of chance and to search for hidden causes. With such tools in hand, we seek patterns and relationships and propose predictions that help us make sense of the world."  (Ivars Peterson, "The Jungles of Randomness: A Mathematical Safari", 1998)

"When an analyst selects the wrong tool, this is a misuse which usually leads to invalid conclusions. Incorrect use of even a tool as simple as the mean can lead to serious misuses. […] But all statisticians know that more complex tools do not guarantee an analysis free of misuses. Vigilance is required on every statistical level."  (Herbert F Spirer et al, "Misused Statistics" 2nd Ed, 1998)

"The key role of representation in thinking is often downplayed because of an ideal of rationality that dictates that whenever two statements are mathematically or logically the same, representing them in different forms should not matter. Evidence that it does matter is regarded as a sign of human irrationality. This view ignores the fact that finding a good representation is an indispensable part of problem solving and that playing with different representations is a tool of creative thinking." (Gerd Gigerenzer, "Calculated Risks: How to know when numbers deceive you", 2002)

"There is a tendency to use hypothesis testing methods even when they are not appropriate. Often, estimation and confidence intervals are better tools. Use hypothesis testing only when you want to test a well-defined hypothesis." (Larry A Wasserman, "All of Statistics: A concise course in statistical inference", 2004)

"Popular accounts of mathematics often stress the discipline’s obsession with certainty, with proof. And mathematicians often tell jokes poking fun at their own insistence on precision. However, the quest for precision is far more than an end in itself. Precision allows one to reason sensibly about objects outside of ordinary experience. It is a tool for exploring possibility: about what might be, as well as what is." (Donal O’Shea, “The Poincaré Conjecture”, 2007)

"The key to understanding randomness and all of mathematics is not being able to intuit the answer to every problem immediately but merely having the tools to figure out the answer." (Leonard Mlodinow,"The Drunkard’s Walk: How Randomness Rules Our Lives", 2008)

"[...] a model is a tool for taking decisions and any decision taken is the result of a process of reasoning that takes place within the limits of the human mind. So, models have eventually to be understood in such a way that at least some layer of the process of simulation is comprehensible by the human mind. Otherwise, we may find ourselves acting on the basis of models that we don’t understand, or no model at all.” (Ugo Bardi, “The Limits to Growth Revisited”, 2011)

"The first and main goal of any graphic and visualization is to be a tool for your eyes and brain to perceive what lies beyond their natural reach." (Alberto Cairo, "The Functional Art", 2011)

"What is really important is to remember that no matter how creative and innovative you wish to be in your graphics and visualizations, the first thing you must do, before you put a finger on the computer keyboard, is ask yourself what users are likely to try to do with your tool." (Alberto Cairo, "The Functional Art", 2011)

"Data mining is a craft. As with many crafts, there is a well-defined process that can help to increase the likelihood of a successful result. This process is a crucial conceptual tool for thinking about data science projects. [...] data mining is an exploratory undertaking closer to research and development than it is to engineering." (Foster Provost, "Data Science for Business", 2013)

"It is important to remember that predictive data analytics models built using machine learning techniques are tools that we can use to help make better decisions within an organization and are not an end in themselves. It is paramount that, when tasked with creating a predictive model, we fully understand the business problem that this model is being constructed to address and ensure that it does address it." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, worked examples, and case studies", 2015)

"A popular misconception holds that the era of Big Data means the end of a need for sampling. In fact, the proliferation of data of varying quality and relevance reinforces the need for sampling as a tool to work efficiently with a variety of data, and minimize bias. Even in a Big Data project, predictive models are typically developed and piloted with samples." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"[…] remember that, as with many statistical issues, sampling in and of itself is not a good or a bad thing. Sampling is a powerful tool that allows us to learn something, when looking at the full population is not feasible (or simply isn’t the preferred option). And you shouldn’t be misled to think that you always should use all the data. In fact, using a sample of data can be incredibly helpful." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"A theory is nothing but a tool to know the reality. If a theory contradicts reality, it must be discarded at the earliest." (Awdhesh Singh, "Myths are Real, Reality is a Myth", 2018)

"Cross-validation is a useful tool for finding optimal predictive models, and it also works well in visualization. The concept is simple: split the data at random into a 'training' and a 'test' set, fit the model to the training data, then see how well it predicts the test data. As the model gets more complex, it will always fit the training data better and better. It will also start off getting better results on the test data, but there comes a point where the test data predictions start going wrong." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

"Even though data is being thrust on more people, it doesn’t mean everyone is prepared to consume and use it effectively. As our dependence on data for guidance and insights increases, the need for greater data literacy also grows. If literacy is defined as the ability to read and write, data literacy can be defined as the ability to understand and communicate data. Today’s advanced data tools can offer unparalleled insights, but they require capable operators who can understand and interpret data." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"We know what forecasting is: you start in the present and try to look into the future and imagine what it will be like. Backcasting is the opposite: you state your desired vision of the future as if it’s already happened, and then work backward to imagine the practices, policies, programs, tools, training, and people who worked in concert in a hypothetical past (which takes place in the future) to get you there." (Eben Hewitt, "Technology Strategy Patterns: Architecture as strategy" 2nd Ed., 2019)

"Big data is revolutionizing the world around us, and it is easy to feel alienated by tales of computers handing down decisions made in ways we don’t understand. I think we’re right to be concerned. Modern data analytics can produce some miraculous results, but big data is often less trustworthy than small data. Small data can typically be scrutinized; big data tends to be locked away in the vaults of Silicon Valley. The simple statistical tools used to analyze small datasets are usually easy to check; pattern-recognizing algorithms can all too easily be mysterious and commercially sensitive black boxes." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"I think sometimes organizations are looking at tools or the mythical and elusive data driven culture to be the strategy. Let me emphasize now: culture and tools are not strategies; they are enabling pieces." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"Visualisation is fundamentally limited by the number of pixels you can pump to a screen. If you have big data, you have way more data than pixels, so you have to summarise your data. Statistics gives you lots of really good tools for this." (Hadley Wickham)

05 November 2018

💠🛠️SQL Server: Administration (End of Life for 2008 and 2008 R2 Versions)

SQL Server 2008 and 2008 R2 versions are heading with steep steps toward the end of support - July 9, 2019. Besides an upgrade to upper versions, it seems there is also the opportunity to migrate to an Azure SQL Server Managed Instance, which allows a near 100% compatibility with an on-premises SQL Server installation, or to a Azure VM (see Franck Mercier’s post).

If you aren’t sure which versions of SQL Server you have in your organization here’s a script that can be run on all SQL Server 2005+ installations via SQL Server Management Studio:

SELECT SERVERPROPERTY('ComputerNamePhysicalNetBIOS') ComputerName
 , SERVERPROPERTY('Edition') Edition
 , SERVERPROPERTY('ProductVersion') ProductVersion
 , CASE Cast(SERVERPROPERTY('ProductVersion') as nvarchar(20))
     WHEN '9.00.5000.00' THEN 'SQL Server 2005'
     WHEN '10.0.6000.29' THEN 'SQL Server 2008' 
     WHEN '10.50.6000.34' THEN 'SQL Server 2008 R2' 
     WHEN '11.0.7001.0' THEN 'SQL Server 2012' 
     WHEN '12.0.6024.0' THEN 'SQL Server 2014' 
     WHEN '13.0.5026.0' THEN 'SQL Server 2016'
     WHEN '14.0.2002.14' THEN 'SQL Server 2017'
     ELSE 'unknown'
   END Product 


Resources:
[1] Microsoft (2018) How to determine the version, edition, and update level of SQL Server and its components [Online] Available from: https://support.microsoft.com/en-us/help/321185/how-to-determine-the-version-edition-and-update-level-of-sql-server-an
[2] Microsoft Blogs (2018) SQL Server 2008 end of support, by Franck Mercier [Online] Available from: https://blogs.technet.microsoft.com/franmer/2018/11/01/sql-server-2008-end-of-support-2/

🔭Data Science: Confidence (Just the Quotes)

"What the use of P [the significance level] implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred." (Harold Jeffreys, "Theory of Probability", 1939)

"Only by the analysis and interpretation of observations as they are made, and the examination of the larger implications of the results, is one in a satisfactory position to pose new experimental and theoretical questions of the greatest significance." (John A Wheeler, "Elementary Particle Physics", American Scientist, 1947)

"As usual we may make the errors of I) rejecting the null hypothesis when it is true, II) accepting the null hypothesis when it is false. But there is a third kind of error which is of interest because the present test of significance is tied up closely with the idea of making a correct decision about which distribution function has slipped furthest to the right. We may make the error of III) correctly rejecting the null hypothesis for the wrong reason." (Frederick Mosteller, "A k-Sample Slippage Test for an Extreme Population", The Annals of Mathematical Statistics 19, 1948)

"Errors of the third kind happen in conventional tests of differences of means, but they are usually not considered, although their existence is probably recognized. It seems to the author that there may be several reasons for this among which are 1) a preoccupation on the part of mathematical statisticians with the formal questions of acceptance and rejection of null hypotheses without adequate consideration of the implications of the error of the third kind for the practical experimenter, 2) the rarity with which an error of the third kind arises in the usual tests of significance." (Frederick Mosteller, "A k-Sample Slippage Test for an Extreme Population", The Annals of Mathematical Statistics 19, 1948)

"If significance tests are required for still larger samples, graphical accuracy is insufficient, and arithmetical methods are advised. A word to the wise is in order here, however. Almost never does it make sense to use exact binomial significance tests on such data - for the inevitable small deviations from the mathematical model of independence and constant split have piled up to such an extent that the binomial variability is deeply buried and unnoticeable. Graphical treatment of such large samples may still be worthwhile because it brings the results more vividly to the eye." (Frederick Mosteller & John W Tukey, "The Uses and Usefulness of Binomial Probability Paper?", Journal of the American Statistical Association 44, 1949)

"One reason for preferring to present a confidence interval statement (where possible) is that the confidence interval, by its width, tells more about the reliance that can be placed on the results of the experiment than does a YES-NO test of significance." (Mary G Natrella, "The relation between confidence intervals and tests of significance", American Statistician 14, 1960)

"Confidence intervals give a feeling of the uncertainty of experimental evidence, and (very important) give it in the same units [...] as the original observations." (Mary G Natrella, "The relation between confidence intervals and tests of significance", American Statistician 14, 1960)

"The null-hypothesis significance test treats ‘acceptance’ or ‘rejection’ of a hypothesis as though these were decisions one makes. But a hypothesis is not something, like a piece of pie offered for dessert, which can be accepted or rejected by a voluntary physical action. Acceptance or rejection of a hypothesis is a cognitive process, a degree of believing or disbelieving which, if rational, is not a matter of choice but determined solely by how likely it is, given the evidence, that the hypothesis is true." (William W Rozeboom, "The fallacy of the null–hypothesis significance test", Psychological Bulletin 57, 1960)

"The null hypothesis of no difference has been judged to be no longer a sound or fruitful basis for statistical investigation. […] Significance tests do not provide the information that scientists need, and, furthermore, they are not the most effective method for analyzing and summarizing data." (Cherry A Clark, "Hypothesis Testing in Relation to Statistical Methodology", Review of Educational Research Vol. 33, 1963)

"The idea of knowledge as an improbable structure is still a good place to start. Knowledge, however, has a dimension which goes beyond that of mere information or improbability. This is a dimension of significance which is very hard to reduce to quantitative form. Two knowledge structures might be equally improbable but one might be much more significant than the other." (Kenneth E Boulding, "Beyond Economics: Essays on Society", 1968)

"Significance levels are usually computed and reported, but power and confidence limits are not. Perhaps they should be." (Amos Tversky & Daniel Kahneman, "Belief in the law of small numbers", Psychological Bulletin 76(2), 1971)

"Science usually amounts to a lot more than blind trial and error. Good statistics consists of much more than just significance tests; there are more sophisticated tools available for the analysis of results, such as confidence statements, multiple comparisons, and Bayesian analysis, to drop a few names. However, not all scientists are good statisticians, or want to be, and not all people who are called scientists by the media deserve to be so described." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"It is usually wise to give a confidence interval for the parameter in which you are interested." (David S Moore & George P McCabe, "Introduction to the Practice of Statistics", 1989)

"I do not think that significance testing should be completely abandoned [...] and I don’t expect that it will be. But I urge researchers to provide estimates, with confidence intervals: scientific advance requires parameters with known reliability estimates. Classical confidence intervals are formally equivalent to a significance test, but they convey more information." (Nigel G Yoccoz, "Use, Overuse, and Misuse of Significance Tests in Evolutionary Biology and Ecology", Bulletin of the Ecological Society of America Vol. 72 (2), 1991)

"Whereas hypothesis testing emphasizes a very narrow question (‘Do the population means fail to conform to a specific pattern?’), the use of confidence intervals emphasizes a much broader question (‘What are the population means?’). Knowing what the means are, of course, implies knowing whether they fail to conform to a specific pattern, although the reverse is not true. In this sense, use of confidence intervals subsumes the process of hypothesis testing." (Geoffrey R Loftus, "On the tyranny of hypothesis testing in the social sciences", Contemporary Psychology 36, 1991) 

"We should push for de-emphasizing some topics, such as statistical significance tests - an unfortunate carry-over from the traditional elementary statistics course. We would suggest a greater focus on confidence intervals - these achieve the aim of formal hypothesis testing, often provide additional useful information, and are not as easily misinterpreted." (Gerry Hahn et al, "The Impact of Six Sigma Improvement: A Glimpse Into the Future of Statistics", The American Statistician, 1999)

"[...] they [confidence limits] are rarely to be found in the literature. I suspect that the main reason they are not reported is that they are so embarrassingly large!" (Jacob Cohen, "The earth is round (p<.05)", American Psychologist 49, 1994)

"Given the important role that correlation plays in structural equation modeling, we need to understand the factors that affect establishing relationships among multivariable data points. The key factors are the level of measurement, restriction of range in data values (variability, skewness, kurtosis), missing data, nonlinearity, outliers, correction for attenuation, and issues related to sampling variation, confidence intervals, effect size, significance, sample size, and power." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

"Another way to secure statistical significance is to use the data to discover a theory. Statistical tests assume that the researcher starts with a theory, collects data to test the theory, and reports the results - whether statistically significant or not. Many people work in the other direction, scrutinizing the data until they find a pattern and then making up a theory that fits the pattern." (Gary Smith, "Standard Deviations", 2014)

"There is a growing realization that reported 'statistically significant' claims in statistical publications are routinely mistaken. Researchers typically express the confidence in their data in terms of p-value: the probability that a perceived result is actually the result of random variation. The value of p (for 'probability') is a way of measuring the extent to which a data set provides evidence against a so-called null hypothesis. By convention, a p- value below 0.05 is considered a meaningful refutation of the null hypothesis; however, such conclusions are less solid than they appear." (Andrew Gelman & Eric Loken, "The Statistical Crisis in Science", American Scientist Vol. 102(6), 2014)

04 November 2018

🔭Data Science: Residuals (Just the Quotes)

"Data analysis must be iterative to be effective. [...] The iterative and interactive interplay of summarizing by fit and exposing by residuals is vital to effective data analysis. Summarizing and exposing are complementary and pervasive." (John W Tukey & Martin B Wilk, "Data Analysis and: An Expository Overview", 1966)

"Exploratory data analysis, EDA, calls for a relatively free hand in exploring the data, together with dual obligations: (•) to look for all plausible alternatives and oddities - and a few implausible ones, (graphic techniques can be most helpful here) and (•) to remove each appearance that seems large enough to be meaningful - ordinarily by some form of fitting, adjustment, or standardization [...] so that what remains, the residuals, can be examined for further appearances." (John W Tukey, "Introduction to Styles of Data Analysis Techniques", 1982)

"A good description of the data summarizes the systematic variation and leaves residuals that look structureless. That is, the residuals exhibit no patterns and have no exceptionally large values, or outliers. Any structure present in the residuals indicates an inadequate fit. Looking at the residuals laid out in an overlay helps to spot patterns and outliers and to associate them with their source in the data." (Christopher H Schrnid, "Value Splitting: Taking the Data Apart", 1991)

"A useful description relates the systematic variation to one or more factors; if the residuals dwarf the effects for a factor, we may not be able to relate variation in the data to changes in the factor. Furthermore, changes in the factor may bring no important change in the response. Such comparisons of residuals and effects require a measure of the variation of overlays relative to each other." (Christopher H Schrnid, "Value Splitting: Taking the Data Apart", 1991)

"Fitting data means finding mathematical descriptions of structure in the data. An additive shift is a structural property of univariate data in which distributions differ only in location and not in spread or shape. […] The process of identifying a structure in data and then fitting the structure to produce residuals that have the same distribution lies at the heart of statistical analysis. Such homogeneous residuals can be pooled, which increases the power of the description of the variation in the data." (William S Cleveland, "Visualizing Data", 1993)

"When the distributions of two or more groups of univariate data are skewed, it is common to have the spread increase monotonically with location. This behavior is monotone spread. Strictly speaking, monotone spread includes the case where the spread decreases monotonically with location, but such a decrease is much less common for raw data. Monotone spread, as with skewness, adds to the difficulty of data analysis. For example, it means that we cannot fit just location estimates to produce homogeneous residuals; we must fit spread estimates as well. Furthermore, the distributions cannot be compared by a number of standard methods of probabilistic inference that are based on an assumption of equal spreads; the standard t-test is one example. Fortunately, remedies for skewness can cure monotone spread as well." (William S Cleveland, "Visualizing Data", 1993)

"Residual analysis is similarly unreliable. In a discussion after a presentation of residual analysis in a seminar at Berkeley in 1993, William Cleveland, one of the fathers of residual analysis, admitted that it could not uncover lack of fit in more than four to five dimensions. The papers I have read on using residual analysis to check lack of fit are confined to data sets with two or three variables. With higher dimensions, the interactions between the variables can produce passable residual plots for a variety of models. A residual plot is a goodness-of-fit test, and lacks power in more than a few dimensions. An acceptable residual plot does not imply that the model is a good fit to the data." (Leo Breiman, "Statistical Modeling: The Two Cultures", Statistical Science Vol. 16(3), 2001)

"For a confidence interval, the central limit theorem plays a role in the reliability of the interval because the sample mean is often approximately normal even when the underlying data is not. A prediction interval has no such protection. The shape of the interval reflects the shape of the underlying distribution. It is more important to examine carefully the normality assumption by checking the residuals […]." (DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

"Using noise (the uncorrelated variables) to fit noise (the residual left from a simple model on the genuinely correlated variables) is asking for trouble." (Steven S Skiena, "The Data Science Design Manual", 2017)

"One of the most common problems that you will encounter when training deep neural networks will be overfitting. What can happen is that your network may, owing to its flexibility, learn patterns that are due to noise, errors, or simply wrong data. [...] The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e., the noise) as if that variation represented the underlying model structure. The opposite is called underfitting - when the model cannot capture the structure of the data." (Umberto Michelucci, "Applied Deep Learning: A Case-Based Approach to Understanding Deep Neural Networks", 2018)

🔭Data Science: Central Limit Theorem (Just the Quotes)

"I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the ‘Law of Frequency of Error’. The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement, amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason. Whenever a large sample of chaotic elements are taken in hand and marshalled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along." (Sir Francis Galton, "Natural Inheritance", 1889)

"The central limit theorem […] states that regardless of the shape of the curve of the original population, if you repeatedly randomly sample a large segment of your group of interest and take the average result, the set of averages will follow a normal curve." (Charles Livingston & Paul Voakes, "Working with Numbers and Statistics: A handbook for journalists", 2005)

"The central limit theorem says that, under conditions almost always satisfied in the real world of experimentation, the distribution of such a linear function of errors will tend to normality as the number of its components becomes large. The tendency to normality occurs almost regardless of the individual distributions of the component errors. An important proviso is that several sources of error must make important contributions to the overall error and that no particular source of error dominate the rest." (George E P Box et al, "Statistics for Experimenters: Design, discovery, and innovation" 2nd Ed., 2005)

"Two things explain the importance of the normal distribution: (1) The central limit effect that produces a tendency for real error distributions to be 'normal like'. (2) The robustness to nonnormality of some common statistical procedures, where 'robustness' means insensitivity to deviations from theoretical normality." (George E P Box et al, "Statistics for Experimenters: Design, discovery, and innovation" 2nd Ed., 2005)

"The central limit theorem differs from laws of large numbers because random variables vary and so they differ from constants such as population means. The central limit theorem says that certain independent random effects converge not to a constant population value such as the mean rate of unemployment but rather they converge to a random variable that has its own Gaussian bell-curve description." (Bart Kosko, "Noise", 2006)

"Normally distributed variables are everywhere, and most classical statistical methods use this distribution. The explanation for the normal distribution’s ubiquity is the Central Limit Theorem, which says that if you add a large number of independent samples from the same distribution the distribution of the sum will be approximately normal." (Ben Bolker, "Ecological Models and Data in R", 2007)

"[…] the Central Limit Theorem says that if we take any sequence of small independent random quantities, then in the limit their sum (or average) will be distributed according to the normal distribution. In other words, any quantity that can be viewed as the sum of many small independent random effects. will be well approximated by the normal distribution. Thus, for example, if one performs repeated measurements of a fixed physical quantity, and if the variations in the measurements across trials are the cumulative result of many independent sources of error in each trial, then the distribution of measured values should be approximately normal." (David Easley & Jon Kleinberg, "Networks, Crowds, and Markets: Reasoning about a Highly Connected World", 2010)

[myth] "It has been said that process behavior charts work because of the central limit theorem."(Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"Statistical inference is really just the marriage of two concepts that we’ve already discussed: data and probability (with a little help from the central limit theorem)." (Charles Wheelan, "Naked Statistics: Stripping the Dread from the Data", 2012)

"The central limit theorem is often used to justify the assumption of normality when using the sample mean and the sample standard deviation. But it is inevitable that real data contain gross errors. Five to ten percent unusual values in a dataset seem to be the rule rather than the exception. The distribution of such data is no longer Normal." (A S Hedayat & Guoqin Su, "Robustness of the Simultaneous Estimators of Location and Scale From Approximating a Histogram by a Normal Density Curve", The American Statistician 66, 2012)

"The central limit theorem tells us that in repeated samples, the difference between the two means will be distributed roughly as a normal distribution." (Charles Wheelan, "Naked Statistics: Stripping the Dread from the Data", 2012)

"According to the central limit theorem, it doesn’t matter what the raw data look like, the sample variance should be proportional to the number of observations and if I have enough of them, the sample mean should be normal." (Kristin H Jarman, "The Art of Data Analysis: How to answer almost any question using basic statistics", 2013)

"For a confidence interval, the central limit theorem plays a role in the reliability of the interval because the sample mean is often approximately normal even when the underlying data is not. A prediction interval has no such protection. The shape of the interval reflects the shape of the underlying distribution. It is more important to examine carefully the normality assumption by checking the residuals […].(DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

"When data is not normal, the reason the formulas are working is usually the central limit theorem. For large sample sizes, the formulas are producing parameter estimates that are approximately normal even when the data is not itself normal. The central limit theorem does make some assumptions and one is that the mean and variance of the population exist. Outliers in the data are evidence that these assumptions may not be true. Persistent outliers in the data, ones that are not errors and cannot be otherwise explained, suggest that the usual procedures based on the central limit theorem are not applicable.(DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

"At very small time scales, the motion of a particle is more like a random walk, as it gets jostled about by discrete collisions with water molecules. But virtually any random movement on small time scales will give rise to Brownian motion on large time scales, just so long as the motion is unbiased. This is because of the Central Limit Theorem, which tells us that the aggregate of many small, independent motions will be normally distributed." (Field Cady, "The Data Science Handbook", 2017)

"The central limit conjecture states that most errors are the result of many small errors and, as such, have a normal distribution. The assumption of a normal distribution for error has many advantages and has often been made in applications of statistical models." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"Theoretically, the normal distribution is most famous because many distributions converge to it, if you sample from them enough times and average the results. This applies to the binomial distribution, Poisson distribution and pretty much any other distribution you’re likely to encounter (technically, any one for which the mean and standard deviation are finite)." (Field Cady, "The Data Science Handbook", 2017)

"The central limit theorem in statistics states that, given a sufficiently large sample size, the sampling distribution of the mean for a variable will approximate a normal distribution regardless of that variable’s distribution in the population." (Jim Frost)

"The old rule of trusting the Central Limit Theorem if the sample size is larger than 30 is just that–old. Bootstrap and permutation testing let us more easily do inferences for a wider variety of statistics." (Tim Hesterberg)

03 November 2018

🔭Data Science: Least Squares (Just the Quotes)

"From the foregoing we see that the two justifications each leave something to be desired. The first depends entirely on the hypothetical form of the probability of the error; as soon as that form is rejected, the values of the unknowns produced by the method of least squares are no more the most probable values than is the arithmetic mean in the simplest case mentioned above. The second justification leaves us entirely in the dark about what to do when the number of observations is not large. In this case the method of least squares no longer has the status of a law ordained by the probability calculus but has only the simplicity of the operations it entails to recommend it." (Carl Friedrich Gauss, "Anzeige: Theoria combinationis observationum erroribus minimis obnoxiae: Pars prior", Göttingische gelehrte Anzeigen, 1821)

"The method of least squares is used in the analysis of data from planned experiments and also in the analysis of data from unplanned happenings. The word 'regression' is most often used to describe analysis of unplanned data. It is the tacit assumption that the requirements for the validity of least squares analysis are satisfied for unplanned data that produces a great deal of trouble." (George E P Box, "Use and Abuse of Regression", 1966)

"At the heart of probabilistic statistical analysis is the assumption that a set of data arises as a sample from a distribution in some class of probability distributions. The reasons for making distributional assumptions about data are several. First, if we can describe a set of data as a sample from a certain theoretical distribution, say a normal distribution (also called a Gaussian distribution), then we can achieve a valuable compactness of description for the data. For example, in the normal case, the data can be succinctly described by giving the mean and standard deviation and stating that the empirical (sample) distribution of the data is well approximated by the normal distribution. A second reason for distributional assumptions is that they can lead to useful statistical procedures. For example, the assumption that data are generated by normal probability distributions leads to the analysis of variance and least squares. Similarly, much of the theory and technology of reliability assumes samples from the exponential, Weibull, or gamma distribution. A third reason is that the assumptions allow us to characterize the sampling distribution of statistics computed during the analysis and thereby make inferences and probabilistic statements about unknown aspects of the underlying distribution. For example, assuming the data are a sample from a normal distribution allows us to use the t-distribution to form confidence intervals for the mean of the theoretical distribution. A fourth reason for distributional assumptions is that understanding the distribution of a set of data can sometimes shed light on the physical mechanisms involved in generating the data." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"Least squares' means just what it says: you minimise the (suitably weighted) squared difference between a set of measurements and their predicted values. This is done by varying the parameters you want to estimate: the predicted values are adjusted so as to be close to the measurements; squaring the differences means that greater importance is placed on removing the large deviations." (Roger J Barlow, "Statistics: A guide to the use of statistical methods in the physical sciences", 1989)

"Principal components and principal factor analysis lack a well-developed theoretical framework like that of least squares regression. They consequently provide no systematic way to test hypotheses about the number of factors to retain, the size of factor loadings, or the correlations between factors, for example. Such tests are possible using a different approach, based on maximum-likelihood estimation." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"Fuzzy models should make good predictions even when they are asked to predict on regions that were not excited during the construction of the model. The generalization capabilities can be controlled by an appropriate initialization of the consequences (prior knowledge) and the use of the recursive least squares to improve the prior choices. The prior knowledge can be obtained from the data." (Jairo Espinosa et al, "Fuzzy Logic, Identification and Predictive Control", 2005)

"Often when people relate essentially the same variable in two different groups, or at two different times, they see this same phenomenon - the tendency of the response variable to be closer to the mean than the predicted value. Unfortunately, people try to interpret this by thinking that the performance of those far from the mean is deteriorating, but it’s just a mathematical fact about the correlation. So, today we try to be less judgmental about this phenomenon and we call it regression to the mean. We managed to get rid of the term 'mediocrity', but the name regression stuck as a name for the whole least squares fitting procedure - and that’s where we get the term regression line." (Richard D De Veaux et al, "Stats: Data and Models", 2016)

🔭Data Science: Myths (Just the Quotes)

"[myth:] Accuracy is more important than precision. For single best estimates, be it a mean value or a single data value, this question does not arise because in that case there is no difference between accuracy and precision. (Think of a single shot aimed at a target.) Generally, it is good practice to balance precision and accuracy. The actual requirements will differ from case to case." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"[myth:] Counting can be done without error. Usually, the counted number is an integer and therefore without (rounding) error. However, the best estimate of a scientifically relevant value obtained by counting will always have an error. These errors can be very small in cases of consecutive counting, in particular of regular events, e.g., when measuring frequencies." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"[myth:] Random errors can always be determined by repeating measurements under identical conditions. […] this statement is true only for time-related random errors." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"[myth:] Systematic errors can be determined inductively. It should be quite obvious that it is not possible to determine the scale error from the pattern of data values." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

[myth] " It has been said that process behavior charts work because of the central limit theorem."(Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

[myth] "It has been said that the data must be normally distributed before they can be placed on a process behavior chart."(Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

[myth]  "It has been said that the observations must be independent - data with autocorrelation are inappropriate for process behavior charts." (Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

[myth] " It has been said that the process must be operating in control before you can place the data on a process behavior chart."(Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

[myth] "The standard deviation statistic is more efficient than the range and therefore we should use the standard deviation statistic when computing limits for a process behavior chart."(Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"The search for better numbers, like the quest for new technologies to improve our lives, is certainly worthwhile. But the belief that a few simple numbers, a few basic averages, can capture the multifaceted nature of national and global economic systems is a myth. Rather than seeking new simple numbers to replace our old simple numbers, we need to tap into both the power of our information age and our ability to construct our own maps of the world to answer the questions we need answering." (Zachary Karabell, "The Leading Indicators: A short history of the numbers that rule our world", 2014)

"An oft-repeated rule of thumb in any sort of statistical model fitting is 'you can't fit a model with more parameters than data points'. This idea appears to be as wide-spread as it is incorrect. On the contrary, if you construct your models carefully, you can fit models with more parameters than datapoints [...]. A model with more parameters than datapoints is known as an under-determined system, and it's a common misperception that such a model cannot be solved in any circumstance. [...] this misconception, which I like to call the 'model complexity myth' [...] is not true in general, it is true in the specific case of simple linear models, which perhaps explains why the myth is so pervasive." (Jake Vanderplas, "The Model Complexity Myth", 2015) [source]

"Hollywood loves the myth of a lone scientist working late nights in a dark laboratory on a mysterious island, but the truth is far less melodramatic. Real science is almost always a team sport. Groups of people, collaborating with other groups of people, are the norm in science - and data science is no exception to the rule. When large groups of people work together for extended periods of time, a culture begins to emerge. " (Mike Barlow, "Learning to Love Data Science", 2015) 

"One of the biggest truths about the real–time analytics is that nothing is actually real–time; it's a myth. In reality, it's close to real–time. Depending upon the performance and ability of a solution and the reduction of operational latencies, the analytics could be close to real–time, but, while day-by-day we are bridging the gap between real–time and near–real–time, it's practically impossible to eliminate the gap due to computational, operational, and network latencies." (Shilpi Saxena & Saurabh Gupta, "Practical Real-time Data Processing and Analytics", 2017)

"The field of big-data analytics is still littered with a few myths and evidence-free lore. The reasons for these myths are simple: the emerging nature of technologies, the lack of common definitions, and the non-availability of validated best practices. Whatever the reasons, these myths must be debunked, as allowing them to persist usually has a negative impact on success factors and Return on Investment (RoI). On a positive note, debunking the myths allows us to set the right expectations, allocate appropriate resources, redefine business processes, and achieve individual/organizational buy-in." (Prashant Natarajan et al, "Demystifying Big Data and Machine Learning for Healthcare", 2017) 

"The first myth is that prediction is always based on time-series extrapolation into the future (also known as forecasting). This is not the case: predictive analytics can be applied to generate any type of unknown data, including past and present. In addition, prediction can be applied to non-temporal (time-based) use cases such as disease progression modeling, human relationship modeling, and sentiment analysis for medication adherence, etc. The second myth is that predictive analytics is a guarantor of what will happen in the future. This also is not the case: predictive analytics, due to the nature of the insights they create, are probabilistic and not deterministic. As a result, predictive analytics will not be able to ensure certainty of outcomes." (Prashant Natarajan et al, "Demystifying Big Data and Machine Learning for Healthcare", 2017) 

"One of the biggest myths is the belief that data science is an autonomous process that we can let loose on our data to find the answers to our problems. In reality, data science requires skilled human oversight throughout the different stages of the process. [...] The second big myth of data science is that every data science project needs big data and needs to use deep learning. In general, having more data helps, but having the right data is the more important requirement. [...] A third data science myth is that modern data science software is easy to use, and so data science is easy to do. [...] The last myth about data science [...] is the belief that data science pays for itself quickly. The truth of this belief depends on the context of the organization. Adopting data science can require significant investment in terms of developing data infrastructure and hiring staff with data science expertise. Furthermore, data science will not give positive results on every project." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"In the world of data and analytics, people get enamored by the nice, shiny object. We are pulled around by the wind of the latest technology, but in so doing we are pulled away from the sound and intelligent path that can lead us to data and analytical success. The data and analytical world is full of examples of overhyped technology or processes, thinking this thing will solve all of the data and analytical needs for an individual or organization. Such topics include big data or data science. These two were pushed into our minds and down our throats so incessantly over the past decade that they are somewhat of a myth, or people finally saw the light. In reality, both have a place and do matter, but they are not the only solution to your data and analytical needs. Unfortunately, though, organizations bit into them, thinking they would solve everything, and were left at the alter, if you will, when it came time for the marriage of data and analytical success with tools." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"[...] the focus on Big Data AI seems to be an excuse to put forth a number of vague and hand-waving theories, where the actual details and the ultimate success of neuroscience is handed over to quasi- mythological claims about the powers of large datasets and inductive computation. Where humans fail to illuminate a complicated domain with testable theory, machine learning and big data supposedly can step in and render traditional concerns about finding robust theories. This seems to be the logic of Data Brain efforts today. (Erik J Larson, "The Myth of Artificial Intelligence: Why Computers Can’t Think the Way We Do", 2021)

"The myth of replacing domain experts comes from people putting too much faith in the power of ML to find patterns in the data. [...] ML looks for patterns that are generally pretty crude - the power comes from the sheer scale at which they can operate. If the important patterns in the data are not sufficiently crude then ML will not be able to ferret them out. The most powerful classes of models, like deep learning, can sometimes learn good-enough proxies for the real patterns, but that requires more training data than is usually available and yields complicated models that are hard to understand and impossible to debug. It’s much easier to just ask somebody who knows the domain!" (Field Cady, "Data Science: The Executive Summary: A Technical Book for Non-Technical Professionals", 2021)

🔭Data Science: Forecasting (Just the Quotes)

"Extrapolations are useful, particularly in the form of soothsaying called forecasting trends. But in looking at the figures or the charts made from them, it is necessary to remember one thing constantly: The trend to now may be a fact, but the future trend represents no more than an educated guess. Implicit in it is 'everything else being equal' and 'present trends continuing'. And somehow everything else refuses to remain equal." (Darell Huff, "How to Lie with Statistics", 1954)

"When numbers in tabular form are taboo and words will not do the work well as is often the case. There is one answer left: Draw a picture. About the simplest kind of statistical picture or graph, is the line variety. It is very useful for showing trends, something practically everybody is interested in showing or knowing about or spotting or deploring or forecasting." (Darell Huff, "How to Lie with Statistics", 1954)

"The moment you forecast you know you’re going to be wrong, you just don’t know when and in which direction." (Edgar R Fiedler, 1977)

"Many of the basic functions performed by neural networks are mirrored by human abilities. These include making distinctions between items (classification), dividing similar things into groups (clustering), associating two or more things (associative memory), learning to predict outcomes based on examples (modeling), being able to predict into the future (time-series forecasting), and finally juggling multiple goals and coming up with a good- enough solution (constraint satisfaction)." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"Probability theory is a serious instrument for forecasting, but the devil, as they say, is in the details - in the quality of information that forms the basis of probability estimates." (Peter L Bernstein, "Against the Gods: The Remarkable Story of Risk", 1996)

"Under conditions of uncertainty, both rationality and measurement are essential to decision-making. Rational people process information objectively: whatever errors they make in forecasting the future are random errors rather than the result of a stubborn bias toward either optimism or pessimism. They respond to new information on the basis of a clearly defined set of preferences. They know what they want, and they use the information in ways that support their preferences." (Peter L Bernstein, "Against the Gods: The Remarkable Story of Risk", 1996)

"Time-series forecasting is essentially a form of extrapolation in that it involves fitting a model to a set of data and then using that model outside the range of data to which it has been fitted. Extrapolation is rightly regarded with disfavour in other statistical areas, such as regression analysis. However, when forecasting the future of a time series, extrapolation is unavoidable." (Chris Chatfield, "Time-Series Forecasting" 2nd Ed, 2000)

"Models can be viewed and used at three levels. The first is a model that fits the data. A test of goodness-of-fit operates at this level. This level is the least useful but is frequently the one at which statisticians and researchers stop. For example, a test of a linear model is judged good when a quadratic term is not significant. A second level of usefulness is that the model predicts future observations. Such a model has been called a forecast model. This level is often required in screening studies or studies predicting outcomes such as growth rate. A third level is that a model reveals unexpected features of the situation being described, a structural model, [...] However, it does not explain the data." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Most long-range forecasts of what is technically feasible in future time periods dramatically underestimate the power of future developments because they are based on what I call the 'intuitive linear' view of history rather than the 'historical exponential' view." (Ray Kurzweil, "The Singularity is Near", 2005)

"A forecaster should almost never ignore data, especially when she is studying rare events […]. Ignoring data is often a tip-off that the forecaster is overconfident, or is overfitting her model - that she is interested in showing off rather than trying to be accurate."  (Nate Silver, "The Signal and the Noise: Why So Many Predictions Fail-but Some Don't", 2012)

"Whether information comes in a quantitative or qualitative flavor is not as important as how you use it. [...] The key to making a good forecast […] is not in limiting yourself to quantitative information. Rather, it’s having a good process for weighing the information appropriately. […] collect as much information as possible, but then be as rigorous and disciplined as possible when analyzing it. [...] Many times, in fact, it is possible to translate qualitative information into quantitative information." (Nate Silver, "The Signal and the Noise: Why So Many Predictions Fail-but Some Don't", 2012)

"In common usage, prediction means to forecast a future event. In data science, prediction more generally means to estimate an unknown value. This value could be something in the future (in common usage, true prediction), but it could also be something in the present or in the past. Indeed, since data mining usually deals with historical data, models very often are built and tested using events from the past." (Foster Provost & Tom Fawcett, "Data Science for Business", 2013)

"Using random processes in our models allows economists to capture the variability of time series data, but it also poses challenges to model builders. As model builders, we must understand the uncertainty from two different perspectives. Consider first that of the econometrician, standing outside an economic model, who must assess its congruence with reality, inclusive of its random perturbations. An econometrician’s role is to choose among different parameters that together describe a family of possible models to best mimic measured real world time series and to test the implications of these models. I refer to this as outside uncertainty. Second, agents inside our model, be it consumers, entrepreneurs, or policy makers, must also confront uncertainty as they make decisions. I refer to this as inside uncertainty, as it pertains to the decision-makers within the model. What do these agents know? From what information can they learn? With how much confidence do they forecast the future? The modeler’s choice regarding insiders’ perspectives on an uncertain future can have significant consequences for each model’s equilibrium outcomes." (Lars P Hansen, "Uncertainty Outside and Inside Economic Models", [Nobel lecture] 2013)

"One important thing to bear in mind about the outputs of data science and analytics is that in the vast majority of cases they do not uncover hidden patterns or relationships as if by magic, and in the case of predictive analytics they do not tell us exactly what will happen in the future. Instead, they enable us to forecast what may come. In other words, once we have carried out some modelling there is still a lot of work to do to make sense out of the results obtained, taking into account the constraints and assumptions in the model, as well as considering what an acceptable level of reliability is in each scenario." (Jesús Rogel-Salazar, "Data Science and Analytics with Python", 2017)

"Regression describes the relationship between an exploratory variable (i.e., independent) and a response variable (i.e., dependent). Exploratory variables are also referred to as predictors and can have a frequency of more than 1. Regression is being used within the realm of predictions and forecasting. Regression determines the change in response variable when one exploratory variable is varied while the other independent variables are kept constant. This is done to understand the relationship that each of those exploratory variables exhibits." (Danish Haroon, "Python Machine Learning Case Studies", 2017)

"The first myth is that prediction is always based on time-series extrapolation into the future (also known as forecasting). This is not the case: predictive analytics can be applied to generate any type of unknown data, including past and present. In addition, prediction can be applied to non-temporal (time-based) use cases such as disease progression modeling, human relationship modeling, and sentiment analysis for medication adherence, etc. The second myth is that predictive analytics is a guarantor of what will happen in the future. This also is not the case: predictive analytics, due to the nature of the insights they create, are probabilistic and not deterministic. As a result, predictive analytics will not be able to ensure certainty of outcomes." (Prashant Natarajan et al, "Demystifying Big Data and Machine Learning for Healthcare", 2017)

"We know what forecasting is: you start in the present and try to look into the future and imagine what it will be like. Backcasting is the opposite: you state your desired vision of the future as if it’s already happened, and then work backward to imagine the practices, policies, programs, tools, training, and people who worked in concert in a hypothetical past (which takes place in the future) to get you there." (Eben Hewitt, "Technology Strategy Patterns: Architecture as strategy" 2nd Ed., 2019)

"Ideally, a decision maker or a forecaster will combine the outside view and the inside view - or, similarly, statistics plus personal experience. But it’s much better to start with the statistical view, the outside view, and then modify it in the light of personal experience than it is to go the other way around. If you start with the inside view you have no real frame of reference, no sense of scale - and can easily come up with a probability that is ten times too large, or ten times too small." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.