14 November 2018

🔭Data Science: Data Exploration (Just the Quotes)

"Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone – as the first step." (John W Tukey, "Exploratory Data Analysis", 1977)

"Modern data graphics can do much more than simply substitute for small statistical tables. At their best, graphics are instruments for reasoning about quantitative information. Often the most effective way to describe, explore, and summarize a set of numbers even a very large set - is to look at pictures of those numbers. Furthermore, of all methods for analyzing and communicating statistical information, well-designed data graphics are usually the simplest and at the same time the most powerful." (Edward R Tufte, "The Visual Display of Quantitative Information", 1983)

"Unless exploratory data analysis uncovers indications, usually quantitative ones, there is likely to nothing for confirmatory data analysis to consider." (John W Tukey, "Exploratory Data Analysis", 1977)

"Many of the applications of visualization in this book give the impression that data analysis consists of an orderly progression of exploratory graphs, fitting, and visualization of fits and residuals. Coherence of discussion and limited space necessitate a presentation that appears to imply this. Real life is usually quite different. There are blind alleys. There are mistaken actions. There are effects missed until the very end when some visualization saves the day. And worse, there is the possibility of the nearly unmentionable: missed effects." (William S Cleveland, "Visualizing Data", 1993)

"The scatterplot is a useful exploratory method for providing a first look at bivariate data to see how they are distributed throughout the plane, for example, to see clusters of points, outliers, and so forth." (William S Cleveland, "Visualizing Data", 1993)

"The science of statistics may be described as exploring, analyzing and summarizing data; designing or choosing appropriate ways of collecting data and extracting information from them; and communicating that information. Statistics also involves constructing and testing models for describing chance phenomena. These models can be used as a basis for making inferences and drawing conclusions and, finally, perhaps for making decisions." (Fergus Daly et al, "Elements of Statistics", 1995)

"Compound errors can begin with any of the standard sorts of bad statistics - a guess, a poor sample, an inadvertent transformation, perhaps confusion over the meaning of a complex statistic. People inevitably want to put statistics to use, to explore a number's implications. [...] The strengths and weaknesses of those original numbers should affect our confidence in the second-generation statistics." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Data mining is more of an art than a science. No one can tell you exactly how to choose columns to include in your data mining models. There are no hard and fast rules you can follow in deciding which columns either help or hinder the final model. For this reason, it is important that you understand how the data behaves before beginning to mine it. The best way to achieve this level of understanding is to see how the data is distributed across columns and how the different columns relate to one another. This is the process of exploring the data." (Seth Paul et al. "Preparing and Mining Data with Microsoft SQL Server 2000 and Analysis", 2002)

"Every statistical analysis is an interpretation of the data, and missingness affects the interpretation. The challenge is that when the reasons for the missingness cannot be determined there is basically no way to make appropriate statistical adjustments. Sensitivity analyses are designed to model and explore a reasonable range of explanations in order to assess the robustness of the results." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Since the aim of exploratory data analysis is to learn what seems to be, it should be no surprise that pictures play a vital role in doing it well." (John W. Tukey, "John W Tukey’s Works on Interactive Graphics", The Annals of Statistics Vol. 30 (6), 2002)

"If we attempt to map the world of a story before we explore it, we are likely either to (a) prematurely limit our exploration, so as to reduce the amount of material we need to consider, or (b) explore at length but, recognizing the impossibility of taking note of everything, and having no sound basis for choosing what to include, arbitrarily omit entire realms of information. The opportunities are overwhelming." (Peter Turchi, "Maps of the Imagination: The writer as cartographer", 2004)

"Exploratory Data Analysis is more than just a collection of data-analysis techniques; it provides a philosophy of how to dissect a data set. It stresses the power of visualisation and aspects such as what to look for, how to look for it and how to interpret the information it contains. Most EDA techniques are graphical in nature, because the main aim of EDA is to explore data in an open-minded way. Using graphics, rather than calculations, keeps open possibilities of spotting interesting patterns or anomalies that would not be apparent with a calculation (where assumptions and decisions about the nature of the data tend to be made in advance)." (Alan Graham, "Developing Thinking in Statistics", 2006)

"There are two main reasons for using graphic displays of datasets: either to present or to explore data. Presenting data involves deciding what information you want to convey and drawing a display appropriate for the content and for the intended audience. [...] Exploring data is a much more individual matter, using graphics to find information and to generate ideas. Many displays may be drawn. They can be changed at will or discarded and new versions prepared, so generally no one plot is especially important, and they all have a short life span." (Antony Unwin, "Good Graphics?" [in "Handbook of Data Visualization"], 2008)

"All graphics present data and allow a certain degree of exploration of those same data. Some graphics are almost all presentation, so they allow just a limited amount of exploration; hence we can say they are more infographics than visualization, whereas others are mostly about letting readers play with what is being shown, tilting more to the visualization side of our linear scale. But every infographic and every visualization has a presentation and an exploration component: they present, but they also facilitate the analysis of what they show, to different degrees." (Alberto Cairo, "The Functional Art", 2011)

"But if you don’t present your data to readers so they can see it, read it, explore it, and analyze it, why would they trust you?" (Alberto Cairo, "The Functional Art", 2011)

"Data scientists combine entrepreneurship with patience, the willingness to build data products incrementally, the ability to explore, and the ability to iterate over a solution. They are inherently interdisciplinary. They can tackle all aspects of a problem, from initial data collection and data conditioning to drawing conclusions. They can think outside the box to come up with new ways to view the problem, or to work with very broadly defined problems: 'there’s a lot of data, what can you make from it?'" (Mike Loukides, "What Is Data Science?", 2011)

"Don’t rush to write a headline or an entire story or to design a visualization immediately after you find an interesting pattern, data point, or fact. Stop and think. Look for other sources and for people who can help you escape from tunnel vision and confirmation bias. Explore your information at multiple levels of depth and breadth, looking for extraneous factors that may help explain your findings. Only then can you make a decision about what to say, and how to say it, and about what amount of detail you need to show to be true to the data." (Alberto Cairo, "The Functional Art", 2011)

"The process of visually exploring data can be summarized in a single sentence: find patterns and trends lurking in the data and then observe the deviations from those patterns. Interesting stories may arise from both the norm - also called the smooth - and the exceptions." (Alberto Cairo, "The Functional Art", 2011)

"A viewer’s eye must be guided to 'read' the elements in a logical order. The design of an exploratory graphic needs to allow for the additional component of discovery - guiding the viewer to first understand the overall concept and then engage her to further explore the supporting information." (Felice C Frankel & Angela H DePace, "Visual Strategies", 2012)

"The process of visual analysis can potentially go on endlessly, with seemingly infinite combinations of variables to explore, especially with the rich opportunities bigger data sets give us. However, by deploying a disciplined and sensible balance between deductive and inductive enquiry you should be able to efficiently and effectively navigate towards the source of the most compelling stories." (Andy Kirk, "Data Visualization: A successful design process", 2012)

"Data mining is a craft. As with many crafts, there is a well-defined process that can help to increase the likelihood of a successful result. This process is a crucial conceptual tool for thinking about data science projects. [...] data mining is an exploratory undertaking closer to research and development than it is to engineering." (Foster Provost, "Data Science for Business", 2013)

"Early exploration of a dataset can be overwhelming, because you don’t know where to start. Ask questions about the data and let your curiosities guide you. […] Make multiple charts, compare all your variables, and see if there are interesting bits that are worth a closer look. Look at your data as a whole and then zoom in on categories and individual data points. […] Subcategories, the categories within categories (within categories), are often more revealing than the main categories. As you drill down, there can be higher variability and more interesting things to see." (Nathan Yau, "Data Points: Visualization That Means Something", 2013)

"Good visualization is a winding process that requires statistics and design knowledge. Without the former, the visualization becomes an exercise only in illustration and aesthetics, and without the latter, one of only analyses. On their own, these are fine skills, but they make for incomplete data graphics. Having skills in both provides you with the luxury - which is growing into a necessity - to jump back and forth between data exploration and storytelling." (Nathan Yau, "Data Points: Visualization That Means Something", 2013)

"Visualization can be appreciated purely from an aesthetic point of view, but it’s most interesting when it’s about data that’s worth looking at. That’s why you start with data, explore it, and then show results rather than start with a visual and try to squeeze a dataset into it. It’s like trying to use a hammer to bang in a bunch of screws. […] Aesthetics isn’t just a shiny veneer that you slap on at the last minute. It represents the thought you put into a visualization, which is tightly coupled with clarity and affects interpretation." (Nathan Yau, "Data Points: Visualization That Means Something", 2013)

"Exploratory analysis is what you do to understand the data and figure out what might be noteworthy or interesting to highlight to others." (Cole N Knaflic, "Storytelling with Data: A Data Visualization Guide for Business Professionals", 2015)

"Highlighting one aspect can make other things harder to see one word of warning in using preattentive attributes: when you highlight one point in your story, it can actually make other points harder to see. When you’re doing exploratory analysis, you should mostly avoid the use of preattentive attributes for this reason. When it comes to explanatory analysis, however, you should have a specific story you are communicating to your audience. Leverage preattentive attributes to help make that story visually clear." (Cole N Knaflic, "Storytelling with Data: A Data Visualization Guide for Business Professionals", 2015)

"Exploratory data analysis is the search for patterns and trends in a given data set. Visualization techniques play an important part in this quest. Looking carefully at your data is important for several reasons, including identifying mistakes in collection/processing, finding violations of statistical assumptions, and suggesting interesting hypotheses." (Steven S Skiena, "The Data Science Design Manual", 2017)

"Exploring data generates hypotheses about patterns in our data. The visualizations and tools of dynamic interactive graphics ease and improve the exploration, helping us to 'see what our data seem to say'." (Forrest W Young et al, "Visual Statistics: Seeing data with dynamic interactive graphics", 2016)

"With time series though, there is absolutely no substitute for plotting. The pertinent pattern might end up being a sharp spike followed by a gentle taper down. Or, maybe there are weird plateaus. There could be noisy spikes that have to be filtered out. A good way to look at it is this: means and standard deviations are based on the naïve assumption that data follows pretty bell curves, but there is no corresponding 'default' assumption for time series data (at least, not one that works well with any frequency), so you always have to look at the data to get a sense of what’s normal. [...] Along the lines of figuring out what patterns to expect, when you are exploring time series data, it is immensely useful to be able to zoom in and out." (Field Cady, "The Data Science Handbook", 2017)

"Dashboards are a type of multiform visualization used to summarize and monitor data. These are most useful when proxies have been well validated and the task is well understood. This design pattern brings a number of carefully selected attributes together for fast, and often continuous, monitoring - dashboards are often linked to updating data streams. While many allow interactivity for further investigation, they typically do not depend on it. Dashboards are often used for presenting and monitoring data and are typically designed for at-a-glance analysis rather than deep exploration and analysis." (Danyel Fisher & Miriah Meyer, "Making Data Visual", 2018)

"Models are formal structures represented in mathematics and diagrams that help us to understand the world. Mastery of models improves your ability to reason, explain, design, communicate, act, predict, and explore." (Scott E Page, "The Model Thinker", 2018)

"[…] the data itself can lead to new questions too. In exploratory data analysis (EDA), for example, the data analyst discovers new questions based on the data. The process of looking at the data to address some of these questions generates incidental visualizations - odd patterns, outliers, or surprising correlations that are worth looking into further." (Danyel Fisher & Miriah Meyer, "Making Data Visual", 2018)

"Analysis is a two-step process that has an exploratory and an explanatory phase. In order to create a powerful data story, you must effectively transition from data discovery (when you’re finding insights) to data communication (when you’re explaining them to an audience). If you don’t properly traverse these two phases, you may end up with something that resembles a data story but doesn’t have the same effect. Yes, it may have numbers, charts, and annotations, but because it’s poorly formed, it won’t achieve the same results." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

13 November 2018

🔭Data Science: Symmetry (Just the Quotes)

"The framing of hypotheses is, for the enquirer after truth, not the end, but the beginning of his work. Each of his systems is invented, not that he may admire it and follow it into all its consistent consequences, but that he may make it the occasion of a course of active experiment and observation. And if the results of this process contradict his fundamental assumptions, however ingenious, however symmetrical, however elegant his system may be, he rejects it without hesitation. He allows no natural yearning for the offspring of his own mind to draw him aside from the higher duty of loyalty to his sovereign, Truth, to her he not only gives his affections and his wishes, but strenuous labour and scrupulous minuteness of attention." (William Whewell, "Philosophy of the Inductive Sciences" Vol. 2, 1847)

"Rule 2. Any summary of a distribution of numbers in terms of symmetric functions should not give an objective degree of belief in any one of the inferences or predictions to be made therefrom that would cause human action significantly different from what this action would be if the original distributions had been taken as evidence." (Walter A Shewhart, "Economic Control of Quality of Manufactured Product", 1931)

"Logging size transforms the original skewed distribution into a more symmetrical one by pulling in the long right tail of the distribution toward the mean. The short left tail is, in addition, stretched. The shift toward symmetrical distribution produced by the log transform is not, of course, merely for convenience. Symmetrical distributions, especially those that resemble the normal distribution, fulfill statistical assumptions that form the basis of statistical significance testing in the regression model." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Symmetry is also important because it can simplify our thinking about the distribution of a set of data. If we can establish that the data are (approximately) symmetric, then we no longer need to describe the  shapes of both the right and left halves. (We might even combine the information from the two sides and have effectively twice as much data for viewing the distributional shape.) Finally, symmetry is important because many statistical procedures are designed for, and work best on, symmetric data." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"There are several reasons why symmetry is an important concept in data analysis. First, the most important single summary of a set of data is the location of the center, and when data meaning of 'center' is unambiguous. We can take center to mean any of the following things, since they all coincide exactly for symmetric data, and they are together for nearly symmetric data: (l) the Center Of symmetry. (2) the arithmetic average or center Of gravity, (3) the median or 50%. Furthermore, if data a single point of highest concentration instead of several (that is, they are unimodal), then we can add to the list (4) point of highest concentration. When data are far from symmetric, we may have trouble even agreeing on what we mean by center; in fact, the center may become an inappropriate summary for the data." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"If a distribution were perfectly symmetrical, all symmetry-plot points would be on the diagonal line. Off-line points indicate asymmetry. Points fall above the line when distance above the median is greater than corresponding distance below the median. A consistent run of above-the-line points indicates positive skew; a run of below-the-line points indicates negative skew." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"Remember that normality and symmetry are not the same thing. All normal distributions are symmetrical, but not all symmetrical distributions are normal. With water use we were able to transform the distribution to be approximately symmetrical and normal, but often symmetry is the most we can hope for. For practical purposes, symmetry (with no severe outliers) may be sufficient. Transformations are not a magic wand, however. Many distributions cannot even be made symmetrical." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"Chaos demonstrates that deterministic causes can have random effects […] There's a similar surprise regarding symmetry: symmetric causes can have asymmetric effects. […] This paradox, that symmetry can get lost between cause and effect, is called symmetry-breaking. […] From the smallest scales to the largest, many of nature's patterns are a result of broken symmetry; […]" (Ian Stewart & Martin Golubitsky, "Fearful Symmetry: Is God a Geometer?", 1992)

"In everyday language, the words 'pattern' and 'symmetry' are used almost interchangeably, to indicate a property possessed by a regular arrangement of more-or-less identical units […]” (Ian Stewart & Martin Golubitsky, "Fearful Symmetry: Is God a Geometer?", 1992)

"Nature behaves in ways that look mathematical, but nature is not the same as mathematics. Every mathematical model makes simplifying assumptions; its conclusions are only as valid as those assumptions. The assumption of perfect symmetry is excellent as a technique for deducing the conditions under which symmetry-breaking is going to occur, the general form of the result, and the range of possible behaviour. To deduce exactly which effect is selected from this range in a practical situation, we have to know which imperfections are present." (Ian Stewart & Martin Golubitsky, "Fearful Symmetry", 1992)

"Data that are skewed toward large values occur commonly. Any set of positive measurements is a candidate. Nature just works like that. In fact, if data consisting of positive numbers range over several powers of ten, it is almost a guarantee that they will be skewed. Skewness creates many problems. There are visualization problems. A large fraction of the data are squashed into small regions of graphs, and visual assessment of the data degrades. There are characterization problems. Skewed distributions tend to be more complicated than symmetric ones; for example, there is no unique notion of location and the median and mean measure different aspects of the distribution. There are problems in carrying out probabilistic methods. The distribution of skewed data is not well approximated by the normal, so the many probabilistic methods based on an assumption of a normal distribution cannot be applied." (William S Cleveland, "Visualizing Data", 1993)

"A normal distribution is most unlikely, although not impossible, when the observations are dependent upon one another - that is, when the probability of one event is determined by a preceding event. The observations will fail to distribute themselves symmetrically around the mean." (Peter L Bernstein, "Against the Gods: The Remarkable Story of Risk", 1996)

"Symmetry is basically a geometrical concept. Mathematically it can be defined as the invariance of geometrical patterns under certain operations. But when abstracted, the concept applies to all sorts of situations. It is one of the ways by which the human mind recognizes order in nature. In this sense symmetry need not be perfect to be meaningful. Even an approximate symmetry attracts one's attention, and makes one wonder if there is some deep reason behind it." (Eguchi Tohru & ‎K Nishijima ," Broken Symmetry: Selected Papers Of Y Nambu", 1995)

"How deep truths can be defined as invariants – things that do not change no matter what; how invariants are defined by symmetries, which in turn define which properties of nature are conserved, no matter what. These are the selfsame symmetries that appeal to the senses in art and music and natural forms like snowflakes and galaxies. The fundamental truths are based on symmetry, and there’s a deep kind of beauty in that." (K C Cole, "The Universe and the Teacup: The Mathematics of Truth and Beauty", 1997)

"Symmetry and skewness can be judged, but boxplots are not entirely useful for judging shape. It is not possible to use a boxplot to judge whether or not a dataset is bell-shaped, nor is it possible to judge whether or not a dataset may be bimodal." (Jessica M Utts & Robert F Heckard, "Mind on Statistics", 2007)

"The concept of symmetry (invariance) with its rigorous mathematical formulation and generalization has guided us to know the most fundamental of physical laws. Symmetry as a concept has helped mankind not only to define ‘beauty’ but also to express the ‘truth’. Physical laws tries to quantify the truth that appears to be ‘transient’ at the level of phenomena but symmetry promotes that truth to the level of ‘eternity’." (Vladimir G Ivancevic & Tijana T Ivancevic,"Quantum Leap", 2008)

"The concept of symmetry is used widely in physics. If the laws that determine relations between physical magnitudes and a change of these magnitudes in the course of time do not vary at the definite operations (transformations), they say, that these laws have symmetry (or they are invariant) with respect to the given transformations. For example, the law of gravitation is valid for any points of space, that is, this law is in variant with respect to the system of coordinates." (Alexey Stakhov et al, "The Mathematics of Harmony", 2009)

"A pattern is a design or model that helps grasp something. Patterns help connect things that may not appear to be connected. Patterns help cut through complexity and reveal simpler understandable trends. […] Patterns can be temporal, which is something that regularly occurs over time. Patterns can also be spatial, such as things being organized in a certain way. Patterns can be functional, in that doing certain things leads to certain effects. Good patterns are often symmetric. They echo basic structures and patterns that we are already aware of." (Anil K. Maheshwari, "Business Intelligence and Data Mining", 2015)

"One kind of probability - classic probability - is based on the idea of symmetry and equal likelihood […] In the classic case, we know the parameters of the system and thus can calculate the probabilities for the events each system will generate. […] A second kind of probability arises because in daily life we often want to know something about the likelihood of other events occurring […]. In this second case, we need to estimate the parameters of the system because we don’t know what those parameters are. […] A third kind of probability differs from these first two because it’s not obtained from an experiment or a replicable event - rather, it expresses an opinion or degree of belief about how likely a particular event is to occur. This is called subjective probability […]." (Daniel J Levitin, "Weaponized Lies", 2017)

"Variables which follow symmetric, bell-shaped distributions tend to be nice as features in models. They show substantial variation, so they can be used to discriminate between things, but not over such a wide range that outliers are overwhelming." (Steven S Skiena, "The Data Science Design Manual", 2017)

"Data analysis and data mining are concerned with unsupervised pattern finding and structure determination in data sets. The data sets themselves are explicitly linked as a form of representation to an observational or otherwise empirical domain of interest. 'Structure' has long been understood as symmetry which can take many forms with respect to any transformation, including point, translational, rotational, and many others. Symmetries directly point to invariants, which pinpoint intrinsic properties of the data and of the background empirical domain of interest. As our data models change, so too do our perspectives on analysing data." (Fionn Murtagh, "Data Science Foundations: Geometry and Topology of Complex Hierarchic Systems and Big Data Analytics", 2018)

"It is not enough to give a single summary for a distribution - we need to have an idea of the spread, sometimes known as the variability. [...] The range is a natural choice, but is clearly very sensitive to extreme values [...] In contrast the inter-quartile range (IQR) is unaffected by extremes. This is the distance between the 25th and 75th percentiles of the data and so contains the ‘central half’ of the numbers [...] Finally the standard deviation is a widely used measure of spread. It is the most technically complex measure, but is only really appropriate for well-behaved symmetric data since it is also unduly influenced by outlying values." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"Many statistical procedures perform more effectively on data that are normally distributed, or at least are symmetric and not excessively kurtotic (fat-tailed), and where the mean and variance are approximately constant. Observed time series frequently require some form of transformation before they exhibit these distributional properties, for in their 'raw' form they are often asymmetric." (Terence C Mills, "Applied Time Series Analysis: A practical guide to modeling and forecasting", 2019)

"Mean-averages can be highly misleading when the raw data do not form a symmetric pattern around a central value but instead are skewed towards one side [...], typically with a large group of standard cases but with a tail of a few either very high (for example, income) or low (for example, legs) values." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

🔭Data Science: Definitions (Just the Quotes)

"The errors of definitions multiply themselves according as the reckoning proceeds; and lead men into absurdities, which at last they see but cannot avoid, without reckoning anew from the beginning." (Thomas Hobbes, "The Moral and Political Works of Thomas Hobbes of Malmesbury", 1750)

"It is the essence of a scientific definition to be causative, not by introduction of imaginary somewhats, natural or supernatural, under the name of causes, but by announcing the law of action in the particular case, in subordination to the common law of which all the phenomena are modifications or results." (Samuel T Coleridge, "Hints Towards the Formation of a More Comprehensive Theory of Life, The Nature of Life", 1847)

"The dimmed outlines of phenomenal things all merge into one another unless we put on the focusing-glass of theory, and screw it up sometimes to one pitch of definition and sometimes to another, so as to see down into different depths through the great millstone of the world." (James C Maxwell, "Are There Real Analogies in Nature?", 1856)

"Being built on concepts, hypotheses, and experiments, laws are no more accurate or trustworthy than the wording of the definitions and the accuracy and extent of the supporting experiments." (Gerald Holton, "Introduction to Concepts and Theories in Physical Science", 1952)

"We cannot define truth in science until we move from fact to law. And within the body of laws in turn, what impresses us as truth is the orderly coherence of the pieces. They fit together like the characters of a great novel, or like the words of a poem. Indeed, we should keep that last analogy by us always, for science is a language, and like a language it defines its parts by the way they make up a meaning. Every word in a sentence has some uncertainty of definition, and yet the sentence defines its own meaning and that of its words conclusively. It is the internal unity and coherence of science which gives it truth, and which makes it a better system of prediction than any less orderly language." (Jacob Bronowski, "The Common Sense of Science", 1953)

"Scientific method is the way to truth, but it affords, even in principle, no unique definition of truth. Any so-called pragmatic definition of truth is doomed to failure equally." (Willard v O Quine, "Word and Object", 1960)

"This other world is the so-called physical world image; it is merely an intellectual structure. To a certain extent it is arbitrary. It is a kind of model or idealization created in order to avoid the inaccuracy inherent in every measurement and to facilitate exact definition." (Max Planck, "The Philosophy of Physics", 1963)

"The assumptions and definitions of mathematics and science come from our intuition, which is based ultimately on experience. They then get shaped by further experience in using them and are occasionally revised. They are not fixed for all eternity." (Richard Hamming, "Methods of Mathematics Applied to Calculus, Probability, and Statistics", 1985)

"First, good statistics are based on more than guessing. [...] Second, good statistics are based on clear, reasonable definitions. Remember, every statistic has to define its subject. Those definitions ought to be clear and made public. [...] Third, good statistics are based on clear, reasonable measures. Again, every statistic involves some sort of measurement; while all measures are imperfect, not all flaws are equally serious. [...] Finally, good statistics are based on good samples." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"While some social problems statistics are deliberate deceptions, many - probably the great majority - of bad statistics are the result of confusion, incompetence, innumeracy, or selective, self-righteous efforts to produce numbers that reaffirm principles and interests that their advocates consider just and right. The best response to stat wars is not to try and guess who's lying or, worse, simply to assume that the people we disagree with are the ones telling lies. Rather, we need to watch for the standard causes of bad statistics - guessing, questionable definitions or methods, mutant numbers, and inappropriate comparisons." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"The goal of data science is to improve decision making by basing decisions on insights extracted from large data sets. As a field of activity, data science encompasses a set of principles, problem definitions, algorithms, and processes for extracting nonobvious and useful patterns from large data sets. It is closely related to the fields of data mining and machine learning, but it is broader in scope." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"Numbers can easily confuse us when they are unmoored from a clear definition." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"The whole discipline of statistics is built on measuring or counting things. […] it is important to understand what is being measured or counted, and how. It is surprising how rarely we do this. Over the years, as I found myself trying to lead people out of statistical mazes week after week, I came to realize that many of the problems I encountered were because people had taken a wrong turn right at the start. They had dived into the mathematics of a statistical claim - asking about sampling errors and margins of error, debating if the number is rising or falling, believing, doubting, analyzing, dissecting - without taking the ti- me to understand the first and most obvious fact: What is being measured, or counted? What definition is being used?" (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

More quotes on "Definitions" at the-web-of-knowledge.blogspot.com

🔭Data Science: Training (Just the Quotes)

"A neural network is characterized by (1) its pattern of connections between the neurons (called its architecture), (2) its method of determining the weights on the connections (called its training, or learning, algorithm), and (3) its activation function." (Laurene Fausett, "Fundamentals of Neural Networks", 1994)

"An artificial neural network (or simply a neural network) is a biologically inspired computational model that consists of processing elements (neurons) and connections between them, as well as of training and recall algorithms." (Nikola K Kasabov, "Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering", 1996)

"[…] an obvious difference between our best classifiers and human learning is the number of examples required in tasks such as object detection. […] the difficulty of a learning task depends on the size of the required hypothesis space. This complexity determines in turn how many training examples are needed to achieve a given level of generalization error. Thus the complexity of the hypothesis space sets the speed limit and the sample complexity for learning." (Tomaso Poggio & Steve Smale, "The Mathematics of Learning: Dealing with Data", Notices of the AMS, 2003)

"Learning a complicated function that matches the training data closely but fails to recognize the underlying process that generates the data. As a result of overfitting, the model performs poor on new input. Overfitting occurs when the training patterns are sparse in input space and/or the trained networks are too complex." (Frank Padberg, "Counting the Hidden Defects in Software Documents", 2010)

"Decision trees are also considered nonparametric models. The reason for this is that when we train a decision tree from data, we do not assume a fixed set of parameters prior to training that define the tree. Instead, the tree branching and the depth of the tree are related to the complexity of the dataset it is trained on. If new instances were added to the dataset and we rebuilt the tree, it is likely that we would end up with a (potentially very) different tree." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies", 2015)

"Boosting defines an objective function to measure the performance of a model given a certain set of parameters. The objective function contains two parts: regularization and training loss, both of which add to one another. The training loss measures how predictive our model is on the training data. The most commonly used training loss function includes mean squared error and logistic regression. The regularization term controls the complexity of the model, which helps avoid overfitting." (Danish Haroon, "Python Machine Learning Case Studies", 2017)

"Decision trees are important for a few reasons. First, they can both classify and regress. It requires literally one line of code to switch between the two models just described, from a classification to a regression. Second, they are able to determine and share the feature importance of a given training set." (Russell Jurney, "Agile Data Science 2.0: Building Full-Stack Data Analytics Applications with Spark", 2017)

"Early stopping and regularization can ensure network generalization when you apply them properly. [...] With early stopping, the choice of the validation set is also important. The validation set should be representative of all points in the training set. When you use Bayesian regularization, it is important to train the network until it reaches convergence. The sum-squared error, the sum-squared weights, and the effective number of parameters should reach constant values when the network has converged. With both early stopping and regularization, it is a good idea to train the network starting from several different initial conditions. It is possible for either method to fail in certain circumstances. By testing several different initial conditions, you can verify robust network performance." (Mark H Beale et al, "Neural Network Toolbox™ User's Guide", 2017)

"Variance is a prediction error due to different sets of training samples. Ideally, the error should not vary from one training sample to another sample, and the model should be stable enough to handle hidden variations between input and output variables. Normally this occurs with the overfitted model." (Umesh R Hodeghatta & Umesha Nayak, "Business Analytics Using R: A Practical Approach", 2017)

"One of the most common problems that you will encounter when training deep neural networks will be overfitting. What can happen is that your network may, owing to its flexibility, learn patterns that are due to noise, errors, or simply wrong data. [...] The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e., the noise) as if that variation represented the underlying model structure. The opposite is called underfitting - when the model cannot capture the structure of the data." (Umberto Michelucci, "Applied Deep Learning: A Case-Based Approach to Understanding Deep Neural Networks", 2018)

"The premise of classification is simple: given a categorical target variable, learn patterns that exist between instances composed of independent variables and their relationship to the target. Because the target is given ahead of time, classification is said to be supervised machine learning because a model can be trained to minimize error between predicted and actual categories in the training data. Once a classification model is fit, it assigns categorical labels to new instances based on the patterns detected during training." (Benjamin Bengfort et al, "Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning", 2018)

"The trick is to walk the line between underfitting and overfitting. An underfit model has low variance, generally making the same predictions every time, but with extremely high bias, because the model deviates from the correct answer by a significant amount. Underfitting is symptomatic of not having enough data points, or not training a complex enough model. An overfit model, on the other hand, has memorized the training data and is completely accurate on data it has seen before, but varies widely on unseen data. Neither an overfit nor underfit model is generalizable - that is, able to make meaningful predictions on unseen data." (Benjamin Bengfort et al, "Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning", 2018)

"There is a trade-off between bias and variance [...]. Complexity increases with the number of features, parameters, depth, training epochs, etc. As complexity increases and the model overfits, the error on the training data decreases, but the error on test data increases, meaning that the model is less generalizable." (Benjamin Bengfort et al, "Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning", 2018)

"Cross-validation is a useful tool for finding optimal predictive models, and it also works well in visualization. The concept is simple: split the data at random into a 'training' and a 'test' set, fit the model to the training data, then see how well it predicts the test data. As the model gets more complex, it will always fit the training data better and better. It will also start off getting better results on the test data, but there comes a point where the test data predictions start going wrong." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

"Any machine learning model is trained based on certain assumptions. In general, these assumptions are the simplistic approximations of some real-world phenomena. These assumptions simplify the actual relationships between features and their characteristics and make a model easier to train. More assumptions means more bias. So, while training a model, more simplistic assumptions = high bias, and realistic assumptions that are more representative of actual phenomena = low bias." (Imran Ahmad, "40 Algorithms Every Programmer Should Know", 2020)

🔭Data Science: Training Data (Just the Quotes)

"Neural networks are a computing technology whose fundamental purpose is to recognize patterns in data. Based on a computing model similar to the underlying structure of the human brain, neural networks share the brains ability to learn or adapt in response to external inputs. When exposed to a stream of training data, neural networks can discover previously unknown relationships and learn complex nonlinear mappings in the data. Neural networks provide some fundamental, new capabilities for processing business data. However, tapping these new neural network data mining functions requires a completely different application development process from traditional programming." (Joseph P Bigus, "Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"[…] an obvious difference between our best classifiers and human learning is the number of examples required in tasks such as object detection. […] the difficulty of a learning task depends on the size of the required hypothesis space. This complexity determines in turn how many training examples are needed to achieve a given level of generalization error. Thus the complexity of the hypothesis space sets the speed limit and the sample complexity for learning." (Tomaso Poggio & Steve Smale, "The Mathematics of Learning: Dealing with Data", Notices of the AMS, 2003)

"Learning a complicated function that matches the training data closely but fails to recognize the underlying process that generates the data. As a result of overfitting, the model performs poor on new input. Overfitting occurs when the training patterns are sparse in input space and/or the trained networks are too complex." (Frank Padberg, "Counting the Hidden Defects in Software Documents", 2010)

"Overfitting occurs when a formula describes a set of data very closely, but does not lead to any sensible explanation for the behavior of the data and does not predict the behavior of comparable data sets. In the case of overfitting, the formula is said to describe the noise of the system rather than the characteristic behavior of the system. Overfitting occurs frequently with models that perform iterative approximations on training data, coming closer and closer to the training data set with each iteration. Neural networks are an example of a data modeling strategy that is prone to overfitting." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)

"Briefly speaking, to solve a Machine Learning problem means you optimize a model to fit all the data from your training set, and then you use the model to predict the results you want. Therefore, evaluating a model need to see how well it can be used to predict the data out of the training set. Usually there are three types of the models: underfitting, fair and overfitting model [...]. If we want to predict a value, both (a) and (c) in this figure cannot work well. The underfitting model does not capture the structure of the problem at all, and we say it has high bias. The overfitting model tries to fit every sample in the training set and it did it, but we say it is of high variance. In other words, it fails to generalize new data." (Shudong Hao, "A Beginner’s Tutorial for Machine Learning Beginners", 2014)

"Neural networks can model very complex patterns and decision boundaries in the data and, as such, are very powerful. In fact, they are so powerful that they can even model the noise in the training data, which is something that definitely should be avoided. One way to avoid this overfitting is by using a validation set in a similar way as with decision trees.[...] Another scheme to prevent a neural network from overfitting is weight regularization, whereby the idea is to keep the weights small in absolute sense because otherwise they may be fitting the noise in the data. This is then implemented by adding a weight size term (e.g., Euclidean norm) to the objective function of the neural network." (Bart Baesens, "Analytics in a Big Data World: The Essential Guide to Data Science and Its Applications", 2014)

"A predictive model overfits the training set when at least some of the predictions it returns are based on spurious patterns present in the training data used to induce the model. Overfitting happens for a number of reasons, including sampling variance and noise in the training set. The problem of overfitting can affect any machine learning algorithm; however, the fact that decision tree induction algorithms work by recursively splitting the training data means that they have a natural tendency to segregate noisy instances and to create leaf nodes around these instances. Consequently, decision trees overfit by splitting the data on irrelevant features that only appear relevant due to noise or sampling variance in the training data. The likelihood of overfitting occurring increases as a tree gets deeper because the resulting predictions are based on smaller and smaller subsets as the dataset is partitioned after each feature test in the path." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies", 2015)

"Cross-validation is a method of splitting all of your data into two parts: training and validation. The training data is used to build the machine learning model, whereas the validation data is used to validate that the model is doing what is expected. This increases our ability to find and determine the underlying errors in a model." (Matthew Kirk, "Thoughtful Machine Learning", 2015)

"Tree pruning identifies and removes subtrees within a decision tree that are likely to be due to noise and sample variance in the training set used to induce it. In cases where a subtree is deemed to be overfitting, pruning the subtree means replacing the subtree with a leaf node that makes a prediction based on the majority target feature level (or average target feature value) of the dataset created by merging the instances from all the leaf nodes in the subtree. Obviously, pruning will result in decision trees being created that are not consistent with the training set used to build them. In general, however, we are more interested in creating prediction models that generalize well to new data rather than that are strictly consistent with training data, so it is common to sacrifice consistency for generalization capacity." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies", 2015)

"When memorization happens, you may have the illusion that everything is working well because your machine learning algorithm seems to have fitted the in sample data so well. Instead, problems can quickly become evident when you start having it work with out-of-sample data and you notice that it produces errors in its predictions as well as errors that actually change a lot when you relearn from the same data with a slightly different approach. Overfitting occurs when your algorithm has learned too much from your data, up to the point of mapping curve shapes and rules that do not exist [...]. Any slight change in the procedure or in the training data produces erratic predictions." (John P Mueller & Luca Massaron, "Machine Learning for Dummies", 2016)

"Bias is error from incorrect assumptions built into the model, such as restricting an interpolating function to be linear instead of a higher-order curve. [...] Errors of bias produce underfit models. They do not fit the training data as tightly as possible, were they allowed the freedom to do so. In popular discourse, I associate the word 'bias' with prejudice, and the correspondence is fairly apt: an apriori assumption that one group is inferior to another will result in less accurate predictions than an unbiased one. Models that perform lousy on both training and testing data are underfit." (Steven S Skiena, "The Data Science Design Manual", 2017)

"Bias occurs normally when the model is underfitted and has failed to learn enough from the training data. It is the difference between the mean of the probability distribution and the actual correct value. Hence, the accuracy of the model is different for different data sets (test and training sets). To reduce the bias error, data scientists repeat the model-building process by resampling the data to obtain better prediction values." (Umesh R Hodeghatta & Umesha Nayak, "Business Analytics Using R: A Practical Approach", 2017)

"Boosting defines an objective function to measure the performance of a model given a certain set of parameters. The objective function contains two parts: regularization and training loss, both of which add to one another. The training loss measures how predictive our model is on the training data. The most commonly used training loss function includes mean squared error and logistic regression. The regularization term controls the complexity of the model, which helps avoid overfitting." (Danish Haroon, "Python Machine Learning Case Studies", 2017)

"From a typical training set, many alternative decision trees can be created. As a rule, smaller trees are to be preferred, their main advantages being interpretability, removal of irrelevant and redundant attributes, and lower danger of overfitting noisy training data." (Miroslav Kubat, "An Introduction to Machine Learning" 2nd Ed., 2017)

"High-bias models typically produce simpler models that do not overfit and in those cases the danger is that of underfitting. Models with low-bias are typically more complex and that complexity enables us to represent the training data in a more accurate way. The danger here is that the flexibility provided by higher complexity may end up representing not only a relationship in the data but also the noise. Another way of portraying the bias-variance trade-off is in terms of complexity v simplicity." (Jesús Rogel-Salazar, "Data Science and Analytics with Python", 2017) 

"In machine learning, a model is defined as a function, and we describe the learning function from the training data as inductive learning. Generalization refers to how well the concepts are learned by the model by applying them to data not seen before. The goal of a good machine-learning model is to reduce generalization errors and thus make good predictions on data that the model has never seen." (Umesh R Hodeghatta & Umesha Nayak, "Business Analytics Using R: A Practical Approach", 2017)

"Multilayer perceptrons share with polynomial classifiers one unpleasant property. Theoretically speaking, they are capable of modeling any decision surface, and this makes them prone to overfitting the training data."  (Miroslav Kubat," An Introduction to Machine Learning" 2nd Ed., 2017)

"The danger of overfitting is particularly severe when the training data is not a perfect gold standard. Human class annotations are often subjective and inconsistent, leading boosting to amplify the noise at the expense of the signal. The best boosting algorithms will deal with overfitting though regularization. The goal will be to minimize the number of non-zero coefficients, and avoid large coefficients that place too much faith in any one classifier in the ensemble." (Steven S Skiena, "The Data Science Design Manual", 2017)

"Variance is a prediction error due to different sets of training samples. Ideally, the error should not vary from one training sample to another sample, and the model should be stable enough to handle hidden variations between input and output variables. Normally this occurs with the overfitted model." (Umesh R Hodeghatta & Umesha Nayak, "Business Analytics Using R: A Practical Approach", 2017)

"Variance is error from sensitivity to fluctuations in the training set. If our training set contains sampling or measurement error, this noise introduces variance into the resulting model. [...] Errors of variance result in overfit models: their quest for accuracy causes them to mistake noise for signal, and they adjust so well to the training data that noise leads them astray. Models that do much better on testing data than training data are overfit." (Steven S Skiena, "The Data Science Design Manual", 2017)

"The premise of classification is simple: given a categorical target variable, learn patterns that exist between instances composed of independent variables and their relationship to the target. Because the target is given ahead of time, classification is said to be supervised machine learning because a model can be trained to minimize error between predicted and actual categories in the training data. Once a classification model is fit, it assigns categorical labels to new instances based on the patterns detected during training." (Benjamin Bengfort et al, "Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning", 2018)

"The trick is to walk the line between underfitting and overfitting. An underfit model has low variance, generally making the same predictions every time, but with extremely high bias, because the model deviates from the correct answer by a significant amount. Underfitting is symptomatic of not having enough data points, or not training a complex enough model. An overfit model, on the other hand, has memorized the training data and is completely accurate on data it has seen before, but varies widely on unseen data. Neither an overfit nor underfit model is generalizable - that is, able to make meaningful predictions on unseen data." (Benjamin Bengfort et al, "Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning", 2018)

"There is a trade-off between bias and variance [...]. Complexity increases with the number of features, parameters, depth, training epochs, etc. As complexity increases and the model overfits, the error on the training data decreases, but the error on test data increases, meaning that the model is less generalizable." (Benjamin Bengfort et al, "Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning", 2018)

"Cross-validation is a useful tool for finding optimal predictive models, and it also works well in visualization. The concept is simple: split the data at random into a 'training' and a 'test' set, fit the model to the training data, then see how well it predicts the test data. As the model gets more complex, it will always fit the training data better and better. It will also start off getting better results on the test data, but there comes a point where the test data predictions start going wrong." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

"The classifier accuracy would be extra ordinary when the test data and the training data are overlapping. But when the model is applied to a new data it will fail to show acceptable accuracy. This condition is called as overfitting." (Jesu V  Nayahi J & Gokulakrishnan K, "Medical Image Classification", 2019)

12 November 2018

🔭Data Science: Statisticians (Just the Quotes)

"It is difficult to understand why statisticians commonly limit their inquiries to Averages, and do not revel in more comprehensive views. Their souls seem as dull to the charm of variety as that of the native of one of our flat English counties, whose retrospect of Switzerland was that, if its mountains could be thrown into its lakes, two nuisances would be got rid of at once. An Average is but a solitary fact, whereas if a single other fact be added to it, an entire Normal Scheme, which nearly corresponds to the observed one, starts potentially into existence." (Francis Galton, "Natural Inheritance", 1889) 

"Even trained statisticians often fail to appreciate the extent to which statistics are vitiated by the unrecorded assumptions of their interpreters." (George B Shaw, "The Doctor's Dilemma", 1906)

"Figures may not lie, but statistics compiled unscientifically and analyzed incompetently are almost sure to be misleading, and when this condition is unnecessarily chronic the so-called statisticians may be called liars." (Edwin B Wilson, "Bulletin of the American Mathematical Society", Vol 18, 1912)

"The statistician’s job is to draw general conclusions from fragmentary data. Too often the data supplied to him for analysis are not only fragmentary but positively incoherent, so that he can do next to nothing with them. Even the most kindly statistician swears heartily under his breath whenever this happens". (Michael J Moroney, "Facts from Figures", 1927)

"To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of." (Sir Ronald A Fisher, [presidential address] 1938)

"An inference, if it is to have scientific value, must constitute a prediction concerning future data. If the inference is to be made purely with the help of the distribution theory of statistics, the experiments that constitute evidence for the inference must arise from a state of statistical control; until that state is reached, there is no universe, normal or otherwise, and the statistician’s calculations by themselves are an illusion if not a delusion. The fact is that when distribution theory is not applicable for lack of control, any inference, statistical or otherwise, is little better than a conjecture. The state of statistical control is therefore the goal of all experimentation." (William E Deming, "Statistical Method from the Viewpoint of Quality Control", 1939)

"[Statistics] is both a science and an art. It is a science in that its methods are basically systematic and have general application; and an art in that their successful application depends to a considerable degree on the skill and special experience of the statistician, and on his knowledge of the field of application, e.g. economics." (Leonard H C Tippett, "Statistics", 1943)

"Errors of the third kind happen in conventional tests of differences of means, but they are usually not considered, although their existence is probably recognized. It seems to the author that there may be several reasons for this among which are 1) a preoccupation on the part of mathematical statisticians with the formal questions of acceptance and rejection of null hypotheses without adequate consideration of the implications of the error of the third kind for the practical experimenter, 2) the rarity with which an error of the third kind arises in the usual tests of significance." (Frederick Mosteller, "A k-Sample Slippage Test for an Extreme Population", The Annals of Mathematical Statistics 19, 1948)

"The characteristic which distinguishes the present-day professional statistician, is his interest and skill in the measurement of the fallibility of conclusions." (George W Snedecor, "On a Unique Feature of Statistics", [address] 1948)

"[...] to be a good theoretical statistician one must also compute, and must therefore have the best computing aids." (Frank Yates, "Sampling Methods for Censuses and Surveys", 1949)

"It is very easy to devise different tests which, on the average, have similar properties, [...] hey behave satisfactorily when the null hypothesis is true and have approximately the same power of detecting departures from that hypothesis. Two such tests may, however, give very different results when applied to a given set of data. The situation leads to a good deal of contention amongst statisticians and much discredit of the science of statistics. The appalling position can easily arise in which one can get any answer one wants if only one goes around to a large enough number of statisticians." (Frances Yates, "Discussion on the Paper by Dr. Box and Dr. Andersen", Journal of the Royal Statistical Society B Vol. 17, 1955)

"One feature [...] which requires much more justification than is usually given, is the setting up of unplausible null hypotheses. For example, a statistician may set out a test to see whether two drugs have exactly the same effect, or whether a regression line is exactly straight. These hypotheses can scarcely be taken literally." (Cedric A B Smith, "Book review of Norman T. J. Bailey: Statistical Methods in Biology", Applied Statistics 9, 1960)

"The statistician cannot excuse himself from the duty of getting his head clear on the principles of scientific inference, but equally no other thinking man can avoid a like obligation." (Sir Ronald A Fisher, "The Design of Experiments", 1971)

"Another reason for the applied statistician to care about Bayesian inference is that consumers of statistical answers, at least interval estimates, commonly interpret them as probability statements about the possible values of parameters. Consequently, the answers statisticians provide to consumers should be capable of being interpreted as approximate Bayesian statements." (Donald B Rubin, "Bayesianly justifiable and relevant frequency calculations for the applied statistician", Annals of Statistics 12(4), 1984)

"Too much of what all statisticians do [...] is blatantly subjective for any of us to kid ourselves or the users of our technology into believing that we have operated ‘impartially’ in any true sense. [...] We can do what seems to us most appropriate, but we can not be objective and would do well to avoid language that hints to the contrary." (Steve V Vardeman, Comment, Journal of the American Statistical Association 82, 1987)

"It is clear that a statistician who is involved at the start of an investigation, advises on data collection, and who knows the background and objectives, will generally make a better job of the analysis than a statistician who was called in later on." (Christopher Chatfield, "Problem solving: a statistician’s guide", 1988)

"Statistics is a very powerful and persuasive mathematical tool. People put a lot of faith in printed numbers. It seems when a situation is described by assigning it a numerical value, the validity of the report increases in the mind of the viewer. It is the statistician's obligation to be aware that data in the eyes of the uninformed or poor data in the eyes of the naive viewer can be as deceptive as any falsehoods." (Theoni Pappas, "More Joy of Mathematics: Exploring mathematical insights & concepts", 1991)

"At its core statistics is not about cleverness and technique, but rather about honesty. Its real contribution to society is primarily moral, not technical. It is about doing the right thing when interpreting empirical information. Statisticians are not the world’s best computer scientists, mathematicians, or scientific subject matter specialists. We are (potentially, at least) the best at the principled collection, summarization, and analysis of data." (Stephen B Vardeman & Max D Morris, "Statistics and Ethics: Some Advice for Young Statisticians", The American Statistician vol 57, 2003)

"Today [...] we have high-speed computers and prepackaged statistical routines to perform the necessary calculations [...] statistical software will no more make one a statistician than would a scalpel turn one into a neurosurgeon. Allowing these tools to do our thinking for us is a sure recipe for disaster." (Phillip Good & Hardin James, "Common Errors in Statistics and How to Avoid Them", 2003)

"What is so unconventional about the statistical way of thinking? First, statisticians do not care much for the popular concept of the statistical average; instead, they fixate on any deviation from the average. They worry about how large these variations are, how frequently they occur, and why they exist. [...] Second, variability does not need to be explained by reasonable causes, despite our natural desire for a rational explanation of everything; statisticians are frequently just as happy to pore over patterns of correlation. [...] Third, statisticians are constantly looking out for missed nuances: a statistical average for all groups may well hide vital differences that exist between these groups. Ignoring group differences when they are present frequently portends inequitable treatment. [...] Fourth, decisions based on statistics can be calibrated to strike a balance between two types of errors. Predictably, decision makers have an incentive to focus exclusively on minimizing any mistake that could bring about public humiliation, but statisticians point out that because of this bias, their decisions will aggravate other errors, which are unnoticed but serious. [...] Finally, statisticians follow a specific protocol known as statistical testing when deciding whether the evidence fits the crime, so to speak. Unlike some of us, they don’t believe in miracles. In other words, if the most unusual coincidence must be contrived to explain the inexplicable, they prefer leaving the crime unsolved." (Kaiser Fung, "Numbers Rule the World", 2010)

"Darn right, graphs are not serious. Any untrained, unsophisticated, non-degree-holding civilian can display data. Relying on plots is like admitting you do not need a statistician. Show pictures of the numbers and let people make their own judgments? That can be no better than airing your statistical dirty laundry. People need guidance; they need to be shown what the data are supposed to say. Graphics cannot do that; models can." (William M Briggs, Comment, Journal of Computational and Graphical Statistics Vol. 20(1), 2011)

"When statisticians, trained in math and probability theory, try to assess likely outcomes, they demand a plethora of data points. Even then, they recognize that unless it’s a very simple and controlled action such as flipping a coin, unforeseen variables can exert significant influence." (Zachary Karabell, "The Leading Indicators: A short history of the numbers that rule our world", 2014)

"To be any good, a sample has to be representative. A sample is representative if every person or thing in the group you’re studying has an equally likely chance of being chosen. If not, your sample is biased. […] The job of the statistician is to formulate an inventory of all those things that matter in order to obtain a representative sample. Researchers have to avoid the tendency to capture variables that are easy to identify or collect data on - sometimes the things that matter are not obvious or are difficult to measure." (Daniel J Levitin, "Weaponized Lies", 2017)

"Some scientists (e.g., econometricians) like to work with mathematical equations; others (e.g., hard-core statisticians) prefer a list of assumptions that ostensibly summarizes the structure of the diagram. Regardless of language, the model should depict, however qualitatively, the process that generates the data - in other words, the cause-effect forces that operate in the environment and shape the data generated." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

🔭Data Science: Regularization (Just the Quotes)

"Neural networks can model very complex patterns and decision boundaries in the data and, as such, are very powerful. In fact, they are so powerful that they can even model the noise in the training data, which is something that definitely should be avoided. One way to avoid this overfitting is by using a validation set in a similar way as with decision trees.[...] Another scheme to prevent a neural network from overfitting is weight regularization, whereby the idea is to keep the weights small in absolute sense because otherwise they may be fitting the noise in the data. This is then implemented by adding a weight size term (e.g., Euclidean norm) to the objective function of the neural network." (Bart Baesens, "Analytics in a Big Data World: The Essential Guide to Data Science and Its Applications", 2014)

"Regularization works because it is the sum of the coefficients of the predictor variables, therefore it’s important that they’re on the same scale or the regularization may find it difficult to converge, and variables with larger absolute coefficient values will greatly influence it, generating an infective regularization. It’s good practice to standardize the predictor values or bind them to a common min‐max, such as the [‐1,+1] range." (Luca Massaron & John P Mueller, "Python for Data Science For Dummies", 2015)

"Neural nets are typically over-parametrized, and hence are prone to overfitting. Originally early stopping was set up as the primary tuning parameter, and the stopping time was determined using a held-out set of validation data. In modern networks the regularization is tuned adaptively to avoid overfitting, and hence it is less of a problem." (Bradley Efron & Trevor Hastie, "Computer Age Statistical Inference: Algorithms, Evidence, and Data Science", 2016)

"Boosting defines an objective function to measure the performance of a model given a certain set of parameters. The objective function contains two parts: regularization and training loss, both of which add to one another. The training loss measures how predictive our model is on the training data. The most commonly used training loss function includes mean squared error and logistic regression. The regularization term controls the complexity of the model, which helps avoid overfitting." (Danish Haroon, "Python Machine Learning Case Studies", 2017)

"Early stopping and regularization can ensure network generalization when you apply them properly. [...] With early stopping, the choice of the validation set is also important. The validation set should be representative of all points in the training set. When you use Bayesian regularization, it is important to train the network until it reaches convergence. The sum-squared error, the sum-squared weights, and the effective number of parameters should reach constant values when the network has converged. With both early stopping and regularization, it is a good idea to train the network starting from several different initial conditions. It is possible for either method to fail in certain circumstances. By testing several different initial conditions, you can verify robust network performance." (Mark H Beale et al, "Neural Network Toolbox™ User's Guide", 2017)

"Feature generation (or engineering, as it is often called) is where the bulk of the time is spent in the machine learning process. As social science researchers or practitioners, you have spent a lot of time constructing features, using transformations, dummy variables, and interaction terms. All of that is still required and critical in the machine learning framework. One difference you will need to get comfortable with is that instead of carefully selecting a few predictors, machine learning systems tend to encourage the creation of lots of features and then empirically use holdout data to perform regularization and model selection. It is common to have models that are trained on thousands of features." (Rayid Ghani & Malte Schierholz, "Machine Learning", 2017)

"The danger of overfitting is particularly severe when the training data is not a perfect gold standard. Human class annotations are often subjective and inconsistent, leading boosting to amplify the noise at the expense of the signal. The best boosting algorithms will deal with overfitting though regularization. The goal will be to minimize the number of non-zero coefficients, and avoid large coefficients that place too much faith in any one classifier in the ensemble." (Steven S Skiena, "The Data Science Design Manual", 2017)

"Even though a natural way of avoiding overfitting is to simply build smaller networks (with fewer units and parameters), it has often been observed that it is better to build large networks and then regularize them in order to avoid overfitting. This is because large networks retain the option of building a more complex model if it is truly warranted. At the same time, the regularization process can smooth out the random artifacts that are not supported by sufficient data. By using this approach, we are giving the model the choice to decide what complexity it needs, rather than making a rigid decision for the model up front (which might even underfit the data)." (Charu C Aggarwal, "Neural Networks and Deep Learning: A Textbook", 2018)

"Regularization is particularly important when the amount of available data is limited. A neat biological interpretation of regularization is that it corresponds to gradual forgetting, as a result of which 'less important' (i.e., noisy) patterns are removed. In general, it is often advisable to use more complex models with regularization rather than simpler models without regularization." (Charu C Aggarwal, "Neural Networks and Deep Learning: A Textbook", 2018)

"The idea behind deeper architectures is that they can better leverage repeated regularities in the data patterns in order to reduce the number of computational units and therefore generalize the learning even to areas of the data space where one does not have examples. Often these repeated regularities are learned by the neural network within the weights as the basis vectors of hierarchical features." (Charu C Aggarwal, "Neural Networks and Deep Learning: A Textbook", 2018)

10 November 2018

🔭Data Science: Criteria (Just the Quotes)

"[Precision] is the very soul of science; and its attainment afford the only criterion, or at least the best, of the truth of theories, and the correctness of experiments." (John F W Herschel, "A Preliminary Discourse on the Study of Natural Philosophy", 1830)

"A primary goal of any learning model is to predict correctly the learning curve - proportions of correct responses versus trials. Almost any sensible model with two or three free parameters, however, can closely fit the curve, and so other criteria must be invoked when one is comparing several models." (Robert R Bush & Frederick Mosteller, "A Comparison of Eight Models?", Studies in Mathematical Learning Theory, 1959)

"A satisfactory prediction of the sequential properties of learning data from a single experiment is by no means a final test of a model. Numerous other criteria - and some more demanding - can be specified. For example, a model with specific numerical parameter values should be invariant to changes in independent variables that explicitly enter in the model." (Robert R Bush & Frederick Mosteller,"A Comparison of Eight Models?", Studies in Mathematical Learning Theory, 1959)

"[...] sciences do not try to explain, they hardly even try to interpret, they mainly make models. By a model is meant a mathematical construct which, with the addition of certain verbal interpretations, describes observed phenomena. The justification of such a mathematical construct is solely and precisely that it is expected to work - that is, correctly to describe phenomena from a reasonably wide area. Furthermore, it must satisfy certain aesthetic criteria - that is, in relation to how much it describes, it must be rather simple.” (John von Neumann, "Method in the physical sciences", 1961)

"Adaptive system - whether on the biological, psychological, or sociocultural level - must manifest (1) some degree of 'plasticity' and 'irritability' vis-a-vis its environment such that it carries on a constant interchange with acting on and reacting to it; (2) some source or mechanism for variety, to act as a potential pool of adaptive variability to meet the problem of mapping new or more detailed variety and constraints in a changeable environment; (3) a set of selective criteria or mechanisms against which the 'variety pool' may be sifted into those variations in the organization or system that more closely map the environment and those that do not; and (4) an arrangement for preserving and/or propagating these 'successful' mappings." (Walter F Buckley," Sociology and modern systems theory", 1967)

"The advantage of this way of proceeding is evident: insights and skills obtained on the model-side can be - certain transference criteria satisfied - transferred to the original, [in this way] the model-builder obtains a new knowledge about the modeled original […]" (Herbert Stachowiak, "Allgemeine Modelltheorie", 1973)

"In most engineering problems, particularly when solving optimization problems, one must have the opportunity of comparing different variants quantitatively. It is therefore important to be able to state a clear-cut quantitative criterion." (Yakov Khurgin, "Did You Say Mathematics?", 1974)

"The principal aim of physical theories is understanding. A theory's ability to find a number is merely a useful criterion for a correct understanding." (Yuri I Manin, "Mathematics and Physics", 1981)

"It is often the scientist’s experience that he senses the nearness of truth when such connections are envisioned. A connection is a step toward simplification, unification. Simplicity is indeed often the sign of truth and a criterion of beauty." (Mahlon B Hoagland, "Toward the Habit of Truth", 1990)

"A model for simulating dynamic system behavior requires formal policy descriptions to specify how individual decisions are to be made. Flows of information are continuously converted into decisions and actions. No plea about the inadequacy of our understanding of the decision-making processes can excuse us from estimating decision-making criteria. To omit a decision point is to deny its presence - a mistake of far greater magnitude than any errors in our best estimate of the process." (Jay W Forrester, "Policies, decisions and information sources for modeling", 1994)

"The amount of understanding produced by a theory is determined by how well it meets the criteria of adequacy - testability, fruitfulness, scope, simplicity, conservatism - because these criteria indicate the extent to which a theory systematizes and unifies our knowledge." (Theodore Schick Jr., "How to Think about Weird Things: Critical Thinking for a New Age", 1995)

09 November 2018

🔭Data Science: Plausibility (Just the Quotes)

"However successful a theory or law may have been in the past, directly it fails to interpret new discoveries its work is finished, and it must be discarded or modified. However plausible the hypothesis, it must be ever ready for sacrifice on the altar of observation." (Joseph W Mellor, "A Comprehensive Treatise on Inorganic and Theoretical Chemistry", 1922) 

"Heuristic reasoning is reasoning not regarded as final and strict but as provisional and plausible only, whose purpose is to discover the solution of the present problem. We are often obliged to use heuristic reasoning. We shall attain complete certainty when we shall have obtained the complete solution, but before obtaining certainty we must often be satisfied with a more or less plausible guess. We may need the provisional before we attain the final. We need heuristic reasoning when we construct a strict proof as we need scaffolding when we erect a building." (George Pólya, "How to Solve It", 1945)

"The scientist who discovers a theory is usually guided to his discovery by guesses; he cannot name a method by means of which he found the theory and can only say that it appeared plausible to him, that he had the right hunch or that he saw intuitively which assumption would fit the facts." (Hans Reichenbach, "The Rise of Scientific Philosophy", 1951)

"In plausible reasoning the principal thing is to distinguish... a more reasonable guess from a less reasonable guess." (George Pólya, "Mathematics and plausible reasoning" Vol. 1, 1954)

"One feature [...] which requires much more justification than is usually given, is the setting up of unplausible null hypotheses. For example, a statistician may set out a test to see whether two drugs have exactly the same effect, or whether a regression line is exactly straight. These hypotheses can scarcely be taken literally." (Cedric A B Smith, "Book review of Norman T. J. Bailey: Statistical Methods in Biology", Applied Statistics 9, 1960)

"[…] the social scientist who lacks a mathematical mind and regards a mathematical formula as a magic recipe, rather than as the formulation of a supposition, does not hold forth much promise. A mathematical formula is never more than a precise statement. It must not be made into a Procrustean bed - and that is what one is driven to by the desire to quantify at any cost. It is utterly implausible that a mathematical formula should make the future known to us, and those who think it can, would once have believed in witchcraft. The chief merit of mathematicization is that it compels us to become conscious of what we are assuming." (Bertrand de Jouvenel, "The Art of Conjecture", 1967)

"In all scientific fields, theory is frequently more important than experimental data. Scientists are generally reluctant to accept the existence of a phenomenon when they do not know how to explain it. On the other hand, they will often accept a theory that is especially plausible before there exists any data to support it." (Richard Morris, 1983)

"The degree of confirmation assigned to any given hypothesis is sensitive to properties of the entire belief system [...] simplicity, plausibility, and conservatism are properties that theories have in virtue of their relation to the whole structure of scientific beliefs taken collectively. A measure of conservatism or simplicity would be a metric over global properties of belief systems." (Jerry Fodor, "Modularity of Mind", 1983)

"Nonetheless, the basic principles regarding correlations between variables are not that difficult to understand. We must look for patterns that reveal potential relationships and for evidence that variables are actually related. But when we do spot those relationships, we should not jump to conclusions about causality. Instead, we need to weigh the strength of the relationship and the plausibility of our theory, and we must always try to discount the possibility of spuriousness." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"Reality dishes out experiences using probability, not plausibility." (Eliezer S Yudkowsky, "A Technical Explanation of Technical Explanation", 2005)

"Don’t just do the calculations. Use common sense to see whether you are answering the correct question, the assumptions are reasonable, and the results are plausible. If a statistical argument doesn’t make sense, think about it carefully - you may discover that the argument is nonsense." (Gary Smith, "Standard Deviations", 2014)

"Facts and values are entangled in science. It's not because scientists are biased, not because they are partial or influenced by other kinds of interests, but because of a commitment to reason, consistency, coherence, plausibility and replicability. These are value commitments." (Alva Noë)

08 November 2018

🔭Data Science: Aggregation (Just the Quotes)

"Statistics may be defined as numerical statements of facts by means of which large aggregates are analyzed, the relations of individual units to their groups are ascertained, comparisons are made between groups, and continuous records are maintained for comparative purposes." (Melvin T Copeland. "Statistical Methods" [in: Harvard Business Studies, Vol. III, Ed. by Melvin T Copeland, 1917])

"The null hypothesis of no difference has been judged to be no longer a sound or fruitful basis for statistical investigation. […] Significance tests do not provide the information that scientists need, and, furthermore, they are not the most effective method for analyzing and summarizing data." (Cherry A Clark, "Hypothesis Testing in Relation to Statistical Methodology", Review of Educational Research Vol. 33, 1963)

"[…] fitting lines to relationships between variables is often a useful and powerful method of summarizing a set of data. Regression analysis fits naturally with the development of causal explanations, simply because the research worker must, at a minimum, know what he or she is seeking to explain." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Fitting lines to relationships between variables is the major tool of data analysis. Fitted lines often effectively summarize the data and, by doing so, help communicate the analytic results to others. Estimating a fitted line is also the first step in squeezing further information from the data." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Modern data graphics can do much more than simply substitute for small statistical tables. At their best, graphics are instruments for reasoning about quantitative information. Often the most effective way to describe, explore, and summarize a set of numbers even a very large set - is to look at pictures of those numbers. Furthermore, of all methods for analyzing and communicating statistical information, well-designed data graphics are usually the simplest and at the same time the most powerful." (Edward R Tufte, "The Visual Display of Quantitative Information", 1983)

"The science of statistics may be described as exploring, analyzing and summarizing data; designing or choosing appropriate ways of collecting data and extracting information from them; and communicating that information. Statistics also involves constructing and testing models for describing chance phenomena. These models can be used as a basis for making inferences and drawing conclusions and, finally, perhaps for making decisions." (Fergus Daly et al, "Elements of Statistics", 1995)

"Without meaningful data there can be no meaningful analysis. The interpretation of any data set must be based upon the context of those data. Unfortunately, much of the data reported to executives today are aggregated and summed over so many different operating units and processes that they cannot be said to have any context except a historical one - they were all collected during the same time period. While this may be rational with monetary figures, it can be devastating to other types of data." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Ockham's Razor in statistical analysis is used implicitly when models are embedded in richer models -for example, when testing the adequacy of a linear model by incorporating a quadratic term. If the coefficient of the quadratic term is not significant, it is dropped and the linear model is assumed to summarize the data adequately." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Every number has its limitations; every number is a product of choices that inevitably involve compromise. Statistics are intended to help us summarize, to get an overview of part of the world’s complexity. But some information is always sacrificed in the process of choosing what will be counted and how. Something is, in short, always missing. In evaluating statistics, we should not forget what has been lost, if only because this helps us understand what we still have." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"Data often arrive in raw form, as long lists of numbers. In this case your job is to summarize the data in a way that captures its essence and conveys its meaning. This can be done numerically, with measures such as the average and standard deviation, or graphically. At other times you find data already in summarized form; in this case you must understand what the summary is telling, and what it is not telling, and then interpret the information for your readers or viewers." (Charles Livingston & Paul Voakes, "Working with Numbers and Statistics: A handbook for journalists", 2005)

"Whereas regression is about attempting to specify the underlying relationship that summarises a set of paired data, correlation is about assessing the strength of that relationship. Where there is a very close match between the scatter of points and the regression line, correlation is said to be 'strong' or 'high' . Where the points are widely scattered, the correlation is said to be 'weak' or 'low'." (Alan Graham, "Developing Thinking in Statistics", 2006)

"Numerical precision should be consistent throughout and summary statistics such as means and standard deviations should not have more than one extra decimal place (or significant digit) compared to the raw data. Spurious precision should be avoided although when certain measures are to be used for further calculations or when presenting the results of analyses, greater precision may sometimes be appropriate." (Jenny Freeman et al, "How to Display Data", 2008)

"Graphical displays are often constructed to place principal focus on the individual observations in a dataset, and this is particularly helpful in identifying both the typical positions of data points and unusual or influential cases. However, in many investigations, principal interest lies in identifying the nature of underlying trends and relationships between variables, and so it is often helpful to enhance graphical displays in ways which give deeper insight into these features. This can be very beneficial both for small datasets, where variation can obscure underlying patterns, and large datasets, where the volume of data is so large that effective representation inevitably involves suitable summaries." (Adrian W Bowman, "Smoothing Techniques for Visualisation" [in "Handbook of Data Visualization"], 2008)

"There are three possible reasons for [the] absence of predictive power. First, it is possible that the models are misspecified. Second, it is possible that the model’s explanatory factors are measured at too high a level of aggregation [...] Third, [...] the search for statistically significant relationships may not be the strategy best suited for evaluating our model’s ability to explain real world events [...] the lack of predictive power is the result of too much emphasis having been placed on finding statistically significant variables, which may be overdetermined. Statistical significance is generally a flawed way to prune variables in regression models [...] Statistically significant variables may actually degrade the predictive accuracy of a model [...] [By using] models that are constructed on the basis of pruning undertaken with the shears of statistical significance, it is quite possible that we are winnowing our models away from predictive accuracy." (Michael D Ward et al, "The perils of policy by p-value: predicting civil conflicts" Journal of Peace Research 47, 2010)

"In order to be effective a descriptive statistic has to make sense - it has to distill some essential characteristic of the data into a value that is both appropriate and understandable. […] the justification for computing any given statistic must come from the nature of the data themselves - it cannot come from the arithmetic, nor can it come from the statistic. If the data are a meaningless collection of values, then the summary statistics will also be meaningless - no arithmetic operation can magically create meaning out of nonsense. Therefore, the meaning of any statistic has to come from the context for the data, while the appropriateness of any statistic will depend upon the use we intend to make of that statistic." (Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"[...] things that seem hopelessly random and unpredictable when viewed in isolation often turn out to be lawful and predictable when viewed in aggregate." (Steven Strogatz, "The Joy of X: A Guided Tour of Mathematics, from One to Infinity", 2012)

"In general, when building statistical models, we must not forget that the aim is to understand something about the real world. Or predict, choose an action, make a decision, summarize evidence, and so on, but always about the real world, not an abstract mathematical world: our models are not the reality - a point well made by George Box in his oft-cited remark that "all models are wrong, but some are useful". (David Hand, "Wonderful examples, but let's not close our eyes", Statistical Science 29, 2014)

"Decision trees are also discriminative models. Decision trees are induced by recursively partitioning the feature space into regions belonging to the different classes, and consequently they define a decision boundary by aggregating the neighboring regions belonging to the same class. Decision tree model ensembles based on bagging and boosting are also discriminative models." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies", 2015)

"Just as with aggregated data, an average is a summary statistic that can tell you something about the data - but it is only one metric, and oftentimes a deceiving one at that. By taking all of the data and boiling it down to one value, an average (and other summary statistics) may imply that all of the underlying data is the same, even when it’s not." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"At very small time scales, the motion of a particle is more like a random walk, as it gets jostled about by discrete collisions with water molecules. But virtually any random movement on small time scales will give rise to Brownian motion on large time scales, just so long as the motion is unbiased. This is because of the Central Limit Theorem, which tells us that the aggregate of many small, independent motions will be normally distributed." (Field Cady, "The Data Science Handbook", 2017)

"The most accurate but least interpretable form of data presentation is to make a table, showing every single value. But it is difficult or impossible for most people to detect patterns and trends in such data, and so we rely on graphs and charts. Graphs come in two broad types: Either they represent every data point visually (as in a scatter plot) or they implement a form of data reduction in which we summarize the data, looking, for example, only at means or medians." (Daniel J Levitin, "Weaponized Lies", 2017)

"Again, classical statistics only summarizes data, so it does not provide even a language for asking [a counterfactual] question. Causal inference provides a notation and, more importantly, offers a solution. As with predicting the effect of interventions [...], in many cases we can emulate human retrospective thinking with an algorithm that takes what we know about the observed world and produces an answer about the counterfactual world." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"While the individual man is an insoluble puzzle, in the aggregate he becomes a mathematical certainty. You can, for example, never foretell what anyone man will be up to, but you can say with precision what an average number will be up to. Individuals vary, but percentages remain constant. So says the statistician." (Sir Arthur C Doyle)

🔭Data Science - Consistency (Just the Quotes)

"A model, like a novel, may resonate with nature, but it is not a ‘real’ thing. Like a novel, a model may be convincing - it may ‘ring true’ if it is consistent with our experience of the natural world. But just as we may wonder how much the characters in a novel are drawn from real life and how much is artifice, we might ask the same of a model: How much is based on observation and measurement of accessible phenomena, how much is convenience? Fundamentally, the reason for modeling is a lack of full access, either in time or space, to the phenomena of interest." (Kenneth Belitz, Science, Vol. 263, 1944)

"Consistency and completeness can also be characterized in terms of models: a theory T is consistent if and only if it has at least one model; it is complete if and only if every sentence of T which is satified in one model is also satisfied in any other model of T. Two theories T1 and T2 are said to be compatible if they have a common consistent extension; this is equivalent to saying that the union of T1 and T2 is consistent." (Alfred Tarski et al, "Undecidable Theories", 1953)

"To be useful data must be consistent - they must reflect periodic recordings of the value of the variable or at least possess logical internal connections. The definition of the variable under consideration cannot change during the period of measurement or enumeration. Also, if the data are to be valuable, they must be relevant to the question to be answered." (Cecil H Meyers, "Handbook of Basic Graphs: A modern approach", 1970)

"When evaluating a model, at least two broad standards are relevant. One is whether the model is consistent with the data. The other is whether the model is consistent with the ‘real world’." (Kenneth A Bollen, "Structural Equations with Latent Variables", 1989)

"The word theory, as used in the natural sciences, doesn’t mean an idea tentatively held for purposes of argument - that we call a hypothesis. Rather, a theory is a set of logically consistent abstract principles that explain a body of concrete facts. It is the logical connections among the principles and the facts that characterize a theory as truth. No one element of a theory [...] can be changed without creating a logical contradiction that invalidates the entire system. Thus, although it may not be possible to substantiate directly a particular principle in the theory, the principle is validated by the consistency of the entire logical structure." (Alan Cromer, "Uncommon Sense: The Heretical Nature of Science", 1993)

"For a given dataset there is not a great deal of advice which can be given on content and context. hose who know their own data should know best for their specific purposes. It is advisable to think hard about what should be shown and to check with others if the graphic makes the desired impression. Design should be let to designers, though some basic guidelines should be followed: consistency is important (sets of graphics should be in similar style and use equivalent scaling); proximity is helpful (place graphics on the same page, or on the facing page, of any text that refers to them); and layout should be checked (graphics should be neither too small nor too large and be attractively positioned relative to the whole page or display)."(Antony Unwin, "Good Graphics?" [in "Handbook of Data Visualization"], 2008)

"It is the consistency of the information that matters for a good story, not its completeness. Indeed, you will often find that knowing little makes it easier to fit everything you know into a coherent pattern." (Daniel Kahneman, "Thinking, Fast and Slow", 2011)

"Accuracy and coherence are related concepts pertaining to data quality. Accuracy refers to the comprehensiveness or extent of missing data, performance of error edits, and other quality assurance strategies. Coherence is the degree to which data - item value and meaning are consistent over time and are comparable to similar variables from other routinely used data sources." (Aileen Rothbard, "Quality Issues in the Use of Administrative Data Records", 2015)

"How good the data quality is can be looked at both subjectively and objectively. The subjective component is based on the experience and needs of the stakeholders and can differ by who is being asked to judge it. For example, the data managers may see the data quality as excellent, but consumers may disagree. One way to assess it is to construct a survey for stakeholders and ask them about their perception of the data via a questionnaire. The other component of data quality is objective. Measuring the percentage of missing data elements, the degree of consistency between records, how quickly data can be retrieved on request, and the percentage of incorrect matches on identifiers (same identifier, different social security number, gender, date of birth) are some examples." (Aileen Rothbard, "Quality Issues in the Use of Administrative Data Records", 2015)

"The danger of overfitting is particularly severe when the training data is not a perfect gold standard. Human class annotations are often subjective and inconsistent, leading boosting to amplify the noise at the expense of the signal. The best boosting algorithms will deal with overfitting though regularization. The goal will be to minimize the number of non-zero coefficients, and avoid large coefficients that place too much faith in any one classifier in the ensemble." (Steven S Skiena, "The Data Science Design Manual", 2017)

"There are other problems with Big Data. In any large data set, there are bound to be inconsistencies, misclassifications, missing data - in other words, errors, blunders, and possibly lies. These problems with individual items occur in any data set, but they are often hidden in a large mass of numbers even when these numbers are generated out of computer interactions." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

Data Management : Data Fabric (Definitions)

"Enterprise data fabric (EDF) is a data layer that separates data sources from applications, providing the means to solve the gridlock prevalent in distributed environments such as grid computing, service-oriented architecture (SOA) and event-driven architecture (EDA)." (Information Management, 2010)

"A data fabric is an emerging data management and data integration design concept for attaining flexible, reusable and augmented data integration pipelines, services and semantics, in support of various operational and analytics use cases delivered across multiple deployment and orchestration platforms." (Jacob O Lund, "Demystifying the Data Fabric", 2020)

"A data fabric is a data management architecture that can optimize access to distributed data and intelligently curate and orchestrate it for self-service delivery to data consumers." (IBM, "Data Fabric", 2021) [source]

"A data fabric is a modern, distributed data architecture that includes shared data assets and optimized data management and integration processes that you can use to address today’s data challenges in a unified way." (Alice LaPlante, "Data Fabric as Modern Data Architecture", 2021)

"A data fabric is an emerging data management design for attaining flexible and reusable data integration pipelines, services and semantics. A data fabric supports various operational and analytics use cases delivered across multiple deployment and orchestration platforms. Data fabrics support a combination of different data integration styles and leverage active metadata, knowledge graphs, semantics and ML to automate and enhance data integration design and delivery." (Ehtisham Zaidi, "Data Fabric", Gartner's Hype Cycle for Data Management, 2021)

"Is a distributed Data Management platform whose objective is to combine various types of data storage, access, preparation, analytics, and security tools in a fully compliant manner to support seamless Data Management." (Michelle Knight, "What Is a Data Fabric?", 2021)

"A data fabric is a customized combination of architecture and technology. It uses dynamic data integration and orchestration to connect different locations, sources, and types of data. With the right structures and flows as defined within the data fabric platform, companies can quickly access and share data regardless of where it is or how it was generated." (SAP)

"A data fabric is a distributed, memory-based data management platform that uses cluster-wide resources - memory, CPU, network bandwidth, and optionally local disk – to manage application data and application logic (behavior). The data fabric uses dynamic replication and data partitioning techniques to offer continuous availability, very high performance, and linear scalability for data intensive applications, all without compromising on data consistency even when exposed to failure conditions." (VMware)

"A Data Fabric is a technology utilization and implementation design capable of multiple outputs and applied uses." (Gartner)

"A data fabric is an architecture and set of data services that provide consistent capabilities across a choice of endpoints spanning hybrid multicloud environments." (NetApp) [source]

"Data fabric is an end-to-end data integration and management solution, consisting of architecture, data management and integration software, and shared data that helps organizations manage their data. A data fabric provides a unified, consistent user experience and access to data for any member of an organization worldwide and in real-time." (Tibco) [source]

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.