28 December 2018

🔭Data Science: Statistics' (Mis)usage (Just the Quotes)

"A witty statesman said, you might prove anything by figures." (Thomas Carlyle, "Chartism", 1840)

"It is difficult to understand why statisticians commonly limit their inquiries to Averages, and do not revel in more comprehensive views. Their souls seem as dull to the charm of variety as that of the native of one of our flat English counties, whose retrospect of Switzerland was that, if its mountains could be thrown into its lakes, two nuisances would be got rid of at once. An Average is but a solitary fact, whereas if a single other fact be added to it, an entire Normal Scheme, which nearly corresponds to the observed one, starts potentially into existence." (Sir Francis Galton, "Natural Inheritance", 1889)

"No doubt statistics can be easily misinterpreted; and are often very misleading when first applied to new problems. But many of the worst fallacies involved in the misapplications of statistics are definite and can be definitely exposed, till at last no one ventures to repeat them even when addressing an uninstructed audience: and on the whole arguments which can be reduced to statistical forms, though still in a backward condition, are making more sure and more rapid advances than any others towards obtaining the general acceptance of all who have studied the subjects to which they refer." (Alfred Marshall, "Principles of Economics", 1890)

"A statistical estimate may be good or bad, accurate or the reverse; but in almost all cases it is likely to be more accurate than a casual observer’s impression, and the nature of things can only be disproved by statistical methods." (Sir Arthur L Bowley, "Elements of Statistics", 1901)

"Some of the common ways of producing a false statistical argument are to quote figures without their context, omitting the cautions as to their incompleteness, or to apply them to a group of phenomena quite different to that to which they in reality relate; to take these estimates referring to only part of a group as complete; to enumerate the events favorable to an argument, omitting the other side; and to argue hastily from effect to cause, this last error being the one most often fathered on to statistics. For all these elementary mistakes in logic, statistics is held responsible." (Sir Arthur L Bowley, "Elements of Statistics", 1901)

"Statistics may, for instance, be called the science of counting. Counting appears at first sight to be a very simple operation, which any one can perform or which can be done automatically; but, as a matter of fact, when we come to large numbers, e.g., the population of the United Kingdom, counting is by no means easy, or within the power of an individual; limits of time and place alone prevent it being so carried out, and in no way can absolute accuracy be obtained when the numbers surpass certain limits." (Sir Arthur L Bowley, "Elements of Statistics", 1901)

"Figures may not lie, but statistics compiled unscientifically and analyzed incompetently are almost sure to be misleading, and when this condition is unnecessarily chronic the so-called statisticians may be called liars." (Edwin B Wilson, "Bulletin of the American Mathematical Society", Vol 18, 1912)

"Great discoveries which give a new direction to currents of thoughts and research are not, as a rule, gained by the accumulation of vast quantities of figures and statistics. These are apt to stifle and asphyxiate and they usually follow rather than precede discovery. The great discoveries are due to the eruption of genius into a closely related field, and the transfer of the precious knowledge there found to his own domain." (Theobald Smith, Boston Medical and Surgical Journal Volume 172, 1915)

"Of itself an arithmetic average is more likely to conceal than to disclose important facts; it is the nature of an abbreviation, and is often an excuse for laziness." (Arthur L Bowley, "The Nature and Purpose of the Measurement of Social Phenomena", 1915)

"Averages are like the economic man; they are inventions, not real. When applied to salaries they hide gaunt poverty at the lower end." (Julia Lathrop, 1919)

"A method is a dangerous thing unless its underlying philosophy is understood, and none more dangerous than the statistical. […] Over-attention to technique may actually blind one to the dangers that lurk about on every side- like the gambler who ruins himself with his system carefully elaborated to beat the game. In the long run it is only clear thinking, experienced methods, that win the strongholds of science." (Edwin B Wilson, "The Statistical Significance of Experimental Data", Science, Volume 58 (1493), 1923)

"[…] the methods of statistics are so variable and uncertain, so apt to be influenced by circumstances, that it is never possible to be sure that one is operating with figures of equal weight." (Havelock Ellis, "The Dance of Life", 1923)

"No human mind is capable of grasping in its entirety the meaning of any considerable quantity of numerical data." (Sir Ronald A Fisher, "Statistical Methods for Research Workers", 1925)

"The preliminary examination of most data is facilitated by the use of diagrams. Diagrams prove nothing, but bring outstanding features readily to the eye; they are therefore no substitutes for such critical tests as may be applied to the data, but are valuable in suggesting such tests, and in explaining the conclusions founded upon them." (Sir Ronald A Fisher, "Statistical Methods for Research Workers", 1925)

"Without an adequate understanding of the statistical methods, the investigator in the social sciences may be like the blind man groping in a dark room for a black cat that is not there. The methods of Statistics are useful in an over-widening range of human activities in any field of thought in which numerical data may be had." (Frederick E Croxton & Dudley J Cowden, "Practical Business Statistics", 1937)

"In earlier times they had no statistics and so they had to fall back on lies. Hence the huge exaggerations of primitive literature, giants, miracles, wonders! It's the size that counts. They did it with lies and we do it with statistics: but it's all the same." (Stephen Leacock, "Model memoirs and other sketches from simple to serious", 1939)

"It has long been recognized by public men of all kinds […] that statistics come under the head of lying, and that no lie is so false or inconclusive as that which is based on statistics." (Hilaire Belloc, "The Silence of the Sea", 1940)

"The enthusiastic use of statistics to prove one side of a case is not open to criticism providing the work is honestly and accurately done, and providing the conclusions are not broader than indicated by the data. This type of work must not be confused with the unfair and dishonest use of both accurate and inaccurate data, which too commonly occurs in business. Dishonest statistical work usually takes the form of: (1) deliberate misinterpretation of data; (2) intentional making of overestimates or underestimates; and (3) biasing results by using partial data, making biased surveys, or using wrong statistical methods." (John R Riggleman & Ira N Frisbee, "Business Statistics", 1951)

"By the laws of statistics we could probably approximate just how unlikely it is that it would happen. But people forget - especially those who ought to know better, such as yourself - that while the laws of statistics tell you how unlikely a particular coincidence is, they state just as firmly that coincidences do happen." (Robert A Heinlein, "The Door Into Summer", 1957)

"The statistics themselves prove nothing; nor are they at any time a substitute for logical thinking. There are […] many simple but not always obvious snags in the data to contend with. Variations in even the simplest of figures may conceal a compound of influences which have to be taken into account before any conclusions are drawn from the data." (Alfred R Ilersic, "Statistics", 1959)

"Many people use statistics as a drunkard uses a street lamp - for support rather than illumination. It is not enough to avoid outright falsehood; one must be on the alert to detect possible distortion of truth. One can hardly pick up a newspaper without seeing some sensational headline based on scanty or doubtful data." (Anna C Rogers, "Graphic Charts Handbook", 1961)

"Myth is more individual and expresses life more precisely than does science. Science works with concepts of averages which are far too general to do justice to the subjective variety of an individual life." (Carl G Jung, "Memories, Dreams, Reflections", 1963)

"It has been said that data collection is like garbage collection: before you collect it you should have in mind what you are going to do with it." (Russell Fox et al, "The Science of Science", 1964)

"[…] statistical techniques are tools of thought, and not substitutes for thought." (Abraham Kaplan, "The Conduct of Inquiry", 1964)

"He who accepts statistics indiscriminately will often be duped unnecessarily. But he who distrusts statistics indiscriminately will often be ignorant unnecessarily. There is an accessible alternative between blind gullibility and blind distrust. It is possible to interpret statistics skillfully. The art of interpretation need not be monopolized by statisticians, though, of course, technical statistical knowledge helps. Many important ideas of technical statistics can be conveyed to the non-statistician without distortion or dilution. Statistical interpretation depends not only on statistical ideas but also on ordinary clear thinking. Clear thinking is not only indispensable in interpreting statistics but is often sufficient even in the absence of specific statistical knowledge. For the statistician not only death and taxes but also statistical fallacies are unavoidable. With skill, common sense, patience and above all objectivity, their frequency can be reduced and their effects minimised. But eternal vigilance is the price of freedom from serious statistical blunders." (W Allen Wallis & Harry V Roberts, "The Nature of Statistics", 1965)

"The manipulation of statistical formulas is no substitute for knowing what one is doing." (Hubert M Blalock Jr., "Social Statistics" 2nd Ed., 1972)

"Confidence in the omnicompetence of statistical reasoning grows by what it feeds on." (Harry Hopkins, "The Numbers Game: The Bland Totalitarianism", 1973)

"Probably one of the most common misuses (intentional or otherwise) of a graph is the choice of the wrong scale - wrong, that is, from the standpoint of accurate representation of the facts. Even though not deliberate, selection of a scale that magnifies or reduces - even distorts - the appearance of a curve can mislead the viewer." (Peter H Selby, "Interpreting Graphs and Tables", 1976)

"No matter how much reverence is paid to anything purporting to be ‘statistics’, the term has no meaning unless the source, relevance, and truth are all checked." (Tom Burnam, "The Dictionary of Misinformation", 1975)

"Crude measurement usually yields misleading, even erroneous conclusions no matter how sophisticated a technique is used." (Henry T Reynolds, "Analysis of Nominal Data", 1977)

"Graphs are used to meet the need to condense all the available information into a more usable quantity. The selection process of combining and condensing will inevitably produce a less than complete study and will lead the user in certain directions, producing a potential for misleading." (Anker V Andersen, "Graphing Financial Information: How accountants can use graphs to communicate", 1983)

"It is all too easy to notice the statistical sea that supports our thoughts and actions. If that sea loses its buoyancy, it may take a long time to regain the lost support." (William Kruskal, "Coordination Today: A Disaster or a Disgrace", The American Statistician, Vol. 37, No. 3, 1983)

"There are two kinds of misrepresentation. In one. the numerical data do not agree with the data in the graph, or certain relevant data are omitted. This kind of misleading presentation. while perhaps hard to determine, clearly is wrong and can be avoided. In the second kind of misrepresentation, the meaning of the data is different to the preparer and to the user." (Anker V Andersen, "Graphing Financial Information: How accountants can use graphs to communicate", 1983)

"’Common sense’ is not common but needs to [be] learnt systematically […]. A ‘simple analysis’ can be harder than it looks […]. All statistical techniques, however sophisticated, should be subordinate to subjective judgment." (Christopher Chatfield, "The Initial Examination of Data", Journal of The Royal Statistical Society, Series A, Vol. 148, 1985)

"The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data." (John Tukey, The American Statistician, 40 (1), 1986)

"Beware of the problem of testing too many hypotheses; the more you torture the data, the more likely they are to confess, but confessions obtained under duress may not be admissible in the court of scientific opinion." (Stephen M. Stigler, "Neutral Models in Biology", 1987)

"[In statistics] you have the fact that the concepts are not very clean. The idea of probability, of randomness, is not a clean mathematical idea. You cannot produce random numbers mathematically. They can only be produced by things like tossing dice or spinning a roulette wheel. With a formula, any formula, the number you get would be predictable and therefore not random. So as a statistician you have to rely on some conception of a world where things happen in some way at random, a conception which mathematicians don’t have." (Lucien LeCam, [interview] 1988)

"Torture numbers, and they will confess to anything." (Gregg Easterbrook, "New Republic", 1989)

"Statistics is a very powerful and persuasive mathematical tool. People put a lot of faith in printed numbers. It seems when a situation is described by assigning it a numerical value, the validity of the report increases in the mind of the viewer. It is the statistician's obligation to be aware that data in the eyes of the uninformed or poor data in the eyes of the naive viewer can be as deceptive as any falsehoods." (Theoni Pappas, "More Joy of Mathematics: Exploring mathematical insights & concepts", 1991)

"When looking at the end result of any statistical analysis, one must be very cautious not to over interpret the data. Care must be taken to know the size of the sample, and to be certain the method for gathering information is consistent with other samples gathered. […] No one should ever base conclusions without knowing the size of the sample and how random a sample it was. But all too often such data is not mentioned when the statistics are given - perhaps it is overlooked or even intentionally omitted." (Theoni Pappas, "More Joy of Mathematics: Exploring mathematical insights & concepts", 1991)

"[…] an honest exploratory study should indicate how many comparisons were made […] most experts agree that large numbers of comparisons will produce apparently statistically significant findings that are actually due to chance. The data torturer will act as if every positive result confirmed a major hypothesis. The honest investigator will limit the study to focused questions, all of which make biologic sense. The cautious reader should look at the number of ‘significant’ results in the context of how many comparisons were made." (James L Mills, "Data torturing", New England Journal of Medicine, 1993)

"Fairy tales lie just as much as statistics do, but sometimes you can find a grain of truth in them." (Sergei Lukyanenko, "The Night Watch", 1998)

"Averages, ranges, and histograms all obscure the time-order for the data. If the time-order for the data shows some sort of definite pattern, then the obscuring of this pattern by the use of averages, ranges, or histograms can mislead the user. Since all data occur in time, virtually all data will have a time-order. In some cases this time-order is the essential context which must be preserved in the presentation." (Donald J Wheeler," Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"No comparison between two values can be global. A simple comparison between the current figure and some previous value and convey the behavior of any time series. […] While it is simple and easy to compare one number with another number, such comparisons are limited and weak. They are limited because of the amount of data used, and they are weak because both of the numbers are subject to the variation that is inevitably present in weak world data. Since both the current value and the earlier value are subject to this variation, it will always be difficult to determine just how much of the difference between the values is due to variation in the numbers, and how much, if any, of the difference is due to real changes in the process." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Without meaningful data there can be no meaningful analysis. The interpretation of any data set must be based upon the context of those data. Unfortunately, much of the data reported to executives today are aggregated and summed over so many different operating units and processes that they cannot be said to have any context except a historical one - they were all collected during the same time period. While this may be rational with monetary figures, it can be devastating to other types of data." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Since the average is a measure of location, it is common to use averages to compare two data sets. The set with the greater average is thought to ‘exceed’ the other set. While such comparisons may be helpful, they must be used with caution. After all, for any given data set, most of the values will not be equal to the average." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Innumeracy - widespread confusion about basic mathematical ideas - means that many statistical claims about social problems don't get the critical attention they deserve. This is not simply because an innumerate public is being manipulated by advocates who cynically promote inaccurate statistics. Often, statistics about social problems originate with sincere, well-meaning people who are themselves innumerate; they may not grasp the full implications of what they are saying. Similarly, the media are not immune to innumeracy; reporters commonly repeat the figures their sources give them without bothering to think critically about them." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Not all statistics start out bad, but any statistic can be made worse. Numbers - even good numbers - can be misunderstood or misinterpreted. Their meanings can be stretched, twisted, distorted, or mangled. These alterations create what we can call mutant statistics - distorted versions of the original figures." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"The ease with which somewhat complex statistics can produce confusion is important, because we live in a world in which complex numbers are becoming more common. Simple statistical ideas - fractions, percentages, rates - are reasonably well understood by many people. But many social problems involve complex chains of cause and effect that can be understood only through complicated models developed by experts. [...] environment has an influence. Sorting out the interconnected causes of these problems requires relatively complicated statistical ideas - net additions, odds ratios, and the like. If we have an imperfect understanding of these ideas, and if the reporters and other people who relay the statistics to us share our confusion - and they probably do - the chances are good that we'll soon be hearing - and repeating, and perhaps making decisions on the basis of - mutated statistics." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"While some social problems statistics are deliberate deceptions, many - probably the great majority - of bad statistics are the result of confusion, incompetence, innumeracy, or selective, self-righteous efforts to produce numbers that reaffirm principles and interests that their advocates consider just and right. The best response to stat wars is not to try and guess who's lying or, worse, simply to assume that the people we disagree with are the ones telling lies. Rather, we need to watch for the standard causes of bad statistics - guessing, questionable definitions or methods, mutant numbers, and inappropriate comparisons." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"This is true only if you torture the statistics until they produce the confession you want." (Larry Schweikart, "Myths of the 1980s Distort Debate over Tax Cuts", 2001) [source]

"Every number has its limitations; every number is a product of choices that inevitably involve compromise. Statistics are intended to help us summarize, to get an overview of part of the world’s complexity. But some information is always sacrificed in the process of choosing what will be counted and how. Something is, in short, always missing. In evaluating statistics, we should not forget what has been lost, if only because this helps us understand what we still have." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"In short, some numbers are missing from discussions of social issues because certain phenomena are hard to quantify, and any effort to assign numeric values to them is subject to debate. But refusing to somehow incorporate these factors into our calculations creates its own hazards. The best solution is to acknowledge the difficulties we encounter in measuring these phenomena, debate openly, and weigh the options as best we can." (Joel Best, "More Damned Lies and Statistics : How numbers confuse public issues", 2004)

"Another way to obscure the truth is to hide it with relative numbers. […] Relative scales are always given as percentages or proportions. An increase or decrease of a given percentage only tells us part of the story, however. We are missing the anchoring of absolute values." (Brian Suda, "A Practical Guide to Designing with Data", 2010)

"A sin of omission – leaving something out – is a strong one and not always recognized; itʼs hard to ask for something you donʼt know is missing. When looking into the data, even before it is graphed and charted, there is potential for abuse. Simply not having all the data or the correct data before telling your story can cause problems and unhappy endings." (Brian Suda, "A Practical Guide to Designing with Data", 2010)

"The omission of zero magnifies the ups and downs in the data, allowing us to detect changes that might otherwise be ambiguous. However, once zero has been omitted, the graph is no longer an accurate guide to the magnitude of the changes. Instead, we need to look at the actual numbers." (Gary Smith, "Standard Deviations", 2014)

"The search for better numbers, like the quest for new technologies to improve our lives, is certainly worthwhile. But the belief that a few simple numbers, a few basic averages, can capture the multifaceted nature of national and global economic systems is a myth. Rather than seeking new simple numbers to replace our old simple numbers, we need to tap into both the power of our information age and our ability to construct our own maps of the world to answer the questions we need answering." (Zachary Karabell, "The Leading Indicators: A short history of the numbers that rule our world", 2014)

"Even properly done statistics can’t be trusted. The plethora of available statistical techniques and analyses grants researchers an enormous amount of freedom when analyzing their data, and it is trivially easy to ‘torture the data until it confesses’." (Alex Reinhart, "Statistics Done Wrong: The Woefully Complete Guide", 2015)

"GIGO is a famous saying coined by early computer scientists: garbage in, garbage out. At the time, people would blindly put their trust into anything a computer output indicated because the output had the illusion of precision and certainty. If a statistic is composed of a series of poorly defined measures, guesses, misunderstandings, oversimplifications, mismeasurements, or flawed estimates, the resulting conclusion will be flawed." (Daniel J Levitin, "Weaponized Lies", 2017)

"Most of us have difficulty figuring probabilities and statistics in our heads and detecting subtle patterns in complex tables of numbers. We prefer vivid pictures, images, and stories. When making decisions, we tend to overweight such images and stories, compared to statistical information. We also tend to misunderstand or misinterpret graphics." (Daniel J Levitin, "Weaponized Lies", 2017)

"If we don’t understand the statistics, we’re likely to be badly mistaken about the way the world is. It is all too easy to convince ourselves that whatever we’ve seen with our own eyes is the whole truth; it isn’t. Understanding causation is tough even with good statistics, but hopeless without them. [...] And yet, if we understand only the statistics, we understand little. We need to be curious about the world that we see, hear, touch, and smell, as well as the world we can examine through a spreadsheet." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"Do not put faith in what statistics say until you have carefully considered what they do not say." (William W Watt)

"Errors using inadequate data are much less than those using no data at all." (Charles Babbage)

"Facts are stubborn things, but statistics are pliable." (Mark Twain)

"I can prove anything by statistics except the truth." (George Canning

"If the statistics are boring, you've got the wrong numbers." (Edward Tufte)

"If your experiment needs statistics, you ought to have done a better experiment." (Ernest Rutherford)

"It is easy to lie with statistics. It is hard to tell the truth without it." (Andrejs Dunkels)

27 December 2018

🔭Data Science: Experiment (Just the Quotes)

"Those who have not imbibed the prejudices of philosophers, are easily convinced that natural knowledge is to be founded on experiment and observation." (Colin Maclaurin, "An Account of Sir Isaac Newton’s Philosophical Discoveries", 1748)

"We have three principal means: observation of nature, reflection, and experiment. Observation gathers the facts reflection combines them, experiment verifies the result of the combination. It is essential that the observation of nature be assiduous, that reflection be profound, and that experimentation be exact. Rarely does one see these abilities in combination. And so, creative geniuses are not common." (Denis Diderot, "On the Interpretation of Nature", 1753)

"Facts, observations, experiments - these are the materials of a great edifice, but in assembling them we must combine them into classes, distinguish which belongs to which order and to which part of the whole each pertains." (Antoine L Lavoisier, "Mémoires de l’Académie Royale des Sciences", 1777)

"The art of drawing conclusions from experiments and observations consists in evaluating probabilities and in estimating whether they are sufficiently great or numerous enough to constitute proofs. This kind of calculation is more complicated and more difficult than it is commonly thought to be […]" (Antoine-Laurent Lavoisier, cca. 1790)

"We must trust to nothing but facts: These are presented to us by Nature, and cannot deceive. We ought, in every instance, to submit our reasoning to the test of experiment, and never to search for truth but by the natural road of experiment and observation." (Antoin-Laurent de Lavoisiere, "Elements of Chemistry", 1790)

"Conjecture may lead you to form opinions, but it cannot produce knowledge. Natural philosophy must be built upon the phenomena of nature discovered by observation and experiment." (George Adams, "Lectures on Natural and Experimental Philosophy" Vol. 1, 1794)

"[Precision] is the very soul of science; and its attainment afford the only criterion, or at least the best, of the truth of theories, and the correctness of experiments." (John F W Herschel, "A Preliminary Discourse on the Study of Natural Philosophy", 1830)

"The hypothesis, by suggesting observations and experiments, puts us upon the road to that independent evidence if it be really attainable; and till it be attained, the hypothesis ought not to count for more than a suspicion." (John S Mill, "A System of Logic, Ratiocinative and Inductive", 1843)

"The framing of hypotheses is, for the enquirer after truth, not the end, but the beginning of his work. Each of his systems is invented, not that he may admire it and follow it into all its consistent consequences, but that he may make it the occasion of a course of active experiment and observation. And if the results of this process contradict his fundamental assumptions, however ingenious, however symmetrical, however elegant his system may be, he rejects it without hesitation. He allows no natural yearning for the offspring of his own mind to draw him aside from the higher duty of loyalty to his sovereign, Truth, to her he not only gives his affections and his wishes, but strenuous labour and scrupulous minuteness of attention." (William Whewell, "Philosophy of the Inductive Sciences" Vol. 2, 1847)

"An anticipative idea or an hypothesis is, then, the necessary starting point for all experimental reasoning. Without it, we could not make any investigation at all nor learn anything; we could only pile up sterile observations. If we experiment without a preconceived idea, we should move at random […]" (Claude Bernard, "An Introduction to the Study of Experimental Medicine", 1865)

"Isolated facts and experiments have in themselves no value, however great their number may be. They only become valuable in a theoretical or practical point of view when they make us acquainted with the law of a series of uniformly recurring phenomena, or, it may be, only give a negative result showing an incompleteness in our knowledge of such a law, till then held to be perfect." (Hermann von Helmholtz, "The Aim and Progress of Physical Science", 1869)

"It is surprising to learn the number of causes of error which enter into the simplest experiment, when we strive to attain rigid accuracy." (William S Jevons, "The Principles of Science: A Treatise on Logic and Scientific Method", 1874)

"A discoverer is a tester of scientific ideas; he must not only be able to imagine likely hypotheses, and to select suitable ones for investigation, but, as hypotheses may be true or untrue, he must also be competent to invent appropriate experiments for testing them, and to devise the requisite apparatus and arrangements." (George Gore, "The Art of Scientific Discovery", 1878)

"Even one well-made observation will be enough in many cases, just as one well-constructed experiment often suffices for the establishment of a law." (Émile Durkheim, "The Rules of Sociological Method", "The Rules of Sociological Method", 1895)

"Every experiment, every observation has, besides its immediate result, effects which, in proportion to its value, spread always on all sides into ever distant parts of knowledge." (Sir Michael Foster, "Annual Report of the Board of Regents of the Smithsonian Institution", 1898)

"If the number of experiments be very large, we may have precise information as to the value of the mean, but if our sample be small, we have two sources of uncertainty: (I) owing to the 'error of random sampling' the mean of our series of experiments deviates more or less widely from the mean of the population, and (2) the sample is not sufficiently large to determine what is the law of distribution of individuals." William S Gosset, "The Probable Error of a Mean", Biometrika, 1908)

"An experiment is an observation that can be repeated, isolated and varied. The more frequently you can repeat an observation, the more likely are you to see clearly what is there and to describe accurately what you have seen. The more strictly you can isolate an observation, the easier does your task of observation become, and the less danger is there of your being led astray by irrelevant circumstances, or of placing emphasis on the wrong point. The more widely you can vary an observation, the more clearly will be the uniformity of experience stand out, and the better is your chance of discovering laws." (Edward B Titchener, "A Text-Book of Psychology", 1909)

"Theory is the best guide for experiment - that were it not for theory and the problems and hypotheses that come out of it, we would not know the points we wanted to verify, and hence would experiment aimlessly" (Henry Hazlitt,  "Thinking as a Science", 1916)

"A scientist, whether theorist or experimenter, puts forward statements, or systems of statements, and tests them step by step. In the field of the empirical sciences, more particularly, he constructs hypotheses, or systems of theories, and tests them against experience by observation and experiment." (Karl R Popper, "The Logic of Scientific Discovery", 1934)

"While it is true that theory often sets difficult, if not impossible tasks for the experiment, it does, on the other hand, often lighten the work of the experimenter by disclosing cogent relationships which make possible the indirect determination of inaccessible quantities and thus render difficult measurements unnecessary." (Georg Joos, "Theoretical Physics", 1934)

"In relation to any experiment we may speak of this hypothesis as the null hypothesis, and it should be noted that the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis." (Ronald Fisher, "The Design of Experiments", 1935)

"Statistics is a scientific discipline concerned with collection, analysis, and interpretation of data obtained from observation or experiment. The subject has a coherent structure based on the theory of Probability and includes many different procedures which contribute to research and development throughout the whole of Science and Technology." (Egon Pearson, 1936)

"Experiment as compared with mere observation has some of the characteristics of cross-examining nature rather than merely overhearing her." (Alan Gregg, "The Furtherance of Medical Research", 1941)

"The well-known virtue of the experimental method is that it brings situational variables under tight control. It thus permits rigorous tests of hypotheses and confidential statements about causation. The correlational method, for its part, can study what man has not learned to control. Nature has been experimenting since the beginning of time, with a boldness and complexity far beyond the resources of science. The correlator’s mission is to observe and organize the data of nature’s experiments." (Lee J Cronbach, "The Two Disciplines of Scientific Psychology", The American Psychologist Vol. 12, 1957)

"A satisfactory prediction of the sequential properties of learning data from a single experiment is by no means a final test of a model. Numerous other criteria - and some more demanding - can be specified. For example, a model with specific numerical parameter values should be invariant to changes in independent variables that explicitly enter in the model." (Robert R Bush & Frederick Mosteller,"A Comparison of Eight Models?", Studies in Mathematical Learning Theory, 1959)

"Mathematical statistics provides an exceptionally clear example of the relationship between mathematics and the external world. The external world provides the experimentally measured distribution curve; mathematics provides the equation (the mathematical model) that corresponds to the empirical curve. The statistician may be guided by a thought experiment in finding the corresponding equation." (Marshall J Walker, "The Nature of Scientific Thought", 1963)

"Observation, reason, and experiment make up what we call the scientific method. (Richard Feynman, "Mainly mechanics, radiation, and heat", 1963)

"Science consists simply of the formulation and testing of hypotheses based on observational evidence; experiments are important where applicable, but their function is merely to simplify observation by imposing controlled conditions." (Henry L Batten, "Evolution of the Earth", 1971)

"In moving from conjecture to experimental data, (D), experiments must be designed which make best use of the experimenter's current state of knowledge and which best illuminate his conjecture. In moving from data to modified conjecture, (A), data must be analyzed so as to accurately present information in a manner which is readily understood by the experimenter." (George E P Box & George C Tjao, "Bayesian Inference in Statistical Analysis", 1973)

"Statistical methods are tools of scientific investigation. Scientific investigation is a controlled learning process in which various aspects of a problem are illuminated as the study proceeds. It can be thought of as a major iteration within which secondary iterations occur. The major iteration is that in which a tentative conjecture suggests an experiment, appropriate analysis of the data so generated leads to a modified conjecture, and this in turn leads to a new experiment, and so on." (George E P Box & George C Tjao, "Bayesian Inference in Statistical Analysis", 1973)

"A hypothesis is empirical or scientific only if it can be tested by experience. […] A hypothesis or theory which cannot be, at least in principle, falsified by empirical observations and experiments does not belong to the realm of science." (Francisco J Ayala, "Biological Evolution: Natural Selection or Random Walk", American Scientist, 1974)

"An experiment is a failure only when it also fails adequately to test the hypothesis in question, when the data it produces don't prove anything one way or the other." (Robert M Pirsig, "Zen and the Art of Motorcycle Maintenance", 1974)

"The essential function of a hypothesis consists in the guidance it affords to new observations and experiments, by which our conjecture is either confirmed or refuted." (Ernst Mach, "Knowledge and Error: Sketches on the Psychology of Enquiry", 1976)

"Theoretical scientists, inching away from the safe and known, skirting the point of no return, confront nature with a free invention of the intellect. They strip the discovery down and wire it into place in the form of mathematical models or other abstractions that define the perceived relation exactly. The now-naked idea is scrutinized with as much coldness and outward lack of pity as the naturally warm human heart can muster. They try to put it to use, devising experiments or field observations to test its claims. By the rules of scientific procedure it is then either discarded or temporarily sustained. Either way, the central theory encompassing it grows. If the abstractions survive they generate new knowledge from which further exploratory trips of the mind can be planned. Through the repeated alternation between flights of the imagination and the accretion of hard data, a mutual agreement on the workings of the world is written, in the form of natural law." (Edward O Wilson, "Biophilia", 1984)

"The only touchstone for empirical truth is experiment and observation." (Heinz Pagels, "Perfect Symmetry: The Search for the Beginning of Time", 1985)

"Any physical theory is always provisional, in the sense that it is only a hypothesis: you can never prove it. No matter how many times the results of experiments agree with some theory, you can never be sure that the next time the result will not contradict the theory." (Stephen Hawking,  "A Brief History of Time", 1988)

"Scientists use mathematics to build mental universes. They write down mathematical descriptions - models - that capture essential fragments of how they think the world behaves. Then they analyse their consequences. This is called 'theory'. They test their theories against observations: this is called 'experiment'. Depending on the result, they may modify the mathematical model and repeat the cycle until theory and experiment agree. Not that it's really that simple; but that's the general gist of it, the essence of the scientific method." (Ian Stewart & Martin Golubitsky, "Fearful Symmetry: Is God a Geometer?", 1992)

"Clearly, science is not simply a matter of observing facts. Every scientific theory also expresses a worldview. Philosophical preconceptions determine where facts are sought, how experiments are designed, and which conclusions are drawn from them." (Nancy R Pearcey & Charles B. Thaxton, "The Soul of Science: Christian Faith and Natural Philosophy", 1994)

"Probability theory is an ideal tool for formalizing uncertainty in situations where class frequencies are known or where evidence is based on outcomes of a sufficiently long series of independent random experiments. Possibility theory, on the other hand, is ideal for formalizing incomplete information expressed in terms of fuzzy propositions." (George Klir, "Fuzzy sets and fuzzy logic", 1995)

"The methods of science include controlled experiments, classification, pattern recognition, analysis, and deduction. In the humanities we apply analogy, metaphor, criticism, and (e)valuation. In design we devise alternatives, form patterns, synthesize, use conjecture, and model solutions." (Béla H Bánáthy, "Designing Social Systems in a Changing World", 1996)

"[…] because observations are all we have, we take them seriously. We choose hard data and the framework of mathematics as our guides, not unrestrained imagination or unrelenting skepticism, and seek the simplest yet most wide-reaching theories capable of explaining and predicting the outcome of today’s and future experiments." (Brian Greene, "The Fabric of the Cosmos", 2004)

"In science, for a theory to be believed, it must make a prediction - different from those made by previous theories - for an experiment not yet done. For the experiment to be meaningful, we must be able to get an answer that disagrees with that prediction. When this is the case, we say that a theory is falsifiable - vulnerable to being shown false. The theory also has to be confirmable, it must be possible to verify a new prediction that only this theory makes. Only when a theory has been tested and the results agree with the theory do we advance the statement to the rank of a true scientific theory." (Lee Smolin, "The Trouble with Physics", 2006)

"Observation and experiment, without a rational hypothesis, is like a man groping at objects at random with his eyes shut." (Henry P Tappan, "Elements of Logic", 2015)

"The dialectical interplay of experiment and theory is a key driving force of modern science. Experimental data do only have meaning in the light of a particular model or at least a theoretical background. Reversely theoretical considerations may be logically consistent as well as intellectually elegant: Without experimental evidence they are a mere exercise of thought no matter how difficult they are. Data analysis is a connector between experiment and theory: Its techniques advise possibilities of model extraction as well as model testing with experimental data." (Achim Zielesny, "From Curve Fitting to Machine Learning" 2nd Ed., 2016)

"If your experiment needs statistics, you ought to have done a better experiment." (Ernest Rutherford)

More quotes on "Experiment" at the-web-of-knowledge.blogspot.com

26 December 2018

🔭Data Science: Uncertainty (Just the Quotes)

"If the number of experiments be very large, we may have precise information as to the value of the mean, but if our sample be small, we have two sources of uncertainty: (I) owing to the 'error of random sampling' the mean of our series of experiments deviates more or less widely from the mean of the population, and (2) the sample is not sufficiently large to determine what is the law of distribution of individuals." (William S Gosset, "The Probable Error of a Mean", Biometrika, 1908)

"The making of decisions, as everyone knows from personal experience, is a burdensome task. Offsetting the exhilaration that may result from correct and successful decision and the relief that follows the termination of a struggle to determine issues is the depression that comes from failure, or error of decision, and the frustration which ensues from uncertainty." (Chester I Barnard, "The Functions of the Executive", 1938)

"Uncertainty is introduced, however, by the impossibility of making generalizations, most of the time, which happens to all members of a class. Even scientific truth is a matter of probability and the degree of probability stops somewhere short of certainty." (Wayne C Minnick, "The Art of Persuasion", 1957)

"Statistics is a body of methods and theory applied to numerical evidence in making decisions in the face of uncertainty." (Lawrence Lapin, "Statistics for Modern Business Decisions", 1973)

"The most dominant decision type [that will have to be made in an organic organization] will be decisions under uncertainty." (Henry L Tosi & Stephen J Carroll, "Management", 1976)

"The greater the uncertainty, the greater the amount of decision making and information processing. It is hypothesized that organizations have limited capacities to process information and adopt different organizing modes to deal with task uncertainty. Therefore, variations in organizing modes are actually variations in the capacity of organizations to process information and make decisions about events which cannot be anticipated in advance." (John K Galbraith, "Organization Design", 1977)

"Probability is the mathematics of uncertainty. Not only do we constantly face situations in which there is neither adequate data nor an adequate theory, but many modem theories have uncertainty built into their foundations. Thus learning to think in terms of probability is essential. Statistics is the reverse of probability (glibly speaking). In probability you go from the model of the situation to what you expect to see; in statistics you have the observations and you wish to estimate features of the underlying model." (Richard W Hamming, "Methods of Mathematics Applied to Calculus, Probability, and Statistics", 1985)

"Probability plays a central role in many fields, from quantum mechanics to information theory, and even older fields use probability now that the presence of 'noise' is officially admitted. The newer aspects of many fields start with the admission of uncertainty." (Richard Hamming, "Methods of Mathematics Applied to Calculus, Probability, and Statistics", 1985)

"Models are often used to decide issues in situations marked by uncertainty. However statistical differences from data depend on assumptions about the process which generated these data. If the assumptions do not hold, the inferences may not be reliable either. This limitation is often ignored by applied workers who fail to identify crucial assumptions or subject them to any kind of empirical testing. In such circumstances, using statistical procedures may only compound the uncertainty." (David A Greedman & William C Navidi, "Regression Models for Adjusting the 1980 Census", Statistical Science Vol. 1 (1), 1986)

"The mathematical theories generally called 'mathematical theories of chance' actually ignore chance, uncertainty and probability. The models they consider are purely deterministic, and the quantities they study are, in the final analysis, no more than the mathematical frequencies of particular configurations, among all equally possible configurations, the calculation of which is based on combinatorial analysis. In reality, no axiomatic definition of chance is conceivable." (Maurice Allais, "An Outline of My Main Contributions to Economic Science", [Noble lecture] 1988)

"The worst, i.e., most dangerous, feature of 'accepting the null hypothesis' is the giving up of explicit uncertainty. [...] Mathematics can sometimes be put in such black-and-white terms, but our knowledge or belief about the external world never can." (John Tukey, "The Philosophy of Multiple Comparisons", Statistical Science Vol. 6 (1), 1991)

"In nonlinear systems - and the economy is most certainly nonlinear - chaos theory tells you that the slightest uncertainty in your knowledge of the initial conditions will often grow inexorably. After a while, your predictions are nonsense." (M Mitchell Waldrop, "Complexity: The Emerging Science at the Edge of Order and Chaos", 1992)

"Statistics as a science is to quantify uncertainty, not unknown." (Chamont Wang, "Sense and Nonsense of Statistical Inference: Controversy, Misuse, and Subtlety", 1993)

"There is a new science of complexity which says that the link between cause and effect is increasingly difficult to trace; that change (planned or otherwise) unfolds in non-linear ways; that paradoxes and contradictions abound; and that creative solutions arise out of diversity, uncertainty and chaos." (Andy P Hargreaves & Michael Fullan, "What’s Worth Fighting for Out There?", 1998)

"Information entropy has its own special interpretation and is defined as the degree of unexpectedness in a message. The more unexpected words or phrases, the higher the entropy. It may be calculated with the regular binary logarithm on the number of existing alternatives in a given repertoire. A repertoire of 16 alternatives therefore gives a maximum entropy of 4 bits. Maximum entropy presupposes that all probabilities are equal and independent of each other. Minimum entropy exists when only one possibility is expected to be chosen. When uncertainty, variety or entropy decreases it is thus reasonable to speak of a corresponding increase in information." (Lars Skyttner, "General Systems Theory: Ideas and Applications", 2001)

"Any scientific data without (a stated) uncertainty is of no avail. Therefore the analysis and description of uncertainty are almost as important as those of the data value itself . It should be clear that the uncertainty itself also has an uncertainty – due to its nature as a scientific quantity – and so on. The uncertainty of an uncertainty is generally not determined." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"As uncertainties of scientific data values are nearly as important as the data values themselves, it is usually not acceptable that a best estimate is only accompanied by an estimated uncertainty. Therefore, only the size of nondominant uncertainties should be estimated. For estimating the size of a nondominant uncertainty we need to find its upper limit, i.e., we want to be as sure as possible that the uncertainty does not exceed a certain value." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Before best estimates are extracted from data sets by way of a regression analysis, the uncertainties of the individual data values must be determined.In this case care must be taken to recognize which uncertainty components are common to all the values, i.e., those that are correlated (systematic)." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Due to the theory that underlies uncertainties an infinite number of data values would be necessary to determine the true value of any quantity. In reality the number of available data values will be relatively small and thus this requirement can never be fully met; all one can get is the best estimate of the true value." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"For linear dependences the main information usually lies in the slope. It is obvious that those points that lie far apart have the strongest influence on the slope if all points have the same uncertainty. In this context we speak of the strong leverage of distant points; when determining the parameter 'slope' these distant points carry more effective weight. Naturally, this weight is distinct from the 'statistical' weight usually used in regression analysis." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"In error analysis the so-called 'chi-squared' is a measure of the agreement between the uncorrelated internal and the external uncertainties of a measured functional relation. The simplest such relation would be time independence. Theory of the chi-squared requires that the uncertainties be normally distributed. Nevertheless, it was found that the test can be applied to most probability distributions encountered in practice." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"In many cases systematic errors are interpreted as the systematic difference between nature (which is being questioned by the experimenter in his experiment) and the model (which is used to describe nature). If the model used is not good enough, but the measurement result is interpreted using this model, the final result (the interpretation) will be wrong because it is biased, i.e., it has a systematic deviation (not uncertainty). If we do not use the best model (the best theory) available for the description of a certain phenomenon this procedure is just wrong. It has nothing to do with an uncertainty." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"It is also inevitable for any model or theory to have an uncertainty (a difference between model and reality). Such uncertainties apply both to the numerical parameters of the model and to the inadequacy of the model as well. Because it is much harder to get a grip on these types of uncertainties, they are disregarded, usually." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"It is important that uncertainty components that are independent of each other are added quadratically. This is also true for correlated uncertainty components, provided they are independent of each other, i.e., as long as there is no correlation between the components." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"It is the nature of an uncertainty that it is not known and can never be known, whether the best estimate is greater or less than the true value." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Outliers or flyers are those data points in a set that do not quite fit within the rest of the data, that agree with the model in use. The uncertainty of such an outlier is seemingly too small. The discrepancy between outliers and the model should be subject to thorough examination and should be given much thought. Isolated data points, i.e., data points that are at some distance from the bulk of the data are not outliers if their values are in agreement with the model in use." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"The fact that the same uncertainty (e.g., scale uncertainty) is uncorrelated if we are dealing with only one measurement, but correlated (i.e., systematic) if we look at more than one measurement using the same instrument shows that both types of uncertainties are of the same nature. Of course, an uncertainty keeps its characteristics (e.g., Poisson distributed), independent of the fact whether it occurs only once or more often." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"To fulfill the requirements of the theory underlying uncertainties, variables with random uncertainties must be independent of each other and identically distributed. In the limiting case of an infinite number of such variables, these are called normally distributed. However, one usually speaks of normally distributed variables even if their number is finite." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"In fact, H [entropy] measures the amount of uncertainty that exists in the phenomenon. If there were only one event, its probability would be equal to 1, and H would be equal to 0 - that is, there is no uncertainty about what will happen in a phenomenon with a single event because we always know what is going to occur. The more events that a phenomenon possesses, the more uncertainty there is about the state of the phenomenon. In other words, the more entropy, the more information." (Diego Rasskin-Gutman, "Chess Metaphors: Artificial Intelligence and the Human Mind", 2009)

"Data always vary randomly because the object of our inquiries, nature itself, is also random. We can analyze and predict events in nature with an increasing amount of precision and accuracy, thanks to improvements in our techniques and instruments, but a certain amount of random variation, which gives rise to uncertainty, is inevitable." (Alberto Cairo, "The Functional Art", 2011)

"The storytelling mind is allergic to uncertainty, randomness, and coincidence. It is addicted to meaning. If the storytelling mind cannot find meaningful patterns in the world, it will try to impose them. In short, the storytelling mind is a factory that churns out true stories when it can, but will manufacture lies when it can't." (Jonathan Gottschall, "The Storytelling Animal: How Stories Make Us Human", 2012)

"The data is a simplification - an abstraction - of the real world. So when you visualize data, you visualize an abstraction of the world, or at least some tiny facet of it. Visualization is an abstraction of data, so in the end, you end up with an abstraction of an abstraction, which creates an interesting challenge. […] Just like what it represents, data can be complex with variability and uncertainty, but consider it all in the right context, and it starts to make sense." (Nathan Yau, "Data Points: Visualization That Means Something", 2013)

"Without precise predictability, control is impotent and almost meaningless. In other words, the lesser the predictability, the harder the entity or system is to control, and vice versa. If our universe actually operated on linear causality, with no surprises, uncertainty, or abrupt changes, all future events would be absolutely predictable in a sort of waveless orderliness." (Lawrence K Samuels, "Defense of Chaos", 2013)

"The greater the uncertainty, the bigger the gap between what you can measure and what matters, the more you should watch out for overfitting - that is, the more you should prefer simplicity." (Brian Christian & Thomas L Griffiths, "Algorithms to Live By: The Computer Science of Human Decisions", 2016)

"A notable difference between many fields and data science is that in data science, if a customer has a wish, even an experienced data scientist may not know whether it’s possible. Whereas a software engineer usually knows what tasks software tools are capable of performing, and a biologist knows more or less what the laboratory can do, a data scientist who has not yet seen or worked with the relevant data is faced with a large amount of uncertainty, principally about what specific data is available and about how much evidence it can provide to answer any given question. Uncertainty is, again, a major factor in the data scientific process and should be kept at the forefront of your mind when talking with customers about their wishes."  (Brian Godsey, "Think Like a Data Scientist", 2017)

"The elements of this cloud of uncertainty (the set of all possible errors) can be described in terms of probability. The center of the cloud is the number zero, and elements of the cloud that are close to zero are more probable than elements that are far away from that center. We can be more precise in this definition by defining the cloud of uncertainty in terms of a mathematical function, called the probability distribution." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"Uncertainty is an adversary of coldly logical algorithms, and being aware of how those algorithms might break down in unusual circumstances expedites the process of fixing problems when they occur - and they will occur. A data scientist’s main responsibility is to try to imagine all of the possibilities, address the ones that matter, and reevaluate them all as successes and failures happen." (Brian Godsey, "Think Like a Data Scientist", 2017)

"Bootstrapping provides an intuitive, computer-intensive way of assessing the uncertainty in our estimates, without making strong assumptions and without using probability theory. But the technique is not feasible when it comes to, say, working out the margins of error on unemployment surveys of 100,000 people. Although bootstrapping is a simple, brilliant and extraordinarily effective idea, it is just too clumsy to bootstrap such large quantities of data, especially when a convenient theory exists that can generate formulae for the width of uncertainty intervals." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"Entropy is a measure of amount of uncertainty or disorder present in the system within the possible probability distribution. The entropy and amount of unpredictability are directly proportional to each other." (G Suseela & Y Asnath V Phamila, "Security Framework for Smart Visual Sensor Networks", 2019)

"Estimates based on data are often uncertain. If the data were intended to tell us something about a wider population (like a poll of voting intentions before an election), or about the future, then we need to acknowledge that uncertainty. This is a double challenge for data visualization: it has to be calculated in some meaningful way and then shown on top of the data or statistics without making it all too cluttered." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

"Uncertainty confuses many people because they have the unreasonable expectation that science and statistics will unearth precise truths, when all they can yield is imperfect estimates that can always be subject to changes and updates." (Alberto Cairo, "How Charts Lie", 2019)

"We over-fit when we go too far in adapting to local circumstances, in a worthy but misguided effort to be ‘unbiased’ and take into account all the available information. Usually we would applaud the aim of being unbiased, but this refinement means we have less data to work on, and so the reliability goes down. Over-fitting therefore leads to less bias but at a cost of more uncertainty or variation in the estimates, which is why protection against over-fitting is sometimes known as the bias/variance trade-off." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

🔭Data Science: Causality (Just the Quotes)

"All human actions have one or more of these seven causes: chance, nature, compulsions, habit, reason, passion, desire." (Aristotle, 4th century BC)

"In all disciplines in which there is systematic knowledge of things with principles, causes, or elements, it arises from a grasp of those: we think we have knowledge of a thing when we have found its primary causes and principles, and followed it back to its elements." (Aristotle, "Physics", cca. 350 BC)

"Constantly regard the universe as one living being, having one substance and one soul; and observe how all things have reference to one perception, the perception of this one living being; and how all things act with one movement; and how all things are the cooperating causes of all things which exist; observe too the continuous spinning of the thread and the contexture of the web." (Marcus Aurelius, "Meditations". cca. 121–180 AD)

"The universal cause is one thing, a particular cause another. An effect can be haphazard with respect to the plan of the second, but not of the first. For an effect is not taken out of the scope of one particular cause save by another particular cause which prevents it, as when wood dowsed with water, will not catch fire. The first cause, however, cannot have a random effect in its own order, since all particular causes are comprehended in its causality. When an effect does escape from a system of particular causality, we speak of it as fortuitous or a chance happening […]" (Thomas Aquinas, “Summa Theologica”, cca. 1266-1273)

"All effects follow not with like certainty from their supposed causes." (David Hume, "An Enquiry Concerning Human Understanding", 1748)

"[…] chance, that is, an infinite number of events, with respect to which our ignorance will not permit us to perceive their causes, and the chain that connects them together. Now, this chance has a greater share in our education than is imagined. It is this that places certain objects before us and, in consequence of this, occasions more happy ideas, and sometimes leads us to the greatest discoveries […]" (Claude Adrien Helvetius, "On Mind", 1751)

"If an event can be produced by a number n of different causes, the probabilities of the existence of these causes, given the event (prises de l'événement), are to each other as the probabilities of the event, given the causes: and the probability of each cause is equal to the probability of the event, given that cause, divided by the sum of all the probabilities of the event, given each of the causes.” (Pierre-Simon Laplace, "Mémoire sur la Probabilité des Causes par les Événements", 1774)

"The word ‘chance’ then expresses only our ignorance of the causes of the phenomena that we observe to occur and to succeed one another in no apparent order. Probability is relative in part to this ignorance, and in part to our knowledge.” (Pierre-Simon Laplace, "Mémoire sur les Approximations des Formules qui sont Fonctions de Très Grands Nombres", 1783)

"Man’s mind cannot grasp the causes of events in their completeness, but the desire to find those causes is implanted in man’s soul. And without considering the multiplicity and complexity of the conditions any one of which taken separately may seem to be the cause, he snatches at the first approximation to a cause that seems to him intelligible and says: ‘This is the cause!’" (Leo Tolstoy, "War and Peace", 1867)

"There is a maxim which is often quoted, that ‘The same causes will always produce the same effects.’ To make this maxim intelligible we must define what we mean by the same causes and the same effects, since it is manifest that no event ever happens more that once, so that the causes and effects cannot be the same in all respects. [...] There is another maxim which must not be confounded with that quoted at the beginning of this article, which asserts ‘That like causes produce like effects’. This is only true when small variations in the initial circumstances produce only small variations in the final state of the system. In a great many physical phenomena this condition is satisfied; but there are other cases in which a small initial variation may produce a great change in the final state of the system, as when the displacement of the ‘points’ causes a railway train to run into another instead of keeping its proper course." (James C Maxwell, "Matter and Motion", 1876)

"'Causation' has been popularly used to express the condition of association, when applied to natural phenomena. There is no philosophical basis for giving it a wider meaning than partial or absolute association. In no case has it been proved that there is an inherent necessity in the laws of nature. Causation is correlation. [...] perfect correlation, when based upon sufficient experience, is causation in the scientific sense." (Henry E. Niles, "Correlation, Causation and Wright's Theory of 'Path Coefficients'", Genetics, 1922)

"To apply the category of cause and effect means to find out which parts of nature stand in this relation. Similarly, to apply the gestalt category means to find out which parts of nature belong as parts to functional wholes, to discover their position in these wholes, their degree of relative independence, and the articulation of larger wholes into sub-wholes." (Kurt Koffka, 1931)

"[...] the conception of chance enters in the very first steps of scientific activity in virtue of the fact that no observation is absolutely correct. I think chance is a more fundamental conception that causality; for whether in a concrete case, a cause-effect relation holds or not can only be judged by applying the laws of chance to the observation." (Max Born, 1949)

"There is no correlation between the cause and the effect. The events reveal only an aleatory determination, connected not so much with the imperfection of our knowledge as with the structure of the human world." (Raymond Aron, "The Opium of the Intellectuals", 1955)

"Nature is pleased with simplicity, and affects not the pomp of superfluous causes." (Morris Kline, "Mathematics and the Physical World", 1959) 

"Every part of the system is so related to every other part that a change in a particular part causes a changes in all other parts and in the total system." (Arthur D Hall, "A methodology for systems engineering", 1962)

"In complex systems cause and effect are often not closely related in either time or space. The structure of a complex system is not a simple feedback loop where one system state dominates the behavior. The complex system has a multiplicity of interacting feedback loops. Its internal rates of flow are controlled by nonlinear relationships. The complex system is of high order, meaning that there are many system states (or levels). It usually contains positive-feedback loops describing growth processes as well as negative, goal-seeking loops. In the complex system the cause of a difficulty may lie far back in time from the symptoms, or in a completely different and remote part of the system. In fact, causes are usually found, not in prior events, but in the structure and policies of the system." (Jay W Forrester, "Urban dynamics", 1969)

"We use mathematics and statistics to describe the diverse realms of randomness. From these descriptions, we attempt to glean insights into the workings of chance and to search for hidden causes. With such tools in hand, we seek patterns and relationships and propose predictions that help us make sense of the world."  (Ivars Peterson, "The Jungles of Randomness: A Mathematical Safari", 1998)

"The complexities of cause and effect defy analysis." (Douglas Adams, "Dirk Gently's Holistic Detective Agency", 1987)

"Until we can distinguish between an event that is truly random and an event that is the result of cause and effect, we will never know whether what we see is what we'll get, nor how we got what we got. When we take a risk, we are betting on an outcome that will result from a decision we have made, though we do not know for certain what the outcome will be. The essence of risk management lies in maximizing the areas where we have some control over the outcome while minimizing the areas where we have absolutely no control over the outcome and the linkage between effect and cause is hidden from us." (Peter L Bernstein, "Against the Gods: The Remarkable Story of Risk", 1996)

"Statistical models in the social sciences rely on correlations, generally not causes, of our behavior. It is inevitable that such models of reality do not capture reality well. This explains the excess of false positives and false negatives." (Kaiser Fung, "Numbersense: How To Use Big Data To Your Advantage", 2013)

"Statisticians set a high bar when they assign a cause to an effect. [...] A model that ignores cause–effect relationships cannot attain the status of a model in the physical sciences. This is a structural limitation that no amount of data - not even Big Data - can surmount." (Kaiser Fung, "Numbersense: How To Use Big Data To Your Advantage", 2013)

"Effects without an understanding of the causes behind them, on the other hand, are just bunches of data points floating in the ether, offering nothing useful by themselves. Big Data is information, equivalent to the patterns of light that fall onto the eye. Big Data is like the history of stimuli that our eyes have responded to. And as we discussed earlier, stimuli are themselves meaningless because they could mean anything. The same is true for Big Data, unless something transformative is brought to all those data sets… understanding." (Beau Lotto, "Deviate: The Science of Seeing Differently", 2017)

"Any time you run regression analysis on arbitrary real-world observational data, there’s a significant risk that there’s hidden confounding in your dataset and so causal conclusions from such analysis are likely to be (causally) biased." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"Expert knowledge is a term covering various types of knowledge that can help define or disambiguate causal relations between two or more variables. Depending on the context, expert knowledge might refer to knowledge from randomized controlled trials, laws of physics, a broad scope of experiences in a given area, and more." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"In summary, the relationship between different branches of contemporary machine learning and causality is nuanced. That said, most broadly adopted machine learning models operate on rung one, not having a causal world model." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"The basic goal of causal inference is to estimate the causal effect of one set of variables on another. In most cases, to do it accurately, we need to know which variables we should control for. [...] to accurately control for confounders, we need to go beyond the realm of pure statistics and use the information about the data-generating process, which can be encoded as a (causal) graph. In this sense, the ability to translate between graphical and statistical properties is central to causal inference." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"The causal interpretation of linear regression only holds when there are no spurious relationships in your data. This is the case in two scenarios: when you control for a set of all necessary variables (sometimes this set can be empty) or when your data comes from a properly designed randomized experiment." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"The first level of creativity [for evaluating causal models] is to use the refutation tests [...] The second level of creativity is available when you have access to historical data coming from randomized experiments. You can compare your observational model with the experimental results and try to adjust your model accordingly. The third level of creativity is to evaluate your modeling approach on simulated data with known outcomes. [...] The fourth level of creativity is sensitivity analysis." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"Although to penetrate into the intimate mysteries of nature and hence to learn the true causes of phenomena is not allowed to us, nevertheless it can happen that a certain fictive hypothesis may suffice for explaining many phenomena." (Leonhard Euler)

"Nature is pleased with simplicity, and affects not the pomp of superfluous causes." (Sir Issac Newton)

More quotes on "Causality" at the-web-of-knowledge.blogspot.com.

🔭Data Science: Precision (Just the Quotes)

"Simplicity and precision ought to be the characteristics of a scientific nomenclature: words should signify things, or the analogies of things, and not opinions." (Sir Humphry Davy, Elements of Chemical Philosophy", 1812)

"[Precision] is the very soul of science; and its attainment afford the only criterion, or at least the best, of the truth of theories, and the correctness of experiments." (John F W Herschel, "A Preliminary Discourse on the Study of Natural Philosophy", 1830)

"Numerical facts, like other facts, are but the raw materials of knowledge, upon which our reasoning faculties must be exerted in order to draw forth the principles of nature. [...] Numerical precision is the soul of science [...]" (William S Jevons, "The Principles of Science: A Treatise on Logic and Scientific Method", 1874)

"One is almost tempted to assert that quite apart from its intellectual mission, theory is the most practical thing conceivable, the quintessence of practice as it were, since the precision of its conclusions cannot be reached by any routine of estimating or trial and error; although given the hidden ways of theory, this will hold only for those who walk them with complete confidence." (Ludwig E Boltzmann, "On the Significance of Theories", 1890)

"Physical research by experimental methods is both a broadening and a narrowing field. There are many gaps yet to be filled, data to be accumulated, measurements to be made with great precision, but the limits within which we must work are becoming, at the same time, more and more defined." (Elihu Thomson, "Annual Report of the Board of Regents of the Smithsonian Institution", 1899)

"The apodictic quality of mathematical thought, the certainty and correctness of its conclusions, are due, not to a special mode of ratiocination, but to the character of the concepts with which it deals. What is that distinctive characteristic? I answer: precision, sharpness, completeness of definition. But how comes your mathematician by such completeness? There is no mysterious trick involved; some ideas admit of such precision, others do not; and the mathematician is one who deals with those that do." (Cassius J Keyser, "The Universe and Beyond", Hibbert Journal Vol. 3, 1904–1905)

"It is difficult to find an intelligible account of the meaning of ‘probability’, or of how we are ever to determine the probability of any particular proposition; and yet treatises on the subject profess to arrive at complicated results of the greatest precision and the most profound practical importance." (John M Keynes, "A Treatise on Probability", 1921)

"It is never possible to predict a physical occurrence with unlimited precision." (Max Planck, "A Scientific Autobiography", 1949)

"Precision is expressed by an international standard, viz., the standard error. It measures the average of the difference between a complete coverage and a long series of estimates formed from samples drawn from this complete coverage by a particular procedure or drawing, and processed by a particular estimating formula." (W Edwards Deming, "On the Presentation of the Results of Sample Surveys as Legal Evidence", Journal of the American Statistical Association Vol 49 (268), 1954)

"Scientists whose work has no clear, practical implications would want to make their decisions considering such things as: the relative worth of (1) more observations, (2) greater scope of his conceptual model, (3) simplicity, (4) precision of language, (5) accuracy of the probability assignment." (C West Churchman, "Costs, Utilities, and Values", 1956)

"The precision of a number is the degree of exactness with which it is stated, while the accuracy of a number is the degree of exactness with which it is known or observed. The precision of a quantity is reported by the number of significant figures in it." (Edmund C Berkeley & Lawrence Wainwright, Computers: Their Operation and Applications", 1956)

"The two most important characteristics of the language of statistics are first, that it describes things in quantitative terms, and second, that it gives this description an air of accuracy and precision." (Ely Devons, "Essays in Economics", 1961)

"We all know that in economic statistics particularly, true precision, comparability and accuracy is extremely difficult to achieve, and it is for this reason that the language of economic statistics is so difficult to handle." (Ely Devons, "Essays in Economics", 1961)

"It is of course desirable to work with manageable models which maximize generality, realism, and precision toward the overlapping but not identical goals of understanding, predicting, and modifying nature. But this cannot be done." (Richard Levins, "The strategy of model building in population biology", American Scientist Vol. 54 (4), 1966) 

"In general, complexity and precision bear an inverse relation to one another in the sense that, as the complexity of a problem increases, the possibility of analysing it in precise terms diminishes. Thus 'fuzzy thinking' may not be deplorable, after all, if it makes possible the solution of problems which are much too complex for precise analysis." (Lotfi A Zadeh, "Fuzzy languages and their relation to human intelligence", 1972)

"As the complexity of a system increases, our ability to make precise and yet significant statements about its behavior diminishes until a threshold is reached beyond which precision and significance (or relevance) become almost mutually exclusive characteristics." (Lotfi A Zadeh, 1973)

"Simplicity is worth buying if we do not have to pay too great a loss of precision for it." (George Pólya, "Mathematical Methods in Science", 1977)

"Computational reducibility may well be the exception rather than the rule: Most physical questions may be answerable only through irreducible amounts of computation. Those that concern idealized limits of infinite time, volume, or numerical precision can require arbitrarily long computations, and so be formally undecidable." (Stephen Wolfram, Undecidability and intractability in theoretical physics", Physical Review Letters 54 (8), 1985)

"Negative feedback only improves the precision of goal-seeking, but does not determine it. Feedback devices are only executive mechanisms that operate during the translation of a program." (Ernst Mayr, "Toward a New Philosophy of Biology: Observations of an Evolutionist", 1988)

"A mathematical model uses mathematical symbols to describe and explain the represented system. Normally used to predict and control, these models provide a high degree of abstraction but also of precision in their application." (Lars Skyttner, "General Systems Theory: Ideas and Applications", 2001)

"Precision does not vary linearly with increasing sample size. As is well known, the width of a confidence interval is a function of the square root of the number of observations. But it is more complicate than that. The basic elements determining a confidence interval are the sample size, an estimate of variability, and a pivotal variable associated with the estimate of variability." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Statistics can certainly pronounce a fact, but they cannot explain it without an underlying context, or theory. Numbers have an unfortunate tendency to supersede other types of knowing. […] Numbers give the illusion of presenting more truth and precision than they are capable of providing." (Ronald J Baker, "Measure what Matters to Customers: Using Key Predictive Indicators", 2006)

"[myth:] Accuracy is more important than precision. For single best estimates, be it a mean value or a single data value, this question does not arise because in that case there is no difference between accuracy and precision. (Think of a single shot aimed at a target.) Generally, it is good practice to balance precision and accuracy. The actual requirements will differ from case to case." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Popular accounts of mathematics often stress the discipline’s obsession with certainty, with proof. And mathematicians often tell jokes poking fun at their own insistence on precision. However, the quest for precision is far more than an end in itself. Precision allows one to reason sensibly about objects outside of ordinary experience. It is a tool for exploring possibility: about what might be, as well as what is." (Donal O’Shea, “The Poincaré Conjecture”, 2007)

"Precision and recall are ways of monitoring the power of the machine learning implementation. Precision is a metric that monitors the percentage of true positives. […] Recall is the ratio of true positives to true positive plus false negatives." (Matthew Kirk, "Thoughtful Machine Learning", 2015)

"Repeated observations of the same phenomenon do not always produce the same results, due to random noise or error. Sampling errors result when our observations capture unrepresentative circumstances, like measuring rush hour traffic on weekends as well as during the work week. Measurement errors reflect the limits of precision inherent in any sensing device. The notion of signal to noise ratio captures the degree to which a series of observations reflects a quantity of interest as opposed to data variance. As data scientists, we care about changes in the signal instead of the noise, and such variance often makes this problem surprisingly difficult." (Steven S Skiena, "The Data Science Design Manual", 2017)

🔭Data Science: Weights (Just the Quotes)

"In many cases general probability samples can be thought of in terms of (1) a subdivision of the population into strata, (2) a self-weighting probability sample in each stratum, and (3) combination of the stratum sample means weighted by the size of the stratum." (Frederick Mosteller et al, "Principles of Sampling", Journal of the American Statistical Association Vol. 49 (265), 1954)

"Averaging results, whether weighted or not, needs to be done with due caution and commonsense. Even though a measurement has a small quoted error it can still be, not to put too fine a point on it, wrong. If two results are in blatant and obvious disagreement, any average is meaningless and there is no point in performing it. Other cases may be less outrageous, and it may not be clear whether the difference is due to incompatibility or just unlucky chance." (Roger J Barlow, "Statistics: A guide to the use of statistical methods in the physical sciences", 1989)

"An artificial neural network is an information-processing system that has certain performance characteristics in common with biological neural networks. Artificial neural networks have been developed as generalizations of mathematical models of human cognition or neural biology, based on the assumptions that: (1) Information processing occurs at many simple elements called neurons. (2) Signals are passed between neurons over connection links. (3) Each connection link has an associated weight, which, in a typical neural net, multiplies the signal transmitted. (4) Each neuron applies an activation function (usually nonlinear) to its net input (sum of weighted input signals) to determine its output signal." (Laurene Fausett, "Fundamentals of Neural Networks", 1994)

"A neural network is characterized by (1) its pattern of connections between the neurons (called its architecture), (2) its method of determining the weights on the connections (called its training, or learning, algorithm), and (3) its activation function." (Laurene Fausett, "Fundamentals of Neural Networks", 1994)

"A neural network training method based on presenting input vector x and looking at the output vector calculated by the network. If it is considered 'good', then a 'reward' is given to the network in the sense that the existing connection weights get increased, otherwise the network is "punished"; the connection weights, being considered as 'not appropriately set,' decrease." (Nikola K Kasabov, "Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering", 1996)

"More than just a new computing architecture, neural networks offer a completely different paradigm for solving problems with computers. […] The process of learning in neural networks is to use feedback to adjust internal connections, which in turn affect the output or answer produced. The neural processing element combines all of the inputs to it and produces an output, which is essentially a measure of the match between the input pattern and its connection weights. When hundreds of these neural processors are combined, we have the ability to solve difficult problems such as credit scoring." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"When training a neural network, it is important to understand when to stop. […] If the same training patterns or examples are given to the neural network over and over, and the weights are adjusted to match the desired outputs, we are essentially telling the network to memorize the patterns, rather than to extract the essence of the relationships. What happens is that the neural network performs extremely well on the training data. However, when it is presented with patterns it hasn't seen before, it cannot generalize and does not perform well. What is the problem? It is called overtraining." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"For linear dependences the main information usually lies in the slope. It is obvious that those points that lie far apart have the strongest influence on the slope if all points have the same uncertainty. In this context we speak of the strong leverage of distant points; when determining the parameter 'slope' these distant points carry more effective weight. Naturally, this weight is distinct from the 'statistical' weight usually used in regression analysis." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Generally, these programs fall within the techniques of reinforcement learning and the majority use an algorithm of temporal difference learning. In essence, this computer learning paradigm approximates the future state of the system as a function of the present state. To reach that future state, it uses a neural network that changes the weight of its parameters as it learns." (Diego Rasskin-Gutman, "Chess Metaphors: Artificial Intelligence and the Human Mind", 2009)

"Neural networks can model very complex patterns and decision boundaries in the data and, as such, are very powerful. In fact, they are so powerful that they can even model the noise in the training data, which is something that definitely should be avoided. One way to avoid this overfitting is by using a validation set in a similar way as with decision trees.[...] Another scheme to prevent a neural network from overfitting is weight regularization, whereby the idea is to keep the weights small in absolute sense because otherwise they may be fitting the noise in the data. This is then implemented by adding a weight size term (e.g., Euclidean norm) to the objective function of the neural network." (Bart Baesens, "Analytics in a Big Data World: The Essential Guide to Data Science and Its Applications", 2014)

"Keep in mind that a weighted average may be different than a simple (non- weighted) average because a weighted average - by definition - counts certain data points more heavily. When you’re thinking about an average, try to determine if it’s a simple average or a weighted average. If it’s weighted, ask yourself how it’s being weighted, and see which data points count more than others." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Boosting is a non-linear flexible regression technique that helps increase the accuracy of trees by assigning more weights to wrong predictions. The reason for inducing more weight is so the model can emphasize more on these wrongly predicted samples and tune itself to increase accuracy. The gradient boosting method solves the inherent problem in boosting trees (i.e., low speed and human interpretability). The algorithm supports parallelism by specifying the number of threads." (Danish Haroon, "Python Machine Learning Case Studies", 2017)

"Early stopping and regularization can ensure network generalization when you apply them properly. [...] With early stopping, the choice of the validation set is also important. The validation set should be representative of all points in the training set. When you use Bayesian regularization, it is important to train the network until it reaches convergence. The sum-squared error, the sum-squared weights, and the effective number of parameters should reach constant values when the network has converged. With both early stopping and regularization, it is a good idea to train the network starting from several different initial conditions. It is possible for either method to fail in certain circumstances. By testing several different initial conditions, you can verify robust network performance." (Mark H Beale et al, "Neural Network Toolbox™ User's Guide", 2017)

"In Boosting, the selection of samples is done by giving more and more weight to hard-to-classify observations. Gradient boosting classification produces a prediction model in the form of an ensemble of weak predictive models, usually decision trees. It generalizes the model by optimizing for the arbitrary differentiable loss function. At each stage, regression trees fit on the negative gradient of binomial or multinomial deviance loss function." (Danish Haroon, "Python Machine Learning Case Studies", 2017)

"Deep neural networks have an input layer and an output layer. In between, are “hidden layers” that process the input data by adjusting various weights in order to make the output correspond closely to what is being predicted. [...] The mysterious part is not the fancy words, but that no one truly understands how the pattern recognition inside those hidden layers works. That’s why they’re called 'hidden'. They are an inscrutable black box - which is okay if you believe that computers are smarter than humans, but troubling otherwise." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.