04 December 2018

🔭Data Science: Null Hypothesis (Just the Quotes)

"The first step in beginning the scientific study of a problem is to collect the data, which are or ought to be 'facts'." (John A Thomson, "Introduction to Science", 1911)

"In relation to any experiment we may speak of this hypothesis as the null hypothesis, and it should be noted that the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis." (Ronald Fisher, "The Design of Experiments", 1935)

"The essential feature is that we express ignorance of whether the new parameter is needed by taking half the prior probability for it as concentrated in the value indicated by the null hypothesis and distributing the other half over the range possible." (Harold Jeffreys, "Theory of Probablitity", 1939)

"What the use of P [the significance level] implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred." (Harold Jeffreys, "Theory of Probability", 1939)

"As usual we may make the errors of I) rejecting the null hypothesis when it is true, II) accepting the null hypothesis when it is false. But there is a third kind of error which is of interest because the present test of significance is tied up closely with the idea of making a correct decision about which distribution function has slipped furthest to the right. We may make the error of III) correctly rejecting the null hypothesis for the wrong reason." (Frederick Mosteller, "A k-Sample Slippage Test for an Extreme Population", The Annals of Mathematical Statistics 19, 1948)

"Errors of the third kind happen in conventional tests of differences of means, but they are usually not considered, although their existence is probably recognized. It seems to the author that there may be several reasons for this among which are 1) a preoccupation on the part of mathematical statisticians with the formal questions of acceptance and rejection of null hypotheses without adequate consideration of the implications of the error of the third kind for the practical experimenter, 2) the rarity with which an error of the third kind arises in the usual tests of significance." (Frederick Mosteller, "A k-Sample Slippage Test for an Extreme Population", The Annals of Mathematical Statistics 19, 1948)

"It is very easy to devise different tests which, on the average, have similar properties, [...] hey behave satisfactorily when the null hypothesis is true and have approximately the same power of detecting departures from that hypothesis. Two such tests may, however, give very different results when applied to a given set of data. The situation leads to a good deal of contention amongst statisticians and much discredit of the science of statistics. The appalling position can easily arise in which one can get any answer one wants if only one goes around to a large enough number of statisticians." (Frances Yates, "Discussion on the Paper by Dr. Box and Dr. Andersen", Journal of the Royal Statistical Society B Vol. 17, 1955)

"Null hypotheses of no difference are usually known to be false before the data are collected [...] when they are, their rejection or acceptance simply reflects the size of the sample and the power of the test, and is not a contribution to science." (I Richard Savage, "Nonparametric statistics", Journal of the American Statistical Association 52, 1957)

"Closely related to the null hypothesis is the notion that only enough subjects need be used in psychological experiments to obtain ‘significant’ results. This often encourages experimenters to be content with very imprecise estimates of effects." (Jum Nunnally, "The place of statistics in psychology", Educational and Psychological Measurement 20, 1960)

"If rejection of the null hypothesis were the real intention in psychological experiments, there usually would be no need to gather data." (Jum Nunnally, "The place of statistics in psychology", Educational and Psychological Measurement 20, 1960)

"One feature [...] which requires much more justification than is usually given, is the setting up of unplausible null hypotheses. For example, a statistician may set out a test to see whether two drugs have exactly the same effect, or whether a regression line is exactly straight. These hypotheses can scarcely be taken literally." (Cedric A B Smith, "Book review of Norman T. J. Bailey: Statistical Methods in Biology", Applied Statistics 9, 1960)

"[...] the null-hypothesis models [...] share a crippling flaw: in the real world the null hypothesis is almost never true, and it is usually nonsensical to perform an experiment with the sole aim of rejecting the null hypothesis." (Jum Nunnally, "The place of statistics in psychology", Educational and Psychological Measurement 20, 1960)

"The null-hypothesis significance test treats ‘acceptance’ or ‘rejection’ of a hypothesis as though these were decisions one makes. But a hypothesis is not something, like a piece of pie offered for dessert, which can be accepted or rejected by a voluntary physical action. Acceptance or rejection of a hypothesis is a cognitive process, a degree of believing or disbelieving which, if rational, is not a matter of choice but determined solely by how likely it is, given the evidence, that the hypothesis is true." (William W Rozeboom, "The fallacy of the null–hypothesis significance test", Psychological Bulletin 57, 1960)

"The null hypothesis of no difference has been judged to be no longer a sound or fruitful basis for statistical investigation. […] Significance tests do not provide the information that scientists need, and, furthermore, they are not the most effective method for analyzing and summarizing data." (Cherry A Clark, "Hypothesis Testing in Relation to Statistical Methodology", Review of Educational Research Vol. 33, 1963) 

"Operational research is the application of methods of the research scientist to various rather complex practical operations. [...] A paucity of numerical data with which to work is a usual characteristic of the operations to which operational research is applied." (John T Davies, "The Scientific Approach", 1965)

"[...] a priori reasons for believing that the null hypothesis is generally false anyway. One of the common experiences of research workers is the very high frequency with which significant results are obtained with large samples." (David Bakan, "The test of significance in psychological research", Psychological Bulletin 66, 1966)

"[...] we need to get on with the business of generating [...] hypotheses and proceed to do investigations and make inferences which bear on them, instead of [...] testing the statistical null hypothesis in any number of contexts in which we have every reason to suppose that it is false in the first place." (David Bakan, "The test of significance in psychological research", Psychological Bulletin 66, 1966)

"[…] most of us still remain content to build our theoretical castles on the quicksand of merely rejecting the null hypothesis." (Marvin D Dunnette, "Fads, Fashions, and Folderol in Psychology", American Psychologist Vol. 21, 1966)

"What used to be called judgment is now called prejudice, and what used to be called prejudice is now called a null hypothesis." (Anthony W F Edwards. "Likelihood", 1972)

"Failing to reject a null hypothesis is distinctly different from proving a null hypothesis; the difference in these interpretations is not merely a semantic point. Rather, the two interpretations can lead to quite different biological conclusions." (David F Parkhurst, "Interpreting Failure to Reject a Null Hypothesis", Bulletin of the Ecological Society of America Vol. 66, 1985)

"A little thought reveals a fact widely understood among statisticians: The null hypothesis, taken literally (and that’s the only way you can take it in formal hypothesis testing), is always false in the real world. [...] If it is false, even to a tiny degree, it must be the case that a large enough sample will produce a significant result and lead to its rejection. So if the null hypothesis is always false, what’s the big deal about rejecting it?" (Jacob Cohen, "Things I Have Learned (So Far)", American Psychologist, 1990)

"The worst, i.e., most dangerous, feature of 'accepting the null hypothesis' is the giving up of explicit uncertainty. [...] Mathematics can sometimes be put in such black-and-white terms, but our knowledge or belief about the external world never can." (John Tukey, "The Philosophy of Multiple Comparisons", Statistical Science Vol. 6 (1), 1991)

"Rejection of a true null hypothesis at the 0.05 level will occur only one in 20 times. The overwhelming majority of these false rejections will be based on test statistics close to the borderline value. If the null hypothesis is false, the inter-ocular traumatic test ['hit between the eyes'] will often suffice to reject it; calculation will serve only to verify clear intuition." (Ward Edwards et al, "Bayesian Statistical Inference for Psychological Research", 1992)

"If the null hypothesis is not rejected, [Sir Ronald] Fisher's position was that nothing could be concluded. But researchers find it hard to go to all the trouble of conducting a study only to conclude that nothing can be concluded." (Frank L Schmidt, "Statistical Significance Testing and Cumulative Knowledge", "Psychology: Implications for Training of Researchers, Psychological Methods" Vol. 1 (2), 1996)

"When significance tests are used and a null hypothesis is not rejected, a major problem often arises - namely, the result may be interpreted, without a logical basis, as providing evidence for the null hypothesis." (David F Parkhurst, "Statistical Significance Tests: Equivalence and Reverse Tests Should Reduce Misinterpretation", BioScience Vol. 51 (12), 2001)

"For the study of the topology of the interactions of a complex system it is of central importance to have proper random null models of networks, i.e., models of how a graph arises from a random process. Such models are needed for comparison with real world data. When analyzing the structure of real world networks, the null hypothesis shall always be that the link structure is due to chance alone. This null hypothesis may only be rejected if the link structure found differs significantly from an expectation value obtained from a random model. Any deviation from the random null model must be explained by non-random processes." (Jörg Reichardt, "Structure in Complex Networks", 2009)

"There is a growing realization that reported 'statistically significant' claims in statistical publications are routinely mistaken. Researchers typically express the confidence in their data in terms of p-value: the probability that a perceived result is actually the result of random variation. The value of p (for 'probability') is a way of measuring the extent to which a data set provides evidence against a so-called null hypothesis. By convention, a p- value below 0.05 is considered a meaningful refutation of the null hypothesis; however, such conclusions are less solid than they appear." (Andrew Gelman & Eric Loken, "The Statistical Crisis in Science", American Scientist Vol. 102(6), 2014)

"Null hypothesis is something we attempt to find evidence against in the hypothesis tests. Null hypothesis is usually an initial claim that researchers make on the basis of previous knowledge or experience. Alternative hypothesis has a population parameter value different from that of null hypothesis. Alternative hypothesis is something you hope to come out to be true. Statistical tests are performed to decide which of these holds true in a hypothesis test. If the experiment goes in favor of the null hypothesis then we say the experiment has failed in rejecting the null hypothesis." (Danish Haroon, "Python Machine Learning Case Studies", 2017)

"[...] a hypothesis test tells us whether the observed data are consistent with the null hypothesis, and a confidence interval tells us which hypotheses are consistent with the data." (William C Blackwelder)

03 December 2018

🔭Data Science: Observation (Just the Quotes)

"[…] it is not necessary that these hypotheses should be true, or even probably; but it is enough if they provide a calculus which fits the observations […]" (Andrew Osiander, "On the Revolutions of the Heavenly Spheres", 1543)

"[…] it is from long experience chiefly that we are to expect the most certain rules of practice, yet it is withal to be remembered, that observations, and to put us upon the most probable means of improving any art, is to get the best insight we can into the nature and properties of those things which we are desirous to cultivate and improve." (Stephen Hales, "Vegetable Staticks", 1727) 

"Those who have not imbibed the prejudices of philosophers, are easily convinced that natural knowledge is to be founded on experiment and observation." (Colin Maclaurin, "An Account of Sir Isaac Newton’s Philosophical Discoveries", 1748)

"We have three principal means: observation of nature, reflection, and experiment. Observation gathers the facts reflection combines them, experiment verifies the result of the combination. It is essential that the observation of nature be assiduous, that reflection be profound, and that experimentation be exact. Rarely does one see these abilities in combination. And so, creative geniuses are not common." (Denis Diderot, "On the Interpretation of Nature", 1753)

"Facts, observations, experiments - these are the materials of a great edifice, but in assembling them we must combine them into classes, distinguish which belongs to which order and to which part of the whole each pertains." (Antoine L Lavoisier, "Mémoires de l’Académie Royale des Sciences", 1777)

"On the other hand, if we add observation to observation, without attempting to draw no only certain conclusions, but also conjectural views from them, we offend against the very end for which only observations ought to be made." (Friedrich W Herschel, "On the Construction of the Heavens", Philosophical Transactions of the Royal Society of London Vol. LXXV, 1785)

"[It] may be laid down as a general rule that, if the result of a long series of precise observations approximates a simple relation so closely that the remaining difference is undetectable by observation and may be attributed to the errors to which they are liable, then this relation is probably that of nature." (Pierre-Simon Laplace, "Mémoire sur les Inégalites Séculaires des Planètes et des Satellites", 1787)

"The art of drawing conclusions from experiments and observations consists in evaluating probabilities and in estimating whether they are sufficiently great or numerous enough to constitute proofs. This kind of calculation is more complicated and more difficult than it is commonly thought to be […]" (Antoine-Laurent Lavoisier, cca. 1790)

"We must trust to nothing but facts: These are presented to us by Nature, and cannot deceive. We ought, in every instance, to submit our reasoning to the test of experiment, and never to search for truth but by the natural road of experiment and observation." (Antoin-Laurent de Lavoisiere, "Elements of Chemistry", 1790)

"Conjecture may lead you to form opinions, but it cannot produce knowledge. Natural philosophy must be built upon the phenomena of nature discovered by observation and experiment." (George Adams, "Lectures on Natural and Experimental Philosophy" Vol. 1, 1794)

"In order to supply the defects of experience, we will have recourse to the probable conjectures of analogy, conclusions which we will bequeath to our posterity to be ascertained by new observations, which, if we augur rightly, will serve to establish our theory and to carry it gradually nearer to absolute certainty." (Johann H Lambert, "The System of the World", 1800)

"[…] we must not measure the simplicity of the laws of nature by our facility of conception; but when those which appear to us the most simple, accord perfectly with observations of the phenomena, we are justified in supposing them rigorously exact." (Pierre-Simon Laplace, "The System of the World", 1809)

"Primary causes are unknown to us; but are subject to simple and constant laws, which may be discovered by observation, the study of them being the object of natural philosophy." (Jean-Baptiste-Joseph Fourier, "The Analytical Theory of Heat", 1822)

"The aim of every science is foresight. For the laws of established observation of phenomena are generally employed to foresee their succession. All men, however little advanced make true predictions, which are always based on the same principle, the knowledge of the future from the past." (Auguste Compte, "Plan des travaux scientifiques nécessaires pour réorganiser la société", 1822)

"The framing of hypotheses is, for the enquirer after truth, not the end, but the beginning of his work. Each of his systems is invented, not that he may admire it and follow it into all its consistent consequences, but that he may make it the occasion of a course of active experiment and observation. And if the results of this process contradict his fundamental assumptions, however ingenious, however symmetrical, however elegant his system may be, he rejects it without hesitation. He allows no natural yearning for the offspring of his own mind to draw him aside from the higher duty of loyalty to his sovereign, Truth, to her he not only gives his affections and his wishes, but strenuous labour and scrupulous minuteness of attention." (William Whewell, "Philosophy of the Inductive Sciences" Vol. 2, 1847)

"In the fields of observation chance favors only the prepared mind." (Louis Pasteur, [lecture] 1854)

"When a power of nature, invisible and impalpable, is the subject of scientific inquiry, it is necessary, if we would comprehend its essence and properties, to study its manifestations and effects. For this purpose simple observation is insufficient, since error always lies on the surface, whilst truth must be sought in deeper regions." (Justus von Liebig," Familiar Letters on Chemistry", 1859)

"Observation is so wide awake, and facts are being so rapidly added to the sum of human experience, that it appears as if the theorizer would always be in arrears, and were doomed forever to arrive at imperfect conclusion; but the power to perceive a law is equally rare in all ages of the world, and depends but little on the number of facts observed." (Henry D Thoreau, "A Week on the Concord and Merrimack Rivers", 1862)

"The process of discovery is very simple. An unwearied and systematic application of known laws to nature, causes the unknown to reveal themselves. Almost any mode of observation will be successful at last, for what is most wanted is method." (Henry D Thoreau, "A Week on the Concord and Merrimack Rivers", 1862)

"An anticipative idea or an hypothesis is, then, the necessary starting point for all experimental reasoning. Without it, we could not make any investigation at all nor learn anything; we could only pile up sterile observations. If we experiment without a preconceived idea, we should move at random […]" (Claude Bernard, "An Introduction to the Study of Experimental Medicine", 1865)

"Men who have excessive faith in their theories or ideas are not only ill prepared for making discoveries; they also make very poor observations." (Claude Bernard, "An Introduction to the Study of Experimental Medicine", 1865)

"Only within very narrow boundaries can man observe the phenomena which surround him; most of them naturally escape his senses, and mere observation is not enough." (Claude Bernard, "An Introduction to the Study of Experimental Medicine", 1865)

"[…] wrong hypotheses, rightly worked from, have produced more useful results than unguided observation." (Augustus de Morgan, "A Budget of Paradoxes", 1872)

"Every science begins by accumulating observations, and presently generalizes these empirically; but only when it reaches the stage at which its empirical generalizations are included in a rational generalization does it become developed science." (Herbert Spencer, "The Data of Ethics", 1879)

"Science is the observation of things possible, whether present or past; prescience is the knowledge of things which may come to pass, though but slowly." (Leonardo da Vinci, "The Notebooks of Leonardo da Vinci", 1883)

"Even one well-made observation will be enough in many cases, just as one well-constructed experiment often suffices for the establishment of a law." (Émile Durkheim, "The Rules of Sociological Method", "The Rules of Sociological Method", 1895)

"Every experiment, every observation has, besides its immediate result, effects which, in proportion to its value, spread always on all sides into ever distant parts of knowledge." (Sir Michael Foster, "Annual Report of the Board of Regents of the Smithsonian Institution", 1898)

"The primary basis of all scientific thinking is observation." (Douglas Marsland, "Principles of Modern Biology", 1899)

"To observe is not enough. We must use our observations, and to do that we must generalize." (Henri Poincaré, "Science and Hypothesis", 1902)

"An isolated sensation teaches us nothing, for it does not amount to an observation. Observation is a putting together of several results of sensation which are or are supposed to be connected with each other according to the law of causality, so that some represent causes and others their effects." (Thorvald N Thiele, "Theory of Observations", 1903)

"Man's determination not to be deceived is precisely the origin of the problem of knowledge. The question is always and only this: to learn to know and to grasp reality in the midst of a thousand causes of error which tend to vitiate our observation." (Federigo Enriques, "Problems of Science", 1906)

"An experiment is an observation that can be repeated, isolated and varied. The more frequently you can repeat an observation, the more likely are you to see clearly what is there and to describe accurately what you have seen. The more strictly you can isolate an observation, the easier does your task of observation become, and the less danger is there of your being led astray by irrelevant circumstances, or of placing emphasis on the wrong point. The more widely you can vary an observation, the more clearly will be the uniformity of experience stand out, and the better is your chance of discovering laws." (Edward B Titchener, "A Text-Book of Psychology", 1909)

"Neither logic without observation, nor observation without logic, can move one step in the formation of science." (Alfred N Whitehead, "The Organization of Thought", 1916)

"A discovery is rarely, if ever, a sudden achievement, nor is it the work of one man; a long series of observations, each in turn received in doubt and discussed in hostility, are familiarized by time, and lead at last to the gradual disclosure of truth." (Sir Berkeley Moynihan, "Surgery, Gynecology & Obstetrics" Vol. 31, 1920)

"In the world of natural knowledge, no authority is great enough to support a theory when a crucial observation has shown it to be untenable." (Sir Richard A Gregory, "Discovery; or, The Spirit and Service of Science", 1928)

"The rational concept of probability, which is the only basis of probability calculus, applies only to problems in which either the same event repeats itself again and again, or a great number of uniform elements are involved at the same time. Using the language of physics, we may say that in order to apply the theory of probability we must have a practically unlimited sequence of uniform observations." (Richard von Mises, "Probability, Statistics and Truth", 1928)

"An observation is judged significant, if it would rarely have been produced, in the absence of a real cause of the kind we are seeking. It is a common practice to judge a result significant, if it is of such a magnitude that it would have been produced by chance not more frequently than once in twenty trials. This is an arbitrary, but convenient, level of significance for the practical investigator, but it does not mean that he allows himself to be deceived once in every twenty experiments. The test of significance only tells him what to ignore, namely all experiments in which significant results are not obtained. He should only claim that a phenomenon is experimentally demonstrable when he knows how to design an experiment so that it will rarely fail to give a significant result. Consequently, isolated significant results which he does not know how to reproduce are left in suspense pending further investigation." (Ronald A Fisher, "The Statistical Method in Psychical Research", Proceedings of the Society for Psychical Research 39, 1929)

"Science is but a method. Whatever its material, an observation accurately made and free of compromise to bias and desire, and undeterred by consequence, is science." (Hans Zinsser, "Untheological Reflections", The Atlantic Monthly, 1929)

"Abstraction is the detection of a common quality in the characteristics of a number of diverse observations […] A hypothesis serves the same purpose, but in a different way. It relates apparently diverse experiences, not by directly detecting a common quality in the experiences themselves, but by inventing a fictitious substance or process or idea, in terms of which the experience can be expressed. A hypothesis, in brief, correlates observations by adding something to them, while abstraction achieves the same end by subtracting something." (Herbert Dingle, Science and Human Experience, 1931)

"A scientist, whether theorist or experimenter, puts forward statements, or systems of statements, and tests them step by step. In the field of the empirical sciences, more particularly, he constructs hypotheses, or systems of theories, and tests them against experience by observation and experiment." (Karl Popper, "The Logic of Scientific Discovery", 1934)

"Science is the attempt to discover, by means of observation, and reasoning based upon it, first, particular facts about the world, and then laws connecting facts with one another and (in fortunate cases) making it possible to predict future occurrences." (Bertrand Russell, "Religion and Science, Grounds of Conflict", 1935)

"Starting from statistical observations, it is possible to arrive at conclusions which not less reliable or useful than those obtained in any other exact science. It is only necessary to apply a clear and precise concept of probability to such observations. " (Richard von Mises, "Probability, Statistics, and Truth", 1939)

"Experiment as compared with mere observation has some of the characteristics of cross-examining nature rather than merely overhearing her." (Alan Gregg, "The Furtherance of Medical Research", 1941)

"Science, in the broadest sense, is the entire body of the most accurately tested, critically established, systematized knowledge available about that part of the universe which has come under human observation. For the most part this knowledge concerns the forces impinging upon human beings in the serious business of living and thus affecting man’s adjustment to and of the physical and the social world. […] Pure science is more interested in understanding, and applied science is more interested in control […]" (Austin L Porterfield, "Creative Factors in Scientific Research", 1941)

"We see what we want to see, and observation conforms to hypothesis." (Bergen Evans, "The Natural History of Nonsense", 1947)

"[...] the conception of chance enters in the very first steps of scientific activity in virtue of the fact that no observation is absolutely correct. I think chance is a more fundamental conception that causality; for whether in a concrete case, a cause-effect relation holds or not can only be judged by applying the laws of chance to the observation." (Max Born, 1949)

"Every bit of knowledge we gain and every conclusion we draw about the universe or about any part or feature of it depends finally upon some observation or measurement. Mankind has had again and again the humiliating experience of trusting to intuitive, apparently logical conclusions without observations, and has seen Nature sail by in her radiant chariot of gold in an entirely different direction." (Oliver J Lee, "Measuring Our Universe: From the Inner Atom to Outer Space", 1950)

"Science is an interconnected series of concepts and schemes that have developed as a result of experimentation and observation and are fruitful of further experimentation and observation."(James B Conant, "Science and Common Sense", 1951)

"The stumbling way in which even the ablest of the scientists in every generation have had to fight through thickets of erroneous observations, misleading generalizations, inadequate formulations, and unconscious prejudice is rarely appreciated by those who obtain their scientific knowledge from textbooks." (James B Conant, "Science and Common Sense", 1951)

"[...] no batch of observations, however large, either definitively rejects or definitively fails to reject the hypothesis H0." (Richard B Braithwaite, "Scientific Explanation: A Study of the Function of Theory, Probability and Law in Science", 1953) 

"The methods of science may be described as the discovery of laws, the explanation of laws by theories, and the testing of theories by new observations. A good analogy is that of the jigsaw puzzle, for which the laws are the individual pieces, the theories local patterns suggested by a few pieces, and the tests the completion of these patterns with pieces previously unconsidered." (Edwin P Hubble, "The Nature of Science and Other Lectures", 1954)

"Scientists whose work has no clear, practical implications would want to make their decisions considering such things as: the relative worth of (1) more observations, (2) greater scope of his conceptual model, (3) simplicity, (4) precision of language, (5) accuracy of the probability assignment." (C West Churchman, "Costs, Utilities, and Values", 1956)

"Confidence intervals give a feeling of the uncertainty of experimental evidence, and (very important) give it in the same units [...] as the original observations." (Mary G Natrella, "The relation between confidence intervals and tests of significance", American Statistician 14, 1960)

"No observations are absolutely trustworthy. In no field of observation can we entirely rule out the possibility that an observation is vitiated by a large measurement or execution error. If a reading is found to lie a very long way from its fellows in a series of replicate observations, there must be a suspicion that the deviation is caused by a blunder or gross error of some kind. [...] One sufficiently erroneous reading can wreck the whole of a statistical analysis, however many observations there are." (Francis J Anscombe, "Rejection of Outliers", Technometrics Vol. 2 (2), 1960)

"Observation, reason, and experiment make up what we call the scientific method. (Richard Feynman, "Mainly mechanics, radiation, and heat", 1963)

"As soon as we inquire into the reasons for the phenomena, we enter the domain of theory, which connects the observed phenomena and traces them back to a single ‘pure’ phenomena, thus bringing about a logical arrangement of an enormous amount of observational material." (Georg Joos, "Theoretical Physics", 1968)

"[…] the link between observation and formulation is one of the most difficult and crucial in the scientific enterprise. It is the process of interpreting our theory or, as some say, of ‘operationalizing our concepts’. Our creations in the world of possibility must be fitted in the world of probability; in Kant’s epigram, ‘Concepts without precepts are empty’. It is also the process of relating our observations to theory; to finish the epigram, ‘Precepts without concepts are blind’." (Scott Greer, "The Logic of Social Inquiry", 1969)

"Innocent, unbiased observation is a myth." (Sir Peter B Medawar, Induction and Intuition in Scientific Thought, 1969)

"The advantages of models are, on one hand, that they force us to present a 'complete' theory by which I mean a theory taking into account all relevant phenomena and relations and, on the other hand, the confrontation with observation, that is, reality." (Jan Tinbergen, "The Use of Models: Experience," 1969)

"Science consists simply of the formulation and testing of hypotheses based on observational evidence; experiments are important where applicable, but their function is merely to simplify observation by imposing controlled conditions." (Henry L Batten, "Evolution of the Earth", 1971)

"All perceiving is also thinking, all reasoning is also intuition, all observation is also invention." (Rudolf Arnheim, "Entropy and Art: An Essay on Disorder and Order", 1974)

"No theory ever agrees with all the facts in its domain, yet it is not always the theory that is to blame. Facts are constituted by older ideologies, and a clash between facts and theories may be proof of progress. It is also a first step in our attempt to find the principles implicit in familiar observational notions." (Paul K Feyerabend, "Against Method: Outline of an Anarchistic Theory of Knowledge", 1975)

"The essential function of a hypothesis consists in the guidance it affords to new observations and experiments, by which our conjecture is either confirmed or refuted." (Ernst Mach, "Knowledge and Error: Sketches on the Psychology of Enquiry", 1976)

"After all of this it is a miracle that our models describe anything at all successfully. In fact, they describe many things well: we observe what they have predicted, and we understand what we observe. However, this last act of observation and understanding always eludes physical description." (Yuri I Manin, "Mathematics and Physics", 1981)

"Science is a process. It is a way of thinking, a manner of approaching and of possibly resolving problems, a route by which one can produce order and sense out of disorganized and chaotic observations. Through it we achieve useful conclusions and results that are compelling and upon which there is a tendency to agree." (Isaac Asimov, "‘X’ Stands for Unknown", 1984)

"Science is defined as a set of observations and theories about observations." (F Albert Matsen, "The Role of Theory in Chemistry", Journal of Chemical Education Vol. 62 (5), 1985)

"The only touchstone for empirical truth is experiment and observation." (Heinz Pagels, "Perfect Symmetry: The Search for the Beginning of Time", 1985)

"The model is only a suggestive metaphor, a fiction about the messy and unwieldy observations of the real world. In order for it to be persuasive, to convey a sense of credibility, it is important that it not be too complicated and that the assumptions that are made be clearly in evidence. In short, the model must be simple, transparent, and verifiable." (Edward Beltrami, "Mathematics for Dynamic Modeling", 1987)

"A theory is a good theory if it satisfies two requirements: it must accurately describe a large class of observations on the basis of a model that contains only a few arbitrary elements, and it must make definite predictions about the results of future observations." (Stephen Hawking, "A Brief History of Time: From Big Bang To Black Holes", 1988)

"A law explains a set of observations; a theory explains a set of laws. […] a law applies to observed phenomena in one domain (e.g., planetary bodies and their movements), while a theory is intended to unify phenomena in many domains. […] Unlike laws, theories often postulate unobservable objects as part of their explanatory mechanism." (John L Casti, "Searching for Certainty: How Scientists Predict the Future", 1990)

"A model is often judged by how well it 'explains' some observations. There need not be a unique model for a particular situation, nor need a model cover every possible special case. A model is not reality, it merely helps to explain some of our impressions of reality. [...] Different models may thus seem to contradict each other, yet we may use both in their appropriate places." (Richard W Hamming, "The Art of Probability for Scientists and Engineers", 1991)

"The ability of a scientific theory to be refuted is the key criterion that distinguishes science from metaphysics. If a theory cannot be refuted, if there is no observation that will disprove it, then nothing can prove it - it cannot predict anything, it is a worthless myth." (Eric Lerner, "The Big Bang Never Happened", 1991)

"It is in the nature of theoretical science that there can be no such thing as certainty. A theory is only ‘true’ for as long as the majority of the scientific community maintain the view that the theory is the one best able to explain the observations." (Jim Baggott, "The Meaning of Quantum Theory", 1992)

"The art of science is knowing which observations to ignore and which are the key to the puzzle." (Edward W Kolb, "Blind Watchers of the Sky", 1996)

"The rate of the development of science is not the rate at which you make observations alone but, much more important, the rate at which you create new things to test." (Richard Feynman, "The Meaning of It All", 1998)

"[…] because observations are all we have, we take them seriously. We choose hard data and the framework of mathematics as our guides, not unrestrained imagination or unrelenting skepticism, and seek the simplest yet most wide-reaching theories capable of explaining and predicting the outcome of today’s and future experiments." (Brian Greene, "The Fabric of the Cosmos", 2004)

"If any observation has been classed as an outlier, the next step should be if possible to infer the cause[...]attention should be given to the possibility that laboratory and data management techniques have been imperfect: improvements and safeguards for the future should be considered." (David Finney, "Calibration Guidelines Challenge Outlier Practices", The American Statistician Vol 60 (4), 2006)

"One cautious approach is represented by Bernoulli’s more conservative outlook. If there are very strong reasons for believing that an observation has suffered an accident that made the value in the data-file thoroughly untrustworthy, then reject it; in the absence of clear evidence that an observation, identified by formal rule as an outlier, is unacceptable then retain it unless there is lack of trust that the laboratory obtaining it is conscientiously operated by able persons who have '[...] taken every care.'" " (David Finney, "Calibration Guidelines Challenge Outlier Practices", The American Statistician Vol 60 (4), 2006)

"Every messy data is messy in its own way - it’s easy to define the characteristics of a clean dataset (rows are observations, columns are variables, columns contain values of consistent types). If you start to look at real life data you’ll see every way you can imagine data being messy (and many that you can’t)!" (Hadley Wickham, "R-help mailing list", 2008)

"A model is a good model if it:1. Is elegant 2. Contains few arbitrary or adjustable elements 3. Agrees with and explains all existing observations 4. Makes detailed predictions about future observations that can disprove or falsify the model if they are not borne out." (Stephen Hawking & Leonard Mlodinow, "The Grand Design", 2010)

"Whatever actually happened, outliers need to be investigated not omitted. Try to understand what caused some observations to be different from the bulk of the observations. If you understand the reasons, you are then in a better position to judge whether the points can legitimately removed from the data set, or whether you’ve just discovered something new and interesting. Never remove a point just because it is weird." (Rob J Hyndman, "Omitting outliers", 2016)

"The Dirty Data Theorem states that 'real world' data tends to come from bizarre and unspecifiable distributions of highly correlated variables and have unequal sample sizes, missing data points, non-independent observations, and an indeterminate number of inaccurately recorded values." (Unknown, Statistically Speaking)

"When the ratio of the largest to smallest observation is large you should question whether the data are being analyzed in the right metric (transformation)." (George E P Box)

🔭Data Science: Events (Just the Quotes)

"[…] chance, that is, an infinite number of events, with respect to which our ignorance will not permit us to perceive their causes, and the chain that connects them together. Now, this chance has a greater share in our education than is imagined. It is this that places certain objects before us and, in consequence of this, occasions more happy ideas, and sometimes leads us to the greatest discoveries […]" (Claude A Helvetius, "On Mind", 1751)

"But ignorance of the different causes involved in the production of events, as well as their complexity, taken together with the imperfection of analysis, prevents our reaching the same certainty about the vast majority of phenomena. Thus there are things that are uncertain for us, things more or less probable, and we seek to compensate for the impossibility of knowing them by determining their different degrees of likelihood. So it was that we owe to the weakness of the human mind one of the most delicate and ingenious of mathematical theories, the science of chance or probability." (Pierre-Simon Laplace, "Recherches, 1º, sur l'Intégration des Équations Différentielles aux Différences Finies, et sur leur Usage dans la Théorie des Hasards", 1773)

"[…] determine the probability of a future or unknown event not on the basis of the number of possible combinations resulting in this event or in its complementary event, but only on the basis of the knowledge of order of familiar previous events of this kind" (Marquis de Condorcet, "Essai sur l'application de l'analyse à la probabilité des décisions rendues à la pluralité des voix", 1785)

"Probability has reference partly to our ignorance, partly to our knowledge [..] The theory of chance consists in reducing all the events of the same kind to a certain number of cases equally possible, that is to say, to such as we may be equally undecided about in regard to their existence, and in determining the number of cases favorable to the event whose probability is sought. The ratio of this number to that of all cases possible is the measure of this probability, which is thus simply a fraction whose number is the number of favorable cases and whose denominator is the number of all cases possible." (Pierre-Simon Laplace, "Philosophical Essay on Probabilities", 1814)

"Things of all kinds are subject to a universal law which may be called the law of large numbers. It consists in the fact that, if one observes very considerable numbers of events of the same nature, dependent on constant causes and causes which vary irregularly, sometimes in one direction, sometimes in the other, it is to say without their variation being progressive in any definite direction, one shall find, between these numbers, relations which are almost constant." (Siméon-Denis Poisson, "Poisson’s Law of Large Numbers", 1837)

"Some of the common ways of producing a false statistical argument are to quote figures without their context, omitting the cautions as to their incompleteness, or to apply them to a group of phenomena quite different to that to which they in reality relate; to take these estimates referring to only part of a group as complete; to enumerate the events favorable to an argument, omitting the other side; and to argue hastily from effect to cause, this last error being the one most often fathered on to statistics. For all these elementary mistakes in logic, statistics is held responsible." (Sir Arthur L Bowley, "Elements of Statistics", 1901)

"The theory of chance consists in reducing all the events of the same kind to a certain number of cases equally possible, that is to say, to such as we may be equally undecided about in regard to their existence, and in determining the number of cases favorable to the event whose probability is sought." (Pierre-Simon de Laplace, "Philosophical Essay on Probabilities", 1902)

"Every theory of the course of events in nature is necessarily based on some process of simplification and is to some extent, therefore, a fairy tale." (Sir Napier Shaw, "Manual of Meteorology", 1932)

"The most important application of the theory of probability is to what we may call 'chance-like' or 'random' events, or occurrences. These seem to be characterized by a peculiar kind of incalculability which makes one disposed to believe - after many unsuccessful attempts - that all known rational methods of prediction must fail in their case. We have, as it were, the feeling that not a scientist but only a prophet could predict them. And yet, it is just this incalculability that makes us conclude that the calculus of probability can be applied to these events." (Karl R Popper, "The Logic of Scientific Discovery", 1934)

"Multiple equilibria are not necessarily useless, but from the standpoint of any exact science the existence of a uniquely determined equilibrium is, of course, of the utmost importance, even if proof has to be purchased at the price of very restrictive assumptions; without any possibility of proving the existence of (a) uniquely determined equilibrium - or at all events, of a small number of possible equilibria - at however high a level of abstraction, a field of phenomena is really a chaos that is not under analytical control." (Joseph A Schumpeter, "History of Economic Analysis", 1954)

"In fact, it is empirically ascertainable that every event is actually produced by a number of factors, or is at least accompanied by numerous other events that are somehow connected with it, so that the singling out involved in the picture of the causal chain is an extreme abstraction. Just as ideal objects cannot be isolated from their proper context, material existents exhibit multiple interconnections; therefore the universe is not a heap of things but a system of interacting systems." (Mario Bunge, "Causality: The place of the casual principles in modern science", 1959)

"Certain properties are necessary or sufficient conditions for other properties, and the network of causal relations thus established will make the occurrence of one property at least tend, subject to the presence of other properties, to promote or inhibit the occurrence of another. Arguments from models involve those analogies which can be used to predict the occurrence of certain properties or events, and hence the relevant relations are causal, at least in the sense of implying a tendency to co-occur." (Mary B Hesse," Models and Analogies in Science", 1963)

"In complex systems cause and effect are often not closely related in either time or space. The structure of a complex system is not a simple feedback loop where one system state dominates the behavior. The complex system has a multiplicity of interacting feedback loops. Its internal rates of flow are controlled by nonlinear relationships. The complex system is of high order, meaning that there are many system states (or levels). It usually contains positive-feedback loops describing growth processes as well as negative, goal-seeking loops. In the complex system the cause of a difficulty may lie far back in time from the symptoms, or in a completely different and remote part of the system. In fact, causes are usually found, not in prior events, but in the structure and policies of the system." (Jay Wright Forrester, "Urban dynamics", 1969)

"There are different levels of organization in the occurrence of events. You cannot explain the events of one level in terms of the events of another. For example, you cannot explain life in terms of mechanical concepts, nor society in terms of individual psychology. Analysis can only take you down the scale of organization. It cannot reveal the workings of things on a higher level. To some extent the holistic philosophers are right." (Anatol Rapoport, "General Systems" Vol. 14, 1969)

"[I]n probability theory we are faced with situations in which our intuition or some physical experiments we have carried out suggest certain results. Intuition and experience lead us to an assignment of probabilities to events. As far as the mathematics is concerned, any assignment of probabilities will do, subject to the rules of mathematical consistency." (Robert Ash, "Basic probability theory", 1970)

"Perhaps randomness is not merely an adequate description for complex causes that we cannot specify. Perhaps the world really works this way, and many events are uncaused in any conventional sense of the word." (Stephen Jay Gould,"Hen's Teeth and Horse's Toes", 1983)

"If you perceive the world as some place where things happen at random - random events over which you have sometimes very little control, sometimes fairly good control, but still random events - well, one has to be able to have some idea of how these things behave. […] People who are not used to statistics tend to see things in data - there are random fluctuations which can sometimes delude them - so you have to understand what can happen randomly and try to control whatever can be controlled. You have to expect that you are not going to get a clean-cut answer. So how do you interpret what you get? You do it by statistics." (Lucien LeCam, [interview] 1988)

"According to the narrower definition of randomness, a random sequence of events is one in which anything that can ever happen can happen next. Usually it is also understood that the probability that a given event will happen next is the same as the probability that a like event will happen at any later time. [...] According to the broader definition of randomness, a random sequence is simply one in which any one of several things can happen next, even though not necessarily anything that can ever happen can happen next." (Edward N Lorenz, "The Essence of Chaos", 1993)

"So we pour in data from the past to fuel the decision-making mechanisms created by our models, be they linear or nonlinear. But therein lies the logician's trap: past data from real life constitute a sequence of events rather than a set of independent observations, which is what the laws of probability demand.[...] It is in those outliers and imperfections that the wildness lurks." (Peter L Bernstein, "Against the Gods: The Remarkable Story of Risk", 1996)

"Events may appear to us to be random, but this could be attributed to human ignorance about the details of the processes involved." (Brain S Everitt, "Chance Rules", 1999)

"The subject of probability begins by assuming that some mechanism of uncertainty is at work giving rise to what is called randomness, but it is not necessary to distinguish between chance that occurs because of some hidden order that may exist and chance that is the result of blind lawlessness. This mechanism, figuratively speaking, churns out a succession of events, each individually unpredictable, or it conspires to produce an unforeseeable outcome each time a large ensemble of possibilities is sampled."  (Edward Beltrami, "What is Random?: Chaos and Order in Mathematics and Life", 1999)

"Entropy [...] is the amount of disorder or randomness present in any system. All non-living systems tend toward disorder; left alone they will eventually lose all motion and degenerate into an inert mass. When this permanent stage is reached and no events occur, maximum entropy is attained. A living system can, for a finite time, avert this unalterable process by importing energy from its environment. It is then said to create negentropy, something which is characteristic of all kinds of life." (Lars Skyttner, "General Systems Theory: Ideas and Applications", 2001)

"One can be highly functionally numerate without being a mathematician or a quantitative analyst. It is not the mathematical manipulation of numbers (or symbols representing numbers) that is central to the notion of numeracy. Rather, it is the ability to draw correct meaning from a logical argument couched in numbers. When such a logical argument relates to events in our uncertain real world, the element of uncertainty makes it, in fact, a statistical argument." (Eric R Sowey, "The Getting of Wisdom: Educating Statisticians to Enhance Their Clients' Numeracy", The American Statistician 57(2), 2003)

"Randomness is a difficult notion for people to accept. When events come in clusters and streaks, people look for explanations and patterns. They refuse to believe that such patterns - which frequently occur in random data - could equally well be derived from tossing a coin. So it is in the stock market as well." (Didier Sornette, "Why Stock Markets Crash: Critical events in complex financial systems", 2003)

"The basic concept of complexity theory is that systems show patterns of organization without organizer (autonomous or self-organization). Simple local interactions of many mutually interacting parts can lead to emergence of complex global structures. […] Complexity originates from the tendency of large dynamical systems to organize themselves into a critical state, with avalanches or 'punctuations' of all sizes. In the critical state, events which would otherwise be uncoupled became correlated." (Jochen Fromm, "The Emergence of Complexity", 2004)

"[myth:] Counting can be done without error. Usually, the counted number is an integer and therefore without (rounding) error. However, the best estimate of a scientifically relevant value obtained by counting will always have an error. These errors can be very small in cases of consecutive counting, in particular of regular events, e.g., when measuring frequencies." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"[...] in probability theory we are faced with situations in which our intuition or some physical experiments we have carried out suggest certain results. Intuition and experience lead us to an assignment of probabilities to events. As far as the mathematics is concerned, any assignment of probabilities will do, subject to the rules of mathematical consistency." (Robert Ash, "Basic Probability Theory", 2008)

"Regression toward the mean. That is, in any series of random events an extraordinary event is most likely to be followed, due purely to chance, by a more ordinary one." (Leonard Mlodinow, "The Drunkard’s Walk: How Randomness Rules Our Lives", 2008)

"In the network society, the space of flows dissolves time by disordering the sequence of events and making them simultaneous in the communication networks, thus installing society in structural ephemerality: being cancels becoming." (Manuel Castells, "Communication Power", 2009)

"Without precise predictability, control is impotent and almost meaningless. In other words, the lesser the predictability, the harder the entity or system is to control, and vice versa. If our universe actually operated on linear causality, with no surprises, uncertainty, or abrupt changes, all future events would be absolutely predictable in a sort of waveless orderliness." (Lawrence K Samuels, "Defense of Chaos: The Chaology of Politics, Economics and Human Action", 2013)

"The problem of complexity is at the heart of mankind’s inability to predict future events with any accuracy. Complexity science has demonstrated that the more factors found within a complex system, the more chances of unpredictable behavior. And without predictability, any meaningful control is nearly impossible. Obviously, this means that you cannot control what you cannot predict. The ability ever to predict long-term events is a pipedream. Mankind has little to do with changing climate; complexity does." (Lawrence K Samuels, "The Real Science Behind Changing Climate", 2014)

More quotes on "Events" at the-web-of-knowledge.blogspot.com

🔭Data Science: Regression (Just the Quotes)

"One feature [...] which requires much more justification than is usually given, is the setting up of unplausible null hypotheses. For example, a statistician may set out a test to see whether two drugs have exactly the same effect, or whether a regression line is exactly straight. These hypotheses can scarcely be taken literally." (Cedric A B Smith, "Book review of Norman T. J. Bailey: Statistical Methods in Biology", Applied Statistics 9, 1960)

"The method of least squares is used in the analysis of data from planned experiments and also in the analysis of data from unplanned happenings. The word 'regression' is most often used to describe analysis of unplanned data. It is the tacit assumption that the requirements for the validity of least squares analysis are satisfied for unplanned data that produces a great deal of trouble." (George E P Box, "Use and Abuse of Regression", 1966)

"[…] fitting lines to relationships between variables is often a useful and powerful method of summarizing a set of data. Regression analysis fits naturally with the development of causal explanations, simply because the research worker must, at a minimum, know what he or she is seeking to explain." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Logging size transforms the original skewed distribution into a more symmetrical one by pulling in the long right tail of the distribution toward the mean. The short left tail is, in addition, stretched. The shift toward symmetrical distribution produced by the log transform is not, of course, merely for convenience. Symmetrical distributions, especially those that resemble the normal distribution, fulfill statistical assumptions that form the basis of statistical significance testing in the regression model." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Logging skewed variables also helps to reveal the patterns in the data. […] the rescaling of the variables by taking logarithms reduces the nonlinearity in the relationship and removes much of the clutter resulting from the skewed distributions on both variables; in short, the transformation helps clarify the relationship between the two variables. It also […] leads to a theoretically meaningful regression coefficient." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"The logarithmic transformation serves several purposes: (1) The resulting regression coefficients sometimes have a more useful theoretical interpretation compared to a regression based on unlogged variables. (2) Badly skewed distributions - in which many of the observations are clustered together combined with a few outlying values on the scale of measurement - are transformed by taking the logarithm of the measurements so that the clustered values are spread out and the large values pulled in more toward the middle of the distribution. (3) Some of the assumptions underlying the regression model and the associated significance tests are better met when the logarithm of the measured variables is taken." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Graphical methodology provides powerful diagnostic tools for conveying properties of the fitted regression, for assessing the adequacy of the fit, and for suggesting improvements. There is seldom any prior guarantee that a hypothesized regression model will provide a good description of the mechanism that generated the data. Standard regression models carry with them many specific assumptions about the relationship between the response and explanatory variables and about the variation in the response that is not accounted for by the explanatory variables. In many applications of regression there is a substantial amount of prior knowledge that makes the assumptions plausible; in many other applications the assumptions are made as a starting point simply to get the analysis off the ground. But whatever the amount of prior knowledge, fitting regression equations is not complete until the assumptions have been examined." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"Stepwise regression is probably the most abused computerized statistical technique ever devised. If you think you need stepwise regression to solve a particular problem you have, it is almost certain that you do not. Professional statisticians rarely use automated stepwise regression." (Leland Wilkinson, "SYSTAT", 1984)

"Someone has characterized the user of stepwise regression as a person who checks his or her brain at the entrance of the computer center." (Dick R Wittink, "The application of regression analysis", 1988)

"Data analysis is rarely as simple in practice as it appears in books. Like other statistical techniques, regression rests on certain assumptions and may produce unrealistic results if those assumptions are false. Furthermore it is not always obvious how to translate a research question into a regression model." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"Exploratory regression methods attempt to reveal unexpected patterns, so they are ideal for a first look at the data. Unlike other regression techniques, they do not require that we specify a particular model beforehand. Thus exploratory techniques warn against mistakenly fitting a linear model when the relation is curved, a waxing curve when the relation is S-shaped, and so forth." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"Linear regression assumes that in the population a normal distribution of error values around the predicted Y is associated with each X value, and that the dispersion of the error values for each X value is the same. The assumptions imply normal and similarly dispersed error distributions." (Fred C Pampel, "Linear Regression: A primer", 2000)

"Whereas regression is about attempting to specify the underlying relationship that summarises a set of paired data, correlation is about assessing the strength of that relationship. Where there is a very close match between the scatter of points and the regression line, correlation is said to be 'strong' or 'high' . Where the points are widely scattered, the correlation is said to be 'weak' or 'low'." (Alan Graham, "Developing Thinking in Statistics", 2006)

"Before best estimates are extracted from data sets by way of a regression analysis, the uncertainties of the individual data values must be determined.In this case care must be taken to recognize which uncertainty components are common to all the values, i.e., those that are correlated (systematic)." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"For linear dependences the main information usually lies in the slope. It is obvious that those points that lie far apart have the strongest influence on the slope if all points have the same uncertainty. In this context we speak of the strong leverage of distant points; when determining the parameter 'slope' these distant points carry more effective weight. Naturally, this weight is distinct from the 'statistical' weight usually used in regression analysis." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Regression toward the mean. That is, in any series of random events an extraordinary event is most likely to be followed, due purely to chance, by a more ordinary one." (Leonard Mlodinow, "The Drunkard’s Walk: How Randomness Rules Our Lives", 2008)

"There are three possible reasons for [the] absence of predictive power. First, it is possible that the models are misspecified. Second, it is possible that the model’s explanatory factors are measured at too high a level of aggregation [...] Third, [...] the search for statistically significant relationships may not be the strategy best suited for evaluating our model’s ability to explain real world events [...] the lack of predictive power is the result of too much emphasis having been placed on finding statistically significant variables, which may be overdetermined. Statistical significance is generally a flawed way to prune variables in regression models [...] Statistically significant variables may actually degrade the predictive accuracy of a model [...] [By using]models that are constructed on the basis of pruning undertaken with the shears of statistical significance, it is quite possible that we are winnowing our models away from predictive accuracy." (Michael D Ward et al, "The perils of policy by p-value: predicting civil conflicts" Journal of Peace Research 47, 2010)

"Regression analysis, like all forms of statistical inference, is designed to offer us insights into the world around us. We seek patterns that will hold true for the larger population. However, our results are valid only for a population that is similar to the sample on which the analysis has been done." (Charles Wheelan, "Naked Statistics: Stripping the Dread from the Data", 2012)

"Multiple regression, like all statistical techniques based on correlation, has a severe limitation due to the fact that correlation doesn't prove causation. And no amount of measuring of 'control' variables can untangle the web of causality. What nature hath joined together, multiple regression cannot put asunder." (Richard Nisbett, "2014 : What scientific idea is ready for retirement?", 2013)

"Multiple regression, like all statistical techniques based on correlation, has a severe limitation due to the fact that correlation doesn't prove causation. And no amount of measuring of 'control' variables can untangle the web of causality. What nature hath joined together, multiple regression cannot put asunder." (Richard Nisbett, "2014 : What scientific idea is ready for retirement?", 2013)

"What nature hath joined together, multiple regression cannot put asunder."  (Richard Nisbett, "2014 : What scientific idea is ready for retirement?", 2013)

"A wide variety of statistical procedures (regression, t-tests, ANOVA) require three assumptions: (i) Normal observations or errors. (ii) Independent observations (or independent errors, which is equivalent, in normal linear models to independent observations). (iii) Equal variance - when that is appropriate (for the one-sample t-test, for example, there is nothing being compared, so equal variances do not apply).(DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

"Regression does not describe changes in ability that happen as time passes […]. Regression is caused by performances fluctuating about ability, so that performances far from the mean reflect abilities that are closer to the mean." (Gary Smith, "Standard Deviations", 2014)

"We encounter regression in many contexts - pretty much whenever we see an imperfect measure of what we are trying to measure. Standardized tests are obviously an imperfect measure of ability. [...] Each experimental score is an imperfect measure of “ability,” the benefits from the layout. To the extent there is randomness in this experiment - and there surely is - the prospective benefits from the layout that has the highest score are probably closer to the mean than was the score." (Gary Smith, "Standard Deviations", 2014))

"When a trait, such as academic or athletic ability, is measured imperfectly, the observed differences in performance exaggerate the actual differences in ability. Those who perform the best are probably not as far above average as they seem. Nor are those who perform the worst as far below average as they seem. Their subsequent performances will consequently regress to the mean." (Gary Smith, "Standard Deviations", 2014)

"Working an integral or performing a linear regression is something a computer can do quite effectively. Understanding whether the result makes sense - or deciding whether the method is the right one to use in the first place - requires a guiding human hand. When we teach mathematics we are supposed to be explaining how to be that guide. A math course that fails to do so is essentially training the student to be a very slow, buggy version of Microsoft Excel." (Jordan Ellenberg, "How Not to Be Wrong: The Power of Mathematical Thinking", 2014)

"A basic problem with MRA is that it typically assumes that the independent variables can be regarded as building blocks, with each variable taken by itself being logically independent of all the others. This is usually not the case, at least for behavioral data. […] Just as correlation doesn’t prove causation, absence of correlation fails to prove absence of causation. False-negative findings can occur using MRA just as false-positive findings do—because of the hidden web of causation that we’ve failed to identify." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"One technique employing correlational analysis is multiple regression analysis (MRA), in which a number of independent variables are correlated simultaneously (or sometimes sequentially, but we won’t talk about that variant of MRA) with some dependent variable. The predictor variable of interest is examined along with other independent variables that are referred to as control variables. The goal is to show that variable A influences variable B 'net of' the effects of all the other variables. That is to say, the relationship holds even when the effects of the control variables on the dependent variable are taken into account." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"The fundamental problem with MRA, as with all correlational methods, is self-selection. The investigator doesn’t choose the value for the independent variable for each subject (or case). This means that any number of variables correlated with the independent variable of interest have been dragged along with it. In most cases, we will fail to identify all these variables. In the case of behavioral research, it’s normally certain that we can’t be confident that we’ve identified all the plausibly relevant variables." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"The theory behind multiple regression analysis is that if you control for everything that is related to the independent variable and the dependent variable by pulling their correlations out of the mix, you can get at the true causal relation between the predictor variable and the outcome variable. That’s the theory. In practice, many things prevent this ideal case from being the norm." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"Regression describes the relationship between an exploratory variable (i.e., independent) and a response variable (i.e., dependent). Exploratory variables are also referred to as predictors and can have a frequency of more than 1. Regression is being used within the realm of predictions and forecasting. Regression determines the change in response variable when one exploratory variable is varied while the other independent variables are kept constant. This is done to understand the relationship that each of those exploratory variables exhibits." (Danish Haroon, "Python Machine Learning Case Studies", 2017)

"Any time you run regression analysis on arbitrary real-world observational data, there’s a significant risk that there’s hidden confounding in your dataset and so causal conclusions from such analysis are likely to be (causally) biased." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"Multiple regression provides scientists and analysts with a tool to perform statistical control - a procedure to remove unwanted influence from certain variables in the model." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"The causal interpretation of linear regression only holds when there are no spurious relationships in your data. This is the case in two scenarios: when you control for a set of all necessary variables (sometimes this set can be empty) or when your data comes from a properly designed randomized experiment." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

More quotes on "Regression" at the-web-of-knowledge.blogspot.com

02 December 2018

🔭Data Science: Complexity (Just the Quotes)

"If we study the history of science we see happen two inverse phenomena […] Sometimes simplicity hides under complex appearances; sometimes it is the simplicity which is apparent, and which disguises extremely complicated realities. […] No doubt, if our means of investigation should become more and more penetrating, we should discover the simple under the complex, then the complex under the simple, then again the simple under the complex, and so on, without our being able to foresee what will be the last term. We must stop somewhere, and that science may be possible, we must stop when we have found simplicity. This is the only ground on which we can rear the edifice of our generalizations." (Henri Poincaré, "Science and Hypothesis", 1901)

"The aim of science is to seek the simplest explanations of complex facts. We are apt to fall into the error of thinking that the facts are simple because simplicity is the goal of our quest. The guiding motto in the life of every natural philosopher should be, ‘Seek simplicity and distrust it’." (Alfred N Whitehead, "The Concept of Nature", 1919)

"[Disorganized complexity] is a problem in which the number of variables is very large, and one in which each of the many variables has a behavior which is individually erratic, or perhaps totally unknown. However, in spite of this helter-skelter, or unknown, behavior of all the individual variables, the system as a whole possesses certain orderly and analyzable average properties. [...] [Organized complexity is] not problems of disorganized complexity, to which statistical methods hold the key. They are all problems which involve dealing simultaneously with a sizable number of factors which are interrelated into an organic whole. They are all, in the language here proposed, problems of organized complexity." (Warren Weaver, "Science and Complexity", American Scientist Vol. 36, 1948)

"Nor does complexity deny the valid simplification which is part of the process of analysis, and even a method of achieving complex architecture itself." (Robert Venturi, "Complexity and Contradiction in Architecture", 1966)

"The central task of a natural science is to make the wonderful commonplace: to show that complexity, correctly viewed, is only a mask for simplicity; to find pattern hidden in apparent chaos." (Herbert A Simon, "The Sciences of the Artificial", 1969)

"At each level of complexity, entirely new properties appear. [And] at each stage, entirely new laws, concepts, and generalizations are necessary, requiring inspiration and creativity to just as great a degree as in the previous one." (Herb Anderson, 1972)

"In general, complexity and precision bear an inverse relation to one another in the sense that, as the complexity of a problem increases, the possibility of analysing it in precise terms diminishes. Thus 'fuzzy thinking' may not be deplorable, after all, if it makes possible the solution of problems which are much too complex for precise analysis." (Lotfi A Zadeh, "Fuzzy languages and their relation to human intelligence", 1972)

"Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius - and a lot of courage to move in the opposite direction." (Ernst F Schumacher, "Small is Beautiful", 1973)

"The aim of the model is of course not to reproduce reality in all its complexity. It is rather to capture in a vivid, often formal, way what is essential to understanding some aspect of its structure or behavior." (Joseph Weizenbaum, "Computer power and human reason: From judgment to calculation", 1976)

"All nature is a continuum. The endless complexity of life is organized into patterns which repeat themselves at each level of system." (James G Miller, "Living Systems", 1978)

"Simplicity does not precede complexity, but follows it." (Alan J Perlis, "Epigrams on Programming", 1982)

"Organized simplicity occurs where a small number of significant factors and a large number of insignificant factors appear initially to be complex, but on investigation display hidden simplicity." (Robert L Flood & Ewart R Carson, "Dealing with Complexity: An introduction to the theory and application of systems", 1988)

"The state of development of mathematical theory in relation to some attributes of complexity is a clear measure of our ability/inability to deal with that attribute […]" (Robert L Flood & Ewart R Carson, "Dealing with Complexity: An introduction to the theory and application of systems", 1988)

"Complexity is not an objective factor but a subjective one. Supersignals reduce complexity, collapsing a number of features into one. Consequently, complexity must be understood in terms of a specific individual and his or her supply of supersignals. We learn supersignals from experience, and our supply can differ greatly from another individual's. Therefore there can be no objective measure of complexity." (Dietrich Dorner, "The Logic of Failure: Recognizing and Avoiding Error in Complex Situations", 1989)

"Modeling in its broadest sense is the cost-effective use of something in place of something else for some [cognitive] purpose. It allows us to use something that is simpler, safer, or cheaper than reality instead of reality for some purpose. A model represents reality for the given purpose; the model is an abstraction of reality in the sense that it cannot represent all aspects of reality. This allows us to deal with the world in a simplified manner, avoiding the complexity, danger and irreversibility of reality." (Jeff Rothenberg, "The Nature of Modeling. In: Artificial Intelligence, Simulation, and Modeling", 1989)

"A measure that corresponds much better to what is usually meant by complexity in ordinary conversation, as well as in scientific discourse, refers not to the length of the most concise description of an entity (which is roughly what AIC [algorithmic information content] is), but to the length of a concise description of a set of the entity’s regularities. Thus something almost entirely random, with practically no regularities, would have effective complexity near zero. So would something completely regular, such as a bit string consisting entirely of zeroes. Effective complexity can be high only a region intermediate between total order and complete." (Murray Gell-Mann, "What is Complexity?", Complexity Vol 1 (1), 1995)

"The larger, more detailed and complex the model - the less abstract the abstraction – the smaller the number of people capable of understanding it and the longer it takes for its weaknesses and limitations to be found out." (John Adams, "Risk", 1995)

"Complexity is that property of a model which makes it difficult to formulate its overall behaviour in a given language, even when given reasonably complete information about its atomic components and their inter-relations." (Bruce Edmonds, "Syntactic Measures of Complexity", 1999)

"Falling between order and chaos, the moment of complexity is the point at which self-organizing systems emerge to create new patterns of coherence and structures of behaviour." (Mark C Taylor, "The Moment of Complexity: Emerging Network Culture", 2001)

"[…] most earlier attempts to construct a theory of complexity have overlooked the deep link between it and networks. In most systems, complexity starts where networks turn nontrivial." (Albert-László Barabási, "Linked: How Everything Is Connected to Everything Else and What It Means for Business, Science, and Everyday Life", 2002)

"The urge to tinker with a formula is a hunger that keeps coming back. Tinkering almost always leads to more complexity. The more complicated the metric, the harder it is for users to learn how to affect the metric, and the less likely it is to improve it." (Kaiser Fung, "Numbersense: How To Use Big Data To Your Advantage", 2013)

More on "Complexity" at the-web-of-knowledge.blogspot.com.

🔭Data Science: Error (Just the Quotes)

"The probable is something which lies midway between truth and error" (Christian Thomasius, "Institutes of Divine Jurisprudence", 1688)

"Knowledge being to be had only of visible and certain truth, error is not a fault of our knowledge, but a mistake of our judgment, giving assent to that which is not true." (John Locke, "An Essay Concerning Human Understanding", 1689)

"The errors of definitions multiply themselves according as the reckoning proceeds; and lead men into absurdities, which at last they see but cannot avoid, without reckoning anew from the beginning." (Thomas Hobbes, "The Moral and Political Works of Thomas Hobbes of Malmesbury", 1750)

"Men are often led into errors by the love of simplicity, which disposes us to reduce things to few principles, and to conceive a greater simplicity in nature than there really is." (Thomas Reid, "Essays on the Intellectual Powers of Man", 1785)

"The orbits of certainties touch one another; but in the interstices there is room enough for error to go forth and prevail." (Johann Wolfgang von Goethe, "Maxims and Reflections", 1833)

"Nothing hurts a new truth more than an old error." (Johann Wolfgang von Goethe, "Sprüche in Prosa", 1840)

"Every detection of what is false directs us towards what is true: every trial exhausts some tempting form of error. Not only so; but scarcely any attempt is entirely a failure; scarcely any theory, the result of steady thought, is altogether false; no tempting form of error is without some latent charm derived from truth." (William Whewell, "Lectures on the History of Moral Philosophy in England", 1852)

"[…] ideas may be both novel and important, and yet, if they are incorrect - if they lack the very essential support of incontrovertible fact, they are unworthy of credence. Without this, a theory may be both beautiful and grand, but must be as evanescent as it is beautiful, and as unsubstantial as it is grand." (George Brewster, "A New Philosophy of Matter", 1858)

"When a power of nature, invisible and impalpable, is the subject of scientific inquiry, it is necessary, if we would comprehend its essence and properties, to study its manifestations and effects. For this purpose simple observation is insufficient, since error always lies on the surface, whilst truth must be sought in deeper regions." (Justus von Liebig," Familiar Letters on Chemistry", 1859)

"As in the experimental sciences, truth cannot be distinguished from error as long as firm principles have not been established through the rigorous observation of facts." (Louis Pasteur, "Étude sur la maladie des vers à soie", 1870)

"It would be an error to suppose that the great discoverer seizes at once upon the truth, or has any unerring method of divining it. In all probability the errors of the great mind exceed in number those of the less vigorous one. Fertility of imagination and abundance of guesses at truth are among the first requisites of discovery; but the erroneous guesses must be many times as numerous as those that prove well founded. The weakest analogies, the most whimsical notions, the most apparently absurd theories, may pass through the teeming brain, and no record remain of more than the hundredth part. […] The truest theories involve suppositions which are inconceivable, and no limit can really be placed to the freedom of hypotheses." (W Stanley Jevons, "The Principles of Science: A Treatise on Logic and Scientific Method", 1877)

"Perfect readiness to reject a theory inconsistent with fact is a primary requisite of the philosophic mind. But it, would be a mistake to suppose that this candour has anything akin to fickleness; on the contrary, readiness to reject a false theory may be combined with a peculiar pertinacity and courage in maintaining an hypothesis as long as its falsity is not actually apparent." (William S Jevons, "The Principles of Science", 1887)

"One is almost tempted to assert that quite apart from its intellectual mission, theory is the most practical thing conceivable, the quintessence of practice as it were, since the precision of its conclusions cannot be reached by any routine of estimating or trial and error; although given the hidden ways of theory, this will hold only for those who walk them with complete confidence." (Ludwig E Boltzmann, "On the Significance of Theories", 1890)

"[…] to kill an error is as good a service as, and sometimes even better than, the establishing of a new truth or fact." (Charles R Darwin, "More Letters of Charles Darwin", Vol 2, 1903)

"Man's determination not to be deceived is precisely the origin of the problem of knowledge. The question is always and only this: to learn to know and to grasp reality in the midst of a thousand causes of error which tend to vitiate our observation." (Federigo Enriques, "Problems of Science", 1906)

"The aim of science is to seek the simplest explanations of complex facts. We are apt to fall into the error of thinking that the facts are simple because simplicity is the goal of our quest. The guiding motto in the life of every natural philosopher should be, ‘Seek simplicity and distrust it’." (Alfred N Whitehead, "The Concept of Nature", 1919)

"Poor statistics may be attributed to a number of causes. There are the mistakes which arise in the course of collecting the data, and there are those which occur when those data are being converted into manageable form for publication. Still later, mistakes arise because the conclusions drawn from the published data are wrong. The real trouble with errors which arise during the course of collecting the data is that they are the hardest to detect." (Alfred R Ilersic, "Statistics", 1959)

"When using estimated figures, i.e. figures subject to error, for further calculation make allowance for the absolute and relative errors. Above all, avoid what is known to statisticians as 'spurious' accuracy. For example, if the arithmetic Mean has to be derived from a distribution of ages given to the nearest year, do not give the answer to several places of decimals. Such an answer would imply a degree of accuracy in the results of your calculations which are quite un- justified by the data. The same holds true when calculating percentages." (Alfred R Ilersic, "Statistics", 1959)

"While it is true to assert that much statistical work involves arithmetic and mathematics, it would be quite untrue to suggest that the main source of errors in statistics and their use is due to inaccurate calculations." (Alfred R Ilersic, "Statistics", 1959)

"Errors may also creep into the information transfer stage when the originator of the data is unconsciously looking for a particular result. Such situations may occur in interviews or questionnaires designed to gather original data. Improper wording of the question, or improper voice inflections. and other constructional errors may elicit nonobjective responses. Obviously, if the data is incorrectly gathered, any graph based on that data will contain the original error - even though the graph be most expertly designed and beautifully presented." (Cecil H Meyers, "Handbook of Basic Graphs: A modern approach", 1970)

"One grievous error in interpreting approximations is to allow only good approximations." (Preston C Hammer, "Mind Pollution", Cybernetics, Vol. 14, 1971)

"Thus, the construction of a mathematical model consisting of certain basic equations of a process is not yet sufficient for effecting optimal control. The mathematical model must also provide for the effects of random factors, the ability to react to unforeseen variations and ensure good control despite errors and inaccuracies." (Yakov Khurgin, "Did You Say Mathematics?", 1974)

"A mature science, with respect to the matter of errors in variables, is not one that measures its variables without error, for this is impossible. It is, rather, a science which properly manages its errors, controlling their magnitudes and correctly calculating their implications for substantive conclusions." (Otis D Duncan, "Introduction to Structural Equation Models", 1975)

"Most people like to believe something is or is not true. Great scientists tolerate ambiguity very well. They believe the theory enough to go ahead; they doubt it enough to notice the errors and faults so they can step forward and create the new replacement theory. If you believe too much you'll never notice the flaws; if you doubt too much you won't get started. It requires a lovely balance." (Richard W Hamming, "You and Your Research", 1986) 

"We have found that some of the hardest errors to detect by traditional methods are unsuspected gaps in the data collection (we usually discovered them serendipitously in the course of graphical checking)." (Peter Huber, "Huge data sets", Compstat '94: Proceedings, 1994)

"Humans may crave absolute certainty; they may aspire to it; they may pretend, as partisans of certain religions do, to have attained it. But the history of science - by far the most successful claim to knowledge accessible to humans - teaches that the most we can hope for is successive improvement in our understanding, learning from our mistakes, an asymptotic approach to the Universe, but with the proviso that absolute certainty will always elude us. We will always be mired in error. The most each generation can hope for is to reduce the error bars a little, and to add to the body of data to which error bars apply." (Carl Sagan, "The Demon-Haunted World: Science as a Candle in the Dark", 1995)

"[myth:] Counting can be done without error. Usually, the counted number is an integer and therefore without (rounding) error. However, the best estimate of a scientifically relevant value obtained by counting will always have an error. These errors can be very small in cases of consecutive counting, in particular of regular events, e.g., when measuring frequencies." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"In error analysis the so-called 'chi-squared' is a measure of the agreement between the uncorrelated internal and the external uncertainties of a measured functional relation. The simplest such relation would be time independence. Theory of the chi-squared requires that the uncertainties be normally distributed. Nevertheless, it was found that the test can be applied to most probability distributions encountered in practice." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"[myth:] Random errors can always be determined by repeating measurements under identical conditions. […] this statement is true only for time-related random errors ." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"[myth:] Systematic errors can be determined inductively. It should be quite obvious that it is not possible to determine the scale error from the pattern of data values." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"What is so unconventional about the statistical way of thinking? First, statisticians do not care much for the popular concept of the statistical average; instead, they fixate on any deviation from the average. They worry about how large these variations are, how frequently they occur, and why they exist. [...] Second, variability does not need to be explained by reasonable causes, despite our natural desire for a rational explanation of everything; statisticians are frequently just as happy to pore over patterns of correlation. [...] Third, statisticians are constantly looking out for missed nuances: a statistical average for all groups may well hide vital differences that exist between these groups. Ignoring group differences when they are present frequently portends inequitable treatment. [...] Fourth, decisions based on statistics can be calibrated to strike a balance between two types of errors. Predictably, decision makers have an incentive to focus exclusively on minimizing any mistake that could bring about public humiliation, but statisticians point out that because of this bias, their decisions will aggravate other errors, which are unnoticed but serious. [...] Finally, statisticians follow a specific protocol known as statistical testing when deciding whether the evidence fits the crime, so to speak. Unlike some of us, they don’t believe in miracles. In other words, if the most unusual coincidence must be contrived to explain the inexplicable, they prefer leaving the crime unsolved." (Kaiser Fung, "Numbers Rule the World", 2010) 

"A key difference between a traditional statistical problems and a time series problem is that often, in time series, the errors are not independent." (DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

 "A wide variety of statistical procedures (regression, t-tests, ANOVA) require three assumptions: (i) Normal observations or errors. (ii) Independent observations (or independent errors, which is equivalent, in normal linear models to independent observations). (iii) Equal variance - when that is appropriate (for the one-sample t-test, for example, there is nothing being compared, so equal variances do not apply).(DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

"If the observations/errors are not independent, the statistical formulations are completely unreliable unless corrections can be made.(DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

"Once a model has been fitted to the data, the deviations from the model are the residuals. If the model is appropriate, then the residuals mimic the true errors. Examination of the residuals often provides clues about departures from the modeling assumptions. Lack of fit - if there is curvature in the residuals, plotted versus the fitted values, this suggests there may be whole regions where the model overestimates the data and other whole regions where the model underestimates the data. This would suggest that the current model is too simple relative to some better model.(DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

 "The random element in most data analysis is assumed to be white noise - normal errors independent of each other. In a time series, the errors are often linked so that independence cannot be assumed (the last examples). Modeling the nature of this dependence is the key to time series.(DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

"When data is not normal, the reason the formulas are working is usually the central limit theorem. For large sample sizes, the formulas are producing parameter estimates that are approximately normal even when the data is not itself normal. The central limit theorem does make some assumptions and one is that the mean and variance of the population exist. Outliers in the data are evidence that these assumptions may not be true. Persistent outliers in the data, ones that are not errors and cannot be otherwise explained, suggest that the usual procedures based on the central limit theorem are not applicable.(DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

"Bias is error from incorrect assumptions built into the model, such as restricting an interpolating function to be linear instead of a higher-order curve. [...] Errors of bias produce underfit models. They do not fit the training data as tightly as possible, were they allowed the freedom to do so. In popular discourse, I associate the word 'bias' with prejudice, and the correspondence is fairly apt: an apriori assumption that one group is inferior to another will result in less accurate predictions than an unbiased one. Models that perform lousy on both training and testing data are underfit." (Steven S Skiena, "The Data Science Design Manual", 2017)

"Repeated observations of the same phenomenon do not always produce the same results, due to random noise or error. Sampling errors result when our observations capture unrepresentative circumstances, like measuring rush hour traffic on weekends as well as during the work week. Measurement errors reflect the limits of precision inherent in any sensing device. The notion of signal to noise ratio captures the degree to which a series of observations reflects a quantity of interest as opposed to data variance. As data scientists, we care about changes in the signal instead of the noise, and such variance often makes this problem surprisingly difficult." (Steven S Skiena, "The Data Science Design Manual", 2017)

"Variance is error from sensitivity to fluctuations in the training set. If our training set contains sampling or measurement error, this noise introduces variance into the resulting model. [...] Errors of variance result in overfit models: their quest for accuracy causes them to mistake noise for signal, and they adjust so well to the training data that noise leads them astray. Models that do much better on testing data than training data are overfit." (Steven S Skiena, "The Data Science Design Manual", 2017)

"Machine learning bias is typically understood as a source of learning error, a technical problem. […] Machine learning bias can introduce error simply because the system doesn’t 'look' for certain solutions in the first place. But bias is actually necessary in machine learning - it’s part of learning itself." (Erik J Larson, "The Myth of Artificial Intelligence: Why Computers Can’t Think the Way We Do", 2021)

🔭Data Science: All Molels Are Wrong (Just the Quotes)

“[…] no models are [true] = not even the Newtonian laws. When you construct a model you leave out all the details which you, with the knowledge at your disposal, consider inessential. […] Models should not be true, but it is important that they are applicable, and whether they are applicable for any given purpose must of course be investigated. This also means that a model is never accepted finally, only on trial.” (Georg Rasch, “Probabilistic Models for Some Intelligence and Attainment Tests”, 1960)

“Celestial navigation is based on the premise that the Earth is the center of the universe. The premise is wrong, but the navigation works. An incorrect model can be a useful tool.” (R A J Phillips, “A Day in the Life of Kelvin Throop”, Analog Science Fiction and Science Fact, Vol. 73 No. 5, 1964)

“Since all models are wrong the scientist cannot obtain a ‘correct’ one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity.” (George Box, “Science and Statistics", Journal of the American Statistical Association 71, 1976)

“A model of the universe does not require faith, but a telescope. If it is wrong, it is wrong.” (Paul C W Davies, “Space and Time in the Modern Universe”, 1977)

"Competent scientists do not believe their own models or theories, but rather treat them as convenient fictions. […] The issue to a scientist is not whether a model is true, but rather whether there is another whose predictive power is enough better to justify movement from today's fiction to a new one." (Steve Vardeman," Comment", Journal of the American Statistical Association 82, 1987)

“The fact that [the model] is an approximation does not necessarily detract from its usefulness because models are approximations. All models are wrong, but some are useful.” (George Box, 1987)

"Statistical models for data are never true. The question whether a model is true is irrelevant. A more appropriate question is whether we obtain the correct scientific conclusion if we pretend that the process under study behaves according to a particular statistical model." (Scott Zeger, "Statistical reasoning in epidemiology", American Journal of Epidemiology, 1991)

“[…] it does not seem helpful just to say that all models are wrong. The very word model implies simplification and idealization. The idea that complex physical, biological or sociological systems can be exactly described by a few formulae is patently absurd. The construction of idealized representations that capture important stable aspects of such systems is, however, a vital part of general scientific analysis and statistical models, especially substantive ones, do not seem essentially different from other kinds of model.” (Sir David Cox, "Comment on ‘Model uncertainty, data mining and statistical inference’", Journal of the Royal Statistical Society, Series A 158, 1995)

“I do not know that my view is more correct; I do not even think that ‘right’ and ‘wrong’ are good categories for assessing complex mental models of external reality - for models in science are judged [as] useful or detrimental, not as true or false.” (Stephen Jay Gould, “Dinosaur in a Haystack: Reflections in Natural History”, 1995)

“No matter how beautiful the whole model may be, no matter how naturally it all seems to hang together now, if it disagrees with experiment, then it is wrong.” (John Gribbin, “Almost Everyone’s Guide to Science”, 1999)

“A model is a simplification or approximation of reality and hence will not reflect all of reality. […] Box noted that ‘all models are wrong, but some are useful’. While a model can never be ‘truth’, a model might be ranked from very useful, to useful, to somewhat useful to, finally, essentially useless.” (Kenneth P Burnham & David R Anderson, “Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach” 2nd Ed., 2005)

"You might say that there’s no reason to bother with model checking since all models are false anyway. I do believe that all models are false, but for me the purpose of model checking is not to accept or reject a model, but to reveal aspects of the data that are not captured by the fitted model." (Andrew Gelman, "Some thoughts on the sociology of statistics", 2007)

"First, we affirm that all models are wrong, some of them are useful. Since a model is an abstraction of reality, and that too only from a particular perspective, they are fundamentally wrong because they are not reality. That gives no license to models that are wrongly built - after all, two wrongs don’t make a right. So usefulness, or purpose, is what determines a model’s role, given that it is correctly formed. Models therefore have teleological value even though they are ontologically erroneous." (John Boardman & Brian Sauser, "Systems Thinking: Coping with 21st Century Problems", 2008)

“In general, when building statistical models, we must not forget that the aim is to understand something about the real world. Or predict, choose an action, make a decision, summarize evidence, and so on, but always about the real world, not an abstract mathematical world: our models are not the reality - a point well made by George Box in his oft-cited remark that “all models are wrong, but some are useful”. (David Hand, "Wonderful examples, but let's not close our eyes", Statistical Science 29, 2014)

"A model is a metaphor, a description of a system that helps us to reason more clearly. Like all metaphors, models are approximations, and will never account for every last detail. A useful mantra here is: all models are wrong, but some models are useful." (James G Scott, "Statistical Modeling: A Gentle Introduction", 2017)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.