SQL Troubles: November 2018

30 November 2018

🔭Data Science: p-value (Just the Quotes)

"What the use of a p-value implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred." (Harold Jeffreys, "Theory of Probability", 1939)

"A quotation of a p-value is part of the ritual of science, a sprinkling of the holy waters in an effort to sanctify the data analysis and turn consumers of the results into true believers." (William Cleveland, "Visualizing Data", 1993)

"A common misconception is that an effect exists only if it is statistically significant and that it does not exist if it is not [statistically significant]." (Jonas Ranstam, "A common misconception about p-value and its consequences", Acta Orthopaedica Scandinavica 67, 1996)

"It’s a commonplace among statisticians that a chi-squared test (and, really, any p-value) can be viewed as a crude measure of sample size: When sample size is small, it’s very difficult to get a rejection (that is, a p-value below 0.05), whereas when sample size is huge, just about anything will bag you a rejection. With large n, a smaller signal can be found amid the noise. In general: small n, unlikely to get small p-values. Large n, likely to find something. Huge n, almost certain to find lots of small p-values." (Andrew Gelman, "The sample size is huge, so a p-value of 0.007 is not that impressive", 2009)

"The p-value is a concept so misaligned with intuition that no civilian can hold it firmly in mind. Nor can many statisticians." (Matt Briggs, "Why do statisticians answer silly questions that no one ever asks?", Significance Vol. 9(1), 2012)

"Statistical significance refers to the probability that something is true. It’s a measure of how probable it is that the effect we’re seeing is real (rather than due to chance occurrence), which is why it’s typically measured with a p-value. P, in this case, stands for probability. If you accept p-values as a measure of statistical significance, then the lower your p-value is, the less likely it is that the results you’re seeing are due to chance alone." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"When statistical inferences, such as p-values, follow extensive looks at the data, they no longer have their usual interpretation. Ignoring this reality is dishonest: it is like painting a bull’s eye around the landing spot of your arrow. This is known in some circles as p-hacking, and much has been written about its perils and pitfalls." (Robert E Kass et all, "Ten Simple Rules for Effective Statistical Practice", PLoS Comput Biol 12(6), 2016)

"Remember that a p-value merely indicates the probability of a particular set of data being generated by the null model–it has little to say about the size of a deviation from that model (especially in the tails of the distribution, where large changes in effect size cause only small changes in p-values)." (Clay Helberg)

🔭Data Science: Control (Just the Quotes)

"An inference, if it is to have scientific value, must constitute a prediction concerning future data. If the inference is to be made purely with the help of the distribution theory of statistics, the experiments that constitute evidence for the inference must arise from a state of statistical control; until that state is reached, there is no universe, normal or otherwise, and the statistician’s calculations by themselves are an illusion if not a delusion. The fact is that when distribution theory is not applicable for lack of control, any inference, statistical or otherwise, is little better than a conjecture. The state of statistical control is therefore the goal of all experimentation. (William E Deming, "Statistical Method from the Viewpoint of Quality Control", 1939)

"Sampling is the science and art of controlling and measuring the reliability of useful statistical information through the theory of probability." (William E Deming, "Some Theory of Sampling", 1950)

"The well-known virtue of the experimental method is that it brings situational variables under tight control. It thus permits rigorous tests of hypotheses and confidential statements about causation. The correlational method, for its part, can study what man has not learned to control. Nature has been experimenting since the beginning of time, with a boldness and complexity far beyond the resources of science. The correlator’s mission is to observe and organize the data of nature’s experiments." (Lee J Cronbach, "The Two Disciplines of Scientific Psychology", The American Psychologist Vol. 12, 1957)

"In complex systems cause and effect are often not closely related in either time or space. The structure of a complex system is not a simple feedback loop where one system state dominates the behavior. The complex system has a multiplicity of interacting feedback loops. Its internal rates of flow are controlled by nonlinear relationships. The complex system is of high order, meaning that there are many system states (or levels). It usually contains positive-feedback loops describing growth processes as well as negative, goal-seeking loops. In the complex system the cause of a difficulty may lie far back in time from the symptoms, or in a completely different and remote part of the system. In fact, causes are usually found, not in prior events, but in the structure and policies of the system." (Jay W Forrester, "Urban dynamics", 1969)

"To adapt to a changing environment, the system needs a variety of stable states that is large enough to react to all perturbations but not so large as to make its evolution uncontrollably chaotic. The most adequate states are selected according to their fitness, either directly by the environment, or by subsystems that have adapted to the environment at an earlier stage. Formally, the basic mechanism underlying self-organization is the (often noise-driven) variation which explores different regions in the system’s state space until it enters an attractor. This precludes further variation outside the attractor, and thus restricts the freedom of the system’s components to behave independently. This is equivalent to the increase of coherence, or decrease of statistical entropy, that defines self-organization." (Francis Heylighen, "The Science Of Self-Organization And Adaptivity", 1970)

"Science consists simply of the formulation and testing of hypotheses based on observational evidence; experiments are important where applicable, but their function is merely to simplify observation by imposing controlled conditions." (Henry L Batten, "Evolution of the Earth", 1971)

"Thus, the construction of a mathematical model consisting of certain basic equations of a process is not yet sufficient for effecting optimal control. The mathematical model must also provide for the effects of random factors, the ability to react to unforeseen variations and ensure good control despite errors and inaccuracies." (Yakov Khurgin, "Did You Say Mathematics?", 1974)

"Uncontrolled variation is the enemy of quality." (W Edwards Deming, 1980)

"The methods of science include controlled experiments, classification, pattern recognition, analysis, and deduction. In the humanities we apply analogy, metaphor, criticism, and (e)valuation. In design we devise alternatives, form patterns, synthesize, use conjecture, and model solutions." (Béla H Bánáthy, "Designing Social Systems in a Changing World", 1996)

"A mathematical model uses mathematical symbols to describe and explain the represented system. Normally used to predict and control, these models provide a high degree of abstraction but also of precision in their application." (Lars Skyttner, "General Systems Theory: Ideas and Applications", 2001)

"A model is an imitation of reality and a mathematical model is a particular form of representation. We should never forget this and get so distracted by the model that we forget the real application which is driving the modelling. In the process of model building we are translating our real world problem into an equivalent mathematical problem which we solve and then attempt to interpret. We do this to gain insight into the original real world situation or to use the model for control, optimization or possibly safety studies." (Ian T Cameron & Katalin Hangos, "Process Modelling and Model Analysis", 2001)

"Dashboards and visualization are cognitive tools that improve your 'span of control' over a lot of business data. These tools help people visually identify trends, patterns and anomalies, reason about what they see and help guide them toward effective decisions. As such, these tools need to leverage people's visual capabilities. With the prevalence of scorecards, dashboards and other visualization tools now widely available for business users to review their data, the issue of visual information design is more important than ever." (Richard Brath & Michael Peters, "Dashboard Design: Why Design is Important," DM Direct, 2004)

"The methodology of feedback design is borrowed from cybernetics (control theory). It is based upon methods of controlled system model’s building, methods of system states and parameters estimation (identification), and methods of feedback synthesis. The models of controlled system used in cybernetics differ from conventional models of physics and mechanics in that they have explicitly specified inputs and outputs. Unlike conventional physics results, often formulated as conservation laws, the results of cybernetical physics are formulated in the form of transformation laws, establishing the possibilities and limits of changing properties of a physical system by means of control." (Alexander L Fradkov, "Cybernetical Physics: From Control of Chaos to Quantum Control", 2007)

"Put simply, statistics is a range of procedures for gathering, organizing, analyzing and presenting quantitative data. […] Essentially […], statistics is a scientific approach to analyzing numerical data in order to enable us to maximize our interpretation, understanding and use. This means that statistics helps us turn data into information; that is, data that have been interpreted, understood and are useful to the recipient. Put formally, for your project, statistics is the systematic collection and analysis of numerical data, in order to investigate or discover relationships among phenomena so as to explain, predict and control their occurrence." (Reva B Brown & Mark Saunders, "Dealing with Statistics: What You Need to Know", 2008)

"One technique employing correlational analysis is multiple regression analysis (MRA), in which a number of independent variables are correlated simultaneously (or sometimes sequentially, but we won’t talk about that variant of MRA) with some dependent variable. The predictor variable of interest is examined along with other independent variables that are referred to as control variables. The goal is to show that variable A influences variable B 'net of' the effects of all the other variables. That is to say, the relationship holds even when the effects of the control variables on the dependent variable are taken into account." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"The correlational technique known as multiple regression is used frequently in medical and social science research. This technique essentially correlates many independent (or predictor) variables simultaneously with a given dependent variable (outcome or output). It asks, 'Net of the effects of all the other variables, what is the effect of variable A on the dependent variable?' Despite its popularity, the technique is inherently weak and often yields misleading results. The problem is due to self-selection. If we don’t assign cases to a particular treatment, the cases may differ in any number of ways that could be causing them to differ along some dimension related to the dependent variable. We can know that the answer given by a multiple regression analysis is wrong because randomized control experiments, frequently referred to as the gold standard of research techniques, may give answers that are quite different from those obtained by multiple regression analysis." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"The theory behind multiple regression analysis is that if you control for everything that is related to the independent variable and the dependent variable by pulling their correlations out of the mix, you can get at the true causal relation between the predictor variable and the outcome variable. That’s the theory. In practice, many things prevent this ideal case from being the norm." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"Too little attention is given to the need for statistical control, or to put it more pertinently, since statistical control (randomness) is so rarely found, too little attention is given to the interpretation of data that arise from conditions not in statistical control." (William E Deming)

🔭Data Science: Conjecture (Just the Quotes)

"In the discovery of hidden things and the investigation of hidden causes, stronger reasons are obtained from sure experiments and demonstrated arguments than from probable conjectures and the opinions of philosophical speculators of the common sort […]" (William Gilbert, "De Magnete", 1600)

"The art of discovering the causes of phenomena, or true hypothesis, is like the art of deciphering, in which an ingenious conjecture greatly shortens the road." (Gottfried W Leibniz, "New Essays Concerning Human Understanding", 1704 [published 1765])

"We define the art of conjecture, or stochastic art, as the art of evaluating as exactly as possible the probabilities of things, so that in our judgments and actions we can always base ourselves on what has been found to be the best, the most appropriate, the most certain, the best advised; this is the only object of the wisdom of the philosopher and the prudence of the statesman." (Jacob Bernoulli, "Ars Conjectandi", 1713)

"One of the most intimate of all associations in the human mind is that of cause and effect. They suggest one another with the utmost readiness upon all occasions; so that it is almost impossible to contemplate the one, without having some idea of, or forming some conjecture about the other." (Joseph Priestley, "The History and Present State of Electricity", 1767

"We know the effects of many things, but the causes of few; experience, therefore, is a surer guide than imagination, and inquiry than conjecture." (Charles C Colton, "Lacon", 1820)

"The rules of scientific investigation always require us, when we enter the domains of conjecture, to adopt that hypothesis by which the greatest number of known facts and phenomena may be reconciled." (Matthew F Maury, "The Physical Geography of the Sea", 1855)

"Scientific theories are not the digest of observations, but they are inventions - conjectures boldly put forward for trial, to be eliminated if they clashed with observations; with observations which were rarely accidental, but as a rule undertaken with the definite intention of testing a theory by obtaining, if possible, a decisive refutation." (Karl R Popper, "Conjectures and Refutations: The Growth of Scientific Knowledge", 1963)

"We wish to see [...] the typical attitude of the scientist who uses mathematics to understand the world around us [...] In the solution of a problem [...] there are typically three phases. The first phase is entirely or almost entirely a matter of physics; the third, a matter of mathematics; and the intermediate phase, a transition from physics to mathematics. The first phase is the formulation of the physical hypothesis or conjecture; the second, its translation into equations; the third, the solution of the equations. Each phase calls for a different kind of work and demands a different attitude." (George Pólya, "Mathematical Methods in Science", 1963)

"We defined the art of conjecture, or stochastic art, as the art of evaluating as exactly as possible the probabilities of things, so that in our judgments and actions we can always base ourselves on what has been found to be the best, the most appropriate, the most certain, the best advised; this is the only object of the wisdom of the philosopher and the prudence of the statesman." (Bertrand de Jouvenel, "The Art of Conjecture", 1967)

"All advances of scientific understanding, at every level, begin with a speculative adventure, an imaginative preconception of what might be true.[...] [This] conjecture is then exposed to criticism to find out whether or not that imagined world is anything like the real one. Scientific reasoning is, therefore, at all levels an interaction between two episodes of thought - a dialogue between two voices, the one imaginative and the other critical [...]" (Sir Peter B Medawar, "The Hope of Progress", 1972)

"In moving from conjecture to experimental data, (D), experiments must be designed which make best use of the experimenter's current state of knowledge and which best illuminate his conjecture. In moving from data to modified conjecture, (A), data must be analyzed so as to accurately present information in a manner which is readily understood by the experimenter." (George E P Box & George C Tjao, "Bayesian Inference in Statistical Analysis", 1973)

"Statistical methods are tools of scientific investigation. Scientific investigation is a controlled learning process in which various aspects of a problem are illuminated as the study proceeds. It can be thought of as a major iteration within which secondary iterations occur. The major iteration is that in which a tentative conjecture suggests an experiment, appropriate analysis of the data so generated leads to a modified conjecture, and this in turn leads to a new experiment, and so on." (George E P Box & George C Tjao, "Bayesian Inference in Statistical Analysis", 1973)

"The essential function of a hypothesis consists in the guidance it affords to new observations and experiments, by which our conjecture is either confirmed or refuted." (Ernst Mach, "Knowledge and Error: Sketches on the Psychology of Enquiry", 1976)

"The verb 'to theorize' is now conjugated as follows: 'I built a model; you formulated a hypothesis; he made a conjecture.'" (John M Ziman, "Reliable Knowledge", 1978)

"All advances of scientific understanding, at every level, begin with a speculative adventure, an imaginative preconception of what might be true - a preconception that always, and necessarily, goes a little way (sometimes a long way) beyond anything which we have logical or factual authority to believe in. It is the invention of a possible world, or of a tiny fraction of that world. The conjecture is then exposed to criticism to find out whether or not that imagined world is anything like the real one. Scientific reasoning is therefore at all levels an interaction between two episodes of thought - a dialogue between two voices, the one imaginative and the other critical; a dialogue, as I have put it, between the possible and the actual, between proposal and disposal, conjecture and criticism, between what might be true and what is in fact the case." (Sir Peter B Medawar, "Pluto’s Republic: Incorporating the Art of the Soluble and Induction Intuition in Scientific Thought", 1982)

"The everyday usage of 'theory' is for an idea whose outcome is as yet undetermined, a conjecture, or for an idea contrary to evidence. But scientists use the word in exactly the opposite sense. [In science] 'theory' [...] refers only to a collection of hypotheses and predictions that is amenable to experimental test, preferably one that has been successfully tested. It has everything to do with the facts." (Tony Rothman & George Sudarshan, "Doubt and Certainty: The Celebrated Academy: Debates on Science, Mysticism, Reality, in General on the Knowable and Unknowable", 1998)

More quotes on "Conjecture" at the-web-of-knowledge.blogspot.com.

29 November 2018

🔭Data Science: Invariance (Just the Quotes)

"[…] as every law of nature implies the existence of an invariant, it follows that every law of nature is a constraint. […] Science looks for laws; it is therefore much concerned with looking for constraints. […] the world around us is extremely rich in constraints. We are so familiar with them that we take most of them for granted, and are often not even aware that they exist. […] A world without constraints would be totally chaotic." (W Ross Ashby, "An Introduction to Cybernetics", 1956)

"[...] the existence of any invariant over a set of phenomena implies a constraint, for its existence implies that the full range of variety does not occur. The general theory of invariants is thus a part of the theory of constraints. Further, as every law of nature implies the existence of an invariant, it follows that every law of nature is a constraint." (W Ross Ashby, "An Introduction to Cybernetics", 1956)

"Through all the meanings runs the basic idea of an 'invariant': that although the system is passing through a series of changes, there is some aspect that is unchanging; so some statement can be made that, in spite of the incessant changing, is true unchangingly." (W Ross Ashby, "An Introduction to Cybernetics", 1956)

"A satisfactory prediction of the sequential properties of learning data from a single experiment is by no means a final test of a model. Numerous other criteria - and some more demanding - can be specified. For example, a model with specific numerical parameter values should be invariant to changes in independent variables that explicitly enter in the model." (Robert R Bush & Frederick Mosteller,"A Comparison of Eight Models?", Studies in Mathematical Learning Theory, 1959)

"We know many laws of nature and we hope and expect to discover more. Nobody can foresee the next such law that will be discovered. Nevertheless, there is a structure in laws of nature which we call the laws of invariance. This structure is so far-reaching in some cases that laws of nature were guessed on the basis of the postulate that they fit into the invariance structure." (Eugene P Wigner, "The Role of Invariance Principles in Natural Philosophy", 1963)

"[..] principle of equipresence: A quantity present as an independent variable in one constitutive equation is so present in all, to the extent that its appearance is not forbidden by the general laws of Physics or rules of invariance. […] The principle of equipresence states, in effect, that no division of phenomena is to be laid down by constitutive equations." (Clifford Truesdell, "Six Lectures on Modern Natural Philosophy", 1966)

"It is now natural for us to try to derive the laws of nature and to test their validity by means of the laws of invariance, rather than to derive the laws of invariance from what we believe to be the laws of nature." (Eugene P Wigner, "Symmetries and Reflections", 1967)

"As a metaphor - and I stress that it is intended as a metaphor - the concept of an invariant that arises out of mutually or cyclically balancing changes may help us to approach the concept of self. In cybernetics this metaphor is implemented in the ‘closed loop’, the circular arrangement of feedback mechanisms that maintain a given value within certain limits. They work toward an invariant, but the invariant is achieved not by a steady resistance, the way a rock stands unmoved in the wind, but by compensation over time. Whenever we happen to look in a feedback loop, we find the present act pitted against the immediate past, but already on the way to being compensated itself by the immediate future. The invariant the system achieves can, therefore, never be found or frozen in a single element because, by its very nature, it consists in one or more relationships - and relationships are not in things but between them." (Ernst von Glasersfeld German, "Cybernetics, Experience and the Concept of Self", 1970)

"An essential condition for a theory of choice that claims normative status is the principle of invariance: different representations of the same choice problem should yield the same preference. That is, the preference between options should be independent of their description. Two characterizations that the decision maker, on reflection, would view as alternative descriptions of the same problem should lead to the same choice-even without the benefit of such reflection." (Amos Tversky & Daniel Kahneman, "Rational Choice and the Framing of Decisions", The Journal of Business Vol. 59 (4), 1986)

"Axiomatic theories of choice introduce preference as a primitive relation, which is interpreted through specific empirical procedures such as choice or pricing. Models of rational choice assume a principle of procedure invariance, which requires strategically equivalent methods of elicitation to yield the same preference order." (Amos Tversky et al, "The Causes of Preference Reversal", The American Economic Review Vol. 80 (1), 1990)

"Symmetry is basically a geometrical concept. Mathematically it can be defined as the invariance of geometrical patterns under certain operations. But when abstracted, the concept applies to all sorts of situations. It is one of the ways by which the human mind recognizes order in nature. In this sense symmetry need not be perfect to be meaningful. Even an approximate symmetry attracts one's attention, and makes one wonder if there is some deep reason behind it." (Eguchi Tohru & ‎K Nishijima ," Broken Symmetry: Selected Papers Of Y Nambu", 1995)

"How deep truths can be defined as invariants – things that do not change no matter what; how invariants are defined by symmetries, which in turn define which properties of nature are conserved, no matter what. These are the selfsame symmetries that appeal to the senses in art and music and natural forms like snowflakes and galaxies. The fundamental truths are based on symmetry, and there’s a deep kind of beauty in that." (K C Cole, "The Universe and the Teacup: The Mathematics of Truth and Beauty", 1997)

"Cybernetics is the science of effective organization, of control and communication in animals and machines. It is the art of steersmanship, of regulation and stability. The concern here is with function, not construction, in providing regular and reproducible behaviour in the presence of disturbances. Here the emphasis is on families of solutions, ways of arranging matters that can apply to all forms of systems, whatever the material or design employed. [...] This science concerns the effects of inputs on outputs, but in the sense that the output state is desired to be constant or predictable – we wish the system to maintain an equilibrium state. It is applicable mostly to complex systems and to coupled systems, and uses the concepts of feedback and transformations (mappings from input to output) to effect the desired invariance or stability in the result." (Chris Lucas, "Cybernetics and Stochastic Systems", 1999)

"Each of the most basic physical laws that we know corresponds to some invariance, which in turn is equivalent to a collection of changes which form a symmetry group. […] whilst leaving some underlying theme unchanged. […] for example, the conservation of energy is equivalent to the invariance of the laws of motion with respect to translations backwards or forwards in time […] the conservation of linear momentum is equivalent to the invariance of the laws of motion with respect to the position of your laboratory in space, and the conservation of angular momentum to an invariance with respect to directional orientation… discovery of conservation laws indicated that Nature possessed built-in sustaining principles which prevented the world from just ceasing to be." (John D Barrow, "New Theories of Everything", 2007)

"The concept of symmetry (invariance) with its rigorous mathematical formulation and generalization has guided us to know the most fundamental of physical laws. Symmetry as a concept has helped mankind not only to define ‘beauty’ but also to express the ‘truth’. Physical laws tries to quantify the truth that appears to be ‘transient’ at the level of phenomena but symmetry promotes that truth to the level of ‘eternity’." (Vladimir G Ivancevic & Tijana T Ivancevic,"Quantum Leap", 2008)

"The concept of symmetry is used widely in physics. If the laws that determine relations between physical magnitudes and a change of these magnitudes in the course of time do not vary at the definite operations (transformations), they say, that these laws have symmetry (or they are invariant) with respect to the given transformations. For example, the law of gravitation is valid for any points of space, that is, this law is in variant with respect to the system of coordinates." (Alexey Stakhov et al, "The Mathematics of Harmony", 2009)

"In dynamical systems, a bifurcation occurs when a small smooth change made to the parameter values (the bifurcation parameters) of a system causes a sudden 'qualitative' or topological change in its behaviour. Generally, at a bifurcation, the local stability properties of equilibria, periodic orbits or other invariant sets changes." (Gregory Faye, "An introduction to bifurcation theory", 2011)

"Data analysis and data mining are concerned with unsupervised pattern finding and structure determination in data sets. The data sets themselves are explicitly linked as a form of representation to an observational or otherwise empirical domain of interest. 'Structure' has long been understood as symmetry which can take many forms with respect to any transformation, including point, translational, rotational, and many others. Symmetries directly point to invariants, which pinpoint intrinsic properties of the data and of the background empirical domain of interest. As our data models change, so too do our perspectives on analysing data." (Fionn Murtagh, "Data Science Foundations: Geometry and Topology of Complex Hierarchic Systems and Big Data Analytics", 2018)

More quotes on "Invariance" at the-web-of-knowledge.blogspot.com.

🔭Data Science: Analysis (Just the Quotes)

"Analysis is a method where one assumes that which is sought, and from this, through a series of implications, arrives at something which is agreed upon on the basis of synthesis; because in analysis, one assumes that which is sought to be known, proved, or constructed, and examines what this is a consequence of and from what this latter follows, so that by backtracking we end up with something that is already known or is part of the starting points of the theory; we call such a method analysis; it is, in a sense, a solution in reversed direction. In synthesis we work in the opposite direction: we assume the last result of the analysis to be true. Then we put the causes from analysis in their natural order, as consequences, and by putting these together we obtain the proof or the construction of that which is sought. We call this synthesis." (Pappus of Alexandria, cca. 4th century BC)

"Analysis is the obtaining of the thing sought by assuming it and so reasoning up to an admitted truth; synthesis is the obtaining of the thing sought by reasoning up to the inference and proof of it." (Eudoxus, cca. 4th century BC)

"The analysis of concepts is for the understanding nothing more than what the magnifying glass is for sight." (Moses Mendelssohn, 1763)

"As the analysis of a substantial composite terminates only in a part which is not a whole, that is, in a simple part, so synthesis terminates only in a whole which is not a part, that is, the world." (Immanuel Kant, "Inaugural Dissertation", 1770)

"But ignorance of the different causes involved in the production of events, as well as their complexity, taken together with the imperfection of analysis, prevents our reaching the same certainty about the vast majority of phenomena. Thus there are things that are uncertain for us, things more or less probable, and we seek to compensate for the impossibility of knowing them by determining their different degrees of likelihood. So it was that we owe to the weakness of the human mind one of the most delicate and ingenious of mathematical theories, the science of chance or probability." (Pierre-Simon Laplace, "Recherches, 1º, sur l'Intégration des Équations Différentielles aux Différences Finies, et sur leur Usage dans la Théorie des Hasards", 1773)

"It has never yet been supposed, that all the facts of nature, and all the means of acquiring precision in the computation and analysis of those facts, and all the connections of objects with each other, and all the possible combinations of ideas, can be exhausted by the human mind." (Nicolas de Condorcet, "Outlines Of An Historical View Of The Progress Of The Human Mind", 1795)

"It is interesting thus to follow the intellectual truths of analysis in the phenomena of nature. This correspondence, of which the system of the world will offer us numerous examples, makes one of the greatest charms attached to mathematical speculations." (Pierre-Simon Laplace, "Exposition du système du monde", 1799)

"With the synthesis of every new concept in the aggregation of coordinate characteristics the extensive or complex distinctness is increased; with the further analysis of concepts in the series of subordinate characteristics the intensive or deep distinctness is increased. The latter kind of distinctness, as it necessarily serves the thoroughness and conclusiveness of cognition, is therefore mainly the business of philosophy and is carried farthest especially in metaphysical investigations." (Immanuel Kant, "Logic", 1800)

"It is easily seen from a consideration of the nature of demonstration and analysis that there can and must be truths which cannot be reduced by any analysis to identities or to the principle of contradiction but which involve an infinite series of reasons which only God can see through." (Gottfried W Leibniz, "Nouvelles lettres et opuscules inédits", 1857)

"Analysis and synthesis, though commonly treated as two different methods, are, if properly understood, only the two necessary parts of the same method. Each is the relative and correlative of the other. Analysis, without a subsequent synthesis, is incomplete; it is a mean cut of from its end. Synthesis, without a previous analysis, is baseless; for synthesis receives from analysis the elements which it recomposes." (Sir William Hamilton, "Lectures on Metaphysics and Logic: 6th Lecture on Metaphysics", 1858)

"Hence, even in the domain of natural science the aid of the experimental method becomes indispensable whenever the problem set is the analysis of transient and impermanent phenomena, and not merely the observation of persistent and relatively constant objects." (Wilhelm Wundt, "Principles of Physiological Psychology", 1874)

"In fact, the opposition of instinct and reason is mainly illusory. Instinct, intuition, or insight is what first leads to the beliefs which subsequent reason confirms or confutes; but the confirmation, where it is possible, consists, in the last analysis, of agreement with other beliefs no less instinctive. Reason is a harmonising, controlling force rather than a creative one. Even in the most purely logical realms, it is insight that first arrives at what is new." (Bertrand Russell, "Our Knowledge of the External World", 1914)

"In obedience to the feeling of reality, we shall insist that, in the analysis of propositions, nothing 'unreal' is to be admitted. But, after all, if there is nothing unreal, how, it may be asked, could we admit anything unreal? The reply is that, in dealing with propositions, we are dealing in the first instance with symbols, and if we attribute significance to groups of symbols which have no significance, we shall fall into the error of admitting unrealities, in the only sense in which this is possible, namely, as objects described." (Bertrand Russell, "Introduction to Mathematical Philosophy" , 1919)

"It requires a very unusual mind to undertake the analysis of the obvious." (Alfred N Whitehead, "Science in the Modern World", 1925)

"The failure of the social sciences to think through and to integrate their several responsibilities for the common problem of relating the analysis of parts to the analysis of the whole constitutes one of the major lags crippling their utility as human tools of knowledge." (Robert S Lynd, "Knowledge of What?", 1939)

"Analogies are useful for analysis in unexplored fields. By means of analogies an unfamiliar system may be compared with one that is better known. The relations and actions are more easily visualized, the mathematics more readily applied, and the analytical solutions more readily obtained in the familiar system." (Harry F Olson, "Dynamical Analogies", 1943)

"Only by the analysis and interpretation of observations as they are made, and the examination of the larger implications of the results, is one in a satisfactory position to pose new experimental and theoretical questions of the greatest significance." (John A Wheeler, "Elementary Particle Physics", American Scientist, 1947)

"The study of the conditions for change begins appropriately with an analysis of the conditions for no change, that is, for the state of equilibrium." (Kurt Lewin, "Quasi-Stationary Social Equilibria and the Problem of Permanent Change", 1947)

"A synthetic approach where piecemeal analysis is not possible due to the intricate interrelationships of parts that cannot be treated out of context of the whole;" (Walter F Buckley, "Sociology and modern systems theory", 1967)

"In general, complexity and precision bear an inverse relation to one another in the sense that, as the complexity of a problem increases, the possibility of analysing it in precise terms diminishes. Thus 'fuzzy thinking' may not be deplorable, after all, if it makes possible the solution of problems which are much too complex for precise analysis." (Lotfi A Zadeh, "Fuzzy languages and their relation to human intelligence", 1972)

"Discovery is a double relation of analysis and synthesis together. As an analysis, it probes for what is there; but then, as a synthesis, it puts the parts together in a form by which the creative mind transcends the bare limits, the bare skeleton, that nature provides." (Jacob Bronowski, "The Ascent of Man", 1973)

"The complexities of cause and effect defy analysis." (Douglas Adams, "Dirk Gently's Holistic Detective Agency", 1987)

"Either one or the other [analysis or synthesis] may be direct or indirect. The direct procedure is when the point of departure is known-direct synthesis in the elements of geometry. By combining at random simple truths with each other, more complicated ones are deduced from them. This is the method of discovery, the special method of inventions, contrary to popular opinion." (André-Marie Ampère)

🔭Data Science: Change (Just the Quotes)

"A law of nature, however, is not a mere logical conception that we have adopted as a kind of memoria technical to enable us to more readily remember facts. We of the present day have already sufficient insight to know that the laws of nature are not things which we can evolve by any speculative method. On the contrary, we have to discover them in the facts; we have to test them by repeated observation or experiment, in constantly new cases, under ever-varying circumstances; and in proportion only as they hold good under a constantly increasing change of conditions, in a constantly increasing number of cases with greater delicacy in the means of observation, does our confidence in their trustworthiness rise." (Hermann von Helmholtz, "Popular Lectures on Scientific Subjects", 1873)

"It is clear that one who attempts to study precisely things that are changing must have a great deal to do with measures of change." (Charles Cooley, "Observations on the Measure of Change", Journal of the American Statistical Association (21), 1893)

"Given any object, relatively abstracted from its surroundings for study, the behavioristic approach consists in the examination of the output of the object and of the relations of this output to the input. By output is meant any change produced in the surroundings by the object. By input, conversely, is meant any event external to the object that modifies this object in any manner." (Arturo Rosenblueth, Norbert Wiener & Julian Bigelow, "Behavior, Purpose and Teleology", Philosophy of Science 10, 1943)

"The general method involved may be very simply stated. In cases where the equilibrium values of our variables can be regarded as the solutions of an extremum (maximum or minimum) problem, it is often possible regardless of the number of variables involved to determine unambiguously the qualitative behavior of our solution values in respect to changes of parameters." (Paul Samuelson, "Foundations of Economic Analysis", 1947)

"A common and very powerful constraint is that of continuity. It is a constraint because whereas the function that changes arbitrarily can undergo any change, the continuous function can change, at each step, only to a neighbouring value." (W Ross Ashby, "An Introduction to Cybernetics", 1956)

"As a simple trick, the discrete can often be carried over into the continuous, in a way suitable for practical purposes, by making a graph of the discrete, with the values shown as separate points. It is then easy to see the form that the changes will take if the points were to become infinitely numerous and close together." (W Ross Ashby, "An Introduction to Cybernetics", 1956)

"The discrete change has only to become small enough in its jump to approximate as closely as is desired to the continuous change. It must further be remembered that in natural phenomena the observations are almost invariably made at discrete intervals; the 'continuity' ascribed to natural events has often been put there by the observer's imagina- tion, not by actual observation at each of an infinite number of points. Thus the real truth is that the natural system is observed at discrete points, and our transformation represents it at discrete points. There can, therefore, be no real incompatibility." (W Ross Ashby, "An Introduction to Cybernetics", 1956)

"Prediction of the future is possible only in systems that have stable parameters like celestial mechanics. The only reason why prediction is so successful in celestial mechanics is that the evolution of the solar system has ground to a halt in what is essentially a dynamic equilibrium with stable parameters. Evolutionary systems, however, by their very nature have unstable parameters. They are disequilibrium systems and in such systems our power of prediction, though not zero, is very limited because of the unpredictability of the parameters themselves. If, of course, it were possible to predict the change in the parameters, then there would be other parameters which were unchanged, but the search for ultimately stable parameters in evolutionary systems is futile, for they probably do not exist… Social systems have Heisenberg principles all over the place, for we cannot predict the future without changing it." (Kenneth E Boulding, Evolutionary Economics, 1981)

"Model is used as a theory. It becomes theory when the purpose of building a model is to understand the mechanisms involved in the developmental process. Hence as theory, model does not carve up or change the world, but it explains how change takes place and in what way or manner. This leads to build change in the structures." (Laxmi K Patnaik, "Model Building in Political Science", The Indian Journal of Political Science Vol. 50 (2), 1989)

"A useful description relates the systematic variation to one or more factors; if the residuals dwarf the effects for a factor, we may not be able to relate variation in the data to changes in the factor. Furthermore, changes in the factor may bring no important change in the response. Such comparisons of residuals and effects require a measure of the variation of overlays relative to each other." (Christopher H Schrnid, "Value Splitting: Taking the Data Apart", 1991)

"[…] continuity appears when we try to mathematically express continuously changing phenomena, and differentiability is the result of expressing smoothly changing phenomena." (Kenji Ueno & Toshikazu Sunada, "A Mathematical Gift, III: The Interplay Between Topology, Functions, Geometry, and Algebra", Mathematical World Vol. 23, 1996)

"There is a new science of complexity which says that the link between cause and effect is increasingly difficult to trace; that change (planned or otherwise) unfolds in non-linear ways; that paradoxes and contradictions abound; and that creative solutions arise out of diversity, uncertainty and chaos." (Andy P Hargreaves & Michael Fullan, "What’s Worth Fighting for Out There?", 1998)

"We analyze numbers in order to know when a change has occurred in our processes or systems. We want to know about such changes in a timely manner so that we can respond appropriately. While this sounds rather straightforward, there is a complication - the numbers can change even when our process does not. So, in our analysis of numbers, we need to have a way to distinguish those changes in the numbers that represent changes in our process from those that are essentially noise." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"After you visualize your data, there are certain things to look for […]: increasing, decreasing, outliers, or some mix, and of course, be sure you’re not mixing up noise for patterns. Also note how much of a change there is and how prominent the patterns are. How does the difference compare to the randomness in the data? Observations can stand out because of human or mechanical error, because of the uncertainty of estimated values, or because there was a person or thing that stood out from the rest. You should know which it is." (Nathan Yau, "Data Points: Visualization That Means Something", 2013)

"In negative feedback regulation the organism has set points to which different parameters (temperature, volume, pressure, etc.) have to be adapted to maintain the normal state and stability of the body. The momentary value refers to the values at the time the parameters have been measured. When a parameter changes it has to be turned back to its set point. Oscillations are characteristic to negative feedback regulation […]" (Gaspar Banfalvi, "Homeostasis - Tumor – Metastasis", 2014)

"Regression does not describe changes in ability that happen as time passes […]. Regression is caused by performances fluctuating about ability, so that performances far from the mean reflect abilities that are closer to the mean." (Gary Smith, "Standard Deviations", 2014)

"When memorization happens, you may have the illusion that everything is working well because your machine learning algorithm seems to have fitted the in sample data so well. Instead, problems can quickly become evident when you start having it work with out-of-sample data and you notice that it produces errors in its predictions as well as errors that actually change a lot when you relearn from the same data with a slightly different approach. Overfitting occurs when your algorithm has learned too much from your data, up to the point of mapping curve shapes and rules that do not exist [...]. Any slight change in the procedure or in the training data produces erratic predictions." (John P Mueller & Luca Massaron, Machine Learning for Dummies, 2016)

More quotes on "Change" at the-web-of-knowledge.blogspot.com.

28 November 2018

🔭Data Science: Classification (Just the Quotes)

"Classification is the process of arranging data into sequences and groups according to their common characteristics, or separating them into different but related parts." (Horace Secrist, "An Introduction to Statistical Methods", 1917)

"Statistics is the fundamental and most important part of inductive logic. It is both an art and a science, and it deals with the collection, the tabulation, the analysis and interpretation of quantitative and qualitative measurements. It is concerned with the classifying and determining of actual attributes as well as the making of estimates and the testing of various hypotheses by which probable, or expected, values are obtained. It is one of the means of carrying on scientific research in order to ascertain the laws of behavior of things - be they animate or inanimate. Statistics is the technique of the Scientific Method." (Bruce D Greenschields & Frank M Weida, "Statistics with Applications to Highway Traffic Analyses", 1952)

"A classification is a scheme for breaking a category into a set of parts, called classes, according to some precisely defined differing characteristics possessed by all the elements of the category." (Alva M Tuttle, "Elementary Business and Economic Statistics", 1957)

"It might be reasonable to expect that the more we know about any set of statistics, the greater the confidence we would have in using them, since we would know in which directions they were defective; and that the less we know about a set of figures, the more timid and hesitant we would be in using them. But, in fact, it is the exact opposite which is normally the case; in this field, as in many others, knowledge leads to caution and hesitation, it is ignorance that gives confidence and boldness. For knowledge about any set of statistics reveals the possibility of error at every stage of the statistical process; the difficulty of getting complete coverage in the returns, the difficulty of framing answers precisely and unequivocally, doubts about the reliability of the answers, arbitrary decisions about classification, the roughness of some of the estimates that are made before publishing the final results. Knowledge of all this, and much else, in detail, about any set of figures makes one hesitant and cautious, perhaps even timid, in using them." (Ely Devons, "Essays in Economics", 1961)

"Many of the basic functions performed by neural networks are mirrored by human abilities. These include making distinctions between items (classification), dividing similar things into groups (clustering), associating two or more things (associative memory), learning to predict outcomes based on examples (modeling), being able to predict into the future (time-series forecasting), and finally juggling multiple goals and coming up with a good- enough solution (constraint satisfaction)." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"While classification is important, it can certainly be overdone. Making too fine a distinction between things can be as serious a problem as not being able to decide at all. Because we have limited storage capacity in our brain (we still haven't figured out how to add an extender card), it is important for us to be able to cluster similar items or things together. Not only is clustering useful from an efficiency standpoint, but the ability to group like things together (called chunking by artificial intelligence practitioners) is a very important reasoning tool. It is through clustering that we can think in terms of higher abstractions, solving broader problems by getting above all of the nitty-gritty details." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"We build models to increase productivity, under the justified assumption that it's cheaper to manipulate the model than the real thing. Models then enable cheaper exploration and reasoning about some universe of discourse. One important application of models is to understand a real, abstract, or hypothetical problem domain that a computer system will reflect. This is done by abstraction, classification, and generalization of subject-matter entities into an appropriate set of classes and their behavior." (Stephen J Mellor, "Executable UML: A Foundation for Model-Driven Architecture", 2002)

"Compared to traditional statistical studies, which are often hindsight, the field of data mining finds patterns and classifications that look toward and even predict the future. In summary, data mining can (1) provide a more complete understanding of data by finding patterns previously not seen and (2) make models that predict, thus enabling people to make better decisions, take action, and therefore mold future events." (Robert Nisbet et al, "Handbook of statistical analysis and data mining applications", 2009)

"The well-known 'No Free Lunch' theorem indicates that there does not exist a pattern classification method that is inherently superior to any other, or even to random guessing without using additional information. It is the type of problem, prior information, and the amount of training samples that determine the form of classifier to apply. In fact, corresponding to different real-world problems, different classes may have different underlying data structures. A classifier should adjust the discriminant boundaries to fit the structures which are vital for classification, especially for the generalization capacity of the classifier." (Hui Xue et al, "SVM: Support Vector Machines", 2009)

"A problem in data mining when random variations in data are misclassified as important patterns. Overfitting often occurs when the data set is too small to represent the real world." (Microsoft, "SQL Server 2012 Glossary", 2012)

"Choosing an appropriate classification algorithm for a particular problem task requires practice: each algorithm has its own quirks and is based on certain assumptions. To restate the 'No Free Lunch' theorem: no single classifier works best across all possible scenarios. In practice, it is always recommended that you compare the performance of at least a handful of different learning algorithms to select the best model for the particular problem; these may differ in the number of features or samples, the amount of noise in a dataset, and whether the classes are linearly separable or not." (Sebastian Raschka, "Python Machine Learning", 2015)

"The no free lunch theorem for machine learning states that, averaged over all possible data generating distributions, every classification algorithm has the same error rate when classifying previously unobserved points. In other words, in some sense, no machine learning algorithm is universally any better than any other. The most sophisticated algorithm we can conceive of has the same average performance (over all possible tasks) as merely predicting that every point belongs to the same class. [...] the goal of machine learning research is not to seek a universal learning algorithm or the absolute best learning algorithm. Instead, our goal is to understand what kinds of distributions are relevant to the 'real world' that an AI agent experiences, and what kinds of machine learning algorithms perform well on data drawn from the kinds of data generating distributions we care about." (Ian Goodfellow et al, "Deep Learning", 2015)

"Roughly stated, the No Free Lunch theorem states that in the lack of prior knowledge (i.e. inductive bias) on average all predictive algorithms that search for the minimum classification error (or extremum over any risk metric) have identical performance according to any measure." (N D Lewis, "Deep Learning Made Easy with R: A Gentle Introduction for Data Science", 2016)

"The power of deep learning models comes from their ability to classify or predict nonlinear data using a modest number of parallel nonlinear steps4. A deep learning model learns the input data features hierarchy all the way from raw data input to the actual classification of the data. Each layer extracts features from the output of the previous layer." (N D Lewis, "Deep Learning Made Easy with R: A Gentle Introduction for Data Science", 2016)

"Decision trees are important for a few reasons. First, they can both classify and regress. It requires literally one line of code to switch between the two models just described, from a classification to a regression. Second, they are able to determine and share the feature importance of a given training set." (Russell Jurney, "Agile Data Science 2.0: Building Full-Stack Data Analytics Applications with Spark", 2017)

"Multilayer perceptrons share with polynomial classifiers one unpleasant property. Theoretically speaking, they are capable of modeling any decision surface, and this makes them prone to overfitting the training data." (Miroslav Kubat," An Introduction to Machine Learning" 2nd Ed., 2017)

"The main reason why pruning tends to improve classification performance on future examples is that the removal of low-level tests, which have poor statistical support, usually reduces the danger of overfitting. This, however, works only up to a certain point. If overdone, a very high extent of pruning can (in the extreme) result in the decision being replaced with a single leaf labeled with the majority class." (Miroslav Kubat," An Introduction to Machine Learning" 2nd Ed., 2017)

"There are other problems with Big Data. In any large data set, there are bound to be inconsistencies, misclassifications, missing data - in other words, errors, blunders, and possibly lies. These problems with individual items occur in any data set, but they are often hidden in a large mass of numbers even when these numbers are generated out of computer interactions." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"The no free lunch theorems set limits on the range of optimality of any method. That is, each methodology has a ‘catchment area’ where it is optimal or nearly so. Often, intuitively, if the optimality is particularly strong then the effectiveness of the methodology falls off more quickly outside its catchment area than if its optimality were not so strong. Boosting is a case in point: it seems so well suited to binary classification that efforts to date to extend it to give effective classification (or regression) more generally have not been very successful. Overall, it remains to characterize the catchment areas where each class of predictors performs optimally, performs generally well, or breaks down." (Bertrand S Clarke & Jennifer L. Clarke, "Predictive Statistics: Analysis and Inference beyond Models", 2018)

"The premise of classification is simple: given a categorical target variable, learn patterns that exist between instances composed of independent variables and their relationship to the target. Because the target is given ahead of time, classification is said to be supervised machine learning because a model can be trained to minimize error between predicted and actual categories in the training data. Once a classification model is fit, it assigns categorical labels to new instances based on the patterns detected during training." (Benjamin Bengfort et al, "Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning", 2018)

"A classification tree is perhaps the simplest form of algorithm, since it consists of a series of yes/no questions, the answer to each deciding the next question to be asked, until a conclusion is reached." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"An advantage of random forests is that it works with both regression and classification trees so it can be used with targets whose role is binary, nominal, or interval. They are also less prone to overfitting than a single decision tree model. A disadvantage of a random forest is that they generally require more trees to improve their accuracy. This can result in increased run times, particularly when using very large data sets." (Richard V McCarthy et al, "Applying Predictive Analytics: Finding Value in Data", 2019)

"The classifier accuracy would be extra ordinary when the test data and the training data are overlapping. But when the model is applied to a new data it will fail to show acceptable accuracy. This condition is called as overfitting." (Jesu V Nayahi J & Gokulakrishnan K, "Medical Image Classification", 2019)

More quotes on "Classification" at the-web-of-knowledge.blogspot.com.

🔭Data Science: Standard Deviation (Just the Quotes)

"Equal variability is not always achieved in plots. For instance, if the theoretical distribution for a probability plot has a density that drops off gradually to zero in the tails (as the normal density does), then the variability of the data in the tails of the probability plot is greater than in the center. Another example is provided by the histogram. Since the height of any one bar has a binomial distribution, the standard deviation of the height is approximately proportional to the square root of the expected height; hence, the variability of the longer bars is greater." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"The most important reason for portraying standard deviations is that they give us a sense of the relative variability of the points in different regions of the plot." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"Many good things happen when data distributions are well approximated by the normal. First, the question of whether the shifts among the distributions are additive becomes the question of whether the distributions have the same standard deviation; if so, the shifts are additive. […] A second good happening is that methods of fitting and methods of probabilistic inference, to be taken up shortly, are typically simple and on well understood ground. […] A third good thing is that the description of the data distribution is more parsimonious." (William S Cleveland, "Visualizing Data", 1993)

"The bounds on the standard deviation are pretty crude but it is surprising how often the rule will pick up gross errors such as confusing the standard error and standard deviation, confusing the variance and the standard deviation, or reporting the mean in one scale and the standard deviation in another scale." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Data often arrive in raw form, as long lists of numbers. In this case your job is to summarize the data in a way that captures its essence and conveys its meaning. This can be done numerically, with measures such as the average and standard deviation, or graphically. At other times you find data already in summarized form; in this case you must understand what the summary is telling, and what it is not telling, and then interpret the information for your readers or viewers." (Charles Livingston & Paul Voakes, "Working with Numbers and Statistics: A handbook for journalists", 2005)

"Roughly stated, the standard deviation gives the average of the differences between the numbers on the list and the mean of that list. If data are very spread out, the standard deviation will be large. If the data are concentrated near the mean, the standard deviation will be small." (Charles Livingston & Paul Voakes, "Working with Numbers and Statistics: A handbook for journalists", 2005)

"A feature shared by both the range and the interquartile range is that they are each calculated on the basis of just two values - the range uses the maximum and the minimum values, while the IQR uses the two quartiles. The standard deviation, on the other hand, has the distinction of using, directly, every value in the set as part of its calculation. In terms of representativeness, this is a great strength. But the chief drawback of the standard deviation is that, conceptually, it is harder to grasp than other more intuitive measures of spread." (Alan Graham, "Developing Thinking in Statistics", 2006)

"Numerical precision should be consistent throughout and summary statistics such as means and standard deviations should not have more than one extra decimal place (or significant digit) compared to the raw data. Spurious precision should be avoided although when certain measures are to be used for further calculations or when presenting the results of analyses, greater precision may sometimes be appropriate." (Jenny Freeman et al, "How to Display Data", 2008)

"Need to consider outliers as they can affect statistics such as means, standard deviations, and correlations. They can either be explained, deleted, or accommodated (using either robust statistics or obtaining additional data to fill-in). Can be detected by methods such as box plots, scatterplots, histograms or frequency distributions." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

"Outliers or influential data points can be defined as data values that are extreme or atypical on either the independent (X variables) or dependent (Y variables) variables or both. Outliers can occur as a result of observation errors, data entry errors, instrument errors based on layout or instructions, or actual extreme values from self-report data. Because outliers affect the mean, the standard deviation, and correlation coefficient values, they must be explained, deleted, or accommodated by using robust statistics." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

[myth] "The standard deviation statistic is more efficient than the range and therefore we should use the standard deviation statistic when computing limits for a process behavior chart."(Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"Outliers make it very hard to give an intuitive interpretation of the mean, but in fact, the situation is even worse than that. For a real‐world distribution, there always is a mean (strictly speaking, you can define distributions with no mean, but they’re not realistic), and when we take the average of our data points, we are trying to estimate that mean. But when there are massive outliers, just a single data point is likely to dominate the value of the mean and standard deviation, so much more data is required to even estimate the mean, let alone make sense of it." (Field Cady, "The Data Science Handbook", 2017)

"Theoretically, the normal distribution is most famous because many distributions converge to it, if you sample from them enough times and average the results. This applies to the binomial distribution, Poisson distribution and pretty much any other distribution you’re likely to encounter (technically, any one for which the mean and standard deviation are finite)." (Field Cady, "The Data Science Handbook", 2017)

"With time series though, there is absolutely no substitute for plotting. The pertinent pattern might end up being a sharp spike followed by a gentle taper down. Or, maybe there are weird plateaus. There could be noisy spikes that have to be filtered out. A good way to look at it is this: means and standard deviations are based on the naïve assumption that data follows pretty bell curves, but there is no corresponding 'default' assumption for time series data (at least, not one that works well with any frequency), so you always have to look at the data to get a sense of what’s normal. [...] Along the lines of figuring out what patterns to expect, when you are exploring time series data, it is immensely useful to be able to zoom in and out." (Field Cady, "The Data Science Handbook", 2017)

"With skewed data, quantiles will reflect the skew, while adding standard deviations assumes symmetry in the distribution and can be misleading." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

"[…] whenever people make decisions after being supplied with the standard deviation number, they act as if it were the expected mean deviation." (Nassim N Taleb, "Statistical Consequences of Fat Tails: Real World Preasymptotics, Epistemology, and Applications" 2nd Ed., 2022)

27 November 2018

🔭Data Science: Data Science (Just the Quotes)

"Data is frequently missing or incongruous. If data is missing, do you simply ignore the missing points? That isn’t always possible. If data is incongruous, do you decide that something is wrong with badly behaved data (after all, equipment fails), or that the incongruous data is telling its own story, which may be more interesting? It’s reported that the discovery of ozone layer depletion was delayed because automated data collection tools discarded readings that were too low. In data science, what you have is frequently all you’re going to get. It’s usually impossible to get 'better' data, and you have no alternative but to work with the data at hand." (Mike Loukides, "What Is Data Science?", 2011).

"Data science isn’t just about the existence of data, or making guesses about what that data might mean; it’s about testing hypotheses and making sure that the conclusions you’re drawing from the data are valid." (Mike Loukides, "What Is Data Science?", 2011)

"The thread that ties most of these applications together is that data collected from users provides added value. Whether that data is search terms, voice samples, or product reviews, the users are in a feedback loop in which they contribute to the products they use. That’s the beginning of data science." (Mike Loukides, "What Is Data Science?", 2011)

"Using data effectively requires something different from traditional statistics, where actuaries in business suits perform arcane but fairly well-defined kinds of analysis. What differentiates data science from statistics is that data science is a holistic approach. We’re increasingly finding data in the wild, and data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others" (Mike Loukides, "What Is Data Science?", 2011)

"Data science, as a field, is overly concerned with the technical tools for executing problems and not nearly concerned enough with asking the right questions. It is very tempting, given how pleasurable it can be to lose oneself in data science work, to just grab the first or most interesting data set and go to town. Other disciplines have successfully built up techniques for asking good questions and ensuring that, once started, work continues on a productive path. We have much to gain from adapting their techniques to our field." (Max Shron, "Thinking with Data: How to Turn Information into Insights", 2014)

"Data science is an iterative process. It starts with a hypothesis (or several hypotheses) about the system we’re studying, and then we analyze the information. The results allow us to reject our initial hypotheses and refine our understanding of the data. When working with thousands of fields and millions of rows, it’s important to develop intuitive ways to reject bad hypotheses quickly." (Phil Simon, "The Visual Organization: Data Visualization, Big Data, and the Quest for Better Decisions", 2014)

"Hollywood loves the myth of a lone scientist working late nights in a dark laboratory on a mysterious island, but the truth is far less melodramatic. Real science is almost always a team sport. Groups of people, collaborating with other groups of people, are the norm in science - and data science is no exception to the rule. When large groups of people work together for extended periods of time, a culture begins to emerge." (Mike Barlow, "Learning to Love Data Science", 2015)

"One important thing to bear in mind about the outputs of data science and analytics is that in the vast majority of cases they do not uncover hidden patterns or relationships as if by magic, and in the case of predictive analytics they do not tell us exactly what will happen in the future. Instead, they enable us to forecast what may come. In other words, once we have carried out some modelling there is still a lot of work to do to make sense out of the results obtained, taking into account the constraints and assumptions in the model, as well as considering what an acceptable level of reliability is in each scenario." (Jesús Rogel-Salazar, "Data Science and Analytics with Python", 2017)

"One of the biggest myths is the belief that data science is an autonomous process that we can let loose on our data to find the answers to our problems. In reality, data science requires skilled human oversight throughout the different stages of the process. [...] The second big myth of data science is that every data science project needs big data and needs to use deep learning. In general, having more data helps, but having the right data is the more important requirement. [...] A third data science myth is that modern data science software is easy to use, and so data science is easy to do. [...] The last myth about data science [...] is the belief that data science pays for itself quickly. The truth of this belief depends on the context of the organization. Adopting data science can require significant investment in terms of developing data infrastructure and hiring staff with data science expertise. Furthermore, data science will not give positive results on every project." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"The goal of data science is to improve decision making by basing decisions on insights extracted from large data sets. As a field of activity, data science encompasses a set of principles, problem definitions, algorithms, and processes for extracting nonobvious and useful patterns from large data sets. It is closely related to the fields of data mining and machine learning, but it is broader in scope. (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"The patterns that we extract using data science are useful only if they give us insight into the problem that enables us to do something to help solve the problem." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"We humans are reasonably good at defining rules that check one, two, or even three attributes (also commonly referred to as features or variables), but when we go higher than three attributes, we can start to struggle to handle the interactions between them. By contrast, data science is often applied in contexts where we want to look for patterns among tens, hundreds, thousands, and, in extreme cases, millions of attributes." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"Even in an era of open data, data science and data journalism, we still need basic statistical principles in order not to be misled by apparent patterns in the numbers." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"Data science is, in reality, something that has been around for a very long time. The desire to utilize data to test, understand, experiment, and prove out hypotheses has been around for ages. To put it simply: the use of data to figure things out has been around since a human tried to utilize the information about herds moving about and finding ways to satisfy hunger. The topic of data science came into popular culture more and more as the advent of ‘big data’ came to the forefront of the business world." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"Data scientists are advanced in their technical skills. They like to do coding, statistics, and so forth. In its purest form, data science is where an individual uses the scientific method on data." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"Pure data science is the use of data to test, hypothesize, utilize statistics and more, to predict, model, build algorithms, and so forth. This is the technical part of the puzzle. We need this within each organization. By having it, we can utilize the power that these technical aspects bring to data and analytics. Then, with the power to communicate effectively, the analysis can flow throughout the needed parts of an organization." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"Aim for simplicity in Data Science. Real creativity won’t make things more complex. Instead, it will simplify them." (Damian D Mingle)

"Data Science is a series of failures punctuated by the occasional success." (Nigel C Lewis)

"Invite your Data Science team to ask questions and assume any system, rule, or way of doing things is open to further consideration." (Damian D Mingle)

🔭Data Science: Planning (Just the Quotes)

"The preparation of clear and simple plans, and a convenient system of numbering the [treatments] that are to be applied, will lighten the work of the man in the field, who is usually operating under averse conditions, is frequently in a hurry, and is sometimes not very certain of the points at issue." (F Yates, "The Design and Analysis of Factorial Experiments" Harpenden Imperial Bureau of Soil Science, 1937)

"The statistician who supposes that his main contribution to the planning of an experiment will involve statistical theory, finds repeatedly that he makes his most valuable contribution simply by persuading the investigator to explain why he wishes to do the experiment, by persuading him to justify the experimental treatments, and to explain why it is that the experiment, when completed, will assist him in his research." (Gertrude Cox, [lecture] 1951)

"What goes wrong [in long-range planning] is that sensible anticipation gets converted into foolish numbers: and their validity always hinges on large loose assumptions." (Robert Heller, "The Naked Manager: Games Executives Play", 1972)

"A good rule of thumb for deciding how long the analysis of the data actually will take is (1) to add up all the time for everything you can think of - editing the data, checking for errors, calculating various statistics, thinking about the results, going back to the data to try out a new idea, and (2) then multiply the estimate obtained in this first step by five." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Statistics is a tool. In experimental science you plan and carry out experiments, and then analyse and interpret the results. To do this you use statistical arguments and calculations. Like any other tool - an oscilloscope, for example, or a spectrometer, or even a humble spanner - you can use it delicately or clumsily, skillfully or ineptly. The more you know about it and understand how it works, the better you will be able to use it and the more useful it will be." (Roger Barlow, "Statistics: A Guide to the Use of Statistical Methods in the Physical Sciences", 1989)

"An important part of the explanation [of continued use of significance testing] is that researchers hold false beliefs about significance testing, beliefs that tell them that significance testing offers important benefits to researchers that it in fact does not. Three of these beliefs are particularly important. The first is the false belief that the significance level of a study indicates the probability of successful replications of the study [...]. A second false belief widely held by researchers is that statistical significance level provides an index of the importance or size of a difference or relation [...]. The third false belief held by many researchers is the most devastating of all to the research enterprise. This is the belief that if a difference or relation is not statistically significant, then it is zero, or at least so small that it can safely be considered to be zero. This is the belief that if the null hypothesis is not rejected then it is to be accepted. This is the belief that a major benefit from significance tests is that they tell us whether a difference or affect is real or ‘probably just occurred by chance’." (Frank L Schmidt, "Statistical significance testing and cumulative knowledge in psychology: implications for training of researchers", Psychological Methods 1(2), 1996)

"Consideration needs to be given to the most appropriate data to be collected. Often the temptation is to collect too much data and not give appropriate attention to the most important. Filing cabinets and computer files world-wide are filled with data that have been collected because they may be of interest to someone in future. Most is never of interest to anyone and if it is, its existence is unknown to those seeking the information, who will set out to collect the data again, probably in a trial better designed for the purpose. In general, it is best to collect only the data required to answer the questions posed, when setting up the trial, and plan another trial for other data in the future, if necessary." (P Portmann & H Ketata, "Statistical Methods for Plant Variety Evaluation", 1997)

"Meta-analytic thinking is the consideration of any result in relation to previous results on the same or similar questions, and awareness that combination with future results is likely to be valuable. Meta-analytic thinking is the application of estimation thinking to more than a single study. It prompts us to seek meta-analysis of previous related studies at the planning stage of research, then to report our results in a way that makes it easy to include them in future meta-analyses. Meta-analytic thinking is a type of estimation thinking, because it, too, focuses on estimates and uncertainty." (Geoff Cumming, "Understanding the New Statistics", 2012)

"Statistics can be defined as a collection of techniques used when planning a data collection, and when subsequently analyzing and presenting data." (Birger S Madsen, "Statistics for Non-Statisticians", 2016)

"The best time to plan an experiment is after you’ve done it." (Ronald A Fisher)

🔭Data Science: Percentiles & Quantiles (Just the Quotes)

"When distributions are compared, the goal is to understand how the distributions shift in going from one data set to the next. […] The most effective way to investigate the shifts of distributions is to compare corresponding quantiles." (William S Cleveland, "Visualizing Data", 1993)

"If the sample is not representative of the population because the sample is small or biased, not selected at random, or its constituents are not independent of one another, then the bootstrap will fail. […] For a given size sample, bootstrap estimates of percentiles in the tails will always be less accurate than estimates of more centrally located percentiles. Similarly, bootstrap interval estimates for the variance of a distribution will always be less accurate than estimates of central location such as the mean or median because the variance depends strongly upon extreme values in the population." (Phillip I Good & James W Hardin, "Common Errors in Statistics (and How to Avoid Them)", 2003)

"A useful feature of a stem plot is that the values maintain their natural order, while at the same time they are laid out in a way that emphasizes the overall distribution of where the values are concentrated (that is, where the longer branches are). This enables you easily to pick out key values such as the median and quartiles." (Alan Graham, "Developing Thinking in Statistics", 2006)

"Having NUMBERSENSE means: (•) Not taking published data at face value; (•) Knowing which questions to ask; (•) Having a nose for doctored statistics. [...] NUMBERSENSE is that bit of skepticism, urge to probe, and desire to verify. It’s having the truffle hog’s nose to hunt the delicacies. Developing NUMBERSENSE takes training and patience. It is essential to know a few basic statistical concepts. Understanding the nature of means, medians, and percentile ranks is important. Breaking down ratios into components facilitates clear thinking. Ratios can also be interpreted as weighted averages, with those weights arranged by rules of inclusion and exclusion. Missing data must be carefully vetted, especially when they are substituted with statistical estimates. Blatant fraud, while difficult to detect, is often exposed by inconsistency." (Kaiser Fung, "Numbersense: How To Use Big Data To Your Advantage", 2013)

"Percentile points are used to define the percentage of cases equal to and below a certain point in a distribution or set of scores." (Neil J Salkind, "Statistics for People who (think They) Hate Statistics: Excel 2007 Edition", 2010)

"Had we started with this [quantile] plot, noticed that it looks straight and not looked further, we would have missed the important features of the data. The general lesson is important. Theoretical quantile -quantile plots are not a panacea and must be used in conjunction with other displays and analyses to get a full picture of the behavior of the data." (John M Chambers et al, "Graphical Methods for Data Analysis", 2011)

"[...] when measuring performance, it’s worth using percentiles rather than averages. The main advantage of the mean is that it’s easy to calculate, but percentiles are much more meaningful." (Martin Kleppmann, "Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems", 2015)

"Many researchers have fallen into the trap of assuming percentiles are interval data and using them in Statistical procedures that require interval data. The results are somewhat distorted under these conditions since the scores are actually only ordinal data." (Martin L Abbott, "Using Statistics in the Social and Health Sciences with SPSS and Excel", 2016)

"The percentile or rank is the point in a distribution of scores below which a given percentage of scores fall. This is an indication of rank since it establishes score that is above the percentage of a set of scores. [...] Therefore, percentiles describe where a certain score is in relation to others in the distribution. [...] Statistically, it is important to remember that percentile ranks are ranks and therefore not interval data." (Martin L Abbott, "Using Statistics in the Social and Health Sciences with SPSS and Excel", 2016)

"It is not enough to give a single summary for a distribution - we need to have an idea of the spread, sometimes known as the variability. [...] The range is a natural choice, but is clearly very sensitive to extreme values [...] In contrast the inter-quartile range (IQR) is unaffected by extremes. This is the distance between the 25th and 75th percentiles of the data and so contains the ‘central half’ of the numbers [...] Finally the standard deviation is a widely used measure of spread. It is the most technically complex measure, but is only really appropriate for well-behaved symmetric data* since it is also unduly influenced by outlying values." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

🔭Data Science: Fuzziness (Just the Quotes)

"Today we preach that science is not science unless it is quantitative. We substitute correlation for causal studies, and physical equations for organic reasoning. Measurements and equations are supposed to sharpen thinking, but [...] they more often tend to make the thinking non-causal and fuzzy." (John R Platt, "Strong Inference", Science Vol. 146 (3641), 1964)

"Information that is only partially structured (and therefore contains some 'noise' is fuzzy, inconsistent, and indistinct. Such imperfect information may be regarded as having merit only if it represents an intermediate step in structuring the information into a final meaningful form. If the partially Structured information remains in fuzzy form, it will create a state of dissatisfaction in the mind of the originator and certainly in the mind of the recipient. The natural desire is to continue structuring until clarity, simplicity, precision, and definitiveness are obtained." (Cecil H Meyers, "Handbook of Basic Graphs: A modern approach", 1970)

"Mental models are fuzzy, incomplete, and imprecisely stated. Furthermore, within a single individual, mental models change with time, even during the flow of a single conversation. The human mind assembles a few relationships to fit the context of a discussion. As debate shifts, so do the mental models. Even when only a single topic is being discussed, each participant in a conversation employs a different mental model to interpret the subject. Fundamental assumptions differ but are never brought into the open. […] A mental model may be correct in structure and assumptions but, even so, the human mind - either individually or as a group consensus - is apt to draw the wrong implications for the future." (Jay W Forrester, "Counterintuitive Behaviour of Social Systems", Technology Review, 1971)

"Fuzziness, then, is a concomitant of complexity. This implies that as the complexity of a task, or of a system for performing that task, exceeds a certain threshold, the system must necessarily become fuzzy in nature. Thus, with the rapid increase in the complexity of the information processing tasks which the computers are called upon to perform, we are reaching a point where computers will have to be designed for processing of information in fuzzy form. In fact, it is the capability to manipulate fuzzy concepts that distinguishes human intelligence from the machine intelligence of current generation computers. Without such capability we cannot build machines that can summarize written text, translate well from one natural language to another, or perform many other tasks that humans can do with ease because of their ability to manipulate fuzzy concepts." (Lotfi A Zadeh, "The Birth and Evolution of Fuzzy Logic", 1989)

"Probability theory is an ideal tool for formalizing uncertainty in situations where class frequencies are known or where evidence is based on outcomes of a sufficiently long series of independent random experiments. Possibility theory, on the other hand, is ideal for formalizing incomplete information expressed in terms of fuzzy propositions." (George Klir, "Fuzzy sets and fuzzy logic", 1995)

"[…] interval mathematics and fuzzy logic together can provide a promising alternative to mathematical modeling for many physical systems that are too vague or too complicated to be described by simple and crisp mathematical formulas or equations. When interval mathematics and fuzzy logic are employed, the interval of confidence and the fuzzy membership functions are used as approximation measures, leading to the so-called fuzzy systems modeling." (Guanrong Chen & Trung Tat Pham, "Introduction to Fuzzy Sets, Fuzzy Logic, and Fuzzy Control Systems", 2001)

"Fuzzy relations are developed by allowing the relationship between elements of two or more sets to take on an infinite number of degrees of relationship between the extremes of 'completely related' and 'not related', which are the only degrees of relationship possible in crisp relations. In this sense, fuzzy relations are to crisp relations as fuzzy sets are to crisp sets; crisp sets and relations are more constrained realizations of fuzzy sets and relations." (Timothy J Ross & W Jerry Parkinson, "Fuzzy Set Theory, Fuzzy Logic, and Fuzzy Systems", 2002)

"The vast majority of information that we have on most processes tends to be nonnumeric and nonalgorithmic. Most of the information is fuzzy and linguistic in form." (Timothy J Ross & W Jerry Parkinson, "Fuzzy Set Theory, Fuzzy Logic, and Fuzzy Systems", 2002)

"Each fuzzy set is uniquely defined by a membership function. […] There are two approaches to determining a membership function. The first approach is to use the knowledge of human experts. Because fuzzy sets are often used to formulate human knowledge, membership functions represent a part of human knowledge. Usually, this approach can only give a rough formula of the membership function and fine-tuning is required. The second approach is to use data collected from various sensors to determine the membership function. Specifically, we first specify the structure of membership function and then fine-tune the parameters of membership function based on the data." (Huaguang Zhang & Derong Liu, "Fuzzy Modeling and Fuzzy Control", 2006)

"Granular computing is a general computation theory for using granules such as subsets, classes, objects, clusters, and elements of a universe to build an efficient computational model for complex applications with huge amounts of data, information, and knowledge. Granulation of an object a leads to a collection of granules, with a granule being a clump of points (objects) drawn together by indiscernibility, similarity, proximity, or functionality. In human reasoning and concept formulation, the granules and the values of their attributes are fuzzy rather than crisp. In this perspective, fuzzy information granulation may be viewed as a mode of generalization, which can be applied to any concept, method, or theory." (Salvatore Greco et al, "Granular Computing and Data Mining for Ordered Data: The Dominance-Based Rough Set Approach", 2009)

"We use the term fuzzy logic to refer to all aspects of representing and manipulating knowledge that employ intermediary truth-values. This general, commonsense meaning of the term fuzzy logic encompasses, in particular, fuzzy sets, fuzzy relations, and formal deductive systems that admit intermediary truth-values, as well as the various methods based on them." (Radim Belohlavek & George J Klir, "Concepts and Fuzzy Logic", 2011)

SQL Troubles

Pages