02 November 2018

Data Science: Torturing the Data in Statistics

Statistics, through its methods, techniques and models rooted in mathematical reasoning, allows exploring, analyzing and summarizing a given set of data, being used to support decision-making, experiments, theories and ultimately to gain and communicate insights. When used adequately, statistics can prove to be a useful toolset, however as soon its use deviates from the mathematical rigor and principles on which it was built, it can be easily misused. Moreover, the results obtained with the help of statistics, can be easily denatured in communication, even when the statistical results are valid. 

The easiness with which statistics can be misused is probably best reflected in sayings like 'if you torture the data long enough it will confess'.  The formulation is attributed by several sources to the economist Ronald H Coase, however according to Coase the reference made by him in the 1960’s was slightly different: 'if you torture the data enough, nature will always confess' (see [1]). The latter formulation is not necessarily negative if one considers the persistence needed by researchers in revealing nature’s secrets. In exchange, the former formulation seems to stress only the negative aspect. 

The word 'torture' seems to be used instead of 'abuse', though metaphorically it has more weight, it draws the attention and sticks with the reader or audience. As the Quotes Investigator remarks [1], ‘torturing the data’ was employed as metaphor much earlier. For example, a 1933 article contains the following passage: 

"The evidence submitted by the committee from its own questionnaire warrants no such conclusion. To torture the data given in Table I into evidence supporting a twelve-hour minimum of professional training is indeed a statistical feat, but one which the committee accomplishes to its own satisfaction." ("The Elementary School Journal" Vol. 33 (7), 1933)

More than a decade earlier, in a similar context with Coase's quote, John Dewey remarked:

"Active experimentation must force the apparent facts of nature into forms different to those in which they familiarly present themselves; and thus make them tell the truth about themselves, as torture may compel an unwilling witness to reveal what he has been concealing." (John Dewey, "Reconstruction in Philosophy", 1920)

Torture was used metaphorically from 1600s, if we consider the following quote from Sir Francis Bacon’s 'Advancement of Learning':

"Another diversity of Methods is according to the subject or matter which is handled; for there is a great difference in delivery of the Mathematics, which are the most abstracted of knowledges, and Policy, which is the most immersed […], yet we see how that opinion, besides the weakness of it, hath been of ill desert towards learning, as that which taketh the way to reduce learning to certain empty and barren generalities; being but the very husks and shells of sciences, all the kernel being forced out and expulsed with the torture and press of the method." (Sir Francis Bacon, Advancement of Learning, 1605)

However a similar metaphor with closer meaning can be found almost two centuries later:

"One very reprehensible mode of theory-making consists, after honest deductions from a few facts have been made, in torturing other facts to suit the end proposed, in omitting some, and in making use of any authority that may lend assistance to the object desired; while all those which militate against it are carefully put on one side or doubted." (Henry De la Beche, "Sections and Views, Illustrative of Geological Phaenomena", 1830)

Probably, also the following quote from Goethe deservers some attention:

"Someday someone will write a pathology of experimental physics and bring to light all those swindles which subvert our reason, beguile our judgement and, what is worse, stand in the way of any practical progress. The phenomena must be freed once and for all from their grim torture chamber of empiricism, mechanism, and dogmatism; they must be brought before the jury of man's common sense." (Johann Wolfgang von Goethe)

Alternatives to Coase’s formulation were used in several later sources, replacing 'data' with 'statistics' or 'numbers':

"Beware of the problem of testing too many hypotheses; the more you torture the data, the more likely they are to confess, but confessions obtained under duress may not be admissible in the court of scientific opinion." (Stephen M Stigler, "Neutral Models in Biology", 1987)

"Torture numbers, and they will confess to anything." (Gregg Easterbrook, New Republic, 1989)

"[…] an honest exploratory study should indicate how many comparisons were made […] most experts agree that large numbers of comparisons will produce apparently statistically significant findings that are actually due to chance. The data torturer will act as if every positive result confirmed a major hypothesis. The honest investigator will limit the study to focused questions, all of which make biologic sense. The cautious reader should look at the number of ‘significant’ results in the context of how many comparisons were made." (James L Mills, "Data torturing", New England Journal of Medicine, 1993)

"This is true only if you torture the statistics until they produce the confession you want." (Larry Schweikart, "Myths of the 1980s Distort Debate over Tax Cuts", 2001) [source

"Even properly done statistics can’t be trusted. The plethora of available statistical techniques and analyses grants researchers an enormous amount of freedom when analyzing their data, and it is trivially easy to ‘torture the data until it confesses’." (Alex Reinhart, "Statistics Done Wrong: The Woefully Complete Guide", 2015)

There is also a psychological component attached to data or facts' torturing to fit the reality, tendency derived from the way the human mind works, the limits and fallacies associated with mind's workings. 

"What are the models? Well, the first rule is that you’ve got to have multiple models - because if you just have one or two that you’re using, the nature of human psychology is such that you’ll torture reality so that it fits your models, or at least you’ll think it does." (Charles Munger, 1994)

Independently of the formulation and context used, the fact remains: statistics (aka data, numbers) can be easily abused, and the reader/audience should be aware of it!

Previously published on quotablemath.blogspot.com.

🔭Data Science: Intelligence (Just the Quotes)

"To be able to discern that what is true is true, and that what is false is false, - this is the mark and character of intelligence." (Ralph W Emerson, "Essays", 1841)

"We study the complex in the simple; and only from the intuition of the lower can we safely proceed to the intellection of the higher degrees. The only danger lies in the leaping from low to high, with the neglect of the intervening gradations." (Samuel T Coleridge, "Physiology of Life", 1848)

"The accidental causes of science are only 'accidents' relatively to the intelligence of a man." (Chauncey Wright, "The Genesis of Species", North American Review, 1871)

"Does the harmony the human intelligence thinks it discovers in nature exist outside of this intelligence? No, beyond doubt, a reality completely independent of the mind which conceives it, sees or feels it, is an impossibility." (Henri Poincaré, "The Value of Science", 1905)

"No one can predict how far we shall be enabled by means of our limited intelligence to penetrate into the mysteries of a universe immeasurably vast and wonderful; nevertheless, each step in advance is certain to bring new blessings to humanity and new inspiration to greater endeavor." (Theodore W Richards, "The Fundamental Properties of the Elements", [Faraday lecture] 1911)

"It may be impossible for human intelligence to comprehend absolute truth, but it is possible to observe Nature with an unbiased mind and to bear truthful testimony of things seen." (Sir Richard A Gregory, "Discovery, Or, The Spirit and Service of Science", 1916)

"In other words then, if a machine is expected to be infallible, it cannot also be intelligent. There are several theorems which say almost exactly that. But these theorems say nothing about how much intelligence may be displayed if a machine makes no pretense at infallibility." (Alan M Turing, 1946)

"A computer would deserve to be called intelligent if it could deceive a human into believing that it was human." (Alan Turing, "Computing Machinery and Intelligence" , Mind Vol. 59, 1950)

"All intelligent endeavor stands with one foot on observation and the other on contemplation." (Gerald Holton & Duane H D Roller, "Foundations of Modern Physical Science", 1950)

"What in fact is the schema of the object? In one essential respect it is a schema belonging to intelligence. To have the concept of an object is to attribute the perceived figure to a substantial basis, so that the figure and the substance that it thus indicates continue to exist outside the perceptual field. The permanence of the object seen from this viewpoint is not only a product of intelligence, but constitutes the very first of those fundamental ideas of conservation which we shall see developing within the thought process." (Jean Piaget, "The Psychology of Intelligence", 1950)

"[…] observation is not enough, and it seems to me that in science, as in the arts, there is very little worth having that does not require the exercise of intuition as well as of intelligence, the use of imagination as well as of information." (Kathleen Lonsdale, "Facts About Crystals", American Scientist Vol. 39 (4), 1951)

"Concepts are for me specific mental abilities exercised in acts of judgment, and expressed in the intelligent use of words (though not exclusively in such use)." (Peter T Geach, "Mental Acts: Their Content and their Objects", 1954)

"The following are some aspects of the artificial intelligence problem: […] If a machine can do a job, then an automatic calculator can be programmed to simulate the machine. […] It may be speculated that a large part of human thought consists of manipulating words according to rules of reasoning and rules of conjecture. From this point of view, forming a generalization consists of admitting a new word and some rules whereby sentences containing it imply and are implied by others. This idea has never been very precisely formulated nor have examples been worked out. […] How can a set of (hypothetical) neurons be arranged so as to form concepts. […] to get a measure of the efficiency of a calculation it is necessary to have on hand a method of measuring the complexity of calculating devices which in turn can be done. […] Probably a truly intelligent machine will carry out activities which may best be described as self-improvement. […] A number of types of 'abstraction' can be distinctly defined and several others less distinctly. […] the difference between creative thinking and unimaginative competent thinking lies in the injection of a some randomness. The randomness must be guided by intuition to be efficient." (John McCarthy et al, "A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence", 1955)

"Solving problems is the specific achievement of intelligence." (George Polya, 1957)

"Computers do not decrease the need for mathematical analysis, but rather greatly increase this need. They actually extend the use of analysis into the fields of computers and computation, the former area being almost unknown until recently, the latter never having been as intensively investigated as its importance warrants. Finally, it is up to the user of computational equipment to define his needs in terms of his problems, In any case, computers can never eliminate the need for problem-solving through human ingenuity and intelligence." (Richard E Bellman & Paul Brock, "On the Concepts of a Problem and Problem-Solving", American Mathematical Monthly 67, 1960)

"Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines; there would then unquestionably be an 'intelligence explosion:, and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make." (Irving J Good, "Speculations Concerning the First Ultraintelligent Machine", Advances in Computers Vol. 6, 1965)

"When intelligent machines are constructed, we should not be surprised to find them as confused and as stubborn as men in their convictions about mind-matter, consciousness, free will, and the like." (Marvin Minsky, "Matter, Mind, and Models", Proceedings of the International Federation of Information Processing Congress Vol. 1 (49), 1965)

"Artificial intelligence is the science of making machines do things that would require intelligence if done by men." (Marvin Minsky, 1968)

"Intelligence has two parts, which we shall call the epistemological and the heuristic. The epistemological part is the representation of the world in such a form that the solution of problems follows from the facts expressed in the representation. The heuristic part is the mechanism that on the basis of the information solves the problem and decides what to do." (John McCarthy & Patrick J Hayes, "Some Philosophical Problems from the Standpoint of Artificial Intelligence", Machine Intelligence 4, 1969)

"Questions are the engines of intellect, the cerebral machines which convert energy to motion, and curiosity to controlled inquiry." (David H Fischer, "Historians’ Fallacies", 1970)

"Man is not a machine, [...] although man most certainly processes information, he does not necessarily process it in the way computers do. Computers and men are not species of the same genus. [...] No other organism, and certainly no computer, can be made to confront genuine human problems in human terms. [...] However much intelligence computers may attain, now or in the future, theirs must always be an intelligence alien to genuine human problems and concerns." (Joesph Weizenbaum, Computer Power and Human Reason: From Judgment to Calculation, 1976)

"Play is the only way the highest intelligence of humankind can unfold." (Joseph C Pearce, "Magical Child: Rediscovering Nature's Plan for Our Children", 1977)

"Because of mathematical indeterminancy and the uncertainty principle, it may be a law of nature that no nervous system is capable of acquiring enough knowledge to significantly predict the future of any other intelligent system in detail. Nor can intelligent minds gain enough self-knowledge to know their own future, capture fate, and in this sense eliminate free will." (Edward O Wilson, "On Human Nature", 1978)

"Collective intelligence emerges when a group of people work together effectively. Collective intelligence can be additive (each adds his or her part which together form the whole) or it can be synergetic, where the whole is greater than the sum of its parts." (Trudy and Peter Johnson-Lenz, "Groupware: Orchestrating the Emergence of Collective Intelligence", cca. 1980)

"Knowing a great deal is not the same as being smart; intelligence is not information alone but also judgement, the manner in which information is coordinated and used." (Carl Sagan, "Cosmos", 1980)

"The basic idea of cognitive science is that intelligent beings are semantic engines - in other words, automatic formal systems with interpretations under which they consistently make sense. We can now see why this includes psychology and artificial intelligence on a more or less equal footing: people and intelligent computers (if and when there are any) turn out to be merely different manifestations of the same underlying phenomenon. Moreover, with universal hardware, any semantic engine can in principle be formally imitated by a computer if only the right program can be found." (John Haugeland, "Semantic Engines: An introduction to mind design", 1981)

"There is a tendency to mistake data for wisdom, just as there has always been a tendency to confuse logic with values, intelligence with insight. Unobstructed access to facts can produce unlimited good only if it is matched by the desire and ability to find out what they mean and where they lead." (Norman Cousins, "Human Options : An Autobiographical Notebook", 1981) 

"Cybernetic information theory suggests the possibility of assuming that intelligence is a feature of any feedback system that manifests a capacity for learning." (Paul Hawken et al, "Seven Tomorrows", 1982)

"We lose all intelligence by averaging." (John Naisbitt, "Megatrends: Ten New Directions Transforming Our Lives", 1982)

"Artificial intelligence is based on the assumption that the mind can be described as some kind of formal system manipulating symbols that stand for things in the world. Thus it doesn't matter what the brain is made of, or what it uses for tokens in the great game of thinking. Using an equivalent set of tokens and rules, we can do thinking with a digital computer, just as we can play chess using cups, salt and pepper shakers, knives, forks, and spoons. Using the right software, one system (the mind) can be mapped onto the other (the computer)." (George Johnson, Machinery of the Mind: Inside the New Science of Artificial Intelligence, 1986)

"Cybernetics is simultaneously the most important science of the age and the least recognized and understood. It is neither robotics nor freezing dead people. It is not limited to computer applications and it has as much to say about human interactions as it does about machine intelligence. Today’s cybernetics is at the root of major revolutions in biology, artificial intelligence, neural modeling, psychology, education, and mathematics. At last there is a unifying framework that suspends long-held differences between science and art, and between external reality and internal belief." (Paul Pangaro, "New Order From Old: The Rise of Second-Order Cybernetics and Its Implications for Machine Intelligence", 1988)

"A popular myth says that the invention of the computer diminishes our sense of ourselves, because it shows that rational thought is not special to human beings, but can be carried on by a mere machine. It is a short stop from there to the conclusion that intelligence is mechanical, which many people find to be an affront to all that is most precious and singular about their humanness." (Jeremy Campbell, "The improbable machine", 1989)

"Fuzziness, then, is a concomitant of complexity. This implies that as the complexity of a task, or of a system for performing that task, exceeds a certain threshold, the system must necessarily become fuzzy in nature. Thus, with the rapid increase in the complexity of the information processing tasks which the computers are called upon to perform, we are reaching a point where computers will have to be designed for processing of information in fuzzy form. In fact, it is the capability to manipulate fuzzy concepts that distinguishes human intelligence from the machine intelligence of current generation computers. Without such capability we cannot build machines that can summarize written text, translate well from one natural language to another, or perform many other tasks that humans can do with ease because of their ability to manipulate fuzzy concepts." (Lotfi A Zadeh, "The Birth and Evolution of Fuzzy Logic", 1989)

"Modeling underlies our ability to think and imagine, to use signs and language, to communicate, to generalize from experience, to deal with the unexpected, and to make sense out of the raw bombardment of our sensations. It allows us to see patterns, to appreciate, predict, and manipulate processes and things, and to express meaning and purpose. In short, it is one of the most essential activities of the human mind. It is the foundation of what we call intelligent behavior and is a large part of what makes us human. We are, in a word, modelers: creatures that build and use models routinely, habitually – sometimes even compulsively – to face, understand, and interact with reality."  (Jeff Rothenberg, "The Nature of Modeling. In: Artificial Intelligence, Simulation, and Modeling", 1989)

"We haven't worked on ways to develop a higher social intelligence […] We need this higher intelligence to operate socially or we're not going to survive. […] If we don't manage things socially, individual high intelligence is not going to make much difference. [...] Ordinary thought in society is incoherent - it is going in all sorts of directions, with thoughts conflicting and canceling each other out. But if people were to think together in a coherent way, it would have tremendous power." (David Bohm, "New Age Journal", 1989)

"[Language comprehension] involves many components of intelligence: recognition of words, decoding them into meanings, segmenting word sequences into grammatical constituents, combining meanings into statements, inferring connections among statements, holding in short-term memory earlier concepts while processing later discourse, inferring the writer’s or speaker’s intentions, schematization of the gist of a passage, and memory retrieval in answering questions about the passage. [… The reader] constructs a mental representation of the situation and actions being described. […] Readers tend to remember the mental model they constructed from a text, rather than the text itself." (Gordon H Bower & Daniel G Morrow, 1990)

"The insight at the root of artificial intelligence was that these 'bits' (manipulated by computers) could just as well stand as symbols for concepts that the machine would combine by the strict rules of logic or the looser associations of psychology." (Daniel Crevier, "AI: The tumultuous history of the search for artificial intelligence", 1993)

"The leading edge of growth of intelligence is at the cultural and societal level. It is like a mind that is struggling to wake up. This is necessary because the most difficult problems we face are now collective ones. They are caused by complex global interactions and are beyond the scope of individuals to understand and solve. Individual mind, with its isolated viewpoints and narrow interests, is no longer enough." (Jeff Wright, "Basic Beliefs", [email] 1995)

"Adaptation is the process of changing a system during its operation in a dynamically changing environment. Learning and interaction are elements of this process. Without adaptation there is no intelligence." (Nikola K Kasabov, "Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering", 1996)

"Artificial intelligence comprises methods, tools, and systems for solving problems that normally require the intelligence of humans. The term intelligence is always defined as the ability to learn effectively, to react adaptively, to make proper decisions, to communicate in language or images in a sophisticated way, and to understand." (Nikola K Kasabov, "Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering", 1996)

"Learning is the process of obtaining new knowledge. It results in a better reaction to the same inputs at the next session of operation. It means improvement. It is a step toward adaptation. Learning is a major characteristic of intelligent systems." (Nikola K Kasabov, "Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering", 1996)

"Intelligence is: (a) the most complex phenomenon in the Universe; or (b) a profoundly simple process. The answer, of course, is (c) both of the above. It's another one of those great dualities that make life interesting." (Ray Kurzweil, "The Age of Spiritual Machines: When Computers Exceed Human Intelligence", 1999)

"It [collective intelligence] is a form of universally distributed intelligence, constantly enhanced, coordinated in real time, and resulting in the effective mobilization of skills. I'll add the following indispensable characteristic to this definition: The basis and goal of collective intelligence is mutual recognition and enrichment of individuals rather than the cult of fetishized or hypostatized communities." (Pierre Levy, "Collective Intelligence", 1999)

"It is, however, fair to say that very few applications of swarm intelligence have been developed. One of the main reasons for this relative lack of success resides in the fact that swarm-intelligent systems are hard to 'program', because the paths to problem solving are not predefined but emergent in these systems and result from interactions among individuals and between individuals and their environment as much as from the behaviors of the individuals themselves. Therefore, using a swarm-intelligent system to solve a problem requires a thorough knowledge not only of what individual behaviors must be implemented but also of what interactions are needed to produce such or such global behavior." (Eric Bonabeau et al, "Swarm Intelligence: From Natural to Artificial Systems", 1999)

"Once a computer achieves human intelligence it will necessarily roar past it." (Ray Kurzweil, "The Age of Spiritual Machines: When Computers Exceed Human Intelligence", 1999)

"[…] when software systems become so intractable that they can no longer be controlled, swarm intelligence offers an alternative way of designing an ‘intelligent’ systems, in which autonomy, emergence, and distributed functioning replace control, preprogramming, and centralization." (Eric Bonabeau et al, "Swarm Intelligence: From Natural to Artificial Systems", 1999)

"With the growing interest in complex adaptive systems, artificial life, swarms and simulated societies, the concept of 'collective intelligence' is coming more and more to the fore. The basic idea is that a group of individuals (e. g. people, insects, robots, or software agents) can be smart in a way that none of its members is. Complex, apparently intelligent behavior may emerge from the synergy created by simple interactions between individuals that follow simple rules." (Francis Heylighen, "Collective Intelligence and its Implementation on the Web", 1999)

"Ecological rationality uses reason – rational reconstruction – to examine the behavior of individuals based on their experience and folk knowledge, who are ‘naïve’ in their ability to apply constructivist tools to the decisions they make; to understand the emergent order in human cultures; to discover the possible intelligence embodied in the rules, norms and institutions of our cultural and biological heritage that are created from human interactions but not by deliberate human design. People follow rules without being able to articulate them, but they can be discovered." (Vernon L Smith, "Constructivist and ecological rationality in economics",  2002)

"But intelligence is not just a matter of acting or behaving intelligently. Behavior is a manifestation of intelligence, but not the central characteristic or primary definition of being intelligent. A moment's reflection proves this: You can be intelligent just lying in the dark, thinking and understanding. Ignoring what goes on in your head and focusing instead on behavior has been a large impediment to understanding intelligence and building intelligent machines." (Jeff Hawkins, "On Intelligence", 2004)

"Evolution moves towards greater complexity, greater elegance, greater knowledge, greater intelligence, greater beauty, greater creativity, and greater levels of subtle attributes such as love. […] Of course, even the accelerating growth of evolution never achieves an infinite level, but as it explodes exponentially it certainly moves rapidly in that direction." (Ray Kurzweil, "The Singularity is Near", 2005)

"Swarm Intelligence can be defined more precisely as: Any attempt to design algorithms or distributed problem-solving methods inspired by the collective behavior of the social insect colonies or other animal societies. The main properties of such systems are flexibility, robustness, decentralization and self-organization." ("Swarm Intelligence in Data Mining", Ed. Ajith Abraham et al, 2006))

"Swarm intelligence is sometimes also referred to as mob intelligence. Swarm intelligence uses large groups of agents to solve complicated problems. Swarm intelligence uses a combination of accumulation, teamwork, and voting to produce solutions. Accumulation occurs when agents contribute parts of a solution to a group. Teamwork occurs when different agents or subgroups of agents accidentally or purposefully work on different parts of a large problem. Voting occurs when agents propose solutions or components of solutions and the other agents vote explicitly by rating the proposal’s quality or vote implicitly by choosing whether to follow the proposal." (Michael J North & Charles M Macal, "Managing Business Complexity: Discovering Strategic Solutions with Agent-Based Modeling and Simulation", 2007)

"The brain and its cognitive mental processes are the biological foundation for creating metaphors about the world and oneself. Artificial intelligence, human beings’ attempt to transcend their biology, tries to enter into these scenarios to learn how they function. But there is another metaphor of the world that has its own particular landscapes, inhabitants, and laws. The brain provides the organic structure that is necessary for generating the mind, which in turn is considered a process that results from brain activity." (Diego Rasskin-Gutman, "Chess Metaphors: Artificial Intelligence and the Human Mind", 2009)

"Cultures are never merely intellectual constructs. They take form through the collective intelligence and memory, through a commonly held psychology and emotions, through spiritual and artistic communion." (Tariq Ramadan, "Islam and the Arab Awakening", 2012)

"An intuition is neither caprice nor a sixth sense but a form of unconscious intelligence." (Gerd Gigerenzer, "Risk Savvy", 2015)

"Artificial intelligence is the elucidation of the human learning process, the quantification of the human thinking process, the explication of human behavior, and the understanding of what makes intelligence possible." (Kai-Fu Lee, "AI Superpowers: China, Silicon Valley, and the New World Order", 2018)

"Deep learning has instead given us machines with truly impressive abilities but no intelligence. The difference is profound and lies in the absence of a model of reality." (Judea Pearl, "The Book of Why: The New Science of Cause and Effect", 2018)

"AI won‘t be fool proof in the future since it will only as good as the data and information that we give it to learn. It could be the case that simple elementary tricks could fool the AI algorithm and it may serve a complete waste of output as a result." (Zoltan Andrejkovics, "Together: AI and Human. On the Same Side", 2019)

"People who assume that extensions of modern machine learning methods like deep learning will somehow 'train up', or learn to be intelligent like humans, do not understand the fundamental limitations that are already known. Admitting the necessity of supplying a bias to learning systems is tantamount to Turing’s observing that insights about mathematics must be supplied by human minds from outside formal methods, since machine learning bias is determined, prior to learning, by human designers." (Erik J Larson, "The Myth of Artificial Intelligence: Why Computers Can’t Think the Way We Do", 2021)

More quotes on "Intelligence" at the-web-of-knowledge.blogspot.com

01 November 2018

🔭Data Science: Black Boxes (Just the Quotes)

"The terms 'black box' and 'white box' are convenient and figurative expressions of not very well determined usage. I shall understand by a black box a piece of apparatus, such as four-terminal networks with two input and two output terminals, which performs a definite operation on the present and past of the input potential, but for which we do not necessarily have any information of the structure by which this operation is performed. On the other hand, a white box will be similar network in which we have built in the relation between input and output potentials in accordance with a definite structural plan for securing a previously determined input-output relation." (Norbert Wiener, "Cybernetics: Or Control and Communication in the Animal and the Machine", 1948)

"The definition of a ‘good model’ is when everything inside it is visible, inspectable and testable. It can be communicated effortlessly to others. A ‘bad model’ is a model that does not meet these standards, where parts are hidden, undefined or concealed and it cannot be inspected or tested; these are often labelled black box models." (Hördur V Haraldsson & Harald U Sverdrup, "Finding Simplicity in Complexity in Biogeochemical Modelling" [in "Environmental Modelling: Finding Simplicity in Complexity", Ed. by John Wainwright and Mark Mulligan, 2004])

"Operational thinking is about mapping relationships. It is about capturing interactions, interconnections, the sequence and flow of activities, and the rules of the game. It is about how systems do what they do, or the dynamic process of using elements of the structure to produce the desired functions. In a nutshell, it is about unlocking the black box that lies between system input and system output." (Jamshid Gharajedaghi, "Systems Thinking: Managing Chaos and Complexity A Platform for Designing Business Architecture" 3rd Ed., 2011)

"The transparency of Bayesian networks distinguishes them from most other approaches to machine learning, which tend to produce inscrutable 'black boxes'. In a Bayesian network you can follow every step and understand how and why each piece of evidence changed the network’s beliefs." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"A recurring theme in machine learning is combining predictions across multiple models. There are techniques called bagging and boosting which seek to tweak the data and fit many estimates to it. Averaging across these can give a better prediction than any one model on its own. But here a serious problem arises: it is then very hard to explain what the model is (often referred to as a 'black box'). It is now a mixture of many, perhaps a thousand or more, models." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

"Deep neural networks have an input layer and an output layer. In between, are “hidden layers” that process the input data by adjusting various weights in order to make the output correspond closely to what is being predicted. [...] The mysterious part is not the fancy words, but that no one truly understands how the pattern recognition inside those hidden layers works. That’s why they’re called 'hidden'. They are an inscrutable black box - which is okay if you believe that computers are smarter than humans, but troubling otherwise." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"The concept of integrated information is clearest when applied to networks. Imagine a black box with input and output terminals. Inside are some electronics, such as a network with logic elements (AND, OR, and so on) wired together. Viewed from the outside, it will usually not be possible to deduce the circuit layout simply by examining the cause–effect relationship between inputs and outputs, because functionally equivalent black boxes can be built from very different circuits. But if the box is opened, it’s a different story. Suppose you use a pair of cutters to sever some wires in the network. Now rerun the system with all manner of inputs. If a few snips dramatically alter the outputs, the circuit can be described as highly integrated, whereas in a circuit with low integration the effect of some snips may make no difference at all." (Paul Davies, "The Demon in the Machine: How Hidden Webs of Information Are Solving the Mystery of Life", 2019)

"Big data is revolutionizing the world around us, and it is easy to feel alienated by tales of computers handing down decisions made in ways we don’t understand. I think we’re right to be concerned. Modern data analytics can produce some miraculous results, but big data is often less trustworthy than small data. Small data can typically be scrutinized; big data tends to be locked away in the vaults of Silicon Valley. The simple statistical tools used to analyze small datasets are usually easy to check; pattern-recognizing algorithms can all too easily be mysterious and commercially sensitive black boxes." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"If the data that go into the analysis are flawed, the specific technical details of the analysis don’t matter. One can obtain stupid results from bad data without any statistical trickery. And this is often how bullshit arguments are created, deliberately or otherwise. To catch this sort of bullshit, you don’t have to unpack the black box. All you have to do is think carefully about the data that went into the black box and the results that came out. Are the data unbiased, reasonable, and relevant to the problem at hand? Do the results pass basic plausibility checks? Do they support whatever conclusions are drawn?" (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

"This problem with adding additional variables is referred to as the curse of dimensionality. If you add enough variables into your black box, you will eventually find a combination of variables that performs well - but it may do so by chance. As you increase the number of variables you use to make your predictions, you need exponentially more data to distinguish true predictive capacity from luck." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

🔭Data Science: Probabilistic Models (Just the Quotes)

"A deterministic system is one in which the parts interact in a perfectly predictable way. There is never any room for doubt: given a last state of the system and the programme of information by defining its dynamic network, it is always possible to predict, without any risk of error, its succeeding state. A probabilistic system, on the other hand, is one about which no precisely detailed prediction can be given. The system may be studied intently, and it may become more and more possible to say what it is likely to do in any given circumstances. But the system simply is not predetermined, and a prediction affecting it can never escape from the logical limitations of the probabilities in which terms alone its behaviour can be described." (Stafford Beer, "Cybernetics and Management", 1959)

"[...] there can be such a thing as a simple probabilistic system. For example, consider the tossing of a penny. Here is a perfectly simple system, but one which is notoriously unpredictable. It maybe described in terms of a binary decision process, with a built-in even probability between the two possible outcomes." (Stafford Beer, "Cybernetics and Management", 1959)

"When loops are present, the network is no longer singly connected and local propagation schemes will invariably run into trouble. [...] If we ignore the existence of loops and permit the nodes to continue communicating with each other as if the network were singly connected, messages may circulate indefinitely around the loops and process may not converges to a stable equilibrium. […] Such oscillations do not normally occur in probabilistic networks […] which tend to bring all messages to some stable equilibrium as time goes on. However, this asymptotic equilibrium is not coherent, in the sense that it does not represent the posterior probabilities of all nodes of the network." (Judea Pearl, "Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference", 1988)

"We will use the convenient expression 'chosen at random' to mean that the probabilities of the events in the sample space are all the same unless some modifying words are near to the words 'at random'. Usually we will compute the probability of the outcome based on the uniform probability model since that is very common in modeling simple situations. However, a uniform distribution does not imply that it comes from a random source; […]" (Richard W Hamming, "The Art of Probability for Scientists and Engineers", 1991)

"Exploratory data analysis (EDA) is a collection of techniques that reveal (or search for) structure in a data set before calculating any probabilistic model. Its purpose is to obtain information about the data distribution (univariate or multivariate), about the presence of outliers and clusters, to disclose relationships and correlations between objects and/or variables." (Ildiko E  Frank & Roberto Todeschini, "The Data Analysis Handbook", 1994)

"To understand what kinds of problems are solvable by the Monte Carlo method, it is important to note that the method enables simulation of any process whose development is influenced by random factors. Second, for many mathematical problems involving no chance, the method enables us to artificially construct a probabilistic model (or several such models), making possible the solution of the problems." (Ilya M Sobol, "A Primer for the Monte Carlo Method", 1994)

"The role of graphs in probabilistic and statistical modeling is threefold: (1) to provide convenient means of expressing substantive assumptions; (2) to facilitate economical representation of joint probability functions; and (3) to facilitate efficient inferences from observations." (Judea Pearl, "Causality: Models, Reasoning, and Inference", 2000)

"The nice thing with Monte Carlo is that you play a game of let’s pretend, like this: first of all there are ten scenarios with different probabilities, so let’s first pick a probability. The dice in this case is a random number generator in the computer. You roll the dice and pick a scenario to work with. Then you roll the dice for a certain speed, and you roll the dice again to see what direction it took. The last thing is that it collided with the bottom at an unknown time so you roll dice for the unknown time. So now you have speed, direction, starting point, time. Given them all, I know precisely where it [could have] hit the bottom. You have the computer put a point there. Rolling dice, I come up with different factors for each scenario. If I had enough patience, I could do it with pencil and paper. We calculated ten thousand points. So you have ten thousand points on the bottom of the ocean that represent equally likely positions of the sub. Then you draw a grid, count the points in each cell of the grid, saying that 10% of the points fall in this cell, 1% in that cell, and those percentages are what you use for probabilities for the prior for the individual distributions." (Henry R Richardson) [in (Sharon B McGrayne, "The Theory That Would Not Die", 2011)]

"A major advantage of probabilistic models is that they can be easily applied to virtually any data type (or mixed data type), as long as an appropriate generative model is available for each mixture component. [...] A downside of probabilistic models is that they try to fit the data to a particular kind of distribution, which may often not be appropriate for the underlying data. Furthermore, as the number of model parameters increases, over-fitting becomes more common. In such cases, the outliers may fit the underlying model of normal data. Many parametric models are also harder to interpret in terms of intensional knowledge, especially when the parameters of the model cannot be intuitively presented to an analyst in terms of underlying attributes. This can defeat one of the important purposes of anomaly detection, which is to provide diagnostic understanding of the abnormal data generative process." (Charu C Aggarwal, "Outlier Analysis", 2013)

"The process of using a probabilistic model to answer a query, given evidence." (Avi Pfeffer, "Practical Probabilistic Programming", 2016)

"Monte Carlo simulations handle uncertainty by using a computer’s random number generator to determine outcomes. Done over and over again, the simulations show the distribution of the possible outcomes. [...] The beauty of these Monte Carlo simulations is that they allow users to see the probabilistic consequences of their decisions, so that they can make informed choices. [...] Monte Carlo simulations are one of the most valuable applications of data science because they can be used to analyze virtually any uncertain situation where we are able to specify the nature of the uncertainty [...]" (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"A simple probabilistic model would not be sufficient to generate the fantastic diversity we see." Wolfgang Pauli

31 October 2018

🔭Data Science: Deep Learning (Just the Quotes)

"Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression." (Pankaj Mehta & David J Schwab, "An exact mapping between the Variational Renormalization Group and Deep Learning", 2014)

"Deep learning is about using a stacked hierarchy of feature detectors. [...] we use pattern detectors and we build them into networks that are arranged in hundreds of layers and then we adjust the links between these layers, usually using some kind of gradient descent." (Joscha Bach, "Joscha: Computational Meta-Psychology", 2015)

"The power of deep learning models comes from their ability to classify or predict nonlinear data using a modest number of parallel nonlinear steps4. A deep learning model learns the input data features hierarchy all the way from raw data input to the actual classification of the data. Each layer extracts features from the output of the previous layer." (N D Lewis, "Deep Learning Made Easy with R: A Gentle Introduction for Data Science", 2016)

"Although deep learning systems share some similarities with machine learning systems, certain characteristics make them sufficiently distinct. For example, conventional machine learning systems tend to be simpler and have fewer options for training. DL systems are noticeably more sophisticated; they each have a set of training algorithms, along with several parameters regarding the systems’ architecture. This is one of the reasons we consider them a distinct framework in data science." (Yunus E Bulut & Zacharias Voulgaris, "AI for Data Science: Artificial Intelligence Frameworks and Functionality for Deep Learning, Optimization, and Beyond", 2018)

"Deep learning broadly describes the large family of neural network architectures that contain multiple, interacting hidden layers." (Benjamin Bengfort et al, Applied Text Analysis with Python, 2018)

"Deep learning has instead given us machines with truly impressive abilities but no intelligence. The difference is profound and lies in the absence of a model of reality." (Judea Pearl, "The Book of Why: The New Science of Cause and Effect", 2018)

"DL systems also tend to be more autonomous than their machine counterparts. To some extent, DL systems can do their own feature engineering. More conventional systems tend to require more fine-tuning of the feature-set, and sometimes require dimensionality reduction to provide any decent results. In addition, the generalization of conventional ML systems when provided with additional data generally don’t improve as much as DL systems. This is also one of the key characteristics that makes DL systems a preferable option when big data is involved." (Yunus E Bulut & Zacharias Voulgaris, "AI for Data Science: Artificial Intelligence Frameworks and Functionality for Deep Learning, Optimization, and Beyond", 2018)

"[…] deep learning has succeeded primarily by showing that certain questions or tasks we thought were difficult are in fact not. It has not addressed the truly difficult questions that continue to prevent us from achieving humanlike AI." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"In essence, deep learning models are just chains of functions, which means that many deep learning libraries tend to have a functional or verbose, declarative style." (Benjamin Bengfort et al, Applied Text Analysis with Python, 2018)

"The second big myth of data science is that every data science project needs big data and needs to use deep learning. In general, having more data helps, but having the right data is the more important requirement" (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"People who assume that extensions of modern machine learning methods like deep learning will somehow 'train up', or learn to be intelligent like humans, do not understand the fundamental limitations that are already known. Admitting the necessity of supplying a bias to learning systems is tantamount to Turing’s observing that insights about mathematics must be supplied by human minds from outside formal methods, since machine learning bias is determined, prior to learning, by human designers." (Erik J Larson, "The Myth of Artificial Intelligence: Why Computers Can’t Think the Way We Do", 2021)

30 October 2018

💠🛠️SQL Server: Administration (Troubleshooting Login Failed for User)

    Since the installation of an SQL Server 2017 on a virtual machine (VM) in the Microsoft Cloud started to appear in the error log records with the following message:

Login failed for user '<domain>\<computer>$'. Reason: Could not find a login matching the name provided. [CLIENT: <local machine>]
Error: 18456, Severity: 14, State: 5.


   From the text it seemed like a permission problem, thing confirmed by the documentation (see [1]), the Error Number and State correspond to a „User Id is not valid“ situation. In a first step I attempted to give permissions to the local account (dollar sign included). The account wasn’t found in the Active Directory (AD), though by typing the account directly in the “Login name” I managed to give temporarily sysadmin permission to the account. The error continued to appear in the error log. I looked then at the accounts under which the SQL Services run - nothing suspect in there.

   Except the error message, which was appearing with an alarming frequency (a few seconds apart), everything seemed to be working on the server. The volume of  records (a few hundred thousands over a few days) bloating the error log, as well the fact that I didn’t knew what’s going on made me take the time and further investigate the issue.

  Looking today at the Windows Logs for Applications I observed that the error is caused by an account used for the Microsoft SQL Server IaaS Agent and IaaS Query Service. Once I gave permissions to the account the error disappeared.

   The search for a best practice on what permissions to give to the IaaS Agent and IaaS Query Service lead me to [2]. To quote, the “Agent Service needs Local System rights to be able to install and configure SQL Server, attach disks and enable storage pool and manage automated security patching of Windows and SQL server”, while the “IaaS Query Service is started with an NT Service account which is a Sys Admin on the SQL Server”. In fact, this was the only resource I found that made a reference to the IaaS Query Service.

   This was just one of the many scenarios in which the above error appears. For more information see for example  [3], [4] or [5].

References:
[1] Microsoft (2017) MSSQLSERVER_18456 [Online] Available from: https://docs.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-18456-database-engine-error?view=sql-server-2017
[2] SQL Database Engine Blog (2018) SQL Server IaaS Extension Query Service for SQL Server on Azure VM, by Mine Tokus Altug [Online] Available from:  https://blogs.msdn.microsoft.com/sqlserverstorageengine/2018/10/25/sql-server-iaas-extension-query-service-for-sql-server-on-azure-vm/
[3] Microsoft Support (2018) "Login failed for user" error message when you log on to SQL Server [Online] Available from: https://support.microsoft.com/en-sg/help/555332/login-failed-for-user-error-message-when-you-log-on-to-sql-server
[4] Microsoft Technet (2018) How to Troubleshoot Connecting to the SQL Server Database [Online] Available from: Engine https://social.technet.microsoft.com/wiki/contents/articles/2102.how-to-troubleshoot-connecting-to-the-sql-server-database-engine.aspx 
[5] Microsoft Blogs (2011)Troubleshoot Connectivity/Login failures (18456 State x) with SQL Server, by Sakthivel Chidambaram [Online] Available from: https://blogs.msdn.microsoft.com/sqlsakthi/2011/02/06/troubleshoot-connectivitylogin-failures-18456-state-x-with-sql-server/

29 October 2018

💠🛠️SQL Server: Administration (Searching the Error Log)

    Searching for a needle in a haystack is an achievable task though may turn to be daunting. Same can be said about searching for a piece of information in the SQL error log. Fortunately, there is xp_readerrorlog, an undocumented (extended) stored procedure, which helps in the process. The stored procedure makes available the content of the error log and provides basic search capabilities via a small set of parameters. For example, it can be used to search for errors, warnings, failed backups, consistency checks, failed logins, databases instant file initializations, and so on. It helps identify whether an event occurred and the time at which the event occurred.

   The following are the parameter available with the stored procedure:

Parameter
Name
Type
Description
1FileToReadint0 = Current, 1 or 2, 3, … n Archive Number
2Logtypeint1 = SQL Error Log and 2 = SQL Agent log
3String1varchar(255)the string to match the logs on
4String2varchar(255)a second string to match in combination with String1 (AND)
5StartDatedatetimebeginning date to look from
6EndDatedatetimeending date to look up to
7ResultsOrderASC or DESC sorting


Note:
If the SQL Server Agent hasn’t been active, then there will be no Agent log and the call to the stored procedure will return an error.

   Here are a few examples of using the stored procedure:

-- listing the content of the current SQL Server error log
EXEC xp_readerrorlog 0, 1

-- listing the content of the second SQL Server error log
EXEC xp_readerrorlog 1, 1

-- listing the content of the current SQL Server Agent log
EXEC xp_readerrorlog 0, 2

-- searching for errors 
EXEC xp_readerrorlog 0, 1, N'error'

-- searching for errors that have to do with consistency checks
EXEC xp_readerrorlog 0, 1, N'error', N'CHECKDB'

-- searching for errors that have to do with consistency checks
EXEC xp_readerrorlog 0, 1, N'failed', N'backups'

-- searching for warnings 
EXEC xp_readerrorlog 0, 1, N'warning'

-- searching who killed a session
EXEC xp_readerrorlog 0, 1, N'kill'

-- searching for I/O information
EXEC xp_readerrorlog 0, 1, N'I/O'

-- searching for consistency checks 
EXEC xp_readerrorlog 0, 1, N'CHECKDB'

-- searching for consistency checks performed via DBCC
EXEC xp_readerrorlog 0, 1, N'DBCC CHECKDB'

-- searching for failed logins  
EXEC xp_readerrorlog 0, 1, N'Login failed'

-- searching for 
EXEC xp_readerrorlog 0, 1, N'[INFO]'

-- searching for shutdowns 
EXEC xp_readerrorlog 0, 1, N'shutdown'

-- searching for a database instant file initialization event  
EXEC xp_readerrorlog 0, 1, N'database instant file initialization'

   If the error log is too big it’s important to narrow the search for a given time interval:

-- searching for errors starting with a given date 
DECLARE @StartDate as Date = DateAdd(d, -1, GetDate())
EXEC xp_readerrorlog 0, 1, N'error', N'', @StartDate

-- searching for errors within a time interval 
DECLARE @StartDate as Date = DateAdd(d, -14, GetDate())
DECLARE @EndDate as Date = DateAdd(d, -7, GetDate())
EXEC xp_readerrorlog 0, 1, N'', N'', @StartDate, @EndDate, N'desc' 

   The output can be dumped into a table especially when is needed to perform a detailed analysis on the error log. It might be interesting to check how often an error message occurred, like in the below example. One can take thus advantage of more complex pattern searching.

-- creating the error log table 
CREATE TABLE dbo.ErrorLogMessages (
    LogDate datetime2(0) 
  , ProcessInfo nvarchar(255)
  , [Text] nvarchar(max))

-- loading the errors 
INSERT INTO dbo.ErrorLogMessages
EXEC xp_readerrorlog 0, 1

-- checking the results 
SELECT *
FROM dbo.ErrorLogMessages

-- checking messages frequency 
SELECT [Text]
, count(*) NoOccurrences
, Min(LogDate) FirstOccurrence
FROM dbo.ErrorLogMessages
GROUP BY [Text]
HAVING count(*)>1
ORDER BY NoOccurrences DESC

-- getting the errors and their information 
SELECT *
FROM (
 SELECT *
 , Lead([Text], 1) OVER (PARTITION BY LogDate, ProcessInfo ORDER BY LogDate) PrevMessage
 FROM dbo.ErrorLogMessages
 ) DAT
WHERE [Text] LIKE '%error:%[0-9]%'

-- cleaning up 
--DROP TABLE IF EXISTS dbo.ErrorLogMessages 

   For those who don’t have admin permissions it is necessary to explicitly give execute permissions on the xp_readerrorlog stored procedure:

-- giving explicit permissions to account
GRANT EXECUTE ON xp_readerrorlog TO [<account_name>]

   Personally, I’ve been using the stored procedure mainly to check whether error messages were logged for a given time interval and whether the consistency checks run without problems. Occasionally, I used it to check for failed logins or sessions terminations (aka kills).

Notes:
Microsoft warns that undocumented objects might change in future releases. Fortunately, xp_readerrorlog made it since SQL Server 2005 to SQL Server 2017, so it might make it further…
The above code was tested also on SQL Server 2017.

Happy coding!

28 October 2018

🔭Data Science: Limits (Just the Quotes)

"Whatever lies beyond the limits of experience, and claims another origin than that of induction and deduction from established data, is illegitimate." (George H Lewes, "The Foundations of a Creed", 1875)

"It is difficult to understand why statisticians commonly limit their inquiries to Averages, and do not revel in more comprehensive views. Their souls seem as dull to the charm of variety as that of the native of one of our flat English counties, whose retrospect of Switzerland was that, if its mountains could be thrown into its lakes, two nuisances would be got rid of at once. An Average is but a solitary fact, whereas if a single other fact be added to it, an entire Normal Scheme, which nearly corresponds to the observed one, starts potentially into existence." (Sir Francis Galton, "Natural Inheritance", 1889)

"Physical research by experimental methods is both a broadening and a narrowing field. There are many gaps yet to be filled, data to be accumulated, measurements to be made with great precision, but the limits within which we must work are becoming, at the same time, more and more defined." (Elihu Thomson, "Annual Report of the Board of Regents of the Smithsonian Institution", 1899)

"Statistics may, for instance, be called the science of counting. Counting appears at first sight to be a very simple operation, which any one can perform or which can be done automatically; but, as a matter of fact, when we come to large numbers, e.g., the population of the United Kingdom, counting is by no means easy, or within the power of an individual; limits of time and place alone prevent it being so carried out, and in no way can absolute accuracy be obtained when the numbers surpass certain limits." (Sir Arthur L Bowley, "Elements of Statistics", 1901)

"The usefulness of the models in constructing a testable theory of the process is severely limited by the quickly increasing number of parameters which must be estimated in order to compare the predictions of the models with empirical results" (Anatol Rapoport, "Prisoner's Dilemma: A study in conflict and cooperation", 1965)

"A real change of theory is not a change of equations - it is a change of mathematical structure, and only fragments of competing theories, often not very important ones conceptually, admit comparison with each other within a limited range of phenomena." (Yuri I Manin, "Mathematics and Physics", 1981)

"Models are often used to decide issues in situations marked by uncertainty. However statistical differences from data depend on assumptions about the process which generated these data. If the assumptions do not hold, the inferences may not be reliable either. This limitation is often ignored by applied workers who fail to identify crucial assumptions or subject them to any kind of empirical testing. In such circumstances, using statistical procedures may only compound the uncertainty." (David A Greedman & William C Navidi, "Regression Models for Adjusting the 1980 Census", Statistical Science Vol. 1 (1), 1986)

"[…] an honest exploratory study should indicate how many comparisons were made […] most experts agree that large numbers of comparisons will produce apparently statistically significant findings that are actually due to chance. The data torturer will act as if every positive result confirmed a major hypothesis. The honest investigator will limit the study to focused questions, all of which make biologic sense. The cautious reader should look at the number of ‘significant’ results in the context of how many comparisons were made." (James L Mills, "Data torturing", New England Journal of Medicine, 1993)

"In spite of the insurmountable computational limits, we continue to pursue the many problems that possess the characteristics of organized complexity. These problems are too important for our well being to give up on them. The main challenge in pursuing these problems narrows down fundamentally to one question: how to deal with systems and associated problems whose complexities are beyond our information processing limits? That is, how can we deal with these problems if no computational power alone is sufficient?"  (George Klir, "Fuzzy sets and fuzzy logic", 1995)

"The larger, more detailed and complex the model - the less abstract the abstraction – the smaller the number of people capable of understanding it and the longer it takes for its weaknesses and limitations to be found out." (John Adams, "Risk", 1995)

"[...] the NFL theorems mean that if an algorithm does particularly well on average for one class of problems then it must do worse on average over the remaining problems. In particular, if an algorithm performs better than random search on some class of problems then in must perform worse than random search on the remaining problems. Thus comparisons reporting the performance of a particular algorithm with a particular parameter setting on a few sample problems are of limited utility. While such results do indicate behavior on the narrow range of problems considered, one should be very wary of trying to generalize those results to other problems." (David H Wolpert & William G Macready, "No free lunch theorems for optimization", IEEE Transactions on Evolutionary Computation 1 (1), 1997)

"No comparison between two values can be global. A simple comparison between the current figure and some previous value and convey the behavior of any time series. […] While it is simple and easy to compare one number with another number, such comparisons are limited and weak. They are limited because of the amount of data used, and they are weak because both of the numbers are subject to the variation that is inevitably present in weak world data. Since both the current value and the earlier value are subject to this variation, it will always be difficult to determine just how much of the difference between the values is due to variation in the numbers, and how much, if any, of the difference is due to real changes in the process." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"[…] an obvious difference between our best classifiers and human learning is the number of examples required in tasks such as object detection. […] the difficulty of a learning task depends on the size of the required hypothesis space. This complexity determines in turn how many training examples are needed to achieve a given level of generalization error. Thus the complexity of the hypothesis space sets the speed limit and the sample complexity for learning." (Tomaso Poggio & Steve Smale, "The Mathematics of Learning: Dealing with Data", Notices of the AMS, 2003)

"Every number has its limitations; every number is a product of choices that inevitably involve compromise. Statistics are intended to help us summarize, to get an overview of part of the world’s complexity. But some information is always sacrificed in the process of choosing what will be counted and how. Something is, in short, always missing. In evaluating statistics, we should not forget what has been lost, if only because this helps us understand what we still have." (Joel Best, "More Damned Lies and Statistics : How numbers confuse public issues", 2004)

"Statistics depend on collecting information. If questions go unasked, or if they are asked in ways that limit responses, or if measures count some cases but exclude others, information goes ungathered, and missing numbers result. Nevertheless, choices regarding which data to collect and how to go about collecting the information are inevitable." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"The Bayesian approach is based on the following postulates: (B1) Probability describes degree of belief, not limiting frequency. As such, we can make probability statements about lots of things, not just data which are subject to random variation. […] (B2) We can make probability statements about parameters, even though they are fixed constants. (B3) We make inferences about a parameter θ by producing a probability distribution for θ. Inferences, such as point estimates and interval estimates, may then be extracted from this distribution." (Larry A Wasserman, "All of Statistics: A concise course in statistical inference", 2004)

"A population that grows logistically, initially increases exponentially; then the growth lows down and eventually approaches an upper bound or limit. The most well-known form of the model is the logistic differential equation." (Linda J S Allen, "An Introduction to Mathematical Biology", 2007)

"Humans have difficulty perceiving variables accurately […]. However, in general, they tend to have inaccurate perceptions of system states, including past, current, and future states. This is due, in part, to limited ‘mental models’ of the phenomena of interest in terms of both how things work and how to influence things. Consequently, people have difficulty determining the full implications of what is known, as well as considering future contingencies for potential systems states and the long-term value of addressing these contingencies. " (William B. Rouse, "People and Organizations: Explorations of Human-Centered Design", 2007)

"The methodology of feedback design is borrowed from cybernetics (control theory). It is based upon methods of controlled system model’s building, methods of system states and parameters estimation (identification), and methods of feedback synthesis. The models of controlled system used in cybernetics differ from conventional models of physics and mechanics in that they have explicitly specified inputs and outputs. Unlike conventional physics results, often formulated as conservation laws, the results of cybernetical physics are formulated in the form of transformation laws, establishing the possibilities and limits of changing properties of a physical system by means of control." (Alexander L Fradkov, "Cybernetical Physics: From Control of Chaos to Quantum Control", 2007)

"All graphics present data and allow a certain degree of exploration of those same data. Some graphics are almost all presentation, so they allow just a limited amount of exploration; hence we can say they are more infographics than visualization, whereas others are mostly about letting readers play with what is being shown, tilting more to the visualization side of our linear scale. But every infographic and every visualization has a presentation and an exploration component: they present, but they also facilitate the analysis of what they show, to different degrees." (Alberto Cairo, "The Functional Art", 2011)

"There are limits on the data we can gather and the kinds of experiments we can perform."(Charles Wheelan, "Naked Statistics: Stripping the Dread from the Data", 2012)

"Learning theory claims that a machine learning algorithm can generalize well from a finite training set of examples. This seems to contradict some basic principles of logic. Inductive reasoning, or inferring general rules from a limited set of examples, is not logically valid. To logically infer a rule describing every member of a set, one must have information about every member of that set." (Ian Goodfellow et al, "Deep Learning", 2015)

"Science’s predictions are more trustworthy, but they are limited to what we can systematically observe and tractably model. Big data and machine learning greatly expand that scope. Some everyday things can be predicted by the unaided mind, from catching a ball to carrying on a conversation. Some things, try as we might, are just unpredictable. For the vast middle ground between the two, there’s machine learning." (Pedro Domingos, "The Master Algorithm", 2015)

"To make progress, every field of science needs to have data commensurate with the complexity of the phenomena it studies. [...] With big data and machine learning, you can understand much more complex phenomena than before. In most fields, scientists have traditionally used only very limited kinds of models, like linear regression, where the curve you fit to the data is always a straight line. Unfortunately, most phenomena in the world are nonlinear. [...] Machine learning opens up a vast new world of nonlinear models." (Pedro Domingos, "The Master Algorithm", 2015)

"Repeated observations of the same phenomenon do not always produce the same results, due to random noise or error. Sampling errors result when our observations capture unrepresentative circumstances, like measuring rush hour traffic on weekends as well as during the work week. Measurement errors reflect the limits of precision inherent in any sensing device. The notion of signal to noise ratio captures the degree to which a series of observations reflects a quantity of interest as opposed to data variance. As data scientists, we care about changes in the signal instead of the noise, and such variance often makes this problem surprisingly difficult." (Steven S Skiena, "The Data Science Design Manual", 2017)

"Regularization is particularly important when the amount of available data is limited. A neat biological interpretation of regularization is that it corresponds to gradual forgetting, as a result of which 'less important' (i.e., noisy) patterns are removed. In general, it is often advisable to use more complex models with regularization rather than simpler models without regularization." (Charu C Aggarwal, "Neural Networks and Deep Learning: A Textbook", 2018)

"The no free lunch theorems set limits on the range of optimality of any method. That is, each methodology has a ‘catchment area’ where it is optimal or nearly so. Often, intuitively, if the optimality is particularly strong then the effectiveness of the methodology falls off more quickly outside its catchment area than if its optimality were not so strong. Boosting is a case in point: it seems so well suited to binary classification that efforts to date to extend it to give effective classification (or regression) more generally have not been very successful. Overall, it remains to characterize the catchment areas where each class of predictors performs optimally, performs generally well, or breaks down." (Bertrand S Clarke & Jennifer L. Clarke, "Predictive Statistics: Analysis and Inference beyond Models", 2018)

"Unless we’re collecting data ourselves, there’s a limit to how much we can do to combat the problem of missing data. But we can and should remember to ask who or what might be missing from the data we’re being told about. Some missing numbers are obvious […]. Other omissions show up only when we take a close look at the claim in question." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"Despite their predictive power, most analytics and data science practices ignore relationships because it has been historically challenging to process them at scale." (Jesús Barrasa et al, "Knowledge Graphs: Data in Context for Responsive Businesses", 2021)

"Visualisation is fundamentally limited by the number of pixels you can pump to a screen. If you have big data, you have way more data than pixels, so you have to summarise your data. Statistics gives you lots of really good tools for this." (Hadley Wickham)

27 October 2018

🔭Data Science: Research (Just the Quotes)

"The aim of research is the discovery of the equations which subsist between the elements of phenomena." (Ernst Mach, 1898)

"[…] scientific research is somewhat like unraveling complicated tangles of strings, in which luck is almost as vital as skill and accurate observation." (Ernst Mach, "Knowledge and Error: Sketches on the Psychology of Enquiry", 1905)

"Research is fundamentally a state of mind involving continual re­examination of doctrines and axioms upon which current thought and action are based. It is, therefore, critical of existing practices." (Theobald Smith, "The Influence of Research in Bringing into Closer Relationship the Practice of Medicine and Public Health Activities", American Journal of Medical Sciences, 1929)

"In every important advance the physicist finds that the fundamental laws are simplified more and more as experimental research advances. He is astonished to notice how sublime order emerges from what appeared to be chaos. And this cannot be traced back to the workings of his own mind but is due to a quality that is inherent in the world of perception." (Albert Einstein, 1932)

"Statistics is a scientific discipline concerned with collection, analysis, and interpretation of data obtained from observation or experiment. The subject has a coherent structure based on the theory of Probability and includes many different procedures which contribute to research and development throughout the whole of Science and Technology." (Egon Pearson, 1936)

"A successful hypothesis is not necessarily a permanent hypothesis, but it is one which stimulates additional research, opens up new fields, or explains and coordinates previously unrelated facts." (Farrington Daniels, "Outlines of Physical Chemistry", 1948)

"The hypothesis is the principal intellectual instrument in research. Its function is to indicate new experiments and observations and it therefore sometimes leads to discoveries even when not correct itself. We must resist the temptation to become too attached to our hypothesis, and strive to judge it objectively and modify it or discard it as soon as contrary evidence is brought to light. Vigilance is needed to prevent our observations and interpretations being biased in favor of the hypothesis. Suppositions can be used without being believed." (William I B Beveridge, "The Art of Scientific Investigation", 1950)

"Mathematical models for empirical phenomena aid the development of a science when a sufficient body of quantitative information has been accumulated. This accumulation can be used to point the direction in which models should be constructed and to test the adequacy of such models in their interim states. Models, in turn, frequently are useful in organizing and interpreting experimental data and in suggesting new directions for experimental research." (Robert R. Bush & Frederick Mosteller, "A Mathematical Model for Simple Learning", Psychological Review 58, 1951)

"Statistics is the fundamental and most important part of inductive logic. It is both an art and a science, and it deals with the collection, the tabulation, the analysis and interpretation of quantitative and qualitative measurements. It is concerned with the classifying and determining of actual attributes as well as the making of estimates and the testing of various hypotheses by which probable, or expected, values are obtained. It is one of the means of carrying on scientific research in order to ascertain the laws of behavior of things - be they animate or inanimate. Statistics is the technique of the Scientific Method." (Bruce D Greenschields & Frank M Weida, "Statistics with Applications to Highway Traffic Analyses", 1952)

"In a general way it may be said that to think in terms of systems seems the most appropriate conceptual response so far available when the phenomena under study - at any level and in any domain--display the character of being organized, and when understanding the nature of the interdependencies constitutes the research task. In the behavioral sciences, the first steps in building a systems theory were taken in connection with the analysis of internal processes in organisms, or organizations, when the parts had to be related to the whole." (Fred Emery, "The Causal Texture of Organizational Environments", 1963)

"If the null hypothesis is not rejected, [Sir Ronald] Fisher's position was that nothing could be concluded. But researchers find it hard to go to all the trouble of conducting a study only to conclude that nothing can be concluded." (Frank L Schmidt, "Statistical Significance Testing and Cumulative Knowledge", "Psychology: Implications for Training of Researchers, Psychological Methods" Vol. 1 (2), 1996)

"Statisticians can calculate the probability that such random samples represent the population; this is usually expressed in terms of sampling error [...]. The real problem is that few samples are random. Even when researchers know the nature of the population, it can be time-consuming and expensive to draw a random sample; all too often, it is impossible to draw a true random sample because the population cannot be defined. This is particularly true for studies of social problems. [...] The best samples are those that come as close as possible to being random." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Meta-analytic thinking is the consideration of any result in relation to previous results on the same or similar questions, and awareness that combination with future results is likely to be valuable. Meta-analytic thinking is the application of estimation thinking to more than a single study. It prompts us to seek meta-analysis of previous related studies at the planning stage of research, then to report our results in a way that makes it easy to include them in future meta-analyses. Meta-analytic thinking is a type of estimation thinking, because it, too, focuses on estimates and uncertainty." (Geoff Cumming, "Understanding the New Statistics", 2012)

"Statistical cognition is concerned with obtaining cognitive evidence about various statistical techniques and ways to present data. It’s certainly important to choose an appropriate statistical model, use the correct formulas, and carry out accurate calculations. It’s also important, however, to focus on understanding, and to consider statistics as communication between researchers and readers." (Geoff Cumming, "Understanding the New Statistics", 2012)

"Another way to secure statistical significance is to use the data to discover a theory. Statistical tests assume that the researcher starts with a theory, collects data to test the theory, and reports the results - whether statistically significant or not. Many people work in the other direction, scrutinizing the data until they find a pattern and then making up a theory that fits the pattern." (Gary Smith, "Standard Deviations", 2014)

"How can we tell the difference between a good theory and quackery? There are two effective antidotes: common sense and fresh data. If it is a ridiculous theory, we shouldn’t be persuaded by anything less than overwhelming evidence, and even then be skeptical. Extraordinary claims require extraordinary evidence. Unfortunately, common sense is an uncommon commodity these days, and many silly theories have been seriously promoted by honest researchers." (Gary Smith, "Standard Deviations", 2014)

"These practices - selective reporting and data pillaging - are known as data grubbing. The discovery of statistical significance by data grubbing shows little other than the researcher’s endurance. We cannot tell whether a data grubbing marathon demonstrates the validity of a useful theory or the perseverance of a determined researcher until independent tests confirm or refute the finding. But more often than not, the tests stop there. After all, you won’t become a star by confirming other people’s research, so why not spend your time discovering new theories? The data-grubbed theory consequently sits out there, untested and unchallenged." (Gary Smith, "Standard Deviations", 2014)

"A conceptual model is a framework that is initially used in research to outline the possible courses of action or to present an idea or thought. When a conceptual model is developed in a logical manner, it will provide a rigor to the research process." (N Elangovan & R Rajendran, "Conceptual Model: A Framework for Institutionalizing the Vigor in Business Research", 2015)

"Even properly done statistics can’t be trusted. The plethora of available statistical techniques and analyses grants researchers an enormous amount of freedom when analyzing their data, and it is trivially easy to ‘torture the data until it confesses’." (Alex Reinhart, "Statistics Done Wrong: The Woefully Complete Guide", 2015)

"The correlational technique known as multiple regression is used frequently in medical and social science research. This technique essentially correlates many independent (or predictor) variables simultaneously with a given dependent variable (outcome or output). It asks, 'Net of the effects of all the other variables, what is the effect of variable A on the dependent variable?' Despite its popularity, the technique is inherently weak and often yields misleading results. The problem is due to self-selection. If we don’t assign cases to a particular treatment, the cases may differ in any number of ways that could be causing them to differ along some dimension related to the dependent variable. We can know that the answer given by a multiple regression analysis is wrong because randomized control experiments, frequently referred to as the gold standard of research techniques, may give answers that are quite different from those obtained by multiple regression analysis." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"Collecting data through sampling therefore becomes a never-ending battle to avoid sources of bias. [...] While trying to obtain a random sample, researchers sometimes make errors in judgment about whether every person or thing is equally likely to be sampled." (Daniel J Levitin, "Weaponized Lies", 2017)

"Samples give us estimates of something, and they will almost always deviate from the true number by some amount, large or small, and that is the margin of error. […] The margin of error does not address underlying flaws in the research, only the degree of error in the sampling procedure. But ignoring those deeper possible flaws for the moment, there is another measurement or statistic that accompanies any rigorously defined sample: the confidence interval." (Daniel J Levitin, "Weaponized Lies", 2017)

"The job of the statistician is to formulate an inventory of all those things that matter in order to obtain a representative sample. Researchers have to avoid the tendency to capture variables that are easy to identify or collect data on - sometimes the things that matter are not obvious or are difficult to measure." (Daniel J Levitin, "Weaponized Lies", 2017)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.