SQL Troubles

16 December 2018

🔭Data Science: Rare Events (Just the Quotes)

"We must rather seek for a cause, for every event whether probable or improbable must have some cause." (Polybius, "The Histories", cca. 100 BC)

"There is nothing in the nature of a miracle that should render it incredible: its credibility depends upon the nature of the evidence by which it is supported. An event of extreme probability will not necessarily command our belief unless upon a sufficiency of proof; and so an event which we may regard as highly improbable may command our belief if it is sustained by sufficient evidence. So that the credibility or incredibility of an event does not rest upon the nature of the event itself, but depends upon the nature and sufficiency of the proof which sustains it." (Charles Babbage, "Passages from the Life of a Philosopher", 1864)

"Events with a sufficiently small probability never occur, or at least we must act, in all circumstances, as if they were impossible." (Émile Borel, "Probabilities and Life", 1962)

"Most accidents in well-designed systems involve two or more events of low probability occurring in the worst possible combination." (Robert E Machol, "Principles of Operations Research", 1975)

"[…] all human beings - professional mathematicians included - are easily muddled when it comes to estimating the probabilities of rare events. Even figuring out the right question to ask can be confusing." (Steven Strogatz, "Sync: The Emerging Science of Spontaneous Order", 2003)

"Bell curves don't differ that much in their bells. They differ in their tails. The tails describe how frequently rare events occur. They describe whether rare events really are so rare. This leads to the saying that the devil is in the tails." (Bart Kosko, "Noise", 2006)

"A Black Swan is a highly improbable event with three principal characteristics: It is unpredictable; it carries a massive impact; and, after the fact, we concoct an explanation that makes it appear less random, and more predictable, than it was. […] The Black Swan idea is based on the structure of randomness in empirical reality. [...] the Black Swan is what we leave out of simplification." (Nassim N Taleb, “The Black Swan”, 2007)

"A forecaster should almost never ignore data, especially when she is studying rare events […]. Ignoring data is often a tip-off that the forecaster is overconfident, or is overfitting her model - that she is interested in showing off rather than trying to be accurate." (Nate Silver, "The Signal and the Noise: Why So Many Predictions Fail-but Some Don't", 2012)

"[…] according to the bell-shaped curve the likelihood of a very-large-deviation event (a major outlier) located in the striped region appears to be very unlikely, essentially zero. The same event, though, is several thousand times more likely if it comes from a set of events obeying a fat-tailed distribution instead of the bell-shaped one." (John L Casti, "X-Events: The Collapse of Everything", 2012)

"[…] both rarity and impact have to go into any meaningful characterization of how black any particular [black] swan happens to be." (John L Casti, "X-Events: The Collapse of Everything", 2012)

"Black Swans (capitalized) are large-scale unpredictable and irregular events of massive consequence - unpredicted by a certain observer, and such un - predictor is generally called the 'turkey' when he is both surprised and harmed by these events. [...] Black Swans hijack our brains, making us feel we 'sort of' or 'almost' predicted them, because they are retrospectively explainable. We don’t realize the role of these Swans in life because of this illusion of predictability. […] An annoying aspect of the Black Swan problem - in fact the central, and largely missed, point - is that the odds of rare events are simply not computable." (Nassim N Taleb, "Antifragile: Things that gain from disorder", 2012)

"Behavioral finance so far makes conclusions from statics not dynamics, hence misses the picture. It applies trade-offs out of context and develops the consensus that people irrationally overestimate tail risk (hence need to be 'nudged' into taking more of these exposures). But the catastrophic event is an absorbing barrier. No risky exposure can be analyzed in isolation: risks accumulate. If we ride a motorcycle, smoke, fly our own propeller plane, and join the mafia, these risks add up to a near-certain premature death. Tail risks are not a renewable resource." (Nassim N Taleb, "Statistical Consequences of Fat Tails: Real World Preasymptotics, Epistemology, and Applications" 2nd Ed., 2022)

"But note that any heavy tailed process, even a power law, can be described in sample (that is finite number of observations necessarily discretized) by a simple Gaussian process with changing variance, a regime switching process, or a combination of Gaussian plus a series of variable jumps (though not one where jumps are of equal size […])." (Nassim N Taleb, "Statistical Consequences of Fat Tails: Real World Preasymptotics, Epistemology, and Applications" 2nd Ed., 2022)

"[…] it is not merely that events in the tails of the distributions matter, happen, play a large role, etc. The point is that these events play the major role and their probabilities are not (easily) computable, not reliable for any effective use. The implication is that Black Swans do not necessarily come from fat tails; the problem can result from an incomplete assessment of tail events." (Nassim N Taleb, "Statistical Consequences of Fat Tails: Real World Preasymptotics, Epistemology, and Applications" 2nd Ed., 2022)

"Once we know something is fat-tailed, we can use heuristics to see how an exposure there reacts to random events: how much is a given unit harmed by them. It is vastly more effective to focus on being insulated from the harm of random events than try to figure them out in the required details (as we saw the inferential errors under thick tails are huge). So it is more solid, much wiser, more ethical, and more effective to focus on detection heuristics and policies rather than fabricate statistical properties." (Nassim N Taleb, "Statistical Consequences of Fat Tails: Real World Preasymptotics, Epistemology, and Applications" 2nd Ed., 2022)

🔭Data Science: Correlation (Just the Quotes)

"Reflection soon made it clear to me that not only were the two new problems identical in principle with the old one of kinship which I had already solved, but that all three of them were no more than special cases of a much more general problem - namely, that of Correlation." (Francis Galton,"Kinship and Correlation", 1890)

"It had appeared from observation, and it was fully confirmed by this theory, that such a thing existed as an 'Index of Correlation', that is to say, a fraction, now commonly written T, that connects with close approximation every value of the deviation on the part of the subject, with the average of all the associated deviations of the Relative [...]" (Francis Galton, "Memories of My Life", 1908)

"One of the main duties of science is the correlation of phenomena, apparently disconnected and even contradictory." (Frederick Soddy, "The Interpretation of Radium and the Structure of the Atom", 1909)

"To speak of the cause of an event is therefore misleading. Any set of antecedents from which the event can theoretically be inferred by means of correlations might be called a cause of the event. But to speak of the cause is to imply a uniqueness [...]." (Bertrand Russell, "Mysticism and Logic: And Other Essays", 1910)

"'Correlation' is a term used to express the relation which exists between two series or groups of data where there is a causal connection. In order to have correlation it is not enough that the two sets of data should both increase or decrease simultaneously. For correlation it is necessary that one set of facts should have some definite causal dependence upon the other set [...]" (Willard C Brinton, "Graphic Methods for Presenting Facts", 1919)

"'Causation' has been popularly used to express the condition of association, when applied to natural phenomena. There is no philosophical basis for giving it a wider meaning than partial or absolute association. In no case has it been proved that there is an inherent necessity in the laws of nature. Causation is correlation. [...] perfect correlation, when based upon sufficient experience, is causation in the scientific sense." (Henry E. Niles, "Correlation, Causation and Wright's Theory of 'Path Coefficients'", Genetics, 1922)

"The futile elaboration of innumerable measures of correlation, and the evasion of the real difficulties of sampling problems under cover of a contempt for small samples, were obviously beginning to make its pretensions ridiculous. These procedures were not only ill-aimed, but for all their elaboration, not sufficiently accurate." (Sir Ronald A Fisher, "Statistical Methods for Research Workers", 1925)

"When the relationship is of a quantitative nature, the appropriate statistical tool for discovering and measuring the relationship and expressing it in a brief formula is known as correlation." (Frederick E Croxton & Dudley J Cowden, "Practical Business Statistics", 1937)

"Graphic methods are very commonly used in business correlation problems. On the whole, carefully handled and skillfully interpreted graphs have certain advantages over mathematical methods of determining correlation in the usual business problems. The elements of judgment and special knowledge of conditions can be more easily introduced in studying correlation graphically. Mathematical correlation is often much too rigid for the data at hand." (John R Riggleman & Ira N Frisbee, "Business Statistics", 1938)

"[…] statistical literacy. That is, the ability to read diagrams and maps; a 'consumer' understanding of common statistical terms, as average, percent, dispersion, correlation, and index number." (Douglas Scates, "Statistics: The Mathematics for Social Problems", 1943)

"Another thing to watch out for is a conclusion in which a correlation has been inferred to continue beyond the data with which it has been demonstrated." (Darell Huff, "How to Lie with Statistics", 1954)

"Keep in mind that a correlation may be real and based on real cause and effect, and still be almost worthless in determining action in any single case." (Darell Huff, "How to Lie with Statistics", 1954)

"When you find somebody - usually an interested party - making a fuss about a correlation, look first of all to see if it is not one of this type, produced by the stream of events, the trend of the times." (Darell Huff, "How to Lie with Statistics", 1954)

"There is no correlation between the cause and the effect. The events reveal only an aleatory determination, connected not so much with the imperfection of our knowledge as with the structure of the human world." (Raymond Aron, "The Opium of the Intellectuals", 1955)

"The well-known virtue of the experimental method is that it brings situational variables under tight control. It thus permits rigorous tests of hypotheses and confidential statements about causation. The correlational method, for its part, can study what man has not learned to control. Nature has been experimenting since the beginning of time, with a boldness and complexity far beyond the resources of science. The correlator’s mission is to observe and organize the data of nature’s experiments." (Lee J Cronbach, "The Two Disciplines of Scientific Psychology", The American Psychologist Vol. 12, 1957)

"It has been said that data collection is like garbage collection: before you collect it you should have in mind what you are going to do with it." (Russell Fox & Max Gorbuny, "The Science of Science: Methods of Interpreting Physical Phenomena", 1964)

"Today we preach that science is not science unless it is quantitative. We substitute correlation for causal studies, and physical equations for organic reasoning. Measurements and equations are supposed to sharpen thinking, but [...] they more often tend to make the thinking non-causal and fuzzy." (John R Platt, "Strong Inference", Science Vol. 146 (3641), 1964)

"If we gather more and more data and establish more and more associations, however, we will not finally find that we know something. We will simply end up having more and more data and larger sets of correlations." (Kenneth N Waltz, "Theory of International Politics Source: Theory of International Politics", 1979)

"The invalid assumption that correlation implies cause is probably among the two or three most serious and common errors of human reasoning." (Stephen J Gould, "The Mismeasure of Man", 1980)

"Correlation analysis is a useful tool for uncovering a tenuous relationship, but it doesn't necessarily provide any real understanding of the relationship, and it certainly doesn't provide any evidence that the relationship is one of cause and effect. People who don't understand correlation tend to credit it with being a more fundamental approach than it is." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"Only a 0 correlation is uninteresting, and in practice 0 correlations do not occur. When you stuff a bunch of numbers into the correlation formula, the chance of getting exactly 0, even if no correlation is truly present, is about the same as the chance of a tossed coin ending up on edge instead of heads or tails." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"Correlation and causation are two quite different words, and the innumerate are more prone to mistake them than most." (John A Paulos, "Innumeracy: Mathematical Illiteracy and its Consequences", 1988)

"Nature normally hates power laws. In ordinary systems all quantities follow bell curves, and correlations decay rapidly, obeying exponential laws. But all that changes if the system is forced to undergo a phase transition. Then power laws emerge-nature's unmistakable sign that chaos is departing in favor of order. The theory of phase transitions told us loud and clear that the road from disorder to order is maintained by the powerful forces of self-organization and is paved by power laws. It told us that power laws are not just another way of characterizing a system's behavior. They are the patent signatures of self-organization in complex systems." (Albert-László Barabási, "Linked: How Everything Is Connected to Everything Else and What It Means for Business, Science, and Everyday Life", 2002)

"If you flip a coin three times and it lands on heads each time, it's probably chance. If you flip it a hundred times and it lands on heads each time, you can be pretty sure the coin has heads on both sides. That's the concept behind statistical significance - it's the odds that the correlation (or other finding) is real, that it isn't just random chance." (T Colin Campbell, "The China Study", 2004)

"Nonetheless, the basic principles regarding correlations between variables are not that difficult to understand. We must look for patterns that reveal potential relationships and for evidence that variables are actually related. But when we do spot those relationships, we should not jump to conclusions about causality. Instead, we need to weigh the strength of the relationship and the plausibility of our theory, and we must always try to discount the possibility of spuriousness." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"Before best estimates are extracted from data sets by way of a regression analysis, the uncertainties of the individual data values must be determined.In this case care must be taken to recognize which uncertainty components are common to all the values, i.e., those that are correlated (systematic)." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Correlation analysis can help us find the size of the formal relation between two properties. An equidirectional variation is present if we observe high values of one variable together with high values of the other variable (or low ones combined with low ones). In this case there is a positive correlation. If high values are combined with low values and low values with high values, the variation is counterdirectional, and the correlation is negative." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"In error analysis the so-called 'chi-squared' is a measure of the agreement between the uncorrelated internal and the external uncertainties of a measured functional relation. The simplest such relation would be time independence. Theory of the chi-squared requires that the uncertainties be normally distributed. Nevertheless, it was found that the test can be applied to most probability distributions encountered in practice." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"It is important that uncertainty components that are independent of each other are added quadratically. This is also true for correlated uncertainty components, provided they are independent of each other, i.e., as long as there is no correlation between the components." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"The fact that the same uncertainty (e.g., scale uncertainty) is uncorrelated if we are dealing with only one measurement, but correlated (i.e., systematic) if we look at more than one measurement using the same instrument shows that both types of uncertainties are of the same nature. Of course, an uncertainty keeps its characteristics (e.g., Poisson distributed), independent of the fact whether it occurs only once or more often." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Given the important role that correlation plays in structural equation modeling, we need to understand the factors that affect establishing relationships among multivariable data points. The key factors are the level of measurement, restriction of range in data values (variability, skewness, kurtosis), missing data, nonlinearity, outliers, correction for attenuation, and issues related to sampling variation, confidence intervals, effect size, significance, sample size, and power." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

"[...] if you want to show change through time, use a time-series chart; if you need to compare, use a bar chart; or to display correlation, use a scatter-plot - because some of these rules make good common sense." (Alberto Cairo, "The Functional Art", 2011)

"Economists should study financial markets as they actually operate, not as they assume them to operate - observing the way in which information is actually processed, observing the serial correlations, bonanzas, and sudden stops, not assuming these away as noise around the edges of efficient and rational markets." (Adair Turner, "Economics after the Crisis: Objectives and means", 2012)

"Without precise predictability, control is impotent and almost meaningless. In other words, the lesser the predictability, the harder the entity or system is to control, and vice versa. If our universe actually operated on linear causality, with no surprises, uncertainty, or abrupt changes, all future events would be absolutely predictable in a sort of waveless orderliness." (Lawrence K Samuels, "Defense of Chaos", 2013)

"The problem of complexity is at the heart of mankind’s inability to predict future events with any accuracy. Complexity science has demonstrated that the more factors found within a complex system, the more chances of unpredictable behavior. And without predictability, any meaningful control is nearly impossible. Obviously, this means that you cannot control what you cannot predict. The ability ever to predict long-term events is a pipedream. Mankind has little to do with changing climate; complexity does." (Lawrence K Samuels, "The Real Science Behind Changing Climate", LewRockwell.com, August 1, 2014)

"The correlational technique known as multiple regression is used frequently in medical and social science research. This technique essentially correlates many independent (or predictor) variables simultaneously with a given dependent variable (outcome or output). It asks, 'Net of the effects of all the other variables, what is the effect of variable A on the dependent variable?' Despite its popularity, the technique is inherently weak and often yields misleading results. The problem is due to self-selection. If we don’t assign cases to a particular treatment, the cases may differ in any number of ways that could be causing them to differ along some dimension related to the dependent variable. We can know that the answer given by a multiple regression analysis is wrong because randomized control experiments, frequently referred to as the gold standard of research techniques, may give answers that are quite different from those obtained by multiple regression analysis." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"The theory behind multiple regression analysis is that if you control for everything that is related to the independent variable and the dependent variable by pulling their correlations out of the mix, you can get at the true causal relation between the predictor variable and the outcome variable. That’s the theory. In practice, many things prevent this ideal case from being the norm." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"A correlation is simply a bivariate relationship - a fancy way of saying that there is a relationship between two ('bi') variables ('variate'). And a bivariate relationship doesn’t prove that one thing caused the other. Think of it this way: you can observe that two things appear to be related statistically, but that doesn’t tell you the answer to any of the questions you might really care about - why is there a relationship and what does it mean to us as a consumer of data?" (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Confirmation bias can affect nearly every aspect of the way you look at data, from sampling and observation to forecasting - so it’s something to keep in mind anytime you’re interpreting data. When it comes to correlation versus causation, confirmation bias is one reason that some people ignore omitted variables - because they’re making the jump from correlation to causation based on preconceptions, not the actual evidence." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"In the real world, statistical issues rarely exist in isolation. You’re going to come across cases where there’s more than one problem with the data. For example, just because you identify some sampling errors doesn’t mean there aren’t also issues with cherry picking and correlations and averages and forecasts - or simply more sampling issues, for that matter. Some cases may have no statistical issues, some may have dozens. But you need to keep your eyes open in order to spot them all." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Correlation is not equivalent to cause for one major reason. Correlation is well defined in terms of a mathematical formula. Cause is not well defined." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"The degree to which one variable can be predicted from another can be calculated as the correlation between them. The square of the correlation (R^2) is the proportion of the variance of one that can be 'explained' by knowledge of the other." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"It is convenient to use a single number to summarize a steadily increasing or decreasing relationship between the pairs of numbers shown on a scatter-plot. This is generally chosen to be the Pearson correlation coefficient [...]. A Pearson correlation runs between −1 and 1, and expresses how close to a straight line the dots or data-points fall. A correlation of 1 occurs if all the points lie on a straight line going upwards, while a correlation of −1 occurs if all the points lie on a straight line going downwards. A correlation near 0 can come from a random scatter of points, or any other pattern in which there is no systematic trend upwards or downwards [...]." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"Another problem is that while data visualizations may appear to be objective, the designer has a great deal of control over the message a graphic conveys. Even using accurate data, a designer can manipulate how those data make us feel. She can create the illusion of a correlation where none exists, or make a small difference between groups look big." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

"Correlation doesn't imply causation - but apparently it doesn't sell newspapers either."(Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

"Correlation quantifies the relationship between features. The purpose of correlation analysis is to understand the dependencies between features, so that observed effects can be explained or desired effects can be achieved." (Thomas A Runkler, "Data Analytics: Models and Algorithms for Intelligent Data Analysis" 3rd Ed., 2020)

"Correlation does not imply causation: often some other missing third variable is influencing both of the variables you are correlating. […] The need for a scatterplot arose when scientists had to examine bivariate relations between distinct variables directly. As opposed to other graphic forms - pie charts, line graphs, and bar charts - the scatterplot offered a unique advantage: the possibility to discover regularity in empirical data (shown as points) by adding smoothed lines or curves designed to pass 'not through, but among them', so as to pass from raw data to a theory-based description, analysis, and understanding." (Michael Friendly & Howard Wainer, "A History of Data Visualization and Graphic Communication", 2021)

"The practice of finding relationships between different sets of data - also known as correlations - is the bread and butter of what data analysis, and by proxy data visualization, is all about." (Kate Strachnyi, "ColorWise: A Data Storyteller’s Guide to the Intentional Use of Color", 2023)

More quotes on "Correlation" at the-web-of-knowledge.blogspot.com.

🔭Data Science: Laws (Just the Quotes)

"[…] we must not measure the simplicity of the laws of nature by our facility of conception; but when those which appear to us the most simple, accord perfectly with observations of the phenomena, we are justified in supposing them rigorously exact." (Pierre-Simon Laplace, "The System of the World", 1809)

"Primary causes are unknown to us; but are subject to simple and constant laws, which may be discovered by observation, the study of them being the object of natural philosophy." (Jean-Baptiste-Joseph Fourier, "The Analytical Theory of Heat", 1822)

"The aim of every science is foresight. For the laws of established observation of phenomena are generally employed to foresee their succession. All men, however little advanced make true predictions, which are always based on the same principle, the knowledge of the future from the past." (Auguste Compte, "Plan des travaux scientifiques nécessaires pour réorganiser la société", 1822)

"But law is no explanation of anything; law is simply a generalization, a category of facts. Law is neither a cause, nor a reason, nor a power, nor a coercive force. It is nothing but a general formula, a statistical table." (Florence Nightingale, "Suggestions for Thought", 1860)

"The process of discovery is very simple. An unwearied and systematic application of known laws to nature, causes the unknown to reveal themselves. Almost any mode of observation will be successful at last, for what is most wanted is method." (Henry D Thoreau, "A Week on the Concord and Merrimack Rivers", 1862)

"Isolated facts and experiments have in themselves no value, however great their number may be. They only become valuable in a theoretical or practical point of view when they make us acquainted with the law of a series of uniformly recurring phenomena, or, it may be, only give a negative result showing an incompleteness in our knowledge of such a law, till then held to be perfect." (Hermann von Helmholtz, "The Aim and Progress of Physical Science", 1869)

"If statistical graphics, although born just yesterday, extends its reach every day, it is because it replaces long tables of numbers and it allows one not only to embrace at glance the series of phenomena, but also to signal the correspondences or anomalies, to find the causes, to identify the laws." (Émile Cheysson, cca. 1877)

"The history of thought should warn us against concluding that because the scientific theory of the world is the best that has yet been formulated, it is necessarily complete and final. We must remember that at bottom the generalizations of science or, in common parlance, the laws of nature are merely hypotheses devised to explain that ever-shifting phantasmagoria of thought which we dignify with the high-sounding names of the world and the universe." (Sir James G Frazer, "The Golden Bough: A Study in Magic and Religion", 1890)

"Even one well-made observation will be enough in many cases, just as one well-constructed experiment often suffices for the establishment of a law." (Émile Durkheim, "The Rules of Sociological Method", "The Rules of Sociological Method", 1895)

"An experiment is an observation that can be repeated, isolated and varied. The more frequently you can repeat an observation, the more likely are you to see clearly what is there and to describe accurately what you have seen. The more strictly you can isolate an observation, the easier does your task of observation become, and the less danger is there of your being led astray by irrelevant circumstances, or of placing emphasis on the wrong point. The more widely you can vary an observation, the more clearly will be the uniformity of experience stand out, and the better is your chance of discovering laws." (Edward B Titchener, "A Text-Book of Psychology", 1909)

"It is well to notice in this connection [the mutual relations between the results of counting and measuring] that a natural law, in the statement of which measurable magnitudes occur, can only be understood to hold in nature with a certain degree of approximation; indeed natural laws as a rule are not proof against sufficient refinement of the measuring tools." (Luitzen E J Brouwer, "Intuitionism and Formalism", Bulletin of the American Mathematical Society, Vol. 20, 1913)

"[…] as the sciences have developed further, the notion has gained ground that most, perhaps all, of our laws are only approximations." (William James, "Pragmatism: A New Name for Some Old Ways of Thinking", 1914)

"Scientific laws, when we have reason to think them accurate, are different in form from the common-sense rules which have exceptions: they are always, at least in physics, either differential equations, or statistical averages." (Bertrand A Russell, "The Analysis of Matter", 1927)

"Science is the attempt to discover, by means of observation, and reasoning based upon it, first, particular facts about the world, and then laws connecting facts with one another and (in fortunate cases) making it possible to predict future occurrences." (Bertrand Russell, "Religion and Science, Grounds of Conflict", 1935)

"Statistics is the fundamental and most important part of inductive logic. It is both an art and a science, and it deals with the collection, the tabulation, the analysis and interpretation of quantitative and qualitative measurements. It is concerned with the classifying and determining of actual attributes as well as the making of estimates and the testing of various hypotheses by which probable, or expected, values are obtained. It is one of the means of carrying on scientific research in order to ascertain the laws of behavior of things - be they animate or inanimate. Statistics is the technique of the Scientific Method." (Bruce D Greenschields & Frank M Weida, "Statistics with Applications to Highway Traffic Analyses", 1952)

"The world is not made up of empirical facts with the addition of the laws of nature: what we call the laws of nature are conceptual devices by which we organize our empirical knowledge and predict the future." (Richard B Braithwaite, "Scientific Explanation", 1953)

"The methods of science may be described as the discovery of laws, the explanation of laws by theories, and the testing of theories by new observations. A good analogy is that of the jigsaw puzzle, for which the laws are the individual pieces, the theories local patterns suggested by a few pieces, and the tests the completion of these patterns with pieces previously unconsidered." (Edwin P Hubble, "The Nature of Science and Other Lectures", 1954)

"Can there be laws of chance? The answer, it would seem should be negative, since chance is in fact defined as the characteristic of the phenomena which follow no law, phenomena whose causes are too complex to permit prediction." (Félix E Borel, "Probabilities and Life", 1962)

"Each piece, or part, of the whole of nature is always merely an approximation to the complete truth, or the complete truth so far as we know it. In fact, everything we know is only some kind of approximation, because we know that we do not know all the laws as yet. Therefore, things must be learned only to be unlearned again or, more likely, to be corrected." (Richard Feynman, "The Feynman Lectures on Physics" Vol. 1, 1964)

"At each level of complexity, entirely new properties appear. [And] at each stage, entirely new laws, concepts, and generalizations are necessary, requiring inspiration and creativity to just as great a degree as in the previous one." (Herb Anderson, 1972)

"A good scientific law or theory is falsifiable just because it makes definite claims about the world. For the falsificationist, If follows fairly readily from this that the more falsifiable a theory is the better, in some loose sense of more. The more a theory claims, the more potential opportunities there will be for showing that the world does not in fact behave in the way laid down by the theory. A very good theory will be one that makes very wide-ranging claims about the world, and which is consequently highly falsifiable, and is one that resists falsification whenever it is put to the test." (Alan F Chalmers, "What Is This Thing Called Science?", 1976)

"Scientific laws give algorithms, or procedures, for determining how systems behave. The computer program is a medium in which the algorithms can be expressed and applied. Physical objects and mathematical structures can be represented as numbers and symbols in a computer, and a program can be written to manipulate them according to the algorithms. When the computer program is executed, it causes the numbers and symbols to be modified in the way specified by the scientific laws. It thereby allows the consequences of the laws to be deduced." (Stephen Wolfram, "Computer Software in Science and Mathematics", 1984)

"The connection between a model and a theory is that a model satisfies a theory; that is, a model obeys those laws of behavior that a corresponding theory explicity states or which may be derived from it. [...[] Computers make possible an entirely new relationship between theories and models. [...] A theory written in the form of a computer program is [...] both a theory and, when placed on a computer and run, a model to which the theory applies." (Joseph Weizenbaum, "Computer Power and Human Reason", 1984)

"We expect to learn new tricks because one of our science based abilities is being able to predict. That after all is what science is about. Learning enough about how a thing works so you'll know what comes next. Because as we all know everything obeys the universal laws, all you need is to understand the laws." (James Burke, "The Day the Universe Changed", 1985)

"A law explains a set of observations; a theory explains a set of laws. […] Unlike laws, theories often postulate unobservable objects as part of their explanatory mechanism." (John L Casti, "Searching for Certainty", 1990)

"So we pour in data from the past to fuel the decision-making mechanisms created by our models, be they linear or nonlinear. But therein lies the logician's trap: past data from real life constitute a sequence of events rather than a set of independent observations, which is what the laws of probability demand. [...] It is in those outliers and imperfections that the wildness lurks." (Peter L Bernstein, "Against the Gods: The Remarkable Story of Risk", 1996)

"A scientific theory is a concise and coherent set of concepts, claims, and laws (frequently expressed mathematically) that can be used to precisely and accurately explain and predict natural phenomena." (Mordechai Ben-Ari, "Just a Theory: Exploring the Nature of Science", 2005)

"[...] things that seem hopelessly random and unpredictable when viewed in isolation often turn out to be lawful and predictable when viewed in aggregate." (Steven Strogatz, "The Joy of X: A Guided Tour of Mathematics, from One to Infinity", 2012)

15 December 2018

🔭Data Science: Probability (Just the Quotes)

"Probability is a degree of possibility." (Gottfried W Leibniz, "On estimating the uncertain", 1676)

"Probability, however, is not something absolute, [it is] drawn from certain information which, although it does not suffice to resolve the problem, nevertheless ensures that we judge correctly which of the two opposites is the easiest given the conditions known to us." (Gottfried W Leibniz, "Forethoughts for an encyclopaedia or universal science", cca. 1679)

"[…] the highest probability amounts not to certainty, without which there can be no true knowledge." (John Locke, "An Essay Concerning Human Understanding", 1689)

"As mathematical and absolute certainty is seldom to be attained in human affairs, reason and public utility require that judges and all mankind in forming their opinions of the truth of facts should be regulated by the superior number of the probabilities on the one side or the other whether the amount of these probabilities be expressed in words and arguments or by figures and numbers." (William Murray, 1773)

"All certainty which does not consist in mathematical demonstration is nothing more than the highest probability; there is no other historical certainty." (Voltaire, "A Philosophical Dictionary", 1881)

"Nature prefers the more probable states to the less probable because in nature processes take place in the direction of greater probability. Heat goes from a body at higher temperature to a body at lower temperature because the state of equal temperature distribution is more probable than a state of unequal temperature distribution." (Max Planck, "The Atomic Theory of Matter", 1909)

"Sometimes the probability in favor of a generalization is enormous, but the infinite probability of certainty is never reached." (William Dampier-Whetham, "Science and the Human Mind", 1912)

"There can be no unique probability attached to any event or behaviour: we can only speak of ‘probability in the light of certain given information’, and the probability alters according to the extent of the information." (Sir Arthur S Eddington, "The Nature of the Physical World", 1928)

"[…] the statistical prediction of the future from the past cannot be generally valid, because whatever is future to any given past, is in tum past for some future. That is, whoever continually revises his judgment of the probability of a statistical generalization by its successively observed verifications and failures, cannot fail to make more successful predictions than if he should disregard the past in his anticipation of the future. This might be called the ‘Principle of statistical accumulation’." (Clarence I Lewis, "Mind and the World-Order: Outline of a Theory of Knowledge", 1929)

"Science does not aim, primarily, at high probabilities. It aims at a high informative content, well backed by experience. But a hypothesis may be very probable simply because it tells us nothing, or very little." (Karl Popper, "The Logic of Scientific Discovery", 1934)

"The most important application of the theory of probability is to what we may call 'chance-like' or 'random' events, or occurrences. These seem to be characterized by a peculiar kind of incalculability which makes one disposed to believe - after many unsuccessful attempts - that all known rational methods of prediction must fail in their case. We have, as it were, the feeling that not a scientist but only a prophet could predict them. And yet, it is just this incalculability that makes us conclude that the calculus of probability can be applied to these events." (Karl R Popper, "The Logic of Scientific Discovery", 1934)

"Equiprobability in the physical world is purely a hypothesis. We may exercise the greatest care and the most accurate of scientific instruments to determine whether or not a penny is symmetrical. Even if we are satisfied that it is, and that our evidence on that point is conclusive, our knowledge, or rather our ignorance, about the vast number of other causes which affect the fall of the penny is so abysmal that the fact of the penny’s symmetry is a mere detail. Thus, the statement 'head and tail are equiprobable' is at best an assumption." (Edward Kasner & James R Newman, "Mathematics and the Imagination", 1940)

"Probabilities must be regarded as analogous to the measurement of physical magnitudes; that is to say, they can never be known exactly, but only within certain approximation." (Emile Borel, "Probabilities and Life", 1943)

"Just as entropy is a measure of disorganization, the information carried by a set of messages is a measure of organization. In fact, it is possible to interpret the information carried by a message as essentially the negative of its entropy, and the negative logarithm of its probability. That is, the more probable the message, the less information it gives. Clichés, for example, are less illuminating than great poems." (Norbert Wiener, "The Human Use of Human Beings", 1950)

"To say that observations of the past are certain, whereas predictions are merely probable, is not the ultimate answer to the question of induction; it is only a sort of intermediate answer, which is incomplete unless a theory of probability is developed that explains what we should mean by ‘probable’ and on what ground we can assert probabilities." (Hans Reichenbach, "The Rise of Scientific Philosophy", 1951)

"Uncertainty is introduced, however, by the impossibility of making generalizations, most of the time, which happens to all members of a class. Even scientific truth is a matter of probability and the degree of probability stops somewhere short of certainty." (Wayne C Minnick, "The Art of Persuasion", 1957)

"Everybody has some idea of the meaning of the term 'probability' but there is no agreement among scientists on a precise definition of the term for the purpose of scientific methodology. It is sufficient for our purpose, however, if the concept is interpreted in terms of relative frequency, or more simply, how many times a particular event is likely to occur in a large population." (Alfred R Ilersic, "Statistics", 1959)

"Incomplete knowledge must be considered as perfectly normal in probability theory; we might even say that, if we knew all the circumstances of a phenomenon, there would be no place for probability, and we would know the outcome with certainty." (Félix E Borel, Probability and Certainty", 1963)

"Probability is the mathematics of uncertainty. Not only do we constantly face situations in which there is neither adequate data nor an adequate theory, but many modem theories have uncertainty built into their foundations. Thus learning to think in terms of probability is essential. Statistics is the reverse of probability (glibly speaking). In probability you go from the model of the situation to what you expect to see; in statistics you have the observations and you wish to estimate features of the underlying model." (Richard W Hamming, "Methods of Mathematics Applied to Calculus, Probability, and Statistics", 1985)

"Probability plays a central role in many fields, from quantum mechanics to information theory, and even older fields use probability now that the presence of 'noise' is officially admitted. The newer aspects of many fields start with the admission of uncertainty." (Richard W Hamming, "Methods of Mathematics Applied to Calculus, Probability, and Statistics", 1985)

"Probabilities are summaries of knowledge that is left behind when information is transferred to a higher level of abstraction." (Judea Pearl, "Probabilistic Reasoning in Intelligent Systems: Network of Plausible, Inference", 1988)

"[In statistics] you have the fact that the concepts are not very clean. The idea of probability, of randomness, is not a clean mathematical idea. You cannot produce random numbers mathematically. They can only be produced by things like tossing dice or spinning a roulette wheel. With a formula, any formula, the number you get would be predictable and therefore not random. So as a statistician you have to rely on some conception of a world where things happen in some way at random, a conception which mathematicians don’t have." (Lucien LeCam, [interview] 1988)

"Often, we use the word random loosely to describe something that is disordered, irregular, patternless, or unpredictable. We link it with chance, probability, luck, and coincidence. However, when we examine what we mean by random in various contexts, ambiguities and uncertainties inevitably arise. Tackling the subtleties of randomness allows us to go to the root of what we can understand of the universe we inhabit and helps us to define the limits of what we can know with certainty." (Ivars Peterson, "The Jungles of Randomness: A Mathematical Safari", 1998)

"In the laws of probability theory, likelihood distributions are fixed properties of a hypothesis. In the art of rationality, to explain is to anticipate. To anticipate is to explain." (Eliezer S. Yudkowsky, "A Technical Explanation of Technical Explanation", 2005)

"For some scientific data the true value cannot be given by a constant or some straightforward mathematical function but by a probability distribution or an expectation value. Such data are called probabilistic. Even so, their true value does not change with time or place, making them distinctly different from most statistical data of everyday life." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"In fact, H [entropy] measures the amount of uncertainty that exists in the phenomenon. If there were only one event, its probability would be equal to 1, and H would be equal to 0 - that is, there is no uncertainty about what will happen in a phenomenon with a single event because we always know what is going to occur. The more events that a phenomenon possesses, the more uncertainty there is about the state of the phenomenon. In other words, the more entropy, the more information." (Diego Rasskin-Gutman, "Chess Metaphors: Artificial Intelligence and the Human Mind", 2009)

"The four questions of data analysis are the questions of description, probability, inference, and homogeneity. [...] Descriptive statistics are built on the assumption that we can use a single value to characterize a single property for a single universe. […] Probability theory is focused on what happens to samples drawn from a known universe. If the data happen to come from different sources, then there are multiple universes with different probability models. [...] Statistical inference assumes that you have a sample that is known to have come from one universe." (Donald J Wheeler," Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"Entropy is a measure of amount of uncertainty or disorder present in the system within the possible probability distribution. The entropy and amount of unpredictability are directly proportional to each other." (G Suseela & Y Asnath V Phamila, "Security Framework for Smart Visual Sensor Networks", 2019)

🔬Data Science: Pattern (Definitions)

"A named description of a problem, solution, when to apply the solution, and how to apply the solution in new contexts." (Craig Larman, "Applying UML and Patterns", 2004)

"A named strategy for solving a recurring problem." (Bruce MacIsaac & Per Kroll, "Agility and Discipline Made Easy: Practices from OpenUP and RUP", 2006)

"A sequence of ordinary and special characters that enables a regular expression engine to locate a string. See regular expression." (Michael Fitzgerald, "Learning Ruby", 2007)

"A pattern is a way of documenting a successful solution to a recurring design problem in a non-contextualized way so that the solution can be applied (reused) in many different contexts." (Shirley Agostinho, "Learning Design Representations to Document, Model, and Share Teaching Practice", 2009)

"A proven solution to a recurring problem in a given context." (Pankaj Kamthan, "Pattern-Oriented Use Case Modeling", 2009)

"A common combination of logic, interactions, and behaviors that form a consistent or characteristic arrangement. An important use of patterns is the idea of design templates that are general solutions to integration problems. They will not solve a specific problem, but they provide a sort of architectural outline that may be reused in order to speed up the development process." (David Lyle & John G Schmidt, "Lean Integration", 2010)

"An empirically proven solution to a recurring problem that occurs in a particular context." (Panjak Kamthan & Terrill Fancott, A Knowledge Management Model for Patterns, 2011)

"A recurring combination of data and task management, separate from any specific algorithm. Patterns are universal in that they apply to and can be used in any programming system. Patterns have also been called dwarfs, motifs, and algorithmic skeletons. Patterns are not necessarily tied to any particular hardware architecture or programming language or system. Examples of patterns include the sequence pattern and the object pattern." (Michael McCool et al, "Structured Parallel Programming", 2012)

"The regular order existing in nature or in a manmade design. In nature patterns can be seen as symmetries (e.g., snowflakes) and/or structures having fractal dimension such as spirals, meanders, or surface waves. In computer science, design patterns serve in creating computer programs. In the arts, pattern is an artistic or decorative design made of recurring lines or any repeated elements." (Anna Ursyn, "Visualization as Communication with Graphic Representation", 2015)

"Set of partial results show systematically behavioral traits associated with a particular situation, entity or object." (Mauro Chiarella, "Folds and Refolds: Space Generation, Shapes, and Complex Components", 2016)

14 December 2018

🔭Data Science: Coincidence (Just the Quotes)

"It is no great wonder if in long process of time, while fortune takes her course hither and thither, numerous coincidences should spontaneously occur. If the number and variety of subjects to be wrought upon be infinite, it is all the more easy for fortune, with such an abundance of material, to effect this similarity of results." (Plutarch, Life of Sertorius, 1st century BC)

"Coincidences, in general, are great stumbling blocks in the way of that class of thinkers who have been educated to know nothing of the theory of probabilities - that theory to which the most glorious objects of human research are indebted for the most glorious of illustrations." (Edgar A Poe, "The Murders in the Rue Morgue", 1841)

"Nothing is more certain in scientific method than that approximate coincidence alone can be expected. In the measurement of continuous quantity perfect correspondence must be accidental, and should give rise to suspicion rather than to satisfaction." (William S Jevons, "The Principles of Science: A Treatise on Logic and Scientific Method", 1874)

"Before we can completely explain a phenomenon we require not only to find its true cause, its chief relations to other causes, and all the conditions which determine how the cause operates, and what its effect and amount of effect are, but also all the coincidences." (George Gore, "The Art of Scientific Discovery", 1878)

"As science progress, it becomes more and more difficult to fit in the new facts when they will not fit in spontaneously. The older theories depend upon the coincidences of so many numerical results which can not be attributed to chance. We should not separate what has been joined together." (Henri Poincaré, "The Ether and Matter", 1912)

"By the laws of statistics we could probably approximate just how unlikely it is that it would happen. But people forget - especially those who ought to know better, such as yourself - that while the laws of statistics tell you how unlikely a particular coincidence is, they state just as firmly that coincidences do happen." (Robert A Heinlein, "The Door Into Summer", 1957)

"There is no coherent knowledge, i.e. no uniform comprehensive account of the world and the events in it. There is no comprehensive truth that goes beyond an enumeration of details, but there are many pieces of information, obtained in different ways from different sources and collected for the benefit of the curious. The best way of presenting such knowledge is the list - and the oldest scientific works were indeed lists of facts, parts, coincidences, problems in several specialized domains." (Paul K Feyerabend, "Farewell to Reason", 1987)

"A tendency to drastically underestimate the frequency of coincidences is a prime characteristic of innumerates, who generally accord great significance to correspondences of all sorts while attributing too little significance to quite conclusive but less flashy statistical evidence." (John A Paulos, "Innumeracy: Mathematical Illiteracy and its Consequences", 1988)

"The law of truly large numbers states: With a large enough sample, any outrageous thing is likely to happen." (Frederick Mosteller, "Methods for Studying Coincidences", Journal of the American Statistical Association Vol. 84, 1989)

"Most coincidences are simply chance events that turn out to be far more probable than many people imagine." (Ivars Peterson, "The Jungles of Randomness: A Mathematical Safari", 1997)

"Coincidence surprises us because our intuition about the likelihood of an event is often wildly inaccurate." (Michael Starbird, "Coincidences, Chaos, and All That Math Jazz", 2005)

"With our heads spinning in the world of coincidence and chaos, we nevertheless must make decisions and take steps into the minefield of our future. To avoid explosive missteps, we rely on data and statistical reasoning to inform our thinking." (Michael Starbird, "Coincidences, Chaos, and All That Math Jazz", 2005)

"The human mind delights in finding pattern - so much so that we often mistake coincidence or forced analogy for profound meaning. No other habit of thought lies so deeply within the soul of a small creature trying to make sense of a complex world not constructed for it." (Stephen J Gould, "The Flamingo's Smile: Reflections in Natural History", 2010)

More quotes on "Coincidence" at the-web-of-knowledge.blogspot.com.

🔭Data Science: Algorithms (Just the Quotes)

"Mathematics is an aspect of culture as well as a collection of algorithms." (Carl B Boyer, "The History of the Calculus and Its Conceptual Development", 1959)

"Design problems - generating or discovering alternatives - are complex largely because they involve two spaces, an action space and a state space, that generally have completely different structures. To find a design requires mapping the former of these on the latter. For many, if not most, design problems in the real world systematic algorithms are not known that guarantee solutions with reasonable amounts of computing effort. Design uses a wide range of heuristic devices - like means-end analysis, satisficing, and the other procedures that have been outlined - that have been found by experience to enhance the efficiency of search. Much remains to be learned about the nature and effectiveness of these devices." (Herbert A Simon, "The Logic of Heuristic Decision Making", [in "The Logic of Decision and Action"], 1966)

"An algorithm must be seen to be believed, and the best way to learn what an algorithm is all about is to try it." (Donald E Knuth, The Art of Computer Programming Vol. I, 1968)

"Algorithmic complexity theory and nonlinear dynamics together establish the fact that determinism reigns only over a quite finite domain; outside this small haven of order lies a largely uncharted, vast wasteland of chaos." (Joseph Ford, "Progress in Chaotic Dynamics: Essays in Honor of Joseph Ford's 60th Birthday", 1988)

"On this view, we recognize science to be the search for algorithmic compressions. We list sequences of observed data. We try to formulate algorithms that compactly represent the information content of those sequences. Then we test the correctness of our hypothetical abbreviations by using them to predict the next terms in the string. These predictions can then be compared with the future direction of the data sequence. Without the development of algorithmic compressions of data all science would be replaced by mindless stamp collecting - the indiscriminate accumulation of every available fact. Science is predicated upon the belief that the Universe is algorithmically compressible and the modern search for a Theory of Everything is the ultimate expression of that belief, a belief that there is an abbreviated representation of the logic behind the Universe's properties that can be written down in finite form by human beings." (John D Barrow, New Theories of Everything", 1991)

"Algorithms are a set of procedures to generate the answer to a problem." (Stuart Kauffman, "At Home in the Universe: The Search for Laws of Complexity", 1995)

"Let us regard a proof of an assertion as a purely mechanical procedure using precise rules of inference starting with a few unassailable axioms. This means that an algorithm can be devised for testing the validity of an alleged proof simply by checking the successive steps of the argument; the rules of inference constitute an algorithm for generating all the statements that can be deduced in a finite number of steps from the axioms." (Edward Beltrami, "What is Random?: Chaos and Order in Mathematics and Life", 1999)

"The vast majority of information that we have on most processes tends to be nonnumeric and nonalgorithmic. Most of the information is fuzzy and linguistic in form." (Timothy J Ross & W Jerry Parkinson, "Fuzzy Set Theory, Fuzzy Logic, and Fuzzy Systems", 2002)

"Knowledge is encoded in models. Models are synthetic sets of rules, and pictures, and algorithms providing us with useful representations of the world of our perceptions and of their patterns." (Didier Sornette, "Why Stock Markets Crash - Critical Events in Complex Systems", 2003)

"Swarm Intelligence can be defined more precisely as: Any attempt to design algorithms or distributed problem-solving methods inspired by the collective behavior of the social insect colonies or other animal societies. The main properties of such systems are flexibility, robustness, decentralization and self-organization." ("Swarm Intelligence in Data Mining", Ed. Ajith Abraham et al, 2006)

"The burgeoning field of computer science has shifted our view of the physical world from that of a collection of interacting material particles to one of a seething network of information. In this way of looking at nature, the laws of physics are a form of software, or algorithm, while the material world - the hardware - plays the role of a gigantic computer." (Paul C W Davies, "Laying Down the Laws", New Scientist, 2007)

"An algorithm refers to a successive and finite procedure by which it is possible to solve a certain problem. Algorithms are the operational base for most computer programs. They consist of a series of instructions that, thanks to programmers’ prior knowledge about the essential characteristics of a problem that must be solved, allow a step-by-step path to the solution." (Diego Rasskin-Gutman, "Chess Metaphors: Artificial Intelligence and the Human Mind", 2009)

"Programming is a science dressed up as art, because most of us don’t understand the physics of software and it’s rarely, if ever, taught. The physics of software is not algorithms, data structures, languages, and abstractions. These are just tools we make, use, and throw away. The real physics of software is the physics of people. Specifically, it’s about our limitations when it comes to complexity and our desire to work together to solve large problems in pieces. This is the science of programming: make building blocks that people can understand and use easily, and people will work together to solve the very largest problems." (Pieter Hintjens, "ZeroMQ: Messaging for Many Applications", 2012)

"These nature-inspired algorithms gradually became more and more attractive and popular among the evolutionary computation research community, and together they were named swarm intelligence, which became the little brother of the major four evolutionary computation algorithms." (Yuhui Shi, "Emerging Research on Swarm Intelligence and Algorithm Optimization", Information Science Reference, 2014)

"[...] algorithms, which are abstract or idealized process descriptions that ignore details and practicalities. An algorithm is a precise and unambiguous recipe. It’s expressed in terms of a fixed set of basic operations whose meanings are completely known and specified. It spells out a sequence of steps using those operations, with all possible situations covered, and it’s guaranteed to stop eventually." (Brian W Kernighan, "Understanding the Digital World", 2017)

"An algorithm is the computer science version of a careful, precise, unambiguous recipe or tax form, a sequence of steps that is guaranteed to compute a result correctly." (Brian W Kernighan, "Understanding the Digital World", 2017)

"Again, classical statistics only summarizes data, so it does not provide even a language for asking [a counterfactual] question. Causal inference provides a notation and, more importantly, offers a solution. As with predicting the effect of interventions [...], in many cases we can emulate human retrospective thinking with an algorithm that takes what we know about the observed world and produces an answer about the counterfactual world." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"Algorithms describe the solution to a problem in terms of the data needed to represent the problem instance and a set of steps necessary to produce the intended result." (Bradley N Miller et al, "Python Programming in Context", 2019)

"An algorithm, meanwhile, is a step-by-step recipe for performing a series of actions, and in most cases 'algorithm' means simply 'computer program'." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"Big data is revolutionizing the world around us, and it is easy to feel alienated by tales of computers handing down decisions made in ways we don’t understand. I think we’re right to be concerned. Modern data analytics can produce some miraculous results, but big data is often less trustworthy than small data. Small data can typically be scrutinized; big data tends to be locked away in the vaults of Silicon Valley. The simple statistical tools used to analyze small datasets are usually easy to check; pattern-recognizing algorithms can all too easily be mysterious and commercially sensitive black boxes." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"Each of us is sweating data, and those data are being mopped up and wrung out into oceans of information. Algorithms and large datasets are being used for everything from finding us love to deciding whether, if we are accused of a crime, we go to prison before the trial or are instead allowed to post bail. We all need to understand what these data are and how they can be exploited." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"Many people have strong intuitions about whether they would rather have a vital decision about them made by algorithms or humans. Some people are touchingly impressed by the capabilities of the algorithms; others have far too much faith in human judgment. The truth is that sometimes the algorithms will do better than the humans, and sometimes they won’t. If we want to avoid the problems and unlock the promise of big data, we’re going to need to assess the performance of the algorithms on a case-by-case basis. All too often, this is much harder than it should be. […] So the problem is not the algorithms, or the big datasets. The problem is a lack of scrutiny, transparency, and debate." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

More quotes on "Algorithms" at the-web-of-knowledge.blogspot.com.

🔭Data Science: Datasets (Just the Quotes)

"Of course statistical graphics, just like statistical calculations, are only as good as what goes into them. An ill-specified or preposterous model or a puny data set cannot be rescued by a graphic (or by calculation), no matter how clever or fancy. A silly theory means a silly graphic." (Edward R Tufte, "The Visual Display of Quantitative Information", 1983)

"No matter what the data, and no matter how the values are arranged and presented, you must always use some method of analysis to come up with an interpretation of the data.

While every data set contains noise, some data sets may contain signals. Therefore, before you can detect a signal within any given data set, you must first filter out the noise." (Donald J Wheeler," Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Since the average is a measure of location, it is common to use averages to compare two data sets. The set with the greater average is thought to ‘exceed’ the other set. While such comparisons may be helpful, they must be used with caution. After all, for any given data set, most of the values will not be equal to the average." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Without meaningful data there can be no meaningful analysis. The interpretation of any data set must be based upon the context of those data." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Enabling insight into large and complex datasets is a prevalent theme in current visualization research for which different approaches are pursued. Topology-based methods are built on the idea of abstracting characteristic structures such as the topological skeleton from the data and to construct the visualization accordingly." (Helwig Hauser et al [Eds.], "Topology-based Methods in Visualization", 2007)

"Most mainstream data-mining techniques ignore the fact that real-world datasets are combinations of underlying data, and build single models from them. If such datasets can first be separated into the components that underlie them, we might expect that the quality of the models will improve significantly. (David Skillicorn, "Understanding Complex Datasets: Data Mining with Matrix Decompositions", 2007)

"For a given dataset there is not a great deal of advice which can be given on content and context. hose who know their own data should know best for their specific purposes. It is advisable to think hard about what should be shown and to check with others if the graphic makes the desired impression. Design should be let to designers, though some basic guidelines should be followed: consistency is important (sets of graphics should be in similar style and use equivalent scaling); proximity is helpful (place graphics on the same page, or on the facing page, of any text that refers to them); and layout should be checked (graphics should be neither too small nor too large and be attractively positioned relative to the whole page or display)."(Antony Unwin, "Good Graphics?" [in "Handbook of Data Visualization"], 2008)

"Graphical displays are often constructed to place principal focus on the individual observations in a dataset, and this is particularly helpful in identifying both the typical positions of data points and unusual or influential cases. However, in many investigations, principal interest lies in identifying the nature of underlying trends and relationships between variables, and so it is often helpful to enhance graphical displays in ways which give deeper insight into these features. This can be very beneficial both for small datasets, where variation can obscure underlying patterns, and large datasets, where the volume of data is so large that effective representation inevitably involves suitable summaries." (Adrian W Bowman, "Smoothing Techniques for Visualisation" [in "Handbook of Data Visualization"], 2008)

"The main goal of data visualization is its ability to visualize data, communicating information clearly and effectively. It doesn’t mean that data visualization needs to look boring to be functional or extremely sophisticated to look beautiful. To convey ideas effectively, both aesthetic form and functionality need to go hand in hand, providing insights into a rather sparse and complex dataset by communicating its key aspects in a more intuitive way. Yet designers often tend to discard the balance between design and function, creating gorgeous data visualizations which fail to serve its main purpose - communicate information." (Vitaly Friedman, "Data Visualization and Infographics", Smashing Magazine, 2008)

"There are two main reasons for using graphic displays of datasets: either to present or to explore data. Presenting data involves deciding what information you want to convey and drawing a display appropriate for the content and for the intended audience. [...] Exploring data is a much more individual matter, using graphics to find information and to generate ideas.Many displays may be drawn. They can be changed at will or discarded and new versions prepared, so generally no one plot is especially important, and they all have a short life span." (Antony Unwin, "Good Graphics?" [in "Handbook of Data Visualization"], 2008)

"To extract useful information from such large and structured data sets, a first step is to be able to visualize their structure, identifying interesting patterns, trends, and complex relationships between the items. The main idea of visual data exploration is to produce a representation of the data in such a way that the human eye can gain insight into their structure and patterns." (George Michailidis, "Data Visualization Through Their Graph Representations" [in "Handbook of Data Visualization"], 2008)

"[...] the form of a technological object must depend on the tasks it should help with. This is one of the most important principles to remember when dealing with infographics and visualizations: The form should be constrained by the functions of your presentation. There may be more than one form a data set can adopt so that readers can perform operations with it and extract meanings, but the data cannot adopt any form. Choosing visual shapes to encode information should not be based on aesthetics and personal tastes alone." (Alberto Cairo, "The Functional Art", 2011)

"If you look too hard at a set of data, you will find something - but it might not generalize beyond the data you’re looking at. This is referred to as overfitting a dataset. Data mining techniques can be very powerful, and the need to detect and avoid overfitting is one of the most important concepts to grasp when applying data mining to real problems. The concept of overfitting and its avoidance permeates data science processes, algorithms, and evaluation methods." (Foster Provost & Tom Fawcett, "Data Science for Business", 2013)

"No subjective metric can escape strategic gaming [...] The possibility of mischief is bottomless. Fighting ratings is fruitless, as they satisfy a very human need. If one scheme is beaten down, another will take its place and wear its flaws. Big Data just deepens the danger. The more complex the rating formulas, the more numerous the opportunities there are to dress up the numbers. The larger the data sets, the harder it is to audit them." (Kaiser Fung, "Numbersense: How To Use Big Data To Your Advantage", 2013)

"Visualization can be appreciated purely from an aesthetic point of view, but it’s most interesting when it’s about data that’s worth looking at. That’s why you start with data, explore it, and then show results rather than start with a visual and try to squeeze a dataset into it. It’s like trying to use a hammer to bang in a bunch of screws. […] Aesthetics isn’t just a shiny veneer that you slap on at the last minute. It represents the thought you put into a visualization, which is tightly coupled with clarity and affects interpretation." (Nathan Yau, "Data Points: Visualization That Means Something", 2013)

"Big Data allows us to meaningfully zoom in on small segments of a dataset to gain new insights on who we are." (Seth Stephens-Davidowitz, "Everybody Lies: What the Internet Can Tell Us About Who We Really Are", 2017)

"Effects without an understanding of the causes behind them, on the other hand, are just bunches of data points floating in the ether, offering nothing useful by themselves. Big Data is information, equivalent to the patterns of light that fall onto the eye. Big Data is like the history of stimuli that our eyes have responded to. And as we discussed earlier, stimuli are themselves meaningless because they could mean anything. The same is true for Big Data, unless something transformative is brought to all those data sets… understanding." (Beau Lotto, "Deviate: The Science of Seeing Differently", 2017)

"One way to lie with statistics is to compare things - datasets, populations, types of products - that are different from one another, and pretend that they’re not. As the old idiom says, you can’t compare apples with oranges." (Daniel J Levitin, "Weaponized Lies", 2017)

"There are other problems with Big Data. In any large data set, there are bound to be inconsistencies, misclassifications, missing data - in other words, errors, blunders, and possibly lies. These problems with individual items occur in any data set, but they are often hidden in a large mass of numbers even when these numbers are generated out of computer interactions." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"Creating effective visualizations is hard. Not because a dataset requires an exotic and bespoke visual representation - for many problems, standard statistical charts will suffice. And not because creating a visualization requires coding expertise in an unfamiliar programming language [...]. Rather, creating effective visualizations is difficult because the problems that are best addressed by visualization are often complex and ill-formed. The task of figuring out what attributes of a dataset are important is often conflated with figuring out what type of visualization to use. Picking a chart type to represent specific attributes in a dataset is comparatively easy. Deciding on which data attributes will help answer a question, however, is a complex, poorly defined, and user-driven process that can require several rounds of visualization and exploration to resolve." (Danyel Fisher & Miriah Meyer, "Making Data Visual", 2018)

"Data analysis and data mining are concerned with unsupervised pattern finding and structure determination in data sets. The data sets themselves are explicitly linked as a form of representation to an observational or otherwise empirical domain of interest. 'Structure' has long been understood as symmetry which can take many forms with respect to any transformation, including point, translational, rotational, and many others. Symmetries directly point to invariants, which pinpoint intrinsic properties of the data and of the background empirical domain of interest. As our data models change, so too do our perspectives on analysing data." (Fionn Murtagh, "Data Science Foundations: Geometry and Topology of Complex Hierarchic Systems and Big Data Analytics", 2018)

"[…] creating effective visualizations is difficult because the problems that are best addressed by visualization are often complex and ill-formed. The task of figuring out what attributes of a dataset are important is often conflated with figuring out what type of visualization to use. Picking a chart type to represent specific attributes in a dataset is comparatively easy. Deciding on which data attributes will help answer a question, however, is a complex, poorly defined, and user-driven process that can require several rounds of visualization and exploration to resolve." (Danyel Fisher & Miriah Meyer, "Making Data Visual", 2018)

"Every dataset has subtleties; it can be far too easy to slip down rabbit holes of complications. Being systematic about the operationalization can help focus our conversations with experts, only introducing complications when needed." (Danyel Fisher & Miriah Meyer, "Making Data Visual", 2018)

"The goal of data science is to improve decision making by basing decisions on insights extracted from large data sets. As a field of activity, data science encompasses a set of principles, problem definitions, algorithms, and processes for extracting nonobvious and useful patterns from large data sets. It is closely related to the fields of data mining and machine learning, but it is broader in scope. (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"Using data science, we can uncover the important patterns in a data set, and these patterns can reveal the important attributes in the domain. The reason why data science is used in so many domains is that it doesn’t matter what the problem domain is: if the right data are available and the problem can be clearly defined, then data science can help." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"It’d be nice to fondly imagine that high-quality statistics simply appear in a spreadsheet somewhere, divine providence from the numerical heavens. Yet any dataset begins with somebody deciding to collect the numbers. What numbers are and aren’t collected, what is and isn’t measured, and who is included or excluded are the result of all-too-human assumptions, preconceptions, and oversights." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

13 December 2018

🔭Data Science: Approximation (Just the Quotes)

"Man’s mind cannot grasp the causes of events in their completeness, but the desire to find those causes is implanted in man’s soul. And without considering the multiplicity and complexity of the conditions any one of which taken separately may seem to be the cause, he snatches at the first approximation to a cause that seems to him intelligible and says: ‘This is the cause!’" (Leo Tolstoy, "War and Peace", 1867)

"[It] may be laid down as a general rule that, if the result of a long series of precise observations approximates a simple relation so closely that the remaining difference is undetectable by observation and may be attributed to the errors to which they are liable, then this relation is probably that of nature." (Pierre-Simon Laplace, "Mémoire sur les Inégalites Séculaires des Planètes et des Satellites", 1787)

"Although this may seem a paradox, all exact science is dominated by the idea of approximation. When a man tells you that he knows the exact truth about anything, you are safe in inferring that he is an inexact man." (Bertrand Russell, "The Scientific Outlook", 1931)

"We live in a system of approximations. Every end is prospective of some other end, which is also temporary; a round and final success nowhere. We are encamped in nature, not domesticated." (Ralph W Emerson, "Essays", 1865)

"Science does not aim at establishing immutable truths and eternal dogmas; its aim is to approach the truth by successive approximations, without claiming that at any stage final and complete accuracy has been achieved." (Bertrand Russell, "The ABC of Relativity", 1925)

"[…] reality is a system, completely ordered and fully intelligible, with which thought in its advance is more and more identifying itself. We may look at the growth of knowledge […] as an attempt by our mind to return to union with things as they are in their ordered wholeness. […] and if we take this view, our notion of truth is marked out for us. Truth is the approximation of thought to reality […] Its measure is the distance thought has travelled […] toward that intelligible system […] The degree of truth of a particular proposition is to be judged in the first instance by its coherence with experience as a whole, ultimately by its coherence with that further whole, all comprehensive and fully articulated, in which thought can come to rest." (Brand Blanshard, "The Nature of Thought" Vol. II, 1939)

"The most important maxim for data analysis to heed, and one which many statisticians seem to have shunned is this: ‘Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.’ Data analysis must progress by approximate answers, at best, since its knowledge of what the problem really is will at best be approximate." (John W Tukey, "The Future of Data Analysis", Annals of Mathematical Statistics, Vol. 33, No. 1, 1962)

"Because engineering is science in action - the practice of decision making at the earliest moment - it has been defined as the art of skillful approximation. No situation in engineering is simple enough to be solved precisely, and none worth evaluating is solved exactly. Never are there sufficient facts, sufficient time, or sufficient money for an exact solution, for if by chance there were, the answer would be of academic and not economic interest to society. These are the circumstances that make engineering so vital and so creative." (Ronald B Smith, "Engineering Is…", Mechanical Engineering Vol. 86 (5), 1964)

"Engineering is the art of skillful approximation; the practice of gamesmanship in the highest form. In the end it is a method broad enough to tame the unknown, a means of combing disciplined judgment with intuition, courage with responsibility, and scientific competence within the practical aspects of time, of cost, and of talent." (Ronald B Smith, "Professional Responsibility of Engineering", Mechanical Engineering Vol. 86 (1), 1964)

"Measurement, we have seen, always has an element of error in it. The most exact description or prediction that a scientist can make is still only approximate." (Abraham Kaplan, "The Conduct of Inquiry: Methodology for Behavioral Science", 1964)

"One grievous error in interpreting approximations is to allow only good approximations." (Preston C Hammer, "Mind Pollution", Cybernetics, Vol. 14, 1971)

"The fact that [the model] is an approximation does not necessarily detract from its usefulness because models are approximations. All models are wrong, but some are useful." (George Box, 1987)

"Science is more than a mere attempt to describe nature as accurately as possible. Frequently the real message is well hidden, and a law that gives a poor approximation to nature has more significance than one which works fairly well but is poisoned at the root." (Robert H March, "Physics for Poets", 1996)

"Most physical systems, particularly those complex ones, are extremely difficult to model by an accurate and precise mathematical formula or equation due to the complexity of the system structure, nonlinearity, uncertainty, randomness, etc. Therefore, approximate modeling is often necessary and practical in real-world applications. Intuitively, approximate modeling is always possible. However, the key questions are what kind of approximation is good, where the sense of 'goodness' has to be first defined, of course, and how to formulate such a good approximation in modeling a system such that it is mathematically rigorous and can produce satisfactory results in both theory and applications." (Guanrong Chen & Trung Tat Pham, "Introduction to Fuzzy Sets, Fuzzy Logic, and Fuzzy Control Systems", 2001)

"Mathematical modeling is as much ‘art’ as ‘science’: it requires the practitioner to (i) identify a so-called ‘real world’ problem (whatever the context may be); (ii) formulate it in mathematical terms (the ‘word problem’ so beloved of undergraduates); (iii) solve the problem thus formulated (if possible; perhaps approximate solutions will suffice, especially if the complete problem is intractable); and (iv) interpret the solution in the context of the original problem." (John A Adam, "Mathematics in Nature", 2003)

"All models are approximations. Essentially, all models are wrong, but some are useful. However, the approximate nature of the model must always be borne in mind." (George E P Box & Norman R Draper, "Response Surfaces, Mixtures, and Ridge Analyses", 2007)

"Science, at its core, is simply a method of practical logic that tests hypotheses against experience. Scientism, by contrast, is the worldview and value system that insists that the questions the scientific method can answer are the most important questions human beings can ask, and that the picture of the world yielded by science is a better approximation to reality than any other." (John M Greer, "After Progress: Reason and Religion at the End of the Industrial Age", 2015)

"Science is about finding ever better approximations rather than pretending you have already found ultimate truth." (Friedrich Nietzsche)

More quotes on "Approximation" at the-web-of-knowledge.blogspot.com.

🔭Data Science: Bayesian Networks (Just the Quotes)

"The best way to convey to the experimenter what the data tell him about theta is to show him a picture of the posterior distribution." (George E P Box & George C Tiao, "Bayesian Inference in Statistical Analysis", 1973)

"In the design of experiments, one has to use some informal prior knowledge. How does one construct blocks in a block design problem for instance? It is stupid to think that use is not made of a prior. But knowing that this prior is utterly casual, it seems ludicrous to go through a lot of integration, etc., to obtain 'exact' posterior probabilities resulting from this prior. So, I believe the situation with respect to Bayesian inference and with respect to inference, in general, has not made progress. Well, Bayesian statistics has led to a great deal of theoretical research. But I don't see any real utilizations in applications, you know. Now no one, as far as I know, has examined the question of whether the inferences that are obtained are, in fact, realized in the predictions that they are used to make." (Oscar Kempthorne, "A conversation with Oscar Kempthorne", Statistical Science, 1995)

"Bayesian methods are complicated enough, that giving researchers user-friendly software could be like handing a loaded gun to a toddler; if the data is crap, you won't get anything out of it regardless of your political bent." (Brad Carlin, "Bayes offers a new way to make sense of numbers", Science, 1999)

"Bayesian inference is a controversial approach because it inherently embraces a subjective notion of probability. In general, Bayesian methods provide no guarantees on long run performance." (Larry A Wasserman, "All of Statistics: A concise course in statistical inference", 2004)

"Bayesian inference is appealing when prior information is available since Bayes’ theorem is a natural way to combine prior information with data. Some people find Bayesian inference psychologically appealing because it allows us to make probability statements about parameters. […] In parametric models, with large samples, Bayesian and frequentist methods give approximately the same inferences. In general, they need not agree." (Larry A Wasserman, "All of Statistics: A concise course in statistical inference", 2004)

"The Bayesian approach is based on the following postulates: (B1) Probability describes degree of belief, not limiting frequency. As such, we can make probability statements about lots of things, not just data which are subject to random variation. […] (B2) We can make probability statements about parameters, even though they are fixed constants. (B3) We make inferences about a parameter θ by producing a probability distribution for θ. Inferences, such as point estimates and interval estimates, may then be extracted from this distribution." (Larry A Wasserman, "All of Statistics: A concise course in statistical inference", 2004)

"The important thing is to understand that frequentist and Bayesian methods are answering different questions. To combine prior beliefs with data in a principled way, use Bayesian inference. To construct procedures with guaranteed long run performance, such as confidence intervals, use frequentist methods. Generally, Bayesian methods run into problems when the parameter space is high dimensional." (Larry A Wasserman, "All of Statistics: A concise course in statistical inference", 2004)

"Bayesian networks can be constructed by hand or learned from data. Learning both the topology of a Bayesian network and the parameters in the CPTs in the network is a difficult computational task. One of the things that makes learning the structure of a Bayesian network so difficult is that it is possible to define several different Bayesian networks as representations for the same full joint probability distribution." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, worked examples, and case studies", 2015)

"Bayesian networks provide a more flexible representation for encoding the conditional independence assumptions between the features in a domain. Ideally, the topology of a network should reflect the causal relationships between the entities in a domain. Properly constructed Bayesian networks are relatively powerful models that can capture the interactions between descriptive features in determining a prediction." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, worked examples, and case studies", 2015)

"Bayesian networks use a graph-based representation to encode the structural relationships - such as direct influence and conditional independence - between subsets of features in a domain. Consequently, a Bayesian network representation is generally more compact than a full joint distribution (because it can encode conditional independence relationships), yet it is not forced to assert a global conditional independence between all descriptive features. As such, Bayesian network models are an intermediary between full joint distributions and naive Bayes models and offer a useful compromise between model compactness and predictive accuracy." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, worked examples, and case studies", 2015)

"Bayesian networks inhabit a world where all questions are reducible to probabilities, or (in the terminology of this chapter) degrees of association between variables; they could not ascend to the second or third rungs of the Ladder of Causation. Fortunately, they required only two slight twists to climb to the top." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"The main differences between Bayesian networks and causal diagrams lie in how they are constructed and the uses to which they are put. A Bayesian network is literally nothing more than a compact representation of a huge probability table. The arrows mean only that the probabilities of child nodes are related to the values of parent nodes by a certain formula (the conditional probability tables) and that this relation is sufficient. That is, knowing additional ancestors of the child will not change the formula. Likewise, a missing arrow between any two nodes means that they are independent, once we know the values of their parents. [...] If, however, the same diagram has been constructed as a causal diagram, then both the thinking that goes into the construction and the interpretation of the final diagram change." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"The transparency of Bayesian networks distinguishes them from most other approaches to machine learning, which tend to produce inscrutable 'black boxes'. In a Bayesian network you can follow every step and understand how and why each piece of evidence changed the network’s beliefs." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"With Bayesian networks, we had taught machines to think in shades of gray, and this was an important step toward humanlike thinking. But we still couldn’t teach machines to understand causes and effects. [...] By design, in a Bayesian network, information flows in both directions, causal and diagnostic: smoke increases the likelihood of fire, and fire increases the likelihood of smoke. In fact, a Bayesian network can’t even tell what the 'causal direction' is." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)