SQL Troubles

24 April 2006

🖍️Joel Best - Collected Quotes

"All human knowledge - including statistics - is created through people's actions; everything we know is shaped by our language, culture, and society. Sociologists call this the social construction of knowledge. Saying that knowledge is socially constructed does not mean that all we know is somehow fanciful, arbitrary, flawed, or wrong. For example, scientific knowledge can be remarkably accurate, so accurate that we may forget the people and social processes that produced it." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Any statistic based on more than a guess requires some sort of counting. Definitions specify what will be counted. Measuring involves deciding how to go about counting. We cannot begin counting until we decide how we will identify and count instances of a social problem. [...] Measurement involves choices. [...] Often, measurement decisions are hidden." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Big numbers warn us that the problem is a common one, compelling our attention, concern, and action. The media like to report statistics because numbers seem to be 'hard facts' - little nuggets of indisputable truth. [...] One common innumerate error involves not distinguishing among large numbers. [...] Because many people have trouble appreciating the differences among big numbers, they tend to uncritically accept social statistics (which often, of course, feature big numbers)." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"But people treat mutant statistics just as they do other statistics - that is, they usually accept even the most implausible claims without question. [...] And people repeat bad statistics [...] bad statistics live on; they take on lives of their own. [...] Statistics, then, have a bad reputation. We suspect that statistics may be wrong, that people who use statistics may be 'lying' - trying to manipulate us by using numbers to somehow distort the truth. Yet, at the same time, we need statistics; we depend upon them to summarize and clarify the nature of our complex society. This is particularly true when we talk about social problems." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Changing measures are a particularly common problem with comparisons over time, but measures also can cause problems of their own. [...] We cannot talk about change without making comparisons over time. We cannot avoid such comparisons, nor should we want to. However, there are several basic problems that can affect statistics about change. It is important to consider the problems posed by changing - and sometimes unchanging - measures, and it is also important to recognize the limits of predictions. Claims about change deserve critical inspection; we need to ask ourselves whether apples are being compared to apples - or to very different objects." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Clear, precise definitions are not enough. Whatever is defined must also be measured, and meaningless measurements will produce meaningless statistics." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Compound errors can begin with any of the standard sorts of bad statistics - a guess, a poor sample, an inadvertent transformation, perhaps confusion over the meaning of a complex statistic. People inevitably want to put statistics to use, to explore a number's implications. [...] The strengths and weaknesses of those original numbers should affect our confidence in the second-generation statistics." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"First, good statistics are based on more than guessing. [...] Second, good statistics are based on clear, reasonable definitions. Remember, every statistic has to define its subject. Those definitions ought to be clear and made public. [...] Third, good statistics are based on clear, reasonable measures. Again, every statistic involves some sort of measurement; while all measures are imperfect, not all flaws are equally serious. [...] Finally, good statistics are based on good samples." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"In order to interpret statistics, we need more than a checklist of common errors. We need a general approach, an orientation, a mind-set that we can use to think about new statistics that we encounter. We ought to approach statistics thoughtfully. This can be hard to do, precisely because so many people in our society treat statistics as fetishes." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Innumeracy - widespread confusion about basic mathematical ideas - means that many statistical claims about social problems don't get the critical attention they deserve. This is not simply because an innumerate public is being manipulated by advocates who cynically promote inaccurate statistics. Often, statistics about social problems originate with sincere, well-meaning people who are themselves innumerate; they may not grasp the full implications of what they are saying. Similarly, the media are not immune to innumeracy; reporters commonly repeat the figures their sources give them without bothering to think critically about them." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Knowledge is factual when evidence supports it and we have great confidence in its accuracy. What we call 'hard fact' is information supported by strong, convincing evidence; this means evidence that, so far as we know, we cannot deny, however we examine or test it. Facts always can be questioned, but they hold up under questioning. How did people come by this information? How did they interpret it? Are other interpretations possible? The more satisfactory the answers to such questions, the 'harder' the facts." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Like definitions, measurements always involve choices. Advocates of different measures can defend their own choices and criticize those made by their opponents - so long as the various choices being made are known and understood. However, when measurement choices are kept hidden, it becomes difficult to assess the statistics based on those choices." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"No definition of a social problem is perfect, but there are two principal ways such definitions can be flawed. On the one hand, we may worry that a definition is too broad, that it encompasses more than it ought to include. That is, broad definitions identify some cases as part of the problem that we might think ought not to be included; statisticians call such cases false positives (that is, they mistakenly identify cases as part of the problem). On the other hand, a definition that is too narrow excludes cases that we might think ought to be included; these are false negatives (incorrectly identified as not being part of the problem)." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Not all statistics start out bad, but any statistic can be made worse. Numbers - even good numbers - can be misunderstood or misinterpreted. Their meanings can be stretched, twisted, distorted, or mangled. These alterations create what we can call mutant statistics - distorted versions of the original figures." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"One reason we tend to accept statistics uncritically is that we assume that numbers come from experts who know what they're doing. [...] There is a natural tendency to treat these figures as straightforward facts that cannot be questioned." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"People who create or repeat a statistic often feel they have a stake in defending the number. When someone disputes an estimate and offers a very different (often lower) figure, people may rush to defend the original estimate and attack the new number and anyone who dares to use it. [...] any estimate can be defended by challenging the motives of anyone who disputes the figure." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Statistics are not magical. Nor are they always true - or always false. Nor need they be incomprehensible. Adopting a Critical approach offers an effective way of responding to the numbers we are sure to encounter. Being Critical requires more thought, but failing to adopt a Critical mind-set makes us powerless to evaluate what others tell us. When we fail to think critically, the statistics we hear might just as well be magical." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Statisticians can calculate the probability that such random samples represent the population; this is usually expressed in terms of sampling error [...]. The real problem is that few samples are random. Even when researchers know the nature of the population, it can be time-consuming and expensive to draw a random sample; all too often, it is impossible to draw a true random sample because the population cannot be defined. This is particularly true for studies of social problems. [...] The best samples are those that come as close as possible to being random." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"The ease with which somewhat complex statistics can produce confusion is important, because we live in a world in which complex numbers are becoming more common. Simple statistical ideas - fractions, percentages, rates - are reasonably well understood by many people. But many social problems involve complex chains of cause and effect that can be understood only through complicated models developed by experts. [...] environment has an influence. Sorting out the interconnected causes of these problems requires relatively complicated statistical ideas - net additions, odds ratios, and the like. If we have an imperfect understanding of these ideas, and if the reporters and other people who relay the statistics to us share our confusion - and they probably do - the chances are good that we'll soon be hearing - and repeating, and perhaps making decisions on the basis of - mutated statistics." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"There are two problems with sampling - one obvious, and the other more subtle. The obvious problem is sample size. Samples tend to be much smaller than their populations. [...] Obviously, it is possible to question results based on small samples. The smaller the sample, the less confidence we have that the sample accurately reflects the population. However, large samples aren't necessarily good samples. This leads to the second issue: the representativeness of a sample is actually far more important than sample size. A good sample accurately reflects (or 'represents') the population." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"We often hear warnings that some social problem is 'epidemic'. This expression suggests that the problem's growth is rapid, widespread, and out of control. If things are getting worse, and particularly if they're getting worse fast, we need to act." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Whenever examples substitute for definitions, there is a risk that our understanding of the problem will be distorted." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"While some social problems statistics are deliberate deceptions, many - probably the great majority - of bad statistics are the result of confusion, incompetence, innumeracy, or selective, self-righteous efforts to produce numbers that reaffirm principles and interests that their advocates consider just and right. The best response to stat wars is not to try and guess who's lying or, worse, simply to assume that the people we disagree with are the ones telling lies. Rather, we need to watch for the standard causes of bad statistics - guessing, questionable definitions or methods, mutant numbers, and inappropriate comparisons." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Every number has its limitations; every number is a product of choices that inevitably involve compromise. Statistics are intended to help us summarize, to get an overview of part of the world’s complexity. But some information is always sacrificed in the process of choosing what will be counted and how. Something is, in short, always missing. In evaluating statistics, we should not forget what has been lost, if only because this helps us understand what we still have." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"Good statistics are not only products of people counting; the quality of statistics also depends on people’s willingness and ability to count thoughtfully and on their decisions about what, exactly, ought to be counted so that the resulting numbers will be both accurate and meaningful." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"In much the same way, people create statistics: they choose what to count, how to go about counting, which of the resulting numbers they share with others, and which words they use to describe and interpret those figures. Numbers do not exist independent of people; understanding numbers requires knowing who counted what, why they bothered counting, and how they went about it." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"In short, some numbers are missing from discussions of social issues because certain phenomena are hard to quantify, and any effort to assign numeric values to them is subject to debate. But refusing to somehow incorporate these factors into our calculations creates its own hazards. The best solution is to acknowledge the difficulties we encounter in measuring these phenomena, debate openly, and weigh the options as best we can." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"Nonetheless, the basic principles regarding correlations between variables are not that diffcult to understand. We must look for patterns that reveal potential relationships and for evidence that variables are actually related. But when we do spot those relationships, we should not jump to conclusions about causality. Instead, we need to weigh the strength of the relationship and the plausibility of our theory, and we must always try to discount the possibility of spuriousness." (Joel Best, "More Damned Lies and Statistics : How numbers confuse public issues", 2004)

"Statistics depend on collecting information. If questions go unasked, or if they are asked in ways that limit responses, or if measures count some cases but exclude others, information goes ungathered, and missing numbers result. Nevertheless, choices regarding which data to collect and how to go about collecting the information are inevitable." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"When people use statistics, they assume - or, at least, they want their listeners to assume - that the numbers are meaningful. This means, at a minimum, that someone has actually counted something and that they have done the counting in a way that makes sense. Statistical information is one of the best ways we have of making sense of the world’s complexities, of identifying patterns amid the confusion. But bad statistics give us bad information." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

23 April 2006

🖍️Michael J Moroney - Collected Quotes

"A good estimator will be unbiased and will converge more and more closely (in the long run) on the true value as the sample size increases. Such estimators are known as consistent. But consistency is not all we can ask of an estimator. In estimating the central tendency of a distribution, we are not confined to using the arithmetic mean; we might just as well use the median. Given a choice of possible estimators, all consistent in the sense just defined, we can see whether there is anything which recommends the choice of one rather than another. The thing which at once suggests itself is the sampling variance of the different estimators, since an estimator with a small sampling variance will be less likely to differ from the true value by a large amount than an estimator whose sampling variance is large." (Michael J Moroney, "Facts from Figures", 1951)

"A piece of self-deception - often dear to the heart of apprentice scientists - is the drawing of a 'smooth curve' (how attractive it sounds!) through a set of points which have about as much trend as the currants in plum duff. Once this is done, the mind, looking for order amidst chaos, follows the Jack-o'-lantern line with scant attention to the protesting shouts of the actual points. Nor, let it be whispered, is it unknown for people who should know better to rub off the offending points and publish the trend line which their foolish imagination has introduced on the flimsiest of evidence. Allied to this sin is that of overconfident extrapolation, i.e. extending the graph by guesswork beyond the range of factual information. Whenever extrapolation is attempted it should be carefully distinguished from the rest of the graph, e.g. by showing the extrapolation as a dotted line in contrast to the full line of the rest of the graph. [...] Extrapolation always calls for justification, sooner or later. Until this justification is forthcoming, it remains a provisional estimate, based on guesswork." (Michael J Moroney, "Facts from Figures", 1951)

"Data should be collected with a clear purpose in mind. Not only a clear purpose, but a clear idea as to the precise way in which they will be analysed so as to yield the desired information." (Michael J Moroney, "Facts from Figures", 1951)

"For the most part, Statistics is a method of investigation that is used when other methods are of no avail; it is often a last resort and a forlorn hope. A statistical analysis, properly conducted, is a delicate dissection of uncertainties, a surgery of suppositions. The surgeon must guard carefully against false incisions with his scalpel. Very often he has to sew up the patient as inoperable. The public knows too little about the statistician as a conscientious and skilled servant of true science." (Michael J Moroney, "Facts from Figures", 1951)

"It is really questionable - though bordering on heresy to put the question - whether we would be any the worse off if the whole bag of tricks were scrapped. So many of these index numbers are so ancient and so out of date, so out of touch with reality, so completely devoid of practical value when they have been computed, that their regular calculation must be regarded as a widespread compulsion neurosis. Only lunatics and public servants with no other choice go on doing silly things and liking it." (Michael J Moroney, "Facts from Figures", 1951)

"It pays to keep wide awake in studying any graph. The thing looks so simple, so frank, and so appealing that the careless are easily fooled. [...] Data and formulae should be given along with the graph, so that the interested reader may look at the details if he wishes." (Michael J Moroney, "Facts from Figures", 1951)

"It will, of course, happen but rarely that the proportions will be identical, even if no real association exists. Evidently, therefore, we need a significance test to reassure ourselves that the observed difference of proportion is greater than could reasonably be attributed to chance. The significance test will test the reality of the association, without telling us anything about the intensity of association. It will be apparent that we need two distinct things: (a) a test of significance, to be used on the data first of all, and (b) some measure of the intensity of the association, which we shall only be justified in using if the significance test confirms that the association is real." (Michael J Moroney, "Facts from Figures", 1951)

"Some distributions [...] are symmetrical about their central value. Other distributions have marked asymmetry and are said to be skew. Skew distributions are divided into two types. If the 'tail' of the distribution reaches out into the larger values of the variate, the distribution is said to show positive skewness; if the tail extends towards the smaller values of the variate, the distribution is called negatively skew." (Michael J Moroney, "Facts from Figures", 1951)

"The economists, of course, have great fun - and show remarkable skill - in inventing more refined index numbers. Sometimes they use geometric averages instead of arithmetic averages (the advantage here being that the geometric average is less upset by extreme oscillations in individual items), sometimes they use the harmonic average. But these are all refinements of the basic idea of the index number [...]" (Michael J Moroney, "Facts from Figures", 1951)

"The mode would form a very poor basis for any further calculations of an arithmetical nature, for it has deliberately excluded arithmetical precision in the interests of presenting a typical result. The arithmetic average, on the other hand, excellent as it is for numerical purposes, has sacrificed its desire to be typical in favour of numerical accuracy. In such a case it is often desirable to quote both measures of central tendency." (Michael J Moroney, "Facts from Figures", 1951)

"The statistician’s job is to draw general conclusions from fragmentary data. Too often the data supplied to him for analysis are not only fragmentary but positively incoherent, so that he can do next to nothing with them. Even the most kindly statistician swears heartily under his breath whenever this happens". (Michael J Moroney, "Facts from Figures", 1951)

"Undoubtedly one of the most elegant, powerful, and useful techniques in modern statistical method is that of the Analysis of Variation and Co-variation by which the total variation in a set of data may be reduced to components associated with possible sources of variability whose relative importance we wish to assess. The precise form which any given analysis will take is intimately connected with the structure of the investigation from which the data are obtained. A simple structure will lead to a simple analysis; a complex structure to a complex analysis." (Michael J Moroney, "Facts from Figures", 1951)

"When the mathematician speaks of the existence of a 'functional relation' between two variable quantities, he means that they are connected by a simple 'formula that is to say, if we are told the value of one of the variable quantities we can find the value of the second quantity by substituting in the formula which tells us how they are related. [...] The thing to be clear about before we proceed further is that a functional relationship in mathematics means an exact and predictable relationship, with no ifs or buts about lt. It is useful in practice so long as the ifs and buts are only tiny voices which even the most ardent protagonist of proportional representation can ignore with a clear conscience." (Michael J Moroney, "Facts from Figures", 1951)

🖍️David Spiegelhalter - Collected Quotes

"A classification tree is perhaps the simplest form of algorithm, since it consists of a series of yes/no questions, the answer to each deciding the next question to be asked, until a conclusion is reached." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"Bootstrapping provides an intuitive, computer-intensive way of assessing the uncertainty in our estimates, without making strong assumptions and without using probability theory. But the technique is not feasible when it comes to, say, working out the margins of error on unemployment surveys of 100,000 people. Although bootstrapping is a simple, brilliant and extraordinarily effective idea, it is just too clumsy to bootstrap such large quantities of data, especially when a convenient theory exists that can generate formulae for the width of uncertainty intervals." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"But [bootstrap-based] simulations are clumsy and time-consuming, especially with large data sets, and in more complex circumstances it is not straightforward to work out what should be simulated. In contrast, formulae derived from probability theory provide both insight and convenience, and always lead to the same answer since they don’t depend on a particular simulation. But the flip side is that this theory relies on assumptions, and we should be careful not to be deluded by the impressive algebra into accepting unjustified conclusions." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"[...] data often has some errors, outliers and other strange values, but these do not necessarily need to be individually identified and excluded. It also points to the benefits of using summary measures that are not unduly affected by odd observations [...] are known as robust measures, and include the median and the inter-quartile range." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"Even in an era of open data, data science and data journalism, we still need basic statistical principles in order not to be misled by apparent patterns in the numbers." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"[...] in the statistical world, what we see and measure around us can be considered as the sum of a systematic mathematical idealized form plus some random contribution that cannot yet be explained. This is the classic idea of the signal and the noise." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"It is convenient to use a single number to summarize a steadily increasing or decreasing relationship between the pairs of numbers shown on a scatter-plot. This is generally chosen to be the Pearson correlation coefficient [...]. A Pearson correlation runs between −1 and 1, and expresses how close to a straight line the dots or data-points fall. A correlation of 1 occurs if all the points lie on a straight line going upwards, while a correlation of −1 occurs if all the points lie on a straight line going downwards. A correlation near 0 can come from a random scatter of points, or any other pattern in which there is no systematic trend upwards or downwards [...]." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"It is not enough to give a single summary for a distribution - we need to have an idea of the spread, sometimes known as the variability. [...] The range is a natural choice, but is clearly very sensitive to extreme values [...] In contrast the inter-quartile range (IQR) is unaffected by extremes. This is the distance between the 25th and 75th percentiles of the data and so contains the ‘central half’ of the numbers [...] Finally the standard deviation is a widely used measure of spread. It is the most technically complex measure, but is only really appropriate for well-behaved symmetric data since it is also unduly influenced by outlying values." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"[...] just because we act, and something changes, it doesn’t mean we were responsible for the result. Humans seem to find this simple truth difficult to grasp - we are always keen to construct an explanatory narrative, and even keener if we are at its centre. Of course sometimes this interpretation is true - if you flick a switch, and the light comes on, then you are usually responsible. But sometimes your actions are clearly not responsible for an outcome: if you don’t take an umbrella, and it rains, it is not your fault (although it may feel that way). But the consequences of many of our actions are less clear-cut. [...] We have a strong psychological tendency to attribute change to intervention, and this makes before-and-after comparisons treacherous." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"Mean-averages can be highly misleading when the raw data do not form a symmetric pattern around a central value but instead are skewed towards one side [...], typically with a large group of standard cases but with a tail of a few either very high (for example, income) or low (for example, legs) values." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"Statistical models have two main components. First, a mathematical formula that expresses a deterministic, predictable component, for example the fitted straight line that enables us to make a prediction [...]. But the deterministic part of a model is not going to be a perfect representation of the observed world [...] and the difference between what the model predicts, and what actually happens, is the second component of a model and is known as the residual error - although it is important to remember that in statistical modelling, ‘error’ does not refer to a mistake, but the inevitable inability of a model to exactly represent what we observe." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"[...] the Central Limit Theorem [...] says that the distribution of sample means tends towards the form of a normal distribution with increasing sample size, almost regardless of the shape of the original data distribution." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"The first rule of communication is to shut up and listen, so that you can get to know about the audience for your communication, whether it might be politicians, professionals or the general public. We have to understand their inevitable limitations and any misunderstandings, and fight the temptation to be too sophisticated and clever, or put in too much detail." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"The second rule of communication is to know what you want to achieve. Hopefully the aim is to encourage open debate, and informed decision-making. But there seems no harm in repeating yet again that numbers do not speak for themselves; the context, language and graphic design all contribute to the way the communication is received. We have to acknowledge we are telling a story, and it is inevitable that people will make comparisons and judgements, no matter how much we only want to inform and not persuade. All we can do is try to pre-empt inappropriate gut reactions by design or warning." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"There is no ‘correct’ way to display sets of numbers: each of the plots we have used has some advantages: strip-charts show individual points, box-and-whisker plots are convenient for rapid visual summaries, and histograms give a good feel for the underlying shape of the data distribution." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"This common view of statistics as a basic ‘bag of tools’ is now facing major challenges. First, we are in an age of data science, in which large and complex data sets are collected from routine sources such as traffic monitors, social media posts and internet purchases, and used as a basis for technological innovations such as optimizing travel routes, targeted advertising or purchase recommendation systems [...]. Statistical training is increasingly seen as just one necessary component of being a data scientist, together with skills in data management, programming and algorithm development, as well as proper knowledge of the subject matter." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"Unfortunately, when an ‘average’ is reported in the media, it is often unclear whether this should be interpreted as the mean or median." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"When it comes to presenting categorical data, pie charts allow an impression of the size of each category relative to the whole pie, but are often visually confusing, especially if they attempt to show too many categories in the same chart, or use a three-dimensional representation that distorts areas. [...] Multiple pie charts are generally not a good idea, as comparisons are hampered by the difficulty in assessing the relative sizes of areas of different shapes. Comparisons are better based on height or length alone in a bar chart." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"When we have all the data, it is straightforward to produce statistics that describe what has been measured. But when we want to use the data to draw broader conclusions about what is going on around us, then the quality of the data becomes paramount, and we need to be alert to the kind of systematic biases that can jeopardize the reliability of any claims." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"With the growing availability of massive data sets and user-friendly analysis software, it might be thought that there is less need for training in statistical methods. This would be naïve in the extreme. Far from freeing us from the need for statistical skills, bigger data and the rise in the number and complexity of scientific studies makes it even more difficult to draw appropriate conclusions. More data means that we need to be even more aware of what the evidence is actually worth." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"We over-fit when we go too far in adapting to local circumstances, in a worthy but misguided effort to be ‘unbiased’ and take into account all the available information. Usually we would applaud the aim of being unbiased, but this refinement means we have less data to work on, and so the reliability goes down. Over-fitting therefore leads to less bias but at a cost of more uncertainty or variation in the estimates, which is why protection against over-fitting is sometimes known as the bias/variance trade-off." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

22 April 2006

🖍️Judea Pearl - Collected Quotes

"Despite the prevailing use of graphs as metaphors for communicating and reasoning about dependencies, the task of capturing informational dependencies by graphs is not at all trivial." (Judea Pearl, "Probabilistic Reasoning in Intelligent Systems: Network of Plausible, Inference", 1988)

"Probabilities are summaries of knowledge that is left behind when information is transferred to a higher level of abstraction." (Judea Pearl, "Probabilistic Reasoning in Intelligent Systems: Network of Plausible, Inference", 1988)

"When loops are present, the network is no longer singly connected and local propagation schemes will invariably run into trouble. […] If we ignore the existence of loops and permit the nodes to continue communicating with each other as if the network were singly connected, messages may circulate indefinitely around the loops and process may not converges to a stable equilibrium. […] Such oscillations do not normally occur in probabilistic networks […] which tend to bring all messages to some stable equilibrium as time goes on. However, this asymptotic equilibrium is not coherent, in the sense that it does not represent the posterior probabilities of all nodes of the network." (Judea Pearl, "Probabilistic Reasoning in Intelligent Systems: Network of Plausible, Inference", 1988)

"Traditional statistics is strong in devising ways of describing data and inferring distributional parameters from sample. Causal inference requires two additional ingredients: a science-friendly language for articulating causal knowledge, and a mathematical machinery for processing that knowledge, combining it with data and drawing new causal conclusions about a phenomenon." (Judea Pearl, "Causal inference in statistics: An overview", Statistics Surveys 3, 2009)

"Again, classical statistics only summarizes data, so it does not provide even a language for asking [a counterfactual] question. Causal inference provides a notation and, more importantly, offers a solution. As with predicting the effect of interventions [...], in many cases we can emulate human retrospective thinking with an algorithm that takes what we know about the observed world and produces an answer about the counterfactual world." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"Bayesian networks inhabit a world where all questions are reducible to probabilities, or (in the terminology of this chapter) degrees of association between variables; they could not ascend to the second or third rungs of the Ladder of Causation. Fortunately, they required only two slight twists to climb to the top." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"Bayesian statistics give us an objective way of combining the observed evidence with our prior knowledge (or subjective belief) to obtain a revised belief and hence a revised prediction of the outcome of the coin’s next toss. [...] This is perhaps the most important role of Bayes’s rule in statistics: we can estimate the conditional probability directly in one direction, for which our judgment is more reliable, and use mathematics to derive the conditional probability in the other direction, for which our judgment is rather hazy. The equation also plays this role in Bayesian networks; we tell the computer the forward probabilities, and the computer tells us the inverse probabilities when needed." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"Deep learning has instead given us machines with truly impressive abilities but no intelligence. The difference is profound and lies in the absence of a model of reality." (Judea Pearl, "The Book of Why: The New Science of Cause and Effect", 2018)

"[…] deep learning has succeeded primarily by showing that certain questions or tasks we thought were difficult are in fact not. It has not addressed the truly difficult questions that continue to prevent us from achieving humanlike AI." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"Some scientists (e.g., econometricians) like to work with mathematical equations; others (e.g., hard-core statisticians) prefer a list of assumptions that ostensibly summarizes the structure of the diagram. Regardless of language, the model should depict, however qualitatively, the process that generates the data - in other words, the cause-effect forces that operate in the environment and shape the data generated." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"The calculus of causation consists of two languages: causal diagrams, to express what we know, and a symbolic language, resembling algebra, to express what we want to know. The causal diagrams are simply dot-and-arrow pictures that summarize our existing scientific knowledge. The dots represent quantities of interest, called 'variables', and the arrows represent known or suspected causal relationships between those variables - namely, which variable 'listens' to which others." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"The main differences between Bayesian networks and causal diagrams lie in how they are constructed and the uses to which they are put. A Bayesian network is literally nothing more than a compact representation of a huge probability table. The arrows mean only that the probabilities of child nodes are related to the values of parent nodes by a certain formula (the conditional probability tables) and that this relation is sufficient. That is, knowing additional ancestors of the child will not change the formula. Likewise, a missing arrow between any two nodes means that they are independent, once we know the values of their parents. [...] If, however, the same diagram has been constructed as a causal diagram, then both the thinking that goes into the construction and the interpretation of the final diagram change." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"The transparency of Bayesian networks distinguishes them from most other approaches to machine learning, which tend to produce inscrutable 'black boxes'. In a Bayesian network you can follow every step and understand how and why each piece of evidence changed the network’s beliefs." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"When the scientific question of interest involves retrospective thinking, we call on another type of expression unique to causal reasoning called a counterfactual. […] Counterfactuals are the building blocks of moral behavior as well as scientific thought. The ability to reflect on one’s past actions and envision alternative scenarios is the basis of free will and social responsibility. The algorithmization of counterfactuals invites thinking machines to benefit from this ability and participate in this (until now) uniquely human way of thinking about the world." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"With Bayesian networks, we had taught machines to think in shades of gray, and this was an important step toward humanlike thinking. But we still couldn’t teach machines to understand causes and effects. [...] By design, in a Bayesian network, information flows in both directions, causal and diagnostic: smoke increases the likelihood of fire, and fire increases the likelihood of smoke. In fact, a Bayesian network can’t even tell what the 'causal direction' is." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

🖍️Foster Provost - Collected Quotes

"Data mining is a craft. As with many crafts, there is a well-defined process that can help to increase the likelihood of a successful result. This process is a crucial conceptual tool for thinking about data science projects. [...] data mining is an exploratory undertaking closer to research and development than it is to engineering." (Foster Provost, "Data Science for Business", 2013)

"Formulating data mining solutions and evaluating the results involves thinking carefully about the context in which they will be used." (Foster Provost, "Data Science for Business", 2013)

"[…] framing a business problem in terms of expected value can allow us to systematically decompose it into data mining tasks." (Foster Provost & Tom Fawcett, "Data Science for Business", 2013)

"If you look too hard at a set of data, you will find something - but it might not generalize beyond the data you’re looking at. This is referred to as overfitting a dataset. Data mining techniques can be very powerful, and the need to detect and avoid overfitting is one of the most important concepts to grasp when applying data mining to real problems. The concept of overfitting and its avoidance permeates data science processes, algorithms, and evaluation methods." (Foster Provost & Tom Fawcett, "Data Science for Business", 2013)

"In analytics, it’s more important for individuals to be able to formulate problems well, to prototype solutions quickly, to make reasonable assumptions in the face of ill-structured problems, to design experiments that represent good investments, and to analyze results." (Foster Provost & Tom Fawcett, "Data Science for Business", 2013)

"In common usage, prediction means to forecast a future event. In data science, prediction more generally means to estimate an unknown value. This value could be something in the future (in common usage, true prediction), but it could also be something in the present or in the past. Indeed, since data mining usually deals with historical data, models very often are built and tested using events from the past." (Foster Provost & Tom Fawcett, "Data Science for Business", 2013)

"In data science, a predictive model is a formula for estimating the unknown value of interest: the target. The formula could be mathematical, or it could be a logical statement such as a rule. Often it is a hybrid of the two." (Foster Provost & Tom Fawcett, "Data Science for Business", 2013)

"There is another important distinction pertaining to mining data: the difference between (1) mining the data to find patterns and build models, and (2) using the results of data mining. Students often confuse these two processes when studying data science, and managers sometimes confuse them when discussing business analytics. The use of data mining results should influence and inform the data mining process itself, but the two should be kept distinct." (Foster Provost & Tom Fawcett, "Data Science for Business", 2013)

"There is convincing evidence that data-driven decision-making and big data technologies substantially improve business performance. Data science supports data-driven decision-making - and sometimes conducts such decision-making automatically - and depends upon technologies for 'big data' storage and engineering, but its principles are separate." (Foster Provost & Tom Fawcett, "Data Science for Business", 2013)

"Unfortunately, creating an objective function that matches the true goal of the data mining is usually impossible, so data scientists often choose based on faith and experience." (Foster Provost & Tom Fawcett, "Data Science for Business", 2013)

🖍️Joseph P Bigus - Collected Quotes

"Data mining is the efficient discovery of valuable, nonobvious information from a large collection of data. […] Data mining centers on the automated discovery of new facts and relationships in data. The idea is that the raw material is the business data, and the data mining algorithm is the excavator, sifting through the vast quantities of raw data looking for the valuable nuggets of business information." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"Like modeling, which involves making a static one-time prediction based on current information, time-series prediction involves looking at current information and predicting what is going to happen. However, with time-series predictions, we typically are looking at what has happened for some period back through time and predicting for some point in the future. The temporal or time element makes time-series prediction both more difficult and more rewarding. Someone who can predict the future based on what has occurred in the past can clearly have tremendous advantages over someone who cannot." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"Many of the basic functions performed by neural networks are mirrored by human abilities. These include making distinctions between items (classification), dividing similar things into groups (clustering), associating two or more things (associative memory), learning to predict outcomes based on examples (modeling), being able to predict into the future (time-series forecasting), and finally juggling multiple goals and coming up with a good- enough solution (constraint satisfaction)." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"More than just a new computing architecture, neural networks offer a completely different paradigm for solving problems with computers. […] The process of learning in neural networks is to use feedback to adjust internal connections, which in turn affect the output or answer produced. The neural processing element combines all of the inputs to it and produces an output, which is essentially a measure of the match between the input pattern and its connection weights. When hundreds of these neural processors are combined, we have the ability to solve difficult problems such as credit scoring." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"Neural networks are a computing model grounded on the ability to recognize patterns in data. As a consequence, they have many applications to data mining and analysis." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"Neural networks are a computing technology whose fundamental purpose is to recognize patterns in data. Based on a computing model similar to the underlying structure of the human brain, neural networks share the brains ability to learn or adapt in response to external inputs. When exposed to a stream of training data, neural networks can discover previously unknown relationships and learn complex nonlinear mappings in the data. Neural networks provide some fundamental, new capabilities for processing business data. However, tapping these new neural network data mining functions requires a completely different application development process from traditional programming." (Joseph P Bigus, "Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"People build practical, useful mental models all of the time. Seldom do they resort to writing a complex set of mathematical equations or use other formal methods. Rather, most people build models relating inputs and outputs based on the examples they have seen in their everyday life. These models can be rather trivial, such as knowing that when there are dark clouds in the sky and the wind starts picking up that a storm is probably on the way. Or they can be more complex, like a stock trader who watches plots of leading economic indicators to know when to buy or sell. The ability to make accurate predictions from complex examples involving many variables is a great asset." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"Unfortunately, just collecting the data in one place and making it easily available isn’t enough. When operational data from transactions is loaded into the data warehouse, it often contains missing or inaccurate data. How good or bad the data is a function of the amount of input checking done in the application that generates the transaction. Unfortunately, many deployed applications are less than stellar when it comes to validating the inputs. To overcome this problem, the operational data must go through a 'cleansing' process, which takes care of missing or out-of-range values. If this cleansing step is not done before the data is loaded into the data warehouse, it will have to be performed repeatedly whenever that data is used in a data mining operation." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"When training a neural network, it is important to understand when to stop. […] If the same training patterns or examples are given to the neural network over and over, and the weights are adjusted to match the desired outputs, we are essentially telling the network to memorize the patterns, rather than to extract the essence of the relationships. What happens is that the neural network performs extremely well on the training data. However, when it is presented with patterns it hasn't seen before, it cannot generalize and does not perform well. What is the problem? It is called overtraining." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"While classification is important, it can certainly be overdone. Making too fine a distinction between things can be as serious a problem as not being able to decide at all. Because we have limited storage capacity in our brain (we still haven't figured out how to add an extender card), it is important for us to be able to cluster similar items or things together. Not only is clustering useful from an efficiency standpoint, but the ability to group like things together (called chunking by artificial intelligence practitioners) is a very important reasoning tool. It is through clustering that we can think in terms of higher abstractions, solving broader problems by getting above all of the nitty-gritty details." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

🖍️Richard E Nisbett - Collected Quotes

"Multiple regression, like all statistical techniques based on correlation, has a severe limitation due to the fact that correlation doesn't prove causation. And no amount of measuring of 'control' variables can untangle the web of causality. What nature hath joined together, multiple regression cannot put asunder." (Richard Nisbett, "2014: What scientific idea is ready for retirement?", 2013)

"What nature hath joined together, multiple regression cannot put asunder." (Richard Nisbett, "2014: What scientific idea is ready for retirement?", 2013)

"A basic problem with MRA is that it typically assumes that the independent variables can be regarded as building blocks, with each variable taken by itself being logically independent of all the others. This is usually not the case, at least for behavioral data. […] Just as correlation doesn’t prove causation, absence of correlation fails to prove absence of causation. False-negative findings can occur using MRA just as false-positive findings do - because of the hidden web of causation that we’ve failed to identify." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"Deductive and inductive reasoning schemas essentially regulate inferences. They tell us what kinds of inferences are valid and what kinds are invalid. […] Dialectical reasoning isn’t formal or deductive and usually doesn’t deal in abstractions. It’s concerned with reaching true and useful conclusions rather than valid conclusions. In fact, conclusions based on dialectical reasoning can actually be opposed to those based on formal logic." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"Multiple regression analysis (MRA) examines the association between an independent variable and a dependent variable, controlling for the association between the independent variable and other variables, as well as the association of those other variables with the dependent variable. The method can tell us about causality only if all possible causal influences have been identified and measured reliably and validly. In practice, these conditions are rarely met." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"One technique employing correlational analysis is multiple regression analysis (MRA), in which a number of independent variables are correlated simultaneously (or sometimes sequentially, but we won’t talk about that variant of MRA) with some dependent variable. The predictor variable of interest is examined along with other independent variables that are referred to as control variables. The goal is to show that variable A influences variable B 'net of' the effects of all the other variables. That is to say, the relationship holds even when the effects of the control variables on the dependent variable are taken into account." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"Science is often described as a 'seamless web'. What’s meant by that is that the facts, methods, theories, and rules of inference discovered in one field can be helpful for other fields. And philosophy and logic can affect reasoning in literally every field of science."(Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"The closer that sample-selection procedures approach the gold standard of random selection - for which the definition is that every individual in the population has an equal chance of appearing in the sample - the more we should trust them. If we don’t know whether a sample is random, any statistical measure we conduct may be biased in some unknown way." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"The correlational technique known as multiple regression is used frequently in medical and social science research. This technique essentially correlates many independent (or predictor) variables simultaneously with a given dependent variable (outcome or output). It asks, 'Net of the effects of all the other variables, what is the effect of variable A on the dependent variable?' Despite its popularity, the technique is inherently weak and often yields misleading results. The problem is due to self-selection. If we don’t assign cases to a particular treatment, the cases may differ in any number of ways that could be causing them to differ along some dimension related to the dependent variable. We can know that the answer given by a multiple regression analysis is wrong because randomized control experiments, frequently referred to as the gold standard of research techniques, may give answers that are quite different from those obtained by multiple regression analysis." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"The fundamental problem with MRA, as with all correlational methods, is self-selection. The investigator doesn’t choose the value for the independent variable for each subject (or case). This means that any number of variables correlated with the independent variable of interest have been dragged along with it. In most cases, we will fail to identify all these variables. In the case of behavioral research, it’s normally certain that we can’t be confident that we’ve identified all the plausibly relevant variables." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"The theory behind multiple regression analysis is that if you control for everything that is related to the independent variable and the dependent variable by pulling their correlations out of the mix, you can get at the true causal relation between the predictor variable and the outcome variable. That’s the theory. In practice, many things prevent this ideal case from being the norm." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"We are superb causal-hypothesis generators. Given an effect, we are rarely at a loss for an explanation. Seeing a difference in observations over time, we readily come up with a causal interpretation. Much of the time, no causality at all is going on—just random variation. The compulsion to explain is particularly strong when we habitually see that one event typically occurs in conjunction with another event. Seeing such a correlation almost automatically provokes a causal explanation. It’s tremendously useful to be on our toes looking for causal relationships that explain our world. But there are two problems: (1) The explanations come too easily. If we recognized how facile our causal hypotheses were, we’d place less confidence in them. (2) Much of the time, no causal interpretation at all is appropriate and wouldn’t even be made if we had a better understanding of randomness." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"We don’t recognize how easy it is to generate hypotheses about the world. If we did, we’d generate fewer of them, or at least hold them more tentatively. We sprout causal theories in abundance when we learn of a correlation, and we readily find causal explanations for the failure of the world to confirm our hypotheses. We don’t realize how easy it is for us to explain away evidence that would seem on the surface to contradict our hypotheses. And we fail to generate tests of a hypothesis that could falsify the hypothesis if in fact the hypothesis is wrong. This is one type of confirmation bias." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

🖍️Mike Barlow - Collected Quotes

"Applying data science principles to solve social problems and improve the lives of ordinary people seems like a logical idea, but it is by no means a given. Using data science to elevate the human condition won’t happen by accident; groups of people will have to envision it, develop the routine processes and underlying infrastructures required to make it practical, and then commit the time and energy necessary to make it all work." (Mike Barlow, "Learning to Love Data Science", 2015)

"Hollywood loves the myth of a lone scientist working late nights in a dark laboratory on a mysterious island, but the truth is far less melodramatic. Real science is almost always a team sport. Groups of people, collaborating with other groups of people, are the norm in science - and data science is no exception to the rule. When large groups of people work together for extended periods of time, a culture begins to emerge." (Mike Barlow, "Learning to Love Data Science", 2015)

"In other words, real-time denotes the ability to process data as it arrives, rather than storing the data and retrieving it at some point in the future. That’s the primary significance of the term - real-time means that you’re processing data in the present, rather than in the future." (Mike Barlow, "Learning to Love Data Science", 2015)

"The ability to manage large and complex sets of data hasn’t diminished the appetite for more size and greater speed. Every day it seems that a new technique or application is introduced that pushes the edges of the speed-size envelope even further." (Mike Barlow, "Learning to Love Data Science", 2015)

"The cultural component of big data is neither trivial nor free. It is not a list of 'feel-good' or 'fluffy' attributes that are posted on a corporate website. Culture (that is, people and processes) is integral and critical to the success of any new technology deployment or implementation." (Mike Barlow, "Learning to Love Data Science", 2015)

"The whole point of machine learning is automating the learning process itself, enabling the computer program to get better as it consumes more data, without requiring the continual intervention of a programmer." (Mike Barlow, "Learning to Love Data Science", 2015)

🖍️Peter C Bruce - Collected Quotes

"A popular misconception holds that the era of Big Data means the end of a need for sampling. In fact, the proliferation of data of varying quality and relevance reinforces the need for sampling as a tool to work efficiently with a variety of data, and minimize bias. Even in a Big Data project, predictive models are typically developed and piloted with samples." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"Do not confuse standard deviation (which measures the variability of individual data points) with standard error (which measures the variability of a sample metric)." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"In statistical theory, location and variability are referred to as the first and second moments of a distribution. The third and fourth moments are called skewness and kurtosis. Skewness refers to whether the data is skewed to larger or smaller values and kurtosis indicates the propensity of the data to have extreme values. Generally, metrics are not used to measure skewness and kurtosis; instead, these are discovered through visual displays [...]" (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"Machine learning tends to be more focused on developing efficient algorithms that scale to large data in order to optimize the predictive model. Statistics generally pays more attention to the probabilistic theory and underlying structure of the model." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"Many classification and regression algorithms optimize a certain criteria or loss function. For example, logistic regression attempts to minimize the deviance. In the literature, some propose to modify the loss function in order to avoid the problems caused by a rare class. In practice, this is hard to do: classification algorithms can be complex and difficult to modify. Weighting is an easy way to change the loss function, discounting errors for records with low weights in favor of records of higher weights." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"Moreover, data science (and business in general) is not so worried about statistical significance, but more concerned with optimizing overall effort and results." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"Statisticians often use the term estimates for values calculated from the data at hand, to draw a distinction between what we see from the data, and the theoretical true or exact state of affairs. Data scientists and business analysts are more likely to refer to such values as a metric. The difference reflects the approach of statistics versus data science: accounting for uncertainty lies at the heart of the discipline of statistics, whereas concrete business or organizational objectives are the focus of data science. Hence, statisticians estimate, and data scientists measure." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"The bootstrap does not compensate for a small sample size; it does not create new data, nor does it fill in holes in an existing data set. It merely informs us about how lots of additional samples would behave when drawn from a population like our original sample." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"The tension between oversmoothing and overfitting is an instance of the bias-variance tradeoff, an ubiquitous problem in statistical model fitting. Variance refers to the modeling error that occurs because of the choice of training data; that is, if you were to choose a different set of training data, the resulting model would be different. Bias refers to the modeling error that occurs because you have not properly identified the underlying real-world scenario; this error would not disappear if you simply added more training data. When a flexible model is overfit, the variance increases. You can reduce this by using a simpler model, but the bias may increase due to the loss of flexibility in modeling the real underlying situation." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"The variance, the standard deviation, mean absolute deviation, and median absolute deviation from the median are not equivalent estimates, even in the case where the data comes from a normal distribution. In fact, the standard deviation is always greater than the mean absolute deviation, which itself is greater than the median absolute deviation. Sometimes, the median absolute deviation is multiplied by a constant scaling factor (it happens to work out to 1.4826) to put MAD on the same scale as the standard deviation in the case of a normal distribution." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"When analysts and researchers use the term regression by itself, they are typically referring to linear regression; the focus is usually on developing a linear model to explain the relationship between predictor variables and a numeric outcome variable. In its formal statistical sense, regression also includes nonlinear models that yield a functional relationship between predictors and outcome variables. In the machine learning community, the term is also occasionally used loosely to refer to the use of any predictive model that produces a predicted numeric outcome (standing in distinction from classification methods that predict a binary or categorical outcome)." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

21 April 2006

🖍️Pedro Domingos - Collected Quotes

"A learner that uses Bayes’ theorem and assumes the effects are independent given the cause is called a Naïve Bayes classifier. That’s because, well, that’s such a naïve assumption." (Pedro Domingos, "The Master Algorithm", 2015)

"An algorithm is not just any set of instructions: they have to be precise and unambiguous enough to be executed by a computer. [...] The computer has to know how to execute the algorithm all the way down to turning specific transistors on and off." (Pedro Domingos, "The Master Algorithm", 2015)

"As so often happens in computer science, we’re willing to sacrifice efficiency for generality." (Pedro Domingos, "The Master Algorithm", 2015)

"Believe it or not, every algorithm, no matter how complex, can be reduced to just these three operations: AND, OR, and NOT." (Pedro Domingos, "The Master Algorithm", 2015)

"Designing an algorithm is not easy. Pitfalls abound, and nothing can be taken for granted. Some of your intuitions will turn out to have been wrong, and you’ll have to find another way. On top of designing the algorithm, you have to write it down in a language computers can understand, like Java or Python (at which point it’s called a program). Then you have to debug it: find every error and fix it until the computer runs your program without screwing up. But once you have a program that does what you want, you can really go to town." (Pedro Domingos, "The Master Algorithm", 2015)

"Dimensionality reduction is essential for coping with big data—like the data coming in through your senses every second. A picture may be worth a thousand words, but it’s also a million times more costly to process and remember. [...] A common complaint about big data is that the more data you have, the easier it is to find spurious patterns in it. This may be true if the data is just a huge set of disconnected entities, but if they’re interrelated, the picture changes." (Pedro Domingos, "The Master Algorithm", 2015)

"Every algorithm has an input and an output: the data goes into the computer, the algorithm does what it will with it, and out comes the result. Machine learning turns this around: in goes the data and the desired result and out comes the algorithm that turns one into the other. Learning algorithms - also known as learners - are algorithms that make other algorithms. With machine learning, computers write their own programs, so we don’t have to." (Pedro Domingos, "The Master Algorithm", 2015)

"In machine learning, knowledge is often in the form of statistical models, because most knowledge is statistical [...] Machine learning is a kind of knowledge pump: we can use it to extract a lot of knowledge from data, but first we have to prime the pump." (Pedro Domingos, "The Master Algorithm", 2015)

"Learning is forgetting the details as much as it is remembering the important parts." (Pedro Domingos, "The Master Algorithm", 2015)

"Machine learning takes many different forms and goes by many different names: pattern recognition, statistical modeling, data mining, knowledge discovery, predictive analytics, data science, adaptive systems, self-organizing systems, and more. Each of these is used by different communities and has different associations. Some have a long half-life, some less so." (Pedro Domingos, "The Master Algorithm", 2015)

"Our beliefs are based on our experience, which gives us a very incomplete picture of the world, and it's easy to jump to false conclusions." (Pedro Domingos, "The Master Algorithm", 2015)

"People often think computers are all about numbers, but they’re not. Computers are all about logic." (Pedro Domingos, "The Master Algorithm", 2015)

"Science’s predictions are more trustworthy, but they are limited to what we can systematically observe and tractably model. Big data and machine learning greatly expand that scope. Some everyday things can be predicted by the unaided mind, from catching a ball to carrying on a conversation. Some things, try as we might, are just unpredictable. For the vast middle ground between the two, there’s machine learning." (Pedro Domingos, "The Master Algorithm", 2015)

"To make progress, every field of science needs to have data commensurate with the complexity of the phenomena it studies. [...] With big data and machine learning, you can understand much more complex phenomena than before. In most fields, scientists have traditionally used only very limited kinds of models, like linear regression, where the curve you fit to the data is always a straight line. Unfortunately, most phenomena in the world are nonlinear. [...] Machine learning opens up a vast new world of nonlinear models." (Pedro Domingos, "The Master Algorithm", 2015)

"Today we routinely learn models with millions of parameters, enough to give each elephant in the world his own distinctive wiggle. It’s even been said that data mining means 'torturing the data until it confesses'." (Pedro Domingos, "The Master Algorithm", 2015)

"Traditionally, the only way to get a computer to do something - from adding two numbers to flying an airplane - was to write down an algorithm explaining how, in painstaking detail. But machine-learning algorithms, also known as learners, are different: they figure it out on their own, by making inferences from data. And the more data they have, the better they get. Now we don’t have to program computers; they program themselves." (Pedro Domingos, "The Master Algorithm", 2015)

"Whoever has the best algorithms and the most data wins. A new type of network effect takes hold: whoever has the most customers accumulates the most data, learns the best models, wins the most new customers, and so on in a virtuous circle (or a vicious one, if you’re the competition)." (Pedro Domingos, "The Master Algorithm", 2015)

🖍️Richard Levins - Collected Quotes

"A mathematical model is neither an hypothesis nor a theory. Unlike the scientific hypothesis, a model is not verifiable directly by experiment. For all models are both true and false. Almost any plausible proposed relation among aspects of nature is likely to be true in the sense that it occurs (although rarely and slightly). Yet all models leave out a lot and are in that sense false, incomplete, inadequate. The validation of a model is not that it is ' 'true" but that it generates good testable hypotheses relevant to important problems. A model may be discarded in favor of a more powerful one, but it usually is simply outgrown when the live issues are not any longer those for which it was designed." (Richard Levins, "The Strategy of Model Building in Population Biology", American Scientist 54(4), 1966)

"For population genetics, a population is specified by the frequencies of genotypes without reference to the age distribution, physiological state as a reflection of past history, or population density. A single population or species is treated at a time, and evolution is usually assumed to occur in a constant environment. Population ecology, on the other hand, recognizes multispecies systems, describes populations in terms of their age distributions, physiological states, and densities. The environment is allowed to vary but the species are treated as genetically homogeneous, so that evolution is ignored." (Richard Levins, "The Strategy of Model Building in Population Biology", American Scientist 54(4), 1966)

"It is of course desirable to work with manageable models which maximize generality, realism, and precision toward the overlapping but not identical goals of understanding, predicting, and modifying nature. But this cannot be done. Therefore, several alternative strategies have evolved: (1) Sacrifice generality to realism and precision. (2) Sacrifice realism to generality and precision. (3) Sacrifice precision to realism and generality." (Richard Levins, "The strategy of model building in population biology", American Scientist Vol. 54 (4), 1966)

"The multiplicity of models is imposed by the contradictory demands of a complex, heterogeneous nature and a mind that can only cope with few variables at a time; by the contradictory desiderata of generality, realism, and precision; by the need to understand and also to control; even by the opposing esthetic standards which emphasize the stark simplicity and power of a general theorem as against the richness and the diversity of living nature. These conflicts are irreconcilable. Therefore, the alternative approaches even of contending schools are part of a larger mixed strategy. But the conflict is about method, not nature, for the individual models, while they are essential for understanding reality, should not be confused with that reality itself." (Richard Levins, "The Strategy of Model Building in Population Biology", American Scientist 54(4), 1966)

"The validation of a model is not that it is 'true' but that it generates good testable hypotheses relevant to important problems." (Richard Levins, "The Strategy of Model Building in Population Biology", American Scientist 54(4), 1966)

"[…] truth is the intersection of independent lies." (Richard Levins, "The Strategy of Model Building in Population Biology", 1966)

"Unlike the theory, models are restricted by technical considerations to a few components at a time, even in systems which are complex. Thus a satisfactory theory is usually a cluster of models. These models are related to each other in several ways : as coordinate alternative models for the same set of phenomena, they jointly produce robust theorems; as complementary models they can cope with different aspects of the same problem and give complementary as well as overlapping results; as hierarchically arranged 'nested' models, each provides an interpretation of the sufficient parameters of the next higher level where they are taken as given." (Richard Levins, "The Strategy of Model Building in Population Biology", American Scientist 54(4), 1966)

"Parts and wholes evolve in consequence of their relationship, and the relationship itself evolves. These are the properties of things that we call dialectical: that one thing cannot exist without the other, that one acquires its properties from its relation to the other, that the properties of both evolve as a consequence of their interpenetration." (Richard Levins & Richard C Lewontin, "The Dialectical Biologist", 1985)

"The organism cannot be regarded as simply the passive object of autonomous internal and external forces; it is also the subject of its own evolution." (Richard Levins & Richard C Lewontin, "The Dialectical Biologist", 1985)

"We believe that science, in all its sense, is a social process that both causes and is caused by social organisation." (Richard Levins & Richard C Lewontin, "The Dialectical Biologist", 1985)

20 April 2006

🖍️Amit Ray - Collected Quotes

"Artificial intelligence is defined as the branch of science and technology that is concerned with the study of software and hardware to provide machines the ability to learn insights from data and the environment, and the ability to adapt in changing situations with high precision, accuracy and speed." (Amit Ray, "Compassionate Artificial Intelligence", 2018)

"Artificial Intelligence is not just learning patterns from data, but understanding human emotions and its evolution from its depth and not just fulfilling the surface level human requirements, but sensitivity towards human pain, happiness, mistakes, sufferings and well-being of the society are the parts of the evolving new AI systems." (Amit Ray, "Compassionate Artificial Intelligence", 2018)

"Quantum Machine Learning is defined as the branch of science and technology that is concerned with the application of quantum mechanical phenomena such as superposition, entanglement and tunneling for designing software and hardware to provide machines the ability to learn insights and patterns from data and the environment, and the ability to adapt automatically to changing situations with high precision, accuracy and speed." (Amit Ray, "Quantum Computing Algorithms for Artificial Intelligence", 2018)

"Quantum machine learning promises to discover the optimal network topologies and hyperparameters automatically without human intervention. (Amit Ray, "Quantum Computing Algorithms for Artificial Intelligence", 2018)

"The beauty of quantum machine learning is that we do not need to depend on an algorithm like gradient descent or convex objective function. The objective function can be nonconvex or something else." (Amit Ray, "Quantum Computing Algorithms for Artificial Intelligence", 2018)

"You can't understand depth of science, unless you challenge the published scientific data." (Amit Ray)

🖍️Aleksander Molak - Collected Quotes

"An important concept in complexity science is emergence – a phenomenon in which we can observe certain properties at the system level that cannot be observed at its constituent parts’ level. This property is sometimes described as a system being more than the sum of its parts." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"Any time you run regression analysis on arbitrary real-world observational data, there’s a significant risk that there’s hidden confounding in your dataset and so causal conclusions from such analysis are likely to be (causally) biased." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"Expert knowledge is a term covering various types of knowledge that can help define or disambiguate causal relations between two or more variables. Depending on the context, expert knowledge might refer to knowledge from randomized controlled trials, laws of physics, a broad scope of experiences in a given area, and more." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"In statistical inference and machine learning, we often talk about estimates and estimators. Estimates are basically our best guesses regarding some quantities of interest given (finite) data. Estimators are computational devices or procedures that allow us to map between a given (finite) data sample and an estimate of interest." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"In summary, the relationship between different branches of contemporary machine learning and causality is nuanced. That said, most broadly adopted machine learning models operate on rung one, not having a causal world model." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"'Let the data speak'" is a catchy and powerful slogan, but [...] data itself is not always enough. It’s worth remembering that in many cases 'data cannot speak for themselves' and we might need more information than just observations to address some of our questions." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"Matching is a family of methods for estimating causal effects by matching similar observations (or units) in the treatment and non-treatment groups. The goal of matching is to make comparisons between similar units in order to achieve as precise an estimate of the true causal effect as possible." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"Multiple regression provides scientists and analysts with a tool to perform statistical control - a procedure to remove unwanted influence from certain variables in the model." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"Non-linear associations are also quantifiable. Even linear regression can be used to model some non-linear relationships. This is possible because linear regression has to be linear in parameters, not necessarily in the data. More complex relationships can be quantified using entropy-based metrics such as mutual information. Linear models can also handle interaction terms. We talk about interaction when the model’s output depends on a multiplicative relationship between two or more variables." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"The basic goal of causal inference is to estimate the causal effect of one set of variables on another. In most cases, to do it accurately, we need to know which variables we should control for. [...] to accurately control for confounders, we need to go beyond the realm of pure statistics and use the information about the data-generating process, which can be encoded as a (causal) graph. In this sense, the ability to translate between graphical and statistical properties is central to causal inference." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"The causal interpretation of linear regression only holds when there are no spurious relationships in your data. This is the case in two scenarios: when you control for a set of all necessary variables (sometimes this set can be empty) or when your data comes from a properly designed randomized experiment." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"The first level of creativity [for evaluating causal models] is to use the refutation tests [...] The second level of creativity is available when you have access to historical data coming from randomized experiments. You can compare your observational model with the experimental results and try to adjust your model accordingly. The third level of creativity is to evaluate your modeling approach on simulated data with known outcomes. [...] The fourth level of creativity is sensitivity analysis." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"[...] the modularity assumption states that when we perform a (perfect) intervention on one variable in the system, the only structural change that takes place in this system is the removal of this variable’s incoming edges (which is equivalent to the modification of its structural equation) and the rest of the system remains structurally unchanged." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

🖍️Manfred Drosg - Collected Quotes

"A histogram consists of the outline of bars of equal width and appropriate length next to each other. By connecting the frequency values at the position of the nominal values (the midpoints of the intervals) with straight lines, a frequency polygon is obtained. Attaching classes with frequency zero at either end makes the area (the integral) under the frequency polygon equal to that under the histogram." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"A valid digit is not necessarily a significant digit. The significance of numbers is a result of its scientific context." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"[myth:] Accuracy is more important than precision. For single best estimates, be it a mean value or a single data value, this question does not arise because in that case there is no difference between accuracy and precision. (Think of a single shot aimed at a target.) Generally, it is good practice to balance precision and accuracy. The actual requirements will differ from case to case." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Any scientific data without (a stated) uncertainty is of no avail. Therefore the analysis and description of uncertainty are almost as important as those of the data value itself . It should be clear that the uncertainty itself also has an uncertainty – due to its nature as a scientific quantity – and so on. The uncertainty of an uncertainty is generally not determined." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"As uncertainties of scientific data values are nearly as important as the data values themselves, it is usually not acceptable that a best estimate is only accompanied by an estimated uncertainty. Therefore, only the size of nondominant uncertainties should be estimated. For estimating the size of a nondominant uncertainty we need to find its upper limit, i.e., we want to be as sure as possible that the uncertainty does not exceed a certain value." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Before best estimates are extracted from data sets by way of a regression analysis, the uncertainties of the individual data values must be determined.In this case care must be taken to recognize which uncertainty components are common to all the values, i.e., those that are correlated (systematic)." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Before discarding a data point one should investigate the possible reasons for this faulty data value." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Correlation analysis can help us find the size of the formal relation between two properties. An equidirectional variation is present if we observe high values of one variable together with high values of the other variable (or low ones combined with low ones). In this case there is a positive correlation. If high values are combined with low values and low values with high values, the variation is counterdirectional, and the correlation is negative." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"[myth:] Counting can be done without error. Usually, the counted number is an integer and therefore without (rounding) error. However, the best estimate of a scientifically relevant value obtained by counting will always have an error. These errors can be very small in cases of consecutive counting, in particular of regular events, e.g., when measuring frequencies." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Due to the theory that underlies uncertainties an infinite number of data values would be necessary to determine the true value of any quantity. In reality the number of available data values will be relatively small and thus this requirement can never be fully met; all one can get is the best estimate of the true value." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"For linear dependences the main information usually lies in the slope. It is obvious that those points that lie far apart have the strongest influence on the slope if all points have the same uncertainty. In this context we speak of the strong leverage of distant points; when determining the parameter “slope” these distant points carry more effective weight. Naturally, this weight is distinct from the “statistical” weight usually used in regression analysis." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"For some scientific data the true value cannot be given by a constant or some straightforward mathematical function but by a probability distribution or an expectation value. Such data are called probabilistic. Even so, their true value does not change with time or place, making them distinctly different from most statistical data of everyday life." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"If there is an outlier there are two possibilities: The model is wrong– after all, a theory is the basis on which we decide whether a data point is an outlier (an unexpected value) or not. The value of the data point is wrong because of a failure of the apparatus or a human mistake. There is a third possibility, though: The data point might not be an actual outlier, but part of a (legitimate) statistical fluctuation." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"In error analysis the so-called 'chi-squared' is a measure of the agreement between the uncorrelated internal and the external uncertainties of a measured functional relation. The simplest such relation would be time independence. Theory of the chi-squared requires that the uncertainties be normally distributed. Nevertheless, it was found that the test can be applied to most probability distributions encountered in practice." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"In many cases systematic errors are interpreted as the systematic difference between nature (which is being questioned by the experimenter in his experiment) and the model (which is used to describe nature). If the model used is not good enough, but the measurement result is interpreted using this model, the final result (the interpretation) will be wrong because it is biased, i.e., it has a systematic deviation (not uncertainty). If we do not use the best model (the best theory) available for the description of a certain phenomenon this procedure is just wrong. It has nothing to do with an uncertainty." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"In science we try to explain reality by using models (theories). This is necessary because reality itself is too complex. So we need to come up with a model for that aspect of reality we want to understand – usually with the help of mathematics. Of course, these models or theories can only be simplifications of that part of reality we are looking at. A model can never be a perfect description of reality, and there can never be a part of reality perfectly mirroring a model." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"It is also inevitable for any model or theory to have an uncertainty (a difference between model and reality). Such uncertainties apply both to the numerical parameters of the model and to the inadequacy of the model as well. Because it is much harder to get a grip on these types of uncertainties, they are disregarded, usually." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"It is important that uncertainty components that are independent of each other are added quadratically. This is also true for correlated uncertainty components, provided they are independent of each other, i.e., as long as there is no correlation between the components." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"It is important to pay heed to the following detail: a disadvantage of logarithmic diagrams is that a graphical integration is not possible, i.e., the area under the curve (the integral) is of no relevance." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"It is the aim of all data analysis that a result is given in form of the best estimate of the true value. Only in simple cases is it possible to use the data value itself as result and thus as best estimate." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"It is the nature of an uncertainty that it is not known and can never be known, whether the best estimate is greater or less than the true value." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Outliers or flyers are those data points in a set that do not quite fit within the rest of the data, that agree with the model in use. The uncertainty of such an outlier is seemingly too small. The discrepancy between outliers and the model should be subject to thorough examination and should be given much thought. Isolated data points, i.e., data points that are at some distance from the bulk of the data are not outliers if their values are in agreement with the model in use." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"[myth:] Random errors can always be determined by repeating measurements under identical conditions. […] this statement is true only for time-related random errors ." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"[myth:] Systematic errors can be determined inductively. It should be quite obvious that it is not possible to determine the scale error from the pattern of data values." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"The fact that the same uncertainty (e.g., scale uncertainty) is uncorrelated if we are dealing with only one measurement, but correlated (i.e., systematic) if we look at more than one measurement using the same instrument shows that both types of uncertainties are of the same nature. Of course, an uncertainty keeps its characteristics (e.g., Poisson distributed), independent of the fact whether it occurs only once or more often." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"To fulfill the requirements of the theory underlying uncertainties, variables with random uncertainties must be independent of each other and identically distributed. In the limiting case of an infinite number of such variables, these are called normally distributed. However, one usually speaks of normally distributed variables even if their number is finite." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

19 April 2006

🖍️Jesús Rogel-Salazar - Collected Quotes

"[...] a data scientist role goes beyond the collection and reporting on data; it must involve looking at a business The role of a data scientist goes beyond the collection and reporting on data. application or process from multiple vantage points and determining what the main questions and follow-ups are, as well as recommending the most appropriate ways to employ the data at hand." (Jesús Rogel-Salazar, "Data Science and Analytics with Python", 2017)

"High-bias models typically produce simpler models that do not overfit and in those cases the danger is that of underfitting. Models with low-bias are typically more complex and that complexity enables us to represent the training data in a more accurate way. The danger here is that the flexibility provided by higher complexity may end up representing not only a relationship in the data but also the noise. Another way of portraying the bias-variance trade-off is in terms of complexity v simplicity." (Jesús Rogel-Salazar, "Data Science and Analytics with Python", 2017)

"In terms of characteristics, a data scientist has an inquisitive mind and is prepared to explore and ask questions, examine assumptions and analyse processes, test hypotheses and try out solutions and, based on evidence, communicate informed conclusions, recommendations and caveats to stakeholders and decision makers." (Jesús Rogel-Salazar, "Data Science and Analytics with Python", 2017)

"Munging, or wrangling data is actually the most time-consuming task in the data science workflow. [...] Data preparation is key to the extraction of valuable insight and although some may prefer to concentrate only on the much more fun modelling part, the fact that you get to know your dataset inside out while munging it implies that any new or follow-up questions can probably be attained with less effort." (Jesús Rogel-Salazar, "Data Science and Analytics with Python", 2017)

"The tension between bias and variance, simplicity and complexity, or underfitting and overfitting is an area in the data science and analytics process that can be closer to a craft than a fixed rule. The main challenge is that not only is each dataset different, but also there are data points that we have not yet seen at the moment of constructing the model. Instead, we are interested in building a strategy that enables us to tell something about data from the sample used in building the model." (Jesús Rogel-Salazar, "Data Science and Analytics with Python", 2017)

"One important thing to bear in mind about the outputs of data science and analytics is that in the vast majority of cases they do not uncover hidden patterns or relationships as if by magic, and in the case of predictive analytics they do not tell us exactly what will happen in the future. Instead, they enable us to forecast what may come. In other words, once we have carried out some modelling there is still a lot of work to do to make sense out of the results obtained, taking into account the constraints and assumptions in the model, as well as considering what an acceptable level of reliability is in each scenario." (Jesús Rogel-Salazar, "Data Science and Analytics with Python", 2017)