SQL Troubles

10 April 2006

🖍️Arthur L Bowley - Collected Quotes

"A knowledge of statistics is like a knowledge of foreign languages or of algebra; it may prove of use at any time under any circumstances." (Arthur L Bowley, "Elements of Statistics", 1901)

"A statistical estimate may be good or bad, accurate or the reverse; but in almost all cases it is likely to be more accurate than a casual observer’s impression, and the nature of things can only be disproved by statistical methods." (Arthur L Bowley, "Elements of Statistics", 1901)

"Great numbers are not counted correctly to a unit, they are estimated; and we might perhaps point to this as a division between arithmetic and statistics, that whereas arithmetic attains exactness, statistics deals with estimates, sometimes very accurate, and very often sufficiently so for their purpose, but never mathematically exact." (Arthur L Bowley, "Elements of Statistics", 1901)

"Some of the common ways of producing a false statistical argument are to quote figures without their context, omitting the cautions as to their incompleteness, or to apply them to a group of phenomena quite different to that to which they in reality relate; to take these estimates referring to only part of a group as complete; to enumerate the events favorable to an argument, omitting the other side; and to argue hastily from effect to cause, this last error being the one most often fathered on to statistics. For all these elementary mistakes in logic, statistics is held responsible." (Arthur L Bowley, "Elements of Statistics", 1901)

"[…] statistics is the science of the measurement of the social organism, regarded as a whole, in all its manifestations." (Arthur L Bowley, "Elements of Statistics", 1901)

"Statistics may rightly be called the science of averages. […] Great numbers and the averages resulting from them, such as we always obtain in measuring social phenomena, have great inertia. […] It is this constancy of great numbers that makes statistical measurement possible. It is to great numbers that statistical measurement chiefly applies." (Arthur L Bowley, "Elements of Statistics", 1901)

"Statistics may, for instance, be called the science of counting. Counting appears at first sight to be a very simple operation, which any one can perform or which can be done automatically; but, as a matter of fact, when we come to large numbers, e.g., the population of the United Kingdom, counting is by no means easy, or within the power of an individual; limits of time and place alone prevent it being so carried out, and in no way can absolute accuracy be obtained when the numbers surpass certain limits." (Arthur L Bowley, "Elements of Statistics", 1901)

"By [diagrams] it is possible to present at a glance all the facts which could be obtained from figures as to the increase, fluctuations, and relative importance of prices, quantities, and values of different classes of goods and trade with various countries; while the sharp irregularities of the curves give emphasis to the disturbing causes which produce any striking change." (Arthur L Bowley, "A Short Account of England's Foreign Trade in the Nineteenth Century, its Economic and Social Results", 1905)

"Of itself an arithmetic average is more likely to conceal than to disclose important facts; it is the nature of an abbreviation, and is often an excuse for laziness." (Arthur L Bowley, "The Nature and Purpose of the Measurement of Social Phenomena", 1915)

"[...] the problems of the errors that arise in the process of sampling have been chiefly discussed from the point of view of the universe, not of the sample; that is, the question has been how far will a sample represent a given universe? The practical question is, however, the converse: what can we infer about a universe from a given sample? This involves the difficult and elusive theory of inverse probability, for it may be put in the form, which of the various universes from which the sample may a priori have been drawn may be expected to have yielded that sample?" (Arthur L Bowley, "Elements of Statistics. 5th Ed., 1926)

"Statistics are numerical statements of facts in any department of inquiry, placed in relation to each other; statistical methods are devices for abbreviating and classifying the statements and making clear the relations." (Arthur L Bowley, "An Elementary Manual of Statistics", 1934)

🖍️Alan Turing - Collected Quotes

"A computer would deserve to be called intelligent if it could deceive a human into believing that it was human." (Alan Turing, "Computing Machinery and Intelligence", Mind Vol. 59, 1950)

"If one wants to make a machine mimic the behaviour of the human computer in some complex operation one has to ask him how it is done, and then translate the answer into the form of an instruction table. Constructing instruction tables is usually described as 'programming'." (Alan Turing, "Computing Machinery and Intelligence", Mind Vol. 59, 1950)

"It is unnecessary to design various new machines to do various computing processes. They can all be done with one digital computer, suitably programmed for each case." (Alan Turing, "Computing Machinery and Intelligence", Mind Vol. 59, 1950)

"The idea behind digital computers may be explained by saying that these machines are intended to carry out any operations which could be done by a human computer.” (Alan Turing, “Computing Machinery and Intelligence”, Mind Vol. 59, 1950)

"The original question, 'Can machines think?:, I believe too meaningless to deserve discussion. Nevertheless I believe that at the end of the century the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted." (Alan M. Turing, 1950)

"The view that machines cannot give rise to surprises is due, I believe, to a fallacy to which philosophers and mathematicians are particularly subject. This is the assumption that as soon as a fact is presented to a mind all consequences of that fact spring into the mind simultaneously with it. It is a very useful assumption under many circumstances, but one too easily forgets that it is false. A natural consequence of doing so is that one then assumes that there is no virtue in the mere working out of consequences from data and general principles." (Alan Turing, "Computing Machinery and Intelligence", Mind Vol. 59, 1950)

"This model will be a simplification and an idealization, and consequently a falsification. It is to be hoped that the features retained for discussion are those of greatest importance in the present state of knowledge." (Alan M Turing, "The Chemical Basis of Morphogenesis" , Philosophical Transactions of the Royal Society of London, Series B: Biological Sciences, Vol. 237 (641), 1952)

"Almost everyone now acknowledges that theory and experiment, model making, theory construction and linguistics all go together, and that the successful development of a science of behavior depends upon a ‘total approach’ in which, given that the computer ‘is the only large-scale universal model’ that we possess, ‘we may expect to follow the prescription of Simon and construct our models - or most of them - in the form of computer programs’." (Alan M Turing)

"Science is a differential equation. Religion is a boundary condition." (Alan M Turing)

"The whole thinking process is rather mysterious to us, but I believe that the attempt to make a thinking machine will help us greatly in finding out how we think ourselves." (Alan M Turing)

"We do not need to have an infinity of different machines doing different jobs. A single one will suffice. The engineering problem of producing various machines for various jobs is replaced by the office work of "programming" the universal machine to do these jobs." (Alan M Turing)

08 April 2006

🖍️John H Johnson - Collected Quotes

"A correlation is simply a bivariate relationship - a fancy way of saying that there is a relationship between two ('bi') variables ('variate'). And a bivariate relationship doesn’t prove that one thing caused the other. Think of it this way: you can observe that two things appear to be related statistically, but that doesn’t tell you the answer to any of the questions you might really care about - why is there a relationship and what does it mean to us as a consumer of data?" (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"A good chart can tell a story about the data, helping you understand relationships among data so you can make better decisions. The wrong chart can make a royal mess out of even the best data set." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Although some people use them interchangeably, probability and odds are not the same and people often misuse the terms. Probability is the likelihood that an outcome will occur. The odds of something happening, statistically speaking, is the ratio of favorable outcomes to unfavorable outcomes." (John H Johnson & Mike Gluck, "Everydata: The misibivarinformation hidden in the little data you consume every day", 2016)

"[…] average isn’t something that should be considered in isolation. Your average is only as good as the data that supports it. If your sample isn’t representative of the full population, if you cherry- picked the data, or if there are other issues with your data, your average may be misleading." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Big data is sexy. It makes the headlines. […] But, as you’ve seen already, it’s the little data - the small bits and bytes of data that you’re bombarded with in your everyday life - that often has a huge effect on your health, your wallet, your job, your relationships, and so much more, every single day. From food labels to weather forecasts, your bank account to your doctor’s office, everydata is all around you." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Confirmation bias can affect nearly every aspect of the way you look at data, from sampling and observation to forecasting - so it’s something to keep in mind anytime you’re interpreting data. When it comes to correlation versus causation, confirmation bias is one reason that some people ignore omitted variables - because they’re making the jump from correlation to causation based on preconceptions, not the actual evidence." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Essentially, magnitude is the size of the effect. It’s a way to determine if the results are meaningful. Without magnitude, it’s hard to get a sense of how much something matters. […] the magnitude of an effect can change, depending on the relationship." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"First, you need to think about whether the universe of data that is being studied or collected is representative of the underlying population. […] Second, you need to consider what you are analyzing in the data that has been collected - are you analyzing all of the data, or only part of it? […] You always have to ask - can you accurately extend your findings from the sample to the general population? That’s called external validity - when you can extend the results from your sample to draw meaningful conclusions about the full population." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Forecasting is difficult because we don’t know everything about how the world works. There are unforeseen events. Unknown processes. Random occurrences. People are unpredictable, and things don’t always stay the same. The data you’re studying can change - as can your understanding of the underlying process." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Having a large sample size doesn’t guarantee better results if it’s the wrong large sample." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"If the underlying data isn’t sampled accurately, it’s like building a house on a foundation that’s missing a few chunks of concrete. Maybe it won’t matter. But if the missing concrete is in the wrong spot - or if there is too much concrete missing - the whole house can come falling down." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"If you’re looking at an average, you are - by definition - studying a specific sample set. If you’re comparing averages, and those averages come from different sample sets, the differences in the sample sets may well be manifested in the averages. Remember, an average is only as good as the underlying data." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"If your conclusions change dramatically by excluding a data point, then that data point is a strong candidate to be an outlier. In a good statistical model, you would expect that you can drop a data point without seeing a substantive difference in the results. It’s something to think about when looking for outliers." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"In the real world, statistical issues rarely exist in isolation. You’re going to come across cases where there’s more than one problem with the data. For example, just because you identify some sampling errors doesn’t mean there aren’t also issues with cherry picking and correlations and averages and forecasts - or simply more sampling issues, for that matter. Some cases may have no statistical issues, some may have dozens. But you need to keep your eyes open in order to spot them all." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Just as with aggregated data, an average is a summary statistic that can tell you something about the data - but it is only one metric, and oftentimes a deceiving one at that. By taking all of the data and boiling it down to one value, an average (and other summary statistics) may imply that all of the underlying data is the same, even when it’s not." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Keep in mind that a weighted average may be different than a simple (non- weighted) average because a weighted average - by definition - counts certain data points more heavily. When you’re thinking about an average, try to determine if it’s a simple average or a weighted average. If it’s weighted, ask yourself how it’s being weighted, and see which data points count more than others." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"[…] remember that, as with many statistical issues, sampling in and of itself is not a good or a bad thing. Sampling is a powerful tool that allows us to learn something, when looking at the full population is not feasible (or simply isn’t the preferred option). And you shouldn’t be misled to think that you always should use all the data. In fact, using a sample of data can be incredibly helpful." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Statistical significance is a concept used by scientists and researchers to set an objective standard that can be used to determine whether or not a particular relationship 'statistically' exists in the data. Scientists test for statistical significance to distinguish between whether an observed effect is present in the data (given a high degree of probability), or just due to chance. It is important to note that finding a statistically significant relationship tells us nothing about whether a relationship is a simple correlation or a causal one, and it also can’t tell us anything about whether some omitted factor is driving the result." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Statistical significance refers to the probability that something is true. It’s a measure of how probable it is that the effect we’re seeing is real (rather than due to chance occurrence), which is why it’s typically measured with a p-value. P, in this case, stands for probability. If you accept p-values as a measure of statistical significance, then the lower your p-value is, the less likely it is that the results you’re seeing are due to chance alone." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"This idea of looking for answers is related to confirmation bias, which is the tendency to interpret data in a way that reinforces your preconceptions. With confirmation bias, you aren’t just looking for an answer - you’re looking for a specific answer." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"The more uncertainty there is in your sample, the more uncertainty there will be in your forecast. A prediction is only as good as the information that goes into it, and in statistics, we call the basis for our forecasts a model. The model represents all the inputs - the factors you determine will predict the future outcomes, the underlying sample data you rely upon, and the relationship you apply mathematically. In other words, the model captures how you think various factors relate to one another." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"The process of making statistical conclusions about the data is called drawing an inference. In any statistical analysis, if you’re going to draw an inference, the goal is to make sure you have the right data to answer the question you are asking." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"The strength of an average is that it takes all the values in your data set and simplifies them down to a single number. This strength, however, is also the great danger of an average. If every data point is exactly the same (picture a row of identical bricks) then an average may, in fact, accurately reflect something about each one. But if your population isn’t similar along many key dimensions - and many data sets aren’t - then the average will likely obscure data points that are above or below the average, or parts of the data set that look different from the average. […] Another way that averages can mislead is that they typically only capture one aspect of the data." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"The tricky part is that there aren’t really any hard- and- fast rules when it comes to identifying outliers. Some economists say an outlier is anything that’s a certain distance away from the mean, but in practice it’s fairly subjective and open to interpretation. That’s why statisticians spend so much time looking at data on a case-by-case basis to determine what is - and isn’t - an outlier." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Using a sample to estimate results in the full population is common in data analysis. But you have to be careful, because even small mistakes can quickly become big ones, given that each observation represents many others. There are also many factors you need to consider if you want to make sure your inferences are accurate." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

07 April 2006

🖍️Victor Cohn - Collected Quotes

"Different problems require different methods, different numbers. One of the most basic questions in science is: Is the study designed in a way that will allow the researchers to answer the questions that they want answered?" (Victor Cohn & Lewis Cope, "News & Numbers: A writer’s guide to statistics" 3rd Ed, 2012)

"If the group is large enough, even very small differences can become statistically significant." (Victor Cohn & Lewis Cope, "News & Numbers: A writer’s guide to statistics" 3rd Ed, 2012)

"In common language and ordinary logic, a low likelihood of chance alone calling the shots means 'it’s close to certain'. A strong likelihood that chance could have ruled means 'it almost certainly can’t be'." (Victor Cohn & Lewis Cope, "News & Numbers: A writer’s guide to statistics" 3rd Ed, 2012)

"Most importantly, much of statistics involves clear thinking rather than numbers. And much, at least much of the statistical principles that reporters can most readily apply, is good sense." (Victor Cohn & Lewis Cope, "News & Numbers: A writer’s guide to statistics" 3rd Ed, 2012)

"Nature is complex, and almost all methods of observation and experiment are imperfect." (Victor Cohn & Lewis Cope, "News & Numbers: A writer’s guide to statistics" 3rd Ed, 2012)

"[…] nonparametric methods […] are methods of examining data that do not rely on a numerical distribution. As a result, they don’t allow a few very large or very small or very wild numbers to run away with the analysis." (Victor Cohn & Lewis Cope, "News & Numbers: A writer’s guide to statistics" 3rd Ed, 2012)

"Regression toward the mean is the tendency of all values in every field of science – physical, biological, social, and economic – to move toward the average. […] The regression effect is common to all repeated measurements. Regression is part of an even more basic phenomenon: variation, or variability. Virtually everything that is measured varies from measurement to measurement. When repeated, every experiment has at least slightly different results." (Victor Cohn & Lewis Cope, "News & Numbers: A writer’s guide to statistics" 3rd Ed, 2012)

"Statistically, power means the probability of finding something if it’s there.[…] statisticians think of power as a function of both sample size and the accuracy of measurement, because that too affects the probability of finding something." (Victor Cohn & Lewis Cope, "News & Numbers: A writer’s guide to statistics" 3rd Ed, 2012)

"The big problems with statistics, say its best practitioners, have little to do with computations and formulas. They have to do with judgment - how to design a study, how to conduct it, then how to analyze and interpret the results. Journalists reporting on statistics have many chances to do harm by shaky reporting, and so are also called on to make sophisticated judgments. How, then, can we tell which studies seem credible, which we should report?" (Victor Cohn & Lewis Cope, "News & Numbers: A writer’s guide to statistics" 3rd Ed, 2012)

"The first thing that you should understand about science is that it is almost always uncertain. The scientific process allows science to move ahead without waiting for an elusive 'proof positive'. […] How can science afford to act on less than certainty? Because science is a continuing story - always retesting ideas. One scientific finding leads scientists to conduct more research, which may support and expand on the original finding." (Victor Cohn & Lewis Cope, "News & Numbers: A writer’s guide to statistics" 3rd Ed, 2012)

"Where many known, measurable factors are involved, statisticians can use mathematical techniques to account for all the variables and try to find which are the truly important predictors. The terms for this include multiple regression, multivariate analysis, and discriminant analysis, and factor, cluster, path, and two-stage least-squares analyses." (Victor Cohn & Lewis Cope, "News & Numbers: A writer’s guide to statistics" 3rd Ed, 2012)

06 April 2006

🖍️Antoine Cornuéjols - Collected Quotes

"Hence, has machine learning uncovered truths that escaped the notice of philosophy, psychology, and biology? On one hand, it can be argued that machine learning has at least provided grounds for some of the claims of philosophy regarding the nature of knowledge and its acquisition. Against pure empiricism, induction requires prior knowledge, if only in the form of a constrained hypothesis space. In addition, there is a kind of conservation law at play in induction. The more a priori knowledge there is, the easier learning is and the fewer data are needed, and vice versa. The statistical study of machine learning allows quantifying this trade-off." (Antoine Cornuéjol, "The Necessity of Order in Machine Learning: Is Order in Order?", 2007)

"In effect, machine learning research has already brought us several interesting concepts. Most prominently, it has stressed the benefit of distinguishing between the properties of the hypothesis space - its richness and the valuation scheme associated with it - and the characteristics of the actual search procedure in this space, guided by the training data. This in turn suggests two important factors related to sequencing effects, namely forgetting and the nonoptimality of the search procedure. Both are key parameters than need to be thoroughly understood if one is to master sequencing effects." (Antoine Cornuéjol, "The Necessity of Order in Machine Learning: Is Order in Order?", 2007)

"On the other hand, the algorithms produced in machine learning during the last few decades seem quite remote from what can be expected to account for natural cognition. For one thing, there is virtually no notion of knowledge organization in these methods. Learning is supposed to arise on a blank slate, albeit a constrained one, and its output is not supposed to be used for subsequent learning episodes. Neither is there any hierarchy in the 'knowledge' produced. Learning is not conceived as an ongoing activity but rather as a one-shot process more akin to data analysis than to a gradual discovery development or even to an adaptive process. " (Antoine Cornuéjol, "The Necessity of Order in Machine Learning: Is Order in Order?", 2007)

"[...] the theory that establishes a link between the empirical fit of the candidate hypothesis with respect to the data and its expected value on unseen events becomes essentially inoperative if the data are not supposed to be independent of each other. This requirement is obviously at odds with most natural learning settings, where either the learner is actively searching for data or where learning occurs under the guidance of a teacher who is carefully choosing the data and their order of presentation." (Antoine Cornuéjol, "The Necessity of Order in Machine Learning: Is Order in Order?", 2007)

"There are many control parameters to a learning system. The question is to identify, at a sufficiently high level, the ones that can play a key role in sequencing effects. Because learning can be seen as the search for an optimal hypothesis in a given space under an inductive criteria defined over the training set, three means to control learning readily appear. The first one corresponds to a change of the hypothesis space. The second consists in modifying the optimization landscape. This can be done by changing either the training set (for instance, by a forgetting mechanism) or the inductive criteria. Finally, one can also fiddle with the exploration process. For instance, in the case of a gradient search, slowing down the search process can prevent the system from having time to find the local optimum, which, in turn, can introduce sequencing effects." (Antoine Cornuéjol, "The Necessity of Order in Machine Learning: Is Order in Order?", 2007)

"While it has been always considered that a piece of information could at worst be useless, it should now be acknowledged that it can have a negative impact. There is simply no theory of information at the moment offering a framework ready to account for this in general." (Antoine Cornuéjol, "The Necessity of Order in Machine Learning: Is Order in Order?", 2007)

🖍️Nate Silver - Collected Quotes

"A forecaster should almost never ignore data, especially when she is studying rare events […]. Ignoring data is often a tip-off that the forecaster is overconfident, or is overfitting her model - that she is interested in showing off rather than trying to be accurate." (Nate Silver, "The Signal and the Noise: Why So Many Predictions Fail-but Some Don't", 2012)

"Complex systems seem to have this property, with large periods of apparent stasis marked by sudden and catastrophic failures. These processes may not literally be random, but they are so irreducibly complex (right down to the last grain of sand) that it just won’t be possible to predict them beyond a certain level. […] And yet complex processes produce order and beauty when you zoom out and look at them from enough distance." (Nate Silver, "The Signal and the Noise: Why So Many Predictions Fail-but Some Don't", 2012)

"Data-driven predictions can succeed - and they can fail. It is when we deny our role in the process that the odds of failure rise. Before we demand more of our data, we need to demand more of ourselves." (Nate Silver, "The Signal and the Noise: Why So Many Predictions Fail-but Some Don't", 2012)

"Distinguishing the signal from the noise requires both scientific knowledge and self-knowledge." (Nate Silver, "The Signal and the Noise: Why So Many Predictions Fail-but Some Don't", 2012)

"Finding patterns is easy in any kind of data-rich environment; that's what mediocre gamblers do. The key is in determining whether the patterns represent signal or noise." (Nate Silver, "The Signal and the Noise: Why So Many Predictions Fail-but Some Don't", 2012)

"The instinctual shortcut that we take when we have 'too much information' is to engage with it selectively, picking out the parts we like and ignoring the remainder, making allies with those who have made the same choices and enemies of the rest." (Nate Silver, "The Signal and the Noise: Why So Many Predictions Fail-but Some Don't", 2012)

"The most basic tenet of chaos theory is that a small change in initial conditions - a butterfly flapping its wings in Brazil - can produce a large and unexpected divergence in outcomes - a tornado in Texas. This does not mean that the behavior of the system is random, as the term 'chaos' might seem to imply. Nor is chaos theory some modern recitation of Murphy’s Law ('whatever can go wrong will go wrong'). It just means that certain types of systems are very hard to predict." (Nate Silver, "The Signal and the Noise: Why So Many Predictions Fail-but Some Don't", 2012)

"The numbers have no way of speaking for themselves. We speak for them. We imbue them with meaning." (Nate Silver, "The Signal and the Noise: Why So Many Predictions Fail-but Some Don't", 2012)

"The signal is the truth. The noise is what distracts us from the truth." (Nate Silver, "The Signal and the Noise: Why So Many Predictions Fail-but Some Don't", 2012)

"The systems are dynamic, meaning that the behavior of the system at one point in time influences its behavior in the future; And they are nonlinear, meaning they abide by exponential rather than additive relationships." (Nate Silver, "The Signal and the Noise: Why So Many Predictions Fail-but Some Don't", 2012)

"We forget - or we willfully ignore - that our models are simplifications of the world. We figure that if we make a mistake, it will be at the margin. In complex systems, however, mistakes are not measured in degrees but in whole orders of magnitude." (Nate Silver, "The Signal and the Noise: Why So Many Predictions Fail-but Some Don't", 2012)

"We need to stop, and admit it: we have a prediction problem. We love to predict things - and we aren't very good at it." (Nate Silver, "The Signal and the Noise: Why So Many Predictions Fail-but Some Don't", 2012)

"Whether information comes in a quantitative or qualitative flavor is not as important as how you use it. [...] The key to making a good forecast […] is not in limiting yourself to quantitative information. Rather, it’s having a good process for weighing the information appropriately. […] collect as much information as possible, but then be as rigorous and disciplined as possible when analyzing it. [...] Many times, in fact, it is possible to translate qualitative information into quantitative information." (Nate Silver, "The Signal and the Noise: Why So Many Predictions Fail-but Some Don't", 2012)

"Statistics is the science of finding relationships and actionable insights from data." (Nate Silver)

🖍️Beau Lotto - Collected Quotes

"Effects without an understanding of the causes behind them, on the other hand, are just bunches of data points floating in the ether, offering nothing useful by themselves. Big Data is information, equivalent to the patterns of light that fall onto the eye. Big Data is like the history of stimuli that our eyes have responded to. And as we discussed earlier, stimuli are themselves meaningless because they could mean anything. The same is true for Big Data, unless something transformative is brought to all those data sets… understanding." (Beau Lotto, "Deviate: The Science of Seeing Differently", 2017)

"New information is constantly flowing in, and your brain is constantly integrating it into this statistical distribution that creates your next perception (so in this sense 'reality' is just the product of your brain’s ever-evolving database of consequence). As such, your perception is subject to a statistical phenomenon known in probability theory as kurtosis. Kurtosis in essence means that things tend to become increasingly steep in their distribution [...] that is, skewed in one direction. This applies to ways of seeing everything from current events to ourselves as we lean 'skewedly' toward one interpretation, positive or negative. Things that are highly kurtotic, or skewed, are hard to shift away from. This is another way of saying that seeing differently isn’t just conceptually difficult - it’s statistically difficult." (Beau Lotto, "Deviate: The Science of Seeing Differently", 2017)

"Our assumptions are un question ably interconnected. They are nodes with connections (edges) to other nodes. The more foundational the assumption, the more strongly connected it is. What I’m suggesting is that our assumptions and the highly sensitive network of responses, perceptions, behaviors, thoughts, and ideas they create and interact with are a complex system. One of the most basic features of such a network is that when you move or disrupt one thing that is strongly connected, you don’t just affect that one thing, you affect all the other things that are connected to it. Hence small causes can have massive effects (but they don’t have to, and usually don’t actually). In a system of high tension, simple questions targeting basic assumptions have the potential to transform perception in radical and unpredictable ways." (Beau Lotto, "Deviate: The Science of Seeing Differently", 2017)

"Questioning our assumptions is what provokes revolutions, be they tiny or vast, technological or social." (Beau Lotto, "Deviate: The Science of Seeing Differently", 2017)

"Understanding reduces the complexity of data by collapsing the dimensionality of information to a lower set of known variables. s revolutions, be they tiny or vast, technological or social." (Beau Lotto, "Deviate: The Science of Seeing Differently", 2017)

"The basis of complex systems is actually quite simple (and this is not an attempt to be paradoxical, like an art critic who describes a sculpture as 'big yet small'). What makes a system unpredictable and thus nonlinear (which includes you and your perceptual process, or the process of making collective decisions) is that the components making up the system are interconnected." (Beau Lotto, "Deviate: The Science of Seeing Differently", 2017)

"The greatest leaders possess a combination of divergent traits: they are both experts and naïve, creative and efficient, serious and playful, social and reclusive - or at the very least, they surround themselves with this dynamic." (Beau Lotto, "Deviate: The Science of Seeing Differently", 2017)

"The term [Big Data] simply refers to sets of data so immense that they require new methods of mathematical analysis, and numerous servers. Big Data - and, more accurately, the capacity to collect it - has changed the way companies conduct business and governments look at problems, since the belief wildly trumpeted in the media is that this vast repository of information will yield deep insights that were previously out of reach." (Beau Lotto, "Deviate: The Science of Seeing Differently", 2017)

"Trust is fundamental to leading others into the dark, since trust enables fear to be 'actionable' as courage rather than actionable as anger. Since the bedrock of trust is faith that all will be OK within uncertainty, leaders’ fundamental role is to ultimately lead themselves. Research has found that successful leaders share three behavioral traits: they lead by example, admit their mistakes, and see positive qualities in others. All three are linked to spaces of play. Leading by example creates a space that is trusted - and without trust, there is no play. Admitting mistakes is to celebrate uncertainty. Seeing qualities in others is to encourage diversity." (Beau Lotto, "Deviate: The Science of Seeing Differently", 2017)

"Understanding transcends context, since the different contexts collapse according to their previously unknown similarity, which the principle contains. That is what understanding does. And you actually feel it in your brain when it happens. Your 'cognitive load' decreases, your level of stress and anxiety decrease, and your emotional state improves." (Beau Lotto, "Deviate: The Science of Seeing Differently", 2017)

"What defines a good leader? Enabling other people to step into the unseen. […] as the world becomes increasingly connected and thus unpredictable, the concept of leadership too must change. Rather than lead from the front toward efficiency, offering the answers, a good leader is defined by how he or she leads others into darkness - into uncertainty." (Beau Lotto, "Deviate: The Science of Seeing Differently", 2017)

05 April 2006

🖍️Robert Hooke - Collected Quotes

"Accounting figures are a blend of facts and arbitrary procedures that are designed to facilitate the recording and communication of business transactions. Their usefulness in the decision process is sometimes grossly overestimated." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"All of us learn by experience. Except for pure deductive processes, everything we learn is from someone's experience. All experience is a sample from an immense range of possible experience that no one individual can ever take in. It behooves us to know what parts of the information we get from samples can be trusted and what cannot." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"Being experimental, however, doesn't necessarily make a scientific study entirely credible. One weakness of experimental work is that it can be out of touch with reality when its controls are so rigid that conclusions are valid only in the experimental situation and don't carryover into the real world." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"Correlation analysis is a useful tool for uncovering a tenuous relationship, but it doesn't necessarily provide any real understanding of the relationship, and it certainly doesn't provide any evidence that the relationship is one of cause and effect. People who don't understand correlation tend to credit it with being a more fundamental approach than it is." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"Experiments usually are looking for 'signals' of truth, and the search is always ham pered by 'noise' of one kind or another. In judging someone else's experimental results it's important to find out whether they represent a true signal or whether they are just so much noise." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"First and foremost an experiment should have a goal, and the goal should be something worth achieving, especially if the experimenter is working on someone else's (for example, the taxpayers') money. 'Worth achieving' implies more than just beneficial; it also should mean that the experiment is the most beneficial thing we can think of doing. Obviously we can't predict accurately the value of an experiment (this may not even be possible after we see how it turns out), but we should feel obliged to make as intelligent a choice as we can. Such a choice is sometimes labeled a 'value judgment'." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"In general a small-scale test or experiment will not detect a small effect, or small differences among various products." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"Mistakes arising from retrospective data analysis led to the idea of experimentation, and experience with experimentation led to the idea of controlled experiments and then to the proper design of experiments for efficiency and credibility. When someone is pushing a conclusion at you, it's a good idea to ask where it came from - was there an experiment, and if so, was it controlled and was it relevant?" (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"One important way of developing our powers of discrimination between good and bad statistical studies is to learn about the differences between backward-looking (retrospective or historical) data and data obtained through carefully planned and controlled (forward-looking) experiments." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"Only a 0 correlation is uninteresting, and in practice 0 correlations do not occur. When you stuff a bunch of numbers into the correlation formula, the chance of getting exactly 0, even if no correlation is truly present, is about the same as the chance of a tossed coin ending up on edge instead of heads or tails." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"Randomization is usually a cheap and harmless way of improving the effectiveness of experimentation with very little extra effort." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"Science usually amounts to a lot more than blind trial and error. Good statistics consists of much more than just significance tests; there are more sophisticated tools available for the analysis of results, such as confidence statements, multiple comparisons, and Bayesian analysis, to drop a few names. However, not all scientists are good statisticians, or want to be, and not all people who are called scientists by the media deserve to be so described." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"Statistical reasoning is such a fundamental part of experimental science that the study of principles of data analysis has become a vital part of the scientist's education. Furthermore, […] the existence of a lot of data does not necessarily mean that any useful information is there ready to be extracted." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"The idea of statistical significance is valuable because it often keeps us from announcing results that later turn out to be nonresults. A significant result tells us that enough cases were observed to provide reasonable assurance of a real effect. It does not necessarily mean, though, that the effect is big enough to be important." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"Today's scientific investigations are so complicated that even experts in related fields may not understand them well. But there is a logic in the planning of experiments and in the analysis of their results that all intelligent people can grasp, and this logic is a great help in determining when to believe what we hear and read and when to be skeptical. This logic has a great deal to do with statistics, which is why statisticians have a unique interest in the scientific method, and why some knowledge of statistics can so often be brought to bear in distinguishing good arguments from bad ones." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"When a real situation involves chance we have to use probability mathematics to understand it quantitatively. Direct mathematical solutions sometimes exist […] but most real systems are too complicated for direct solutions. In these cases the computer, once taught to generate random numbers, can use simulation to get useful answers to otherwise impossible problems." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

🖍️Mike Loukides - Collected Quotes

"Data is frequently missing or incongruous. If data is missing, do you simply ignore the missing points? That isn’t always possible. If data is incongruous, do you decide that something is wrong with badly behaved data (after all, equipment fails), or that the incongruous data is telling its own story, which may be more interesting? It’s reported that the discovery of ozone layer depletion was delayed because automated data collection tools discarded readings that were too low. In data science, what you have is frequently all you’re going to get. It’s usually impossible to get 'better' data, and you have no alternative but to work with the data at hand." (Mike Loukides, "What Is Data Science?", 2011).

"Data science isn’t just about the existence of data, or making guesses about what that data might mean; it’s about testing hypotheses and making sure that the conclusions you’re drawing from the data are valid." (Mike Loukides, "What Is Data Science?", 2011)

"Data scientists combine entrepreneurship with patience, the willingness to build data products incrementally, the ability to explore, and the ability to iterate over a solution. They are inherently interdisciplinary. They can tackle all aspects of a problem, from initial data collection and data conditioning to drawing conclusions. They can think outside the box to come up with new ways to view the problem, or to work with very broadly defined problems: 'there’s a lot of data, what can you make from it?'" (Mike Loukides, "What Is Data Science?", 2011)

"Discovery is the key to building great data products, as opposed to products that are merely good." (Mike Loukides, "The Evolution of Data Products", 2011)

"New interfaces for data products are all about hiding the data itself, and getting to what the user wants." (Mike Loukides, "The Evolution of Data Products", 2011)

"The thread that ties most of these applications together is that data collected from users provides added value. Whether that data is search terms, voice samples, or product reviews, the users are in a feedback loop in which they contribute to the products they use. That’s the beginning of data science." (Mike Loukides, "What Is Data Science?", 2011)

"Using data effectively requires something different from traditional statistics, where actuaries in business suits perform arcane but fairly well-defined kinds of analysis. What differentiates data science from statistics is that data science is a holistic approach. We’re increasingly finding data in the wild, and data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others" (Mike Loukides, "What Is Data Science?", 2011).

"Whether we’re talking about web server logs, tweet streams, online transaction records, 'citizen science', data from sensors, government data, or some other source, the problem isn’t finding data, it’s figuring out what to do with it." (Mike Loukides, "What Is Data Science?", 2011)

🖍️Charles Wheelan - Collected Quotes

"A statistical index has all the potential pitfalls of any descriptive statistic - plus the distortions introduced by combining multiple indicators into a single number. By definition, any index is going to be sensitive to how it is constructed; it will be affected both by what measures go into the index and by how each of those measures is weighted." (Charles Wheelan, "Naked Statistics: Stripping the Dread from the Data", 2012)

"Correlation measures the degree to which two phenomena are related to one another. [...] Two variables are positively correlated if a change in one is associated with a change in the other in the same direction, such as the relationship between height and weight. [...] A correlation is negative if a positive change in one variable is associated with a negative change in the other, such as the relationship between exercise and weight." (Charles Wheelan, "Naked Statistics: Stripping the Dread from the Data", 2012)

"Descriptive statistics give us insight into phenomena that we care about. […] Although the field of statistics is rooted in mathematics, and mathematics is exact, the use of statistics to describe complex phenomena is not exact. That leaves plenty of room for shading the truth." (Charles Wheelan, "Naked Statistics: Stripping the Dread from the Data", 2012)

"Even if you have a solid indicator of what you are trying to measure and manage, the challenges are not over. The good news is that 'managing by statistics' can change the underlying behavior of the person or institution being managed for the better. If you can measure the proportion of defective products coming off an assembly line, and if those defects are a function of things happening at the plant, then some kind of bonus for workers that is tied to a reduction in defective products would presumably change behavior in the right kinds of ways. Each of us responds to incentives (even if it is just praise or a better parking spot). Statistics measure the outcomes that matter; incentives give us a reason to improve those outcomes." (Charles Wheelan, "Naked Statistics: Stripping the Dread from the Data", 2012)

"Even in the best of circumstances, statistical analysis rarely unveils “the truth.” We are usually building a circumstantial case based on imperfect data. As a result, there are numerous reasons that intellectually honest individuals may disagree about statistical results or their implications. At the most basic level, we may disagree on the question that is being answered." (Charles Wheelan, "Naked Statistics: Stripping the Dread from the Data", 2012)

"If the distance from the mean for one variable tends to be broadly consistent with distance from the mean for the other variable (e.g., people who are far from the mean for height in either direction tend also to be far from the mean in the same direction for weight), then we would expect a strong positive correlation. If distance from the mean for one variable tends to correspond to a similar distance from the mean for the second variable in the other direction (e.g., people who are far above the mean in terms of exercise tend to be far below the mean in terms of weight), then we would expect a strong negative correlation. If two variables do not tend to deviate from the mean in any meaningful pattern (e.g., shoe size and exercise) then we would expect little or no correlation." (Charles Wheelan, "Naked Statistics: Stripping the Dread from the Data", 2012)

"Once these different measures of performance are consolidated into a single number, that statistic can be used to make comparisons […] The advantage of any index is that it consolidates lots of complex information into a single number. We can then rank things that otherwise defy simple comparison […] Any index is highly sensitive to the descriptive statistics that are cobbled together to build it, and to the weight given to each of those components. As a result, indices range from useful but imperfect tools to complete charades." (Charles Wheelan, "Naked Statistics: Stripping the Dread from the Data", 2012)

"Probability is the study of events and outcomes involving an element of uncertainty." (Charles Wheelan, "Naked Statistics: Stripping the Dread from the Data", 2012)

"Regression analysis, like all forms of statistical inference, is designed to offer us insights into the world around us. We seek patterns that will hold true for the larger population. However, our results are valid only for a population that is similar to the sample on which the analysis has been done." (Charles Wheelan, "Naked Statistics: Stripping the Dread from the Data", 2012)

"Statistics cannot be any smarter than the people who use them. And in some cases, they can make smart people do dumb things." (Charles Wheelan, "Naked Statistics: Stripping the Dread from the Data", 2012)

"The correlation coefficient has two fabulously attractive characteristics. First, for math reasons that have been relegated to the appendix, it is a single number ranging from –1 to 1. A correlation of 1, often described as perfect correlation, means that every change in one variable is associated with an equivalent change in the other variable in the same direction. A correlation of –1, or perfect negative correlation, means that every change in one variable is associated with an equivalent change in the other variable in the opposite direction. The closer the correlation is to 1 or –1, the stronger the association. […] The second attractive feature of the correlation coefficient is that it has no units attached to it. […] The correlation coefficient does a seemingly miraculous thing: It collapses a complex mess of data measured in different units (like our scatter plots of height and weight) into a single, elegant descriptive statistic." (Charles Wheelan, "Naked Statistics: Stripping the Dread from the Data", 2012)

"The problem is that the mechanics of regression analysis are not the hard part; the hard part is determining which variables ought to be considered in the analysis and how that can best be done. Regression analysis is like one of those fancy power tools. It is relatively easy to use, but hard to use well - and potentially dangerous when used improperly." (Charles Wheelan, "Naked Statistics: Stripping the Dread from the Data", 2012)

"There are limits on the data we can gather and the kinds of experiments we can perform."(Charles Wheelan, "Naked Statistics: Stripping the Dread from the Data", 2012)

"While the main point of statistics is to present a meaningful picture of things we care about, in many cases we also hope to act on these numbers." (Charles Wheelan, "Naked Statistics: Stripping the Dread from the Data", 2012)

04 April 2006

🖍️Brian Godsey - Collected Quotes

"A good software developer (or engineer) and a good data scientist have several traits in common. Both are good at designing and building complex systems with many interconnected parts; both are familiar with many different tools and frameworks for building these systems; both are adept at foreseeing potential problems in those systems before they’re actualized. But in general, software developers design systems consisting of many well-defined components, whereas data scientists work with systems wherein at least one of the components isn’t well defined prior to being built, and that component is usually closely involved with data processing or analysis." (Brian Godsey, "Think Like a Data Scientist", 2017)

"A notable difference between many fields and data science is that in data science, if a customer has a wish, even an experienced data scientist may not know whether it’s possible. Whereas a software engineer usually knows what tasks software tools are capable of performing, and a biologist knows more or less what the laboratory can do, a data scientist who has not yet seen or worked with the relevant data is faced with a large amount of uncertainty, principally about what specific data is available and about how much evidence it can provide to answer any given question. Uncertainty is, again, a major factor in the data scientific process and should be kept at the forefront of your mind when talking with customers about their wishes." (Brian Godsey, "Think Like a Data Scientist", 2017)

"The process of data science begins with preparation. You need to establish what you know, what you have, what you can get, where you are, and where you would like to be. This last one is of utmost importance; a project in data science needs to have a purpose and corresponding goals. Only when you have well-defined goals can you begin to survey the available resources and all the possibilities for moving toward those goals." (Brian Godsey, "Think Like a Data Scientist", 2017)

"Uncertainty is an adversary of coldly logical algorithms, and being aware of how those algorithms might break down in unusual circumstances expedites the process of fixing problems when they occur - and they will occur. A data scientist’s main responsibility is to try to imagine all of the possibilities, address the ones that matter, and reevaluate them all as successes and failures happen." (Brian Godsey, "Think Like a Data Scientist", 2017)

🖍️Ely Devons - Collected Quotes

"Every economic and social situation or problem is now described in statistical terms, and we feel that it is such statistics which give us the real basis of fact for understanding and analysing problems and difficulties, and for suggesting remedies. In the main we use such statistics or figures without any elaborate theoretical analysis; little beyond totals, simple averages and perhaps index numbers. Figures have become the language in which we describe our economy or particular parts of it, and the language in which we argue about policy." (Ely Devons, "Essays in Economics", 1961)

"Indeed the language of statistics is rarely as objective as we imagine. The way statistics are presented, their arrangement in a particular way in tables, the juxtaposition of sets of figures, in itself reflects the judgment of the author about what is significant and what is trivial in the situation which the statistics portray." (Ely Devons, "Essays in Economics", 1961)

"It might be reasonable to expect that the more we know about any set of statistics, the greater the confidence we would have in using them, since we would know in which directions they were defective; and that the less we know about a set of figures, the more timid and hesitant we would be in using them. But, in fact, it is the exact opposite which is normally the case; in this field, as in many others, knowledge leads to caution and hesitation, it is ignorance that gives confidence and boldness. For knowledge about any set of statistics reveals the possibility of error at every stage of the statistical process; the difficulty of getting complete coverage in the returns, the difficulty of framing answers precisely and unequivocally, doubts about the reliability of the answers, arbitrary decisions about classification, the roughness of some of the estimates that are made before publishing the final results. Knowledge of all this, and much else, in detail, about any set of figures makes one hesitant and cautious, perhaps even timid, in using them." (Ely Devons, "Essays in Economics", 1961)

"The art of using the language of figures correctly is not to be over-impressed by the apparent air of accuracy, and yet to be able to take account of error and inaccuracy in such a way as to know when, and when not, to use the figures. This is a matter of skill, judgment, and experience, and there are no rules and short cuts in acquiring this expertness." (Ely Devons, "Essays in Economics", 1961)

"The knowledge that the economist uses in analysing economic problems and in giving advice on them is of thre First, theories of how the economic system works (and why it sometimes does not work so well); second, commonsense maxims about reasonable economic behaviour; and third, knowledge of the facts describing the main features of the economy, many of these facts being statistical." (Ely Devons, "Essays in Economics", 1961)

"The general models, even of the most elaborate kind, serve the simple purpose of demonstrating the interconnectedness of all economic phenomena, and show how, under certain conditions, price may act as a guiding link between them. Looked at in another way such models show how a complex set of interrelations can hang together consistently without any central administrative direction." (Ely Devons, "Essays in Economics", 1961)

"The most important and frequently stressed prescription for avoiding pitfalls in the use of economic statistics, is that one should find out before using any set of published statistics, how they have been collected, analysed and tabulated. This is especially important, as you know, when the statistics arise not from a special statistical enquiry, but are a by-product of law or administration. Only in this way can one be sure of discovering what exactly it is that the figures measure, avoid comparing the non-comparable, take account of changes in definition and coverage, and as a consequence not be misled into mistaken interpretations and analysis of the events which the statistics portray." (Ely Devons, "Essays in Economics", 1961)

"The two most important characteristics of the language of statistics are first, that it describes things in quantitative terms, and second, that it gives this description an air of accuracy and precision. The limitations, as well as the advantages, of the statistical approach arise from these two characteristics. For a description of the quantitative aspect of events never gives us the whole story; and even the best statistics are never, and never can be, completely accurate and precise. To avoid misuse of the language we must, therefore, guard against exaggerating the importance of the elements in any situation that can be described quantitatively, and we must know sufficient about the error and inaccuracy of the figures to be able to use them with discretion." (Ely Devons, "Essays in Economics", 1961)

"There are, indeed, plenty of ways in which statistics can help in the process of decision-taking. But exaggerated claims for the role they can play merely serve to confuse rather than clarify issues of public policy, and lead those responsible for action to oscillate between over-confidence and over-scepticism in using them." (Ely Devons, "Essays in Economics", 1961)

"There is a demand for every issue of economic policy to be discussed in terms of statistics, and even those who profess a general distrust of statistics are usually more impressed by an argument in support of a particular policy if it is backed up by figures. There is a passionate desire in our society to see issues of economic policy decided on what we think are rational grounds. We rebel against any admission of the uncertainty of our knowledge of the future as a confession of weakness." (Ely Devons, "Essays in Economics", 1961)

"There seems to be striking similarities between the role of economic statistics in our society and some of the functions which magic and divination play in primitive society." (Ely Devons, "Essays in Economics", 1961)

"This exaggerated influence of statistics resulting from willingness, indeed eagerness, to be impressed by the 'hard facts' provided by the 'figures', may play an important role in decision-making." (Ely Devons, "Essays in Economics", 1961)

"We all know that in economic statistics particularly, true precision, comparability and accuracy is extremely difficult to achieve, and it is for this reason that the language of economic statistics is so difficult to handle." (Ely Devons, "Essays in Economics", 1961)

03 April 2006

🖍️Kristin H Jarman - Collected Quotes

"A study is any data collection exercise. The purpose of any study is to answer a question. [...] Once the question has been clearly articulated, it’s time to design a study to answer it. At one end of the spectrum, a study can be a controlled experiment, deliberate and structured, where researchers act like the ultimate control freaks, manipulating everything from the gender of their test subjects to the humidity in the room. Scientific studies, the kind run by men in white lab coats and safety goggles, are often controlled experiments. At the other end of the spectrum, an observational study is simply the process of watching something unfold without trying to impact the outcome in any way." (Kristin H Jarman, "The Art of Data Analysis: How to answer almost any question using basic statistics", 2013)

"According to the central limit theorem, it doesn’t matter what the raw data look like, the sample variance should be proportional to the number of observations and if I have enough of them, the sample mean should be normal." (Kristin H Jarman, "The Art of Data Analysis: How to answer almost any question using basic statistics", 2013)

"Although it’s a little more complicated than [replication and random sampling], blocking is a powerful way to eliminate confounding factors. Blocking is the process of dividing a sample into one or more similar groups, or blocks, so that samples in each block have certain factors in common. This technique is a great way to gain a little control over an experiment with lots of uncontrollable factors." (Kristin H Jarman, "The Art of Data Analysis: How to answer almost any question using basic statistics", 2013)

"Any factor you don’t account for can become a confounding factor. A confounding factor is any variable that confuses the conclusions of your study, or makes them ambiguous. [...] Confounding factors can really screw up an otherwise perfectly good statistical analysis." (Kristin H Jarman, "The Art of Data Analysis: How to answer almost any question using basic statistics", 2013)

"Any time you collect data, you have uncertainty to deal with. This uncertainty comes from two places: (1) inherent variation in the values a random variable can take on and (2) the fact that for most studies, you can’t capture the entire population and so you must rely on a sample to make your conclusions." (Kristin H Jarman, "The Art of Data Analysis: How to answer almost any question using basic statistics", 2013)

"Choosing and organizing a sample is a crucial part of the experimental design process. Statistically speaking, the best type of sample is called a random sample. A random sample is a subset of the entire population, chosen so each member is equally likely to be picked. [...] Random sampling is the best way to guarantee you’ve chosen objectively, without personal preference or bias." (Kristin H Jarman, "The Art of Data Analysis: How to answer almost any question using basic statistics", 2013)

"Probability, the mathematical language of uncertainty, describes what are called random experiments, bets, campaigns, trials, games, brawls, and anything other situation where the outcome isn’t known beforehand. A probability is a fraction, a value between zero and one that measures the likelihood a given outcome will occur. A probability of zero means the outcome is virtually impossible. A probability of one means it will almost certainly happen. A probability of one-half means the outcome is just as likely to occur as not." (Kristin H Jarman, "The Art of Data Analysis: How to answer almost any question using basic statistics", 2013)

"Replication is the process of taking more than one observation or measurement. [...] Replication helps eliminate negative effects of uncontrollable factors, because it keeps us from getting fooled by a single, unusual outcome." (Kristin H Jarman, "The Art of Data Analysis: How to answer almost any question using basic statistics", 2013)

"The random experiment, or trial, is the situation whose outcome is uncertain, the one you’re watching. A coin toss is a random experiment, because you don’t know beforehand whether it will turn up heads or tails. The sample space is the list of all possible separate and distinct outcomes in your random experiment. The sample space in a coin toss contains the two outcomes heads and tails. The outcome you're interested in calculating a probability for is the event. On a coin toss, that might be the case where the coin lands on heads." (Kristin H Jarman, "The Art of Data Analysis: How to answer almost any question using basic statistics", 2013)

"The scientific method is the foundation of modern research. It’s how we prove a theory. It’s how we demonstrate cause and effect. It’s how we discover, innovate, and invent. There are five basic steps to the scientific method: (1) Ask a question. (2) Conduct background research. (3) Come up with a hypothesis. (4) Test the hypothesis with data. (5) Revise and retest the hypothesis until a conclusion can be made." (Kristin H Jarman, "The Art of Data Analysis: How to answer almost any question using basic statistics", 2013)

"There are three important requirements for the probability distribution. First, it should be defined for every possible value the random variable can take on. In other words, it should completely describe the sample space of a random experiment. Second, the probability distribution values should always be nonnegative. They’re meant to measure probabilities, after all, and probabilities are never less than zero. Finally, when all the probability distribution values are summed together, they must add to one." (Kristin H Jarman, "The Art of Data Analysis: How to answer almost any question using basic statistics", 2013)

♯OOP: Attribute (Definitions)

"Additional characteristics or information defined for an entity." (Owen Williams, "MCSE TestPrep: SQL Server 6.5 Design and Implementation", 1998)

"A named characteristic or property of a class." (Craig Larman, "Applying UML and Patterns", 2004)

"A characteristic, quality, or property of an entity class. For example, the properties 'First Name' and 'Last Name' are attributes of entity class 'Person'." (Danette McGilvray, "Executing Data Quality Projects", 2008)

"Another name for a field, used by convention in many object-oriented programming languages. Scala follows Java’s convention of preferring the term field over attribute." (Dean Wampler & Alex Payne, "Programming Scala", 2009)

"1. (UML diagram) A descriptor of a kind of information captured about an object class. 2. (Relational theory) The definition of a descriptor of a relation." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"A fact type element (specifically a characteristic assignment) that is a descriptor of an entity class." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"A characteristic of an object." (Requirements Engineering Qualifications Board, "Standard glossary of terms used in Requirements Engineering", 2011)

"An inherent characteristic, an accidental quality, an object closely associated with or belonging to a specific person, place, or office; a word ascribing a quality." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

02 April 2006

🖍️Prashant Natarajan - Collected Quotes

"Data quality in warehousing and BI is typically defined in terms of the 4 C’s—is the data clean, correct, consistent, and complete? When it comes to big data, there are two schools of thought that have different views and expectations of data quality. The first school believes that the gold standard of the 4 C’s must apply to all data (big and little) used for clinical care and performance metrics. The second school believes that in big data environments, a stringent data quality standard is impossible, too costly, or not required. While diametrically opposite opinions may play well in panel discussions, they do little to reconcile the realities of healthcare data quality." (Prashant Natarajan et al, "Demystifying Big Data and Machine Learning for Healthcare", 2017)

"Data warehousing has always been difficult, because leaders within an organization want to approach warehousing and analytics as just another technology or application buy. Viewed in this light, they fail to understand the complexity and interdependent nature of building an enterprise reporting environment." (Prashant Natarajan et al, "Demystifying Big Data and Machine Learning for Healthcare", 2017)

"Generalization is a core concept in machine learning; to be useful, machine-learning algorithms can’t just memorize the past, they must learn from the past. Generalization is the ability to respond properly to new situations based on experience from past situations." (Prashant Natarajan et al, "Demystifying Big Data and Machine Learning for Healthcare", 2017)

"The field of big-data analytics is still littered with a few myths and evidence-free lore. The reasons for these myths are simple: the emerging nature of technologies, the lack of common definitions, and the non-availability of validated best practices. Whatever the reasons, these myths must be debunked, as allowing them to persist usually has a negative impact on success factors and Return on Investment (RoI). On a positive note, debunking the myths allows us to set the right expectations, allocate appropriate resources, redefine business processes, and achieve individual/organizational buy-in." (Prashant Natarajan et al, "Demystifying Big Data and Machine Learning for Healthcare", 2017)

"The first myth is that prediction is always based on time-series extrapolation into the future (also known as forecasting). This is not the case: predictive analytics can be applied to generate any type of unknown data, including past and present. In addition, prediction can be applied to non-temporal (time-based) use cases such as disease progression modeling, human relationship modeling, and sentiment analysis for medication adherence, etc. The second myth is that predictive analytics is a guarantor of what will happen in the future. This also is not the case: predictive analytics, due to the nature of the insights they create, are probabilistic and not deterministic. As a result, predictive analytics will not be able to ensure certainty of outcomes." (Prashant Natarajan et al, "Demystifying Big Data and Machine Learning for Healthcare", 2017)

"Your machine-learning algorithm should answer a very specific question that tells you something you need to know and that can be answered appropriately by the data you have access to. The best first question is something you already know the answer to, so that you have a reference and some intuition to compare your results with. Remember: you are solving a business problem, not a math problem."(Prashant Natarajan et al, "Demystifying Big Data and Machine Learning for Healthcare", 2017)