SQL Troubles

20 April 2006

🖍️Aleksander Molak - Collected Quotes

"An important concept in complexity science is emergence – a phenomenon in which we can observe certain properties at the system level that cannot be observed at its constituent parts’ level. This property is sometimes described as a system being more than the sum of its parts." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"Any time you run regression analysis on arbitrary real-world observational data, there’s a significant risk that there’s hidden confounding in your dataset and so causal conclusions from such analysis are likely to be (causally) biased." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"Expert knowledge is a term covering various types of knowledge that can help define or disambiguate causal relations between two or more variables. Depending on the context, expert knowledge might refer to knowledge from randomized controlled trials, laws of physics, a broad scope of experiences in a given area, and more." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"In statistical inference and machine learning, we often talk about estimates and estimators. Estimates are basically our best guesses regarding some quantities of interest given (finite) data. Estimators are computational devices or procedures that allow us to map between a given (finite) data sample and an estimate of interest." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"In summary, the relationship between different branches of contemporary machine learning and causality is nuanced. That said, most broadly adopted machine learning models operate on rung one, not having a causal world model." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"'Let the data speak'" is a catchy and powerful slogan, but [...] data itself is not always enough. It’s worth remembering that in many cases 'data cannot speak for themselves' and we might need more information than just observations to address some of our questions." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"Matching is a family of methods for estimating causal effects by matching similar observations (or units) in the treatment and non-treatment groups. The goal of matching is to make comparisons between similar units in order to achieve as precise an estimate of the true causal effect as possible." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"Multiple regression provides scientists and analysts with a tool to perform statistical control - a procedure to remove unwanted influence from certain variables in the model." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"Non-linear associations are also quantifiable. Even linear regression can be used to model some non-linear relationships. This is possible because linear regression has to be linear in parameters, not necessarily in the data. More complex relationships can be quantified using entropy-based metrics such as mutual information. Linear models can also handle interaction terms. We talk about interaction when the model’s output depends on a multiplicative relationship between two or more variables." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"The basic goal of causal inference is to estimate the causal effect of one set of variables on another. In most cases, to do it accurately, we need to know which variables we should control for. [...] to accurately control for confounders, we need to go beyond the realm of pure statistics and use the information about the data-generating process, which can be encoded as a (causal) graph. In this sense, the ability to translate between graphical and statistical properties is central to causal inference." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"The causal interpretation of linear regression only holds when there are no spurious relationships in your data. This is the case in two scenarios: when you control for a set of all necessary variables (sometimes this set can be empty) or when your data comes from a properly designed randomized experiment." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"The first level of creativity [for evaluating causal models] is to use the refutation tests [...] The second level of creativity is available when you have access to historical data coming from randomized experiments. You can compare your observational model with the experimental results and try to adjust your model accordingly. The third level of creativity is to evaluate your modeling approach on simulated data with known outcomes. [...] The fourth level of creativity is sensitivity analysis." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"[...] the modularity assumption states that when we perform a (perfect) intervention on one variable in the system, the only structural change that takes place in this system is the removal of this variable’s incoming edges (which is equivalent to the modification of its structural equation) and the rest of the system remains structurally unchanged." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

🖍️Manfred Drosg - Collected Quotes

"A histogram consists of the outline of bars of equal width and appropriate length next to each other. By connecting the frequency values at the position of the nominal values (the midpoints of the intervals) with straight lines, a frequency polygon is obtained. Attaching classes with frequency zero at either end makes the area (the integral) under the frequency polygon equal to that under the histogram." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"A valid digit is not necessarily a significant digit. The significance of numbers is a result of its scientific context." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"[myth:] Accuracy is more important than precision. For single best estimates, be it a mean value or a single data value, this question does not arise because in that case there is no difference between accuracy and precision. (Think of a single shot aimed at a target.) Generally, it is good practice to balance precision and accuracy. The actual requirements will differ from case to case." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Any scientific data without (a stated) uncertainty is of no avail. Therefore the analysis and description of uncertainty are almost as important as those of the data value itself . It should be clear that the uncertainty itself also has an uncertainty – due to its nature as a scientific quantity – and so on. The uncertainty of an uncertainty is generally not determined." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"As uncertainties of scientific data values are nearly as important as the data values themselves, it is usually not acceptable that a best estimate is only accompanied by an estimated uncertainty. Therefore, only the size of nondominant uncertainties should be estimated. For estimating the size of a nondominant uncertainty we need to find its upper limit, i.e., we want to be as sure as possible that the uncertainty does not exceed a certain value." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Before best estimates are extracted from data sets by way of a regression analysis, the uncertainties of the individual data values must be determined.In this case care must be taken to recognize which uncertainty components are common to all the values, i.e., those that are correlated (systematic)." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Before discarding a data point one should investigate the possible reasons for this faulty data value." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Correlation analysis can help us find the size of the formal relation between two properties. An equidirectional variation is present if we observe high values of one variable together with high values of the other variable (or low ones combined with low ones). In this case there is a positive correlation. If high values are combined with low values and low values with high values, the variation is counterdirectional, and the correlation is negative." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"[myth:] Counting can be done without error. Usually, the counted number is an integer and therefore without (rounding) error. However, the best estimate of a scientifically relevant value obtained by counting will always have an error. These errors can be very small in cases of consecutive counting, in particular of regular events, e.g., when measuring frequencies." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Due to the theory that underlies uncertainties an infinite number of data values would be necessary to determine the true value of any quantity. In reality the number of available data values will be relatively small and thus this requirement can never be fully met; all one can get is the best estimate of the true value." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"For linear dependences the main information usually lies in the slope. It is obvious that those points that lie far apart have the strongest influence on the slope if all points have the same uncertainty. In this context we speak of the strong leverage of distant points; when determining the parameter “slope” these distant points carry more effective weight. Naturally, this weight is distinct from the “statistical” weight usually used in regression analysis." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"For some scientific data the true value cannot be given by a constant or some straightforward mathematical function but by a probability distribution or an expectation value. Such data are called probabilistic. Even so, their true value does not change with time or place, making them distinctly different from most statistical data of everyday life." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"If there is an outlier there are two possibilities: The model is wrong– after all, a theory is the basis on which we decide whether a data point is an outlier (an unexpected value) or not. The value of the data point is wrong because of a failure of the apparatus or a human mistake. There is a third possibility, though: The data point might not be an actual outlier, but part of a (legitimate) statistical fluctuation." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"In error analysis the so-called 'chi-squared' is a measure of the agreement between the uncorrelated internal and the external uncertainties of a measured functional relation. The simplest such relation would be time independence. Theory of the chi-squared requires that the uncertainties be normally distributed. Nevertheless, it was found that the test can be applied to most probability distributions encountered in practice." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"In many cases systematic errors are interpreted as the systematic difference between nature (which is being questioned by the experimenter in his experiment) and the model (which is used to describe nature). If the model used is not good enough, but the measurement result is interpreted using this model, the final result (the interpretation) will be wrong because it is biased, i.e., it has a systematic deviation (not uncertainty). If we do not use the best model (the best theory) available for the description of a certain phenomenon this procedure is just wrong. It has nothing to do with an uncertainty." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"In science we try to explain reality by using models (theories). This is necessary because reality itself is too complex. So we need to come up with a model for that aspect of reality we want to understand – usually with the help of mathematics. Of course, these models or theories can only be simplifications of that part of reality we are looking at. A model can never be a perfect description of reality, and there can never be a part of reality perfectly mirroring a model." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"It is also inevitable for any model or theory to have an uncertainty (a difference between model and reality). Such uncertainties apply both to the numerical parameters of the model and to the inadequacy of the model as well. Because it is much harder to get a grip on these types of uncertainties, they are disregarded, usually." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"It is important that uncertainty components that are independent of each other are added quadratically. This is also true for correlated uncertainty components, provided they are independent of each other, i.e., as long as there is no correlation between the components." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"It is important to pay heed to the following detail: a disadvantage of logarithmic diagrams is that a graphical integration is not possible, i.e., the area under the curve (the integral) is of no relevance." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"It is the aim of all data analysis that a result is given in form of the best estimate of the true value. Only in simple cases is it possible to use the data value itself as result and thus as best estimate." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"It is the nature of an uncertainty that it is not known and can never be known, whether the best estimate is greater or less than the true value." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Outliers or flyers are those data points in a set that do not quite fit within the rest of the data, that agree with the model in use. The uncertainty of such an outlier is seemingly too small. The discrepancy between outliers and the model should be subject to thorough examination and should be given much thought. Isolated data points, i.e., data points that are at some distance from the bulk of the data are not outliers if their values are in agreement with the model in use." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"[myth:] Random errors can always be determined by repeating measurements under identical conditions. […] this statement is true only for time-related random errors ." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"[myth:] Systematic errors can be determined inductively. It should be quite obvious that it is not possible to determine the scale error from the pattern of data values." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"The fact that the same uncertainty (e.g., scale uncertainty) is uncorrelated if we are dealing with only one measurement, but correlated (i.e., systematic) if we look at more than one measurement using the same instrument shows that both types of uncertainties are of the same nature. Of course, an uncertainty keeps its characteristics (e.g., Poisson distributed), independent of the fact whether it occurs only once or more often." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"To fulfill the requirements of the theory underlying uncertainties, variables with random uncertainties must be independent of each other and identically distributed. In the limiting case of an infinite number of such variables, these are called normally distributed. However, one usually speaks of normally distributed variables even if their number is finite." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

19 April 2006

🖍️Jesús Rogel-Salazar - Collected Quotes

"[...] a data scientist role goes beyond the collection and reporting on data; it must involve looking at a business The role of a data scientist goes beyond the collection and reporting on data. application or process from multiple vantage points and determining what the main questions and follow-ups are, as well as recommending the most appropriate ways to employ the data at hand." (Jesús Rogel-Salazar, "Data Science and Analytics with Python", 2017)

"High-bias models typically produce simpler models that do not overfit and in those cases the danger is that of underfitting. Models with low-bias are typically more complex and that complexity enables us to represent the training data in a more accurate way. The danger here is that the flexibility provided by higher complexity may end up representing not only a relationship in the data but also the noise. Another way of portraying the bias-variance trade-off is in terms of complexity v simplicity." (Jesús Rogel-Salazar, "Data Science and Analytics with Python", 2017)

"In terms of characteristics, a data scientist has an inquisitive mind and is prepared to explore and ask questions, examine assumptions and analyse processes, test hypotheses and try out solutions and, based on evidence, communicate informed conclusions, recommendations and caveats to stakeholders and decision makers." (Jesús Rogel-Salazar, "Data Science and Analytics with Python", 2017)

"Munging, or wrangling data is actually the most time-consuming task in the data science workflow. [...] Data preparation is key to the extraction of valuable insight and although some may prefer to concentrate only on the much more fun modelling part, the fact that you get to know your dataset inside out while munging it implies that any new or follow-up questions can probably be attained with less effort." (Jesús Rogel-Salazar, "Data Science and Analytics with Python", 2017)

"The tension between bias and variance, simplicity and complexity, or underfitting and overfitting is an area in the data science and analytics process that can be closer to a craft than a fixed rule. The main challenge is that not only is each dataset different, but also there are data points that we have not yet seen at the moment of constructing the model. Instead, we are interested in building a strategy that enables us to tell something about data from the sample used in building the model." (Jesús Rogel-Salazar, "Data Science and Analytics with Python", 2017)

"One important thing to bear in mind about the outputs of data science and analytics is that in the vast majority of cases they do not uncover hidden patterns or relationships as if by magic, and in the case of predictive analytics they do not tell us exactly what will happen in the future. Instead, they enable us to forecast what may come. In other words, once we have carried out some modelling there is still a lot of work to do to make sense out of the results obtained, taking into account the constraints and assumptions in the model, as well as considering what an acceptable level of reliability is in each scenario." (Jesús Rogel-Salazar, "Data Science and Analytics with Python", 2017)

🖍️Francis Galton - Collected Quotes

"A visual image is the most perfect form of mental representation wherever the shape, position, and relations of objects in space are concerned. It is of importance in every handicraft and profession where design is required." (Francis Galton, "Mental Imagery" [in "Inquiries into Human Faculty and Development"] 1883)

"The object of statistical science is to discover methods of condensing information concerning large groups of allied facts into brief and compendious expressions suitable for discussion. The possibility of doing this is based on the constancy and continuity with which objects of the same species are found to vary." (Sir Francis Galton, "Inquiries into Human Faculty and Its Development, Statistical Methods", 1883)

"It is always well to retain a clear geometric view of the facts when we are dealing with statistical problems, which abound with dangerous pitfalls, easily overlooked by the unwary, while they are cantering gaily along upon their arithmetic." (Sir Francis Galton, "Natural Inheritance", 1889)

"It is difficult to understand why statisticians commonly limit their inquiries to Averages, and do not revel in more comprehensive views. […] An Average is but a solitary fact, whereas if a single other fact be added to it, an entire Normal Scheme, which nearly corresponds to the observed one, starts potentially into existence. Some people hate the very name of statistics, but I find them full of beauty and interest. Whenever they are not brutalised, but delicately handled by the higher methods, and are warily interpreted, their power of dealing with complicated phenomena is extraordinary. They are the only tools by which an opening can be cut through the formidable thicket of difficulties that bars the path of those who pursue the Science of man." (Sir Francis Galton, "Natural Inheritance", 1889)

"It is difficult to understand why statisticians commonly limit their inquiries to Averages, and do not revel in more comprehensive views. Their souls seem as dull to the charm of variety as that of the native of one of our flat English counties, whose retrospect of Switzerland was that, if its mountains could be thrown into its lakes, two nuisances would be got rid of at once. An Average is but a solitary fact, whereas if a single other fact be added to it, an entire Normal Scheme, which nearly corresponds to the observed one, starts potentially into existence." (Sir Francis Galton, "Natural Inheritance", 1889)

"[Statistics] are the only tools by which an opening can be cut through the formidable thicket of difficulties that bars the path of those who pursue the Science of man." (Sir Francis Galton, "Natural Inheritance", 1889)

"Every statistician wants now and then to test the practical value of some theoretical process, it may be of smoothing, or of interpola- tion, or of obtaining a measure of variability, or of making some particular deduction or inference." (Francis Galton, Nature vol. 42, [letter] 1890)

"It is now beginning to be generally understood, even by merely practical statisticians, that there is truth in the theory that all variability is much the same kind." (Francis Galton, "Kinship and Correlation", North American Review Vol. 150 (11), 1890)

"Reflection soon made it clear to me that not only were the two new problems identical in principle with the old one of kinship which I had already solved, but that all three of them were no more than special cases of a much more general problem - namely, that of Correlation." (Francis Galton,"Kinship and Correlation", 1890)

"It had appeared from observation, and it was fully confirmed by this theory, that such a thing existed as an 'Index of Correlation', that is to say, a fraction, now commonly written T, that connects with close approximation every value of the deviation on the part of the subject, with the average of all the associated deviations of the Relative [...]" (Francis Galton, "Memories of My Life", 1908)

"[Statistics are] the only tools by which an opening can be cut through the formidable thicket of difficulties that bars the path of those who pursue the Science of Man." (Sir Ronald Galton)

More quotes from the author at quotablemath.blogspot.com.

🖍️Frederick Mosteller - Collected Quotes

"As usual we may make the errors of I) rejecting the null hypothesis when it is true, II) accepting the null hypothesis when it is false. But there is a third kind of error which is of interest because the present test of significance is tied up closely with the idea of making a correct decision about which distribution function has slipped furthest to the right. We may make the error of III) correctly rejecting the null hypothesis for the wrong reason." (Frederick Mosteller, "A k-Sample Slippage Test for an Extreme Population", The Annals of Mathematical Statistics 19, 1948)

"Errors of the third kind happen in conventional tests of differences of means, but they are usually not considered, although their existence is probably recognized. It seems to the author that there may be several reasons for this among which are 1) a preoccupation on the part of mathematical statisticians with the formal questions of acceptance and rejection of null hypotheses without adequate consideration of the implications of the error of the third kind for the practical experimenter, 2) the rarity with which an error of the third kind arises in the usual tests of significance." (Frederick Mosteller, "A k-Sample Slippage Test for an Extreme Population", The Annals of Mathematical Statistics 19, 1948)

"For many purposes graphical accuracy is sufficient. The speed of graphical processes, and more especially the advantages of visual presentation in pointing out facts or clues which might otherwise be overlooked, make graphical analysis very valuable." (Frederick Mosteller & John W Tukey, "The Uses and Usefulness of Binomial Probability Paper?", Journal of the American Statistical Association 44, 1949)

"If significance tests are required for still larger samples, graphical accuracy is insufficient, and arithmetical methods are advised. A word to the wise is in order here, however. Almost never does it make sense to use exact binomial significance tests on such data - for the inevitable small deviations from the mathematical model of independence and constant split have piled up to such an extent that the binomial variability is deeply buried and unnoticeable. Graphical treatment of such large samples may still be worthwhile because it brings the results more vividly to the eye." (Frederick Mosteller & John W Tukey, "The Uses and Usefulness of Binomial Probability Paper?", Journal of the American Statistical Association 44, 1949)

"Scientific and technological advances have made the world we live in complex and hard to understand. […] Science itself shows the same growing complexity. We often hear that 'one man can no longer cover a broad enough field' and that 'there is too much narrow specialization'. And yet these complexities must be met - and resolved. At all levels, decisions must be made which involve consideration of more than a single field." (Frederick Mosteller et al, "The Education of a Scientific Generalist", Science 109,1949)

"Mathematical models for empirical phenomena aid the development of a science when a sufficient body of quantitative information has been accumulated. This accumulation can be used to point the direction in which models should be constructed and to test the adequacy of such models in their interim states. Models, in turn, frequently are useful in organizing and interpreting experimental data and in suggesting new directions for experimental research." (Robert R. Bush & Frederick Mosteller, "A Mathematical Model for Simple Learning", Psychological Review 58, 1951)

"Almost any sort of inquiry that is general and not particular involves both sampling and measurement […]. Further, both the measurement and the sampling will be imperfect in almost every case. We can define away either imperfection in certain cases. But the resulting appearance of perfection is usually only an illusion." (Frederick Mosteller et al, "Principles of Sampling", Journal of the American Statistical Association Vol. 49 (265), 1954)

"Because representativeness is inherent in the sampling plan and not in the particular sample at hand, we can never make adequate use of sample results without some measure of how well the results of this particular sample are likely to agree with the results of other samples which the same sampling plan might have provided. The ability to assess stability fairly is as important as the ability to represent the population fairly. Modern sampling plans concentrate on both." (Frederick Mosteller et al, "Principles of Sampling", Journal of the American Statistical Association Vol. 49 (265), 1954)

"By sampling we can learn only about collective properties of populations, not about properties of individuals. We can study the average height, the percentage who wear hats, or the variability in weight of college juniors [...]. The population we study may be small or large, but there must be a population - and what we are studying must be a population characteristic. By sampling, we cannot study individuals as particular entities with unique idiosyncrasies; we can study regularities (including typical variabilities as well as typical levels) in a population as exemplified by the individuals in the sample." (Frederick Mosteller et al, "Principles of Sampling", Journal of the American Statistical Association Vol. 49 (265), 1954)

"In many cases general probability samples can be thought of in terms of (1) a subdivision of the population into strata, (2) a self-weighting probability sample in each stratum, and (3) combination of the stratum sample means weighted by the size of the stratum." (Frederick Mosteller et al, "Principles of Sampling", Journal of the American Statistical Association Vol. 49 (265), 1954)

"That which can be and should be representative is the sampling plan, which includes the manner in which the sample was drawn (essentially a specification of what other samples might have been drawn and what the relative chances of selection were for any two possible samples) and how it is to be analyzed. [...] It is clear that many [...] groups fail to be represented in any particular sample, yet this is not a criticism of that sample. Representation is not, and should not be, by groups. It is, and should be, by individuals as members of the sampled population. Representation is not, and should not be, in any particular sample. It is, and should be, in the sampling plan." (Frederick Mosteller et al, "Principles of Sampling", Journal of the American Statistical Association Vol. 49 (265), 1954)

"The main purpose of a significance test is to inhibit the natural enthusiasm of the investigator." (Frederick Mosteller, "Selected Quantitative Techniques", 1954)

"We must emphasize that such terms as 'select at random', 'choose at random', and the like, always mean that some mechanical device, such as coins, cards, dice, or tables of random numbers, is used." (Frederick Mosteller et al, "Principles of Sampling", Journal of the American Statistical Association Vol. 49 (265), 1954)

"We have made the sampling plan representative, not by giving each individual an equal chance to enter the sample and then weighting them equally, but by a more noticeable process of compensation, where those individuals very likely to enter the sample are weighted less, while those unlikely to enter are weighted more when they do appear. The net result is to give each individual an equal chance of affecting the (weighted) sample mean." (Frederick Mosteller et al, "Principles of Sampling", Journal of the American Statistical Association Vol. 49 (265), 1954)

"We realize that if someone just 'grabs a handful', the individuals in the handful almost always resemble one another (on the average) more than do the members of a simple random sample. Even if the 'grabs' [sampling] are randomly spread around so that every individual has an equal chance of entering the sample, there are difficulties. Since the individuals of grab samples resemble one another more than do individuals of random samples, it follows (by a simple mathematical argument) that the means of grab samples resemble one another less than the means of random samples of the same size. From a grab sample, therefore, we tend to underestimate the variability in the population, although we should have to overestimate it in order to obtain valid estimates of variability of grab sample means by substituting such an estimate into the formula for the variability of means of simple random samples. Thus using simple random sample formulas for grab sample means introduces a double bias, both parts of which lead to an unwarranted appearance of higher stability." (Frederick Mosteller et al, "Principles of Sampling", Journal of the American Statistical Association Vol. 49 (265), 1954)

"Weighing a sample appropriately is no more fudging the data than is correcting a gas volume for barometric pressure." (Frederick Mosteller et al, "Principles of Sampling", Journal of the American Statistical Association Vol. 49 (265), 1954)

"A primary goal of any learning model is to predict correctly the learning curve - proportions of correct responses versus trials. Almost any sensible model with two or three free parameters, however, can closely fit the curve, and so other criteria must be invoked when one is comparing several models." (Robert R Bush & Frederick Mosteller, "A Comparison of Eight Models?", Studies in Mathematical Learning Theory, 1959)

"A satisfactory prediction of the sequential properties of learning data from a single experiment is by no means a final test of a model. Numerous other criteria - and some more demanding - can be specified. For example, a model with specific numerical parameter values should be invariant to changes in independent variables that explicitly enter in the model." (Robert R Bush & Frederick Mosteller,"A Comparison of Eight Models?", Studies in Mathematical Learning Theory, 1959)

"In the testing of a scientific model or theory, one rarely has a general measure of goodness-of-fit, a universal yardstick by which one accepts or rejects the model. Indeed, science does not and should not work this way; a theory is kept until a better one is found. One way that science does work is by comparing two or more theories to determine their relative merits in handling relevant data."(Robert R Bush & Frederick Mosteller, "A Comparison of Eight Models?", Studies in Mathematical Learning Theory, 1959)

"In a problem, the great thing is the challenge. A problem can be challenging for many reasons: because the subject matter is intriguing, because the answer defies unsophisticated intuition, because it illustrates an important principle, because of its vast generality, because of its difficulty, because of a clever solution, or even because of the simplicity or beauty of the answer." (Frederick Mosteller, "Fifty Challenging Problems in Probability with Solutions", 1965)

"Using data from the population as it stands is a dangerous substitute for testing." (Frederick Mosteller & Gale Mosteller, "New Statistical Methods in Public Policy. Part I: Experimentation", Journal of Contemporary Business 8, 1979)

"Although we often hear that data speak for themselves, their voices can be soft and sly." (Frederick Mosteller, "Beginning Statistics with Data Analysis", 1983)

"The law of truly large numbers states: With a large enough sample, any outrageous thing is likely to happen." (Frederick Mosteller, "Methods for Studying Coincidences", Journal of the American Statistical Association Vol. 84, 1989)

"It is easy to lie with statistics, but easier to lie without them [...]" (Frederick Mosteller)

18 April 2006

🖍️Leo Breiman - Collected Quotes

"Probability theory has a right and a left hand. On the right is the rigorous foundational work using the tools of measure theory. The left hand 'thinks probabilistically', reduces problems to gambling situations, coin-tossing, motions of a physical particle." (Leo Breiman, "Probability", 1992)

"Approaching problems by looking for a data model imposes an a priori straight jacket that restricts the ability of statisticians to deal with a wide range of statistical problems. The best available solution to a data problem might be a data model; then again it might be an algorithmic model. The data and the problem guide the solution. To solve a wider range of data problems, a larger set of tools is needed." (Leo Breiman, "Statistical Modeling: The Two Cultures", Statistical Science 16(3), 2001)

"Data modeling has given the statistics field many successes in analyzing data and getting information about the mechanisms producing the data. But there is also misuse leading to questionable conclusions about the underlying mechanism." (Leo Breiman, "Statistical Modeling: The Two Cultures", Statistical Science 16(3), 2001)

"One goal of statistics is to extract information from the data about the underlying mechanism producing the data. The greatest plus of data modeling is that it produces a simple and understandable picture of the relationship between the input variables and responses." (Leo Breiman, "Statistical Modeling: The Two Cultures", Statistical Science 16(3), 2001)

"Prediction is rarely perfect. There are usually many unmeasured variables whose effect is referred to as 'noise'. But the extent to which the model box emulates nature's box is a measure of how well our model can reproduce the natural phenomenon producing the data." (Leo Breiman, "Statistical Modeling: The Two Cultures", Statistical Science 16(3), 2001)

"The goals in statistics are to use data to predict and to get information about the underlying data mechanism. Nowhere is it written on a stone tablet what kind of model should be used to solve problems involving data. To make my position clear, I am not against data models per se. In some situations they are the most appropriate way to solve the problem. But the emphasis needs to be on the problem and on the data." (Leo Breiman, "Statistical Modeling: The Two Cultures", Statistical Science 16(3), 2001)

"The greatest plus of data modeling is that it produces a simple and understandable picture of the relationship between the input variables and responses [...] different models, all of them equally good, may give different pictures of the relation between the predictor and response variables [...] One reason for this multiplicity is that goodness-of-fit tests and other methods for checking fit give a yes–no answer. With the lack of power of these tests with data having more than a small number of dimensions, there will be a large number of models whose fit is acceptable. There is no way, among the yes–no methods for gauging fit, of determining which is the better model." (Leo Breiman, "Statistical Modeling: The two cultures", Statistical Science 16(3), 2001)

"The point of a model is to get useful information about the relation between the response and predictor variables. Interpretability is a way of getting information. But a model does not have to be simple to provide reliable information about the relation between predictor and response variables; neither does it have to be a data model." (Leo Breiman, "Statistical Modeling: The Two Cultures", Statistical Science 16(3), 2001)

"The roots of statistics, as in science, lie in working with data and checking theory against data." (Leo Breiman, "Statistical Modeling: The Two Cultures", Statistical Science 16(3), 2001)

"There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools." (Leo Breiman, "Statistical Modeling: The Two Cultures", Statistical Science 16(3), 2001)

🖍️Charu C Aggarwal - Collected Quotes

"A major advantage of probabilistic models is that they can be easily applied to virtually any data type (or mixed data type), as long as an appropriate generative model is available for each mixture component. [...] A downside of probabilistic models is that they try to fit the data to a particular kind of distribution, which may often not be appropriate for the underlying data. Furthermore, as the number of model parameters increases, over-fitting becomes more common. In such cases, the outliers may fit the underlying model of normal data. Many parametric models are also harder to interpret in terms of intensional knowledge, especially when the parameters of the model cannot be intuitively presented to an analyst in terms of underlying attributes. This can defeat one of the important purposes of anomaly detection, which is to provide diagnostic understanding of the abnormal data generative process." (Charu C Aggarwal, "Outlier Analysis", 2013)

"An attempt to use the wrong model for a given data set is likely to provide poor results. Therefore, the core principle of discovering outliers is based on assumptions about the structure of the normal patterns in a given data set. Clearly, the choice of the 'normal' model depends highly upon the analyst’s understanding of the natural data patterns in that particular domain." (Charu C Aggarwal, "Outlier Analysis", 2013)

"Dimensionality reduction and regression modeling are particularly hard to interpret in terms of original attributes, when the underlying data dimensionality is high. This is because the subspace embedding is defined as a linear combination of attributes with positive or negative coefficients. This cannot easily be intuitively interpreted in terms specific properties of the data attributes." (Charu C Aggarwal, "Outlier Analysis", 2013)

"Typically, most outlier detection algorithms use some quantified measure of the outlierness of a data point, such as the sparsity of the underlying region, nearest neighbor based distance, or the fit to the underlying data distribution. Every data point lies on a continuous spectrum from normal data to noise, and finally to anomalies [...] The separation of the different regions of this spectrum is often not precisely defined, and is chosen on an ad-hoc basis according to application-specific criteria. Furthermore, the separation between noise and anomalies is not pure, and many data points created by a noisy generative process may be deviant enough to be interpreted as anomalies on the basis of the outlier score. Thus, anomalies will typically have a much higher outlier score than noise, but this is not a distinguishing factor between the two as a matter of definition. Rather, it is the interest of the analyst, which regulates the distinction between noise and an anomaly." (Charu C Aggarwal, "Outlier Analysis", 2013)

"Even though a natural way of avoiding overfitting is to simply build smaller networks (with fewer units and parameters), it has often been observed that it is better to build large networks and then regularize them in order to avoid overfitting. This is because large networks retain the option of building a more complex model if it is truly warranted. At the same time, the regularization process can smooth out the random artifacts that are not supported by sufficient data. By using this approach, we are giving the model the choice to decide what complexity it needs, rather than making a rigid decision for the model up front (which might even underfit the data)." (Charu C Aggarwal, "Neural Networks and Deep Learning: A Textbook", 2018)

"Regularization is particularly important when the amount of available data is limited. A neat biological interpretation of regularization is that it corresponds to gradual forgetting, as a result of which 'less important' (i.e., noisy) patterns are removed. In general, it is often advisable to use more complex models with regularization rather than simpler models without regularization." (Charu C Aggarwal, "Neural Networks and Deep Learning: A Textbook", 2018)

"The high generalization error in a neural network may be caused by several reasons. First, the data itself might have a lot of noise, in which case there is little one can do in order to improve accuracy. Second, neural networks are hard to train, and the large error might be caused by the poor convergence behavior of the algorithm. The error might also be caused by high bias, which is referred to as underfitting. Finally, overfitting (i.e., high variance) may cause a large part of the generalization error. In most cases, the error is a combination of more than one of these different factors." (Charu C Aggarwal, "Neural Networks and Deep Learning: A Textbook", 2018)

"The idea behind deeper architectures is that they can better leverage repeated regularities in the data patterns in order to reduce the number of computational units and therefore generalize the learning even to areas of the data space where one does not have examples. Often these repeated regularities are learned by the neural network within the weights as the basis vectors of hierarchical features." (Charu C Aggarwal, "Neural Networks and Deep Learning: A Textbook", 2018)

"A key point is that an increased number of attributes relative to training points provides additional degrees of freedom to the optimization problem, as a result of which irrelevant solutions become more likely. Therefore, a natural solution is to add a penalty for using additional features." (Charu C Aggarwal, "Artificial Intelligence: A Textbook", 2021)

"In general, the more complex the data, the more the analyst has to make prior inferences of what is considered normal for modeling purposes." (Charu C Aggarwal, "Artificial Intelligence: A Textbook", 2021)

"The ability to go beyond human domain knowledge is usually achieved by inductive learning methods that are unfettered from the imperfections in the domain knowledge of deductive methods." (Charu C Aggarwal, "Artificial Intelligence: A Textbook", 2021)

"The Monte Carlo tree search method is naturally suited to non-deterministic settings such as card games or backgammon. Minimax trees are not well suited to non-deterministic settings because of the inability to predict the opponent’s moves while building the tree. On the other hand, Monte Carlo tree search is naturally suited to handling such settings, since the desirability of moves is always evaluated in an expected sense. The randomness in the game can be naturally combined with the randomness in move sampling in order to learn the expected outcomes from each choice of move." (Charu C Aggarwal, "Artificial Intelligence: A Textbook", 2021)

🖍️Umesh R Hodeghatta - Collected Quotes

"A histogram represents the frequency distribution of the data. Histograms are similar to bar charts but group numbers into ranges. Also, a histogram lets you show the frequency distribution of continuous data. This helps in analyzing the distribution (for example, normal or Gaussian), any outliers present in the data, and skewness." (Umesh R Hodeghatta & Umesha Nayak, "Business Analytics Using R: A Practical Approach", 2017)

"Bias occurs normally when the model is underfitted and has failed to learn enough from the training data. It is the difference between the mean of the probability distribution and the actual correct value. Hence, the accuracy of the model is different for different data sets (test and training sets). To reduce the bias error, data scientists repeat the model-building process by resampling the data to obtain better prediction values." (Umesh R Hodeghatta & Umesha Nayak, "Business Analytics Using R: A Practical Approach", 2017)

"Clustering analysis is performed on data to identify hidden groups or to form different sectors. The objective of the clusters is to enable meaningful analysis in ways that help business. Clustering can uncover previously undetected relationships in a data set." (Umesh R Hodeghatta & Umesha Nayak, "Business Analytics Using R: A Practical Approach", 2017)

"Correlation explains the extent of change in one of the variables given the unit change in the value of another variable. Correlation assumes a very significant role in statistics and hence in the field of business analytics as any business cannot make any decision without understanding the relationship between various forces acting in favor of or against it." (Umesh R Hodeghatta & Umesha Nayak, "Business Analytics Using R: A Practical Approach", 2017)

"Graphs represent data visually and provide more details about the data, enabling you to identify outliers in the data, distribute data for each column variable, provide a statistical description of the data, and present the relationship between the two or more variables." (Umesh R Hodeghatta & Umesha Nayak, "Business Analytics Using R: A Practical Approach", 2017)

"If either bias or variance is high, the model can be very far off from reality. In general, there is a trade-off between bias and variance. The goal of any machine-learning algorithm is to achieve low bias and low variance such that it gives good prediction performance. In reality, because of so many other hidden parameters in the model, it is hard to calculate the real bias and variance error. Nevertheless, the bias and variance provide a measure to understand the behavior of the machine-learning algorithm so that the model model can be adjusted to provide good prediction performance." (Umesh R Hodeghatta & Umesha Nayak, "Business Analytics Using R: A Practical Approach", 2017)

"In machine learning, a model is defined as a function, and we describe the learning function from the training data as inductive learning. Generalization refers to how well the concepts are learned by the model by applying them to data not seen before. The goal of a good machine-learning model is to reduce generalization errors and thus make good predictions on data that the model has never seen." (Umesh R Hodeghatta & Umesha Nayak, "Business Analytics Using R: A Practical Approach", 2017)

"Machine learning is about making computers learn and perform tasks better based on past historical data. Learning is always based on observations from the data available. The emphasis is on making computers build mathematical models based on that learning and perform tasks automatically without the intervention of humans." (Umesh R Hodeghatta & Umesha Nayak, "Business Analytics Using R: A Practical Approach", 2017)

"Overfitting and underfitting are two important factors that could impact the performance of machine-learning models. Overfitting occurs when the model performs well with training data and poorly with test data. Underfitting occurs when the model is so simple that it performs poorly with both training and test data. [...] When the model does not capture and fit the data, it results in poor performance. We call this underfitting. Underfitting is the result of a poor model that typically does not perform well for any data." (Umesh R Hodeghatta & Umesha Nayak, "Business Analytics Using R: A Practical Approach", 2017)

"Variance is a prediction error due to different sets of training samples. Ideally, the error should not vary from one training sample to another sample, and the model should be stable enough to handle hidden variations between input and output variables. Normally this occurs with the overfitted model." (Umesh R Hodeghatta & Umesha Nayak, "Business Analytics Using R: A Practical Approach", 2017)

🖍️George B Dantzig - Collected Quotes

"All such problems can be formulated as mathematical programming problems. Naturally, we can propose many sophisticated algorithms and a theory but the final test of a theory is its capacity to solve the problems which originated it." (George B Dantzig, "Linear Programming and Extensions", 1963)

"If the system exhibits a structure which can be represented by a mathematical equivalent, called a mathematical model, and if the objective can be also so quantified, then some computational method may be evolved for choosing the best schedule of actions among alternatives. Such use of mathematical models is termed mathematical programming." (George B Dantzig, "Linear Programming and Extensions", 1963)

"Linear programming is viewed as a revolutionary development giving man the ability to state general objectives and to find, by means of the simplex method, optimal policy decisions for a broad class of practical decision problems of great complexity. In the real world, planning tends to be ad hoc because of the many special-interest groups with their multiple objectives." (George B Dantzig, "Mathematical Programming: The state of the art", 1983)

"Linear programming and its generalization, mathematical programming, can be viewed as part of a great revolutionary development that has given mankind the ability to state general goals and lay out a path of detailed decisions to be taken in order to 'best' achieve these goals when faced with practical situations of great complexity. The tools for accomplishing this are the models that formulate real-world problems in detailed mathematical terms, the algorithms that solve the models, and the software that execute the algorithms on computers based on the mathematical theory." (George B Dantzig & Mukund N Thapa, "Linear Programming" Vol I, 1997)

"Linear programming is concerned with the maximization or minimization of a linear objective function in many variables subject to linear equality and inequality constraints." (George B Dantzig & Mukund N Thapa, "Linear Programming" Vol I, 1997)

"Mathematical programming (or optimization theory) is that branch of mathematics dealing with techniques for maximizing or minimizing an objective function subject to linear, nonlinear, and integer constraints on the variables." (George B Dantzig & Mukund N Thapa, "Linear Programming" Vol I, 1997)

"Models of the real world are not always easy to formulate because of the richness, variety, and ambiguity that exists in the real world or because of our ambiguous understanding of it." (George B Dantzig & Mukund N Thapa, "Linear Programming" Vol I, 1997)

"The linear programming problem is to determine the values of the variables of the system that (a) are nonnegative or satisfy certain bounds, (b) satisfy a system of linear constraints, and (c) minimize or maximize a linear form in the variables called an objective." (George B Dantzig & Mukund N Thapa, "Linear Programming" Vol I, 1997)

"The mathematical model of a system is the collection of mathematical relationships which, for the purpose of developing a design or plan, characterize the set of feasible solutions of the system." (George B Dantzig & Mukund N Thapa, "Linear Programming" Vol I, 1997)

17 April 2006

🖍️Benjamin Bengfort - Collected Quotes

"Deep learning broadly describes the large family of neural network architectures that contain multiple, interacting hidden layers." (Benjamin Bengfort et al, Applied Text Analysis with Python, 2018)

"Graphs can embed complex semantic representations in a compact form. As such, modeling data as networks of related entities is a powerful mechanism for analytics, both for visual analyses and machine learning. Part of this power comes from performance advantages of using a graph data structure, and the other part comes from an inherent human ability to intuitively interact with small networks." (Benjamin Bengfort et al, "Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning", 2018)

"In essence, deep learning models are just chains of functions, which means that many deep learning libraries tend to have a functional or verbose, declarative style." (Benjamin Bengfort et al, Applied Text Analysis with Python, 2018)

"Language is unstructured data that has been produced by people to be understood by other people. By contrast, structured or semistructured data includes fields or markup that enable it to be easily parsed by a computer. However, while it does not feature an easily machine-readable structure, unstructured data is not random. On the contrary, it is governed by linguistic properties that make it very understandable to other people." (Benjamin Bengfort et al, "Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning", 2018)

"Machine learning is often associated with the automation of decision making, but in practice, the process of constructing a predictive model generally requires a human in the loop. While computers are good at fast, accurate numerical computation, humans are instinctively and instantly able to identify patterns. The bridge between these two necessary skill sets lies in visualization - the precise and accurate rendering of data by a computer in visual terms and the immediate assignation of meaning to that data by humans." (Benjamin Bengfort et al, "Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning", 2018)

"Many model families suffer from 'the curse of dimensionality'; as the feature space increases in dimensions, the data becomes more sparse and less informative to the underlying decision space." (Benjamin Bengfort et al, "Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning", 2018)

"Neural networks refer to a family of models that are defined by an input layer (a vectorized representation of input data), a hidden layer that consists of neurons and synapses, and an output layer with the predicted values. Within the hidden layer, synapses transmit signals between neurons, which rely on an activation function to buffer incoming signals. The synapses apply weights to incoming values, and the activation function determines if the weighted inputs are sufficiently high to activate the neuron and pass the values on to the next layer of the network." (Benjamin Bengfort et al, "Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning", 2018)

"The current trade-offs between traditional models and neural networks concern two factors: model complexity and speed. Because neural networks tend to take longer to train, they can impede rapid iteration [...] Neural networks are also typically more complex than traditional models, meaning that their hyperparameters are more difficult to tune and modeling errors are more challenging to diagnose. However, neural networks are not only increasingly practical, they also promise nontrivial performance gains over traditional models. This is because unlike traditional models, which face performance plateaus even as more data become available, neural models continue to improve." (Benjamin Bengfort et al, "Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning", 2018)

"The premise of classification is simple: given a categorical target variable, learn patterns that exist between instances composed of independent variables and their relationship to the target. Because the target is given ahead of time, classification is said to be supervised machine learning because a model can be trained to minimize error between predicted and actual categories in the training data. Once a classification model is fit, it assigns categorical labels to new instances based on the patterns detected during training." (Benjamin Bengfort et al, "Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning", 2018)

"The trick is to walk the line between underfitting and overfitting. An underfit model has low variance, generally making the same predictions every time, but with extremely high bias, because the model deviates from the correct answer by a significant amount. Underfitting is symptomatic of not having enough data points, or not training a complex enough model. An overfit model, on the other hand, has memorized the training data and is completely accurate on data it has seen before, but varies widely on unseen data. Neither an overfit nor underfit model is generalizable - that is, able to make meaningful predictions on unseen data." (Benjamin Bengfort et al, "Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning", 2018)

"There is a trade-off between bias and variance [...]. Complexity increases with the number of features, parameters, depth, training epochs, etc. As complexity increases and the model overfits, the error on the training data decreases, but the error on test data increases, meaning that the model is less generalizable." (Benjamin Bengfort et al, "Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning", 2018)

"Unfortunately, because the search space is large, automatic techniques for optimization are not sufficient. Instead, the process of selecting an optimal model is complex and iterative, involving repeated cycling through feature engineering, model selection, and hyperparameter tuning. Results are evaluated after each iteration in order to arrive at the best combination of features, model, and parameters that will solve the problem at hand." (Benjamin Bengfort et al, "Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning", 2018)

"Unsupervised learning or clustering is a way of discovering hidden structures in unlabeled data. Clustering algorithms aim to discover latent patterns in unlabeled data using features to organize instances into meaningfully dissimilar groups." (Benjamin Bengfort et al, "Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning", 2018)

🖍️Gary Smith - Collected Quotes

"A computer makes calculations quickly and correctly, but doesn’t ask if the calculations are meaningful or sensible. A computer just does what it is told." (Gary Smith, "Standard Deviations", 2014)

"A study that leaves out data is waving a big red flag. A decision to include orxclude data sometimes makes all the difference in the world. This decision should be based on the relevance and quality of the data, not on whether the data support or undermine a conclusion that is expected or desired." (Gary Smith, "Standard Deviations", 2014)

"Another way to secure statistical significance is to use the data to discover a theory. Statistical tests assume that the researcher starts with a theory, collects data to test the theory, and reports the results - whether statistically significant or not. Many people work in the other direction, scrutinizing the data until they find a pattern and then making up a theory that fits the pattern." (Gary Smith, "Standard Deviations", 2014)

"Comparisons are the lifeblood of empirical studies. We can’t determine if a medicine, treatment, policy, or strategy is effective unless we compare it to some alternative. But watch out for superficial comparisons: comparisons of percentage changes in big numbers and small numbers, comparisons of things that have nothing in common except that they increase over time, comparisons of irrelevant data. All of these are like comparing apples to prunes." (Gary Smith, "Standard Deviations", 2014)

"Data clusters are everywhere, even in random data. Someone who looks for an explanation will inevitably find one, but a theory that fits a data cluster is not persuasive evidence. The found explanation needs to make sense and it needs to be tested with uncontaminated data." (Gary Smith, "Standard Deviations", 2014)

"Data without theory can fuel a speculative stock market bubble or create the illusion of a bubble where there is none. How do we tell the difference between a real bubble and a false alarm? You know the answer: we need a theory. Data are not enough. […] Data without theory is alluring, but misleading." (Gary Smith, "Standard Deviations", 2014)

"Don’t just do the calculations. Use common sense to see whether you are answering the correct question, the assumptions are reasonable, and the results are plausible. If a statistical argument doesn’t make sense, think about it carefully - you may discover that the argument is nonsense." (Gary Smith, "Standard Deviations", 2014)

"Graphs can help us interpret data and draw inferences. They can help us see tendencies, patterns, trends, and relationships. A picture can be worth not only a thousand words, but a thousand numbers. However, a graph is essentially descriptive - a picture meant to tell a story. As with any story, bumblers may mangle the punch line and the dishonest may lie." (Gary Smith, "Standard Deviations", 2014)

"Graphs should not be mere decoration, to amuse the easily bored. A useful graph displays data accurately and coherently, and helps us understand the data. Chartjunk, in contrast, distracts, confuses, and annoys. Chartjunk may be well-intentioned, but it is misguided. It may also be a deliberate attempt to mystify." (Gary Smith, "Standard Deviations", 2014)

"How can we tell the difference between a good theory and quackery? There are two effective antidotes: common sense and fresh data. If it is a ridiculous theory, we shouldn’t be persuaded by anything less than overwhelming evidence, and even then be skeptical. Extraordinary claims require extraordinary evidence. Unfortunately, common sense is an uncommon commodity these days, and many silly theories have been seriously promoted by honest researchers." (Gary Smith, "Standard Deviations", 2014)

"If somebody ransacks data to find a pattern, we still need a theory that makes sense. On the other hand, a theory is just a theory until it is tested with persuasive data." (Gary Smith, "Standard Deviations", 2014)

"[…] many gamblers believe in the fallacious law of averages because they are eager to find a profitable pattern in the chaos created by random chance." (Gary Smith, "Standard Deviations", 2014)

"Numbers are not inherently tedious. They can be illuminating, fascinating, even entertaining. The trouble starts when we decide that it is more important for a graph to be artistic than informative." (Gary Smith, "Standard Deviations", 2014)

"Provocative assertions are provocative precisely because they are counterintuitive - which is a very good reason for skepticism." (Gary Smith, "Standard Deviations", 2014)

"Remember that even random coin flips can yield striking, even stunning, patterns that mean nothing at all. When someone shows you a pattern, no matter how impressive the person’s credentials, consider the possibility that the pattern is just a coincidence. Ask why, not what. No matter what the pattern, the question is: Why should we expect to find this pattern?" (Gary Smith, "Standard Deviations", 2014)

"Self-selection bias occurs when people choose to be in the data - for example, when people choose to go to college, marry, or have children. […] Self-selection bias is pervasive in 'observational data', where we collect data by observing what people do. Because these people chose to do what they are doing, their choices may reflect who they are. This self-selection bias could be avoided with a controlled experiment in which people are randomly assigned to groups and told what to do." (Gary Smith, "Standard Deviations", 2014)

"The omission of zero magnifies the ups and downs in the data, allowing us to detect changes that might otherwise be ambiguous. However, once zero has been omitted, the graph is no longer an accurate guide to the magnitude of the changes. Instead, we need to look at the actual numbers." (Gary Smith, "Standard Deviations", 2014)

"These practices - selective reporting and data pillaging - are known as data grubbing. The discovery of statistical significance by data grubbing shows little other than the researcher’s endurance. We cannot tell whether a data grubbing marathon demonstrates the validity of a useful theory or the perseverance of a determined researcher until independent tests confirm or refute the finding. But more often than not, the tests stop there. After all, you won’t become a star by confirming other people’s research, so why not spend your time discovering new theories? The data-grubbed theory consequently sits out there, untested and unchallenged." (Gary Smith, "Standard Deviations", 2014)

"We are genetically predisposed to look for patterns and to believe that the patterns we observe are meaningful. […] Don’t be fooled into thinking that a pattern is proof. We need a logical, persuasive explanation and we need to test the explanation with fresh data." (Gary Smith, "Standard Deviations", 2014)

"We are hardwired to make sense of the world around us - to notice patterns and invent theories to explain these patterns. We underestimate how easily pat - terns can be created by inexplicable random events - by good luck and bad luck." (Gary Smith, "Standard Deviations", 2014)

"We are seduced by patterns and we want explanations for these patterns. When we see a string of successes, we think that a hot hand has made success more likely. If we see a string of failures, we think a cold hand has made failure more likely. It is easy to dismiss such theories when they involve coin flips, but it is not so easy with humans. We surely have emotions and ailments that can cause our abilities to go up and down. The question is whether these fluctuations are important or trivial." (Gary Smith, "Standard Deviations", 2014)

"We naturally draw conclusions from what we see […]. We should also think about what we do not see […]. The unseen data may be just as important, or even more important, than the seen data. To avoid survivor bias, start in the past and look forward." (Gary Smith, "Standard Deviations", 2014)

"We encounter regression in many contexts - pretty much whenever we see an imperfect measure of what we are trying to measure. Standardized tests are obviously an imperfect measure of ability." (Gary Smith, "Standard Deviations", 2014)

"With fast computers and plentiful data, finding statistical significance is trivial. If you look hard enough, it can even be found in tables of random numbers." (Gary Smith, "Standard Deviations", 2014)

"[...] a mathematically elegant procedure can generate worthless predictions. Principal components regression is just the tip of the mathematical iceberg that can sink models used by well-intentioned data scientists. Good data scientists think about their tools before they use them." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"A neural-network algorithm is simply a statistical procedure for classifying inputs (such as numbers, words, pixels, or sound waves) so that these data can mapped into outputs. The process of training a neural-network model is advertised as machine learning, suggesting that neural networks function like the human mind, but neural networks estimate coefficients like other data-mining algorithms, by finding the values for which the model’s predictions are closest to the observed values, with no consideration of what is being modeled or whether the coefficients are sensible." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Clowns fool themselves. Scientists don’t. Often, the easiest way to differentiate a data clown from a data scientist is to track the successes and failures of their predictions. Clowns avoid experimentation out of fear that they’re wrong, or wait until after seeing the data before stating what they expected to find. Scientists share their theories, question their assumptions, and seek opportunities to run experiments that will verify or contradict them. Most new theories are not correct and will not be supported by experiments. Scientists are comfortable with that reality and don’t try to ram a square peg in a round hole by torturing data or mangling theories. They know that science works, but only if it’s done right." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Data-mining tools, in general, tend to be mathematically sophisticated, yet often make implausible assumptions. Too often, the assumptions are hidden in the math and the people who use the tools are more impressed by the math than curious about the assumptions. Instead of being blinded by math, good data scientists use assumptions and models that make sense. Good data scientists use math, but do not worship it. They know that math is an invaluable tool, but it is not a substitute for common sense, wisdom, or expertise." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Deep neural networks have an input layer and an output layer. In between, are “hidden layers” that process the input data by adjusting various weights in order to make the output correspond closely to what is being predicted. [...] The mysterious part is not the fancy words, but that no one truly understands how the pattern recognition inside those hidden layers works. That’s why they’re called 'hidden'. They are an inscrutable black box - which is okay if you believe that computers are smarter than humans, but troubling otherwise." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Effective data scientists know that they are trying to convey accurate information in an easily understood way. We have never seen a pie chart that was an improvement over a simple table. Even worse, the creative addition of pictures, colors, shading, blots, and splotches may produce chartjunk that confuses the reader and strains the eyes." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Good data scientists are careful when they compare samples of different sizes. It is easier for small groups to be lucky. It’s also easier for small groups to be unlucky." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Good data scientists consider the reliability of the data, while data clowns don’t. It’s also important to know if there are unreported 'silent data'. If something is surprising about top-ranked groups, ask to see the bottom-ranked groups. Consider the possibility of survivorship bias and self-selection bias. Incomplete, inaccurate, or unreliable data can make fools out of anyone." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Good data scientists do not cherry pick data by excluding data that do not support their claims. One of the most bitter criticisms of statisticians is that, 'Figures don’t lie, but liars figure.' An unscrupulous statistician can prove most anything by carefully choosing favorable data and ignoring conflicting evidence." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Good data scientists know that, because of inevitable ups and downs in the data for almost any interesting question, they shouldn’t draw conclusions from small samples, where flukes might look like evidence." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Good data scientists know that some predictions are inherently difficult and we should not expect anything close to 100 percent accuracy. It is better to construct a reasonable model and acknowledge its uncertainty than to expect the impossible." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Good data scientists know that they need to get the assumptions right. It is not enough to have fancy math. Clever math with preposterous premises can be disastrous. [...] Good data scientists think about what they are modeling before making assumptions." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"In addition to overfitting the data by sifting through a kitchen sink of variables, data scientists can overfit the data by trying a wide variety of nonlinear models." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"It is certainly good data science practice to set aside data to test models. However, suppose that we data mine lots of useless models, and test them all on set-aside data. Just as some useless models are certain to fit the original data, some, by luck alone, are certain to fit the set-aside data too. Finding a model that fits both the original data and the set-aside data is just another form of data mining. Instead of discovering a model that fits half the data, we discover a model that fits all the data. That makes the problem less likely, but doesn’t solve it." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"It is tempting to think that because computers can do some things extremely well, they must be highly intelligent, but being useful for specific tasks is very different from having a general intelligence that applies the lessons learned and the skills required for one task to more complex tasks, or completely different tasks." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Machines do not know which features to ignore and which to focus on, since that requires real knowledge of the real world. In the absence of such knowledge, computers focus on idiosyncrasies in the data that maximize their success with the training data, without considering whether these idiosyncrasies are useful for making predictions with fresh data. Because they don’t truly understand Real-World, computers cannot distinguish between the meaningful and the meaningless." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Mathematicians love math and many non-mathematicians are intimidated by math. This is a lethal combination that can lead to the creation of wildly unrealistic mathematical models. [...] A good mathematical model starts with plausible assumptions and then uses mathematics to derive the implications. A bad model focuses on the math and makes whatever assumptions are needed to facilitate the math." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Monte Carlo simulations handle uncertainty by using a computer’s random number generator to determine outcomes. Done over and over again, the simulations show the distribution of the possible outcomes. [...] The beauty of these Monte Carlo simulations is that they allow users to see the probabilistic consequences of their decisions, so that they can make informed choices. [...] Monte Carlo simulations are one of the most valuable applications of data science because they can be used to analyze virtually any uncertain situation where we are able to specify the nature of the uncertainty [...]" (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Neural-network algorithms do not know what they are manipulating, do not understand their results, and have no way of knowing whether the patterns they uncover are meaningful or coincidental. Nor do the programmers who write the code know exactly how they work and whether the results should be trusted. Deep neural networks are also fragile, meaning that they are sensitive to small changes and can be fooled easily." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"One of the paradoxical things about computers is that they can excel at things that humans consider difficult (like calculating square roots) while failing at things that humans consider easy (like recognizing stop signs). They do not understand the world the way humans do. They have neither common sense nor wisdom. They are our tools, not our masters. Good data scientists know that data analysis still requires expert knowledge." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Outliers are sometimes clerical errors, measurement errors, or flukes that, if not corrected or omitted, will distort the data. At other times, they are the most important observations. Either way, good data scientists look at their data before analyzing them." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Regression toward the mean is NOT the fallacious law of averages, also known as the gambler’s fallacy. The fallacious law of averages says that things must balance out - that making a free throw makes a player more likely to miss the next shot; a coin flip that lands heads makes tails more likely on the next flip; and good luck now makes bad luck more likely in the future." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Statistical correlations are a poor substitute for expertise. The best way to build models of the real world is to start with theories that are appealing and then test these models. Models that make sense can be used to make useful predictions." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"The binomial distribution applies to things like coin flips, where every flip has the same constant probability of occurring. Jay saw several problems. [...] the binomial distribution assumes that the outcomes are independent, the way that a coin flip doesn’t depend on previous flips. [...] The binomial distribution is elegant mathematics, but it should be used when its assumptions are true, not because the math is elegant." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"The label neural networks suggests that these algorithms replicate the neural networks in human brains that connect electrically excitable cells called neurons. They don’t. We have barely scratched the surface in trying to figure out how neurons receive, store, and process information, so we cannot conceivably mimic them with computers." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"The logic of regression is simple, but powerful. Our lives are filled with uncertainties. The difference between what we expect to happen and what actually does happen is, by definition, unexpected. We can call these unexpected surprises chance, luck, or some other convenient shorthand. The important point is that, no matter how reasonable or rational our expectations, things sometimes turn out to be higher or lower, larger or smaller, stronger or weaker than expected." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"The plausibility of the assumptions is more important than the accuracy of the math. There is a well-known saying about data analysis: 'Garbage in, garbage out.' No matter how impeccable the statistical analysis, bad data will yield useless output. The same is true of mathematical models that are used to make predictions. If the assumptions are wrong, the predictions are worthless." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"The principle behind regression toward the mean is that extraordinary performances exaggerate how far the underlying trait is from average. [...] Regression toward the mean also works for the worst performers. [...] Regression toward the mean is a purely statistical phenomenon that has nothing at all to do with ability improving or deteriorating over time." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Useful data analysis requires good data. [...] Good data scientists also consider the reliability of their data. [...] If the data tell you something crazy, there’s a good chance you would be crazy to believe the data." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

16 April 2006

🖍️Danish Haroon - Collected Quotes

"Boosting defines an objective function to measure the performance of a model given a certain set of parameters. The objective function contains two parts: regularization and training loss, both of which add to one another. The training loss measures how predictive our model is on the training data. The most commonly used training loss function includes mean squared error and logistic regression. The regularization term controls the complexity of the model, which helps avoid overfitting." (Danish Haroon, "Python Machine Learning Case Studies", 2017)

"Boosting is a non-linear flexible regression technique that helps increase the accuracy of trees by assigning more weights to wrong predictions. The reason for inducing more weight is so the model can emphasize more on these wrongly predicted samples and tune itself to increase accuracy. The gradient boosting method solves the inherent problem in boosting trees (i.e., low speed and human interpretability). The algorithm supports parallelism by specifying the number of threads." (Danish Haroon, "Python Machine Learning Case Studies", 2017)

"Cluster analysis refers to the grouping of observations so that the objects within each cluster share similar properties, and properties of all clusters are independent of each other. Cluster algorithms usually optimize by maximizing the distance among clusters and minimizing the distance between objects in a cluster. Cluster analysis does not complete in a single iteration but goes through several iterations until the model converges. Model convergence means that the cluster memberships of all objects converge and don’t change with every new iteration." (Danish Haroon, "Python Machine Learning Case Studies", 2017)

"In Boosting, the selection of samples is done by giving more and more weight to hard-to-classify observations. Gradient boosting classification produces a prediction model in the form of an ensemble of weak predictive models, usually decision trees. It generalizes the model by optimizing for the arbitrary differentiable loss function. At each stage, regression trees fit on the negative gradient of binomial or multinomial deviance loss function." (Danish Haroon, "Python Machine Learning Case Studies", 2017)

"Multicollinearity and Singularity are two concepts which undermines the regression modeling, resulting in bizarre and inaccurate results. If exploratory variables are highly correlated, then regression becomes vulnerable to biases. Multicollinearity refers to a correlation of 0.9 or higher, whereas singularity refers to a perfect correlation (i.e., 1)." (Danish Haroon, "Python Machine Learning Case Studies", 2017)

"Multivariate analysis refers to incorporation of multiple exploratory variables to understand the behavior of a response variable. This seems to be the most feasible and realistic approach considering the fact that entities within this world are usually interconnected. Thus the variability in response variable might be affected by the variability in the interconnected exploratory variables." (Danish Haroon, "Python Machine Learning Case Studies", 2017)

"Null hypothesis is something we attempt to find evidence against in the hypothesis tests. Null hypothesis is usually an initial claim that researchers make on the basis of previous knowledge or experience. Alternative hypothesis has a population parameter value different from that of null hypothesis. Alternative hypothesis is something you hope to come out to be true. Statistical tests are performed to decide which of these holds true in a hypothesis test. If the experiment goes in favor of the null hypothesis then we say the experiment has failed in rejecting the null hypothesis." (Danish Haroon, "Python Machine Learning Case Studies", 2017)

"Overfitting refers to the phenomenon where a model is highly fitted on a dataset. This generalization thus deprives the model from making highly accurate predictions about unseen data. [...] Underfitting is a phenomenon where the model is not trained with high precision on data at hand. The treatment of underfitting is subject to bias and variance. A model will have a high bias if both train and test errors are high [...] If a model has a high bias type underfitting, then the remedy can be to increase the model complexity, and if a model is suffering from high variance type underfitting, then the cure can be to bring in more data or otherwise make the model less complex." (Danish Haroon, "Python Machine Learning Case Studies", 2017)

"Regression describes the relationship between an exploratory variable (i.e., independent) and a response variable (i.e., dependent). Exploratory variables are also referred to as predictors and can have a frequency of more than 1. Regression is being used within the realm of predictions and forecasting. Regression determines the change in response variable when one exploratory variable is varied while the other independent variables are kept constant. This is done to understand the relationship that each of those exploratory variables exhibits." (Danish Haroon, "Python Machine Learning Case Studies", 2017)

🖍️Galit Shmueli - Collected Quotes

"Extreme values are values that are unusually large or small compared to other values in the series. Extreme va- lue can affect different forecasting methods to various degrees. The decision whether to remove an extreme value or not must rely on information beyond the data. Is the extreme value the result of a data entry error? Was it due to an unusual event (such as an earthquake) that is unlikely to occur again in the forecast horizon? If there is no grounded justification to remove or replace the extreme value, then the best practice is to generate two sets of forecasts: those based on the series with the extreme values and those based on the series excluding the extreme values." (Galit Shmueli, "Practical Time Series Forecasting: A Hands-On Guide", 2011)

"For the purpose of choosing adequate forecasting methods, it is useful to dissect a time series into a systematic part and a non-systematic part. The systematic part is typically divided into three components: level , trend , and seasonality. The non-systematic part is called noise. The systematic components are assumed to be unobservable, as they characterize the underlying series, which we only observe with added noise." (Galit Shmueli, "Practical Time Series Forecasting: A Hands-On Guide", 2011)

"Forecasting methods attempt to isolate the systematic part and quantify the noise level. The systematic part is used for generating point forecasts and the level of noise helps assess the uncertainty associated with the point forecasts." (Galit Shmueli, "Practical Time Series Forecasting: A Hands-On Guide", 2011)

"Missing values in a time series create "holes" in the series. The presence of missing values has different implications and requires different action depending on the forecasting method." (Galit Shmueli, "Practical Time Series Forecasting: A Hands-On Guide", 2011)

"[…] noise is the random variation that results from measurement error or other causes not accounted for. It is always present in a time series to some degree, although we cannot observe it directly." (Galit Shmueli, "Practical Time Series Forecasting: A Hands-On Guide", 2011)

"Some forecasting methods directly model these components by making assumptions about their structure. For example, a popular assumption about trend is that it is linear or exponential over parts, or all, of the given time period. Another common assumption is about the noise structure: many statistical methods assume that the noise follows a normal distribution. The advantage of methods that rely on such assumptions is that when the assumptions are reasonably met, the resulting forecasts will be more robust and the models more understandable." (Galit Shmueli, "Practical Time Series Forecasting: A Hands-On Guide", 2011)

"Overfitting means that the model is not only fitting the systematic component of the data, but also the noise. An over-fitted model is therefore likely to perform poorly on new data." (Galit Shmueli, "Practical Time Series Forecasting: A Hands-On Guide", 2011)

"Understanding how performance is evaluated affects the choice of forecasting method, as well as the particular details of how a particular forecasting method is executed." (Galit Shmueli, "Practical Time Series Forecasting: A Hands-On Guide", 2011)

"When the purpose of forecasting is to generate accurate forecasts, it is useful to define performance metrics that measure predictive accuracy. Such metrics can tell us how well a particular method performs in general, as well as compared to benchmarks or forecasts from other methods." (Galit Shmueli, "Practical Time Series Forecasting: A Hands-On Guide", 2011)