SQL Troubles: regression

Showing posts with label regression. Show all posts

25 March 2024

📊R Language: Regression Analysis with Simulated & Real Data

Before doing regression on a real dataset, one can use as minimum a set of simulated data to test the steps (code adapted after [1]):

# define the model with simulated data
n <- 100
x <- c(1:n)
error <- rnorm(n,0,10)
y <- 1+2*x+error
fit <- lm(y~x)

# plotting the values
plot(x, y, ylab="1+2*x+error")
lines(x, fit$fitted.values)

#using anova (analysis of variance)
anova(fit)

In the first step is created the data model, while in the second the data are plotted, while in the third the analysis of variance is run. For the y variable, can be used any linear function that represents a line in the plane.

rnorm() function generates multivariate normal random variates based on the parameters given, therefore the output will vary between the runs of the above code. The bigger the value of the third parameter, the more dispersed the data is.

To test the code on real data, one can use the Sleuth3 library with the data from [2] (see RPubs):

install.packages ("Sleuth3")
library("Sleuth3")

Let's look at the data from the first case, which represent an experiment concerning the effects of intrinsic and extrinsic motivation on creativity run by the psychologist Teresa Amabile (see [2]):

attach(case0101)
case0101
summary(case0101)

The regression can be applied to all the data:

# case 0101 (all data)
x <- c(1:47)
y <- case0101$Score
fit <- lm(y~x)
plot(x, y, ylab="Score")
lines(x, fit$fitted.values)

Though, a more appropriate analysis should be based on each questionnaire:

# case 0101 (extrinsic vs intrinsic treatments)
extrinsic <- subset(case0101, Treatment %in% "Extrinsic")
intrinsic <- subset(case0101, Treatment %in% "Intrinsic")

par(mfrow = c(1,2)) #1x2 matrix display

x <- c(1:length(extrinsic$Score))
y <- extrinsic$Score
fit <- lm(y~x)
plot(x, y, ylab="Extrinsic Score")
lines(x, fit$fitted.values)

x <- c(1:length(intrinsic$Score))
y <- intrinsic$Score
fit <- lm(y~x)
plot(x, y, ylab="Intrinsic Score")
lines(x, fit$fitted.values)

title("Extrinsic vs. Intrinsic Motivation on Creativity", line = -2, outer = TRUE)

And, here's the output:

Case 0101 Extrinsic vs. Intrinsic Motivation on Creativity

Happy coding!

Previous Post <<||>> Next Post

🔭Data Science: Errors in Statistics (Just the Quotes)

"[It] may be laid down as a general rule that, if the result of a long series of precise observations approximates a simple relation so closely that the remaining difference is undetectable by observation and may be attributed to the errors to which they are liable, then this relation is probably that of nature." (Pierre-Simon Laplace, "Mémoire sur les Inégalites Séculaires des Planètes et des Satellites", 1787)

"It is surprising to learn the number of causes of error which enter into the simplest experiment, when we strive to attain rigid accuracy." (William S Jevons, "The Principles of Science: A Treatise on Logic and Scientific Method", 1874)

"Some of the common ways of producing a false statistical argument are to quote figures without their context, omitting the cautions as to their incompleteness, or to apply them to a group of phenomena quite different to that to which they in reality relate; to take these estimates referring to only part of a group as complete; to enumerate the events favorable to an argument, omitting the other side; and to argue hastily from effect to cause, this last error being the one most often fathered on to statistics. For all these elementary mistakes in logic, statistics is held responsible." (Sir Arthur L Bowley, "Elements of Statistics", 1901)

"If the number of experiments be very large, we may have precise information as to the value of the mean, but if our sample be small, we have two sources of uncertainty: (I) owing to the 'error of random sampling' the mean of our series of experiments deviates more or less widely from the mean of the population, and (2) the sample is not sufficiently large to determine what is the law of distribution of individuals." (William S Gosset, "The Probable Error of a Mean", Biometrika, 1908)

"We know not to what are due the accidental errors, and precisely because we do not know, we are aware they obey the law of Gauss. Such is the paradox." (Henri Poincaré, "The Foundations of Science", 1913)

"No observations are absolutely trustworthy. In no field of observation can we entirely rule out the possibility that an observation is vitiated by a large measurement or execution error. If a reading is found to lie a very long way from its fellows in a series of replicate observations, there must be a suspicion that the deviation is caused by a blunder or gross error of some kind. [...] One sufficiently erroneous reading can wreck the whole of a statistical analysis, however many observations there are." (Francis J Anscombe, "Rejection of Outliers", Technometrics Vol. 2 (2), 1960)

"It might be reasonable to expect that the more we know about any set of statistics, the greater the confidence we would have in using them, since we would know in which directions they were defective; and that the less we know about a set of figures, the more timid and hesitant we would be in using them. But, in fact, it is the exact opposite which is normally the case; in this field, as in many others, knowledge leads to caution and hesitation, it is ignorance that gives confidence and boldness. For knowledge about any set of statistics reveals the possibility of error at every stage of the statistical process; the difficulty of getting complete coverage in the returns, the difficulty of framing answers precisely and unequivocally, doubts about the reliability of the answers, arbitrary decisions about classification, the roughness of some of the estimates that are made before publishing the final results. Knowledge of all this, and much else, in detail, about any set of figures makes one hesitant and cautious, perhaps even timid, in using them." (Ely Devons, "Essays in Economics", 1961)

"The art of using the language of figures correctly is not to be over-impressed by the apparent ai

"Measurement, we have seen, always has an element of error in it. The most exact description or prediction that a scientist can make is still only approximate." (Abraham Kaplan, "The Conduct of Inquiry: Methodology for Behavioral Science", 1964)

"A mature science, with respect to the matter of errors in variables, is not one that measures its variables without error, for this is impossible. It is, rather, a science which properly manages its errors, controlling their magnitudes and correctly calculating their implications for substantive conclusions." (Otis D Duncan, "Introduction to Structural Equation Models", 1975)

"Pencil and paper for construction of distributions, scatter diagrams, and run-charts to compare small groups and to detect trends are more efficient methods of estimation than statistical inference that depends on variances and standard errors, as the simple techniques preserve the information in the original data." (William E Deming, "On Probability as Basis for Action" American Statistician Vol. 29 (4), 1975)

"When the statistician looks at the outside world, he cannot, for example, rely on finding errors that are independently and identically distributed in approximately normal distributions. In particular, most economic and business data are collected serially and can be expected, therefore, to be heavily serially dependent. So is much of the data collected from the automatic instruments which are becoming so common in laboratories these days. Analysis of such data, using procedures such as standard regression analysis which assume independence, can lead to gross error. Furthermore, the possibility of contamination of the error distribution by outliers is always present and has recently received much attention. More generally, real data sets, especially if they are long, usually show inhomogeneity in the mean, the variance, or both, and it is not always possible to randomize." (George E P Box, "Some Problems of Statistics and Everyday Life", Journal of the American Statistical Association, Vol. 74 (365), 1979)

"Under conditions of uncertainty, both rationality and measurement are essential to decision-making. Rational people process information objectively: whatever errors they make in forecasting the future are random errors rather than the result of a stubborn bias toward either optimism or pessimism. They respond to new information on the basis of a clearly defined set of preferences. They know what they want, and they use the information in ways that support their preferences." (Peter L Bernstein, "Against the Gods: The Remarkable Story of Risk", 1996)

"Linear regression assumes that in the population a normal distribution of error values around the predicted Y is associated with each X value, and that the dispersion of the error values for each X value is the same. The assumptions imply normal and similarly dispersed error distributions." (Fred C Pampel, "Linear Regression: A primer", 2000)

"Compound errors can begin with any of the standard sorts of bad statistics - a guess, a poor sample, an inadvertent transformation, perhaps confusion over the meaning of a complex statistic. People inevitably want to put statistics to use, to explore a number's implications. [...] The strengths and weaknesses of those original numbers should affect our confidence in the second-generation statistics." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Trimming potentially theoretically meaningful variables is not advisable unless one is quite certain that the coefficient for the variable is near zero, that the variable is inconsequential, and that trimming will not introduce misspecification error." (James Jaccard, "Interaction Effects in Logistic Regression", 2001)

"The central limit theorem says that, under conditions almost always satisfied in the real world of experimentation, the distribution of such a linear function of errors will tend to normality as the number of its components becomes large. The tendency to normality occurs almost regardless of the individual distributions of the component errors. An important proviso is that several sources of error must make important contributions to the overall error and that no particular source of error dominate the rest." (George E P Box et al, "Statistics for Experimenters: Design, discovery, and innovation" 2nd Ed., 2005)

"Two things explain the importance of the normal distribution: (1) The central limit effect that produces a tendency for real error distributions to be 'normal like'. (2) The robustness to nonnormality of some common statistical procedures, where 'robustness' means insensitivity to deviations from theoretical normality." (George E P Box et al, "Statistics for Experimenters: Design, discovery, and innovation" 2nd Ed., 2005)

"There are many ways for error to creep into facts and figures that seem entirely straightforward. Quantities can be miscounted. Small samples can fail to accurately reflect the properties of the whole population. Procedures used to infer quantities from other information can be faulty. And then, of course, numbers can be total bullshit, fabricated out of whole cloth in an effort to confer credibility on an otherwise flimsy argument. We need to keep all of these things in mind when we look at quantitative claims. They say the data never lie - but we need to remember that the data often mislead." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

"Always expect to find at least one error when you proofread your own statistics. If you don’t, you are probably making the same mistake twice." (Cheryl Russell)

[Murphy’s Laws of Analysis:] "(1) In any collection of data, the figures that are obviously correct contain errors. (2) It is customary for a decimal to be misplaced. (3) An error that can creep into a calculation, will. Also, it will always be in the direction that will cause the most damage to the calculation." (G C Deakly)

03 December 2018

🔭Data Science: Regression (Just the Quotes)

"One feature [...] which requires much more justification than is usually given, is the setting up of unplausible null hypotheses. For example, a statistician may set out a test to see whether two drugs have exactly the same effect, or whether a regression line is exactly straight. These hypotheses can scarcely be taken literally." (Cedric A B Smith, "Book review of Norman T. J. Bailey: Statistical Methods in Biology", Applied Statistics 9, 1960)

"The method of least squares is used in the analysis of data from planned experiments and also in the analysis of data from unplanned happenings. The word 'regression' is most often used to describe analysis of unplanned data. It is the tacit assumption that the requirements for the validity of least squares analysis are satisfied for unplanned data that produces a great deal of trouble." (George E P Box, "Use and Abuse of Regression", 1966)

"[…] fitting lines to relationships between variables is often a useful and powerful method of summarizing a set of data. Regression analysis fits naturally with the development of causal explanations, simply because the research worker must, at a minimum, know what he or she is seeking to explain." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Logging size transforms the original skewed distribution into a more symmetrical one by pulling in the long right tail of the distribution toward the mean. The short left tail is, in addition, stretched. The shift toward symmetrical distribution produced by the log transform is not, of course, merely for convenience. Symmetrical distributions, especially those that resemble the normal distribution, fulfill statistical assumptions that form the basis of statistical significance testing in the regression model." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Logging skewed variables also helps to reveal the patterns in the data. […] the rescaling of the variables by taking logarithms reduces the nonlinearity in the relationship and removes much of the clutter resulting from the skewed distributions on both variables; in short, the transformation helps clarify the relationship between the two variables. It also […] leads to a theoretically meaningful regression coefficient." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"The logarithmic transformation serves several purposes: (1) The resulting regression coefficients sometimes have a more useful theoretical interpretation compared to a regression based on unlogged variables. (2) Badly skewed distributions - in which many of the observations are clustered together combined with a few outlying values on the scale of measurement - are transformed by taking the logarithm of the measurements so that the clustered values are spread out and the large values pulled in more toward the middle of the distribution. (3) Some of the assumptions underlying the regression model and the associated significance tests are better met when the logarithm of the measured variables is taken." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Graphical methodology provides powerful diagnostic tools for conveying properties of the fitted regression, for assessing the adequacy of the fit, and for suggesting improvements. There is seldom any prior guarantee that a hypothesized regression model will provide a good description of the mechanism that generated the data. Standard regression models carry with them many specific assumptions about the relationship between the response and explanatory variables and about the variation in the response that is not accounted for by the explanatory variables. In many applications of regression there is a substantial amount of prior knowledge that makes the assumptions plausible; in many other applications the assumptions are made as a starting point simply to get the analysis off the ground. But whatever the amount of prior knowledge, fitting regression equations is not complete until the assumptions have been examined." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"Stepwise regression is probably the most abused computerized statistical technique ever devised. If you think you need stepwise regression to solve a particular problem you have, it is almost certain that you do not. Professional statisticians rarely use automated stepwise regression." (Leland Wilkinson, "SYSTAT", 1984)

"Someone has characterized the user of stepwise regression as a person who checks his or her brain at the entrance of the computer center." (Dick R Wittink, "The application of regression analysis", 1988)

"Data analysis is rarely as simple in practice as it appears in books. Like other statistical techniques, regression rests on certain assumptions and may produce unrealistic results if those assumptions are false. Furthermore it is not always obvious how to translate a research question into a regression model." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"Exploratory regression methods attempt to reveal unexpected patterns, so they are ideal for a first look at the data. Unlike other regression techniques, they do not require that we specify a particular model beforehand. Thus exploratory techniques warn against mistakenly fitting a linear model when the relation is curved, a waxing curve when the relation is S-shaped, and so forth." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"Whereas regression is about attempting to specify the underlying relationship that summarises a set of paired data, correlation is about assessing the strength of that relationship. Where there is a very close match between the scatter of points and the regression line, correlation is said to be 'strong' or 'high' . Where the points are widely scattered, the correlation is said to be 'weak' or 'low'." (Alan Graham, "Developing Thinking in Statistics", 2006)

"Before best estimates are extracted from data sets by way of a regression analysis, the uncertainties of the individual data values must be determined.In this case care must be taken to recognize which uncertainty components are common to all the values, i.e., those that are correlated (systematic)." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"For linear dependences the main information usually lies in the slope. It is obvious that those points that lie far apart have the strongest influence on the slope if all points have the same uncertainty. In this context we speak of the strong leverage of distant points; when determining the parameter 'slope' these distant points carry more effective weight. Naturally, this weight is distinct from the 'statistical' weight usually used in regression analysis." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Regression toward the mean. That is, in any series of random events an extraordinary event is most likely to be followed, due purely to chance, by a more ordinary one." (Leonard Mlodinow, "The Drunkard’s Walk: How Randomness Rules Our Lives", 2008)

"There are three possible reasons for [the] absence of predictive power. First, it is possible that the models are misspecified. Second, it is possible that the model’s explanatory factors are measured at too high a level of aggregation [...] Third, [...] the search for statistically significant relationships may not be the strategy best suited for evaluating our model’s ability to explain real world events [...] the lack of predictive power is the result of too much emphasis having been placed on finding statistically significant variables, which may be overdetermined. Statistical significance is generally a flawed way to prune variables in regression models [...] Statistically significant variables may actually degrade the predictive accuracy of a model [...] [By using]models that are constructed on the basis of pruning undertaken with the shears of statistical significance, it is quite possible that we are winnowing our models away from predictive accuracy." (Michael D Ward et al, "The perils of policy by p-value: predicting civil conflicts" Journal of Peace Research 47, 2010)

"Regression analysis, like all forms of statistical inference, is designed to offer us insights into the world around us. We seek patterns that will hold true for the larger population. However, our results are valid only for a population that is similar to the sample on which the analysis has been done." (Charles Wheelan, "Naked Statistics: Stripping the Dread from the Data", 2012)

"Multiple regression, like all statistical techniques based on correlation, has a severe limitation due to the fact that correlation doesn't prove causation. And no amount of measuring of 'control' variables can untangle the web of causality. What nature hath joined together, multiple regression cannot put asunder." (Richard Nisbett, "2014 : What scientific idea is ready for retirement?", 2013)

"What nature hath joined together, multiple regression cannot put asunder." (Richard Nisbett, "2014 : What scientific idea is ready for retirement?", 2013)

"A wide variety of statistical procedures (regression, t-tests, ANOVA) require three assumptions: (i) Normal observations or errors. (ii) Independent observations (or independent errors, which is equivalent, in normal linear models to independent observations). (iii) Equal variance - when that is appropriate (for the one-sample t-test, for example, there is nothing being compared, so equal variances do not apply)." (DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

"Regression does not describe changes in ability that happen as time passes […]. Regression is caused by performances fluctuating about ability, so that performances far from the mean reflect abilities that are closer to the mean." (Gary Smith, "Standard Deviations", 2014)

"We encounter regression in many contexts - pretty much whenever we see an imperfect measure of what we are trying to measure. Standardized tests are obviously an imperfect measure of ability. [...] Each experimental score is an imperfect measure of “ability,” the benefits from the layout. To the extent there is randomness in this experiment - and there surely is - the prospective benefits from the layout that has the highest score are probably closer to the mean than was the score." (Gary Smith, "Standard Deviations", 2014))

"When a trait, such as academic or athletic ability, is measured imperfectly, the observed differences in performance exaggerate the actual differences in ability. Those who perform the best are probably not as far above average as they seem. Nor are those who perform the worst as far below average as they seem. Their subsequent performances will consequently regress to the mean." (Gary Smith, "Standard Deviations", 2014)

"Working an integral or performing a linear regression is something a computer can do quite effectively. Understanding whether the result makes sense - or deciding whether the method is the right one to use in the first place - requires a guiding human hand. When we teach mathematics we are supposed to be explaining how to be that guide. A math course that fails to do so is essentially training the student to be a very slow, buggy version of Microsoft Excel." (Jordan Ellenberg, "How Not to Be Wrong: The Power of Mathematical Thinking", 2014)

"A basic problem with MRA is that it typically assumes that the independent variables can be regarded as building blocks, with each variable taken by itself being logically independent of all the others. This is usually not the case, at least for behavioral data. […] Just as correlation doesn’t prove causation, absence of correlation fails to prove absence of causation. False-negative findings can occur using MRA just as false-positive findings do—because of the hidden web of causation that we’ve failed to identify." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"One technique employing correlational analysis is multiple regression analysis (MRA), in which a number of independent variables are correlated simultaneously (or sometimes sequentially, but we won’t talk about that variant of MRA) with some dependent variable. The predictor variable of interest is examined along with other independent variables that are referred to as control variables. The goal is to show that variable A influences variable B 'net of' the effects of all the other variables. That is to say, the relationship holds even when the effects of the control variables on the dependent variable are taken into account." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"The fundamental problem with MRA, as with all correlational methods, is self-selection. The investigator doesn’t choose the value for the independent variable for each subject (or case). This means that any number of variables correlated with the independent variable of interest have been dragged along with it. In most cases, we will fail to identify all these variables. In the case of behavioral research, it’s normally certain that we can’t be confident that we’ve identified all the plausibly relevant variables." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"The theory behind multiple regression analysis is that if you control for everything that is related to the independent variable and the dependent variable by pulling their correlations out of the mix, you can get at the true causal relation between the predictor variable and the outcome variable. That’s the theory. In practice, many things prevent this ideal case from being the norm." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"Regression describes the relationship between an exploratory variable (i.e., independent) and a response variable (i.e., dependent). Exploratory variables are also referred to as predictors and can have a frequency of more than 1. Regression is being used within the realm of predictions and forecasting. Regression determines the change in response variable when one exploratory variable is varied while the other independent variables are kept constant. This is done to understand the relationship that each of those exploratory variables exhibits." (Danish Haroon, "Python Machine Learning Case Studies", 2017)

"Any time you run regression analysis on arbitrary real-world observational data, there’s a significant risk that there’s hidden confounding in your dataset and so causal conclusions from such analysis are likely to be (causally) biased." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"Multiple regression provides scientists and analysts with a tool to perform statistical control - a procedure to remove unwanted influence from certain variables in the model." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"The causal interpretation of linear regression only holds when there are no spurious relationships in your data. This is the case in two scenarios: when you control for a set of all necessary variables (sometimes this set can be empty) or when your data comes from a properly designed randomized experiment." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

More quotes on "Regression" at the-web-of-knowledge.blogspot.com.

30 November 2018

🔭Data Science: Control (Just the Quotes)

"An inference, if it is to have scientific value, must constitute a prediction concerning future data. If the inference is to be made purely with the help of the distribution theory of statistics, the experiments that constitute evidence for the inference must arise from a state of statistical control; until that state is reached, there is no universe, normal or otherwise, and the statistician’s calculations by themselves are an illusion if not a delusion. The fact is that when distribution theory is not applicable for lack of control, any inference, statistical or otherwise, is little better than a conjecture. The state of statistical control is therefore the goal of all experimentation. (William E Deming, "Statistical Method from the Viewpoint of Quality Control", 1939)

"Sampling is the science and art of controlling and measuring the reliability of useful statistical information through the theory of probability." (William E Deming, "Some Theory of Sampling", 1950)

"The well-known virtue of the experimental method is that it brings situational variables under tight control. It thus permits rigorous tests of hypotheses and confidential statements about causation. The correlational method, for its part, can study what man has not learned to control. Nature has been experimenting since the beginning of time, with a boldness and complexity far beyond the resources of science. The correlator’s mission is to observe and organize the data of nature’s experiments." (Lee J Cronbach, "The Two Disciplines of Scientific Psychology", The American Psychologist Vol. 12, 1957)

"In complex systems cause and effect are often not closely related in either time or space. The structure of a complex system is not a simple feedback loop where one system state dominates the behavior. The complex system has a multiplicity of interacting feedback loops. Its internal rates of flow are controlled by nonlinear relationships. The complex system is of high order, meaning that there are many system states (or levels). It usually contains positive-feedback loops describing growth processes as well as negative, goal-seeking loops. In the complex system the cause of a difficulty may lie far back in time from the symptoms, or in a completely different and remote part of the system. In fact, causes are usually found, not in prior events, but in the structure and policies of the system." (Jay W Forrester, "Urban dynamics", 1969)

"To adapt to a changing environment, the system needs a variety of stable states that is large enough to react to all perturbations but not so large as to make its evolution uncontrollably chaotic. The most adequate states are selected according to their fitness, either directly by the environment, or by subsystems that have adapted to the environment at an earlier stage. Formally, the basic mechanism underlying self-organization is the (often noise-driven) variation which explores different regions in the system’s state space until it enters an attractor. This precludes further variation outside the attractor, and thus restricts the freedom of the system’s components to behave independently. This is equivalent to the increase of coherence, or decrease of statistical entropy, that defines self-organization." (Francis Heylighen, "The Science Of Self-Organization And Adaptivity", 1970)

"Science consists simply of the formulation and testing of hypotheses based on observational evidence; experiments are important where applicable, but their function is merely to simplify observation by imposing controlled conditions." (Henry L Batten, "Evolution of the Earth", 1971)

"Thus, the construction of a mathematical model consisting of certain basic equations of a process is not yet sufficient for effecting optimal control. The mathematical model must also provide for the effects of random factors, the ability to react to unforeseen variations and ensure good control despite errors and inaccuracies." (Yakov Khurgin, "Did You Say Mathematics?", 1974)

"Uncontrolled variation is the enemy of quality." (W Edwards Deming, 1980)

"The methods of science include controlled experiments, classification, pattern recognition, analysis, and deduction. In the humanities we apply analogy, metaphor, criticism, and (e)valuation. In design we devise alternatives, form patterns, synthesize, use conjecture, and model solutions." (Béla H Bánáthy, "Designing Social Systems in a Changing World", 1996)

"A mathematical model uses mathematical symbols to describe and explain the represented system. Normally used to predict and control, these models provide a high degree of abstraction but also of precision in their application." (Lars Skyttner, "General Systems Theory: Ideas and Applications", 2001)

"A model is an imitation of reality and a mathematical model is a particular form of representation. We should never forget this and get so distracted by the model that we forget the real application which is driving the modelling. In the process of model building we are translating our real world problem into an equivalent mathematical problem which we solve and then attempt to interpret. We do this to gain insight into the original real world situation or to use the model for control, optimization or possibly safety studies." (Ian T Cameron & Katalin Hangos, "Process Modelling and Model Analysis", 2001)

"Dashboards and visualization are cognitive tools that improve your 'span of control' over a lot of business data. These tools help people visually identify trends, patterns and anomalies, reason about what they see and help guide them toward effective decisions. As such, these tools need to leverage people's visual capabilities. With the prevalence of scorecards, dashboards and other visualization tools now widely available for business users to review their data, the issue of visual information design is more important than ever." (Richard Brath & Michael Peters, "Dashboard Design: Why Design is Important," DM Direct, 2004)

"The methodology of feedback design is borrowed from cybernetics (control theory). It is based upon methods of controlled system model’s building, methods of system states and parameters estimation (identification), and methods of feedback synthesis. The models of controlled system used in cybernetics differ from conventional models of physics and mechanics in that they have explicitly specified inputs and outputs. Unlike conventional physics results, often formulated as conservation laws, the results of cybernetical physics are formulated in the form of transformation laws, establishing the possibilities and limits of changing properties of a physical system by means of control." (Alexander L Fradkov, "Cybernetical Physics: From Control of Chaos to Quantum Control", 2007)

"Put simply, statistics is a range of procedures for gathering, organizing, analyzing and presenting quantitative data. […] Essentially […], statistics is a scientific approach to analyzing numerical data in order to enable us to maximize our interpretation, understanding and use. This means that statistics helps us turn data into information; that is, data that have been interpreted, understood and are useful to the recipient. Put formally, for your project, statistics is the systematic collection and analysis of numerical data, in order to investigate or discover relationships among phenomena so as to explain, predict and control their occurrence." (Reva B Brown & Mark Saunders, "Dealing with Statistics: What You Need to Know", 2008)

"The correlational technique known as multiple regression is used frequently in medical and social science research. This technique essentially correlates many independent (or predictor) variables simultaneously with a given dependent variable (outcome or output). It asks, 'Net of the effects of all the other variables, what is the effect of variable A on the dependent variable?' Despite its popularity, the technique is inherently weak and often yields misleading results. The problem is due to self-selection. If we don’t assign cases to a particular treatment, the cases may differ in any number of ways that could be causing them to differ along some dimension related to the dependent variable. We can know that the answer given by a multiple regression analysis is wrong because randomized control experiments, frequently referred to as the gold standard of research techniques, may give answers that are quite different from those obtained by multiple regression analysis." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"Too little attention is given to the need for statistical control, or to put it more pertinently, since statistical control (randomness) is so rarely found, too little attention is given to the interpretation of data that arise from conditions not in statistical control." (William E Deming)

22 November 2018

🔭Data Science: Regression toward the Mean (Just the Quotes)

"Whenever we make any decision based on the expectation that matters will return to 'normal', we are employing the notion of regression to the mean." (Peter L Bernstein, "Against the Gods: The Remarkable Story of Risk", 1996)

"Regression to the mean occurs when the process produces results that are statistically independent or negatively correlated. With strong negative serial correlation, extremes are likely to be reversed each time (which would reinforce the instructors' error). In contrast, with strong positive dependence, extreme results are quite likely to be clustered together." (Dan Trietsch, "Statistical Quality Control : A loss minimization approach", 1998)

"Unfortunately, people are poor intuitive scientists, generally failing to reason in accordance with the principles of scientific method. For example, people do not generate sufficient alternative explanations or consider enough rival hypotheses. People generally do not adequately control for confounding variables when they explore a novel environment. People’s judgments are strongly affected by the frame in which the information is presented, even when the objective information is unchanged. People suffer from overconfidence in their judgments (underestimating uncertainty), wishful thinking (assessing desired outcomes as more likely than undesired outcomes), and the illusion of control (believing one can predict or influence the outcome of random events). People violate basic rules of probability, do not understand basic statistical concepts such as regression to the mean, and do not update beliefs according to Bayes’ rule. Memory is distorted by hindsight, the availability and salience of examples, and the desirability of outcomes. And so on." (John D Sterman, "Business Dynamics: Systems thinking and modeling for a complex world", 2000)

"People often attribute meaning to phenomena governed only by a regression to the mean, the mathematical tendency for an extreme value of an at least partially chance-dependent quantity to be followed by a value closer to the average. Sports and business are certainly chancy enterprises and thus subject to regression. So is genetics to an extent, and so very tall parents can be expected to have offspring who are tall, but probably not as tall as they are. A similar tendency holds for the children of very short parents." (John A Paulos, "A Mathematician Plays the Stock Market", 2003)

"'Regression to the mean' […] says that, in any series of events where chance is involved, very good or bad performances, high or low scores, extreme events, etc. tend on the average, to be followed by more average performance or less extreme events. If we do extremely well, we're likely to do worse the next time, while if we do poorly, we're likely to do better the next time. But regression to the mean is not a natural law. Merely a statistical tendency. And it may take a long time before it happens." (Peter Bevelin, "Seeking Wisdom: From Darwin to Munger", 2003)

"Another aspect of representativeness that is misunderstood or ignored is the tendency of regression to the mean. Stochastic phenomena where the outcomes vary randomly around stable values (so-called stationary processes) exhibit the general tendency that extreme outcomes are more likely to be followed by an outcome closer to the mean or mode than by other extreme values in the same direction. For example, even a bright student will observe that her or his performance in a test following an especially outstanding outcome tends to be less brilliant. Similarly, extremely low or extremely high sales in a given period tend to be followed by sales that are closer to the stable mean or the stable trend." (Hans G Daellenbach & Donald C McNickle, "Management Science: Decision making through systems thinking", 2005)

"Behavioural research shows that we tend to use simplifying heuristics when making judgements about uncertain events. These are prone to biases and systematic errors, such as stereotyping, disregard of sample size, disregard for regression to the mean, deriving estimates based on the ease of retrieving instances of the event, anchoring to the initial frame, the gambler’s fallacy, and wishful thinking, which are all affected by our inability to consider more than a few aspects or dimensions of any phenomenon or situation at the same time." (Hans G Daellenbach & Donald C McNickle, "Management Science: Decision making through systems thinking", 2005)

"Concluding that the population is becoming more centralized by observing behavior at the extremes is called the 'Regression to the Mean' Fallacy. […] When looking for a change in a population, do not look only at the extremes; there you will always find a motion to the mean. Look at the entire population." (Charles Livingston & Paul Voakes, "Working with Numbers and Statistics: A handbook for journalists", 2005)

"'Regression to the mean' describes a natural phenomenon whereby, after a short period of success, things tend to return to normal immediately afterwards. This notion applies particularly to random events." (Alan Graham, "Developing Thinking in Statistics", 2006)

"regression to the mean: The fact that unexpectedly high or low numbers from the mean are an exception and are usually followed by numbers that are closer to the mean. Over the long haul, we tend to get relatively more numbers that are near the mean compared to numbers that are far from the mean." (Hari Singh, "Framed! Solve an Intriguing Mystery and Master How to Make Smart Choices", 2006)

"A naive interpretation of regression to the mean is that heights, or baseball records, or other variable phenomena necessarily become more and more 'average' over time. This view is mistaken because it ignores the error in the regression predicting y from x. For any data point xi, the point prediction for its yi will be regressed toward the mean, but the actual yi that is observed will not be exactly where it is predicted. Some points end up falling closer to the mean and some fall further." (Andrew Gelman & Jennifer Hill, "Data Analysis Using Regression and Multilevel/Hierarchical Models", 2007)

"The term shrinkage is used in regression modeling to denote two ideas. The first meaning relates to the slope of a calibration plot, which is a plot of observed responses against predicted responses. When a dataset is used to fit the model parameters as well as to obtain the calibration plot, the usual estimation process will force the slope of observed versus predicted values to be one. When, however, parameter estimates are derived from one dataset and then applied to predict outcomes on an independent dataset, overfitting will cause the slope of the calibration plot (i.e., the shrinkage factor ) to be less than one, a result of regression to the mean. Typically, low predictions will be too low and high predictions too high. Predictions near the mean predicted value will usually be quite accurate. The second meaning of shrinkage is a statistical estimation method that preshrinks regression coefficients towards zero so that the calibration plot for new data will not need shrinkage as its calibration slope will be one." (Frank E. Harrell Jr., "Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis" 2nd Ed, 2015)

"Often when people relate essentially the same variable in two different groups, or at two different times, they see this same phenomenon - the tendency of the response variable to be closer to the mean than the predicted value. Unfortunately, people try to interpret this by thinking that the performance of those far from the mean is deteriorating, but it’s just a mathematical fact about the correlation. So, today we try to be less judgmental about this phenomenon and we call it regression to the mean. We managed to get rid of the term 'mediocrity', but the name regression stuck as a name for the whole least squares fitting procedure - and that’s where we get the term regression line." (Richard D De Veaux et al, "Stats: Data and Models", 2016)

"Regression toward the mean is pervasive. In sports, excellent performance tends to be followed by good, but less outstanding, performance. [...] By contrast, the good news about regression toward the mean is that very poor performance tends to be followed by improved performance. If you got the worst score in your statistics class on the first exam, you probably did not do so poorly on the second exam (but you were probably still below the mean)." (Alan Agresti et al, Statistics: The Art and Science of Learning from Data" 4th Ed., 2018)

18 May 2018

🔬Data Science: Boltzmann Machine (Definitions)

[Boltzmann machine (with learning):] "A net that adjusts its weights so that the equilibrium configuration of the net will solve a given problem, such as an encoder problem" (David H Ackley et al, "A learning algorithm for boltzmann machines", Cognitive Science Vol. 9 (1), 1985)

[Boltzmann machine (without learning):] "A class of neural networks used for solving constrained optimization problems. In a typical Boltzmann machine, the weights are fixed to represent the constraints of the problem and the function to be optimized. The net seeks the solution by changing the activations (either 1 or 0) of the units based on a probability distribution and the effect that the change would have on the energy function or consensus function for the net." (David H Ackley et al, "A learning algorithm for boltzmann machines", Cognitive Science Vol. 9 (1), 1985)

"neural-network model otherwise similar to a Hopfield network but having symmetric interconnects and stochastic processing elements. The input-output relation is optimized by adjusting the bistable values of its internal state variables one at a time, relating to a thermodynamically inspired rule, to reach a global optimum." (Teuvo Kohonen, "Self-Organizing Maps 3rd" Ed., 2001)

"A neural network model consisting of interacting binary units in which the probability of a unit being in the active state depends on its integrated synaptic inputs." (Terrence J Sejnowski, "The Deep Learning Revolution", 2018)

"An unsupervised network that maximizes the product of probabilities assigned to the elements of the training set." (Mário P Véstias, "Deep Learning on Edge: Challenges and Trends", 2020)

"Restricted Boltzmann machine (RBM) is an undirected graphical model that falls under deep learning algorithms. It plays an important role in dimensionality reduction, classification and regression. RBM is the basic block of Deep-Belief Networks. It is a shallow, two-layer neural networks. The first layer of the RBM is called the visible or input layer while the second is the hidden layer. In RBM the interconnections between visible units and hidden units are established using symmetric weights." (S Abirami & P Chitra, "The Digital Twin Paradigm for Smarter Systems and Environments: The Industry Use Cases", Advances in Computers, 2020)

"A deep Boltzmann machine (DBM) is a type of binary pairwise Markov random field (undirected probabilistic graphical model) with multiple layers of hidden random variables." (Udit Singhania & B. K. Tripathy, "Text-Based Image Retrieval Using Deep Learning", 2021)

"A Boltzmann machine is a neural network of symmetrically connected nodes that make their own decisions whether to activate. Boltzmann machines use a straightforward stochastic learning algorithm to discover “interesting” features that represent complex patterns in the database." (DeepAI) [source]

"Boltzmann Machines is a type of neural network model that was inspired by the physical process of thermodynamics and statistical mechanics. [...] Full Boltzmann machines are impractical to train, which is one of the reasons why a limited form, called the restricted Boltzmann machine, is used." (Accenture)

"RBMs [Restricted Boltzmann Machines] are a type of probabilistic graphical model that can be interpreted as a stochastic artificial neural network. RBNs learn a representation of the data in an unsupervised manner. An RBN consists of visible and hidden layer, and connections between binary neurons in each of these layers. RBNs can be efficiently trained using Contrastive Divergence, an approximation of gradient descent." (Wild ML)

10 May 2018

🔬Data Science: Support Vector Machines [SVM] (Definitions)

"A supervised machine learning classification approach with the objective to find the hyperplane maximizing the minimum distance between the plane and the training data points." (Xiaoyan Yu et al, "Automatic Syllabus Classification Using Support Vector Machines", 2009)

"Support vector machines [SVM] is a methodology used for classification and regression. SVMs select a small number of critical boundary instances called support vectors from each class and build a linear discriminant function that separates them as widely as possible." (Yorgos Goletsis et al, "Bankruptcy Prediction through Artificial Intelligence", 2009)

"SVM is a data mining method useful for classification problems. It uses training data and kernel functions to build a model that can appropriately predict the class of an unclassified observation." (Indranil Bose, "Data Mining in Tourism", 2009)

"A modeling technique that assigns points to classes based on the assignment of previous points, and then determines the gap dividing the classes where the gap is furthest from points in both classes." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A machine-learning technique that classifies objects. The method starts with a training set consisting of two classes of objects as input. The SVA computes a hyperplane, in a multidimensional space, that separates objects of the two classes. The dimension of the hyperspace is determined by the number of dimensions or attributes associated with the objects. Additional objects (i.e., test set objects) are assigned membership in one class or the other, depending on which side of the hyperplane they reside." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)

"A machine learning algorithm that works with labeled training data and outputs results to an optimal hyperplane. A hyperplane is a subspace of the dimension minus one (that is, a line in a plane)." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"A classification algorithm that finds the hyperplane dividing the training data into given classes. This division by the hyperplane is then used to classify the data further." (David Natingga, "Data Science Algorithms in a Week" 2nd Ed., 2018)

"Machine learning techniques that are used to make predictions of continuous variables and classifications of categorical variables based on patterns and relationships in a set of training data for which the values of predictors and outcomes for all cases are known." (Jonathan Ferrar et al, "The Power of People: Learn How Successful Organizations Use Workforce Analytics To Improve Business Performance", 2017)

"It is a supervised machine learning tool utilized for data analysis, regression, and classification." (Shradha Verma, "Deep Learning-Based Mobile Application for Plant Disease Diagnosis", 2019)

"It is a supervised learning algorithm in ML used for problems in both classification and regression. This uses a technique called the kernel trick to transform the data and then determines an optimal limit between the possible outputs, based on those transformations." (Mehmet A Cifci, "Optimizing WSNs for CPS Using Machine Learning Techniques", 2021)

"Support Vector Machines (SVM) are supervised machine learning algorithms used for classification and regression analysis. Employed in classification analysis, support vector machines can carry out text categorization, image classification, and handwriting recognition." (Accenture)

🔬Data Science: Cross-validation (Definitions)

"A method for assessing the accuracy of a regression or classification model. A data set is divided up into a series of test and training sets, and a model is built with each of the training set and is tested with the separate test set." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"A method for assessing the accuracy of a regression or classification model." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2007)

"A statistical method derived from cross-classification which main objective is to detect the outlying point in a population set." (Tomasz Ciszkowski & Zbigniew Kotulski, "Secure Routing with Reputation in MANET", 2008)

"Process by which an original dataset d is divided into a training set t and a validation set v. The training set is used to produce an effort estimation model (if applicable), later used to predict effort for each of the projects in v, as if these projects were new projects for which effort was unknown. Accuracy statistics are then obtained and aggregated to provide an overall measure of prediction accuracy." (Emilia Mendes & Silvia Abrahão, "Web Development Effort Estimation: An Empirical Analysis", 2008)

"A method of estimating predictive error of inducers. Cross-validation procedure splits that dataset into k equal-sized pieces called folds. k predictive function are built, each tested on a distinct fold after being trained on the remaining folds." (Gilles Lebrun et al, EA Multi-Model Selection for SVM, 2009)

"Method to estimate the accuracy of a classifier system. In this approach, the dataset, D, is randomly split into K mutually exclusive subsets (folds) of equal size (D1, D2, …, Dk) and K classifiers are built. The i-th classifier is trained on the union of all Dj ¤ j¹i and tested on Di. The estimate accuracy is the overall number of correct classifications divided by the number of instances in the dataset." (M Paz S Lorente et al, "Ensemble of ANN for Traffic Sign Recognition" [in "Encyclopedia of Artificial Intelligence"], 2009)

"The process of assessing the predictive accuracy of a model in a test sample compared to its predictive accuracy in the learning or training sample that was used to make the model. Cross-validation is a primary way to assure that over learning does not take place in the final model, and thus that the model approximates reality as well as can be obtained from the data available." (Robert Nisbet et al, "Handbook of statistical analysis and data mining applications", 2009)

"Validating a scoring procedure by applying it to another set of data." (Dougal Hutchison, "Automated Essay Scoring Systems", 2009)

"A method for evaluating the accuracy of a data mining model." (Microsoft, "SQL Server 2012 Glossary", 2012)

"Cross-validation is a method of splitting all of your data into two parts: training and validation. The training data is used to build the machine learning model, whereas the validation data is used to validate that the model is doing what is expected. This increases our ability to find and determine the underlying errors in a model." (Matthew Kirk, "Thoughtful Machine Learning", 2015)

"A technique used for validation and model selection. The data is randomly partitioned into K groups. The model is then trained K times, each time with one of the groups left out, on which it is evaluated." (Simon Rogers & Mark Girolami, "A First Course in Machine Learning", 2017)

"A model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set." (Adrian Carballal et al, "Approach to Minimize Bias on Aesthetic Image Datasets", 2019)

18 March 2018

🔬Data Science: Linear Regression (Definitions)

"A regression model that uses the equation for a straight line." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"A quantitative model building tool that relates one or more independent variables (Xs) to a single dependent variable (Y)." (Lynne Hambleton, "Treasure Chest of Six Sigma Growth Methods, Tools, and Best Practices", 2007)

"A regression that deals with a straight-line relationship between variables. It is in the form of Y = a + bX, whereas nonlinear regression involves curvilinear relationships, such as exponential and quadratic functions." (Jae K Shim & Joel G Siegel, "Budgeting Basics and Beyond", 2008)

"In statistics, a method of modeling the relationship between dependent and independent variables. Linear regression creates a model by fitting a straight line to the values in a dataset." (Meta S Brown, "Data Mining For Dummies", 2014)

"Linear regression is a statistical technique for modeling the relationship between a single variable and one or more other variables. In a machine learning context, linear regression refers to a regression model based on this statistical technique." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

"is an area of unsupervised machine learning that uses linear predictor functions to understand the relationship between a scalar dependent variable and one or more explanatory variables." (Accenture)

15 March 2018

🔬Data Science: Logistic Regression (Definitions)

"A regression equation used to predict a binary variable." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"A regression model where the dependent variable takes on a limited number of discrete values, often two values representing yes and no." (Peter L Stenberg & Mitchell Morehart, "Characteristics of Farm and Rural Internet Use in the USA", 2008)

"Technique for making predictions when a dependent variable is a categorical dichotomy, and the independent variable(s) are continuous and/or categorical." (Ken J Farion et al, "Clinical Decision Making by Emergency Room Physicians and Residents", 2008)

"A form of regression analysis in which the target variable (response variable) is a binary-level or ordinal-level response and the target estimate is bounded at the extremes." (Robert Nisbet et al, "Handbook of statistical analysis and data mining applications", 2009)

"A modeling technique where unknown values are predicted by known values of other valuables where the dependent variable is binary type." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Logistic regression is a method of statistical modeling appropriate for categorical outcome variables. It describes the relationship between a categorical response variable and a set of explanatory variables." (Leping Liu & Livia D’Andrea, "Initial Stages to Create Online Graduate Communities: Assessment and Development", 2011)

"Like linear regression, a statistical method of modeling the relationship between dependent and independent variables based on probability. However, in binary logistic regression, the dependent variable (the effect, or outcome) can have only one of two values, as in, say, a baby’s sex or the results of an election. (Multinomial logistic regression allows for more than two possible values.) A logistic regression model is formed by fitting data to a logit function. (The dependent variable is a 0 or 1, and the regression curve is shaped something like the letter 's'.) market basket analysis: The identification of product combinations frequently purchased within a single transaction." (Meta S Brown, "Data Mining For Dummies", 2014)

"Logistic regression is a statistical method for determining the relationship between independent predictor variables (such as financial ratios) and a dichotomously coded dependent variable (such as default or non-default)." (Niccolò Gordini, "Genetic Algorithms for Small Enterprises Default Prediction: Empirical Evidence from Italy", 2014)

"Logistic regression is a predictive analytic method for describing and explaining the relationships between a categorical dependent variable and one or more continuous or categorical independent variables in the recent and past existing data in efforts to build predictive models for predicting a membership of individuals or products into two groups or categories." (Sema A Kalaian & Rafa M Kasim, "Predictive Analytics", 2015)

"Form of regression analysis where the dependent variable is a category rather than a continuous variable. An example of a continuous variable is sales or profit. In order to understand customer retention, regression analysis would calculate the effects of variables such as age, demographics, products purchased, and competitor information on two categories: retaining the customer and losing the customer." (Brittany Bullard, "Style and Statistics", 2016)

"A regression model that is used when the dependent variable is qualitative and a probability is assigned to an observation for the likelihood that the target variable has a value of 1." (Alan Olinsky et al, Visualization of Predictive Modeling for Big Data Using Various Approaches When There Are Rare Events at Differing Levels, 2018)

"Logistic regression analysis is mainly used in epidemiology. The most common case is to explore the risk factors of a certain disease and predict the probability of the occurrence of a certain disease according to the risk factors." (Chunfa Xu et al, "Crime Hotspot Prediction Using Big Data in China", 2020)

"Logistic regression is a classification algorithm that comes under supervised learning and is used for predictive learning. Logistic regression is used to describe data. It works best for dichotomous (binary) classification." (Astha Baranwal et al, "Machine Learning in Python: Diabetes Prediction Using Machine Learning", 2020)

"Logistic regression is a statistical technique for modeling the probability of an event. In a machine learning context, logistic regression refers to a classification model based on this statistical technique." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

"This is a kind of regression analysis often used when the outcome variable is dichotomous and scored 0, 1. Logistic regression is also known as logit regression and when the dependent variable has more than two categories it is called multinomial. Logistic regression is used when predicting whether an event will happen or not." (John K Rugutt & Caroline C Chemosit, "Student Collaborative Learning Strategies: A Logistic Regression Analysis Approach", 2021)

11 February 2018

🔬Data Science: K-nearest neighbors (Definitions)

"A modeling technique that assigns values to points based on the values of the k nearby points, such as average value, or most common value." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A simple and popular classifier algorithm that assigns a class (in a preexisting classification) to an object whose class is unknown. [...] From a collection of data objects whose class is known, the algorithm computes the distances from the object of unknown class to k (a number chosen by the user) objects of known class. The most common class (i.e., the class that is assigned most often to the nearest k objects) is assigned to the object of unknown class." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)

"A method used for classification and regression. Cases are analyzed, and class membership is assigned based on similarity to other cases, where cases that are similar (or 'near' in characteristics) are known as neighbors." (Brenda L Dietrich et al, "Analytics Across the Enterprise", 2014)

"A prediction method, which uses a function of the k most similar observations from the training set to generate a prediction, such as the mean." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"K-Nearest Neighbors classification is an instance-based supervised learning method that works well with distance-sensitive data." (Matthew Kirk, "Thoughtful Machine Learning", 2015)

"An algorithm that estimates an unknown data item as being like the majority of the k-closest neighbors to that item." (David Natingga, "Data Science Algorithms in a Week" 2nd Ed., 2018)

"K-nearest neighbourhood is a algorithm which stores all available cases and classifies new cases based on a similarity measure. It is used in statistical estimation and pattern recognition." (Aman Tyagi, "Healthcare-Internet of Things and Its Components: Technologies, Benefits, Algorithms, Security, and Challenges", 2021)

25 January 2018

🔬Data Science: Regression Analysis (Definitions)

"A set of statistical operations that helps to predict the value of the dependent variable from the values of one or more independent variables." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling 2nd Ed.", 2005)

"A statistical tool that measures the strength of relationship between one or more independent variables with a dependent variable. It builds upon the correlation concepts to develop an empirical, databased model. Correlation describes the X and Y relationship with a single number (the Pearson’s Correlation Coefficient (r)), whereas regression summarizes the relationship with a line - the regression line." (Lynne Hambleton, "Treasure Chest of Six Sigma Growth Methods, Tools, and Best Practices", 2007)

"A statistical procedure for estimating mathematically the average relationship between the dependent variable (e.g., sales) and one or more independent variables (e.g., price and advertising)." (Jae K Shim & Joel G Siegel, "Budgeting Basics and Beyond", 2008)

"Regression analysis is a statistical technique for estimating the relationship between a set of predictors (independent variables) and an outcome variable (dependent variable). Linear least-squares regression, in which the relationship is expressed in a linear form, is the most common type of regression analysis. The mathematical model used in least-squares linear regression is often called the general linear model (GLM)." (Herbert I Weisberg, "Bias and Causation: Models and Judgment for Valid Comparisons", 2010)

"A statistical technique which seeks to find a line which best fits through a set of data as plotted on a graph, seeking to find the cleanest path which deviates the least from any instance within the set." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

[regression] "Using one data set to predict the results of a second." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The statistical process of predicting one or more continuous variables, such as profit or loss, based on other attributes in the dataset." (Microsoft, "SQL Server 2012 Glossary", 2012)

"A family of methods for fitting a line or curve to a dataset, used to simplify or make sense of a number of apparently random data points." (Meta S Brown, "Data Mining For Dummies", 2014)

"An analytic technique where a series of input variables are examined in relation to their corresponding output results in order to develop a mathematical or statistical relationship." (For Dummies, "PMP Certification All-in-One For Dummies" 2nd Ed., 2013)

"A statistical technique for estimating relationships between variables." (Brenda L Dietrich et al, "Analytics Across the Enterprise", 2014)

"Process to statistically estimate the relationship between different attributes." (Sanjiv K Bhatia & Jitender S Deogun, "Data Mining Tools: Association Rules", 2014)

"Plotting pairs of independent and dependent variables in an XY chart and then finding a linear or exponential equation that best describes the plotted data." (E C Nelson & Stephen L Nelson, "Excel Data Analysis For Dummies", 2015)

"A statistical procedure that produces an equation for predicting a variable (the criterion measure) from one or more other variables (the predictor measures)." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"A statistical technique used to estimate the mathematical relationship between a dependent variable, such as quantity demanded, and one or more explanatory variables, such as price and income." (Jeffrey M Perloff & James A Brander, "Managerial Economics and Strategy" 2nd Ed., 2016)

"A statistical process for estimating the relationships between variables, often used to forecast the change in a variable based on changes in other variables. Linear regression is used to analyze continuous variables, and logistic regression is used for discrete variables." (Jonathan Ferrar et al, "The Power of People: Learn How Successful Organizations Use Workforce Analytics To Improve Business Performance", 2017)

"In a machine learning context, regression is the task of assigning scalar value to examples." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

"Algorithms used to predict values for new data based on training data fed into the system. Areas where regression in machine learning is used to predict future values include drug response modeling, marketing, real estate and financial forecasting." (Accenture)

"To define the dependency between variables. It assumes a one-way causal effect from one variable to the response of another variable." (Analytics Insight)

16 July 2007

🌁Software Engineering: Regression Testing (Defintiions)

"A test that exercises the entire application to verify that a new piece of code didn’t break anything." (Rod Stephens, "Beginning Software Engineering", 2015)

[regression test suite:] "A collection of tests that are run against a system on a regular basis to validate that it works according to the tests." (Pramod J Sadalage & Scott W Ambler, "Refactoring Databases: Evolutionary Database Design", 2006)

"Selective retesting of a modified system or component to verify that faults have not been introduced or exposed as a result of the changes, and that the modified system or component still meets its requirements." (Richard D Stutzke, "Estimating Software-Intensive Systems: Projects, Products, and Processes", 2005)

"Testing to verify that previously successfully tested features are still correct. It is necessary after modifications to eliminate undesired side effects." (Lars Dittmann et al, "Automotive SPICE in Practice", 2008)

"Testing a program to see if recent changes to the code have broken any existing features." (Rod Stephens, "Start Here!™ Fundamentals of Microsoft® .NET Programming", 2011)

"Testing a previously tested program or a partial functionality following modification to show that defects have not been introduced or uncovered in unchanged areas of the software as a result of the changes made. It is performed when the software or its environment is changed." (Tilo Linz et al, "Software Testing Foundations" 4th Ed., 2014)

"A software testing method that checks for additional errors in software that may have been introduced in the process of upgrading or patching to fix other problems." (Mike Harwood, "Internet Security: How to Defend Against Attackers on the Web" 2nd Ed., 2015)

04 December 2006

✏️Lawrence C Hamilton - Collected Quotes

"Boxplots provide information at a glance about center (median), spread (interquartile range), symmetry, and outliers. With practice they are easy to read and are especially useful for quick comparisons of two or more distributions. Sometimes unexpected features such as outliers, skew, or differences in spread are made obvious by boxplots but might otherwise go unnoticed." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"Comparing normal distributions reduces to comparing only means and standard deviations. If standard deviations are the same, the task even simpler: just compare means. On the other hand, means and standard deviations may be incomplete or misleading as summaries for nonnormal distributions." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"Correlation and covariance are linear regression statistics. Nonlinearity and influential cases cause the same problems for correlations, and hence for principal components/factor analysis, as they do for regression. Scatterplots should be examined routinely to check for nonlinearity and outliers. Diagnostic checks become even more important with maximum-likelihood factor analysis, which makes stronger assumptions and may be less robust than principal components or principal factors." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"Data analysis typically begins with straight-line models because they are simplest, not because we believe reality is inherently linear. Theory or data may suggest otherwise [...]" (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"If a distribution were perfectly symmetrical, all symmetry-plot points would be on the diagonal line. Off-line points indicate asymmetry. Points fall above the line when distance above the median is greater than corresponding distance below the median. A consistent run of above-the-line points indicates positive skew; a run of below-the-line points indicates negative skew." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"Principal components and factor analysis are methods for data reduction. They seek a few underlying dimensions that account for patterns of variation among the observed variables underlying dimensions imply ways to combine variables, simplifying subsequent analysis. For example, a few combined variables could replace many original variables in a regression. Advantages of this approach include more parsimonious models, improved measurement of indirectly observed concepts, new graphical displays, and the avoidance of multicollinearity." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"Principal components and principal factor analysis lack a well-developed theoretical framework like that of least squares regression. They consequently provide no systematic way to test hypotheses about the number of factors to retain, the size of factor loadings, or the correlations between factors, for example. Such tests are possible using a different approach, based on maximum-likelihood estimation." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"Remember that normality and symmetry are not the same thing. All normal distributions are symmetrical, but not all symmetrical distributions are normal. With water use we were able to transform the distribution to be approximately symmetrical and normal, but often symmetry is the most we can hope for. For practical purposes, symmetry (with no severe outliers) may be sufficient. Transformations are not a magic wand, however. Many distributions cannot even be made symmetrical." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"Visually, skewed sample distributions have one 'longer' and one 'shorter' tail. More general terms are 'heavier' and 'lighter' tails. Tail weight reflects not only distance from the center (tail length) but also the frequency of cases at that distance (tail depth, in a histogram). Tail weight corresponds to actual weight if the sample histogram were cut out of wood and balanced like a seesaw on its median (see next section). A positively skewed distribution is heavier to the right of the median; negative skew implies the opposite." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"A well-constructed graph can show several features of the data at once. Some graphs contain as much information as the original data, and so (unlike numerical summaries) do not actually simplify the data; rather, they express it in visual form. Unexpected or unusual features, which are not obvious within numerical tables, often jump to our attention once we draw a graph. Because the strengths and weaknesses of graphical methods are opposite those of numerical summary methods, the two work best in combination." (Lawrence C Hamilton, "Data Analysis for Social Scientists: A first course in applied statistics", 1995)

"Data analysis [...] begins with a dataset in hand. Our purpose in data analysis is to learn what we can from those data, to help us draw conclusions about our broader research questions. Our research questions determine what sort of data we need in the first place, and how we ought to go about collecting them. Unless data collection has been done carefully, even a brilliant analyst may be unable to reach valid conclusions regarding the original research questions." (Lawrence C Hamilton, "Data Analysis for Social Scientists: A first course in applied statistics", 1995)

"Variance and its square root, the standard deviation, summarize the amount of spread around the mean, or how much a variable varies. Outliers influence these statistics too, even more than they influence the mean. On the other hand. the variance and standard deviation have important mathematical advantages that make them (together with the mean) the foundation of classical statistics. If a distribution appears reasonably symmetrical, with no extreme outliers, then the mean and standard deviation or variance are the summaries most analysts would use." (Lawrence C Hamilton, "Data Analysis for Social Scientists: A first course in applied statistics", 1995)

26 April 2006

🖍️George E P Box - Collected Quotes

"Statistical criteria should (1) be sensitive to change in the specific factors tested, (2) be insensitive to changes, of a magnitude likely to occur in practice, in extraneous factors." (George E P Box, 1955)

"The method of least squares is used in the analysis of data from planned experiments and also in the analysis of data from unplanned happenings. The word 'regression' is most often used to describe analysis of unplanned data. It is the tacit assumption that the requirements for the validity of least squares analysis are satisfied for unplanned data that produces a great deal of trouble." (George E P Box, "Use and Abuse of Regression", 1966)

"To find out what happens to a system when you interfere with it you have to interfere with it (not just passively observe it)." (George E P Box, "Use and Abuse of Regression", 1966)

"A man in daily muddy contact with field experiments could not be expected to have much faith in any direct assumption of independently distributed normal errors." (George E P Box, "Science and Statistics", Journal of the American Statistical Association 71, 1976)

"For the theory-practice iteration to work, the scientist must be, as it were, mentally ambidextrous; fascinated equally on the one hand by possible meanings, theories, and tentative models to be induced from data and the practical reality of the real world, and on the other with the factual implications deducible from tentative theories, models and hypotheses." (George E P Box, "Science and Statistics", Journal of the American Statistical Association 71, 1976)

"One important idea is that science is a means whereby learning is achieved, not by mere theoretical speculation on the one hand, nor by the undirected accumulation of practical facts on the other, but rather by a motivated iteration between theory and practice." (George E P Box, "Science and Statistics", Journal of the American Statistical Association 71, 1976)

"Since all models are wrong the scientist cannot obtain a ‘correct’ one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity." (George E P Box, "Science and Statistics", Journal of the American Statistical Association 71, 1976)

"Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad." (George E P Box, "Science and Statistics", Journal of the American Statistical Association 71, 1976)

"Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful." (George E P Box, "Empirical Model-Building and Response Surfaces", 1987)

"The fact that [the model] is an approximation does not necessarily detract from its usefulness because models are approximations. All models are wrong, but some are useful." (George E P Box, 1987)

"Statistics is, or should be, about scientific investigation and how to do it better, but many statisticians believe it is a branch of mathematics." (George E P Box, Commentary, Technometrics 32, 1990)

"All models are approximations. Essentially, all models are wrong, but some are useful. However, the approximate nature of the model must always be borne in mind." (George E P Box & Norman R Draper, "Response Surfaces, Mixtures, and Ridge Analyses", 2007)

"In my view, statistics has no reason for existence except as the catalyst for investigation and discovery." (George E P Box)

22 April 2006

🖍️Richard E Nisbett - Collected Quotes

"Multiple regression, like all statistical techniques based on correlation, has a severe limitation due to the fact that correlation doesn't prove causation. And no amount of measuring of 'control' variables can untangle the web of causality. What nature hath joined together, multiple regression cannot put asunder." (Richard Nisbett, "2014: What scientific idea is ready for retirement?", 2013)

"What nature hath joined together, multiple regression cannot put asunder." (Richard Nisbett, "2014: What scientific idea is ready for retirement?", 2013)

"A basic problem with MRA is that it typically assumes that the independent variables can be regarded as building blocks, with each variable taken by itself being logically independent of all the others. This is usually not the case, at least for behavioral data. […] Just as correlation doesn’t prove causation, absence of correlation fails to prove absence of causation. False-negative findings can occur using MRA just as false-positive findings do - because of the hidden web of causation that we’ve failed to identify." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"Deductive and inductive reasoning schemas essentially regulate inferences. They tell us what kinds of inferences are valid and what kinds are invalid. […] Dialectical reasoning isn’t formal or deductive and usually doesn’t deal in abstractions. It’s concerned with reaching true and useful conclusions rather than valid conclusions. In fact, conclusions based on dialectical reasoning can actually be opposed to those based on formal logic." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"Multiple regression analysis (MRA) examines the association between an independent variable and a dependent variable, controlling for the association between the independent variable and other variables, as well as the association of those other variables with the dependent variable. The method can tell us about causality only if all possible causal influences have been identified and measured reliably and validly. In practice, these conditions are rarely met." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"Science is often described as a 'seamless web'. What’s meant by that is that the facts, methods, theories, and rules of inference discovered in one field can be helpful for other fields. And philosophy and logic can affect reasoning in literally every field of science."(Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"The closer that sample-selection procedures approach the gold standard of random selection - for which the definition is that every individual in the population has an equal chance of appearing in the sample - the more we should trust them. If we don’t know whether a sample is random, any statistical measure we conduct may be biased in some unknown way." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"We are superb causal-hypothesis generators. Given an effect, we are rarely at a loss for an explanation. Seeing a difference in observations over time, we readily come up with a causal interpretation. Much of the time, no causality at all is going on—just random variation. The compulsion to explain is particularly strong when we habitually see that one event typically occurs in conjunction with another event. Seeing such a correlation almost automatically provokes a causal explanation. It’s tremendously useful to be on our toes looking for causal relationships that explain our world. But there are two problems: (1) The explanations come too easily. If we recognized how facile our causal hypotheses were, we’d place less confidence in them. (2) Much of the time, no causal interpretation at all is appropriate and wouldn’t even be made if we had a better understanding of randomness." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"We don’t recognize how easy it is to generate hypotheses about the world. If we did, we’d generate fewer of them, or at least hold them more tentatively. We sprout causal theories in abundance when we learn of a correlation, and we readily find causal explanations for the failure of the world to confirm our hypotheses. We don’t realize how easy it is for us to explain away evidence that would seem on the surface to contradict our hypotheses. And we fail to generate tests of a hypothesis that could falsify the hypothesis if in fact the hypothesis is wrong. This is one type of confirmation bias." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

SQL Troubles

Pages