22 April 2006

๐Ÿ–️Peter C Bruce - Collected Quotes

"A popular misconception holds that the era of Big Data means the end of a need for sampling. In fact, the proliferation of data of varying quality and relevance reinforces the need for sampling as a tool to work efficiently with a variety of data, and minimize bias. Even in a Big Data project, predictive models are typically developed and piloted with samples." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"Do not confuse standard deviation (which measures the variability of individual data points) with standard error (which measures the variability of a sample metric)." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"In statistical theory, location and variability are referred to as the first and second moments of a distribution. The third and fourth moments are called skewness and kurtosis. Skewness refers to whether the data is skewed to larger or smaller values and kurtosis indicates the propensity of the data to have extreme values. Generally, metrics are not used to measure skewness and kurtosis; instead, these are discovered through visual displays [...]" (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"Machine learning tends to be more focused on developing efficient algorithms that scale to large data in order to optimize the predictive model. Statistics generally pays more attention to the probabilistic theory and underlying structure of the model." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"Many classification and regression algorithms optimize a certain criteria or loss function. For example, logistic regression attempts to minimize the deviance. In the literature, some propose to modify the loss function in order to avoid the problems caused by a rare class. In practice, this is hard to do: classification algorithms can be complex and difficult to modify. Weighting is an easy way to change the loss function, discounting errors for records with low weights in favor of records of higher weights." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"Moreover, data science (and business in general) is not so worried about statistical significance, but more concerned with optimizing overall effort and results." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"Statisticians often use the term estimates for values calculated from the data at hand, to draw a distinction between what we see from the data, and the theoretical true or exact state of affairs. Data scientists and business analysts are more likely to refer to such values as a metric. The difference reflects the approach of statistics versus data science: accounting for uncertainty lies at the heart of the discipline of statistics, whereas concrete business or organizational objectives are the focus of data science. Hence, statisticians estimate, and data scientists measure." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"The bootstrap does not compensate for a small sample size; it does not create new data, nor does it fill in holes in an existing data set. It merely informs us about how lots of additional samples would behave when drawn from a population like our original sample." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"The tension between oversmoothing and overfitting is an instance of the bias-variance tradeoff, an ubiquitous problem in statistical model fitting. Variance refers to the modeling error that occurs because of the choice of training data; that is, if you were to choose a different set of training data, the resulting model would be different. Bias refers to the modeling error that occurs because you have not properly identified the underlying real-world scenario; this error would not disappear if you simply added more training data. When a flexible model is overfit, the variance increases. You can reduce this by using a simpler model, but the bias may increase due to the loss of flexibility in modeling the real underlying situation." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"The variance, the standard deviation, mean absolute deviation, and median absolute deviation from the median are not equivalent estimates, even in the case where the data comes from a normal distribution. In fact, the standard deviation is always greater than the mean absolute deviation, which itself is greater than the median absolute deviation. Sometimes, the median absolute deviation is multiplied by a constant scaling factor (it happens to work out to 1.4826) to put MAD on the same scale as the standard deviation in the case of a normal distribution." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"When analysts and researchers use the term regression by itself, they are typically referring to linear regression; the focus is usually on developing a linear model to explain the relationship between predictor variables and a numeric outcome variable. In its formal statistical sense, regression also includes nonlinear models that yield a functional relationship between predictors and outcome variables. In the machine learning community, the term is also occasionally used loosely to refer to the use of any predictive model that produces a predicted numeric outcome (standing in distinction from classification methods that predict a binary or categorical outcome)." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

No comments:

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.