15 November 2018

Data Science: Transformations (Just the Quotes)

"Logging size transforms the original skewed distribution into a more symmetrical one by pulling in the long right tail of the distribution toward the mean. The short left tail is, in addition, stretched. The shift toward symmetrical distribution produced by the log transform is not, of course, merely for convenience. Symmetrical distributions, especially those that resemble the normal distribution, fulfill statistical assumptions that form the basis of statistical significance testing in the regression model." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Logging skewed variables also helps to reveal the patterns in the data. […] the rescaling of the variables by taking logarithms reduces the nonlinearity in the relationship and removes much of the clutter resulting from the skewed distributions on both variables; in short, the transformation helps clarify the relationship between the two variables. It also […] leads to a theoretically meaningful regression coefficient." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"The logarithmic transformation serves several purposes: (1) The resulting regression coefficients sometimes have a more useful theoretical interpretation compared to a regression based on unlogged variables. (2) Badly skewed distributions - in which many of the observations are clustered together combined with a few outlying values on the scale of measurement - are transformed by taking the logarithm of the measurements so that the clustered values are spread out and the large values pulled in more toward the middle of the distribution. (3) Some of the assumptions underlying the regression model and the associated significance tests are better met when the logarithm of the measured variables is taken." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"The logarithm is one of many transformations that we can apply to univariate measurements. The square root is another. Transformation is a critical tool for visualization or for any other mode of data analysis because it can substantially simplify the structure of a set of data. For example, transformation can remove skewness toward large values, and it can remove monotone increasing spread. And often, it is the logarithm that achieves this removal." (William S Cleveland, "Visualizing Data", 1993)

"Compound errors can begin with any of the standard sorts of bad statistics - a guess, a poor sample, an inadvertent transformation, perhaps confusion over the meaning of a complex statistic. People inevitably want to put statistics to use, to explore a number's implications. [...] The strengths and weaknesses of those original numbers should affect our confidence in the second-generation statistics." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"All forms of complex causation, and especially nonlinear transformations, admittedly stack the deck against prediction. Linear describes an outcome produced by one or more variables where the effect is additive. Any other interaction is nonlinear. This would include outcomes that involve step functions or phase transitions. The hard sciences routinely describe nonlinear phenomena. Making predictions about them becomes increasingly problematic when multiple variables are involved that have complex interactions. Some simple nonlinear systems can quickly become unpredictable when small variations in their inputs are introduced." (Richard N Lebow, "Forbidden Fruit: Counterfactuals and International Relations", 2010)

"Either a logarithmic or a square-root transformation of the data would produce a new series more amenable to fit a simple trigonometric model. It is often the case that periodic time series have rounded minima and sharp-peaked maxima. In these cases, the square root or logarithmic transformation seems to work well most of the time.(DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

"Transformations of data alter statistics. For example, the mean of a data set can be found, but it is not easy to relate the mean of a data set to the mean of the logarithm of that data set. The median is far friendlier to transformations. If the median of a data set is found, then the logarithm of the data set is analyzed; the median of the log transformed data will be the log of the original median.(DeWayne R Derryberry, "Basic data analysis for time series with R", 2014) 

"Transforming data to measurements of a different kind can clarify and simplify hypotheses that have already been generated and can reveal patterns that would otherwise be hidden." (Forrest W Young et al, "Visual Statistics: Seeing data with dynamic interactive graphics", 2016)

"Feature generation (or engineering, as it is often called) is where the bulk of the time is spent in the machine learning process. As social science researchers or practitioners, you have spent a lot of time constructing features, using transformations, dummy variables, and interaction terms. All of that is still required and critical in the machine learning framework. One difference you will need to get comfortable with is that instead of carefully selecting a few predictors, machine learning systems tend to encourage the creation of lots of features and then empirically use holdout data to perform regularization and model selection. It is common to have models that are trained on thousands of features." (Rayid Ghani & Malte Schierholz, "Machine Learning", 2017)

"Data analysis and data mining are concerned with unsupervised pattern finding and structure determination in data sets. The data sets themselves are explicitly linked as a form of representation to an observational or otherwise empirical domain of interest. 'Structure' has long been understood as symmetry which can take many forms with respect to any transformation, including point, translational, rotational, and many others. Symmetries directly point to invariants, which pinpoint intrinsic properties of the data and of the background empirical domain of interest. As our data models change, so too do our perspectives on analysing data." (Fionn Murtagh, "Data Science Foundations: Geometry and Topology of Complex Hierarchic Systems and Big Data Analytics", 2018)

"Many statistical procedures perform more effectively on data that are normally distributed, or at least are symmetric and not excessively kurtotic (fat-tailed), and where the mean and variance are approximately constant. Observed time series frequently require some form of transformation before they exhibit these distributional properties, for in their 'raw' form they are often asymmetric." (Terence C Mills, "Applied Time Series Analysis: A practical guide to modeling and forecasting", 2019)

No comments:

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.