"To the untrained eye, randomness appears as regularity or tendency to cluster." (William Feller, "An Introduction to Probability Theory and its Applications", 1950)
"Sometimes clusters of variables tend to vary together in the normal course of events, thereby rendering it difficult to discover the magnitude of the independent effects of the different variables in the cluster. And yet it may be most desirable, from a practical as well as scientific point of view, to disentangle correlated describing variables in order to discover more effective policies to improve conditions. Many economic indicators tend to move together in response to underlying economic and political events."
"The logarithmic transformation serves several purposes: (1) The resulting regression coefficients sometimes have a more useful theoretical interpretation compared to a regression based on unlogged variables. (2) Badly skewed distributions - in which many of the observations are clustered together combined with a few outlying values on the scale of measurement - are transformed by taking the logarithm of the measurements so that the clustered values are spread out and the large values pulled in more toward the middle of the distribution. (3) Some of the assumptions underlying the regression model and the associated significance tests are better met when the logarithm of the measured variables is taken."
"The scatterplot is a useful exploratory method for providing a first look at bivariate data to see how they are distributed throughout the plane, for example, to see clusters of points, outliers, and so forth."
"Multivariate techniques often summarize or classify many variables to only a few groups or factors (e.g., cluster analysis or multi-dimensional scaling). Parallel coordinate plots can help to investigate the influence of a single variable or a group of variables on the result of a multivariate procedure. Plotting the input variables in a parallel coordinate plot and selecting the features of interest of the multivariate procedure will show the influence of different input variables." (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)
"Parallel coordinate plots are often overrated concerning their ability to depict multivariate features. Scatterplots are clearly superior in investigating the relationship between two continuous variables and multivariate outliers do not necessarily stick out in a parallel coordinate plot. Nonetheless, parallel coordinate plots can help to find and understand features such as groups/clusters, outliers and multivariate structures in their multivariate context. The key feature is the ability to select and highlight individual cases or groups in the data, and compare them to other groups or the rest of the data." (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)
"Be careful not to confuse clustering and stratification. Even though both of these sampling strategies involve dividing the population into subgroups, both the way in which the subgroups are sampled and the optimal strategy for creating the subgroups are different. In stratified sampling, we sample from every stratum, whereas in cluster sampling, we include only selected whole clusters in the sample. Because of this difference, to increase the chance of obtaining a sample that is representative of the population, we want to create homogeneous groups for strata and heterogeneous (reflecting the variability in the population) groups for clusters." (Roxy Peck et al, "Introduction to Statistics and Data Analysis" 4th Ed., 2012)
"Linking is a powerful dynamic interactive graphics technique that can help us better understand high-dimensional data. This technique works in the following way: When several plots are linked, selecting an observation's point in a plot will do more than highlight the observation in the plot we are interacting with - it will also highlight points in other plots with which it is linked, giving us a more complete idea of its value across all the variables. Selecting is done interactively with a pointing device. The point selected, and corresponding points in the other linked plots, are highlighted simultaneously. Thus, we can select a cluster of points in one plot and see if it corresponds to a cluster in any other plot, enabling us to investigate the high-dimensional shape and density of the cluster of points, and permitting us to investigate the structure of the disease space."
"Dimensionality reduction is a way of reducing a large number of different measures into a smaller set of metrics. The intent is that the reduced metrics are a simpler description of the complex space that retains most of the meaning. […] Clustering techniques are similarly useful for reducing a large number of items into a smaller set of groups. A clustering technique finds groups of items that are logically near each other and gathers them together." (Danyel Fisher & Miriah Meyer, "Making Data Visual", 2018)
"[...] scatterplots had advantages over earlier graphic forms: the ability to see clusters, patterns, trends, and relations in a cloud of points. Perhaps most importantly, it allowed the addition of visual annotations (point symbols, lines, curves, enclosing contours, etc.) to make those relationships more coherent and tell more nuanced stories." (Michael Friendly & Howard Wainer, "A History of Data Visualization and Graphic Communication", 2021)
No comments:
Post a Comment