SQL Troubles: 🖍️Dianne Cook

24 April 2006

🖍️Dianne Cook - Collected Quotes

"A common myth is that non-linear dimension reduction captures non-linear patterns in the high-dimensional data. It may or may not do this. The term means that the methods transform the data non-linearly into a useful (or not) visual representation." (Dianne Cook & Ursula Laa, "Interactively Exploring High-Dimensional Data and Models in R", 2026)

"Bias and variance are conceptual constructs. Bias is not possible to quantify unless a true model is known. It is used for setting up simulations and comparing various models, because in these controlled scenarios bias and variance can be computed. In practice, it is not possible to compute. Using high-dimensional visualisation can help with understanding the shape of the class and separation between classes. This provides a better sense about whether a particular approach will be able to capture the shape of the boundary or not, and will thus likely have low or high bias." (Dianne Cook & Ursula Laa, "Interactively Exploring High-Dimensional Data and Models in R", 2026)

"Defining an appropriate distance metric from the context ofthe problem is a most important decision. For example, if your variables are all numeric, and on the same scale, then Euclidean distance might be best. If your variables are categorical, you might need to use something like Hamming distance." (Dianne Cook & Ursula Laa, "Interactively Exploring High-Dimensional Data and Models in R", 2026)

"Hierarchical clustering is summarised by a dendrogram, which sequentially shows points being joined to form a cluster, with the corresponding distances. Breaking the data into clusters is done by cutting the dendrogram at the long edges. [...] Plotting the dendrogram in the data space can help you understand how the hierarchical clustering has collected the points together into clusters. You can learn if the algorithm has been confused by nuisance patterns in the data, and how different choices of linkage method affect the result." (Dianne Cook & Ursula Laa, "Interactively Exploring High-Dimensional Data and Models in R", 2026)

"High-dimensional data spaces are fascinating places. You may think that there are a lot of ways to plot one or two variables, and a lot of types of patterns that can be found. You might use a density plot and see skewness or a dot plot to find outliers. A scatterplot of two variables might reveal a non-linear relationship or a barrier beyond which no observations exist. We don’t as yet have so many different choices of plot types for high dimensions, but these types of patterns are also what we seek in scatterplots of high-dimensional data. The additional dimensions can clarify these patterns, so that clusters are likely to be more distinct. Observations that did not appear to be very different can be seen to be lonely anomalies in high dimensions, and that no other observations have quite the same combination of values." (Dianne Cook & Ursula Laa, "Interactively Exploring High-Dimensional Data and Models in R", 2026)

"It is important to visualise your data because you might discover things that you could never have anticipated. Although there are many resources available for data visualisation, there are few comprehensive resources on high-dimensional data visualisation. High-dimensional (or multivariate) data arises when many different things are measured for each observation. While we can learn many things from plotting with 1D and 2D or 3D methods there are likely more structures hidden in the higher dimensions." (Dianne Cook & Ursula Laa, "Interactively Exploring High-Dimensional Data and Models in R", 2026)

"Non-linear dimension reduction (NLDR) aims to find a single low-dimensional representation of the high-dimensional data that shows the main features of the data. If there are separated clusters present, then it might be a layout where the clusters are all distinct, in a way that a single linear projection could not reveal. For observations falling on a low-dimensional non-linear manifold in high dimensions the NLDR might unfold or unroll it so that they are represented in a plane where the distances are similar to their distance along the manifold." (Dianne Cook & Ursula Laa, "Interactively Exploring High-Dimensional Data and Models in R", 2026)

"PCA (Principal Component Analysis) is very broadly useful for summarising linear association by using combinations of the variables that are highly correlated. However, high correlation can also occur when there are outliers or clustering. PCA is commonly used to detect these patterns also, although this might NOT be a reliable way to do so. To detect clustering or anomalies, using a different approach that is specifically focused on these types of patterns is advisable. To some extent capturing clustering or anomalies using PCA is actually finding problematic patterns that adversely affect conducting appropriate dimension reduction." (Dianne Cook & Ursula Laa, "Interactively Exploring High-Dimensional Data and Models in R", 2026)

"PCA (Principal Component Analysis) is not very effective when the distribution of the variables is highly skewed, so it can be helpful to transform variables to make them more symmetrically distributed before conducting PCA. It is also possible to summarise different types of structure by generalising the optimisation criteria to any function of projected data, f(XA), which is called projection pursuit (PP)." (Dianne Cook & Ursula Laa, "Interactively Exploring High-Dimensional Data and Models in R", 2026)

"Unsupervised classification, or cluster analysis, organizes observations into similar groups. Clusteranalysis is a commonly used, appealing, and conceptually intuitive statistical method. Some of its uses include market segmentation, where customers are grouped into clusters with similar attributes for targeted marketing; gene expression analysis, where genes with similar expression patterns are grouped together; and the creation of taxonomies for animals, insects, or plants. Clustering can be used as a way of reducing a massive amount of data because observations within a cluster can be summarised by its centre. Also, clustering effectively subsets the data thus simplifying analysis because observations in each cluster can be analysed separately." (Dianne Cook & Ursula Laa, "Interactively Exploring High-Dimensional Data and Models in R", 2026)

"The way variables are scaled can affect the appearance of dimensionity. If the variables are scaled together, using global values, some variables may have smaller variance than others. Scaling variables individually shifts the focus to association between variables, as the predominant reason for reduced dimension." (Dianne Cook & Ursula Laa, "Interactively Exploring High-Dimensional Data and Models in R", 2026)

"To determine which variables are responsible for the reduced dimension look for the axes that extend out of the point cloud. These contribute to smaller variation in the observations, and thus indicate possible dimension reduction using these variables." (Dianne Cook & Ursula Laa, "Interactively Exploring High-Dimensional Data and Models in R", 2026)

"To understand variance, we need to know how the model fit changes when a different training sample is used to fit the model. This is achieved by dividing the training sample into folds and fitting a model to each fold. This is more difficult to evaluate with visual methods because it would require examining multiple samples for small differences." (Dianne Cook & Ursula Laa, "Interactively Exploring High-Dimensional Data and Models in R", 2026)

"Viewing the dendrograms in high dimensions provides insight into how the algorithm has joined points to clusters. For example, single linkage often has edges leading to a single focal point, which might not yield a useful clustering but might help to

identify outliers. If the edges point to multiple focal points, with long edges bridging gaps in the data, the result is more likely yielding a useful clustering." (Dianne Cook & Ursula Laa, "Interactively Exploring High-Dimensional Data and Models in R", 2026)

"When exploring the implicit dimensionality of multivariate data we are looking for projections where the points do not fill the plotting canvas fully. This would indicate that the observed values do not fully populate the high dimensions." (Dianne Cook & Ursula Laa, "Interactively Exploring High-Dimensional Data and Models in R", 2026)

SQL Troubles

Pages

24 April 2006

🖍️Dianne Cook - Collected Quotes

No comments:

About Me