"Boxplots provide information at a glance about center (median), spread (interquartile range), symmetry, and outliers. With practice they are easy to read and are especially useful for quick comparisons of two or more distributions. Sometimes unexpected features such as outliers, skew, or differences in spread are made obvious by boxplots but might otherwise go unnoticed." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)
"Remember that normality and symmetry are not the same thing. All normal distributions are symmetrical, but not all symmetrical distributions are normal. With water use we were able to transform the distribution to be approximately symmetrical and normal, but often symmetry is the most we can hope for. For practical purposes, symmetry (with no severe outliers) may be sufficient. Transformations are not a magic wand, however. Many distributions cannot even be made symmetrical." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)
"Fitting is essential to visualizing hypervariate data. The structure of data in many dimensions can be exceedingly complex. The visualization of a fit to hypervariate data, by reducing the amount of noise, can often lead to more insight. The fit is a hypervariate surface, a function of three or more variables. As with bivariate and trivariate data, our fitting tools are loess and parametric fitting by least-squares. And each tool can employ bisquare iterations to produce robust estimates when outliers or other forms of leptokurtosis are present." (William S Cleveland, "Visualizing Data", 1993)
"Variance and its square root, the standard deviation, summarize the amount of spread around the mean, or how much a variable varies. Outliers influence these statistics too, even more than they influence the mean. On the other hand. the variance and standard deviation have important mathematical advantages that make them (together with the mean) the foundation of classical statistics. If a distribution appears reasonably symmetrical, with no extreme outliers, then the mean and standard deviation or variance are the summaries most analysts would use." (Lawrence C Hamilton, "Data Analysis for Social Scientists: A first course in applied statistics", 1995)
"[…] an outlier is an observation that lies an 'abnormal' distance from other values in a batch of data. There are two possible explanations for the occurrence of an outlier. One is that this happens to be a rare but valid data item that is either extremely large or extremely small. The other is that it isa mistake – maybe due to a measuring or recording error.
"After you visualize your data, there are certain things to look for […]: increasing, decreasing, outliers, or some mix, and of course, be sure you’re not mixing up noise for patterns. Also note how much of a change there is and how prominent the patterns are. How does the difference compare to the randomness in the data? Observations can stand out because of human or mechanical error, because of the uncertainty of estimated values, or because there was a person or thing that stood out from the rest. You should know which it is."
"What is good visualization? It is a representation of data that helps you see what you otherwise would have been blind to if you looked only at the naked source. It enables you to see trends, patterns, and outliers that tell you about yourself and what surrounds you. The best visualization evokes that moment of bliss when seeing something for the first time, knowing that what you see has been right in front of you, just slightly hidden. Sometimes it is a simple bar graph, and other times the visualization is complex because the data requires it." (Nathan Yau, "Data Points: Visualization That Means Something", 2013)
"When we find data quality issues due to valid data during data exploration, we should note these issues in a data quality plan for potential handling later in the project. The most common issues in this regard are missing values and outliers, which are both examples of noise in the data." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, worked examples, and case studies", 2015)
"Histograms and frequency polygons display a schematic of a numeric variable's frequency distribution. These plots can show us the center and spread of a distribution, can be used to judge the skewness, kurtosis, and modicity of a distribution, can be used to search for outliers, and can help us make decisions about the symmetry and normality of a distribution." (Forrest W Young et al, "Visual Statistics: Seeing data with dynamic interactive graphics", 2016)
"A histogram represents the frequency distribution of the data. Histograms are similar to bar charts but group numbers into ranges. Also, a histogram lets you show the frequency distribution of continuous data. This helps in analyzing the distribution (for example, normal or Gaussian), any outliers present in the data, and skewness." (Umesh R Hodeghatta & Umesha Nayak, "Business Analytics Using R: A Practical Approach", 2017)
"[…] the data itself can lead to new questions too. In exploratory data analysis (EDA), for example, the data analyst discovers new questions based on the data. The process of looking at the data to address some of these questions generates incidental visualizations - odd patterns, outliers, or surprising correlations that are worth looking into further." (Danyel Fisher & Miriah Meyer, "Making Data Visual", 2018)
"When visuals are applied to data, they can enlighten the audience to insights that they wouldn’t see without charts or graphs. Many interesting patterns and outliers in the data would remain hidden in the rows and columns of data tables without the help of data visualizations. They connect with our visual nature as human beings and impart knowledge that couldn’t be obtained as easily using other approaches that involve just words or numbers." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)
"Visualizations can remove the background noise from enormous sets of data so that only the most important points stand out to the intended audience. This is particularly important in the era of big data. The more data there is, the more chance for noise and outliers to interfere with the core concepts of the data set." (Kate Strachnyi, "ColorWise: A Data Storyteller’s Guide to the Intentional Use of Color", 2023)
"We see first what stands out. Our eyes go right to change and difference - peaks, valleys, intersections, dominant colors, outliers. Many successful charts - often the ones that please us the most and are shared and talked about - exploit this inclination by showing a single salient point so clearly that we feel we understand the chart’s meaning without even trying." (Scott Berinato, "Good Charts : the HBR guide to making smarter, more persuasive data visualizations", 2023)
No comments:
Post a Comment