SQL Troubles

05 December 2006

✏️John Hoffmann - Collected Quotes

"A useful way to think about tables and graphics is to visualize layers. Just as photographic files may be manipulated in photo editing software using layers, data presentations are constructed by imagining that layers of an image are placed one on top of another. There are three general layers that apply to visual data presentations: (a) a frame that is typically a rectangle or matrix, (b) axes and coordinate systems (for graphics), and (c) data presented as numbers or geometric objects." (John Hoffmann, "Principles of Data Management and Presentation", 2017)

"Also known as line charts or line plots, this type of graphic displays a series of data points using line segments. […] Do not include too many lines, especially if they are difficult to distinguish. […] it is best to label the lines directly rather than use a legend. […] It is not a good idea to use line graphs with unordered categorical (nominal) data These graphs are simpler to understand when the data are ordered in some way. […] Visual acuity is enhanced when the lines do not touch the x- or y-axis […] There is no need, except under exceptional circumstances, to include a marker to show at what point the line matches a specific value of the x- and y-axes. Line graphs are designed to display patterns and trends rather than data points." (John Hoffmann, "Principles of Data Management and Presentation", 2017)

"Clarity is related to two other principles of good data presentation: precision and efficiency. Precision refers to ensuring that the data are presented accurately with minimal error. This is a topic that is equally important to data presentation as it is to data management. Always keep in mind: don’t mislead the audience. As already mentioned, people can be fooled by visual images, but they can also be misled by the myth of the infallible graphic. This refers to a tendency to believe there is an important association among concepts simply because they are correlated." (John Hoffmann, "Principles of Data Management and Presentation", 2017)

"Contrasts can be a help or a hindrance. Our eyes are drawn to bright colors on muted backgrounds. In addition, warm colors, such as red, are more likely to get attention than cool colors (although the relative brightness affects this phenomenon). Objects in color that are included in black and white or grayscale visuals are quite effective at drawing the eye. Thus, using color to highlight certain parts of a graphic or table can be valuable. However, avoid using these strategies if they will draw attention to extraneous or trivial parts of the data presentation." (John Hoffmann, "Principles of Data Management and Presentation", 2017)

"If colors are used for different bars in a graphic, use distinguishable shades of the same color rather than distinct colors. If lines are in color in a graph, use those that are easy to discriminate, such as red and blue. But be careful of lines that cross since a red line is perceived as in front of a blue line. If colors are employed in a table, used them to highlight the relevant comparisons you wish to make. […] Use colors to highlight important parts of the graphic. […] But be careful because this practice is easily abused." (John Hoffmann, "Principles of Data Management and Presentation", 2017)

"It is generally a good idea to avoid gridlines, vertical lines, and double lines. Use single horizontal lines to separate the title, headers, and content. Lines are also employed to identify column spanners, which are used to group particular columns of data." (John Hoffmann, "Principles of Data Management and Presentation", 2017)

"Many data presentations spice up the image with background images, embedded visuals, ornate typeface, and bright colors. Our eyes may be drawn to these aspects, rather than to the patterns in the data, thus breaking the principles of clarity and efficiency. It is usually best to take out the clutter: remove the chartjunk." (John Hoffmann, "Principles of Data Management and Presentation", 2017)

"People tend to comprehend visual images quicker and with fewer errors than words on a page. Visual images also activate memories better than words." (John Hoffmann, "Principles of Data Management and Presentation", 2017)

"Reference tables show a lot of data with a high degree of precision. They are designed generally to provide users with a way to fi nd particular pieces of data. […] Summary tables provide some type of extraction of data from a reference table or a spreadsheet. The data are usually manipulated, analyzed, or summarized in some way, such as by sorting or providing summary statistics (means, percentages, ranges). The results of statistical models are usually presented in research reports using this type of table." (John Hoffmann, "Principles of Data Management and Presentation", 2017)

"Some experts argue that axes - in particular, the y-axis - should always begin at zero. However, when differences are small, yet the size of the numbers is relatively large, this can make detection difficult. On the other hand, viewers can be misled by manipulating the axes to magnify differences. One guideline is to always use a zero bottom point when judging absolute magnitudes. This is often the case in bar charts." (John Hoffmann, "Principles of Data Management and Presentation", 2017)

"Titles should clearly specify the content of the table or the graphic. What is being presented? Means and standard deviations? Confidence intervals? Percentages? Trends over time? Furthermore, consider the context, such as when and where the data were gathered, as well as the name of the dataset if using secondary data (although the dataset may also be identified in a source note)." (John Hoffmann, "Principles of Data Management and Presentation", 2017)

"Whichever scale is used to represent the data, it is important to keep it consistent in data presentations. The principles of clarity, precision, and efficiency are rarely met if the measurement scales change within tables." (John Hoffmann, "Principles of Data Management and Presentation", 2017)

✏️Tamara Munzner- Collected Quotes

"A fundamental principle of design is to consider multiple alternatives and then choose the best, rather than to immediately fixate on one solution without considering any alternatives. One way to ensure that more than one possibility is considered is to explicitly generate multiple ideas in parallel. " (Tamara Munzner, "Visualization Analysis and Design", 2014)

"As with all design problems, vis design cannot be easily handled as a simple process of optimization because trade-offs abound. A design that does well by one measure will rate poorly on another. The characterization of trade-offs in the vis design space is a very open problem at the frontier of vis research." (Tamara Munzner, "Visualization Analysis and Design", 2014)

"Developing a clear understanding of the requirements of a particular target audience is a tricky problem for a designer. While it might seem obvious to you that it would be a good idea to understand requirements, it’s a common pitfall for designers to cut corners by making assumptions rather than actually engaging with any target users. " (Tamara Munzner, "Visualization Analysis and Design", 2014)

"Interactivity is crucial for building vis tools that handle complexity. When datasets are large enough, the limitations of both people and displays preclude just showing everything at once; interaction where user actions cause the view to change is the way forward. Moreover, a single static view can show only one aspect of a dataset. For some combinations of simple datasets and tasks, the user may only need to see a single visual encoding. In contrast, an interactively changing display supports many possible queries. " (Tamara Munzner, "Visualization Analysis and Design", 2014)

"Statistical characterization of datasets is a very powerful approach, but it has the intrinsic limitation of losing information through summarization. " (Tamara Munzner, "Visualization Analysis and Design", 2014)

"The effectiveness principle dictates that the importance of the attribute should match the salience of the channel; that is, its noticeability. In other words, the most important attributes should be encoded with the most effective channels in order to be most noticeable, and then decreasingly important attributes can be matched with less effective channels. " (Tamara Munzner, "Visualization Analysis and Design", 2014)

"The expressiveness principle dictates that the visual encoding should express all of, and only, the information in the dataset attributes. The most fundamental expression of this principle is that ordered data should be shown in a way that our perceptual system intrinsically senses as ordered. Conversely, unordered data should not be shown in a way that perceptually implies an ordering that does not exist. Violating this principle is a common beginner’s mistake in vis. " (Tamara Munzner, "Visualization Analysis and Design", 2014)

"The idiom of heatmaps is one of the simplest uses of the matrix alignment: each cell is fully occupied by an area mark encoding a single quantitative value attribute with color. […] The benefit of heatmaps is that visually encoding quantitative data with color using small area marks is very compact, so they are good for providing overviews with high information density. " (Tamara Munzner, "Visualization Analysis and Design", 2014)

"The idiom of parallel coordinates is an approach for visualizing many quantitative attributes at once using spatial position. As the name suggests, the axes are placed parallel to each other, rather than perpendicularly at right angles. While an item is shown with a dot in a scatterplot, with parallel coordinates a single item is represented by a jagged line that zigzags through the parallel axes, crossing each axis exactly once at the location of the item’s value for the associated attribute. " (Tamara Munzner, "Visualization Analysis and Design", 2014)

"The idiom of scatterplots encodes two quantitative value variables using both the vertical and horizontal spatial position channels, and the mark type is necessarily a point. Scatterplots are effective for the abstract tasks of providing overviews and characterizing distributions, and specifically for finding outliers and extreme values. Scatterplots are also highly effective for the abstract task of judging the correlation between two attributes. With this visual encoding, that task corresponds the easy perceptual judgement of noticing whether the points form a line along the diagonal. The stronger the correlation, the closer the points fall along a perfect diagonal line; positive correlation is an upward slope, and negative is downward." (Tamara Munzner, "Visualization Analysis and Design", 2014)

"The most powerful depth cue is occlusion, where some objects can not be seen because they are hidden behind others. The visible objects are interpreted as being closer than the occluded ones. The occlusion relationships between objects change as we move around; this motion parallax allows us to build up an understanding of the relative distances between objects in the world. " (Tamara Munzner, "Visualization Analysis and Design", 2014)

"The phenomenon of change blindness is that we fail to notice even quite drastic changes if our attention is directed elsewhere. […] Although we are very sensitive to changes at the focus of our attention, we are surprisingly blind to changes when our attention is not engaged. The difficulty of tracking complex and widespread changes across multiframe animations is one of the implications of change blindness for vis. " (Tamara Munzner, "Visualization Analysis and Design", 2014)

"Three high-level targets are very broadly relevant, for all kinds of data: trends, outliers, and features. A trend is a high-level characterization of a pattern in the data. Simple examples of trends include increases, decreases, peaks, troughs, and plateaus. Almost inevitably, some data doesn’t fit well with that backdrop; those elements are the outliers. The exact definition of features is task dependent, meaning any particular structures of interest." (Tamara Munzner, "Visualization Analysis and Design", 2014)

✏️John M Chambers - Collected Quotes

"At the heart of probabilistic statistical analysis is the assumption that a set of data arises as a sample from a distribution in some class of probability distributions. The reasons for making distributional assumptions about data are several. First, if we can describe a set of data as a sample from a certain theoretical distribution, say a normal distribution (also called a Gaussian distribution), then we can achieve a valuable compactness of description for the data. For example, in the normal case, the data can be succinctly described by giving the mean and standard deviation and stating that the empirical (sample) distribution of the data is well approximated by the normal distribution. A second reason for distributional assumptions is that they can lead to useful statistical procedures. For example, the assumption that data are generated by normal probability distributions leads to the analysis of variance and least squares. Similarly, much of the theory and technology of reliability assumes samples from the exponential, Weibull, or gamma distribution. A third reason is that the assumptions allow us to characterize the sampling distribution of statistics computed during the analysis and thereby make inferences and probabilistic statements about unknown aspects of the underlying distribution. For example, assuming the data are a sample from a normal distribution allows us to use the t-distribution to form confidence intervals for the mean of the theoretical distribution. A fourth reason for distributional assumptions is that understanding the distribution of a set of data can sometimes shed light on the physical mechanisms involved in generating the data." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"Equal variability is not always achieved in plots. For instance, if the theoretical distribution for a probability plot has a density that drops off gradually to zero in the tails (as the normal density does), then the variability of the data in the tails of the probability plot is greater than in the center. Another example is provided by the histogram. Since the height of any one bar has a binomial distribution, the standard deviation of the height is approximately proportional to the square root of the expected height; hence, the variability of the longer bars is greater." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"Frequently we can increase the informativeness of a graph by removing structure from the data once we have identified it, so that subsequent plots are free of its dominating influence and can help us see finer structure or subtler effects. This usually means (l) partitioning the data, or (2) plotting differences or ratios, or (3) fitting a model and taking the residuals as a new set of data for further study." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"Generally speaking, a good display is one in which the visual impact of its components is matched to their importance in the context of the analysis. Consider the issue of overplotting." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"Graphical methodology provides powerful diagnostic tools for conveying properties of the fitted regression, for assessing the adequacy of the fit, and for suggesting improvements. There is seldom any prior guarantee that a hypothesized regression model will provide a good description of the mechanism that generated the data. Standard regression models carry with them many specific assumptions about the relationship between the response and explanatory variables and about the variation in the response that is not accounted for by the explanatory variables. In many applications of regression there is a substantial amount of prior knowledge that makes the assumptions plausible; in many other applications the assumptions are made as a starting point simply to get the analysis off the ground. But whatever the amount of prior knowledge, fitting regression equations is not complete until the assumptions have been examined." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"Missing data values pose a particularly sticky problem for symbols. For instance, if the ray corresponding to a missing value is simply left off of a star symbol, the result will be almost indistinguishable from a minimum (i.e., an extreme) value. It may be better either (i) to impute a value, perhaps a median for that variable, or a fitted value from some regression on other variables, (ii) to indicate that the value is missing, possibly with a dashed line, or (iii) not to draw the symbol for a particular observation if any value is missing." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"Part of the strategy of regression modelling is to improve the model until the residuals look 'structureless', or like a simple random sample. They should only contain structure that is already taken into account (such as nonconstant variance) or imposed by the fitting process itself. By plotting them against a variety of original and derived variables, we can look for systematic patterns that relate to the model's adequacy. Although we talk about graphics for use after the model is fit, if problems with the fit are discovered at this stage of the analysis, We should take corrective action and refit the equation or a modified form of it." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"Plotting on power-transformed scales (either cube roots or logs) is recommended only in those cases where the distribution is very asymmetric and the reference configuration for the untransformed plot would be a straight line through the origin." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"Symmetry is also important because it can simplify our thinking about the distribution of a set of data. If we can establish that the data are (approximately) symmetric, then we no longer need to describe the shapes of both the right and left halves. (We might even combine the information from the two sides and have effectively twice as much data for viewing the distributional shape.) Finally, symmetry is important because many statistical procedures are designed for, and work best on, symmetric data." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"The information on a plot should be relevant to the goals of the analysis. This means that in choosing graphical methods we should match the capabilities of the methods to our needs in the context of each application. [...] Scatter plots, with the views carefully selected as in draftsman's displays, casement displays, and multiwindow plots, are likely to be more informative. We must be careful, however, not to confuse what is relevant with what we expect or want to find. Often wholly unexpected phenomena constitute our most important findings." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"The most important reason for portraying standard deviations is that they give us a sense of the relative variability of the points in different regions of the plot." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"The quantile plot is a good general display since it is fairly easy to construct and does a good job of portraying many aspects of a distribution. Three convenient features of the plot are the following: First, in constructing it, we do not make any arbitrary choices of parameter values or cell boundaries [...] and no models for the data are fitted or assumed. Second, like a table, it is not a summary but a display of all the data. Third, on the quantile plot every point is plotted at a distinct location, even if there are duplicates in the data. The number of points that can be portrayed without overlap is limited only by the resolution of the plotting device. For a high resolution device several hundred points distinguished." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"The truth is that one display is better than another if it leads to more understanding. Often a simpler display, one that tries to accomplish less at one time, succeeds in conveying more insight. In order to understand complicated or subtle structure in the data we should be prepared to look at complicated displays when necessary, but to see any particular type of structure we should use the simplest display that shows it."(John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"There are several reasons why symmetry is an important concept in data analysis. First, the most important single summary of a set of data is the location of the center, and when data meaning of 'center' is unambiguous. We can take center to mean any of the following things, since they all coincide exactly for symmetric data, and they are together for nearly symmetric data: (l) the Center Of symmetry. (2) the arithmetic average or center Of gravity, (3) the median or 50%. Furthermore, if data a single point of highest concentration instead of several (that is, they are unimodal), then we can add to the list (4) point of highest concentration. When data are far from symmetric, we may have trouble even agreeing on what we mean by center; in fact, the center may become an inappropriate summary for the data." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"We can gain further insight into what makes good p!ots by thinking about the process of visual perception. The eye can assimilate large amounts of visual information, perceive unanticipated structure, and recognize complex patterns; however, certain kinds of patterns are more readily perceived than others. If we thoroughly understood the interaction between the brain, eye, and picture, we could organize displays to take advantage of the things that the eye and brain do best, so that the potentially most important patterns are associated with the most easily perceived visual aspects in the display." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"When some interesting structure is seen in a plot, it is an advantage to be able to relate that structure back to the original data in a clear, direct, and meaningful way. Although this seems obvious, interpretability is at once one of the most important, difficult, and controversial issues." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

04 December 2006

✏️Lawrence C Hamilton - Collected Quotes

"Boxplots provide information at a glance about center (median), spread (interquartile range), symmetry, and outliers. With practice they are easy to read and are especially useful for quick comparisons of two or more distributions. Sometimes unexpected features such as outliers, skew, or differences in spread are made obvious by boxplots but might otherwise go unnoticed." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"Comparing normal distributions reduces to comparing only means and standard deviations. If standard deviations are the same, the task even simpler: just compare means. On the other hand, means and standard deviations may be incomplete or misleading as summaries for nonnormal distributions." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"Correlation and covariance are linear regression statistics. Nonlinearity and influential cases cause the same problems for correlations, and hence for principal components/factor analysis, as they do for regression. Scatterplots should be examined routinely to check for nonlinearity and outliers. Diagnostic checks become even more important with maximum-likelihood factor analysis, which makes stronger assumptions and may be less robust than principal components or principal factors." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"Data analysis is rarely as simple in practice as it appears in books. Like other statistical techniques, regression rests on certain assumptions and may produce unrealistic results if those assumptions are false. Furthermore it is not always obvious how to translate a research question into a regression model." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"Data analysis typically begins with straight-line models because they are simplest, not because we believe reality is inherently linear. Theory or data may suggest otherwise [...]" (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"Exploratory regression methods attempt to reveal unexpected patterns, so they are ideal for a first look at the data. Unlike other regression techniques, they do not require that we specify a particular model beforehand. Thus exploratory techniques warn against mistakenly fitting a linear model when the relation is curved, a waxing curve when the relation is S-shaped, and so forth." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"If a distribution were perfectly symmetrical, all symmetry-plot points would be on the diagonal line. Off-line points indicate asymmetry. Points fall above the line when distance above the median is greater than corresponding distance below the median. A consistent run of above-the-line points indicates positive skew; a run of below-the-line points indicates negative skew." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"Principal components and factor analysis are methods for data reduction. They seek a few underlying dimensions that account for patterns of variation among the observed variables underlying dimensions imply ways to combine variables, simplifying subsequent analysis. For example, a few combined variables could replace many original variables in a regression. Advantages of this approach include more parsimonious models, improved measurement of indirectly observed concepts, new graphical displays, and the avoidance of multicollinearity." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"Principal components and principal factor analysis lack a well-developed theoretical framework like that of least squares regression. They consequently provide no systematic way to test hypotheses about the number of factors to retain, the size of factor loadings, or the correlations between factors, for example. Such tests are possible using a different approach, based on maximum-likelihood estimation." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"Remember that normality and symmetry are not the same thing. All normal distributions are symmetrical, but not all symmetrical distributions are normal. With water use we were able to transform the distribution to be approximately symmetrical and normal, but often symmetry is the most we can hope for. For practical purposes, symmetry (with no severe outliers) may be sufficient. Transformations are not a magic wand, however. Many distributions cannot even be made symmetrical." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"Visually, skewed sample distributions have one 'longer' and one 'shorter' tail. More general terms are 'heavier' and 'lighter' tails. Tail weight reflects not only distance from the center (tail length) but also the frequency of cases at that distance (tail depth, in a histogram). Tail weight corresponds to actual weight if the sample histogram were cut out of wood and balanced like a seesaw on its median (see next section). A positively skewed distribution is heavier to the right of the median; negative skew implies the opposite." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"A well-constructed graph can show several features of the data at once. Some graphs contain as much information as the original data, and so (unlike numerical summaries) do not actually simplify the data; rather, they express it in visual form. Unexpected or unusual features, which are not obvious within numerical tables, often jump to our attention once we draw a graph. Because the strengths and weaknesses of graphical methods are opposite those of numerical summary methods, the two work best in combination." (Lawrence C Hamilton, "Data Analysis for Social Scientists: A first course in applied statistics", 1995)

"Data analysis [...] begins with a dataset in hand. Our purpose in data analysis is to learn what we can from those data, to help us draw conclusions about our broader research questions. Our research questions determine what sort of data we need in the first place, and how we ought to go about collecting them. Unless data collection has been done carefully, even a brilliant analyst may be unable to reach valid conclusions regarding the original research questions." (Lawrence C Hamilton, "Data Analysis for Social Scientists: A first course in applied statistics", 1995)

"Variance and its square root, the standard deviation, summarize the amount of spread around the mean, or how much a variable varies. Outliers influence these statistics too, even more than they influence the mean. On the other hand. the variance and standard deviation have important mathematical advantages that make them (together with the mean) the foundation of classical statistics. If a distribution appears reasonably symmetrical, with no extreme outliers, then the mean and standard deviation or variance are the summaries most analysts would use." (Lawrence C Hamilton, "Data Analysis for Social Scientists: A first course in applied statistics", 1995)

✏️William S Cleveland - Collected Quotes

"A graphical form that involves elementary perceptual tasks that lead to more accurate judgments than another graphical form (with the same quantitative in formation) will result in better organization and increase the chances of a correct perception of patterns and behavior." (William S Cleveland & Robert McGill, "Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods", Journal of the American Statistical Association Vol. 79(387), 1984)

"Dot charts are suggested as replacements for bar charts. The replacements allow more effective visual decoding of the quantitative information and can be used for a wider variety of data sets." (William S. Cleveland, "Graphical Methods for Data Presentation: Full Scale Breaks, Dot Charts, and Multibased Logging", The American Statistician Vol. 38 (4) 1984)

"[...] error bars are more effectively portrayed on dot charts than on bar charts. […] On the bar chart the upper values of the intervals stand out well, but the lower values are visually deemphasized and are not as well perceived as a result of being embedded in the bars. This deemphasis does not occur on the dot chart." (William S. Cleveland, "Graphical Methods for Data Presentation: Full Scale Breaks, Dot Charts, and Multibased Logging", The American Statistician Vol. 38 (4) 1984)

"Experimentation with graphical methods for data presentation is important for improving graphical communication in science." (William S. Cleveland, "Graphical Methods for Data Presentation: Full Scale Breaks, Dot Charts, and Multibased Logging", The American Statistician Vol. 38 (4) 1984)

"For certain types of data structures, one cannot always use the most accurate elementary task, judging position along a common scale. But this is not true of the data represented in divided bar charts and pie charts; one can always represent such data along a common scale. A pie chart can always be replaced by a bar chart, thus replacing angle judgments by position judgments. […] A divided bar chart can always be replaced by a grouped bar chart; […]." (William S Cleveland & Robert McGill, "Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods", Journal of the American Statistical Association Vol. 79(387), 1984)

"Of course increased bias does not necessarily imply less overall accuracy. The reasoning, however, is that the mechanism leading to bias might well lead to other types of inaccuracy as well." (William S Cleveland & Robert McGill, "Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods", Journal of the American Statistical Association Vol. 79(387), 1984)

"One must be careful not to fall into a conceptual trap by adopting accuracy as a criterion. We are not saying that the primary purpose of a graph is to convey numbers with as many decimal places as possible. […] The power of a graph is its ability to enable one to take in the quantitative information, organize it, and see patterns and structure not readily revealed by other means of studying the data." (William S Cleveland & Robert McGill, "Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods", Journal of the American Statistical Association Vol. 79(387), 1984)

"The bar of a bar chart has two aspects that can be used to visually decode quantitative information-size (length and area) and the relative position of the end of the bar along the common scale. The changing sizes of the bars is an important and imposing visual factor; thus it is important that size encode something meaningful. The sizes of bars encode the magnitudes of deviations from the baseline. If the deviations have no important interpretation, the changing sizes are wasted energy and even have the potential to mislead." (William S. Cleveland, "Graphical Methods for Data Presentation: Full Scale Breaks, Dot Charts, and Multibased Logging", The American Statistician Vol. 38 (4) 1984)

"The full break results in a graph with two juxtaposed panels. This use of juxtaposition to provide a full scale break, with each panel having a fill frame and its own scales, shows the scale break about as forcefully as possible and discourages mental visual connections by viewers and actual connections by authors." (William S. Cleveland, "Graphical Methods for Data Presentation: Full Scale Breaks, Dot Charts, and Multibased Logging", The American Statistician Vol. 38 (4) 1984)

"The logarithm is an extremely powerful and useful tool for graphical data presentation. One reason is that logarithms turn ratios into differences, and for many sets of data, it is natural to think in terms of ratios. […] Another reason for the power of logarithms is resolution. Data that are amounts or counts are often very skewed to the right; on graphs of such data, there are a few large values that take up most of the scale and the majority of the points are squashed into a small region of the scale with no resolution." (William S. Cleveland, "Graphical Methods for Data Presentation: Full Scale Breaks, Dot Charts, and Multibased Logging", The American Statistician Vol. 38 (4) 1984)

"[…] the partial scale break is a weak indicator that the reader can fail to appreciate fully; visually the graph is still a single panel that invites the viewer to see, inappropriately, patterns between the two scales. […] The partial scale break also invites authors to connect points across the break, a poor practice indeed; […]" (William S. Cleveland, "Graphical Methods for Data Presentation: Full Scale Breaks, Dot Charts, and Multibased Logging", The American Statistician Vol. 38 (4) 1984)

"A connected graph is appropriate when the time series is smooth, so that perceiving individual values is not important. A vertical line graph is appropriate when it is important to see individual values, when we need to see short-term fluctuations, and when the time series has a large number of values; the use of vertical lines allows us to pack the series tightly along the horizontal axis. The vertical line graph, however, usually works best when the vertical lines emanate from a horizontal line through the center of the data and when there are no long-term trends in the data." (William S Cleveland, "The Elements of Graphing Data", 1985)

"A time series is a special case of the broader dependent-independent variable category. Time is the independent variable. One important property of most time series is that for each time point of the data there is only a single value of the dependent variable; there are no repeat measurements. Furthermore, most time series are measured at equally-spaced or nearly equally-spaced points in time." (William S Cleveland, "The Elements of Graphing Data", 1985)

"Another way to obscure data is to graph too much. It is always tempting to show everything that comes to mind on a single graph, but graphing too much can result in less being seen and understood." (William S Cleveland, "The Elements of Graphing Data", 1985)

"Do not allow data labels in the data region to interfere with the quantitative data or to clutter the graph. […] Avoid putting notes, keys, and markers in the data region. Put keys and markers just outside the data region and put notes in the legend or in the text." (William S Cleveland, "The Elements of Graphing Data", 1985)

"Clear vision is a vital aspect of graphs. The viewer must be able to visually disentangle the many different items that appear on a graph." (William S Cleveland, "The Elements of Graphing Data", 1985)

"Graphs that communicate data to others often must undergo reduction and reproduction; these processes, if not done with care, can interfere with visual clarity." (William S Cleveland, "The Elements of Graphing Data", 1985)

"In part, graphing data needs to be iterative because we often do not know what to expect of the data; a graph can help discover unknown aspects of the data, and once the unknown is known, we frequently find ourselves formulating a new question about the data. Even when we understand the data and are graphing them for presentation, a graph will look different from what we had expected; our mind's eye frequently does not do a good job of predicting what our actual eyes will see." (William S Cleveland, "The Elements of Graphing Data", 1985)

"It is common for positive data to be skewed to the right: some values bunch together at the low end of the scale and others trail off to the high end with increasing gaps between the values as they get higher. Such data can cause severe resolution problems on graphs, and the common remedy is to take logarithms. Indeed, it is the frequent success of this remedy that partly accounts for the large use of logarithms in graphical data display." (William S Cleveland, "The Elements of Graphing Data", 1985)

"Iteration and experimentation are important for all of data analysis, including graphical data display. In many cases when we make a graph it is immediately clear that some aspect is inadequate and we regraph the data. In many other cases we make a graph, and all is well, but we get an idea for studying the data in a different way with a different graph; one successful graph often suggests another." (William S Cleveland, "The Elements of Graphing Data", 1985)

"Make the data stand out and avoid superfluity are two broad strategies that serve as an overall guide to the specific principles […] The data - the quantitative and qualitative information in the data region - are the reason for the existence of the graph. The data should stand out. […] We should eliminate superfluity in graphs. Unnecessary parts of a graph add to the clutter and increase the difficulty of making the necessary elements - the data - stand out." (William S Cleveland, "The Elements of Graphing Data", 1985)

"No matter how clever the choice of the information, and no matter how technologically impressive the encoding, a visualization fails if the decoding fails. Some display methods lead to efficient, accurate decoding, and others lead to inefficient, inaccurate decoding. It is only through scientific study of visual perception that informed judgments can be made about display methods." (William S Cleveland, "The Elements of Graphing Data", 1985)

"There are some who argue that a graph is a success only if the important information in the data can be seen within a few seconds. While there is a place for rapidly-understood graphs, it is too limiting to make speed a requirement in science and technology, where the use of graphs ranges from, detailed, in-depth data analysis to quick presentation." (William S Cleveland, "The Elements of Graphing Data", 1985)

"Use a reference line when there is an important value that must be seen across the entire graph, but do not let the line interfere with the data." (William S Cleveland, "The Elements of Graphing Data", 1985)

"When a graph is constructed, quantitative and categorical information is encoded, chiefly through position, size, symbols, and color. When a person looks at a graph, the information is visually decoded by the person's visual system. A graphical method is successful only if the decoding process is effective. No matter how clever and how technologically impressive the encoding, it is a failure if the decoding process is a failure. Informed decisions about how to encode data can be achieved only through an understanding of the visual decoding process, which is called graphical perception." (William S Cleveland, "The Elements of Graphing Data", 1985)

"When magnitudes are graphed on a logarithmic scale, percents and factors are easier to judge since equal multiplicative factors and percents result in equal distances throughout the entire scale." (William S Cleveland, "The Elements of Graphing Data", 1985)

"When the data are magnitudes, it is helpful to have zero included in the scale so we can see its value relative to the value of the data. But the need for zero is not so compelling that we should allow its inclusion to ruin the resolution of the data on the graph." (William S Cleveland, "The Elements of Graphing Data", 1985)

"Data that are skewed toward large values occur commonly. Any set of positive measurements is a candidate. Nature just works like that. In fact, if data consisting of positive numbers range over several powers of ten, it is almost a guarantee that they will be skewed. Skewness creates many problems. There are visualization problems. A large fraction of the data are squashed into small regions of graphs, and visual assessment of the data degrades. There are characterization problems. Skewed distributions tend to be more complicated than symmetric ones; for example, there is no unique notion of location and the median and mean measure different aspects of the distribution. There are problems in carrying out probabilistic methods. The distribution of skewed data is not well approximated by the normal, so the many probabilistic methods based on an assumption of a normal distribution cannot be applied." (William S Cleveland, "Visualizing Data", 1993)

"Fitting data means finding mathematical descriptions of structure in the data. An additive shift is a structural property of univariate data in which distributions differ only in location and not in spread or shape. […] The process of identifying a structure in data and then fitting the structure to produce residuals that have the same distribution lies at the heart of statistical analysis. Such homogeneous residuals can be pooled, which increases the power of the description of the variation in the data." (William S Cleveland, "Visualizing Data", 1993)

"Fitting is essential to visualizing hypervariate data. The structure of data in many dimensions can be exceedingly complex. The visualization of a fit to hypervariate data, by reducing the amount of noise, can often lead to more insight. The fit is a hypervariate surface, a function of three or more variables. As with bivariate and trivariate data, our fitting tools are loess and parametric fitting by least-squares. And each tool can employ bisquare iterations to produce robust estimates when outliers or other forms of leptokurtosis are present." (William S Cleveland, "Visualizing Data", 1993)

"If the underlying pattern of the data has gentle curvature with no local maxima and minima, then locally linear fitting is usually sufficient. But if there are local maxima or minima, then locally quadratic fitting typically does a better job of following the pattern of the data and maintaining local smoothness." (William S Cleveland, "Visualizing Data", 1993)

"Many good things happen when data distributions are well approximated by the normal. First, the question of whether the shifts among the distributions are additive becomes the question of whether the distributions have the same standard deviation; if so, the shifts are additive. […] A second good happening is that methods of fitting and methods of probabilistic inference, to be taken up shortly, are typically simple and on well understood ground. […] A third good thing is that the description of the data distribution is more parsimonious." (William S Cleveland, "Visualizing Data", 1993)

"Many of the applications of visualization in this book give the impression that data analysis consists of an orderly progression of exploratory graphs, fitting, and visualization of fits and residuals. Coherence of discussion and limited space necessitate a presentation that appears to imply this. Real life is usually quite different. There are blind alleys. There are mistaken actions. There are effects missed until the very end when some visualization saves the day. And worse, there is the possibility of the nearly unmentionable: missed effects." (William S Cleveland, "Visualizing Data", 1993)

"One important aspect of reality is improvisation; as a result of special structure in a set of data, or the finding of a visualization method, we stray from the standard methods for the data type to exploit the structure or the finding." (William S Cleveland, "Visualizing Data", 1993)

"Probabilistic inference is the classical paradigm for data analysis in science and technology. It rests on a foundation of randomness; variation in data is ascribed to a random process in which nature generates data according to a probability distribution. This leads to a codification of uncertainly by confidence intervals and hypothesis tests." (William S Cleveland, "Visualizing Data", 1993)

"Sometimes, when visualization thoroughly reveals the structure of a set of data, there is a tendency to underrate the power of the method for the application. Little effort is expended in seeing the structure once the right visualization method is used, so we are mislead into thinking nothing exciting has occurred." (William S Cleveland, "Visualizing Data", 1993)

"The logarithm is one of many transformations that we can apply to univariate measurements. The square root is another. Transformation is a critical tool for visualization or for any other mode of data analysis because it can substantially simplify the structure of a set of data. For example, transformation can remove skewness toward large values, and it can remove monotone increasing spread. And often, it is the logarithm that achieves this removal." (William S Cleveland, "Visualizing Data", 1993)

"The scatterplot is a useful exploratory method for providing a first look at bivariate data to see how they are distributed throughout the plane, for example, to see clusters of points, outliers, and so forth." (William S Cleveland, "Visualizing Data", 1993)

"There are two components to visualizing the structure of statistical data - graphing and fitting. Graphs are needed, of course, because visualization implies a process in which information is encoded on visual displays. Fitting mathematical functions to data is needed too. Just graphing raw data, without fitting them and without graphing the fits and residuals, often leaves important aspects of data undiscovered." (William S Cleveland, "Visualizing Data", 1993)

"Using area to encode quantitative information is a poor graphical method. Effects that can be readily perceived in other visualizations are often lost in an encoding by area." (William S Cleveland, "Visualizing Data", 1993)

"Visualization is an approach to data analysis that stresses a penetrating look at the structure of data. No other approach conveys as much information. […] Conclusions spring from data when this information is combined with the prior knowledge of the subject under investigation." (William S Cleveland, "Visualizing Data", 1993)

"Visualization is an effective framework for drawing inferences from data because its revelation of the structure of data can be readily combined with prior knowledge to draw conclusions. By contrast, because of the formalism of probabilistic methods, it is typically impossible to incorporate into them the full body of prior information." (William S Cleveland, "Visualizing Data", 1993)

"When distributions are compared, the goal is to understand how the distributions shift in going from one data set to the next. […] The most effective way to investigate the shifts of distributions is to compare corresponding quantiles." (William S Cleveland, "Visualizing Data", 1993)

"When the distributions of two or more groups of univariate data are skewed, it is common to have the spread increase monotonically with location. This behavior is monotone spread. Strictly speaking, monotone spread includes the case where the spread decreases monotonically with location, but such a decrease is much less common for raw data. Monotone spread, as with skewness, adds to the difficulty of data analysis. For example, it means that we cannot fit just location estimates to produce homogeneous residuals; we must fit spread estimates as well. Furthermore, the distributions cannot be compared by a number of standard methods of probabilistic inference that are based on an assumption of equal spreads; the standard t-test is one example. Fortunately, remedies for skewness can cure monotone spread as well." (William S Cleveland, "Visualizing Data", 1993)

"Pie charts have severe perceptual problems. Experiments in graphical perception have shown that compared with dot charts, they convey information far less reliably. But if you want to display some data, and perceiving the information is not so important, then a pie chart is fine." (Richard Becker & William S Cleveland," S-Plus Trellis Graphics User's Manual", 1996)

✏️Scott Berinato - Collected Quotes

"A chart that knows its context well will naturally end up looking better because it’s showing what it needs to show and nothing else. Good context begets good design. Good charts are only the means to a more profound end: presenting your ideas effectively. Good charts are not the product you’re after. They’re the way to deliver your product - insight." (Scott Berinato, "Good Charts : the HBR guide to making smarter, more persuasive data visualizations", 2023)

"A perfectly relevant visualization that breaks a few presentation rules is far more valuable - it’s better - than a perfectly executed, beautiful chart that contains the wrong data, communicates the wrong message, or fails to engage its audience." (Scott Berinato, "Good Charts : the HBR guide to making smarter, more persuasive data visualizations", 2023)

"[…] although the relationship between perception and correlation is linear for all types of charts, the linear rate varies between chart types." (Scott Berinato, "Good Charts : the HBR guide to making smarter, more persuasive data visualizations", 2023)

"Bad complexity neither elucidates important salient points nor shows coherent broader trends. It will obfuscate, frustrate, tax the mind, and ultimately convey trendlessness and confusion to the viewer. Good complexity, in contrast, emerges from visualizations that use more data than humans can reasonably process to form a few salient points." (Scott Berinato, "Good Charts : the HBR guide to making smarter, more persuasive data visualizations", 2023)

"But rules are open to interpretation and sometimes arbitrary or even counterproductive when it comes to producing good visualizations. They’re for responding to context, not setting it. Instead of worrying about whether a chart is "right" or "wrong", focus on whether it’s good." (Scott Berinato, "Good Charts : the HBR guide to making smarter, more persuasive data visualizations", 2023)

"Charts used to confirm are less formal, and designed well enough to be interpreted, but they don’t always have to be presentation worthy. […] Or maybe you don’t know what you’re looking for […] This is exploratory work - rougher still in design, usually iterative, sometimes interactive. Most of us don’t do as much exploratory work as we do declarative and confirmatory; we should do more. It’s a kind of data brainstorming." (Scott Berinato, "Good Charts : the HBR guide to making smarter, more persuasive data visualizations", 2023)

"Confirmation is a kind of focused exploration, whereas true exploration is more open-ended. The bigger and more complex the data, and the less you know going in, the more exploratory the work. If confirmation is hiking a new trail, exploration is blazing one." (Scott Berinato, "Good Charts : the HBR guide to making smarter, more persuasive data visualizations", 2023)

"Dataviz has become a competitive imperative for companies. Those that don’t have a critical mass of managers capable of thinking visually will lag behind the ones that do." (Scott Berinato, "Good Charts : the HBR guide to making smarter, more persuasive data visualizations", 2023)

"Good design isn’t just choosing colors and fonts or coming up with an aesthetic for charts. That’s styling - part of design, but by no means the most important part. Rather, people with design talent develop and execute systems for effective visual communication. They understand how to create and edit visuals to focus an audience and distill ideas." (Scott Berinato, "Good Charts : the HBR guide to making smarter, more persuasive data visualizations", 2023)

"Good design serves a more important function than simply pleasing you: It helps you access ideas. It improves your comprehension and makes the ideas more persuasive. Good design makes lesser charts good and good charts transcendent." (Scott Berinato, "Good Charts : the HBR guide to making smarter, more persuasive data visualizations", 2023)

"In general, charts that contain enough data to take minutes, not seconds, to digest will work better on paper or a personal screen, for an individual who’s not being asked to listen to a presentation while trying to take in so much information." (Scott Berinato, "Good Charts : the HBR guide to making smarter, more persuasive data visualizations", 2023)

"Keep in mind that bars, lines, and scatter plots are your workhorses. Those three forms alone will help you arrive at many good charts in most situations. While you shouldn’t shun other forms, you also don’t need to choose different ones just to be different." (Scott Berinato, "Good Charts : the HBR guide to making smarter, more persuasive data visualizations", 2023)

"People feel data. They don’t just process statistics and come to rational conclusions. They form emotions about the data visualization. We are not informed by charts; we’re affected by them." (Scott Berinato, "Good Charts : the HBR guide to making smarter, more persuasive data visualizations", 2023)

"Sketching bridges idea and visualization. Good sketches are quick, simple, and messy. Don’t think too much about real values or scales or any refining details. In fact, don’t think too much. Just keep in mind those keywords, the possible forms they suggest, and that overarching idea you keep coming back to, the one you wrote down in answer to What am I trying to say (or learn)? And draw. Create shapes, develop a sense of what you want your audience to see. Try anything." (Scott Berinato, "Good Charts : the HBR guide to making smarter, more persuasive data visualizations", 2023)

"To build fluency in this new language, to tap into this vehicle for professional growth, and to give your organization a competitive edge, you first need to recognize a good chart when you see one." (Scott Berinato, "Good Charts : the HBR guide to making smarter, more persuasive data visualizations", 2023)

"Unlike text, visual communication is governed less by an agreed-upon convention between 'writer' and 'reader' than by how our visual systems react to stimuli, often before we’re aware of it. And just as composers use music theory to create music that produces certain predictable effects on an audience, chart makers can use visual perception theory to make more-effective visualizations with similarly predictable effects." (Scott Berinato, "Good Charts : the HBR guide to making smarter, more persuasive data visualizations", 2023)

"Ultimately, when you create a visualization, that’s what you need to know. Is it good? Is it effective? Are you helping people see an idea and learn from it? Are you making your case?" (Scott Berinato, "Good Charts : the HBR guide to making smarter, more persuasive data visualizations", 2023)

"Visualization is an abstraction, a way to reduce complexity […] complexity and color catch the eye; they’re captivating. They can also make it harder to extract meaning from a chart." (Scott Berinato, "Good Charts : the HBR guide to making smarter, more persuasive data visualizations", 2023)

"We see first what stands out. Our eyes go right to change and difference - peaks, valleys, intersections, dominant colors, outliers. Many successful charts - often the ones that please us the most and are shared and talked about - exploit this inclination by showing a single salient point so clearly that we feel we understand the chart’s meaning without even trying." (Scott Berinato, "Good Charts : the HBR guide to making smarter, more persuasive data visualizations", 2023)

"When deeply complex charts work, we find them effective and beautiful, just as we find a symphony beautiful, which is another marvelously complex arrangement of millions of data points that we experience as a coherent whole." (Scott Berinato, "Good Charts : the HBR guide to making smarter, more persuasive data visualizations", 2023)

"Without context, no one […] can say whether that chart is good. In the absence of context, a chart is neither good nor bad. It’s only well built or poorly built. To judge a chart’s value, you need to know more - much more - than whether you used the right chart type, picked good colors, or labeled axes correctly. Those things can help make charts good, but in the absence of context they’re academic considerations. It’s far more important to know Who will see this? What do they want? What do they need? What idea do I want to convey? What could I show? What should I show? Then, after all that, How will I show it?" (Scott Berinato, "Good Charts : the HBR guide to making smarter, more persuasive data visualizations", 2023)

"Your eyes and your brain always notice more dynamic visual information first and fastest. The implicit lesson is to make the idea you want people to see stand out. Conversely, make sure you’re not helping people see something that either doesn’t help convey your idea or actively fights against it." (Scott Berinato, "Good Charts : the HBR guide to making smarter, more persuasive data visualizations", 2023)

✏️Antony Unwin - Collected Quotes

"Data Visulization is related to Information Visualization, but there are important differences. Data Visualization is for exploration, for uncovering information, as well as for presenting information. It is certainly a goal of Data Visualization to present any information in the data, but another goal is to display the raw data themselves, revealing the inherent variability and uncertainty." (Antony Unwin et al [in "Graphics of Large Datasets: Visualizing a Million"], 2006)

"Deciding on which graphics to use is often a matter of taste. What one person thinks are good graphics for illustrating information may not appeal to someone else. It may also happen that different people interpret the same graphic in quite different ways." (Antony Unwin et al [in "Graphics of Large Datasets: Visualizing a Million"], 2006)

"Histograms use area to represent counts of a distribution. This makes them somewhat related to barcharts and mosaic plots, although the number or the width of the bins of a histogram is not determined a priori and the bins are drawn without gaps between them reflecting the continuous scale of the data. Whereas barcharts and mosaic plots show the exact distribution of the sample, a histogram is always just one approximation to the distribution of the data. Sometimes histograms are also used as crude density estimators for some 'true', but usually unknown, underlying distribution for the data. There are much better density estimation methods that produce smooth distribution displays." (Antony Unwin et al [in "Graphics of Large Datasets: Visualizing a Million"], 2006)

"How would a million be visualized today? If you have ever drawn a histogram or a scatterplot of a million cases, you know that it is possible, but that there are problems. The screen resolution of a computer cannot be high enough to show very small bars in the histogram, and in regions of high density the scatterplots look like black blobs with huge numbers of points piled on top of one another. (It is noteworthy - and useful - that the weaknesses of the two kinds of plot arise at opposite extremes of the distributional densities.) So what should be visualized? If the distributional form of the bulk of the data is of interest, then the histogram will be fine for one-dimensional views (and it may give some information about outliers too). If individual outliers are of interest, then the scatterplot will be pretty good (and it will give a fair bit of distributional information as well). One aim might be described as global, attempting to summarise the main structure, and the other as local, attempting to identify individual features. Ideally, both kinds of plot are needed to satisfy both aims." (Antony Unwin et al [in "Graphics of Large Datasets: Visualizing a Million"], 2006)

"Largeness comes in different forms and has many different effects. Whereas some tasks remain easy, others become obstinately difficult. Largeness is not just an increase in dataset size. [...] Largeness may mean more complexity - more variables, more detail (additional categories, special cases), and more structure (temporal or spatial components, combinations of relational data tables). Again this is not so much of a problem with small datasets, where the complexity will be by definition limited, but becomes a major problem with large datasets. They will often have special features that do not fit the standard case by variable matrix structure well-known to statisticians." (Antony Unwin et al [in "Graphics of Large Datasets: Visualizing a Million"], 2006)

"Like parallel coordinates, networks are drawn with many lines, and so an increase in magnitude has a more dramatic effect on networks than it does on point or area plots. The main issue is not drawing optimal layouts but drawing informative and acceptable layouts fast enough to be useful. In particular, this chapter makes clear that having to analyze applications with a million nodes is not at all unusual. With trees, the task is different again. Large datasets do not lead to specially large trees, but complex datasets may lead to many, many trees, and the visualization here concentrates on the task of combining and summarizing the information from large numbers of trees. A broad range of innovative displays is introduced for these specialist tasks, though they all have their origins in existing plots." (Antony Unwin et al [in "Graphics of Large Datasets: Visualizing a Million"], 2006)

"Many different words can be used to describe graphic representations of data, but the overall aim is always to visualize the information in the data and so the term Data Visualization is the best universal term. Other terms have different connotations." (Antony Unwin et al [in "Graphics of Large Datasets: Visualizing a Million"], 2006)

"Mosaic plots […] are designed to show the dependencies and interactions between multiple categorical variables in one plot. […] . A spineplot can be regarded as a kind of one-dimensional mosaic plot. […] In contrast with a barchart, where the bars are aligned to an axis, the mosaic plot uses a rectangular region, which is subdivided into tiles according to the numbers of observations falling into the different classes. This subdivision is done recursively, or in statistical terms conditionally, as more variables are included." (Antony Unwin et al [in "Graphics of Large Datasets: Visualizing a Million"], 2006)

"Statistics has its own basic suite of domain-specific visualization tools. These statistical graphics can best be classified by the kind of data that they depict. Statistical data are usually characterized by their scale: nominal, ordinal (which are both categorical) or numerical (which is usually regarded as continuous). What is most important in distinguishing statistical graphics from other graphics is their universality: statistical graphics are not tailored towards only one specific application but are valid for any data measured on the appropriate scales." (Antony Unwin et al [in "Graphics of Large Datasets: Visualizing a Million"], 2006)

"Tables are fine for viewing sections of a dataset, but simple scrolling is no longer a practical navigational option." (Antony Unwin et al [in "Graphics of Large Datasets: Visualizing a Million"], 2006)

"There are plenty of graphical displays that work well for small datasets and that can be found in the commonly available software packages, but they do not automatically scale up. Dotplots, scatterplots, and parallel coordinate plots all suffer from overplotting with large datasets; just think of drawing a scatterplot of a million points." (Antony Unwin et al [in "Graphics of Large Datasets: Visualizing a Million"], 2006)

"The days of trawling through endless volumes of frequency tables for every variable and of contingency tables for every pair of variables are still sadly with us. Automatic filtering and storing of results are essential first steps to help analysts to concentrate on the important issues that require human input to interpret the result." (Antony Unwin et al [in "Graphics of Large Datasets: Visualizing a Million"], 2006)

"The recursive construction of a mosaic plot means that the only limit for the number of variables included is the number of tiles to display, i.e. the number of possible combinations of the variables. […] If interactive queries are not available, the following strategy has proved to be helpful. Variables with only few categories should be put in the plot first, to keep the number of conditioned groups small. If one of the variables in the plot is a binary response, showing this variable via highlighting will reduce the number of tiles by half. Note that the gaps between the tiles are not part of the rectangular region that is used to build the tiles. The gaps are there to improve visual discrimination." (Antony Unwin et al [in "Graphics of Large Datasets: Visualizing a Million"], 2006)

"The simplest way to plot univariate continuous data is a dotplot. Because the points are distributed along only one axis, overplotting is a serious problem, no matter how small the sample is. The usual technique to avoid overplotting is jittering, i.e., the data are randomly spread along a virtual second axis." (Antony Unwin et al [in "Graphics of Large Datasets: Visualizing a Million"], 2006)

"Clearly principles and guidelines for good presentation graphics have a role to play in exploratory graphics, but personal taste and individual working style also play important roles. The same data may be presented in many alternative ways, and taste and customs differ as to what is regarded as a good presentation graphic. Nevertheless, there are principles that should be respected and guidelines that are generally worth following. No one should expect a perfect consensus where graphics are concerned." (Antony Unwin, "Good Graphics?" [in "Handbook of Data Visualization"], 2008)

"Data visualization [...] expresses the idea that it involves more than just representing data in a graphical form (instead of using a table). The information behind the data should also be revealed in a good display; the graphic should aid readers or viewers in seeing the structure in the data. The term data visualization is related to the new field of information visualization. This includes visualization of all kinds of information, not just of data, and is closely associated with research by computer scientists." (Antony Unwin et al, "Introduction" [in "Handbook of Data Visualization"], 2008)

"For a given dataset there is not a great deal of advice which can be given on content and context. hose who know their own data should know best for their specific purposes. It is advisable to think hard about what should be shown and to check with others if the graphic makes the desired impression. Design should be let to designers, though some basic guidelines should be followed: consistency is important (sets of graphics should be in similar style and use equivalent scaling); proximity is helpful (place graphics on the same page, or on the facing page, of any text that refers to them); and layout should be checked (graphics should be neither too small nor too large and be attractively positioned relative to the whole page or display)." (Antony Unwin, "Good Graphics?" [in "Handbook of Data Visualization"], 2008)

"There are two main reasons for using graphic displays of datasets: either to present or to explore data. Presenting data involves deciding what information you want to convey and drawing a display appropriate for the content and for the intended audience. [...] Exploring data is a much more individual matter, using graphics to find information and to generate ideas.Many displays may be drawn. They can be changed at will or discarded and new versions prepared, so generally no one plot is especially important, and they all have a short life span." (Antony Unwin, "Good Graphics?" [in "Handbook of Data Visualization"], 2008)

"Eye-catching data graphics tend to use designs that are unique (or nearly so) without being strongly focused on the data being displayed. In the world of Infovis, design goals can be pursued at the expense of statistical goals. In contrast, default statistical graphics are to a large extent determined by the structure of the data (line plots for time series, histograms for univariate data, scatterplots for bivariate nontime-series data, and so forth), with various conventions such as putting predictors on the horizontal axis and outcomes on the vertical axis. Most statistical graphs look like other graphs, and statisticians often think this is a good thing." (Andrew Gelman & Antony Unwin, "Infovis and Statistical Graphics: Different Goals, Different Looks" , Journal of Computational and Graphical Statistics Vol. 22(1), 2013)

"Providing the right comparisons is important, numbers on their own make little sense, and graphics should enable readers to make up their own minds on any conclusions drawn, and possibly see more. On the Infovis side, computer scientists and designers are interested in grabbing the readers' attention and telling them a story. When they use data in a visualization (and data-based graphics are only a subset of the field of Infovis), they provide more contextual information and make more effort to awaken the readers' interest. We might argue that the statistical approach concentrates on what can be got out of the available data and the Infovis approach uses the data to draw attention to wider issues. Both approaches have their value, and it would probably be best if both could be combined." (Andrew Gelman & Antony Unwin, "Infovis and Statistical Graphics: Different Goals, Different Looks" , Journal of Computational and Graphical Statistics Vol. 22(1), 2013)

"Statisticians tend to use standard graphic forms (e.g., scatterplots and time series), which enable the experienced reader to quickly absorb lots of information but may leave other readers cold. We personally prefer repeated use of simple graphical forms, which we hope draw attention to the data rather than to the form of the display." (Andrew Gelman & Antony Unwin, "Infovis and Statistical Graphics: Different Goals, Different Looks" , Journal of Computational and Graphical Statistics Vol. 22(1), 2013)

"[…] we do see a tension between the goal of statistical communication and the more general goal of communicating the qualitative sense of a dataset. But graphic design is not on one side or another of this divide. Rather, design is involved at all stages, especially when several graphics are combined to contribute to the overall picture, something we would like to see more of." (Andrew Gelman & Antony Unwin, "Tradeoffs in Information Graphics", Journal of Computational and Graphical Statistics, 2013)

"Yes, it can sometimes be possible for a graph to be both beautiful and informative […]. But such synergy is not always possible, and we believe that an approach to data graphics that focuses on celebrating such wonderful examples can mislead people by obscuring the tradeoffs between the goals of visual appeal to outsiders and statistical communication to experts." (Andrew Gelman & Antony Unwin, "Tradeoffs in Information Graphics", Journal of Computational and Graphical Statistics, 2013)

✏️Naomi B Robbins - Collected Quotes

"Choose an aspect ratio that shows variation in the data." (Naomi B Robbins, "Creating More effective Graphs", 2005)

"Choose scales wisely, as they have a profound influence on the interpretation of graphs. Not all scales require that zero be included, but bar graphs and other graphs where area is judged do require it." (Naomi B Robbins, "Creating More effective Graphs", 2005)

"Creating a more effective graph involves choosing a graphical construction in which the visual decoding uses tasks as high as possible on the ordered list of elementary graphical tasks while balancing this ordering with consideration of distance and detection." (Naomi B Robbins, "Creating More effective Graphs", 2005)

"Distance and detection also play a role in our ability to decode information from graphs. The closer together objects are, the easier it is to judge attributes that compare them. As distance between objects increases, accuracy of judgment decreases. It is certainly easier to judge the difference in lengths of two bars if they are next to one another than if they are pages apart." (Naomi B Robbins, "Creating More effective Graphs", 2005)

"Graphs are for the forest and tables are for the trees. Graphs give you the big picture and show you the trends; tables give you the details." (Naomi B Robbins, "Creating More effective Graphs", 2005)

"Graphs are pictorial representations of numerical quantities. It therefore seems reasonable to expect that the visual impression we get when looking at a graph is proportional to the numbers that the graph represents. Unfortunately, this is not always the case." (Naomi B Robbins, "Creating More effective Graphs", 2005)

"One graph is more effective than another if its quantitative information can be decoded more quickly or more easily by most observers. […] This definition of effectiveness assumes that the reason we draw graphs is to communicate information - but there are actually many other reasons to draw graphs." (Naomi B Robbins, "Creating More effective Graphs", 2005)

"The principles of drawing effective graphs are the same no matter what the medium: strive for clarity and conciseness. However, since a reader may spend more time studying a written report than is possible during a presentation, more detail can be included." (Naomi B Robbins, "Creating More effective Graphs", 2005)

"Use a logarithmic scale when it is important to understand percent change or multiplicative factors. […] Showing data on a logarithmic scale can cure skewness toward large values." (Naomi B Robbins, "Creating More effective Graphs", 2005)

"Use a scale break only when necessary. If a break cannot be avoided, use a full scale break. Taking logs can cure the need for a break." (Naomi B Robbins, "Creating More effective Graphs", 2005)

"We make angle judgments when we read a pie chart, but we don't judge angles very well. These judgments are biased; we underestimate acute angles (angles less than 90°) and overestimate obtuse angles (angles greater than 90°). Also, angles with horizontal bisectors (when the line dividing the angle in two is horizontal) appear larger than angles with vertical bisectors." (Naomi B Robbins, "Creating More effective Graphs", 2005)

03 December 2006

✏️Martin Theus - Collected Quotes

"Any conclusion drawn from an analysis of a transformed variable must be retranslated into the original domain - which is usually not an easy task. A special handling of outliers, be it a complete removal, or just visual suppression such as hot-selection or shadowing, must have a cogent motivation. At any rate, transformations of data are usually part of a data preprocessing step that might precede a data analysis. Also it can be motivated by initial findings in a data analysis which revealed yet undiscovered problems in the dataset." (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)

"Basically, one can distinguish three motivations for weighted data. The first is a technical motivation. Whenever we look at purely categorical data, it is not necessary to supply a dataset case by case. A breakdown summary can capture the dataset without loss of any information. […] The second situation in which weights are introduced is when sampling unequally from a population. Statistics and graphics must then account for the weights. A third reason to use weights is a change of the sampling population." (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)

"Choropleth maps are most effective when the range of the color-shading is fully used, i.e., the visual discrimination is maximized. A skewed distribution [...] will shrink the chosen colors to just a fraction of the possible color range. Using a continuously differentiable transformation function [...] is one way to expand the range of colors used. A more effective way to maximize the visual discrimination in a choropleth map is to transform the data to match a target distribution. One option is to force all colors to have the same frequency, i.e., to force the target distribution to be uniform. Another option is to force a normal target distribution. Obviously, the transfer function needed for this transformation is data dependent and piecewise linear." (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)

"Due to their recursive definition, switching the order of variables in a mosaic plot has a strong impact on what can be read from the plot. For instance, exchanging the two variables in a two-dimensional mosaic plot results in a completely new plot rather than in a mere graphically transposed version of the original plot." (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)

"Histograms are powerful in cases where meaningful class breaks can be defined and classes are used to select intervals and groups in the data. However, they often perform poorly when it comes to the visualization of a distribution." (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)

"Log-linear models aim at modeling interactions between more than just two variables. Depending on how many variables are investigated simultaneously and how many interactions are included in the model/data, different model types can be distinguished by simply looking at the corresponding mosaic plot. Each of these models exhibits a specific pattern in a mosaic plot. If there are less than four variables included in the model, the specific interaction-structure of a model can be read from the mosaic plot." (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)

"Mosaic plots are defined recursively, i.e., each variable that is introduced in a mosaic plot is plotted conditioned on the groups already established in the plot. As with barcharts, the area of bars or tiles is proportional to the number of observations (or the sum of the observation weights of a class). The direction along which bars are divided by a newly introduced variable is usually alternating, starting with the x-direction." (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)

"Mosaic plots become more difficult to read for variables with more than two or three categories. One way out is to assign a constant space for all possible crossings of categories. This way, the data from the r×c table are plotted in a table-like layout. Whereas this regular layout makes it much easier to compare values across rows and columns, the plot space is used less efficiently than in a mosaic plot." (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)

"Multivariate techniques often summarize or classify many variables to only a few groups or factors (e.g., cluster analysis or multi-dimensional scaling). Parallel coordinate plots can help to investigate the influence of a single variable or a group of variables on the result of a multivariate procedure. Plotting the input variables in a parallel coordinate plot and selecting the features of interest of the multivariate procedure will show the influence of different input variables." (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)

"No other statistical graphic can hold so much information at a time than the parallel coordinate plot. Thus this plot is ideal to get an initial overview of a dataset, or at the very least a large subgroup of the variables." (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)

"One big advantage of parallel coordinate plots over scatterplot matrices. (i.e., the matrix of scatterplots of all variable pairs) is that parallel coordinate plots need less space to plot the same amount of data. On the other hand, parallel coordinate plots with p variables show only p − 1 adjacencies. However, adjacent variables reveal most of the information in a parallel coordinate plot. Reordering variables in a parallel coordinate plot is therefore essential." (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)

"Parallel coordinate plots are often overrated concerning their ability to depict multivariate features. Scatterplots are clearly superior in investigating the relationship between two continuous variables and multivariate outliers do not necessarily stick out in a parallel coordinate plot. Nonetheless, parallel coordinate plots can help to find and understand features such as groups/clusters, outliers and multivariate structures in their multivariate context. The key feature is the ability to select and highlight individual cases or groups in the data, and compare them to other groups or the rest of the data." (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)

"Presentation graphics face the challenge to depict a key message in - usually a single - graphic which needs to fit very many observers at a time, without the chance to give further explanations or context. Exploration graphics, in contrast, are mostly created and used only by a single researcher, who can use as many graphics as necessary to explore particular questions. In most cases none of these graphics alone gives a comprehensive answer to those questions, but must be seen as a whole in the context of the analysis." (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)

"Raster maps - often also called raster images - represent measurements on a regular grid. They are usually a result of remote sensing techniques via satellites or airborne surveillance systems. They fit neither the construct of scatterplots nor that of maps. Nevertheless, both scatterplots and maps can be used to display raster maps within statistics software which has no extra GIS capabilities." (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)

"Shingling is the process of dividing a continuous variable into - possibly overlapping - intervals in order to convert a continuous variable into a discrete variable. Shingling is quite different from conditioning on categorical variables. Overlapping shingles/intervals lead to multiple representation of data within a trellis display, which is not the case for categorical variables. Furthermore, it is challenging to judge which intervals/cases have been chosen to build a shingle. Trellis displays represent the shingle interval visually by an interval of the strip label. Although no plotting space is wasted, the information on the intervals is difficult to read from the strip label. Despite these drawbacks, there is a valid motivation for shingling […]." (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)

"Spineplots have the nice property that highlighted proportions can be compared directly. However, it must be noted that the x axis in a spinogram is no longer linear. It is only piecewise linear within the bars. Although this might be confusing at first sight, it yields two interesting characteristics. Areas where only very few cases have been observed are squeezed together and thus get less visual weight. [...] Spineplots use normalized bar lengths while the bar widths are proportional to the number of cases in the category" (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)

"Sorting data is one of the most efficient actions to derive different views of data in order to see the variables from many angles. Sorting is usually not applied to the data itself, but to statistical objects of a plot. We might want to sort the bars in a barchart, the variables in a parallel boxplot or the categories in a boxplot y by x." (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)

"The problem of overplotting can be as severe that (smaller) groups can disappear completely, which will not only lead to quantitatively biased inferences, but even to qualitatively inappropriate conclusions." (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)

"There are many reasons for the existence of missing values: the failure of a sensor, different recording standards for different parts of a sample, or structural differences of the objects observed that make it impossible to record all attributes for all observed instances." (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)

"Trellis displays introduce the concept of shingling. Shingling is the process of dividing a continuous variable into - possibly overlapping - intervals in order to convert a continuous variable into a discrete variable. Shingling is quite different from conditioning on categorical variables. Overlapping shingles/intervals lead to multiple representation of data within a trellis display, which is not the case for categorical variables. Furthermore, it is challenging to judge which intervals/cases have been chosen to build a shingle. Trellis displays represent the shingle interval visually by an interval of the strip label. Although no plotting space is wasted, the information on the intervals is difficult to read from the strip label. Despite these drawbacks, there is a valid motivation for shingling," (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)

"Trellis displays use a lattice-like arrangement to place plots onto so-called panels. Each plot in a trellis display is conditioned upon at least one other variable. The same scales are used in all the panel plots in order to make them comparable across rows and columns. […] Trellis displays are an ideal tool to compare models for different subsets. " (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)

✏️Gene Zelazny - Collected Quotes

"[…] a chart is a picture of relationships, and only the picture counts. Everything else - titles, labels, scale values - merely identifies and explains. The most important feature of the picture is the impression you receive. Scaling has an important controlling effect on that impression." (Gene Zelazny. "Say It with Charts: The executive’s guide to visual communication" 4th Ed., 2001)

"A component comparison can best be demonstrated using a pie chart. Because a circle gives such a clear impression of being a total, a pie chart is ideally suited for the one - and only - purpose it serves: showing the size of each part as a percentage of some whole, such as companies that make up an industry." (Gene Zelazny. "Say It with Charts: The executive’s guide to visual communication" 4th Ed., 2001)

"A correlation comparison shows whether the relationship between two variables follows - or fails to follow - the pattern you would normally expect." (Gene Zelazny. "Say It with Charts: The executive’s guide to visual communication" 4th Ed., 2001)

“[…] any point from the data you wish to emphasize - will always lead to one of five basic kinds of comparison, which I’ve chosen to call component, item, time series, frequency distribution, and correlation." (Gene Zelazny. "Say It with Charts: The executive’s guide to visual communication" 4th Ed., 2001)

"A component comparison can best be demonstrated using a pie chart. Because a circle gives such a clear impression of being a total, a pie chart is ideally suited for the one - and only - purpose it serves: showing the size of each part as a percentage of some whole, such as companies that make up an industr" (Gene Zelazny. "Say It with Charts: The executive’s guide to visual communication" 4th Ed., 2001)

"Choosing a chart form without a message in mind is like trying to color coordinate your wardrobe while blindfolded. Choosing the correct chart form depends completely on your being clear about what your message is. It is not the data - be they dollars, percentages, liters, yen, etc. - that determine the chart. It is not the measure - be it profits, return on investment, compensation, etc. - that determines the chart. Rather, it is your message, what you want to show, the specific point you want to make." (Gene Zelazny. "Say It with Charts: The executive’s guide to visual communication" 4th Ed., 2001)

"Don’t necessarily settle for the first idea that grabs you. Keep looking, playing with the diagrams, so that you find the right fit." (Gene Zelazny. "Say It with Charts: The executive’s guide to visual communication" 4th Ed., 2001)

"I’ve observed that the pie chart is the most popular. It shouldn’t be; it’s the least practical and should account for little more than 5 percent of the charts used in a presentation or report. On the other hand, the bar chart is the least appreciated. It should receive much more attention; it’s the most versatile and should account for as much as 25percent of all charts used. I consider the column chart to be 'good old reliable' and the line chart to be the workhorse; these two should account for half of all charts used. While possibly intimidating at first glance, the dot chart has its place 10 percent of the time." (Gene Zelazny. "Say It with Charts: The executive’s guide to visual communication" 4th Ed., 2001)

"In choosing between a column and a line chart, you can also be guided by the nature of the data. A column chart emphasizes levels or magnitudes and is more suitable for data on activities that occur within a set period of time, suggesting a fresh start for each period. […] A line chart emphasizes movement and angles of change and is therefore the best form for showing data that have a 'carry-over' from one time to the next." (Gene Zelazny. "Say It with Charts: The executive’s guide to visual communication" 4th Ed., 2001)

"In preparing bar charts, make certain that the space separating the bars is smaller than the width of the bars. Use the most contrasting color or shading to emphasize the important item, thereby reinforcing the message title." (Gene Zelazny. "Say It with Charts: The executive’s guide to visual communication" 4th Ed., 2001)

"Naturally, scale values are used in practice, but omitting them should not obscure the relationship each chart illustrates. In fact, it is a good test of your own charts to see whether messages come across clearly without showing the scales. This does not mean that scaling considerations are unimportant to the design of charts. On the contrary, the wrong scale can lead to producing a chart that is misleading or worse, dishonest." (Gene Zelazny. "Say It with Charts: The executive’s guide to visual communication" 4th Ed., 2001)

"[…] no matter what your message is, it will always imply one of the five kinds of comparison. It should come as no surprise that, no matter what the comparison is, it will always lead to one of the five basic chart forms: the pie chart, the bar chart, the column chart, the line chart, and the dot chart." (Gene Zelazny. "Say It with Charts: The executive’s guide to visual communication" 4th Ed., 2001)

"The suggestions for making the most of bar charts also apply to column charts: make the space between the columns smaller than the width of the columns; and use color or shading to emphasize one point in time more than others or to distinguish, say, historical from projected data." (Gene Zelazny. "Say It with Charts: The executive’s guide to visual communication" 4th Ed., 2001)

"When showing numbers, round out the figures and omit decimals whenever they have little effect on your message; […]" (Gene Zelazny. "Say It with Charts: The executive’s guide to visual communication" 4th Ed., 2001)

"When preparing a line chart, make sure the trend line is bolder than the baseline and that the baseline, in turn, is a little bit heavier than the vertical and horizontal scale lines that shape the reference grid." (Gene Zelazny. "Say It with Charts: The executive’s guide to visual communication" 4th Ed., 2001)

"Whenever the form becomes more important than the content - that is, whenever the design of the chart interferes with a clear grasp of the relationship - it does a disservice to the audience or readers who may be basing decisions on the strength of what they see." (Gene Zelazny. "Say It with Charts: The executive’s guide to visual communication" 4th Ed., 2001)