Graphical Representation Series |
It's difficult, if not impossible, to give general rules on how data visualizations should be built. However, the data professional can use a set of principles, which are less strict than rules, and validate one's work against them. Even then one might need to make concessions and go against the principles or common sense, though such cases should be few, at least in theory. One of such important principles is reflected in Tufte's statement that "if the statistics are boring, then you've got the wrong numbers" [1].
So, the numbers we show should not be boring, but that's the case with most of the numbers we need to show, or we consume in news and other media sources. Unfortunately, in many cases we need to go with the numbers we have and find a way to represent them, ideally by facilitating the reader to make sense of the respective data. This should be our first goal when visualizing data. Secondly, because everybody talks about insights nowadays, one should identify the opportunity for providing views into the data that go beyond the simple visualization, even if this puts more burden on data professional's shoulder. Frankly, from making sense of a set of data and facilitating an 'Aha' moment is a long leap. Thirdly, one should find or use the proper tools and channels for communicating the findings.
A basic requirement for the data professional to be able to address these goals is to have clear definitions of the numbers, have a good understanding of how the numbers reflect the reality, respectively how the numbers can be put into the broader context. Unfortunately, all these assumptions seem to be a luxury. On the other side, the type of data we work with allows us to address at least the first goal. Upon case, our previous experience can help, though there will be also cases in which we can try to do our best.
Let's consider a simple set of data retrieved recently from another post - Discount rates (in percentage) per State, in which the values for 5 neighboring States are considered (see the first two columns from diagram A). Without knowing the meaning of data, one could easily create a chart in Excel or any other visualization tool. Because the State has categorical values, probably some visualization tools will suggest using bar and not column charts. Either by own choice or by using the default settings, the data professional might end up with a column chart (see diagram B), which is Ok for some visualizations.
One can start with a few related questions:
In these charts, the author signalized in titles that red denotes the lowest value, though it just reduces the confusion. One can meet titles in which several colors are used, reminding of a Christmas tree. Frankly, this type of encoding is not esthetically pleasing, and it can annoy the reader.
(6) What's in a name?
The titles and, upon case, the subtitles are important elements in communicating what the data reflects. The title should be in general short and succinct in the information it conveys, having the role of introducing, respectively identifying the chart, especially when multiple charts are used. Some charts can also use a subtitle, which can be longer than the title and have more of a storytelling character by highlighting the message and/or the finding in the data. In diagrams C and D the subtitles were considered as tiles, which is not considerably wrong.
In the media and presentations with influencing character, subtitles help the user understand the message or the main findings, though it's not appropriate for hardcoding the same in dynamic dashboards. Even if a logic was identified to handle the various scenarios, this shifts users' attention, and the chance is high that they'll stop further investigating the visualization. A data professional should present the facts with minimal interference in how the audience and/or users perceive the data.
As a recommendation, one should aim for clear general titles and avoid transmitting own message in charts. As a principle this can be summarized as "aim for clarity and equidistance".
(7) What about meaning?
Until now we barely considered the meaning of data. Unfortunately, there's no information about what the Discount rate means. It could be "the minimum interest rate set by the US Federal Reserve (and some other national banks) for lending to other banks" or "a rate used for discounting bills of exchange", to use the definitions given by the Oxford dictionary. Searching on the web, the results lead to discount rates for royalty savings, resident tuitions, or retail for discount transactions. Most probably the Discount rates from the data set refer to the latter.
We need a definition of the Discount rate to understand what the values represent when they are ordered. For example, when Texas has a value of 25% (see B), does this value have a negative or a positive impact when compared with other values? It depends on how it's used in the associated formula. The last two charts consider that the minimum value has a negative impact, though without more information the encoding might be wrong!
Important formulas and definitions should be considered as side information in the visualization, accompanying text or documentation! If further resources are required for understanding the data, then links to the required resources should be provided as well. At least this assures that the reader can acquire the right information without major overhead.
(8) What do readers look for?
Frankly, this should have been the first question! Readers have different expectations from data visualizations. First of all, it's the curiosity - how the data look in row and/or aggregated form, or in more advanced form how are they shaped (e.g. statistical characteristics like dispersion, variance, outliers). Secondly, readers look in the first phase to understand mainly whether the "results" are good or bad, even if there are many shades of grey in between. Further on, there must be made distinction between readers who want to learn more about the data, models, and processes behind, respectively readers who just want a confirmation of their expectations, opinions and beliefs (aka bias). And, in the end, there are also people who are not interested in the data and what it tells, where the title and/or subtitle provide enough information.
Besides this there are further categories of readers segmented by their role in the decision making, the planning and execution of operational, tactical, or strategic activities. Each of these categories has different needs. However, this exceeds the scope of our analysis.
Returning to our example, one can expect that the average reader will try to identify the smallest and highest Discount rates from the data set, respectively try to compare the values between the different States. Sorting the data and having the values close to each other facilitates the comparison and ranking, otherwise the reader needing to do this by himself/herself. This latter aspect and the fact that bar charts better handle the display of categorical data such as length and number, make from bar charts the tool of choice (see diagram E). So, whenever you see categorical data, consider using a bar chart!
Despite sorting the data, the reader might still need to subtract the various values to identify and compare the differences. The higher the differences between the values, the more complex these operations become. Diagram F is supposed to help in this area, the comparison to the minimal value being shown in orange. Unfortunately, small variances make numbers' display more challenging especially when the visualization tools don't offer display alternatives.
For showing the data from Diagram F were added in the table the third and fourth columns (see diagram A). There's a fifth column which designates the percentage from a percentage (what's the increase in percentages between the current and minimal value). Even if that's mathematically possible, the gain from using such data is neglectable and can create confusion. This opens the door for another principle that applies in other areas as well: "just because you can, it doesn't mean you should!". One should weigh design decisions against common sense or one's intuition on how something can be (mis)used and/or (mis)understood!
The downside of Diagram F is that the comparisons are made only in relation to the minimum value. The variations are small and allow further comparisons. The higher the differences, the more challenging it becomes to make further comparisons. A matrix display (see diagram G) which compares any two values will help if the number of points is manageable. The upper side of the numbers situated on and above the main diagonal were grayed (and can be removed) because they are either nonmeaningful, or the negatives of the numbers found below the diagram. Such diagrams are seldom used, though upon case they prove to be useful.
Choropleth maps (diagram H) are met almost everywhere data have a geographical dimension. Like all the other visuals they have their own advantages (e.g. relative location on the map) and disadvantages (e.g. encoding or displaying data). The diagram shows only the regions with data (remember the data-to-ink ratio principle).
(9) How about the shape of data?
When dealing with numerical data series, it's useful to show aggregated summaries like the average, quartiles, or standard deviation to understand how the data are shaped. Such summaries don't really make sense for our data set given the nature of the numbers (five values with small variance). One can still calculate them and show them in a box plot, though the benefit is neglectable.
(10) Which chart should be used?
As mentioned above, each chart has advantages and disadvantages. Given the simplicity and the number of data points, any of the above diagrams will do. A table is simple enough despite not using any visualization effects. Also, the bar charts are simple enough to use, with a plus maybe for diagram F which shows a further dimension of the data. The choropleth map adds the geographical dimension, which could be important for some readers. The matrix table is more appropriate for technical readers and involves more effort to understand, at least at first sight, though the learning curve is small. The column charts were considered only for exemplification purposes, though they might work as well.
In the end one should go with own experience and consider the audience and the communication channels used. One can also choose 2 different diagrams, especially when they are complementary and offer an additional dimension (e.g. diagrams F and H), though the context may dictate whether their use is appropriate or not. The diagrams should be simple to read and understand, but this doesn't mean that one should stick to the standard visuals. The data professional should explore other means of representing the data, a fresh view having the opportunity of catching the reader's attention.
As a closing remark, nowadays data visualization tools allow building such diagrams without much effort. Conversely, it takes more effort to go beyond the basic functionality and provide more value for thyself and the readers. One should be able to evaluate upfront how much time it makes sense to invest. Hopefully, the few methods, principles and recommendations presented here will help further!
Previous Post <<||>> Next Post
Resources:
[1] Edward R Tufte (1983) "The Visual Display of Quantitative Information"