Showing posts with label data visualization. Show all posts
Showing posts with label data visualization. Show all posts

04 August 2024

Graphical Representation: Graphics We Live By (Part X: Pie and Donut Charts in Power BI and Excel)

Graphical Representation Series

Pie charts are loved and hated by many altogether, and there are many entitled reasons to use them and avoid them, though the most important criteria to evaluate them is whether they do the intended job in an acceptable manner, especially when compared to other representational means. The most important aspect they depict is the part to whole ratio, which even if can be depicted by other graphical tools, few tools are efficient in representing it. 

The pie chart works well as a visualization tool when it has only 3-5 values that are easily recognizable in the visualization, however as soon the size or the number of pieces vary considerably, the more difficult it is to visualize and interpret them, in case their representation has more negative than positive effects. There are many topics that form something like a long tail - the portion of the distribution having many occurrences far from the head or beginning. Displaying the items from the long tail together with the other components together can totally obscure the distribution of the items from the long tail as they become unrecognizable in the diagram. 

One approach to handle this is to group all the items from the long tail together under a piece (e.g. Other) and use a second form of representation to display them separately. For example,  Microsoft Excel offers a way to zoom in the section of a pie chart with small percentages by displaying them in a second pie chart (pie of pie) or bar chart (bar of pie), something like a "zoom in" perspective (see image below). Unfortunately, the feature seems to limit itself only to small percentages, and thus can't be used currently to offer a broader perspective. Ideally, it would be useful to zoom in on any piece of the pie, especially when the items are categorized as a hierarchy with two or even more levels. 


Unfortunately, even modern visualization tools offer limited features in displaying this kind of perspective into a flexible unitary visualization, and thus users are forced to use their creativity in providing proper solutions. In the below example the "Renewables" piece of pie is further broken down into several components of a full pie, an ensemble supposed to function as a single form of representation. With a bit of effort, the reader probably will understand the meaning behind the two pie charts, however the encoding of colors and other elements used are suboptimal in the decoding process. 

Pie Charts - Original Solution

In the above example, the arrow may suggest that in between the two donut charts exists a relationship, reflected also in the description provided, however the readers may still have difficulties in correctly interpreting the diagrams, especially when there's some kind of overlapping or other type of implied or unimplied resemblance. If the colors overlap or have other similarities, are they intentional? If the circles have the same size, does this observed resemblance have a meaning? The reader shouldn't bother himself with this type of questions, but see the resemblance and the meaning of the various elements with a minimum of effort while decoding a chart's elements. Of course, when the meaning is not clear, some guidance should be ideally provided!

Unfortunately, Power BI doesn't seem to have a similar visual like the one from Excel yet, however with a bit of effort one can obtain similar results, even if there are other minor or important limitations. For example, the lines between the two pie charts can't be drawn, so one is forced to use other encodings to show that there's a connection between the Renewable slice and the small pie chart. Moreover, the ensemble thus created isn't treated unitary and handled accordingly. Frankly, the maturity of a graphical representation environment can and should be judged also from this perspective!

The below representation built in Power BI uses a few tricks to display two pie charts together. The smaller pie chart representing the breakdown and pieces' colors are variations of parent's color, attempting to show that there's a relationship between the slice from the first chart and the pie chart with the details. Unfortunately, it wasn't possible to use similar lines like in Excel to show the relation between the two sections. 

Pie of Pie in Power BI

Instead of a pie chart, one can use a donut, like in the original representation. Even if the donut uses a smaller area for representation, in theory the pie chart offers a better basis for comparisons, at least in theory. Stacked column charts can be used as well (see C), however one loses the certainty that the pieces must add up to 100%. Further limitations can appear when one wants to achieve more with the visualizations.

Custom charts can be used as well. The pie chart coming from xViz (see D) allows to increase the size of a pie piece by using another radius, technique which could be used to highlight the piece represented in the second chart. Frankly, sunburst diagrams (see E) are better at representing the parent to child proportions, where the same color encoding has been used. Unfortunately, the more information is shown, the more loaded the visualization seems to be.

Pie of Pie Alternatives in Power BI I

A treemap can prove to be a better representation alternative because it encodes proportions in a unitary way, much like pie charts do, though it takes more space if one wants to make the labels visible. Radial charts (see G) and Aster plots (see I) can be occasionally better choices, especially because they use less space as they display only the main categories. A second diagram chart can be used to display the subcategories, much like in A and B. Sankey charts (see H) can be used as well, even if they don't allow representing any quantitative values unless one encodes them directly in the labels. 

Pie of Pie Alternatives in Power BI II

When one dives into the world of diagrams and goes behind the still limited representational choices provided by the standard tools, one can be surprised by the additional representational choices. However, their appropriateness should be considered against readers' skillset to read and interpret them! Frankly, the alternatives considered above could be a better choice when they will reach a representational maturity. 

Many thanks to Christopher Chin, who in his weekly post on data visualization blunders, suggested the examples used as basis for this post (see [1])!

Previous Post <<||>> Next Post

References:
[1] LinkedIn (2024) Christopher Chin's post (link)

18 May 2024

Graphical Representation: Graphics We Live By (Part IV: Area Charts in MS Excel)

Graphical Representation
Graphical Representation

An area chart or area graph (see A) is a graphical representation of quantitative data based on a line chart for which the areas between axis and the lines of the series are commonly emphasized with colors, textures, or hatchings (Wikipedia). It resembles a combination between line and bar charts. Each data series results in the formation of a region (aka area), allowing thus to identify the overlapping and do comparisons between the lines within the same visual display. This approach works usually well for two or three data series if the lines don't overlap, though if more data series are added to the chart, the higher are the chances for lines to overlap or for one area to be covered by another (see B). This can easily become more than the chart can handle, even if the data series can be filtered dynamically.

Area Charts
Area Charts

Stacked area charts are a variation of area charts in which the areas are stacked, much like stacked bar charts (see C). Research papers abound with such charts, probably because they allow to stack together multiple data series within a small area, reflecting thus the many variables involved. Such charts allow to track individual as well as intermediary and total aggregated trends.

Stacked Area Charts
Stacked Area Charts

Unfortunately, besides the fact that some areas are barely distinguishable or that distant areas can't be compared (especially when one area in between has strong fluctuations), the lack of ticks and/or gridlines (see D) makes it difficult to interpret such charts. Moreover, when the lines are smoothed, it becomes even more difficult to identify the actual points. To address this it makes sense to use markers for data points to show that one works with discrete and not continuous points (see further paragraphs).

In general, it's recommended to reduce the number of data series to 3-5. For example, one can split the data series into 2-3 groups or categories based on series' characteristics (e.g. concentrate on the high values in one chart, respectively the low values in another, or group the low values under an "others" category) which would allow to make better comparisons.

Being able to sort the time series on their average value or other criteria (e.g. showing the areas with minimal variations first) can improve the readability of such charts.

Moreover, areas under curves can easily hide missing data (see F) and occasionally negative values (which is the case of the 8th example), or distort the rate of change when the charts are wider than needed (compare F with C). 

Line Chart, respectively Area Chart based on a subset
Area Charts Variations

Area charts seem to encode a dimension as area, though that's not necessarily the case. It seems natural to display time series of different granularities (day, month, quarter, year), though one needs to be careful about one important aspect! On a time scale, the more one moves away from the day to weeks and months as time units, the bigger the distance between points is. In the end, all the points in a series are discrete points (not continuous), though the bigger the distance, the more category-like these series become (compare F with C, the charts have the same width).

Using the area under the curve as dimension makes sense when there's continuity or the discrete points are close enough to each other to resemble continuity. Thus, area charts are useful when the number of points is high (and the distance between them becomes neglectable), e.g. showing daily values within a year or the months over several years. 

According to [2], [3] and several other sources, using the area to encode quantitative information is a poor graphical method and this applies to pie charts and area charts altogether. By contrast, for a bar chart (see G) one has either height or width to use for comparisons while the points are always as bars delimited. Scatter plots (see H), even if they might miss the time dimension, they better reflect the dispersion of the points along the lines delimited by encoding the color (compare H with E). 

Column Chart and Scatter Plot
Alternatives for Area Charts

The more category-like and the fewer data points the data series have, the higher the chances for other graphical representation tools to be able to better represent the data. For example, year or even quarter-based data can be better visualized with Sankey charts (unfortunately, not available as standard Excel visual yet).

Conversely, there are situations in which the area chart isn't supposed to convey specific values but to get a feeling of areas' shape, or its simplicity is more appropriate, situations in which area charts do a good job. In the end, a graphical representation's utility is linked to a chart's purpose (and audience, of course). 

References:
[1] Wikipedia (2023) Area charts (link)
[2] William S Cleveland (1993) Visualizing Data
[3] Robert L Harris (1996) Information Graphics: A Comprehensive Illustrated Reference

22 March 2024

Business Intelligence: Dashboards (Part I: Dashboards Are Dead & Other Crap)

Business Intelligence
Business Intelligence Series

I find annoying the posts that declare that a technology is dead, as they seem to seek the sensational and, in the end, don't offer enough arguments for the positions taken; all is just surfing though a few random ideas. Almost each time I klick on such a link I find myself disappointed. Maybe it's just me - having too great expectations from ad-hoc experts who haven't understood the role of technologies and their lifecycle.

At least until now dashboards are the only visual tool that allows displaying related metrics in a consistent manner, reflecting business objectives, health, or other important perspective into an organization's performance. More recently notebooks seem to be getting closer given their capabilities of presenting data visualizations and some intermediary steps used to obtain the data, though they are still far away from offering similar capabilities. So, from where could come any justification against dashboard's utility? Even if I heard one or two expert voices saying that they don't need KPIs for managing an organization, organizations still need metrics to understand how the organization is doing as a whole and taken on parts. 

Many argue that the design of dashboards is poor, that they don't reflect data visualization best practices, or that they are too difficult to navigate. There are so many books on dashboard and/or graphic design that is almost impossible not to find such a book in any big library if one wants to learn more about design. There are many resources online as well, though it's tough to fight with a mind's stubbornness in showing no interest in what concerns the topic. Conversely, there's also lot of crap on the social networks that qualify after the mainstream as best practices. 

Frankly, design is important, though as long as the dashboards show the right data and the organization can guide itself on the respective numbers, the perfectionists can say whatever they want, even if they are right! Unfortunately, the numbers shown in dashboards raise entitled questions and the reasons are multiple. Do dashboards show the right numbers? Do they focus on the objectives or important issues? Can the number be trusted? Do they reflect reality? Can we use them in decision-making? 

There are so many things that can go wrong when building a dashboard - there are so many transformations that need to be performed, that the chances of failure are high. It's enough to have several blunders in the code or data visualizations for people to stop trusting the data shown.

Trust and quality are complex concepts and there’s no standard path to address them because they are a matter of perception, which can vary and change dynamically based on the situation. There are, however, approaches that allow to minimize this. One can start for example by providing transparency. For each dashboard provide also detailed reports that through drilldown (or also by running the reports separately if that’s not possible) allow to validate the numbers from the report. If users don’t trust the data or the report, then they should pinpoint what’s wrong. Of course, the two sources must be in synch, otherwise the validation will become more complex.

There are also issues related to the approach - the way a reporting tool was introduced, the way dashboards flooded the space, how people reacted, etc. Introducing a reporting tool for dashboards is also a matter of strategy, tactics and operations and the various aspects related to them must be addressed. Few organizations address this properly. Many organizations work after the principle "build it and they will come" even if they build the wrong thing!

Previous Post <<||>> Next Post

29 February 2024

R Language: Visualizing the Iris Dataset

When working with a dataset that has several numeric features, it's useful to visualize it to understand the shapes of each feature, usually by category or in the case of the iris dataset by species. For this purpose one can use a combination between a boxplot and a stripchart to obtain a visualization like the one below (click on the image for a better resolution):

Iris features by species
Iris features by species (box & jitter plots combined)

And here's the code used to obtain the above visualization:

par(mfrow = c(2,2)) #2x2 matrix display

boxplot(iris$Petal.Width ~ iris$Species) 
stripchart(iris$Petal.Width ~ iris$Species
	, method = "jitter"
	, add = TRUE
	, vertical = TRUE
	, pch = 20
	, jitter = .5
	, col = c('steelblue', 'red', 'purple'))

boxplot(iris$Petal.Length ~ iris$Species) 
stripchart(iris$Petal.Length ~ iris$Species
	, method = "jitter"
	, add = TRUE
	, vertical = TRUE
	, pch = 20
	, jitter = .5
	, col = c('steelblue', 'red', 'purple'))

boxplot(iris$Sepal.Width ~ iris$Species) 
stripchart(iris$Sepal.Width ~ iris$Species
	, method = "jitter"
	, add = TRUE
	, vertical = TRUE
	, pch = 20
	, jitter = .5
	, col = c('steelblue', 'red', 'purple'))

boxplot(iris$Sepal.Length ~ iris$Species) 
stripchart(iris$Sepal.Length ~ iris$Species
	, method = "jitter"
	, add = TRUE
	, vertical = TRUE
	, pch = 20
	, jitter = .5
	, col = c('steelblue', 'red', 'purple'))

mtext("© sql-troubles@blogspot.com 2024", side = 1, line = 4, adj = 1, col = "dodgerblue4", cex = .7)
title("Iris Features (cm) by Species", line = -2, outer = TRUE)

By contrast, one can obtain a similar visualization with just a command:

plot(iris, col = c('steelblue', 'red', 'purple'), pch = 20)
title("Iris Features (cm) by Species", line = -1, outer = TRUE)
mtext("© sql-troubles@blogspot.com 2024", side = 1, line = 4, adj = 1, col = "dodgerblue4", cex = .7)

And here's the output:

Iris features by species (general plot)

One can improve the visualization by using a bigger contrast between colors (I preferred to use the same colors as in the previous visualization).

I find the first data visualization easier to understand and it provides more information about the shape of data even it requires more work.

Histograms make it easier to understand the distribution of values, though the visualizations make sense only when done by species:

Histograms of Setosa's features

And, here's the code:

par(mfrow = c(2,2)) #2x2 matrix display

setosa = subset(iris, Species == 'setosa') #focus only on setosa
hist(setosa$Sepal.Width)
hist(setosa$Sepal.Length)
hist(setosa$Petal.Width)
hist(setosa$Petal.Length)
title("Setosa's Features (cm)", line = -1, outer = TRUE)
mtext("© sql-troubles@blogspot.com 2024", side = 1, line = 4, adj = 1, col = "dodgerblue4", cex = .7)

There's however a visual called staked histogram that allows to delimit the data for each species:


Iris features by species (stacked histograms)

And, here's the code:

#installing plotrix & multcomp
install.packages("plotrix")
install.packages("plotrix")
library(plotrix)
library(multcomp)

par(mfrow = c(2,2)) #1x2 matrix display

histStack(iris$Sepal.Width
	, z = iris$Species
	, col = c('steelblue', 'red', 'purple')
	, main = "Sepal.Width"
	, xlab = "Width"
	, legend.pos = "topright")

histStack(iris$Sepal.Length
	, z = iris$Species
	, col = c('steelblue', 'red', 'purple')
	, main = "Sepal.Length"
	, xlab = "Length"
	, legend.pos = "topright")

histStack(iris$Petal.Width
	, z = iris$Species
	, col = c('steelblue', 'red', 'purple')
	, main = "Petal.Width"
	, xlab = "Width"
	, legend.pos = "topright")

histStack(iris$Petal.Length
	, z = iris$Species
	, col = c('steelblue', 'red', 'purple')
	, main = "Petal.Length"
	, xlab = "Length"
	, legend.pos = "topright")
title("Iris Features (cm) by Species - Histograms", line = -1, outer = TRUE)
mtext("© sql-troubles@blogspot.com 2024", side = 1, line = 4, adj = 1, col = "dodgerblue4", cex = .7)

Conversely, the standard histogram allows drawing the density curves within its boundaries:

par(mfrow = c(2,2)) #1x2 matrix display 

hist(iris$Sepal.Width
	, main = "Sepal.Width"
	, xlab = "Width"
	, las = 1, cex.axis = .8, freq = F)
eq = density(iris$Sepal.Width) # estimate density curve
lines(eq, lwd = 2) # plot density curve

hist(iris$Sepal.Length
	, main = "Sepal.Length"
	, xlab = "Length"
	, las = 1, cex.axis = .8, freq = F)
eq = density(iris$Sepal.Length) # estimate density curve
lines(eq, lwd = 2) # plot density curve

hist(iris$Petal.Width
	, main = "Petal.Width"
	, xlab = "Width"
	, las = 1, cex.axis = .8, freq = F)
eq = density(iris$Petal.Width) # estimate density curve
lines(eq, lwd = 2) # plot density curve

hist(iris$Petal.Length
	, main = "Petal.Length"
	, xlab = "Length"
	, las = 1, cex.axis = .8, freq = F)
eq = density(iris$Petal.Length) # estimate density curve
lines(eq, lwd = 2) # plot density curve

title("Iris Features (cm) by Species - Density plots", line = -1, outer = TRUE)
mtext("© sql-troubles@blogspot.com 2024", side = 1, line = 4, adj = 1, col = "dodgerblue4", cex = .7)

And, here's the diagram:

Iris features aggregated (histograms with density plots)

As final visualization, one can also compare the width and length for the sepal, respectively petal:
 
par(mfrow = c(1,2)) #1x2 matrix display

plot(iris$Sepal.Width, iris$Sepal.Length, main = "Sepal Width vs Length", col = iris$Species)
plot(iris$Petal.Width, iris$Petal.Length, main = "Petal Width vs Length", col = iris$Species)

title("Iris Features (cm) by Species - Scatter Plots", line = -1, outer = TRUE)
mtext("© sql-troubles@blogspot.com 2024", side = 1, line = 4, adj = 1, col = "dodgerblue4", cex = .7)

And, here's the output:
 
Iris features by species (scatter plots)

Happy coding!

27 February 2024

Book Review: Rolf Hichert & Jürgen Faisst's International Business Communication Standards (IBCS Version 1.2)

Over the last months I found several references to Rolf Hichert & Jürgen Faisst's booklet on business communication standards [1]. It draw my attention especially because it attempts to provide a standard for reports and data visualizations, which frankly it seems like a tremendous endeavor if done right. The two authors founded the IBCS institute 20 years ago, which is a host, training institute, and certification body of the Creative Commons project called IBCS.

The 150 pages booklet considers various standardization techniques with the help of more than 180 instructive figures, the overall structure being based on a set of principles and rules rooted in an acronym that spells "SUCCESS" - Say, Unify, Condense, Check, Express, Simplify, Structure. On one side the principles seem to form a solid fundament, however the fundament seems to suffer from the same rigidity resulted from fitting something in a nicely-spelled acronym. 

Say or conveying a message reflects the principle that each report should convey a message, otherwise the report is just a data collection. According to this "definition" most of the operational reports are just collections of data. Conversely, lot of communication in organizations revolve around issues, metrics and decision making, scenarios in which the messages conveyed can be powerful though dependent on the business context. Settling on only one message can make the message fall short.

Unifying or applying semantic notation reflects the principle that things that have same meaning should look the same. There are many patterns out there that can be standardized, however it's questionable how much complex visualizations can be standardized, respectively how much liberty of expressing certain aspects the standardization allows. 

Condense or increasing the information density reflects the requirements that all information necessary to understanding the content should, if possible, be included on one page. This allows to easier navigate the content and prioritize what the audience is able to see. The principle however seems to have more to do with the ink-information ratio principle (see [2]). 

Check or ensuring the visual integrity reflects the principle that the information should be presented in the most truthful and the most easily understood way. This is something that many data visualizations out there lack.

Express or choosing the proper visualizations is based on the principle that the visuals considered should be as intuitive as possible. In theory, the more intuitive a visual the easier is to be understood and reused, however this depends on the "visual vocabulary" and "visual grammar" of each individual. Intuition is something that needs to grow through the interplay of these two areas. Having the expectation of displaying everything in terms of basic elements is unrealistic and suboptimal. 

Simplify or avoiding clutter refers to eliminating the unnecessary from a visualization, when there's nothing to take out without changing the meaning of a visualization. At least, the principle is correctly considered even if is in general difficult to apply because quite often one needs to build something more complex and reduce the complexity through iterative steps until the simple is obtained. 

Structure or organizing the content is based on the principle that content should follow (a logical consistent) structure. The interplay between function and structure is an important topic in itself.

Browsing through the many data visualizations given as example, I'd say that many of the recommendations make sense, though from there to a standardization is still a long way. The reader should evaluate against his/her own judgements the practices described and consider what seems to work. 

The book is available on the IBS website as PDF, though the Kindle version is 40% cheaper. Overall, it is worth a read. 

Previous Post <<<||>> Next Post

Resources:
[1] Rolf Hichert & Jürgen Faisst (2022) "International Business Communication Standards (IBCS Version 1.2): Conceptual, perceptual, and semantic design of comprehensible business reports, presentations, and dashboards" (link)
[2] Edward R Tufte (1983) "The Visual Display of Quantitative Information"
[3] IBCS Institude (2024) About (link)

21 October 2023

Graphical Representation: Overreaching in Data Visualizations

Graphical Representation
Graphical Representation Series 

One of the most important aspects to stress in the context of graphical design is the purpose of graphical representations and the medium in which they are communicated. For example, one needs to differentiate between the graphics propagated on the various media channels that target the public consumption and potential customers (books, newspapers, articles in paper or paperless form, respectively blog posts and similar content) and graphics made for organizational use (reports, dashboards or presentations).

If the former graphics are supposed to back up a story, the reader being led into one direction or another, the author having the freedom of choosing the direction and the message, in the latter, unless the content is supposed to support, persuade or force a decision, the facts and data need to be presented in an equidistant manner, in a form that support insights, decision making or further inference. This applies to data professionals as well to the business users preparing the data.

Data visualization authors tend to use the title and subtitle to highlight in reports and dashboards the most important findings as per their perception, sometimes even stating the obvious. One of the issues with this approach is that the audience might just pick up the respective information without further looking at the chart, missing maybe more important facts. Just highlighting an element in the graphic or providing explanatory headlines is not storytelling, even if it helps in the process. Ideally, the data itself as depicted by the visuals should tell the story! Further information with storytelling character should be provided in the presentation of the data and taylored accordingly for the audience!

With a few exceptions, the information and decisions shouldn't be forced on the audience. There are so many such examples on the various social networks in which data analysts or other types of data professionals seem to imply this in the content they share and this is so wrong on many levels!

No matter how deep a data professional is involved into the business and no matter how extensive is his/her knowledge about the systems, data and processes, the business user and the manager are the closest to the business context and needs, while data professionals might not be aware of the full extent. This lack of context makes it challenging to interpret the trends depicted by the data, respectively to associate the changes observed in trends with decisions made or issues the business dealt with. When such knowledge is not available the data professional tends to extrapolate instead of identifying the chain of causality together with the business (and here annotation capabilities would help considerably). 

Moreover, it falls on management's shoulders to decide which facts, data, metrics, KPIs and information are important for the organization. A data professional can make recommendations, can play with the data and communicate certain insights, gaps or courses of action, though the management decides what's important and how the respective information should be communicated! Overstepping the boundaries can easily lead to unnecessary conflict in which the data professional can easily lose, even if the facts are in his favor. It's enough to deal with missing or incorrect information for the whole story to fall apart. 

It's true that some of the books on graphical design use various highlighting techniques in the explanation process, but they are intended for the readers to understand what the authors want to say. Unfortunately, there are also examples improperly used or authors' opinion diverge from the common sense. Independently of this, the data professional should develop own visual critical thinking and validate the techniques used against own judgement! 

Previous Post <<||>> Next Post

19 October 2023

Graphical Representation: Graphics We Live By II (Discount Rates in MS Excel)

Graphical Representation
Graphical Representation Series

It's difficult, if not impossible, to give general rules on how data visualizations should be built. However, the data professional can use a set of principles, which are less strict than rules, and validate one's work against them. Even then one might need to make concessions and go against the principles or common sense, though such cases should be few, at least in theory. One of such important principles is reflected in Tufte's statement that "if the statistics are boring, then you've got the wrong numbers" [1].

So, the numbers we show should not be boring, but that's the case with most of the numbers we need to show, or we consume in news and other media sources. Unfortunately, in many cases we need to go with the numbers we have and find a way to represent them, ideally by facilitating the reader to make sense of the respective data. This should be our first goal when visualizing data. Secondly, because everybody talks about insights nowadays, one should identify the opportunity for providing views into the data that go beyond the simple visualization, even if this puts more burden on data professional's shoulder. Frankly, from making sense of a set of data and facilitating an 'Aha' moment is a long leap. Thirdly, one should find or use the proper tools and channels for communicating the findings. 

A basic requirement for the data professional to be able to address these goals is to have clear definitions of the numbers, have a good understanding of how the numbers reflect the reality, respectively how the numbers can be put into the broader context. Unfortunately, all these assumptions seem to be a luxury. On the other side, the type of data we work with allows us to address at least the first goal. Upon case, our previous experience can help, though there will be also cases in which we can try to do our best. 

Let's consider a simple set of data retrieved recently from another post - Discount rates (in percentage) per State, in which the values for 5 neighboring States are considered (see the first two columns from diagram A). Without knowing the meaning of data, one could easily create a chart in Excel or any other visualization tool. Because the State has categorical values, probably some visualization tools will suggest using bar and not column charts. Either by own choice or by using the default settings, the data professional might end up with a column chart (see diagram B), which is Ok for some visualizations. 


One can start with a few related questions:
(1) Does it make sense to use a chart to represent 5 values which have small variability (the difference between the first and last value is of only 6%)? 
(2) Does it make sense to use a chart only for the sake of visualizing the data?
(3) Where is the benefit for using a chart as long there's no information conveyed? 

One can see similar examples in the media where non-aggregated values are shown in a chart just for the sake of visualizing the data. Sometimes the authors compensate for the lack of meaning with junk elements, fancy titles or other tricks. Usually, sense-making in a chart takes longer than looking at the values in a table as there are more dimensions or elements to consider. For a table there's the title, headers and the values, nothing more! For a chart one has in addition the axes and some visualization elements that can facilitate or complicate visualization's decoding. Where to add that there are also many tricks to distort the data. 

Tables tend to maximize the amount of digital ink used to represent the data, and minimize the amount used to represent everything else not important to understanding. It's what Tufte calls the data-to-ink ratio (see [1]), a second important principle. This can be translated in (a) removing the border of the chart area, (b) minimizing the number of gridlines shown, (c) minimizing the number of ticks on the axis without leading to information lost, (d) removing redundant information, (e) or information that doesn't help the reader. 

However, the more data is available in the table, the more difficult it becomes to navigate the data. But again, if the chart shows the individual data without any information gained, a table might be still more effective. One shouldn't be afraid to show a table where is the case!

(4) I have a data visualization, what's next?

Ideally, the data professional should try to obtain the maximum of effect with minimum of elements. If this principle aims for the efficiency of design, a fourth related principle aims for the efficiency of effort - one should achieve a good enough visualization with a minimum of effort. Therefore, it's enough maybe if we settle to any of the two above results. 

On the other side, maybe by investing a bit more effort certain aspects can be improved. In this area beginners start playing with the colors, formatting the different elements of the chart. Unfortunately, even if color plays a major role in the encoding and decoding of meaning, is often misused/overused. 

(5) Is there any meaning in the colors used?

In the next examples taken from the web (diagram C and D), the author changed the color of the column with the minimal value to red to contrast it with the other values. Red is usually associated with danger, error, warning, or other similar characteristics with negative impact. The chances are high that the reader will associate the value with a negative connotation, even if red is used also for conveying important information (usually in text). Moreover, the reader will try to interpret the meaning of the other colors. In practice, the color grey has a neutral tone (and calming effect on the mind). Therefore, it's safe to use grey in visualization (see diagram D in contrast with diagram C). Some even advise setting grey as default for the visualization and changing the colors as needed later

In these charts, the author signalized in titles that red denotes the lowest value, though it just reduces the confusion. One can meet titles in which several colors are used, reminding of a Christmas tree. Frankly, this type of encoding is not esthetically pleasing, and it can annoy the reader. 

(6) What's in a name?

The titles and, upon case, the subtitles are important elements in communicating what the data reflects. The title should be in general short and succinct in the information it conveys, having the role of introducing, respectively identifying the chart, especially when multiple charts are used. Some charts can also use a subtitle, which can be longer than the title and have more of a storytelling character by highlighting the message and/or the finding in the data. In diagrams C and D the subtitles were considered as tiles, which is not considerably wrong. 

In the media and presentations with influencing character, subtitles help the user understand the message or the main findings, though it's not appropriate for hardcoding the same in dynamic dashboards. Even if a logic was identified to handle the various scenarios, this shifts users' attention, and the chance is high that they'll stop further investigating the visualization. A data professional should present the facts with minimal interference in how the audience and/or users perceive the data. 

As a recommendation, one should aim for clear general titles and avoid transmitting own message in charts. As a principle this can be summarized as "aim for clarity and equidistance".

(7) What about meaning?

Until now we barely considered the meaning of data. Unfortunately, there's no information about what the Discount rate means. It could be "the minimum interest rate set by the US Federal Reserve (and some other national banks) for lending to other banks" or "a rate used for discounting bills of exchange", to use the definitions given by the Oxford dictionary. Searching on the web, the results lead to discount rates for royalty savings, resident tuitions, or retail for discount transactions. Most probably the Discount rates from the data set refer to the latter.

We need a definition of the Discount rate to understand what the values represent when they are ordered. For example, when Texas has a value of 25% (see B), does this value have a negative or a positive impact when compared with other values? It depends on how it's used in the associated formula. The last two charts consider that the minimum value has a negative impact, though without more information the encoding might be wrong! 

Important formulas and definitions should be considered as side information in the visualization, accompanying text or documentation! If further resources are required for understanding the data, then links to the required resources should be provided as well. At least this assures that the reader can acquire the right information without major overhead. 

(8) What do readers look for? 

Frankly, this should have been the first question! Readers have different expectations from data visualizations. First of all, it's the curiosity - how the data look in row and/or aggregated form, or in more advanced form how are they shaped (e.g. statistical characteristics like dispersion, variance, outliers). Secondly, readers look in the first phase to understand mainly whether the "results" are good or bad, even if there are many shades of grey in between. Further on, there must be made distinction between readers who want to learn more about the data, models, and processes behind, respectively readers who just want a confirmation of their expectations, opinions and beliefs (aka bias). And, in the end, there are also people who are not interested in the data and what it tells, where the title and/or subtitle provide enough information. 

Besides this there are further categories of readers segmented by their role in the decision making, the planning and execution of operational, tactical, or strategic activities. Each of these categories has different needs. However, this exceeds the scope of our analysis. 

Returning to our example, one can expect that the average reader will try to identify the smallest and highest Discount rates from the data set, respectively try to compare the values between the different States. Sorting the data and having the values close to each other facilitates the comparison and ranking, otherwise the reader needing to do this by himself/herself. This latter aspect and the fact that bar charts better handle the display of categorical data such as length and number, make from bar charts the tool of choice (see diagram E). So, whenever you see categorical data, consider using a bar chart!

Despite sorting the data, the reader might still need to subtract the various values to identify and compare the differences. The higher the differences between the values, the more complex these operations become. Diagram F is supposed to help in this area, the comparison to the minimal value being shown in orange. Unfortunately, small variances make numbers' display more challenging especially when the visualization tools don't offer display alternatives.

For showing the data from Diagram F were added in the table the third and fourth columns (see diagram A). There's a fifth column which designates the percentage from a percentage (what's the increase in percentages between the current and minimal value). Even if that's mathematically possible, the gain from using such data is neglectable and can create confusion. This opens the door for another principle that applies in other areas as well: "just because you can, it doesn't mean you should!". One should weigh design decisions against common sense or one's intuition on how something can be (mis)used and/or (mis)understood!

The downside of Diagram F is that the comparisons are made only in relation to the minimum value. The variations are small and allow further comparisons. The higher the differences, the more challenging it becomes to make further comparisons. A matrix display (see diagram G) which compares any two values will help if the number of points is manageable. The upper side of the numbers situated on and above the main diagonal were grayed (and can be removed) because they are either nonmeaningful, or the negatives of the numbers found below the diagram. Such diagrams are seldom used, though upon case they prove to be useful.

Choropleth maps (diagram H) are met almost everywhere data have a geographical dimension. Like all the other visuals they have their own advantages (e.g. relative location on the map) and disadvantages (e.g. encoding or displaying data). The diagram shows only the regions with data (remember the data-to-ink ratio principle).


(9) How about the shape of data?

When dealing with numerical data series, it's useful to show aggregated summaries like the average, quartiles, or standard deviation to understand how the data are shaped. Such summaries don't really make sense for our data set given the nature of the numbers (five values with small variance). One can still calculate them and show them in a box plot, though the benefit is neglectable. 

(10) Which chart should be used?

As mentioned above, each chart has advantages and disadvantages. Given the simplicity and the number of data points, any of the above diagrams will do. A table is simple enough despite not using any visualization effects. Also, the bar charts are simple enough to use, with a plus maybe for diagram F which shows a further dimension of the data. The choropleth map adds the geographical dimension, which could be important for some readers. The matrix table is more appropriate for technical readers and involves more effort to understand, at least at first sight, though the learning curve is small. The column charts were considered only for exemplification purposes, though they might work as well. 

In the end one should go with own experience and consider the audience and the communication channels used. One can also choose 2 different diagrams, especially when they are complementary and offer an additional dimension (e.g. diagrams F and H), though the context may dictate whether their use is appropriate or not. The diagrams should be simple to read and understand, but this doesn't mean that one should stick to the standard visuals. The data professional should explore other means of representing the data, a fresh view having the opportunity of catching the reader's attention.

As a closing remark, nowadays data visualization tools allow building such diagrams without much effort. Conversely, it takes more effort to go beyond the basic functionality and provide more value for thyself and the readers. One should be able to evaluate upfront how much time it makes sense to invest. Hopefully, the few methods, principles and recommendations presented here will help further!

Previous Post <<||>> Next Post

Resources:
[1] Edward R Tufte (1983) "The Visual Display of Quantitative Information"

14 October 2023

Graphical Representation: On Insights II (The Complexity Perspective)

Graphical Representation
Graphical Representation Series

Scientists attempt to discover laws and principles, and for this they conduct experiments, build theories and models rooted in the data they collect. In the business setup, data professionals analyze the data for identifying patterns, trends, outliers or anything else that can lead to new information or knowledge. On one side scientists chose the boundaries of the systems they study, while for data professionals even if the systems are usually given, they can make similar choices. 

In theory, scientists are more flexible in what data they collect, though they might have constraints imposed by the boundaries of their experiments and the tools they use. For data professionals most of the data they need is already there, in the systems the business uses, though the constraints reside in the intrinsic and extrinsic quality of the data, whether the data are fit for the purpose. Both parties need to work around limitations, or attempt to improve the experiments, respectively the systems. 

Even if the data might have different characteristics, this doesn't mean that the methods applied by data professionals can't be used by scientists and vice-versa. The closer data professionals move from Data Analytics to Data Science, the higher the overlap between the business and scientific setup. 

Conversely, the problems data professionals meet have different characteristics. Scientists outlook is directed mainly at the phenomena and processes occurring in nature and society, where randomness, emergence and chaos seem to feel at home. Business processes deal more with predefined controlled structures, cyclicity, higher dependency between processes, feedback and delays. Even if the problems may seem to be different, they can be modeled with systems dynamics. 

Returning to data visualization and the problem of insight, there are multiple questions. Can we use simple designs or characterizations to find the answer to complex problems? Which must be the characteristics of a piece of information or knowledge to generate insight? How can a simple visualization generate an insight moment? 

Appealing to complexity theory, there are several general approaches in handling complexity. One approach resides in answering complexity with complexity. This means building complex data visualizations that attempt to model problem's complexity. For example, this could be done by building a complex model that reflects the problem studied, and build a set of complex visualizations that reflect the different important facets. Many data professionals advise against this approach as it goes against the simplicity principle. On the other hand, starting with something complex and removing the nonessential can prove to be an approachable strategy, even if it involves more effort. 

Another approach resides in reducing the complexity of the problem either by relaxing the constraints, or by breaking the problem into simple problems and addressing each one of them with visualizations. Relaxing the constraints allow studying upon case a more general problem or a linearization of the initial problem. Breaking down the problem into problems that can be easier solved, can help to better understand the general problem though we might lose the sight of emergence and other behavior that characterize complex systems.

Providing simple visualizations to complex problems implies a good understanding of the problem, its solution(s) and the overall context, which frankly is harder to achieve the more complex a problem is. For its understanding a problem requires a minimum of knowledge that needs to be reflected in the visualization(s). Even if some important aspects are assumed as known, they still need to be confirmed by the visualizations, otherwise any deviation from assumptions can lead to a new problem. Therefore, its questionable that simple visualizations can address the complexity of the problems in a general manner. 

Previous Post <<||>> Next Post 

18 April 2023

Graphical Representation: Graphics We Live By I (The Analytics Marathon)

Graphical Representation
Graphical Representation Series

In a diagram adapted from an older article [1], Brent Dykes, the author of "Effective Data Storytelling" [2], makes a parallel between Data Analytics and marathon running, considering that an organization must pass through the depicted milestones, the percentages representing how many organizations reach the respective milestones:



It's a nice visualization and the metaphor makes sense given that running a marathon requires a long-term strategy to address the gaps between the current and targeted physical/mental form and skillset required to run a marathon, respectively for approaching a set of marathons and each course individually. Similarly, implementing a Data Analytics initiative requires a Data Strategy supposed to address the gaps existing between current and targeted state of art, respectively the many projects run to reach organization's goals. 

It makes sense, isn't it? On the other side the devil lies in details and frankly the diagram raises several questions when is compared with practices and processes existing in organizations. This doesn't mean that the diagram is wrong, just that it doesn't seem to reflect entirely the reality. 

The percentages represent author's perception of how many organizations reach the respective milestones, probably in an repeatable manner (as there are several projects). Thus, only 10% have a data strategy, 100% collect data, 80% of them prepare the data, while at the opposite side only 15% communicate insight, respectively 5% act on information.

Considering only the milestones the diagram looks like a funnel and a capability maturity model (CMM). Typically, the CMMs are more complex than this, evolving with technologies' capabilities. All the mentioned milestones have a set of capabilities that increase in complexity and that usually help differentiated organization's maturity. Therefore, the model seems too simple for an actual categorization.  

Typically, data collection has a specific scope resuming to surveys, interviews and/or research. However, the definition can be extended to the storage of data within organizations. Thus, data collection as the gathering of raw data is mainly done as part of their value supporting processes, and given the degree of digitization of data, one can suppose that most organizations gather data for the different purposes, even if only a small part are maybe digitized.

Even if many organizations build data warehouses, marts, lakehouses, mashes or whatever architecture might be en-vogue these days, an important percentage of the reporting needs are covered by standard reports or reporting tools that access directly the source systems without data preparation or even data visualization. The first important question is what is understood by data analytics? Is it only the use of machine learning and statistical analysis? Does it resume only to pattern and insight finding or does it includes also what is typically considered under the Business Intelligence umbrella? 

Pragmatically thinking, Data Analytics should consider BI capabilities as well as its an extension of the current infrastructure to consider analytic capabilities. On the other side Data Warehousing and BI are considered together by DAMA as part of their Data Management methodology. Moreover, organizations may have a Data Strategy and a BI strategy, respectively a Data Analytics strategy as they might have different goals, challenges and bodies to support them. To make it even more complicated, an organization might even consider all these important topics as part of the Data or even Information Governance, or consider BI or Analytics without Data Management. 

So, a Data Strategy might or might not address Data Analytics at all. It's a matter of management philosophy, organizational structure, politics and other factors. Probably, having a strayegy related to data should count. Even if a written and communicated data-related strategy is recommended for all medium to big organizations, only a small percentage of them have one, while small organizations might ignore the topic completely.

At least in the past, data analysis and its various subcomponents was performed before preparing and visualizing the data, or at least in parallel with data visualization. Frankly, it's a strange succession of steps. Or does it refers to exploratory data analysis (EDA) from a statistical perspective, which requires statistical experience to model and interpret the facts? Moreover, data exploration and discovery happen usually in the early stages.

The most puzzling step is the last one - what does the author intended with it? Ideally, data should be actionable, at least that's what one says about KPIs, OKRs and other metrics. Does it make sense to extend Data Analytics into the decision-making process? Where does a data professional's responsibilities end and which are those boundaries? Or does it refer to the actions that need to be performed by data professionals? 

The natural step after communicating insight is for the management to take action and provide feedback. Furthermore, the decisions taken have impact on the artifacts built and a reevaluation of the business problem, assumptions and further components is needed. The many steps of analytics projects are iterative, some iterations affecting the Data Strategy as well. The diagram shows the process as linear, which is not the case.

For sure there's an interface between Data Analytics and Decision-Making and the processes associated with them, however there should be clear boundaries. E.g., it's a data professional's responsibility to make sure that the data/information is actionable and eventually advise upon it, though whether the entitled people act on it is a management topic. Not acting upon an information is also a decision. Overstepping boundaries can put the data professional into a strange situation in which he becomes responsible and eventually accountable for an action not taken, which is utopic.

The final question - is the last mile representative for the analytical process? The challenge is not the analysis and communication of data but of making sure that the feedback processes work and the changes are addressed correspondingly, that value is created continuously from the data analytics infrastructure, that data-related risks and opportunities are addressed as soon they are recognized. 

As any model, a diagram doesn't need to be correct to be useful and might not be even wrong in the right context and argumentation. A data analytics CMM might allow better estimates and comparison between organizations, though it can easily become more complex to use. Between the two models lies probably a better solution for modeling the data analytics process.

Resources:
[1] Brent Dykes (2022) "Data Analytics Marathon: Why Your Organization Must Focus On The Finish", Forbes (link)
[2] Brent Dykes (2019) Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals (link)

16 April 2023

Book Review: Willard C Brinton's Graphic Methods for Presenting Facts (1919)

"It is often with impotent exasperation that a person having the knowledge sees some fallacious conclusion accepted, or some wrong policy adopted, just because known facts cannot be marshalled and presented in such manner as to be effective." 

This is the conclusion phrase from the first paragraph of Willard C Brinton's "Graphic Methods for Presenting Facts", in which the author expresses his disappointment about the impossibility of bridging the important gap existing between data collection and presentation on side, and the decision-making on the other. Despite being written more than a century ago (1915), the issue seems to be so actual, the average data professional probably met this kind of situation at least once in a lifetime, if not on a regular basis. 

I found out about this book from Bridget Cogley & Vidya Setlur's "Functional Aesthetics for Data Visualization" (2020), which credits Brinton for "shaping the path toward broad use of charts". I found a digitized copy of the book at Internet Archive and browsing though it I found it appealing for a deeper reading and a first review. 

Written in a simple style stripped of any mathematical or statistical formulae, and thus approachable by the average nontechnical reader, the book addresses the techniques and challenges of graphical authors in preparing charts and other graphical content for their consumption in organizations for insight and decision-making, as well for the masses. It mentions also the projecting of graphs as lantern slides to accompany a talk, a precursor of nowadays' forms of presentations.

The engineering and statistical background of the author can be seen in the meticulosity with which the book was written. The book discusses the graphic methods for presenting facts in graphical form, which are the component parts and how can be used to attract readers' attention, respectively present them in an effective manner. Several principle-like statements are considered though the book and listed together in the last chapters, rules that can be found in modern books as well, though probably less exemplified. 

From organization charts to maps, from circle and bar charts to time plots, the number and variety of graphical displays is overwhelming and at the same time surprising for a book that old, especially when we consider the publishing technologies available. As mentioned by the author, color printing of the book was prohibitive given the costs, only one ink color being used. However, this doesn't diminish the quality of visuals considered. Compared with nowadays' books, which seem to attempt compensating the lack of novelty with too much color and mentions of technologies, book's graphics stand out in their simplicity and richness of exemplifications. It is sad to remark that the graphical displays are better chosen and the book is better written than some of nowadays books on data visualization.

Comparing the language and vocabulary used nowadays with the one used then, the reader can see the difficulties of approaching a subject found in its early years, the author recognizing the lack of standards and the difficulties of showing quantitative facts in true proportions. It's also true that more modern authors like Tufte or Cleveland were facing same challenge 70 years later. 

About the author is worth mentioning that he was chairman of the "Joint Committee on Standards for Graphic Presentation" initiated in 1913, committee that published in 1915 their first brief report which consisted of 17 simply basic rules, a first attempt of standardizing the principles of graphic presentation. In 1939 Brinton published a second book on "Graphic Presentation", with less text and abundant colorful graphical displays. Even if some charts are available in the second book as well, overall, the two books seem to complement each other and should be a lecture for the data professional as well for the average reader interested in understanding the use of graphical methods.

Previous Post <<||>> Next Post

References:
[1] Willard C Brinton, "Graphic Methods for Presenting Facts", 1919 (link)
[2] Bridget Cogley & Vidya Setlur, "Functional Aesthetics for Data Visualization" (2020)
[3] Willard C Brinton, "Graphic Presentation", 1939 (link)
[4] Joint Committee on Standards for Graphic Presentation, "Publications of the American Statistical Association" Vol.14 (112), 1915 (Jstor)

17 February 2021

Python: Plotting Data with the Radar Chart

Today's task was to display a set of data using the radar chart available with the matplotlib.pyplot library. For this I considered the iris dataset available with the sklearn learning library. The dataset is stored as an array, therefore for further manipulation was converted into a data frame. As the radar chart allows comparing only a small set of numerical values, I considered displaying only the mean values for each type of iris (setosas versicolor, virginica). 

Unfortunately, the radar chart doesn't seem to complete the polygons based on the available dataset, therefore as workaround I had to duplicate the first column within the result data frame. (It seems that the Ploty library does a better job at displaying radar charts, see example).

Radar Chart

Here's the code:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets  
import pandas as pd

#preparing the data frame 
iris = datasets.load_iris()

ds = pd.DataFrame(data = iris.data, columns = iris.feature_names)
dr = ds.assign(target = iris.target) #iris type

group_by_iris = dr.groupby('target').mean()
group_by_iris[''] = group_by_iris[iris.feature_names[0]] #duplicating the first column

# creating the graph
fig = plt.subplots()

angle = np.linspace(start=0, stop=2 * np.pi, num=len(group_by_iris.columns))

plt.figure(figsize=(5, 5))
plt.subplot(polar=True)

values = group_by_iris[:1].values[0]
plt.plot(angle, values, label='Iris-setosa', color='r')
plt.fill(angle, values, 'r', alpha=0.2)

values = group_by_iris[1:2].values[0]
plt.plot(angle, values, label='Iris-versicolor', color='g')
plt.fill(angle, values, 'g', alpha=0.2)

values = group_by_iris[2:3].values[0]
plt.plot(angle, values, label='Iris-virginica', color='b')
plt.fill(angle, values, 'b', alpha=0.2)

#labels
plt.title('Iris comparison', size=15)
labels = plt.thetagrids(np.degrees(angle), labels=group_by_iris.columns.values)
plt.legend()

plt.show()

Happy coding!

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.