Showing posts with label MS Excel. Show all posts
Showing posts with label MS Excel. Show all posts

04 August 2024

Graphical Representation: Graphics We Live By (Part X: Pie and Donut Charts in Power BI and Excel)

Graphical Representation Series

Pie charts are loved and hated by many altogether, and there are many entitled reasons to use them and avoid them, though the most important criteria to evaluate them is whether they do the intended job in an acceptable manner, especially when compared to other representational means. The most important aspect they depict is the part to whole ratio, which even if can be depicted by other graphical tools, few tools are efficient in representing it. 

The pie chart works well as a visualization tool when it has only 3-5 values that are easily recognizable in the visualization, however as soon the size or the number of pieces vary considerably, the more difficult it is to visualize and interpret them, in case their representation has more negative than positive effects. There are many topics that form something like a long tail - the portion of the distribution having many occurrences far from the head or beginning. Displaying the items from the long tail together with the other components together can totally obscure the distribution of the items from the long tail as they become unrecognizable in the diagram. 

One approach to handle this is to group all the items from the long tail together under a piece (e.g. Other) and use a second form of representation to display them separately. For example,  Microsoft Excel offers a way to zoom in the section of a pie chart with small percentages by displaying them in a second pie chart (pie of pie) or bar chart (bar of pie), something like a "zoom in" perspective (see image below). Unfortunately, the feature seems to limit itself only to small percentages, and thus can't be used currently to offer a broader perspective. Ideally, it would be useful to zoom in on any piece of the pie, especially when the items are categorized as a hierarchy with two or even more levels. 


Unfortunately, even modern visualization tools offer limited features in displaying this kind of perspective into a flexible unitary visualization, and thus users are forced to use their creativity in providing proper solutions. In the below example the "Renewables" piece of pie is further broken down into several components of a full pie, an ensemble supposed to function as a single form of representation. With a bit of effort, the reader probably will understand the meaning behind the two pie charts, however the encoding of colors and other elements used are suboptimal in the decoding process. 

Pie Charts - Original Solution

In the above example, the arrow may suggest that in between the two donut charts exists a relationship, reflected also in the description provided, however the readers may still have difficulties in correctly interpreting the diagrams, especially when there's some kind of overlapping or other type of implied or unimplied resemblance. If the colors overlap or have other similarities, are they intentional? If the circles have the same size, does this observed resemblance have a meaning? The reader shouldn't bother himself with this type of questions, but see the resemblance and the meaning of the various elements with a minimum of effort while decoding a chart's elements. Of course, when the meaning is not clear, some guidance should be ideally provided!

Unfortunately, Power BI doesn't seem to have a similar visual like the one from Excel yet, however with a bit of effort one can obtain similar results, even if there are other minor or important limitations. For example, the lines between the two pie charts can't be drawn, so one is forced to use other encodings to show that there's a connection between the Renewable slice and the small pie chart. Moreover, the ensemble thus created isn't treated unitary and handled accordingly. Frankly, the maturity of a graphical representation environment can and should be judged also from this perspective!

The below representation built in Power BI uses a few tricks to display two pie charts together. The smaller pie chart representing the breakdown and pieces' colors are variations of parent's color, attempting to show that there's a relationship between the slice from the first chart and the pie chart with the details. Unfortunately, it wasn't possible to use similar lines like in Excel to show the relation between the two sections. 

Pie of Pie in Power BI

Instead of a pie chart, one can use a donut, like in the original representation. Even if the donut uses a smaller area for representation, in theory the pie chart offers a better basis for comparisons, at least in theory. Stacked column charts can be used as well (see C), however one loses the certainty that the pieces must add up to 100%. Further limitations can appear when one wants to achieve more with the visualizations.

Custom charts can be used as well. The pie chart coming from xViz (see D) allows to increase the size of a pie piece by using another radius, technique which could be used to highlight the piece represented in the second chart. Frankly, sunburst diagrams (see E) are better at representing the parent to child proportions, where the same color encoding has been used. Unfortunately, the more information is shown, the more loaded the visualization seems to be.

Pie of Pie Alternatives in Power BI I

A treemap can prove to be a better representation alternative because it encodes proportions in a unitary way, much like pie charts do, though it takes more space if one wants to make the labels visible. Radial charts (see G) and Aster plots (see I) can be occasionally better choices, especially because they use less space as they display only the main categories. A second diagram chart can be used to display the subcategories, much like in A and B. Sankey charts (see H) can be used as well, even if they don't allow representing any quantitative values unless one encodes them directly in the labels. 

Pie of Pie Alternatives in Power BI II

When one dives into the world of diagrams and goes behind the still limited representational choices provided by the standard tools, one can be surprised by the additional representational choices. However, their appropriateness should be considered against readers' skillset to read and interpret them! Frankly, the alternatives considered above could be a better choice when they will reach a representational maturity. 

Many thanks to Christopher Chin, who in his weekly post on data visualization blunders, suggested the examples used as basis for this post (see [1])!

Previous Post <<||>> Next Post

References:
[1] LinkedIn (2024) Christopher Chin's post (link)

29 May 2024

Graphical Representation: Graphics We Live By (Part VII: Reading a Conversion Rates Chart with ChatGPT and Copilot)

Graphical Representation Series
Graphical Representation Series

One of the areas where ChatGPT, Copilot and other similar AI-based chatbots can help is in summarizing a chart saved as image. Ideally, the chatbots should be able also to approximate the points from the chart as well (an image is made of pixels and thus areas should be easy to delimit). So, I was wondering how far the chatbots can be used for these purposes. I used first an image copied from the web, though I realized that not all visual elements could be read (e.g. Copilot had issues retrieving the values for some months) and I had no basis data for comparisons to identify how big the deviations are. 

So, I created a chart in Power BI based on the below chart (see original data):

Conversion Rates Dual Axes Chart
Conversion Rates Dual Axes Chart

Here's the output based on Copilot over several attempts:
Original data First attempt Second attempt Third attempt Fourth attempt
Sorting Month Conv. Conv. Rate Conv. Conv. Rate Conv. Conv. Rate Conv. Conv. Rate Conv. Conv. Rate
1 Jul 8 4 10 1 10 1 8 4 8 4
2 Aug 280 16 275 15 275 15 275 18 275 18
3 Sep 100 13 225 12 225 10 225 12 225 12
4 Oct 280 14 275 12 275 11 275 11 275 11
5 Nov 90 4 75 5 75 6 75 6 75 6
6 Dec 85 3.5 100 5 100 5 100 5 100 5
7 Jan 70 4.5 50 3 50 3 50 4 50 4
8 Feb 30 1.5 50 3 25 2 50 2.5 50 2.5
9 Mar 70 4 25 1 50 2.5 25 1.5 25 1.5
10 Apr 185 11 200 10 200 10 200 10 200 10
11 May 25 3.5 50 4 50 4 50 3.5 50 3.5
12 Jun 195 4 225 10 225 10 225 11 195 4

Please note that same values were repeated to identify whether the chatbot is able to identify the resemblance correctly. 

Based on the data several tests were made with ChatGPT4o, where the following prompts were given:

  1. generate the data behind the chart as excel list (see 'First attempt' in the table)
  2. generate the data behind the chart as excel list where blue is represented on the left axis and orange on the right axis (see 'Second attempt');
  3. reevaluate the values from the previous output considering that the values for Jul are 8, respectively 4% (see 'Third attempt');
  4. reevaluate the values from the previous output considering that the values for Jun are 195, respectively 4% (see 'Fourth attempt');
  5. generate the data behind the dual axis chart as list from the last conversation;
  6. summarize the chart by revealing trends and other statistics.
From attempt to attempt, the values get closer to the original data, corrections being made, though there are points in the final output that diverge considerably. ChatGPT does generate a table which can be easily copied to MS Excel, and even gives some instructions on how to generate a dual axis chart. 

It's great that one can use data between sessions and improve maybe the output. ChatGPT's summary is better than expected and it leverages the latest corrections. However, the estimations need to improve so that the inferences can be correct. 

Here's the output based on the same prompts in Copilot:
Original data First attempt Second attempt Third attempt Fourth attempt
Sorting Month Conv. Conv. Rate Conv. Conv. Rate Conv. Conv. Rate Conv. Conv. Rate Conv. Conv. Rate
1 Jul 8 4 300 18 300 18 8 4 8 4
2 Aug 280 16 250 16 250 16 250 16 250 16
3 Sep 100 13 200 14 200 14 200 14 200 14
4 Oct 280 14 150 12 150 12 150 12 150 12
5 Nov 90 4 100 10 100 10 100 10 100 10
6 Dec 85 3.5 50 8 50 8 50 8 50 8
7 Jan 70 4.5 300 18 300 18 300 18 300 18
8 Feb 30 1.5 250 16 250 16 250 16 250 16
9 Mar 70 4 200 14 200 14 200 14 200 14
10 Apr 185 11 150 12 150 12 150 12 150 12
11 May 25 3.5 100 10 100 10 100 10 100 10
12 Jun 195 4 50 8 50 8 50 8 195 4

Copilot's estimations are higher than the ones made by ChatGPT and deviate more from the original data. No reevaluations are done between prompts for the other values. The summary provides information that can be used to complement ChatGPT's output. 

Overall, ChatGPT seems to perform better than Copilot, at least for this example (though we might talk here about different "generations"). Unfortunately, given that the estimations provided by both chatbots deviate considerably from the expectation, the output needs to be revised and corrected, which decreases the usability of such chatbots. In fact, one can use them to generate an initial set of data and correct then the deviations.

The outputs of other chatbots like Google's Gemini or Claude-3-Haiku (via Poe) can't be compared with the ones from ChatGPT or Copilot yet. Claude-3-Haiku does provide estimated values (even with comma), though they deviate considerably from the original data. 

It would be interesting to test how other charts and plots are processed by chatbots, respectively whether the various visual elements (e.g. gridlines, ticks, markers) make a difference.

Previous Post <<||>> Next Post

18 May 2024

Graphical Representation: Graphics We Live By (Part IV: Area Charts in MS Excel)

Graphical Representation
Graphical Representation

An area chart or area graph (see A) is a graphical representation of quantitative data based on a line chart for which the areas between axis and the lines of the series are commonly emphasized with colors, textures, or hatchings (Wikipedia). It resembles a combination between line and bar charts. Each data series results in the formation of a region (aka area), allowing thus to identify the overlapping and do comparisons between the lines within the same visual display. This approach works usually well for two or three data series if the lines don't overlap, though if more data series are added to the chart, the higher are the chances for lines to overlap or for one area to be covered by another (see B). This can easily become more than the chart can handle, even if the data series can be filtered dynamically.

Area Charts
Area Charts

Stacked area charts are a variation of area charts in which the areas are stacked, much like stacked bar charts (see C). Research papers abound with such charts, probably because they allow to stack together multiple data series within a small area, reflecting thus the many variables involved. Such charts allow to track individual as well as intermediary and total aggregated trends.

Stacked Area Charts
Stacked Area Charts

Unfortunately, besides the fact that some areas are barely distinguishable or that distant areas can't be compared (especially when one area in between has strong fluctuations), the lack of ticks and/or gridlines (see D) makes it difficult to interpret such charts. Moreover, when the lines are smoothed, it becomes even more difficult to identify the actual points. To address this it makes sense to use markers for data points to show that one works with discrete and not continuous points (see further paragraphs).

In general, it's recommended to reduce the number of data series to 3-5. For example, one can split the data series into 2-3 groups or categories based on series' characteristics (e.g. concentrate on the high values in one chart, respectively the low values in another, or group the low values under an "others" category) which would allow to make better comparisons.

Being able to sort the time series on their average value or other criteria (e.g. showing the areas with minimal variations first) can improve the readability of such charts.

Moreover, areas under curves can easily hide missing data (see F) and occasionally negative values (which is the case of the 8th example), or distort the rate of change when the charts are wider than needed (compare F with C). 

Line Chart, respectively Area Chart based on a subset
Area Charts Variations

Area charts seem to encode a dimension as area, though that's not necessarily the case. It seems natural to display time series of different granularities (day, month, quarter, year), though one needs to be careful about one important aspect! On a time scale, the more one moves away from the day to weeks and months as time units, the bigger the distance between points is. In the end, all the points in a series are discrete points (not continuous), though the bigger the distance, the more category-like these series become (compare F with C, the charts have the same width).

Using the area under the curve as dimension makes sense when there's continuity or the discrete points are close enough to each other to resemble continuity. Thus, area charts are useful when the number of points is high (and the distance between them becomes neglectable), e.g. showing daily values within a year or the months over several years. 

According to [2], [3] and several other sources, using the area to encode quantitative information is a poor graphical method and this applies to pie charts and area charts altogether. By contrast, for a bar chart (see G) one has either height or width to use for comparisons while the points are always as bars delimited. Scatter plots (see H), even if they might miss the time dimension, they better reflect the dispersion of the points along the lines delimited by encoding the color (compare H with E). 

Column Chart and Scatter Plot
Alternatives for Area Charts

The more category-like and the fewer data points the data series have, the higher the chances for other graphical representation tools to be able to better represent the data. For example, year or even quarter-based data can be better visualized with Sankey charts (unfortunately, not available as standard Excel visual yet).

Conversely, there are situations in which the area chart isn't supposed to convey specific values but to get a feeling of areas' shape, or its simplicity is more appropriate, situations in which area charts do a good job. In the end, a graphical representation's utility is linked to a chart's purpose (and audience, of course). 

References:
[1] Wikipedia (2023) Area charts (link)
[2] William S Cleveland (1993) Visualizing Data
[3] Robert L Harris (1996) Information Graphics: A Comprehensive Illustrated Reference

04 March 2024

Business Intelligence: A Software Engineer's Perspective VI (The Data Citizen)

Business Intelligence
Business Intelligence Series

More than a century ago, Jerbert G Wells wrote on mathematical literacy: "[...] the time may not be very remote when it will be understood that for complete initiation as an efficient citizen of one of the new great complex world-wide States that are now developing, it is as necessary to be able to compute, to think in averages and maxima and minima, as it is now to be able to read and write” [1]. The quote is occasionally misquoted as referring to Statistics, though frankly the boundaries of mathematical, statistical, numerical and data literacy tend to melt into each other, existing multiple dependencies between them.

In the age of big data, data citizens, business people able to use data, data processing and visualization tools for building solutions that enable their job, become steadily a necessity for businesses in their quest of making data-driven decisions, gaining insight and whatever valuable use data might have for the organizations. The need is not new,  Microsoft Access and Excel were used for similar purposes already in the 90s, becoming a maintenance nightmare for IT, data islands without proper backup or documentation existing through the organizations, diverse numbers being reported and contradicting each other. 

Then IT took over, trying to find alternatives for the data islands, implementing concepts like single source(s) of truth, quality gates and supporting processes, designing data models and infrastructures for self-service, allowing users to tap into the data for data exploration, discovery, reporting, etc. Getting all this right required to redesign existing infrastructures, making one step forward and a few steps back, in the end everything is a learning process. Such an effort can easily consume an organization's resources. 

Microsoft and other vendors for data-driven solutions keep insisting on how much potential exist in their tools for the data citizen, how the citizens can bring competitive advantage for organizations, automating business and supporting processes. The potential is not to neglect, though it requires a considerable investment from organizations in training and mentoring data citizens, in building data warehouses or data meshes that focus on end-user self-service needs. The data citizen needs time to learn, to play with the data, build solutions, test their usefulness in the daily tasks, respectively incorporate and disseminate the knowledge gained within the organization. 

There are many scenarios in which results can be obtained with a minimum of effort, however there are also hard limits. Besides the learning effort and the time available, there are cognitive, knowledge and ability limits that vary from person to person. Understanding what good architecture, design and techniques means is unfortunately not for everybody, and here's where the concept of citizen data analyst or citizen scientist breaks, and this independently of the tools used. There are also IT people who have similar challenges. 

It must be also recognized that the solutions built in the early stages by data citizens are primarily personal solutions that need to be reviewed and brought to the standards adopted by the organization. In time, it's expected to reduce considerably such effort by evolving data citizen's knowledge and skillset. Without this further work, the solutions built will tend to display some of the shortcomings of the solutions built on MS Access or Excel

The concept of data citizen can work as long the various assumptions and needs are adequately addressed, however progress will not happen overnight. The effort needs to become part of organization's long-term strategy, and the effort can be considerable for many organizations. Mentorship in terms of technical and non-technical support is needed. It's advisable to proceed in small iterative steps and integrate gradually the lessons learned.

Previous Post <<||>> Next Post

Resources:

[1] “Mankind in the Making”, by Herbert G Wells, 1903 [Source]

19 October 2023

Graphical Representation: Graphics We Live By II (Discount Rates in MS Excel)

Graphical Representation
Graphical Representation Series

It's difficult, if not impossible, to give general rules on how data visualizations should be built. However, the data professional can use a set of principles, which are less strict than rules, and validate one's work against them. Even then one might need to make concessions and go against the principles or common sense, though such cases should be few, at least in theory. One of such important principles is reflected in Tufte's statement that "if the statistics are boring, then you've got the wrong numbers" [1].

So, the numbers we show should not be boring, but that's the case with most of the numbers we need to show, or we consume in news and other media sources. Unfortunately, in many cases we need to go with the numbers we have and find a way to represent them, ideally by facilitating the reader to make sense of the respective data. This should be our first goal when visualizing data. Secondly, because everybody talks about insights nowadays, one should identify the opportunity for providing views into the data that go beyond the simple visualization, even if this puts more burden on data professional's shoulder. Frankly, from making sense of a set of data and facilitating an 'Aha' moment is a long leap. Thirdly, one should find or use the proper tools and channels for communicating the findings. 

A basic requirement for the data professional to be able to address these goals is to have clear definitions of the numbers, have a good understanding of how the numbers reflect the reality, respectively how the numbers can be put into the broader context. Unfortunately, all these assumptions seem to be a luxury. On the other side, the type of data we work with allows us to address at least the first goal. Upon case, our previous experience can help, though there will be also cases in which we can try to do our best. 

Let's consider a simple set of data retrieved recently from another post - Discount rates (in percentage) per State, in which the values for 5 neighboring States are considered (see the first two columns from diagram A). Without knowing the meaning of data, one could easily create a chart in Excel or any other visualization tool. Because the State has categorical values, probably some visualization tools will suggest using bar and not column charts. Either by own choice or by using the default settings, the data professional might end up with a column chart (see diagram B), which is Ok for some visualizations. 


One can start with a few related questions:
(1) Does it make sense to use a chart to represent 5 values which have small variability (the difference between the first and last value is of only 6%)? 
(2) Does it make sense to use a chart only for the sake of visualizing the data?
(3) Where is the benefit for using a chart as long there's no information conveyed? 

One can see similar examples in the media where non-aggregated values are shown in a chart just for the sake of visualizing the data. Sometimes the authors compensate for the lack of meaning with junk elements, fancy titles or other tricks. Usually, sense-making in a chart takes longer than looking at the values in a table as there are more dimensions or elements to consider. For a table there's the title, headers and the values, nothing more! For a chart one has in addition the axes and some visualization elements that can facilitate or complicate visualization's decoding. Where to add that there are also many tricks to distort the data. 

Tables tend to maximize the amount of digital ink used to represent the data, and minimize the amount used to represent everything else not important to understanding. It's what Tufte calls the data-to-ink ratio (see [1]), a second important principle. This can be translated in (a) removing the border of the chart area, (b) minimizing the number of gridlines shown, (c) minimizing the number of ticks on the axis without leading to information lost, (d) removing redundant information, (e) or information that doesn't help the reader. 

However, the more data is available in the table, the more difficult it becomes to navigate the data. But again, if the chart shows the individual data without any information gained, a table might be still more effective. One shouldn't be afraid to show a table where is the case!

(4) I have a data visualization, what's next?

Ideally, the data professional should try to obtain the maximum of effect with minimum of elements. If this principle aims for the efficiency of design, a fourth related principle aims for the efficiency of effort - one should achieve a good enough visualization with a minimum of effort. Therefore, it's enough maybe if we settle to any of the two above results. 

On the other side, maybe by investing a bit more effort certain aspects can be improved. In this area beginners start playing with the colors, formatting the different elements of the chart. Unfortunately, even if color plays a major role in the encoding and decoding of meaning, is often misused/overused. 

(5) Is there any meaning in the colors used?

In the next examples taken from the web (diagram C and D), the author changed the color of the column with the minimal value to red to contrast it with the other values. Red is usually associated with danger, error, warning, or other similar characteristics with negative impact. The chances are high that the reader will associate the value with a negative connotation, even if red is used also for conveying important information (usually in text). Moreover, the reader will try to interpret the meaning of the other colors. In practice, the color grey has a neutral tone (and calming effect on the mind). Therefore, it's safe to use grey in visualization (see diagram D in contrast with diagram C). Some even advise setting grey as default for the visualization and changing the colors as needed later

In these charts, the author signalized in titles that red denotes the lowest value, though it just reduces the confusion. One can meet titles in which several colors are used, reminding of a Christmas tree. Frankly, this type of encoding is not esthetically pleasing, and it can annoy the reader. 

(6) What's in a name?

The titles and, upon case, the subtitles are important elements in communicating what the data reflects. The title should be in general short and succinct in the information it conveys, having the role of introducing, respectively identifying the chart, especially when multiple charts are used. Some charts can also use a subtitle, which can be longer than the title and have more of a storytelling character by highlighting the message and/or the finding in the data. In diagrams C and D the subtitles were considered as tiles, which is not considerably wrong. 

In the media and presentations with influencing character, subtitles help the user understand the message or the main findings, though it's not appropriate for hardcoding the same in dynamic dashboards. Even if a logic was identified to handle the various scenarios, this shifts users' attention, and the chance is high that they'll stop further investigating the visualization. A data professional should present the facts with minimal interference in how the audience and/or users perceive the data. 

As a recommendation, one should aim for clear general titles and avoid transmitting own message in charts. As a principle this can be summarized as "aim for clarity and equidistance".

(7) What about meaning?

Until now we barely considered the meaning of data. Unfortunately, there's no information about what the Discount rate means. It could be "the minimum interest rate set by the US Federal Reserve (and some other national banks) for lending to other banks" or "a rate used for discounting bills of exchange", to use the definitions given by the Oxford dictionary. Searching on the web, the results lead to discount rates for royalty savings, resident tuitions, or retail for discount transactions. Most probably the Discount rates from the data set refer to the latter.

We need a definition of the Discount rate to understand what the values represent when they are ordered. For example, when Texas has a value of 25% (see B), does this value have a negative or a positive impact when compared with other values? It depends on how it's used in the associated formula. The last two charts consider that the minimum value has a negative impact, though without more information the encoding might be wrong! 

Important formulas and definitions should be considered as side information in the visualization, accompanying text or documentation! If further resources are required for understanding the data, then links to the required resources should be provided as well. At least this assures that the reader can acquire the right information without major overhead. 

(8) What do readers look for? 

Frankly, this should have been the first question! Readers have different expectations from data visualizations. First of all, it's the curiosity - how the data look in row and/or aggregated form, or in more advanced form how are they shaped (e.g. statistical characteristics like dispersion, variance, outliers). Secondly, readers look in the first phase to understand mainly whether the "results" are good or bad, even if there are many shades of grey in between. Further on, there must be made distinction between readers who want to learn more about the data, models, and processes behind, respectively readers who just want a confirmation of their expectations, opinions and beliefs (aka bias). And, in the end, there are also people who are not interested in the data and what it tells, where the title and/or subtitle provide enough information. 

Besides this there are further categories of readers segmented by their role in the decision making, the planning and execution of operational, tactical, or strategic activities. Each of these categories has different needs. However, this exceeds the scope of our analysis. 

Returning to our example, one can expect that the average reader will try to identify the smallest and highest Discount rates from the data set, respectively try to compare the values between the different States. Sorting the data and having the values close to each other facilitates the comparison and ranking, otherwise the reader needing to do this by himself/herself. This latter aspect and the fact that bar charts better handle the display of categorical data such as length and number, make from bar charts the tool of choice (see diagram E). So, whenever you see categorical data, consider using a bar chart!

Despite sorting the data, the reader might still need to subtract the various values to identify and compare the differences. The higher the differences between the values, the more complex these operations become. Diagram F is supposed to help in this area, the comparison to the minimal value being shown in orange. Unfortunately, small variances make numbers' display more challenging especially when the visualization tools don't offer display alternatives.

For showing the data from Diagram F were added in the table the third and fourth columns (see diagram A). There's a fifth column which designates the percentage from a percentage (what's the increase in percentages between the current and minimal value). Even if that's mathematically possible, the gain from using such data is neglectable and can create confusion. This opens the door for another principle that applies in other areas as well: "just because you can, it doesn't mean you should!". One should weigh design decisions against common sense or one's intuition on how something can be (mis)used and/or (mis)understood!

The downside of Diagram F is that the comparisons are made only in relation to the minimum value. The variations are small and allow further comparisons. The higher the differences, the more challenging it becomes to make further comparisons. A matrix display (see diagram G) which compares any two values will help if the number of points is manageable. The upper side of the numbers situated on and above the main diagonal were grayed (and can be removed) because they are either nonmeaningful, or the negatives of the numbers found below the diagram. Such diagrams are seldom used, though upon case they prove to be useful.

Choropleth maps (diagram H) are met almost everywhere data have a geographical dimension. Like all the other visuals they have their own advantages (e.g. relative location on the map) and disadvantages (e.g. encoding or displaying data). The diagram shows only the regions with data (remember the data-to-ink ratio principle).


(9) How about the shape of data?

When dealing with numerical data series, it's useful to show aggregated summaries like the average, quartiles, or standard deviation to understand how the data are shaped. Such summaries don't really make sense for our data set given the nature of the numbers (five values with small variance). One can still calculate them and show them in a box plot, though the benefit is neglectable. 

(10) Which chart should be used?

As mentioned above, each chart has advantages and disadvantages. Given the simplicity and the number of data points, any of the above diagrams will do. A table is simple enough despite not using any visualization effects. Also, the bar charts are simple enough to use, with a plus maybe for diagram F which shows a further dimension of the data. The choropleth map adds the geographical dimension, which could be important for some readers. The matrix table is more appropriate for technical readers and involves more effort to understand, at least at first sight, though the learning curve is small. The column charts were considered only for exemplification purposes, though they might work as well. 

In the end one should go with own experience and consider the audience and the communication channels used. One can also choose 2 different diagrams, especially when they are complementary and offer an additional dimension (e.g. diagrams F and H), though the context may dictate whether their use is appropriate or not. The diagrams should be simple to read and understand, but this doesn't mean that one should stick to the standard visuals. The data professional should explore other means of representing the data, a fresh view having the opportunity of catching the reader's attention.

As a closing remark, nowadays data visualization tools allow building such diagrams without much effort. Conversely, it takes more effort to go beyond the basic functionality and provide more value for thyself and the readers. One should be able to evaluate upfront how much time it makes sense to invest. Hopefully, the few methods, principles and recommendations presented here will help further!

Previous Post <<||>> Next Post

Resources:
[1] Edward R Tufte (1983) "The Visual Display of Quantitative Information"

01 April 2021

SQL Reloaded: Processing JSON Files with Flat Matrix Structure in SQL Server 2016+

Besides the CSV format, many of the data files made available under the open data initiatives are stored in JSON format, which makes data more difficult to process, even if JSON offers a richer structure that goes beyond the tabular structure of CSV files. Fortunately, starting with SQL Server 2016, JSON became a native format, which makes the processing of JSON files relatively easy, the easiness with which one can process the data depending on how they are structured.

Let’s consider as example a JSON file with the world population per country and year that can be downloaded from DataHub (source). The structure behind resembles a tabular model (see the table on the source website), having a flat structure. Just export the data to a file with the JSON extension (e.g. ‘population-figures-by-country.json’) locally (e.g. ‘D:/Data’). The next step is to understand file’s structure. Some repositories provide good documentation in this respect, though there are also many exceptions. Having a JSON editor like Visual Studio which reveals the structure makes easier the process. 

As in the case of CSV files, is needed to infer the data types. There are two alphanumeric fields (Country & Country Code), while the remaining fields are numeric. The only challenge raised by the data seems to be the difference in format between the years 2002 and 2015 in respect to the other years, as the values of the former contain a decimal after comma. All the numeric values should have been whole values. 

It’s recommended to start small and build the logic iteratively. Therefore, for the first step just look at files content via the OPENROWSET function:

-- looking at the JSON file 
SELECT *
FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB)  as jsonfile 

In a second step one can add the OPENJSON function by looking only at the first record: 

-- querying a json file (one record)
SELECT *
FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB)  as jsonfile 
     CROSS APPLY OPENJSON(BulkColumn,'$[0]')

In a third step one can add a few columns (e.g. Country & Country Code) to make sure that the select statement works correctly. 

-- querying a json file (all records, a few fields)
SELECT Country 
, CountryCode 
FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB)  as jsonfile 
     CROSS APPLY OPENJSON(BulkColumn,'$')
 WITH ( 
  Country nvarchar(max) '$.Country'
, CountryCode nvarchar(3) '$.Country_Code'
) AS DAT; 

In a next step can be added all the columns and import the data in a table (e.g. dbo.CountryPopulation) on the fly: 

-- importing a json file (all records) on the fly
SELECT DAT.Country
, DAT.CountryCode
, DAT.Y1960
, DAT.Y1961
, DAT.Y1962
, DAT.Y1963
, DAT.Y1964
, DAT.Y1965
, DAT.Y1966
, DAT.Y1967
, DAT.Y1968
, DAT.Y1969
, DAT.Y1970
, DAT.Y1971
, DAT.Y1972
, DAT.Y1973
, DAT.Y1974
, DAT.Y1975
, DAT.Y1976
, DAT.Y1977
, DAT.Y1978
, DAT.Y1979
, DAT.Y1980
, DAT.Y1981
, DAT.Y1982
, DAT.Y1983
, DAT.Y1984
, DAT.Y1985
, DAT.Y1986
, DAT.Y1987
, DAT.Y1988
, DAT.Y1989
, DAT.Y1990
, DAT.Y1991
, DAT.Y1992
, DAT.Y1993
, DAT.Y1994
, DAT.Y1995
, DAT.Y1996
, DAT.Y1997
, DAT.Y1998
, DAT.Y1999
, DAT.Y2000
, DAT.Y2001
, Cast(DAT.Y2002 as bigint) Y2002
, Cast(DAT.Y2003 as bigint) Y2003
, Cast(DAT.Y2004 as bigint) Y2004
, Cast(DAT.Y2005 as bigint) Y2005
, Cast(DAT.Y2006 as bigint) Y2006
, Cast(DAT.Y2007 as bigint) Y2007
, Cast(DAT.Y2008 as bigint) Y2008
, Cast(DAT.Y2009 as bigint) Y2009
, Cast(DAT.Y2010 as bigint) Y2010
, Cast(DAT.Y2011 as bigint) Y2011
, Cast(DAT.Y2012 as bigint) Y2012
, Cast(DAT.Y2013 as bigint) Y2013
, Cast(DAT.Y2014 as bigint) Y2014
, Cast(DAT.Y2015 as bigint) Y2015
, DAT.Y2016
INTO dbo.CountryPopulation
FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB)  as jsonfile 
     CROSS APPLY OPENJSON(BulkColumn,'$')
 WITH ( 
  Country nvarchar(max) '$.Country'
, CountryCode nvarchar(3) '$.Country_Code'
, Y1960 bigint '$.Year_1960'
, Y1961 bigint '$.Year_1961'
, Y1962 bigint '$.Year_1962'
, Y1963 bigint '$.Year_1963'
, Y1964 bigint '$.Year_1964'
, Y1965 bigint '$.Year_1965'
, Y1966 bigint '$.Year_1966'
, Y1967 bigint '$.Year_1967'
, Y1968 bigint '$.Year_1968'
, Y1969 bigint '$.Year_1969'
, Y1970 bigint '$.Year_1970'
, Y1971 bigint '$.Year_1971'
, Y1972 bigint '$.Year_1972'
, Y1973 bigint '$.Year_1973'
, Y1974 bigint '$.Year_1974'
, Y1975 bigint '$.Year_1975'
, Y1976 bigint '$.Year_1976'
, Y1977 bigint '$.Year_1977'
, Y1978 bigint '$.Year_1978'
, Y1979 bigint '$.Year_1979'
, Y1980 bigint '$.Year_1980'
, Y1981 bigint '$.Year_1981'
, Y1982 bigint '$.Year_1982'
, Y1983 bigint '$.Year_1983'
, Y1984 bigint '$.Year_1984'
, Y1985 bigint '$.Year_1985'
, Y1986 bigint '$.Year_1986'
, Y1987 bigint '$.Year_1987'
, Y1988 bigint '$.Year_1988'
, Y1989 bigint '$.Year_1989'
, Y1990 bigint '$.Year_1990'
, Y1991 bigint '$.Year_1991'
, Y1992 bigint '$.Year_1992'
, Y1993 bigint '$.Year_1993'
, Y1994 bigint '$.Year_1994'
, Y1995 bigint '$.Year_1995'
, Y1996 bigint '$.Year_1996'
, Y1997 bigint '$.Year_1997'
, Y1998 bigint '$.Year_1998'
, Y1999 bigint '$.Year_1999'
, Y2000 bigint '$.Year_2000'
, Y2001 bigint '$.Year_2001'
, Y2002 decimal(19,1) '$.Year_2002'
, Y2003 decimal(19,1) '$.Year_2003'
, Y2004 decimal(19,1) '$.Year_2004'
, Y2005 decimal(19,1) '$.Year_2005'
, Y2006 decimal(19,1) '$.Year_2006'
, Y2007 decimal(19,1) '$.Year_2007'
, Y2008 decimal(19,1) '$.Year_2008'
, Y2009 decimal(19,1) '$.Year_2009'
, Y2010 decimal(19,1) '$.Year_2010'
, Y2011 decimal(19,1) '$.Year_2011'
, Y2012 decimal(19,1) '$.Year_2012'
, Y2013 decimal(19,1) '$.Year_2013'
, Y2014 decimal(19,1) '$.Year_2014'
, Y2015 decimal(19,1) '$.Year_2015'
, Y2016 bigint '$.Year_2016'
) AS DAT; 

As can be seen the decimal values were converted to bigint to preserve the same definition. Moreover, this enables data processing later, as no additional (implicit) conversions are necessary. 

Also, the columns’ names were changed either for simplification/convenience or simply taste. 

Writing such a monster query can be time-consuming, though preparing the metadata into Excel can decrease considerably the effort. With copy-past and a few tricks (e.g. replacing values, splitting columns based on a delimiter) one can easily prepare such a structure:

Source fieldTarget fieldDataTypeValueImport ClauseSelect Clause
CountryCountrynvarchar(max) emen Rep., Country nvarchar(max) '$.Country', DAT.Country
Country_CodeCountryCodenvarchar(3) YEM, CountryCode nvarchar(3) '$.Country_Code', DAT.CountryCode
Year_1960Y1960bigint5172135, Y1960 bigint '$.Year_1960', DAT.Y1960
Year_1961Y1961bigint5260501, Y1961 bigint '$.Year_1961', DAT.Y1961
Year_1962Y1962bigint5351799, Y1962 bigint '$.Year_1962', DAT.Y1962
Year_1963Y1963bigint5446063, Y1963 bigint '$.Year_1963', DAT.Y1963
Year_1964Y1964bigint5543339, Y1964 bigint '$.Year_1964', DAT.Y1964
Year_1965Y1965bigint5643643, Y1965 bigint '$.Year_1965', DAT.Y1965
Year_1966Y1966bigint5748588, Y1966 bigint '$.Year_1966', DAT.Y1966
Year_1967Y1967bigint5858638, Y1967 bigint '$.Year_1967', DAT.Y1967
Year_1968Y1968bigint5971407, Y1968 bigint '$.Year_1968', DAT.Y1968
Year_1969Y1969bigint6083619, Y1969 bigint '$.Year_1969', DAT.Y1969
Year_1970Y1970bigint6193810, Y1970 bigint '$.Year_1970', DAT.Y1970
Year_1971Y1971bigint6300554, Y1971 bigint '$.Year_1971', DAT.Y1971
Year_1972Y1972bigint6407295, Y1972 bigint '$.Year_1972', DAT.Y1972
Year_1973Y1973bigint6523452, Y1973 bigint '$.Year_1973', DAT.Y1973
Year_1974Y1974bigint6661566, Y1974 bigint '$.Year_1974', DAT.Y1974
Year_1975Y1975bigint6830692, Y1975 bigint '$.Year_1975', DAT.Y1975
Year_1976Y1976bigint7034868, Y1976 bigint '$.Year_1976', DAT.Y1976
Year_1977Y1977bigint7271872, Y1977 bigint '$.Year_1977', DAT.Y1977
Year_1978Y1978bigint7536764, Y1978 bigint '$.Year_1978', DAT.Y1978
Year_1979Y1979bigint7821552, Y1979 bigint '$.Year_1979', DAT.Y1979
Year_1980Y1980bigint8120497, Y1980 bigint '$.Year_1980', DAT.Y1980
Year_1981Y1981bigint8434017, Y1981 bigint '$.Year_1981', DAT.Y1981
Year_1982Y1982bigint8764621, Y1982 bigint '$.Year_1982', DAT.Y1982
Year_1983Y1983bigint9111097, Y1983 bigint '$.Year_1983', DAT.Y1983
Year_1984Y1984bigint9472170, Y1984 bigint '$.Year_1984', DAT.Y1984
Year_1985Y1985bigint9847899, Y1985 bigint '$.Year_1985', DAT.Y1985
Year_1986Y1986bigint10232733, Y1986 bigint '$.Year_1986', DAT.Y1986
Year_1987Y1987bigint10628585, Y1987 bigint '$.Year_1987', DAT.Y1987
Year_1988Y1988bigint11051504, Y1988 bigint '$.Year_1988', DAT.Y1988
Year_1989Y1989bigint11523267, Y1989 bigint '$.Year_1989', DAT.Y1989
Year_1990Y1990bigint12057039, Y1990 bigint '$.Year_1990', DAT.Y1990
Year_1991Y1991bigint12661614, Y1991 bigint '$.Year_1991', DAT.Y1991
Year_1992Y1992bigint13325583, Y1992 bigint '$.Year_1992', DAT.Y1992
Year_1993Y1993bigint14017239, Y1993 bigint '$.Year_1993', DAT.Y1993
Year_1994Y1994bigint14692686, Y1994 bigint '$.Year_1994', DAT.Y1994
Year_1995Y1995bigint15320653, Y1995 bigint '$.Year_1995', DAT.Y1995
Year_1996Y1996bigint15889449, Y1996 bigint '$.Year_1996', DAT.Y1996
Year_1997Y1997bigint16408954, Y1997 bigint '$.Year_1997', DAT.Y1997
Year_1998Y1998bigint16896210, Y1998 bigint '$.Year_1998', DAT.Y1998
Year_1999Y1999bigint17378098, Y1999 bigint '$.Year_1999', DAT.Y1999
Year_2000Y2000bigint17874725, Y2000 bigint '$.Year_2000', DAT.Y2000
Year_2001Y2001bigint18390135, Y2001 bigint '$.Year_2001', DAT.Y2001
Year_2002Y2002decimal(19,1) 18919179.0, Y2002 decimal(19,1) '$.Year_2002', Cast(DAT.Y2002 as bigint) Y2002
Year_2003Y2003decimal(19,1) 19462086.0, Y2003 decimal(19,1) '$.Year_2003', Cast(DAT.Y2003 as bigint) Y2003
Year_2004Y2004decimal(19,1) 20017068.0, Y2004 decimal(19,1) '$.Year_2004', Cast(DAT.Y2004 as bigint) Y2004
Year_2005Y2005decimal(19,1) 20582927.0, Y2005 decimal(19,1) '$.Year_2005', Cast(DAT.Y2005 as bigint) Y2005
Year_2006Y2006decimal(19,1) 21160534.0, Y2006 decimal(19,1) '$.Year_2006', Cast(DAT.Y2006 as bigint) Y2006
Year_2007Y2007decimal(19,1) 21751605.0, Y2007 decimal(19,1) '$.Year_2007', Cast(DAT.Y2007 as bigint) Y2007
Year_2008Y2008decimal(19,1) 22356391.0, Y2008 decimal(19,1) '$.Year_2008', Cast(DAT.Y2008 as bigint) Y2008
Year_2009Y2009decimal(19,1) 22974929.0, Y2009 decimal(19,1) '$.Year_2009', Cast(DAT.Y2009 as bigint) Y2009
Year_2010Y2010decimal(19,1) 23606779.0, Y2010 decimal(19,1) '$.Year_2010', Cast(DAT.Y2010 as bigint) Y2010
Year_2011Y2011decimal(19,1) 24252206.0, Y2011 decimal(19,1) '$.Year_2011', Cast(DAT.Y2011 as bigint) Y2011
Year_2012Y2012decimal(19,1) 24909969.0, Y2012 decimal(19,1) '$.Year_2012', Cast(DAT.Y2012 as bigint) Y2012
Year_2013Y2013decimal(19,1) 25576322.0, Y2013 decimal(19,1) '$.Year_2013', Cast(DAT.Y2013 as bigint) Y2013
Year_2014Y2014decimal(19,1) 26246327.0, Y2014 decimal(19,1) '$.Year_2014', Cast(DAT.Y2014 as bigint) Y2014
Year_2015Y2015decimal(19,1) 26916207.0, Y2015 decimal(19,1) '$.Year_2015', Cast(DAT.Y2015 as bigint) Y2015
Year_2016Y2016bigint27584213, Y2016 bigint '$.Year_2016', DAT.Y2016

Based on this structure, one can add two further formulas in Excel to prepare the statements as above and then copy the fields (last two columns were generated using the below formulas): 

=", " & TRIM(B2) & " " & C2 & " '$." & TRIM(A2) & "'" 
=", DAT." & TRIM(B2)

Consuming data in which the values are stored in a matrix structure can involve further challenges sometimes, even if this type of storage tends to save space. For example, adding the values for a new year would involve extending the table with one more column, while performing calculations between years would involve referencing each column in formulas. Therefore, transforming the data from a matrix to a normalized structure can have some benefit. This can be achieved by writing a query via the UNPIVOT operator:

-- unpivoting the data 
SELECT RES.Country
, RES.CountryCode
, Cast(Replace(RES.[Year], 'Y', '') as int) [Year]
, RES.Population
--INTO dbo.CountryPopulationPerYear
FROM 
( -- basis data
	SELECT Country
	, CountryCode
	, Y1960, Y1961, Y1962, Y1963, Y1964, Y1965, Y1966, Y1967, Y1968, Y1969
	, Y1970, Y1971, Y1972, Y1973, Y1974, Y1975, Y1976, Y1977, Y1978, Y1979
	, Y1980, Y1981, Y1982, Y1983, Y1984, Y1985, Y1986, Y1987, Y1988, Y1989
	, Y1990, Y1991, Y1992, Y1993, Y1994, Y1995, Y1996, Y1997, Y1998, Y1999
	, Y2000, Y2001, Y2002, Y2003, Y2004, Y2005, Y2006, Y2007, Y2008, Y2009
	, Y2010, Y2011, Y2012, Y2013, Y2014, Y2015, Y2016
	FROM dbo.CountryPopulation
) DAT
UNPIVOT  -- unpivot logic
   (Population FOR [Year] IN  (Y1960, Y1961, Y1962, Y1963, Y1964, Y1965, Y1966, Y1967, Y1968, Y1969
, Y1970, Y1971, Y1972, Y1973, Y1974, Y1975, Y1976, Y1977, Y1978, Y1979
, Y1980, Y1981, Y1982, Y1983, Y1984, Y1985, Y1986, Y1987, Y1988, Y1989
, Y1990, Y1991, Y1992, Y1993, Y1994, Y1995, Y1996, Y1997, Y1998, Y1999
, Y2000, Y2001, Y2002, Y2003, Y2004, Y2005, Y2006, Y2007, Y2008, Y2009
, Y2010, Y2011, Y2012, Y2013, Y2014, Y2015, Y2016)
) RES

Also this can be performed in two steps, first preparing the query, and in a final step inserting the data into a table (e.g. dbo.CountryPopulationPerYear) on the fly (re-execute the previous query after uncommenting the INSERT clause to generate the table). 

--reviewing the data 
SELECT Country
, CountryCode
, AVG(Population) AveragePopulation
, Max(Population) - Min(Population) RangePopulation
FROM dbo.CountryPopulationPerYear
WHERE [Year] BETWEEN 2010 AND 2019
GROUP BY Country
, CountryCode
ORDER BY Country

On the other side making comparisons between consecutive years is easier when using a matrix structure: 

--reviewing the data 
SELECT Country
, CountryCode
, Y2016
, Y2010
, Y2010-Y2010 [2016-2010]
, Y2011-Y2010 [2011-2010]
, Y2012-Y2011 [2011-2011]
, Y2013-Y2012 [2011-2012]
, Y2014-Y2013 [2011-2013]
, Y2015-Y2014 [2011-2014]
, Y2016-Y2015 [2011-2015]
FROM dbo.CountryPopulation
ORDER BY Country

Unless the storage space is a problem, in theory one can store the data in both formats as there can be requests which can benefit from one structure or the other. 

Happy coding!
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.