Showing posts with label dataviz. Show all posts
Showing posts with label dataviz. Show all posts

03 May 2025

📊Graphical Representation: Graphics We Live By (Part XI: Comparisons Between Data Series)

Graphical Representation Series
Graphical Representation Series

Over the past 10-20 years it became so easy to create data visualizations just by dropping some of the data available into a tool like Excel and providing a visual depiction of it with just a few clicks. In many cases, the first draft, typically provided by default in the tool used, doesn't even need further work as the objective was reached, while in others the creator must have a minimum skillset for making the visualization useful, appealing, or whatever quality is a final requirement for the work in scope. However, the audience might judge the visualization(s) from different perspectives, and there can be a broad audience with different skills in reading, evaluating and understanding the work.

There are many depictions on the web resembling the one below, taken from a LinkedIn post:

Example Chart - Boing vs. Airbus

Even if the visualization is not perfect, it does a fair job in representing the data. Improvements can be made in the areas of labels, the title and positioning of elements, and the color palette used. At least these were the improvements made in the original post. It must be differentiated also between the environment in which the charts are made available, the print format having different characteristics than the ones in business setups. Unfortunately, the requirements of the two are widely confused, probably also because of the overlapping of the mediums used. 

Probably, it's a good idea to always start with the row data (or summaries of it) when the result consists of only a few data points that can be easily displayed in a table like the one below (the feature to round the decimals for integer values should be available soon in Power BI):

Summary Table

Of course, one can calculate more meaningful values like percentages from the total, standard deviations and other values that offer more perspectives into the data. Even if the values adequately reflect the reality, the reader can but wonder about the local and global minimal/maximal values, without talking much about the meaning of data points, which is easily identifiable in a chart. At least in the case of small data sets, using a table in combination with a chart can provide a more complete perspective and different ways of analyzing the data, especially when the navigation is interactive. 

Column and bar charts do a fair job in comparing values over time, though they do use a lot of ink in the process (see D). While they make it easy to compare neighboring values, the rectangles used tend to occupy a lot of space when they are made too wide or too high to cover the empty space within the display (e.g. when just a few values are displayed, space being wasted in the process). As the main downside, it takes a lot of scanning until the reader identifies the overall trends, and the further away the bars are from each other, the more difficult it becomes to do comparisons. 

In theory, line charts are more efficient in representing the above data points, because the marks are usually small and the line thin enough to provide a better data-ink ratio, while one can see a lot at a glance. In Power BI the creator can use different types of interpolation: linear (A), step (B) or smooth (C). In many cases, it might be a good idea to use a linear interpolation, though when there are no or minimal overlapping, it might be worthwhile to explore the other types if interpolation too (and further request feedback from the users):

Linear, Step and Smooth Line Charts

The nearness of values from different series can raise difficulties in identifying adequately the points, respectively delimiting the lines (see B).When the density of values allows it, it makes sense also to include the averages for each data series to reflect the distance between the two data sets. Unfortunately, the chart can get crowded if further data series or summaries are added to the cart(s). 

If the column chart (E) is close to the redesigned chart provided in the original redesign, the other alternatives can provide upon case more value. Stacked column charts (D) allow also to compare the overall quantity by month, area charts (F) tend to use even more color than needed, while water charts (G) allow to compare the difference between data points per time unit. Tornado charts (H) are a variation of bar charts, allowing easier comparing of the size of the bars, while ribbon charts (I) show well the stacking values. 

Alternatives to Line Charts

One should consider changing the subtitle(s) slightly to reflect the chart type when the patterns shown imply a shift in attention or meaning. Upon case, more that one of the above charts can be used within the same report when two or more perspectives are important. Using a complementary perspective can facilitate data's understanding or of identifying certain patterns that aren't easily identifiable otherwise. 

In general, the graphics creators try to use various representational means of facilitating a data set's understanding, though seldom only two series or a small subset of dimensions provide a complete description. The value of data comes when multiple perspectives are combined. Frankly, the same can be said about the above data series. Yes, there are important differences between the two series, though how do the numbers compare when one looks at the bigger picture, especially when broken down on element types (e.g. airplane size). How about plan vs. actual values, how long does it take more for production or other processes? It's one of a visualization's goals to improve the questions posed, but how efficient are visualizations that barely scratch the surface?

In what concerns the code, the following scripts can be used to prepare the data:

-- Power Query script (Boeing vs Airbus)
= let
    Source = let
    Source = #table({"Sorting", "Month Name", "Serial Date", "Boeing Deliveries", "Airbus Deliveries"},
    {
        {1, "Oct", #date(2023, 10, 31), 30, 50},
        {2, "Nov", #date(2023, 11, 30), 40, 40},
        {3, "Dec", #date(2023, 12, 31), 40, 110},
        {4, "Jan", #date(2024, 1, 31), 20, 30},
        {5, "Feb", #date(2024, 2, 29), 30, 40},  // Leap year adjustment
        {6, "Mar", #date(2024, 3, 31), 30, 60},
        {7, "Apr", #date(2024, 4, 30), 40, 60},
        {8, "May", #date(2024, 5, 31), 40, 50},
        {9, "Jun", #date(2024, 6, 30), 50, 80},
        {10, "Jul", #date(2024, 7, 31), 40, 90},
        {11, "Aug", #date(2024, 8, 31), 40, 50},
        {12, "Sep", #date(2024, 9, 30), 30, 50}
    }
    ),
    #"Changed Types" = Table.TransformColumnTypes(Source, {{"Sorting", Int64.Type}, {"Serial Date", type date}, {"Boeing Deliveries", Int64.Type}, {"Airbus Deliveries", Int64.Type}})
in
    #"Changed Types"
in
    Source

It can be useful to create the labels for the charts dynamically:

-- DAX code for labels
MaxDate = Format(Max('Boeing vs Airbus'[Serial Date]),"MMM-YYYY")
MinDate = FORMAT (Min('Boeing vs Airbus'[Serial Date]),"MMM-YYYY")
MinMaxDate = [MinDate] & " to " & [MaxDate]
Title Boing Airbus = "Boing and Airbus Deliveries " & [MinMaxDate]

Happy coding!

Previous Post <<||>> Next Post

07 February 2025

🌌🏭KQL Reloaded: First Steps (Part VII: Basic Data Visualizations)

One of the greatest aspects of KQL and its environment is that creating a chart is just one instruction away from the dataset generated in the process. Of course, the data still need to be in an appropriate form to be used as source for a visual, though the effort is minimal. Let's consider the example used in the previous post based ln the ContosoSales data, where the visualization part is everything that comes after "| render":

// visualizations by Country: various charts
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize count_customers = count_distinct(CustomerKey) by RegionCountryName
| order by count_customers desc
//| render table
//| render linechart
//| render areachart 
//| render stackedchart
//| render columnchart | render piechart with (xtitle="Country", ytitle="# Customers", title="# Customers by Country (pie chart)", legend=hidden)
Output:
# Customers by Country (various charts)

It's enough to use "render" with the chart type without specifying the additional information provided under "with", though the legend can facilitate data's understanding. Unfortunately, the available properties are relatively limited, at least for now. 

Adding one more dimension is quite simple, even if the display may be sometimes confusing as there's no clear delimitation between the entities represented while the legend grows linearly with the number of points. It might be a good idea to use additional charts for the further dimensions in scope. 

// visualizations by Region & Country: various charts
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize count_customers = count_distinct(CustomerKey) by ContinentName, RegionCountryName
| order by count_customers desc   
//| render stackedareachart 
//| render linechart 
//| render table 
//| render areachart 
//| render piechart
| render columnchart 
    with (xtitle="Region/Country", ytitle="# Customers",
    title="#Customers by Continent & Country", legend=hidden)
Output:
# Customers by Continent & Country (column chart)

Sometimes, it makes sense to reduce the number of values, recommendation that applies mainly to pie charts:

// visualizations by Zone: pie chart
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize count_customers = count_distinct(CustomerKey) by iif(RegionCountryName in ('United States', 'Canada'), RegionCountryName, 'Others')
| render piechart
    with (xtitle="Country", ytitle="Sales volume",
    title="Sales volume by Zone")
Output:
# Customers by Zone (pie chart)

Adding a second set of values (e.g. Total cost) allows to easily create a scatter chart:

// visualization by Occupation: scatter chart
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize count_customers = count_distinct(CustomerKey) 
    , TotalCost = sum(TotalCost) by Occupation
| order by count_customers desc
| render scatterchart 
    with (xtitle="# Customers", ytitle="Sales volume",
    title="# Customers vs Sales volume by Occupation", legend=visible )
Output:
# Customers vs Sales volume by Occupation (scatter chart)

The visualizations are pretty simple to build, though one shouldn't expect that one can build a visualization on top of any dataset, at least not without further formatting and eventually code changes. For example, considering the query from the previous post, with a small change one can use the data with a column chart, though this approach might have some limitation (e.g. it doesn't work pie charts):

// calculating percentages from totals: column chart
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
//| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize count_customers = count_distinct(CustomerKey)
    , count_customers_US = count_distinctif(CustomerKey, RegionCountryName == 'United States')
    , count_customers_CA = count_distinctif(CustomerKey, RegionCountryName == 'Canada')
    , count_customers_other = count_distinctif(CustomerKey, not(RegionCountryName in ('United States', 'Canada')))
| project Charting = "Country"
    , US = count_customers_US
    , CA = count_customers_CA
    , other = count_customers_other
| render columnchart
    with (xtitle="Region", ytitle="# Customers",
    title="# Customers by Region")
Output:
# Customers by Region (column chart)

There are a few more visuals that will be considered in a next post. Despite the relatively limited set of visuals and properties, the visualizations are useful to get a sense of data's shape, and this with a minimum of changes. Ad-hoc visualizations can help also in data modeling, validating the logic and/or identifying issues in the data when creating the queries, which makes it a great feature. 

Happy coding!

Previous Post <<||>> Next Post

29 January 2025

🌌🏭KQL Reloaded: First Steps (Part I: Simple Queries)

If one has followed the Microsoft training, webcasts and other resources, it becomes clear that the Kusto Query Language (KQL) can't be ignored as it's useful in various scenarios thar deal with large volumes of data. 

Even if KQL was created to query big tables, at least for the first steps it's recommended to start with a small data model, and when the basics were understood one can move to bigger databases. Moreover, it's easier to validate the logic when using small datasets.

Probably, the most important aspect before starting is that KQL is case-sensitive and this applies to everything – table and column names, operators, functions, etc. The below code can be tried in the online explorer (see [2], [3]), which makes available several databases for playing around. The users might need to register beforehand.

The following queries are based on the ContosoSales database (available in the above mentioned link). First, here are some simple projections. Each query is preceded by its short description in which the text was commented via "//" and must be run individually. 

// selecting all records
Customers 

// selecting all records
Customers 
| take 10

// multiple filters
Customers 
| where CityName == 'Berkeley'
| where Occupation != 'Professional'
| take 10

// multiple filters on the same column
Customers 
| where Occupation == 'Professional' and Occupation == 'Clerical')
| take 10

// multiple filters on the same column
Customers 
| where Occupation in ('Professional','Clerical')
| take 10

// multiple filters on the same column
Customers 
| where not(Occupation in ('Professional','Clerical'))
| take 10

//subset of columns
Customers
| take 5
| project ContinentName, CityName, FirstName, LastName, Gender

Here are some example for the selection of unique values, the equivalent of SELECT DISTINCT from SQL:

//distinct values used
Customers
| distinct  Occupation

//distinct values used sorted ascendingly 
Customers
| distinct  Occupation
| sort by Occupation asc

//combinations of values used
Customers
| distinct  Occupation, Education
| sort by Occupation asc, Education asc

When further data is needed, one needs to resume to grouping values, the equivalent of GROUP BY:

// record count
Customers 
| count

// record count for constraint
Customers 
| where CityName == 'Berkeley'
| count

// record count for constraint: returns 0 records (KQL is case sensitive)
Customers 
| where CityName == 'BERKELEY'
| count

// numnber of records by occupation 
Customers
| summarize occupations_count = count() by Occupation

// numnber of records by occupation with bar chart visual
Customers
| summarize occupations_count = count() by Occupation
| render barchart

The last query renders the data directly to a bar chart, which is a cool feature, especially when is needed to understand the distribution of values. Executing the query without the last line renders the initial dataset.

Azure Data Explorer - Chart example
Azure Data Explorer - Chart Example

Here are some first impressions:
1) The language is relatively simple, though the transition from SQL to KQL requires time, even if the thinking process of writing the code is not that far away. For those with no SQL knowledge, the transition might be a bit more challenging, though practice makes perfect!
2) One can try to run the code line by line to understand the steps used.
3) There are also some online code converters from SQL to KQL (see for example msticpy).
4) The intellisense capabilities facilitate the overall experience. 
5) Unfortunately, there seems to be no code formatter to HTML for KQL, so one needs to compromise in one way or another.
6) For a comparison between SQL and KQL see [4].

Happy coding!

|>> Next Post 

References:
[1] Microsoft Learn (2024) Kusto Query Language learning resources [link]
[2] TestingSpot Blog (2021) Learning Kusto Query Language - A tool for performance test engineers [link]
[3] Azure Data Explorer (2025] Create a free cluster to explore your data [link]
[4] Microsoft Learn (2024) SQL to Kusto Query Language cheat sheet [link]

Resources:
[R1] GitHub repository (2022) Sentinel Queries [link]
[R2] GitHub repository (2022) Kusto Query Language Grammar [link]
[R3] GitHub repository (2024) The Definitive Guide to KQL source code [link]
[R4[ Git Hub repository (2022) Learning Kijo [link]
[R5] Kusto gamification [link]
[R6] Clickhouse KQL  [link]
[R6] KQL Cafe user group [link]

12 December 2024

🧭💹Business Intelligence: Perspectives (Part XIX: Data Visualization between Art, Pragmatism and Kitsch)

Business Intelligence Series

The data visualizations (aka dataviz) presented in the media, especially the ones coming from graphical artists, have the power to help us develop what is called graphical intelligence, graphical culture, graphical sense, etc., though without a tutor-like experience the process is suboptimal because it depends on our ability of identifying what is important and which are the steps needed for decoding and interpreting such work, respectively for integrating their messages in our overall understanding about the world.

When such skillset is lacking, without explicit annotations or other form of support, the reader might misinterpret or fail to observe important visual cues even for simple visualizations, with all the implications deriving from this – a false understanding, and further aspects deriving from it, this being probably the most important aspect to consider. Unfortunately, even the most elaborate work can fail if the reader doesn’t have a basic understanding of all that’s implied in the process.

The books of Willard Brinton, Ana Rogers, Jacques Bertin, William Cleveland, Leland Wilkinson, Stephen Few, Albert Cairo, Soctt Berinato and many others can help the readers build a general understanding of the dataviz process and how data visualizations or simple graphics can be used/misused effectively, though each reader must follow his/her own journey. It’s also true that the basics can be easily learned, though the deeper one dives, the more interesting and nontrivial the journey becomes. Fortunately, the average reader can stick to the basics and many visualizations are simple enough to be understood.

To grasp the full extent of the implications, one can make comparisons with the domain of poetry where the author uses basic constructs like metaphor, comparisons, rhythm and epithets to create, communicate and imprint in reader’s mind old and new meanings, images and feelings altogether. Artistic data visualizations tend to offer similar charge as poetry does, even if the impact might not appeal so much to our artistic sensibility. Though dataviz from this perspective is or at least resembles an art form.

Many people can write verses, though only a fraction can write good meaningful poetry, from which a smaller fraction get poems, respectively even fewer get books published. Conversely, not everything can be expressed in verses unless one finds good metaphors and other aspects that can be leveraged in the process. Same can be said about good dataviz.

One can argue that in dataviz the author can explore and learn especially by failing fast (seeing what works and what doesn’t). One can also innovate, though the creator has probably a limited set of tools and rules for communication. Enabling readers to see the obvious or the hidden in complex visualizations or contexts requires skill and some kind of mastery of the visual form.

Therefore, dataviz must be more pragmatic and show the facts. In art one has the freedom to distort or move things around to create new meanings, while in dataviz it’s important for the meaning to be rooted in 'truth', at least by definition. The more the creator of a dataviz innovates, the higher the chances of being misunderstood. Moreover, readers need to be educated in interpreting the new meanings and get used to their continuous use.

Kitsch is a term applied to art and design that is perceived as naïve imitation to the degree that it becomes a waste of resources even if somebody pays the tag price. There’s a trend in dataviz to add elements to visualizations that don’t bring any intrinsic value – images, colors and other elements can be misused to the degree that the result resembles kitsch, and the overall value of the visualization is diminished considerably.

04 August 2024

📊Graphical Representation: Graphics We Live By (Part X: Pie and Donut Charts in Power BI and Excel)

Graphical Representation Series
Graphical Representation Series

Pie charts are loved and hated by many altogether, and there are many entitled reasons to use them and avoid them, though the most important criteria to evaluate them is whether they do the intended job in an acceptable manner, especially when compared to other representational means. The most important aspect they depict is the part to whole ratio, which even if can be depicted by other graphical tools, few tools are efficient in representing it. 

The pie chart works well as a visualization tool when it has only 3-5 values that are easily recognizable in the visualization, however as soon the size or the number of pieces vary considerably, the more difficult it is to visualize and interpret them, in case their representation has more negative than positive effects. There are many topics that form something like a long tail - the portion of the distribution having many occurrences far from the head or beginning. Displaying the items from the long tail together with the other components together can totally obscure the distribution of the items from the long tail as they become unrecognizable in the diagram. 

One approach to handle this is to group all the items from the long tail together under a piece (e.g. Other) and use a second form of representation to display them separately. For example,  Microsoft Excel offers a way to zoom in the section of a pie chart with small percentages by displaying them in a second pie chart (pie of pie) or bar chart (bar of pie), something like a "zoom in" perspective (see image below). Unfortunately, the feature seems to limit itself only to small percentages, and thus can't be used currently to offer a broader perspective. Ideally, it would be useful to zoom in on any piece of the pie, especially when the items are categorized as a hierarchy with two or even more levels. 


Unfortunately, even modern visualization tools offer limited features in displaying this kind of perspective into a flexible unitary visualization, and thus users are forced to use their creativity in providing proper solutions. In the below example the "Renewables" piece of pie is further broken down into several components of a full pie, an ensemble supposed to function as a single form of representation. With a bit of effort, the reader probably will understand the meaning behind the two pie charts, however the encoding of colors and other elements used are suboptimal in the decoding process. 

Pie Charts - Original Solution

In the above example, the arrow may suggest that in between the two donut charts exists a relationship, reflected also in the description provided, however the readers may still have difficulties in correctly interpreting the diagrams, especially when there's some kind of overlapping or other type of implied or unimplied resemblance. If the colors overlap or have other similarities, are they intentional? If the circles have the same size, does this observed resemblance have a meaning? The reader shouldn't bother himself with this type of questions, but see the resemblance and the meaning of the various elements with a minimum of effort while decoding a chart's elements. Of course, when the meaning is not clear, some guidance should be ideally provided!

Unfortunately, Power BI doesn't seem to have a similar visual like the one from Excel yet, however with a bit of effort one can obtain similar results, even if there are other minor or important limitations. For example, the lines between the two pie charts can't be drawn, so one is forced to use other encodings to show that there's a connection between the Renewable slice and the small pie chart. Moreover, the ensemble thus created isn't treated unitary and handled accordingly. Frankly, the maturity of a graphical representation environment can and should be judged also from this perspective!

The below representation built in Power BI uses a few tricks to display two pie charts together. The smaller pie chart representing the breakdown and pieces' colors are variations of parent's color, attempting to show that there's a relationship between the slice from the first chart and the pie chart with the details. Unfortunately, it wasn't possible to use similar lines like in Excel to show the relation between the two sections. 

Pie of Pie in Power BI

Instead of a pie chart, one can use a donut, like in the original representation. Even if the donut uses a smaller area for representation, in theory the pie chart offers a better basis for comparisons, at least in theory. Stacked column charts can be used as well (see C), however one loses the certainty that the pieces must add up to 100%. Further limitations can appear when one wants to achieve more with the visualizations.

Custom charts can be used as well. The pie chart coming from xViz (see D) allows to increase the size of a pie piece by using another radius, technique which could be used to highlight the piece represented in the second chart. Frankly, sunburst diagrams (see E) are better at representing the parent to child proportions, where the same color encoding has been used. Unfortunately, the more information is shown, the more loaded the visualization seems to be.

Pie of Pie Alternatives in Power BI I

A treemap can prove to be a better representation alternative because it encodes proportions in a unitary way, much like pie charts do, though it takes more space if one wants to make the labels visible. Radial charts (see G) and Aster plots (see I) can be occasionally better choices, especially because they use less space as they display only the main categories. A second diagram chart can be used to display the subcategories, much like in A and B. Sankey charts (see H) can be used as well, even if they don't allow representing any quantitative values unless one encodes them directly in the labels. 

Pie of Pie Alternatives in Power BI II

When one dives into the world of diagrams and goes behind the still limited representational choices provided by the standard tools, one can be surprised by the additional representational choices. However, their appropriateness should be considered against readers' skillset to read and interpret them! Frankly, the alternatives considered above could be a better choice when they will reach a representational maturity. 

Many thanks to Christopher Chin, who in his weekly post on data visualization blunders, suggested the examples used as basis for this post (see [1])!

Previous Post <<||>> Next Post

References:
[1] LinkedIn (2024) Christopher Chin's post (link)

14 June 2024

📊Graphical Representation: Graphics We Live By (Part IX: Word Clouds in Power BI)

Graphical Representation Series
Graphical Representation Series

A word cloud (aka tag cloud) is a visual representation of textual data in the form of a cloud - a mass of words in which each word is shown with a different font size and/or color based on its frequency, significance or categorization in the dataset considered. It is used to depict keyword metadata on websites, to visualize free form text or the frequency of specific values within a categorical dimension, respectively to navigate the same. 

Words can be categorized as single or compounded, where special characters like hyphen can be used. A tag is a special type of a word, usually a single word. One can use different direction or arrangement for displaying each word, independently of whether the value is numerical or alphanumerical. Word clouds are usually not sorted, even if the values could be sorted using a spiraled arrangement, which offers and easier way to navigate and compare the data.

Most of the representations are based on words' frequency, though occasionally the frequency is considered against a background corpus (e.g. Wikipedia). The use of tags as categorization methods for content items is seldom done, though needs to be considered as well. 

It makes sense to use word clouds only with categorical data (see below) for which the chances of multiple occurrences is high. Numerical values (see A & D) can be displayed as well when their range is narrow. Moreover, when the number of distinct values is high, one can consider only the top N values. Continuous data may be challenging to represent, though occasionally they can be represented as well, especially when reducing the precision

Word clouds allow to see at a glance what values are available and can be used as an alternative to choropleth maps for filtering and navigating the data. They aren't good for precise comparisons, though further information can be provided in the tooltip. 

In Power BI there are currently two visuals that allow to display word clouds - from Microsoft, respectively Powerviz, which was added recently (see Jun-2024 release [2]). They provide similar functionality, though Powerviz's visual offers more flexibility in what concerns the word options (case, styling, delimiters) direction, shapes (displaying the values within a form), ranking (top vs bottom), exclusion rules and formational formatting. It uses also a radial arrangement, which allows to select or exclude a set of values via the lasso functionality (see E). 

Word Clouds

Previous Post <<||>> Next Post

References:
[1] Wikipedia (2024) Tag cloud (link)
[2] Microsoft Power BI Blog (2004) Power BI June 2024 Feature Summary (link)


Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.