SQL Troubles

15 June 2024

🗒️Graphical Representation: Bar & Column Charts [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources and may deviate from them. Please consult the sources for the exact content!
Last updated: 15-Jun-2024

Bar & Column Charts with Variations

Bar & Column Charts (Graphs)

{definition} graphical representation of categorical data with rectangular figures (aka boxes) whose heights (column chart) or lengths (bar chart) are proportional to the values that they represent
{benefit} allow to visually encode/decode quantitative information-size as magnitude and area based on the relative position of the end of the box along the common scale

if the width of the box is the same, it's enough to compare the length

⇒ the basis of comparison is one-dimensional [1]
⇐ orient the reader to the relative magnitudes of the boxes

area is typically encoded when the width varies

⇐ encoding by area is a poor encoding method as it can mislead

can represent negative and positive values
one of the most useful, simple, and adaptable techniques in graphic presentation [1]

easily understood by readers
sometimes avoided because they are so common
almost everything could be a bar chart

the length of each bar is proportional to the quantity or amount of each category represented [1]

⇒the zero line must be shown [1]
⇒the scale must not be broken [1]

{exception} an excessively long bar in a series of bars may be broken off at the end, and the amount involved shown directly beyond it [1]

{benefit} allow to visually represent categorical data

⇒ occasionally represented without scales, grid lines or tick marks
the more data elements are presented, the more difficult it becomes to navigate and/or display the data

{benefit} allow us to easily compare magnitudes

sometimes without looking at the actual values

{type} bar chart

the box is shown horizontally
represents magnitude by length
allows comparing different items as of a specific time

{type} column chart

the box is shown vertically
represents magnitude by height
allows comparing different items over time

⇐ it still displays discrete points

recommended for comparing similar items for different time periods [2]
effective way to show most types of comparisons [2]

{subtype} stacked chart

variation of bar/column charts in which the boxes of a dimension's components are staked over each other

{exception} spaces can be used between boxes if the values aren't cumulative [3]

{benefit} allows encoding a further dimension where the values are staked within the same box
{drawback} do not show data structure well

⇒ make it challenging to compare values across boxes

{subtype} 100-percent chart

variation of stacked chart in which the magnitude totals to 100%
{benefit} allows to display part to whole relationships

⇐ preferable to circle chart's angle and area comparison [1]

{subtype} clustered chart (aka grouped chart)

variation of bar/column charts that allows encoding further quantitative information in distinct boxes tacked together which occasionally overlap

⇐ if there's space, it is usually kept to a minimum
e.g. can be used to display multiple data series

can be used with a secondary axis
{benefit} allows comparisons within the cluster/group as well between clusters/groups
{drawback} more challenging to make comparisons across points

{subtype} area chart (variable-width/variwide chart/graph)

variation of bar/column charts in which the height/width have significance being proportional to some measure or characteristics of the data elements represented [3]
{benefit} allow encoding a further dimension as part of the area

{subtype} deviation chart

variation of bar/column charts that display positive and negative values

{subtype} joined chart

variation of bar/column charts in which the boxes are tacked together
{benefit} allow to better use the space available

{subtype} paired chart

variation of bar/column charts in which the boxes are paired in mirror based on an axis

e.g. the values of one data series are displayed to the left, while the values for a second data series are displayed to the right

{benefit} allows to study the correlation and/or other relationships between the values of two data series
the hidden axes can have different scales

{subtype} circular chart (aka radial chart)

variation of bar/column charts in which the boxes are wrapped into a circle, the various categories being uniformly spaced along the radial or category axis [3]
the value scale can have any upper or lower value and can progress in either direction [3]
{benefit} useful to represent data that have a circular dimension in an aesthetic form

e.g. months, hours

{subtype} waterfall chart (aka progressing chart)

variation of bar/column charts in which the boxes are displayed progressively, the start of a box corresponding the end of the previous box
time and activity charts can be considered as variations of this subtype [3]
{advantage} allows to determine cumulative values, respectively the increase/decrease between consecutive boxes

{subtype}composite chart (aka mixed chart, combination chart, overlay chart)

variation of bar/column charts in besides boxes are used other graphic types of encoding (line, area)

⇐ the different data graphics are overlaid on one another [3]

{benefit} allows to improve clarity or highlight the relationships between several data series [3]
{drawback} overlaying can result in clutter

used to

display totals, averages or frequencies
display time series
display the relationship between two or more items
make a comparison among several items
make a comparison between parts and the whole

can be confounded with

[histograms]

show distribution through the frequency of quantitative values against defined intervals of quantitative values
used for continuous numerical data or data that can be effectively modelled as continuous
it doesn't have spaces between bars

⇐ older use of bar/column charts don't use spaces
if this aspect is ignored, histograms can be considered as a special type of area chart

[vertical line chart] (aka price chart, bar chart)

vertical line charts are sometimes referred as bar charts (see [3])

things to consider

distance between bars

the more distant the bars, the more difficult it becomes to make comparisons and the accuracy of judgment decreases

sorting

sorting the bars/columns by their size facilitates comparisons, though it can impede items' search, especially when there are many categories involved

{exception} not recommended for time series

clutter

displaying too many items in a cluster and/or too many labels can lead to clutter
{recommendation} display at maximum 3-4 clustered boxes

color

one should follow the general recommendations

trend lines

can be used especially with time series especially to represent the linear regression line

dual axis

{benefit} allows to compare the magnitudes of two data series by employing a secondary axis

overlapping

overlapping boxes can make charts easier to read

symbols

can be used to designate reference points of comparison for each of the bars [3]

{alternative} pie chart

can be used to dramatize comparisons in relation to the whole [2]
one should consider the drawbacks

{alternative} choropleth maps

more adequate for geographical dimensions
provide minimal encoding

{alternative} line charts

can be much more informative
provides an optimal dat-ink ratio
reduces the chart junk feeling

{alternative} dot plots

are closer to the original data

References:
[1] Anna C Rogers (1961) "Graphic Charts Handbook"
[2] Robert Lefferts (1981) "Elements of Graphics: How to prepare charts and graphs for effective reports"
[3] Robert L Harris (1996) "Information Graphics: A Comprehensive Illustrated Reference"

14 June 2024

📊Graphical Representation: Graphics We Live By (Part IX: Word Clouds in Power BI)

Graphical Representation Series

A word cloud (aka tag cloud) is a visual representation of textual data in the form of a cloud - a mass of words in which each word is shown with a different font size and/or color based on its frequency, significance or categorization in the dataset considered. It is used to depict keyword metadata on websites, to visualize free form text or the frequency of specific values within a categorical dimension, respectively to navigate the same.

Words can be categorized as single or compounded, where special characters like hyphen can be used. A tag is a special type of a word, usually a single word. One can use different direction or arrangement for displaying each word, independently of whether the value is numerical or alphanumerical. Word clouds are usually not sorted, even if the values could be sorted using a spiraled arrangement, which offers and easier way to navigate and compare the data.

Most of the representations are based on words' frequency, though occasionally the frequency is considered against a background corpus (e.g. Wikipedia). The use of tags as categorization methods for content items is seldom done, though needs to be considered as well.

It makes sense to use word clouds only with categorical data (see below) for which the chances of multiple occurrences is high. Numerical values (see A & D) can be displayed as well when their range is narrow. Moreover, when the number of distinct values is high, one can consider only the top N values. Continuous data may be challenging to represent, though occasionally they can be represented as well, especially when reducing the precision

Word clouds allow to see at a glance what values are available and can be used as an alternative to choropleth maps for filtering and navigating the data. They aren't good for precise comparisons, though further information can be provided in the tooltip.

In Power BI there are currently two visuals that allow to display word clouds - from Microsoft, respectively Powerviz, which was added recently (see Jun-2024 release [2]). They provide similar functionality, though Powerviz's visual offers more flexibility in what concerns the word options (case, styling, delimiters) direction, shapes (displaying the values within a form), ranking (top vs bottom), exclusion rules and formational formatting. It uses also a radial arrangement, which allows to select or exclude a set of values via the lasso functionality (see E).

Word Clouds

Previous Post <<||>> Next Post

References:
[1] Wikipedia (2024) Tag cloud (link)
[2] Microsoft Power BI Blog (2004) Power BI June 2024 Feature Summary (link)

13 June 2024

🧭🏭Business Intelligence: Microsoft Fabric (Part V: One Person Can’t Learn or Do Everything)

Business Intelligence Series

Today’s Explicit Measures webcast [1] considered an article written by Kurt Buhler (The Data Goblins): [Microsoft] "Fabric is a Team Sport: One Person Can’t Learn or Do Everything" [2]. It’s a well-written article that deserves some thought as there are several important points made. I can’t say I agree with the full extent of some statements, even if some disagreements are probably just a matter of semantics.

My main disagreement starts with the title “One Person Can’t Learn or Do Everything”. As clarified in webcast's chat, the author defines “everything" as an umbrella for “all the capabilities and experiences that comprise Fabric including both technical (like Power BI) or non-technical (like adoption data literacy) and everything in between” [1].

For me “everything” is relative and considers a domain's core set of knowledge, while "expertise" (≠ "mastery") refers to the degree to which a person can use the respective knowledge to build back-to-back solutions for a given area. I’d say that it becomes more and more challenging for beginners or average data professionals to cover the core features. Moreover, I’d separate the non-technical skills because then one will also need to consider topics like Data, Project, Information or Knowledge Management.

There are different levels of expertise, and they can vary in depth (specialization) or breadth (covering multiple areas), respectively depend on previous experience (whether one worked with similar technologies). Usually, there’s a minimum of requirements that need to be covered for being considered as expert (e.g. certification, building a solution from beginning to the end, troubleshooting, performance optimization, etc.). It’s also challenging to roughly define when one’s expertise starts (or ends), as there are different perspectives on the topics.

Conversely, the term expert is in general misused extensively, sometimes even with a mischievous intent. As “expert” is usually considered an external consultant or a person who got certified in an area, even if the person may not be able to build solutions that address a customer’s needs.

Even data professionals with many years of experience can be overwhelmed by the volume of knowledge, especially when one considers the different experiences available in MF, respectively the volume of new features released monthly. Conversely, expertise can be considered in respect to only one or more MF experiences or for one area within a certain layer. Lot of the knowledge can be transported from other areas – writing SQL and complex database objects, modelling (enterprise) semantic layers, programming in Python, R or Power Query, building data pipelines, managing SQL databases, etc.

Besides the standard documentation, training sessions, and some reference architectures, Microsoft made available also some labs and other material, which helps discovering the features available, though it doesn’t teach people how to build complete solutions. I find more important than declaring explicitly the role-based audience, the creation of learning paths for the various roles.

During the past 6-7 months I've spent on average 2 days per week learning MF topics. My problem is not the documentation but the lack of maturity of some features, the gaps in functionality, identifying the respective gaps, knowing what and when new features will be made available. The fact that features are made available or changed while learning makes the process more challenging.

My goal is to be able to provide back-to-back solutions and I believe that’s possible, even if I might not consider all the experiences available. During the past 22 years, at least until MF, I could build complete BI solutions starting from requirements elicitation, data extraction, modeling and processing for data consumption, respectively data consumption for the various purposes. At least this was the journey of a Software Engineer into the world of data.

References:
[1] Explicit Measures (2024) Power BI tips Ep.328: Microsoft Fabric is a Team Sport (link)
[2] Data Goblins (2024) Fabric is a Team Sport: One Person Can’t Learn or Do Everything (link)

11 June 2024

🧭🏭Business Intelligence: Microsoft Fabric (Part IV: Is Microsoft Fabric Ready?)

Business Intelligence Series

When writing a Business Case, besides the problem and solution(s) high-level descriptions, is important to roughly estimate how much it costs, how long it takes, respectively how many resources are needed and for what activities. A proof-of-concept (PoC) might not need an explicit business case, though the same high-level information is needed at least for the planning of resources and a formal approval.

Given that there are several analytical experiences in Microsoft Fabric (MF), it’s clear that can’t be anymore a reference architecture that can be recommended for customers. Frankly, that ship has sailed even since the introduction of Microsoft Synapse, if not earlier, with the movement to the cloud. Also, there’s no one size fits all as certain building blocks make sense only in certain scenarios (e.g. organization scale, data volume or source’s type). Moreover, even if MF has been generally available for quite some time, customers and service providers ask themselves whether the available features are enough for building analytics solutions based on it.

“Is Fabric Ready?” was the topic of today’s Explicit Measures webcast [1]. Probably the answer is as usual “it depends” and the general recommendation is to do a PoC to check solution's feasibility. Conversely, MF may be the best approach to consider if integration with other systems (e.g. Dynamics 365, Dataverse) is needed.

What the customers need are some rough realistic estimates they can base any planning upon (at least for a PoC if not for the whole project) in terms of making the data available into OneLake, building a semantic model, respectively processing and making the data available for consumption. Ideally, one needs a translation of the various steps as done earlier. For example, how long it takes to make the data available in OneLake, how long it takes to move the data physically or logically though the various layers, to build semantic models, etc.

Probably, some things can be achieved in a matter of days, at least if one knows what one’s doing. However, we are talking here about a new architecture that may resemble for some of an unknow territory. Even if old and new techniques can be mixed, there are further implications or improvements that can be considered. There are many webcasts, blog posts and other material on how to do things, on what’s possible, though building a functioning solution from beginning to the end, even as PoC, requires more than putting all this together.

Just making the data flow from point A to B or C is not enough - data security, data governance and a few other topics like scalability and availability need to be considered as well. Security and governance are also the areas in which probably more features must be considered. For many customers starting now with MF, the hope is that most of these features will be available during the time the solutions are ready for production.

From a cost perspective, there’s the cost of data at rest, in transit, the licensing for MF and the other components involved. Ideally, one should start small and increase capacities as needed, though small can vary from case to case, while it’s important to find out the optimum. Starting in the middle could be an alternative approach even if may involve higher costs. If one starts small, the costs for PoC can be neglectable, though sooner or later a compromise is needed to provide an acceptable performance.

In terms of human resources, the topic is more complex (see [2]), and it depends largely on the nature of the project. The pool of skillsets is the most important constraint or enabler such projects can have.

Previous Post <<||>> Next Post

References:
[1] Explicit Measures (2024) Power BI tips Ep.327: Is Fabric Ready? (link)
[2] Explicit Measures (2024) Power BI tips Ep.321: Building and BI Team (link)

01 June 2024

📊Graphical Representation: Graphics We Live By (Part VIII: List of Items in Power BI)

Graphical Representation Series

Introduction

There are situations in which one needs to visualize only the rating, other values, or ranking of a list of items (e.g. shopping cart, survey items) on a scale (e.g. 1 to 100, 1 to 10) for a given dimension (e.g. country, department). Besides tables, in Power BI there are 3 main visuals that can be used for this purpose: the clustered bar chart, the line chart (aka line graph), respectively the slopegraph:

Main Display Methods

Main Display Methods

For a small list of items and dimension values probably the best choice would be to use a clustered bar chart (see A). If the chart is big enough, one can display also the values as above. However, the more items in the list, respectively values in the dimension, the more space is needed. One can maybe focus then only on a subset of items from the list (e.g. by grouping several items under a category), respectively choose which dimension values to consider. Another important downside of this method is that one needs to remember the color encodings.

This downside applies also to the next method - the use of a line chart (see B) with categorical data, however applying labels to each line simplifies its navigation and decoding. With line charts the audience can directly see the order of the items, the local and general trends. Moreover, a line chart can better scale with the number of items and dimension values.

The third option (see C), the slopegraph, looks like a line chart though it focuses only on two dimension values (points) and categorizes the line as "down" (downward slope), "neutral" (no change) and "up" (upward slope). For this purpose, one can use parameters fields with measures. Unfortunately, the slopegraph implementation is pretty basic and the labels overlap which makes the graph more difficult to read. Probably, with the new set of changes planned by Microsoft, the use of conditional formatting of lines would allow to implement slope graphs with line charts, creating thus a mix between (B) and (C).

This is one of the cases in which the Y-axis (see B and C) could be broken and start with the meaningful values.

Table Based Displays

Especially when combined with color encodings (see C & G) to create heatmap-like displays or sparklines (see E), tables can provide an alternative navigation of the same data. The color encodings allow to identify the areas of focus (low, average, or high values), while the sparklines allow to show inline the trends. Ideally, it should be possible to combine the two displays.

Table Displays and the Aster Plot

One can vary the use of tables. For example, one can display only the deviations from one of the data series (see F), where the values for the other countries are based on AUS. In (G), with the help of visual calculations one can also display values' ranking.

Pie Charts

Pie charts and their variations appear nowadays almost everywhere. The Aster plot is a variation of the pie charts in which the values are encoded in the height of the pieces. This method was considered because the data used above were encoded in 4 similar plots. Unfortunately, the settings available in Power BI are quite basic - it's not possible to use gradient colors or link the labels as below:

Source Data as Aster Plots

Sankey Diagram

A Sankey diagram is a data visualization method that emphasizes the flow or change from one state (the source) to another (the destination). In theory it could be used to map the items to the dimensions and encode the values in the width of the lines (see I). Unfortunately, the diagram becomes challenging to read because all the lines and most of the labels intersect. Probably this could be solved with more flexible formatting and a rework of the algorithm used for the display of the labels (e.g. align the labels for AUS to the left, while the ones for CAN to the right).

Sankey Diagram

Data Preparation

A variation of the above image with the Aster Plots which contains only the plots was used in ChatGPT to generate the basis data as a table via the following prompts:

retrieve the labels from the four charts by country and value in a table
consolidate the values in a matrix table by label country and value

The first step generated 4 tables, which were consolidated in a matrix table in the second step. Frankly, the data generated in the first step should have been enough because using the matrix table required an additional step in DAX.

Here is the data imported in Power BI as the Industries query:

let
    Source = #table({"Label","Australia","Canada","U.S.","Japan"}
, {
 {"Credit card","67","64","66","68"}
, {"Online retail","55","57","48","53"}
, {"Banking","58","53","57","48"}
, {"Mobile phone","62","55","44","48"}
, {"Social media","74","72","62","47"}
, {"Search engine","66","64","56","42"}
, {"Government","52","52","58","39"}
, {"Health insurance","44","48","50","36"}
, {"Media","52","50","39","23"}
, {"Retail store","44","40","33","23"}
, {"Car manufacturing","29","29","26","20"}
, {"Airline/hotel","35","37","29","16"}
, {"Branded manufacturing","36","33","25","16"}
, {"Loyalty program","45","41","32","12"}
, {"Cable","40","39","29","9"}
}
),
    #"Changed Types" = Table.TransformColumnTypes(Source,{{"Australia", Int64.Type}, {"Canada", Int64.Type}, {"U.S.", Number.Type}, {"Japan", Number.Type}})
in
    #"Changed Types"

Transforming (unpivoting) the matrix to a table with the values by country:

IndustriesT = UNION (
    SUMMARIZECOLUMNS(
     Industries[Label]
     , Industries[Australia]
     , "Country", "Australia"
    )
    , SUMMARIZECOLUMNS(
     Industries[Label]
     , Industries[Canada]
     , "Country", "Canada"
    )
    , SUMMARIZECOLUMNS(
     Industries[Label]
     , Industries[U.S.]
     , "Country", "U.S."
    )
    ,  SUMMARIZECOLUMNS(
     Industries[Label]
     , Industries[Japan]
     , "Country", "Japan"
    )
)

Notes:

The slopechart from MAQ Software requires several R language libraries to be installed (see how to install the R language and optionally the RStudio). Run the following scripts, then reopen Power BI Desktop and enable running visual's scripts.

install.packages("XML")
install.packages("htmlwidgets")
install.packages("ggplot2")
install.packages("plotly")

Happy (de)coding!

Previous Post <<||>> Next Post

29 May 2024

📊Graphical Representation: Graphics We Live By (Part VII: Reading a Conversion Rates Chart with ChatGPT and Copilot)

Graphical Representation Series

One of the areas where ChatGPT, Copilot and other similar AI-based chatbots can help is in summarizing a chart saved as image. Ideally, the chatbots should be able also to approximate the points from the chart as well (an image is made of pixels and thus areas should be easy to delimit). So, I was wondering how far the chatbots can be used for these purposes. I used first an image copied from the web, though I realized that not all visual elements could be read (e.g. Copilot had issues retrieving the values for some months) and I had no basis data for comparisons to identify how big the deviations are.

So, I created a chart in Power BI based on the below chart (see original data):

Conversion Rates Dual Axes Chart

Here's the output based on Copilot over several attempts:

	Original data			First attempt		Second attempt		Third attempt		Fourth attempt
Sorting	Month	Conv.	Conv. Rate	Conv.	Conv. Rate	Conv.	Conv. Rate	Conv.	Conv. Rate	Conv.	Conv. Rate
1	Jul	8	4	10	1	10	1	8	4	8	4
2	Aug	280	16	275	15	275	15	275	18	275	18
3	Sep	100	13	225	12	225	10	225	12	225	12
4	Oct	280	14	275	12	275	11	275	11	275	11
5	Nov	90	4	75	5	75	6	75	6	75	6
6	Dec	85	3.5	100	5	100	5	100	5	100	5
7	Jan	70	4.5	50	3	50	3	50	4	50	4
8	Feb	30	1.5	50	3	25	2	50	2.5	50	2.5
9	Mar	70	4	25	1	50	2.5	25	1.5	25	1.5
10	Apr	185	11	200	10	200	10	200	10	200	10
11	May	25	3.5	50	4	50	4	50	3.5	50	3.5
12	Jun	195	4	225	10	225	10	225	11	195	4

Please note that same values were repeated to identify whether the chatbot is able to identify the resemblance correctly.

Based on the data several tests were made with ChatGPT4o, where the following prompts were given:

generate the data behind the chart as excel list (see 'First attempt' in the table)
generate the data behind the chart as excel list where blue is represented on the left axis and orange on the right axis (see 'Second attempt');
reevaluate the values from the previous output considering that the values for Jul are 8, respectively 4% (see 'Third attempt');
reevaluate the values from the previous output considering that the values for Jun are 195, respectively 4% (see 'Fourth attempt');
generate the data behind the dual axis chart as list from the last conversation;
summarize the chart by revealing trends and other statistics.

From attempt to attempt, the values get closer to the original data, corrections being made, though there are points in the final output that diverge considerably. ChatGPT does generate a table which can be easily copied to MS Excel, and even gives some instructions on how to generate a dual axis chart.

It's great that one can use data between sessions and improve maybe the output. ChatGPT's summary is better than expected and it leverages the latest corrections. However, the estimations need to improve so that the inferences can be correct.

Here's the output based on the same prompts in Copilot:

	Original data			First attempt		Second attempt		Third attempt		Fourth attempt
Sorting	Month	Conv.	Conv. Rate	Conv.	Conv. Rate	Conv.	Conv. Rate	Conv.	Conv. Rate	Conv.	Conv. Rate
1	Jul	8	4	300	18	300	18	8	4	8	4
2	Aug	280	16	250	16	250	16	250	16	250	16
3	Sep	100	13	200	14	200	14	200	14	200	14
4	Oct	280	14	150	12	150	12	150	12	150	12
5	Nov	90	4	100	10	100	10	100	10	100	10
6	Dec	85	3.5	50	8	50	8	50	8	50	8
7	Jan	70	4.5	300	18	300	18	300	18	300	18
8	Feb	30	1.5	250	16	250	16	250	16	250	16
9	Mar	70	4	200	14	200	14	200	14	200	14
10	Apr	185	11	150	12	150	12	150	12	150	12
11	May	25	3.5	100	10	100	10	100	10	100	10
12	Jun	195	4	50	8	50	8	50	8	195	4

Copilot's estimations are higher than the ones made by ChatGPT and deviate more from the original data. No reevaluations are done between prompts for the other values. The summary provides information that can be used to complement ChatGPT's output.

Overall, ChatGPT seems to perform better than Copilot, at least for this example (though we might talk here about different "generations"). Unfortunately, given that the estimations provided by both chatbots deviate considerably from the expectation, the output needs to be revised and corrected, which decreases the usability of such chatbots. In fact, one can use them to generate an initial set of data and correct then the deviations.

The outputs of other chatbots like Google's Gemini or Claude-3-Haiku (via Poe) can't be compared with the ones from ChatGPT or Copilot yet. Claude-3-Haiku does provide estimated values (even with comma), though they deviate considerably from the original data.

It would be interesting to test how other charts and plots are processed by chatbots, respectively whether the various visual elements (e.g. gridlines, ticks, markers) make a difference.

Previous Post <<||>> Next Post

27 May 2024

📊Graphical Representation: Graphics We Live By (Part VI: Conversion Rates in Power BI)

Graphical Representation Series

Introduction

Conversion rates record the percentage of users, customers and other entities who completed a desired action within a set of steps, typically as part of a process. Conversion rates are a way to evaluate the performance of digital marketing processes in respect to marketing campaigns, website traffic and other similar actions.

In data visualizations the conversion rates can be displayed occasionally alone over a time unit (e.g. months, weeks, quarters), though they make sense only in the context of some numbers that reveal the magnitude, either the conversions or the total number of users (as one value can be calculated then based on the other). Thus, it is needed to display two data series with different scales if one considers the conversion rates, respectively display the conversions and the total number of users on the same scale.

For the first approach, one can use (1) a table or heatmap, if the number of values is small (see A, B) or the data can be easily aggregated (see L); (2) a visual with dual axis where the values are displayed as columns, lines or even areas (see E, I, J, K); (3) two different visuals where the X axis represents the time unit (see H); (4) a visual that can handle by default data series with different axis - a scatter chart (see F). For the second approach, one has a wider set of display methods (see C, D, G), though there are other challenges involved.

Conversion Rates in Power BI

Tables/Heatmaps

When the number of values is small, as in the current case, a table with the unaltered values can occasionally be the best approach in terms of clarity, understandability, explicitness, or economy of space. The table can display additional statistics including ranking or moving averages. Moreover, the values contained can be represented as colors or color saturation, with different smooth color gradients for each important column, which allows to easily identify high/low values, respectively values from the same row with different orders of magnitude (see the values for September).

In Power BI, a simple table (see A) allows to display the values as they are, though it doesn't allow to display totals. Conversely, a matrix table (see B) allows to display the totals, though one needs to use measures to calculate the values, and to use sparklines, even if in this case the values displayed are meaningless except the totals. Probably, a better approach would be to display the totals with sparklines in an additional table (see L), which is based on a matrix table. Sparklines better use the space and can be represented inline in tables, though each sparkline follows its own scale of values (which can be advantageous or disadvantageous upon case).

Column/Bar Charts

Column or bar charts are usually the easiest way to encode values as they represent magnitude by their length and are thus easy to decode. To use a single axis one is forced to use the conversions against the totals, and this may work in many cases. Unfortunately, in this case the number of conversions is small compared with the number of "actions", which makes it challenging to make inferences on conversion rates' approximate values. Independently of this, it's probably a good idea to show a visual with the conversion rates anyway (or use dual axes).

In Power BI, besides the standard column/bar chart visuals (see G), one can use also the Tornado visual from Microsoft (see C), which needs to be added manually and is less customizable than the former. It allows to display two data series in mirror and is thus more appropriate for bipartite data (e.g. males vs females), though it allows to display the data labels clearly for both series, and thus more convenient in certain cases.

Dual Axes

A dual-axis chart is usually used to represent the relationship between two variables with different amplitude or scale, encoding more information in a smaller place than two separate visuals would do. The primary disadvantage of such representations is that they take more time and effort to decode, not all users being accustomed with them. However, once the audience is used to interpreting such charts, they can prove to be very useful.

One can use columns/bars, lines and even areas to encode the values, though the standard visuals might not support all the combinations. Power BI provides dual axis support for the line chart, the area chart, the line and staked/clustered column charts (see I), respectively the Power KPI chart (see E). Alternatively, custom visuals from ZoomCharts and other similar vendors could offer more flexibility. For example, ZoomCharts's Drill Down Combo PRO allows to mix columns/bars, lines, and areas with or without smooth lines (see J, K).

Currently, Power BI standard visuals don't allow column/bar charts on both axes concomitantly. In general, using the same encoding on both sides of the axes might not be a good idea because audience's tendency is to compare the values on the same axis as the encoding looks the same. For example, if the values on both sides are encoded as column lengths (see J), the audience may start comparing the length without considering that the scales are different. One needs to translate first the scale equivalence (e.g. 1:3) and might be a good idea to reflect this (e.g. in subtitle or annotation). Therefore, the combination column and line (see I) or column and area (see K) might work better. In the end, the choice depends on the audience or one's feeling what may work.

Radar Chart

Radar charts are seldom an ideal solution for visualizing data, though they can be used occasionally for displaying categorical-like data, in this case monthly based data series. The main advantage of radar charts is that they allow to compare areas overlapping of two or more series when their overlap is not too cluttered. Encoding values as areas is in general not recommended, as areas are more difficult to decode, though in this case the area is a secondary outcome which allows upon case some comparisons.

Scatter Chart

Scatter charts (and bubble charts) allow by design to represent the relationship between two variables with different amplitude or scale, while allowing to infer further information - the type of relationship, respectively how strong the relationship between the variables is. However, each month needs to be considered here as a category, which makes color decoding more challenging, though labels can facilitate the process, even if they might overlap.

Using Distinct Visuals

As soon as one uses distinct visuals to represent each data series, the power of comparison decreases based on the appropriateness of the visuals used. Conversely, one can use the most appropriate visual for each data series. For example, a waterfall chart can be used for conversions, and a line chart for conversion rates (see H). When the time axis scales similarly across both charts, one can remove it.

The Data

The data comes from a chart with dual axes similar to the visual considered in (J). Here's is the Power Query script used to create the table used for the above charts:

let
    Source = #table({"Sorting", "Month" ,"Conversions", "Conversion Rate"}
, {
{1,"Jul",8,0.04},
{2,"Aug",280,0.16},
{3,"Sep",100,0.13},
{4,"Oct",280,0.14},
{5,"Nov",90,0.04},
{6,"Dec",85,0.035},
{7,"Jan",70,0.045},
{8,"Feb",30,0.015},
{9,"Mar",70,0.04},
{10,"Apr",185,0.11},
{11,"May",25,0.035},
{12,"Jun",195,0.04}
}
),
    #"Changed Types" = Table.TransformColumnTypes(Source,{{"Sorting", Int64.Type}, {"Conversions", Int64.Type}, {"Conversion Rate", Number.Type}})
in
    #"Changed Types"

Conclusion

Upon case, depending also on the bigger picture, each of the above visuals can be used. I would go with (H) or an alternative of it (e.g. column chart instead of waterfall chart) because it shows the values for both data series. If the values aren't important and the audience is comfortable with dual axes, then probably I would go with (K) or (I), with a plus for (I) because the line encodes the conversion rates better than an area.

Happy (de)coding!

Previous Post <<||>> Next Post

23 May 2024

🏭🗒️Microsoft Fabric: Domains [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 29-May-2024

Domains & Entities

[Microsoft Fabric] Domains

{definition} a way of logically grouping together data in an organization that is relevant to a particular area or field [1]

associated with workspaces

{benefit} allows to group data into business domains [1]
all the items in the workspace are then associated with the domain, and they receive a domain attribute as part of their metadata [1]
{benefit} enables a better consumption experience [1]

simplify discovery and consumption

provide a management boundary between tenant and workspace enabling domain admins to have more granular control over multiple workspaces [6]

some tenant-level settings for managing and governing data can be delegated to the domain level [2]

allow to achieve federated governance [7]

by delegating settings to domain admins

⇒ allow provide more granular control over business area [7]

[security] domain roles

Fabric admins (or higher)

can create and edit domains
can specify domain admins and contributors
can associate workspaces with domains [4]
can see, edit, and delete all domains in the admin portal [4]
domain admins

business owners or experts of a domain

can update the domain description
can define contributors
can associate workspaces with the domain [4]
can define and update the domain image
can override tenant settings for any specific settings the tenant admin has delegated to the domain level [4]
can't delete the domain, change the domain name, or add/delete other domain admins
can only see and edit the domains they're admins of.

domain contributors

⇐ must be a workspace admin
can associate their workspaces with a domain or change the current domain association
don’t have access to the Domains page in the admin portal

domain users

can share lakehouse with other domain users without giving access to workspace and other artifacts [4]

{concept} default domain

a domain that has been specified as the default domain for specific users and/or security groups [3]

⇒ when these users/security groups create/update a new/unassigned workspace, that workspace will automatically be assigned to that domain [3]
⇒ generally automatically become domain contributors of the workspaces that are assigned in this manner [3]

{feature} subdomains

a way for fine tuning the logical grouping data under a domain [1]

subdivisions of a domain
only one level is supported in the hierarchy

visible as part of the domains filter and as part of the item location path
no setup available

[planned] some domain settings will be added to subdomains as well

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2023) Administer Microsoft Fabric (link)
[2] Microsoft Learn - Fabric (2024) Governance overview and guidance (link)
[3] Microsoft Learn: Fabric (2023) Fabric domains (link)
[4] Establishing Data Mesh architectural pattern with Domains and OneLake on Microsoft Fabric, by Maheswaran Arunachalam (link)
[5] Microsoft Fabric Updates Blog (2024) Easily implement data mesh architecture with domains in Fabric, by Naama Tsafrir (link)
[6] Microsoft (2024) Microsoft Fabric Domains – Data Mesh [with Naama Tsafrir & Assaf Shemesh]
[7] Microsoft Fabric (2024) Fabric Analyst in a Day [course notes]
[8] Microsoft Learn (2025) Tags in Microsoft Fabric [link]

Resources:
[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]
[R2] Microsoft Learn (2025) Best practices for planning and creating domains in Microsoft Fabric [link]

Acronyms:
MF - Microsoft Fabric

18 May 2024

📊Graphical Representation: Graphics We Live By (Part IV: Area Charts in MS Excel)

Graphical Representation

An area chart or area graph (see A) is a graphical representation of quantitative data based on a line chart for which the areas between axis and the lines of the series are commonly emphasized with colors, textures, or hatchings (Wikipedia). It resembles a combination between line and bar charts. Each data series results in the formation of a region (aka area), allowing thus to identify the overlapping and do comparisons between the lines within the same visual display. This approach works usually well for two or three data series if the lines don't overlap, though if more data series are added to the chart, the higher are the chances for lines to overlap or for one area to be covered by another (see B). This can easily become more than the chart can handle, even if the data series can be filtered dynamically.

Area Charts

Stacked area charts are a variation of area charts in which the areas are stacked, much like stacked bar charts (see C). Research papers abound with such charts, probably because they allow to stack together multiple data series within a small area, reflecting thus the many variables involved. Such charts allow to track individual as well as intermediary and total aggregated trends.

Stacked Area Charts

Unfortunately, besides the fact that some areas are barely distinguishable or that distant areas can't be compared (especially when one area in between has strong fluctuations), the lack of ticks and/or gridlines (see D) makes it difficult to interpret such charts. Moreover, when the lines are smoothed, it becomes even more difficult to identify the actual points. To address this it makes sense to use markers for data points to show that one works with discrete and not continuous points (see further paragraphs).

In general, it's recommended to reduce the number of data series to 3-5. For example, one can split the data series into 2-3 groups or categories based on series' characteristics (e.g. concentrate on the high values in one chart, respectively the low values in another, or group the low values under an "others" category) which would allow to make better comparisons.

Being able to sort the time series on their average value or other criteria (e.g. showing the areas with minimal variations first) can improve the readability of such charts.

Moreover, areas under curves can easily hide missing data (see F) and occasionally negative values (which is the case of the 8th example), or distort the rate of change when the charts are wider than needed (compare F with C).

Line Chart, respectively Area Chart based on a subset

Area Charts Variations

Area charts seem to encode a dimension as area, though that's not necessarily the case. It seems natural to display time series of different granularities (day, month, quarter, year), though one needs to be careful about one important aspect! On a time scale, the more one moves away from the day to weeks and months as time units, the bigger the distance between points is. In the end, all the points in a series are discrete points (not continuous), though the bigger the distance, the more category-like these series become (compare F with C, the charts have the same width).

Using the area under the curve as dimension makes sense when there's continuity or the discrete points are close enough to each other to resemble continuity. Thus, area charts are useful when the number of points is high (and the distance between them becomes neglectable), e.g. showing daily values within a year or the months over several years.

According to [2], [3] and several other sources, using the area to encode quantitative information is a poor graphical method and this applies to pie charts and area charts altogether. By contrast, for a bar chart (see G) one has either height or width to use for comparisons while the points are always as bars delimited. Scatter plots (see H), even if they might miss the time dimension, they better reflect the dispersion of the points along the lines delimited by encoding the color (compare H with E).

Alternatives for Area Charts

The more category-like and the fewer data points the data series have, the higher the chances for other graphical representation tools to be able to better represent the data. For example, year or even quarter-based data can be better visualized with Sankey charts (unfortunately, not available as standard Excel visual yet).

Conversely, there are situations in which the area chart isn't supposed to convey specific values but to get a feeling of areas' shape, or its simplicity is more appropriate, situations in which area charts do a good job. In the end, a graphical representation's utility is linked to a chart's purpose (and audience, of course).

Previous Post <<||>> Next Post

References:
[1] Wikipedia (2023) Area charts (link)
[2] William S Cleveland (1993) Visualizing Data
[3] Robert L Harris (1996) Information Graphics: A Comprehensive Illustrated Reference

SQL Troubles

Pages

15 June 2024

🗒️Graphical Representation: Bar & Column Charts [Notes]

14 June 2024

📊Graphical Representation: Graphics We Live By (Part IX: Word Clouds in Power BI)

13 June 2024

🧭🏭Business Intelligence: Microsoft Fabric (Part V: One Person Can’t Learn or Do Everything)

11 June 2024

🧭🏭Business Intelligence: Microsoft Fabric (Part IV: Is Microsoft Fabric Ready?)

01 June 2024

📊Graphical Representation: Graphics We Live By (Part VIII: List of Items in Power BI)

29 May 2024

📊Graphical Representation: Graphics We Live By (Part VII: Reading a Conversion Rates Chart with ChatGPT and Copilot)

27 May 2024

📊Graphical Representation: Graphics We Live By (Part VI: Conversion Rates in Power BI)

23 May 2024

🏭🗒️Microsoft Fabric: Domains [Notes]

18 May 2024

📊Graphical Representation: Graphics We Live By (Part IV: Area Charts in MS Excel)

About Me