Showing posts with label tables. Show all posts
Showing posts with label tables. Show all posts

01 June 2024

Graphical Representation: Graphics We Live By (Part VIII: List of Items in Power BI)

Graphical Representation Series
Graphical Representation Series

Introduction

There are situations in which one needs to visualize only the rating, other values, or ranking of a list of items (e.g. shopping cart, survey items) on a scale (e.g. 1 to 100, 1 to 10) for a given dimension (e.g. country, department). Besides tables, in Power BI there are 3 main visuals that can be used for this purpose: the clustered bar chart, the line chart (aka line graph), respectively the slopegraph:

Main Display Methods

Main Display Methods

For a small list of items and dimension values probably the best choice would be to use a clustered bar chart (see A). If the chart is big enough, one can display also the values as above. However, the more items in the list, respectively values in the dimension, the more space is needed. One can maybe focus then only on a subset of items from the list (e.g. by grouping several items under a category), respectively choose which dimension values to consider. Another important downside of this method is that one needs to remember the color encodings. 

This downside applies also to the next method - the use of a line chart (see B) with categorical data, however applying labels to each line simplifies its navigation and decoding. With line charts the audience can directly see the order of the items, the local and general trends. Moreover, a line chart can better scale with the number of items and dimension values.

The third option (see C), the slopegraph, looks like a line chart though it focuses only on two dimension values (points) and categorizes the line as "down" (downward slope), "neutral" (no change) and "up" (upward slope). For this purpose, one can use parameters fields with measures. Unfortunately, the slopegraph implementation is pretty basic and the labels overlap which makes the graph more difficult to read. Probably, with the new set of changes planned by Microsoft, the use of conditional formatting of lines would allow to implement slope graphs with line charts, creating thus a mix between (B) and (C).

This is one of the cases in which the Y-axis (see B and C) could be broken and start with the meaningful values. 

Table Based Displays

Especially when combined with color encodings (see C & G) to create heatmap-like displays or sparklines (see E), tables can provide an alternative navigation of the same data. The color encodings allow to identify the areas of focus (low, average, or high values), while the sparklines allow to show inline the trends. Ideally, it should be possible to combine the two displays.  

Table Displays and the Aster Plot

One can vary the use of tables. For example, one can display only the deviations from one of the data series (see F), where the values for the other countries are based on AUS. In (G), with the help of visual calculations one can also display values' ranking. 

Pie Charts

Pie charts and their variations appear nowadays almost everywhere. The Aster plot is a variation of the pie charts in which the values are encoded in the height of the pieces. This method was considered because the data used above were encoded in 4 similar plots. Unfortunately, the settings available in Power BI are quite basic - it's not possible to use gradient colors or link the labels as below:

Source Data as Aster Plots

Sankey Diagram

A Sankey diagram is a data visualization method that emphasizes the flow or change from one state (the source) to another (the destination). In theory it could be used to map the items to the dimensions and encode the values in the width of the lines (see I). Unfortunately, the diagram becomes challenging to read because all the lines and most of the labels intersect. Probably this could be solved with more flexible formatting and a rework of the algorithm used for the display of the labels (e.g. align the labels for AUS to the left, while the ones for CAN to the right).

Sankey Diagram

Data Preparation

A variation of the above image with the Aster Plots which contains only the plots was used in ChatGPT to generate the basis data as a table via the following prompts:

  • retrieve the labels from the four charts by country and value in a table
  • consolidate the values in a matrix table by label country and value
The first step generated 4 tables, which were consolidated in a matrix table in the second step. Frankly, the data generated in the first step should have been enough because using the matrix table required an additional step in DAX.

Here is the data imported in Power BI as the Industries query:

let
    Source = #table({"Label","Australia","Canada","U.S.","Japan"}
, {
 {"Credit card","67","64","66","68"}
, {"Online retail","55","57","48","53"}
, {"Banking","58","53","57","48"}
, {"Mobile phone","62","55","44","48"}
, {"Social media","74","72","62","47"}
, {"Search engine","66","64","56","42"}
, {"Government","52","52","58","39"}
, {"Health insurance","44","48","50","36"}
, {"Media","52","50","39","23"}
, {"Retail store","44","40","33","23"}
, {"Car manufacturing","29","29","26","20"}
, {"Airline/hotel","35","37","29","16"}
, {"Branded manufacturing","36","33","25","16"}
, {"Loyalty program","45","41","32","12"}
, {"Cable","40","39","29","9"}
}
),
    #"Changed Types" = Table.TransformColumnTypes(Source,{{"Australia", Int64.Type}, {"Canada", Int64.Type}, {"U.S.", Number.Type}, {"Japan", Number.Type}})
in
    #"Changed Types"

Transforming (unpivoting) the matrix to a table with the values by country:

IndustriesT = UNION (
    SUMMARIZECOLUMNS(
     Industries[Label]
     , Industries[Australia]
     , "Country", "Australia"
    )
    , SUMMARIZECOLUMNS(
     Industries[Label]
     , Industries[Canada]
     , "Country", "Canada"
    )
    , SUMMARIZECOLUMNS(
     Industries[Label]
     , Industries[U.S.]
     , "Country", "U.S."
    )
    ,  SUMMARIZECOLUMNS(
     Industries[Label]
     , Industries[Japan]
     , "Country", "Japan"
    )
)

Notes:
The slopechart from MAQ Software requires several R language libraries to be installed (see how to install the R language and optionally the RStudio). Run the following scripts, then reopen Power BI Desktop and enable running visual's scripts.

install.packages("XML")
install.packages("htmlwidgets")
install.packages("ggplot2")
install.packages("plotly")

Happy (de)coding!

29 April 2024

Power BI: Working with Visual Calculations (Part II: Simple Tables with Square Numbers as Example)

Introduction

The records behind a visual can be mentally represented as a matrix, the visual calculations allowing to tap into this structure intuitively and simplify many of the visualizations used. After a general test drive of the functionality, it makes sense to dive deeper into the topic to understand more about the limitations, functions behavior and what it takes to fill the gaps. This post focuses on simple tables, following in a next post to focus on matrices and a few other topics. 

For exemplification, it makes sense to use a simple set of small numbers that are easy to work with, and magic squares seem to match this profile. A magic square is a matrix of positive sequential numbers in which each row, each column, and both main diagonals are the same [1]. Thus, a square of order N has N*N numbers from 1 to N*N, the non-trivial case being order 3. However, from the case of non-trivial squares, the one of order 5 provides a low order and allows hopefully the minimum needed for exemplification:

18252911
46132022
15172418
213101219
71416235
17131925

Data Modeling

One magic square should be enough to exemplify the various operations, though for testing purposes it makes sense to have a few more squares readily available. Each square has an [Id], [C1] to [C5] corresponds to matrix's columns, while [R] stores a row identifier which allows to sort the values the way they are stored in the matrix:

let
    Source = #table({"Id","C1","C2","C3","C4","C5","R"}
, {
{1,18,25,2,9,11,"R1"},
{1,4,6,13,20,22,"R2"},
{1,15,17,24,1,8,"R3"},
{1,21,3,10,12,19,"R4"},
{1,7,14,16,23,5,"R5"},
{2,1,7,13,19,25,"R1"},
{2,14,20,21,2,5,"R2"},
{2,22,3,9,15,16,"R3"},
{2,10,11,17,23,4,"R4"},
{2,18,24,5,6,12,"R5"},
{3,1,2,22,25,15,"R1"},
{3,9,10,16,11,19,"R2"},
{3,17,23,13,5,7,"R3"},
{3,24,12,6,20,3,"R4"},
{3,14,18,8,4,21,"R5"},
{4,22,6,3,18,16,"R1"},
{4,4,14,11,15,21,"R2"},
{4,5,8,12,23,17,"R3"},
{4,25,13,19,7,1,"R4"},
{4,9,24,20,2,10,"R5"},
{5,5,9,20,25,6,"R1"},
{5,13,15,2,11,24,"R2"},
{5,21,1,23,3,17,"R3"},
{5,19,18,4,14,10,"R4"},
{5,7,22,16,12,8,"R5"}
}
),
    #"Changed Type to Number" = Table.TransformColumnTypes(Source,{{"C1", Int64.Type}, {"C2", Int64.Type}, {"C3", Int64.Type}, {"C4", Int64.Type}, {"C5", Int64.Type}}),
    #"Sorted Rows" = Table.Sort(#"Changed Type to Number",{{"Id", Order.Ascending}, {"R", Order.Ascending}}),
    #"Added Index" = Table.AddIndexColumn(#"Sorted Rows", "Index", 0, 1, Int64.Type)
in
    #"Added Index"

The column names and the row identifier could have been numeric values from 1 to 5, though it could have been confounded with the actual numeric values.

In addition, the columns [C1] to [C5] were formatted as integers and an index was added after sorting the values after [Id] and [R]. Copy the above code as a Blank Query in Power BI and change the name to Magic5. 

Prerequisites

For the further steps you'll need to enable visual calculations in Power BI Developer via:
File >> Options and settings >> Options >> Preview features >> Visual calculations >> (check)

Into a Table visual drag and drop [R], [C1] to [C5] as column and make sure that the records are sorted ascending by [R]. To select only a square, add a filter based on the [Id] and select the first square. Use further copies of this visual for further tests. 

Some basic notions of Algebra are recommended but not a must. If you worked with formulas in Excel, then you are set to go. 

In Mathematics a matrix starts from the top left side and one moves on the rows (e.g. 18, 25, 2, ...) and then on the columns. With a few exceptions in which the reference is based on the latest value from a series (see Exchange rates), this is the direction that will be followed. 

Basic Operations

Same as in Excel [C1] + [C2] creates a third column in the matrix that stores the sum of the two. The sum can be further applies to all the columns:

Sum(C) = [C1] + [C2] + [C3] + [C4] + [C5] -- sum of all columns (should amount to 65)

The column can be called "Sum", "Sum(C)" or any other allowed unique name, though the names should be meaningful, useful, and succinct, when possible.

Similarly, one can work with constants, linear or nonlinear transformations (each formula is a distinct calculation):

constant = 1 -- constant value
linear = 2*[C1] + 1 -- linear translation: 2*x+1
linear2 = 2*[C1] + [constant] -- linear translation: 2*x+1
quadratic = Power([C1],2) + 2*[C1] + 1 -- quadratic translation: x^2+2*x+1 quadratic2 = Power([C1],2) + [linear] -- quadratic translation: x^2+2*x+1
Output:
R C1 constant linear linear2 quadratic quadratic2
R1 18 1 37 37 361 361
R2 4 1 9 9 25 25
R3 15 1 31 31 256 256
R4 21 1 43 43 484 484
R5 7 1 15 15 64 64
Please note that the output was duplicated in Excel (instead of making screenshots).

Similarly, can be build any type of formulas based on one or more columns.

With a simple trick, one can use DAX functions like SUMX, PRODUCTX, MINX or MAXX as well:

Sum2(C) = SUMX({[C1], [C2], [C3], [C4], [C5]}, [Value]) -- sum of all columns
Prod(C) = PRODUCTX({[C1], [C2], [C3], [C4], [C5]}, [Value]) -- product of all columns
Avg(C) = AVERAGEX({[C1], [C2], [C3], [C4], [C5]}, [Value]) -- average of all columns
Min(C) = MINX({[C1], [C2], [C3], [C4], [C5]}, [Value]) -- minimum value of all columns
Max(C) = MAXX({[C1], [C2], [C3], [C4], [C5]}, [Value]) -- maximum value of all columns
Count(C) = COUNTX({[C1], [C2], [C3], [C4], [C5]},[Value]) -- counts the number of columns
Output:
C1 C2 C3 C4 C5 Sum(C) Avg(C) Prod(C) Min(C) Max(C) Count(C)
18 25 2 9 11 65 13 89100 2 25 5
4 6 13 20 22 65 13 137280 4 22 5
15 17 24 1 8 65 13 48960 1 24 5
21 3 10 12 19 65 13 143640 3 21 5
7 14 16 23 5 65 13 180320 5 23 5

Unfortunately, currently there seems to be no way available for applying such calculations without referencing the individual columns. 

Working across Rows

ROWNUMBER and RANK allow to rank a cell within a column independently, respectively dependently of its value:

Ranking = ROWNUMBER() -- returns the rank in the column (independently of the value)
RankA(C) = RANK(DENSE, ORDERBY([C1], ASC)) -- ranking of the value (ascending) 
RankD(C) = RANK(DENSE, ORDERBY([C1], DESC)) -- ranking of the value (descending) 
Output:
R C1 Ranking RankA(C) RankD(C)
R1 18 1 4 2
R2 4 2 1 5
R3 15 3 3 3
R4 21 4 5 1
R5 7 5 2 4

PREVIOUS, NEXT, LAST and FIRST allow to refer to the values of other cells within the same column:

Prev(C) = PREVIOUS([C1]) -- previous cell
Next(C) = NEXT([C1])  -- next cell
First(C) = FIRST([C1]) -- first cell
Last(C) = LAST([C1]) -- last cell
Output:
R C1 Prev(C) NextC) First(C) Last(C)
R1 18 4 18 7
R2 4 18 15 18 7
R3 15 4 21 18 7
R4 21 15 7 18 7
R5 7 21 18 7

OFFSET is a generalization of these functions

offset(2) = calculate([C1], offset(2)) -- 
offset(-2) = calculate([C1], offset(-2))
Ind = ROWNUMBER() -- index
inverse = calculate([C1], offset(6-2*[Ind])) -- inversing the values based on index
Output:
R C1 offset(2) offset(-2) ind inverse
R1 18 15 1 7
R2 4 21 2 21
R3 15 7 18 3 15
R4 21 4 4 4
R5 7 15 5 18

The same functions allow to calculate the differences for consecutive values:

DiffToPrev(C) = [C1] - PREVIOUS([C1]) -- difference to previous 
DiffToNext(C) = [C1] - PREVIOUS([C1]) -- difference to next 
DiffTtoFirst(C) = [C1] - FIRST([C1]) -- difference to first
DiffToLast(C) = [C1] - LAST([C1]) -- difference to last
Output:
R C1 DiffToPrev(C) DiffToNextC) DiffToFirst(C) DiffToLast(C)
R1 18 18 14 0 11
R2 4 -14 -11 -14 -3
R3 15 11 -6 -3 8
R4 21 6 14 3 14
R5 7 -14 7 -11 0

DAX makes available several functions for working across the rows of the same column. Two of the useful functions are RUNNINGSUM and MOVINGAVERAGE:

Run Sum(C) = RUNNINGSUM([C1]) -- running sum
Moving Avg3(C) = MOVINGAVERAGE([C1], 3) -- moving average for the past 3 values
Moving Avg2(C) = MOVINGAVERAGE([C1], 2) -- moving average for the past 2 values

Unfortunately, one can use only the default sorting of the table with the functions that don't support the ORDERBY parameter. Therefore, when the table needs to be sorted descending and the RUNNINGSUM calculated ascending, for the moment there's no solution to achieve this behavior. However, it appears that Microsoft is planning to implement a solution for this issue.

RUNNINGSUM together with ROWNUMBER can be used to calculate a running average:

Run Avg(C) = DIVIDE(RUNNINGSUM([C1]), ROWNUMBER()) -- running average
Output:
R C1 Run Sum(C) Moving Avg3(C) Moving Avg2(C) Run Avg(C)
R1 18 18 18 18 18
R2 4 22 11 11 11
R3 15 37 12.33 9.5 12.33
R4 21 58 13.33 18 14.5
R5 7 65 14.33 14 13

With a mathematical trick that allows to transform a product into a sum of elements by applying the Exp (exponential) and Log (logarithm) functions (see the solution in SQL), one can run the PRODUCT across rows, though the values must be small enough to allow their multiplication without running into issues:

Ln(C) = IFERROR(LN([C1]), Blank()) -- applying the natural logarithm
Sum(Ln(C)) = RUNNINGSUM([Ln(C)]) -- running sum
Run Prod(C) = IF(NOT(ISBLANK([Sum(Ln(C))])), Exp([Sum(Ln(C))])) -- product across rows
Output:
R C1 Ln(C) Sum(Ln(C)) Run Prod(C)
R1 18 2.89 2.89 18
R2 4 1.39 4.28 72
R3 15 2.71 6.98 1080
R4 21 3.04 10.03 22680
R5 7 1.95 11.98 158760

These three calculations could be brought into a single formula, though the result could be more difficult to troubleshoot. The test via IsBlank is necessary because otherwise the exponential for the total raises an error. 

Considering that when traversing a column it's enough to remember the previous value, one can build MIN and MAX functionality across a column: 

Run Min = IF(OR(Previous([C1]) > [C1], IsBlank(Previous([C1]))), [C1], Previous([C1])) -- minimum value across rows
Run Max = IF(OR(Previous([C1]) < [C1], IsBlank(Previous([C1]))), [C1], Previous([C1])) -- maximum across rows

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Wikipedia (2024) Magic Squares (online)
[2] Microsoft Learn (2024) Power BI: Using visual calculations [preview] (link)

25 April 2024

Graphical Representation: Graphics We Live By (Part III: Exchange Rates in Power BI)

Graphical Representation Series
Graphical Representation Series

An exchange rate (XR) is the rate at which one currency will be exchanged for another currency, and thus XRs are used in everything related to trades, several processes in Finance relying on them. There are various sources for the XR like the European Central Bank (ECB) that provide the row data and various analyses including graphical representations varying in complexity. Conversely, XRs' processing offers some opportunities for learning techniques for data visualization. 

On ECB there are monthlyyearly, daily and biannually XRs from EUR to the various currencies which by triangulation allow to create XRs for any of the currencies involved. If N currencies are involved for one time unit in the process (e.g. N-1 XRs) , the triangulation generates NxN values for only one time division, the result being tedious to navigate. A matrix like the one below facilitates identifying the value between any of the currencies:


The table needs to be multiplied by 12, the number of months, respectively by the number of years, and filter allowing to navigate the data as needed. For many operations is just needed to look use the EX for a given time division. There are however operations in which is needed to have a deeper understanding of one or more XR's evolution over time (e.g. GBP to NOK). 

Moreover, for some operations is enough to work with two decimals, while for others one needs to use up to 6 or even more decimals for each XR. Occasionally, one can compromise and use 3 decimals, which should be enough for most of the scenarios. Making sense of such numbers is not easy for most of us, especially when is needed to compare at first sight values across multiple columns. Summary tables can help:

Statistics like Min. (minimum), Max. (maximum), Max. - Min. (range), Avg. (average) or even StdDev. (standard deviation) can provide some basis for further analysis, while sparklines are ideal for showing trends over a time interval (e.g. months).

Usually, a heatmap helps to some degree to navigate the data, especially when there's a plot associated with it:

In this case filtering by column in the heatmap allows to see how an XR changed for the same month over the years, while the trendline allows to identify the overall tendency (which is sensitive to the number of years considered). Showing tendencies or patterns for the same month over several years complements the yearly perspective shown via sparklines.

Fortunately, there are techniques to reduce the representational complexity of such numbers. For example, one can use as basis the XRs for January (see Base Jan), and represent the other XRs only as differences from the respective XR. Thus, in the below table for February is shown the XR difference between February and January (13.32-13.22=0.10). The column for January is zero and could be omitted, though it can still be useful in further calculations (e.g. in the calculation of averages) based on the respective data..

This technique works when the variations are relatively small (e.g. the values vary around 0). The above plots show the respective differences for the whole year, respectively only for four months. Given a bigger sequence (e.g. 24, 28 months) one can attempt to use the same technique, though there's a point beyond which it becomes difficult to make sense of the results. One can also use the year end XR or even the yearly average for the same, though it adds unnecessary complexity to the calculations when the values for the whole year aren't available. 

Usually, it's recommended to show only 3-5 series in a plot, as one can better distinguish the trends. However, plotting all series allows to grasp the overall pattern, if any. Thus, in the first plot is not important to identify the individual series but to see their tendencies. The two perspectives can be aggregated into one plot obtained by applying different filtering. 

Of course, a similar perspective can be obtained by looking at the whole XRs:

The Max.-Min. and StdDev (standard deviation for population) between the last and previous tables must match. 

Certain operations require comparing the trends of two currencies. The first plot shows the evolution NOK and SEK in respect to EUR, while the second shows only the differences between the two XRs:


The first plot will show different values when performed against other currency (e.g. USD), however the second plot will look similarly, even if the points deviate slightly:

Another important difference is the one between monthly and yearly XRs, difference depicted by the below plot:

The value differences between the two XR types can have considerable impact on reporting. Therefore, one must reflect in analyses the rate type used in the actual process. 

Attempting to project data into the future can require complex techniques, however, sometimes is enough to highlight a probable area, which depends also on the confidence interval (e.g. 85%) and the forecast length (e.g. 10 months):

Every perspective into the data tends to provide something new that helps in sense-making. For some users the first table with flexible filtering (e.g. time unit, currency type, currency from/to) is enough, while for others multiple perspectives are needed. When possible, one should  allow users to explore the various perspectives and use the feedback to remove or even add more perspectives. Including a feedback loop in graphical representation is important not only for tailoring the visuals to users' needs but also for managing their expectations,  respectively of learning what works and what doesn't.

Comments:
1) I used GBP to NOK XRs to provide an example based on  triangulation.
2) Some experts advise against using borders or grid lines. Borders, as the name indicates allow to delimitate between various areas, while grid lines allow to make comparisons within a section without needing to sway between broader areas, adding thus precision to our senses-making. Choosing grey as color for the elements from the background minimizes the overhead for coping with more information while allowing to better use the available space.
3) Trend lines are recommended where the number of points is relatively small and only one series is involved, though, as always, there are exceptions too. 
4) In heatmaps one can use a gradient between two colors to show the tendencies of moving toward an extreme or another. One should avoid colors like red or green.
5) Ideally, a color should be used for only one encoding (e.g. one color for the same month across all graphics), though the more elements need to be encoded, the more difficult it becomes to respect this rule. The above graphics might slightly deviate from this as the purpose is to show a representation technique. 
6) In some graphics the XRs are displayed only with two decimals because currently the technique used (visual calculations) doesn't support formatting.
7) All the above graphical elements are based on a Power BI solution. Unfortunately, the tool has its representational limitations, especially when one wants to add additional information into the plots. 
8) Unfortunately, the daily XR values are not easily available from the same source. There are special scenarios for which a daily, hourly or even minute-based analysis is needed.
9) It's a good idea to validate the results against the similar results available on the web (see the ECB website).

Previous Post <<||>> Next Post

19 April 2024

Power BI: Preparatory Steps for Creating a Power BI Report

When creating a Power BI report consider the following steps when starting the actual work. The first five steps can be saved to a "template" that can be reused as starting point for each report.

Step 0: Check Power BI Desktop's version

Check whether you have the latest version, otherwise you can download it from the Microsoft website.
Given that most of the documentation, books and other resources are in English, it might be a good idea to install the English version.

Step 1: Enable the recommended options

File >> Options and settings >> Options >> Global >> Data Load:
.>> Time intelligence >> Auto date/time for new files >> (uncheck)
.>> Regional settings >> Application language >> set to English (United States)
.>> Regional settings >> Model language >> set to English (United States)

You can consider upon case also the following options (e.g. when the relationships are more complex than the feature can handle):
File >> Options and settings >> Options >> Current >> Data load:
.>> Relationship >> Import relationships from data sources on first load >> (uncheck)
.>> Relationship >> Autodetect new relationships after data is loaded >> (uncheck)

Step 2: Enable the options needed by the report

For example, you can enable visual calculations:
File >> Options and settings >> Options >> Preview features >> Visual calculations >> (check)

Comment:
Given that not all preview features are stable enough, instead of activating several features at once, it might be a good idea to do it individually and test first whether they work as expected. 

Step 3: Add a table for managing the measures

Add a new table (e.g. "dummy" with one column "OK"):

Results = ROW("dummy", "OK")

Add a dummy measure that could be deleted later when there's at least one other measure:
Test = ""

Hide the "OK" column and with this the table is moved to the top. The measures can be further organized within folders for easier maintenance. 

Step 4: Add the Calendar if time analysis is needed

Add a new table (e.g. "Calendar" with a "Date" column):

Calendar = Calendar(Date(Year(Today()-3*365),1,1),Date(Year(Today()+1*365),12,31))

Add the columns:

Year = Year('Calendar'[Date])
YearQuarter = 'Calendar'[Year] & "-Q" & 'Calendar'[Quarter]
Quarter = Quarter('Calendar'[Date])
QuarterName = "Q" & Quarter('Calendar'[Date])
Month = Month('Calendar'[Date])
MonthName = FORMAT('Calendar'[Date], "mmm")

Even if errors appear (as the columns aren't listed in the order of their dependencies), create first all the columns. Format the Date in a standard format (e.g. dd-mmm-yy) including for Date/Time for which the Time is not needed.

To get the values in the visual sorted by the MonthName:
Table view >> (select MonthName) >> Column tools >> Sort by column >> (select Month)

To get the values in the visual sorted by the QuarterName:
Table view >> (select QuarterName) >> Column tools >> Sort by column >> (select Quarter)

With these changes the filter could look like this:


Step 5: Add the corporate/personal theme

Consider using a corporate/personal theme at this stage. Without this the volume of work that needs to be done later can increase considerably. 

There are also themes generators, e.g. see powerbitips.com, a tool that simplifies the process of creating complex theme files. The tool is free however, users can save their theme files via a subscription service.

Set canvas settings (e.g. 1080 x 1920 pixels).

Step 6: Get the data

Consider the appropriate connectors for getting the data into the report. 

Step 7: Set/Validate the relationships

Check whether the relationships between tables set by default are correct, respectively set the relationships accordingly.

Step 8: Optimize the data model

Look for ways to optimize the data model.

Step 9: Apply the formatting

Format numeric values to represent their precision accordingly.
Format the dates in a standard format (e.g. "dd-mmm-yy") including for Date/Time for which the Time is not needed.

The formatting needs to be considered for the fields, measures and metrics added later as well. 

Step 10: Define the filters

Identify the filters that will be used more likely in pages and use the Sync slicers to synchronize the filters between pages, when appropriate:
View >> Sync slicers >> (select Page name) >> (check Synch) >> (check Visible)

Step 11: Add the visuals

At least for report's validation, consider using a visual that holds the detail data as represented in the other visuals on the page. Besides the fact that it allows users to validate the report, it also provides transparence, which facilitates report's adoption. 

10 March 2024

Power BI: Dataflows Gen 1 (Notes)

Disclaimer: This is work in progress intended to consolidate information from various sources. 

Last updated: 10-Mar-2024

Dataflows Architecture in Power BI
Dataflows Architecture [3]

Dataflow (Gen1)

  • a type of cloud-based ETL tool for building and executing scalable data transformation processes [1]
  • a collection of tables created and managed in workspaces in the Power BI service [4]
  • acts as building blocks on top of one another [4]
  • includes all of the transformations to reduce data prep time and then can be loaded into a new table, included in a Data Pipeline, or used as a data source by data analysts [1]
  • {benefit} promote reusability of underlying data elements
    • prevent the need to create separate connections with your cloud or on-premises data sources.
  • supports a wide range of cloud and on-premises sources [3]
  • {operation} refreshing a dataflow 
    • is required before it can be consumed in a semantic model in Power BI Desktop, or referenced as a linked or computed table [4]
    • can be refreshed at the same frequency as a semantic model [4]
    • {concept} incremental refresh
    • [Premium capacity] can be set to refresh incrementally
      • adds parameters to the dataflow to specify the date range
      • {contraindication} linked tables shouldn't use incremental refresh if they reference a dataflow
        • ⇐ dataflows don't support query folding (even if the table is DirectQuery enabled).
      • {contraindication} semantic models referencing dataflows shouldn't use incremental refresh
        • ⇐ refreshes to dataflows are generally performant, so incremental refreshes shouldn't be necessary
        • if refreshes take too long, consider using the compute engine, or DirectQuery mode
  • {operation} deleting a workflow
    • if a workspace that contains dataflows is deleted, all its dataflows are also deleted [4]
      • even if recovery of the workspace is possible, one cannot recover the deleted dataflows
        • ⇐ either directly or through support from Microsoft [4]
  • {operation} consuming a dataflow
    • create a linked table from the dataflow
    • allows another dataflow author to use the data [4]
    • create a semantic model from the dataflow
    • allows a user to utilize the data to create reports [4]
    • create a connection from external tools that can read from the CDM format [4]
  • {feature} [premium] Enhanced compute engine.
    • enables premium subscribers to use their capacity to optimize the use of dataflows
    • {advantage} reduces the refresh time required for long-running ETL steps over computed entities, such as performing joins, distinct, filters, and group by [7]
    • {advantage} performs DirectQuery queries over entities [7]
    • individually set for each dataflow
    • {configuration} disabled
    • {configuration|default} optimized
      • automatically turned on when a table in the dataflow is referenced by another table or when the dataflow is connected to another dataflow in the same workspace.
    • {configuration} On
    • {limitation} works only for A3 or larger Power BI capacities [7]
  • {feature} [premium] DirectQuery
    • allows to use DirectQuery to connect directly to dataflows without having to import its data [7]
  • {advantage} avoid separate refresh schedules 
    • removes the need to create an imported semantic model [7]
  • {advantage} filtering data 
    • allows to filter dataflow data and work with the filtered subset [7]
    • {limitation} composite/mixed models that have import and DirectQuery data sources are currently not supported [7]
    • {limitation} large dataflows might have trouble with timeout issues when viewing visualizations [7]
      • {workaround} use Import mode [7]
    • {limitation} under data source settings, the dataflow connector will show invalid credentials [7]
      • the warning doesn't affect the behavior, and the semantic model will work properly [7]
  • {feature} [premium] Computed entities
    • allows to perform calculations on your existing dataflows, and return results [7]
    • enable to focus on report creation and analytics [7]
    • {limitation} work properly only when the entities reside in the same storage account [7]
  • {feature} [premium] Linked Entities
    • allows to reference existing dataflows
    • one can perform calculations on these entities using computed entities [7]
    • allows to create a "single source of the truth" table that can be reused within multiple dataflows [7]
    • {limitation} work properly only when the entities reside in the same storage account [7]
  • {feature} [premium] Incremental refresh
    • adds parameters to the dataflow to specify the date range [7]
  • {concept} table
    • represents the data output of a query created in a dataflow, after the dataflow has been refreshed
    • represents data from a source and, optionally, the transformations that were applied to it
  • {concept} computed tables
    • similar to other tables 
      • one get data from a source and one can apply further transformations to create them
    • their data originates from the storage dataflow used, and not the original data source [6]
      • ⇐ they were previously created by a dataflow and then reused [6]
    • created by referencing a table in the same dataflow or in a different dataflow [6]
  • {concept} [Power Query] custom function
    • a mapping from a set of input values to a single output value [5]
  • {scenario} create reusable transformation logic that can be shared by many semantic models and reports inside Power BI [3]
  • {scenario} persist data in ADL Gen 2 storage, enabling you to expose it to other Azure services outside Power BI [3]
  • {scenario} create a single source of truth
    • encourages uptake by removing analysts' access to underlying data sources [3]
  • {scenario} strengthen security around underlying data sources by exposing data to report creators in dataflows
    • allows to limit access to underlying data sources, reducing the load on source systems [3]
    • gives administrators finer control over data refresh operations [3]
  • {scenario} perform ETL at scale, 
    • dataflows with Power BI Premium scales more efficiently and gives you more flexibility [3]
  • {best practice} chose the best connector for the task provides the best experience and performance [5] 
  • {best practice} filter data in the early stages of the query
    • some connectors can take advantage of filters through query folding [5]
  • {best practice} do expensive operations last
    • help minimize the amount of time spend waiting for the preview to render each time a new step is added to the query [5]
  • {best practice} temporarily work against a subset of your data
    • if adding new steps to the query is slow, consider using "Keep First Rows" operation and limiting the number of rows you're working against [5]
  • {best practice} use the correct data types
    • some features are contextual to the data type [5]
  • {best practice} explore the data
  • {best practice} document queries by renaming or adding a description to steps, queries, or groups [5]
  • {best practice} take a modular approach
    • split queries that contains a large number of steps into multiple queries
    • {goal} simplify and decouple transformation phases into smaller pieces to make them easier to understand [5]
  • {best practice} future-proof queries 
    • make queries resilient to changes and able to refresh even when some components of data source change [5]
  • {best practice] creating queries that are dynamic and flexible via parameters [5]
    • parameters serves as a way to easily store and manage a value that can be reused in many different ways [5]
  • {best practice} create reusable functions
    • can be created from existing queries and parameters [5]
Previous Post <<||>> Next Post

Acronyms:
CDM - Common Data Model
ETL - Extract, Transform, Load

References:
[1] Microsoft Learn: Fabric (2023) Ingest data with Microsoft Fabric (link)
[2] Microsoft Learn: Fabric (2023) Dataflow Gen2 pricing for Data Factory in Microsoft Fabric (link)
[3] Microsoft Learn: Fabric (2023) Introduction to dataflows and self-service data prep (link)
[4] Microsoft Learn: Fabric (2023) Configure and consume a dataflow (link)
[5] Microsoft Learn: Fabric (2023) Dataflows best practices* (link)
[6] Microsoft Learn: Fabric (2023) Computed table scenarios and use cases (link)
[7] Microsoft Lean: Power BI - Learn (2024) Premium features of dataflows (link

Resources:

31 January 2024

Microsoft Fabric: Parquet Format (Notes)

Disclaimer: This is work in progress intended to consolidate information from various sources. 

Last updated: 31-Jan-2024

Parquet format

  • {definition} open source, column-oriented data file format designed for efficient data storage and retrieval [1]
    • provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk [1]
    • designed to be a common interchange format for both batch and interactive workloads [1]
  • {characteristic} open source file format
    • similar to other columnar-storage file formats available in Hadoop [1]
      • e.g. RCFile, ORC
    • became an industry standard 
      •  {benefit} provides interoperability across multiple tools
  • {characteristic} language agnostic [1]
    • different programming languages can be used to manipulate the data
  • {characteristic} column-based format [1]
    • files are organized by column
      • ⇐ rather than by row
      • ⇒ saves storage space and speeds up analytics queries [1]
    •  reads only the needed columns 
      • ⇐ non-relevant data are skipped
      • ⇒ greatly minimizes the IO [1]
        • aggregation queries are less time-consuming compared to row-oriented databases [1]
    • {benefit} increased data throughput and performance [1]
      • ⇒ recommended for analytical workloads
  • {characteristic} highly efficient data compression/decompression [1]
    • supports flexible compression options and efficient encoding schemes [1]
      • data can be compressed by using one of the several codecs available [1]
        • ⇒ different data files can be compressed differently [1]
    •  reduced storage requirements [1]
      • by at least one-third on large datasets
      • ⇒ {benefit} saves on cloud storage space
    •  greatly improves scan and deserialization time [1]
      • ⇒ {benefit} reduces the processing costs
    • {downside} can be slower to write than row-based file formats
      • primarily because they contain metadata about the file contents 
      • though have fast read times
  • {characteristic} supports complex data types and advanced nested data structures [1]
    • implemented using the record-shredding and assembly algorithm
      • accommodates complex data structures that can be used to store the data [1]
      • optimized to work with complex data in bulk and features different ways for efficient data compression and encoding types [1]
        • the approach is best especially for those queries that need to read certain columns from a large table [1]
  • {characteristic} cloud-ready
    • works best with interactive and serverless technologies [1]
  • {characteristic} immutable
    • a file can't be update to modify the column name, reorder or drop columns [2]
      • ⇐ requires rewriting the whole file [2]
  • {characteristic} binary-based file
    • ⇒ not easily readable (by humans)
  • {characteristic} self-describing 
    •  contains metadata about schema and structure
    • {concept} row groups (aka segments) 
      • contains data from the same columns
        • {constraint} column names are case sensitive
    • {concept} file footer 
      • stores metadata statistics for each row group [2]
        • min/max statistics 
        • the number of rows
        • can be leveraged by data processing engines to run queries more efficiently [2]
          • ⇐ depending on the query, entire row group can be skipped [2]
    • {concept} file header
  •  large datasets can be split across multiple parquet files
    • ⇐ the structure can be flat or hierarchical 
    • managing multiple files has several challenges
    • the files can be used to define a table (aka parquet table)
      • ⇐ {constraint} the files must have the same definition
        • ⇐ schema enforcement must be coded manually [2]
      • {limitation} [Data Lake] no support for ACID transactions [2]
        • ⇒ easy to corrupt [2]
          • partially written files will break any subsequent read operations
            • the compute engine will try to read in the corrupt files and error out [2]
            • corrupted files must be manually identified and deleted manually to fix the issues [2]
      • {limitation} it's not easy to delete rows from it [2]
        • requires reading all the data, filtering out the data not needed, and then rewriting the entire table [2]
      • {limitation} doesn't support DML transactions [2]
      • {limitation} there is no change data feed [2]
      • {limitation} slow file listing [2]
        • small files require excessive I/O overhead
          • ideally the files should be between 64 MB and 1 GB
          • ideally the files should be compacted into larger files (aka small file compaction, bin-packing)
      • {limitation} expensive footer reads to gather statistics for file skipping [2]
        • fetching all the footers and building the file-level metadata for the entire table is slow [2]
          • ⇐ it requires a file-listing operation [2]
        • the effectiveness of data skipping depends on how many files can be can skipped when performing a query [2]
      • {limitation} doesn't support schema enforcement [2]
      • {limitation} doesn't support check constraints [2]
      • {limitation} doesn't support data versioning [2]
    • {concept} table partitioning
      • {definition} common optimization approach used to store the data of the same table in different directories, with partitioning column values encoded in the path of each partition directory [6]
      • {recommendation} avoid partitioning by columns with very high cardinality
    • {concept} bin-packing (aka compaction, bin-compaction)
      • aims to produce evenly-balanced data files with respect to their size on disk, 
        • ⇐ but not necessarily in respect to the number of tuples per file [7]
      • requires an algorithm that efficiently organizes the files into equal size containers [6]
      • {characteristic} idempotent
        •  if it is run twice on the same dataset, the second run has no effect [7]
  • {feature} [Microsoft Fabric] V-order
    • {definition} write time optimization to the parquet file format that enables lightning-fast reads under the MF compute engines [3]
    • applies special sorting, row group distribution, dictionary encoding and compression on parquet files [3]
      • requires less compute engines resources in to read it [3]
        • provides further cost efficiency and performance
          • has a 15% impact on average write times but provides up to 50% more compression [3]
    • {characteristic} open-source parquet format compliant
      • all parquet engines can read it as a regular parquet file [3]
      • ⇐ table properties and optimization commands can be used on control V-Order on its partitions [3]
      • compatible with other features [3]
    • applied at parquet file level [3]
    • enabled by default
  • {command} OPTIMIZE
    • merges all changes into bigger, consolidated parquet files (aka bin-compaction) [3]
    • [Spark] dynamically optimizes partitions while generating files with a default 128 MB size [5]
      • the target file size may be changed per workload requirements using configurations [5]
    • properly designing the table physical structure based on the ingestion frequency and expected read patterns is likely more important than running the optimization command [3]
    • running the compaction operation brings the data lake in an unusable state for readers [7]
    • {warning} manually compacting the files is inefficient and error prone [7]
      • no way to differentiate files that contain new data from files that contain existing data that was just compacted into new files [7]
  • [Delta Lake] when ZORDER and VORDER are used together, Apache Spark performs bin-compaction, ZORDER, VORDER sequentially [3]
Previous Post <<||>> Next Post

Acronyms:
ACID - atomicity, consistency, isolation, durability
IO - Input/Output
MF - Microsoft Fabric
ORC - Optimized Row Columnar
RCFile - Record Columnar File

Resources:
[1] Databricks (2023) What is Parquet? (link)
[2] Data Lake (2023) Delta Lake vs. Parquet Comparison (link)
[3] Data Mozart (2023) Parquet file format – everything you need to know! (link)
[4] Microsoft Learn (2023) Query Parquet files using serverless SQL pool in Azure Synapse Analytics (link)
[5] Microsoft Learn (2023) Lakehouse tutorial: Prepare and transform data in the lakehouse (link)
[6] Apache Spark (2023) Spark SQL Guide (link)
[7] Delta Lake (2023) Delta Lake Small File Compaction with OPTIMIZE (link)
[8] Delta Lake (2023) Optimizations (link)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.