Showing posts with label tables. Show all posts
Showing posts with label tables. Show all posts

03 May 2025

📊Graphical Representation: Graphics We Live By (Part XI: Comparisons Between Data Series)

Graphical Representation Series
Graphical Representation Series

Over the past 10-20 years it became so easy to create data visualizations just by dropping some of the data available into a tool like Excel and providing a visual depiction of it with just a few clicks. In many cases, the first draft, typically provided by default in the tool used, doesn't even need further work as the objective was reached, while in others the creator must have a minimum skillset for making the visualization useful, appealing, or whatever quality is a final requirement for the work in scope. However, the audience might judge the visualization(s) from different perspectives, and there can be a broad audience with different skills in reading, evaluating and understanding the work.

There are many depictions on the web resembling the one below, taken from a LinkedIn post:

Example Chart - Boing vs. Airbus

Even if the visualization is not perfect, it does a fair job in representing the data. Improvements can be made in the areas of labels, the title and positioning of elements, and the color palette used. At least these were the improvements made in the original post. It must be differentiated also between the environment in which the charts are made available, the print format having different characteristics than the ones in business setups. Unfortunately, the requirements of the two are widely confused, probably also because of the overlapping of the mediums used. 

Probably, it's a good idea to always start with the row data (or summaries of it) when the result consists of only a few data points that can be easily displayed in a table like the one below (the feature to round the decimals for integer values should be available soon in Power BI):

Summary Table

Of course, one can calculate more meaningful values like percentages from the total, standard deviations and other values that offer more perspectives into the data. Even if the values adequately reflect the reality, the reader can but wonder about the local and global minimal/maximal values, without talking much about the meaning of data points, which is easily identifiable in a chart. At least in the case of small data sets, using a table in combination with a chart can provide a more complete perspective and different ways of analyzing the data, especially when the navigation is interactive. 

Column and bar charts do a fair job in comparing values over time, though they do use a lot of ink in the process (see D). While they make it easy to compare neighboring values, the rectangles used tend to occupy a lot of space when they are made too wide or too high to cover the empty space within the display (e.g. when just a few values are displayed, space being wasted in the process). As the main downside, it takes a lot of scanning until the reader identifies the overall trends, and the further away the bars are from each other, the more difficult it becomes to do comparisons. 

In theory, line charts are more efficient in representing the above data points, because the marks are usually small and the line thin enough to provide a better data-ink ratio, while one can see a lot at a glance. In Power BI the creator can use different types of interpolation: linear (A), step (B) or smooth (C). In many cases, it might be a good idea to use a linear interpolation, though when there are no or minimal overlapping, it might be worthwhile to explore the other types if interpolation too (and further request feedback from the users):

Linear, Step and Smooth Line Charts

The nearness of values from different series can raise difficulties in identifying adequately the points, respectively delimiting the lines (see B).When the density of values allows it, it makes sense also to include the averages for each data series to reflect the distance between the two data sets. Unfortunately, the chart can get crowded if further data series or summaries are added to the cart(s). 

If the column chart (E) is close to the redesigned chart provided in the original redesign, the other alternatives can provide upon case more value. Stacked column charts (D) allow also to compare the overall quantity by month, area charts (F) tend to use even more color than needed, while water charts (G) allow to compare the difference between data points per time unit. Tornado charts (H) are a variation of bar charts, allowing easier comparing of the size of the bars, while ribbon charts (I) show well the stacking values. 

Alternatives to Line Charts

One should consider changing the subtitle(s) slightly to reflect the chart type when the patterns shown imply a shift in attention or meaning. Upon case, more that one of the above charts can be used within the same report when two or more perspectives are important. Using a complementary perspective can facilitate data's understanding or of identifying certain patterns that aren't easily identifiable otherwise. 

In general, the graphics creators try to use various representational means of facilitating a data set's understanding, though seldom only two series or a small subset of dimensions provide a complete description. The value of data comes when multiple perspectives are combined. Frankly, the same can be said about the above data series. Yes, there are important differences between the two series, though how do the numbers compare when one looks at the bigger picture, especially when broken down on element types (e.g. airplane size). How about plan vs. actual values, how long does it take more for production or other processes? It's one of a visualization's goals to improve the questions posed, but how efficient are visualizations that barely scratch the surface?

In what concerns the code, the following scripts can be used to prepare the data:

-- Power Query script (Boeing vs Airbus)
= let
    Source = let
    Source = #table({"Sorting", "Month Name", "Serial Date", "Boeing Deliveries", "Airbus Deliveries"},
    {
        {1, "Oct", #date(2023, 10, 31), 30, 50},
        {2, "Nov", #date(2023, 11, 30), 40, 40},
        {3, "Dec", #date(2023, 12, 31), 40, 110},
        {4, "Jan", #date(2024, 1, 31), 20, 30},
        {5, "Feb", #date(2024, 2, 29), 30, 40},  // Leap year adjustment
        {6, "Mar", #date(2024, 3, 31), 30, 60},
        {7, "Apr", #date(2024, 4, 30), 40, 60},
        {8, "May", #date(2024, 5, 31), 40, 50},
        {9, "Jun", #date(2024, 6, 30), 50, 80},
        {10, "Jul", #date(2024, 7, 31), 40, 90},
        {11, "Aug", #date(2024, 8, 31), 40, 50},
        {12, "Sep", #date(2024, 9, 30), 30, 50}
    }
    ),
    #"Changed Types" = Table.TransformColumnTypes(Source, {{"Sorting", Int64.Type}, {"Serial Date", type date}, {"Boeing Deliveries", Int64.Type}, {"Airbus Deliveries", Int64.Type}})
in
    #"Changed Types"
in
    Source

It can be useful to create the labels for the charts dynamically:

-- DAX code for labels
MaxDate = Format(Max('Boeing vs Airbus'[Serial Date]),"MMM-YYYY")
MinDate = FORMAT (Min('Boeing vs Airbus'[Serial Date]),"MMM-YYYY")
MinMaxDate = [MinDate] & " to " & [MaxDate]
Title Boing Airbus = "Boing and Airbus Deliveries " & [MinMaxDate]

Happy coding!

Previous Post <<||>> Next Post

25 February 2025

🏭💠🗒️Microsoft Fabric: T-SQL Notebook [Notes] 🆕

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 25-Feb-2024

[Microsoft Fabric] T-SQL notebook

  • {def} notebook that enables to write and run T-SQL code within a notebook [1]
  • {feature} allows to manage complex queries and write better markdown documentation [1]
  • {feature} allows the direct execution of T-SQL on
    • connected warehouse
    • SQL analytics endpoint
    • ⇐ queries can be run directly on the connected endpoint [1]
      • multiple connections are allowed [1]
  • allows running cross-database queries to gather data from multiple warehouses and SQL analytics endpoints [1]
  • the code is run by the primary warehouse
    • used as default in commands which supports three-part naming, though no warehouse was provided [1]
    • three-part naming consists of 
      • database name
        • the name of the warehouse or SQL analytics endpoint [1]
      • schema name
      • table name
  • {feature} autogenerate T-SQL code using the code template from the object explorer's context [1] menu
  • {concept} code cells
    • allow to create and run T-SQL code
      • each code cell is executed in a separate session [1]
        • {limitation} the variables defined in one cell are not available in another cell [1]
        • one can check the execution summary after the code is executed [1]
      • cells can be run individually or together [1]
      • one cell can contain multiple lines of code [1]
        • users can select and run subparts of a cell’s code [1]
    • {feature} Table tab
      • lists the records from the returned result set
        • if the execution contains multiple result set, you can switch from one to another via the dropdown menu [1]
  • a query can be saved as 
    • view
      • via 'Save as' view
      • {limitation} does not support three-part naming [1]
        • the view is always created in the primary warehouse [1]
          • by setting the warehouse as the primary warehouse [1]
    • table
      • via 'Save as' table
      • saved as CTAS 
    • ⇐ 'Save as' is only available for the selected query text
      • the query text must be selected before using the Save as options
  • {limitation} doesn’t support 
    • parameter cell
      • the parameter passed from pipeline or scheduler can't be used [1]
    • {feature} Recent Run 
      • {workaround} use the current data warehouse monitoring feature to check the execution history of the T-SQL notebook [1]
    • {feature} the monitor URL inside the pipeline execution
    • {feature} snapshot 
    • {feature} Git support 
    • {feature} deployment pipeline support 

References:
[1] Microsoft Learn (2025) T-SQL support in Microsoft Fabric notebooks [link
[2] Microsoft Learn (2025) Create and run a SQL Server notebook [link
[3] Microsoft Learn (2025) T-SQL surface area in Microsoft Fabric [link
[4] Microsoft Fabric Updates Blog (2024) Announcing Public Preview of T-SQL Notebook in Fabric [link]

Resources:
[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms
CTAS - Create Table as Select
T-SQL - Transact SQL

23 February 2025

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part X: Templates for Database Objects)

One of the new features remarked in SQL databases when working on the previous post is the availability of templates in SQL databases. The functionality is useful even if is kept to a minimum. Probably, more value can be obtained when used in combination with Copilot, which requires at least a F12 capacity.

Schemas

Schemas are used to create a logical grouping of objects such as tables, stored procedures, and functions. From a structural and security point of view it makes sense to create additional schemas to manage the various database objects and use the default dbo schema only occasionally (e.g. for global created objects).

-- generated template - schema
CREATE SCHEMA SchemaName

-- create schema
CREATE SCHEMA Test

One can look at the sys.schemas to retrieve all the schemas available:

-- retrieve all schemas
SELECT schema_id
, name
, principal_id
FROM sys.schemas
ORDER BY schema_id

Tables

Tables, as database objects that contain all the data in a database are probably the elements that need the greatest attention in design and data processing. In some cases a table can be dedenormalized and it can store all the data needed, much like in MS Excel, respectively, benormalized in fact and dimension tables. 

Tables can be created explicitly by defining in advance their structure (see Option 1), respectively on the fly (see Option 2). 

-- Option 1
-- create the table manually (alternative to precedent step
CREATE TABLE [Test].[Customers](
	[CustomerId] [int] NOT NULL,
	[AddressID] [int] NULL,
	[Title] [nvarchar](8) NULL,
	[FirstName] [nvarchar](50) NULL,
	[LastName] [nvarchar](50) NULL,
	[CompanyName] [nvarchar](128) NULL,
	[SalesPerson] [nvarchar](256) NULL
) ON [PRIMARY]
GO

-- insert records
INSERT INTO Test.Customers
SELECT CustomerId
, Title
, FirstName 
, LastName
, CompanyName
, SalesPerson
FROM SalesLT.Customer -- checking the output (both scenarios) SELECT top 100 * FROM Test.Customers

One can look at the sys.tables to retrieve all the tables available:

-- retrieve all tables
SELECT schema_name(schema_id) schema_name
, object_id
, name
FROM sys.tables
ORDER BY schema_name
, name

Views

Views are much like virtual table based on the result-set of an SQL statement that combines data from one or multiple tables.  They can be used to encapsulate logic, respectively project horizontally or  vertically a subset of the data. 

-- create view
CREATE OR ALTER VIEW Test.vCustomers
-- Customers 
AS
SELECT CST.CustomerId 
, CST.Title
, CST.FirstName 
, IsNull(CST.MiddleName, '') MiddleName
, CST.LastName 
, CST.CompanyName 
, CST.SalesPerson 
FROM SalesLT.Customer CST

-- test the view 
SELECT *
FROM Test.vCustomers
WHERE CompanyName = 'A Bike Store'

One can look at the sys.views to retrieve all the views available:

-- retrieve all views
SELECT schema_name(schema_id) schema_name
, object_id
, name
FROM sys.views
ORDER BY schema_name
, name

User-Defined Functions

A user-defined function (UDF) allows to create a function by using a SQL expression. It can be used alone or as part of a query, as in the below example.

-- generated template - user defined function
CREATE FUNCTION [dbo].[FunctionName] (
    @param1 INT,
    @param2 INT
)
RETURNS INT AS BEGIN RETURN
    @param1 + @param2
END

-- user-defined function: 
CREATE OR ALTER FUNCTION Test.GetFirstMiddleLastName (
    @FirstName nvarchar(50),
    @MiddleName nvarchar(50),
    @LastName nvarchar(50)
)
RETURNS nvarchar(150) AS 
BEGIN 
   RETURN IsNull(@FirstName, '') + IsNull(' ' + @MiddleName, '') + IsNull(' ' + @LastName, '') 
END

-- test UDF on single values
SELECT Test.GetFirstMiddleLastName ('Jack', NULL, 'Sparrow')
SELECT Test.GetFirstMiddleLastName ('Jack', 'L.', 'Sparrow')

-- test UDF on a whole table
SELECT TOP 100 Test.GetFirstMiddleLastName (FirstName, MiddleName, LastName)
FROM SalesLT.Customer

One can look at the sys.objects to retrieve all the scalar functions available:

-- retrieve all scalar functions
SELECT schema_name(schema_id) schema_name
, name
, object_id
FROM sys.objects 
WHERE type_desc = 'SQL_SCALAR_FUNCTION'
ORDER BY schema_name
, name

However, UDFs prove to be useful when they mix the capabilities of functions with the ones of views allowing to create a "parametrized view" (see next example) or even encapsulate a multi-line statement that returns a dataset. Currently, there seems to be no template available for creating such functions.

-- table-valued function
CREATE OR ALTER FUNCTION Test.tvfGetCustomers (
    @CompanyName nvarchar(50) NULL
)
RETURNS TABLE
-- Customers by Company
AS
RETURN (
	SELECT CST.CustomerId 
	, CST.CompanyName
	, CST.Title
	, IsNull(CST.FirstName, '') + IsNull(' ' + CST.MiddleName, '') + IsNull(' ' + CST.LastName, '') FullName
	, CST.FirstName 
	, CST.MiddleName 
	, CST.LastName 
	FROM SalesLT.Customer CST
	WHERE CST.CompanyName = IsNull(@CompanyName, CST.CompanyName)
);

-- test function for values
SELECT *
FROM Test.tvfGetCustomers ('A Bike Store')
ORDER BY CompanyName
, FullName

-- test function for retrieving all values
SELECT *
FROM Test.tvfGetCustomers (NULL)
ORDER BY CompanyName
, FullName

One can look at the sys.objects to retrieve all the table-valued functions available:

-- retrieve all table-valued functions
SELECT schema_name(schema_id) schema_name
, name
, object_id
FROM sys.objects 
WHERE type_desc = 'SQL_INLINE_TABLE_VALUED_FUNCTION'
ORDER BY schema_name , name

Stored Procedures

A stored procedure is a prepared SQL statement that is stored as a database object and precompiled. Typically, the statements considered in SQL functions can be created also as stored procedure, however the latter doesn't allow to reuse the output directly.

-- get customers by company
CREATE OR ALTER PROCEDURE Test.spGetCustomersByCompany (
    @CompanyName nvarchar(50) NULL
)
AS
BEGIN
	SELECT CST.CustomerId 
	, CST.CompanyName
	, CST.Title
	, IsNull(CST.FirstName, '') + IsNull(' ' + CST.MiddleName, '') + IsNull(' ' + CST.LastName, '') FullName
	, CST.FirstName 
	, CST.MiddleName 
	, CST.LastName 
	FROM SalesLT.Customer CST
	WHERE CST.CompanyName = IsNull(@CompanyName, CST.CompanyName)
	ORDER BY CST.CompanyName
	, FullName
END 

-- test the procedure 
EXEC Test.spGetCustomersByCompany NULL -- all customers
EXEC Test.spGetCustomersByCompany 'A Bike Store' -- individual customer

One can look at the sys.objects to retrieve all the stored procedures available:

-- retrieve all scalar functions
SELECT schema_name(schema_id) schema_name
, name
, object_id
FROM sys.objects 
WHERE type_desc = 'SQL_STORED_PROCEDURE'
ORDER BY schema_name , name

In the end, don't forget to drop the objects created above (note the order of the dependencies):

-- drop function 
DROP FUNCTION IF EXISTS Test.GetFirstMiddleLastName

-- drop function 
DROP FUNCTION IF EXISTS Test.tvfGetCustomers 
-- drop precedure DROP VIEW IF EXISTS Test.Test.spGetCustomersByCompany -- drop view DROP VIEW IF EXISTS Test.vCustomers -- drop schema DROP SCHEMA IF EXISTS Test

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) Microsoft Fabric: Overview of Copilot in Fabric [link]

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part IX: From OLTP to OLAP Data Models)

With SQL databases Microsoft brought OLTP to Microsoft Fabric which allows addressing a wider range of requirements, though this involves also some challenges that usually are addressed by the transition from the OLTP to OLAP architectures. Typically, there's an abstraction layer that is built on top of the OLTP data models that allows to address the various OLAP requirements. As soon as OLTP and OLAP models are mixed together, this opens the door to design and data quality issues that have impact on the adoption of solutions by users. Probably, those who worked with MS Access or even MS Excel directly or in combination with SQL Server can still remember the issues they run into.

Ideally, it should be a separation layer between OLTP and the OLAP data. This can be easily achieved in SQL databases by using two different schemas that mimic the interaction between the two types of architectures. So, supposing that the dbo schema from the SalesLT is the data as maintain by the OLTP layer, one can add an additional schema Test in which the OLAP logic is modelled. This scenario is not ideal, though it allows to model the two aspects of the topic considered. The following steps are to be performed in the environment in which the SalesLT database was created. 

Independently in which layer one works, it's ideal to create a set of views that abstracts the logic and ideally simplifies the processing of data. So, in a first step it's recommended to abstract the data from the source by creating a set of views like the one below:

-- drop view (cleaning)
-- DROP VIEW IF EXISTS SalesLT.vCustomerLocations 

-- create view
CREATE VIEW SalesLT.vCustomerLocations
-- Customers with main office
AS
SELECT CST.CustomerId 
, CSA.AddressID
, CST.Title
, CST.FirstName 
, IsNull(CST.MiddleName, '') MiddleName
, CST.LastName 
, CST.CompanyName 
, CST.SalesPerson 
, IsNull(CSA.AddressType, '') AddressType
, IsNull(ADR.City, '') City
, IsNull(ADR.StateProvince, '') StateProvince
, IsNull(ADR.CountryRegion, '') CountryRegion
, IsNull(ADR.PostalCode, '') PostalCode
FROM SalesLT.Customer CST
	 LEFT JOIN SalesLT.CustomerAddress CSA
	   ON CST.CustomerID = CSA.CustomerID
	  AND CSA.AddressType = 'Main Office'
	 	LEFT JOIN SalesLT.Address ADR
		  ON CSA.AddressID = ADR.AddressID

The view uses LEFT instead of FULL joins because this allows more flexibility, respectively identifying the gaps existing between entities (e.g. customers without addresses). In these abstractions, the number of transformations is kept to a minimum to reflect the data as reflected by the source. It may be chosen to minimize the occurrence of NULL values as this simplifies the logic for comparisons (see the use of IsNull).

Once the abstraction from the OLTP layer was built, one can make the data available in the OLAP layer:

-- create schema
CREATE SCHEMA Test

-- dropping the target table (for cleaning)
-- DROP TABLE IF EXISTS Test.CustomerLocations

-- Option 1
-- create the table on the fly
SELECT *
INTO Test.CustomerLocations
FROM SalesLT.vCustomerLocations

-- Option 2
-- create the table manually (alternative to precedent step
CREATE TABLE [Test].[CustomerLocations](
	[CustomerId] [int] NOT NULL,
	[AddressID] [int] NULL,
	[Title] [nvarchar](8) NULL,
	[FirstName] [nvarchar](50) NULL,
	[MiddleName] [nvarchar](50) NULL,
	[LastName] [nvarchar](50) NULL,
	[CompanyName] [nvarchar](128) NULL,
	[SalesPerson] [nvarchar](256) NULL,
	[AddressType] [nvarchar](50) NULL,
	[City] [nvarchar](30) NULL,
	[StateProvince] [nvarchar](50) NULL,
	[CountryRegion] [nvarchar](50) NULL,
	[PostalCode] [nvarchar](15) NULL
) ON [PRIMARY]
GO

-- insert records
INSERT INTO Test.CustomerLocations
SELECT *
FROM SalesLT.vCustomerLocations


-- checking the output (both scenarios)
SELECT top 100 *
FROM Test.CustomerLocations


-- drop the view (for cleaning)
-- DROP VIEW IF EXISTS Test.vCustomerLocations

-- create view
CREATE VIEW Test.vCustomerLocations
-- Customer locations
AS
SELECT CSL.CustomerId 
, CSL.AddressID
, CSL.Title
, CSL.FirstName 
, CSL.MiddleName 
, CSL.LastName 
, Concat(CSL.FirstName, ' ' + CSL.MiddleName, ' ', CSL.LastName) FullName
, CSL.CompanyName 
, CSL.SalesPerson 
, CSL.AddressType
, CSL.City
, CSL.StateProvince
, CSL.CountryRegion 
, CSL.PostalCode
FROM Test.CustomerLocations CSL

-- test the view
SELECT top 100 *
FROM Test.vCustomerLocations

Further on, one can create additional objects as required. Usually, a set of well-designed views is enough, offering the needed flexibility with a minimum of code duplication. In addition, one can build stored procedures and table-valued functions as needed:

-- drop the function (for cleaning)
-- DROP FUNCTION IF EXISTS Test.tvfGetCustomerAddresses

-- generated template - function
CREATE FUNCTION Test.tvfGetCustomerAddresses (
    @CountryRegion nvarchar(50) NULL,
    @StateProvince nvarchar(50) NULL
)
RETURNS TABLE
-- Customers by Country & State province
AS
RETURN (
SELECT CSL.CustomerId 
, CSL.AddressID
, CSL.Title
, CSL.FirstName 
, CSL.MiddleName 
, CSL.LastName 
, CSL.FullName
, CSL.CompanyName 
, CSL.SalesPerson 
, CSL.AddressType 
, CSL.City
, CSL.StateProvince 
, CSL.CountryRegion 
, CSL.PostalCode
FROM Test.vCustomerLocations CSL
WHERE CSL.CountryRegion = IsNull(@CountryRegion, CSL.CountryRegion)
  AND CSL.StateProvince = IsNull(@StateProvince, CSL.StateProvince)
);

-- retrieving all records
SELECT *
FROM Test.tvfGetCustomerAddresses(NULL, NULL)

-- providing parameters
SELECT *
FROM Test.tvfGetCustomerAddresses('United States', 'Utah')

-- filtering on non-parametrized volumns
SELECT *
FROM Test.tvfGetCustomerAddresses('United States', 'Utah')
WHERE City = 'Salt Lake City'



-- drop the procedure (for cleaning)
-- DROP PROCEDURE IF EXISTS Test.spGetCustomerAddresses 

-- generated template - stored procedure
CREATE PROCEDURE Test.spGetCustomerAddresses (
    @CountryRegion nvarchar(50) NULL,
    @StateProvince nvarchar(50) NULL
)
-- Customers by Country & State province
AS
BEGIN
	SELECT CSL.CustomerId 
	, CSL.AddressID
	, CSL.Title
	, CSL.FirstName 
	, CSL.MiddleName 
	, CSL.LastName 
	, CSL.FullName
	, CSL.CompanyName 
	, CSL.SalesPerson 
	, CSL.AddressType 
	, CSL.City
	, CSL.StateProvince 
	, CSL.CountryRegion 
	, CSL.PostalCode
	FROM Test.vCustomerLocations CSL
	WHERE CSL.CountryRegion = IsNull(@CountryRegion, CSL.CountryRegion)
	AND CSL.StateProvince = IsNull(@StateProvince, CSL.StateProvince)
END 

-- retrieving all records
EXEC Test.spGetCustomerAddresses NULL, NULL

-- providing parameters
 EXEC Test.spGetCustomerAddresses 'United States', 'Utah'

These steps can repeated for each entity in scope.

This separation between OLTP and OLAP is usually necessary given that business processes need a certain amount of time until they are correctly reflected as per reporting needs. Otherwise, the gaps can negatively impact the quality of data used for reporting. For some reports these deviation might be acceptable, though there will be probably also (many) exceptions. Independently of the solution used, it's still needed to make sure that the data are appropriate for the processes and reporting. 

If no physical separation is needed between the two types of layers, one can remove the persisted tables from the logic and keep the objects as they are.

Independently of which architecture is chosen, one shouldn't forget to validate one's presumptions in what concerns the data model (e.g. customers without addresses, address types, etc.).

Previous Post <<||>> Next Post

07 February 2025

🌌🏭KQL Reloaded: First Steps (Part VII: Basic Data Visualizations)

One of the greatest aspects of KQL and its environment is that creating a chart is just one instruction away from the dataset generated in the process. Of course, the data still need to be in an appropriate form to be used as source for a visual, though the effort is minimal. Let's consider the example used in the previous post based ln the ContosoSales data, where the visualization part is everything that comes after "| render":

// visualizations by Country: various charts
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize count_customers = count_distinct(CustomerKey) by RegionCountryName
| order by count_customers desc
//| render table
//| render linechart
//| render areachart 
//| render stackedchart
//| render columnchart | render piechart with (xtitle="Country", ytitle="# Customers", title="# Customers by Country (pie chart)", legend=hidden)
Output:
# Customers by Country (various charts)

It's enough to use "render" with the chart type without specifying the additional information provided under "with", though the legend can facilitate data's understanding. Unfortunately, the available properties are relatively limited, at least for now. 

Adding one more dimension is quite simple, even if the display may be sometimes confusing as there's no clear delimitation between the entities represented while the legend grows linearly with the number of points. It might be a good idea to use additional charts for the further dimensions in scope. 

// visualizations by Region & Country: various charts
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize count_customers = count_distinct(CustomerKey) by ContinentName, RegionCountryName
| order by count_customers desc   
//| render stackedareachart 
//| render linechart 
//| render table 
//| render areachart 
//| render piechart
| render columnchart 
    with (xtitle="Region/Country", ytitle="# Customers",
    title="#Customers by Continent & Country", legend=hidden)
Output:
# Customers by Continent & Country (column chart)

Sometimes, it makes sense to reduce the number of values, recommendation that applies mainly to pie charts:

// visualizations by Zone: pie chart
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize count_customers = count_distinct(CustomerKey) by iif(RegionCountryName in ('United States', 'Canada'), RegionCountryName, 'Others')
| render piechart
    with (xtitle="Country", ytitle="Sales volume",
    title="Sales volume by Zone")
Output:
# Customers by Zone (pie chart)

Adding a second set of values (e.g. Total cost) allows to easily create a scatter chart:

// visualization by Occupation: scatter chart
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize count_customers = count_distinct(CustomerKey) 
    , TotalCost = sum(TotalCost) by Occupation
| order by count_customers desc
| render scatterchart 
    with (xtitle="# Customers", ytitle="Sales volume",
    title="# Customers vs Sales volume by Occupation", legend=visible )
Output:
# Customers vs Sales volume by Occupation (scatter chart)

The visualizations are pretty simple to build, though one shouldn't expect that one can build a visualization on top of any dataset, at least not without further formatting and eventually code changes. For example, considering the query from the previous post, with a small change one can use the data with a column chart, though this approach might have some limitation (e.g. it doesn't work pie charts):

// calculating percentages from totals: column chart
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
//| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize count_customers = count_distinct(CustomerKey)
    , count_customers_US = count_distinctif(CustomerKey, RegionCountryName == 'United States')
    , count_customers_CA = count_distinctif(CustomerKey, RegionCountryName == 'Canada')
    , count_customers_other = count_distinctif(CustomerKey, not(RegionCountryName in ('United States', 'Canada')))
| project Charting = "Country"
    , US = count_customers_US
    , CA = count_customers_CA
    , other = count_customers_other
| render columnchart
    with (xtitle="Region", ytitle="# Customers",
    title="# Customers by Region")
Output:
# Customers by Region (column chart)

There are a few more visuals that will be considered in a next post. Despite the relatively limited set of visuals and properties, the visualizations are useful to get a sense of data's shape, and this with a minimum of changes. Ad-hoc visualizations can help also in data modeling, validating the logic and/or identifying issues in the data when creating the queries, which makes it a great feature. 

Happy coding!

Previous Post <<||>> Next Post

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.