04 February 2025

🌌🏭KQL Reloaded: First Steps (Part III: Basic Joins)

As data is usually dispersed over multiple tables, in any query languages is quintessential to know how to bring the data together for further analysis, joins allowing to achieve this is SQL as well in KQL. One usually starts with the main table and joins the further tables to it. In a data warehouse, the main table is a fact table to which the dimension tables are further joined to add details, respectively to filter on the dimensions. 

Before starting to build any logic, it's important to get an idea of tables' cardinality and which are the keys on which the data to be join. Some models, like the one in the ContosoSales model, are straightforward and the columns participating in the joins have the same names, or at least similar names.  

When writing and testing queries, one can start with a small subset of the data, join the tables together, and at the end validate the logic as needed. Selecting a small interval (e.g. a month, week, or even a day) and the records for 1-2 representative records from a dimensional table (e.g. Customers) helps in the process for minimizing the data movement and 

// check how many records are available 
// Sales table: 28980 based on the filter
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()

// Products: 2517 records
Products
| summarize record_count = count()

// Customers: 18484 records
Customers 
| summarize record_count = count()

In a second step one can look at the individual values, identify uniqque indentifiers, the columns that participate in joins, respectively other columns of interest. If needed, one can five deeper by checking the number for individual values. Selecting 10-100 records is usually enough for getting an idea about the dataset, though there are also many exceptions:
 
// review the data
// Sales table: 28980 based on the filter
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| take 10

// Products: 2517 records
Products
| take 10

// Customers: 18484 records
Customers 
| take 10

In KQL one must start with the fact table (e.g NewSales) and add the dimensions accordingly. Alternatively, it's useful to group only by ProductKey, respectively by CustomerKey.

// review the basis output by Products & Customers
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
    , TotalCost = sum(TotalCost) by ProductKey, CustomerKey
| order by TotalCost desc
| summarize result_record_count = count()

Now let's join the Product dimension table to the NewSales fact table via the join operator

// total by product via join
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
    , TotalCost = sum(TotalCost) by ProductKey
| join kind=inner (
    Products 
    | where  ProductCategoryName == 'TV and Video'
    )
    on ProductKey
| project ProductName, TotalCost
//| summarize record_count = count() 
| order by TotalCost desc

One can join now also the second dimension table and eventually select only the columns in scope, respectively do further calculations (see CustomerName):

// total by product & customer via joins
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
    , TotalCost = sum(TotalCost) by ProductKey, CustomerKey
| join kind=inner (
    Products 
    | where  ProductCategoryName == 'TV and Video'
    | project ProductKey, ProductName 
    )
    on ProductKey
| join kind=inner (
    Customers
    | project CustomerKey, CustomerName = strcat(FirstName, ' ', LastName)
    )
    on CustomerKey
| project CustomerName, ProductName, TotalCost
//| summarize record_count = count() 
| order by CustomerName asc, ProductName asc

Once the query built, one can go ahead and remove the pieces of logic not needed, an reevaluate the number of records with a count (see commented line).

Alternatively, one can rewrite the above query as follows via the lookup operators:

// total by Product & Customer via lookup
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
    , TotalCost = sum(TotalCost) by ProductKey, CustomerKey
| lookup Products on ProductKey
| lookup Customers on CustomerKey 
//| summarize record_count = count() 
| project CustomerName = strcat(FirstName, ' ', LastName), ProductName, TotalCost

The two last queries should produce the same results. The first query allows more flexibility in what concerns the constraints applied, while the second is more compact while it also allows to easier test how the number of records changes by added each join. This set of tests is done to make sure that duplicates aren't created in the process, respectively that records are not removed unnecessarily from the logic. 

If the fields used in joins don't have the same name, which happens in data many models, one can use the $left and $right to refer to the table participating in the join. This syntax makes the query a bit more verbose, though it brings more clarity. Even if some vendors provide similar syntax (e.g. USING in Oracle), one may prefer the syntax that offers clarity.

// total by product via lookup with table alias
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
    , TotalCost = sum(TotalCost) by ProductKey, CustomerKey
| lookup Products on $left.ProductKey == $right.ProductKey
| lookup Customers on $left.CustomerKey == $right.CustomerKey
| project CustomerName = strcat(FirstName, ' ', LastName), ProductName, TotalCost

Notes:
1) For readability purposes, it might be useful to use indentation for the lines that are continuation of the previous line. Therefore, if the line doesn't start with a pipe ("|"), the next line starts a few characters (a tab) further.
2) According to the documentation (see [1]), the "lookup" operator optimizes the performance of queries where a fact table is enriched with data from a dimension table.

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) Kusto Tutorial: Join data from multiple tables [link]

No comments:

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.