05 February 2025

🌌🏭KQL Reloaded: First Steps (Part IV: Left, Right, Anti-Joins and Unions)

In a standard scenario there is a fact table and multiple dimension table (see previous post), though one can look at the same data from multiple perspectives. In KQL it's recommended to start with the fact table, however in some reports one needs records from the dimension table independently whether there are any records in the fact tale.  

For example, it would be useful to show all the products, independently whether they were sold or not. It's what the below query does via a RIGHT JOIN between the fact table and the Customer dimension:

// totals by customer via right join
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
    , TotalCost = sum(TotalCost) by CustomerKey
| join kind=rightouter (
    Customers
    | where RegionCountryName in ('Canada', 'Australia')
    | project CustomerKey, RegionCountryName, CustomerName = strcat(FirstName, ' ', LastName)
    )
    on CustomerKey
| project RegionCountryName, CustomerName, TotalCost
//| summarize record_count = count() 
| order by CustomerName asc

The Product details can be added then via a LEFT JOIN between the fact table and the Product dimension:

// total by customer via right join with product information
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
    , TotalCost = sum(TotalCost)
    , FirstPurchaseDate = min(DateKey)
    , LastPurchaseDate = max(DateKey) by CustomerKey, ProductKey
| join kind=rightouter (
    Customers
    | where RegionCountryName in ('Canada', 'Australia')
    | project CustomerKey, RegionCountryName, CustomerName = strcat(FirstName, ' ', LastName)
    )
    on CustomerKey
| join kind=leftouter (
    Products 
    | where  ProductCategoryName == 'TV and Video'
    | project ProductKey, ProductName 
    )
    on ProductKey
| project RegionCountryName, CustomerName, ProductName, TotalCost, FirstPurchaseDate, LastPurchaseDate, record_count
//| where record_count>1 // multiple purchases
| order by CustomerName asc, ProductName asc

These kind of queries need adequate validation and for this it might be needed to restructure the queries. 

// validating the multiple records (defailed)
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| lookup Products on ProductKey
| lookup Customers on CustomerKey 
| where  FirstName == 'Alexandra' and LastName == 'Sanders'
| project CustomerName = strcat(FirstName, ' ', LastName), ProductName, TotalCost, DateKey, ProductKey, CustomerKey 

// validating the multiple records against the fact table (detailed)
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| where CustomerKey == 14912

Mixing RIGHT and  LEFT joins in this way increases the complexity of the queries and sometimes comes with a burden for validating the logic. In SQL one could prefer to start with the Customer table, add the summary data and the other dimensions. In this way one can do a total count individually starting with the Customer table and adding each join which reviewing the record count for each change. 

In special scenario instead of a RIGHT JOIN one could use a FULL JOIN and add the Customers without any orders via a UNION. In some scenarios this approach can offer even a better performance. For this approach, one needs to get the Customers without Orders via an anti-join, more exactly a   rightanti:

// customers without orders
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
    , TotalCost = sum(TotalCost)
    , FirstPurchaseDate = min(DateKey)
, LastPurchaseDate = max(DateKey) by CustomerKey, ProductKey
| join kind=rightanti (
    Customers
    | project CustomerKey, RegionCountryName, CustomerName = strcat(FirstName, ' ', LastName)
    )
    on CustomerKey
| project RegionCountryName, CustomerName, ProductName = '', TotalCost = 0
//| summarize count()
| order by CustomerName asc, ProductName asc

And now, joining the Customers with orders with the ones without orders gives an overview of all the customers:

// total by product via lookup with table alias
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
    , TotalCost = sum(TotalCost) by CustomerKey //, ProductKey 
//| lookup Products on ProductKey 
| lookup Customers on CustomerKey
| project RegionCountryName
    , CustomerName = strcat(FirstName, ' ', LastName)
    //, ProductName
    , TotalCost
//| summarize count()
| union withsource=SourceTable kind=outer (
    // customers without orders
    NewSales
    | where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
    | where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
    | summarize record_count = count()
        , TotalCost = sum(TotalCost)
        , FirstPurchaseDate = min(DateKey)
        , LastPurchaseDate = max(DateKey) by CustomerKey
        //, ProductKey
    | join kind=rightanti (
        Customers
        | project CustomerKey
            , RegionCountryName
            , CustomerName = strcat(FirstName, ' ', LastName)
        )
        on CustomerKey
    | project RegionCountryName
        , CustomerName
        //, ProductName = ''
        , TotalCost = 0
    //| summarize count()
)
//| summarize count()
| order by CustomerName asc
//, ProductName asc

And, of course, the number of records returned by the three queries must match. The information related to the Product were left out for this version, though it can be added as needed. Unfortunately, there are queries more complex than this, which makes the queries more difficult to read, understand and troubleshoot. Inline views could be useful to structure the logic as needed. 

let T_customers_with_orders = view () { 
    NewSales
    | where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
    | where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
    | summarize record_count = count()
        , TotalCost = sum(TotalCost) 
        by CustomerKey
         //, ProductKey 
         //, DateKey
    //| lookup Products on ProductKey 
    | lookup Customers on CustomerKey
    | project RegionCountryName
        , CustomerName = strcat(FirstName, ' ', LastName)
        //, ProductName
        //, ProductCategoryName
        , TotalCost
        //, DateKey
};
let T_customers_without_orders = view () { 
   // customers without orders
    NewSales
    | where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
    | where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
    | summarize record_count = count()
        , TotalCost = sum(TotalCost)
        , FirstPurchaseDate = min(DateKey)
        , LastPurchaseDate = max(DateKey) by CustomerKey
        //, ProductKey
    | join kind=rightanti (
        Customers
        | project CustomerKey
            , RegionCountryName
            , CustomerName = strcat(FirstName, ' ', LastName)
        )
        on CustomerKey
    | project RegionCountryName
        , CustomerName
        //, ProductName = ''
        , TotalCost = 0
    //| summarize count()
};
T_customers_with_orders
| union withsource=SourceTable kind=outer T_customers_without_orders
| summarize count()

In this way the queries should be easier to troubleshoot and restructure the logic in more manageable pieces. 

It would be useful to save the definition of the view (aka stored view), however it's not possible to create views on the machine on which the tests are run:

"Principal 'aaduser=...' is not authorized to write database 'ContosoSales'."

Happy coding!

Previous Post <<||>> Next Post

No comments:

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.