SQL Troubles: explain

Showing posts with label explain. Show all posts

08 February 2025

🌌🏭KQL Reloaded: First Steps (Part IX: Translating SQL to KQL - More Joins)

The last post exemplified the use of "explain" to translate queries from SQL to KQL. The current post attempts to test the feature based on the various join constructs available in SQL by using the NewSales and Products tables. The post presumes that the reader as a basic understanding of the join types from SQL-based environments.

Inner Join

// full join in SQL
--
explain
SELECT NewSales.CustomerKey
, NewSales.ProductKey
, Products.ProductName
FROM NewSales 
     JOIN Products 
      ON NewSales.ProductKey = Products.ProductKey 

// full join in KQL
NewSales
| join kind=inner (
    Products
    | project ProductKey, ProductName 
    )
    on ($left.ProductKey == $right.ProductKey)
| project CustomerKey, ProductKey, ProductName
| limit 10

A full join is probably the most used type of join given that fact tables presume the existence of dimensions, even if poor data warehousing design can lead also to exception. The join retrieves all the data matching from both tables, including the eventual duplicates from both sides of the join.

Left Join

// left join in SQL
--
explain
SELECT NewSales.CustomerKey
, NewSales.ProductKey
, Products.ProductName
FROM NewSales 
     LEFT JOIN Products 
      ON NewSales.ProductKey = Products.ProductKey 

// left join in KQL
NewSales
| join kind=leftouter (
    Products
    | project ProductKey

        , Product = ProductName 
    )
    on ($left.ProductKey == $right.ProductKey)
| where isnull(Product) 
| project CustomerKey
    , ProductKey
    , ProductName
| limit 10

A left join retrieves all the records from the left table, typically the fact table, independently whether records were found in the dimension table. One can check whether mismatches exist by retrieving the records where no match was found.

Right Join

// right join in SQL
--
explain
SELECT NewSales.CustomerKey
, Products.ProductKey
, Products.ProductName
FROM NewSales 
     RIGHT JOIN Products 
      ON NewSales.ProductKey = Products.ProductKey 

// right join in KQL
NewSales
| join kind=rightouter (
    Products
    | project DimProductKey = ProductKey
    , DimProductName = ProductName 
    )
    on ($left.ProductKey == $right.DimProductKey)
| where isnull(ProductKey) 
| project CustomerKey
    , DimProductKey
    , DimProductName
| limit 10

A right join retrieves the records from the dimension together with the matches from the fact table, independently whether a match was found in the fact table.

Full Outer Join

// full outer join in SQL
--
explain
SELECT NewSales.CustomerKey
, Coalesce(NewSales.ProductKey, Products.ProductKey) ProductKey
, Coalesce(NewSales.ProductName, Products.ProductName) ProductName
FROM NewSales 
     FULL OUTER JOIN Products 
      ON NewSales.ProductKey = Products.ProductKey 


// full outer join in KQL
NewSales
| join kind=fullouter (
    Products
    | project DimProductKey = ProductKey
    , DimProductName = ProductName 
    )
    on ($left.ProductKey == $right.DimProductKey)
//| where isnull(ProductKey) 
| project CustomerKey
    , ProductKey = coalesce(ProductKey, DimProductKey)
    , ProductName = coalesce(ProductName, DimProductName)
| limit 10

A full outer join retrieves all the data from both sides of the join independently on whether a match is found. In RDBMS this type of join performs poorly especially when further joins are considered, respectively when many records are involved on both sides of the join. Therefore it should be avoided when possible, though in many cases it might be the only feasible solution. There are also alternatives that involve a UNION between a LEFT JOIN and a RIGHT JOIN, the letter retrieving only the records which is not found in the fact table (see last query from a previous post). This can be a feasible solution when data sharding is involved.

Notes:
1) If one ignores the unnecessary logic introduced by the translation via explain, the tool is excellent for learning KQL. It would be interesting to understand why the tool used a certain complex translation over another, especially when there's a performance benefit in the use of a certain piece of code.
2) Also in SQL-based queries it's recommended to start with the fact table, respectively with the table having the highest cardinality and/or the lowest level of detail, though the database engine might find an optimal plan independently of which table was written first.

Happy coding!

Previous Post <<||>> Next Post

🌌🏭KQL Reloaded: First Steps (Part VIII: Translating SQL to KQL - Full Joins)

One of the great features of KQL is the possibility of translating SQL code to KQL via the "explain" keyword, allowing thus to port SQL code to KQL, respectively help translate knowledge from one programming language to another.

Let's start with a basic example:

// transform SQL to KQL code (to be run only the first part from --)
--
explain
SELECT top(10) CustomerKey, FirstName, LastName, CityName, CompanyName 
FROM Customers 
ORDER BY CityName DESC

// output: translated KQL code 
Customers
| project CustomerKey, FirstName, LastName, CityName, CompanyName
| sort by CityName desc nulls first
| take int(10)

The most interesting part of the translation is how "explain" translate joins from SQL to KQL. Let's start with a FULL JOIN from the set of patterns considered in a previous post on SQL joins:

--
explain
SELECT CST.CustomerKey
, CST.FirstName + ' ' + CST.LastName CustomerName
, Cast(SAL.DateKey as Date) DateKey
, SAL.TotalCost
FROM NewSales SAL
    JOIN Customers CST
      ON SAL.CustomerKey = CST.CustomerKey 
WHERE SAL.DateKey > '20240101' AND SAL.DateKey < '20240201'
ORDER BY CustomerName, DateKey, TotalCost DESC

And, here's the translation:

// translated code
NewSales
| project-rename ['SAL.DateKey']=DateKey
| join kind=inner (Customers
| project-rename ['CST.CustomerKey']=CustomerKey
    , ['CST.CityName']=CityName
    , ['CST.CompanyName']=CompanyName
    , ['CST.ContinentName']=ContinentName
    , ['CST.Education']=Education
    , ['CST.FirstName']=FirstName
    , ['CST.Gender']=Gender
    , ['CST.LastName']=LastName
    , ['CST.MaritalStatus']=MaritalStatus
    , ['CST.Occupation']=Occupation
    , ['CST.RegionCountryName']=RegionCountryName
    , ['CST.StateProvinceName']=StateProvinceName) 
    on ($left.CustomerKey == $right.['CST.CustomerKey'])
| where ((['SAL.DateKey'] > todatetime("20240101")) 
    and (['SAL.DateKey'] < todatetime("20240201")))
| project ['CST.CustomerKey']

    , CustomerName=__sql_add(__sql_add(['CST.FirstName']
    , " "), ['CST.LastName'])
    , DateKey=['SAL.DateKey']
    , TotalCost
| sort by CustomerName asc nulls first
    , DateKey asc nulls first
    , TotalCost desc nulls first
| project-rename CustomerKey=['CST.CustomerKey']

The code was slightly formatted to facilitated its reading. Unfortunately, the tool doesn't work well with table aliases, introduces also all the fields available from the dimension table, which can become a nightmare for the big dimension tables, the concatenation seems strange, and if one looks deeper, further issues can be identified. So, the challenge is how to write a query in SQL so it can minimize the further changed in QKL.

Probably, one approach is to write the backbone of the query in SQL and add the further logic after translation.

--
explain
SELECT NewSales.CustomerKey
, NewSales.DateKey 
, NewSales.TotalCost
FROM NewSales 
    INNER JOIN Customers 
      ON NewSales.CustomerKey = Customers.CustomerKey 
WHERE DateKey > '20240101' AND DateKey < '20240201'
ORDER BY NewSales.CustomerKey
, NewSales.DateKey

And the translation looks simpler:

// transformed query
NewSales
| join kind=inner 
(Customers
| project-rename ['Customers.CustomerKey']=CustomerKey
    , ['Customers.CityName']=CityName
    , ['Customers.CompanyName']=CompanyName
    , ['Customers.ContinentName']=ContinentName
    , ['Customers.Education']=Education
    , ['Customers.FirstName']=FirstName
    , ['Customers.Gender']=Gender
    , ['Customers.LastName']=LastName
    , ['Customers.MaritalStatus']=MaritalStatus
    , ['Customers.Occupation']=Occupation
    , ['Customers.RegionCountryName']=RegionCountryName
    , ['Customers.StateProvinceName']=StateProvinceName) 
    on ($left.CustomerKey == $right.['Customers.CustomerKey'])
| where ((DateKey > todatetime("20240101")) 
    and (DateKey < todatetime("20240201")))
| project CustomerKey, DateKey, TotalCost
| sort by CustomerKey asc nulls first
, DateKey asc nulls first

I would have written the query as follows:

// transformed final query
NewSales
| where (DateKey > todatetime("20240101")) 
    and (DateKey < todatetime("20240201"))
| join kind=inner (
    Customers
    | project CustomerKey
        , FirstName
        , LastName 
    ) on $left.CustomerKey == $right.CustomerKey
| project CustomerKey
    , CustomerName = strcat(FirstName, ' ', LastName)
    , DateKey
    , TotalCost
| sort by CustomerName asc nulls first
    , DateKey asc nulls first

So, it makes sense to create the backbone of a query, translate it to KQL via explain, remove the unnecessary columns and formatting, respectively add what's missing. Once the patterns were mastered, there's probably no need to use the translation tool, but could prove to be also some exceptions. Anyway, the translation tool helps considerably in learning. Big kudos for the development team!

Notes:
1) The above queries ignore the fact that Customer information is available also in the NewSales table, making thus the joins obsolete. The joins were considered only for exemplification purposes. Similar joins might be still needed for checking the "quality" of the data (e.g. for dimensions that change over time). Even if such "surprises" shouldn't appear by design, real life designs continue to surprise...
2) Queries should be less verbose by design (aka simplicity by design)! The more unnecessary code is added, the higher the chances for errors to be overseen, respectively the more time is needed to understand and validated the queries!

Happy coding!

Previous Post <<||>> Next Post

References
[1] Microsoft Lear n (2024) Azure: Query data using T-SQL [link]

SQL Troubles

Pages

08 February 2025

🌌🏭KQL Reloaded: First Steps (Part IX: Translating SQL to KQL - More Joins)

🌌🏭KQL Reloaded: First Steps (Part VIII: Translating SQL to KQL - Full Joins)

About Me