In SQL Server and other RDBMS databases there are many scenarios in which one needs information from a fact table based on a dimension table without requiring information from the dimension table.
Correlated Subquery via EXISTS
Before considering the main example, let's start with a simple subquery:
// subquery in SQL -- explain SELECT CustomerKey , ProductKey , ProductName FROM NewSales WHERE ProductKey IN ( SELECT DISTINCT ProductKey FROM Products WHERE ProductSubcategoryName = 'MP4&MP3' ) // subquery in KQL NewSales | where ProductKey in ( (Products | where (ProductSubcategoryName == "MP4&MP3") | project ProductKey | distinct *)) | project CustomerKey, ProductKey, ProductName
Of course, the ProductKey is unique by design, though there can be dimension, fact tables or subqueries in which the value is not unique.
Now let's consider the correlated subquery pattern, which should provide the same outcome as above, though in RDBMS there are scenarios in which it provides better performance, especially when the number of values from subquery is high.
// correlated subquery in SQL -- explain SELECT CustomerKey , ProductKey , ProductName FROM NewSales WHERE EXISTS ( SELECT Products.ProductKey FROM Products WHERE NewSales.ProductKey = Products.ProductKey)
Unfortunately, trying to translate the code via explain leads to the following error, which confirms that the syntax is not supported in KQL (see [1]):
"Error: Reference to missing column 'NewSales.ProductKey'"
Fortunately, in this case one can use the first version of the query.
Correlated Subquery via CROSS APPLY
Before creating the main query, let's look at the inner query and check whether it gets correctly translate to KQL:
// subquery logic -- explain SELECT sum(TotalCost) TotalCost FROM NewSales WHERE DateKey > '20240101' and DateKey <'20240201'
Now, let's bring the logic within the CROSS APPLY:
// correlated subquery in SQL -- explain SELECT ProductKey , ProductName , TotalCost FROM Products CROSS APPLY ( SELECT sum(TotalCost) TotalCost FROM NewSales WHERE DateKey > '20240101' and DateKey <'20240201' AND Products.ProductKey = NewSales.ProductKey ) DAT
Running the above code leads to the following error:
"Sql node of type 'Microsoft.SqlServer.TransactSql.ScriptDom.UnqualifiedJoin' is not implemented"
Unfortunately, many SQL queries are written following this pattern, especially when an OUTER CROSS APPLY is used, retrieving thus all the records from the dimension table.
In this case one can rewrite the query via a RIGHT JOIN:
// correlated subquery in SQL -- explain SELECT PRD.ProductKey , PRD.ProductName , SAL.TotalCost FROM Products PRD LEFT JOIN ( SELECT ProductKey , sum(TotalCost) TotalCost FROM NewSales WHERE DateKey > '20240101' and DateKey <'20240201' GROUP BY ProductKey ) SAL ON PRD.ProductKey = SAL.ProductKey // direct translation of the query Products | join kind=leftouter (NewSales | where ((DateKey > todatetime("20240101")) and (DateKey < todatetime("20240201"))) | summarize TotalCost=sum(TotalCost) by ProductKey | project ProductKey, TotalCost ) on ($left.ProductKey == $right.ProductKey) | project ProductKey, ProductName, TotalCost //| where isnull(TotalCost) //| summarize record_number = count() // query after restructuring NewSales | where ((DateKey > todatetime("20240101")) and (DateKey < todatetime("20240201"))) | summarize TotalCost=sum(TotalCost) by ProductKey | join kind=rightouter ( Products | project ProductKey, ProductName ) on ($left.ProductKey == $right.ProductKey) | project ProductKey, ProductName, TotalCost //| where isnull(TotalCost) //| summarize record_number = count()
During transformations it's important to check whether the number of records changes between the various versions of the query (including the most general version in which filtering constraints were applied).
Especially when SQL solutions are planned to be migrated to KQL, it's important to know which query patterns can be used in KQL.
Happy coding!
Previous Post <<||>> Next Post
References:
[1] GitHib (2024) Module Ex-01 - Advanced KQL [link]