The last post exemplified the use of "explain" to translate queries from SQL to KQL. The current post attempts to test the feature based on the various join constructs available in SQL by using the NewSales and Products tables. The post presumes that the reader as a basic understanding of the join types from SQL-based environments.
Inner Join
// full join in SQL -- explain SELECT NewSales.CustomerKey , NewSales.ProductKey , Products.ProductName FROM NewSales JOIN Products ON NewSales.ProductKey = Products.ProductKey // full join in KQL NewSales | join kind=inner ( Products | project ProductKey, ProductName ) on ($left.ProductKey == $right.ProductKey) | project CustomerKey, ProductKey, ProductName | limit 10
A full join is probably the most used type of join given that fact tables presume the existence of dimensions, even if poor data warehousing design can lead also to exception. The join retrieves all the data matching from both tables, including the eventual duplicates from both sides of the join.
Left Join
// left join in SQL -- explain SELECT NewSales.CustomerKey , NewSales.ProductKey , Products.ProductName FROM NewSales LEFT JOIN Products ON NewSales.ProductKey = Products.ProductKey // left join in KQL NewSales | join kind=leftouter ( Products | project ProductKey
, Product = ProductName ) on ($left.ProductKey == $right.ProductKey) | where isnull(Product) | project CustomerKey , ProductKey , ProductName | limit 10
A left join retrieves all the records from the left table, typically the fact table, independently whether records were found in the dimension table. One can check whether mismatches exist by retrieving the records where no match was found.
Right Join
// right join in SQL -- explain SELECT NewSales.CustomerKey , Products.ProductKey , Products.ProductName FROM NewSales RIGHT JOIN Products ON NewSales.ProductKey = Products.ProductKey // right join in KQL NewSales | join kind=rightouter ( Products | project DimProductKey = ProductKey , DimProductName = ProductName ) on ($left.ProductKey == $right.DimProductKey) | where isnull(ProductKey) | project CustomerKey , DimProductKey , DimProductName | limit 10
A right join retrieves the records from the dimension together with the matches from the fact table, independently whether a match was found in the fact table.
Full Outer Join
// full outer join in SQL -- explain SELECT NewSales.CustomerKey , Coalesce(NewSales.ProductKey, Products.ProductKey) ProductKey , Coalesce(NewSales.ProductName, Products.ProductName) ProductName FROM NewSales FULL OUTER JOIN Products ON NewSales.ProductKey = Products.ProductKey // full outer join in KQL NewSales | join kind=fullouter ( Products | project DimProductKey = ProductKey , DimProductName = ProductName ) on ($left.ProductKey == $right.DimProductKey) //| where isnull(ProductKey) | project CustomerKey , ProductKey = coalesce(ProductKey, DimProductKey) , ProductName = coalesce(ProductName, DimProductName) | limit 10
A full outer join retrieves all the data from both sides of the join independently on whether a match is found. In RDBMS this type of join performs poorly especially when further joins are considered, respectively when many records are involved on both sides of the join. Therefore it should be avoided when possible, though in many cases it might be the only feasible solution. There are also alternatives that involve a UNION between a LEFT JOIN and a RIGHT JOIN, the letter retrieving only the records which is not found in the fact table (see last query from a previous post). This can be a feasible solution when data sharding is involved.
Notes:
1) If one ignores the unnecessary logic introduced by the translation via explain, the
tool is excellent for learning KQL. It would be interesting to understand why
the tool used a certain complex translation over another, especially when
there's a performance benefit in the use of a certain piece of code.
2) Also in SQL-based queries it's recommended to start with the fact table, respectively with the table having the highest cardinality and/or the lowest level of detail, though the database engine might find an optimal plan independently of which table was written first.
Happy coding!
No comments:
Post a Comment