As data is usually dispersed over multiple tables, in any query languages is quintessential to know how to bring the data together for further analysis, joins allowing to achieve this is SQL as well in KQL. One usually starts with the main table and joins the further tables to it. In a data warehouse, the main table is a fact table to which the dimension tables are further joined to add details, respectively to filter on the dimensions.
Before starting to build any logic, it's important to get an idea of tables' cardinality and which are the keys on which the data to be join. Some models, like the one in the ContosoSales model, are straightforward and the columns participating in the joins have the same names, or at least similar names.
When writing and testing queries, one can start with a small subset of the data, join the tables together, and at the end validate the logic as needed. Selecting a small interval (e.g. a month, week, or even a day) and the records for 1-2 representative records from a dimensional table (e.g. Customers) helps in the process for minimizing the data movement and
// check how many records are available // Sales table: 28980 based on the filter NewSales | where SalesAmount <> 0 and ProductCategoryName == 'TV and Video' | where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01) | summarize record_count = count() // Products: 2517 records Products | summarize record_count = count() // Customers: 18484 records Customers | summarize record_count = count()
// review the data // Sales table: 28980 based on the filter NewSales | where SalesAmount <> 0 and ProductCategoryName == 'TV and Video' | where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01) | take 10 // Products: 2517 records Products | take 10 // Customers: 18484 records Customers | take 10
In KQL one must start with the fact table (e.g NewSales) and add the dimensions accordingly. Alternatively, it's useful to group only by ProductKey, respectively by CustomerKey.
// review the basis output by Products & Customers NewSales | where SalesAmount <> 0 and ProductCategoryName == 'TV and Video' | where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01) | summarize record_count = count() , TotalCost = sum(TotalCost) by ProductKey, CustomerKey | order by TotalCost desc | summarize result_record_count = count()
Now let's join the Product dimension table to the NewSales fact table via the join operator:
// total by product via join NewSales | where SalesAmount <> 0 and ProductCategoryName == 'TV and Video' | where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01) | summarize record_count = count() , TotalCost = sum(TotalCost) by ProductKey | join kind=inner ( Products | where ProductCategoryName == 'TV and Video' ) on ProductKey | project ProductName, TotalCost //| summarize record_count = count() | order by TotalCost desc
One can join now also the second dimension table and eventually select only the columns in scope, respectively do further calculations (see CustomerName):
// total by product & customer via joins NewSales | where SalesAmount <> 0 and ProductCategoryName == 'TV and Video' | where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01) | summarize record_count = count() , TotalCost = sum(TotalCost) by ProductKey, CustomerKey | join kind=inner ( Products | where ProductCategoryName == 'TV and Video' | project ProductKey, ProductName ) on ProductKey | join kind=inner ( Customers | project CustomerKey, CustomerName = strcat(FirstName, ' ', LastName) ) on CustomerKey | project CustomerName, ProductName, TotalCost //| summarize record_count = count() | order by CustomerName asc, ProductName asc
Once the query built, one can go ahead and remove the pieces of logic not needed, an reevaluate the number of records with a count (see commented line).
Alternatively, one can rewrite the above query as follows via the lookup operators:
// total by Product & Customer via lookup
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
, TotalCost = sum(TotalCost) by ProductKey, CustomerKey
| lookup Products on ProductKey
| lookup Customers on CustomerKey
//| summarize record_count = count()
| project CustomerName = strcat(FirstName, ' ', LastName), ProductName, TotalCost
The two last queries should produce the same results. The first query allows more flexibility in what concerns the constraints applied, while the second is more compact while it also allows to easier test how the number of records changes by added each join. This set of tests is done to make sure that duplicates aren't created in the process, respectively that records are not removed unnecessarily from the logic.
// total by product via lookup with table alias NewSales | where SalesAmount <> 0 and ProductCategoryName == 'TV and Video' | where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01) | summarize record_count = count() , TotalCost = sum(TotalCost) by ProductKey, CustomerKey | lookup Products on $left.ProductKey == $right.ProductKey | lookup Customers on $left.CustomerKey == $right.CustomerKey | project CustomerName = strcat(FirstName, ' ', LastName), ProductName, TotalCost
1) For readability purposes, it might be useful to use indentation for the lines that are continuation of the previous line. Therefore, if the line doesn't start with a pipe ("|"), the next line starts a few characters (a tab) further.
Previous Post <<||>> Next Post
References:
[1] Microsoft Learn (2024) Kusto Tutorial: Join data from multiple tables [link]
No comments:
Post a Comment