Showing posts with label anti-join. Show all posts
Showing posts with label anti-join. Show all posts

05 February 2025

🌌🏭KQL Reloaded: First Steps (Part IV: Left, Right, Anti-Joins and Unions)

In a standard scenario there is a fact table and multiple dimension table (see previous post), though one can look at the same data from multiple perspectives. In KQL it's recommended to start with the fact table, however in some reports one needs records from the dimension table independently whether there are any records in the fact tale.  

For example, it would be useful to show all the products, independently whether they were sold or not. It's what the below query does via a RIGHT JOIN between the fact table and the Customer dimension:

// totals by customer via right join
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
    , TotalCost = sum(TotalCost) by CustomerKey
| join kind=rightouter (
    Customers
    | where RegionCountryName in ('Canada', 'Australia')
    | project CustomerKey, RegionCountryName, CustomerName = strcat(FirstName, ' ', LastName)
    )
    on CustomerKey
| project RegionCountryName, CustomerName, TotalCost
//| summarize record_count = count() 
| order by CustomerName asc

The Product details can be added then via a LEFT JOIN between the fact table and the Product dimension:

// total by customer via right join with product information
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
    , TotalCost = sum(TotalCost)
    , FirstPurchaseDate = min(DateKey)
    , LastPurchaseDate = max(DateKey) by CustomerKey, ProductKey
| join kind=rightouter (
    Customers
    | where RegionCountryName in ('Canada', 'Australia')
    | project CustomerKey, RegionCountryName, CustomerName = strcat(FirstName, ' ', LastName)
    )
    on CustomerKey
| join kind=leftouter (
    Products 
    | where  ProductCategoryName == 'TV and Video'
    | project ProductKey, ProductName 
    )
    on ProductKey
| project RegionCountryName, CustomerName, ProductName, TotalCost, FirstPurchaseDate, LastPurchaseDate, record_count
//| where record_count>1 // multiple purchases
| order by CustomerName asc, ProductName asc

These kind of queries need adequate validation and for this it might be needed to restructure the queries. 

// validating the multiple records (defailed)
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| lookup Products on ProductKey
| lookup Customers on CustomerKey 
| where  FirstName == 'Alexandra' and LastName == 'Sanders'
| project CustomerName = strcat(FirstName, ' ', LastName), ProductName, TotalCost, DateKey, ProductKey, CustomerKey 

// validating the multiple records against the fact table (detailed)
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| where CustomerKey == 14912

Mixing RIGHT and  LEFT joins in this way increases the complexity of the queries and sometimes comes with a burden for validating the logic. In SQL one could prefer to start with the Customer table, add the summary data and the other dimensions. In this way one can do a total count individually starting with the Customer table and adding each join which reviewing the record count for each change. 

In special scenario instead of a RIGHT JOIN one could use a FULL JOIN and add the Customers without any orders via a UNION. In some scenarios this approach can offer even a better performance. For this approach, one needs to get the Customers without Orders via an anti-join, more exactly a   rightanti:

// customers without orders
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
    , TotalCost = sum(TotalCost)
    , FirstPurchaseDate = min(DateKey)
, LastPurchaseDate = max(DateKey) by CustomerKey, ProductKey
| join kind=rightanti (
    Customers
    | project CustomerKey, RegionCountryName, CustomerName = strcat(FirstName, ' ', LastName)
    )
    on CustomerKey
| project RegionCountryName, CustomerName, ProductName = '', TotalCost = 0
//| summarize count()
| order by CustomerName asc, ProductName asc

And now, joining the Customers with orders with the ones without orders gives an overview of all the customers:

// total by product via lookup with table alias
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
    , TotalCost = sum(TotalCost) by CustomerKey //, ProductKey 
//| lookup Products on ProductKey 
| lookup Customers on CustomerKey
| project RegionCountryName
    , CustomerName = strcat(FirstName, ' ', LastName)
    //, ProductName
    , TotalCost
//| summarize count()
| union withsource=SourceTable kind=outer (
    // customers without orders
    NewSales
    | where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
    | where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
    | summarize record_count = count()
        , TotalCost = sum(TotalCost)
        , FirstPurchaseDate = min(DateKey)
        , LastPurchaseDate = max(DateKey) by CustomerKey
        //, ProductKey
    | join kind=rightanti (
        Customers
        | project CustomerKey
            , RegionCountryName
            , CustomerName = strcat(FirstName, ' ', LastName)
        )
        on CustomerKey
    | project RegionCountryName
        , CustomerName
        //, ProductName = ''
        , TotalCost = 0
    //| summarize count()
)
//| summarize count()
| order by CustomerName asc
//, ProductName asc

And, of course, the number of records returned by the three queries must match. The information related to the Product were left out for this version, though it can be added as needed. Unfortunately, there are queries more complex than this, which makes the queries more difficult to read, understand and troubleshoot. Inline views could be useful to structure the logic as needed. 

let T_customers_with_orders = view () { 
    NewSales
    | where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
    | where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
    | summarize record_count = count()
        , TotalCost = sum(TotalCost) 
        by CustomerKey
         //, ProductKey 
         //, DateKey
    //| lookup Products on ProductKey 
    | lookup Customers on CustomerKey
    | project RegionCountryName
        , CustomerName = strcat(FirstName, ' ', LastName)
        //, ProductName
        //, ProductCategoryName
        , TotalCost
        //, DateKey
};
let T_customers_without_orders = view () { 
   // customers without orders
    NewSales
    | where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
    | where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
    | summarize record_count = count()
        , TotalCost = sum(TotalCost)
        , FirstPurchaseDate = min(DateKey)
        , LastPurchaseDate = max(DateKey) by CustomerKey
        //, ProductKey
    | join kind=rightanti (
        Customers
        | project CustomerKey
            , RegionCountryName
            , CustomerName = strcat(FirstName, ' ', LastName)
        )
        on CustomerKey
    | project RegionCountryName
        , CustomerName
        //, ProductName = ''
        , TotalCost = 0
    //| summarize count()
};
T_customers_with_orders
| union withsource=SourceTable kind=outer T_customers_without_orders
| summarize count()

In this way the queries should be easier to troubleshoot and restructure the logic in more manageable pieces. 

It would be useful to save the definition of the view (aka stored view), however it's not possible to create views on the machine on which the tests are run:

"Principal 'aaduser=...' is not authorized to write database 'ContosoSales'."

Happy coding!

Previous Post <<||>> Next Post

03 April 2010

💎SQL Reloaded: Power of Joins V (Semi-Joins and Anti-Joins)

An incursion in the world of joins is incomplete without approaching two other important techniques for writing queries: semi-joins and anti-joins, some of the techniques specific to them allowing to speed up queries and reduce queries’ complexity. An anti-join returns rows from the left table where no matches is found in the right table, while an semi-join returns returns rows from the left table if at least a match is found in the second table. 

The typical techniques for creating semi-joins and anti-joins are based on IN or EXISTS operators, respectively their negations the NOT EXISTS and NOT IN, both operators being available in SQL Server and Oracle. According to the above definitions also the EXCEPT and INTERSECT operators introduced in horizontal joins post could be used to create anti-joins, respectively semi-joins too, they being introduced by Microsoft starting with SQL Server 2005 in order to compare whole rows rather then a smaller set of attributes. Another operators that allows creating anti-joins and semi-joins and introduced with SQL Server 2005 is the CROSS APPLY and OUTER APPLY operators.

Because the  EXCEPT and INTERSECT operators were introduced in a previous post, this post is focused on the other mentioned operators, for exemplification of the techniques following to use again the Production.Product and Purchasing.PurchaseOrderDetail tables from AdventureWorks database.

IN operator allows to test whether a given (attribute) value is in a list of values, the respective list could be based also on the values returned by another query, called actually a subquery. The output is not influenced by the eventual duplicates in the list of values, though for the sake of simplicity and testability it makes sense to provide a list of distinct values.
 
EXISTS operator provides a similar functionality like the IN operator, however it checks only whether a record exists in what is called a correlated suquery (also known as repeated subquery, is a subquery making direct reference to one or more of the attributes of the main query). The subquery used with an IN operator could be easily translated to a correlated subquery by bringing the attribute used to create the list of values into the WHERE clause.

Let’s start with a simple problem to exemplify the use of a semi-join, for example to select only the products that have a PO (Purchase Order) placed, the query could be written as follows:
 
-- Products with PO placed (IN operator) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
FROM Production.Product ITM  
WHERE ITM.ProductID IN (--list of Product IDs with PO placed 
    SELECT DISTINCT POD.ProductID  
     FROM Purchasing.PurchaseOrderDetail POD  
       JOIN Purchasing.PurchaseOrderHeader POH  
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
      WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
     )  

The above query could be rewritten using the EXISTS operator, the correlated query being based on ProductID attribute found in both tables:
 
-- Products with PO placed (EXISTS operator) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
FROM Production.Product ITM  
WHERE EXISTS (--list of Product IDs with PO placed 
     SELECT 1 
     FROM Purchasing.PurchaseOrderDetail POD  
       JOIN Purchasing.PurchaseOrderHeader POH  
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
     WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
         AND POD.ProductID = ITM.ProductID -- correlated constraint 
)  

Notes:
1.    As usual special attention must be given to the handling of NULL values either by replacing them with a default value or by handing the NULL case in the WHERE clause.
2.   The Execution Plan for the above two queries is the same, SQL Server using a Hash Match (Left Semi Join) for both operators. I was trying to find information on whether there are any recommendations that advices on the use of one of the two operators in the detriment of the other, however I found nothing in this direction for SQL Server. R. Schrag offers a rule of dumb rooted in Oracle documentation and based on the selectivity of a query, a query being highly selective if it returns the least number of rows:
"If the main body of your query is highly selective, then an EXISTS clause might be more appropriate to semi-join to the target table. However, if the main body of your query is not so selective and the subquery (the target of the semi-join) is more selective, then an IN clause might be more appropriate." [1]
Both above solutions are ideal when no data are needed from the tables used in the subquery, however quite often such a need arises, for example to show the volume of PO placed, the last PO Date, and so on, in such cases an inner join could be used instead:
 
-- Products with PO placed (inner join) SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
, POD.TotalOrderQty 
, POD.LastOrderDate 
FROM Production.Product ITM  
JOIN (--aggregated PO details 
    SELECT POD.ProductID  
    , SUM(POD.OrderQty) TotalOrderQty 
    , Max(OrderDate) LastOrderDate 
     FROM Purchasing.PurchaseOrderDetail POD  
       JOIN Purchasing.PurchaseOrderHeader POH  
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
     WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
     GROUP BY POD.ProductID 
    ) POD 
    ON POD.ProductID = ITM.ProductID 

For the same purpose could be used also a CROSS APPLY taking care to eliminate the Products for which no record is retrieved in the subquery, this approach is actually a mixture between an inner join and a correlated query:

 -- Products with PO placed (CROSS APPLY) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost  
, POD.TotalOrderQty  
, POD.LastOrderDate  
FROM Production.Product ITM  
     CROSS APPLY (--aggregated PO details  
     SELECT SUM(POD.OrderQty) TotalOrderQty  
     , Max(OrderDate) LastOrderDate  
     FROM Purchasing.PurchaseOrderDetail POD  
           JOIN Purchasing.PurchaseOrderHeader POH  
              ON POD.PurchaseOrderID = POH.PurchaseOrderID  
     WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
        AND POD.ProductID = ITM.ProductID  
    ) POD  
WHERE POD.TotalOrderQty IS NOT NULL

Notes:
1.   The IN and EXISTS query versions seems to provide better performance than the use of vertical joins, though from my experience is often needed to provide information also from the subqueries used with the respective operators, therefore the inner join provides more flexibility in the detriment of relatively poorer performance.
2.  Microsoft’s implementation for IN operator allows using only single list of values, in contrast with Oracle that allows working with lists of n-uples. This problem in theory could be solved using the IN operators with a correlated subquery. An example of such a problem is finding the last PO placed for each Product:

 -- Last PO placed per Product (IN + correlated query) 
SELECT POD.PurchaseOrderDetailID 
, POD.ProductID  
, POH.OrderDate 
, POD.OrderQty 
FROM Purchasing.PurchaseOrderDetail POD  
     JOIN Purchasing.PurchaseOrderHeader POH         ON POD.PurchaseOrderID = POH.PurchaseOrderID 
WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
   AND POH.OrderDate IN (--Last PO Date per Product 
      SELECT Max(POH1.OrderDate) LastOrderDate 
      FROM Purchasing.PurchaseOrderDetail POD1 
          JOIN Purchasing.PurchaseOrderHeader POH1 
             ON POD1.PurchaseOrderID = POH1.PurchaseOrderID 
    WHERE POH1.Status IN (2, 4) -- 2-Approved, 4-Complete  
AND POD1.ProductID = POD.ProductID -- correlated constraint 
) 

This might not be the best approach to solve this problem though the query is intended only to show a technique. The query could be rewritten using an inner join or the EXISTS operator, but I will show only the later approach because it comes with an interesting technique – because the LastOrderDate and ProductID are used as pair of values in order to retrieve the last PO per Product, given the fact that LastOrderDate is an aggregate value, the correlated constraint needs to be added in HAVING clause:
 
-- Last PO placed per Product (correlated query) 
SELECT POD.PurchaseOrderDetailID 
, POD.ProductID  
, POH.OrderDate 
, POD.OrderQty 
FROM Purchasing.PurchaseOrderDetail POD  
     JOIN Purchasing.PurchaseOrderHeader POH 
        ON POD.PurchaseOrderID = POH.PurchaseOrderID 
WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
     AND EXISTS(--Last PO Date per Product 
     SELECT 1 
     FROM Purchasing.PurchaseOrderDetail POD1 
          JOIN Purchasing.PurchaseOrderHeader POH1 
             ON POD1.PurchaseOrderID = POH1.PurchaseOrderID 
     WHERE POH1.Status IN (2, 4) -- 2-Approved, 4-Complete  
          AND POD1.ProductID = POD.ProductID -- correlated constraint 
     GROUP BY POD1.ProductID 
     HAVING Max(POH1.OrderDate) = POH.OrderDate 
     ) 

The NOT IN and NOT EXISTS versions of the operators simply negates the use of IN and EXISTS operators, showing thus the records for which a match is not found. The complementary problem of showing the Products with a PO placed is to show the Products without a PO placed, for this, as previously mentioned, being enough to add the NOT keyword in from of IN, respectively EXISTS in the first two queries from this post:

-- Products without PO placed (IN operator) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
FROM Production.Product ITM  
WHERE ITM.ProductID NOT IN (--list of Product IDs with PO placed 
    SELECT DISTINCT POD.ProductID 
     FROM Purchasing.PurchaseOrderDetail POD 
       JOIN Purchasing.PurchaseOrderHeader POH 
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
      WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
     )  

-- Products without PO placed (EXISTS operator) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
FROM Production.Product ITM  
WHERE NOT EXISTS (--list of Product IDs with PO placed 
     SELECT 1 
     FROM Purchasing.PurchaseOrderDetail POD 
       JOIN Purchasing.PurchaseOrderHeader POH 
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
     WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
         AND POD.ProductID = ITM.ProductID -- correlated constraint 
)  

An alternative to NOT IN and NOT EXISTS is the use of a LEFT JOIN transposing the project of A\B problem:
 
  -- Products without PO placed (left join) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost FROM Production.Product ITM  
    LEFT JOIN (--aggregated PO details 
    SELECT POD.ProductID  
    FROM Purchasing.PurchaseOrderDetail POD 
       JOIN Purchasing.PurchaseOrderHeader POH 
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
     WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
     GROUP BY POD.ProductID 
    ) POD 
    ON POD.ProductID = ITM.ProductID   
WHERE POD.ProductID IS NULL 

The CROSS APPLY could be used too, this time the WHERE clause from the main query being modified to show only the records without a match in CROSS APPLY, for this needing to be used a "disguised" aggregate too:
 
 -- Products without PO placed (CROSS APPLY) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost  FROM Production.Product ITM  
    CROSS APPLY (-- PO details  
    SELECT Max(1) HasPOPlaced 
    FROM Purchasing.PurchaseOrderDetail POD  
          JOIN Purchasing.PurchaseOrderHeader POH  
             ON POD.PurchaseOrderID = POH.PurchaseOrderID  
    WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
        AND POD.ProductID = ITM.ProductID  
    ) POD 
WHERE POD.HasPOPlaced IS NULL 

As can be seen there are multiple alternative ways of solving the same problem, none of the above described patters could be considered as best approach because in the end must be weighted between performance, flexibility, reusability and complexity.  In many cases the performance of a query is the most important factor, though the performance of any of the above techniques is relative because it needs to be considered factors like table structures and relations’ cardinality, server’s configuration and available resources, etc.

References:
[1] Schrag R. (2005). Speeding Up Queries with Semi-Joins and Anti-Joins: How Oracle Evaluates EXISTS, NOT EXISTS, IN, and NOT IN. [link]

30 March 2010

💎SQL Reloaded: The Power of Joins – Part I: Introduction

Joins are the heart of RDBMS, they allowing to retrieve data (bring data together in a single result dataset) from multiple (related) table-like objects (tables, views, user-defined functions) by defining how and which  records are selected from such objects. In order to reduce the complexity of such a topic I will introduce the joins with regard to tables, the concepts applying also to views and user-defined functions, especially when they are regarded as “virtual” tables. In theory any two tables could be joined independently on whether a direct or indirect relation exists between them, it can be a foreign key, a natural key, a date or any other attribute that could be used to get a certain view into the data. Typically a join involves a minimum of two tables, but before going dipper into the topic it makes sense to look a little at the various terms used to refer to the tables joined and the anatomy of a join.

There are various ways of naming the tables involved in a join:
- left vs. right tables, where the left table is the first listed table participating in the join and the right table the second listed table participating in the join;
- parent vs. dependent tables, refers to tables between which exists a direct relationship, in other words exists a primary key and a foreign key, the dependent table being the table containing the foreign key, and the parent table the table containing the primary key;
- parent vs. child tables, is similar with the previous definition, the child table being an alternative naming for the dependent table, in addition we can talk also about grandchildren when a child table is referenced at its turn by other table called thus, when the three tables are considered together, the grandchild table;
- referent vs. referenced tables, another alternative for naming the tables between which exists a direct relationship, the referenced table being the table containing the foreign key, and the referent table the ones containing the primary key;
- inner vs. outer tables, these terms are used only in the case of outer joins, the inner table being the table from which are selected all the records, while the outer table is the table from which are selected only the matched records, therefore the two are also called sometimes the row-preserving table, respectively the null-supplying table;

None of the above pair of naming conventions are perfect because they can’t be used to address all the join situations in which tables are involved, and some of the terms may be used interchangely, especially when a term is preferred in order to send across adequate meaning, for example referring to a table as a inner table then we imply also that we deal with a left or right join. I realized that in time I used all the the above terms in a situation or another, however I prefer to use the left vs. right tables denomination, and occasionally inner vs. outer tables when referring to left or right joins.

In addition to the two tables joined a query typically includes also a join operator that specifies the join type (union, inner, outer or cross join), the join constraints that specifies which attributes are used to join the tables and the logical operators used, and eventually the non-join constraints referring to other constraints than the join constraints. I’m saying typically because either of the two could be missing, thus if the the join constraint and operator miss then we deal with a Cartesian join, while if the join constraint is appearing in the WHERE instead of FROM clause then there is no need to specify the join operator or in case of a UNION there is no need to specify the join constraint, though I will come back later on this topic.

The work with joins is deeply rooted in the Set Theory (the tables joined referred also as datasets) and the set operations – union, subtraction and intersection, their projection in world of databases coming, from my point of view and not sure if the database literature really discusses about this, into two flavors – vertical vs. horizontal joining. By vertical joining I’m referring to the intrinsic character of joins, in the result dataset the attributes from the two joined datasets appearing on the same row, while in the horizontal joining they never appear on the same row, thus a row being formed exclusively from the records of only one of the tables involved in the join. Thus we can discuss about vertical/horizontal joins as projection of union, intersection and subtraction operations, each of them with different connotations, the horizontal operations being actually similar to the ones from set theory. If we discuss about joins then we inevitably we have to mention also the anti-joins and semi-joins, a special type of queries based on NOT IN, respectively EXISTS operators.

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.