03 April 2010

The Power of Joins – Part V: Semi-Joins and Anti-Joins

     An incursion in the world of joins is incomplete without approaching two other important techniques for writing queries: semi-joins and anti-joins, some of the techniques specific to them allowing to speed up queries and reduce queries’ complexity. An anti-join returns rows from the left table where no matches is found in the right table, while an semi-join returns returns rows from the left table if at least a match is found in the second table. The typical techniques for creating semi-joins and anti-joins are based on IN or EXISTS operators, respectively their negations the NOT EXISTS and NOT IN, both operators being available in SQL Server and Oracle. According to the above definitions also the EXCEPT and INTERSECT operators introduced in horizontal joins post could be used to create anti-joins, respectively semi-joins too, they being introduced by Microsoft starting with SQL Server 2005 in order to compare whole rows rather then a smaller set of attributes. Another operators that allows creating anti-joins and semi-joins and introduced with SQL Server 2005 is the CROSS APPLY and OUTER APPLY operators.

     Because the  EXCEPT and INTERSECT operators were introduced in a previous post, this post is focused on the other mentioned operators, for exemplification of the techniques following to use again the Production.Product and Purchasing.PurchaseOrderDetail tables from AdventureWorks database.

      IN operator allows to test whether a given (attribute) value is in a list of values, the respective list could be based also on the values returned by another query, called actually a subquery. The output is not influenced by the eventual duplicates in the list of values, though for the sake of simplicity and testability it makes sense to provide a list of distinct values.
      EXISTS operator provides a similar functionality like the IN operator, however it checks only whether a record exists in what is called a correlated suquery (also known as repeated subquery, is a subquery making direct reference to one or more of the attributes of the main query). The subquery used with an IN operator could be easily translated to a correlated subquery by bringing the attribute used to create the list of values into the WHERE clause.

     Let’s start with a simple problem to exemplify the use of a semi-join, for example to select only the products that have a PO (Purchase Order) placed, the query could be written as follows:
-- Products with PO placed (IN operator)
SELECT ITM.ProductID
,
ITM.ProductNumber

,
ITM.StandardCost

FROM
Production.Product ITM

WHERE
ITM.ProductID IN (--list of Product IDs with PO placed

    SELECT DISTINCT POD.ProductID
     FROM Purchasing.PurchaseOrderDetail POD
       JOIN Purchasing.PurchaseOrderHeader POH
          ON POD.PurchaseOrderID = POH.PurchaseOrderID
      WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete

     )
 
 
      The above query could be rewritten using the EXISTS operator, the correlated query being based on ProductID attribute found in both tables:
-- Products with PO placed (EXISTS operator)
SELECT ITM.ProductID
,
ITM.ProductNumber

,
ITM.StandardCost

FROM
Production.Product ITM

WHERE
EXISTS (--list of Product IDs with PO placed

     SELECT 1
     FROM Purchasing.PurchaseOrderDetail POD
       JOIN Purchasing.PurchaseOrderHeader POH
          ON POD.PurchaseOrderID = POH.PurchaseOrderID
     WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete

         AND POD.ProductID = ITM.ProductID -- correlated constraint
)
 
Notes:
1.    As usual special attention must be given to the handling of NULL values either by replacing them with a default value or by handing the NULL case in the WHERE clause.
2.   The Execution Plan for the above two queries is the same, SQL Server using a Hash Match (Left Semi Join) for both operators. I was trying to find information on whether there are any recommendations that advices on the use of one of the two operators in the detriment of the other, however I found nothing in this direction for SQL Server. R. Schrag offers a rule of dumb rooted in Oracle documentation and based on the selectivity of a query, a query being highly selective if it returns the least number of rows:
     “If the main body of your query is highly selective, then an EXISTS clause might be more appropriate to semi-join to the target table. However, if the main body of your query is not so selective and the subquery (the target of the semi-join) is more selective, then an IN clause might be more appropriate.” [1]

     Both above solutions are ideal when no data are needed from the tables used in the subquery, however quite often such a need arises, for example to show the volume of PO placed, the last PO Date, and so on, in such cases an inner join could be used instead:
-- Products with PO placed (inner join)
SELECT ITM.ProductID
,
ITM.ProductNumber

,
ITM.StandardCost

,
POD.TotalOrderQty

,
POD.LastOrderDate

FROM
Production.Product ITM
JOIN (--aggregated PO details

    SELECT POD.ProductID
    , SUM(POD.OrderQty) TotalOrderQty
    , Max(OrderDate) LastOrderDate
     FROM Purchasing.PurchaseOrderDetail POD
       JOIN Purchasing.PurchaseOrderHeader POH
          ON POD.PurchaseOrderID = POH.PurchaseOrderID
     WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete

     GROUP BY POD.ProductID
    ) POD
    ON POD.ProductID = ITM.ProductID


     For the same purpose could be used also a CROSS APPLY taking care to eliminate the Products for which no record is retrieved in the subquery, this approach is actually a mixture between an inner join and a correlated query:
-- Products with PO placed (CROSS APPLY)
SELECT ITM.ProductID
,
ITM.ProductNumber

,
ITM.StandardCost

,
POD.TotalOrderQty

,
POD.LastOrderDate

FROM
Production.Product ITM
     CROSS APPLY (--aggregated PO details

     SELECT SUM(POD.OrderQty) TotalOrderQty
     , Max(OrderDate) LastOrderDate
     FROM Purchasing.PurchaseOrderDetail POD
           JOIN Purchasing.PurchaseOrderHeader POH
              ON POD.PurchaseOrderID = POH.PurchaseOrderID
     WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete

        AND POD.ProductID = ITM.ProductID
    ) POD

WHERE
POD.TotalOrderQty IS NOT NULL


Notes:
1.   The IN and EXISTS query versions seems to provide better performance than the use of vertical joins, though from my experience is often needed to provide information also from the subqueries used with the respective operators, therefore the inner join provides more flexibility in the detriment of relatively poorer performance.
2.  Microsoft’s implementation for IN operator allows using only single list of values, in contrast with Oracle that allows working with lists of n-uples. This problem in theory could be solved using the IN operators with a correlated subquery. An example of such a problem is finding the last PO placed for each Product:
-- Last PO placed per Product (IN + correlated query)
SELECT POD.PurchaseOrderDetailID

,
POD.ProductID

,
POH.OrderDate

,
POD.OrderQty

FROM
Purchasing.PurchaseOrderDetail POD
     JOIN
Purchasing.PurchaseOrderHeader POH
        ON POD.PurchaseOrderID = POH.PurchaseOrderID

WHERE
POH.Status IN (2, 4) -- 2-Approved, 4-Complete

   AND POH.OrderDate IN (--Last PO Date per Product

      SELECT Max(POH1.OrderDate) LastOrderDate
      FROM Purchasing.PurchaseOrderDetail POD1
          JOIN Purchasing.PurchaseOrderHeader POH1
             ON POD1.PurchaseOrderID = POH1.PurchaseOrderID
    WHERE POH1.Status IN (2, 4) -- 2-Approved, 4-Complete
AND POD1.ProductID = POD.ProductID -- correlated constraint
)

      This might not be the best approach to solve this problem though the query is intended only to show a technique. The query could be rewritten using an inner join or the EXISTS operator, but I will show only the later approach because it comes with an interesting technique – because the LastOrderDate and ProductID are used as pair of values in order to retrieve the last PO per Product, given the fact that LastOrderDate is an aggregate value, the correlated constraint needs to be added in HAVING clause:
-- Last PO placed per Product (correlated query)
SELECT POD.PurchaseOrderDetailID

,
POD.ProductID

,
POH.OrderDate

,
POD.OrderQty

FROM
Purchasing.PurchaseOrderDetail POD
     JOIN Purchasing.PurchaseOrderHeader POH
        ON POD.PurchaseOrderID = POH.PurchaseOrderID

WHERE
POH.Status IN (2, 4) -- 2-Approved, 4-Complete

     AND EXISTS(--Last PO Date per Product
     SELECT 1
     FROM Purchasing.PurchaseOrderDetail POD1
          JOIN Purchasing.PurchaseOrderHeader POH1
             ON POD1.PurchaseOrderID = POH1.PurchaseOrderID
     WHERE POH1.Status IN (2, 4) -- 2-Approved, 4-Complete
          AND POD1.ProductID = POD.ProductID -- correlated constraint
     GROUP BY POD1.ProductID
     HAVING Max(POH1.OrderDate) = POH.OrderDate
     )


     The NOT IN and NOT EXISTS versions of the operators simply negates the use of IN and EXISTS operators, showing thus the records for which a match is not found. The complementary problem of showing the Products with a PO placed is to show the Products without a PO placed, for this, as previously mentioned, being enough to add the NOT keyword in from of IN, respectively EXISTS in the first two queries from this post:
-- Products without PO placed (IN operator)
SELECT ITM.ProductID

,
ITM.ProductNumber

,
ITM.StandardCost

FROM
Production.Product ITM

WHERE
ITM.ProductID NOT IN (--list of Product IDs with PO placed

    SELECT DISTINCT POD.ProductID
     FROM Purchasing.PurchaseOrderDetail POD
       JOIN Purchasing.PurchaseOrderHeader POH
          ON POD.PurchaseOrderID = POH.PurchaseOrderID
      WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete
     )
 

-- Products without PO placed (EXISTS operator)
SELECT ITM.ProductID

,
ITM.ProductNumber

,
ITM.StandardCost

FROM
Production.Product ITM

WHERE
NOT EXISTS (--list of Product IDs with PO placed

     SELECT 1
     FROM Purchasing.PurchaseOrderDetail POD
       JOIN Purchasing.PurchaseOrderHeader POH
          ON POD.PurchaseOrderID = POH.PurchaseOrderID
     WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete
         AND POD.ProductID = ITM.ProductID -- correlated constraint
)
 

      An alternative to NOT IN and NOT EXISTS is the use of a LEFT JOIN transposing the project of A\B problem:
-- Products without PO placed (left join)
SELECT ITM.ProductID

,
ITM.ProductNumber

,
ITM.StandardCost

FROM Production.Product ITM
    LEFT JOIN (--aggregated PO details

    SELECT POD.ProductID 
    FROM Purchasing.PurchaseOrderDetail POD
       JOIN Purchasing.PurchaseOrderHeader POH
          ON POD.PurchaseOrderID = POH.PurchaseOrderID
     WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete
     GROUP BY POD.ProductID
    ) POD
    ON POD.ProductID = ITM.ProductID

WHERE POD.ProductID IS NULL

     The CROSS APPLY could be used too, this time the WHERE clause from the main query being modified to show only the records without a match in CROSS APPLY, for this needing to be used a "disguised" aggregate too:
-- Products without PO placed (CROSS APPLY)
SELECT ITM.ProductID
,
ITM.ProductNumber

,
ITM.StandardCost

FROM Production.Product ITM
    CROSS APPLY (-- PO details

    SELECT Max(1) HasPOPlaced
    FROM Purchasing.PurchaseOrderDetail POD
          JOIN Purchasing.PurchaseOrderHeader POH
             ON POD.PurchaseOrderID = POH.PurchaseOrderID
    WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete

        AND POD.ProductID = ITM.ProductID
    ) POD

WHERE POD.HasPOPlaced IS NULL

    As can be seen there are multiple alternative ways of solving the same problem, none of the above described patters could be considered as best approach because in the end must be weighted between performance, flexibility, reusability and complexity.  In many cases the performance of a query is the most important factor, though the performance of any of the above techniques is relative because it needs to be considered factors like table structures and relations’ cardinality, server’s configuration and available resources, etc.

References:
[1] Schrag R. (2005). Speeding Up Queries with Semi-Joins and Anti-Joins: How Oracle Evaluates EXISTS, NOT EXISTS, IN, and NOT IN. [Online] Available from: http://www.dbspecialists.com/files/presentations/semijoins.html (Accessed: 3 Apr 2010)

No comments: