SQL Troubles: anti-join

Showing posts with label anti-join. Show all posts

03 April 2010

SQL Reloaded: Power of Joins V (Semi-Joins and Anti-Joins)

     An incursion in the world of joins is incomplete without approaching two other important techniques for writing queries: semi-joins and anti-joins, some of the techniques specific to them allowing to speed up queries and reduce queries’ complexity. An anti-join returns rows from the left table where no matches is found in the right table, while an semi-join returns returns rows from the left table if at least a match is found in the second table. The typical techniques for creating semi-joins and anti-joins are based on IN or EXISTS operators, respectively their negations the NOT EXISTS and NOT IN, both operators being available in SQL Server and Oracle. According to the above definitions also the EXCEPT and INTERSECT operators introduced in horizontal joins post could be used to create anti-joins, respectively semi-joins too, they being introduced by Microsoft starting with SQL Server 2005 in order to compare whole rows rather then a smaller set of attributes. Another operators that allows creating anti-joins and semi-joins and introduced with SQL Server 2005 is the CROSS APPLY and OUTER APPLY operators.

     Because the EXCEPT and INTERSECT operators were introduced in a previous post, this post is focused on the other mentioned operators, for exemplification of the techniques following to use again the Production.Product and Purchasing.PurchaseOrderDetail tables from AdventureWorks database.

      IN operator allows to test whether a given (attribute) value is in a list of values, the respective list could be based also on the values returned by another query, called actually a subquery. The output is not influenced by the eventual duplicates in the list of values, though for the sake of simplicity and testability it makes sense to provide a list of distinct values.

      EXISTS operator provides a similar functionality like the IN operator, however it checks only whether a record exists in what is called a correlated suquery (also known as repeated subquery, is a subquery making direct reference to one or more of the attributes of the main query). The subquery used with an IN operator could be easily translated to a correlated subquery by bringing the attribute used to create the list of values into the WHERE clause.

     Let’s start with a simple problem to exemplify the use of a semi-join, for example to select only the products that have a PO (Purchase Order) placed, the query could be written as follows:

-- Products with PO placed (IN operator) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
FROM Production.Product ITM  
WHERE ITM.ProductID IN (--list of Product IDs with PO placed 
    SELECT DISTINCT POD.ProductID  
     FROM Purchasing.PurchaseOrderDetail POD  
       JOIN Purchasing.PurchaseOrderHeader POH  
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
      WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
     )

The above query could be rewritten using the EXISTS operator, the correlated query being based on ProductID attribute found in both tables:

-- Products with PO placed (EXISTS operator) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
FROM Production.Product ITM  
WHERE EXISTS (--list of Product IDs with PO placed 
     SELECT 1 
     FROM Purchasing.PurchaseOrderDetail POD  
       JOIN Purchasing.PurchaseOrderHeader POH  
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
     WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
         AND POD.ProductID = ITM.ProductID -- correlated constraint 
)

Notes:
1.    As usual special attention must be given to the handling of NULL values either by replacing them with a default value or by handing the NULL case in the WHERE clause.
2.   The Execution Plan for the above two queries is the same, SQL Server using a Hash Match (Left Semi Join) for both operators. I was trying to find information on whether there are any recommendations that advices on the use of one of the two operators in the detriment of the other, however I found nothing in this direction for SQL Server. R. Schrag offers a rule of dumb rooted in Oracle documentation and based on the selectivity of a query, a query being highly selective if it returns the least number of rows:
     “If the main body of your query is highly selective, then an EXISTS clause might be more appropriate to semi-join to the target table. However, if the main body of your query is not so selective and the subquery (the target of the semi-join) is more selective, then an IN clause might be more appropriate.” [1]

     Both above solutions are ideal when no data are needed from the tables used in the subquery, however quite often such a need arises, for example to show the volume of PO placed, the last PO Date, and so on, in such cases an inner join could be used instead:

-- Products with PO placed (inner join) SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
, POD.TotalOrderQty 
, POD.LastOrderDate 
FROM Production.Product ITM  
JOIN (--aggregated PO details 
    SELECT POD.ProductID  
    , SUM(POD.OrderQty) TotalOrderQty 
    , Max(OrderDate) LastOrderDate 
     FROM Purchasing.PurchaseOrderDetail POD  
       JOIN Purchasing.PurchaseOrderHeader POH  
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
     WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
     GROUP BY POD.ProductID 
    ) POD 
    ON POD.ProductID = ITM.ProductID

For the same purpose could be used also a CROSS APPLY taking care to eliminate the Products for which no record is retrieved in the subquery, this approach is actually a mixture between an inner join and a correlated query:

 -- Products with PO placed (CROSS APPLY) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost  
, POD.TotalOrderQty  
, POD.LastOrderDate  
FROM Production.Product ITM  
     CROSS APPLY (--aggregated PO details  
     SELECT SUM(POD.OrderQty) TotalOrderQty  
     , Max(OrderDate) LastOrderDate  
     FROM Purchasing.PurchaseOrderDetail POD  
           JOIN Purchasing.PurchaseOrderHeader POH  
              ON POD.PurchaseOrderID = POH.PurchaseOrderID  
     WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
        AND POD.ProductID = ITM.ProductID  
    ) POD  
WHERE POD.TotalOrderQty IS NOT NULL

Notes:
1. The IN and EXISTS query versions seems to provide better performance than the use of vertical joins, though from my experience is often needed to provide information also from the subqueries used with the respective operators, therefore the inner join provides more flexibility in the detriment of relatively poorer performance.
2. Microsoft’s implementation for IN operator allows using only single list of values, in contrast with Oracle that allows working with lists of n-uples. This problem in theory could be solved using the IN operators with a correlated subquery. An example of such a problem is finding the last PO placed for each Product:

 -- Last PO placed per Product (IN + correlated query) 
SELECT POD.PurchaseOrderDetailID 
, POD.ProductID  
, POH.OrderDate 
, POD.OrderQty 
FROM Purchasing.PurchaseOrderDetail POD  
     JOIN Purchasing.PurchaseOrderHeader POH         ON POD.PurchaseOrderID = POH.PurchaseOrderID 
WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
   AND POH.OrderDate IN (--Last PO Date per Product 
      SELECT Max(POH1.OrderDate) LastOrderDate 
      FROM Purchasing.PurchaseOrderDetail POD1 
          JOIN Purchasing.PurchaseOrderHeader POH1 
             ON POD1.PurchaseOrderID = POH1.PurchaseOrderID 
    WHERE POH1.Status IN (2, 4) -- 2-Approved, 4-Complete  
AND POD1.ProductID = POD.ProductID -- correlated constraint 
)

This might not be the best approach to solve this problem though the query is intended only to show a technique. The query could be rewritten using an inner join or the EXISTS operator, but I will show only the later approach because it comes with an interesting technique – because the LastOrderDate and ProductID are used as pair of values in order to retrieve the last PO per Product, given the fact that LastOrderDate is an aggregate value, the correlated constraint needs to be added in HAVING clause:

-- Last PO placed per Product (correlated query) 
SELECT POD.PurchaseOrderDetailID 
, POD.ProductID  
, POH.OrderDate 
, POD.OrderQty 
FROM Purchasing.PurchaseOrderDetail POD  
     JOIN Purchasing.PurchaseOrderHeader POH 
        ON POD.PurchaseOrderID = POH.PurchaseOrderID 
WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
     AND EXISTS(--Last PO Date per Product 
     SELECT 1 
     FROM Purchasing.PurchaseOrderDetail POD1 
          JOIN Purchasing.PurchaseOrderHeader POH1 
             ON POD1.PurchaseOrderID = POH1.PurchaseOrderID 
     WHERE POH1.Status IN (2, 4) -- 2-Approved, 4-Complete  
          AND POD1.ProductID = POD.ProductID -- correlated constraint 
     GROUP BY POD1.ProductID 
     HAVING Max(POH1.OrderDate) = POH.OrderDate 
     )

The NOT IN and NOT EXISTS versions of the operators simply negates the use of IN and EXISTS operators, showing thus the records for which a match is not found. The complementary problem of showing the Products with a PO placed is to show the Products without a PO placed, for this, as previously mentioned, being enough to add the NOT keyword in from of IN, respectively EXISTS in the first two queries from this post:

-- Products without PO placed (IN operator) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
FROM Production.Product ITM  
WHERE ITM.ProductID NOT IN (--list of Product IDs with PO placed 
    SELECT DISTINCT POD.ProductID 
     FROM Purchasing.PurchaseOrderDetail POD 
       JOIN Purchasing.PurchaseOrderHeader POH 
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
      WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
     )  

-- Products without PO placed (EXISTS operator) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
FROM Production.Product ITM  
WHERE NOT EXISTS (--list of Product IDs with PO placed 
     SELECT 1 
     FROM Purchasing.PurchaseOrderDetail POD 
       JOIN Purchasing.PurchaseOrderHeader POH 
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
     WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
         AND POD.ProductID = ITM.ProductID -- correlated constraint 
)

An alternative to NOT IN and NOT EXISTS is the use of a LEFT JOIN transposing the project of A\B problem:

  -- Products without PO placed (left join) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost FROM Production.Product ITM  
    LEFT JOIN (--aggregated PO details 
    SELECT POD.ProductID  
    FROM Purchasing.PurchaseOrderDetail POD 
       JOIN Purchasing.PurchaseOrderHeader POH 
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
     WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
     GROUP BY POD.ProductID 
    ) POD 
    ON POD.ProductID = ITM.ProductID   
WHERE POD.ProductID IS NULL

The CROSS APPLY could be used too, this time the WHERE clause from the main query being modified to show only the records without a match in CROSS APPLY, for this needing to be used a "disguised" aggregate too:

 -- Products without PO placed (CROSS APPLY) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost  FROM Production.Product ITM  
    CROSS APPLY (-- PO details  
    SELECT Max(1) HasPOPlaced 
    FROM Purchasing.PurchaseOrderDetail POD  
          JOIN Purchasing.PurchaseOrderHeader POH  
             ON POD.PurchaseOrderID = POH.PurchaseOrderID  
    WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
        AND POD.ProductID = ITM.ProductID  
    ) POD 
WHERE POD.HasPOPlaced IS NULL

As can be seen there are multiple alternative ways of solving the same problem, none of the above described patters could be considered as best approach because in the end must be weighted between performance, flexibility, reusability and complexity. In many cases the performance of a query is the most important factor, though the performance of any of the above techniques is relative because it needs to be considered factors like table structures and relations’ cardinality, server’s configuration and available resources, etc.

References:
[1] Schrag R. (2005). Speeding Up Queries with Semi-Joins and Anti-Joins: How Oracle Evaluates EXISTS, NOT EXISTS, IN, and NOT IN. [Online] Available from: http://www.dbspecialists.com/files/presentations/semijoins.html (Accessed: 3 Apr 2010)

30 March 2010

The Power of Joins – Part I: Introduction

     Joins are the heart of RDBMS, they allowing to retrieve data (bring data together in a single result dataset) from multiple (related) table-like objects (tables, views, user-defined functions) by defining how and which records are selected from such objects. In order to reduce the complexity of such a topic I will introduce the joins with regard to tables, the concepts applying also to views and user-defined functions, especially when they are regarded as “virtual” tables. In theory any two tables could be joined independently on whether a direct or indirect relation exists between them, it can be a foreign key, a natural key, a date or any other attribute that could be used to get a certain view into the data. Typically a join involves a minimum of two tables, but before going dipper into the topic it makes sense to look a little at the various terms used to refer to the tables joined and the anatomy of a join.

     There are various ways of naming the tables involved in a join:
- left vs. right tables, where the left table is the first listed table participating in the join and the right table the second listed table participating in the join;
- parent vs. dependent tables, refers to tables between which exists a direct relationship, in other words exists a primary key and a foreign key, the dependent table being the table containing the foreign key, and the parent table the table containing the primary key;
- parent vs. child tables, is similar with the previous definition, the child table being an alternative naming for the dependent table, in addition we can talk also about grandchildren when a child table is referenced at its turn by other table called thus, when the three tables are considered together, the grandchild table;
- referent vs. referenced tables, another alternative for naming the tables between which exists a direct relationship, the referenced table being the table containing the foreign key, and the referent table the ones containing the primary key;
- inner vs. outer tables, these terms are used only in the case of outer joins, the inner table being the table from which are selected all the records, while the outer table is the table from which are selected only the matched records, therefore the two are also called sometimes the row-preserving table, respectively the null-supplying table;

      None of the above pair of naming conventions are perfect because they can’t be used to address all the join situations in which tables are involved, and some of the terms may be used interchangely, especially when a term is preferred in order to send across adequate meaning, for example referring to a table as a inner table then we imply also that we deal with a left or right join. I realized that in time I used all the the above terms in a situation or another, however I prefer to use the left vs. right tables denomination, and occasionally inner vs. outer tables when referring to left or right joins.

     In addition to the two tables joined a query typically includes also a join operator that specifies the join type (union, inner, outer or cross join), the join constraints that specifies which attributes are used to join the tables and the logical operators used, and eventually the non-join constraints referring to other constraints than the join constraints. I’m saying typically because either of the two could be missing, thus if the the join constraint and operator miss then we deal with a carthezian join, while if the join constraint is appearing in the WHERE instead of FROM clause then there is no need to specify the join operator or in case of a UNION there is no need to specify the join constraint, though I will come back later on this topic.

      The work with joins is deeply rooted in the Set Theory (the tables joined referred also as datasets) and the set operations – union, subtraction and intersection, their projection in world of databases coming, from my point of view and not sure if the database literature really discusses about this, into two flavors – vertical vs. horizontal joining. By vertical joining I’m referring to the intrinsic character of joins, in the result dataset the attributes from the two joined datasets appearing on the same row, while in the horizontal joining they never appear on the same row, thus a row being formed exclusively from the records of only one of the tables involved in the join. Thus we can discuss about vertical/horizontal joins as projection of union, intersection and subtraction operations, each of them with different connotations, the horizontal operations being actually similar to the ones from set theory. If we discuss about joins then we inevitably we have to mention also the anti-joins and semi-joins, a special type of queries based on NOT IN, respectively EXISTS operators.

SQL Troubles

Pages

03 April 2010

SQL Reloaded: Power of Joins V (Semi-Joins and Anti-Joins)

30 March 2010

The Power of Joins – Part I: Introduction

About Me