An incursion in the world of joins is incomplete without approaching two other important techniques for writing queries: semi-joins and anti-joins, some of the techniques specific to them allowing to speed up queries and reduce queries’ complexity. An
anti-join returns rows from the left table where no matches is found in the right table, while an
semi-join returns returns rows from the left table if at least a match is found in the second table. The typical techniques for creating semi-joins and anti-joins are based on IN or EXISTS operators, respectively their negations the NOT EXISTS and NOT IN, both operators being available in SQL Server and Oracle. According to the above definitions also the EXCEPT and INTERSECT operators introduced in
horizontal joins post could be used to create anti-joins, respectively semi-joins too, they being introduced by Microsoft starting with SQL Server 2005 in order to compare whole rows rather then a smaller set of attributes. Another operators that allows creating anti-joins and semi-joins and introduced with SQL Server 2005 is the CROSS APPLY and OUTER APPLY operators.
Because the EXCEPT and INTERSECT operators were introduced in a previous post, this post is focused on the other mentioned operators, for exemplification of the techniques following to use again the Production.Product and Purchasing.PurchaseOrderDetail tables from AdventureWorks database.
IN operator allows to test whether a given (attribute) value is in a list of values, the respective list could be based also on the values returned by another query, called actually a
subquery. The output is not influenced by the eventual duplicates in the list of values, though for the sake of simplicity and testability it makes sense to provide a list of distinct values.
EXISTS operator provides a similar functionality like the IN operator, however it checks only whether a record exists in what is called a
correlated suquery (also known as
repeated subquery, is a subquery making direct reference to one or more of the attributes of the main query). The subquery used with an IN operator could be easily translated to a correlated subquery by bringing the attribute used to create the list of values into the WHERE clause.
Let’s start with a simple problem to exemplify the use of a semi-join, for example to select only the products that have a PO (Purchase Order) placed, the query could be written as follows:
-- Products with PO placed (IN operator)
SELECT ITM.ProductID
, ITM.ProductNumber
, ITM.StandardCost
FROM Production.Product ITM
WHERE ITM.ProductID IN (--list of Product IDs with PO placed
SELECT DISTINCT POD.ProductID
FROM Purchasing.PurchaseOrderDetail POD
JOIN Purchasing.PurchaseOrderHeader POH
ON POD.PurchaseOrderID = POH.PurchaseOrderID
WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete
)
The above query could be rewritten using the EXISTS operator, the correlated query being based on ProductID attribute found in both tables:
-- Products with PO placed (EXISTS operator)
SELECT ITM.ProductID
, ITM.ProductNumber
, ITM.StandardCost
FROM Production.Product ITM
WHERE EXISTS (--list of Product IDs with PO placed
SELECT 1
FROM Purchasing.PurchaseOrderDetail POD
JOIN Purchasing.PurchaseOrderHeader POH
ON POD.PurchaseOrderID = POH.PurchaseOrderID
WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete
AND POD.ProductID = ITM.ProductID -- correlated constraint
)
Notes:
1. As usual special attention must be given to the handling of NULL values either by replacing them with a default value or by handing the NULL case in the WHERE clause.
2. The Execution Plan for the above two queries is the same, SQL Server using a Hash Match (Left Semi Join) for both operators. I was trying to find information on whether there are any recommendations that advices on the use of one of the two operators in the detriment of the other, however I found nothing in this direction for SQL Server. R. Schrag offers a rule of dumb rooted in Oracle documentation and based on the selectivity of a query, a query being highly selective if it returns the least number of rows:
“If the main body of your query is highly selective, then an EXISTS clause might be more appropriate to semi-join to the target table. However, if the main body of your query is not so selective and the subquery (the target of the semi-join) is more selective, then an IN clause might be more appropriate.” [1]
Both above solutions are ideal when no data are needed from the tables used in the subquery, however quite often such a need arises, for example to show the volume of PO placed, the last PO Date, and so on, in such cases an inner join could be used instead:
-- Products with PO placed (inner join) SELECT ITM.ProductID
, ITM.ProductNumber
, ITM.StandardCost
, POD.TotalOrderQty
, POD.LastOrderDate
FROM Production.Product ITM
JOIN (--aggregated PO details
SELECT POD.ProductID
, SUM(POD.OrderQty) TotalOrderQty
, Max(OrderDate) LastOrderDate
FROM Purchasing.PurchaseOrderDetail POD
JOIN Purchasing.PurchaseOrderHeader POH
ON POD.PurchaseOrderID = POH.PurchaseOrderID
WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete
GROUP BY POD.ProductID
) POD
ON POD.ProductID = ITM.ProductID
For the same purpose could be used also a CROSS APPLY taking care to eliminate the Products for which no record is retrieved in the subquery, this approach is actually a mixture between an inner join and a correlated query:
-- Products with PO placed (CROSS APPLY)
SELECT ITM.ProductID
, ITM.ProductNumber
, ITM.StandardCost
, POD.TotalOrderQty
, POD.LastOrderDate
FROM Production.Product ITM
CROSS APPLY (--aggregated PO details
SELECT SUM(POD.OrderQty) TotalOrderQty
, Max(OrderDate) LastOrderDate
FROM Purchasing.PurchaseOrderDetail POD
JOIN Purchasing.PurchaseOrderHeader POH
ON POD.PurchaseOrderID = POH.PurchaseOrderID
WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete
AND POD.ProductID = ITM.ProductID
) POD
WHERE POD.TotalOrderQty IS NOT NULL
Notes:
1. The IN and EXISTS query versions seems to provide better performance than the use of vertical joins, though from my experience is often needed to provide information also from the subqueries used with the respective operators, therefore the inner join provides more flexibility in the detriment of relatively poorer performance.
2. Microsoft’s implementation for IN operator allows using only single list of values, in contrast with Oracle that allows working with lists of n-uples. This problem in theory could be solved using the IN operators with a correlated subquery. An example of such a problem is finding the last PO placed for each Product:
-- Last PO placed per Product (IN + correlated query)
SELECT POD.PurchaseOrderDetailID
, POD.ProductID
, POH.OrderDate
, POD.OrderQty
FROM Purchasing.PurchaseOrderDetail POD
JOIN Purchasing.PurchaseOrderHeader POH ON POD.PurchaseOrderID = POH.PurchaseOrderID
WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete
AND POH.OrderDate IN (--Last PO Date per Product
SELECT Max(POH1.OrderDate) LastOrderDate
FROM Purchasing.PurchaseOrderDetail POD1
JOIN Purchasing.PurchaseOrderHeader POH1
ON POD1.PurchaseOrderID = POH1.PurchaseOrderID
WHERE POH1.Status IN (2, 4) -- 2-Approved, 4-Complete
AND POD1.ProductID = POD.ProductID -- correlated constraint
)
This might not be the best approach to solve this problem though the query is intended only to show a technique. The query could be rewritten using an inner join or the EXISTS operator, but I will show only the later approach because it comes with an interesting technique – because the LastOrderDate and ProductID are used as pair of values in order to retrieve the last PO per Product, given the fact that LastOrderDate is an aggregate value, the correlated constraint needs to be added in HAVING clause:
-- Last PO placed per Product (correlated query)
SELECT POD.PurchaseOrderDetailID
, POD.ProductID
, POH.OrderDate
, POD.OrderQty
FROM Purchasing.PurchaseOrderDetail POD
JOIN Purchasing.PurchaseOrderHeader POH
ON POD.PurchaseOrderID = POH.PurchaseOrderID
WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete
AND EXISTS(--Last PO Date per Product
SELECT 1
FROM Purchasing.PurchaseOrderDetail POD1
JOIN Purchasing.PurchaseOrderHeader POH1
ON POD1.PurchaseOrderID = POH1.PurchaseOrderID
WHERE POH1.Status IN (2, 4) -- 2-Approved, 4-Complete
AND POD1.ProductID = POD.ProductID -- correlated constraint
GROUP BY POD1.ProductID
HAVING Max(POH1.OrderDate) = POH.OrderDate
)
The NOT IN and NOT EXISTS versions of the operators simply negates the use of IN and EXISTS operators, showing thus the records for which a match is not found. The complementary problem of showing the Products with a PO placed is to show the Products without a PO placed, for this, as previously mentioned, being enough to add the NOT keyword in from of IN, respectively EXISTS in the first two queries from this post:
-- Products without PO placed (IN operator)
SELECT ITM.ProductID
, ITM.ProductNumber
, ITM.StandardCost
FROM Production.Product ITM
WHERE ITM.ProductID NOT IN (--list of Product IDs with PO placed
SELECT DISTINCT POD.ProductID
FROM Purchasing.PurchaseOrderDetail POD
JOIN Purchasing.PurchaseOrderHeader POH
ON POD.PurchaseOrderID = POH.PurchaseOrderID
WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete
)
-- Products without PO placed (EXISTS operator)
SELECT ITM.ProductID
, ITM.ProductNumber
, ITM.StandardCost
FROM Production.Product ITM
WHERE NOT EXISTS (--list of Product IDs with PO placed
SELECT 1
FROM Purchasing.PurchaseOrderDetail POD
JOIN Purchasing.PurchaseOrderHeader POH
ON POD.PurchaseOrderID = POH.PurchaseOrderID
WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete
AND POD.ProductID = ITM.ProductID -- correlated constraint
)
An alternative to NOT IN and NOT EXISTS is the use of a LEFT JOIN transposing the project of A\B problem:
-- Products without PO placed (left join)
SELECT ITM.ProductID
, ITM.ProductNumber
, ITM.StandardCost FROM Production.Product ITM
LEFT JOIN (--aggregated PO details
SELECT POD.ProductID
FROM Purchasing.PurchaseOrderDetail POD
JOIN Purchasing.PurchaseOrderHeader POH
ON POD.PurchaseOrderID = POH.PurchaseOrderID
WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete
GROUP BY POD.ProductID
) POD
ON POD.ProductID = ITM.ProductID
WHERE POD.ProductID IS NULL
The CROSS APPLY could be used too, this time the WHERE clause from the main query being modified to show only the records without a match in CROSS APPLY, for this needing to be used a "disguised" aggregate too:
-- Products without PO placed (CROSS APPLY)
SELECT ITM.ProductID
, ITM.ProductNumber
, ITM.StandardCost FROM Production.Product ITM
CROSS APPLY (-- PO details
SELECT Max(1) HasPOPlaced
FROM Purchasing.PurchaseOrderDetail POD
JOIN Purchasing.PurchaseOrderHeader POH
ON POD.PurchaseOrderID = POH.PurchaseOrderID
WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete
AND POD.ProductID = ITM.ProductID
) POD
WHERE POD.HasPOPlaced IS NULL
As can be seen there are multiple alternative ways of solving the same problem, none of the above described patters could be considered as best approach because in the end must be weighted between performance, flexibility, reusability and complexity. In many cases the performance of a query is the most important factor, though the performance of any of the above techniques is relative because it needs to be considered factors like table structures and relations’ cardinality, server’s configuration and available resources, etc.
References:
[1] Schrag R. (2005). Speeding Up Queries with Semi-Joins and Anti-Joins: How Oracle Evaluates EXISTS, NOT EXISTS, IN, and NOT IN. [Online] Available from:
http://www.dbspecialists.com/files/presentations/semijoins.html (Accessed: 3 Apr 2010)