05 April 2010

The Power of Joins – Part VI: Not-Equal Joins

    SQL Server 2008 documentation mentions the not-equal joins without giving an explicit definition, the term appearing within the context of joining values in two columns that are not equal. On the other side the not-equal join denomination could be used to refer more generally to the joins whose join constraints involve comparison operators other than equal (“=”), for example <, >, <>, BETWEEN, IN, and its negation NOT IN, LIKE, etc. The difference between the two views resides on the fact that some of the operators (e.g. IN, BETWEEN) could imply the equality of values from joined columns. Another objection could be made on whether the not-equality join constraint could be based only on one of the table’s attributes participating in the join, the other thus missing, such constraints could be considered also as join constraints even if in some scenarios they could have been written in the main WHERE clause as well. For the sake of simplicity and flexibility I would consider a not-equal join or not-equality join as a join in which at least one of the join constraints is based on a operator other than the equal operator. Could be discussed thus about not-equal join constraints referring to the join constraints that involve other operators than equal, and equal join constraints, the two types of joins could coexist in the same join. A not-equal join constraint often involves the possibility of being returned more records then if a equal join constraint was used.

      If we talk about join constraints then more likely we refer to vertical joins, though also anti-joins and semi-joins could include sporadically not-equality operators. In a normal join based on equalities for the set of attributes participating in the join from the left table are matched all the records from the right table, even if this means retrieving more than one record. Same happens also for not-equal constraints even if the business scenarios in which such needs arises are not so intuitive. One relatively simple example I often met is the case when is needed to calculate the incremental aggregation of values, for example to calculate the incremental volume of PO placed over time:
SELECT DII.DateSequence
, IsNull(POH.SubTotal, 0) SubTotal
FROM dbo.fGetDatesInInterval('2002-01-01', '2002-01-31') DII
    CROSS APPLY (--incremental PO Volume

     SELECT SUM(POH.SubTotal) SubTotal
    FROM Purchasing.PurchaseOrderHeader POH
   WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete

       AND DATEDIFF(d, POH.OrderDate, DII.DateSequence)>=0 --not-equal constraint
    ) POH

       In order to simplify the query I used the dbo.fGetDatesInInterval table-valued function that returns a sequence of dates within a given time interval, and a CROSS APPLY returning the aggregated value for each date returned by the function, for the calculations being considered all the records having an Order Date smaller or equal with the given (sequence) date. As can be seen from the above query the not-equal constraint is based on DateDiff function, instead we could have written POH.OrderDate<=DII.DateSequence, one of the reasons for its use is the fact that the constraint could be easily modified to show for example only the aggregated values for the past 90 days, the BETWEEN operator entering in scene:
SELECT DII.DateSequence
, IsNull(POH.SubTotal, 0) SubTotal
FROM dbo.fGetDatesInInterval('2002-01-01', '2002-01-31') DII
     CROSS APPLY (--incremental PO Volume

    SELECT SUM(POH.SubTotal) SubTotal
    FROM Purchasing.PurchaseOrderHeader POH
    WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete

        AND DATEDIFF(d, POH.OrderDate, DII.DateSequence) BETWEEN 0 AND 90--not-equal constraint

    ) POH

    The not-equal constraint from the query is part of interval-based constraints category which covers the scenarios in which one of the attributes from the tables participating in a join is checked whether it falls within a given interval, typically that being valid for numeric values as above or date value, as in the next query that shows the Last Standard Cost Details per Product.

-- Products & Last Standard Cost details
SELECT ITM.ProductID
,
ITM.ProductNumber

,
ITM.StandardCost

,
PCH.StartDate

,
PCH.EndDate

FROM
Production.Product ITM
    JOIN Production.ProductCostHistory PCH
     ON ITM.ProductID = PCH.ProductID
AND GETDATE() BETWEEN PCH.StartDate AND IsNull(PCH.EndDate, GETDATE())

    Such queries need often to be written when the records in a table have a given validity, no overlap existing between intervals.


    Using list of values within a join constraint and thus the IN or NOT IN operators occurs when for example the base table stores data coming from multiple sources and more encodings are used for the same source, each of them caring its meaning. For example using the Production.TransactionHistory and several other tables coming with AdventureWorks database, and supposing that for Purchase Orders are two transaction types P and B, and for Sales Orders S and I, a general scope query on Transactions could be written as follows:
-- Transaction History
SELECT PTH.TransactionID

,
PTH.ProductID

,
PTH.ReferenceOrderID

,
CASE PTH.TransactionType
     WHEN 'W' THEN 'Work Order'

    WHEN 'S' THEN 'Sales Order'
    WHEN 'I' THEN 'Internal Sales Order'
    WHEN 'P' THEN 'Purchase Order'
    WHEN 'B' THEN 'Blanket Purchase Order'
    ELSE 'n.a'
  END TransactionType
, CASE
    WHEN PTH.TransactionType = 'W' THEN StartDate
    WHEN PTH.TransactionType IN ('S', 'I') THEN SOH.OrderDate
    WHEN PTH.TransactionType IN ('P', 'B') THEN POH.OrderDate
  END ReferenceDate
FROM Production.TransactionHistory PTH
    LEFT JOIN
Purchasing.PurchaseOrderHeader POH
      ON
PTH.ReferenceOrderID = POH.PurchaseOrderID
  AND PTH.TransactionType IN ('P', 'B') --not-equal constraint

   LEFT JOIN Sales.SalesOrderHeader SOH
     ON PTH.ReferenceOrderID = SOH.SalesOrderID
  AND PTH.TransactionType IN ('S', 'I'--not-equal constraint
    LEFT JOIN Production.WorkOrder PWO
      ON PTH.ReferenceOrderID = PWO.WorkOrderID
   AND PTH.TransactionType = 'W'

    The need for writing such a query is seldom and could have been written as 3 distinct queries whose results are joined with two unions, both approaches having their pluses and minuses. 

   In one of the above queries was used a constraint based on DateDiff function, and in theory any other function could be used in a join constraint, including user-defined functions. Such function-based join constraints are handy but should be used with caution because they could impact query’s performance. Pattern-based join constraints used with LIKE operator could be used as well.

    There are cases in which the constraints that typically would be included in the WHERE clause are added in the body of the join, apparently without any good reason. Such based denatured join constraints are base only on the attributes of one of the tables involved in the join, like in the next query:
-- PO for Products with StandardCost>10
SELECT POD.PurchaseOrderDetailID

,
POD.ProductID

,
ITM.ProductNumber

,
ITM.StandardCost

,
POD.UnitPrice

,
POD.OrderQty

FROM
Purchasing.PurchaseOrderDetail POD
    JOIN
Production.Product ITM
      ON POD.ProductID = ITM.ProductID
  AND ITM.StandardCost>10  -- denaturated join constraint

No comments: