About Me

IT Professional with more than 16 years experience in IT especially in the area of full life-cycle of Web/Desktop Applications Development, Database Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP support, etc.

Thursday, April 01, 2010

The Power of Joins – Part II: Horizontal Joins

Introduction
    A horizontal join allows retrieving data from two or more datasets by merging/comparing the rows from each involved dataset, the join of rows being done on vertical, thus a row from the final result dataset containing only the attributes from one of the input datasets, the attributes being matched based on their order in the dataset (first attribute with first attribute from each dataset, second with the second, and so on), the output attribute taking the name of the correspondent attribute from the first. It worth to mention that the comparison is done at row level, thus we could talk about distinct rows if at least one of the attributes is having different values.

    There are two important constraints horizontal joins comes with: the datasets involved in a vertical join must have the same number of attributes and the joined attributes must have the same data type. If one of the attributes is not available in one of the datasets, a NULL or other default value could be provided for the missing attribute, while in case data types can’t be converted implicitly then the conversion must be done explicitly to a data type sufficient to accommodate the data type of the attributes joined under the same attribute.
  
      The general syntax for an horizontal join could be written as follows:
-- Horizontal Join syntax
SELECT A.Attribute1
[, A.Attribute]
FROM TABLE1 A
[WHERE <constraints>]
UNION | INTERSECT | EXCEPT [ALL]

SELECT B.Attribute1
[, B.Attribute]
FROM TABLE2 B
[WHERE <join_constraints>
[ORDER BY <order_attributes>]


Notes:
1.    There are 3 horizontal join operators UNION for union (AUB), INTERSECT (A∩B) for intersection and EXCEPT for subtraction (A\B), and only UNION could be used in combination with ALL keyword in order to show ignore the duplicates. Please note that only the union and intersection operators are commutative: AUB = BUA, respectively A∩B = B∩A, while A\B<>B\A.
2.    The SELECT statement for each dataset could include any complex logic supported, including aggregates, window ranking and aggregate functions, virtual tables (views, user-defined functions), vertical joins, etc.
3.     The horizontal join may contain only one ORDER BY clause placed at the end of the query, and with it could be used directly the attributes’ name or alias given to the columns from first dataset. In case aliases are used, and especially when many attributes are involved or then when the attribute involve complex logic, it makes sense to use the same alias also for the other datasets involved in the join.

   For demonstrating the horizontal operations I will use the following two datasets based on the same table, respectively Products table from AdventureWorks database.
-- Dataset A
SELECT ITM.ProductID
,
ITM.Name ProductName

,
ITM.ProductNumber

,
ITM.StandardCost

, ITM.MakeFlag
FROM
Production.Product ITM

WHERE
ITM.StandardCost BETWEEN 0 AND 200


-- Dataset B
SELECT ITM.ProductID

, ITM.Name ProductName

, ITM.ProductNumber

, ITM.StandardCost

, ITM.MakeFlag

FROM Production.Product ITM

WHERE ITM.StandardCost BETWEEN 100 AND 250


The UNION Join
    The UNION join, also referred as the outer union, allows combining two datasets in one dataset formed from the combined rows of each data set, the eventual duplicate rows being eliminated. In case are needed also the duplicates, then the ALL keyword could be used in combination with the UNION (also referred as UNION ALL operator), this technique being recommended when the two datasets have no common records because when a simple UNION is used the database engine still checks for duplicates.

-- UNION example (AUB)
SELECT ITM.ProductID
, ITM.Name ProductName

, ITM.ProductNumber

, ITM.StandardCost

, ITM.MakeFlag

FROM Production.Product ITM

WHERE ITM.StandardCost BETWEEN 0 AND 200

UNION

SELECT ITM.ProductID

, ITM.Name ProductName

, ITM.ProductNumber

, ITM.StandardCost

, ITM.MakeFlag

FROM Production.Product ITM

WHERE ITM.StandardCost BETWEEN 100 AND 250
 


Note:         
     Because the above two datasets are using the same table the final output is the same with the output of the below query, the reunion being thus applied directly to the constraints used. It is actually recommended to avoid using unions

-- alternative for UNION example (AUB)
SELECTITM.ProductID
, ITM.Name ProductName

, ITM.ProductNumber

, ITM.StandardCost

, ITM.MakeFlag

FROM Production.Product ITM

WHERE ITM.StandardCost BETWEEN 0 AND 250


      In order to exemplify what’s happening in the background I will use only two rows from each of the two datasets used as input, on the right side being shown the output of applying the UNION, respectively UNION ALL operators:
horizontal joins - union presentation
The INTERSECT Join
    The INTERSECT operator combines the two datasets in one dataset formed from the rows found to be present in both datasets.
-- INTERSECT example (AB)
SELECT ITM.ProductID
, ITM.Name ProductName

, ITM.ProductNumber

, ITM.StandardCost

, ITM.MakeFlag

FROM Production.Product ITM

WHERE ITM.StandardCost BETWEEN 0 AND 200

INTERSECT

SELECT ITM.ProductID

, ITM.Name ProductName

, ITM.ProductNumber

, ITM.StandardCost

, ITM.MakeFlag

FROM Production.Product ITM

WHERE ITM.StandardCost BETWEEN 100 AND 250


Note:
   Because the above two datasets are using the same table the final output is the same with the output of the following query, the intersection could be thus applied directly to the constraints used:
-- alternative for INTERSECT example (AB)
SELECT ITM.ProductID
, ITM.Name ProductName

, ITM.ProductNumber

, ITM.StandardCost

, ITM.MakeFlag

FROM Production.Product ITM

WHERE ITM.StandardCost BETWEEN 100 AND 200


     In order to exemplify what’s happening in the background I will use only two rows from each of the two datasets used as input, on the right side being shown the output of applying the INTERSECT operator:
horizontal joins - intersect presentation 
The EXCEPT Join
-- EXCEPT example (A\B)
SELECTITM.ProductID
, ITM.Name ProductName

, ITM.ProductNumber

, ITM.StandardCost

, ITM.MakeFlag

FROM Production.Product ITM

WHERE ITM.StandardCost BETWEEN 0 AND 200

EXCEPT

SELECT ITM.ProductID

, ITM.Name ProductName

, ITM.ProductNumber

, ITM.StandardCost

,ITM.MakeFlag

FROM Production.Product ITM

WHERE ITM.StandardCost BETWEEN 100 AND 250


-- EXCEPT example (B\A)
SELECT ITM.ProductID
, ITM.Name ProductName

, ITM.ProductNumber

, ITM.StandardCost

, ITM.MakeFlag

FROM Production.Product ITM

WHERE ITM.StandardCost BETWEEN 100 AND 250

EXCEPT
SELECT ITM.ProductID
, ITM.Name ProductName

, ITM.ProductNumber

, ITM.StandardCost

, ITM.MakeFlag

FROM Production.Product ITM

WHERE ITM.StandardCost BETWEEN 0 AND 200


   As can be seen from the above two examples there are two possibilities of subtracting two datasets – A\B, respectively B\A, selecting the rows which are only in A and not in B, respectively the rows which are in B and not in A.

Note:
   Because the above two datasets are using the same table the final output is the same with the output of the following queries, the subtraction could be thus applied directly to the constraints used:

-- alternative for SUBTRACTION example (A\B)
SELECT ITM.ProductID
, ITM.Name ProductName

, ITM.ProductNumber

, ITM.StandardCost

, ITM.MakeFlag

FROM Production.Product ITM

WHERE ITM.StandardCost >= 0

    AND ITM.StandardCost < 100

-- alternative for SUBTRACTION example (B\A)
SELECT ITM.ProductID
, ITM.Name ProductName

, ITM.ProductNumber

, ITM.StandardCost
, ITM.MakeFlag
FROMProduction.Product ITM

WHERE ITM.StandardCost > 200

    AND ITM.StandardCost <= 250

      In order to exemplify what’s happening in the background I will use only two rows from each of the two datasets used as input, on the right side being shown the output of applying the EXCEPT for A\B, respectively B\A:
horizontal joins - except presentation

Combining Operators
    For the sake of simplicity all the above exemplifications are using only two datasets, though multiple datasets could be joined in this way, here is an example for the UNION ALL based on three datasets (please note that I avoided to list the attributes):
-- multiple UNION
SELECT *
FROMProduction.Product ITM

WHERE ITM.StandardCost BETWEEN 0 AND 50

UNIONALL

SELECT *

FROM Production.Product ITM

WHERE ITM.StandardCost BETWEEN 100 AND 150

UNION ALL

SELECT *

FROM Production.Product ITM

WHERE ITM.StandardCost BETWEEN 200 AND 250


Note:
     Also in this case the union could be applied directly on the constraints, though because there is no overlap between any combination of the three ranges, the constraint needs to be modified accordingly:
-- multiple UNION
SELECT *
FROM Production.Product ITM

WHERE ITM.StandardCost BETWEEN 0 AND 50

    OR ITM.StandardCost BETWEEN 100 AND 150

    OR ITM.StandardCost BETWEEN 200 AND 250


    When the union and intersection operations are used with multiple datasets it doesn’t matter the order of the datasets because the two operations are associative: (AUB)UC=AU(BUC)=AUBUC, respectively (A∩B)∩C=A∩(B∩C)=A∩B∩C, though not the same could be said about subtraction because (A\B)\C <> A\(B\C) or the cases in which operators are combined, for example (A\B)UC<>A\(BUC), so is needed to use parentheses in order to avoid eventual surprises.
-- (A\B)UC
(
    -- subtraction

    SELECT *
    FROM Production.Product ITM
    WHERE ITM.StandardCost BETWEEN 0 AND 150
    EXCEPT
    SELECT *
    FROM Production.Product ITM
    WHERE ITM.StandardCost BETWEEN 100 AND 150
)
UNION
SELECT *
FROM Production.Product ITM

WHERE ITM.StandardCost BETWEEN 150 AND 250

Note:
      Knowing the basics of Set Theory is important in making your queries returning the expected results, the use of some of the identities allowing to simplify considerably the queries sometimes. Without a set-mindset I think it will be difficult to solve such problems.

No comments: