01 April 2010

SQL Reloaded: The Power of Joins II (Horizontal Joins)

Introduction 
 
    A horizontal join allows retrieving data from two or more datasets by merging/comparing the rows from each involved dataset, the join of rows being done on vertical, thus a row from the final result dataset containing only the attributes from one of the input datasets, the attributes being matched based on their order in the dataset (first attribute with first attribute from each dataset, second with the second, and so on), the output attribute taking the name of the correspondent attribute from the first. It worth to mention that the comparison is done at row level, thus we could talk about distinct rows if at least one of the attributes is having different values.

    There are two important constraints horizontal joins comes with: the datasets involved in a vertical join must have the same number of attributes and the joined attributes must have the same data type. If one of the attributes is not available in one of the datasets, a NULL or other default value could be provided for the missing attribute, while in case data types can’t be converted implicitly then the conversion must be done explicitly to a data type sufficient to accommodate the data type of the attributes joined under the same attribute.
  
      The general syntax for an horizontal join could be written as follows: 

-- Horizontal Join syntax 
SELECT A.Attribute1 
[, A.Attribute] 
FROM TABLE1 A 
[WHERE <constraints>] 
UNION | INTERSECT | EXCEPT [ALL] 
SELECT B.Attribute1 
[, B.Attribute] 
FROM TABLE2 B 
[WHERE <join_constraints>]  
[ORDER BY <order_attributes>] 

Notes:
   
1.    There are 3 horizontal join operators UNION for union (AUB), INTERSECT (A∩B) for intersection and EXCEPT for subtraction (A\B), and only UNION could be used in combination with ALL keyword in order to show ignore the duplicates. Please note that only the union and intersection operators are commutative: AUB = BUA, respectively A∩B = B∩A, while A\B<>B\A. 
 
2.    The SELECT statement for each dataset could include any complex logic supported, including aggregates, window ranking and aggregate functions, virtual tables (views, user-defined functions), vertical joins, etc. 
 
3.     The horizontal join may contain only one ORDER BY clause placed at the end of the query, and with it could be used directly the attributes’ name or alias given to the columns from first dataset. In case aliases are used, and especially when many attributes are involved or then when the attribute involve complex logic, it makes sense to use the same alias also for the other datasets involved in the join.

   For demonstrating the horizontal operations I will use the following two datasets based on the same table, respectively Products table from AdventureWorks database. 
 
-- Dataset A 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 200 

-- Dataset B 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 100 AND 250 

The UNION Join 

    The UNION join, also referred as the outer union, allows combining two datasets in one dataset formed from the combined rows of each data set, the eventual duplicate rows being eliminated. In case are needed also the duplicates, then the ALL keyword could be used in combination with the UNION (also referred as UNION ALL operator), this technique being recommended when the two datasets have no common records because when a simple UNION is used the database engine still checks for duplicates. 

-- UNION example (AUB) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 200 
UNION 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 100 AND 250  

Note:
   
     
     Because the above two datasets are using the same table the final output is the same with the output of the below query, the reunion being thus applied directly to the constraints used. It is actually recommended to avoid using unions:

-- alternative for UNION example (AUB) 
SELECTITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 250 

      In order to exemplify what’s happening in the background I will use only two rows from each of the two datasets used as input, on the right side being shown the output of applying the UNION, respectively UNION ALL operators:
horizontal joins - union presentation

The INTERSECT Join 
 
    The INTERSECT operator combines the two datasets in one dataset formed from the rows found to be present in both datasets. 
 
-- INTERSECT example (A∩B) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 200 
INTERSECT 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 100 AND 250 

Note:

   Because the above two datasets are using the same table the final output is the same with the output of the following query, the intersection could be thus applied directly to the constraints used: 
  
-- alternative for INTERSECT example (A∩B) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 100 AND 200 

     In order to exemplify what’s happening in the background I will use only two rows from each of the two datasets used as input, on the right side being shown the output of applying the INTERSECT operator:
horizontal joins - intersect presentation 
The EXCEPT Join 

-- EXCEPT example (A\B) 
SELECTITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 200 
EXCEPT 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
,ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 100 AND 250 

-- EXCEPT example (B\A) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 100 AND 250 
EXCEPT 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 200 

   As can be seen from the above two examples there are two possibilities of subtracting two datasets – A\B, respectively B\A, selecting the rows which are only in A and not in B, respectively the rows which are in B and not in A.

Note:
   Because the above two datasets are using the same table the final output is the same with the output of the following queries, the subtraction could be thus applied directly to the constraints used: 

-- alternative for SUBTRACTION example (A\B) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost >= 0 
    AND ITM.StandardCost < 100 

-- alternative for SUBTRACTION example (B\A) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost , ITM.MakeFlag 
FROMProduction.Product ITM 
WHERE ITM.StandardCost > 200 
    AND ITM.StandardCost <= 250 

      In order to exemplify what’s happening in the background I will use only two rows from each of the two datasets used as input, on the right side being shown the output of applying the EXCEPT for A\B, respectively B\A: 
 
horizontal joins - except presentation

Combining Operators 
 
    For the sake of simplicity all the above exemplifications are using only two datasets, though multiple datasets could be joined in this way, here is an example for the UNION ALL based on three datasets (please note that I avoided to list the attributes): 
 
-- multiple UNION 
SELECT * 
FROMProduction.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 50 
UNION ALL
SELECT * FROM Production.Product ITM WHERE ITM.StandardCost BETWEEN 100 AND 150 UNION ALL SELECT * FROM Production.Product ITM WHERE ITM.StandardCost BETWEEN 200 AND 250

Note:
     Also in this case the union could be applied directly on the constraints, though because there is no overlap between any combination of the three ranges, the constraint needs to be modified accordingly: 

  
-- multiple UNION 
SELECT * 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 50 
    OR ITM.StandardCost BETWEEN 100 AND 150 
    OR ITM.StandardCost BETWEEN 200 AND 250 

    When the union and intersection operations are used with multiple datasets it doesn’t matter the order of the datasets because the two operations are associative: (AUB)UC=AU(BUC)=AUBUC, respectively (A∩B)∩C=A∩(B∩C)=A∩B∩C, though not the same could be said about subtraction because (A\B)\C <> A\(B\C) or the cases in which operators are combined, for example (A\B)UC<>A\(BUC), so is needed to use parentheses in order to avoid eventual surprises. 
  
-- (A\B)UC 
( 
    -- subtraction 
    SELECT * 
    FROM Production.Product ITM 
    WHERE ITM.StandardCost BETWEEN 0 AND 150 
    EXCEPT 
    SELECT * 
    FROM Production.Product ITM 
    WHERE ITM.StandardCost BETWEEN 100 AND 150 
) 
UNION 
SELECT * 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 150 AND 250 

Note:

      Knowing the basics of Set Theory is important in making your queries returning the expected results, the use of some of the identities allowing to simplify considerably the queries sometimes. Without a set-mindset I think it will be difficult to solve such problems.

No comments:

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.