About Me

IT Professional with more than 16 years experience in IT especially in the area of full life-cycle of Web/Desktop Applications Development, Database Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP support, etc.

Friday, April 02, 2010

The Power of Joins – Part III: Vertical Joins

Introduction
      A vertical join allows the possibility of retrieving data from multiple datasets by combining the attributes from one or more datasets on the same row, being possible to include complex expressions using any combination of attributes. The join within a vertical join is based only on the attributes specified in the join constraint(s), typically each constraint taking the form of an identity, in this case the data types must match, though also complex expressions evaluated to a value of true/false could be used.

      Within SQL Server 2000 were possible two syntaxes for vertical joins – the ANSI and non-ANSI syntaxes, however in later versions the non-ANSI syntax could be enabled only by changing database’s compatibility level to 80, Microsoft recommends using only the ANSI standard because the non-ANSI syntax could be dropped in future versions. In ANSI syntax are used the LEFT OTER JOIN, RIGHT OUTER JOIN, INNER JOIN and FULL OUTER JOIN operators in the FROM clause while in non-ANSI syntax the join constraints appear in WHERE clause, being used the *= and =* operators for LEFT OUTER JOIN, respectively RIGHT OUTER JOIN, and = for an INNER JOIN, the FULL OUTER JOIN not being supported in non-ANSI. Starting with SQL Server 2005 were introduced also the CROSS APPLY and OUTER APPLY operators that could be used to model cross joins.

-- Joins ANSI syntax
SELECT A.Attribute1
[, A.Attribute]
[, B.Attribute]

FROM TABLE1 A
    [LEFT|RIGHT|FULL] [INNER|OUTER] TABLE2 B
        ON <join_constraints>

[WHERE <constraints>]

-- Joins non-ANSI syntax
SELECT A.Attribute1
[, A.Attribute]
[, B.Attribute]
FROM TABLE1 A

,
TABLE2 B

[WHERE
<join_constraints>
]
[AND|OR <join_constraints>]


Notes:
1.   In case of horizontal joins the Set Theory applies mainly to the join constraints, an element in the dataset being the n-uple formed by the attributes participating in the join constraints. Many times it’s natural to use the foreign key-primary key attribute pairs though tables’ design and the logic to be modeled don’t always allow this luxury, therefore when choosing the attributes participating in the join is targeted to find the smallest n-uple with unique values across the whole dataset. From a performance standpoint is preferred to use the attributes that are part of an index, actually in many cases an index arrives to be created for the attributes used often in joins in order to increase queries’ performance. If the n-uple used in the join doesn’t have unique values, in other words exists two records in the dataset on which the n-uple takes the same values for each attribute of the n-uple, then duplicates are induced in the join.
2.    The records whose attributes participating in the join constraint have NULL values are ignored, so it might be needed to consider replacing the NULL values with a default value, though must be paid attention also to the possibility of introducing duplicates.
3.    Oracle supports the SQL ANSI syntax together with a slightly different form of non-ANSI syntax.  Especially when considering the portability of code it makes sense to use  the SQL ANSI syntax in any RDBMS that supports it; I mention this aspect because many Oracle developers are still using the non-ANSI syntax.

4.    In case an attribute is found in more than one table then it’s needed to use aliases for the tables in which the attribute is found or prefix the attribute with the table and schema name (e.g. Production.Product.Name). Actually from performance reasons it’s advisable to always use aliases, thus the database engine identifying easier the source table for each attribute.
5.     In theory in the SELECT statement could include all the attributes from the tables participating in the join, though it’s recommendable to use only the attributes needed (when possible should be also avoided SELECT * constructs). For output’s readability it makes sense to group together the attributes coming from the same table or arrange them in a patterns that could be easily understood by the users.
6.     The cross, inner and full outer joins are the only commutative operations, in other words the same results are obtained if the tables are inversed. It can be discussed also about associativity, the ability of changing the table join processing precedence without affecting the result, [1] for example discussing in detail this topic.

      For demonstrating the application of vertical joins I will use the Production.Product, Purchasing.PurchaseOrderDetail POD and HumanResources.Employee tables from AdventureWorks database.

The Inner Join
    The INNER JOIN operators, shortened sometimes as JOIN, allows retrieving the rows matched from both tables based on the join constraints, thus if for one record isn’t found a match in the other table then the record will not appear in the result dataset.
-- Inner Join ANSI syntax
SELECT ITM.ProductID

,
POD.PurchaseOrderID

FROM
Production.Product ITM
    JOIN Purchasing.PurchaseOrderDetail POD 
       ON ITM.ProductID = POD.ProductID

WHERE
ITM.MakeFlag = 0


-- Inner Join non-ANSI syntax
SELECT ITM.ProductID

,
POD.PurchaseOrderID

FROM
Production.Product ITM

Purchasing.PurchaseOrderDetail POD

WHERE ITM.ProductID=POD.ProductID
    AND ITM.MakeFlag = 0


Note:
      The inner join could be considered the projection of the intersection operation from Set Theory applied to the attributes participating in the join constraints.

      In order to exemplify what’s happening in the background I will use a smaller subset from each of the two tables used as input, and the final result for the inner join:
vertical joins - inner join output
     As can be seen from the above image, the 8th, 9th and 10th records from Purchase Orders sample table corresponding to PurchaseOrderID = 7 are not appearing in the result dataset because the corresponding Product IDs (317, 318 and 319) for the respective lines are not found in the Products sample table. Also the second line from Products sample table corresponding to ProductID = 2 is not appearing in the result dataset because there is no Purchase Order placed for the respective Product.

Note:
      The below query summarizes the sample tables used above and the output of the inner join, inline views being used for each of the tables in order to simplify the use of the example with all types of join, thus only the join type needs to be changed for exemplification:
SELECT IsNull(ITM.ProductID, POD.ProductID) ProductID
,
POD.PurchaseOrderID

,
ITM.ProductNumber

,
ITM.StandardCost

,
POD.UnitPrice

,
POD.OrderQty

FROM
( -- sample Products

     SELECT ProductID
     , ProductNumber
     , StandardCost
      FROM Production.Product
      WHERE ProductID IN (1, 2, 359, 360, 530, 4, 512, 513)
      AND MakeFlag = 0) ITM
      JOIN (-- sample Purchase Orders

          SELECT ProductID
         , PurchaseOrderID
         , UnitPrice
         , OrderQty
         FROM Purchasing.PurchaseOrderDetail
         WHERE PurchaseOrderID IN (1,2,3,4,5,6,7)) POD
     ON ITM.ProductID = POD.ProductID

ORDER
BY 2


The Left Outer Join
    The LEFT OUTER JOIN, shortened sometimes as LEFT JOIN, allows retrieving all the rows from the left table and only the matched records from the right table based on the join constraints.

-- Left Join ANSI syntax
SELECT ITM.ProductID
,
POD.PurchaseOrderID

FROM
Production.Product ITM
      LEFT JOIN Purchasing.PurchaseOrderDetail POD
         ON ITM.ProductID = POD.ProductID

WHERE
ITM.MakeFlag = 0


-- Left Join non-ANSI syntax
SELECTITM.ProductID

,
POD.PurchaseOrderID

FROM
Production.Product ITM

,
Purchasing.PurchaseOrderDetail POD

WHERE
ITM.ProductID*=POD.ProductID
    AND ITM.MakeFlag = 0
 
   

      In order to exemplify what’s happening in the background I will use a smaller subset from each of the two tables used as input, and the final result for the left outer join:
vertical joins - left join output 
     Because the Products sample table is used as row-preserving table it will be shown all the records it holds together with the matched records from the Purchase Orders sample table, thus the Purchase Orders sample table corresponding to PurchaseOrderID = 7 are not appearing in the result dataset because the corresponding Product IDs (317, 318 and 319) for the respective lines are not found in the Products sample table.
     If the tables are inversed then the Purchase Orders table becomes the row-preserving table and thus all the records from it are shown, including the records for which no Product is found in the Products sample table.

Notes:
1.   The left join has no direct correspondent operation from Set Theory but it’s the projection of AU(A∩B) formula, which equals to (A\B)U(A∩B). The left join could be used also to determine the projection of A\B by adding in the WHERE clause the constraint that allows retrieving only the records for which no match was found to the right table:
-- LEFT Join ANSI syntax for A\B
SELECT ITM.ProductID

,
POD.PurchaseOrderID

FROM
Production.Product ITM
    LEFT JOIN Purchasing.PurchaseOrderDetail POD
      ON ITM.ProductID = POD.ProductID

WHERE
POD.ProductID IS NULL

2.    Special attention must be given to the way the join and non-join constraints are added to the WHERE clause, because if the constraint is based solely on the null-supplying table and included in the WHERE clause without handling the Null case then an inner join could have been written instead because the rows for the attributes participating in constraints are ignored. In order to preserve left join’s character the constraint should be brought into the join constraint as below:
-- Right Join ANSI syntax modified
SELECT ITM.ProductID
,
POD.PurchaseOrderID

FROM
Purchasing.PurchaseOrderDetail POD
     LEFT JOIN Production.Product ITM
        ON ITM.ProductID = POD.ProductID
AND ITM.MakeFlag = 0

      Because unlike in Oracle the both terms of an non-ANSI left join operator must contain columns (attributes), then the IsNull function could be used in order to handle the NULL values:
SELECT ITM.ProductID
,
POD.PurchaseOrderID

FROM
Purchasing.PurchaseOrderDetail POD

,
Production.Product ITM

WHERE
POD.ProductID *= ITM.ProductID
AND IsNull(ITM.MakeFlag, 0) = 0

      The same method could be used also to keep non-join constraints in the WHERE clause when using the ANSI syntax for outer joins.
3.    There are cases in which is needed to replace with default values the NULL values from the not matched records from the Null-preserving table.

The Right Outer Join
    The RIGHT OUTER JOIN, shortened sometimes as RIGHT JOIN, allows retrieving all the rows from the right table and only the matched records from the right table based on the join constraints. A left join could be rewritten as a right join, just by inversing the tables, in fact the above examples given for left joins are rewritten using a right join:
-- Right Join ANSI syntax
SELECT ITM.ProductID

,
POD.PurchaseOrderID

FROM
Purchasing.PurchaseOrderDetail POD
     RIGHT JOIN Production.Product ITM
          ON ITM.ProductID = POD.ProductID

WHERE
ITM.MakeFlag = 0


-- Right Join ANSI syntax modified
SELECT ITM.ProductID

,
POD.PurchaseOrderID

FROM
Purchasing.PurchaseOrderDetail POD
    RIGHT JOIN Production.Product ITM
         ON ITM.ProductID = POD.ProductID
AND ITM.MakeFlag = 0
 

   In order to exemplify what’s happening in the background I will use a smaller subset from each of the two tables used as input, and the final result for the right outer join:
vertical joins - right join output
     As opposed to the general syntax examples given above for the left outer join in which the Products sample table is used as row-preserving table, in the above image the Purchase Orders sample table is used as row-preserving table, thus are shown all the records it holds and only the matched Products.

The Full Outer Join
    The FULL OUTER JOIN, shortened sometimes as FULL JOIN, allows retrieving the rows matched from both tables based on the join together with the not matched records from both tables.
-- Full Outer Join ANSI syntax
SELECT ITM.ProductID

,
POD.PurchaseOrderID

FROM
Production.Product ITM
    FULL JOIN Purchasing.PurchaseOrderDetail POD
       ON ITM.ProductID = POD.ProductID
 

   In order to exemplify what’s happening in the background I will use a smaller subset from each of the two tables used as input, and the final result for the full outer join:
vertical joins - full outer  join output
    As can be seen from the above image are considered all the rows from both tables, including the Purchase Orders for which no match was found in the Products sample table and the Products for which no Purchase Order is found in the corresponding sample table. 
Note:
1.      As in the left outer join’s case, special attention must be given to the non-join constraints added in the WHERE clause because they could reduce the range of applicability of the full outer join to a left/right outer join or to an inner join. Eventually, if really needed to add non-join constraints, instead of the base table it could be used an inline view as input for the full outer join.
2.      In SQL Server there is no non-ANSI operator equivalent to the ANSI full outer join operator, though the same functionality could be obtained by using the union between the result of a left outer join and the one of a right outer join:

-- Full Outer Join non-ANSI syntax equivalent

SELECT ITM.ProductID
,
POD.PurchaseOrderID

FROM
Production.Product ITM

,
Purchasing.PurchaseOrderDetail POD

WHERE
ITM.ProductID *= POD.ProductID

UNION
SELECT
ITM.ProductID

,
POD.PurchaseOrderID

FROM
Production.Product ITM

,
Purchasing.PurchaseOrderDetail POD

WHERE
ITM.ProductID =* POD.ProductID


    In some cases could be useful to rewrite the above query using the ANSI syntax, but this time it makes sense to use the UNION ALL operator instead and limit the second dataset only to the Products for which no Purchase Order was placed:
-- Full Outer Join ANSI syntax equivalent using Left & Right Outer Joins
SELECT ITM.ProductID

,
POD.PurchaseOrderID

FROM
Production.Product ITM
     LEFT JOIN Purchasing.PurchaseOrderDetail POD 
         ON ITM.ProductID = POD.ProductID

UNION
ALL

SELECT
ITM.ProductID

,
POD.PurchaseOrderID

FROM
Production.Product ITM
      RIGHT JOIN Purchasing.PurchaseOrderDetail POD 
            ON ITM.ProductID = POD.ProductID

WHERE
ITM.ProductID IS NULL



The Cross Join
    The CROSS JOIN resumes at listing the joined tables without specifying the join constraint and it returns the carthezian product between the two tables, a row from the first table being matched to each row from the second table, thus if the first table has m rows and the second n rows, the final query will return m*n rows. If in non-ANSI syntax it’s pretty simple to create a cross join, using ANSI syntax the same could be obtained using a full outer join with a join constraint that always equates to true independently of the tables’ attributes values (e.g. 1=1) or, starting with SQL Server 2005,  when was first introduced, could be used the cross apply operator.
-- Cross Join ANSI syntax (CROSS APPLY)
SELECT ITM.ProductID

,
POD.PurchaseOrderID

FROM
Production.Product ITM
     CROSS APPLY Purchasing.PurchaseOrderDetail POD


-- Cross Join ANSI syntax (FULL OUTER JOIN)
SELECT ITM.ProductID

,
POD.PurchaseOrderID

FROM
Production.Product ITM
    FULL OUTER JOIN Purchasing.PurchaseOrderDetail POD
      ON 1=1


-- Cross Join non-ANSI syntax
SELECT ITM.ProductID

,
POD.PurchaseOrderID

FROM
Production.Product ITM

,
Purchasing.PurchaseOrderDetail POD


The Self Join
     The self join is a special type of vertical join in which the left and right tables in a join represent the same table, and it could take the form of any of the above mentioned types of joins. This typically implies the existence of a self-reference attribute that references the primary key of another record, such structures being useful in modeling hierarchies (e.g. Bill of Materials, Human Resources hierarchies). For example HumanResources.Employee table stores together with the current Employee also ManagerID which stores the EmployeeID of Employee’s Manager, that stores at its turn the reference to Manager’s Manager, thus a whole structure of an organization could be built with self joins. The simplified queries for getting the Manager of Employee’s Manager could be written with a left join, this mainly because there might be Employees who don’t have a manager:
-- Self Join ANSI Syntax (LEFT JOIN)
SELECT EMP.EmployeeID

,
EMP.ManagerID

,
MNG.ManagerID ManagersManagerID

FROM
HumanResources.Employee EMP
    LEFT JOIN HumanResources.Employee MNG
       ON EMP.ManagerID = MNG.EmployeeID


-- Self Join ANSI Syntax (LEFT JOIN)
SELECT EMP.EmployeeID
,
EMP.ManagerID

,
MNG.ManagerID ManagersManagerID

FROM
HumanResources.Employee EMP

,
HumanResources.Employee MNG

WHERE
EMP.ManagerID *= MNG.EmployeeID


Note:
       In order to built the whole hierarchy, starting with SQL Server 2005 instead of using multiple self joins could be used common tables expressions (CTE), they offering more flexibility and better performance.

References:
[1] David M.M., (1999). Advanced ANSI SQL Data Modeling and Structure Processing. Artech House. ISBN: 1-58053-038-9

No comments: