SQL Troubles: horizontal join

Showing posts with label horizontal join. Show all posts

01 April 2010

💎SQL Reloaded: The Power of Joins II (Horizontal Joins)

Introduction

    A horizontal join allows retrieving data from two or more datasets by merging/comparing the rows from each involved dataset, the join of rows being done on vertical, thus a row from the final result dataset containing only the attributes from one of the input datasets, the attributes being matched based on their order in the dataset (first attribute with first attribute from each dataset, second with the second, and so on), the output attribute taking the name of the correspondent attribute from the first. It worth to mention that the comparison is done at row level, thus we could talk about distinct rows if at least one of the attributes is having different values.

    There are two important constraints horizontal joins comes with: the datasets involved in a vertical join must have the same number of attributes and the joined attributes must have the same data type. If one of the attributes is not available in one of the datasets, a NULL or other default value could be provided for the missing attribute, while in case data types can’t be converted implicitly then the conversion must be done explicitly to a data type sufficient to accommodate the data type of the attributes joined under the same attribute.

      The general syntax for an horizontal join could be written as follows:

-- Horizontal Join syntax 
SELECT A.Attribute1 
[, A.Attribute] 
FROM TABLE1 A 
[WHERE <constraints>] 
UNION | INTERSECT | EXCEPT [ALL] 
SELECT B.Attribute1 
[, B.Attribute] 
FROM TABLE2 B 
[WHERE <join_constraints>]  
[ORDER BY <order_attributes>]

Notes:

1. There are 3 horizontal join operators UNION for union (AUB), INTERSECT (A∩B) for intersection and EXCEPT for subtraction (A\B), and only UNION could be used in combination with ALL keyword in order to show ignore the duplicates. Please note that only the union and intersection operators are commutative: AUB = BUA, respectively A∩B = B∩A, while A\B<>B\A.

2. The SELECT statement for each dataset could include any complex logic supported, including aggregates, window ranking and aggregate functions, virtual tables (views, user-defined functions), vertical joins, etc.

3. The horizontal join may contain only one ORDER BY clause placed at the end of the query, and with it could be used directly the attributes’ name or alias given to the columns from first dataset. In case aliases are used, and especially when many attributes are involved or then when the attribute involve complex logic, it makes sense to use the same alias also for the other datasets involved in the join.

For demonstrating the horizontal operations I will use the following two datasets based on the same table, respectively Products table from AdventureWorks database.

-- Dataset A 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 200 

-- Dataset B 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 100 AND 250

The UNION Join

The UNION join, also referred as the outer union, allows combining two datasets in one dataset formed from the combined rows of each data set, the eventual duplicate rows being eliminated. In case are needed also the duplicates, then the ALL keyword could be used in combination with the UNION (also referred as UNION ALL operator), this technique being recommended when the two datasets have no common records because when a simple UNION is used the database engine still checks for duplicates.

-- UNION example (AUB) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 200 
UNION 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 100 AND 250

Note:
Because the above two datasets are using the same table the final output is the same with the output of the below query, the reunion being thus applied directly to the constraints used. It is actually recommended to avoid using unions:

-- alternative for UNION example (AUB) 
SELECTITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 250

In order to exemplify what’s happening in the background I will use only two rows from each of the two datasets used as input, on the right side being shown the output of applying the UNION, respectively UNION ALL operators:

The INTERSECT Join

The INTERSECT operator combines the two datasets in one dataset formed from the rows found to be present in both datasets.

-- INTERSECT example (A∩B) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 200 
INTERSECT 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 100 AND 250

Note:
Because the above two datasets are using the same table the final output is the same with the output of the following query, the intersection could be thus applied directly to the constraints used:

-- alternative for INTERSECT example (A∩B) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 100 AND 200

In order to exemplify what’s happening in the background I will use only two rows from each of the two datasets used as input, on the right side being shown the output of applying the INTERSECT operator:

horizontal joins - intersect presentation

The EXCEPT Join

-- EXCEPT example (A\B) 
SELECTITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 200 
EXCEPT 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
,ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 100 AND 250 

-- EXCEPT example (B\A) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 100 AND 250 
EXCEPT 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 200

As can be seen from the above two examples there are two possibilities of subtracting two datasets – A\B, respectively B\A, selecting the rows which are only in A and not in B, respectively the rows which are in B and not in A.

Note:
Because the above two datasets are using the same table the final output is the same with the output of the following queries, the subtraction could be thus applied directly to the constraints used:

-- alternative for SUBTRACTION example (A\B) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost >= 0 
    AND ITM.StandardCost < 100 

-- alternative for SUBTRACTION example (B\A) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost , ITM.MakeFlag 
FROMProduction.Product ITM 
WHERE ITM.StandardCost > 200 
    AND ITM.StandardCost <= 250

In order to exemplify what’s happening in the background I will use only two rows from each of the two datasets used as input, on the right side being shown the output of applying the EXCEPT for A\B, respectively B\A:

Combining Operators

For the sake of simplicity all the above exemplifications are using only two datasets, though multiple datasets could be joined in this way, here is an example for the UNION ALL based on three datasets (please note that I avoided to list the attributes):

-- multiple UNION 
SELECT * 
FROMProduction.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 50 
UNION ALL
SELECT * 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 100 AND 150 
UNION ALL 
SELECT * 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 200 AND 250

Note:
Also in this case the union could be applied directly on the constraints, though because there is no overlap between any combination of the three ranges, the constraint needs to be modified accordingly:

  
-- multiple UNION 
SELECT * 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 50 
    OR ITM.StandardCost BETWEEN 100 AND 150 
    OR ITM.StandardCost BETWEEN 200 AND 250

When the union and intersection operations are used with multiple datasets it doesn’t matter the order of the datasets because the two operations are associative: (AUB)UC=AU(BUC)=AUBUC, respectively (A∩B)∩C=A∩(B∩C)=A∩B∩C, though not the same could be said about subtraction because (A\B)\C <> A\(B\C) or the cases in which operators are combined, for example (A\B)UC<>A\(BUC), so is needed to use parentheses in order to avoid eventual surprises.

-- (A\B)UC 
( 
    -- subtraction 
    SELECT * 
    FROM Production.Product ITM 
    WHERE ITM.StandardCost BETWEEN 0 AND 150 
    EXCEPT 
    SELECT * 
    FROM Production.Product ITM 
    WHERE ITM.StandardCost BETWEEN 100 AND 150 
) 
UNION 
SELECT * 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 150 AND 250

Note:
Knowing the basics of Set Theory is important in making your queries returning the expected results, the use of some of the identities allowing to simplify considerably the queries sometimes. Without a set-mindset I think it will be difficult to solve such problems.

30 March 2010

💎SQL Reloaded: The Power of Joins – Part I: Introduction

Joins are the heart of RDBMS, they allowing to retrieve data (bring data together in a single result dataset) from multiple (related) table-like objects (tables, views, user-defined functions) by defining how and which records are selected from such objects. In order to reduce the complexity of such a topic I will introduce the joins with regard to tables, the concepts applying also to views and user-defined functions, especially when they are regarded as “virtual” tables. In theory any two tables could be joined independently on whether a direct or indirect relation exists between them, it can be a foreign key, a natural key, a date or any other attribute that could be used to get a certain view into the data. Typically a join involves a minimum of two tables, but before going dipper into the topic it makes sense to look a little at the various terms used to refer to the tables joined and the anatomy of a join.

There are various ways of naming the tables involved in a join:
- left vs. right tables, where the left table is the first listed table participating in the join and the right table the second listed table participating in the join;
- parent vs. dependent tables, refers to tables between which exists a direct relationship, in other words exists a primary key and a foreign key, the dependent table being the table containing the foreign key, and the parent table the table containing the primary key;
- parent vs. child tables, is similar with the previous definition, the child table being an alternative naming for the dependent table, in addition we can talk also about grandchildren when a child table is referenced at its turn by other table called thus, when the three tables are considered together, the grandchild table;
- referent vs. referenced tables, another alternative for naming the tables between which exists a direct relationship, the referenced table being the table containing the foreign key, and the referent table the ones containing the primary key;
- inner vs. outer tables, these terms are used only in the case of outer joins, the inner table being the table from which are selected all the records, while the outer table is the table from which are selected only the matched records, therefore the two are also called sometimes the row-preserving table, respectively the null-supplying table;

None of the above pair of naming conventions are perfect because they can’t be used to address all the join situations in which tables are involved, and some of the terms may be used interchangely, especially when a term is preferred in order to send across adequate meaning, for example referring to a table as a inner table then we imply also that we deal with a left or right join. I realized that in time I used all the the above terms in a situation or another, however I prefer to use the left vs. right tables denomination, and occasionally inner vs. outer tables when referring to left or right joins.

In addition to the two tables joined a query typically includes also a join operator that specifies the join type (union, inner, outer or cross join), the join constraints that specifies which attributes are used to join the tables and the logical operators used, and eventually the non-join constraints referring to other constraints than the join constraints. I’m saying typically because either of the two could be missing, thus if the the join constraint and operator miss then we deal with a Cartesian join, while if the join constraint is appearing in the WHERE instead of FROM clause then there is no need to specify the join operator or in case of a UNION there is no need to specify the join constraint, though I will come back later on this topic.

The work with joins is deeply rooted in the Set Theory (the tables joined referred also as datasets) and the set operations – union, subtraction and intersection, their projection in world of databases coming, from my point of view and not sure if the database literature really discusses about this, into two flavors – vertical vs. horizontal joining. By vertical joining I’m referring to the intrinsic character of joins, in the result dataset the attributes from the two joined datasets appearing on the same row, while in the horizontal joining they never appear on the same row, thus a row being formed exclusively from the records of only one of the tables involved in the join. Thus we can discuss about vertical/horizontal joins as projection of union, intersection and subtraction operations, each of them with different connotations, the horizontal operations being actually similar to the ones from set theory. If we discuss about joins then we inevitably we have to mention also the anti-joins and semi-joins, a special type of queries based on NOT IN, respectively EXISTS operators.

|>> Next Post

SQL Troubles

Pages

01 April 2010

💎SQL Reloaded: The Power of Joins II (Horizontal Joins)

30 March 2010

💎SQL Reloaded: The Power of Joins – Part I: Introduction

About Me