A horizontal join allows retrieving data from two or more datasets by merging/comparing the rows from each involved dataset, the join of rows being done on vertical, thus a row from the final result dataset containing only the attributes from one of the input datasets, the attributes being matched based on their order in the dataset (first attribute with first attribute from each dataset, second with the second, and so on), the output attribute taking the name of the correspondent attribute from the first. It worth to mention that the comparison is done at row level, thus we could talk about distinct rows if at least one of the attributes is having different values.
There are two important constraints horizontal joins comes with: the datasets involved in a vertical join must have the same number of attributes and the joined attributes must have the same data type. If one of the attributes is not available in one of the datasets, a NULL or other default value could be provided for the missing attribute, while in case data types can’t be converted implicitly then the conversion must be done explicitly to a data type sufficient to accommodate the data type of the attributes joined under the same attribute.
The general syntax for an horizontal join could be written as follows:
-- Horizontal Join syntax SELECT A.Attribute1 [, A.Attribute] FROM TABLE1 A [WHERE <constraints>] UNION | INTERSECT | EXCEPT [ALL] SELECT B.Attribute1 [, B.Attribute] FROM TABLE2 B [WHERE <join_constraints>] [ORDER BY <order_attributes>]
1. There are 3 horizontal join operators UNION for union (AUB), INTERSECT (A∩B) for intersection and EXCEPT for subtraction (A\B), and only UNION could be used in combination with ALL keyword in order to show ignore the duplicates. Please note that only the union and intersection operators are commutative: AUB = BUA, respectively A∩B = B∩A, while A\B<>B\A.
2. The SELECT statement for each dataset could include any complex logic supported, including aggregates, window ranking and aggregate functions, virtual tables (views, user-defined functions), vertical joins, etc.
3. The horizontal join may contain only one ORDER BY clause placed at the end of the query, and with it could be used directly the attributes’ name or alias given to the columns from first dataset. In case aliases are used, and especially when many attributes are involved or then when the attribute involve complex logic, it makes sense to use the same alias also for the other datasets involved in the join.
For demonstrating the horizontal operations I will use the following two datasets based on the same table, respectively Products table from AdventureWorks database.
-- Dataset A SELECT ITM.ProductID , ITM.Name ProductName , ITM.ProductNumber , ITM.StandardCost , ITM.MakeFlag FROM Production.Product ITM WHERE ITM.StandardCost BETWEEN 0 AND 200 -- Dataset B SELECT ITM.ProductID , ITM.Name ProductName , ITM.ProductNumber , ITM.StandardCost , ITM.MakeFlag FROM Production.Product ITM WHERE ITM.StandardCost BETWEEN 100 AND 250
The UNION Join
The UNION join, also referred as the outer union, allows combining two datasets in one dataset formed from the combined rows of each data set, the eventual duplicate rows being eliminated. In case are needed also the duplicates, then the ALL keyword could be used in combination with the UNION (also referred as UNION ALL operator), this technique being recommended when the two datasets have no common records because when a simple UNION is used the database engine still checks for duplicates.
-- UNION example (AUB) SELECT ITM.ProductID , ITM.Name ProductName , ITM.ProductNumber , ITM.StandardCost , ITM.MakeFlag FROM Production.Product ITM WHERE ITM.StandardCost BETWEEN 0 AND 200 UNION SELECT ITM.ProductID , ITM.Name ProductName , ITM.ProductNumber , ITM.StandardCost , ITM.MakeFlag FROM Production.Product ITM WHERE ITM.StandardCost BETWEEN 100 AND 250
Because the above two datasets are using the same table the final output is the same with the output of the below query, the reunion being thus applied directly to the constraints used. It is actually recommended to avoid using unions:
-- alternative for UNION example (AUB) SELECTITM.ProductID , ITM.Name ProductName , ITM.ProductNumber , ITM.StandardCost , ITM.MakeFlag FROM Production.Product ITM WHERE ITM.StandardCost BETWEEN 0 AND 250
In order to exemplify what’s happening in the background I will use only two rows from each of the two datasets used as input, on the right side being shown the output of applying the UNION, respectively UNION ALL operators:
The INTERSECT Join
The INTERSECT operator combines the two datasets in one dataset formed from the rows found to be present in both datasets.
-- INTERSECT example (A∩B) SELECT ITM.ProductID , ITM.Name ProductName , ITM.ProductNumber , ITM.StandardCost , ITM.MakeFlag FROM Production.Product ITM WHERE ITM.StandardCost BETWEEN 0 AND 200 INTERSECT SELECT ITM.ProductID , ITM.Name ProductName , ITM.ProductNumber , ITM.StandardCost , ITM.MakeFlag FROM Production.Product ITM WHERE ITM.StandardCost BETWEEN 100 AND 250
Because the above two datasets are using the same table the final output is the same with the output of the following query, the intersection could be thus applied directly to the constraints used:
-- alternative for INTERSECT example (A∩B) SELECT ITM.ProductID , ITM.Name ProductName , ITM.ProductNumber , ITM.StandardCost , ITM.MakeFlag FROM Production.Product ITM WHERE ITM.StandardCost BETWEEN 100 AND 200
In order to exemplify what’s happening in the background I will use only two rows from each of the two datasets used as input, on the right side being shown the output of applying the INTERSECT operator:
The EXCEPT Join
-- EXCEPT example (A\B) SELECTITM.ProductID , ITM.Name ProductName , ITM.ProductNumber , ITM.StandardCost , ITM.MakeFlag FROM Production.Product ITM WHERE ITM.StandardCost BETWEEN 0 AND 200 EXCEPT SELECT ITM.ProductID , ITM.Name ProductName , ITM.ProductNumber , ITM.StandardCost ,ITM.MakeFlag FROM Production.Product ITM WHERE ITM.StandardCost BETWEEN 100 AND 250 -- EXCEPT example (B\A) SELECT ITM.ProductID , ITM.Name ProductName , ITM.ProductNumber , ITM.StandardCost , ITM.MakeFlag FROM Production.Product ITM WHERE ITM.StandardCost BETWEEN 100 AND 250 EXCEPT SELECT ITM.ProductID , ITM.Name ProductName , ITM.ProductNumber , ITM.StandardCost , ITM.MakeFlag FROM Production.Product ITM WHERE ITM.StandardCost BETWEEN 0 AND 200
Note:
Because the above two datasets are using the same table the final output is the same with the output of the following queries, the subtraction could be thus applied directly to the constraints used:
-- alternative for SUBTRACTION example (A\B) SELECT ITM.ProductID , ITM.Name ProductName , ITM.ProductNumber , ITM.StandardCost , ITM.MakeFlag FROM Production.Product ITM WHERE ITM.StandardCost >= 0 AND ITM.StandardCost < 100 -- alternative for SUBTRACTION example (B\A) SELECT ITM.ProductID , ITM.Name ProductName , ITM.ProductNumber , ITM.StandardCost , ITM.MakeFlag FROMProduction.Product ITM WHERE ITM.StandardCost > 200 AND ITM.StandardCost <= 250
In order to exemplify what’s happening in the background I will use only two rows from each of the two datasets used as input, on the right side being shown the output of applying the EXCEPT for A\B, respectively B\A:
For the sake of simplicity all the above exemplifications are using only two datasets, though multiple datasets could be joined in this way, here is an example for the UNION ALL based on three datasets (please note that I avoided to list the attributes):
-- multiple UNION SELECT * FROMProduction.Product ITM WHERE ITM.StandardCost BETWEEN 0 AND 50UNION ALLSELECT * FROM Production.Product ITM WHERE ITM.StandardCost BETWEEN 100 AND 150 UNION ALL SELECT * FROM Production.Product ITM WHERE ITM.StandardCost BETWEEN 200 AND 250
Note:
Also in this case the union could be applied directly on the constraints, though because there is no overlap between any combination of the three ranges, the constraint needs to be modified accordingly:
-- multiple UNION SELECT * FROM Production.Product ITM WHERE ITM.StandardCost BETWEEN 0 AND 50 OR ITM.StandardCost BETWEEN 100 AND 150 OR ITM.StandardCost BETWEEN 200 AND 250
When the union and intersection operations are used with multiple datasets it doesn’t matter the order of the datasets because the two operations are associative: (AUB)UC=AU(BUC)=AUBUC, respectively (A∩B)∩C=A∩(B∩C)=A∩B∩C, though not the same could be said about subtraction because (A\B)\C <> A\(B\C) or the cases in which operators are combined, for example (A\B)UC<>A\(BUC), so is needed to use parentheses in order to avoid eventual surprises.
-- (A\B)UC ( -- subtraction SELECT * FROM Production.Product ITM WHERE ITM.StandardCost BETWEEN 0 AND 150 EXCEPT SELECT * FROM Production.Product ITM WHERE ITM.StandardCost BETWEEN 100 AND 150 ) UNION SELECT * FROM Production.Product ITM WHERE ITM.StandardCost BETWEEN 150 AND 250
Knowing the basics of Set Theory is important in making your queries returning the expected results, the use of some of the identities allowing to simplify considerably the queries sometimes. Without a set-mindset I think it will be difficult to solve such problems.
No comments:
Post a Comment