Showing posts with label union. Show all posts
Showing posts with label union. Show all posts

29 April 2024

Power BI: Working with Visual Calculations (Part III: Matrix Tables with Square Numbers as Example)

Introduction

In the previous post I exemplified various operations that can be performed with visual calculations on simple tables based on square numbers. Changing the simple table to a matrix table doesn't bring any benefit. The real benefit comes when one restructures the table to store only a cell per row in a table. 

Data Modelling

For this the Magic5 table can be transformed via the following code, which creates a second table (e.g. M5):

M5 = UNION (
    SUMMARIZECOLUMNS(
     Magic5[Id]
     , Magic5[R]
     , Magic5[Index]
     , Magic5[C1]
     , "Col", "C1"
    )
    , SUMMARIZECOLUMNS(
     Magic5[Id]
     , Magic5[R]
     , Magic5[Index]
     , Magic5[C2]
     , "Col", "C2"
    )
    , SUMMARIZECOLUMNS(
     Magic5[Id]
     , Magic5[R]
     , Magic5[Index]
     , Magic5[C3]
     , "Col", "C3"
    )
    ,  SUMMARIZECOLUMNS(
     Magic5[Id]
     , Magic5[R]
     , Magic5[Index]
     , Magic5[C4]
     , "Col", "C4"
    )
    , SUMMARIZECOLUMNS(
      Magic5[Id]
     , Magic5[R]
     , Magic5[Index]
     , Magic5[C5]
     , "Col", "C5"
    )
)

Once this done, one can add the column [Col] as values for the matrix in a new visual. From now on, all the calculations can be done on copies of this visual. 

Simple Operations

The behavior of the RUNNINGSUM and other functions is different when applied on a matrix table because the formula is applied to every cell of the N*N table, a column with the result being added for each existing column of the matrix.

Moreover, there are four different ways of applying the formula based on the Axis used. ROW calculates the formula by the row within a column:

Run SumByRow(C) = RUNNINGSUM([C], ROWS)
Output:
R C Run Sum(C) C Run Sum(C) C Run Sum(C) C Run Sum(C) C Run Sum(C)
R1 18 18 25 25 2 2 9 9 11 11
R2 4 22 6 31 13 15 20 29 22 33
R3 15 37 17 48 24 39 1 30 8 41
R4 21 58 3 51 10 49 12 42 19 60
R5 7 65 14 65 16 65 23 65 5 65

By providing COLUMNS as parameter for the Axis makes the calculation run by the column within a row: 

Run SumByCol(C) = RUNNINGSUM([C], COLUMNS)
Output:
R C Run Sum(C) C Run Sum(C) C Run Sum(C) C Run Sum(C) C Run Sum(C)
R1 18 18 25 43 2 45 9 54 11 65
R2 4 4 6 10 13 23 20 43 22 65
R3 15 15 17 32 24 56 1 57 8 65
R4 21 21 3 24 10 34 12 46 19 65
R5 7 7 14 21 16 37 23 60 5 65

By providing ROW COLUMNS as parameter for the Axis makes the calculation run by the column and then continuing the next column (without resetting the value at the end of the column):
Run SumByRow-Col(C) = RUNNINGSUM([C],ROWS COLUMNS)
Output:
R C Run Sum(C) C Run Sum(C) C Run Sum(C) C Run Sum(C) C Run Sum(C)
R1 18 18 25 90 2 132 9 204 11 271
R2 4 22 6 96 13 145 20 224 22 293
R3 15 37 17 113 24 169 1 225 8 301
R4 21 58 3 116 10 179 12 237 19 320
R5 7 65 14 130 16 195 23 260 5 325

By providing COLUMNS ROWS as parameter for the Axis makes the calculation run by the row and then continuing the next row (without resetting the value at the end of the column):
Run SumByCol-Row = RUNNINGSUM([C],COLUMNS ROWS)
Output:
R C Run Sum(C) C Run Sum(C) C Run Sum(C) C Run Sum(C) C Run Sum(C)
R1 18 18 25 43 2 45 9 54 11 65
R2 4 69 6 75 13 88 20 108 22 130
R3 15 145 17 162 24 186 1 187 8 195
R4 21 216 3 219 10 229 12 241 19 260
R5 7 267 14 281 16 297 23 320 5 325

Ranking

RANK can be applied independent of the values, or considering the value with ASC or DESC sorting:
RankByRow = RANK(DENSE,ROWS) -- ranking by row independent of values
RankByRow ASC = RANK(DENSE,ROWS, ORDERBY([C],ASC)) -- ranking by row ascending
RankByRow DESC = RANK(DENSE,ROWS, ORDERBY([C], DESC)) -- ranking by row descending
RankByRow-Col ASC = RANK(DENSE,ROWS COLUMNS, ORDERBY([C],ASC)) -- ranking by row columns ascending
RankByRow-Col DESC = RANK(DENSE,ROWS COLUMNS, ORDERBY([C], DESC)) -- ranking by row columns ascending

[RankByRow-Col ASC] matches the actual numbers from the matrix and is thus useful when sorting any numbers accordingly. 

Differences

Differences can be calculated between any of the cells of the matrix:
DiffToPrevByRow = [C] - PREVIOUS([C])  -- difference to previous record
DiffToPrevByRow* = IF(NOT(IsBlank(PREVIOUS([C]))), [C] - PREVIOUS([C])) -- extended difference to previous record
DiffToPrevByRow-Col = [C] - PREVIOUS([C],, ROWS COLUMNS) -- difference to previous record by ROWS COLUMNS
DiffToFirstByRow = [C] - FIRST([C]) -- difference to first record
DiffToPrevByCol = [C] - FIRST([C], COLUMNS) -- difference to previous record COLUMNS

Ranking = RANK(DENSE, ROWS COLUMNS, ORDERBY([C], ASC)) -- ranking of values by ROWS COLUMNS
OffsetDiffToPrevByRow = [C] - calculate([C], OFFSET(1, ROWS, ORDERBY([Ranking],DESC))) -- difference to the previous record by ROW
OffsetDiffToPrevByRow-Col = [C] - calculate([C], OFFSET(1, ROWS COLUMNS, ORDERBY([Ranking],DESC))) -- difference to the previous record by ROW

Ranking has been introduced to facilitate the calculations based on OFFSET.

The other functions [1] can be applied similarly.

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) Power BI: Using visual calculations [preview] (link)

05 April 2024

SQL Reloaded: SQL Antipatterns (Part I: JOINs, UNIONs & DISTINCT)

Introduction

SQL antipatterns refer in general to common mistakes made when developing SQL code, though the term can refer also to situations in which even if the code is syntactically and logically correct, it's either suboptimal, unclear or even incorrect. Therefore "mistake" can cover a wide range of scenarios, some that can be ignored, while others need to be addressed accordingly. 

In this post I consider a few antipatterns observed especially in data warehouses (DWHs). Let's look at the below code created to exemplify several scenarios:

-- Products in open orders (initial query)
SELECT DISTINCT ITM.ProductId                                   -- (1) use of DISTINCT
, ITM.ProductNumber
, ITM.Name 
, ITM.Color 
, ITM.Style 
, ITM.Size
FROM Production.Product ITM
    LEFT JOIN (							-- (5) use of JOIN instead of EXISTS
	-- Open Purchase orders 
	SELECT DISTINCT POL.ProductId
	, 'POs' Source                                          -- (7) use columns not needed in output
	FROM Purchasing.PurchaseOrderDetail POL                 
	     LEFT JOIN Purchasing.PurchaseOrderHeader POH       -- (2) use of LEFT JOIN instead of FULL JOIN
		  ON POL.PurchaseOrderID = POH.PurchaseOrderID
	WHERE POH.Status = 1 -- pending 
	UNION					                -- (3) use of UNION
	-- Open Sales orders 
	SELECT DISTINCT SOL.ProductId
	, 'SOs' Source
	FROM Sales.SalesOrderDetail SOL
	    LEFT JOIN Sales.SalesOrderHeader SOH
		  ON SOL.SalesOrderID = SOH.SalesOrderID
	WHERE SOH.Status = 1 -- in process		        -- (4) use of OR instead of IN
	   OR SOH.Status = 2 -- approved
	) DAT
	ON ITM.ProductID = DAT.ProductID
WHERE DAT.ProductID IS NOT NULL 
ORDER BY ITM.ProductNumber			                -- (6) using too many columns in ORDER BY
, ITM.Name 
, ITM.Color 
, ITM.Style 
, ITM.Size

(1) Use of DISTINCT 

DISTINCT is a dirty way to remove the duplicates from a dataset. Sometimes it makes sense to use it to check something fast, though it should be avoided into code intended for a production environment because it can lead to unexpected behavior especially when selecting all the columns using the "*" (SELECT DISTINCT *).

I saw tools and developers adding a DISTINCT in almost each step, independently on whether it was necessary or not. One can thus but wonder whether the DISTINCT was added to fix a bigger issue with the data in the DWH, to remove special duplicates imposed by the logic or just as poor practice. Unfortunately, when it's used frequently, it can become challenging to investigate its use and discover the actual issues from the DWH.

There are several approaches to eliminate DISTINCTs from the code: GROUP BY, ranking functions or upon case also code rewrites.

(2)  Use of LEFT JOIN Instead of FULL JOIN

When refreshing a DWH there can be the case that related data are out of synch. It would be the case of Purchase or Sales orders where the headers and lines are out of synch, headers existing without lines and/or vice versa. A common practice is to use a FULL JOIN and thus to eliminate such exceptions, though there are also entitled uses of a LEFT JOIN. This antipattern resumes however to the cases in which logically should be used a FULL JOIN, though a LEFT JOIN is used instead.

In the above example there are two distinct occurrences of this pattern: the relationship between the header and lines in the inner query, respectively the LEFT JOIN with a NOT NULL constraint in the outer query. The latter use is useful when during testing one wants to see all the Products, though bringing this further into production may rise some eyebrows even if it's not necessarily wrong. Anyway, the database engine should be smart enough to recognize such a scenario. However, for the header vs lines case, upon case the plan generated might be suboptimal. 

One of the best practices when writing SQL queries is to state one's intent clearly in what the logic concerns. Using a LEFT JOIN instead of a FULL JOIN can make people raise questions about the actual need. When the need is not properly documented, some developer may even go and change the joins. There can be for example business cases that are not cover by the current data, but as soon as case appears it will lead to incorrect logic!

Similarly, splitting a piece of logic into two or more steps unnecessarily can create confusion. There can be however also entitled s situations (e.g. query optimization), which ideally should be documented.

(3) Use of UNION

When a UNION is used, the values returned by the first query will be checked against the values of the second query, and thus unnecessary comparisons occur even if they are not needed. This depends also on the business context, which might not be easily to identify from the query (especially when the reviewer doesn't know the business case). 

The misuse of a UNION will not make a big difference when the volume of data is small, though the more data are processed by the query, the higher the impact. 

Besides the proper use of the UNION, there are also situations in which a query rewrite can eliminate the need for a UNION (see the rewritten query below).

(4) use of OR instead of IN

One can occasionally find queries in which a OR was used for 10 to 50 distinct values as in the example above. Even if the database engine generate in both cases the same query plan, it's easier to read and maintain a query that used IN. However, if the number of values go beyond a certain value, other techniques should be used to improve the performance.

The only benefit I see for a OR is to document meaning's values or remove during testing one of the values, especially when the list is a selection coming from the user. Frankly, it's not the appropriate way for documenting logic (even if I'm doing it sometimes in ad-hoc queries).

There's a more extreme scenario in which distinct subqueries are written for each or a set of ORs (e.g. the distinction between open vs closed vs. invoices orders), which can make sense sometimes (e.g. the logic is completely different). Therefore, an antipattern can be dependent also of the context or use case. 

(5) use of JOIN instead of EXISTS

When there are no values returned from the subquery, quite often it makes sense to the EXISTS or not EXISTS operators in the queries (see the rewritten query below). This might not be indicated however for distributed environments like serverless SQL pool in which the distribution of the processing across multiple tasks might benefit when the pieces of the logic distributed don't require heavy reshuffles. 

(6) Using too Many Columns in ORDER BY

The columns specified in an ORDER BY clause need to make sense, otherwise they just add extra burden on the database engine, at least from the perspective of the checks that need to be performed. In the above query, at least the Name doesn't make sense.

It helps also if the columns can use existing indexes, though this depends also on query specifics. 

Another antipattern scenario not exemplified above is the use of ordinals to refer to the columns, which should be avoided in production environments (because the order of the columns can be changed accidentally or even :

-- using ordinals instead of number columns (not recommended)
SELECT ITM.ProductId                                  
, ITM.ProductNumber
, ITM.Name 
, ITM.Color 
, ITM.Style 
, ITM.Size
FROM Production.Product ITM                           
ORDER BY 2, 4, 5, 6

(7) Use Columns Not  Needed in Output

Besides the fact that each column included unnecessarily in the query can increase the size of the data processed (unless the database engine is smart to remove them), there can be also performance issues and/or optimizations involved. For example, if all the other columns are part of a covering index, the database engine might opt for a suboptimal index compared to the case in which the unnecessary columns are removed. 

Conversely, some columns are helpful to troubleshoot the logic (and that's why the Source column was considered) even if they aren't considered in the final output or the logic. It doesn't make sense to bring the query version with the respective fields into production, even if this would mean to have maybe a second version of the query used only for troubleshooting needs. Commenting the unnecessary columns could be a better choice, even if it's not recommended in general as too many such comments can obfuscate the case. 

Rewriting the Query

With the above input the query can be rewritten as follows:

-- Products in open orders (modified query)
SELECT ITM.ProductId                                  
, ITM.ProductNumber
, ITM.Name 
, ITM.Color 
, ITM.Style 
, ITM.Size
FROM Production.Product ITM
WHERE EXISTS (										            
	-- Open Purchase orders 
	SELECT POL.ProductId
	FROM Purchasing.PurchaseOrderDetail POL                 
	     JOIN Purchasing.PurchaseOrderHeader POH      
		  ON POL.PurchaseOrderID = POH.PurchaseOrderID
	WHERE POH.Status = 1 
	  AND ITM.ProductID = POL.ProductID
	)
 OR EXISTS (				                                    
	-- Open Sales orders 
	SELECT SOL.ProductId
	FROM Sales.SalesOrderDetail SOL
	     JOIN Sales.SalesOrderHeader SOH
		  ON SOL.SalesOrderID = SOH.SalesOrderID
	WHERE SOH.Status IN (1, 2)
	  AND ITM.ProductID = SOL.ProductID
	)	                           
ORDER BY ITM.ProductNumber			                           
, ITM.Color 
, ITM.Style 
, ITM.Size

Please note that in general to each rule there are also exceptions which should be considered against one's common sense. When the benefit of addressing an antipattern is neglectable compared with the effort involved and the logic doesn't create any issues, probably it's better to let the code as it. One can still reconsider the antipatterns later with the next refactoring opportunity. 

There are zealous data professionals who treat minor inconveniences (e.g. not using upper case for SQL reserved words, alternate code formatting, alternative writing of words, especially function names, different indentation, the use of "--" for commenting within a query, etc.) as antipatterns. Use your common sense and evaluate the effort against the benefits or risks and, not less important, be patient with others' mistakes!

Happy coding!

19 December 2023

SQL Reloaded: Blocked Products in Dynamics 365 F&O

Besides listing the products released by Legal entity (see previous post), it's useful to know whether they were blocked for Inventory, Sales or Procurement. This is quite easy when doing it over the data entity:

-- check status via the data entity
SELECT PSO.ItemNumber 
, PSO.DataAreaId 
, PSO.IsSalesProcessingStopped
, PSO.IsInventoryProcessingStopped 
, PSO.IsProcurementProcessingStopped
FROM dbo.InventProductSpecificOrderSettingsV3Entity PSO
ORDER BY PSO.ItemNumber 
, PSO.DataAreaId

However, when the data entity is not available, the logic gets a bit more complex because the data are stored in 3 different tables: InventItemInventSetup, InventItemPurchSetup, respectively InventItemSalesSetup. Here's the piece of logic used to get the various statuses in AX 2009 (of course, without the JOINs on Partition) and it works also in Dynamics 365 F&O:

/* Product Specific Order Settings via JOINs */
SELECT ITM.DataAreaId 
, ITM.ItemId 
, IPS.Stopped IsProcurementProcessingStopped
, ILS.Stopped IsInventoryProcessingStopped
, ISS.Stopped IsSalesProcessingStopped
FROM dbo.InventTable ITM
     LEFT JOIN dbo.InventItemPurchSetup IPS
       ON ITM.ItemID = IPS.ItemId
      AND ITM.DataAreaId = IPS.DataAreaId 
      AND ITM.Partition = IPS.Partition
      AND IPS.InventDimId = 'AllBlank'
     LEFT JOIN dbo.InventItemSalesSetup ISS
       ON ITM.ItemID = ISS.ItemId
      AND ITM.DataAreaId = ISS.DataAreaId 
      AND ITM.Partition = ISS.Partition
      AND ISS.InventDimId = 'AllBlank'
     LEFT JOIN dbo.InventItemInventSetup ILS
       ON ITM.ItemID = ILS.ItemId
      AND ITM.DataAreaId = ILS.DataAreaId 
      AND ITM.Partition = ILS.Partition
      AND ILS.InventDimId = 'AllBlank'
ORDER BY ITM.ItemId
,  ITM.DataAreaId 

The constraint on InventDimId is necessary because there can be multiple records for the Product and Legal entity combination. 

Alternatively, one can use UNIONs instead of JOINs and include the logic into a view to simplify the final query (the query was written to test the behavior in a distributed environment like serverless SQL pool):

-- create the view
CREATE OR ALTER VIEW TDM.vProductSpecificOrderSettings
AS
/* Product Specific Order Settings via UNIONs */
SELECT IST.DataAreaId 
, IST.ItemId 
, IST.Partition 
, max(IST.IsInventoryProcessingStopped) IsInventoryProcessingStopped
, max(IST.IsSalesProcessingStopped) IsSalesProcessingStopped
, max(IST.IsProcurementProcessingStopped) IsProcurementProcessingStopped
FROM (
	-- inventory
	SELECT IIS.DataAreaId 
	, IIS.ItemId 
	, IIS.Partition
	, IIS.Stopped IsInventoryProcessingStopped
	, 0 IsSalesProcessingStopped
	, 0 IsProcurementProcessingStopped
	FROM dbo.InventItemInventSetup IIS
	WHERE IIS.InventDimId = 'AllBlank'
	UNION ALL
	-- purchasing
	SELECT IPS.DataAreaId 
	, IPS.ItemId 
	, IPS.Partition 
	, 0
	, 0
	, IPS.Stopped
	FROM dbo.InventItemPurchSetup IPS
	WHERE IPS.InventDimId = 'AllBlank'
	UNION ALL
	-- sales
	SELECT ISS.DataAreaId 
	, ISS.ItemId 
	, ISS.Partition 
	, 0
	, ISS.Stopped
	, 0
	FROM dbo.InventItemSalesSetup ISS
	WHERE ISS.InventDimId = 'AllBlank'
 ) IST
 GROUP BY IST.DataAreaId 
, IST.ItemId 
, IST.Partition 

/* Product Specific Order Settings via UNIONs */
SELECT ITM.DataAreaId 
, ITM.ItemId 
, PSO.IsInventoryProcessingStopped
, PSO.IsSalesProcessingStopped
, PSO.IsProcurementProcessingStopped
FROM dbo.InventTable ITM
     LEFT JOIN TDM.vProductSpecificOrderSettings PSO
	   ON ITM.ItemId = PSO.ItemId 
	  AND ITM.DataAreaId = PSO.DataAreaId 
	  AND ITM.Partition = PSO.Partition
ORDER BY ITM.ItemId 
, ITM.DataAreaId 

At least for the database used for testing on an SQL Server 2022 instance, the last query has slightly lower estimated subtree cost and memory grant than the previous one. It's also interesting that the data entity-based query outperforms the other two queries. 

And here's the combined query from this and previous post:

-- Legal Entities vs Products incl. product order settings
SELECT DAT.Id DataAreaId 
, PRD.DisplayProductNumber ItemId 
, CASE WHEN ITM.DataAreaId IS NOT NULL THEN 1 ELSE 0 END IsReleased
, CASE WHEN ITM.CreatedDatetime <> ITM.ModifiedDateTime THEN 1 ELSE 0 END IsModified
, PSO.IsInventoryProcessingStopped
, PSO.IsSalesProcessingStopped
, PSO.IsProcurementProcessingStopped
FROM dbo.DataArea DAT
     CROSS JOIN dbo.EcoResProduct PRD
	 LEFT JOIN dbo.InventTable ITM
	   ON DAT.Id = ITM.DataAreaId 
	  AND PRD.DisplayProductNumber = ITM.ItemId
	  AND PRD.Partition = ITM.Partition
          LEFT JOIN TDM.vProductSpecificOrderSettings PSO
	        ON ITM.ItemId = PSO.ItemId 
	       AND ITM.DataAreaId = PSO.DataAreaId 
	       AND ITM.Partition = PSO.Partition
WHERE DAT.Id <> 'DAT'
   AND PRD.DisplayProductNumber = 'D0001'
ORDER BY PRD.DisplayProductNumber
, DAT.Id

Happy coding!

09 April 2011

SQL Reloaded: Table Value Constructors at Work

    Paradoxically, the first new SQL Server 2008 feature I learned, soon after the first CTP was made available, was the use of table value constructors, however I managed all this time without using it. This happened because I was either working on previous SQL Server versions or because other techniques were more appropriate to the problems I had to solve. So, a few days ago, the use of table value constructors came again on the table when I was weighting what’s the best approach to a problem posed by a colleague. Actually, the problem is quite simple – given a pair of values (id, new value) is needed to update a table with the “new value” corresponding to the given “id”. Previously, in similar cases, I preferred to load the data in a staging table because this was allowing me to quickly perform several checks on the set of values (e.g. check for duplicates, identify how many records will be updated). But what do you do when you are working directly in a production or any other type of environment in which you don’t have the possibility of using a staging table?

    For such cases you can either create a dataset on the fly with a set of UNIONS and reuse it in queries or, create an UPDATE statement for each pair of values. You can even automate the creation of UNION and UPDATE statements, however their use doesn’t seem so elegant, especially when you can use table value constructors. The feature seems to be hidden in documentation, as a matter of fact I found it only in the online documentation, the local help having no reference to it. Anyway, excepting a few posts on the same subject I can’t say I have seen any code until now making use of it. What’s interesting to note, before going into details, is that the ANSI SQL 94 (the earliest version of documentation I have) mentions vaguely table value constructors (p. 221), therefore I expect they are part of ANSI standard.

   According to MSDN a table value constructor allows a set of row value expressions to be constructed into a table within a single DML statement. As can be seen from the MSDN examples the list of values can contain a combination of list of values and queries, as long the data type matches and only a single scalar value is provided as a row value expression Another limitation concerns the maximum number of rows constructed, the maximum value being 1000. It’s important to know these limitations, because they might force you use other approaches. Before sketching a solution for this post’s problem, let’s look how a table can be constructed on the fly given a list of pair-values:

-- example table value constructor 
SELECT EmployeeId 
, VacationHours 
FROM ( VALUES (101, '23') 
     , (102, '82') 
     , (103, '66') 
     , (104, '17') 
      , (105, '64') 
      , (106, '56') 
      , (107, '107') 
      , (108, '50') 
      , (109, '109') 
      , (110, '48')) AS Data(EmployeeId, VacationHours) 

    The table thus constructed can be further used in an UPDATE statement, the following example being based on HumanResources.Employee from AdventureWorks database.

-- example table value constructor update UPDATE HumanResources.Employee 
SET VacationHours = Data.VacationHours 
FROM ( VALUES (101, '23') 
      , (102, '82') 
      , (103, '66') 
      , (104, '17') 
      , (105, '64') 
      , (106, '56') 
      , (107, '107') 
      , (108, '50') 
      , (109, '109') 
      , (110, '48')) AS Data(EmployeeId, VacationHours) 
WHERE HumanResources.Employee.EmployeeId = Data.EmployeeId 

    A similar construct can be be used to perform a SELECT, UPDATE or DELETE in which multiple columns participate in the JOIN. For example let’s consider a SELECT based on a set of pairs of values from the Sales.SalesOrderHeader table:

-- example table value constructor
SELECT SOH.* 
FROM Sales.SalesOrderHeader SOH 
     JOIN ( -- set of pairs 
        VALUES(10, 5069) 
         , (10, 6070) 
         , (10, 7060) 
         , (11, 407) 
         , (11, 1603) 
         , (11, 2737)) AS Data(CustomerId, CurrencyRateID) 
      ON SOH.CustomerId = Data.CustomerID 
    AND SOH.CurrencyRateID = Data.CurrencyRateID 

    As can be seen table value constructors are pretty simple to use, especially when dealing with a small number of values. If they are the best choice, that depends on the problem you are trying to solve and its specific constraints!

01 April 2010

SQL Reloaded: The Power of Joins II (Horizontal Joins)

Introduction 
 
    A horizontal join allows retrieving data from two or more datasets by merging/comparing the rows from each involved dataset, the join of rows being done on vertical, thus a row from the final result dataset containing only the attributes from one of the input datasets, the attributes being matched based on their order in the dataset (first attribute with first attribute from each dataset, second with the second, and so on), the output attribute taking the name of the correspondent attribute from the first. It worth to mention that the comparison is done at row level, thus we could talk about distinct rows if at least one of the attributes is having different values.

    There are two important constraints horizontal joins comes with: the datasets involved in a vertical join must have the same number of attributes and the joined attributes must have the same data type. If one of the attributes is not available in one of the datasets, a NULL or other default value could be provided for the missing attribute, while in case data types can’t be converted implicitly then the conversion must be done explicitly to a data type sufficient to accommodate the data type of the attributes joined under the same attribute.
  
      The general syntax for an horizontal join could be written as follows: 

-- Horizontal Join syntax 
SELECT A.Attribute1 
[, A.Attribute] 
FROM TABLE1 A 
[WHERE <constraints>] 
UNION | INTERSECT | EXCEPT [ALL] 
SELECT B.Attribute1 
[, B.Attribute] 
FROM TABLE2 B 
[WHERE <join_constraints>]  
[ORDER BY <order_attributes>] 

Notes:
   
1.    There are 3 horizontal join operators UNION for union (AUB), INTERSECT (A∩B) for intersection and EXCEPT for subtraction (A\B), and only UNION could be used in combination with ALL keyword in order to show ignore the duplicates. Please note that only the union and intersection operators are commutative: AUB = BUA, respectively A∩B = B∩A, while A\B<>B\A. 
 
2.    The SELECT statement for each dataset could include any complex logic supported, including aggregates, window ranking and aggregate functions, virtual tables (views, user-defined functions), vertical joins, etc. 
 
3.     The horizontal join may contain only one ORDER BY clause placed at the end of the query, and with it could be used directly the attributes’ name or alias given to the columns from first dataset. In case aliases are used, and especially when many attributes are involved or then when the attribute involve complex logic, it makes sense to use the same alias also for the other datasets involved in the join.

   For demonstrating the horizontal operations I will use the following two datasets based on the same table, respectively Products table from AdventureWorks database. 
 
-- Dataset A 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 200 

-- Dataset B 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 100 AND 250 

The UNION Join 

    The UNION join, also referred as the outer union, allows combining two datasets in one dataset formed from the combined rows of each data set, the eventual duplicate rows being eliminated. In case are needed also the duplicates, then the ALL keyword could be used in combination with the UNION (also referred as UNION ALL operator), this technique being recommended when the two datasets have no common records because when a simple UNION is used the database engine still checks for duplicates. 

-- UNION example (AUB) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 200 
UNION 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 100 AND 250  

Note:
   
     
     Because the above two datasets are using the same table the final output is the same with the output of the below query, the reunion being thus applied directly to the constraints used. It is actually recommended to avoid using unions:

-- alternative for UNION example (AUB) 
SELECTITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 250 

      In order to exemplify what’s happening in the background I will use only two rows from each of the two datasets used as input, on the right side being shown the output of applying the UNION, respectively UNION ALL operators:
horizontal joins - union presentation

The INTERSECT Join 
 
    The INTERSECT operator combines the two datasets in one dataset formed from the rows found to be present in both datasets. 
 
-- INTERSECT example (A∩B) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 200 
INTERSECT 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 100 AND 250 

Note:

   Because the above two datasets are using the same table the final output is the same with the output of the following query, the intersection could be thus applied directly to the constraints used: 
  
-- alternative for INTERSECT example (A∩B) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 100 AND 200 

     In order to exemplify what’s happening in the background I will use only two rows from each of the two datasets used as input, on the right side being shown the output of applying the INTERSECT operator:
horizontal joins - intersect presentation 
The EXCEPT Join 

-- EXCEPT example (A\B) 
SELECTITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 200 
EXCEPT 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
,ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 100 AND 250 

-- EXCEPT example (B\A) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 100 AND 250 
EXCEPT 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 200 

   As can be seen from the above two examples there are two possibilities of subtracting two datasets – A\B, respectively B\A, selecting the rows which are only in A and not in B, respectively the rows which are in B and not in A.

Note:
   Because the above two datasets are using the same table the final output is the same with the output of the following queries, the subtraction could be thus applied directly to the constraints used: 

-- alternative for SUBTRACTION example (A\B) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost >= 0 
    AND ITM.StandardCost < 100 

-- alternative for SUBTRACTION example (B\A) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost , ITM.MakeFlag 
FROMProduction.Product ITM 
WHERE ITM.StandardCost > 200 
    AND ITM.StandardCost <= 250 

      In order to exemplify what’s happening in the background I will use only two rows from each of the two datasets used as input, on the right side being shown the output of applying the EXCEPT for A\B, respectively B\A: 
 
horizontal joins - except presentation

Combining Operators 
 
    For the sake of simplicity all the above exemplifications are using only two datasets, though multiple datasets could be joined in this way, here is an example for the UNION ALL based on three datasets (please note that I avoided to list the attributes): 
 
-- multiple UNION 
SELECT * 
FROMProduction.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 50 
UNION ALL
SELECT * FROM Production.Product ITM WHERE ITM.StandardCost BETWEEN 100 AND 150 UNION ALL SELECT * FROM Production.Product ITM WHERE ITM.StandardCost BETWEEN 200 AND 250

Note:
     Also in this case the union could be applied directly on the constraints, though because there is no overlap between any combination of the three ranges, the constraint needs to be modified accordingly: 

  
-- multiple UNION 
SELECT * 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 50 
    OR ITM.StandardCost BETWEEN 100 AND 150 
    OR ITM.StandardCost BETWEEN 200 AND 250 

    When the union and intersection operations are used with multiple datasets it doesn’t matter the order of the datasets because the two operations are associative: (AUB)UC=AU(BUC)=AUBUC, respectively (A∩B)∩C=A∩(B∩C)=A∩B∩C, though not the same could be said about subtraction because (A\B)\C <> A\(B\C) or the cases in which operators are combined, for example (A\B)UC<>A\(BUC), so is needed to use parentheses in order to avoid eventual surprises. 
  
-- (A\B)UC 
( 
    -- subtraction 
    SELECT * 
    FROM Production.Product ITM 
    WHERE ITM.StandardCost BETWEEN 0 AND 150 
    EXCEPT 
    SELECT * 
    FROM Production.Product ITM 
    WHERE ITM.StandardCost BETWEEN 100 AND 150 
) 
UNION 
SELECT * 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 150 AND 250 

Note:

      Knowing the basics of Set Theory is important in making your queries returning the expected results, the use of some of the identities allowing to simplify considerably the queries sometimes. Without a set-mindset I think it will be difficult to solve such problems.
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.