Showing posts with label JOIN. Show all posts
Showing posts with label JOIN. Show all posts

05 April 2024

💎SQL Reloaded: SQL Antipatterns (Part I: JOINs, UNIONs & DISTINCT)

Introduction

SQL antipatterns refer in general to common mistakes made when developing SQL code, though the term can refer also to situations in which even if the code is syntactically and logically correct, it's either suboptimal, unclear or even incorrect. Therefore "mistake" can cover a wide range of scenarios, some that can be ignored, while others need to be addressed accordingly. 

In this post I consider a few antipatterns observed especially in data warehouses (DWHs). Let's look at the below code created to exemplify several scenarios:

-- Products in open orders (initial query)
SELECT DISTINCT ITM.ProductId                                   -- (1) use of DISTINCT
, ITM.ProductNumber
, ITM.Name 
, ITM.Color 
, ITM.Style 
, ITM.Size
FROM Production.Product ITM
    LEFT JOIN (							-- (5) use of JOIN instead of EXISTS
	-- Open Purchase orders 
	SELECT DISTINCT POL.ProductId
	, 'POs' Source                                          -- (7) use columns not needed in output
	FROM Purchasing.PurchaseOrderDetail POL                 
	     LEFT JOIN Purchasing.PurchaseOrderHeader POH       -- (2) use of LEFT JOIN instead of FULL JOIN
		  ON POL.PurchaseOrderID = POH.PurchaseOrderID
	WHERE POH.Status = 1 -- pending 
	UNION					                -- (3) use of UNION
	-- Open Sales orders 
	SELECT DISTINCT SOL.ProductId
	, 'SOs' Source
	FROM Sales.SalesOrderDetail SOL
	    LEFT JOIN Sales.SalesOrderHeader SOH
		  ON SOL.SalesOrderID = SOH.SalesOrderID
	WHERE SOH.Status = 1 -- in process		        -- (4) use of OR instead of IN
	   OR SOH.Status = 2 -- approved
	) DAT
	ON ITM.ProductID = DAT.ProductID
WHERE DAT.ProductID IS NOT NULL 
ORDER BY ITM.ProductNumber			                -- (6) using too many columns in ORDER BY
, ITM.Name 
, ITM.Color 
, ITM.Style 
, ITM.Size

(1) Use of DISTINCT 

DISTINCT is a dirty way to remove the duplicates from a dataset. Sometimes it makes sense to use it to check something fast, though it should be avoided into code intended for a production environment because it can lead to unexpected behavior especially when selecting all the columns using the "*" (SELECT DISTINCT *).

I saw tools and developers adding a DISTINCT in almost each step, independently on whether it was necessary or not. One can thus but wonder whether the DISTINCT was added to fix a bigger issue with the data in the DWH, to remove special duplicates imposed by the logic or just as poor practice. Unfortunately, when it's used frequently, it can become challenging to investigate its use and discover the actual issues from the DWH.

There are several approaches to eliminate DISTINCTs from the code: GROUP BY, ranking functions or upon case also code rewrites.

(2)  Use of LEFT JOIN Instead of FULL JOIN

When refreshing a DWH there can be the case that related data are out of synch. It would be the case of Purchase or Sales orders where the headers and lines are out of synch, headers existing without lines and/or vice versa. A common practice is to use a FULL JOIN and thus to eliminate such exceptions, though there are also entitled uses of a LEFT JOIN. This antipattern resumes however to the cases in which logically should be used a FULL JOIN, though a LEFT JOIN is used instead.

In the above example there are two distinct occurrences of this pattern: the relationship between the header and lines in the inner query, respectively the LEFT JOIN with a NOT NULL constraint in the outer query. The latter use is useful when during testing one wants to see all the Products, though bringing this further into production may rise some eyebrows even if it's not necessarily wrong. Anyway, the database engine should be smart enough to recognize such a scenario. However, for the header vs lines case, upon case the plan generated might be suboptimal. 

One of the best practices when writing SQL queries is to state one's intent clearly in what the logic concerns. Using a LEFT JOIN instead of a FULL JOIN can make people raise questions about the actual need. When the need is not properly documented, some developer may even go and change the joins. There can be for example business cases that are not cover by the current data, but as soon as case appears it will lead to incorrect logic!

Similarly, splitting a piece of logic into two or more steps unnecessarily can create confusion. There can be however also entitled s situations (e.g. query optimization), which ideally should be documented.

(3) Use of UNION

When a UNION is used, the values returned by the first query will be checked against the values of the second query, and thus unnecessary comparisons occur even if they are not needed. This depends also on the business context, which might not be easily to identify from the query (especially when the reviewer doesn't know the business case). 

The misuse of a UNION will not make a big difference when the volume of data is small, though the more data are processed by the query, the higher the impact. 

Besides the proper use of the UNION, there are also situations in which a query rewrite can eliminate the need for a UNION (see the rewritten query below).

(4) use of OR instead of IN

One can occasionally find queries in which a OR was used for 10 to 50 distinct values as in the example above. Even if the database engine generate in both cases the same query plan, it's easier to read and maintain a query that used IN. However, if the number of values go beyond a certain value, other techniques should be used to improve the performance.

The only benefit I see for a OR is to document meaning's values or remove during testing one of the values, especially when the list is a selection coming from the user. Frankly, it's not the appropriate way for documenting logic (even if I'm doing it sometimes in ad-hoc queries).

There's a more extreme scenario in which distinct subqueries are written for each or a set of ORs (e.g. the distinction between open vs closed vs. invoices orders), which can make sense sometimes (e.g. the logic is completely different). Therefore, an antipattern can be dependent also of the context or use case. 

(5) use of JOIN instead of EXISTS

When there are no values returned from the subquery, quite often it makes sense to the EXISTS or not EXISTS operators in the queries (see the rewritten query below). This might not be indicated however for distributed environments like serverless SQL pool in which the distribution of the processing across multiple tasks might benefit when the pieces of the logic distributed don't require heavy reshuffles. 

(6) Using too Many Columns in ORDER BY

The columns specified in an ORDER BY clause need to make sense, otherwise they just add extra burden on the database engine, at least from the perspective of the checks that need to be performed. In the above query, at least the Name doesn't make sense.

It helps also if the columns can use existing indexes, though this depends also on query specifics. 

Another antipattern scenario not exemplified above is the use of ordinals to refer to the columns, which should be avoided in production environments (because the order of the columns can be changed accidentally or even :

-- using ordinals instead of number columns (not recommended)
SELECT ITM.ProductId                                  
, ITM.ProductNumber
, ITM.Name 
, ITM.Color 
, ITM.Style 
, ITM.Size
FROM Production.Product ITM                           
ORDER BY 2, 4, 5, 6

(7) Use Columns Not  Needed in Output

Besides the fact that each column included unnecessarily in the query can increase the size of the data processed (unless the database engine is smart to remove them), there can be also performance issues and/or optimizations involved. For example, if all the other columns are part of a covering index, the database engine might opt for a suboptimal index compared to the case in which the unnecessary columns are removed. 

Conversely, some columns are helpful to troubleshoot the logic (and that's why the Source column was considered) even if they aren't considered in the final output or the logic. It doesn't make sense to bring the query version with the respective fields into production, even if this would mean to have maybe a second version of the query used only for troubleshooting needs. Commenting the unnecessary columns could be a better choice, even if it's not recommended in general as too many such comments can obfuscate the case. 

Rewriting the Query

With the above input the query can be rewritten as follows:

-- Products in open orders (modified query)
SELECT ITM.ProductId                                  
, ITM.ProductNumber
, ITM.Name 
, ITM.Color 
, ITM.Style 
, ITM.Size
FROM Production.Product ITM
WHERE EXISTS (										            
	-- Open Purchase orders 
	SELECT POL.ProductId
	FROM Purchasing.PurchaseOrderDetail POL                 
	     JOIN Purchasing.PurchaseOrderHeader POH      
		  ON POL.PurchaseOrderID = POH.PurchaseOrderID
	WHERE POH.Status = 1 
	  AND ITM.ProductID = POL.ProductID
	)
 OR EXISTS (				                                    
	-- Open Sales orders 
	SELECT SOL.ProductId
	FROM Sales.SalesOrderDetail SOL
	     JOIN Sales.SalesOrderHeader SOH
		  ON SOL.SalesOrderID = SOH.SalesOrderID
	WHERE SOH.Status IN (1, 2)
	  AND ITM.ProductID = SOL.ProductID
	)	                           
ORDER BY ITM.ProductNumber			                           
, ITM.Color 
, ITM.Style 
, ITM.Size

Please note that in general to each rule there are also exceptions which should be considered against one's common sense. When the benefit of addressing an antipattern is neglectable compared with the effort involved and the logic doesn't create any issues, probably it's better to let the code as it. One can still reconsider the antipatterns later with the next refactoring opportunity. 

There are zealous data professionals who treat minor inconveniences (e.g. not using upper case for SQL reserved words, alternate code formatting, alternative writing of words, especially function names, different indentation, the use of "--" for commenting within a query, etc.) as antipatterns. Use your common sense and evaluate the effort against the benefits or risks and, not less important, be patient with others' mistakes!

Happy coding!

Previous Post <<||>> Next Post

19 December 2023

💎💫SQL Reloaded: Blocked Products in Dynamics 365 F&O

Besides listing the products released by Legal entity (see previous post), it's useful to know whether they were blocked for Inventory, Sales or Procurement. This is quite easy when doing it over the data entity:

-- check status via the data entity
SELECT PSO.ItemNumber 
, PSO.DataAreaId 
, PSO.IsSalesProcessingStopped
, PSO.IsInventoryProcessingStopped 
, PSO.IsProcurementProcessingStopped
FROM dbo.InventProductSpecificOrderSettingsV3Entity PSO
ORDER BY PSO.ItemNumber 
, PSO.DataAreaId

However, when the data entity is not available, the logic gets a bit more complex because the data are stored in 3 different tables: InventItemInventSetup, InventItemPurchSetup, respectively InventItemSalesSetup. Here's the piece of logic used to get the various statuses in AX 2009 (of course, without the JOINs on Partition) and it works also in Dynamics 365 F&O:

/* Product Specific Order Settings via JOINs */
SELECT ITM.DataAreaId 
, ITM.ItemId 
, IPS.Stopped IsProcurementProcessingStopped
, ILS.Stopped IsInventoryProcessingStopped
, ISS.Stopped IsSalesProcessingStopped
FROM dbo.InventTable ITM
     LEFT JOIN dbo.InventItemPurchSetup IPS
       ON ITM.ItemID = IPS.ItemId
      AND ITM.DataAreaId = IPS.DataAreaId 
      AND ITM.Partition = IPS.Partition
      AND IPS.InventDimId = 'AllBlank'
     LEFT JOIN dbo.InventItemSalesSetup ISS
       ON ITM.ItemID = ISS.ItemId
      AND ITM.DataAreaId = ISS.DataAreaId 
      AND ITM.Partition = ISS.Partition
      AND ISS.InventDimId = 'AllBlank'
     LEFT JOIN dbo.InventItemInventSetup ILS
       ON ITM.ItemID = ILS.ItemId
      AND ITM.DataAreaId = ILS.DataAreaId 
      AND ITM.Partition = ILS.Partition
      AND ILS.InventDimId = 'AllBlank'
ORDER BY ITM.ItemId
,  ITM.DataAreaId 

The constraint on InventDimId is necessary because there can be multiple records for the Product and Legal entity combination. 

Alternatively, one can use UNIONs instead of JOINs and include the logic into a view to simplify the final query (the query was written to test the behavior in a distributed environment like serverless SQL pool):

-- create the view
CREATE OR ALTER VIEW TDM.vProductSpecificOrderSettings
AS
/* Product Specific Order Settings via UNIONs */
SELECT IST.DataAreaId 
, IST.ItemId 
, IST.Partition 
, max(IST.IsInventoryProcessingStopped) IsInventoryProcessingStopped
, max(IST.IsSalesProcessingStopped) IsSalesProcessingStopped
, max(IST.IsProcurementProcessingStopped) IsProcurementProcessingStopped
FROM (
	-- inventory
	SELECT IIS.DataAreaId 
	, IIS.ItemId 
	, IIS.Partition
	, IIS.Stopped IsInventoryProcessingStopped
	, 0 IsSalesProcessingStopped
	, 0 IsProcurementProcessingStopped
	FROM dbo.InventItemInventSetup IIS
	WHERE IIS.InventDimId = 'AllBlank'
	UNION ALL
	-- purchasing
	SELECT IPS.DataAreaId 
	, IPS.ItemId 
	, IPS.Partition 
	, 0
	, 0
	, IPS.Stopped
	FROM dbo.InventItemPurchSetup IPS
	WHERE IPS.InventDimId = 'AllBlank'
	UNION ALL
	-- sales
	SELECT ISS.DataAreaId 
	, ISS.ItemId 
	, ISS.Partition 
	, 0
	, ISS.Stopped
	, 0
	FROM dbo.InventItemSalesSetup ISS
	WHERE ISS.InventDimId = 'AllBlank'
 ) IST
 GROUP BY IST.DataAreaId 
, IST.ItemId 
, IST.Partition 

/* Product Specific Order Settings via UNIONs */
SELECT ITM.DataAreaId 
, ITM.ItemId 
, PSO.IsInventoryProcessingStopped
, PSO.IsSalesProcessingStopped
, PSO.IsProcurementProcessingStopped
FROM dbo.InventTable ITM
     LEFT JOIN TDM.vProductSpecificOrderSettings PSO
	   ON ITM.ItemId = PSO.ItemId 
	  AND ITM.DataAreaId = PSO.DataAreaId 
	  AND ITM.Partition = PSO.Partition
ORDER BY ITM.ItemId 
, ITM.DataAreaId 

At least for the database used for testing on an SQL Server 2022 instance, the last query has slightly lower estimated subtree cost and memory grant than the previous one. It's also interesting that the data entity-based query outperforms the other two queries. 

And here's the combined query from this and previous post:

-- Legal Entities vs Products incl. product order settings
SELECT DAT.Id DataAreaId 
, PRD.DisplayProductNumber ItemId 
, CASE WHEN ITM.DataAreaId IS NOT NULL THEN 1 ELSE 0 END IsReleased
, CASE WHEN ITM.CreatedDatetime <> ITM.ModifiedDateTime THEN 1 ELSE 0 END IsModified
, PSO.IsInventoryProcessingStopped
, PSO.IsSalesProcessingStopped
, PSO.IsProcurementProcessingStopped
FROM dbo.DataArea DAT
     CROSS JOIN dbo.EcoResProduct PRD
	 LEFT JOIN dbo.InventTable ITM
	   ON DAT.Id = ITM.DataAreaId 
	  AND PRD.DisplayProductNumber = ITM.ItemId
	  AND PRD.Partition = ITM.Partition
          LEFT JOIN TDM.vProductSpecificOrderSettings PSO
	        ON ITM.ItemId = PSO.ItemId 
	       AND ITM.DataAreaId = PSO.DataAreaId 
	       AND ITM.Partition = PSO.Partition
WHERE DAT.Id <> 'DAT'
   AND PRD.DisplayProductNumber = 'D0001'
ORDER BY PRD.DisplayProductNumber
, DAT.Id

Happy coding!

12 December 2023

💎💫SQL Reloaded: Released Products by Legal Entity in Dynamics 365 F&O

To check in which Legal Entity (LE) a Product was released using the standard UI, the user needs to go through each LE and search for the respective Product, action which can be time-consuming and error-prone if done for multiple Products. An alternative would be to use the Excel add-in or the Table Browser and export the data for each LE, respectively aggregate the data in one file and build a power pivot on top of it. If this kind of checks need to be done on a regular basis, then a better solution is to build a simple report.

When a Product is released for a Legal Entity (LE) a record is created in the InventTable table and is accessible over the dbo.EcoResReleasedProductV2Entity data entity as well. One can build the logic as a GROUP BY with an entry for each LE in scope. 

-- Released Products by Legal Entity
SELECT ItemId
, Max(CASE WHEN DataAreaId ='cnmf' THEN 1 ELSE 0 END) [cnmf]
, Max(CASE WHEN DataAreaId ='demf' THEN 1 ELSE 0 END) [demf]
, Max(CASE WHEN DataAreaId ='inmf' THEN 1 ELSE 0 END) [inmf]
, Max(CASE WHEN DataAreaId ='jpmf' THEN 1 ELSE 0 END) [jpmf]
, Max(CASE WHEN DataAreaId ='mymf' THEN 1 ELSE 0 END) [mymf]
, Max(CASE WHEN DataAreaId ='rumf' THEN 1 ELSE 0 END) [rumf]
, Max(CASE WHEN DataAreaId ='samf' THEN 1 ELSE 0 END) [samf]
, Max(CASE WHEN DataAreaId ='thmf' THEN 1 ELSE 0 END) [thmf]
, Max(CASE WHEN DataAreaId ='usmf' THEN 1 ELSE 0 END) [usmf]
, count(*) NoRecords
FROM dbo.InventTable
WHERE DataAreaId <> 'DAT'
-- AND ItemId = 'D0001'
GROUP BY ItemId
ORDER BY NoRecords DESC
-- HAVING count(*)>1 -- 

Unfortunately, the query is static and needs to be adjusted each time a new LE is added or removed. Unless the query is build and executed dynamically, there is apparently no way in SQL to include/exclude the LEs dynamically. However, one can use a cartesian product between LEs and Products, and retrieve the corresponding entry from the Released products table:

-- Legal Entities vs Products 
SELECT DAT.Id DataAreaId 
, PRD.DisplayProductNumber ItemId 
, CASE WHEN ITM.DataAreaId IS NOT NULL THEN 1 ELSE 0 END IsReleased
, CASE WHEN ITM.CreatedDatetime <> ITM.ModifiedDateTime THEN 1 ELSE 0 END IsModified
FROM dbo.DataArea DAT
     CROSS JOIN dbo.EcoResProduct PRD
	 LEFT JOIN dbo.InventTable ITM
	   ON DAT.Id = ITM.DataAreaId 
	  AND PRD.DisplayProductNumber = ITM.ItemId
	  AND PRD.Partition = ITM.Partition
WHERE DAT.Id <> 'DAT'
   AND PRD.DisplayProductNumber = 'D0001'
ORDER BY PRD.DisplayProductNumber
, DAT.Id 

Further on, the data can be aggregated in a matrix display in a paginated report, given that the respective functionality can dynamically display the LEs in scope. 

Notes:
1) The first query will run also in AX 2009, while the second will run only in D365 F&O.
2) Besides checking whether a Product was released (see IsReleased column) it's useful to check whether the Product was modified (see IsModified). If the Created and Modified timestamps are the same, then the Product was not touched after its release. 
3) The two queries can be extended to retrieve further Product-specific data (e.g. Unit of Measure, Base Prices, Default vendor, etc.).
4) It's safer to use the corresponding data entities instead of the tables.
5) It's also useful to check whether the Products are blocked for Sales, Purchasing or Inventory (see next post). 

Happy coding!

04 February 2023

💎SQL Reloaded: Alternatives for Better Code Maintainability in SQL Server & Azure Synapse II

Few of the answers we search resume to simple queries as the one considered in the previous post. Usually, there are at least 5-10 tables involved, with more or less complex joins and different levels of aggregation. With each table added the number of options for implementing the logic increases. Based on analysis' particularities we can use a combination of views, table-valued functions, CTEs, temporary tables or table variables.

Let's consider two more tables involving aggregations on Sales orders and On-hand:

-- products analysis (via JOINs)
SELECT ITM.ProductNumber
, ITM.Name
, IsNull(OHD.OnHand, 0) OnHand
, IsNull(SOL.SalesQty, 0) SalesQty
, IsNull(POL.PurchQty, 0) PurchQty
FROM [Production].[Product] ITM
     LEFT JOIN ( -- cumulated on hand 
		SELECT OHD.ProductId
		, SUM(OHD.Quantity) OnHand
		, 0 SalesQty
		, 0 PurchQty
		FROM [Production].[ProductInventory] OHD
		GROUP BY OHD.ProductId
	 ) OHD
	   ON ITM.ProductId = OHD.ProductId 
	 LEFT JOIN ( -- cumulated sales orders
		SELECT SOL.ProductId
		, 0 OnHand
		, SUM(SOL.OrderQty) SalesQty
		, 0 PurchQty
		FROM [Sales].[SalesOrderDetail] SOL
		GROUP BY SOL.ProductId
	 ) SOL
	   ON ITM.ProductID = SOL.ProductId
	 LEFT JOIN (-- cumulated open purchase orders
		SELECT POL.ProductId 
		, 0 OnHand 
		, 0 SalesQty
		, SUM(POL.OrderQty) PurchQty
		FROM Purchasing.PurchaseOrderDetail POL
		WHERE OrderQty - (ReceivedQty - RejectedQty)>0
		  --AND POL.DueDate BETWEEN '2014-01-01' AND '2014-12-31'
		GROUP BY POL.ProductId 
	 ) POL
	    ON ITM.ProductId = POL.ProductId
WHERE COALESCE(OHD.ProductId, SOL.ProductId, POL.ProductId) IS NOT NULL
ORDER BY ITM.ProductNumber

With the help of CTEs the final query can be simplified considerably:

-- products analysis (via CTEs)
WITH OnHand 
AS ( -- cumulated on hand 
	SELECT OHD.ProductId
	, SUM(OHD.Quantity) OnHand
	, 0 SalesQty
	, 0 PurchQty
	FROM [Production].[ProductInventory] OHD
	GROUP BY OHD.ProductId
)
, SalesOrders
AS ( -- cumulated sales orders
		SELECT SOL.ProductId
		, 0 OnHand
		, SUM(SOL.OrderQty) SalesQty
		, 0 PurchQty
		FROM [Sales].[SalesOrderDetail] SOL
		GROUP BY SOL.ProductId
)
, OpenPOs 
AS ( -- cumulated open purchase orders
	SELECT POL.ProductId 
	, 0 OnHand 
	, 0 SalesQty
	, SUM(POL.OrderQty) PurchQty
	FROM Purchasing.PurchaseOrderDetail POL
	WHERE OrderQty - (ReceivedQty - RejectedQty)>0
		--AND POL.DueDate BETWEEN '2014-01-01' AND '2014-12-31'
	GROUP BY POL.ProductId 
) 
SELECT ITM.ProductNumber
, ITM.Name
, IsNull(OHD.OnHand, 0) OnHand
, IsNull(SOL.SalesQty, 0) SalesQty
, IsNull(POL.PurchQty, 0) PurchQty
FROM [Production].[Product] ITM
     LEFT JOIN OnHand OHD
	   ON ITM.ProductId = OHD.ProductId 
	 LEFT JOIN SalesOrders SOL
	   ON ITM.ProductID = SOL.ProductId
	 LEFT JOIN OpenPOs POL
	    ON ITM.ProductId = POL.ProductId
WHERE COALESCE(OHD.ProductId, SOL.ProductId, POL.ProductId) IS NOT NULL
ORDER BY ITM.ProductNumber

This way of structuring the code allows us in theory also a better way to troubleshoot the logic on segments (each CTE) and a better grip at the end query. In fact, any of the above alternatives would result in a similar writing of the end query and, with the right indexes and processing power, the performance won't be in general an issue. 

A CTE is just a way of reorganizing a query and the changes made by moving the logic to the CTE  shouldn't have any impact on the performance. Even if the constraints and tables appear somewhere else, the database engine will push down predicates and move operators around to obtain an optimal algebraic tree. In theory, the two approaches should lead to the same query plan. Of course, there are also exceptions, e.g. when restructuring the logic inline in a query without CTE can favor one plan over the other.

There's also the temptation to use CTEs for everything. An extreme case is using a CTE for each table on which a filter is applied instead of bringing the constraints into the JOIN or the WHERE clause. Even if the former approach allows troubleshooting individual sources, it can create further overhead as is needed to check back and forth between CTEs and JOINs, taking thus more time and allowing some errors maybe to escape. CTEs are for breaking down complex logic to a level at which the code can be managed and not create further overhead. There must be a balance between usability and effort. 

Moving to a distributed environment like Azure Synapse, where tables need to be distributed between several compute nodes, because of the dependencies existing between tables, queries based on several aggregations might take longer than expected. Besides following the best practices for table design including tables' distribution (see [1], [2]), it's maybe worth playing around with the query structure and see how the database engine will react to it. 

For example, the considered query could be restructured as follows by transforming the many LEFT JOINs to UNIONs: 

-- products analysis (via UNIONs)
SELECT ITM.ProductNumber
, ITM.Name
, SUM(DAT.OnHand) OnHand
, SUM(DAT.SalesQty) SalesQty
, SUM(DAT.PurchQty) PurchQty
FROM [Production].[Product] ITM
     JOIN (
		-- cumulated on hand 
		SELECT OHD.ProductId
		, SUM(OHD.Quantity) OnHand
		, 0 SalesQty
		, 0 PurchQty
		FROM [Production].[ProductInventory] OHD
		GROUP BY OHD.ProductId
		UNION ALL
		-- cumulated sales orders
		SELECT SOL.ProductId
		, 0 OnHand
		, SUM(SOL.OrderQty) SalesQty
		, 0 PurchQty
		FROM [Sales].[SalesOrderDetail] SOL
		GROUP BY SOL.ProductId
		UNION ALL
		-- cumulated open purchase orders
		SELECT POL.ProductId 
		, 0 OnHand 
		, 0 SalesQty
		, SUM(POL.OrderQty) PurchQty
		FROM Purchasing.PurchaseOrderDetail POL
		WHERE OrderQty - (ReceivedQty - RejectedQty)>0
		  --AND POL.DueDate BETWEEN '2014-01-01' AND '2014-12-31'
		GROUP BY POL.ProductId 
	) DAT
	ON ITM.ProductId = DAT.ProductId
GROUP BY ITM.ProductNumber
, ITM.Name
ORDER BY ITM.ProductNumber

In similar situations I found this approach to provide better performance, especially when working with serverless SQL pools. Though, this might depend also on tables' particularities. 

If the queries take long time and the result sets are needed frequently, it might be useful to store the intermediary or final result sets in (base) tables, or in case of serverless SQL pool to parquet files and have the data updated on a regular basis. Upon case, unless the data chance too often, this might prove a much better option than using temporary tables or table variables.

Happy coding!


Resources:
[1] Microsoft Learn (2022) Guidance for designing distributed tables using dedicated SQL pool in Azure Synapse Analytics (link)
[2] Microsoft Learn (2022) Design tables using dedicated SQL pool in Azure Synapse Analytics (link)


16 May 2020

💎🏭SQL Reloaded: Query Patterns in SQL Server (Part II: Joins and Subqueries)

The basis for being able to manipulate data via SQL scripting is a good knowledge of using joins and subqueries as seems fit for the purpose and data models. The following scripts are based on the tables created in a previous post.

-- FULL JOIN
SELECT *
FROM dbo.T_Courses C
     JOIN dbo.T_Students S
        ON C.CourseId = S.CourseId
 ORDER BY C.CourseName
 , S.StudentName 

-- FULL JOIN (deprecated)
 SELECT *
 FROM dbo.T_Courses C
 , dbo.T_Students S
 WHERE C.CourseId = S.CourseId
 ORDER BY C.CourseName
 , S.StudentName 

--LEFT JOIN
 SELECT *
 FROM dbo.T_Courses C
      LEFT JOIN dbo.T_Students S
        ON C.CourseId = S.CourseId 
 WHERE S.CourseId IS NULL
 ORDER BY C.CourseName
 , S.StudentName 

-- RIGHT JOIN
 SELECT *
 FROM dbo.T_Courses C
      RIGHT JOIN dbo.T_Students S
       ON C.CourseId = S.CourseId 
 WHERE C.CourseId IS NULL
 ORDER BY C.CourseName
 , S.StudentName 

-- FULL OUTER JOIN
 SELECT *
 FROM dbo.T_Courses C
      FULL OUTER JOIN dbo.T_Students S
        ON C.CourseId = S.CourseId 
 --WHERE C.CourseId IS NULL
 --WHERE S.CourseId IS NULL
 --WHERE (C.CourseId IS NULL OR S.StudentId IS NULL)
 --WHERE (C.CourseId IS NULL AND S.StudentId IS NULL)
 ORDER BY C.CourseName
 , S.StudentName 

The IN, NOT IN, EXISTS and NOT EXISTS allow using correlated queries, their use being indicated when there are no actual data needed from the tables involved in the correlated queries:

-- EXISTS (correlated subquery)
SELECT *
FROM dbo.T_Courses C
WHERE EXISTS (SELECT StudentId 
        FROM dbo.T_Students S
        WHERE C.CourseId = S.CourseId)
ORDER BY C.CourseName

-- NOT EXISTS (correlated subquery)
 SELECT *
 FROM dbo.T_Courses C
 WHERE EXISTS (SELECT StudentId 
        FROM dbo.T_Students S
        WHERE C.CourseId = S.CourseId)
 ORDER BY C.CourseName

-- IN (subquery)
 SELECT *
 FROM dbo.T_Courses C
 WHERE CourseId IN (SELECT CourseId 
        FROM dbo.T_Students S)
 ORDER BY C.CourseName

Joining multiples tables is done using the same principles as above:

-- joins with more tables 
SELECT A.CourseId 
, C.CourseName 
, A.StudentId 
, S.StudentName 
 , A.StartDate 
 , A.EndDate 
FROM dbo.T_Allocations A
      JOIN dbo.T_Courses C
        ON A.CourseId = C.CourseId 
      JOIN dbo.T_Students S
        ON A.StudentId = S.StudentId 
 ORDER BY C.CourseName 
 , S.StudentName 

One can obtain the same result via correlated subqueries (a technique often met between Oracle developers). From readability reasons I avoid writing such queries, unless there’s a special purpose to do so.

-- correlated subquery for individual values
SELECT A.CourseId 
, (SELECT C.CourseName 
    FROM dbo.T_Courses C
    WHERE A.CourseId = C.CourseId) CourseName
 , A.StudentId 
 , (SELECT S.StudentName 
    FROM dbo.T_Students S
    WHERE A.StudentId = S.StudentId) StudentName
 , A.StartDate 
 , A.EndDate 
FROM dbo.T_Allocations A
 ORDER BY CourseName 
 , StudentName 

When displaying values within a SELECT via a correlated subqueries, some developers feel the need to use MAX or MIN functions to make sure only one value will be returned. For data analysis it may be acceptable, however if the data model imposes it, then a redesign of the solution is more likely necessary.

-- correlated subquery for individual values
 SELECT A.CourseId 
 , (SELECT Max(C.CourseName)
    FROM dbo.T_Courses C
    WHERE A.CourseId = C.CourseId) CourseName
 , A.StudentId 
 , (SELECT Max(S.StudentName)
    FROM dbo.T_Students S
    WHERE A.StudentId = S.StudentId) StudentName
 , A.StartDate 
 , A.EndDate 
 FROM dbo.T_Allocations A
 ORDER BY CourseName 
 , StudentName 

Another technique not recommended is displaying one or more attributes from the same table with the same conditions via individual correlated queries. The use of aggregate functions is more appropriate however with numerical or date values.

Notes:
The queries work also in SQL databases in Microsoft Fabric.

Happy coding!

Previous Post <<||>> Next Post

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.