04 February 2023

SQL Reloaded: Alternatives for Better Code Maintainability in SQL Server & Azure Synapse II

Few of the answers we search resume to simple queries as the one considered in the previous post. Usually, there are at least 5-10 tables involved, with more or less complex joins and different levels of aggregation. With each table added the number of options for implementing the logic increases. Based on analysis' particularities we can use a combination of views, table-valued functions, CTEs, temporary tables or table variables.

Let's consider two more tables involving aggregations on Sales orders and On-hand:

-- products analysis (via JOINs)
SELECT ITM.ProductNumber
, ITM.Name
, IsNull(OHD.OnHand, 0) OnHand
, IsNull(SOL.SalesQty, 0) SalesQty
, IsNull(POL.PurchQty, 0) PurchQty
FROM [Production].[Product] ITM
     LEFT JOIN ( -- cumulated on hand 
		SELECT OHD.ProductId
		, SUM(OHD.Quantity) OnHand
		, 0 SalesQty
		, 0 PurchQty
		FROM [Production].[ProductInventory] OHD
		GROUP BY OHD.ProductId
	 ) OHD
	   ON ITM.ProductId = OHD.ProductId 
	 LEFT JOIN ( -- cumulated sales orders
		SELECT SOL.ProductId
		, 0 OnHand
		, SUM(SOL.OrderQty) SalesQty
		, 0 PurchQty
		FROM [Sales].[SalesOrderDetail] SOL
		GROUP BY SOL.ProductId
	 ) SOL
	   ON ITM.ProductID = SOL.ProductId
	 LEFT JOIN (-- cumulated open purchase orders
		SELECT POL.ProductId 
		, 0 OnHand 
		, 0 SalesQty
		, SUM(POL.OrderQty) PurchQty
		FROM Purchasing.PurchaseOrderDetail POL
		WHERE OrderQty - (ReceivedQty - RejectedQty)>0
		  --AND POL.DueDate BETWEEN '2014-01-01' AND '2014-12-31'
		GROUP BY POL.ProductId 
	 ) POL
	    ON ITM.ProductId = POL.ProductId
WHERE COALESCE(OHD.ProductId, SOL.ProductId, POL.ProductId) IS NOT NULL
ORDER BY ITM.ProductNumber

With the help of CTEs the final query can be simplified considerably:

-- products analysis (via CTEs)
WITH OnHand 
AS ( -- cumulated on hand 
	SELECT OHD.ProductId
	, SUM(OHD.Quantity) OnHand
	, 0 SalesQty
	, 0 PurchQty
	FROM [Production].[ProductInventory] OHD
	GROUP BY OHD.ProductId
)
, SalesOrders
AS ( -- cumulated sales orders
		SELECT SOL.ProductId
		, 0 OnHand
		, SUM(SOL.OrderQty) SalesQty
		, 0 PurchQty
		FROM [Sales].[SalesOrderDetail] SOL
		GROUP BY SOL.ProductId
)
, OpenPOs 
AS ( -- cumulated open purchase orders
	SELECT POL.ProductId 
	, 0 OnHand 
	, 0 SalesQty
	, SUM(POL.OrderQty) PurchQty
	FROM Purchasing.PurchaseOrderDetail POL
	WHERE OrderQty - (ReceivedQty - RejectedQty)>0
		--AND POL.DueDate BETWEEN '2014-01-01' AND '2014-12-31'
	GROUP BY POL.ProductId 
) 
SELECT ITM.ProductNumber
, ITM.Name
, IsNull(OHD.OnHand, 0) OnHand
, IsNull(SOL.SalesQty, 0) SalesQty
, IsNull(POL.PurchQty, 0) PurchQty
FROM [Production].[Product] ITM
     LEFT JOIN OnHand OHD
	   ON ITM.ProductId = OHD.ProductId 
	 LEFT JOIN SalesOrders SOL
	   ON ITM.ProductID = SOL.ProductId
	 LEFT JOIN OpenPOs POL
	    ON ITM.ProductId = POL.ProductId
WHERE COALESCE(OHD.ProductId, SOL.ProductId, POL.ProductId) IS NOT NULL
ORDER BY ITM.ProductNumber

This way of structuring the code allows us in theory also a better way to troubleshoot the logic on segments (each CTE) and a better grip at the end query. In fact, any of the above alternatives would result in a similar writing of the end query and, with the right indexes and processing power, the performance won't be in general an issue. 

A CTE is just a way of reorganizing a query and the changes made by moving the logic to the CTE  shouldn't have any impact on the performance. Even if the constraints and tables appear somewhere else, the database engine will push down predicates and move operators around to obtain an optimal algebraic tree. In theory, the two approaches should lead to the same query plan. Of course, there are also exceptions, e.g. when restructuring the logic inline in a query without CTE can favor one plan over the other.

There's also the temptation to use CTEs for everything. An extreme case is using a CTE for each table on which a filter is applied instead of bringing the constraints into the JOIN or the WHERE clause. Even if the former approach allows troubleshooting individual sources, it can create further overhead as is needed to check back and forth between CTEs and JOINs, taking thus more time and allowing some errors maybe to escape. CTEs are for breaking down complex logic to a level at which the code can be managed and not create further overhead. There must be a balance between usability and effort. 

Moving to a distributed environment like Azure Synapse, where tables need to be distributed between several compute nodes, because of the dependencies existing between tables, queries based on several aggregations might take longer than expected. Besides following the best practices for table design including tables' distribution (see [1], [2]), it's maybe worth playing around with the query structure and see how the database engine will react to it. 

For example, the considered query could be restructured as follows by transforming the many LEFT JOINs to UNIONs: 

-- products analysis (via UNIONs)
SELECT ITM.ProductNumber
, ITM.Name
, SUM(DAT.OnHand) OnHand
, SUM(DAT.SalesQty) SalesQty
, SUM(DAT.PurchQty) PurchQty
FROM [Production].[Product] ITM
     JOIN (
		-- cumulated on hand 
		SELECT OHD.ProductId
		, SUM(OHD.Quantity) OnHand
		, 0 SalesQty
		, 0 PurchQty
		FROM [Production].[ProductInventory] OHD
		GROUP BY OHD.ProductId
		UNION ALL
		-- cumulated sales orders
		SELECT SOL.ProductId
		, 0 OnHand
		, SUM(SOL.OrderQty) SalesQty
		, 0 PurchQty
		FROM [Sales].[SalesOrderDetail] SOL
		GROUP BY SOL.ProductId
		UNION ALL
		-- cumulated open purchase orders
		SELECT POL.ProductId 
		, 0 OnHand 
		, 0 SalesQty
		, SUM(POL.OrderQty) PurchQty
		FROM Purchasing.PurchaseOrderDetail POL
		WHERE OrderQty - (ReceivedQty - RejectedQty)>0
		  --AND POL.DueDate BETWEEN '2014-01-01' AND '2014-12-31'
		GROUP BY POL.ProductId 
	) DAT
	ON ITM.ProductId = DAT.ProductId
GROUP BY ITM.ProductNumber
, ITM.Name
ORDER BY ITM.ProductNumber

In similar situations I found this approach to provide better performance, especially when working with serverless SQL pools. Though, this might depend also on tables' particularities. 

If the queries take long time and the result sets are needed frequently, it might be useful to store the intermediary or final result sets in (base) tables, or in case of serverless SQL pool to parquet files and have the data updated on a regular basis. Upon case, unless the data chance too often, this might prove a much better option than using temporary tables or table variables.

Happy coding!

Resources:
[1] Microsoft Learn (2022) Guidance for designing distributed tables using dedicated SQL pool in Azure Synapse Analytics (link)
[2] Microsoft Learn (2022) Design tables using dedicated SQL pool in Azure Synapse Analytics (link)


No comments:

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.