05 April 2024

SQL Reloaded: SQL Antipatterns (Part I: JOINs, UNIONs & DISTINCT)

Introduction

SQL antipatterns refer in general to common mistakes made when developing SQL code, though the term can refer also to situations in which even if the code is syntactically and logically correct, it's either suboptimal, unclear or even incorrect. Therefore "mistake" can cover a wide range of scenarios, some that can be ignored, while others need to be addressed accordingly. 

In this post I consider a few antipatterns observed especially in data warehouses (DWHs). Let's look at the below code created to exemplify several scenarios:

-- Products in open orders (initial query)
SELECT DISTINCT ITM.ProductId                                   -- (1) use of DISTINCT
, ITM.ProductNumber
, ITM.Name 
, ITM.Color 
, ITM.Style 
, ITM.Size
FROM Production.Product ITM
    LEFT JOIN (							-- (5) use of JOIN instead of EXISTS
	-- Open Purchase orders 
	SELECT DISTINCT POL.ProductId
	, 'POs' Source                                          -- (7) use columns not needed in output
	FROM Purchasing.PurchaseOrderDetail POL                 
	     LEFT JOIN Purchasing.PurchaseOrderHeader POH       -- (2) use of LEFT JOIN instead of FULL JOIN
		  ON POL.PurchaseOrderID = POH.PurchaseOrderID
	WHERE POH.Status = 1 -- pending 
	UNION					                -- (3) use of UNION
	-- Open Sales orders 
	SELECT DISTINCT SOL.ProductId
	, 'SOs' Source
	FROM Sales.SalesOrderDetail SOL
	    LEFT JOIN Sales.SalesOrderHeader SOH
		  ON SOL.SalesOrderID = SOH.SalesOrderID
	WHERE SOH.Status = 1 -- in process		        -- (4) use of OR instead of IN
	   OR SOH.Status = 2 -- approved
	) DAT
	ON ITM.ProductID = DAT.ProductID
WHERE DAT.ProductID IS NOT NULL 
ORDER BY ITM.ProductNumber			                -- (6) using too many columns in ORDER BY
, ITM.Name 
, ITM.Color 
, ITM.Style 
, ITM.Size

(1) Use of DISTINCT 

DISTINCT is a dirty way to remove the duplicates from a dataset. Sometimes it makes sense to use it to check something fast, though it should be avoided into code intended for a production environment because it can lead to unexpected behavior especially when selecting all the columns using the "*" (SELECT DISTINCT *).

I saw tools and developers adding a DISTINCT in almost each step, independently on whether it was necessary or not. One can thus but wonder whether the DISTINCT was added to fix a bigger issue with the data in the DWH, to remove special duplicates imposed by the logic or just as poor practice. Unfortunately, when it's used frequently, it can become challenging to investigate its use and discover the actual issues from the DWH.

There are several approaches to eliminate DISTINCTs from the code: GROUP BY, ranking functions or upon case also code rewrites.

(2)  Use of LEFT JOIN Instead of FULL JOIN

When refreshing a DWH there can be the case that related data are out of synch. It would be the case of Purchase or Sales orders where the headers and lines are out of synch, headers existing without lines and/or vice versa. A common practice is to use a FULL JOIN and thus to eliminate such exceptions, though there are also entitled uses of a LEFT JOIN. This antipattern resumes however to the cases in which logically should be used a FULL JOIN, though a LEFT JOIN is used instead.

In the above example there are two distinct occurrences of this pattern: the relationship between the header and lines in the inner query, respectively the LEFT JOIN with a NOT NULL constraint in the outer query. The latter use is useful when during testing one wants to see all the Products, though bringing this further into production may rise some eyebrows even if it's not necessarily wrong. Anyway, the database engine should be smart enough to recognize such a scenario. However, for the header vs lines case, upon case the plan generated might be suboptimal. 

One of the best practices when writing SQL queries is to state one's intent clearly in what the logic concerns. Using a LEFT JOIN instead of a FULL JOIN can make people raise questions about the actual need. When the need is not properly documented, some developer may even go and change the joins. There can be for example business cases that are not cover by the current data, but as soon as case appears it will lead to incorrect logic!

Similarly, splitting a piece of logic into two or more steps unnecessarily can create confusion. There can be however also entitled s situations (e.g. query optimization), which ideally should be documented.

(3) Use of UNION

When a UNION is used, the values returned by the first query will be checked against the values of the second query, and thus unnecessary comparisons occur even if they are not needed. This depends also on the business context, which might not be easily to identify from the query (especially when the reviewer doesn't know the business case). 

The misuse of a UNION will not make a big difference when the volume of data is small, though the more data are processed by the query, the higher the impact. 

Besides the proper use of the UNION, there are also situations in which a query rewrite can eliminate the need for a UNION (see the rewritten query below).

(4) use of OR instead of IN

One can occasionally find queries in which a OR was used for 10 to 50 distinct values as in the example above. Even if the database engine generate in both cases the same query plan, it's easier to read and maintain a query that used IN. However, if the number of values go beyond a certain value, other techniques should be used to improve the performance.

The only benefit I see for a OR is to document meaning's values or remove during testing one of the values, especially when the list is a selection coming from the user. Frankly, it's not the appropriate way for documenting logic (even if I'm doing it sometimes in ad-hoc queries).

There's a more extreme scenario in which distinct subqueries are written for each or a set of ORs (e.g. the distinction between open vs closed vs. invoices orders), which can make sense sometimes (e.g. the logic is completely different). Therefore, an antipattern can be dependent also of the context or use case. 

(5) use of JOIN instead of EXISTS

When there are no values returned from the subquery, quite often it makes sense to the EXISTS or not EXISTS operators in the queries (see the rewritten query below). This might not be indicated however for distributed environments like serverless SQL pool in which the distribution of the processing across multiple tasks might benefit when the pieces of the logic distributed don't require heavy reshuffles. 

(6) Using too Many Columns in ORDER BY

The columns specified in an ORDER BY clause need to make sense, otherwise they just add extra burden on the database engine, at least from the perspective of the checks that need to be performed. In the above query, at least the Name doesn't make sense.

It helps also if the columns can use existing indexes, though this depends also on query specifics. 

Another antipattern scenario not exemplified above is the use of ordinals to refer to the columns, which should be avoided in production environments (because the order of the columns can be changed accidentally or even :

-- using ordinals instead of number columns (not recommended)
SELECT ITM.ProductId                                  
, ITM.ProductNumber
, ITM.Name 
, ITM.Color 
, ITM.Style 
, ITM.Size
FROM Production.Product ITM                           
ORDER BY 2, 4, 5, 6

(7) Use Columns Not  Needed in Output

Besides the fact that each column included unnecessarily in the query can increase the size of the data processed (unless the database engine is smart to remove them), there can be also performance issues and/or optimizations involved. For example, if all the other columns are part of a covering index, the database engine might opt for a suboptimal index compared to the case in which the unnecessary columns are removed. 

Conversely, some columns are helpful to troubleshoot the logic (and that's why the Source column was considered) even if they aren't considered in the final output or the logic. It doesn't make sense to bring the query version with the respective fields into production, even if this would mean to have maybe a second version of the query used only for troubleshooting needs. Commenting the unnecessary columns could be a better choice, even if it's not recommended in general as too many such comments can obfuscate the case. 

Rewriting the Query

With the above input the query can be rewritten as follows:

-- Products in open orders (modified query)
SELECT ITM.ProductId                                  
, ITM.ProductNumber
, ITM.Name 
, ITM.Color 
, ITM.Style 
, ITM.Size
FROM Production.Product ITM
WHERE EXISTS (										            
	-- Open Purchase orders 
	SELECT POL.ProductId
	FROM Purchasing.PurchaseOrderDetail POL                 
	     JOIN Purchasing.PurchaseOrderHeader POH      
		  ON POL.PurchaseOrderID = POH.PurchaseOrderID
	WHERE POH.Status = 1 
	  AND ITM.ProductID = POL.ProductID
	)
 OR EXISTS (				                                    
	-- Open Sales orders 
	SELECT SOL.ProductId
	FROM Sales.SalesOrderDetail SOL
	     JOIN Sales.SalesOrderHeader SOH
		  ON SOL.SalesOrderID = SOH.SalesOrderID
	WHERE SOH.Status IN (1, 2)
	  AND ITM.ProductID = SOL.ProductID
	)	                           
ORDER BY ITM.ProductNumber			                           
, ITM.Color 
, ITM.Style 
, ITM.Size

Please note that in general to each rule there are also exceptions which should be considered against one's common sense. When the benefit of addressing an antipattern is neglectable compared with the effort involved and the logic doesn't create any issues, probably it's better to let the code as it. One can still reconsider the antipatterns later with the next refactoring opportunity. 

There are zealous data professionals who treat minor inconveniences (e.g. not using upper case for SQL reserved words, alternate code formatting, alternative writing of words, especially function names, different indentation, the use of "--" for commenting within a query, etc.) as antipatterns. Use your common sense and evaluate the effort against the benefits or risks and, not less important, be patient with others' mistakes!

Happy coding!

No comments:

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.