SQL Troubles: DISTINCT

Showing posts with label DISTINCT. Show all posts

05 April 2024

💎SQL Reloaded: SQL Antipatterns (Part I: JOINs, UNIONs & DISTINCT)

Introduction

SQL antipatterns refer in general to common mistakes made when developing SQL code, though the term can refer also to situations in which even if the code is syntactically and logically correct, it's either suboptimal, unclear or even incorrect. Therefore "mistake" can cover a wide range of scenarios, some that can be ignored, while others need to be addressed accordingly.

In this post I consider a few antipatterns observed especially in data warehouses (DWHs). Let's look at the below code created to exemplify several scenarios:

-- Products in open orders (initial query)
SELECT DISTINCT ITM.ProductId                                   -- (1) use of DISTINCT
, ITM.ProductNumber
, ITM.Name 
, ITM.Color 
, ITM.Style 
, ITM.Size
FROM Production.Product ITM
    LEFT JOIN (							-- (5) use of JOIN instead of EXISTS
	-- Open Purchase orders 
	SELECT DISTINCT POL.ProductId
	, 'POs' Source                                          -- (7) use columns not needed in output
	FROM Purchasing.PurchaseOrderDetail POL                 
	     LEFT JOIN Purchasing.PurchaseOrderHeader POH       -- (2) use of LEFT JOIN instead of FULL JOIN
		  ON POL.PurchaseOrderID = POH.PurchaseOrderID
	WHERE POH.Status = 1 -- pending 
	UNION					                -- (3) use of UNION
	-- Open Sales orders 
	SELECT DISTINCT SOL.ProductId
	, 'SOs' Source
	FROM Sales.SalesOrderDetail SOL
	    LEFT JOIN Sales.SalesOrderHeader SOH
		  ON SOL.SalesOrderID = SOH.SalesOrderID
	WHERE SOH.Status = 1 -- in process		        -- (4) use of OR instead of IN
	   OR SOH.Status = 2 -- approved
	) DAT
	ON ITM.ProductID = DAT.ProductID
WHERE DAT.ProductID IS NOT NULL 
ORDER BY ITM.ProductNumber			                -- (6) using too many columns in ORDER BY
, ITM.Name 
, ITM.Color 
, ITM.Style 
, ITM.Size

(1) Use of DISTINCT

DISTINCT is a dirty way to remove the duplicates from a dataset. Sometimes it makes sense to use it to check something fast, though it should be avoided into code intended for a production environment because it can lead to unexpected behavior especially when selecting all the columns using the "*" (SELECT DISTINCT *).

I saw tools and developers adding a DISTINCT in almost each step, independently on whether it was necessary or not. One can thus but wonder whether the DISTINCT was added to fix a bigger issue with the data in the DWH, to remove special duplicates imposed by the logic or just as poor practice. Unfortunately, when it's used frequently, it can become challenging to investigate its use and discover the actual issues from the DWH.

There are several approaches to eliminate DISTINCTs from the code: GROUP BY, ranking functions or upon case also code rewrites.

(2) Use of LEFT JOIN Instead of FULL JOIN

When refreshing a DWH there can be the case that related data are out of synch. It would be the case of Purchase or Sales orders where the headers and lines are out of synch, headers existing without lines and/or vice versa. A common practice is to use a FULL JOIN and thus to eliminate such exceptions, though there are also entitled uses of a LEFT JOIN. This antipattern resumes however to the cases in which logically should be used a FULL JOIN, though a LEFT JOIN is used instead.

In the above example there are two distinct occurrences of this pattern: the relationship between the header and lines in the inner query, respectively the LEFT JOIN with a NOT NULL constraint in the outer query. The latter use is useful when during testing one wants to see all the Products, though bringing this further into production may rise some eyebrows even if it's not necessarily wrong. Anyway, the database engine should be smart enough to recognize such a scenario. However, for the header vs lines case, upon case the plan generated might be suboptimal.

One of the best practices when writing SQL queries is to state one's intent clearly in what the logic concerns. Using a LEFT JOIN instead of a FULL JOIN can make people raise questions about the actual need. When the need is not properly documented, some developer may even go and change the joins. There can be for example business cases that are not cover by the current data, but as soon as case appears it will lead to incorrect logic!

Similarly, splitting a piece of logic into two or more steps unnecessarily can create confusion. There can be however also entitled s situations (e.g. query optimization), which ideally should be documented.

(3) Use of UNION

When a UNION is used, the values returned by the first query will be checked against the values of the second query, and thus unnecessary comparisons occur even if they are not needed. This depends also on the business context, which might not be easily to identify from the query (especially when the reviewer doesn't know the business case).

The misuse of a UNION will not make a big difference when the volume of data is small, though the more data are processed by the query, the higher the impact.

Besides the proper use of the UNION, there are also situations in which a query rewrite can eliminate the need for a UNION (see the rewritten query below).

(4) use of OR instead of IN

One can occasionally find queries in which a OR was used for 10 to 50 distinct values as in the example above. Even if the database engine generate in both cases the same query plan, it's easier to read and maintain a query that used IN. However, if the number of values go beyond a certain value, other techniques should be used to improve the performance.

The only benefit I see for a OR is to document meaning's values or remove during testing one of the values, especially when the list is a selection coming from the user. Frankly, it's not the appropriate way for documenting logic (even if I'm doing it sometimes in ad-hoc queries).

There's a more extreme scenario in which distinct subqueries are written for each or a set of ORs (e.g. the distinction between open vs closed vs. invoices orders), which can make sense sometimes (e.g. the logic is completely different). Therefore, an antipattern can be dependent also of the context or use case.

(5) use of JOIN instead of EXISTS

When there are no values returned from the subquery, quite often it makes sense to the EXISTS or not EXISTS operators in the queries (see the rewritten query below). This might not be indicated however for distributed environments like serverless SQL pool in which the distribution of the processing across multiple tasks might benefit when the pieces of the logic distributed don't require heavy reshuffles.

(6) Using too Many Columns in ORDER BY

The columns specified in an ORDER BY clause need to make sense, otherwise they just add extra burden on the database engine, at least from the perspective of the checks that need to be performed. In the above query, at least the Name doesn't make sense.

It helps also if the columns can use existing indexes, though this depends also on query specifics.

Another antipattern scenario not exemplified above is the use of ordinals to refer to the columns, which should be avoided in production environments (because the order of the columns can be changed accidentally or even :

-- using ordinals instead of number columns (not recommended)
SELECT ITM.ProductId                                  
, ITM.ProductNumber
, ITM.Name 
, ITM.Color 
, ITM.Style 
, ITM.Size
FROM Production.Product ITM                           
ORDER BY 2, 4, 5, 6

(7) Use Columns Not Needed in Output

Besides the fact that each column included unnecessarily in the query can increase the size of the data processed (unless the database engine is smart to remove them), there can be also performance issues and/or optimizations involved. For example, if all the other columns are part of a covering index, the database engine might opt for a suboptimal index compared to the case in which the unnecessary columns are removed.

Conversely, some columns are helpful to troubleshoot the logic (and that's why the Source column was considered) even if they aren't considered in the final output or the logic. It doesn't make sense to bring the query version with the respective fields into production, even if this would mean to have maybe a second version of the query used only for troubleshooting needs. Commenting the unnecessary columns could be a better choice, even if it's not recommended in general as too many such comments can obfuscate the case.

Rewriting the Query

With the above input the query can be rewritten as follows:

-- Products in open orders (modified query)
SELECT ITM.ProductId                                  
, ITM.ProductNumber
, ITM.Name 
, ITM.Color 
, ITM.Style 
, ITM.Size
FROM Production.Product ITM
WHERE EXISTS (										            
	-- Open Purchase orders 
	SELECT POL.ProductId
	FROM Purchasing.PurchaseOrderDetail POL                 
	     JOIN Purchasing.PurchaseOrderHeader POH      
		  ON POL.PurchaseOrderID = POH.PurchaseOrderID
	WHERE POH.Status = 1 
	  AND ITM.ProductID = POL.ProductID
	)
 OR EXISTS (				                                    
	-- Open Sales orders 
	SELECT SOL.ProductId
	FROM Sales.SalesOrderDetail SOL
	     JOIN Sales.SalesOrderHeader SOH
		  ON SOL.SalesOrderID = SOH.SalesOrderID
	WHERE SOH.Status IN (1, 2)
	  AND ITM.ProductID = SOL.ProductID
	)	                           
ORDER BY ITM.ProductNumber			                           
, ITM.Color 
, ITM.Style 
, ITM.Size

Please note that in general to each rule there are also exceptions which should be considered against one's common sense. When the benefit of addressing an antipattern is neglectable compared with the effort involved and the logic doesn't create any issues, probably it's better to let the code as it. One can still reconsider the antipatterns later with the next refactoring opportunity.

There are zealous data professionals who treat minor inconveniences (e.g. not using upper case for SQL reserved words, alternate code formatting, alternative writing of words, especially function names, different indentation, the use of "--" for commenting within a query, etc.) as antipatterns. Use your common sense and evaluate the effort against the benefits or risks and, not less important, be patient with others' mistakes!

Happy coding!

Previous Post <<||>> Next Post

20 June 2020

💎🏭SQL Reloaded: When Queries Look Like Ugly Ducks

There’s an expression of abductive reasoning “if it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck” used to identify something indefinite based on certain (definite) characteristics. When one looks at a query written by someone else, one tries to identify such characteristics. However, there are also situations in which simple queries are hard to interpret. For example, each time I see a SELECT DISTINCT in a query I ask myself whether the developer wanted to shortcut a GROUP BY, or just wanted to avoid duplicated data from the output. Without looking at the data in detail it’s almost impossible deciding between the two scenarios.

Let’s consider the following simple example based on a set of orders sent over the interface between two systems. Because of poor design, the first position was caught by the logic before entering the full order (the price from it is incorrect):

-- dropping the table
--DROP TABLE IF EXISTS dbo.T_Orders 

-- inserting the records 
SELECT *
INTO dbo.T_Orders 
FROM (VALUES ('P0001', 1,'20200618', '100001','S','black',20, 1.23)
, ('P0001', 1,'20200618', '100001','S','black',20, 1.22)
, ('P0001', 2,'20200618', '100001','S','black',5, 1.23)
, ('P0001', 3,'20200618', '100001','M','black',10, 1.24)
, ('P0002', 1,'20200619', '100001','S','black',5, 1.23)
, ('P0002', 2,'20200619', '100001','S','black',10, 1.22)) DAT (PurchId, Pos, PurchDate, ItemId, SizeId, ColorId, Qty, Price)

-- adding a primary key
ALTER TABLE dbo.T_Orders 
ADD Id int identity(1,1) NOT NULL

-- reviewing the data
SELECT *
FROM dbo.T_Orders 
ORDER BY PurchId
, Pos

The following queries return the same results :

-- simple SELECT DISTINCT 
SELECT DISTINCT PurchId, Pos, ItemId, SizeId, ColorId, Qty
FROM dbo.T_Orders

-- simple GROUP BY
SELECT PurchId, Pos, ItemId, SizeId, ColorId, Qty
FROM dbo.T_Orders
GROUP BY PurchId, Pos, ItemId, SizeId, ColorId, Qty

Both queries hide the fact that a duplicate exists. However, with a small change in the last query (adding a count) one can highlight that, while the first query doesn't allow this flexibility:

-- simple GROUP BY with Count
SELECT PurchId, Pos, ItemId, SizeId, ColorId, Qty
, count(*) NoRecords
FROM dbo.T_Orders
GROUP BY PurchId, Pos, ItemId, SizeId, ColorId, Qty

-- HAVING count(*)>1

This small change can make an important difference when one needs to analyze the data, so at least from this perspective it might be worth to write the query as a GROUP BY from the beginning, especially when writing complex queries.

There are also nested SELECT DISTINCTS that attempt refining the selection in successive steps. Such curiosities seldom make sense:

-- nested SELECT DISTINCT
SELECT DISTINCT PurchId, Pos, ItemId
FROM (
 SELECT DISTINCT PurchId, Pos, ItemId, SizeId, ColorId, Qty
 FROM dbo.T_Orders
) DAT

To remove the duplicates and perform a sum one may choose another curiosity of query - the inner query removes the duplicates (some write it also as a DISTINCT), while in the outer query is done the sum:

-- double grouping to remove duplicates
SELECT PurchId, Pos, ItemId, SizeId, ColorId
, SUM(Qty) Total
FROM (--inner query
 SELECT PurchId, Pos, ItemId, SizeId, ColorId, Qty
 FROM dbo.T_Orders
 GROUP BY PurchId, Pos, ItemId, SizeId, ColorId, Qty
) DAT
GROUP BY PurchId, Pos, ItemId, SizeId, ColorId

Even if such queries (miraculously) work upon case, sooner or later they are predestined to fail. When doing operations on data whose quality is doubtful, one needs to select a strategy for removing the duplicates. The first step is to identify the attributes which make a record unique (e.g. PurchId and Pos), and using a sorting criteria, one can remove the duplicates via a window function like Rank, Dense_Rank or Row_Number (with small exceptions their use is interchangeable). Supposing that the Id attribute determinates the order in which the data were added, one can write a similar query:

-- removing duplicates via ranking partition window function 
SELECT *
FROM (
 SELECT PurchId, Pos, PurchDate, ItemId, SizeId, ColorId, Qty, Price
 , RANK() OVER (PARTITION BY PurchId, Pos ORDER BY Id DESC) Ranking 
 FROM dbo.T_Orders
) DAT 
WHERE Ranking=1

The problem is that such queries need to be repeated in each use. A simpler method is running a regular check for duplicates and marking the duplicates as inactive or pushing them into a separate table. Deleting the duplicates can work as well, though this approach can hit back like a boomerang if not addressed correctly.

To identify the duplicates one of the following approaches can be used:

-- duplicates via grouping in inner query
SELECT DAT.*
, DUP.NoRecords 
FROM dbo.T_Orders DAT
     JOIN (-- duplicates
 SELECT PurchId, Pos
 , count(*) NoRecords
 FROM dbo.T_Orders DUP
 GROUP BY PurchId, Pos
 HAVING count(*)>1 
  ) DUP 
   ON DAT.PurchId = DUP.Purchid 
   AND DAT.Pos = DUP.Pos 

-- duplicates via grouping in correlated query 
SELECT DAT.*
, DUP.NoRecords 
FROM dbo.T_Orders DAT
     CROSS APPLY (-- duplicates
 SELECT count(*) NoRecords
 FROM dbo.T_Orders DUP
 WHERE DAT.PurchId = DUP.Purchid 
   AND DAT.Pos = DUP.Pos 
 GROUP BY PurchId, Pos
 HAVING count(*)>1 
  ) DUP 


-- duplicates via count within aggregate window function 
SELECT *
FROM (
 SELECT PurchId, Pos, PurchDate, ItemId, SizeId, ColorId, Qty, Price
 , count(*) OVER (PARTITION BY PurchId, Pos) NoRecords 
 FROM dbo.T_Orders
) DAT 
WHERE NoRecords>1

Notes:

The queries work also in SQL databases in Microsoft Fabric.

Happy coding!

SQL Troubles

Pages

05 April 2024

💎SQL Reloaded: SQL Antipatterns (Part I: JOINs, UNIONs & DISTINCT)

20 June 2020

💎🏭SQL Reloaded: When Queries Look Like Ugly Ducks

About Me