Showing posts with label antipatterns. Show all posts
Showing posts with label antipatterns. Show all posts

05 April 2024

SQL Reloaded: SQL Antipatterns (Part I: JOINs, UNIONs & DISTINCT)

Introduction

SQL antipatterns refer in general to common mistakes made when developing SQL code, though the term can refer also to situations in which even if the code is syntactically and logically correct, it's either suboptimal, unclear or even incorrect. Therefore "mistake" can cover a wide range of scenarios, some that can be ignored, while others need to be addressed accordingly. 

In this post I consider a few antipatterns observed especially in data warehouses (DWHs). Let's look at the below code created to exemplify several scenarios:

-- Products in open orders (initial query)
SELECT DISTINCT ITM.ProductId                                   -- (1) use of DISTINCT
, ITM.ProductNumber
, ITM.Name 
, ITM.Color 
, ITM.Style 
, ITM.Size
FROM Production.Product ITM
    LEFT JOIN (							-- (5) use of JOIN instead of EXISTS
	-- Open Purchase orders 
	SELECT DISTINCT POL.ProductId
	, 'POs' Source                                          -- (7) use columns not needed in output
	FROM Purchasing.PurchaseOrderDetail POL                 
	     LEFT JOIN Purchasing.PurchaseOrderHeader POH       -- (2) use of LEFT JOIN instead of FULL JOIN
		  ON POL.PurchaseOrderID = POH.PurchaseOrderID
	WHERE POH.Status = 1 -- pending 
	UNION					                -- (3) use of UNION
	-- Open Sales orders 
	SELECT DISTINCT SOL.ProductId
	, 'SOs' Source
	FROM Sales.SalesOrderDetail SOL
	    LEFT JOIN Sales.SalesOrderHeader SOH
		  ON SOL.SalesOrderID = SOH.SalesOrderID
	WHERE SOH.Status = 1 -- in process		        -- (4) use of OR instead of IN
	   OR SOH.Status = 2 -- approved
	) DAT
	ON ITM.ProductID = DAT.ProductID
WHERE DAT.ProductID IS NOT NULL 
ORDER BY ITM.ProductNumber			                -- (6) using too many columns in ORDER BY
, ITM.Name 
, ITM.Color 
, ITM.Style 
, ITM.Size

(1) Use of DISTINCT 

DISTINCT is a dirty way to remove the duplicates from a dataset. Sometimes it makes sense to use it to check something fast, though it should be avoided into code intended for a production environment because it can lead to unexpected behavior especially when selecting all the columns using the "*" (SELECT DISTINCT *).

I saw tools and developers adding a DISTINCT in almost each step, independently on whether it was necessary or not. One can thus but wonder whether the DISTINCT was added to fix a bigger issue with the data in the DWH, to remove special duplicates imposed by the logic or just as poor practice. Unfortunately, when it's used frequently, it can become challenging to investigate its use and discover the actual issues from the DWH.

There are several approaches to eliminate DISTINCTs from the code: GROUP BY, ranking functions or upon case also code rewrites.

(2)  Use of LEFT JOIN Instead of FULL JOIN

When refreshing a DWH there can be the case that related data are out of synch. It would be the case of Purchase or Sales orders where the headers and lines are out of synch, headers existing without lines and/or vice versa. A common practice is to use a FULL JOIN and thus to eliminate such exceptions, though there are also entitled uses of a LEFT JOIN. This antipattern resumes however to the cases in which logically should be used a FULL JOIN, though a LEFT JOIN is used instead.

In the above example there are two distinct occurrences of this pattern: the relationship between the header and lines in the inner query, respectively the LEFT JOIN with a NOT NULL constraint in the outer query. The latter use is useful when during testing one wants to see all the Products, though bringing this further into production may rise some eyebrows even if it's not necessarily wrong. Anyway, the database engine should be smart enough to recognize such a scenario. However, for the header vs lines case, upon case the plan generated might be suboptimal. 

One of the best practices when writing SQL queries is to state one's intent clearly in what the logic concerns. Using a LEFT JOIN instead of a FULL JOIN can make people raise questions about the actual need. When the need is not properly documented, some developer may even go and change the joins. There can be for example business cases that are not cover by the current data, but as soon as case appears it will lead to incorrect logic!

Similarly, splitting a piece of logic into two or more steps unnecessarily can create confusion. There can be however also entitled s situations (e.g. query optimization), which ideally should be documented.

(3) Use of UNION

When a UNION is used, the values returned by the first query will be checked against the values of the second query, and thus unnecessary comparisons occur even if they are not needed. This depends also on the business context, which might not be easily to identify from the query (especially when the reviewer doesn't know the business case). 

The misuse of a UNION will not make a big difference when the volume of data is small, though the more data are processed by the query, the higher the impact. 

Besides the proper use of the UNION, there are also situations in which a query rewrite can eliminate the need for a UNION (see the rewritten query below).

(4) use of OR instead of IN

One can occasionally find queries in which a OR was used for 10 to 50 distinct values as in the example above. Even if the database engine generate in both cases the same query plan, it's easier to read and maintain a query that used IN. However, if the number of values go beyond a certain value, other techniques should be used to improve the performance.

The only benefit I see for a OR is to document meaning's values or remove during testing one of the values, especially when the list is a selection coming from the user. Frankly, it's not the appropriate way for documenting logic (even if I'm doing it sometimes in ad-hoc queries).

There's a more extreme scenario in which distinct subqueries are written for each or a set of ORs (e.g. the distinction between open vs closed vs. invoices orders), which can make sense sometimes (e.g. the logic is completely different). Therefore, an antipattern can be dependent also of the context or use case. 

(5) use of JOIN instead of EXISTS

When there are no values returned from the subquery, quite often it makes sense to the EXISTS or not EXISTS operators in the queries (see the rewritten query below). This might not be indicated however for distributed environments like serverless SQL pool in which the distribution of the processing across multiple tasks might benefit when the pieces of the logic distributed don't require heavy reshuffles. 

(6) Using too Many Columns in ORDER BY

The columns specified in an ORDER BY clause need to make sense, otherwise they just add extra burden on the database engine, at least from the perspective of the checks that need to be performed. In the above query, at least the Name doesn't make sense.

It helps also if the columns can use existing indexes, though this depends also on query specifics. 

Another antipattern scenario not exemplified above is the use of ordinals to refer to the columns, which should be avoided in production environments (because the order of the columns can be changed accidentally or even :

-- using ordinals instead of number columns (not recommended)
SELECT ITM.ProductId                                  
, ITM.ProductNumber
, ITM.Name 
, ITM.Color 
, ITM.Style 
, ITM.Size
FROM Production.Product ITM                           
ORDER BY 2, 4, 5, 6

(7) Use Columns Not  Needed in Output

Besides the fact that each column included unnecessarily in the query can increase the size of the data processed (unless the database engine is smart to remove them), there can be also performance issues and/or optimizations involved. For example, if all the other columns are part of a covering index, the database engine might opt for a suboptimal index compared to the case in which the unnecessary columns are removed. 

Conversely, some columns are helpful to troubleshoot the logic (and that's why the Source column was considered) even if they aren't considered in the final output or the logic. It doesn't make sense to bring the query version with the respective fields into production, even if this would mean to have maybe a second version of the query used only for troubleshooting needs. Commenting the unnecessary columns could be a better choice, even if it's not recommended in general as too many such comments can obfuscate the case. 

Rewriting the Query

With the above input the query can be rewritten as follows:

-- Products in open orders (modified query)
SELECT ITM.ProductId                                  
, ITM.ProductNumber
, ITM.Name 
, ITM.Color 
, ITM.Style 
, ITM.Size
FROM Production.Product ITM
WHERE EXISTS (										            
	-- Open Purchase orders 
	SELECT POL.ProductId
	FROM Purchasing.PurchaseOrderDetail POL                 
	     JOIN Purchasing.PurchaseOrderHeader POH      
		  ON POL.PurchaseOrderID = POH.PurchaseOrderID
	WHERE POH.Status = 1 
	  AND ITM.ProductID = POL.ProductID
	)
 OR EXISTS (				                                    
	-- Open Sales orders 
	SELECT SOL.ProductId
	FROM Sales.SalesOrderDetail SOL
	     JOIN Sales.SalesOrderHeader SOH
		  ON SOL.SalesOrderID = SOH.SalesOrderID
	WHERE SOH.Status IN (1, 2)
	  AND ITM.ProductID = SOL.ProductID
	)	                           
ORDER BY ITM.ProductNumber			                           
, ITM.Color 
, ITM.Style 
, ITM.Size

Please note that in general to each rule there are also exceptions which should be considered against one's common sense. When the benefit of addressing an antipattern is neglectable compared with the effort involved and the logic doesn't create any issues, probably it's better to let the code as it. One can still reconsider the antipatterns later with the next refactoring opportunity. 

There are zealous data professionals who treat minor inconveniences (e.g. not using upper case for SQL reserved words, alternate code formatting, alternative writing of words, especially function names, different indentation, the use of "--" for commenting within a query, etc.) as antipatterns. Use your common sense and evaluate the effort against the benefits or risks and, not less important, be patient with others' mistakes!

Happy coding!

08 December 2007

Software Engineering: Anti-pattern (Definitions)

"An antipattern is just like a pattern, except that instead of a solution it gives something that looks superficially like a solution, but isn’t one." (Andrew Koenig, Journal of Object-Oriented Programming Vol 8(1), 1995)

"[...] an antipattern is a common approach to solving a problem that leaves us worse off than when we started." (Scott W Ambler & Larry L Constantine, "The Unified Process Construction Phase: Best Practices for Completing the Unified Process", 2000)

"An antipattern is a structure used in software design that seems to be beneficial initially, but has significant, unintended negative consequences." (Jens Dietrich, "From Business Rules to Application Code: Code Generation Patterns for Rule Defined Associations", 2009)

"An anti-pattern is a general, proven, and non-beneficial problem (i.e., bad solution) in a software product or process. It strongly classifies the problem that exhibits negative consequences and provides a solution. Built upon similar experiences, anti-patterns represent 'worst-practices' about how to structure or build a software architecture." (Jörg Rech et al, "Knowledge Patterns" [in "Encyclopedia of Knowledge Management" 2nd Ed.], 2011)

"An antipattern is a software design practice that is ineffective or counterproductive—in other words, the opposite of a 'best practice'. To put it another way, an antipattern is something that the software allows you to do, but that may have an adverse functional or performance impact." (Google)

"An inferior process pattern that teams follow or a design solution that teams commonly make. Antipatterns are used to reinforce better planning and provide a problem solving reference point." (IBM)

"Repeated action, process, structure or reusable solution that initially appears to be beneficial and is commonly used but is ineffective and/or counterproductive in practice." (ISQTB)

03 December 2007

Software Engineering: Design Patterns (Just the Quotes)

"A pattern is a fully realized form original, or model accepted or proposed for imitation. With patterns, small piecework is standardized into a larger chunk or unit. Patterns become the building blocks for design and construction. Finding and applying patterns indicates progress in a field of human endeavor." (Peter Coad, "Object-oriented patterns", 1992) 

"A design pattern systematically names, motivates, and explains a general design that addresses a recurring design problem in object-oriented systems. It describes the problem, the solution, when to apply the solution, and its consequences. It also gives implementation hints and examples. The solution is a general arrangement of objects and classes that solve the problem. The solution is customized and implemented to solve the problem in a particular context." (Erich Gamma et al, "Design Patterns: Elements of Reusable Object-Oriented Software", 1994)

"The purpose of a conceptual model is to provide a vocabulary of terms and concepts that can be used to describe problems and/or solutions of design. It is not the purpose of a model to address specific problems, and even less to propose solutions for them. Drawing an analogy with linguistics, a conceptual model is analogous to a language, while design patterns are analogous to rhetorical figures, which are predefined templates of language usages, suited particularly to specific problems." (Peter P Chen [Ed.], "Advances in Conceptual Modeling", 1999)

"In addition to their complexity-management benefit, design patterns can accelerate design discussions by allowing designers to think and discuss at a larger level of granularity." (Steve McConnell, "Code Complete" 2nd Ed., 2004)

"On small, informal projects, a lot of design is done while the programmer sits at the keyboard. 'Design' might be just writing a class interface in pseudocode before writing the details. It might be drawing diagrams of a few class relationships before coding them. It might be asking another programmer which design pattern seems like a better choice. Regardless of how it’s done, small projects benefit from careful design just as larger projects do, and recognizing design as an explicit activity maximizes the benefit you will receive from it." (Steve McConnell, "Code Complete" 2nd Ed., 2004)

"Software design patterns are what allow us to describe design fragments, and reuse design ideas, helping developers leverage the expertise of others. Patterns give a name and form to abstract heuristics, rules and best practices of object-oriented techniques." (Philippe Kruchten, [foreword] 2004)

"Design patterns give names to practical knowledge; they define a high-level vocabulary for understanding and solving business statements graphically. Design patterns are presented in a standard format; they're like recipes in a cookbook or dress patterns in a catalog. Above all, they are practical, first as instructional materials and then as development tools." (Alan Chmura & J Mark Heumann, "Logical Data Modeling: What it is and How to do it", 2005)

"Structural patterns describe how classes and objects can be combined to form larger structures. Patterns for classes describe how inheritance can be used to provide more useful program interfaces. Patterns for objects describe how objects can be composed into larger structures using object composition." (Junji Nakano et al, "Programming Statistical Data Visualization in the Java Language" [in "Handbook of Data Visualization"], 2008)

"Design patterns are high-level abstractions that document successful design solutions. They are fundamental to design reuse in object-oriented development." (Ian Sommerville, "Software Engineering" 9th Ed., 2011)

"Most designers think of design patterns as a way of supporting object-oriented design. Patterns often rely on object characteristics such as inheritance and polymorphism to provide generality. However, the general principle of encapsulating experience in a pattern is one that is equally applicable to all software design approaches." (Ian Sommerville, "Software Engineering" 9th Ed., 2011)

"A design pattern usually suggests a scheme for structuring the classes in a design solution and defines the required interactions among those classes. In other words, a design pattern describes some commonly recurring structure of communicating classes that can be used to solve some general design problems. Design pattern solutions are typically described in terms of classes, their instances, their roles and collaborations." (Rajib Mall, "Fundamentals of Software Engineering" 4th Ed., 2014)

"If a pattern represents a best practice, then an antipattern represents lessons learned from a bad design. [...] Antipatterns are valuable because they help us to recognise why a particular design alternative might seem at first like an attractive solution, but later on lead to complicacies and finally turn out to be a poor solution." (Rajib Mall, "Fundamentals of Software Engineering" 4th Ed., 2014)

"Each pattern describes a problem which occurs over and over again in our environment, and then describes the core of the solution to that problem, in such a way that you can use this solution a million times over, without ever doing it the same way twice.''  (Christopher Alexander)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.