There’s an expression of abductive reasoning “if it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck” used to identify something indefinite based on certain (definite) characteristics. When one looks at a query written by someone else, one tries to identify such characteristics. However, there are also situations in which simple queries are hard to interpret. For example, each time I see a SELECT DISTINCT in a query I ask myself whether the developer wanted to shortcut a GROUP BY, or just wanted to avoid duplicated data from the output. Without looking at the data in detail it’s almost impossible deciding between the two scenarios.
Let’s consider the following simple example based on a set of orders sent over the interface between two systems. Because of poor design, the first position was caught by the logic before entering the full order (the price from it is incorrect):
-- dropping the table
--DROP TABLE IF EXISTS dbo.T_Orders
-- inserting the records
SELECT *
INTO dbo.T_Orders
FROM (VALUES ('P0001', 1,'20200618', '100001','S','black',20, 1.23)
, ('P0001', 1,'20200618', '100001','S','black',20, 1.22)
, ('P0001', 2,'20200618', '100001','S','black',5, 1.23)
, ('P0001', 3,'20200618', '100001','M','black',10, 1.24)
, ('P0002', 1,'20200619', '100001','S','black',5, 1.23)
, ('P0002', 2,'20200619', '100001','S','black',10, 1.22)) DAT (PurchId, Pos, PurchDate, ItemId, SizeId, ColorId, Qty, Price)
-- adding a primary key
ALTER TABLE dbo.T_Orders
ADD Id int identity(1,1) NOT NULL
-- reviewing the data
SELECT *
FROM dbo.T_Orders
ORDER BY PurchId
, Pos
The following queries return the same results :
-- simple SELECT DISTINCT
SELECT DISTINCT PurchId, Pos, ItemId, SizeId, ColorId, Qty
FROM dbo.T_Orders
-- simple GROUP BY
SELECT PurchId, Pos, ItemId, SizeId, ColorId, Qty
FROM dbo.T_Orders
GROUP BY PurchId, Pos, ItemId, SizeId, ColorId, Qty
Both queries hide the fact that a duplicate exists. However, with a small change in the last query (adding a count) one can highlight that, while the first query doesn't allow this flexibility:
-- simple GROUP BY with Count
SELECT PurchId, Pos, ItemId, SizeId, ColorId, Qty
, count(*) NoRecords
FROM dbo.T_Orders
GROUP BY PurchId, Pos, ItemId, SizeId, ColorId, Qty
-- HAVING count(*)>1
This small change can make an important difference when one needs to analyze the data, so at least from this perspective it might be worth to write the query as a GROUP BY from the beginning, especially when writing complex queries.
There are also nested SELECT DISTINCTS that attempt refining the selection in successive steps. Such curiosities seldom make sense:
-- nested SELECT DISTINCT
SELECT DISTINCT PurchId, Pos, ItemId
FROM (
SELECT DISTINCT PurchId, Pos, ItemId, SizeId, ColorId, Qty
FROM dbo.T_Orders
) DAT
To remove the duplicates and perform a sum one may choose another curiosity of query - the inner query removes the duplicates (some write it also as a DISTINCT), while in the outer query is done the sum:
-- double grouping to remove duplicates
SELECT PurchId, Pos, ItemId, SizeId, ColorId
, SUM(Qty) Total
FROM (--inner query
SELECT PurchId, Pos, ItemId, SizeId, ColorId, Qty
FROM dbo.T_Orders
GROUP BY PurchId, Pos, ItemId, SizeId, ColorId, Qty
) DAT
GROUP BY PurchId, Pos, ItemId, SizeId, ColorId
Even if such queries (miraculously) work upon case, sooner or later they are predestined to fail. When doing operations on data whose quality is doubtful, one needs to select a strategy for removing the duplicates. The first step is to identify the attributes which make a record unique (e.g. PurchId and Pos), and using a sorting criteria, one can remove the duplicates via a window function like Rank, Dense_Rank or Row_Number (with small exceptions their use is interchangeable). Supposing that the Id attribute determinates the order in which the data were added, one can write a similar query:
-- removing duplicates via ranking partition window function
SELECT *
FROM (
SELECT PurchId, Pos, PurchDate, ItemId, SizeId, ColorId, Qty, Price
, RANK() OVER (PARTITION BY PurchId, Pos ORDER BY Id DESC) Ranking
FROM dbo.T_Orders
) DAT
WHERE Ranking=1
The problem is that such queries need to be repeated in each use. A simpler method is running a regular check for duplicates and marking the duplicates as inactive or pushing them into a separate table. Deleting the duplicates can work as well, though this approach can hit back like a boomerang if not addressed correctly.
To identify the duplicates one of the following approaches can be used:
-- duplicates via grouping in inner query
SELECT DAT.*
, DUP.NoRecords
FROM dbo.T_Orders DAT
JOIN (-- duplicates
SELECT PurchId, Pos
, count(*) NoRecords
FROM dbo.T_Orders DUP
GROUP BY PurchId, Pos
HAVING count(*)>1
) DUP
ON DAT.PurchId = DUP.Purchid
AND DAT.Pos = DUP.Pos
-- duplicates via grouping in correlated query
SELECT DAT.*
, DUP.NoRecords
FROM dbo.T_Orders DAT
CROSS APPLY (-- duplicates
SELECT count(*) NoRecords
FROM dbo.T_Orders DUP
WHERE DAT.PurchId = DUP.Purchid
AND DAT.Pos = DUP.Pos
GROUP BY PurchId, Pos
HAVING count(*)>1
) DUP
-- duplicates via count within aggregate window function
SELECT *
FROM (
SELECT PurchId, Pos, PurchDate, ItemId, SizeId, ColorId, Qty, Price
, count(*) OVER (PARTITION BY PurchId, Pos) NoRecords
FROM dbo.T_Orders
) DAT
WHERE NoRecords>1
Happy coding!