Let’s consider the following simple example based on a set of orders sent over the interface between two systems. Because of poor design, the first position was caught by the logic before entering the full order (the price from it is incorrect):
-- dropping the table --DROP TABLE IF EXISTS dbo.T_Orders -- inserting the records SELECT * INTO dbo.T_Orders FROM (VALUES ('P0001', 1,'20200618', '100001','S','black',20, 1.23) , ('P0001', 1,'20200618', '100001','S','black',20, 1.22) , ('P0001', 2,'20200618', '100001','S','black',5, 1.23) , ('P0001', 3,'20200618', '100001','M','black',10, 1.24) , ('P0002', 1,'20200619', '100001','S','black',5, 1.23) , ('P0002', 2,'20200619', '100001','S','black',10, 1.22)) DAT (PurchId, Pos, PurchDate, ItemId, SizeId, ColorId, Qty, Price) -- adding a primary key ALTER TABLE dbo.T_Orders ADD Id int identity(1,1) NOT NULL -- reviewing the data SELECT * FROM dbo.T_Orders ORDER BY PurchId , Pos
The following queries return the same results :
-- simple SELECT DISTINCT SELECT DISTINCT PurchId, Pos, ItemId, SizeId, ColorId, Qty FROM dbo.T_Orders -- simple GROUP BY SELECT PurchId, Pos, ItemId, SizeId, ColorId, Qty FROM dbo.T_Orders GROUP BY PurchId, Pos, ItemId, SizeId, ColorId, Qty
Both queries hide the fact that a duplicate exists. However, with a small change in the last query (adding a count) one can highlight that, while the first query doesn't allow this flexibility:
-- simple GROUP BY with Count SELECT PurchId, Pos, ItemId, SizeId, ColorId, Qty , count(*) NoRecords FROM dbo.T_Orders GROUP BY PurchId, Pos, ItemId, SizeId, ColorId, Qty
-- HAVING count(*)>1
This small change can make an important difference when one needs to analyze the data, so at least from this perspective it might be worth to write the query as a GROUP BY from the beginning, especially when writing complex queries.
There are also nested SELECT DISTINCTS that attempt refining the selection in successive steps. Such curiosities seldom make sense:
-- nested SELECT DISTINCT SELECT DISTINCT PurchId, Pos, ItemId FROM ( SELECT DISTINCT PurchId, Pos, ItemId, SizeId, ColorId, Qty FROM dbo.T_Orders ) DAT
To remove the duplicates and perform a sum one may choose another curiosity of query - the inner query removes the duplicates (some write it also as a DISTINCT), while in the outer query is done the sum:
-- double grouping to remove duplicates SELECT PurchId, Pos, ItemId, SizeId, ColorId , SUM(Qty) Total FROM (--inner query SELECT PurchId, Pos, ItemId, SizeId, ColorId, Qty FROM dbo.T_Orders GROUP BY PurchId, Pos, ItemId, SizeId, ColorId, Qty ) DAT GROUP BY PurchId, Pos, ItemId, SizeId, ColorId
Even if such queries (miraculously) work upon case, sooner or later they are predestined to fail. When doing operations on data whose quality is doubtful, one needs to select a strategy for removing the duplicates. The first step is to identify the attributes which make a record unique (e.g. PurchId and Pos), and using a sorting criteria, one can remove the duplicates via a window function like Rank, Dense_Rank or Row_Number (with small exceptions their use is interchangeable). Supposing that the Id attribute determinates the order in which the data were added, one can write a similar query:
-- removing duplicates via ranking partition window function SELECT * FROM ( SELECT PurchId, Pos, PurchDate, ItemId, SizeId, ColorId, Qty, Price , RANK() OVER (PARTITION BY PurchId, Pos ORDER BY Id DESC) Ranking FROM dbo.T_Orders ) DAT WHERE Ranking=1
The problem is that such queries need to be repeated in each use. A simpler method is running a regular check for duplicates and marking the duplicates as inactive or pushing them into a separate table. Deleting the duplicates can work as well, though this approach can hit back like a boomerang if not addressed correctly.
To identify the duplicates one of the following approaches can be used:
-- duplicates via grouping in inner query SELECT DAT.* , DUP.NoRecords FROM dbo.T_Orders DAT JOIN (-- duplicates SELECT PurchId, Pos , count(*) NoRecords FROM dbo.T_Orders DUP GROUP BY PurchId, Pos HAVING count(*)>1 ) DUP ON DAT.PurchId = DUP.Purchid AND DAT.Pos = DUP.Pos -- duplicates via grouping in correlated query SELECT DAT.* , DUP.NoRecords FROM dbo.T_Orders DAT CROSS APPLY (-- duplicates SELECT count(*) NoRecords FROM dbo.T_Orders DUP WHERE DAT.PurchId = DUP.Purchid AND DAT.Pos = DUP.Pos GROUP BY PurchId, Pos HAVING count(*)>1 ) DUP -- duplicates via count within aggregate window function SELECT * FROM ( SELECT PurchId, Pos, PurchDate, ItemId, SizeId, ColorId, Qty, Price , count(*) OVER (PARTITION BY PurchId, Pos) NoRecords FROM dbo.T_Orders ) DAT WHERE NoRecords>1
Happy coding!
No comments:
Post a Comment