Showing posts with label dense_rank. Show all posts
Showing posts with label dense_rank. Show all posts

08 November 2008

SQL Reloaded: Dealing with data duplicates on SQL Server

Duplicates or duplications can be defined as "result rows that are identical to one or more other rows" [1]. They occur frequently in data collections (e.g. Excel, Access, etc.) with concurrent access that have minimum or no validation at all. More users and/or more frequent the updates on the same information, higher the risks of having duplicated data. Duplicates occur also in modern architectures when validation wasn’t addressed correspondingly or from usage related issues, for example, in information systems the huge amount of data entry makes it hard to overlook documents already processed (e.g. purchase orders, invoices, payments, etc.).

Subject to duplication are whole records, a group of attributes (fields) or only single attributes. I depends from case to case. Often duplicates are easy to identify - it’s enough to let somebody who has the proper knowledge to look over them. But what you do when the volume of data is too large or when is need to automate the process as much as possible? Using the DISTINCT keyword in a SELECT statement might do the trick, while other times it requires more complicated validation, ranging from simple checks to Data Mining techniques.

I will try to exemplify the techniques I use to deal with duplicates with the help of a simple example based on table that tracks information about Assets:

-- create test table
CREATE TABLE [dbo].[Assets](
 [ID] [int] NOT NULL,
 [CreationDate] smalldatetime NOT NULL,
 [Vendor] [varchar](50) NULL,
 [Asset] [varchar](50) NULL,
 [Model] [varchar](50) NULL,
 [Owner] [varchar](50) NULL,
 [Tag] [varchar](50) NULL,
 [Quantity] [decimal](13, 2) NULL
) ON [PRIMARY]

Here's some test data:

-- insert test data (SQL Server 2000+)
INSERT INTO dbo.Assets
VALUES ('1', DATEADD(d,-5, GetDate()), 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1')
INSERT INTO dbo.Assets
VALUES  ('2', DATEADD(d,-4, GetDate()),'IBM','Laptop 2','Model 2','Owner 2','XX0002','1')
INSERT INTO dbo.Assets
VALUES  ('3', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
INSERT INTO dbo.Assets
VALUES ('4', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
INSERT INTO dbo.Assets
VALUES  ('5', DATEADD(d,-3, GetDate()),'Dell','Laptop 4','Model 4','Owner 3','DD0001','1')
INSERT INTO dbo.Assets
VALUES  ('6', DATEADD(d,-1, GetDate()),'Dell','Laptop 4','Model 4','Owner 4','DD0001','1')

    
Let’s check table’s content:

-- review the data
SELECT ID, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM dbo.Assets

Output:
ID CreationDate Vendor Asset Model Owner Tag Quantity
1 1/29/2024 10:46:00 PM IBM Laptop 1 Model 1 Owner 1 XX0001 1
2 1/30/2024 10:46:00 PM IBM Laptop 2 Model 2 Owner 2 XX0002 1
3 1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1
4 1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1
5 1/31/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 3 DD0001 1
6 2/2/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 4 DD0001 1

Normally, a Tag or a Serial Number should uniquely identify a product coming from the same Vendor, so in this case duplicates’ identification will resume to the records in which the Tag and Vendor combination appears more than once:

-- retrieve the duplicates
SELECT Vendor, Tag
FROM dbo.Assets A    
GROUP BY Vendor, Tag
HAVING COUNT(*)>1

Output:
Vendor Tag
Dell DD0001
Microsoft WX0001

In many cases showing only the duplicated values would be enough, however there are cases in which is required to see the whole record in which the duplicates appeared, for example to clarify the context in which the issues appear. A simple JOIN with the base table would provide the affected records:

-- retrieve duplicates with details
SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM dbo.Assets A
     JOIN (-- duplicates
	  SELECT Vendor, Tag
	  FROM dbo.Assets A    
	  GROUP BY Vendor, Tag
	  HAVING COUNT(*)>1
     ) B
      ON A.Vendor = B.Vendor 
     AND A.Tag = B.Tag

Output:
Id CreationDate Vendor Asset Model Owner Tag Quantity
5 1/31/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 3 DD0001 1
6 2/2/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 4 DD0001 1
3 1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1
4 1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1

There are two scenarios in which we need to remove the duplicates – in a result set or within an existing table.

In a result set normally it's enough to use the DISTINCT keyword to remove duplicated rows:

-- select unique records
SELECT DISTINCT CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM dbo.Assets

Output:
CreationDate Vendor Asset Model Owner Tag Quantity
1/29/2024 10:46:00 PM IBM Laptop 1 Model 1 Owner 1 XX0001 1
1/30/2024 10:46:00 PM IBM Laptop 2 Model 2 Owner 2 XX0002 1
1/31/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 3 DD0001 1
1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1
2/2/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 4 DD0001 1

In our example only some combinations are duplicated while the other attributes might slightly differ, and therefore is needed another approach. First of all we need to identify which one is the most reliable record, in some cases the latest records entry should be the most accurate or closer to reality, but that’s not necessarily the truth. There are also cases in which we don’t care which the record that is selected is, but from experience these cases are few.

Oracle and SQL Server introduced the dense_rank() analytic function, which returns the rank of rows within the partition of a result set, without any gaps in the ranking. In our case the partition is determined by Vendor and Tag, following to identify which the logic used for raking. Supposing that we are always interested in the last record entered, the query would look like this:

-- retrieve duplicates via ranking functions 
SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM (--subquery 
 SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
 , dense_rank() OVER(PARTITION BY Vendor, Tag ORDER BY CreationDate DESC , Id DESC) RANKING 
 FROM dbo.Assets 
) A 
WHERE RANKING = 1

Output:
CreationDate Vendor Asset Model Owner Tag Quantity
1/29/2024 10:46:00 PM IBM Laptop 1 Model 1 Owner 1 XX0001 1
1/30/2024 10:46:00 PM IBM Laptop 2 Model 2 Owner 2 XX0002 1
1/31/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 3 DD0001 1
1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1
2/2/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 4 DD0001 1

Things are simple when records’ creation date is available and it's unique for the key used to validate the data, because a descending order would allow selecting the last record first. An Id can be added to the clause for the cases in which multiple records have the same date. At least in this example the same output is obtained by using the row_number analytic function, which returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition. Actually, the dense_rank function assigns the same rank for the same occurrence of the values appearing in ORDER BY clause (within the same partition), and thus forces the developer to choose the exact attributes that make a field unique within a partition, while row_number() will return a sequence no matter of the attributes used in ORDER BY clause. 

Both functions should provide similar performance. Conversely, when the selection of records within a partition is not important, it’s better to use the row_number() function, which would need to do less processing.

Unfortunately, this technique doesn’t work in SQL Server 2000, where a different approach is needed. In most of the cases the unique identifier for a record is a sequential unique number, the highest id corresponding to the latest entered record. This would allow selecting the latest entered record, by using the Max function:

-- nonduplicated records (SQL server 2000+)
SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM dbo.Assets A
     JOIN ( -- last entry
	  SELECT Vendor, Tag, MAX(Id) MaxId
	  FROM dbo.Assets A    
	  GROUP BY Vendor, Tag
	 -- HAVING count(*)>1
     ) B
      ON A.Vendor = B.Vendor 
     AND A.Tag = B.Tag 
     AND A.ID = B.MaxId

Output:
Id CreationDate Vendor Asset Model Owner Tag Quantity
4 1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1
2 1/30/2024 10:46:00 PM IBM Laptop 2 Model 2 Owner 2 XX0002 1
1 1/29/2024 10:46:00 PM IBM Laptop 1 Model 1 Owner 1 XX0001 1
6 2/2/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 4 DD0001 1

The same technique can be used to delete the duplicates from a table:

-- nonduplicated records (SQL server 2000+)
SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM dbo.Assets A
     JOIN ( -- last entry
	  SELECT Vendor, Tag, MAX(Id) MaxId
	  FROM dbo.Assets A    
	  GROUP BY Vendor, Tag
	 -- HAVING count(*)>1
     ) B
      ON A.Vendor = B.Vendor 
     AND A.Tag = B.Tag 
     AND A.ID = B.MaxId

When an Id is not available, then we have most probably to create a temporary table with a sequence (numeric unique identifier), sort the records based on the criteria we chose for selection, and then apply the technique based on Id as above.

Notes:
1. In other scenarios it’s important to select all the records matching extreme values (first, last), the dense_rank function becoming handy, however for versions that doesn’t supports it, a creation date attribute saves the day, when available, and it's unique:

-- nonduplicated records (SQL server 2000+)  
SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM dbo.Assets A
     JOIN (-- last entry
	  SELECT Vendor, Tag, MAX(CreationDate) LastCreationDate
	  FROM dbo.Assets A    
	  GROUP BY Vendor, Tag
	  -- HAVING count(*)>1
     ) B
      ON A.Vendor = B.Vendor 
     AND A.Tag = B.Tag 
     AND DateDiff(d, A.CreationDate, B.LastCreationDate)=0

Output:
Id CreationDate Vendor Asset Model Owner Tag Quantity
6 2/2/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 4 DD0001 1
1 1/29/2024 10:46:00 PM IBM Laptop 1 Model 1 Owner 1 XX0001 1
2 1/30/2024 10:46:00 PM IBM Laptop 2 Model 2 Owner 2 XX0002 1
3 1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1
4 1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1

2. Duplicated records can be created by merging two sets of data with UNION ALL, in such cases a simple UNION would remove the duplicates introduced by the queries coming after the first one however this will burden the database with more checks. It is recommended to find other solutions, when possible!

3. Instead of using a single multi-row insertion I used multiple insertion statements because I preferred to make the tutorial usable also on SQL Server 2000. Here’s the single multi-row insertion statement:

-- insert test data (SQL Server 2005+)
INSERT INTO dbo.Assets
VALUES ('1', DATEADD(d,-5, GetDate()), 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1')
, ('2', DATEADD(d,-4, GetDate()),'IBM','Laptop 2','Model 2','Owner 2','XX0002','1')
, ('3', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
, ('4', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
, ('5', DATEADD(d,-3, GetDate()),'Dell','Laptop 4','Model 4','Owner 3','DD0001','1')
, ('6', DATEADD(d,-1, GetDate()),'Dell','Laptop 4','Model 4','Owner 4','DD0001','1')

4. The above techniques should work also in Oracle with two amendments, attributes’ type must be adapted to Oracle ones, while instead of SQL Server GetDate() function should be used the corresponding Oracle SYSDATE function, as below:
 
-- insert test data (Oracle)
INSERT INTO dbo.Assets<br />
VALUES ('1', SYSDATE-1, 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1')

References:
[1] S Hull (1998) Business Objects Glossary
[2] Microsoft Learn (2023) dense_rank() (link)
[3] Microsoft Learn (2023) row_number() (link)

23 September 2007

SQL Reloaded: 6 out of 49 (Part 3: Basic Data Analysis)

The most basic information one can get is the number of drawings in which a number occurred, respectively which was its frequency:

-- number & frequency drawings by number
DECLARE @NumberDrawings int = 2000
SELECT Number 
, Cast(100.0*count(*)/@NumberDrawings as decimal(10,2)) Frequency 
, count(*) NumberDrawings 
FROM dbo.vLottoNumbers 
GROUP BY Number 
ORDER BY Number  

One of the interesting things to see in a data set of drawings is the distance between two drawings in which appears the same numbers. It doesn’t look too easy to compare two consecutive records within the same query; actually this can be done easy with the help of DENSE_RANK function (works only on SQL Server 2005+ and Oracle), which ranks the data within a partition, thus if two values are identical, they have the same ranking. Let’s see how DENSE_RANK function works: 

-- ranked drawings by number
SELECT Number 
, DrawingDate 
, DENSE_RANK() OVER (PARTITION BY Number ORDER BY DrawingDate) Ranking 
FROM dbo.vLottoNumbers   

The partition is created by Number, while the ranking is delimited by DrawingDate. Now all we have to do is to join two such queries by Number with two consecutive Rankings: 

--creating the view 
CREATE VIEW dbo.vLottoConsecutiveDrawings 
AS 
-- consecutive drawings by numbers
WITH DAT 
AS (
	SELECT Number     
	, DrawingDate     
	, DENSE_RANK() OVER (PARTITION BY Number ORDER BY DrawingDate) Ranking   
	FROM dbo.vLottoNumbers  
)
SELECT A.Number 
, A.DrawingDate 
, B.DrawingDate NextDrawingDate 
, IsNull(DateDiff(d, A.DrawingDate, B.DrawingDate), 0) DifferenceDays 
, IsNull(DateDiff(wk, A.DrawingDate, B.DrawingDate), 0) DifferenceWeeks 
FROM DAT A
     LEFT JOIN DAT B 
	   ON A.Number = B.Number 
	  AND A.Ranking = B.Ranking - 1 

The first interesting thing to find out is:     
  • What’s the average number of occurrences?     
  • What’s the minimum/maximum number of consecutive drawings in which the number hasn’t appeared?  
  • How many times a number appeared?      
The following query answers to these questions, by doing a simple grouping by Number over vLottoConsecutiveDrawings output:

-- minimum/maximum differences between drawings
SELECT Number 
, Min(NullIf(DifferenceWeeks, 0)) Minimum 
, Max(NullIf(DifferenceWeeks, 0)) Maximum 
, Avg(NullIf(DifferenceWeeks, 0)) Average 
, count(*) NumberDrawings 
FROM dbo.vLottoConsecutiveDrawings 
GROUP BY Number 
ORDER BY Number    

The output doesn’t give much information, but it’s a good start. The study interval can be broken down by using the DrawingDate in GROUP and/or WHERE clause: 

-- minimum/maximum differences between drawings by year
SELECT Year(DrawingDate) DrawingYear 
, Number 
, Min(NullIf(DifferenceWeeks, 0)) Minimum 
, Max(NullIf(DifferenceWeeks, 0)) Maximum 
, Avg(NullIf(DifferenceWeeks, 0)) Average 
, count(*) NumberDrawings 
FROM vLottoConsecutiveDrawings 
--WHERE Year(DrawingDate) IN (2000, 2001) 
GROUP BY Number 
, Year(DrawingDate) 
ORDER BY Number, DrawingYear

It would be also interesting to know how many numbers fall on a given row or column within the 7x7 matrix:

-- aggregating data by drawing & row
SELECT LN.DrawingDate 
, SUM(CASE WHEN LN.Number BETWEEN 1 AND 7 THEN 1 ELSE 0 END) Bucket1
, SUM(CASE WHEN LN.Number BETWEEN 8 AND 14 THEN 1 ELSE 0 END) Bucket2
, SUM(CASE WHEN LN.Number BETWEEN 15 AND 21 THEN 1 ELSE 0 END) Bucket3
, SUM(CASE WHEN LN.Number BETWEEN 22 AND 28 THEN 1 ELSE 0 END) Bucket4
, SUM(CASE WHEN LN.Number BETWEEN 29 AND 35 THEN 1 ELSE 0 END) Bucket5
, SUM(CASE WHEN LN.Number BETWEEN 36 AND 42 THEN 1 ELSE 0 END) Bucket6
, SUM(CASE WHEN LN.Number BETWEEN 43 AND 49 THEN 1 ELSE 0 END) Bucket7
FROM dbo.vLottoNumbers LN 
GROUP BY LN.DrawingDate 

-- aggregating data by drawing & column
SELECT LN.DrawingDate 
, SUM(CASE WHEN LN.Number %7 = 1 THEN 1 ELSE 0 END) Bucket1
, SUM(CASE WHEN LN.Number %7 = 2 THEN 1 ELSE 0 END) Bucket2
, SUM(CASE WHEN LN.Number %7 = 3 THEN 1 ELSE 0 END) Bucket3
, SUM(CASE WHEN LN.Number %7 = 4 THEN 1 ELSE 0 END) Bucket4
, SUM(CASE WHEN LN.Number %7 = 5 THEN 1 ELSE 0 END) Bucket5
, SUM(CASE WHEN LN.Number %7 = 6 THEN 1 ELSE 0 END) Bucket6
, SUM(CASE WHEN LN.Number %7 = 7 THEN 1 ELSE 0 END) Bucket7
FROM dbo.vLottoNumbers LN 
GROUP BY LN.DrawingDate 

The buckets can be aggregated as follows (observe the replacement of SUM with MAX):
 
-- consolidated drawings by 7 buckets (row)
 SELECT SUM(Bucket1) Bucket1
, SUM(Bucket2) Bucket2
, SUM(Bucket3) Bucket3
, SUM(Bucket4) Bucket4
, SUM(Bucket5) Bucket5
, SUM(Bucket6) Bucket6
, SUM(Bucket7) Bucket7
 FROM (
     -- aggregating data by drawing & row
	 SELECT LN.DrawingDate 
	, Max(CASE WHEN LN.Number BETWEEN 1 AND 7 THEN 1 ELSE 0 END) Bucket1
	, Max(CASE WHEN LN.Number BETWEEN 8 AND 14 THEN 1 ELSE 0 END) Bucket2
	, Max(CASE WHEN LN.Number BETWEEN 15 AND 21 THEN 1 ELSE 0 END) Bucket3
	, Max(CASE WHEN LN.Number BETWEEN 22 AND 28 THEN 1 ELSE 0 END) Bucket4
	, Max(CASE WHEN LN.Number BETWEEN 29 AND 35 THEN 1 ELSE 0 END) Bucket5
	, Max(CASE WHEN LN.Number BETWEEN 36 AND 42 THEN 1 ELSE 0 END) Bucket6
	, Max(CASE WHEN LN.Number BETWEEN 43 AND 49 THEN 1 ELSE 0 END) Bucket7
	 FROM dbo.vLottoNumbers LN 
	 GROUP BY LN.DrawingDate 
  ) DAT

Happy coding!

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.