Subject to duplication are whole records, a group of attributes (fields) or only single attributes. I depends from case to case. Often duplicates are easy to identify - it’s enough to let somebody who has the proper knowledge to look over them. But what you do when the volume of data is too large or when is need to automate the process as much as possible? Using the DISTINCT keyword in a SELECT statement might do the trick, while other times it requires more complicated validation, ranging from simple checks to Data Mining techniques.
I will try to exemplify the techniques I use to deal with duplicates with the help of a simple example based on table that tracks information about Assets:
-- create test table CREATE TABLE [dbo].[Assets]( [ID] [int] NOT NULL, [CreationDate] smalldatetime NOT NULL, [Vendor] [varchar](50) NULL, [Asset] [varchar](50) NULL, [Model] [varchar](50) NULL, [Owner] [varchar](50) NULL, [Tag] [varchar](50) NULL, [Quantity] [decimal](13, 2) NULL ) ON [PRIMARY]
Here's some test data:
-- insert test data (SQL Server 2000+) INSERT INTO dbo.Assets VALUES ('1', DATEADD(d,-5, GetDate()), 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1') INSERT INTO dbo.Assets VALUES ('2', DATEADD(d,-4, GetDate()),'IBM','Laptop 2','Model 2','Owner 2','XX0002','1') INSERT INTO dbo.Assets VALUES ('3', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1') INSERT INTO dbo.Assets VALUES ('4', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1') INSERT INTO dbo.Assets VALUES ('5', DATEADD(d,-3, GetDate()),'Dell','Laptop 4','Model 4','Owner 3','DD0001','1') INSERT INTO dbo.Assets VALUES ('6', DATEADD(d,-1, GetDate()),'Dell','Laptop 4','Model 4','Owner 4','DD0001','1')
-- review the data SELECT ID, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity FROM dbo.Assets
Output:
ID | CreationDate | Vendor | Asset | Model | Owner | Tag | Quantity |
1 | 1/29/2024 10:46:00 PM | IBM | Laptop 1 | Model 1 | Owner 1 | XX0001 | 1 |
2 | 1/30/2024 10:46:00 PM | IBM | Laptop 2 | Model 2 | Owner 2 | XX0002 | 1 |
3 | 1/31/2024 10:46:00 PM | Microsoft | Laptop 3 | Model 3 | Owner 2 | WX0001 | 1 |
4 | 1/31/2024 10:46:00 PM | Microsoft | Laptop 3 | Model 3 | Owner 2 | WX0001 | 1 |
5 | 1/31/2024 10:46:00 PM | Dell | Laptop 4 | Model 4 | Owner 3 | DD0001 | 1 |
6 | 2/2/2024 10:46:00 PM | Dell | Laptop 4 | Model 4 | Owner 4 | DD0001 | 1 |
-- retrieve the duplicates SELECT Vendor, Tag FROM dbo.Assets A GROUP BY Vendor, Tag HAVING COUNT(*)>1
Vendor | Tag |
Dell | DD0001 |
Microsoft | WX0001 |
-- retrieve duplicates with details SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity FROM dbo.Assets A JOIN (-- duplicates SELECT Vendor, Tag FROM dbo.Assets A GROUP BY Vendor, Tag HAVING COUNT(*)>1 ) B ON A.Vendor = B.Vendor AND A.Tag = B.Tag
Id | CreationDate | Vendor | Asset | Model | Owner | Tag | Quantity |
5 | 1/31/2024 10:46:00 PM | Dell | Laptop 4 | Model 4 | Owner 3 | DD0001 | 1 |
6 | 2/2/2024 10:46:00 PM | Dell | Laptop 4 | Model 4 | Owner 4 | DD0001 | 1 |
3 | 1/31/2024 10:46:00 PM | Microsoft | Laptop 3 | Model 3 | Owner 2 | WX0001 | 1 |
4 | 1/31/2024 10:46:00 PM | Microsoft | Laptop 3 | Model 3 | Owner 2 | WX0001 | 1 |
In a result set normally it's enough to use the DISTINCT keyword to remove duplicated rows:
-- select unique records SELECT DISTINCT CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity FROM dbo.Assets
Output:
CreationDate | Vendor | Asset | Model | Owner | Tag | Quantity |
1/29/2024 10:46:00 PM | IBM | Laptop 1 | Model 1 | Owner 1 | XX0001 | 1 |
1/30/2024 10:46:00 PM | IBM | Laptop 2 | Model 2 | Owner 2 | XX0002 | 1 |
1/31/2024 10:46:00 PM | Dell | Laptop 4 | Model 4 | Owner 3 | DD0001 | 1 |
1/31/2024 10:46:00 PM | Microsoft | Laptop 3 | Model 3 | Owner 2 | WX0001 | 1 |
2/2/2024 10:46:00 PM | Dell | Laptop 4 | Model 4 | Owner 4 | DD0001 | 1 |
In our example only some combinations are duplicated while the other attributes might slightly differ, and therefore is needed another approach. First of all we need to identify which one is the most reliable record, in some cases the latest records entry should be the most accurate or closer to reality, but that’s not necessarily the truth. There are also cases in which we don’t care which the record that is selected is, but from experience these cases are few.
Oracle and SQL Server introduced the dense_rank() analytic function, which returns the rank of rows within the partition of a result set, without any gaps in the ranking. In our case the partition is determined by Vendor and Tag, following to identify which the logic used for raking. Supposing that we are always interested in the last record entered, the query would look like this:
-- retrieve duplicates via ranking functions SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity FROM (--subquery SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity , dense_rank() OVER(PARTITION BY Vendor, Tag ORDER BY CreationDate DESC , Id DESC) RANKING FROM dbo.Assets ) A WHERE RANKING = 1
Output:
CreationDate | Vendor | Asset | Model | Owner | Tag | Quantity |
1/29/2024 10:46:00 PM | IBM | Laptop 1 | Model 1 | Owner 1 | XX0001 | 1 |
1/30/2024 10:46:00 PM | IBM | Laptop 2 | Model 2 | Owner 2 | XX0002 | 1 |
1/31/2024 10:46:00 PM | Dell | Laptop 4 | Model 4 | Owner 3 | DD0001 | 1 |
1/31/2024 10:46:00 PM | Microsoft | Laptop 3 | Model 3 | Owner 2 | WX0001 | 1 |
2/2/2024 10:46:00 PM | Dell | Laptop 4 | Model 4 | Owner 4 | DD0001 | 1 |
Unfortunately, this technique doesn’t work in SQL Server 2000, where a different approach is needed. In most of the cases the unique identifier for a record is a sequential unique number, the highest id corresponding to the latest entered record. This would allow selecting the latest entered record, by using the Max function:
-- nonduplicated records (SQL server 2000+) SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity FROM dbo.Assets A JOIN ( -- last entry SELECT Vendor, Tag, MAX(Id) MaxId FROM dbo.Assets A GROUP BY Vendor, Tag -- HAVING count(*)>1 ) B ON A.Vendor = B.Vendor AND A.Tag = B.Tag AND A.ID = B.MaxId
Output:
Id | CreationDate | Vendor | Asset | Model | Owner | Tag | Quantity |
4 | 1/31/2024 10:46:00 PM | Microsoft | Laptop 3 | Model 3 | Owner 2 | WX0001 | 1 |
2 | 1/30/2024 10:46:00 PM | IBM | Laptop 2 | Model 2 | Owner 2 | XX0002 | 1 |
1 | 1/29/2024 10:46:00 PM | IBM | Laptop 1 | Model 1 | Owner 1 | XX0001 | 1 |
6 | 2/2/2024 10:46:00 PM | Dell | Laptop 4 | Model 4 | Owner 4 | DD0001 | 1 |
-- nonduplicated records (SQL server 2000+) SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity FROM dbo.Assets A JOIN ( -- last entry SELECT Vendor, Tag, MAX(Id) MaxId FROM dbo.Assets A GROUP BY Vendor, Tag -- HAVING count(*)>1 ) B ON A.Vendor = B.Vendor AND A.Tag = B.Tag AND A.ID = B.MaxId
Notes:
1. In other scenarios it’s important to select all the records matching extreme values (first, last), the dense_rank function becoming handy, however for versions that doesn’t supports it, a creation date attribute saves the day, when available, and it's unique:
-- nonduplicated records (SQL server 2000+) SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity FROM dbo.Assets A JOIN (-- last entry SELECT Vendor, Tag, MAX(CreationDate) LastCreationDate FROM dbo.Assets A GROUP BY Vendor, Tag -- HAVING count(*)>1 ) B ON A.Vendor = B.Vendor AND A.Tag = B.Tag AND DateDiff(d, A.CreationDate, B.LastCreationDate)=0
Id | CreationDate | Vendor | Asset | Model | Owner | Tag | Quantity |
6 | 2/2/2024 10:46:00 PM | Dell | Laptop 4 | Model 4 | Owner 4 | DD0001 | 1 |
1 | 1/29/2024 10:46:00 PM | IBM | Laptop 1 | Model 1 | Owner 1 | XX0001 | 1 |
2 | 1/30/2024 10:46:00 PM | IBM | Laptop 2 | Model 2 | Owner 2 | XX0002 | 1 |
3 | 1/31/2024 10:46:00 PM | Microsoft | Laptop 3 | Model 3 | Owner 2 | WX0001 | 1 |
4 | 1/31/2024 10:46:00 PM | Microsoft | Laptop 3 | Model 3 | Owner 2 | WX0001 | 1 |
3. Instead of using a single multi-row insertion I used multiple insertion statements because I preferred to make the tutorial usable also on SQL Server 2000. Here’s the single multi-row insertion statement:
-- insert test data (SQL Server 2005+) INSERT INTO dbo.Assets VALUES ('1', DATEADD(d,-5, GetDate()), 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1') , ('2', DATEADD(d,-4, GetDate()),'IBM','Laptop 2','Model 2','Owner 2','XX0002','1') , ('3', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1') , ('4', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1') , ('5', DATEADD(d,-3, GetDate()),'Dell','Laptop 4','Model 4','Owner 3','DD0001','1') , ('6', DATEADD(d,-1, GetDate()),'Dell','Laptop 4','Model 4','Owner 4','DD0001','1')
4. The above techniques should work also in Oracle with two amendments, attributes’ type must be adapted to Oracle ones, while instead of SQL Server GetDate() function should be used the corresponding Oracle SYSDATE function, as below: