Subject to duplication are whole records, a group of attributes (fields) or only single attributes. I depends from case to case. Often duplicates are easy to identify - it’s enough to let somebody who has the proper knowledge to look over them. But what you do when the volume of data is too large or when is need to automate the process as much as possible? Using the DISTINCT keyword in a SELECT statement might do the trick, while other times it requires more complicated validation, ranging from simple checks to Data Mining techniques.
I will try to exemplify the techniques I use to deal with duplicates with the help of a simple example based on table that tracks information about Assets:
-- create test table CREATE TABLE [dbo].[Assets]( [ID] [int] NOT NULL, [CreationDate] smalldatetime NOT NULL, [Vendor] [varchar](50) NULL, [Asset] [varchar](50) NULL, [Model] [varchar](50) NULL, [Owner] [varchar](50) NULL, [Tag] [varchar](50) NULL, [Quantity] [decimal](13, 2) NULL ) ON [PRIMARY]
Here's some test data:
-- insert test data (SQL Server 2000+) INSERT INTO dbo.Assets VALUES ('1', DATEADD(d,-5, GetDate()), 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1') INSERT INTO dbo.Assets VALUES ('2', DATEADD(d,-4, GetDate()),'IBM','Laptop 2','Model 2','Owner 2','XX0002','1') INSERT INTO dbo.Assets VALUES ('3', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1') INSERT INTO dbo.Assets VALUES ('4', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1') INSERT INTO dbo.Assets VALUES ('5', DATEADD(d,-3, GetDate()),'Dell','Laptop 4','Model 4','Owner 3','DD0001','1') INSERT INTO dbo.Assets VALUES ('6', DATEADD(d,-1, GetDate()),'Dell','Laptop 4','Model 4','Owner 4','DD0001','1')
Let’s check table’s content:
Output:
-- review the data SELECT ID, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity FROM dbo.Assets
Output:
ID | CreationDate | Vendor | Asset | Model | Owner | Tag | Quantity |
1 | 1/29/2024 10:46:00 PM | IBM | Laptop 1 | Model 1 | Owner 1 | XX0001 | 1 |
2 | 1/30/2024 10:46:00 PM | IBM | Laptop 2 | Model 2 | Owner 2 | XX0002 | 1 |
3 | 1/31/2024 10:46:00 PM | Microsoft | Laptop 3 | Model 3 | Owner 2 | WX0001 | 1 |
4 | 1/31/2024 10:46:00 PM | Microsoft | Laptop 3 | Model 3 | Owner 2 | WX0001 | 1 |
5 | 1/31/2024 10:46:00 PM | Dell | Laptop 4 | Model 4 | Owner 3 | DD0001 | 1 |
6 | 2/2/2024 10:46:00 PM | Dell | Laptop 4 | Model 4 | Owner 4 | DD0001 | 1 |
Normally, a Tag or a Serial Number should uniquely identify a product coming from the same Vendor, so in this case duplicates’ identification will resume to the records in which the Tag and Vendor combination appears more than once:
Output:
-- retrieve the duplicates SELECT Vendor, Tag FROM dbo.Assets A GROUP BY Vendor, Tag HAVING COUNT(*)>1
Vendor | Tag |
Dell | DD0001 |
Microsoft | WX0001 |
In many cases showing only the duplicated values would be enough, however there are cases in which is required to see the whole record in which the duplicates appeared, for example to clarify the context in which the issues appear. A simple JOIN with the base table would provide the affected records:
-- retrieve duplicates with details SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity FROM dbo.Assets A JOIN (-- duplicates SELECT Vendor, Tag FROM dbo.Assets A GROUP BY Vendor, Tag HAVING COUNT(*)>1 ) B ON A.Vendor = B.Vendor AND A.Tag = B.Tag
Id | CreationDate | Vendor | Asset | Model | Owner | Tag | Quantity |
5 | 1/31/2024 10:46:00 PM | Dell | Laptop 4 | Model 4 | Owner 3 | DD0001 | 1 |
6 | 2/2/2024 10:46:00 PM | Dell | Laptop 4 | Model 4 | Owner 4 | DD0001 | 1 |
3 | 1/31/2024 10:46:00 PM | Microsoft | Laptop 3 | Model 3 | Owner 2 | WX0001 | 1 |
4 | 1/31/2024 10:46:00 PM | Microsoft | Laptop 3 | Model 3 | Owner 2 | WX0001 | 1 |
There are two scenarios in which we need to remove the duplicates – in a result set or within an existing table.
In a result set normally it's enough to use the DISTINCT keyword to remove duplicated rows:
Output:
In a result set normally it's enough to use the DISTINCT keyword to remove duplicated rows:
-- select unique records SELECT DISTINCT CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity FROM dbo.Assets
Output:
CreationDate | Vendor | Asset | Model | Owner | Tag | Quantity |
1/29/2024 10:46:00 PM | IBM | Laptop 1 | Model 1 | Owner 1 | XX0001 | 1 |
1/30/2024 10:46:00 PM | IBM | Laptop 2 | Model 2 | Owner 2 | XX0002 | 1 |
1/31/2024 10:46:00 PM | Dell | Laptop 4 | Model 4 | Owner 3 | DD0001 | 1 |
1/31/2024 10:46:00 PM | Microsoft | Laptop 3 | Model 3 | Owner 2 | WX0001 | 1 |
2/2/2024 10:46:00 PM | Dell | Laptop 4 | Model 4 | Owner 4 | DD0001 | 1 |
In our example only some combinations are duplicated while the other attributes might slightly differ, and therefore is needed another approach. First of all we need to identify which one is the most reliable record, in some cases the latest records entry should be the most accurate or closer to reality, but that’s not necessarily the truth. There are also cases in which we don’t care which the record that is selected is, but from experience these cases are few.
Oracle and SQL Server introduced the dense_rank() analytic function, which returns the rank of rows within the partition of a result set, without any gaps in the ranking. In our case the partition is determined by Vendor and Tag, following to identify which the logic used for raking. Supposing that we are always interested in the last record entered, the query would look like this:
-- retrieve duplicates via ranking functions SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity FROM (--subquery SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity , dense_rank() OVER(PARTITION BY Vendor, Tag ORDER BY CreationDate DESC , Id DESC) RANKING FROM dbo.Assets ) A WHERE RANKING = 1
Output:
CreationDate | Vendor | Asset | Model | Owner | Tag | Quantity |
1/29/2024 10:46:00 PM | IBM | Laptop 1 | Model 1 | Owner 1 | XX0001 | 1 |
1/30/2024 10:46:00 PM | IBM | Laptop 2 | Model 2 | Owner 2 | XX0002 | 1 |
1/31/2024 10:46:00 PM | Dell | Laptop 4 | Model 4 | Owner 3 | DD0001 | 1 |
1/31/2024 10:46:00 PM | Microsoft | Laptop 3 | Model 3 | Owner 2 | WX0001 | 1 |
2/2/2024 10:46:00 PM | Dell | Laptop 4 | Model 4 | Owner 4 | DD0001 | 1 |
Things are simple when records’ creation date is available and it's unique for the key used to validate the data, because a descending order would allow selecting the last record first. An Id can be added to the clause for the cases in which multiple records have the same date. At least in this example the same output is obtained by using the row_number analytic function, which returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition. Actually, the dense_rank function assigns the same rank for the same occurrence of the values appearing in ORDER BY clause (within the same partition), and thus forces the developer to choose the exact attributes that make a field unique within a partition, while row_number() will return a sequence no matter of the attributes used in ORDER BY clause.
Both functions should provide similar performance. Conversely, when the selection of records within a partition is not important, it’s better to use the row_number() function, which would need to do less processing.
Unfortunately, this technique doesn’t work in SQL Server 2000, where a different approach is needed. In most of the cases the unique identifier for a record is a sequential unique number, the highest id corresponding to the latest entered record. This would allow selecting the latest entered record, by using the Max function:
Unfortunately, this technique doesn’t work in SQL Server 2000, where a different approach is needed. In most of the cases the unique identifier for a record is a sequential unique number, the highest id corresponding to the latest entered record. This would allow selecting the latest entered record, by using the Max function:
-- nonduplicated records (SQL server 2000+) SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity FROM dbo.Assets A JOIN ( -- last entry SELECT Vendor, Tag, MAX(Id) MaxId FROM dbo.Assets A GROUP BY Vendor, Tag -- HAVING count(*)>1 ) B ON A.Vendor = B.Vendor AND A.Tag = B.Tag AND A.ID = B.MaxId
Output:
Id | CreationDate | Vendor | Asset | Model | Owner | Tag | Quantity |
4 | 1/31/2024 10:46:00 PM | Microsoft | Laptop 3 | Model 3 | Owner 2 | WX0001 | 1 |
2 | 1/30/2024 10:46:00 PM | IBM | Laptop 2 | Model 2 | Owner 2 | XX0002 | 1 |
1 | 1/29/2024 10:46:00 PM | IBM | Laptop 1 | Model 1 | Owner 1 | XX0001 | 1 |
6 | 2/2/2024 10:46:00 PM | Dell | Laptop 4 | Model 4 | Owner 4 | DD0001 | 1 |
The same technique can be used to delete the duplicates from a table:
-- nonduplicated records (SQL server 2000+) SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity FROM dbo.Assets A JOIN ( -- last entry SELECT Vendor, Tag, MAX(Id) MaxId FROM dbo.Assets A GROUP BY Vendor, Tag -- HAVING count(*)>1 ) B ON A.Vendor = B.Vendor AND A.Tag = B.Tag AND A.ID = B.MaxId
Notes:
1. In other scenarios it’s important to select all the records matching extreme values (first, last), the dense_rank function becoming handy, however for versions that doesn’t supports it, a creation date attribute saves the day, when available, and it's unique:
-- nonduplicated records (SQL server 2000+) SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity FROM dbo.Assets A JOIN (-- last entry SELECT Vendor, Tag, MAX(CreationDate) LastCreationDate FROM dbo.Assets A GROUP BY Vendor, Tag -- HAVING count(*)>1 ) B ON A.Vendor = B.Vendor AND A.Tag = B.Tag AND DateDiff(d, A.CreationDate, B.LastCreationDate)=0
Id | CreationDate | Vendor | Asset | Model | Owner | Tag | Quantity |
6 | 2/2/2024 10:46:00 PM | Dell | Laptop 4 | Model 4 | Owner 4 | DD0001 | 1 |
1 | 1/29/2024 10:46:00 PM | IBM | Laptop 1 | Model 1 | Owner 1 | XX0001 | 1 |
2 | 1/30/2024 10:46:00 PM | IBM | Laptop 2 | Model 2 | Owner 2 | XX0002 | 1 |
3 | 1/31/2024 10:46:00 PM | Microsoft | Laptop 3 | Model 3 | Owner 2 | WX0001 | 1 |
4 | 1/31/2024 10:46:00 PM | Microsoft | Laptop 3 | Model 3 | Owner 2 | WX0001 | 1 |
2. Duplicated records can be created by merging two sets of data with UNION ALL, in such cases a simple UNION would remove the duplicates introduced by the queries coming after the first one however this will burden the database with more checks. It is recommended to find other solutions, when possible!
3. Instead of using a single multi-row insertion I used multiple insertion statements because I preferred to make the tutorial usable also on SQL Server 2000. Here’s the single multi-row insertion statement:
4. The above techniques should work also in Oracle with two amendments, attributes’ type must be adapted to Oracle ones, while instead of SQL Server GetDate() function should be used the corresponding Oracle SYSDATE function, as below:
3. Instead of using a single multi-row insertion I used multiple insertion statements because I preferred to make the tutorial usable also on SQL Server 2000. Here’s the single multi-row insertion statement:
-- insert test data (SQL Server 2005+) INSERT INTO dbo.Assets VALUES ('1', DATEADD(d,-5, GetDate()), 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1') , ('2', DATEADD(d,-4, GetDate()),'IBM','Laptop 2','Model 2','Owner 2','XX0002','1') , ('3', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1') , ('4', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1') , ('5', DATEADD(d,-3, GetDate()),'Dell','Laptop 4','Model 4','Owner 3','DD0001','1') , ('6', DATEADD(d,-1, GetDate()),'Dell','Laptop 4','Model 4','Owner 4','DD0001','1')
4. The above techniques should work also in Oracle with two amendments, attributes’ type must be adapted to Oracle ones, while instead of SQL Server GetDate() function should be used the corresponding Oracle SYSDATE function, as below:
No comments:
Post a Comment