About Me

IT Professional with more than 16 years experience in IT especially in the area of full life-cycle of Web/Desktop Applications Development, Database Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP support, etc.

Saturday, November 08, 2008

Dealing with data duplicates on SQL Server

    Duplicates or duplications can be defined as “result rows that are identical to one or more other rows” (S.Hull, 1998), they occur frequently in data collections (e.g. Excel, Access, etc.) that have minimum or no validation, more users, higher the risks of having duplicated data. Duplicates occur also in modern architectures when validation wasn’t addressed correspondingly or from usage related issues, for example in information systems, the huge amount of data entry makes it hard to overlook documents already processed (e.g. purchase orders, invoices, payments, etc.).

    Subject to duplication are whole records, a group of attributes (fields) or only single attributes, it depends from case to case. Often duplicates are easy to identify, it’s enough to let somebody who has the proper knowledge to look over them, but what you do when the volume of data is too large or when is need to automate the process as much as possible. Using the DISTINCT keyword in a SELECT statement might do the trick, while other times it requires more complicated validation, ranging from simple checks to data mining techniques.

    I will try to exemplify the techniques I use to deal with duplicates with the help of a self-constructed example, a table having the following structure, which tracks information about Assets:

CREATE TABLE [dbo].[Assets](
[ID] [int] NOT NULL,
[CreationDate] smalldatetime NOT NULL,
[Vendor] [varchar](50) NULL,
[Asset] [varchar](50) NULL,
[Model] [varchar](50) NULL,
[Owner] [varchar](50) NULL,
[Tag] [varchar](50) NULL,
[Quantity] [decimal](13, 2) NULL
) ON [PRIMARY]

    Next step is to provide the test data:
INSERT INTO Assets
VALUES ('1', DATEADD(d,-5, GetDate()), 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1')
INSERT INTO Assets
VALUES ('2', DATEADD(d,-4, GetDate()),'IBM','Laptop 2','Model 2','Owner 2','XX0002','1')
INSERT INTO Assets
VALUES ('3', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
INSERT INTO Assets
VALUES ('4', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
INSERT INTO Assets
VALUES ('5', DATEADD(d,-3, GetDate()),'Dell','Laptop 4','Model 4','Owner 3','DD0001','1')
INSERT INTO Assets
VALUES ('6', DATEADD(d,-1, GetDate()),'Dell','Laptop 4','Model 4','Owner 4','DD0001','1')

    Now let’s check table’s content:

SELECT ID, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM Assets

Output: All records selection     Normally a Tag or a Serial Number should uniquely identify a product coming from the same Vendor, so in this case duplicates’ identification will resume to the records in which the Tag-Vendor combination appears more than once:

SELECT Vendor, Tag
FROM Assets A
GROUP BY Vendor, Tag
HAVING COUNT(*)>1

Output: Vendor-Product duplications     In many cases showing only the duplicated values would be enough, however there are cases in which is required to see the whole record in which the duplicates appeared, for example to clarify the context in which the issues appear. A simple JOIN with the base table would provide the affected records:

SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity FROM Assets A
JOIN (
SELECT Vendor, Tag
FROM Assets A
GROUP BY Vendor, Tag
HAVING COUNT(*)>1
) B
ON A.Vendor = B.Vendor
AND A.Tag = B.Tag

Output:
selection of duplicates     There are two scenarios in which we need to remove the duplicates – in a result set or within an existing table.

In a result set normally is enough to use the DISTINCT keyword to remove duplicated rows,

SELECT DISTINCT CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM Assets

Output:
selecting distinct records     In our example only some combinations are duplicated while the other attributes might slightly differ, and therefore is needed another approach. First of all we need to identify which one is the most reliable record, in some cases the latest records entry should be the most accurate or closer to reality, but that’s not necessarily the truth. There are also cases in which we don’t care which the record that is selected is, but from experience these cases are few.

    Oracle and SQL Server introduced the dense_rank() analytic function, which “returns the rank of rows within the partition of a result set, without any gaps in the ranking” (MSDN, 2008 a). In our case the partition is determined by Vendor and Tag, following to identify which the logic used for raking. Supposing that we are always interested in the last record entered, the query would look like this:

SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM (--subquery
SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
, dense_rank() OVER(PARTITION BY Vendor, Tag ORDER BY CreationDate DESC , Id DESC) RANKING
FROM Assets
) A
WHERE RANKING = 1

Output:
using dense_rank() function     Things are simple when records’ creation date is available, because a descending order would allow selecting the last record first, the Id being added to the clause for the cases in which multiple records have the same date. At least in this example the same output is obtained by using the row_number() analytic function, which “returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition” (MSDN, 2008 b). Actually the dense_rank function assigns the same rank for the same occurrence of the values appearing in ORDER BY clause (within the same partition), and thus forces the developer to choose the exact attributes that make a field unique within a partition, while row_number() will return a sequence no matter of the attributes used in ORDER BY clause. In this case I suppose that both functions should have the same performance, on the other side, when the selection of records within a partition is not important, it’s better to use the row_number() function, which would need to do less processing and thus better performance.

Unfortunately this technique doesn’t work in SQL Server 2000, so a different approach is requested. In most of the cases the unique identifier for a record is a running number, the highest id corresponding to the latest entered record. This would allow selecting the latest entered record, by using the Max function:

SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM Assets A
JOIN (
SELECT Vendor, Tag, MAX(Id) MaxId
FROM Assets A
GROUP BY Vendor, Tag
) B
ON A.Vendor = B.Vendor
AND A.Tag = B.Tag
AND A.ID = B.MaxId
Output:
SQL Server 2000 technique The same technique can be used to delete the duplicates from a table:
DELETE Assets
FROM (
SELECT Vendor, Tag, MAX(Id) MaxId
FROM Assets A
GROUP BY Vendor, Tag
) B
WHERE Assets.Vendor = B.Vendor
AND Assets.Tag = B.Tag
AND Assets.ID <> B.MaxId
    When an Id is not available, then we have most probably to create a temporary table with a sequence (numeric unique identifier), sort the records based on the criteria we chose for selection, and then apply the technique based on Id as above.
Notes:
1. In other scenarios it’s important to select all the records matching extreme values (first, last), the dense_rank function becoming handy, however for versions that doesn’t supports it, a creation date attribute saves the day, when available:
SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM Assets A
JOIN (
SELECT Vendor, Tag, MAX(CreationDate) LastCreationDate
FROM Assets A
GROUP BY Vendor, Tag
) B
ON A.Vendor = B.Vendor
AND A.Tag = B.Tag
AND DateDiff(d, A.CreationDate, B.LastCreationDate)=0

Output:
record matching extreme values 2. Duplicated records can be created by merging two sets of data with UNION ALL, in such cases a simple UNION would remove the duplicates introduced by the queries coming after the first one however this will burden the database with more checks. It is recommended to find other solutions, when possible!

3. Instead of using a single multi-row insertion I used multiple insertion statements because I preferred to make the tutorial portable also on SQL Server 2000. Here’s the single multi-row insertion statement:

INSERT INTO Assets VALUES ('1', DATEADD(d,-5, GetDate()), 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1')
, ('2', DATEADD(d,-4, GetDate()),'IBM','Laptop 2','Model 2','Owner 2','XX0002','1')
, ('3', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
, ('4', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
, ('5', DATEADD(d,-3, GetDate()),'Dell','Laptop 4','Model 4','Owner 3','DD0001','1')
, ('6', DATEADD(d,-1, GetDate()),'Dell','Laptop 4','Model 4','Owner 4','DD0001','1')

4. The above techniques should work also in Oracle with two amendments, the attributes’ type must be adapted to Oracle ones, while instead of SQL Server GetDate() function should be used the corresponding Oracle SYSDATE function as below:
INSERT INTO Assets
VALUES ('1', SYSDATE-1, 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1')

References:
S.Hull. 1998. Business Objects Glossary. [Online] Available from:http://planning.ucsc.edu/IRPS/dwh/BOBGLOSS.HTM (Accessed: 7 November 2008)
MSDN. 2008 a. SQL Server 2008 Books Online, dense_rank(). [Online] Available from: http://msdn.microsoft.com/en-us/library/ms173825.aspx(Accessed: 7 November 2008)
MSDN. 2008 b. SQL Server 2008 Books Online, dense_rank(). [Online] Available from: http://msdn.microsoft.com/en-us/library/ms186734.aspx (Accessed: 7 November 2008)

No comments: