Showing posts with label row_number. Show all posts
Showing posts with label row_number. Show all posts

21 May 2020

SQL Reloaded: Splitting a String (Before and After)

Starting with SQL Server 2016 Microsoft introduced the STRING_SPLIT table-valued function, which splits a string into rows of substrings, based on a specified separator character (e.g. “:”).


-- splitting a string (SQL Server 2016+)
SELECT *
FROM STRING_SPLIT('100001::M:black:', ':') 

The table-valued function object allowed also earlier to implement the same functionality, either by looping through the string or by using a common table expression. Here’s the implementation based on a loop (it was modified from a previous version to include an index):


-- split table-valued function with an index
CREATE FUNCTION [dbo].[SplitListWithIndex]( 
  @ListValues varchar(500) 
, @Delimiter char(1)) 
RETURNS @Temp TABLE(
  Ind int
, Value varchar(50)) 
AS 
BEGIN 
 DECLARE @Index int 
 DECLARE @Length int 
 DECLARE @Ind int 

 SET @Index = CharIndex(@Delimiter, @ListValues) 
 SET @Length = Len(@ListValues) - @Index 
 SET @Ind = 1
   
 WHILE @Index > 0 --if the fatch was successful
 BEGIN 
  INSERT @Temp 
  VALUES(@Ind, Substring(@ListValues, 0, @Index)) 

  SET @ListValues = Substring(@ListValues, @Index+1, @Length) 
  SET @Index = CharIndex(@Delimiter, @ListValues) 
  SET @Length = @Length - @Index 
  SET @Ind = @Ind + 1
 END  
   
 INSERT @Temp 
 VALUES(@Ind, @ListValues) 
RETURN 
END  
GO

The function could be called in code same as the STRING_SPLIT:

-- splitting a string (SQL Server 2000+)
SELECT *
FROM dbo.SplitListWithIndex('100001::M:black:', ':') 

The two functions are helpful when a column stores delimited values. It’s the case of Dynamics 365 which stores a SKU (Stock Keeping Unit) – the Product Numbers together with its Dimensions (ItemId, Configid, InventSizeId, InventColorId, StyleId) delimited by “:”, like in the above examples. Therefore, to parse the records one could write such code:


-- parsing delimited values (SQL Server 2000+)
SELECT DAT.ProductNumber
, Max(CASE WHEN LV.Ind = 1 THEN LV.Value END) ItemId
, Max(CASE WHEN LV.Ind = 2 THEN LV.Value END) Configid
, Max(CASE WHEN LV.Ind = 3 THEN LV.Value END) InventSizeId
, Max(CASE WHEN LV.Ind = 4 THEN LV.Value END) InventColorId
, Max(CASE WHEN LV.Ind = 5 THEN LV.Value END) StyleId
FROM ( VALUES ('100001::S:black:')
 , ('100001::M:black:')
 , ('100001::L:black:')
 , ('100001::XL:black:')
 , ('100001::S:white:')
 , ('100001::M:white:')
 , ('100001::L:white:')
 , ('100001::XL:white:')
) DAT (ProductNumber)
CROSS APPLY dbo.SplitListWithIndex(DAT.ProductNumber, ':') LV
GROUP BY DAT.ProductNumber
ORDER BY DAT.ProductNumber
 Similar output can be obtained via the STRING_SPLIT with the help of row_number() ranking window function introduced with SQL Server 2005:
-- parsing delimited values (SQL Server 2016+)
SELECT LV.ProductNumber
, Max(CASE WHEN LV.Ind = 1 THEN LV.Value END) ItemId
, Max(CASE WHEN LV.Ind = 2 THEN LV.Value END) Configid
, Max(CASE WHEN LV.Ind = 3 THEN LV.Value END) InventSizeId
, Max(CASE WHEN LV.Ind = 4 THEN LV.Value END) InventColorId
, Max(CASE WHEN LV.Ind = 5 THEN LV.Value END) StyleId
FROM (
 SELECT DAT.ProductNumber
 , XT.VALUE
 , ROW_NUMBER() OVER (PARTITION BY DAT.ProductNumber ORDER BY DAT.ProductNumber) Ind
 FROM ( VALUES ('100001::S:black:')
  , ('100001::M:black:')
  , ('100001::L:black:')
  , ('100001::XL:black:')
  , ('100001::S:white:')
  , ('100001::M:white:')
  , ('100001::L:white:')
  , ('100001::XL:white:')
 ) DAT (ProductNumber)
 CROSS APPLY STRING_SPLIT(DAT.ProductNumber, ':') XT
) LV
GROUP BY LV.ProductNumber
ORDER BY LV.ProductNumber

As can be seen the introduction of an index into the dbo.SplitListWithIndex function simplified the code, making the use of a ranking window function unnecessary. It would be useful for the STRING_SPLIT to provide the same, as this time of processing is pretty common.  
Here’s another example based on the PIVOT clause introduced also in SQL 2005:


-- parsing delimited values (SQL Server 2016+)
SELECT P.ProductNumber
, Cast(Trim([1]) as nvarchar(20)) ItemId
, Cast(Trim([2]) as nvarchar(20)) ConfigId
, Cast(Trim([3]) as nvarchar(20)) InventSizeid
, Cast(Trim([4]) as nvarchar(20)) InventColorId
, Cast(Trim([5]) as nvarchar(20)) StyleId
FROM (
 SELECT DAT.ProductNumber
 , XT.VALUE
 , ROW_NUMBER() OVER (PARTITION BY DAT.ProductNumber ORDER BY DAT.ProductNumber) Ranking
 FROM ( VALUES ('100001::S:black:')
  , ('100001::M:black:')
  , ('100001::L:black:')
  , ('100001::XL:black:')
  , ('100001::S:white:')
  , ('100001::M:white:')
  , ('100001::L:white:')
  , ('100001::XL:white:')
 ) DAT (ProductNumber)
 CROSS APPLY STRING_SPLIT(DAT.ProductNumber, ':') XT
) DAT
PIVOT (MAX(DAT.[VALUE]) FOR DAT.Ranking IN ([1],[2],[3],[4],[5])) P

Happy coding!

18 October 2018

SQL Reloaded: String_Split Function

    Today, as I was playing with a data model – although simplistic, the simplicity kicked back when I had to deal with fields in which several values were encoded within the same column. The “challenge” resided in the fact that the respective attributes were quintessential in analyzing and matching the data with other datasets. Therefore, was needed a mix between flexibility and performance. It was the point where the String_Split did its magic. Introduced with SQL Server 2016 and available only under compatibility level 130 and above, the function splits a character expression using specified separator.
Here’s a simplified example of the code I had to write:

-- cleaning up
-- DROP TABLE dbo.ItemList

-- test data 
SELECT A.*
INTO dbo.ItemList 
FROM (
VALUES (1, '1001:a:blue')
, (2, '1001:b:white')
, (3, '1002:a:blue')
, (4, '1002:b:white')
, (5, '1002:c:red')
, (6, '1003:b:white')
, (7, '1003:c:red')) A(Id, List)

-- checking the data
SELECT *
FROM dbo.ItemList 

-- prepared data
SELECT ITM.Id 
, ITM.List 
, DAT.ItemId
, DAT.Size
, DAT.Country
FROM dbo.ItemList ITM
   LEFT JOIN (-- transformed data 
 SELECT DAT.id
 , [1] AS ItemId
 , [2] AS Size
 , [3] AS Country
 FROM(
  SELECT ITM.id
  , TX.Value
  , ROW_NUMBER() OVER (PARTITION BY ITM.id ORDER BY ITM.id) Ranking
  FROM dbo.ItemList ITM
  CROSS APPLY STRING_SPLIT(ITM.List, ':') TX
  ) DAT
 PIVOT (MAX(DAT.Value) FOR DAT.Ranking IN ([1],[2],[3])) 
 DAT
  ) DAT
   ON ITM.Id = DAT.Id 


And, here’s the output:

image_thumb[2]

   For those dealing with former versions of SQL Server the functionality provided by the String_Split can be implemented with the help of user-defined functions, either by using the old fashioned loops (see this), cursors (see this) or more modern common table expressions (see this). In fact, these implementations are similar to the String_Split function, the difference being made mainly by the performance.

Happy coding!

08 November 2008

SQL Reloaded: Dealing with data duplicates on SQL Server

Duplicates or duplications can be defined as "result rows that are identical to one or more other rows" [1]. They occur frequently in data collections (e.g. Excel, Access, etc.) with concurrent access that have minimum or no validation at all. More users and/or more frequent the updates on the same information, higher the risks of having duplicated data. Duplicates occur also in modern architectures when validation wasn’t addressed correspondingly or from usage related issues, for example, in information systems the huge amount of data entry makes it hard to overlook documents already processed (e.g. purchase orders, invoices, payments, etc.).

Subject to duplication are whole records, a group of attributes (fields) or only single attributes. I depends from case to case. Often duplicates are easy to identify - it’s enough to let somebody who has the proper knowledge to look over them. But what you do when the volume of data is too large or when is need to automate the process as much as possible? Using the DISTINCT keyword in a SELECT statement might do the trick, while other times it requires more complicated validation, ranging from simple checks to Data Mining techniques.

I will try to exemplify the techniques I use to deal with duplicates with the help of a simple example based on table that tracks information about Assets:

-- create test table
CREATE TABLE [dbo].[Assets](
 [ID] [int] NOT NULL,
 [CreationDate] smalldatetime NOT NULL,
 [Vendor] [varchar](50) NULL,
 [Asset] [varchar](50) NULL,
 [Model] [varchar](50) NULL,
 [Owner] [varchar](50) NULL,
 [Tag] [varchar](50) NULL,
 [Quantity] [decimal](13, 2) NULL
) ON [PRIMARY]

Here's some test data:

-- insert test data (SQL Server 2000+)
INSERT INTO dbo.Assets
VALUES ('1', DATEADD(d,-5, GetDate()), 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1')
INSERT INTO dbo.Assets
VALUES  ('2', DATEADD(d,-4, GetDate()),'IBM','Laptop 2','Model 2','Owner 2','XX0002','1')
INSERT INTO dbo.Assets
VALUES  ('3', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
INSERT INTO dbo.Assets
VALUES ('4', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
INSERT INTO dbo.Assets
VALUES  ('5', DATEADD(d,-3, GetDate()),'Dell','Laptop 4','Model 4','Owner 3','DD0001','1')
INSERT INTO dbo.Assets
VALUES  ('6', DATEADD(d,-1, GetDate()),'Dell','Laptop 4','Model 4','Owner 4','DD0001','1')

    
Let’s check table’s content:

-- review the data
SELECT ID, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM dbo.Assets

Output:
ID CreationDate Vendor Asset Model Owner Tag Quantity
1 1/29/2024 10:46:00 PM IBM Laptop 1 Model 1 Owner 1 XX0001 1
2 1/30/2024 10:46:00 PM IBM Laptop 2 Model 2 Owner 2 XX0002 1
3 1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1
4 1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1
5 1/31/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 3 DD0001 1
6 2/2/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 4 DD0001 1

Normally, a Tag or a Serial Number should uniquely identify a product coming from the same Vendor, so in this case duplicates’ identification will resume to the records in which the Tag and Vendor combination appears more than once:

-- retrieve the duplicates
SELECT Vendor, Tag
FROM dbo.Assets A    
GROUP BY Vendor, Tag
HAVING COUNT(*)>1

Output:
Vendor Tag
Dell DD0001
Microsoft WX0001

In many cases showing only the duplicated values would be enough, however there are cases in which is required to see the whole record in which the duplicates appeared, for example to clarify the context in which the issues appear. A simple JOIN with the base table would provide the affected records:

-- retrieve duplicates with details
SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM dbo.Assets A
     JOIN (-- duplicates
	  SELECT Vendor, Tag
	  FROM dbo.Assets A    
	  GROUP BY Vendor, Tag
	  HAVING COUNT(*)>1
     ) B
      ON A.Vendor = B.Vendor 
     AND A.Tag = B.Tag

Output:
Id CreationDate Vendor Asset Model Owner Tag Quantity
5 1/31/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 3 DD0001 1
6 2/2/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 4 DD0001 1
3 1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1
4 1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1

There are two scenarios in which we need to remove the duplicates – in a result set or within an existing table.

In a result set normally it's enough to use the DISTINCT keyword to remove duplicated rows:

-- select unique records
SELECT DISTINCT CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM dbo.Assets

Output:
CreationDate Vendor Asset Model Owner Tag Quantity
1/29/2024 10:46:00 PM IBM Laptop 1 Model 1 Owner 1 XX0001 1
1/30/2024 10:46:00 PM IBM Laptop 2 Model 2 Owner 2 XX0002 1
1/31/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 3 DD0001 1
1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1
2/2/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 4 DD0001 1

In our example only some combinations are duplicated while the other attributes might slightly differ, and therefore is needed another approach. First of all we need to identify which one is the most reliable record, in some cases the latest records entry should be the most accurate or closer to reality, but that’s not necessarily the truth. There are also cases in which we don’t care which the record that is selected is, but from experience these cases are few.

Oracle and SQL Server introduced the dense_rank() analytic function, which returns the rank of rows within the partition of a result set, without any gaps in the ranking. In our case the partition is determined by Vendor and Tag, following to identify which the logic used for raking. Supposing that we are always interested in the last record entered, the query would look like this:

-- retrieve duplicates via ranking functions 
SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM (--subquery 
 SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
 , dense_rank() OVER(PARTITION BY Vendor, Tag ORDER BY CreationDate DESC , Id DESC) RANKING 
 FROM dbo.Assets 
) A 
WHERE RANKING = 1

Output:
CreationDate Vendor Asset Model Owner Tag Quantity
1/29/2024 10:46:00 PM IBM Laptop 1 Model 1 Owner 1 XX0001 1
1/30/2024 10:46:00 PM IBM Laptop 2 Model 2 Owner 2 XX0002 1
1/31/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 3 DD0001 1
1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1
2/2/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 4 DD0001 1

Things are simple when records’ creation date is available and it's unique for the key used to validate the data, because a descending order would allow selecting the last record first. An Id can be added to the clause for the cases in which multiple records have the same date. At least in this example the same output is obtained by using the row_number analytic function, which returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition. Actually, the dense_rank function assigns the same rank for the same occurrence of the values appearing in ORDER BY clause (within the same partition), and thus forces the developer to choose the exact attributes that make a field unique within a partition, while row_number() will return a sequence no matter of the attributes used in ORDER BY clause. 

Both functions should provide similar performance. Conversely, when the selection of records within a partition is not important, it’s better to use the row_number() function, which would need to do less processing.

Unfortunately, this technique doesn’t work in SQL Server 2000, where a different approach is needed. In most of the cases the unique identifier for a record is a sequential unique number, the highest id corresponding to the latest entered record. This would allow selecting the latest entered record, by using the Max function:

-- nonduplicated records (SQL server 2000+)
SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM dbo.Assets A
     JOIN ( -- last entry
	  SELECT Vendor, Tag, MAX(Id) MaxId
	  FROM dbo.Assets A    
	  GROUP BY Vendor, Tag
	 -- HAVING count(*)>1
     ) B
      ON A.Vendor = B.Vendor 
     AND A.Tag = B.Tag 
     AND A.ID = B.MaxId

Output:
Id CreationDate Vendor Asset Model Owner Tag Quantity
4 1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1
2 1/30/2024 10:46:00 PM IBM Laptop 2 Model 2 Owner 2 XX0002 1
1 1/29/2024 10:46:00 PM IBM Laptop 1 Model 1 Owner 1 XX0001 1
6 2/2/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 4 DD0001 1

The same technique can be used to delete the duplicates from a table:

-- nonduplicated records (SQL server 2000+)
SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM dbo.Assets A
     JOIN ( -- last entry
	  SELECT Vendor, Tag, MAX(Id) MaxId
	  FROM dbo.Assets A    
	  GROUP BY Vendor, Tag
	 -- HAVING count(*)>1
     ) B
      ON A.Vendor = B.Vendor 
     AND A.Tag = B.Tag 
     AND A.ID = B.MaxId

When an Id is not available, then we have most probably to create a temporary table with a sequence (numeric unique identifier), sort the records based on the criteria we chose for selection, and then apply the technique based on Id as above.

Notes:
1. In other scenarios it’s important to select all the records matching extreme values (first, last), the dense_rank function becoming handy, however for versions that doesn’t supports it, a creation date attribute saves the day, when available, and it's unique:

-- nonduplicated records (SQL server 2000+)  
SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM dbo.Assets A
     JOIN (-- last entry
	  SELECT Vendor, Tag, MAX(CreationDate) LastCreationDate
	  FROM dbo.Assets A    
	  GROUP BY Vendor, Tag
	  -- HAVING count(*)>1
     ) B
      ON A.Vendor = B.Vendor 
     AND A.Tag = B.Tag 
     AND DateDiff(d, A.CreationDate, B.LastCreationDate)=0

Output:
Id CreationDate Vendor Asset Model Owner Tag Quantity
6 2/2/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 4 DD0001 1
1 1/29/2024 10:46:00 PM IBM Laptop 1 Model 1 Owner 1 XX0001 1
2 1/30/2024 10:46:00 PM IBM Laptop 2 Model 2 Owner 2 XX0002 1
3 1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1
4 1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1

2. Duplicated records can be created by merging two sets of data with UNION ALL, in such cases a simple UNION would remove the duplicates introduced by the queries coming after the first one however this will burden the database with more checks. It is recommended to find other solutions, when possible!

3. Instead of using a single multi-row insertion I used multiple insertion statements because I preferred to make the tutorial usable also on SQL Server 2000. Here’s the single multi-row insertion statement:

-- insert test data (SQL Server 2005+)
INSERT INTO dbo.Assets
VALUES ('1', DATEADD(d,-5, GetDate()), 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1')
, ('2', DATEADD(d,-4, GetDate()),'IBM','Laptop 2','Model 2','Owner 2','XX0002','1')
, ('3', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
, ('4', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
, ('5', DATEADD(d,-3, GetDate()),'Dell','Laptop 4','Model 4','Owner 3','DD0001','1')
, ('6', DATEADD(d,-1, GetDate()),'Dell','Laptop 4','Model 4','Owner 4','DD0001','1')

4. The above techniques should work also in Oracle with two amendments, attributes’ type must be adapted to Oracle ones, while instead of SQL Server GetDate() function should be used the corresponding Oracle SYSDATE function, as below:
 
-- insert test data (Oracle)
INSERT INTO dbo.Assets<br />
VALUES ('1', SYSDATE-1, 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1')

References:
[1] S Hull (1998) Business Objects Glossary
[2] Microsoft Learn (2023) dense_rank() (link)
[3] Microsoft Learn (2023) row_number() (link)
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.