19 November 2022

SQL Reloaded: Tricks with Strings via STRING_SPLIT, PATINDEX and TRANSLATE

Searching for a list of words within a column can be easily achieved by using the LIKE operator:

-- searching for several words via LIKE (SQL Server 2000+)
SELECT * 
FROM Production.Product 
WHERE Name LIKE '%chain%'
   OR Name LIKE '%lock%'
   OR Name LIKE '%rim%'
   OR Name LIKE '%spindle%'

The search is quite efficient, if on the column is defined an index, a clustered index scan being more likely chosen.

If the list of strings to search upon becomes bigger, the query becomes at least more difficult to maintain. Using regular expressions could be a solution. Unfortunately, SQL Server has its limitations in working with patterns. For example, it doesn't have a REGEXP_LIKE function, which is used something like (not tested):

-- Oracle 
SELECT * 
FROM Production.Product 
WHERE REGEXP_LIKE(lower(Name), 'chain|lock|rim|spindle')

However, there's a PATINDEX function which returns the position of a pattern within a string, and which uses the same wildcards that can be used with the LIKE operator:

-- searching for a value via PATINDEX (SQL Server 2000+)
SELECT * 
FROM [Production].[Product] 
WHERE PATINDEX('%rim%', Name)>0

Even if together with the Name can be provided only one of the values, retrieving the values from a table or a table-valued function (TVF) would do the trick. If the values need to be reused in several places, they can be stored in a table or view. If needed only once, a common table expression is more indicated:

-- filtering for several words via PATHINDEX (SQL Server 2008+)
WITH CTE 
AS (
  -- table from list of values (SQL Server 2008+)
SELECT * FROM (VALUES ('chain') , ('lock') , ('rim') , ('spindle')) DAT(words) ) SELECT * FROM Production.Product PRD WHERE EXISTS ( SELECT * FROM CTE WHERE PATINDEX('%'+ CTE.words +'%', PRD.Name)>0 )

The query should return the same records as above in the first query!

Besides own's UDFs (see SplitListWithIndex or SplitList), starting with SQL Server 2017 can be used the STRING_SPLIT function to return the same values as a TVF:

-- filtering for several words via PATHINDEX & STRING_SPLIT (SQL Server 2017+)
SELECT * 
FROM Production.Product PRD
WHERE EXISTS (
	SELECT *
	FROM STRING_SPLIT('chain|lock|rim|spindle', '|') SPL
	WHERE PATINDEX('%'+ SPL.value +'%', PRD.Name)>0
	)

A dynamic list of values can be built as well. For example, the list of words can be obtained from a table and the STRING_SPLIT function:

-- listing the words appearing in a column (SQL Server 2017+)
SELECT DISTINCT SPL.value
FROM Production.Product PRD
     CROSS APPLY STRING_SPLIT(Name, ' ') SPL
ORDER BY SPL.value

One can remove the special characters, the numeric values, respectively the 1- and 2-letters words:

-- listing the words appearing in a column (SQL Server 2017+)
SELECT DISTINCT SPL.value
FROM Production.Product PRD
     CROSS APPLY STRING_SPLIT(Replace(Replace(Replace(Replace(Name, '-', ' '), ',', ' '), '/', ' '), '''', ' '), ' ') SPL
WHERE IsNumeric(SPL.value) = 0 -- removing numbers
  AND Len(SPL.value)>2 -- removing single/double letters
ORDER BY SPL.value

The output looks better, though the more complex the text, the more replacements need to be made. An alternative to a UDF (see ReplaceSpecialChars) is the TRANSLATE function, which replaces a list of characters with another. One needs to be careful and have a 1:1 mapping, the REPLICATE function doing the trick:

-- replacing special characters via TRANSLATE (SQL Server 2017+)
SELECT TRANSLATE(Name, '-,/''', Replicate(' ', 4))
FROM Production.Product PRD

Now the query becomes:

-- listing the words appearing in a column using TRANSLATE (SQL Server 2017+)
SELECT DISTINCT SPL.value
FROM Production.Product PRD
     CROSS APPLY STRING_SPLIT(TRANSLATE(Name, '-,/''', Replicate(' ', 4)), ' ') SPL
WHERE IsNumeric(SPL.value) = 0 -- removing numbers
  AND Len(SPL.value)>2 -- removing single/double letters
ORDER BY SPL.value

Happy coding!

05 November 2022

SQL Reloaded: STRING_AGG and STRING_SPLIT at Work, and a Bit of Pivoting

Working with strings across records was for long a nightmare for SQL developers until Microsoft introduced STRING_SPLIT in SQL Server 2016, respectively STRING_AGG in SQL Server 2017.  Previously, one was forced to write procedural language or use workarounds until SQL Server 2015, when recursive CTEs (common table expressions), Ranking and PIVOT were introduced, which allowed handling many scenarios. 

Microsoft provides several examples for the usage of STRING_SPLIT and STRING_AGG functions based on AdventureWorks database, though let's look at another example based on the same database. 

Let's say we want to show the concatenated Contacts for a store, result which can now easily be obtained by using the STRING_AGG:

-- concatenating names per store via STRING_AGG (SQL Server 2017+)
SELECT BusinessEntityID
, STRING_AGG(Concat(FirstName, ' ', LastName), ';') Contacts
FROM Sales.vStoreWithContacts
GROUP BY BusinessEntityID
HAVING count(*)>1

Observe that is needed to use a GROUP BY to show one record per Store. Unfortunately, there isn't yet a window function available for the same. 

The inverse operation can be performed with the help of STRING_SPLIT table-valued function (TVF). (If you wonder why is needed a TVF, it is because the initial record needs to be multiplied by the generated output.)

-- reversing the concatenation (SQL Server 2017+)
WITH CTE
AS (
	-- concatenating names per store
	SELECT BusinessEntityID
	, STRING_AGG(Concat(FirstName, ' ', LastName), ';') Contacts
	FROM Sales.vStoreWithContacts
	GROUP BY BusinessEntityID
	HAVING count(*)>1
) 
SELECT CTE.BusinessEntityID
, DAT.Value
, DAT.Ordinal 
FROM CTE
    CROSS APPLY STRING_SPLIT(Contacts, ';', 1) DAT

STRING_SPLIT provides also an ordinal field, which can be used in theory in pivoting the values, though we'd return then from where we started. Instead of using the query just generated, let's exemplify an alternative solution which is available with SQL Server 2005 for concatenating strings across records:
 
-- concatenating names per store via PIVOT (SQL Server 2012+)
SELECT BusinessEntityID
, [1] Contact1
, [2] Contact2
, [3] Contact3
, [4] Contact4
, Concat([1], IsNull(';' + [2], ''), IsNull(';' + [3], ''), IsNull(';' + [4], '')) Contacts
FROM (
	-- concatenating names and adding a rank
	SELECT BusinessEntityID
	, Concat(FirstName, ' ', LastName) Contact
	, ROW_NUMBER() OVER(PARTITION BY BusinessEntityID ORDER BY FirstName) Ranking
	FROM Sales.vStoreWithContacts
) PER
PIVOT (
    Max(Contact)
	FOR Ranking IN ([1], [2], [3], [4])
) AS DAT

It's needed to rewrite the Concat function to port the code on SQL Server 2005 though. 

Talking about workarounds for splitting strings, in certain scenarios I used a combination of CutLeft & CutRight functions, which proved to be useful in data migrations, or use my own version of STRING_SPLIT (see SplitListWithIndex or SplitList). For concatenations I used mainly CTEs (see example) or cursors for exceptional cases (see example).
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.