Searching for a list of words within a column can be easily achieved by using the LIKE operator:
-- searching for several words via LIKE (SQL Server 2000+) SELECT * FROM Production.Product WHERE Name LIKE '%chain%' OR Name LIKE '%lock%' OR Name LIKE '%rim%' OR Name LIKE '%spindle%'
The search is quite efficient, if on the column is defined an index, a clustered index scan being more likely chosen.
If the list of strings to search upon becomes bigger, the query becomes at least more difficult to maintain. Using regular expressions could be a solution. Unfortunately, SQL Server has its limitations in working with patterns. For example, it doesn't have a REGEXP_LIKE function, which is used something like (not tested):
-- Oracle SELECT * FROM Production.Product WHERE REGEXP_LIKE(lower(Name), 'chain|lock|rim|spindle')
However, there's a PATINDEX function which returns the position of a pattern within a string, and which uses the same wildcards that can be used with the LIKE operator:
-- searching for a value via PATINDEX (SQL Server 2000+) SELECT * FROM [Production].[Product] WHERE PATINDEX('%rim%', Name)>0
Even if together with the Name can be provided only one of the values, retrieving the values from a table or a table-valued function (TVF) would do the trick. If the values need to be reused in several places, they can be stored in a table or view. If needed only once, a common table expression is more indicated:
-- filtering for several words via PATHINDEX (SQL Server 2008+) WITH CTE
AS (
-- table from list of values (SQL Server 2008+)
SELECT *
FROM (VALUES ('chain')
, ('lock')
, ('rim')
, ('spindle')) DAT(words)
)
SELECT *
FROM Production.Product PRD
WHERE EXISTS (
SELECT *
FROM CTE
WHERE PATINDEX('%'+ CTE.words +'%', PRD.Name)>0
)
The query should return the same records as above in the first query!
Besides own's UDFs (see SplitListWithIndex or SplitList), starting with SQL Server 2017 can be used the STRING_SPLIT function to return the same values as a TVF:
-- filtering for several words via PATHINDEX & STRING_SPLIT (SQL Server 2017+) SELECT * FROM Production.Product PRD WHERE EXISTS ( SELECT * FROM STRING_SPLIT('chain|lock|rim|spindle', '|') SPL WHERE PATINDEX('%'+ SPL.value +'%', PRD.Name)>0 )
A dynamic list of values can be built as well. For example, the list of words can be obtained from a table and the STRING_SPLIT function:
-- listing the words appearing in a column (SQL Server 2017+) SELECT DISTINCT SPL.value FROM Production.Product PRD CROSS APPLY STRING_SPLIT(Name, ' ') SPL ORDER BY SPL.value
One can remove the special characters, the numeric values, respectively the 1- and 2-letters words:
-- listing the words appearing in a column (SQL Server 2017+) SELECT DISTINCT SPL.value FROM Production.Product PRD CROSS APPLY STRING_SPLIT(Replace(Replace(Replace(Replace(Name, '-', ' '), ',', ' '), '/', ' '), '''', ' '), ' ') SPL WHERE IsNumeric(SPL.value) = 0 -- removing numbers AND Len(SPL.value)>2 -- removing single/double letters ORDER BY SPL.value
The output looks better, though the more complex the text, the more replacements need to be made. An alternative to a UDF (see ReplaceSpecialChars) is the TRANSLATE function, which replaces a list of characters with another. One needs to be careful and have a 1:1 mapping, the REPLICATE function doing the trick:
-- replacing special characters via TRANSLATE (SQL Server 2017+) SELECT TRANSLATE(Name, '-,/''', Replicate(' ', 4)) FROM Production.Product PRD
Now the query becomes:
-- listing the words appearing in a column using TRANSLATE (SQL Server 2017+) SELECT DISTINCT SPL.value FROM Production.Product PRD CROSS APPLY STRING_SPLIT(TRANSLATE(Name, '-,/''', Replicate(' ', 4)), ' ') SPL WHERE IsNumeric(SPL.value) = 0 -- removing numbers AND Len(SPL.value)>2 -- removing single/double letters ORDER BY SPL.value
Notes:
1) The SQL Server-based queries work also in a SQL databases in Microsoft Fabric. Just replace the Production with SalesLT schema (see post, respectively GitHub repository with the changed code).
Happy coding!
No comments:
Post a Comment