Many of the programming languages I worked with provide a function for extracting a substring from a string - Mid (VB/VBScript), Substring in C# Substr in Oracle, etc. SQL Server is actually providing three functions for this purpose: Substring, Left, respectively Right. The Substring function extracts a substring of a given length starting from a given position. The Left and Right functions return, as their name suggests, the left part, respectively the right part of a character string with the specified number of characters. The use of Left and Right functions seems to be redundant, as they are a particular case of Substring, however they can simplify sometimes the code, as can be seen from below example based on the above IBAN example.
-- parsing a VAT Number DECLARE @VAT varchar(50) SET @VAT = 'CH9300762011623852957' SELECT @VAT VATNumber , LEFT(@VAT, 2) CountryCode1 , SUBSTRING(@VAT, 1, 2) CountryCode2 , SUBSTRING(@VAT, 3, 2) CheckDigits , RIGHT(@VAT, Len(@VAT)-4) BBAN1 , SUBSTRING(@VAT, 5, Len(@VAT)-4) BBAN2 , CASE WHEN LEN(@str)<@length THEN Replicate(@padchar, @length-LEN(@str)) + @str ELSE @str END LeftPadding , CASE WHEN LEN(@str)<@length THEN @str + Replicate(@padchar, @length-LEN(@str)) ELSE @str END RightPadding
Even if the IBAN has a variable-length component (the BBAN) given the fact that the other two components are fixed, this allows us to clearly extract each component. The example shows also the equivalent call of Substring function for Left (Country Code extraction), respectively Right (BBAN extraction).
What happens if there are more than one variable-length components? For such scenarios it’s useful to introduce a delimiter, it could be a comma, a dash, space, etc. It’s the case of a First and Last Name stored in the same attribute. Normally only one component qualifies as Last Name, and for simplicity let’s consider it as being stored first and space as delimiter. In order to identify the components, it’s enough to identify the first occurrence of the delimiter by using the CharIndex function.
-- parsing a Person's Name DECLARE @Name varchar(50) SET @Name = 'Stone Sharon' SELECT @Name Name , LEFT(@Name, CHARINDEX(' ', @Name)) LastName , RIGHT(@Name, LEN(@Name) - CHARINDEX(' ', @Name)) FirstName
Name | LastName | FirstName |
Stone Sharon | Stone | Sharon |
The code for cutting the left, respectively right part of a string is pretty simple and over the years I used it quite often, so it makes sense to encapsulate it in a function., like I did in an older post.
When a delimiter is used repeatedly in a string, normally we need to identify each component in the string. Sometimes the number of components can be given, other times not. For this purpose can be used a common table expression, and here is another example in which the space is used as delimiter – extracting the words from a given sentence.
-- extracting the words from a sentence DECLARE @str nvarchar(100) SET @str = 'This is just a test' ;WITH CTE (PrevString, Position, Word) AS ( SELECT LTrim(RTrim( CASE WHEN CharIndex(' ', @str)>0 THEN Right(@str, Len(@str)-CharIndex(' ', @str)) ELSE '' END)) PrevString , 1 Position , LTrim(RTrim(CASE WHEN CharIndex(' ', @str)>0 THEN LEFT(@str, CharIndex(' ', @str)) ELSE @str END)) Word UNION ALL SELECT LTrim(RTrim(CASE WHEN CharIndex(' ', PrevString)>0 THEN Right(PrevString, Len(PrevString)-CharIndex(' ', PrevString)) ELSE '' END)) PrevString , Position + 1 Position , LTrim(RTrim(CASE WHEN CharIndex(' ', PrevString)>0 THEN LEFT(PrevString, CharIndex(' ', PrevString)) ELSE PrevString END)) Word FROM CTE WHERE Len(PrevString)>0 ) SELECT PrevString , Word , Position FROM CTE OPTION (maxrecursion 100)
PrevString | Word | Position |
is just a test | This | 1 |
just a test | is | 2 |
a test | just | 3 |
test | a | 4 |
test | 5 |
The logic works for a sentence, and if we ignore the fact that some punctuation signs are appearing at the end of the words, it might work as well for a whole phrase, considering that the punctuation signs can be replaced from the end result. It would be useful for example to generalize the logic for a set of delimiters, in this case the other punctuation signs (e.g. “,”, “;”, “!”, etc.), however this would mean to identify which of the delimiters is used first or to apply the the same logic for the first delimiter, then for the second and so on. In addition, if the number of encoded elements within a value remain the same, a pivot can be applied on the final result and have thus all values’ elements within the same row.
Happy Coding!
No comments:
Post a Comment