24 February 2011

Pulling the Strings of SQL Server – Part VI: Subparts of a String

    No matter how normalized a database is, there will always be the need to encode multiple semantic entities in a string, needing thus to extract them later. For example data like the Street Name and Number, or the Country Phone Prefix and Phone Number, the First and the Last Name, etc. Another frequent scenario is the misuse of a long-sized string data types to store multiple delimited attributes or simply a whole text like a Comment or Description. The easiest scenarios to deal with are the ones when you know the rules behind your encoding, preferably dealing with a fix length encoding or only a given delimiter. An example of fix length encoding is the IBAN Number, the country specific VAT Number or any other artificial constructed/standardized encoding. According to Wikipedia, IBAN (International Bank Account) Number consists of  ISO 3166-1 alpha-2 country code, followed by two check digits that are calculated using a mod-97 technique, and Basic Bank Account Number (BBAN) with up to thirty alphanumeric characters. Taking the IBAN example provided for Switzerland, it could be stored as “CH93 0076 2011 6238 5295 7” or “CH9300762011623852957”, in either case when the country is not provided explicitly it would be interesting to extract it from the IBAN together with the BBAN. How would we do that?

    Many of the programming languages I worked with provide a function for extracting a substring from a string - Mid (VB/VBScript), Substring in C# Substr in Oracle, etc. SQL Server is actually providing three functions for this purpose: Substring, Left, respectively Right. The Substring function extracts a substring of a given length starting from a given position. The Left and Right functions return, as their name suggests, the left part, respectively the right part of a character string with the specified number of characters. The use of Left and Right functions seems to be redundant, as they are a particular case of Substring, however they can simplify sometimes the code, as can be seen from below example based on the above IBAN example.
-- parsing a VAT Number
DECLARE @VAT varchar(50)
SET
@VAT = 'CH9300762011623852957'

SELECT
@VAT VATNumber

,
LEFT(@VAT, 2) CountryCode1

,
SUBSTRING(@VAT, 1, 2) CountryCode2

,
SUBSTRING(@VAT, 3, 2) CheckDigits

,
RIGHT(@VAT, Len(@VAT)-4) BBAN1

,
SUBSTRING(@VAT, 5, Len(@VAT)-4) BBAN2

Substrings Example 1

    Even if the IBAN has a variable-length component (the BBAN) given the fact that the other two components are fixed, this allows us to clearly extract each component. The example shows also the equivalent call of Substring function for Left (Country Code extraction), respectively Right (BBAN extraction).

    What happens if there are more than one variable-length components? For such scenarios it’s useful to introduce a delimiter, it could be a comma, a dash, space, etc. It’s the case of a First and Last Name stored in the same attribute. Normally only one component qualifies as Last Name, and for simplicity let’s consider it as being stored first and space as delimiter. In order to identify the components, it’s enough to identify the first occurrence of the delimiter by using the CharIndex function.
-- parsing a Person's Name
DECLARE @Name varchar(50)
SET
@Name = 'Stone Sharon'

SELECT
@Name Name

,
LEFT(@Name, CHARINDEX(' ', @Name)) LastName

,
RIGHT(@Name, LEN(@Name) - CHARINDEX(' ', @Name)) FirstName

Substrings Example 2
    The code for cutting the left, respectively right part of a string is pretty simple and over the years I used it quite often, so it makes sense to encapsulate it in a function., like I did in an older post.

    When a delimiter is used repeatedly in a string, normally we need to identify each component in the string. Sometimes the number of components can be given, other times not. For this purpose can be used a common table expression, and here is another example in which the space is used as delimiter – extracting the words from a given sentence.
-- extracting the words from a sentence
DECLARE @str nvarchar(100)
SET
@str = 'This is just a test'

;
WITH CTE (PrevString, Position, Word)

AS
(

    SELECT LTrim(RTrim( CASE
         WHEN CharIndex(' ', @str)>0 THEN Right(@str, Len(@str)-CharIndex(' ', @str))

         ELSE ''

     END)) PrevString
, 1 Position
, LTrim(RTrim(CASE
     WHEN CharIndex(' ', @str)>0 THEN LEFT(@str, CharIndex(' ', @str))

     ELSE @str
END)) Word
UNION ALL
SELECT LTrim(RTrim(CASE
     WHEN CharIndex(' ', PrevString)>0 THEN Right(PrevString, Len(PrevString)-CharIndex(' ', PrevString))

     ELSE ''

END)) PrevString
, Position + 1 Position
, LTrim(RTrim(CASE
     WHEN CharIndex(' ', PrevString)>0 THEN LEFT(PrevString, CharIndex(' ', PrevString))

     ELSE PrevString
END)) Word
FROM CTE
WHERE Len(PrevString)>0

)

SELECT
PrevString

,
Word

,
Position

FROM
CTE

OPTION
(maxrecursion 100)

Substrings Example 3
    The logic works for a sentence, and if we ignore the fact that some punctuation signs are appearing at the end of the words, it might work  as well for a whole phrase, considering that the punctuation signs can be replaced from the end result. It would be useful for example to generalize the logic for a set of delimiters, in this case the other punctuation signs (e.g. “,”, “;”, “!”, etc.), however this would mean to identify which of the delimiters is used first or to apply the the same logic for the first delimiter, then for the second and so on. In addition, if the number of encoded elements within a value remain the same, a pivot can be applied on the final result and have thus all values’ elements within the same row.

No comments: