08 December 2010

Pulling the Strings of SQL Server–Part IX: Special Characters

    Under special characters denomination are categorized typically the characters that don’t belong to the alphabet of a given language, typically English (a-z, A-Z), or the numeric digits (0-9). Characters like umlauts (e.g. ä, ë, ö, ü, ÿ), accents (e.g. é, í, ó, ú) and other type of characters used in specific languages like German, French, Turkish, Hungarian, Romanian, Dutch or Swedish, together with punctuation signs or graphical characters are falling under the designation of special characters. It’s true that SQL Server, like many other databases, supports Unicode, a standard designed to encode such characters, though not all database designs are taking Unicodes into account. Preponderantly the old database and other software solutions use the non-unicode string data types (char, varchar, text) fact that makes the special characters to be displayed inadequately, sometimes undecipherable. In order to avoid this behavior could be decided to port the data types to unicode or use only the standard characters of English language, both solutions with their positive and negative aspects. In such cases, especially during migration project or ETL tasks, eventually as part of a Data Quality initiative, it’s preferred to identify and replace the special characters in a manual, automatic or semi-automatic fashion. In addition, there are also cases in which, from various reasons, some attributes are not allowed or should not include special characters, and also this aspect could be included in a Data Quality initiative.

    During assessment of Data Quality, in an organized or less organized manner, a first step resides in understanding the quality of the data. In the current case this resumes primarily in identifying how many records contain special characters, how could be the data cleaned, and the “cost” for this activity. Actually, before going this far, must be defined the character sets included in special characters, the definition could vary from case to case. For example in some cases could be considered also the space or the important punctuation signs as valid characters, while in others they may not be allowed. There could be identified thus multiple scenarios, though I found out that the range of characters a-z, A-Z, 0-9 and the space are considered as valid character in most of the cases. For this purpose could be built a simple function that iterates through all the characters of a string and identifies if there is any character not belonging to the before mentioned range of valid characters. In order to address this, a few years back I built a function similar with the below one:

-- checks if a string has special characters 
CREATE FUNCTION dbo.HasSpecialCharacters( 
@string nvarchar(1000)) 
RETURNS int 
AS  
    BEGIN 
         DECLARE @retval int 
         DECLARE @index int 
         DECLARE @char nchar(1)  

         SET @retval = 0 
         SET @index = 1  

         WHILE (@index <= IsNull(len(@string), 0) AND @retval=0) 
        BEGIN  
           SET @char = Substring(@string, @index, @index+1) 
           IF NOT (ASCII(@char) BETWEEN 48 AND 57 -- numeric value 
             OR ASCII(@char) BETWEEN 65 AND 90 -- capital letters 
            OR ASCII(@char) BETWEEN 97 AND 122 -- small letters 
            OR ASCII(@char) = 32) --space 
           BEGIN 
                SET @retval = @index 
           END 
           ELSE 
               SET @index = @index + 1  
     END  

    RETURN (@retval) 
END 

    Function’s logic is based on the observation that the ASCII of numeric values could be found in the integer interval between 48 and 57, the capital letters between 65 and 90, while the small letters between 97 and 122. By adding the ASCII for space and eventually several other characters, the check on whether an character is valid resuming thus to only 4 constraints. Here’s the function at work:  

-- testing HasSpecialCharacters function 
SELECT dbo.HasSpecialCharacters('kj324h5kjkj3245k2j3hkj342jj4') Example1 
, dbo.HasSpecialCharacters('Qualität') Example2 
, dbo.HasSpecialCharacters('Änderung') Example3 
, dbo.HasSpecialCharacters('') Example4 
, dbo.HasSpecialCharacters(NULL) Example5 
, dbo.HasSpecialCharacters('Ä') Example6 
, dbo.HasSpecialCharacters('ä') Example7 
, dbo.HasSpecialCharacters('a') Example8 
 
special characters 1

    As can be seen, the function returns the position where a special character is found, fact that enables users to identify the character that causes the problem. A similar function could be built also in order to count the number of special characters found in a string, the change residing in performing a counter rather then returning the position at the first occurrence of a special character.

    The function might not be perfect though it solves the problem. There are also other alternatives, for example of storing the special characters in a table and performing a simple join against the target table. Another solution could be based on the use RegEx functionality, either by using OLE automation or directly CLR functionality. There could be done variations on the above solution too by limiting to check on whether the characters of a string are falling in the range projected by the ASCII function. That’s what the following function does:  

-- checks if a string has special characters falling in an interval  
CREATE FUNCTION dbo.HasCharNotInASCIIRange( 
@string nvarchar(1000) 
, @start int 
, @end int) 
RETURNS int 
AS  
BEGIN      
    DECLARE @retval int 
    DECLARE @index int 
    DECLARE @char nchar(1) 
 
    SET @retval = 0 
    SET @index = 1 
    WHILE (@index <= IsNull(len(@string), 0) AND @retval=0) 
    BEGIN  
         SET @char = Substring(@string, @index, @index+1) 
         IF NOT (ASCII(@char) BETWEEN @start AND @end)  
        BEGIN 
              SET @retval = @index 
        END 
        ELSE 
             SET @index = @index + 1  
     END 
     RETURN (@retval) 
END 

      
   With this function are necessary 4 calls in order to identify if a string contains special characters, though we loose the flexibility of identifying the first character that is invalid. We could still identify the first occurrence by taking the minimum value returned by the 4 calls, however, unlike Oracle (see Least function), SQL Server doesn’t have such a function, so we’ll have eventually to built it. Anyway, here’s the above function at work:  

-- testing HasCharNotInASCIIRange function 
SELECT dbo.HasCharNotInASCIIRange('k12345', 48, 57) Example1 
, dbo.HasCharNotInASCIIRange('12k345', 48, 57) Example2 
, dbo.HasCharNotInASCIIRange('12345', 48, 57) Example3 
, dbo.HasCharNotInASCIIRange(' 12345', 48, 57) Example4 
, dbo.HasCharNotInASCIIRange('12345 ', 48, 57) Example5 
, dbo.HasCharNotInASCIIRange('', 48, 57) Example6 
, dbo.HasCharNotInASCIIRange(NULL, 48, 57) Example7 
, dbo.HasCharNotInASCIIRange('', 48, 57) Example8 
, dbo.HasCharNotInASCIIRange('a', 48, 57) Example9 
, dbo.HasCharNotInASCIIRange('Ä', 48, 57) Example10 
, dbo.HasCharNotInASCIIRange('ä', 32, 32) Example11 

special characters 2

No comments:

Related Posts Plugin for WordPress, Blogger...