During assessment of Data Quality, in an organized or less organized manner, a first step resides in understanding the quality of the data. In the current case this resumes primarily in identifying how many records contain special characters, how could be the data cleaned, and the “cost” for this activity. Actually, before going this far, must be defined the character sets included in special characters, the definition could vary from case to case. For example in some cases could be considered also the space or the important punctuation signs as valid characters, while in others they may not be allowed. There could be identified thus multiple scenarios, though I found out that the range of characters a-z, A-Z, 0-9 and the space are considered as valid character in most of the cases. For this purpose could be built a simple function that iterates through all the characters of a string and identifies if there is any character not belonging to the before mentioned range of valid characters. In order to address this, a few years back I built a function similar with the below one:
-- checks if a string has special characters CREATE FUNCTION dbo.HasSpecialCharacters( @string nvarchar(1000)) RETURNS int AS BEGIN DECLARE @retval int DECLARE @index int DECLARE @char nchar(1) SET @retval = 0 SET @index = 1 WHILE (@index <= IsNull(len(@string), 0) AND @retval=0) BEGIN SET @char = Substring(@string, @index, @index+1) IF NOT (ASCII(@char) BETWEEN 48 AND 57 -- numeric value OR ASCII(@char) BETWEEN 65 AND 90 -- capital letters OR ASCII(@char) BETWEEN 97 AND 122 -- small letters OR ASCII(@char) = 32) --space BEGIN SET @retval = @index END ELSE SET @index = @index + 1 END RETURN (@retval) END
Function’s logic is based on the observation that the ASCII of numeric values could be found in the integer interval between 48 and 57, the capital letters between 65 and 90, while the small letters between 97 and 122. By adding the ASCII for space and eventually several other characters, the check on whether an character is valid resuming thus to only 4 constraints. Here’s the function at work:
-- testing HasSpecialCharacters function SELECT dbo.HasSpecialCharacters('kj324h5kjkj3245k2j3hkj342jj4') Example1 , dbo.HasSpecialCharacters('Qualität') Example2 , dbo.HasSpecialCharacters('Änderung') Example3 , dbo.HasSpecialCharacters('') Example4 , dbo.HasSpecialCharacters(NULL) Example5 , dbo.HasSpecialCharacters('Ä') Example6 , dbo.HasSpecialCharacters('ä') Example7 , dbo.HasSpecialCharacters('a') Example8
As can be seen, the function returns the position where a special character is found, fact that enables users to identify the character that causes the problem. A similar function could be built also in order to count the number of special characters found in a string, the change residing in performing a counter rather then returning the position at the first occurrence of a special character.
The function might not be perfect though it solves the problem. There are also other alternatives, for example of storing the special characters in a table and performing a simple join against the target table. Another solution could be based on the use RegEx functionality, either by using OLE automation or directly CLR functionality. There could be done variations on the above solution too by limiting to check on whether the characters of a string are falling in the range projected by the ASCII function. That’s what the following function does:
-- checks if a string has special characters falling in an interval CREATE FUNCTION dbo.HasCharNotInASCIIRange( @string nvarchar(1000) , @start int , @end int) RETURNS int AS BEGIN DECLARE @retval int DECLARE @index int DECLARE @char nchar(1)
SET @retval = 0 SET @index = 1
WHILE (@index <= IsNull(len(@string), 0) AND @retval=0) BEGIN SET @char = Substring(@string, @index, @index+1) IF NOT (ASCII(@char) BETWEEN @start AND @end) BEGIN SET @retval = @index END ELSE SET @index = @index + 1 END RETURN (@retval) END
With this function are necessary 4 calls in order to identify if a string contains special characters, though we loose the flexibility of identifying the first character that is invalid. We could still identify the first occurrence by taking the minimum value returned by the 4 calls, however, unlike Oracle (see Least function), SQL Server doesn’t have such a function, so we’ll have eventually to built it. Anyway, here’s the above function at work:
-- testing HasCharNotInASCIIRange function SELECT dbo.HasCharNotInASCIIRange('k12345', 48, 57) Example1 , dbo.HasCharNotInASCIIRange('12k345', 48, 57) Example2 , dbo.HasCharNotInASCIIRange('12345', 48, 57) Example3 , dbo.HasCharNotInASCIIRange(' 12345', 48, 57) Example4 , dbo.HasCharNotInASCIIRange('12345 ', 48, 57) Example5 , dbo.HasCharNotInASCIIRange('', 48, 57) Example6 , dbo.HasCharNotInASCIIRange(NULL, 48, 57) Example7 , dbo.HasCharNotInASCIIRange('', 48, 57) Example8 , dbo.HasCharNotInASCIIRange('a', 48, 57) Example9 , dbo.HasCharNotInASCIIRange('Ä', 48, 57) Example10 , dbo.HasCharNotInASCIIRange('ä', 32, 32) Example11
No comments:
Post a Comment