12 July 2010

💎SQL Reloaded: CRUD Stored Procedures from Metadata I (Insert & Update)

    In Saving Data With Stored Procedures post I shown my favorite technique of creating a new record, respectively updating an existing record, by using a stored procedure, while in Creating a Stored Procedure from Table’s Metadata post I shown how a macro could be created in a semiautomatic manner in order to create the same type of stored procedure using table’s metadata. The Excel approach, even if decreases the creation time for a stored procedure to only a few minutes, has the disadvantage that the metadata has to be copied in Excel manually for each table (when manual work is involved in automatic generation of code we deal with semiautomatic methods). Last weekend, after I had to use the macro on Friday at work to create such a stored procedure, I was thinking on whether is possible to achieve the same using only T-SQL code. The difficulty of such a task resides primarily in creating the repeating code for inserting and updating the tables, and for creating the stored procedure parameters, fact that could be achieved with a cursor or a common table expression. On a secondary place could be put the handling of special data types like uniqueidentifier, xml, date/time or even bit. The good news is that not all data types are used in data modeling, at least not in the same proportions. In practice, when attempting to automate tasks, treating all the possible cases could equate with quite an effort, which maybe doesn’t pay off, therefore is important to find the balance between effort and utility (sometimes automating 60-70% of an work is quite an achievement, while other times 90-100% could be achieved). Here’s is the view I created for this task.

-- prepares metadata for CRUD stored procedures 
CREATE VIEW dbo.vCRUDMetadata 
AS 
WITH CTE([schema_name], Table_Name, id, id_data_type, Column_id, [parameters], [values], [attributes], [update]) 
AS 
( SELECT A.[schema_name] 
, A.Table_Name 
, A.Column_Name id 
, A.system_type id_data_type 
, A.Column_id  
, Cast(' @' + A.Column_Name + ' ' + system_type  
+ CASE  
     WHEN A.system_type IN ('char', 'nchar' , 'varchar', 'nvarchar') THEN ' (' + CASE A.max_length WHEN -1 THEN 'max' ELSE Cast(A.max_length AS varchar(10)) END + ')'  
      ELSE '' 
END  
+ ' output' as nvarchar(max)) [parameters] 
, Cast(CASE  
WHEN A.system_type = 'bit' THEN ' IsNull(@' + A.Column_Name + ', 0)'  
WHEN A.system_type = 'uniqueidentifier' THEN ' IsNull(@' + A.Column_Name + ', NEWID())' 
ELSE '@' + A.Column_Name  
END as nvarchar(max)) [values] 
, Cast(A.Column_Name as nvarchar(max)) [attributes] 
, Cast('' as nvarchar(max)) [update] 
FROM dbo.vTableColumns A 
WHERE A.column_id = 1 
UNION ALL 
SELECT A.[schema_name] 
, A.Table_Name 
, B.id 
, B.id_data_type 
, A.column_id 
, B.[parameters] + CHAR(13)+ char(10)  
+ ', @' + A.Column_Name + ' ' + A.system_type  
+ CASE  
WHEN A.system_type IN ('char', 'nchar' , 'varchar', 'nvarchar') THEN ' (' + CASE A.max_length WHEN -1 THEN 'max' ELSE Cast(A.max_length AS varchar(10)) END + ')'  
ELSE '' 
END [parameters] 
, B.[values] + CHAR(13)+ char(10)  
+ ', ' + CASE  
WHEN A.system_type = 'bit' THEN ' IsNull(@' + A.Column_Name + ', 0)'  
WHEN A.system_type = 'uniqueidentifier' THEN ' IsNull(@' + A.Column_Name + ', NEWID())' 
ELSE '@' + A.Column_Name  
END [values] 
, B.[attributes] + CHAR(13)+ char(10) + ', ' + A.Column_Name [attributes] 
, B.[update] + CHAR(13)+ char(10)  
+ ', ' + A.Column_Name + ' = ' 
+ CASE  
WHEN A.Column_Name = 'bit' THEN ' IsNull(@' + A.Column_Name + ', 0)'  
WHEN A.system_type = 'uniqueidentifier' THEN ' IsNull(@' + A.Column_Name + ', NEWID())' 
ELSE '@' + A.Column_Name  
END [update] 
FROM dbo.vTableColumns A 
JOIN CTE B 
ON A.column_id - 1 = B.column_id 
AND A.[schema_name] = B.[schema_name] 
AND A.Table_Name = B.Table_Name 
) SELECT A.[schema_name] 
, A.Table_Name 
, A.column_id NumberAttributes 
, A.id 
, A.id_data_type 
, A.[parameters] 
, A.[values] 
, A.[attributes] 
, A.[update] 
FROM CTE A 
WHERE column_id IN ( SELECT MAX(column_id) 
FROM CTE B 
WHERE A.[schema_name] = B.[schema_name] 
AND A.Table_Name = B.Table_Name) 

    The view needs additional work because it treats only the most simple met cases, preferring to keep it (relatively) simple. Even so, the logic might be not so simple to understand within a simple review - it just makes use of CASEs combined with concatenation, building a string for parameters, update, insert or select.

    Here’s a call to the view for a table:

-- testing the view 
SELECT * 
FROM dbo.vCRUDMetadata 
WHERE [Schema_Name] = 'Person' 
AND Table_Name = 'Address' 

    Now how do we make it useful? I used the same idea as in Excel, the code for creating the Save stored procedure being implemented in the below UDF:

-- Template for Save stored procedure 
CREATE FUNCTION dbo.TemplateSave( 
@SchemaName nvarchar(50) 
, @TableName nvarchar(50) 
, @id nvarchar(50) 
, @id_data_type nvarchar(50) 
, @parameters nvarchar(max) 
, @values nvarchar(max) 
, @attributes nvarchar(max) 
, @update nvarchar(max) 
) RETURNS nvarchar(max) 
AS BEGIN 
DECLARE @DefaultValue nvarchar(2) 
SET @DefaultValue = CASE 
WHEN @id_data_type IN ('char', 'nchar' , 'varchar', 'nvarchar') THEN '''' 
ELSE '0' 
END 
SET @values = Right(@values, Len(@values) - CharIndex(',', @values)-1) 
SET @attributes = Right(@attributes, Len(@attributes) - CharIndex(',', @attributes)-1) 
SET @update = Right(@update, Len(@update) - CharIndex(',', @update)-1) 
RETURN( 'CREATE PROCEDURE ' + @SchemaName + '.pSave' + @TableName + '(' + CHAR(13)+ char(10) + @parameters + ')' + CHAR(13)+ char(10) +  
     'AS' + CHAR(13)+ char(10) +  
     'BEGIN' + CHAR(13)+ char(10) +  
     'BEGIN TRY' + CHAR(13)+ char(10) +  
     ' IF ISNULL(@' + @ID + ', ' + @DefaultValue + ') = ' + @DefaultValue + CHAR(13)+ char(10) +  
     ' BEGIN' + CHAR(13)+ char(10) +  
     ' -- insert statement' + CHAR(13)+ char(10) +  
     ' INSERT INTO ' + @SchemaName + '.' + @TableName + ' (' + @attributes + ')' + CHAR(13)+ char(10) +  
     ' VALUES (' + @values + ')' + CHAR(13)+ char(10) +  
     ' SET @' + @ID + ' = @@IDENTITY' + CHAR(13)+ char(10) +  
     ' END' + CHAR(13)+ char(10) + 
     ' ELSE' + CHAR(13)+ char(10) + 
     ' BEGIN' + CHAR(13)+ char(10) + 
     ' -- update statement' + CHAR(13)+ char(10) +  
     ' UPDATE ' + @SchemaName + '.' + @TableName + CHAR(13)+ char(10) +  
     ' SET ' + @update + CHAR(13)+ char(10) + 
     ' WHERE ' + @ID + '= @' + @ID + CHAR(13)+ char(10) +  
     ' END' + CHAR(13)+ char(10) + 
    ' SELECT @@Rowcount' + CHAR(13)+ char(10) +  
     'END TRY' + CHAR(13)+ char(10) +  
     'BEGIN CATCH' + CHAR(13)+ char(10) +  
    ' SELECT Cast(ERROR_NUMBER() as varchar(10)) + '':'' + ERROR_MESSAGE()' + CHAR(13)+ char(10) +  
    'END CATCH' + CHAR(13)+ char(10) +  
    'END' ) 
END 
 
    The stored procedure template is slightly modified from the one presented in the previous posts, this time preferring to return the ID in the same parameter used as input, using the select to return the number of records affected. Here’s the test script which creates the stored procedure and its output:

SELECT dbo.TemplateSave( 
[Schema_Name] , Table_Name  
, id  
, id_data_type  
, [parameters]  
, [values]  
, [attributes]  
, [update]) CodeSP 
FROM dbo.vCRUDMetadata 
WHERE [Schema_Name] = 'Person' 
AND Table_Name = 'Address' 

CREATE PROCEDURE Person.pSaveAddress( 
@AddressID int output 
, @AddressLine1 nvarchar (120) 
, @AddressLine2 nvarchar (120) 
, @City nvarchar (60) 
, @StateProvinceID int 
, @PostalCode nvarchar (30) 
, @rowguid uniqueidentifier 
, @ModifiedDate datetime) 
AS 
BEGIN 
BEGIN TRY 
IF ISNULL(@AddressID, 0) = 0 
BEGIN 
-- insert statement 
     INSERT INTO Person.Address ( AddressLine1 
     , AddressLine2 
     , City 
     , StateProvinceID 
     , PostalCode 
     , rowguid 
     , ModifiedDate) 
    VALUES ( @AddressLine1 
     , @AddressLine2 
     , @City 
     , @StateProvinceID 
     , @PostalCode 
     , IsNull(@rowguid, NEWID()) 
     , @ModifiedDate) 
    SET @AddressID = @@IDENTITY 
END 
ELSE 
BEGIN 
     -- update statement 
    UPDATE Person.Address 
    SET AddressLine1 = @AddressLine1 
    , AddressLine2 = @AddressLine2 
    , City = @City 
    , StateProvinceID = @StateProvinceID 
    , PostalCode = @PostalCode 
    , rowguid = IsNull(@rowguid, NEWID()) 
    , ModifiedDate = @ModifiedDate 
     WHERE AddressID= @AddressID 
END 
SELECT @@Rowcount 
END TRY BEGIN CATCH 
     SELECT Cast(ERROR_NUMBER() as varchar(10)) + ':' + ERROR_MESSAGE() 
END CATCH 
END 
 
  The resulted code could be exported to an Excel or text file and then run in Management Studio. And here are two calls to it, one for update, and the other one for insert:

EXEC Person.pSaveAddress 1, 'AddressLine1', 'AddressLine2', 'City', 1 , 'PostalCode', NULL, '20090101' 
EXEC Person.pSaveAddress 0, 'AddressLine100', 'AddressLine2', 'City', 1 , 'PostalCode', '6D1EB969-C5CF-420C-BA5F-1153F4D89373', '20090101' 

    The difference from the Excel version is that it could be run easily for a set of tables and then the stored procedures could be created one by one. A higher level of automation could be achieved by creating a cursor or an SSIS package, and then each stored procedure created with EXEC or sp_executesql stored procedures. The template could be modified after requirements as needed. Eventually the code from template could be included in a CLR function and thus take advantage of some string functionality, but it becomes then more difficult to modify

11 July 2010

💎SQL Reloaded: Evaluating Textual Expressions with CLR UDFs

The developers having experience with scripting languages like VBScript or JavaScript are most probably used with Eval (VBScript/JavaScript) or Execute (VBScript) functions, functions that parse a string evaluating its content to an expression or evaluating a given set of instructions, allowing thus to create and execute code at runtime. In several occasions I needed similar functionality also in T-SQL, having to evaluate a formula stored in a text field or text variable. The latest situation of this type I could recall was when importing a set of BOM (Bill of Materials) data from Oracle into SQL Server. I had two problems then, one related to Oracle and the second to SQL Server. 

As each level in the (Oracle) BOM has a Quantity which needs to be multiplied with the Quantities from the child levels in order to get the total Required Quantity for each level, I needed a method to aggregate the respective values. Unlike in SQL Server, in Oracle (as far I know) there is no built-in functionality for that, however SYS_CONNECT_BY_PATH function (see example) allows to built a string with the total quantities available in the BOM structure and is even possible to evaluate the resulted string using a function, though as I was not allowed to create a function in the production environment then, I had to solve the issue in SQL Server (2000) itself. The concatenated string of quantities at each BOM level was looking something like ‘1*1*2*4*1’ (here for the 5thBOM  level), so I needed a way to evaluate easily the respective product.

    In SQL Server the Cast and Convert functions allow to cast a string holding a numeric value to a numeric data type but their utility stops there.

-- Cast/Convert (working) example 
SELECT Cast('1' as int) * 25 
, Convert(int, '1') * 10 
 
    Many years ago, I remember I “tried my luck” and attempted to run a similar script like the below one:

-- Cast/Convert (not working) example 
SELECT Cast('1*2' as int) * 25 
, Convert(int, '1*2') * 10 
 
    No wonder that I got an error:
Msg 245, Level 16, State 1, Line 1
Conversion failed when converting the varchar value '1*2' to data type int.

   I attempted to find, again without luck, a built-in solution for my problem. For dynamic created scripts could be used, in general, the sp_executesql stored procedure:

-- expression evaluation using sp_executesql (output: 6) DECLARE @retval int; 
DECLARE @param int; 
SET @param = 3 
EXECUTE sp_executesql N'SELECT @retval = 1*2*@param', N'@param int, @retval int output', @param=@param, @retval = @retval output;  
SELECT @retval 

    Unfortunately the sp_executesql can’t be used from UDFs, therefore the functionality can’t be used in queries involving a select from a given object. However, given its simplicity, the string could be regarded as a list of numbers delimited by “*” (multiplication) sign, a loop or starting with SQL Server 2005, a common table expression could solve the problem but it will work only for this type of multiplications, unless the expression is parsed using specific mathematical algorithms. Then I wrote a function similar with the below one:

CREATE FUNCTION dbo.EvalProduct( 
@ListValues varchar(max)) 
RETURNS decimal(18,2) 
/* 
Purpose: evaluates the product of multiple numbers written as a text expression 
Parameters: @ListValues varchar(max) - List of values to be multiplied 
Sample call: SELECT dbo.EvalProduct('1*2*3*4') 
SELECT dbo.EvalProduct('1.2*2.3*3.4') 
SELECT dbo.EvalProduct('1.2') 
SELECT dbo.EvalProduct('NULL') 
*/ AS 
BEGIN 
DECLARE @Index int 
DECLARE @Product decimal(18,2) 
SET @Index = CharIndex('*', @ListValues) 
SET @Product = CASE 
     WHEN @Index>1 OR (@Index = 0 AND IsNumeric(@ListValues) = 1) THEN 1 
     ELSE NULL 
END  

WHILE @Index > 0  
BEGIN 
    SET @Product = @Product * Cast(Left(@ListValues, @Index-1) as decimal(18,2)) 

    SET @ListValues = Right(@ListValues, Len(@ListValues)-@Index) 
    SET @Index = CharIndex('*', @ListValues) 
END  

IF IsNumeric(@ListValues)=1 
SET @Product = @Product * Cast(@ListValues as decimal(18,2))  

RETURN @Product 
END 

      
    The function, even if handy, it has its limitations, first of all it works only with products of positive numeric values, and secondly there is no chance to use it for more complex expressions. Fortunately, starting with SQL Server 2005 there is an alternative.  In last post, Data Profiling Using CLR Functions I introduced ExecuteScalarToInt and ExecuteScalarToDecimal CLR functions which could be used to evaluate a text expression to an integer, respectively a decimal value, and thus they could be used also to solve the above problem:  

-- CLR functions approach (Output: 120, respectively 236.48) 
SELECT dbo.ExecuteScalarToInt('SELECT 1*2*3*4*5') IntExample 
, dbo.ExecuteScalarToDecimal('SELECT 1.2*2.3*3.4*4.5*5.6') DecimalExample  

    The functions could be actually used with any type of expression that returns a numeric value, the expressions could include also built-in or user-defined functions:
 
-- CLR with built-in mathematical functions (Output: 21.76)
SELECT dbo.ExecuteScalarToDecimal('SELECT Pi()*SQRT(47.98)') 

--CLR with UDF (Output: 9.38) 
SELECT dbo.ExecuteScalarToDecimal('SELECT dbo.EvalProduct(''1.2*2.3*3.4'')') 

    This means that the two functions could be used for example in queries in which the formulas following to be executed are stored in tables or built dynamically. Let’s look at an example based on the values stored in a table:

-- creating a simple table 
 CREATE TABLE dbo.TestTextExpressions ( 
Expression varchar(max)) 

--inserting the values 
INSERT INTO dbo.TestTextExpressions 
VALUES ('1*2*3*4') 
, ('Pi()*SQRT(47.98)') 
, ('Cos(10)') 
, ('Year(GetDate())') 
, ('Square(10.45)') 

-- running the formulas 
SELECT Expression 
, dbo.ExecuteScalarToDecimal('SELECT ' + Expression) [Output]      
FROM dbo.TestTextExpressions   


CLR UDF - expression evaluation 
   
Notes:
1.    Instead of adding the SELECT in the parameter string, it could be created a CLR function (e.g. EvalToInt) that includes it already. Not sure how much benefit could be obtained from this, however I prefer to have generalized functionality rather then providing specific functions for each scenario.
2.    Most probably there are also exceptions in which this approach can’t be used. As the two CLR functions don’t perform any validation, the usefulness of such functions depends also on the validity of the input. If the input is invalid, then an error will be thrown; on the other side, even if it’s possible to catch the error in the CLR functions itself, it could prove to be difficult to alter function’s behavior (for example returning a NULL value) without impacting the result.
3.    In order to run UDFs from CLR UDFs, the SystemDataAccess:=SystemDataAccessKind.Read must be added in CLR functions definitions, otherwise “This statement has attempted to access data whose access is restricted by the assembly” error message will be thrown.

PS:  It would be nice to receive some feedback on how the above functions helped you to solve a similar problem!

04 July 2010

🗄️Data Management: Data Profiling II (Using CLR functions)

Data Management
Data Management Series

In the previous post on Data Profiling I introduced basic data profiling techniques, the queries used for exemplification being quite simple. Probably many people will ask themselves on whether is possible to automate data profiling for a whole range of tables/datasets and the attributes they contain. The answer is simple, this relies entirely on databases’ capabilities of running dynamic scripts in an elegant manner – many such solutions relying on the use of cursors in which the dynamic created scripts are run by a stored procedure (e.g. EXEC or sp_executesql in SQL Server). There are also solution in which the dynamic queries are created, run and managed by third-party tools. In what concerns SQL Server, the CLR functions could prove to be an elegant solution for this purpose.

In a previous post, “Number of Records – The CLR Version”, I described the steps I followed in order to create a CLR function in order to get the number of records for a set of tables, the name of the table being provided as parameter. Instead of giving as parameter only the name of the table it could be provided instead the whole query string, thus having the possibility of running multiple types of query that return only a scalar. 

The problem is that the CLR function returns a value only of a certain data type, thus we could transform all the values to string or create a function for each data type. I will try to describe in this post both approaches, a reason for that being also the fact that initially I thought that the second solution is more practical, and probably easier to debug, though as I discovered the first technique is more flexible allowing to encapsulate the logic in a single query.

Let’s consider the following functions that could be added in a new Database Project or in the project created in the previous post on CLR functions. As mentioned above, is created a function for each important data type, thus ExecuteScalarToInt returns an integer value, ExecuteScalarToDate a date value, ExecuteScalarToString a string value, ExecuteScalarToDecimal a decimal value, amd ExecuteScalarToBoolean a boolean value:

Partial Public Class UserDefinedFunctions 
<SqlFunction(DataAccess:=DataAccessKind.Read, SystemDataAccess:=SystemDataAccessKind.Read)> _ Public Shared Function ExecuteScalarToInt(ByVal SqlScript As String) As Integer     
Using conn As New SqlConnection("context connection=true") 
conn.Open()     
Dim cmd As New SqlCommand(SqlScript, conn) 
Return CType(cmd.ExecuteScalar(), Integer)     
End Using End Function 

<SqlFunction(DataAccess:=DataAccessKind.Read, SystemDataAccess:=SystemDataAccessKind.Read)> _ Public Shared Function ExecuteScalarToDate(ByVal SqlScript As String) As Date     
Using conn As New SqlConnection("context connection=true") 
    conn.Open()     

    Dim cmd As New SqlCommand(SqlScript, conn)     

    Return CType(cmd.ExecuteScalar(), Date)     
End Using End Function 

<SqlFunction(DataAccess:=DataAccessKind.Read, SystemDataAccess:=SystemDataAccessKind.Read)> _ Public Shared Function ExecuteScalarToString(ByVal SqlScript As String) As String     
Using conn As New SqlConnection("context connection=true") 
    conn.Open()     

    Dim cmd As New SqlCommand(SqlScript, conn)   
  
    Return CType(cmd.ExecuteScalar(), String)     
End Using End Function 

<SqlFunction(DataAccess:=DataAccessKind.Read, SystemDataAccess:=SystemDataAccessKind.Read)> _ Public Shared Function ExecuteScalarToDecimal(ByVal SqlScript As String) As Decimal     
Using conn As New SqlConnection("context connection=true") 
    conn.Open()     

    Dim cmd As New SqlCommand(SqlScript, conn)     

    Return CType(cmd.ExecuteScalar(), Decimal)     
End Using End Function 

<SqlFunction(DataAccess:=DataAccessKind.Read, SystemDataAccess:=SystemDataAccessKind.Read)> _ Public 
Shared Function ExecuteScalarToBoolean(ByVal SqlScript As String) As Boolean     
Using conn As New SqlConnection("context connection=true") 
    conn.Open()     

    Dim cmd As New SqlCommand(SqlScript, conn)   
  
    Return CType(cmd.ExecuteScalar(), Boolean)     
End Using End Function 
End Class

After compiling the project and re-linking the assembly, must be declared the following functions:

-- returns int values 
CREATE FUNCTION dbo.ExecuteScalarToInt(@sql nvarchar(max)) 
RETURNS bigint  
AS  
EXTERNAL NAME SqlServerProject1.[SqlServerProject1.UserDefinedFunctions].ExecuteScalarToInt 
GO 

-- returns decimal values 
CREATE FUNCTION dbo.ExecuteScalarToDecimal(@sql nvarchar(max)) 
RETURNS decimal(18,2)  
AS  
EXTERNAL NAME SqlServerProject1.[SqlServerProject1.UserDefinedFunctions].ExecuteScalarToDecimal 
GO 

-- returns nvarchar values 
CREATE FUNCTION dbo.ExecuteScalarToString(@sql nvarchar(max)) 
RETURNS nvarchar(max)  
AS  
EXTERNAL NAME SqlServerProject1.[SqlServerProject1.UserDefinedFunctions].ExecuteScalarToString 
GO 

-- returns date values 
CREATE FUNCTION dbo.ExecuteScalarToDate(@sql nvarchar(max)) 
RETURNS date  
AS  
EXTERNAL NAME SqlServerProject1.[SqlServerProject1.UserDefinedFunctions].ExecuteScalarToDate 
GO 

-- returns boolean values 
CREATE FUNCTION dbo.ExecuteScalarToBoolean(@sql nvarchar(max)) 
RETURNS bit  
AS  
EXTERNAL NAME SqlServerProject1.[SqlServerProject1.UserDefinedFunctions].ExecuteScalarToBoolean 
GO 


-- testing the functions    
SELECT dbo.ExecuteScalarToInt('SELECT count(*) FROM Production.Product') NumberRecords 
SELECT dbo.ExecuteScalarToString('SELECT Max(ProductNumber) FROM Production.Product') GreatestStringValue 
SELECT dbo.ExecuteScalarToString('SELECT Max(ListPrice) FROM Production.Product') GreatestNumericValue 
SELECT dbo.ExecuteScalarToDate('SELECT Max(SellStartDate) FROM Production.Product') GreatestDateValue 
SELECT dbo.ExecuteScalarToString('SELECT CASE WHEN count(1)&gt;1 THEN 1 ELSE 0 END FROM Production.Product') HasRecords
 
Data Profiling - testing CLR functions

  The first example that returns the number of records could be easily adapted in order to return the respective value for a given set of tables:

-- number of records 
SELECT S.name [schema_name] 
, T.name table_name 
, dbo.ExecuteScalarToInt('SELECT count(*) NumberRecords FROM AdventureWorks.' + S.name + '.' + T.name) NumberRecords 
FROM AdventureWorks.sys.tables T 
    JOIN AdventureWorks.sys.schemas S 
      ON T.schema_id = S.schema_id 
WHERE S.name = 'Sales' 
ORDER BY S.name  
, T.name 
 
Data Profiling - number of records

The attribute-related metrics require a lower level of detail, to the above query adding the sys.colums table, for this purpose I will use the dbo.vTableColumns view created in the previous post on data profiling. And here’s the query based on AdventureWorks database:

-- data profiling attribute level SELECT[Schema_Name]  
, Table_Name  
, Column_Name  
, user_type  
, system_type  
-- number distinct values 
 , CASE  
    WHEN system_type NOT IN ('xml') THEN dbo.ExecuteScalarToInt('SELECT count(DISTINCT [' + Column_Name + ']) NumberRecords FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name)  
     ELSE -1 
END NumberDistinctValues 
-- number/percentage not null values 
, dbo.ExecuteScalarToInt('SELECT count(1) NumberRecords FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name + ' WHERE [' + Column_Name + '] IS NOT NULL') NumberNotNullValues  
, CASE  
    WHEN system_type NOT IN ('xml') THEN dbo.ExecuteScalarToDecimal('SELECT Cast(CASE WHEN count(1) &gt;0 THEN Cast(count([' + Column_Name + ']) as decimal(18,2))/count(1) ELSE 0 END as decimal(5,2)) PercentageNotNullValues FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name) 
    ELSE -1 
END PercentageNotNullValues 
-- min/max length of values 
, CASE  
    WHEN system_type NOT IN ('xml') THEN dbo.ExecuteScalarToInt('SELECT IsNull(Min(Len([' + Column_Name + '])), 0) MinLengthValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name) 
    ELSE -1 
END MinLengthValue 
, CASE  
     WHEN system_type NOT IN ('xml') THEN dbo.ExecuteScalarToInt('SELECT IsNull(Max(Len([' + Column_Name + '])), 0) MaxLengthValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name) 
    ELSE -1 
END MaxLengthValue 
-- Min/Max values , CASE  
    WHEN system_type NOT IN ('varbinary', 'uniqueidentifier', 'xml', 'sysname') THEN  
dbo.ExecuteScalarToString('SELECT IsNull(Cast(Min('  
    + CASE WHEN system_type = 'bit' THEN 'Cast(' ELSE '' END 
    + '[' + Column_Name + ']' 
    + CASE WHEN system_type = 'bit' THEN ' as smallint)' ELSE '' END 
     + ') as nvarchar(max)), ''NULL'') MinValue FROM AdventureWorks.' 
     + [Schema_Name] + '.' + Table_Name) 
   ELSE CAST('' as nvarchar(max))  
END MinValue 
, CASE  
     WHEN system_type NOT IN ('varbinary', 'uniqueidentifier', 'xml', 'sysname') THEN  
dbo.ExecuteScalarToString('SELECT IsNull(Cast(Max('  
    + CASE WHEN system_type = 'bit' THEN 'Cast(' ELSE '' END 
    + '[' + Column_Name + ']' 
    + CASE WHEN system_type = 'bit' THEN ' as smallint)' ELSE '' END 
+ ') as nvarchar(max)), ''NULL'') MaxValue FROM AdventureWorks.'  
    + [Schema_Name] + '.' + Table_Name )  
    ELSE CAST('' as nvarchar(max))  
END MaxValue  
FROM dbo.vTableColumns 
WHERE [Schema_Name] = 'Production' 
AND Table_Name = 'Product' ORDER BY [Schema_Name]  
, Table_Name  
, Column_Name  
, column_id 

   Here’s the output, the query run in about 30 seconds for the whole AdventureWorks database’s table:
  
Data Profiling - attribute level

 More likely, the above query requires some explanations. First of all some CASEs have been introduced in order to treat the exceptions, given the fact that not all data types support all the aggregate functions, while the xml data type doesn’t support any aggregate function at all. Hopefully I treated all the scenarios, if not the query could be easily updated. In other cases I applied a simple conversion, as in the case of bit data type casted to a smallint. In case the dynamic queries return other value than integer, the output value has been casted to a nvarchar(max). Most probably for easier understanding of the dynamic queries used is necessary also a review of the corresponding static queries presented in the previous post.

In case we use a specific function for each data type then we either choose to work with a query for each data type, or in case we want to use a UNION (ALL) join, then we’ll have to cast the output to nvarchar. So whatever method we choose, in order to create only one dataset out of our profiling implementation, a cast to nvarchar is required. I will resume myself to provide the queries for each data type, more for exemplifying the use of the above CLR functions:

-- getting Min/Max integer values 
SELECT [Schema_Name] 
, Table_Name  
, Column_Name 
, user_type 
, dbo.ExecuteScalarToInt('SELECT IsNull(Min(' + Column_Name + '), 0) MinValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MinValue 
, dbo.ExecuteScalarToInt('SELECT IsNull(Max(' + Column_Name + '), 0) MaxValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MaxValue 
FROM dbo.vTableColumns 
WHERE [Schema_Name] = 'Production' 
AND Table_Name = 'Product' 
AND user_type IN ('int', 'bigint') 

-- getting Min/Max String values 
SELECT [Schema_Name] 
, Table_Name  
, Column_Name 
, user_type 
, dbo.ExecuteScalarToString('SELECT IsNull(Min(' + Column_Name + '), '''') MinValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MinValue 
, dbo.ExecuteScalarToString('SELECT IsNull(Max(' + Column_Name + '), '''') MaxValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MaxValue 
FROM dbo.vTableColumns 
WHERE [Schema_Name] = 'Production' AND Table_Name = 'Product' 
AND user_type IN ('nvarchar', 'varchar', 'char', 'nchar') 

-- getting Min/Max Date values 
SELECT [Schema_Name] 
, Table_Name  
, Column_Name 
, user_type 
, dbo.ExecuteScalarToDate('SELECT IsNull(Min(' + Column_Name + '), Cast(''1800-01-01'' as date)) MinDateValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MinDateValue 
, dbo.ExecuteScalarToDate('SELECT IsNull(Max(' + Column_Name + '), Cast(''1800-01-01'' as date)) MaxDateValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MaxDateValue 
FROM dbo.vTableColumns 
WHERE [Schema_Name] = 'Production' 
AND Table_Name = 'Product' 
AND user_type IN ('datetime', 'date') 

-- getting Min/Max Boolean values 
SELECT [Schema_Name] 
, Table_Name  
, Column_Name 
, user_type 
, dbo.ExecuteScalarToBoolean('SELECT IsNull(Min(Cast(' + Column_Name + ' as smallint)), 0) MinValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MinValue 
, dbo.ExecuteScalarToBoolean('SELECT IsNull(Max(Cast(' + Column_Name + ' as smallint)), 0) MaxValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MaxValue 
FROM dbo.vTableColumns 
WHERE [Schema_Name] = 'Production' 
--AND Table_Name = 'Products' AND user_type IN ('bit') 

Notes:
1.   The problem with actual form of the functions is that when the query returns a null value, the database engine throws the following error message:
Msg 6522, Level 16, State 1, Line 2
A .NET Framework error occurred during execution of user-defined routine or aggregate "ExecuteScalarToDate":
System.InvalidCastException: Conversion from type 'DBNull' to type 'Date' is not valid.
System.InvalidCastException:
   at Microsoft.VisualBasic.CompilerServices.Conversions.ToDate(Object Value)
   at SqlServerProject1.UserDefinedFunctions.ExecuteScalarToDate(String SqlScript)

    
The error could be avoided by handling the NULL values with a default value, for example 0 for numeric values, the empty string for string values, a minimal value date for date values, and 0 for bit values. If for numeric and string value this translation has a minimum of impact, for date and 0 values kind of reduces from functionality.

2. SQL Server 2008 provides also a Data Profiling Task as part of its SSIS solution, quite easy to use, however in comparison with it, the above functions allow to profile not only a table but a set of tables, the statistics about attributes could be brought easily together in a tabular format and the logic could be further encapsulated in a UDF or view for reuse.

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.