Data Management Series |
In the previous post on Data Profiling I introduced basic data profiling techniques, the queries used for exemplification being quite simple. Probably many people will ask themselves on whether is possible to automate data profiling for a whole range of tables/datasets and the attributes they contain. The answer is simple, this relies entirely on databases’ capabilities of running dynamic scripts in an elegant manner – many such solutions relying on the use of cursors in which the dynamic created scripts are run by a stored procedure (e.g. EXEC or sp_executesql in SQL Server). There are also solution in which the dynamic queries are created, run and managed by third-party tools. In what concerns SQL Server, the CLR functions could prove to be an elegant solution for this purpose.
In a previous post, “Number of Records – The CLR Version”, I described the steps I followed in order to create a CLR function in order to get the number of records for a set of tables, the name of the table being provided as parameter. Instead of giving as parameter only the name of the table it could be provided instead the whole query string, thus having the possibility of running multiple types of query that return only a scalar.
The problem is that the CLR function returns a value only of a certain data type, thus we could transform all the values to string or create a function for each data type. I will try to describe in this post both approaches, a reason for that being also the fact that initially I thought that the second solution is more practical, and probably easier to debug, though as I discovered the first technique is more flexible allowing to encapsulate the logic in a single query.
Let’s consider the following functions that could be added in a new Database Project or in the project created in the previous post on CLR functions. As mentioned above, is created a function for each important data type, thus ExecuteScalarToInt returns an integer value, ExecuteScalarToDate a date value, ExecuteScalarToString a string value, ExecuteScalarToDecimal a decimal value, amd ExecuteScalarToBoolean a boolean value:
After compiling the project and re-linking the assembly, must be declared the following functions:
The first example that returns the number of records could be easily adapted in order to return the respective value for a given set of tables:
The attribute-related metrics require a lower level of detail, to the above query adding the sys.colums table, for this purpose I will use the dbo.vTableColumns view created in the previous post on data profiling. And here’s the query based on AdventureWorks database:
Here’s the output, the query run in about 30 seconds for the whole AdventureWorks database’s table:
More likely, the above query requires some explanations. First of all some CASEs have been introduced in order to treat the exceptions, given the fact that not all data types support all the aggregate functions, while the xml data type doesn’t support any aggregate function at all. Hopefully I treated all the scenarios, if not the query could be easily updated. In other cases I applied a simple conversion, as in the case of bit data type casted to a smallint. In case the dynamic queries return other value than integer, the output value has been casted to a nvarchar(max). Most probably for easier understanding of the dynamic queries used is necessary also a review of the corresponding static queries presented in the previous post.
In case we use a specific function for each data type then we either choose to work with a query for each data type, or in case we want to use a UNION (ALL) join, then we’ll have to cast the output to nvarchar. So whatever method we choose, in order to create only one dataset out of our profiling implementation, a cast to nvarchar is required. I will resume myself to provide the queries for each data type, more for exemplifying the use of the above CLR functions:
Notes:
1. The problem with actual form of the functions is that when the query returns a null value, the database engine throws the following error message:
Msg 6522, Level 16, State 1, Line 2
A .NET Framework error occurred during execution of user-defined routine or aggregate "ExecuteScalarToDate":
System.InvalidCastException: Conversion from type 'DBNull' to type 'Date' is not valid.
System.InvalidCastException:
at Microsoft.VisualBasic.CompilerServices.Conversions.ToDate(Object Value)
at SqlServerProject1.UserDefinedFunctions.ExecuteScalarToDate(String SqlScript)
The error could be avoided by handling the NULL values with a default value, for example 0 for numeric values, the empty string for string values, a minimal value date for date values, and 0 for bit values. If for numeric and string value this translation has a minimum of impact, for date and 0 values kind of reduces from functionality.
2. SQL Server 2008 provides also a Data Profiling Task as part of its SSIS solution, quite easy to use, however in comparison with it, the above functions allow to profile not only a table but a set of tables, the statistics about attributes could be brought easily together in a tabular format and the logic could be further encapsulated in a UDF or view for reuse.
Let’s consider the following functions that could be added in a new Database Project or in the project created in the previous post on CLR functions. As mentioned above, is created a function for each important data type, thus ExecuteScalarToInt returns an integer value, ExecuteScalarToDate a date value, ExecuteScalarToString a string value, ExecuteScalarToDecimal a decimal value, amd ExecuteScalarToBoolean a boolean value:
Partial Public Class UserDefinedFunctions <SqlFunction(DataAccess:=DataAccessKind.Read, SystemDataAccess:=SystemDataAccessKind.Read)> _ Public Shared Function ExecuteScalarToInt(ByVal SqlScript As String) As Integer Using conn As New SqlConnection("context connection=true") conn.Open() Dim cmd As New SqlCommand(SqlScript, conn) Return CType(cmd.ExecuteScalar(), Integer) End Using End Function <SqlFunction(DataAccess:=DataAccessKind.Read, SystemDataAccess:=SystemDataAccessKind.Read)> _ Public Shared Function ExecuteScalarToDate(ByVal SqlScript As String) As Date Using conn As New SqlConnection("context connection=true") conn.Open() Dim cmd As New SqlCommand(SqlScript, conn) Return CType(cmd.ExecuteScalar(), Date) End Using End Function <SqlFunction(DataAccess:=DataAccessKind.Read, SystemDataAccess:=SystemDataAccessKind.Read)> _ Public Shared Function ExecuteScalarToString(ByVal SqlScript As String) As String Using conn As New SqlConnection("context connection=true") conn.Open() Dim cmd As New SqlCommand(SqlScript, conn) Return CType(cmd.ExecuteScalar(), String) End Using End Function <SqlFunction(DataAccess:=DataAccessKind.Read, SystemDataAccess:=SystemDataAccessKind.Read)> _ Public Shared Function ExecuteScalarToDecimal(ByVal SqlScript As String) As Decimal Using conn As New SqlConnection("context connection=true") conn.Open() Dim cmd As New SqlCommand(SqlScript, conn) Return CType(cmd.ExecuteScalar(), Decimal) End Using End Function <SqlFunction(DataAccess:=DataAccessKind.Read, SystemDataAccess:=SystemDataAccessKind.Read)> _ Public Shared Function ExecuteScalarToBoolean(ByVal SqlScript As String) As Boolean Using conn As New SqlConnection("context connection=true") conn.Open() Dim cmd As New SqlCommand(SqlScript, conn) Return CType(cmd.ExecuteScalar(), Boolean) End Using End Function End Class
After compiling the project and re-linking the assembly, must be declared the following functions:
-- returns int values CREATE FUNCTION dbo.ExecuteScalarToInt(@sql nvarchar(max)) RETURNS bigint AS EXTERNAL NAME SqlServerProject1.[SqlServerProject1.UserDefinedFunctions].ExecuteScalarToInt GO -- returns decimal values CREATE FUNCTION dbo.ExecuteScalarToDecimal(@sql nvarchar(max)) RETURNS decimal(18,2) AS EXTERNAL NAME SqlServerProject1.[SqlServerProject1.UserDefinedFunctions].ExecuteScalarToDecimal GO -- returns nvarchar values CREATE FUNCTION dbo.ExecuteScalarToString(@sql nvarchar(max)) RETURNS nvarchar(max) AS EXTERNAL NAME SqlServerProject1.[SqlServerProject1.UserDefinedFunctions].ExecuteScalarToString GO -- returns date values CREATE FUNCTION dbo.ExecuteScalarToDate(@sql nvarchar(max)) RETURNS date AS EXTERNAL NAME SqlServerProject1.[SqlServerProject1.UserDefinedFunctions].ExecuteScalarToDate GO -- returns boolean values CREATE FUNCTION dbo.ExecuteScalarToBoolean(@sql nvarchar(max)) RETURNS bit AS EXTERNAL NAME SqlServerProject1.[SqlServerProject1.UserDefinedFunctions].ExecuteScalarToBoolean GO -- testing the functions SELECT dbo.ExecuteScalarToInt('SELECT count(*) FROM Production.Product') NumberRecords SELECT dbo.ExecuteScalarToString('SELECT Max(ProductNumber) FROM Production.Product') GreatestStringValue SELECT dbo.ExecuteScalarToString('SELECT Max(ListPrice) FROM Production.Product') GreatestNumericValue SELECT dbo.ExecuteScalarToDate('SELECT Max(SellStartDate) FROM Production.Product') GreatestDateValue SELECT dbo.ExecuteScalarToString('SELECT CASE WHEN count(1)>1 THEN 1 ELSE 0 END FROM Production.Product') HasRecords
The first example that returns the number of records could be easily adapted in order to return the respective value for a given set of tables:
-- number of records SELECT S.name [schema_name] , T.name table_name , dbo.ExecuteScalarToInt('SELECT count(*) NumberRecords FROM AdventureWorks.' + S.name + '.' + T.name) NumberRecords FROM AdventureWorks.sys.tables T JOIN AdventureWorks.sys.schemas S ON T.schema_id = S.schema_id WHERE S.name = 'Sales' ORDER BY S.name , T.name
The attribute-related metrics require a lower level of detail, to the above query adding the sys.colums table, for this purpose I will use the dbo.vTableColumns view created in the previous post on data profiling. And here’s the query based on AdventureWorks database:
-- data profiling attribute level SELECT[Schema_Name] , Table_Name , Column_Name , user_type , system_type -- number distinct values , CASE WHEN system_type NOT IN ('xml') THEN dbo.ExecuteScalarToInt('SELECT count(DISTINCT [' + Column_Name + ']) NumberRecords FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name) ELSE -1 END NumberDistinctValues -- number/percentage not null values
, dbo.ExecuteScalarToInt('SELECT count(1) NumberRecords FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name + ' WHERE [' + Column_Name + '] IS NOT NULL') NumberNotNullValues , CASE WHEN system_type NOT IN ('xml') THEN dbo.ExecuteScalarToDecimal('SELECT Cast(CASE WHEN count(1) >0 THEN Cast(count([' + Column_Name + ']) as decimal(18,2))/count(1) ELSE 0 END as decimal(5,2)) PercentageNotNullValues FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name) ELSE -1 END PercentageNotNullValues -- min/max length of values
, CASE WHEN system_type NOT IN ('xml') THEN dbo.ExecuteScalarToInt('SELECT IsNull(Min(Len([' + Column_Name + '])), 0) MinLengthValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name) ELSE -1 END MinLengthValue , CASE WHEN system_type NOT IN ('xml') THEN dbo.ExecuteScalarToInt('SELECT IsNull(Max(Len([' + Column_Name + '])), 0) MaxLengthValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name) ELSE -1 END MaxLengthValue -- Min/Max values , CASE WHEN system_type NOT IN ('varbinary', 'uniqueidentifier', 'xml', 'sysname') THEN dbo.ExecuteScalarToString('SELECT IsNull(Cast(Min(' + CASE WHEN system_type = 'bit' THEN 'Cast(' ELSE '' END + '[' + Column_Name + ']' + CASE WHEN system_type = 'bit' THEN ' as smallint)' ELSE '' END + ') as nvarchar(max)), ''NULL'') MinValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name) ELSE CAST('' as nvarchar(max)) END MinValue , CASE WHEN system_type NOT IN ('varbinary', 'uniqueidentifier', 'xml', 'sysname') THEN dbo.ExecuteScalarToString('SELECT IsNull(Cast(Max(' + CASE WHEN system_type = 'bit' THEN 'Cast(' ELSE '' END + '[' + Column_Name + ']' + CASE WHEN system_type = 'bit' THEN ' as smallint)' ELSE '' END + ') as nvarchar(max)), ''NULL'') MaxValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) ELSE CAST('' as nvarchar(max)) END MaxValue FROM dbo.vTableColumns WHERE [Schema_Name] = 'Production' AND Table_Name = 'Product' ORDER BY [Schema_Name] , Table_Name , Column_Name , column_id
Here’s the output, the query run in about 30 seconds for the whole AdventureWorks database’s table:
More likely, the above query requires some explanations. First of all some CASEs have been introduced in order to treat the exceptions, given the fact that not all data types support all the aggregate functions, while the xml data type doesn’t support any aggregate function at all. Hopefully I treated all the scenarios, if not the query could be easily updated. In other cases I applied a simple conversion, as in the case of bit data type casted to a smallint. In case the dynamic queries return other value than integer, the output value has been casted to a nvarchar(max). Most probably for easier understanding of the dynamic queries used is necessary also a review of the corresponding static queries presented in the previous post.
In case we use a specific function for each data type then we either choose to work with a query for each data type, or in case we want to use a UNION (ALL) join, then we’ll have to cast the output to nvarchar. So whatever method we choose, in order to create only one dataset out of our profiling implementation, a cast to nvarchar is required. I will resume myself to provide the queries for each data type, more for exemplifying the use of the above CLR functions:
-- getting Min/Max integer values SELECT [Schema_Name] , Table_Name , Column_Name , user_type , dbo.ExecuteScalarToInt('SELECT IsNull(Min(' + Column_Name + '), 0) MinValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MinValue , dbo.ExecuteScalarToInt('SELECT IsNull(Max(' + Column_Name + '), 0) MaxValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MaxValue FROM dbo.vTableColumns WHERE [Schema_Name] = 'Production' AND Table_Name = 'Product' AND user_type IN ('int', 'bigint') -- getting Min/Max String values SELECT [Schema_Name] , Table_Name , Column_Name , user_type , dbo.ExecuteScalarToString('SELECT IsNull(Min(' + Column_Name + '), '''') MinValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MinValue , dbo.ExecuteScalarToString('SELECT IsNull(Max(' + Column_Name + '), '''') MaxValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MaxValue FROM dbo.vTableColumns WHERE [Schema_Name] = 'Production' AND Table_Name = 'Product' AND user_type IN ('nvarchar', 'varchar', 'char', 'nchar') -- getting Min/Max Date values SELECT [Schema_Name] , Table_Name , Column_Name , user_type , dbo.ExecuteScalarToDate('SELECT IsNull(Min(' + Column_Name + '), Cast(''1800-01-01'' as date)) MinDateValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MinDateValue , dbo.ExecuteScalarToDate('SELECT IsNull(Max(' + Column_Name + '), Cast(''1800-01-01'' as date)) MaxDateValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MaxDateValue FROM dbo.vTableColumns WHERE [Schema_Name] = 'Production' AND Table_Name = 'Product' AND user_type IN ('datetime', 'date') -- getting Min/Max Boolean values SELECT [Schema_Name] , Table_Name , Column_Name , user_type , dbo.ExecuteScalarToBoolean('SELECT IsNull(Min(Cast(' + Column_Name + ' as smallint)), 0) MinValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MinValue , dbo.ExecuteScalarToBoolean('SELECT IsNull(Max(Cast(' + Column_Name + ' as smallint)), 0) MaxValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MaxValue FROM dbo.vTableColumns WHERE [Schema_Name] = 'Production' --AND Table_Name = 'Products' AND user_type IN ('bit')
Notes:
1. The problem with actual form of the functions is that when the query returns a null value, the database engine throws the following error message:
Msg 6522, Level 16, State 1, Line 2
A .NET Framework error occurred during execution of user-defined routine or aggregate "ExecuteScalarToDate":
System.InvalidCastException: Conversion from type 'DBNull' to type 'Date' is not valid.
System.InvalidCastException:
at Microsoft.VisualBasic.CompilerServices.Conversions.ToDate(Object Value)
at SqlServerProject1.UserDefinedFunctions.ExecuteScalarToDate(String SqlScript)
The error could be avoided by handling the NULL values with a default value, for example 0 for numeric values, the empty string for string values, a minimal value date for date values, and 0 for bit values. If for numeric and string value this translation has a minimum of impact, for date and 0 values kind of reduces from functionality.
2. SQL Server 2008 provides also a Data Profiling Task as part of its SSIS solution, quite easy to use, however in comparison with it, the above functions allow to profile not only a table but a set of tables, the statistics about attributes could be brought easily together in a tabular format and the logic could be further encapsulated in a UDF or view for reuse.
Previous Post <<||>> Next Post