Showing posts with label record count. Show all posts
Showing posts with label record count. Show all posts

25 December 2024

💎🏭SQL Reloaded: Number of Records IV (via sys.partitions)

To get the exact number of records in a table one can use the COUNT (see post) or the more recent COUNT_BIG function, though for big tables this can be an inefficient operation for the database engine:

-- number of records via COUNT
SELECT count(*) row_count
FROM SalesLT.Product

Moreover, sometimes the operation needs to be repeated for a number of tables, e.g. dropdown tables in Dynamics 365 for Finance and Operations (D365 F&O). Writing the query as a UNION allows to export the data as a single table and do comparisons (e.g. in Excel). The same approach can be used also when multiple columns are used for grouping, though one must account for the additional columns in the other subqueries. However, the more tables are involved, the more difficult it becomes to maintain the query over time. 
 
-- number of records via COUNT for multiple tables
SELECT 'SalesLT.Product' table_name
, count(*) row_count
FROM SalesLT.Product
UNION ALL
SELECT 'SalesLT.ProductDescription' table_name
, count(*) row_count
FROM SalesLT.ProductDescription
UNION ALL
SELECT 'SalesLT.ProductModel' table_name
, count(*) row_count
FROM SalesLT.ProductModel

There are many scenarios in which it's needed to get an approximate of the number of records available in a table and doing a record count might prove to be too expensive. For a quick and dirty solution one can use the sys.partitions DMV  instead:

-- number of records via DMV for single object
SELECT object_id
, OBJECT_NAME(object_id) object_name
, OBJECT_SCHEMA_NAME(object_id) schema_name
, SUM(Rows) AS row_count
, data_compression_desc AS compression_type
, COUNT(*) partitions_count
FROM sys.partitions 
WHERE index_id < 2 --ignore the partitions from the non-clustered index if any
  AND OBJECT_ID('SalesLT.Product') = object_id
GROUP BY object_id
, data_compression_desc
ORDER BY row_count DESC;

The query is based on sys.partitions table [1] which contains a row for each partition of all the tables and most types of indexes in the database. The documentation mentions that "rows" indicates the approximate number of rows in the considered partition.

Alternatively, one can bring more tables into the query to extend its range of applicability. 

-- number of records via DMVs
SELECT S.name + '.' + T.name SearchName
, S.Name SchemaName
, T.name TableName
, P.row_count
, P.compression_type
, P.partitions_count
FROM sys.tables T
     LEFT JOIN (
        SELECT object_id
        , SUM(Rows) AS row_count
        , data_compression_desc AS compression_type
        , COUNT(*) partitions_count
        FROM sys.partitions 
        WHERE index_id < 2 --ignore the partitions from the non-clustered index if any
        --AND OBJECT_ID('SalesLT.Product') = object_id
        GROUP BY object_id
        , data_compression_desc
     ) P
    ON T.object_id = P.object_id
     JOIN sys.schemas as S
	   on S.schema_id = T.schema_id
WHERE S.Name = 'SalesLT'
  AND T.Name LIKE 'Product%'
ORDER BY row_count DESC;

The data can be exported regularly to give an idea how tables' cardinality changes over time. One can find this useful as part of the loading process in data warehouses or other solutions (e.g. data migrations). 

By using a FULL JOIN instead of a LEFT JOIN one can retrieve only the tables that have records. 

One should consider only the tables in scope, and eventually remove the records associated with the system objects (e.g. sys or information_schema upon case).

 -- constraints to be added in the WHERE clause to remove the records related to system objects
 AND OBJECT_NAME(object_id) NOT LIKE 'sys%'
 AND OBJECT_NAME(object_id) NOT LIKE 'queue_%' 
 AND OBJECT_NAME(object_id) NOT LIKE 'filestream_tombstone%' 

There are also scenarios in which the count is needed only for a subset of the data. It's the case of D365 F&O (in which the number of records is needed by DataAreaId (aka company) or another field. A solution can be built using the sp_MSForEachTable stored procedure (see the last query from this post) and a cursor.

Notes:
The code used in this post is available also in the GitHub repository.

Happy coding and Merry Christmas!

Previous Post <<||>> Next Post

Resources:
[1] Microsoft Learn (2024) sys.partitions (Transact-SQL) [link]
[2] Microsoft Learn (2024) COUNT_BIG (Transact-SQL) [link]

27 June 2020

💠🛠️SQL Server: Undocumented (Part II: Execute Command for Each Table)

One neat undocumented feature useful in the administration of the SQL Server is the sp_MSforeachtable stored procedure which allows executing a set of several commands against each table available into the current database, or, with a few changes, against a set of predefined tables.  It allows running up to 3 commands as part of the main processing, as well a pre-command, typically used to set variable environments or perform initializations, respectively a post-command, typically used for control and cleanup processes. It replaces thus the need to run the commands within a cursor, even if the latter can occasionally provide more flexibility (see example).

The stored procedures provides the following parameters:
@command1, @command2, @command3: the main commands to be executed (at least the first command must be executed)
@precommand: a command to be executed before the @command1,
@postcommand: a command to be executed after all the commands were executed successfully,
@replacechar: the character used to represent the table within the commands, by default a question mark (?) character
@whereand: used to limit the scope only to a set of tables 

Before the introduction of Data Management Views with SQL Server 2005, the stored procedure was used to get the number of records for all tables:

-- getting the number of records for each table
EXEC sp_MSForEachTable @command1='SELECT ''?'' [Table], COUNT(*) NoRecords FROM ?'

To restrict running the command(s) only for a table, the object_id function can be used within the @whereand parameter:

-- getting the number of records for a table
EXEC sp_MSforeachtable @command1='SELECT ''?'' [Table], COUNT(*) NoRecords FROM ?' --
      , @whereand = ' And Object_id = Object_id(''[Person].[Address]'')'

In AX 2009 and even the later versions including Dynamics 365 (as long the direct access to the database is possible) the number of records could be restricted to a given company (aka business unit):

-- getting the number of records from a mandant
sp_MSforEachTable @command1 = 'SELECT ''?'' [Table], COUNT(*) NoRecords FROM WHERE DataAreaId IN (''DAT'')'

The procedude can be used to perform CRUD operayions on each table. During a data migration it was possible to clean out the business units not needed with just a simple script:

--delete the data for multiple data areas
 sp_MSforEachTable @command1 = 'DELETE FROM ? WHERE DataAreaId IN (''m01'')'

Administrations tasks can be performed as well, e.g. updating statistics, performing consistency checks for each table, etc.:

-- for each table 
EXEC sp_MSforeachtable 'EXEC sp_spaceused ''?''' -- displaying the disk space researved/used
EXEC sp_MSforeachtable 'UPDATE statistics ? WITH ALL' -- updating statistics
EXEC sp_MSforeachtable 'DBCC CHECKTABLE (''?'') WITH ALL_ERRORMSGS, NO_INFOMSGS' -- check table considerncy 

A set of tasks can be performed for a single table by providing the constraint within the @whereand parameter:

-- executing multiple commands for a table
EXEC sp_MSforeachtable @command1='UPDATE statistics ? WITH ALL' -- updating statistics
   , @command2= 'DBCC CHECKTABLE (''?'') WITH ALL_ERRORMSGS, NO_INFOMSGS' -- check table considerncy 
   , @whereand = ' And Object_Name(Object_id) = ''[Person].[Address]'''

Executing the commands against a subset of tables involves using a persisted or temporary table to store the name of the tables which shoul be considered, eventually with further metadata to allow filtering:

-- dropping the table
-- DROP TABLE IF EXISTS #Tables

-- create a temporary table
SELECT TableName
INTO #Tables 
FROM (VALUES ('[Person].[Address]')
, ('[Person].[AddressType]')
, ('[Person].[BusinessEntity]')) DAT(TableName)

-- getting the number of records for the list of tables
EXEC sp_MSForEachTable @command1='SELECT ''?'' [Table], COUNT(*) NoRecords FROM ?'
, @whereand = ' And Object_id In (Select Object_id(TableName) FROM #Tables)'

Warnings:
1) Do not forget to dump the temporary table when finished!
2) The code is provided only for exemplification purposes. You can use the above code on your own risk!
3) Undocumented features can be deprecated in future versions of SQL Server, therefore they should be used with precaution in long-term solutions.

Happy coding!

09 April 2011

💎SQL Reloaded: Searching a Value within a Database

Some years back, in a Management meeting related to one of the projects I was working on, one of the Managers made a quite interesting statement, something like: "it should be easy to pool in an Excel sheet from the whole database all the information I am interested in". I kind of smiled, because the information he was requiring was spanning within different modules, having different levels of detail and thus requiring several different output datasets in order to answer a simple question like "how is the manufacturing cost varying over time?". After some meetings we figured out that it’s not so easy to pull the data for the respective question, lacking some data which were not available in the respective database. It’s funny how IS (Information Systems) are getting more complex day by day but fail to capture and indirectly answer simple questions, this also because they were not designed to address such questions. Anyway, this post is not about this.

Some days back I spent a couple of hours trying to identify the tables from which the IDs stored in a log where coming from, in the log being saved no reference to the table, just the IDs. Another example of "good" design! While working diligently, I remembered that I could have easily solved the problem if I were using one of the CLR functions created in a post on Data Profiling Using CLR Functions. Unfortunately, because I was working in a production environment with strong security constraints, I wasn’t allowed registering and use any of the respective functions. Still, considering the easiness and flexibility of a the solution I thought that would be interesting to put it into practice. Who knows, maybe someday I will find the opportunity to use it!

Let’s suppose I am searching for an integer ID value within a given schema of AdventureWorks database. So, in theory, I need to do a search on each integer column of all the tables from the respective schema. The tables can be found in a matter of seconds using the following query based on the dbo.ExecuteScalarToInt CLR function introduced in the previous mention post and dbo.vTableColumns view.
 
-- searching a value in whole database 
WITH CTE 
AS (  
    SELECT [Schema_Name]  
    , Table_Name  
    , Column_Name 
    , system_type 
    , dbo.ExecuteScalarToInt('SELECT count(*) FROM ' + Quotename([Schema_Name]) + '.' + Quotename(Table_Name) + ' WHERE ' + Quotename(Column_Name) + ' = ' + Cast(1001 as varchar(20))) NumberRecords 
    FROM dbo.vTableColumns 
    WHERE [Schema_Name]= 'Sales' 
     --AND Table_Name = 'SalesOrderDetail' 
     --AND Column_Name = 'SalesOrderDetailID' 
     AND system_type = 'int' ) 
SELECT * 
FROM CTE 
WHERE NumberRecords>0 
ORDER BY [Schema_Name] 
, Table_Name  

Output:
Schema_Name Table_Name Column_Name system_type NumberRecords
Sales ContactCeditCard CreditCardId int 1
Sales CreditCard CreditCardId int 1
Sales CurrencyRate CurrencyRateId int 1
Sales CustomerAddress AddressId int 1
Sales SalesOrderDetail SalesOrderDetailId int 1
Sales SalesOrderHeader BillToAddressId int 8
Sales SalesOrderHeader CreditCardId int 1
Sales SalesOrderHeader ShipToAddressId int 8

As can be seen the dynamically built query, based on table metadata and searched value, is provided as parameter to dbo.ExecuteScalarToInt function. The query calculates the number of occurrences (see NumberRecords column) of the same value within a given column. Some attention must be given to the search column’s data type, as the data type needs to be compatible with the searched value, otherwise a cast being necessary. I haven’t tried the query in a big database and I would advice you to be cautious when running it in a production environment. In case you arrive to use the query in a big database, please let me know how it behaved!

In case of the problem I had a few days back, the above query it’s only a first step in identifying the tables in which the values occur. Because multiple tables are returned, it’s needed to investigate which of the tables has the highest probability to be the source for the ID saved in log. Of course, some DM (data mining) algorithm could be probably used for this purpose, however some programming is needed and also some amount of diligent work to prove the data. It would be nice to have a tool that performs such analysis or that creates a complex query using any set of attributes within a given database.

17 July 2010

💎SQL Reloaded: Number of Records III (The DMV Approach)

In SQL Server 2000 the safest way to return the total number of records from a table was to do a simple COUNT. I’m say the safest way because (for big tables) there was a faster but approximate way to obtain the same information by using sysindexes table, which, as its name denotes, was storing one record for each index and table in a given database. When I started to use SQL Server 2005 and further 2008 I naturally followed the same approach, from several reasons not attempting to find whether there is a better alternative. Yesterday, while scanning roughly the titles from MSDN blogs, my attention was caught by Martinjnh’s post SQL Server–HOW-TO: quickly retrieve accurate row count for table, in which, in addition to COUNT and sys.sysindexes view kept from compatibility reasons,  he gives 2 other alternatives that use two sys.partitions and sys.dm_db_partition_stats DMV (dynamic management views), that contain “a row for each partition of all the tables and most types of indexes in the database”, respectively the “page and row-count information for every partition in the current database”.

 Of course, I tested the queries and their accuracy by inserting/deleting a few records in one of the AdventureWorks tables. The “statistics” seems to be updated in real time, thing certified also by the SQL Server 2005 documentation. I wanted to see the definition of the respective views, though they are not browsable by using the Management Studio. Some time ago I discovered that the definition of system views, functions and stored procedures are available though the sys.all_sql_modules view, the union join between sys.sql_modules and sys.system_sql_modules views. Here’s the query I used:

-- retrieving the definition of system views used to return the number of records 
SELECT * 
FROM master.sys.all_sql_modules 
WHERE CHARINDEX ('sys.dm_db_partition_stats', definition)>0 
      OR CHARINDEX ('sys.sysindexes', definition)>0 
       OR CHARINDEX ('sys.partitions', definition)>0 

Notes:
1.   Another way to get the definition of an object is to use the Object_Definition function:
SELECT object_definition(object_id('sys.sysindexes')) 

2.   Following Mladen Prajdic’s post, an SQL Server MVP, I found out that all the system objects are stored in the (hidden) read-only MsSqlSystemResource database that complements the master database.

3.   Diving into the three view’s definition, can be seen that all of them use the OpenRowSet function in order to make calls to INDEPROP, PARTITIONCOUNTS, ALUCOUNT internal SQL Server tables. As it seems this functionality of OpenRowSet function can’t be (mis-)used by users.

4. I found several interesting posts on sys.dm_db_partition_stats, for example Cindy Gross’ post, a Support Engineer at Microsoft, the blog containing several scripts and links to more resources.

04 July 2010

🗄️Data Management: Data Profiling II (Using CLR functions)

Data Management
Data Management Series

In the previous post on Data Profiling I introduced basic data profiling techniques, the queries used for exemplification being quite simple. Probably many people will ask themselves on whether is possible to automate data profiling for a whole range of tables/datasets and the attributes they contain. The answer is simple, this relies entirely on databases’ capabilities of running dynamic scripts in an elegant manner – many such solutions relying on the use of cursors in which the dynamic created scripts are run by a stored procedure (e.g. EXEC or sp_executesql in SQL Server). There are also solution in which the dynamic queries are created, run and managed by third-party tools. In what concerns SQL Server, the CLR functions could prove to be an elegant solution for this purpose.

In a previous post, “Number of Records – The CLR Version”, I described the steps I followed in order to create a CLR function in order to get the number of records for a set of tables, the name of the table being provided as parameter. Instead of giving as parameter only the name of the table it could be provided instead the whole query string, thus having the possibility of running multiple types of query that return only a scalar. 

The problem is that the CLR function returns a value only of a certain data type, thus we could transform all the values to string or create a function for each data type. I will try to describe in this post both approaches, a reason for that being also the fact that initially I thought that the second solution is more practical, and probably easier to debug, though as I discovered the first technique is more flexible allowing to encapsulate the logic in a single query.

Let’s consider the following functions that could be added in a new Database Project or in the project created in the previous post on CLR functions. As mentioned above, is created a function for each important data type, thus ExecuteScalarToInt returns an integer value, ExecuteScalarToDate a date value, ExecuteScalarToString a string value, ExecuteScalarToDecimal a decimal value, amd ExecuteScalarToBoolean a boolean value:

Partial Public Class UserDefinedFunctions 
<SqlFunction(DataAccess:=DataAccessKind.Read, SystemDataAccess:=SystemDataAccessKind.Read)> _ Public Shared Function ExecuteScalarToInt(ByVal SqlScript As String) As Integer     
Using conn As New SqlConnection("context connection=true") 
conn.Open()     
Dim cmd As New SqlCommand(SqlScript, conn) 
Return CType(cmd.ExecuteScalar(), Integer)     
End Using End Function 

<SqlFunction(DataAccess:=DataAccessKind.Read, SystemDataAccess:=SystemDataAccessKind.Read)> _ Public Shared Function ExecuteScalarToDate(ByVal SqlScript As String) As Date     
Using conn As New SqlConnection("context connection=true") 
    conn.Open()     

    Dim cmd As New SqlCommand(SqlScript, conn)     

    Return CType(cmd.ExecuteScalar(), Date)     
End Using End Function 

<SqlFunction(DataAccess:=DataAccessKind.Read, SystemDataAccess:=SystemDataAccessKind.Read)> _ Public Shared Function ExecuteScalarToString(ByVal SqlScript As String) As String     
Using conn As New SqlConnection("context connection=true") 
    conn.Open()     

    Dim cmd As New SqlCommand(SqlScript, conn)   
  
    Return CType(cmd.ExecuteScalar(), String)     
End Using End Function 

<SqlFunction(DataAccess:=DataAccessKind.Read, SystemDataAccess:=SystemDataAccessKind.Read)> _ Public Shared Function ExecuteScalarToDecimal(ByVal SqlScript As String) As Decimal     
Using conn As New SqlConnection("context connection=true") 
    conn.Open()     

    Dim cmd As New SqlCommand(SqlScript, conn)     

    Return CType(cmd.ExecuteScalar(), Decimal)     
End Using End Function 

<SqlFunction(DataAccess:=DataAccessKind.Read, SystemDataAccess:=SystemDataAccessKind.Read)> _ Public 
Shared Function ExecuteScalarToBoolean(ByVal SqlScript As String) As Boolean     
Using conn As New SqlConnection("context connection=true") 
    conn.Open()     

    Dim cmd As New SqlCommand(SqlScript, conn)   
  
    Return CType(cmd.ExecuteScalar(), Boolean)     
End Using End Function 
End Class

After compiling the project and re-linking the assembly, must be declared the following functions:

-- returns int values 
CREATE FUNCTION dbo.ExecuteScalarToInt(@sql nvarchar(max)) 
RETURNS bigint  
AS  
EXTERNAL NAME SqlServerProject1.[SqlServerProject1.UserDefinedFunctions].ExecuteScalarToInt 
GO 

-- returns decimal values 
CREATE FUNCTION dbo.ExecuteScalarToDecimal(@sql nvarchar(max)) 
RETURNS decimal(18,2)  
AS  
EXTERNAL NAME SqlServerProject1.[SqlServerProject1.UserDefinedFunctions].ExecuteScalarToDecimal 
GO 

-- returns nvarchar values 
CREATE FUNCTION dbo.ExecuteScalarToString(@sql nvarchar(max)) 
RETURNS nvarchar(max)  
AS  
EXTERNAL NAME SqlServerProject1.[SqlServerProject1.UserDefinedFunctions].ExecuteScalarToString 
GO 

-- returns date values 
CREATE FUNCTION dbo.ExecuteScalarToDate(@sql nvarchar(max)) 
RETURNS date  
AS  
EXTERNAL NAME SqlServerProject1.[SqlServerProject1.UserDefinedFunctions].ExecuteScalarToDate 
GO 

-- returns boolean values 
CREATE FUNCTION dbo.ExecuteScalarToBoolean(@sql nvarchar(max)) 
RETURNS bit  
AS  
EXTERNAL NAME SqlServerProject1.[SqlServerProject1.UserDefinedFunctions].ExecuteScalarToBoolean 
GO 


-- testing the functions    
SELECT dbo.ExecuteScalarToInt('SELECT count(*) FROM Production.Product') NumberRecords 
SELECT dbo.ExecuteScalarToString('SELECT Max(ProductNumber) FROM Production.Product') GreatestStringValue 
SELECT dbo.ExecuteScalarToString('SELECT Max(ListPrice) FROM Production.Product') GreatestNumericValue 
SELECT dbo.ExecuteScalarToDate('SELECT Max(SellStartDate) FROM Production.Product') GreatestDateValue 
SELECT dbo.ExecuteScalarToString('SELECT CASE WHEN count(1)&gt;1 THEN 1 ELSE 0 END FROM Production.Product') HasRecords
 
Data Profiling - testing CLR functions

  The first example that returns the number of records could be easily adapted in order to return the respective value for a given set of tables:

-- number of records 
SELECT S.name [schema_name] 
, T.name table_name 
, dbo.ExecuteScalarToInt('SELECT count(*) NumberRecords FROM AdventureWorks.' + S.name + '.' + T.name) NumberRecords 
FROM AdventureWorks.sys.tables T 
    JOIN AdventureWorks.sys.schemas S 
      ON T.schema_id = S.schema_id 
WHERE S.name = 'Sales' 
ORDER BY S.name  
, T.name 
 
Data Profiling - number of records

The attribute-related metrics require a lower level of detail, to the above query adding the sys.colums table, for this purpose I will use the dbo.vTableColumns view created in the previous post on data profiling. And here’s the query based on AdventureWorks database:

-- data profiling attribute level SELECT[Schema_Name]  
, Table_Name  
, Column_Name  
, user_type  
, system_type  
-- number distinct values 
 , CASE  
    WHEN system_type NOT IN ('xml') THEN dbo.ExecuteScalarToInt('SELECT count(DISTINCT [' + Column_Name + ']) NumberRecords FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name)  
     ELSE -1 
END NumberDistinctValues 
-- number/percentage not null values 
, dbo.ExecuteScalarToInt('SELECT count(1) NumberRecords FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name + ' WHERE [' + Column_Name + '] IS NOT NULL') NumberNotNullValues  
, CASE  
    WHEN system_type NOT IN ('xml') THEN dbo.ExecuteScalarToDecimal('SELECT Cast(CASE WHEN count(1) &gt;0 THEN Cast(count([' + Column_Name + ']) as decimal(18,2))/count(1) ELSE 0 END as decimal(5,2)) PercentageNotNullValues FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name) 
    ELSE -1 
END PercentageNotNullValues 
-- min/max length of values 
, CASE  
    WHEN system_type NOT IN ('xml') THEN dbo.ExecuteScalarToInt('SELECT IsNull(Min(Len([' + Column_Name + '])), 0) MinLengthValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name) 
    ELSE -1 
END MinLengthValue 
, CASE  
     WHEN system_type NOT IN ('xml') THEN dbo.ExecuteScalarToInt('SELECT IsNull(Max(Len([' + Column_Name + '])), 0) MaxLengthValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name) 
    ELSE -1 
END MaxLengthValue 
-- Min/Max values , CASE  
    WHEN system_type NOT IN ('varbinary', 'uniqueidentifier', 'xml', 'sysname') THEN  
dbo.ExecuteScalarToString('SELECT IsNull(Cast(Min('  
    + CASE WHEN system_type = 'bit' THEN 'Cast(' ELSE '' END 
    + '[' + Column_Name + ']' 
    + CASE WHEN system_type = 'bit' THEN ' as smallint)' ELSE '' END 
     + ') as nvarchar(max)), ''NULL'') MinValue FROM AdventureWorks.' 
     + [Schema_Name] + '.' + Table_Name) 
   ELSE CAST('' as nvarchar(max))  
END MinValue 
, CASE  
     WHEN system_type NOT IN ('varbinary', 'uniqueidentifier', 'xml', 'sysname') THEN  
dbo.ExecuteScalarToString('SELECT IsNull(Cast(Max('  
    + CASE WHEN system_type = 'bit' THEN 'Cast(' ELSE '' END 
    + '[' + Column_Name + ']' 
    + CASE WHEN system_type = 'bit' THEN ' as smallint)' ELSE '' END 
+ ') as nvarchar(max)), ''NULL'') MaxValue FROM AdventureWorks.'  
    + [Schema_Name] + '.' + Table_Name )  
    ELSE CAST('' as nvarchar(max))  
END MaxValue  
FROM dbo.vTableColumns 
WHERE [Schema_Name] = 'Production' 
AND Table_Name = 'Product' ORDER BY [Schema_Name]  
, Table_Name  
, Column_Name  
, column_id 

   Here’s the output, the query run in about 30 seconds for the whole AdventureWorks database’s table:
  
Data Profiling - attribute level

 More likely, the above query requires some explanations. First of all some CASEs have been introduced in order to treat the exceptions, given the fact that not all data types support all the aggregate functions, while the xml data type doesn’t support any aggregate function at all. Hopefully I treated all the scenarios, if not the query could be easily updated. In other cases I applied a simple conversion, as in the case of bit data type casted to a smallint. In case the dynamic queries return other value than integer, the output value has been casted to a nvarchar(max). Most probably for easier understanding of the dynamic queries used is necessary also a review of the corresponding static queries presented in the previous post.

In case we use a specific function for each data type then we either choose to work with a query for each data type, or in case we want to use a UNION (ALL) join, then we’ll have to cast the output to nvarchar. So whatever method we choose, in order to create only one dataset out of our profiling implementation, a cast to nvarchar is required. I will resume myself to provide the queries for each data type, more for exemplifying the use of the above CLR functions:

-- getting Min/Max integer values 
SELECT [Schema_Name] 
, Table_Name  
, Column_Name 
, user_type 
, dbo.ExecuteScalarToInt('SELECT IsNull(Min(' + Column_Name + '), 0) MinValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MinValue 
, dbo.ExecuteScalarToInt('SELECT IsNull(Max(' + Column_Name + '), 0) MaxValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MaxValue 
FROM dbo.vTableColumns 
WHERE [Schema_Name] = 'Production' 
AND Table_Name = 'Product' 
AND user_type IN ('int', 'bigint') 

-- getting Min/Max String values 
SELECT [Schema_Name] 
, Table_Name  
, Column_Name 
, user_type 
, dbo.ExecuteScalarToString('SELECT IsNull(Min(' + Column_Name + '), '''') MinValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MinValue 
, dbo.ExecuteScalarToString('SELECT IsNull(Max(' + Column_Name + '), '''') MaxValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MaxValue 
FROM dbo.vTableColumns 
WHERE [Schema_Name] = 'Production' AND Table_Name = 'Product' 
AND user_type IN ('nvarchar', 'varchar', 'char', 'nchar') 

-- getting Min/Max Date values 
SELECT [Schema_Name] 
, Table_Name  
, Column_Name 
, user_type 
, dbo.ExecuteScalarToDate('SELECT IsNull(Min(' + Column_Name + '), Cast(''1800-01-01'' as date)) MinDateValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MinDateValue 
, dbo.ExecuteScalarToDate('SELECT IsNull(Max(' + Column_Name + '), Cast(''1800-01-01'' as date)) MaxDateValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MaxDateValue 
FROM dbo.vTableColumns 
WHERE [Schema_Name] = 'Production' 
AND Table_Name = 'Product' 
AND user_type IN ('datetime', 'date') 

-- getting Min/Max Boolean values 
SELECT [Schema_Name] 
, Table_Name  
, Column_Name 
, user_type 
, dbo.ExecuteScalarToBoolean('SELECT IsNull(Min(Cast(' + Column_Name + ' as smallint)), 0) MinValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MinValue 
, dbo.ExecuteScalarToBoolean('SELECT IsNull(Max(Cast(' + Column_Name + ' as smallint)), 0) MaxValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MaxValue 
FROM dbo.vTableColumns 
WHERE [Schema_Name] = 'Production' 
--AND Table_Name = 'Products' AND user_type IN ('bit') 

Notes:
1.   The problem with actual form of the functions is that when the query returns a null value, the database engine throws the following error message:
Msg 6522, Level 16, State 1, Line 2
A .NET Framework error occurred during execution of user-defined routine or aggregate "ExecuteScalarToDate":
System.InvalidCastException: Conversion from type 'DBNull' to type 'Date' is not valid.
System.InvalidCastException:
   at Microsoft.VisualBasic.CompilerServices.Conversions.ToDate(Object Value)
   at SqlServerProject1.UserDefinedFunctions.ExecuteScalarToDate(String SqlScript)

    
The error could be avoided by handling the NULL values with a default value, for example 0 for numeric values, the empty string for string values, a minimal value date for date values, and 0 for bit values. If for numeric and string value this translation has a minimum of impact, for date and 0 values kind of reduces from functionality.

2. SQL Server 2008 provides also a Data Profiling Task as part of its SSIS solution, quite easy to use, however in comparison with it, the above functions allow to profile not only a table but a set of tables, the statistics about attributes could be brought easily together in a tabular format and the logic could be further encapsulated in a UDF or view for reuse.

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.