04 July 2010

🗄️Data Management: Data Profiling II (Using CLR functions)

Data Management
Data Management Series

In the previous post on Data Profiling I introduced basic data profiling techniques, the queries used for exemplification being quite simple. Probably many people will ask themselves on whether is possible to automate data profiling for a whole range of tables/datasets and the attributes they contain. The answer is simple, this relies entirely on databases’ capabilities of running dynamic scripts in an elegant manner – many such solutions relying on the use of cursors in which the dynamic created scripts are run by a stored procedure (e.g. EXEC or sp_executesql in SQL Server). There are also solution in which the dynamic queries are created, run and managed by third-party tools. In what concerns SQL Server, the CLR functions could prove to be an elegant solution for this purpose.

In a previous post, “Number of Records – The CLR Version”, I described the steps I followed in order to create a CLR function in order to get the number of records for a set of tables, the name of the table being provided as parameter. Instead of giving as parameter only the name of the table it could be provided instead the whole query string, thus having the possibility of running multiple types of query that return only a scalar. 

The problem is that the CLR function returns a value only of a certain data type, thus we could transform all the values to string or create a function for each data type. I will try to describe in this post both approaches, a reason for that being also the fact that initially I thought that the second solution is more practical, and probably easier to debug, though as I discovered the first technique is more flexible allowing to encapsulate the logic in a single query.

Let’s consider the following functions that could be added in a new Database Project or in the project created in the previous post on CLR functions. As mentioned above, is created a function for each important data type, thus ExecuteScalarToInt returns an integer value, ExecuteScalarToDate a date value, ExecuteScalarToString a string value, ExecuteScalarToDecimal a decimal value, amd ExecuteScalarToBoolean a boolean value:

Partial Public Class UserDefinedFunctions 
<SqlFunction(DataAccess:=DataAccessKind.Read, SystemDataAccess:=SystemDataAccessKind.Read)> _ Public Shared Function ExecuteScalarToInt(ByVal SqlScript As String) As Integer     
Using conn As New SqlConnection("context connection=true") 
conn.Open()     
Dim cmd As New SqlCommand(SqlScript, conn) 
Return CType(cmd.ExecuteScalar(), Integer)     
End Using End Function 

<SqlFunction(DataAccess:=DataAccessKind.Read, SystemDataAccess:=SystemDataAccessKind.Read)> _ Public Shared Function ExecuteScalarToDate(ByVal SqlScript As String) As Date     
Using conn As New SqlConnection("context connection=true") 
    conn.Open()     

    Dim cmd As New SqlCommand(SqlScript, conn)     

    Return CType(cmd.ExecuteScalar(), Date)     
End Using End Function 

<SqlFunction(DataAccess:=DataAccessKind.Read, SystemDataAccess:=SystemDataAccessKind.Read)> _ Public Shared Function ExecuteScalarToString(ByVal SqlScript As String) As String     
Using conn As New SqlConnection("context connection=true") 
    conn.Open()     

    Dim cmd As New SqlCommand(SqlScript, conn)   
  
    Return CType(cmd.ExecuteScalar(), String)     
End Using End Function 

<SqlFunction(DataAccess:=DataAccessKind.Read, SystemDataAccess:=SystemDataAccessKind.Read)> _ Public Shared Function ExecuteScalarToDecimal(ByVal SqlScript As String) As Decimal     
Using conn As New SqlConnection("context connection=true") 
    conn.Open()     

    Dim cmd As New SqlCommand(SqlScript, conn)     

    Return CType(cmd.ExecuteScalar(), Decimal)     
End Using End Function 

<SqlFunction(DataAccess:=DataAccessKind.Read, SystemDataAccess:=SystemDataAccessKind.Read)> _ Public 
Shared Function ExecuteScalarToBoolean(ByVal SqlScript As String) As Boolean     
Using conn As New SqlConnection("context connection=true") 
    conn.Open()     

    Dim cmd As New SqlCommand(SqlScript, conn)   
  
    Return CType(cmd.ExecuteScalar(), Boolean)     
End Using End Function 
End Class

After compiling the project and re-linking the assembly, must be declared the following functions:

-- returns int values 
CREATE FUNCTION dbo.ExecuteScalarToInt(@sql nvarchar(max)) 
RETURNS bigint  
AS  
EXTERNAL NAME SqlServerProject1.[SqlServerProject1.UserDefinedFunctions].ExecuteScalarToInt 
GO 

-- returns decimal values 
CREATE FUNCTION dbo.ExecuteScalarToDecimal(@sql nvarchar(max)) 
RETURNS decimal(18,2)  
AS  
EXTERNAL NAME SqlServerProject1.[SqlServerProject1.UserDefinedFunctions].ExecuteScalarToDecimal 
GO 

-- returns nvarchar values 
CREATE FUNCTION dbo.ExecuteScalarToString(@sql nvarchar(max)) 
RETURNS nvarchar(max)  
AS  
EXTERNAL NAME SqlServerProject1.[SqlServerProject1.UserDefinedFunctions].ExecuteScalarToString 
GO 

-- returns date values 
CREATE FUNCTION dbo.ExecuteScalarToDate(@sql nvarchar(max)) 
RETURNS date  
AS  
EXTERNAL NAME SqlServerProject1.[SqlServerProject1.UserDefinedFunctions].ExecuteScalarToDate 
GO 

-- returns boolean values 
CREATE FUNCTION dbo.ExecuteScalarToBoolean(@sql nvarchar(max)) 
RETURNS bit  
AS  
EXTERNAL NAME SqlServerProject1.[SqlServerProject1.UserDefinedFunctions].ExecuteScalarToBoolean 
GO 


-- testing the functions    
SELECT dbo.ExecuteScalarToInt('SELECT count(*) FROM Production.Product') NumberRecords 
SELECT dbo.ExecuteScalarToString('SELECT Max(ProductNumber) FROM Production.Product') GreatestStringValue 
SELECT dbo.ExecuteScalarToString('SELECT Max(ListPrice) FROM Production.Product') GreatestNumericValue 
SELECT dbo.ExecuteScalarToDate('SELECT Max(SellStartDate) FROM Production.Product') GreatestDateValue 
SELECT dbo.ExecuteScalarToString('SELECT CASE WHEN count(1)&gt;1 THEN 1 ELSE 0 END FROM Production.Product') HasRecords
 
Data Profiling - testing CLR functions

  The first example that returns the number of records could be easily adapted in order to return the respective value for a given set of tables:

-- number of records 
SELECT S.name [schema_name] 
, T.name table_name 
, dbo.ExecuteScalarToInt('SELECT count(*) NumberRecords FROM AdventureWorks.' + S.name + '.' + T.name) NumberRecords 
FROM AdventureWorks.sys.tables T 
    JOIN AdventureWorks.sys.schemas S 
      ON T.schema_id = S.schema_id 
WHERE S.name = 'Sales' 
ORDER BY S.name  
, T.name 
 
Data Profiling - number of records

The attribute-related metrics require a lower level of detail, to the above query adding the sys.colums table, for this purpose I will use the dbo.vTableColumns view created in the previous post on data profiling. And here’s the query based on AdventureWorks database:

-- data profiling attribute level SELECT[Schema_Name]  
, Table_Name  
, Column_Name  
, user_type  
, system_type  
-- number distinct values 
 , CASE  
    WHEN system_type NOT IN ('xml') THEN dbo.ExecuteScalarToInt('SELECT count(DISTINCT [' + Column_Name + ']) NumberRecords FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name)  
     ELSE -1 
END NumberDistinctValues 
-- number/percentage not null values 
, dbo.ExecuteScalarToInt('SELECT count(1) NumberRecords FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name + ' WHERE [' + Column_Name + '] IS NOT NULL') NumberNotNullValues  
, CASE  
    WHEN system_type NOT IN ('xml') THEN dbo.ExecuteScalarToDecimal('SELECT Cast(CASE WHEN count(1) &gt;0 THEN Cast(count([' + Column_Name + ']) as decimal(18,2))/count(1) ELSE 0 END as decimal(5,2)) PercentageNotNullValues FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name) 
    ELSE -1 
END PercentageNotNullValues 
-- min/max length of values 
, CASE  
    WHEN system_type NOT IN ('xml') THEN dbo.ExecuteScalarToInt('SELECT IsNull(Min(Len([' + Column_Name + '])), 0) MinLengthValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name) 
    ELSE -1 
END MinLengthValue 
, CASE  
     WHEN system_type NOT IN ('xml') THEN dbo.ExecuteScalarToInt('SELECT IsNull(Max(Len([' + Column_Name + '])), 0) MaxLengthValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name) 
    ELSE -1 
END MaxLengthValue 
-- Min/Max values , CASE  
    WHEN system_type NOT IN ('varbinary', 'uniqueidentifier', 'xml', 'sysname') THEN  
dbo.ExecuteScalarToString('SELECT IsNull(Cast(Min('  
    + CASE WHEN system_type = 'bit' THEN 'Cast(' ELSE '' END 
    + '[' + Column_Name + ']' 
    + CASE WHEN system_type = 'bit' THEN ' as smallint)' ELSE '' END 
     + ') as nvarchar(max)), ''NULL'') MinValue FROM AdventureWorks.' 
     + [Schema_Name] + '.' + Table_Name) 
   ELSE CAST('' as nvarchar(max))  
END MinValue 
, CASE  
     WHEN system_type NOT IN ('varbinary', 'uniqueidentifier', 'xml', 'sysname') THEN  
dbo.ExecuteScalarToString('SELECT IsNull(Cast(Max('  
    + CASE WHEN system_type = 'bit' THEN 'Cast(' ELSE '' END 
    + '[' + Column_Name + ']' 
    + CASE WHEN system_type = 'bit' THEN ' as smallint)' ELSE '' END 
+ ') as nvarchar(max)), ''NULL'') MaxValue FROM AdventureWorks.'  
    + [Schema_Name] + '.' + Table_Name )  
    ELSE CAST('' as nvarchar(max))  
END MaxValue  
FROM dbo.vTableColumns 
WHERE [Schema_Name] = 'Production' 
AND Table_Name = 'Product' ORDER BY [Schema_Name]  
, Table_Name  
, Column_Name  
, column_id 

   Here’s the output, the query run in about 30 seconds for the whole AdventureWorks database’s table:
  
Data Profiling - attribute level

 More likely, the above query requires some explanations. First of all some CASEs have been introduced in order to treat the exceptions, given the fact that not all data types support all the aggregate functions, while the xml data type doesn’t support any aggregate function at all. Hopefully I treated all the scenarios, if not the query could be easily updated. In other cases I applied a simple conversion, as in the case of bit data type casted to a smallint. In case the dynamic queries return other value than integer, the output value has been casted to a nvarchar(max). Most probably for easier understanding of the dynamic queries used is necessary also a review of the corresponding static queries presented in the previous post.

In case we use a specific function for each data type then we either choose to work with a query for each data type, or in case we want to use a UNION (ALL) join, then we’ll have to cast the output to nvarchar. So whatever method we choose, in order to create only one dataset out of our profiling implementation, a cast to nvarchar is required. I will resume myself to provide the queries for each data type, more for exemplifying the use of the above CLR functions:

-- getting Min/Max integer values 
SELECT [Schema_Name] 
, Table_Name  
, Column_Name 
, user_type 
, dbo.ExecuteScalarToInt('SELECT IsNull(Min(' + Column_Name + '), 0) MinValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MinValue 
, dbo.ExecuteScalarToInt('SELECT IsNull(Max(' + Column_Name + '), 0) MaxValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MaxValue 
FROM dbo.vTableColumns 
WHERE [Schema_Name] = 'Production' 
AND Table_Name = 'Product' 
AND user_type IN ('int', 'bigint') 

-- getting Min/Max String values 
SELECT [Schema_Name] 
, Table_Name  
, Column_Name 
, user_type 
, dbo.ExecuteScalarToString('SELECT IsNull(Min(' + Column_Name + '), '''') MinValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MinValue 
, dbo.ExecuteScalarToString('SELECT IsNull(Max(' + Column_Name + '), '''') MaxValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MaxValue 
FROM dbo.vTableColumns 
WHERE [Schema_Name] = 'Production' AND Table_Name = 'Product' 
AND user_type IN ('nvarchar', 'varchar', 'char', 'nchar') 

-- getting Min/Max Date values 
SELECT [Schema_Name] 
, Table_Name  
, Column_Name 
, user_type 
, dbo.ExecuteScalarToDate('SELECT IsNull(Min(' + Column_Name + '), Cast(''1800-01-01'' as date)) MinDateValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MinDateValue 
, dbo.ExecuteScalarToDate('SELECT IsNull(Max(' + Column_Name + '), Cast(''1800-01-01'' as date)) MaxDateValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MaxDateValue 
FROM dbo.vTableColumns 
WHERE [Schema_Name] = 'Production' 
AND Table_Name = 'Product' 
AND user_type IN ('datetime', 'date') 

-- getting Min/Max Boolean values 
SELECT [Schema_Name] 
, Table_Name  
, Column_Name 
, user_type 
, dbo.ExecuteScalarToBoolean('SELECT IsNull(Min(Cast(' + Column_Name + ' as smallint)), 0) MinValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MinValue 
, dbo.ExecuteScalarToBoolean('SELECT IsNull(Max(Cast(' + Column_Name + ' as smallint)), 0) MaxValue FROM AdventureWorks.' + [Schema_Name] + '.' + Table_Name ) MaxValue 
FROM dbo.vTableColumns 
WHERE [Schema_Name] = 'Production' 
--AND Table_Name = 'Products' AND user_type IN ('bit') 

Notes:
1.   The problem with actual form of the functions is that when the query returns a null value, the database engine throws the following error message:
Msg 6522, Level 16, State 1, Line 2
A .NET Framework error occurred during execution of user-defined routine or aggregate "ExecuteScalarToDate":
System.InvalidCastException: Conversion from type 'DBNull' to type 'Date' is not valid.
System.InvalidCastException:
   at Microsoft.VisualBasic.CompilerServices.Conversions.ToDate(Object Value)
   at SqlServerProject1.UserDefinedFunctions.ExecuteScalarToDate(String SqlScript)

    
The error could be avoided by handling the NULL values with a default value, for example 0 for numeric values, the empty string for string values, a minimal value date for date values, and 0 for bit values. If for numeric and string value this translation has a minimum of impact, for date and 0 values kind of reduces from functionality.

2. SQL Server 2008 provides also a Data Profiling Task as part of its SSIS solution, quite easy to use, however in comparison with it, the above functions allow to profile not only a table but a set of tables, the statistics about attributes could be brought easily together in a tabular format and the logic could be further encapsulated in a UDF or view for reuse.

03 July 2010

🗄️Data Management: Data Profiling I (An Introduction)

Data Management
Data Management Series

Some Generalities

Jack Olson defines data profiling as "the use of analytical techniques to discover the true content, structure, and quality of data" [1]. Understanding the structure of data is quintessential for database developers in order to provide accurate technical solutions (reports, database objects for various purposes), though in order to answer business questions is important to understand also the content, the projection of business processes in database environment and, not less important, is the Data Quality reflected in the value the data has for an enterprise, being highly dependent on the understanding of data content and structure.

Typically, DBMS store also metadata about the data chunks and data structures they contain – tables, the columns they contain and their data types, the relations between tables, various types of indexes and other constraints, database objects like views, stored procedures, or user-defined functions. When taken together they provide together with the logical and semantical data models a good overview over the data structure, though the complexity of such systems, sometimes made of hundreds of tables, if often hard to comprehend, especially when one or more layers of abstraction reside in between the database structure and consumer. 

Conversely, the structure is one thing, while the data contained by the structure is another thing, attributes being more or less correctly misused for various purposes, with different degrees of completeness, actuality, accuracy, etc. I wrote about Data Quality and the most important dimensions in several posts, highlighting the importance of having data with an acceptable quality. Thus, in Data Quality initiatives, especially the ones done during data migration/conversion between various systems, data profiling occupies an important place.

A natural question (actually two): what analytical metrics could be used in order to understand the data and how complex do they have to be? Nowadays data mining, the process of extracting patterns from data, gains day by day in importance, more and more complex techniques being used, visualization techniques together with complex algorithms allowing us to derive rules from data. Do we have to go so far? Statistics offers us for example a simple set of techniques and concepts which could be used to understand the data, though Statistics deals mainly with quantities. 

Therefore, if we could translate the volume of data into various metrics, we could reuse in theory the metrics as input for statistical reasoning. And again, do we have to go that far? As we will see further on, data profiling could be done using several simple metrics, like the number of records, the number of missing values (see completeness), the number of distinct values and the distribution formed by them, the minimum and maximum values, or the minimum and maximum length of values. There are also more complex techniques that attempt to identify the correlation existing between the various attributes, falling already in the direction of data mining techniques.

Volume of Data

The number of records, as the name intuitively denotes, resumes in identifying the number of records each table/dataset from a set of table/datasets contains. It’s only a dimension of the volume of data, reflected also in the number of tables/datasets in scope, and the number of attributes each of them contains. So, the volume of data could be considered as a measure of number of records, number of tables and number of attributes. If we want to be more accurate in measurement, given the fact that each attribute could have a different maximum length, the maximum length could be considered as a fourth dimension of the volume of data. 

Now each database vendor decides on whether it reserves the respective maximum length or it considers only the actual used length (e.g. an attribute of type string could have a maximum length of 250 characters, but it holds values with a maximum length of 30 characters). Things are even more complicated because the structural length is different than the physical lengths (the number of bits allocated for storage). The physical storage occupied by a database could be in theory easily identified, though if only a subset is considered then it might be needed to consider the actual size of the volume of data in scope. The subset resumes not only to tables, but also to attributes as maybe not all the attributes from a table are used, records as there could be also inactive/disabled or other type of irrelevant records, and even data content, in case data could be stripped of not useful content.

Considering a given table/dataset the above-mentioned metrics could be obtained fairly easy by running queries against the table/dataset itself or against the metadata. As I highlighted above, database vendors that store metadata about their data structure, of importance in this context being the tables, the attributes, attributes’ data type and maximum length. There are also vendors  that store the number of records as statistics for various purposes, though the statistics are not always up to date, now it depends also on profiling requirements. On the other side getting the number of records is a simple task, even in an automated manner. Considering for exemplification the AdventureWorks database coming with SQL Server, the number of records from a table can be easily identified with the help of the count aggregate function available in all important RDBMS:  

-- the number of records (returns 54) 
SELECT count(1) NumberRecords 
FROM AdventureWorks.Production.Product 

The number of attributes the table contains could be obtained by doing a simple count against the number of attributes returned by a SELECT * statement or by querying directly the metadata:  

-- the number of attributes (returns 25) 
SELECT COUNT(1) 
FROM sys.columns C 
    JOIN sys.tables T 
      ON C.object_id = T.object_id 
WHERE T.name = 'Product' 

The same tables (catalog views starting with SQL Server 2005) could be used in order to retrieve the name of the attributes, their data type, length related attributes and whether the attribute accepts NULL values. Given the fact that the respective views will be used in multiple queries, it makes sense to encapsulate them in a view:

-- table columns view 
CREATE VIEW dbo.vTableColumns 
AS  
SELECT s.name [schema_name] 
, T.name Table_Name 
, C.name Column_Name 
, C.column_id  
, C.max_length 
, C.precision 
, C.scale 
, C.is_nullable 
, C.is_identity 
, UT.name user_type 
, ST.name system_type 
FROM sys.columns C 
    JOIN sys.types as UT 
       ON C.user_type_id = UT.user_type_id 
    JOIN sys.types ST 
      ON C.system_type_id = ST.user_type_id 
    JOIN sys.tables T 
       ON C.object_id = T.object_id 
         JOIN sys.schemas S 
            ON T.schema_id = S.schema_id 
 
Thus, the needed query becomes: 

-- metadata details 
SELECT * 
FROM dbo.vTableColumns 
WHERE [Schema_Name] = 'Production' 
   AND Table_Name = 'Product' 
ORDER BY column_id 
Output:
schema_name Table_Name Column_Name column_id max_length precision scale is_nullable is_identity user_type system_type
Production Product ProductID 1 4 10 0 0 1 int int
Production Product Name 2 100 0 0 0 0 Name nvarchar
Production Product ProductNumber 3 50 0 0 0 0 nvarchar nvarchar
Production Product MakeFlag 4 1 1 0 0 0 Flag bit
Production Product FinishedGoodsFlag 5 1 1 0 0 0 Flag bit
Production Product Color 6 30 0 0 1 0 nvarchar nvarchar
Production Product SafetyStockLevel 7 2 5 0 0 0 smallint smallint
Production Product ReorderPoint 8 2 5 0 0 0 smallint smallint
Production Product StandardCost 9 8 19 4 0 0 money money
Production Product ListPrice 10 8 19 4 0 0 money money

Note:
For more information on the meaning of user_type and system_type and the other related metadata stored, or on the length of each data types in SQL Server please check the local or online SQL Server documentation (see: Mapping System Tables to System Views, respectively Data Types).

Analyzing the Data Content

Analyzing the content resumes in studying the completeness for each attribute, what values are used and their distribution, the minimal/maximal/average values and minimal/maximal/average length of values.

Completeness could be studied as a metric of the number of records that have not-null values, or the complementary metric, the number of records that have null values, or as a percentage of the two from the total number of records. For example, in order to study the completeness of “Color” attribute could be used the following query:

-- completeness 
SELECT count(1) NumberRecords 
, count(Color) NumberNotNullValues 
, count(1)-count(Color) NumberNullValues 
, CASE  
    WHEN count(1)&gt;0 THEN Cast(100*Cast(count(Color) As decimal (18,2))/count(1) AS decimal (5,2)) 
     ELSE 0.00  
END PercentageNotNullValues 
, CASE  
    WHEN count(1)&gt;0 THEN Cast(100*Cast((count(1)-count(Color)) As decimal (18,2))/count(1) AS decimal (5,2)) 
     ELSE 0.00  
END PercentageNullValues  
FROM Production.Product 
Output:
NumberRecords NumberNotNullValues NumberNullValues PercentageNotNullValues PercentageNullValues
504 256 248 50.79 49.21

Studying completeness of attributes is important especially for the cases in which the attribute becomes mandatory, the number of null records equating with the number of changes that must be made to make the respective values available. In contrast, the not-null values could be a base for further review.

The next step is determining the values used in attributes, in other words the number of distinct values appearing in an attribute. Eventually, especially for list of values, it could be considered as a percentage from the number of values available. For example it could be studied how many accounts are used from the number of accounts defined. On the other side there are also attributes for which the list of possible values are not explicitly delimited, as in the case of colors available for Products, though that doesn’t necessarily happen in all business cases. To determine the number of distinct colors used for queries could be used the following query:
   
-- number distinct values 
 SELECT count(DISTINCT Color) NumberDistinctColors 
, count(DISTINCT IsNull(Color, 'n/a')) NumberDistinctColorsIncludingNulls 
FROM Production.Product 
Output:
NumberDistinctColors NumberDistinctColorsIncludingNulls
9 10

Note:
Please note that the NULL value is ignored by count aggregate function. I’m mentioning that because there are cases in which the NULL value has a meaning of its own, in such cases the NULL could be placed with a meaningful (e.g. ‘’n/a in this case).

Attributes’ usage is quite an important subject when considering metadata, being for example important to study which attributes are referenced in various tables. For example, is of importance studying which Products appear in Purchases, Sales Orders, Inventory, etc., especially the respective Products have to receive special attention in what concerns Data Quality.

The number of distinct values is just a measure in studying attributes’ usage, a more complex analysis being performed based on the frequency distribution of values based on the number of references in a given table/dataset. This can be easily done using a count together with a grouping based on the studied attribute. Here’s the example for color: 

-- distribution of values 
SELECT Color 
, count(1) NumberRecords 
FROM Production.Product 
GROUP BY Color 
ORDER BY NumberRecords DESC
Output:
Color NumberRecords
NULL 248
Black 93
Silver 43
Red 38
Yellow 36
Blue 26
Multi 8
Silver/Black 7
White 4
Grey 1

As can be seen from the above output (called a univariate frequency table), the result shows not only the number of records with NULL values, but also the number of distinct values. Attributes’ usage could be used also as a criterion for prioritizing Data Cleansing by focusing first on the attributes with the highest frequency. Eventually the frequencies could be expressed also as percentages in ratio with the number of records, the percentages being often considered as more representative values in visualization of data.

Often, it’s useful to study the frequency of values based on explicit (e.g. Product Category/Subcategory) or implicit (ad-hoc defined) categories instead of studying values’ distribution.

A third technique used in analyzing attribute’s usage is based on identifying the range of values, here simple statistical tools like minimum, maximum, averages, or standard variation of values could be of importance in understanding the data. The minimum and maximum could be applied for all data types, but they are meaningful mainly for the numeric and date-time data types, and both could take advantage of the whole specific range of statistical tools. Studying the maximum and minimum value within for a string data attribute makes sense only when there is a strong lexicographical motivation, therefore as example I will consider the List Price from same Product table.

-- aggregate values 
 SELECT MIN(ListPrice) MinListPrice 
, MAX(ListPrice) MaxListPrice 
, AVG(ListPrice) AverageListPrice 
, STDEV(ListPrice) SandardDevListPrice  
FROM Production.Product 
Output:
MinListPrice MaxListPrice AverageListPrice SandardDevListPrice
0.00 3578.27 438.6662 773.602842630487
 
The 0 values could be excluded from analysis because at least in this case they represent the lack of a valid List Price. At least in this case the average and standard deviation doesn’t really make sense, while the minimum and maximum reveal the range of values used. As in the case of frequencies, it’s useful to study the frequency of values based on explicit or implicit categories. Here’s an example based on List Price aggregation at Subcategory level:

-- aggregate values per subcategory 
SELECT PS.Name 
, MIN(P.ListPrice) MinListPrice 
, MAX(P.ListPrice) MaxListPrice 
, AVG(P.ListPrice) AverageListPrice 
, STDEV(P.ListPrice) SandardDevListPrice  
FROM Production.Product P 
    LEFT JOIN Production.ProductSubcategory PS 
      ON P.ProductSubcategoryID = PS.ProductSubcategoryID 
WHERE IsNull(P.ListPrice, 0)!=0 
GROUP BY PS.Name 
ORDER BY PS.Name 
Output:
Name MinListPrice MaxListPrice AverageListPrice SandardDevListPrice
NULL 133.34 196.92 159.1333 28.9631334630769
Bib-Shorts 89.99 89.99 89.99 1.34869915234861E-06
Bike Racks 120.00 120.00 120.00
Bike Stands 159.00 159.00 159.00
Bottles and Cages 4.99 9.99 7.99 2.64575131106459
Bottom Brackets 53.99 121.49 92.24 34.6383097162665
Brakes 106.50 106.50 106.50 0
Caps 8.99 8.99 8.99
Chains 20.24 20.24 20.24
Cleaners 7.95 7.95 7.95
  
I tend to believe that such analysis at category level is less important in Data Profiling, though there could be also exceptions, for example the case of ABC analysis

Analyzing the Data Structure

The understanding of the structure of data is facilitated by the metadata available in the database itself, the logical/semantical data models and other types of documentation (e.g. functional specifications, data dictionaries, etc.), but as specified above even if the data are filled in a given predefined structure, when the database is not adequately normalized or that the data attributes are misused, results structures in structures, sometimes people use their own encodings varying from person to person, while other times such encodings are standardized, and even implemented in tools. When we talk about structure, the simplest metrics that could be created are based on the minimal and maximal length of attributes, while the average length could be used for example as a base for calculating the volume of data. Here’s the example script based on Color attribute:

-- aggregate length values 
SELECT MIN(Len(Color)) MinLength 
, MAX(Len(Color)) MaxLength 
, AVG(Len(Color)) AverageLength            
FROM Production.Product
Output:
MinLength MaxLength AverageLength
3 12 5

The discovery of special structures (patterns) used inside of an attribute are not so easy to determine using simple scripts, at least not unless the structure is known in advance. There are tools and I’ve also seen some SQL scripts available that allow identifying the patterns in data, though they have relative precision and I believe human intervention is requested (e.g. semi-automatic methods).

Analyzing Data Relations

Talking about patterns, the discovery of relations between attributes could also be considered as special type of pattern discovery, the above last statement applying for them too, the data mining field providing many tools for this type of discovery. In theory the relation between attributes could be discovered by grouping two or more attributes and analyzing the output and interdependence between them. Whether that can be achieved by a database developer depends also on his skills and knowledge in statistics and data mining field, often being required an experienced Data Analyst for this type of tasks.

Closing Remarks

There are more to say about Data Profiling topic, there are many more (complex) techniques that could be used to profile the data, especially in respect to Data Quality, which mainly focuses on 6 important dimensions of data: accuracy, duplicates (or uniqueness), completeness, consistency, conformity, and referential integrity. If in previous posts I addressed the respective topics from a theoretical point of view, I will try to attempt to develop the subjects from Data Profiling perspective in next posts, showing various techniques that could be used by developers for this purpose.


Written: Jul-2010, Last Reviewed: Mar-2024

References:
[1] Jack E Olson(2003) Data Quality: The Accuracy Dimension
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.