03 July 2010

🗄️Data Management: Data Profiling I (An Introduction)

Data Management
Data Management Series

Some Generalities

Jack Olson defines data profiling as "the use of analytical techniques to discover the true content, structure, and quality of data" [1]. Understanding the structure of data is quintessential for database developers in order to provide accurate technical solutions (reports, database objects for various purposes), though in order to answer business questions is important to understand also the content, the projection of business processes in database environment and, not less important, is the Data Quality reflected in the value the data has for an enterprise, being highly dependent on the understanding of data content and structure.

Typically, DBMS store also metadata about the data chunks and data structures they contain – tables, the columns they contain and their data types, the relations between tables, various types of indexes and other constraints, database objects like views, stored procedures, or user-defined functions. When taken together they provide together with the logical and semantical data models a good overview over the data structure, though the complexity of such systems, sometimes made of hundreds of tables, if often hard to comprehend, especially when one or more layers of abstraction reside in between the database structure and consumer. 

Conversely, the structure is one thing, while the data contained by the structure is another thing, attributes being more or less correctly misused for various purposes, with different degrees of completeness, actuality, accuracy, etc. I wrote about Data Quality and the most important dimensions in several posts, highlighting the importance of having data with an acceptable quality. Thus, in Data Quality initiatives, especially the ones done during data migration/conversion between various systems, data profiling occupies an important place.

A natural question (actually two): what analytical metrics could be used in order to understand the data and how complex do they have to be? Nowadays data mining, the process of extracting patterns from data, gains day by day in importance, more and more complex techniques being used, visualization techniques together with complex algorithms allowing us to derive rules from data. Do we have to go so far? Statistics offers us for example a simple set of techniques and concepts which could be used to understand the data, though Statistics deals mainly with quantities. 

Therefore, if we could translate the volume of data into various metrics, we could reuse in theory the metrics as input for statistical reasoning. And again, do we have to go that far? As we will see further on, data profiling could be done using several simple metrics, like the number of records, the number of missing values (see completeness), the number of distinct values and the distribution formed by them, the minimum and maximum values, or the minimum and maximum length of values. There are also more complex techniques that attempt to identify the correlation existing between the various attributes, falling already in the direction of data mining techniques.

Volume of Data

The number of records, as the name intuitively denotes, resumes in identifying the number of records each table/dataset from a set of table/datasets contains. It’s only a dimension of the volume of data, reflected also in the number of tables/datasets in scope, and the number of attributes each of them contains. So, the volume of data could be considered as a measure of number of records, number of tables and number of attributes. If we want to be more accurate in measurement, given the fact that each attribute could have a different maximum length, the maximum length could be considered as a fourth dimension of the volume of data. 

Now each database vendor decides on whether it reserves the respective maximum length or it considers only the actual used length (e.g. an attribute of type string could have a maximum length of 250 characters, but it holds values with a maximum length of 30 characters). Things are even more complicated because the structural length is different than the physical lengths (the number of bits allocated for storage). The physical storage occupied by a database could be in theory easily identified, though if only a subset is considered then it might be needed to consider the actual size of the volume of data in scope. The subset resumes not only to tables, but also to attributes as maybe not all the attributes from a table are used, records as there could be also inactive/disabled or other type of irrelevant records, and even data content, in case data could be stripped of not useful content.

Considering a given table/dataset the above-mentioned metrics could be obtained fairly easy by running queries against the table/dataset itself or against the metadata. As I highlighted above, database vendors that store metadata about their data structure, of importance in this context being the tables, the attributes, attributes’ data type and maximum length. There are also vendors  that store the number of records as statistics for various purposes, though the statistics are not always up to date, now it depends also on profiling requirements. On the other side getting the number of records is a simple task, even in an automated manner. Considering for exemplification the AdventureWorks database coming with SQL Server, the number of records from a table can be easily identified with the help of the count aggregate function available in all important RDBMS:  

-- the number of records (returns 54) 
SELECT count(1) NumberRecords 
FROM AdventureWorks.Production.Product 

The number of attributes the table contains could be obtained by doing a simple count against the number of attributes returned by a SELECT * statement or by querying directly the metadata:  

-- the number of attributes (returns 25) 
SELECT COUNT(1) 
FROM sys.columns C 
    JOIN sys.tables T 
      ON C.object_id = T.object_id 
WHERE T.name = 'Product' 

The same tables (catalog views starting with SQL Server 2005) could be used in order to retrieve the name of the attributes, their data type, length related attributes and whether the attribute accepts NULL values. Given the fact that the respective views will be used in multiple queries, it makes sense to encapsulate them in a view:

-- table columns view 
CREATE VIEW dbo.vTableColumns 
AS  
SELECT s.name [schema_name] 
, T.name Table_Name 
, C.name Column_Name 
, C.column_id  
, C.max_length 
, C.precision 
, C.scale 
, C.is_nullable 
, C.is_identity 
, UT.name user_type 
, ST.name system_type 
FROM sys.columns C 
    JOIN sys.types as UT 
       ON C.user_type_id = UT.user_type_id 
    JOIN sys.types ST 
      ON C.system_type_id = ST.user_type_id 
    JOIN sys.tables T 
       ON C.object_id = T.object_id 
         JOIN sys.schemas S 
            ON T.schema_id = S.schema_id 
 
Thus, the needed query becomes: 

-- metadata details 
SELECT * 
FROM dbo.vTableColumns 
WHERE [Schema_Name] = 'Production' 
   AND Table_Name = 'Product' 
ORDER BY column_id 
Output:
schema_name Table_Name Column_Name column_id max_length precision scale is_nullable is_identity user_type system_type
Production Product ProductID 1 4 10 0 0 1 int int
Production Product Name 2 100 0 0 0 0 Name nvarchar
Production Product ProductNumber 3 50 0 0 0 0 nvarchar nvarchar
Production Product MakeFlag 4 1 1 0 0 0 Flag bit
Production Product FinishedGoodsFlag 5 1 1 0 0 0 Flag bit
Production Product Color 6 30 0 0 1 0 nvarchar nvarchar
Production Product SafetyStockLevel 7 2 5 0 0 0 smallint smallint
Production Product ReorderPoint 8 2 5 0 0 0 smallint smallint
Production Product StandardCost 9 8 19 4 0 0 money money
Production Product ListPrice 10 8 19 4 0 0 money money

Note:
For more information on the meaning of user_type and system_type and the other related metadata stored, or on the length of each data types in SQL Server please check the local or online SQL Server documentation (see: Mapping System Tables to System Views, respectively Data Types).

Analyzing the Data Content

Analyzing the content resumes in studying the completeness for each attribute, what values are used and their distribution, the minimal/maximal/average values and minimal/maximal/average length of values.

Completeness could be studied as a metric of the number of records that have not-null values, or the complementary metric, the number of records that have null values, or as a percentage of the two from the total number of records. For example, in order to study the completeness of “Color” attribute could be used the following query:

-- completeness 
SELECT count(1) NumberRecords 
, count(Color) NumberNotNullValues 
, count(1)-count(Color) NumberNullValues 
, CASE  
    WHEN count(1)>0 THEN Cast(100*Cast(count(Color) As decimal (18,2))/count(1) AS decimal (5,2)) 
     ELSE 0.00  
END PercentageNotNullValues 
, CASE  
    WHEN count(1)>0 THEN Cast(100*Cast((count(1)-count(Color)) As decimal (18,2))/count(1) AS decimal (5,2)) 
     ELSE 0.00  
END PercentageNullValues  
FROM Production.Product 
Output:
NumberRecords NumberNotNullValues NumberNullValues PercentageNotNullValues PercentageNullValues
504 256 248 50.79 49.21

Studying completeness of attributes is important especially for the cases in which the attribute becomes mandatory, the number of null records equating with the number of changes that must be made to make the respective values available. In contrast, the not-null values could be a base for further review.

The next step is determining the values used in attributes, in other words the number of distinct values appearing in an attribute. Eventually, especially for list of values, it could be considered as a percentage from the number of values available. For example it could be studied how many accounts are used from the number of accounts defined. On the other side there are also attributes for which the list of possible values are not explicitly delimited, as in the case of colors available for Products, though that doesn’t necessarily happen in all business cases. To determine the number of distinct colors used for queries could be used the following query:
   
-- number distinct values 
 SELECT count(DISTINCT Color) NumberDistinctColors 
, count(DISTINCT IsNull(Color, 'n/a')) NumberDistinctColorsIncludingNulls 
FROM Production.Product 
Output:
NumberDistinctColors NumberDistinctColorsIncludingNulls
9 10

Note:
Please note that the NULL value is ignored by count aggregate function. I’m mentioning that because there are cases in which the NULL value has a meaning of its own, in such cases the NULL could be placed with a meaningful (e.g. ‘’n/a in this case).

Attributes’ usage is quite an important subject when considering metadata, being for example important to study which attributes are referenced in various tables. For example, is of importance studying which Products appear in Purchases, Sales Orders, Inventory, etc., especially the respective Products have to receive special attention in what concerns Data Quality.

The number of distinct values is just a measure in studying attributes’ usage, a more complex analysis being performed based on the frequency distribution of values based on the number of references in a given table/dataset. This can be easily done using a count together with a grouping based on the studied attribute. Here’s the example for color: 

-- distribution of values 
SELECT Color 
, count(1) NumberRecords 
FROM Production.Product 
GROUP BY Color 
ORDER BY NumberRecords DESC
Output:
Color NumberRecords
NULL 248
Black 93
Silver 43
Red 38
Yellow 36
Blue 26
Multi 8
Silver/Black 7
White 4
Grey 1

As can be seen from the above output (called a univariate frequency table), the result shows not only the number of records with NULL values, but also the number of distinct values. Attributes’ usage could be used also as a criterion for prioritizing Data Cleansing by focusing first on the attributes with the highest frequency. Eventually the frequencies could be expressed also as percentages in ratio with the number of records, the percentages being often considered as more representative values in visualization of data.

Often, it’s useful to study the frequency of values based on explicit (e.g. Product Category/Subcategory) or implicit (ad-hoc defined) categories instead of studying values’ distribution.

A third technique used in analyzing attribute’s usage is based on identifying the range of values, here simple statistical tools like minimum, maximum, averages, or standard variation of values could be of importance in understanding the data. The minimum and maximum could be applied for all data types, but they are meaningful mainly for the numeric and date-time data types, and both could take advantage of the whole specific range of statistical tools. Studying the maximum and minimum value within for a string data attribute makes sense only when there is a strong lexicographical motivation, therefore as example I will consider the List Price from same Product table.

-- aggregate values 
 SELECT MIN(ListPrice) MinListPrice 
, MAX(ListPrice) MaxListPrice 
, AVG(ListPrice) AverageListPrice 
, STDEV(ListPrice) SandardDevListPrice  
FROM Production.Product 
Output:
MinListPrice MaxListPrice AverageListPrice SandardDevListPrice
0.00 3578.27 438.6662 773.602842630487
 
The 0 values could be excluded from analysis because at least in this case they represent the lack of a valid List Price. At least in this case the average and standard deviation doesn’t really make sense, while the minimum and maximum reveal the range of values used. As in the case of frequencies, it’s useful to study the frequency of values based on explicit or implicit categories. Here’s an example based on List Price aggregation at Subcategory level:

-- aggregate values per subcategory 
SELECT PS.Name 
, MIN(P.ListPrice) MinListPrice 
, MAX(P.ListPrice) MaxListPrice 
, AVG(P.ListPrice) AverageListPrice 
, STDEV(P.ListPrice) SandardDevListPrice  
FROM Production.Product P 
    LEFT JOIN Production.ProductSubcategory PS 
      ON P.ProductSubcategoryID = PS.ProductSubcategoryID 
WHERE IsNull(P.ListPrice, 0)!=0 
GROUP BY PS.Name 
ORDER BY PS.Name 
Output:
Name MinListPrice MaxListPrice AverageListPrice SandardDevListPrice
NULL 133.34 196.92 159.1333 28.9631334630769
Bib-Shorts 89.99 89.99 89.99 1.34869915234861E-06
Bike Racks 120.00 120.00 120.00
Bike Stands 159.00 159.00 159.00
Bottles and Cages 4.99 9.99 7.99 2.64575131106459
Bottom Brackets 53.99 121.49 92.24 34.6383097162665
Brakes 106.50 106.50 106.50 0
Caps 8.99 8.99 8.99
Chains 20.24 20.24 20.24
Cleaners 7.95 7.95 7.95
  
I tend to believe that such analysis at category level is less important in Data Profiling, though there could be also exceptions, for example the case of ABC analysis

Analyzing the Data Structure

The understanding of the structure of data is facilitated by the metadata available in the database itself, the logical/semantical data models and other types of documentation (e.g. functional specifications, data dictionaries, etc.), but as specified above even if the data are filled in a given predefined structure, when the database is not adequately normalized or that the data attributes are misused, results structures in structures, sometimes people use their own encodings varying from person to person, while other times such encodings are standardized, and even implemented in tools. When we talk about structure, the simplest metrics that could be created are based on the minimal and maximal length of attributes, while the average length could be used for example as a base for calculating the volume of data. Here’s the example script based on Color attribute:

-- aggregate length values 
SELECT MIN(Len(Color)) MinLength 
, MAX(Len(Color)) MaxLength 
, AVG(Len(Color)) AverageLength            
FROM Production.Product
Output:
MinLength MaxLength AverageLength
3 12 5

The discovery of special structures (patterns) used inside of an attribute are not so easy to determine using simple scripts, at least not unless the structure is known in advance. There are tools and I’ve also seen some SQL scripts available that allow identifying the patterns in data, though they have relative precision and I believe human intervention is requested (e.g. semi-automatic methods).

Analyzing Data Relations

Talking about patterns, the discovery of relations between attributes could also be considered as special type of pattern discovery, the above last statement applying for them too, the data mining field providing many tools for this type of discovery. In theory the relation between attributes could be discovered by grouping two or more attributes and analyzing the output and interdependence between them. Whether that can be achieved by a database developer depends also on his skills and knowledge in statistics and data mining field, often being required an experienced Data Analyst for this type of tasks.

Closing Remarks

There are more to say about Data Profiling topic, there are many more (complex) techniques that could be used to profile the data, especially in respect to Data Quality, which mainly focuses on 6 important dimensions of data: accuracy, duplicates (or uniqueness), completeness, consistency, conformity, and referential integrity. If in previous posts I addressed the respective topics from a theoretical point of view, I will try to attempt to develop the subjects from Data Profiling perspective in next posts, showing various techniques that could be used by developers for this purpose.


Written: Jul-2010, Last Reviewed: Mar-2024

References:
[1] Jack E Olson(2003) Data Quality: The Accuracy Dimension

29 June 2010

💎SQL Reloaded: Number of Records I (The CLR Version)

    While scanning rapidly the daily posts from MSDN blogs, I stopped on Chris Skorlinski’s “Example code executing TSQL for each Table in a database” post, wondering then on whether he’s using other techniques than I used in the past. I run into this type of need several times, especially in cases in which I wanted to report in an automated manner the number of records for a given set of tables, typically during ETL tasks. The problem could be easily solved with the help of a cursor, though it’s not the most elegant technique, even if it solves the problem fairly easy. I remember how I tried to find a way to execute a dynamic statement using a dynamic script inside of a  UDF (user-defined function), though that’s not possible because a stored procedure can’t be executed inside of a function. An alternative would be to use the OPENROWSET function and execute a statement using a linked server, though the OPENROWSET function doesn’t work with dynamically created statements.

    Five years ago, soon after SQL Server 2005 RTM appeared on the market, I tried to create a UDF programmatically using an example from Bob Beauchemin & co.’s “A First Look at SQL Server 2005 for Developers” book. The example worked then, but not without some inherent headaches, having to spent some time on troubleshooting the received errors, and if I remember correctly they were caused by a change in the way the connection was created from the SQL Server SQLClient. Anyway, that’s already history, though since then I was wondering on whether is possible to use a programmatically created UDF to return the number of records for a given table sent as parameter. Since then I haven’t had the chance to test that, even if it seemed logically possible, but seeing Chris Skorlinski’s post and as last week I installed Visual Studio 2010 on my home computer, I said to myself that that’s a nice way to see the tool at work. Thought and done, but again not without problems, running in several small but time consuming issues.

    So armed with some patience, I did a quick search on Microsoft site on how to use CLR scalar-valued functions under SQL Server 2008, running into this MSDN resource, which actually returns the record count using a static query. So I opened Visual Studio 2010 and created a new Database/SQL Server/Visual Basic SQL CLR Database Project named SQLServerProject1 (I forgot to give the project a meaningful name) and added a database reference to the AdventureWorks database using Windows Authentication.

CLR UDF - Database Reference

    Then in the new created project I’ve added a UDF to the project (Add/New Item/User Defined Function), naming the new created class as CLRLibrary, I replaced the existing function with the one from the MSDN article, added a new String input parameter called TableName, modified the query and changed UDF’s name to NumberRecords as per below piece of code:
 
Imports System 
Imports System.Data 
Imports System.Data.SqlClient 
Imports System.Data.SqlTypes 
Imports Microsoft.SqlServer.Server 
Imports System.Runtime.InteropServices 'added manually 
 

Partial Public Class UserDefinedFunctions 
SqlFunction(DataAccess:=DataAccessKind.Read)> _ 
Public Shared Function NumberRecords(ByVal TableName As String) As Integer 
Using conn As New SqlConnection("context connection=true") 
conn.Open() 
Dim cmd As New SqlCommand("SELECT COUNT(1) AS [NumberRecords] FROM " & TableName, conn) 
Return CType(cmd.ExecuteScalar(), Integer) 
End Using 
End Function 
End Class 

    In addition, I had to add also a reference to System.Runtime.InteropServices, and now the function was ready for testing so I built the solution; then I copied CLRLibrary’s Full Path in SQL Server Management Studio and attempted to follow the steps described in the MSDN article. Now it’s time for some troubleshooting fun… First I enabled the CLR integration configuration option by running the below script:

--enable CLR integration use master 
go sp_configure 'clr enabled', 1  

go 
reconfigure 
go 
       
    Using the master database I attempted to create a reference to the assembly, using the Full Path and the dll name.
 
CREATE ASSEMBLY SqlServerProject1 
FROM 'D:\\SqlServerProject1.dll' 
WITH PERMISSION_SET = EXTERNAL_ACCESS 

   The next step was to declare the function:

CREATE FUNCTION dbo.NumberRecords(@table_name nvarchar(100)) 
RETURNS int  
AS  
EXTERNAL NAME SqlServerProject1.[SqlServerProject1.UserDefinedFunctions].NumberRecords 
GO  
 

    Actually is the final form of the DDL script, because initially I used only a combination of the Project, class and function name (e.g. SqlServerProject1.UserDefinedFunctions.NumberRecords), receiving the following error message:
Msg 6505, Level 16, State 2, Procedure NumberRecords, Line 1
Could not find Type 'UserDefinedFunctions' in assembly 'SqlServerProject1'.

    After a quick search on Google I found out from Eggheadcafe forum that the Namespace was missing from the declaration, and once I made this change the above statement worked without problems. The next step was naturally to test on whether the function is working:

SELECT dbo.NumberRecords('AdventureWorks.HumanResources.Department')  

And it worked! So it’s time to test it against a set of tables from AdventureWorks database:
 
SELECT s.name [schema_name]
   , T.name table_Name 
, dbo.NumberRecords('AdventureWorks.' + s.name + '.' + t.name) NumberRecords 
FROM AdventureWorks.sys.tables t 
      JOIN AdventureWorks.sys.schemas s 
        ON t.schema_id = S.schema_id  
ORDER BY s.name  
, T.name 

  Actually I missed some important thing from the above steps, somewhere while attempting to test the function or to register the assembly, I received the following error message:
A .NET Framework error occurred during execution of user-defined routine or aggregate "NumberRecords": System.InvalidOperationException: Data access is not allowed in this context.  Either the context is a function or method not marked with DataAccessKind.Read or SystemDataAccessKind.Read, is a callback to obtain data from FillRow method of a Table Valued Function, or is a UDT validation method.

   A post on MSDN SQL Server forum remembered me that I forgot to check on whether I’m using the 3.5 or 4.0 .Net Framework. So in the SQLServerProject1 project, from main menu/Project/SQLServerProject1 Properties/Compile/Advice Compile Options/Target Framework, I changed the value to .Net Framework 3.5. This change requires the project to be closed and reopened.

    In between the above steps I had also to check on whether the assembly was registered using the below script:

--assembly file details 
SELECT A.name [Assembly] 
, AF.name File_Path 
FROM sys.assemblies A 
     JOIN sys.assembly_files AF 
        ON A.assembly_id = AF.assembly_id 
WHERE A.is_user_defined=1 
--AND A.name='SqlServerProject1' 

    Also, if you’d like to remove the assembly from the database, you’ll have first to drop the function, and only then drop the assembly:

-- cleaning up
DROP FUNCTION NumberRecords 
DROP ASSEMBLY SqlServerProject1 

    It can be further discussed on whether this approach is acceptable as performance or not, on whether it can be misused, etc. Actually a nice misuse is the following statement which passes a WHERE constraint with the table name.

SELECT dbo.NumberRecords('AdventureWorks.HumanResources.Department WHERE Name LIKE ''P%''') 

    It could be created a second parameter for passing the WHERE constraint or, why not, pass the whole query as parameter, as long only a scalar is returned. I hope I haven’t forgot any step, I’m kind of mixing the facts because I attempted also to create the same function using a new created login, running into other type of issues that maybe deserve a second post.

27 June 2010

🛒Market Review: What’s New in Microsoft World II

Microsoft Office - Cloud Computing is the Word

    Two weeks ago, on 15th of June 2010, Microsoft Office was shipped together with Visio and Project 2010, closing the cycle of releases started with SQL Server 2008 R2, Visual Studio 2010, Sharepoint 2010 (all 3 shipped in April 2010) and Windows Azure (available also in April). The words that describe/unite at best these software tools is cloud computing and collaboration, why that? First we have to consider Azure, the new product from Windows’ portfolio, a framework for cloud computing and SaaS (Software as a Service) architectures, and composed of 3 components, namely Windows Azure which allows running applications and accessing data in the cloud, SQL Azure Database  provides data services in the cloud, while Windows Azure platform AppFabric allows the communication between the applications residing in the cloud. Also MS Office 2010 is part of Microsoft’s strategy toward cloud computing, the weight falling on SharePoint 2010, a business collaboration platform that together with the other MS Office tools allow to manage information, automate and manage business processes, facilitate decision making process, etc. A cornerstone of the framework is the co-authoring tool that “allows multiple people to work on a single copy of a document at the same time or at different times, seamlessly, whether they are online or offline”. As it seems are provided also “community features that allows users to share data as they do on Twitter and Facebook”, a step toward social computing. Microsoft plans to offer an online version of Office 2010, called Office Web Apps (OWA), supposed to be also a competitor for Google Docs.

    There are also people who question the steps done by Microsoft toward cloud computing, but in the end is important to establish the software infrastructure in which cloud computing-based applications could be developed, futures that don’t exist currently could appear in future versions or could be provided by third-party vendors.

    Microsoft comes also with some unpleasant surprises, as it seems Microsoft’s SharePoint Server runs only on 64-bit hardware and requires also a 64 bit SQL Server edition, and this could be quite an important constraint for many customers. The most unpleasant surprise is that Microsoft renounces to the well-known upgrade schema, the reason for that, as mentioned in Ars Technica quoting a Microsoft spokesman, from the need to simplify the product lineup and pricing, based on “partner and customer feedback” (I’m sorry but I can’t really buy that!). The same source expects that upgrades will be available with promotions, after Office’s launch. The only promotion I heard of is Microsoft Office 2010 Technology Guarantee program but if refers only to the customers who “purchased, installed, and activated a qualifying Microsoft Office 2007 product between March 5, 2010, and September 30, 2010”, they being eligible to download Office 2010 at no additional cost. How about the ones who bought a Microsoft Office 2007 copy in 2010 but before 5th of March (like I did)?!

Microsoft TechEd North America Sessions are Online

    The Microsoft TechEd North America sessions held in New Orleans were made available online (video and slides), an opportunity for technical professionals to get an overview on the new advancements in Microsoft technologies, being approached topics related to the various platforms of Windows, MS Office, Dynamics, Web, Cloud Computing & Online Services, etc. I really like the way Microsoft makes its technologies available to the public, especially the fact that it provides also Express versions of their software, allowing newbies and developers to get acquainted and use essential basic functionality. The MSDN, TechNet, webcasts, Channel9  and community and personal blogs bring the technical and non-technical closer to the company and its technologies.

25 June 2010

💎SQL Reloaded: Trivial Equalities in Queries

    If I remember correctly from the Math literature, equalities of the type 0=0, 1=1 or more general n=n could be referred also as trivial equalities. I tried to find, without success though, an exact definition of what is intended by trivial equality, the closest I could get is the Wikipedia’s content on the use of trivial adjective in Mathematics for objects that have a simple structure. As per my perception, not sure if I’m correct, a trivial equality is an equality in which one of the members is a constant, the same definition holding also for trivial inequalities.

    Now, what have trivial equalities like 1=1 have to do with queries?! As 1=1 always equates to true, same as 1=0 equates to false, could be used in certain scenarios as a simple technique to return all the records, respectively no records from a database. For example given the fact that 0=1 equates to false I used such a constraint under ADO in queries like the below one in order to retrieve metadata from a database.

SELECT *  
FROM Production.Product 
WHERE 1=0 

    In the same way 1=1 could be used to retrieve all the records from a table:

SELECT *  
FROM Production.Product 
WHERE 1=1 

    Now if in above query we add different constraints the 1=1 allows to comment/uncomment the constraints as we wish with a minimal of changes.

SELECT *  
FROM Production.Product 
WHERE 1=1 
AND Color = 'Black' 
--AND SafetyStockLevel>800 
AND StandardCost>300 

    This makes it quite useful when creating dynamic queries (see the first constraint from the stored procedure given as example in Just In Case – Part IV: Dynamic Queries post).  I actually have seen quite often this type of writing a query, especially between Oracle developers. Long time ago I asked somebody what’s the consideration behind its use, excepting the fact that it allows to add or remove (comment) constraints. The answer was quite fuzzy, being mentioned a possible improvement in query’s performance. Some time ago I remember somebody was mentioning that in older SQL Server database engines such a constraint could lead to an unexpected query plan, and thus poorer performance. Normally the database engine should recognize such statements as meaningless, and ignore them, however from theory to implementation is a long way and anything is possible. An example in this direction could be found in The real cost of performance, the post and the comments that followed revealing the various facets of using equality sign in programming languages like C# or VB. SQL shouldn’t be so complex, as it doesn’t typically used in the work with objects.

    I used such trivial equalities also in cartesian JOINs as an artifice in order to use ANSI syntax in queries, here’s an example using the SplitList table-valued function introduced long time ago:

SELECT * 
FROM dbo.SplitList('1,2,3,4', ',') A  
       JOIN dbo.SplitList('A,B,C,D', ',') B  
          ON 1=1 

    Given the above definition for trivial equalities, in theory all constraints in which one of the members is a hard-coded could be regarded as trivial. Actually all numeric data type constraints could be written as trivial equalities by moving all attributes in one of the members, the other being a constant, for example:

SELECT P.Name Product 
, PM.Name ProductModel 
FROM Production.Product P 
        JOIN Production.ProductModel PM 
            ON P.ProductModelID - PM.ProductModelID = 0 

    Even if such a statement is logically correct, it should be avoided when possible, given the fact that it decreases considerably the performance of the query (just compare the query plan between the above query and the typical join), indexes, if exist, not being used efficiently.

20 June 2010

🎡SSIS: Using Oracle as Data Source

Making SSIS work with Oracle doesn’t seem to be a complicated task especially after several years of experience in doing that, though, as usual, something new appears in the landscape – new software versions, new requirements, an environment with its own particularities, etc. In general, when an application needs a connection to an Oracle server, is needed to install on the client computer several components that come with Oracle Client, add/configure the TNS (Transparent Network Substrate) name, eventually set-up some global variables in case the components were not installed in the default location, and depending on the chosen provider might be needed to configure also a DSN (Data Source Name) pointing to the Oracle server. SSIS makes actually no exception from these steps, once the steps performed you should be in theory ready to develop, test, deploy, schedule and run packages using a connection to an Oracle server.

Unfortunately, from my experience, every 2-3 installations, there is a problem with the Oracle Client and its configuration, most of the times the solution being quite simple – removing some declarations from the set-up files, correcting the global variables or the TNS name. As it seems the Oracle Client is quite sensible to the changes in the default installation path, therefore is indicated to install the Client using the default path unless you’d like to gain more experience in troubleshooting such installation issues. In addition, as any respectable company, some of the products come with their own defects, the patching of such issues being not so easy, therefore it’s advisable to check beforehand the known issues coming with the Client version you’d need to install, preferably in case you need to take advantage of the latest Oracle features, it makes sense to install the latest stable Client.  

Because the Oracle Client downloadable package has a few hundred MB, there is a thinner Oracle alternative to Oracle Client, namely Instant Client package, the components could be downloaded from Oracle site (here) and installed individually, as a minimum being necessary to install Instant Client Package – Basic or Basic Lite versions, the ODBC libraries and the SQL *Plus libraries in case you want to test the connection. Of course, after case could be installed also the JDBC, SDK or any other packages Oracle made available.

On 64 bit platforms might be needed to install in parallel the 32 and 64 bit Oracle/Instant Clients (see SSIS, Oracle and X64 post from Business Vision DEV Team), while in order to troubleshoot the various issues could be a good idea to check the differences between 32 bit and 64 bit registry (here). 

There are several drivers that allows you to connect to an Oracle database using SSIS, the most popular ones:
- Microsoft OLE BD Provider for Oracle
- Oracle Provider for OLE DB
- .Net Framework Data Provider for Oracle
- .Net Framework Data Provider for Odbc
- Oracle Data Provider for .NET

The Microsoft drivers come with MDAC (Microsoft Data Access Components), its latest version being MDAC 2.8. Starting with Windows Vista and Windows Server 2008, Microsoft changed MDAC into WDAC (Windows Data Access Components), including it as part of the operating system, removing thus the need to redistribute the components. See Data Access Technologies Road Map, FAQ and Troubleshooting MDAC/WDAC for MDAC/WDAC architecture, components and releases, respectively troubleshooting. Given the various issues that exist with a particular MDAC/WDAC library, see in KB301202 how you could check the current version by using the registers, while for  Windows version up to Windows Server 2003 could be used the Component Checker MDAC Utility.

If .Net Framework Data Provider for Oracle, respectively for Odbc come with the .Net framework, in exchange, Oracle Data Provider for .NET, see also FAQ is Oracle’s implementation for ADO.NET data access,  supposed to take advantage of advanced Oracle database functionality, it comes with ODAC (Oracle Data Access Components). Please note that you might need in install the respective components in addition to the Oracle Client.

There are also third party drivers for Oracle or even particularly for SSIS, for example Microsoft Connectors Version 1.1 for Oracle and Teradata for use with SSIS from Attunity; see also a few tips from SQL Server Performance blog.

Before taking any decision on which driver to use, it might be a good idea to look also at drivers’ limitations and advantages. In the past I often used the Microsoft Oracle ODBC Driver until I run into in an important limitation, namely its inexistent support for unicodes, this residing, according to KB244661, in the fact that “from Microsoft Data Access Components (MDAC) version 2.5 and later versions, both the Microsoft ODBC Driver and OLE DB Provider support ONLY Oracle 7 and Oracle 8i”. Also Oracle Provider for OLE DB seems to have its own limitations.

SQL Server provides linked servers, the powerful functionality of executing commands against OLE DB data sources on remote servers, including Oracle. As linked servers offer the ability to create cross-vendor distributed queries, in certain scenarios they could prove to be a powerful alternative of querying Oracle databases, in such cases not being needed to create in SSIS package an additional connection to Oracle. KB280106 and an article on Oracle Provider for OLE DB describe how to set up and troubleshoot a linked server to an Oracle database in SQL Server.

You might want to check also the SQL Server Integration Services with Oracle Database 10 White Paper coming from Microsoft. “Connectivity and SQL Server 2005 Integration Services”  MSDN article written by Bob Beauchemin, Scott Barrett’s blog or this SSIS 64 bit – Using Oracle Provider post.

💎SQL Reloaded: Troubleshooting Query Issues III (Logical errors)

    In last post I provided some general guidelines on how to troubleshoot query issues related to errors thrown by the database engine. That’s a fortunate case, because engine’s validation shows that something went wrong. I’m saying fortunate because there are also cases in which there are also logical errors that result in unexpected output, and the errors could remain even for years undiscovered. It’s true that such errors are hard to discover, but not impossible; adequate testing combined with defensive programming techniques could allow decreasing the volume of such errors. The simplest and handiest test could be performed by looking in a first phase at the number of records returned by the query, and secondly by studying the evolution of cardinality with each table added to the query. The clearest signal that something went bad is when no records, too less or too many records are returned (than expected), then the problem most probably resides in the constraints used, namely in JOIN, WHERE or HAVING clause constraints.

WHERE clause constraints

    If your query is just a simple statement without joins of any type, then more likely the problem resides in the constraints used in the WHERE clause, at least one of them being evaluated to false. Here are several possible reasons:
- One of the hardcoded values or one of the provided parameters is not found between the values available in the table;
- Not treated NULL cases;
- Wrong use of predicates or parenthesis.

    The simplest constraint evaluated to false is “0=1” or something similar, actually can be used any similar values (e.g. ‘A’=’B’, ‘ABC’=’’, etc.). In contrast “1=1” is evaluated always to true. The below query returns no records because the constraint is evaluated for false:
 
-- single constraint evaluated to false 
SELECT * 
FROM Production.Product 
WHERE 1=0 

    A more complex scenario is when multiple predicates and at least one equates always to true or to false, the task in such situations being to identify such constraints:

-- multiple constraints, one evaluated to false 
SELECT * 
FROM Production.Product 
WHERE 1=0 
    AND Class = 'L' 
    AND SafetyStockLevel >100 

    The constraints could be grouped also using parenthesis:

-- multiple constraints with parenthesis, one evaluated to false 
SELECT * 
FROM Production.Product 
WHERE 1=0 
    AND ((Class = 'L' 
    AND SafetyStockLevel >100) 
      OR (Class = 'M' 
    AND SafetyStockLevel =100)) 

    Dealing with multiple such constraints is not a question of guessing but pure applied Boolean algebra essential in writing accurate queries. So if you are feeling that you’re not mastering such concepts, then maybe it’s a good idea to consult some material on this topic.

    There could be situations in which the total outcome of a set of constraints it’s not so easy to see, in such cases the constraint could be brought in the SELECT statement. Which constraints should be tested depends also on the particularities of your query, here is an example:  

--testing constraints' outcome for each record 
SELECT * 
, CASE  
     WHEN Class = 'L' AND SafetyStockLevel >100 THEN 1 
     ELSE 0 
  END Constraint1  
, CASE  
     WHEN Class = 'M' AND SafetyStockLevel =100 THEN 1 
     ELSE 0 
  END Constraint2 
, CASE  
     WHEN (Class = 'L' AND SafetyStockLevel >100) 
        OR (Class = 'M' AND SafetyStockLevel =100) THEN 1 
     ELSE 0 
END Constraint3 
FROM Production.Product  

      
  The constraints in which one of the members is NULL and the IS NULL or Null functions are not used, are incorrect evaluated, actually ignored, and in certain cases they might look even correct, though this depends also on the expressions used. For example the below query will return no records, while the next one will return some records. For more on the correct handling of NULLs see Null-ifying the world or similar posts on the web.

--misusing NULL-based constraints 
  SELECT * 
FROM Production.Product 
WHERE (NOT 1=NULL) OR (1=NULL) 
 
JOIN constraints

    JOINs based on single JOIN constraints should in theory pose no problem unless the wrong attributes are used in the JOIN and that’s so easy to do especially when no documentation is available on the topic or it is incomplete. Functional or other technical specifications, physical or semantic models, metadata or other developers’ knowledge could help the developer to figure out about the relations between the various entities. On the other side it’s in human nature to make mistakes, forgetting to add one constraint, using the wrong type of join or the wrong attributes in the join constraint. In addition all the situations described above for WHERE clauses constraints apply also to join constraints, independently on whether the ANSI or non-ANSI syntax is used, the later based on writing the JOIN constraints in the WHERE clause, making the identification of the JOIN constraints not so easy to see, this being one of the reasons for which I recommend you to use ANSI syntax.

    A technique I use on a daily basis in order to test my scripts is to check the changes in the number of records with the adding of each new table to the initial query. In a first phase is enough to count the number of records from the table with the lowest level of detail, and the number of records from the final query, this ignoring the WHERE constraints. For example let’s consider the following query written some time ago (most of attributes were removed for convenience):

-- Purchase Orders 
 SELECT POD.PurchaseOrderDetailID 
, POD.PurchaseOrderID 
, POD.ProductID  
, POH.ShipMethodID 
, POH.VendorID 
FROM Purchasing.PurchaseOrderDetail POD 
     JOIN Purchasing.PurchaseOrderHeader POH 
       ON POD.PurchaseOrderID = POH.PurchaseOrderID 
           JOIN Purchasing.ShipMethod PSM 
              ON POH.ShipMethodID = PSM.ShipMethodID 
           JOIN Purchasing.Vendor SMF 
              ON POH.VendorID = SMF.VendorID  
     JOIN Production.Product ITM 
        ON POD.ProductID = ITM.ProductID 
WHERE POD.StockedQty >100 
     AND ITM.Class = 'L' 
 
    Usually I write the table with the lowest level of detail (in theory also with the highest number of records) first, adding the referenced table gradually, starting with the tables with the lowest number of records, thus the lookup tables will come first. This usually doesn’t affect the way the database engine processes the query (unless special techniques are used), and it allows to use kind of a systematic approach. This allows me also to test the query without making important changes. As I wrote above, the first test is done against the first table, for this commenting the other joins and eventually the WHERE constraints:

-- testing changes in the number of records - test 1 
SELECT count(1) NumberRecords 
FROM Purchasing.PurchaseOrderDetail POD 
/* JOIN Purchasing.PurchaseOrderHeader POH 
     ON POD.PurchaseOrderID = POH.PurchaseOrderID 
          JOIN Purchasing.ShipMethod PSM 
             ON POH.ShipMethodID = PSM.ShipMethodID 
          JOIN Purchasing.Vendor SMF 
             ON POH.VendorID = SMF.VendorID 
      JOIN Production.Product ITM 
         ON POD.ProductID = ITM.ProductID 
WHERE POD.StockedQty >100 
AND ITM.Class = 'L' */ 
       
    In the second step is enough to check the number of records returned by the whole query without the WHERE constraints:

-- testing changes in the number of records - test 2 
SELECT count(1) NumberRecords 
FROM Purchasing.PurchaseOrderDetail POD  
     JOIN Purchasing.PurchaseOrderHeader POH 
        ON POD.PurchaseOrderID = POH.PurchaseOrderID 
            JOIN Purchasing.ShipMethod PSM 
               ON POH.ShipMethodID = PSM.ShipMethodID 
            JOIN Purchasing.Vendor SMF 
               ON POH.VendorID = SMF.VendorID  
      JOIN Production.Product ITM 
          ON POD.ProductID = ITM.ProductID 
/* WHERE POD.StockedQty >100 
     AND ITM.Class = 'L' */       

      
    If the number of records is correct then most probably the query is correct, this because there could be incorrect LEFT JOINS that are not reflected in the number of records but in the fact that the corresponding attributes are missing. So don’t rely entirely on this type of test, but take 1-2 examples for which you are sure that records must be retrieved from the LEFT JOIN, and check whether the expected values are shown. Eventually, only for testing purposes, the LEFT JOIN could be modified as a FULL JOIN to see what records are returned.  

    Now if there are changes in the number of records from one test to the other, then take the first query and move the comment one join further (see below), and repeat the step until the JOIN that causes the variances is identified. 

-- testing changes in the number of records - test 3  
SELECT count(1) NumberRecords 
FROM Purchasing.PurchaseOrderDetail POD 
JOIN Purchasing.PurchaseOrderHeader POH 
ON POD.PurchaseOrderID = POH.PurchaseOrderID 
/*      JOIN Purchasing.ShipMethod PSM 
           ON POH.ShipMethodID = PSM.ShipMethodID 
         JOIN Purchasing.Vendor SMF 
            ON POH.VendorID = SMF.VendorID 
     JOIN Production.Product ITM 
        ON POD.ProductID = ITM.ProductID 
WHERE POD.StockedQty >100       
AND ITM.Class = 'L' */          

 Once the join that causes the variances is found, then might be requested then to add a new join constraint or use a LEFT JOIN instead of a FULL JOIN. For example if in the above query there are ShipMethodID with NULL values, then the query should have been written using a LEFT JOIN. On the other side, if duplicates are found, then do a grouping by using the primary key of the table with the lowest level of detail, in this case the Purchasing.PurchaseOrderDetail.  

HAVING clause constraints

    HAVING clause constraints behave more like WHERE clause constraints, the difference residing in the different levels where the constraints are applied, so the same technique could be applied in this case too, with the difference that the constraints need to be accommodated to support aggregate functions.  

--Vendors with total purchase value >100000 
SELECT POH.VendorID 
, SUM(POH.SubTotal) TotalSubTotal 
FROM Purchasing.PurchaseOrderHeader POH 
WHERE POH.Status IN (2,4) -- 2-Approved, 4-Complete  
GROUP BY POH.VendorID 
HAVING SUM(POH.SubTotal)>100000 

    Sometimes it helps to bring the HAVING constraints into the SELECT, though that offers limited functionality, therefore it’s more productive maybe to re-link the aggregated query back to the base table(s) on which the statement is based:

--Vendors with total purchase value >100000 - base records 
SELECT A.* 
, B.TotalSubTotal 
, CASE 
     WHEN B.VendorID IS NOT NULL THEN 1 
     ELSE 0 
END ConsideredFlag 
FROM Purchasing.PurchaseOrderHeader A 
LEFT JOIN ( --aggregated data 
      SELECT POH.VendorID 
      , SUM(POH.SubTotal) TotalSubTotal 
      FROM Purchasing.PurchaseOrderHeader POH 
      WHERE POH.Status IN (2,4) -- 2-Approved, 4-Complete  
      GROUP BY POH.VendorID 
HAVING SUM(POH.SubTotal)>100000 
) B 
     ON A.VendorID = B.VendorID 
WHERE A.Status IN (2,4) -- 2-Approved, 4-Complete  

  
 
  Also this technique could prove to be limited, though I found it useful in several cases, especially when a complicated formula is included in the aggregated function. The same query could be rewritten with a window aggregate function introduced with SQL Server 2005.

--Vendors with total purchase value >100000 - window aggregate function 
SELECT POH.* 
, SUM(POH.SubTotal) OVER (PARTITION BY POH.VendorID) TotalSubTotal 
, CASE 
     WHEN SUM(POH.SubTotal) OVER (PARTITION BY POH.VendorID)>100000 THEN 1 
     ELSE 0 
END ConsideredFlag 
FROM Purchasing.PurchaseOrderHeader POH 
WHERE POH.Status IN (2,4) -- 2-Approved, 4-Complete 
 
    When working with amounts, quantities or other number values, it could be useful to check the total amount/quantity based on the query logic against the total amount/quantity based on the base table.

💎SQL Reloaded: Troubleshooting Query Issues II (Errors)

    Some weeks ago I wrote a post on troubleshooting, looking at the topic from problem solving perspective. In what concerns the troubleshooting of SQL scripts, now it depends also on the type of DML statement used or on the type of issues the developer deals with – errors thrown by the database engine, wrong number of records returned by a query, performance, accessibility or reusability related issues.

    Many of the errors come normally from wrong usage of syntax and functionality or from misspells, in many situations the database query engine providing enough information, leading the developer closer to issue’s resolution, at least by narrowing the range of search, and here the problem solving approach could be useful. In such situations it is useful to do a quick search in the local/online documentation or resources using the error number or the full/partial error message. I know that’s not rocket science, though I’ve seen people acting like hypnotized by the error message, not knowing what do to next. A quick search using a powerful search engine like Google.com, Bing.com or Yahoo.com could lead you to the solution or help gathering more data about the possible causes. More likely that there was somebody in the same situation and has posted a question or even the answer somewhere in a forum or a blog. It’s always a good idea and time-effective to do a little research before posting a new question in a forum. Especially between beginners there are people who don’t know to make use of the documentation, help tools or even a search engine, therefore it could be useful for them to take some time and look on some tips on how to make better use of such tools.

    In general when an error is thrown is provided an error number that has a specific meaning for the vendor, a general description, and in fortunate cases also the line where it occurs. For example in SQL Server the execution of the below query based on AdventureWorks database will return the next error message:

SELECT * 
FROM dbo.Purchasing.PurchaseOrderHeader POH 

Msg 208, Level 16, State 1, Line 1
Invalid object name 'dbo.Purchasing.PurchaseOrderHeader'


   Even if in this case the error was created deliberately by using the name of a schema (e.g. dbo) instead of database’s name, this doesn’t mean that the cause/solution is so simple. For example the same error message could be thrown by an issue related to KB837970, KB310882, KB289551 and several other articles that could be found by using the following search text in Google:
site:support.microsoft.com Microsoft.com "Msg 208, Level 16"

    Unfortunately in what concerns the errors appearing in queries, the localization of errors is not so clean, depending also on the order in which the query is parsed by the engine, thus the issue might be shown as occurring at a line other than its actual location, namely at the line where the query parsing finds an issue.
  
    When the error message is in other languages than English, it makes sense to try to find the corresponding error message in English and use it in search because, as a quit important volume of documentation and other type of information are in English, the chances to find something are higher. Normally that could be done by using the error code, which in theory should have the same meaning independently of the language used.

    Sometimes, when you didn’t knew how to do something and just used your inspiration and tried to use some functionality, and you are not fully aware of its implications and constraints, then it makes sense to consult the documentation or do a search on the web on how to do a certain task (e.g. How to load data from Oracle to SQL Server). 10 minutes spent in research could help you save a few hours later, so don’t jump blindly into coding! In many situations it could happen that a solution is already available, and if you find already the code, please consider not using it blindly, try to understand the how and why, its limitations and how it could be eventually improved, and some sorrow testing might be necessary too. Don’t try to reinvent the wheel just for the sake of it, in what concerns programming there are more likely others who had the same problem and found a neat solution to it. This doesn’t mean that you’ll have to do that for every problem you’ll have, but for tasks having higher complexity than the tasks you are accustomed with, the need for that decreasing with the level of expertise. Doing some research on a given topic can be time consuming, but could prove to be useful especially when other developers used other technique. Actually once you understood the idea, you could attempt to reconstruct the logic with your own coding and styling conventions, of course, not always it makes sense to do that.

Note:
    You’ll have to consider also copyright infringements, but in theory the developers who posted code on the web for further reuse, want only some minimal credit for that (e.g. posting a link to the website where the code is found, mentioning developers’ name, etc.), while in some situations you could reuse the code only if you add to it substantial changes. Both statements apply not only to code but also other type of work (content) available freely on the web (e.g. articles, tutorials, etc.).

    In case the search was unsuccessful then it makes sense to post your question on a forum, specifically in the area closest to your problem. Posting a question in the wrong forum has quite high chances to lead nowhere! Some forum members are quite critical about the users who post questions without doing some research, especially when there is already an answer available on the same forum. On forums, users have to be aware that they are not only wasting their time, but also readers’ time. On the other side, the more the points where the same question is posed, the higher are the chances for others to reach to them, and eventually get an answer, this depending also on how search engines index the web pages. 
 
    When the same error appears for the second time, in theory the developer should know already how to solve it or the possible cause(s). Because the solutions are not always so simple to remember as is in human nature to forget, it makes sense to document the issues by describing the problem and the solutions (preferably with code) to such a level that it could understood also by other professionals.

    When the number of errors occurring in your SQL scripts is quite high, especially when the same type of errors occurs, then maybe it makes sense to take a step back and ask yourself why that happens. Sometimes it just means that is time to take a break, though, more obvious, is maybe needed to review the theory/documentation, make use of defensive programming techniques or use an editor with powerful IntelliSense functionality that checks the syntax and highlights the errors as you type. If you are a beginner, don’t worry, with time coding and troubleshooting becomes much easier.

    All the above ideas are simple common knowledge between developers, though such knowledge comes from (prolonged) experience – a combination of learning by doing and solving issues – “the school of hard knocks”. I just hope the above lines and the coming posts will be helpful to you.

03 June 2010

🧭Business Intelligence: Enterprise Reporting (VIII: Addressable Questions in Reports’ Creation I)

Business Intelligence
Business Intelligence Series

The creation of a (data) report of any type involves most of the times several iterations until all the requirements are gathered, the query and eventually the final report are created, the reports gets tested, the issues related to deliveries and requirements mitigated, etc. Typically the steps involved in the creation of a report are straightforward for the developer, though for the user and the other people involved in the process everything might look like a black box, often also from the lack of visibility and communication between the involved parties. Actually at least in this case the black box is not the problem but the input and output that’s highly dependent on the input, not to mention that any information omitted from user’s side could have unexpected impact on output.

In many cases I found out that the information on which a report is created are minimal, for example the user comes to the developer with a question (e.g. what’s the total revenue for the past x years) and he’s expecting the developer to come with the result in terms of a report with data reflecting the answer to the respective question. When formulating the requirements, the user might even come with a list of attributes he would like to see in such a report, some of the attributes might not be possible to be shown given the needed report’s level of detail or that it makes sense to add other attributes (e.g. Unit of Measure, Currency, Price Unit, Quantity) in order to avoid possible confusion, miscalculations, and to facilitate the understanding, testing or usage of such a report. There could be cases in which the user has no really idea of what he wants, being difficult to describe the problem and the exact data he might need, and this could happen also because the user isn’t aware of the data and their structure, processes or the various constraints and other business rules that applies.

 When from the whole initial dataset based on the requested attributes the user needs only a smaller subset, this is translated into a set of constraints (e.g. the report is needed is given only for a single Vendor or a time unit), then it makes sense to consider some of the constraints as parameters, allowing to fit the report to a larger set of requirements. In addition the user might need to use as filter additional attributes not already considered in constraints. For an easier report’s understanding the columns could be positioned in a certain order and the records sorted based on a given list of ordered attributes. For the same reason special formatting could be applied for some of the attributes, for example the values could be right/center/left aligned, the dates written is a predefined format (e.g. DD/MM/YY), the rounded to a given number of decimals, the negative values could be highlighted in a another color, etc.  

When the requirements don’t match the semantic or relational data model, then must be found the adequate level of details, for this maybe being needed to renounce to some of the attributes, eventually split the report in summary and detail report, aggregate some of the values, consider only last/first record from a given subtable, all these aspects needing to be mitigated with the users. Some of the calculations the user needs could be included directly in the report, the exact applied formulas needing to be given as input by users, the special cases needing to be mitigated too.

As the same data could come from different sources, having different timeliness, accuracy, consistency, completeness, availability, in other words with different quality, the adequate data source needs again to be mitigated with the users, in theory needing to be considered the data with the most appropriate data quality. In addition to these constraints could be added also the technical constraints derived from the volume of data, report type, storage type, available time, data transformations needed, reporting platform, etc.

There are several other important facts that should be actually considered in requirements gathering phase, namely on whether a similar report already exists, could be used unaltered or needs to be modified by adding additional attributes, filters, formatting, etc. As all such work needs to have also a ROI argumentation, there are also cases in which the costs of creating such a report are higher then the costs associated with the user preparing the report, on the other side such costs are not always so easy to quantify, though it’s a good idea to have the needed reports as much as possible automated. In such cases might be requested some input from the Functional and IT Managers, the enforcement of policies and processes to deal with such aspects.

The above aspects are gathered in the below set of questions, users and developers should address them altogether, and even if some apply only to developers, some awareness from the users could be beneficial to:
Q1: What’s the problem statement? (What issues tries the report to address?)
Q2: What kind of data would need to be gathered (in order to solve or better understand the problem)?
Q3: What level of detail is needed?
Q4: What attributes are need?
Q5: Is there a similar report (that could be extended or used as template)?
Q6: What are report’s definition, purpose and financial argumentation?
Q7: Does it make sense to invest time, effort and money in creating such a report? What's the estimated effort to build the report? What's the halting point?
Q8: What attributes it makes sense to add/remove?
Q9: What’s the (appropriate) source for the attributes?
Q10: What special filter constraints are needed?
Q11: Which of the filter constraints should be considered as parameters?
Q12: Which are the formulas that need to be applied for calculated values?
Q13: What (default) sorting the user needs?
Q14: What’s the needed order of the attributes in the layout?
Q15: Are there any aggregations that need to be added?
Q16: In what form the report should be delivered?
Q17: What formatting should be applied?
Q18: What’s report’s frequency?
Q19: Who will be report’s owner?
Q20: How many records the average report will have?
Q21: What other technical or logistic constraints qualify?
Q22: When is the report needed?
Q23: What documentation is needed?
Q24: Are there any sensitive information contained?
Q25: Who needs access to the report? Do any externals need access to it?
Q26: Are there further uses for the same data? 
Q27: What's the expected life-expectancy for the report?
Q28: What's the reporting platform on which the report should be built?
Q29: Does it makes sense to split the report in multiple perspectives?
Q30: What's the expected performance for the report? Are there any limitations?

Note:
The above list of questions should not be considered as complete, some of the questions could be merged while some of the question could lead to other questions, especially in the area of technical and logistics constraints that need to be addressed adequately.

Previous Post <<||>> Next Post

Market Review: What's New around the World

Instead of Introduction

    May is gone and it was quite a stressful month because given the fact that I had to start with a new job, a new position I will have to grow in, I had also to find a new apartment, move, install myself, so I barely had time for my blogs, the little time I had spending it reading stuff related to the new requirements, trying to improve my German, etc. Even so, I found a few interesting resources, adding them to the pile of other interesting links I collected over time. So, I realized that maybe it makes sense to group some of them in a set of postings under the title “What's New (In My World)”, hopefully they will be of interest for other people too.

Free Universities and Resources for SQL Developers & Co.

    I was kind of positively surprised to discover that several known authorities in their fields opened their own online Universities for the masses, a new free formal form of study on the web that takes advantage of the latest trends in online learning – webcasts, PowerPoint presentations, blogging posts, online communities, new technologies and applications, etc. The most unexpected appearance is the SQL University having as faculty known professionals and bloggers like Jorge Segarra, Adam Machanic, Pinal Dave or Kendal Van Dyke, to mention only the ones whose blogs I previously known, but also others not less important professionals. The posts cover topics like SSIS, PowerShell, parallelism, database administration, etc. and I find them quite a useful resource of knowledge not only for beginners but also for experienced professionals.

   Early this year I discovered the BeyeUniversity that currently offers a series of 5 free certificate programs for BeyeNetwork members, covering Business Intelligence related topics, the online video courses having an easy to follow format, the content being approachable also by non-professionals. Between the instructors could be recognized Bill Inmon, one of the pioneers in data warehousing, leading the course on Unstructured Data and being also a columnist in BeyeNetwork newsletters, several others webcasts with him being also available. Ralph Kimball, the other important personality in the area of data warehousing has opened the Kimball University and even if its courses are not free, on his website could be found a big collection of articles reflecting the ideas of his school of thought. 

   The MIT OpenCoureWare initiative is quite old, though maybe not so many people know that there are made available also courses on Database related topics, for example Database Systems, Database, Internet, and Systems Integration Technologies or Data Mining, the lecture notes and study materials offering a high level overview of the concepts related to the mentioned subjects. Some of the other courses offer also video or audio content, though more such resources could be found on YouTube EDU, Academic Earth, Lecturefox or similar resources, being covered a growing range of database related topics. Similar such resources could be found also with a simple search using the favorite search engine, many Universities have made available online, voluntarily or involuntarily their text, video or audio lecture notes.

Microsoft News

    With SQL Server 2008 R2, Visual Studio 2010 and MS Office 2010 Microsoft closes a new cycle of releases, the products reflecting Microsoft’s philosophy and the degree to which the company made it to hold their promises. It will be interesting to see how the products will be regarded by professionals and press, how they fit in the general technology trends, how they could be used together in order to provide highly scalable and available solutions. I really hope that Microsoft have made some important steps in bridging the gaps existing in their tools integration on 64 bit version, and that the new added functionality makes a difference.
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.