Some days back I spent a couple of hours trying to identify the tables from which the IDs stored in a log where coming from, in the log being saved no reference to the table, just the IDs. Another example of "good" design! While working diligently, I remembered that I could have easily solved the problem if I were using one of the CLR functions created in a post on Data Profiling Using CLR Functions. Unfortunately, because I was working in a production environment with strong security constraints, I wasn’t allowed registering and use any of the respective functions. Still, considering the easiness and flexibility of a the solution I thought that would be interesting to put it into practice. Who knows, maybe someday I will find the opportunity to use it!
Let’s suppose I am searching for an integer ID value within a given schema of AdventureWorks database. So, in theory, I need to do a search on each integer column of all the tables from the respective schema. The tables can be found in a matter of seconds using the following query based on the dbo.ExecuteScalarToInt CLR function introduced in the previous mention post and dbo.vTableColumns view.
-- searching a value in whole database WITH CTE AS ( SELECT [Schema_Name] , Table_Name , Column_Name , system_type , dbo.ExecuteScalarToInt('SELECT count(*) FROM ' + Quotename([Schema_Name]) + '.' + Quotename(Table_Name) + ' WHERE ' + Quotename(Column_Name) + ' = ' + Cast(1001 as varchar(20))) NumberRecords FROM dbo.vTableColumns WHERE [Schema_Name]= 'Sales' --AND Table_Name = 'SalesOrderDetail' --AND Column_Name = 'SalesOrderDetailID' AND system_type = 'int' ) SELECT * FROM CTE WHERE NumberRecords>0 ORDER BY [Schema_Name] , Table_Name
Schema_Name | Table_Name | Column_Name | system_type | NumberRecords |
Sales | ContactCeditCard | CreditCardId | int | 1 |
Sales | CreditCard | CreditCardId | int | 1 |
Sales | CurrencyRate | CurrencyRateId | int | 1 |
Sales | CustomerAddress | AddressId | int | 1 |
Sales | SalesOrderDetail | SalesOrderDetailId | int | 1 |
Sales | SalesOrderHeader | BillToAddressId | int | 8 |
Sales | SalesOrderHeader | CreditCardId | int | 1 |
Sales | SalesOrderHeader | ShipToAddressId | int | 8 |
In case of the problem I had a few days back, the above query it’s only a first step in identifying the tables in which the values occur. Because multiple tables are returned, it’s needed to investigate which of the tables has the highest probability to be the source for the ID saved in log. Of course, some DM (data mining) algorithm could be probably used for this purpose, however some programming is needed and also some amount of diligent work to prove the data. It would be nice to have a tool that performs such analysis or that creates a complex query using any set of attributes within a given database.