13 March 2010

SSIS: Percentage Sampling Data Flow Transformation

    One of the problems relatively difficult to address in the past was the random sampling within a SQL Server data set. Of course there is always Excel that could be used for this purpose just with a few tricks, though what do you do when working with huge source dataset? As we will see below SSIS 2008’s Percentage Sampling Data Flow Transformation allows easily addressing this problem.

    Using the template SSIS package defined in Third Magic Class post, copy paste the Package.dtsx in the project and rename it (e.g. Package Percentage Sampling.dtsx), and from Toolbox add an Percentage Sampling Transformation and link it to the OLE DB Source. Access the Percentage Sampling Editor in which modify the Percentage of rows value from 10 to 50. It doesn’t really makes sense to rename the sample and unselected outputs, though you might need to do that when dealing with multiple Percentage Sampling Transformations.

SSIS - Percentage Sampling Transformation Editor
Note:
    The percentage of rows you’d like to work with depends entirely upon request, in many cases it’s indicated to determine statistically the size of your sample. Given the fact that the number of records in this example is quite small I preferred to use a medium dataset size.

    Link the Aggregate Transformation to the OLE DB Destination and in the Input Output Selection dialog select as Output the ‘Sampling Selected Output’, while in the OLE DB Destination Editor create a new table (e.g. Production.BikesSample). 
 
SSIS - Input Output Selection

   In the last step, before testing the whole package, in Control Flow tab change the Execute SQL Task’s SQLStatement property to point to the current destination table: 
 
TRUNCATE TABLE [Production].[BikesSample]  

  
SSIS - Percentage Sampling Data Flow

   Save the project, test (debug) the package (twice) and don’t forget to validate the output data: 
  
SELECT * 
FROM [Production].[BikesSample]   
 

Note:
   I was actually expecting to have 48 or 49 records (97:2=48.5) in the output and not 45, I wonder from where comes the difference?! That’s a topic I still have to investigate. I tried also to change the percentage of rows to 25 resulting an output of 23 of records (23*4=92), 75 resulting an output of 74 records, respectively 100, all the records being this time selected. At least the algorithm used by Microsoft partitions the output in complementary datasets.

No comments:

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.