12 March 2010

SSIS: The Union All Data Flow Transformation

    In the yesterday’s post I was showing how to use the Conditional Split Data Flow Transformation in order to vertically partition a dataset based on a set of constraints, allowing thus to dump the data from each partition to its corresponding table. The inverse problem considers merging together the distributed partitions of the same dataset or distinct similar datasets into a common data set. If the datasets needed to be merged are on the same server it makes sense to join them by using UNION or UNION ALL inside of a simple Data Flow Task:
SELECT * -- specify explicitly the attributes
FROM
[Production].[Bikes]

UNION
ALL

SELECT
* -- specify explicitly the attributes

FROM
[Production].[ClothesAccessories]

UNION
ALL

SELECT
* -- specify explicitly the attributes

FROM
[Production].[Components]


   If the datasets are distributed across several data sources then in theory could be created a package for each data source and dump the data locally, however this often implies the synchronization of all involved packages, eventually by calling each package from a parent package. A better and easier approach is to use the Union All Data Flow Transformation that merges the data from multiple sources into a common result set, a data source needing to be defined for each input dataset. As exemplification I will use the three Product tables created in the previous post, namely [Production].[Bikes] and [Production].[ClothesAccessories] and[Production].[Components]. Even if the tables are located on the same server and the datasets could be merged with a UNION, I will use the general approach and reuse the template SSIS package defined in Third Magic Class post. Copy paste the Package.dtsx in the project and rename it (e.g. Package Union All.dtsx), in Data Flow Task rename the Data Source (e.g. OLE DB Source Bikes) and change the SQL command text using the following query:
SELECT ITM.ProductID
,
ITM.ProductName

,
ITM.ProductNumber

,
ITM.ProductModel

,
ITM.ProductSubcategory

,
ITM.ProductCategory

,
ITM.MakeFlag

,
ITM.FinishedGoodsFlag

,
ITM.Color

,
ITM.StandardCost

,
ITM.ListPrice

,
ITM.Class

,
ITM.Style

,
ITM.SellStartDate

,
ITM.SellEndDate

,
ITM.DiscontinuedDate

,
ITM.ModifiedDate

FROM
Production.Bikes ITM


     Once this step completed copy the OLE DB Source, paste it again in the current Data Flow Task twice, and modify the queries of the two new added data sources to point to Production.ClothesAccessories, respectively to Production.Components tables. From Toolbox add a Union All Transformation and link to it the (green) connectors of the three sources, and once you open the Union All Transformation Editor you’ll see that SQL Server mapped the columns automatically:
SSIS - Union All Editor
      Accept the mapping and link the (green) connector of the Union All Transformation to the existing OLE DB Destination. When attempting to use the OLE DB Destination Editor SQL Server will first show the Restore Invalid Column References Editor in which you could update the invalid references.
SSIS - Restor Invalid Column References 
    Accept the default settings and in the OLE DB Destination Editor create a new table based on the existing input (e.g. [Production].[ProductsOfInterest]), and test the Data Flow Task:
SSIS - Union All Data Load Task
   In the last step, before testing the whole package, in Control Flow tab change the Execute SQL Task’s SQLStatement property to point to the current destination table:
TRUNCATE TABLE [Production].[ProductsOfInterest]
   Save the project, test (debug) the package (twice) and don’t forget to validate the output data:
SELECT *
FROM
[Production].[ProductsOfInterest]


Note:
    In case you need to reuse the datasets in multiple packages, it makes sense to load each dataset in its own staging table rather then loading the data over and over again.

No comments: