SQL Troubles: Data Flow Task

Showing posts with label Data Flow Task. Show all posts

13 March 2010

🎡SSIS: Aggregate Data Flow Transformation

At a first look the SSIS Aggregate Data Flow Transformation doesn’t seem to be so useful given the fact that the same functionality could be easily obtained with an aggregate query, which actually has more flexibility in what concerns data manipulation, however must be not forgotten that the power of transformations reside in the way they are combined in order to solve a problem and not in their isolated use. On the other side in order to master the basics it makes sense to create simple examples that include a minimum of information/tasks. It’s also the case of this post, for exemplification I will use the following query based on AdventureWorks’ Purchase Orders Header/Detail tables:

SELECT POD.PurchaseOrderDetailID 
, POD.PurchaseOrderID 
, POD.ProductID 
, POD.OrderQty 
, POD.ReceivedQty 
, POD.UnitPrice 
, POD.OrderQty * POD.UnitPrice OrderValue 
, POD.ReceivedQty * POD.UnitPrice ReceiptValue 
, POH.OrderDate 
, POD.DueDate 
FROM Purchasing.PurchaseOrderDetail POD 
   JOIN Purchasing.PurchaseOrderHeader POH 
      ON POD.PurchaseOrderID = POH.PurchaseOrderID 
WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete

Using the template SSIS package defined in Third Magic Class post, copy paste the Package.dtsx in the project and rename it (e.g. Package Aggregate.dtsx), then in Data Flow Task rename the Data Source (e.g. OLE DB Source POs) and change the SQL command text using the above query. From Toolbox add an Aggregate Transformation and link it to the OLE DB Source and access the Aggregate Transformation Editor in which, if everything went swell until now, it will appear a “table” based on the above query and check in a first phase the following attributes: ProductID, OrderQty, ReceivedQty, OrderValue and ReceiptValue.

Once an attribute is checked in the table, it will appear also as Input Column in the Editor, the same name being used also for Output Alias. For ProductID the Editor chosen the ‘Group by’ as Operation and ‘Sum’ for OrderValue. Change the Operation as Sum for the other three attributes like in the below screenshot. Add also the Order Date and once more the ProductID as Input Columns, for the first choose ‘Maximum’ as Operation while for the second the ‘Count’ in order to get the Last OrderDate, respectively the Number of Records.

Once the above changes completed, delete the existing OLE DB Destination and add another one, link the Aggregate Transformation to it, create a new table (e.g. Purchasing.PurchaseOrderProductAggregation) based on Aggregate’s input, accept the mappings and test the Data Flow Task:

In the last step, before testing the whole package, in Control Flow tab change the Execute SQL Task’s SQLStatement property to point to the current destination table:

TRUNCATE TABLE [Purchasing].[PurchaseOrderProductAggregation]

Save the project, test (debug) the package (twice) and don’t forget to validate the output data:

SELECT * 
FROM [Purchasing].[PurchaseOrderProductAggregation]

The output from the above table should be actually the same with the output’s of the following aggregated query:

-- PO Product Aggregation SELECT POD.ProductID 
, SUM(POD.OrderQty) OrderQty 
, SUM(POD.ReceivedQty) ReceivedQty 
, SUM(POD.OrderQty * POD.UnitPrice) OrderValue 
, SUM(POD.ReceivedQty * POD.UnitPrice) ReceiptValue 
, Max(POH.OrderDate) LastOrderDate 
, COUNT(1) NumberRecords 
FROM Purchasing.PurchaseOrderDetail POD 
   JOIN Purchasing.PurchaseOrderHeader POH 
      ON POD.PurchaseOrderID = POH.PurchaseOrderID 
WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete 
GROUP BY POD.ProductID 
ORDER BY POD.ProductID

Happy Coding!

12 March 2010

🎡SSIS: The Union All Data Flow Transformation

In yesterday’s post I was showing how to use the Conditional Split Data Flow Transformation in order to vertically partition a dataset based on a set of constraints, allowing thus to dump the data from each partition to its corresponding table. The inverse problem considers merging together the distributed partitions of the same dataset or distinct similar datasets into a common data set. If the datasets needed to be merged are on the same server it makes sense to join them by using UNION or UNION ALL inside of a simple Data Flow Task:

SELECT * -- specify explicitly the attributes 
FROM [Production].[Bikes] 
UNION ALL 
SELECT * -- specify explicitly the attributes 
FROM [Production].[ClothesAccessories] 
UNION ALL 
SELECT * -- specify explicitly the attributes 
FROM [Production].[Components]

If the datasets are distributed across several data sources then in theory could be created a package for each data source and dump the data locally, however this often implies the synchronization of all involved packages, eventually by calling each package from a parent package. A better and easier approach is to use the Union All Data Flow Transformation that merges the data from multiple sources into a common result set, a data source needing to be defined for each input dataset. As exemplification I will use the three Product tables created in the previous post, namely [Production].[Bikes] and [Production].[ClothesAccessories] and[Production].[Components]. Even if the tables are located on the same server and the datasets could be merged with a UNION, I will use the general approach and reuse the template SSIS package defined in Third Magic Class post. Copy paste the Package.dtsx in the project and rename it (e.g. Package Union All.dtsx), in Data Flow Task rename the Data Source (e.g. OLE DB Source Bikes) and change the SQL command text using the following query:

SELECT ITM.ProductID 
, ITM.ProductName 
, ITM.ProductNumber 
, ITM.ProductModel 
, ITM.ProductSubcategory 
, ITM.ProductCategory 
, ITM.MakeFlag 
, ITM.FinishedGoodsFlag 
, ITM.Color 
, ITM.StandardCost 
, ITM.ListPrice 
, ITM.Class 
, ITM.Style 
, ITM.SellStartDate 
, ITM.SellEndDate 
, ITM.DiscontinuedDate 
, ITM.ModifiedDate 
FROM Production.Bikes ITM

Once this step completed copy the OLE DB Source, paste it again in the current Data Flow Task twice, and modify the queries of the two new added data sources to point to Production.ClothesAccessories, respectively to Production.Components tables. From Toolbox add a Union All Transformation and link to it the (green) connectors of the three sources, and once you open the Union All Transformation Editor you’ll see that SQL Server mapped the columns automatically:

Accept the mapping and link the (green) connector of the Union All Transformation to the existing OLE DB Destination. When attempting to use the OLE DB Destination Editor SQL Server will first show the Restore Invalid Column References Editor in which you could update the invalid references.

Accept the default settings and in the OLE DB Destination Editor create a new table based on the existing input (e.g. [Production].[ProductsOfInterest]), and test the Data Flow Task:

In the last step, before testing the whole package, in Control Flow tab change the Execute SQL Task’s SQLStatement property to point to the current destination table:

TRUNCATE TABLE [Production].[ProductsOfInterest]

Save the project, test (debug) the package (twice) and don’t forget to validate the output data:

SELECT * 
FROM [Production].[ProductsOfInterest]

Note:
In case you need to reuse the datasets in multiple packages, it makes sense to load each dataset in its own staging table rather then loading the data over and over again.

🎡SSIS: The Conditional Split Data Flow Transformation

In Third Magic Class I shown how to create a Data Flow Task using the SQL Server 2008 Business Intelligence Development Studio (BIDS), the respective SSIS package being based only on Products with the Product Category having the value ‘Bikes’. What if would be needed to create a table for ‘Components’ Product Category, and another one for ‘Clothes’ and ‘Accessories’? Normally this would equate with creating a package for each destination table, though SSIS provides the Conditional Split component that allows to partition a data set horizontally based on a set of constraints.

For current example copy paste the Package.dtsx created in the mentioned tutorial and rename it (e.g. Package Conditional Split.dtsx). The first important change is to leverage the scope to the four Product Categories considered - 'Bikes', 'Components', 'Clothing' and 'Accessories', therefore in the ‘OLD DB Source’ change the query as follows:

SELECT ITM.ProductID 
,ITM.ProductName 
, ITM.ProductNumber 
,ITM.ProductModel 
,ITM.ProductSubcategory 
,ITM.ProductCategory 
,ITM.MakeFlag 
,ITM.FinishedGoodsFlag 
,ITM.Color 
,ITM.StandardCost 
,ITM.ListPrice 
,ITM.Class 
,ITM.Style 
,ITM.SellStartDate 
,ITM.SellEndDate 
,ITM.DiscontinuedDate 
,ITM.ModifiedDate 
FROM Production.vProducts ITM 
WHERE ProductCategory IN ('Bikes', 'Components', 'Clothing', 'Accessories')

From the Toolbox add the Conditional Split component, delete the previous connector and attach it to the new added transformation element. Double click on the Conditional Split in order to access the Conditional Split Transformation Editor in which you’ll need to specify the conditions used for the split. Considering as the first condition the case treated in the previous tutorial, define as Condition [ProductCategory]==”Bikes” and rename the default ‘Case 1’ Output Name to ‘Case Bikes’. For the second condition consider [ProductCategory] == "Components" as Constraint and ‘Case Components’ as ‘Output Name’, as in below screenshot:

There is no need to create a third Case for 'Clothing' and 'Accessories', because they will be considered on the Conditional Split Default branch. Eventually you could rename the ‘Conditional Split Default Output’ Default output name to ‘Case Else’ as above.

For each Case in the Conditional Split, including the Default one, you’ll need to have a destination, therefore drop two more OLE DB destinations and link to all three destination one green connector from the Conditional Split. For each connector you’ll have to select the constraint the branch is supposed to address, and be sure that you selected ‘Case Bikes’ for the destination inherited from the template package! As for the other two you’ll have to create two destination tables using the default structure and rename them [Production].[ClothesAccessories], respectively [Production].[Components]. The two tables will have the same definition as [Production].[Bikes] table. Do not forget to check the ‘Keep identity’ and ‘Keep nulls’ checkboxes too!

Save the package and debug it, the resulted package showing similarly with the one from the below screenshot:

We have used the Conditional Split Transformation and tested it, though we haven’t finished yet! As I remarked in the previous post, the data from the destination table(s) needs to be deleted each time the package is run. For this in Control Flow tab modify the Execute SQL Task and in the SQL Statement replace the existing statement with the following lines:

TRUNCATE TABLE [Production].[Bikes] 
TRUNCATE TABLE [Production].[ClothesAccessories] 
TRUNCATE TABLE [Production].[Components]

Link the Execution SQL Task connector to the Data Flow Task, assure that the Constraint option value is set to ‘Success’, save the project, test (debug) the package (twice) and don’t forget to validate the output data!

Note:
I preferred to keep things as simple as possible, especially when considering the constraints used for the Conditional Split, however this shouldn't stop you to attempt using the supported library of functions and operators in order to create more complex constraints. This post is dependent on Third Magic Class post and the SSIS “template” package create in it, so please start with the respective post!

SQL Troubles

Pages

13 March 2010

🎡SSIS: Aggregate Data Flow Transformation

12 March 2010

🎡SSIS: The Union All Data Flow Transformation

🎡SSIS: The Conditional Split Data Flow Transformation

About Me