12 March 2010

🎡SSIS: The Union All Data Flow Transformation

In yesterday’s post I was showing how to use the Conditional Split Data Flow Transformation in order to vertically partition a dataset based on a set of constraints, allowing thus to dump the data from each partition to its corresponding table. The inverse problem considers merging together the distributed partitions of the same dataset or distinct similar datasets into a common data set. If the datasets needed to be merged are on the same server it makes sense to join them by using UNION or UNION ALL inside of a simple Data Flow Task:

SELECT * -- specify explicitly the attributes 
FROM [Production].[Bikes] 
UNION ALL 
SELECT * -- specify explicitly the attributes 
FROM [Production].[ClothesAccessories] 
UNION ALL 
SELECT * -- specify explicitly the attributes 
FROM [Production].[Components] 

If the datasets are distributed across several data sources then in theory could be created a package for each data source and dump the data locally, however this often implies the synchronization of all involved packages, eventually by calling each package from a parent package. A better and easier approach is to use the Union All Data Flow Transformation that merges the data from multiple sources into a common result set, a data source needing to be defined for each input dataset. As exemplification I will use the three Product tables created in the previous post, namely [Production].[Bikes] and [Production].[ClothesAccessories] and[Production].[Components]. Even if the tables are located on the same server and the datasets could be merged with a UNION, I will use the general approach and reuse the template SSIS package defined in Third Magic Class post. Copy paste the Package.dtsx in the project and rename it (e.g. Package Union All.dtsx), in Data Flow Task rename the Data Source (e.g. OLE DB Source Bikes) and change the SQL command text using the following query: 
 
SELECT ITM.ProductID 
, ITM.ProductName 
, ITM.ProductNumber 
, ITM.ProductModel 
, ITM.ProductSubcategory 
, ITM.ProductCategory 
, ITM.MakeFlag 
, ITM.FinishedGoodsFlag 
, ITM.Color 
, ITM.StandardCost 
, ITM.ListPrice 
, ITM.Class 
, ITM.Style 
, ITM.SellStartDate 
, ITM.SellEndDate 
, ITM.DiscontinuedDate 
, ITM.ModifiedDate 
FROM Production.Bikes ITM 

Once this step completed copy the OLE DB Source, paste it again in the current Data Flow Task twice, and modify the queries of the two new added data sources to point to Production.ClothesAccessories, respectively to Production.Components tables. From Toolbox add a Union All Transformation and link to it the (green) connectors of the three sources, and once you open the Union All Transformation Editor you’ll see that SQL Server mapped the columns automatically: 
 
SSIS - Union All Editor
  
Accept the mapping and link the (green) connector of the Union All Transformation to the existing OLE DB Destination. When attempting to use the OLE DB Destination Editor SQL Server will first show the Restore Invalid Column References Editor in which you could update the invalid references. 

SSIS - Restor Invalid Column References  
 
Accept the default settings and in the OLE DB Destination Editor create a new table based on the existing input (e.g. [Production].[ProductsOfInterest]), and test the Data Flow Task: 
 
SSIS - Union All Data Load Task

In the last step, before testing the whole package, in Control Flow tab change the Execute SQL Task’s SQLStatement property to point to the current destination table: 

TRUNCATE TABLE [Production].[ProductsOfInterest] 
 
Save the project, test (debug) the package (twice) and don’t forget to validate the output data: 
 
SELECT * 
FROM [Production].[ProductsOfInterest] 

Note:
In case you need to reuse the datasets in multiple packages, it makes sense to load each dataset in its own staging table rather then loading the data over and over again.

🎡SSIS: The Conditional Split Data Flow Transformation

In Third Magic Class I shown how to create a Data Flow Task using the SQL Server 2008 Business Intelligence Development Studio (BIDS), the respective SSIS package being based only on Products with the Product Category having the value ‘Bikes’. What if would be needed to create a table for ‘Components’ Product Category, and another one for ‘Clothes’ and ‘Accessories’? Normally this would equate with creating a package for each destination table, though SSIS provides the Conditional Split component that allows to partition a data set horizontally based on a set of constraints.

For current example copy paste the Package.dtsx created in the mentioned tutorial and rename it (e.g. Package Conditional Split.dtsx). The first important change is to leverage the scope to the four Product Categories considered - 'Bikes', 'Components', 'Clothing' and 'Accessories', therefore in the ‘OLD DB Source’ change the query as follows: 
 
SELECT ITM.ProductID 
,ITM.ProductName 
, ITM.ProductNumber 
,ITM.ProductModel 
,ITM.ProductSubcategory 
,ITM.ProductCategory 
,ITM.MakeFlag 
,ITM.FinishedGoodsFlag 
,ITM.Color 
,ITM.StandardCost 
,ITM.ListPrice 
,ITM.Class 
,ITM.Style 
,ITM.SellStartDate 
,ITM.SellEndDate 
,ITM.DiscontinuedDate 
,ITM.ModifiedDate 
FROM Production.vProducts ITM 
WHERE ProductCategory IN ('Bikes', 'Components', 'Clothing', 'Accessories') 

From the Toolbox add the Conditional Split component, delete the previous connector and attach it to the new added transformation element. Double click on the Conditional Split in order to access the Conditional Split Transformation Editor in which you’ll need to specify the conditions used for the split. Considering as the first condition the case treated in the previous tutorial, define as Condition [ProductCategory]==”Bikes” and rename the default ‘Case 1’ Output Name to ‘Case Bikes’. For the second condition consider [ProductCategory] == "Components" as Constraint and ‘Case Components’ as ‘Output Name’, as in below screenshot: 
 
SSIS - Conditional Split
 
There is no need to create a third Case for 'Clothing' and 'Accessories', because they will be considered on the Conditional Split Default branch. Eventually you could rename the ‘Conditional Split Default Output’ Default output name to ‘Case Else’ as above.

For each Case in the Conditional Split, including the Default one, you’ll need to have a destination, therefore drop two more OLE DB destinations and link to all three destination one green connector from the Conditional Split. For each connector you’ll have to select the constraint the branch is supposed to address, and be sure that you selected ‘Case Bikes’ for the destination inherited from the template package! As for the other two you’ll have to create two destination tables using the default structure and rename them [Production].[ClothesAccessories], respectively [Production].[Components]. The two tables will have the same definition as [Production].[Bikes] table. Do not forget to check the ‘Keep identity’ and ‘Keep nulls’ checkboxes too!

Save the package and debug it, the resulted package showing similarly with the one from the below screenshot: 
 
SSIS - Conditional Split Package  

We have used the Conditional Split Transformation and tested it, though we haven’t finished yet! As I remarked in the previous post, the data from the destination table(s) needs to be deleted each time the package is run. For this in Control Flow tab modify the Execute SQL Task and in the SQL Statement replace the existing statement with the following lines: 

TRUNCATE TABLE [Production].[Bikes] 
TRUNCATE TABLE [Production].[ClothesAccessories] 
TRUNCATE TABLE [Production].[Components] 

Link the Execution SQL Task connector to the Data Flow Task, assure that the Constraint option value is set to ‘Success’, save the project, test (debug) the package (twice) and don’t forget to validate the output data!

Note:
I preferred to keep things as simple as possible, especially when considering the constraints used for the Conditional Split, however this shouldn't stop you to attempt using the supported library of functions and operators in order to create more complex constraints. This post is dependent on Third Magic Class post and the SSIS “template” package create in it, so please start with the respective post!

11 March 2010

💎SQL Reloaded: Oracle vs. SQL Server: Averages I

   For many people working with averages is not a complicated topic at all, and that’s quite true, though as always it depends on how much you want to complicate everything. The average or the arithmetic mean is defined as the middle value of the values within a data set calculated as the sum of the values divided by the number of values: 
AVG - Simple Average 
   RDBMS vendors provide similar functionality within the AVG aggregate function, and even if in theory the result could be calculated using the SUM and COUNT aggregate functions, as per the above definition, given its importance in statistics, to provide an AVG function seems to be a natural choice. AVG could be used across the whole data set or within a given “partition” determined by the GROUP BY clause.

-- Oracle/SQL Server: Simple Averages 
SELECT ProductID 
, AVG(UnitPrice) AverageUnitPrice 
, SUM(UnitPrice)/COUNT(1) AverageUnitPriceDef 
FROM Purchasing.PurchaseOrderDetail 
GROUP BY ProductID 


    For most of the data analysis requirements the simple average function would be enough, though there are cases in which is required to address the sensitivity of values to certain modifiers – for example quantities or the number of lines. For such cases is used the weighted average (weighted arithmetic mean) calculated as the sum of values and quantifiers products, the total being divided by the sum of quantifiers:
AVG - Weighted Average
Note:
     In case the quantifiers are all 1 then the weighted average equates with the simple average, the output being the same also when the values are constant.

      The weighted average is quite a useful tool when performing PO Price Analysis in which often needs to be be considered also the volume of purchased quantities with a given Price Unit: 
 
-- Oracle/SQL Server: Simple/Weighted Averages 
SELECT ProductID 
, AVG(UnitPrice) AverageUnitPrice 
, SUM(UnitPrice*OrderQty)/SUM(OrderQty) WeightedAverageUnitPrice 
FROM Purchasing.PurchaseOrderDetail 
GROUP BY ProductID  

   The actual implementation in a query might differ based on requirements and preferences – as multiple left joins, inline view or correlated query, each of the mentioned approaches coming with their downsides and strengths: 
 
-- Oracle/SQL Server: Averages (inline view) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, DAT.AverageUnitPrice 
, DAT.WeightedAverageUnitPrice 
FROM Production.Product ITM 
LEFT JOIN ( --calculated averages 
    SELECT POD.ProductID 
    , AVG(POD.UnitPrice) AverageUnitPrice 
    , SUM(POD.UnitPrice*OrderQty)/SUM(POD.OrderQty) WeightedAverageUnitPrice 
    FROM Purchasing.PurchaseOrderDetail POD 
       JOIN Purchasing.PurchaseOrderHeader POH 
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
    WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete 
   GROUP BY POD.ProductID)DAT 
    ON ITM.ProductID = DAT.ProductID 
ORDER BY ITM.Name 

-- Oracle/SQL Server: Averages (multiple left joins) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, AVG(POD.UnitPrice) AverageUnitPrice 
, SUM(POD.UnitPrice*OrderQty)/SUM(POD.OrderQty) WeightedAverageUnitPrice 
FROM Production.Product ITM 
   LEFT JOIN Purchasing.PurchaseOrderDetail POD 
      ON ITM.ProductID = POD.ProductID 
   LEFT JOIN Purchasing.PurchaseOrderHeader POH 
      ON POD.PurchaseOrderID = POH.PurchaseOrderID  
WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete 
GROUP BY ITM.ProductID 
,ITM.Name 
,ITM.ProductNumber 
ORDER BY ITM.Name  

-- SQL Server 2005: Averages (OUTER APPLY) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, DAT.AverageUnitPrice 
, DAT.WeightedAverageUnitPrice 
FROM Production.Product ITM 
OUTER APPLY ( -- calculated averages 
    SELECT AVG(POD.UnitPrice) AverageUnitPrice 
    , SUM(POD.UnitPrice*OrderQty)/SUM(POD.OrderQty) WeightedAverageUnitPrice 
    FROM Purchasing.PurchaseOrderDetail POD  
       JOIN Purchasing.PurchaseOrderHeader POH 
          ON POD.PurchaseOrderID = POH.PurchaseOrderID  
    WHERE ITM.ProductID = POD.ProductID  
        AND POH.Status IN (2, 4) -- 2-Approved, 4-Complete 
) DAT 
ORDER BY ITM.Name 

      A different type of approach could be focused on calculating the simple or weighted average based only on the last n PO Lines, for example the last 3 PO Lines: 
  
-- Oracle/SQL Server 2005: last 3 PO average prices 
SELECT DAT.ProductID 
, MAX(DAT.RANKING) NumberRecords 
, AVG(DAT.UnitPrice) AveragePrice 
, SUM(DAT.UnitPrice* DAT.OrderQty)/SUM(DAT.OrderQty) WeightedAveragePrice 
FROM (-- ranked PO Prices 
    SELECT POD.ProductID 
    , POD.UnitPrice 
    , POD.OrderQty 
    , RANK() OVER(PARTITION BY POD.ProductID ORDER BY POH.OrderDate DESC,   POD.PurchaseOrderDetailID DESC) RANKING 
    FROM Purchasing.PurchaseOrderDetail POD 
       JOIN Purchasing.PurchaseOrderHeader POH 
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
    WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete 
) DAT 
WHERE DAT.RANKING BETWEEN 1 AND 3 
GROUP BY DAT.ProductID
 


    The problem could be complicated even more by selecting all the POs corresponding to the last 3 distinct Prices used; not sure if it brings any value but the formulation is perfectly valid and don’t despair if a user comes with such a request.

    Occasionally it might be needed to study the evolution of prices over time, the previous problem becoming a simple moving average or weighted moving average problem, in which the average is calculated in report with a moving point on the time scale with respect to the previous k points or concerning the previous and following k points including the current point:
AVG - Simple Moving Average
AVG - Weighted Moving Average
     G. Priester shows in his Calculating Moving Averages with T-SQL article a nice solution for simple moving averages calculation exemplifying it on stock market data. Frankly, given its flexibility, I prefer to use the CROSS APPLY operator in combination with fGetDatesInInterval function introduced in More Date-related Functionality – Part I post, the function returning all the dates falling in a certain interval. Because in case of POs there could be multiple transactions per day for the same Product, the problem is slightly changed, the averages being based within a 31 window:

-- SQL Server 2005: 31 days moving averages 
SELECT ITM.ProductNumber 
, ITM.Name ProductName 
, DIN.DateSequence ReportingDate 
, DAT.AverageUnitPrice 
, DAT.WeightedAverageUnitPrice 
FROM dbo.fGetDatesInInterval('2003-01-01', '2003-12-31') DIN 
CROSS APPLY (--calculated averages 
    SELECT POD.ProductID 
    , AVG(POD.UnitPrice) AverageUnitPrice 
    , SUM(POD.UnitPrice*OrderQty)/SUM(POD.OrderQty) WeightedAverageUnitPrice 
    FROM Purchasing.PurchaseOrderDetail POD 
       JOIN Purchasing.PurchaseOrderHeader POH 
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
     WHERE DATEDIFF(d, POH.OrderDate, DIN.DateSequence) BETWEEN 0 AND 30 -- 31 days 
     --WHERE DATEDIFF(d, POH.OrderDate, DIN.DateSequence) BETWEEN -15 AND 15 -- 31 days 
         AND POH.Status IN (2, 4) -- 2-Approved, 4-Complete 
GROUP BY POD.ProductID) DAT 
LEFT JOIN Production.Product ITM 
   ON DAT.ProductID = ITM.ProductID 
WHERE DATEPART(dw, DIN.DateSequence) BETWEEN 2 AND 6 -- Working Days 
ORDER BY ITM.Name 
, DIN.DateSequence 

🎡SSIS: Data Flow Task (Third Magic Class)

Independently of the data sources or destinations used, all the ETL tasks have as cornerstone the simple creation of a Data Flow Task. If in the previous two tutorials I was showing how to export data from an Excel file (see First Magic Class), respectively an Oracle database (see Second Magic Class), using the SQL Server Import and Export Wizard, in this tutorial I will show how you can create an SSIS package from the scratch using the SQL Server 2008 Business Intelligence Development Studio.

Step 1: Create the View
Before starting to create the SSIS package and especially when dealing with multiple tables it’s useful to encapsulate the logic in a view, this step allowing you primarily to capture, test, refine and stabilize the requirements before creating the package, thus attempting to minimize the changes done in the package itself, and secondly it allows reusing the logic in other packages or database objects. Here’s the source code for vProducts view that will be used for this tutorial: 
 
CREATEVIEW [Production].[vProducts] 
-- Products (view) 
AS 
SELECT ITM.ProductID 
,ITM.ProductSubcategoryID 
,ITM.ProductModelID 
,ITM.Name ProductName 
,ITM.ProductNumber 
,PPM.Name ProductModel 
,PPM.CatalogDescription 
,PPS.Name ProductSubcategory 
,PPC.Name ProductCategory 
,ITM.MakeFlag 
,ITM.FinishedGoodsFlag 
,ITM.Color 
,ITM.SafetyStockLevel 
,ITM.ReorderPoint 
,ITM.StandardCost 
,ITM.ListPrice 
,ITM.Size 
,ITM.SizeUnitMeasureCode 
,SUOM.Name SizeUnitMeasure 
,ITM.WeightUnitMeasureCode 
,WUOM.Name WeightUnitMeasure 
,ITM.Weight 
,ITM.DaysToManufacture 
,ITM.ProductLine 
,ITM.Class 
,ITM.Style 
,ITM.SellStartDate 
,ITM.SellEndDate 
,ITM.DiscontinuedDate 
,ITM.ModifiedDate 
FROM Production.Product ITM 
    LEFT JOIN Production.UnitMeasure SUOM 
       ON ITM.SizeUnitMeasureCode = SUOM.UnitMeasureCode 
    LEFT JOIN Production.UnitMeasure WUOM 
        ON ITM.WeightUnitMeasureCode = WUOM.UnitMeasureCode 
    LEFT JOIN Production.ProductModel PPM 
       ON ITM.ProductModelID = PPM.ProductModelID 
    LEFT JOIN Production.ProductSubcategory PPS 
        ON ITM.ProductSubcategoryID = PPS.ProductSubcategoryID 
    LEFT JOIN Production.ProductCategory PPC 
       ON PPS.ProductCategoryID = PPC.ProductCategoryID 

Note:
The reuse of database objects like views, stored procedures and user-defined functions comes with its costs because even if they are helping to simplify business logic’s maintenance, a change impacts the whole set of objects and applications referencing the changed object, thus for example some of the SSIS packages using the respective object could be invalidated. Sometimes it’s useful to keep a list with the objects and (especially) applications using your database objects, just in case you want to easily validate changes’ impact.

Step 2: Create the SSIS Project
From Start/Microsoft SQL Server 2008 folder open the SQL Server Business Intelligence Development Studio and from its Menu choose File/New/Project, in the New Project dialog choose choose Integration Services Project as Template, enter a meaningful (Project) Name (e.g. SSIS Package Template), specify the Location (e.g. D:\Work\SQL Server) and eventually change the default Solution Name, though that’s not normally required, then click ‘Ok’. 
 
SSIS - New Project
 
The new created project shell will look as below, with the default Package.dtsx, (eventually you could rename the package): 
 
SSIS - Project Shell
 
Step 3: Create the Data Source
     In the Solution Explorer click on Data Sources/New Data Source in order to open the Data Source Wizard in which you can create a new Data Source to AdventureWorks database using the Connection Manager as in the previous two tutorials or, in case the source is already available, just select it.

Notes:   
      Is not really required to create a Data Source explicitly, though this approach would allow you to easier manage the connections especially when the same connection will be reused in multiple packages.

Step 4: Create the Data Flow
     Within the Package.dtsx from Toolbox double click on the Data Flow Task, the Task being added to the Control Flow tab, then in order to add components to it click on the Data Flow tab and from Toolbox add an ‘OLE DB Source’ and an ‘OLE DB Destination’.
     Double click on the ‘OLD DB Source’ and in ‘OLE DB Source Editor’ dialog just opened selected the defined Data Source (e.g. AdventureWorks) in the OLE DB connection manager, in ‘Data Access mode’ from the list of available options select ‘SQL Command’ while in the below ‘SQL command text’ copy paste the below query: 

SELECT ITM.ProductID 
,ITM.ProductName 
, ITM.ProductNumber 
,ITM.ProductModel 
,ITM.ProductSubcategory 
,ITM.ProductCategory 
,ITM.MakeFlag 
,ITM.FinishedGoodsFlag 
,ITM.Color 
,ITM.StandardCost 
,ITM.ListPrice 
,ITM.Class 
,ITM.Style 
,ITM.SellStartDate 
,ITM.SellEndDate 
,ITM.DiscontinuedDate 
,ITM.ModifiedDate 
FROM Production.vProducts ITM 
WHERE ProductCategory = 'Bikes' 
 
SSIS - OLE DB Source Editor 
Notes:
1.    The vProducts could have included only the above attributes and the required constraint(s), though you always have to consider the changes in scope, typically being preferred to include all the attributes the business is typically interested in – this for the cases in which the number of attributes is quite huge or some of them are not used.
2.     Instead of using the ‘SQL Command’ you could have chosen the ‘Table or view’ Data access mode and use thus the vProducts view directly, in the end selecting only the needed columns in the Columns tab of the ‘OLE DB Source Editor’.
SSIS - OLE DB Source Editor Columns

   Once the data source was created, drag the green connector over the ‘OLE DB Destination’, and double click on the later in order to access the ‘OLE DB Destination Editor’. In ‘OLE DB connection manager’ select the destination source, it could be same connection used for the source, and select ‘Table or view – fast load’ as ‘Data access mode’. As the data needs to be dumped to a table, we either use an existing table or create one on the fly by clicking on the ‘New’ button, next to the ‘name of the table or the view’ control. Based on the source attributes, SQL Server already provides the script needed to create the table, all is needed to do is to change the default table name (e.g. OLE DB Destination) with a meaningful one (e.g. Production.Bikes):
CREATETABLE [Production].[Bikes] (
[ProductID]int,

[ProductName] nvarchar(50),

[ProductNumber] nvarchar(25),

[ProductModel] nvarchar(50),

[ProductSubcategory] nvarchar(50),
[ProductCategory] nvarchar(50),

[MakeFlag] bit,

[FinishedGoodsFlag] bit,

[Color] nvarchar(15),

[StandardCost] money,

[ListPrice] money,

[Class] nvarchar(2),

[Style] nvarchar(2),

[SellStartDate] datetime,

[SellEndDate] datetime,

[DiscontinuedDate] datetime,

[ModifiedDate] datetime

)

    Because is useful to keep the identity key (just in case you want to validate the logic) and the null values (unless you want to replace them with default values) check the ‘Keep identity’ and ‘Keep nulls’ checkboxes.
SSIS - OLE DB Destination Editor

Step 5: Testing the Data Flow Task
     Save the Package and from the Menu select Debug/Start debugging in order to run the package. Normally the package should run without problems, the tasks/components being shown in green as they are processed:
SSIS - Running the Package
     As always is a good idea to check if the data were saved as expected: 
SELECT * 
FROM [Production].[Bikes] 
ORDER BY ProductName 


Step 6: Delete the Previous Records
     Because the next time the package will be run the data will be written in the respective table making abstraction of the existing records, therefore is needed to delete the previous records before the data are dumped to the table. For this in Control Flow tab add an Execute SQL Task and in its Editor select the AdventureWorks Data Source as Connection, while in the SQL Statement add the following statement: 
 
TRUNCATE TABLE [Production].[Bikes]

SSIS - Delete Previous Records 
 
Notes:
1.    It’s preferred to TRUNCATE the table instead of using a DELETE statement, the later logging the deletion of each individual row in the transactional log while the TRUNCATE removes all table’s pages, no record deletion being actually performed.
2.    If you’re clicking on ‘Build Query’ button, because the Query Builder doesn’t support DDL statements you’ll get the following syntax error: “The TRUNCATE TABLE SQL construct or statement is not supported”. Just ignore the error message!
 
    Now link the Execution SQL Task connector to the Data Flow Task and by clicking again on it review its properties in Precedence Constraint Editor and assure that the Constraint option value is set to ‘Success’ (default value), this meaning that the next step in the package will be executed only if the precedent step succeeded. 
 
SSIS - Precedence Constraint Editor

Note:
     When in the current step data are deleted or important actions are performed, is always a good idea to process the next step only if the current task succeeded, otherwise your data/logic might be invalidated.

Step 7: Testing the Package
     Now you are ready to test (debug) the whole package,  but before this don’t forget to save your project! 
 
SSIS - Template Package

   Typically you’ll need to test the package at least twice in order to assure yourself that the previous data are deleted. It also helps to attempt a select from the [Production].[Bikes] table again, just to assure yourself that everything went as expected!

Notes:
1.    Actually you should save your work each time an important step is completed, because you never know when a runtime error occurs and your application closes unexpectedly, thus loosing your work.
2.    Usually when dealing with multiple tasks and transformations it makes sense to rename the respective objects, this allowing you to easier identify the objects.

09 March 2010

🕋Data Warehousing: Degenerate Dimension (Definitions)

"Degenerate dimensions represent a business event in the fact table. For example, a business event would consist of an order, invoice, or a shipping ticket. Degenerate dimensions are not dimensional keys or part of the additive numeric facts and do not have an associated dimension table." (Microsoft Corporation, "Microsoft SQL Server 7.0 Data Warehouse Training Kit", 2000)

"A dimension key, such as a transaction number, invoice number, ticket number, or bill-of-lading number, that has no attributes and hence does not join to an actual dimension table." (Ralph Kimball & Margy Ross, "The Data Warehouse Toolkit" 2nd Ed., 2002)

"A dimension whose members are stored in a fact table instead of a dimension table." (Reed Jacobsen & Stacia Misner, "Microsoft SQL Server 2005 Analysis Services Step by Step", 2006)

"A dimension where there are no valid dimensional attributes other than a unique identifier in a one-to-one relationship with a fact table." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A relationship between a dimension and a measure group in which the dimension main table is the same as the measure group table." (Microsoft, "SQL Server 2012 Glossary", 2012)

🕋Data Warehousing: Dimensional Modeling (Definitions)

"A methodology for logically modeling data for query performance and ease of use that starts from a set of base measurement events. In the relational DBMS environment, a fact table is constructed generally with one record for each discrete measurement. This fact table is then surrounded by a set of dimension tables describing precisely what is known in the context of each measurement record. Because of the characteristic structure of a dimensional model, it is often called a star schema." (Ralph Kimball & Margy Ross, "The Data Warehouse Toolkit" 2nd Ed., 2002)

"A formal data modeling technique that is used to organize and represent data for analytical and reporting use. The focus is on the business perspective and the representation of data." (Laura Reeves, "A Manager's Guide to Data Warehousing", 2009)

"A generally accepted practice in the data warehouse industry to structure data intended for user access, analysis, and reporting in dimensional data models" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"The most frequently used data model for data warehouses." (Jan L Harrington, "Relational Database Design and Implementation" 3rd Ed., 2009)

"With dimensional data modeling or denormalization, data is collapsed, combined, or grouped together. Within dimensional data modeling, the concepts of facts (measures) and dimensions (context) are used. If dimensions are collapsed into single structures, the data model is also often called a star schema. If the dimensions are not collapsed, the data model is called snowflake. The dimensional models are typically seen within data warehouse systems." (Piethein Strengholt, "Data Management at Scale", 2020)

08 March 2010

💎SQL Reloaded: TSQL Challenge 24

💎Beyond Relational came up with an interesting set of  TSQL Challenges contest that reached already its 25th round. Here’s my answer for TSQL Challenge 24, this being the first time I’ve attempted to come with a solution:
 
WITH CTE(LogID, EmpName, Activity, StartTime, TopStartTime, NoMinutes, StartDuration, DiffDuration, HasParentFlag, HasChildFlag, Level) 
AS ( -- initialization query
SELECT A.LogID, A.EmpName, A.Activity, A.StartTime
, A.StartTime TopStartTime
, DATEPART(mi, A.Duration) NoMinutes
, DATEPART(mi, A.Duration) StartDuration
, DAT.NoMinutes DiffDuration
, IsNull(DAT.HasParentFlag, 0) HasParentFlag
, ISNULL(DAT.HasChildFlag, 0) HasChildFlag
, 1 Level
FROM TC24_ActivityLog A
LEFT JOIN TC24_ActivityLog B
ON A.EmpName = B.EmpName
AND A.Activity = B.Activity
AND DateDiff(Mi, B.StartTime, A.StartTime) BETWEEN 0 AND 30
AND DateDiff(dd, B.StartTime, A.StartTime)= 0
AND A.LogID <> B.LogID
CROSS APPLY ( -- cumulate
     SELECT SUM(DATEPART(mi, C.Duration)) NoMinutes
     , Max(CASE WHEN D.EmpName IS NOT NULL THEN 1 ELSE 0 END) HasParentFlag
     , Max(CASE WHEN E.EmpName IS NOT NULL THEN 1 ELSE 0 END) HasChildFlag
     , Max(CASE WHEN D.EmpName IS NULL AND D.EmpName IS NULL THEN 1 ELSE 0 END)   IsSingleFlag
      FROM TC24_ActivityLog C
        LEFT JOIN TC24_ActivityLog D
          ON C.EmpName = D.EmpName
        AND C.Activity = D.Activity
        AND DateDiff(Mi, D.StartTime, C.StartTime)=30
        AND DateDiff(dd, D.StartTime, C.StartTime)= 0
        LEFT JOIN TC24_ActivityLog E
            ON C.EmpName = E.EmpName
         AND C.Activity = E.Activity
         AND DateDiff(Mi, E.StartTime, C.StartTime)=-30
         AND DateDiff(dd, E.StartTime, C.StartTime)= 0
        WHERE A.EmpName = C.EmpName
         AND A.Activity <> C.Activity
        AND DateDiff(Mi, C.StartTime, A.StartTime)=0) DAT
WHERE B.EmpName IS NULL
UNION ALL
SELECT A.LogID, A.EmpName, A.Activity, A.StartTime
, B.TopStartTime
, B.NoMinutes + DATEPART(mi, A.Duration) NoMinutes
, B.StartDuration
, B.DiffDuration
, B.HasParentFlag
, B.HasChildFlag
, B.Level+1 Level
FROM TC24_ActivityLog A
JOIN CTE B
ON A.EmpName = B.EmpName
AND A.Activity = B.Activity
AND DateDiff(Mi, B.StartTime, A.StartTime)=30)
SELECT
Convert(varchar(10), TCS.StartTime, 126) [Date]
, TCS.EmpName
, TCS.Activity
, Convert(varchar(5), TCS.StartTime, 108) SchSt
, Convert(varchar(5), TCS.Duration, 108) SchDur
, Convert(varchar(5), DateAdd(mi, CASE
WHEN HasParentFlag = 1 THEN TCA.DiffDuration
WHEN HasChildFlag = 1 THEN 0
ELSE 30-TCA.StartDuration
END, TCA.TopStartTime), 108) ActSt --
, Convert(varchar(5), DateAdd(mi, TCA.NoMinutes, Cast('00:00' as time(0))), 108) ActDur
FROM TC24_Schedules TCS
JOIN (-- Activities Log
      SELECT EmpName
      , Activity
      , StartTime
      , TopStartTime
      , NoMinutes
      , StartDuration
     , DiffDuration
      , HasParentFlag
      , HasChildFlag
      , RANK() OVER(PARTITION BY EmpName, Activity, TopStartTime ORDER BY Level DESC) Ranking
     FROM CTE
) TCA
     ON TCS.EmpName = TCA.EmpName
    AND TCS.Activity = TCA.Activity
    AND DateDiff(d, TCS.StartTime, TCA.TopStartTime) = 0
    AND TCA.Ranking = 1
ORDER BY [Date]
, EmpName
, SchSt;

Notes:
1.    The initialization query identifies the top-most activities and checks whether there is another activity (referred as co-occurring activity) scheduled within the same 30 minutes time interval by using the CROSS APPLY operator, therefore:
DiffDuration – stores the time scheduled within the same 30 minutes for the co-occurring activity
HasParentFlag – has the value 1 if the co-occurring activity is a continuation from the previous 30 minutes, 0 otherwise
HasChildFlag – has the value 1 if the co-occurring activity continues on the next 30 minutes, 0 otherwise
DiffDuration, HasParentFlag, HasChildFlag are needed in order to calculate correctly the ActSt.
   The recursive query just aggregates the total time, the final output query considering the lowest level in the recursion tree (the end activity).
   The query works also with the Example 1 & 2.
   The query was also tested using Lutz Mueller's test data set.
  
2.  By adding the rank function for Schedule table (e.g. RANK() OVER(PARTITION BY EmpName, Activity ORDER BY StartTime) SchRanking) thus creating an inline view,
    a dense rank in the final Activites subquery (e.g.   , DENSE_RANK() OVER(PARTITION BY EmpName, Activity ORDER BY TopStartTime) SchRanking
    and matching the result sets on the two attributes (e.g. AND TCS.SchRanking = TCA.SchRanking) the query could be leveraged to handle interrupted activites (activities interrupted by another activity, FAQ11).
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.