SQL Troubles

11 March 2010

🎡SSIS: Data Flow Task (Third Magic Class)

Independently of the data sources or destinations used, all the ETL tasks have as cornerstone the simple creation of a Data Flow Task. If in the previous two tutorials I was showing how to export data from an Excel file (see First Magic Class), respectively an Oracle database (see Second Magic Class), using the SQL Server Import and Export Wizard, in this tutorial I will show how you can create an SSIS package from the scratch using the SQL Server 2008 Business Intelligence Development Studio.

Step 1: Create the View
Before starting to create the SSIS package and especially when dealing with multiple tables it’s useful to encapsulate the logic in a view, this step allowing you primarily to capture, test, refine and stabilize the requirements before creating the package, thus attempting to minimize the changes done in the package itself, and secondly it allows reusing the logic in other packages or database objects. Here’s the source code for vProducts view that will be used for this tutorial:

CREATEVIEW [Production].[vProducts] 
-- Products (view) 
AS 
SELECT ITM.ProductID 
,ITM.ProductSubcategoryID 
,ITM.ProductModelID 
,ITM.Name ProductName 
,ITM.ProductNumber 
,PPM.Name ProductModel 
,PPM.CatalogDescription 
,PPS.Name ProductSubcategory 
,PPC.Name ProductCategory 
,ITM.MakeFlag 
,ITM.FinishedGoodsFlag 
,ITM.Color 
,ITM.SafetyStockLevel 
,ITM.ReorderPoint 
,ITM.StandardCost 
,ITM.ListPrice 
,ITM.Size 
,ITM.SizeUnitMeasureCode 
,SUOM.Name SizeUnitMeasure 
,ITM.WeightUnitMeasureCode 
,WUOM.Name WeightUnitMeasure 
,ITM.Weight 
,ITM.DaysToManufacture 
,ITM.ProductLine 
,ITM.Class 
,ITM.Style 
,ITM.SellStartDate 
,ITM.SellEndDate 
,ITM.DiscontinuedDate 
,ITM.ModifiedDate 
FROM Production.Product ITM 
    LEFT JOIN Production.UnitMeasure SUOM 
       ON ITM.SizeUnitMeasureCode = SUOM.UnitMeasureCode 
    LEFT JOIN Production.UnitMeasure WUOM 
        ON ITM.WeightUnitMeasureCode = WUOM.UnitMeasureCode 
    LEFT JOIN Production.ProductModel PPM 
       ON ITM.ProductModelID = PPM.ProductModelID 
    LEFT JOIN Production.ProductSubcategory PPS 
        ON ITM.ProductSubcategoryID = PPS.ProductSubcategoryID 
    LEFT JOIN Production.ProductCategory PPC 
       ON PPS.ProductCategoryID = PPC.ProductCategoryID

Note:
The reuse of database objects like views, stored procedures and user-defined functions comes with its costs because even if they are helping to simplify business logic’s maintenance, a change impacts the whole set of objects and applications referencing the changed object, thus for example some of the SSIS packages using the respective object could be invalidated. Sometimes it’s useful to keep a list with the objects and (especially) applications using your database objects, just in case you want to easily validate changes’ impact.

Step 2: Create the SSIS Project
From Start/Microsoft SQL Server 2008 folder open the SQL Server Business Intelligence Development Studio and from its Menu choose File/New/Project, in the New Project dialog choose choose Integration Services Project as Template, enter a meaningful (Project) Name (e.g. SSIS Package Template), specify the Location (e.g. D:\Work\SQL Server) and eventually change the default Solution Name, though that’s not normally required, then click ‘Ok’.

The new created project shell will look as below, with the default Package.dtsx, (eventually you could rename the package):

Step 3: Create the Data Source
     In the Solution Explorer click on Data Sources/New Data Source in order to open the Data Source Wizard in which you can create a new Data Source to AdventureWorks database using the Connection Manager as in the previous two tutorials or, in case the source is already available, just select it.

Notes:
      Is not really required to create a Data Source explicitly, though this approach would allow you to easier manage the connections especially when the same connection will be reused in multiple packages.

Step 4: Create the Data Flow
     Within the Package.dtsx from Toolbox double click on the Data Flow Task, the Task being added to the Control Flow tab, then in order to add components to it click on the Data Flow tab and from Toolbox add an ‘OLE DB Source’ and an ‘OLE DB Destination’.
     Double click on the ‘OLD DB Source’ and in ‘OLE DB Source Editor’ dialog just opened selected the defined Data Source (e.g. AdventureWorks) in the OLE DB connection manager, in ‘Data Access mode’ from the list of available options select ‘SQL Command’ while in the below ‘SQL command text’ copy paste the below query:

SELECT ITM.ProductID 
,ITM.ProductName 
, ITM.ProductNumber 
,ITM.ProductModel 
,ITM.ProductSubcategory 
,ITM.ProductCategory 
,ITM.MakeFlag 
,ITM.FinishedGoodsFlag 
,ITM.Color 
,ITM.StandardCost 
,ITM.ListPrice 
,ITM.Class 
,ITM.Style 
,ITM.SellStartDate 
,ITM.SellEndDate 
,ITM.DiscontinuedDate 
,ITM.ModifiedDate 
FROM Production.vProducts ITM 
WHERE ProductCategory = 'Bikes'

Notes:
1. The vProducts could have included only the above attributes and the required constraint(s), though you always have to consider the changes in scope, typically being preferred to include all the attributes the business is typically interested in – this for the cases in which the number of attributes is quite huge or some of them are not used.
2. Instead of using the ‘SQL Command’ you could have chosen the ‘Table or view’ Data access mode and use thus the vProducts view directly, in the end selecting only the needed columns in the Columns tab of the ‘OLE DB Source Editor’.

Once the data source was created, drag the green connector over the ‘OLE DB Destination’, and double click on the later in order to access the ‘OLE DB Destination Editor’. In ‘OLE DB connection manager’ select the destination source, it could be same connection used for the source, and select ‘Table or view – fast load’ as ‘Data access mode’. As the data needs to be dumped to a table, we either use an existing table or create one on the fly by clicking on the ‘New’ button, next to the ‘name of the table or the view’ control. Based on the source attributes, SQL Server already provides the script needed to create the table, all is needed to do is to change the default table name (e.g. OLE DB Destination) with a meaningful one (e.g. Production.Bikes):
CREATETABLE [Production].[Bikes] (
[ProductID]int,
[ProductName] nvarchar(50),
[ProductNumber] nvarchar(25),
[ProductModel] nvarchar(50),
[ProductSubcategory] nvarchar(50),
[ProductCategory] nvarchar(50),
[MakeFlag] bit,
[FinishedGoodsFlag] bit,
[Color] nvarchar(15),
[StandardCost] money,
[ListPrice] money,
[Class] nvarchar(2),
[Style] nvarchar(2),
[SellStartDate] datetime,
[SellEndDate] datetime,
[DiscontinuedDate] datetime,
[ModifiedDate] datetime
)

Because is useful to keep the identity key (just in case you want to validate the logic) and the null values (unless you want to replace them with default values) check the ‘Keep identity’ and ‘Keep nulls’ checkboxes.

Step 5: Testing the Data Flow Task
Save the Package and from the Menu select Debug/Start debugging in order to run the package. Normally the package should run without problems, the tasks/components being shown in green as they are processed:

As always is a good idea to check if the data were saved as expected:

SELECT * 
FROM [Production].[Bikes] 
ORDER BY ProductName

Step 6: Delete the Previous Records
Because the next time the package will be run the data will be written in the respective table making abstraction of the existing records, therefore is needed to delete the previous records before the data are dumped to the table. For this in Control Flow tab add an Execute SQL Task and in its Editor select the AdventureWorks Data Source as Connection, while in the SQL Statement add the following statement:

TRUNCATE TABLE [Production].[Bikes]

Notes:
1.    It’s preferred to TRUNCATE the table instead of using a DELETE statement, the later logging the deletion of each individual row in the transactional log while the TRUNCATE removes all table’s pages, no record deletion being actually performed.
2.    If you’re clicking on ‘Build Query’ button, because the Query Builder doesn’t support DDL statements you’ll get the following syntax error: “The TRUNCATE TABLE SQL construct or statement is not supported”. Just ignore the error message!

    Now link the Execution SQL Task connector to the Data Flow Task and by clicking again on it review its properties in Precedence Constraint Editor and assure that the Constraint option value is set to ‘Success’ (default value), this meaning that the next step in the package will be executed only if the precedent step succeeded.

Note:
When in the current step data are deleted or important actions are performed, is always a good idea to process the next step only if the current task succeeded, otherwise your data/logic might be invalidated.

Step 7: Testing the Package
Now you are ready to test (debug) the whole package, but before this don’t forget to save your project!

Typically you’ll need to test the package at least twice in order to assure yourself that the previous data are deleted. It also helps to attempt a select from the [Production].[Bikes] table again, just to assure yourself that everything went as expected!

Notes:
1. Actually you should save your work each time an important step is completed, because you never know when a runtime error occurs and your application closes unexpectedly, thus loosing your work.
2. Usually when dealing with multiple tasks and transformations it makes sense to rename the respective objects, this allowing you to easier identify the objects.

09 March 2010

🕋Data Warehousing: Degenerate Dimension (Definitions)

"Degenerate dimensions represent a business event in the fact table. For example, a business event would consist of an order, invoice, or a shipping ticket. Degenerate dimensions are not dimensional keys or part of the additive numeric facts and do not have an associated dimension table." (Microsoft Corporation, "Microsoft SQL Server 7.0 Data Warehouse Training Kit", 2000)

"A dimension key, such as a transaction number, invoice number, ticket number, or bill-of-lading number, that has no attributes and hence does not join to an actual dimension table." (Ralph Kimball & Margy Ross, "The Data Warehouse Toolkit" 2nd Ed., 2002)

"A dimension whose members are stored in a fact table instead of a dimension table." (Reed Jacobsen & Stacia Misner, "Microsoft SQL Server 2005 Analysis Services Step by Step", 2006)

"A dimension where there are no valid dimensional attributes other than a unique identifier in a one-to-one relationship with a fact table." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A relationship between a dimension and a measure group in which the dimension main table is the same as the measure group table." (Microsoft, "SQL Server 2012 Glossary", 2012)

🕋Data Warehousing: Dimensional Modeling (Definitions)

"A methodology for logically modeling data for query performance and ease of use that starts from a set of base measurement events. In the relational DBMS environment, a fact table is constructed generally with one record for each discrete measurement. This fact table is then surrounded by a set of dimension tables describing precisely what is known in the context of each measurement record. Because of the characteristic structure of a dimensional model, it is often called a star schema." (Ralph Kimball & Margy Ross, "The Data Warehouse Toolkit" 2nd Ed., 2002)

"A formal data modeling technique that is used to organize and represent data for analytical and reporting use. The focus is on the business perspective and the representation of data." (Laura Reeves, "A Manager's Guide to Data Warehousing", 2009)

"A generally accepted practice in the data warehouse industry to structure data intended for user access, analysis, and reporting in dimensional data models" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"The most frequently used data model for data warehouses." (Jan L Harrington, "Relational Database Design and Implementation" 3rd Ed., 2009)

"With dimensional data modeling or denormalization, data is collapsed, combined, or grouped together. Within dimensional data modeling, the concepts of facts (measures) and dimensions (context) are used. If dimensions are collapsed into single structures, the data model is also often called a star schema. If the dimensions are not collapsed, the data model is called snowflake. The dimensional models are typically seen within data warehouse systems." (Piethein Strengholt, "Data Management at Scale", 2020)

08 March 2010

💎SQL Reloaded: TSQL Challenge 24

💎Beyond Relational came up with an interesting set of TSQL Challenges contest that reached already its 25th round. Here’s my answer for TSQL Challenge 24, this being the first time I’ve attempted to come with a solution:

WITH CTE(LogID, EmpName, Activity, StartTime, TopStartTime, NoMinutes, StartDuration, DiffDuration, HasParentFlag, HasChildFlag, Level) 
AS ( -- initialization query
SELECT A.LogID, A.EmpName, A.Activity, A.StartTime
, A.StartTime TopStartTime
, DATEPART(mi, A.Duration) NoMinutes
, DATEPART(mi, A.Duration) StartDuration
, DAT.NoMinutes DiffDuration
, IsNull(DAT.HasParentFlag, 0) HasParentFlag
, ISNULL(DAT.HasChildFlag, 0) HasChildFlag
, 1 Level
FROM TC24_ActivityLog A
LEFT JOIN TC24_ActivityLog B
ON A.EmpName = B.EmpName
AND A.Activity = B.Activity
AND DateDiff(Mi, B.StartTime, A.StartTime) BETWEEN 0 AND 30
AND DateDiff(dd, B.StartTime, A.StartTime)= 0
AND A.LogID <> B.LogID
CROSS APPLY ( -- cumulate
     SELECT SUM(DATEPART(mi, C.Duration)) NoMinutes
     , Max(CASE WHEN D.EmpName IS NOT NULL THEN 1 ELSE 0 END) HasParentFlag
     , Max(CASE WHEN E.EmpName IS NOT NULL THEN 1 ELSE 0 END) HasChildFlag
     , Max(CASE WHEN D.EmpName IS NULL AND D.EmpName IS NULL THEN 1 ELSE 0 END)   IsSingleFlag
      FROM TC24_ActivityLog C
        LEFT JOIN TC24_ActivityLog D
          ON C.EmpName = D.EmpName
        AND C.Activity = D.Activity
        AND DateDiff(Mi, D.StartTime, C.StartTime)=30
        AND DateDiff(dd, D.StartTime, C.StartTime)= 0
        LEFT JOIN TC24_ActivityLog E
            ON C.EmpName = E.EmpName
         AND C.Activity = E.Activity
         AND DateDiff(Mi, E.StartTime, C.StartTime)=-30
         AND DateDiff(dd, E.StartTime, C.StartTime)= 0
        WHERE A.EmpName = C.EmpName
         AND A.Activity <> C.Activity
        AND DateDiff(Mi, C.StartTime, A.StartTime)=0) DAT
WHERE B.EmpName IS NULL
UNION ALL
SELECT A.LogID, A.EmpName, A.Activity, A.StartTime
, B.TopStartTime
, B.NoMinutes + DATEPART(mi, A.Duration) NoMinutes
, B.StartDuration
, B.DiffDuration
, B.HasParentFlag
, B.HasChildFlag
, B.Level+1 Level
FROM TC24_ActivityLog A
JOIN CTE B
ON A.EmpName = B.EmpName
AND A.Activity = B.Activity
AND DateDiff(Mi, B.StartTime, A.StartTime)=30)
SELECT
Convert(varchar(10), TCS.StartTime, 126) [Date]
, TCS.EmpName
, TCS.Activity
, Convert(varchar(5), TCS.StartTime, 108) SchSt
, Convert(varchar(5), TCS.Duration, 108) SchDur
, Convert(varchar(5), DateAdd(mi, CASE
WHEN HasParentFlag = 1 THEN TCA.DiffDuration
WHEN HasChildFlag = 1 THEN 0
ELSE 30-TCA.StartDuration
END, TCA.TopStartTime), 108) ActSt --
, Convert(varchar(5), DateAdd(mi, TCA.NoMinutes, Cast('00:00' as time(0))), 108) ActDur
FROM TC24_Schedules TCS
JOIN (-- Activities Log
      SELECT EmpName
      , Activity
      , StartTime
      , TopStartTime
      , NoMinutes
      , StartDuration
     , DiffDuration
      , HasParentFlag
      , HasChildFlag
      , RANK() OVER(PARTITION BY EmpName, Activity, TopStartTime ORDER BY Level DESC) Ranking
     FROM CTE
) TCA
     ON TCS.EmpName = TCA.EmpName
    AND TCS.Activity = TCA.Activity
    AND DateDiff(d, TCS.StartTime, TCA.TopStartTime) = 0
    AND TCA.Ranking = 1
ORDER BY [Date]
, EmpName
, SchSt;

Notes:
1.    The initialization query identifies the top-most activities and checks whether there is another activity (referred as co-occurring activity) scheduled within the same 30 minutes time interval by using the CROSS APPLY operator, therefore:
DiffDuration – stores the time scheduled within the same 30 minutes for the co-occurring activity
HasParentFlag – has the value 1 if the co-occurring activity is a continuation from the previous 30 minutes, 0 otherwise
HasChildFlag – has the value 1 if the co-occurring activity continues on the next 30 minutes, 0 otherwise
DiffDuration, HasParentFlag, HasChildFlag are needed in order to calculate correctly the ActSt.
   The recursive query just aggregates the total time, the final output query considering the lowest level in the recursion tree (the end activity).
   The query works also with the Example 1 & 2.
   The query was also tested using Lutz Mueller's test data set.

2. By adding the rank function for Schedule table (e.g. RANK() OVER(PARTITION BY EmpName, Activity ORDER BY StartTime) SchRanking) thus creating an inline view,
    a dense rank in the final Activites subquery (e.g.   , DENSE_RANK() OVER(PARTITION BY EmpName, Activity ORDER BY TopStartTime) SchRanking
    and matching the result sets on the two attributes (e.g. AND TCS.SchRanking = TCA.SchRanking) the query could be leveraged to handle interrupted activites (activities interrupted by another activity, FAQ11).

06 March 2010

💎⌛SQL Reloaded: Oracle vs. SQL Server (Running Totals)

One of the SQL Server 2005+ functionalities I kind of ignored in the past years is the CROSS APPLY operator that allows returning rows from a table-valued function of a correlated query, allowing for example to calculate running totals, cumulated amounts from the previous records and the current record in a data set. Before the introduction of CROSS APPLY operator such a query could be written as follows:

-- SQL Server/Oracle: Running Totals (simple join)
SELECT POH.PurchaseOrderID
, POH.VendorID
, POH.OrderDate
, POH.SubTotal
, POH.TaxAmt
, POH.Freight
, SUM(POT.SubTotal) RunningSubTotal
, SUM(POT.TaxAmt) RunningTaxAmt
, SUM(POT.Freight) RunningFreight
FROM Purchasing.PurchaseOrderHeader POH
    LEFT JOIN Purchasing.PurchaseOrderHeader POT
        ON POH.PurchaseOrderID >= POT.PurchaseOrderID
GROUP BY POH.PurchaseOrderID
, POH.VendorID
, POH.OrderDate
, POH.SubTotal
, POH.TaxAmt
, POH.Freight
ORDER BY POH.PurchaseOrderID

Each of the attributes not aggregated needs to be added in the GROUP BY clause, fact that could be quite a hit from a performance standpoint. If only one aggregated value is needed then the query could be rewritten, the cumulated value being calculated in a correlated query:

-- SQL Server/Oracle: Running Totals (correlated query)
SELECT POH.PurchaseOrderID
, POH.VendorID
, POH.OrderDate
, POH.SubTotal
, POH.TaxAmt
, POH.Freight
, ( -- calculating aggregated running total
    SELECT SUM(POT.SubTotal)
    FROM Purchasing.PurchaseOrderHeader POT
    WHERE POH.PurchaseOrderID >= POT.PurchaseOrderID
) RunningSubTotal
FROM Purchasing.PurchaseOrderHeader POH

In theory could be added a correlated query for each aggregated value, though this means an additional table scan for each attribute, not the best approach for sure. The CROSS APPLY operator simplifies such queries allowing to do multiple calculations within the same correlated query:

-- SQL Server 2005: Running Totals (CROSS APPLY)
SELECT POH.PurchaseOrderID
, POH.VendorID
, POH.OrderDate
, POH.SubTotal
, POH.TaxAmt
, POH.Freight
, DAT.RunningSubTotal
, DAT.RunningTaxAmt
, DAT.RunningFreight
FROM Purchasing.PurchaseOrderHeader POH
    CROSS APPLY (-- calculating aggregated running total
        SELECT SUM(POT.SubTotal) RunningSubTotal
        , SUM(POT.TaxAmt) RunningTaxAmt
        , SUM(POT.Freight) RunningFreight
        FROM Purchasing.PurchaseOrderHeader POT
        WHERE POH.PurchaseOrderID >= POT.PurchaseOrderID) DAT
ORDER BY POH.PurchaseOrderID

The advantage of the first two approaches is that they work also in Oracle, though also Oracle comes with an alternative using analytic aggregate functions:

-- Oracle: Running Totals (analytic aggregate function)
SELECT POH.PurchaseOrderID
, POH.VendorID
, POH.OrderDate
, POH.SubTotal
, POH.TaxAmt
, POH.Freight
, SUM(POH.SubTotal) OVER (ORDER BY POH.PurchaseOrderID ) RunningSubTotal
, SUM(POH.TaxAmt) OVER (ORDER BY POH.PurchaseOrderID) RunningTaxAmt
, SUM(POH.Freight) OVER (ORDER BY POH.PurchaseOrderID) RunningFreight
FROM PurchaseOrderHeader POH
ORDER BY POH.PurchaseOrderID

Notes:
1. The calculation of running totals requires the existence of a unique identifier for each record that can be sorted, in this case being used the PurchaseOrderID attribute, though typically could be used the creation date (timestamp) when the record was modified, alone or in combination with other attributes; when used in combination then maybe it makes sense to add a running number in order to apply the above techniques. In Oracle the use of analytic aggregate functions offers less headaches from this perspective, being enough to specify the unique combination of attributes in the ORDER BY clause in the body of the function.
2. SQL Server 2005’s window aggregate functions, even if similar with Oracle’s analytic aggregate functions, unlike the window ranking functions they don’t allow the ORDER BY clause in the functions, therefore running totals can’t be modeled in SQL Server with the help of window aggregate functions.
3. In case the running totals are based on other table, then might be needed to use the OUTER APPLY operator, returning thus also the records for which not match is retrieved in the outer dataset.

Sometimes might be requested to create running totals within a given partition, for example by Vendor, in this case all is needed to do is to add the VendorID in the WHERE or JOIN constraint, while in Oracle’s analytic functions case in the PARTITION BY clause. Here are the three queries modified:

-- SQL Server/Oracle: Running Totals per Vendor (simple join)
SELECT POH.PurchaseOrderID
, POH.VendorID
, POH.OrderDate
, POH.SubTotal
, POH.TaxAmt
, POH.Freight
, SUM(POT.SubTotal) RunningSubTotal
, SUM(POT.TaxAmt) RunningTaxAmt
, SUM(POT.Freight) RunningFreight
FROM Purchasing.PurchaseOrderHeader POH
    LEFT JOIN Purchasing.PurchaseOrderHeader POT
        ON POH.VendorID = POT.VendorID
     AND POH.PurchaseOrderID >= POT.PurchaseOrderID
GROUP BY POH.PurchaseOrderID
, POH.VendorID
, POH.OrderDate
, POH.SubTotal
, POH.TaxAmt
, POH.Freight
ORDER BY POH.VendorID
, POH.PurchaseOrderID

-- SQL Server 2005: Running Totals per Vendor (CROSS APPLY)
SELECT POH.PurchaseOrderID
, POH.VendorID
, POH.OrderDate
, POH.SubTotal
, POH.TaxAmt
, POH.Freight
, DAT.RunningSubTotal
, DAT.RunningTaxAmt
, DAT.RunningFreight
FROM Purchasing.PurchaseOrderHeader POH
     CROSS APPLY ( -- calculating aggregated running total
        SELECT SUM(POT.SubTotal) RunningSubTotal
        , SUM(POT.TaxAmt) RunningTaxAmt
        , SUM(POT.Freight) RunningFreight
        FROM Purchasing.PurchaseOrderHeader POT
        WHERE POH.VendorID = POT.VendorID
          AND POH.PurchaseOrderID >= POT.PurchaseOrderID) DAT
ORDER BY POH.VendorID
, POH.PurchaseOrderID

-- Oracle: Running Totals per Vendor (analytic aggregate function)
SELECT POH.PurchaseOrderID
, POH.VendorID
, POH.OrderDate
, POH.SubTotal
, POH.TaxAmt
, POH.Freight
, SUM(POH.SubTotal) OVER (PARTITION BY POH.VendorID ORDER BY POH.PurchaseOrderID ) RunningSubTotal
, SUM(POH.TaxAmt) OVER (PARTITION BY POH.VendorID ORDER BY POH.PurchaseOrderID) RunningTaxAmt
, SUM(POH.Freight) OVER (PARTITION BY POH.VendorID ORDER BY POH.PurchaseOrderID) RunningFreight
FROM PurchaseOrderHeader POH
ORDER BY POH.VendorID
, POH.PurchaseOrderID

Happy coding!

04 March 2010

🕋Data Warehousing: Drill Down (Definitions)

"A technique for navigating through levels of data ranging from the most summarized (up) to the most detailed (down). For example, to view the details of sales data by year, a user can drill down to display sales data by quarter, and drill down further to display data by month." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"The act of adding a row header or replacing a row header in a report to break down the rows of the answer set more finely." (Ralph Kimball & Margy Ross, "The Data Warehouse Toolkit" 2nd Ed., 2002)

"To delve deeper into data by going from a summary value to more detailed values." (Margaret Y Chu, "Blissful Data ", 2004)

"The process of exposing progressively more detail by making selections of items in a dynamic report or further enhancing a query." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"Method of analysis for retrieving lower levels of detailed data starting from summary data." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"A method of exploring detailed data that was used in creating a summary level of data. Drill down levels depend on the granularity of data within a dimension." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"To decompose data into more atomic components, that is, data at lower levels of aggregation." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management" 9th Ed., 2011)

"An analytical technique that lets a DSS user navigate among levels of data ranging from the most summarized (up) to the most detailed (down)." (Ciara Heavin & Daniel J Power, "Decision Support, Analytics, and Business Intelligence 3rd Ed.", 2017)

"A method of exploring detailed data that was used in creating a summary level of data. Drill Down levels depend on the granularity of the data in the data warehouse." (Intrafocus)

01 March 2010

🕋Data Warehousing: Granularity (Definitions)

"The degree of specificity of information contained in a data element. A fact table that has fine granularity contains many discrete facts, such as individual sales transactions. A table that has coarse granularity stores facts that are summaries of individual elements, such as sales totals per day." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"The level of detail captured in the data warehouse." (Ralph Kimball & Margy Ross, "The Data Warehouse Toolkit" 2nd Ed., 2002)

"Granularity level is the level of detail of the data in a data warehouse or data mart." (Claudia Imhoff et al, "Mastering Data Warehouse Design", 2003)

"A classification of data based upon the level of detail at which data is recorded. Atomic, base, and summary are all classifications of data granularity." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"The level of detail contained in a unit of data. The more detail there is, the lower the level of granularity. The less detail there is, the higher the level of granularity." (William H Inmon, "Building the Data Warehouse", 2005)

"Refers to the level of detail represented by the values stored in a table’s row. Data stored at their lowest level of granularity are said to be atomic data." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management" 9th Ed, 2011)

"Granularity refers to the smallest reusable element of information. See also physical granularity." (Charles Cooper & Ann Rockley, "Managing Enterprise Content: A Unified Content Strategy, 2nd Ed.", 2012)

"A term sometimes used to describe the level of detail used in obtaining information." (Kenneth A Shaw, "Integrated Management of Processes and Information", 2013)

"The level of detail found in a record of data" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"Granularity refers to the level of detail or precision for a specific information resource property. For instance, the postal address of a particular location might be represented as several different data items, including the number, street name, city, state, country and postal code (a high-granularity model). It might also be represented in one single line including all of the information above (a low-granularity model)." (Robert J Glushko, "The Discipline of Organizing: Professional Edition" 4th Ed, 2016)

🕋Data Warehousing: Extraction, transformation, and loading [ETL] (Definitions)

"A process for acquiring data from source systems, reformatting and cleansing it based on business requirements, and loading it, usually into a data warehouse. The term is more generally used as a means of moving data between systems." (Jill Dyché & Evan Levy, "Customer Data Integration", 2006)

"The process of taking data from one source - the source - and transforming it and loading it into another location - the target." (William H Inmon & Anthony Nesavich, "Tapping into Unstructured Data", 2007)

[ETL system:] "Extract, transformation, and load system consisting of a set of processes by which the operational source data is prepared for the data warehouse. Consists of extracting operational data from source applications, cleaning and conforming the data, and delivering the data to the presentation servers, along with the ongoing management and support functions." (Ralph Kimball, "The Data Warehouse Lifecycle Toolkit", 2008)

"The very essence of business intelligence, this is the process of removing raw data from a data system, processing and cleaning it, and then making it available in a business-intelligence database." (Stuart Mudie et al, "BusinessObjects™ XI Release 2 for Dummies", 2008)

"The process of connecting to a source database, pulling data out of the source database, transforming the data into a standard format, and then loading the data into a destination system." (Ken Withee, "Microsoft Business Intelligence For Dummies", 2010)

"A term that describes the activities used to prepare a collection of data sources for loading into a database, but also used to describe data mapping and transformation processes in general." (John R Talburt, "Entity Resolution and Information Quality", 2011)

"Generally, an approach to data integration from multiple source databases to integrated target databases." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"In computing, extract, transform, and load refers to a process in database usage and especially in data warehousing that: extracts data from outside sources; transforms it to fit operational needs, which can include quality levels, and loads it into the end target (database, more specifically, operational data store, data mart, or data warehouse)." (Keith Holdaway, "Harness Oil and Gas Big Data with Analytics", 2014)

"Data warehouse process used to extract data from disparate or homogeneous data sources. Transforms the data to align with data structure for storing and analysis." (Dennis C Guster, "Scalable Data Warehouse Architecture: A Higher Education Case Study", 2018)

"An acronym for the Extract, Transform and Load process. This refers to extracting data from the source system, transforming (cleaning and manipulating etc.) the data and then loading the data into the data warehouse." (BI System Builders)

"ETL is a process that extracts, transforms, and loads data from multiple sources to a data warehouse or other unified data repository." (IBM) [source]

"Refers to the process by which data is:
– Extracted from sources
– Transformed or standardized for storing in the proper heterogeneous format
– and Loaded into the final store or warehouse. The ETL process is commonly run in parallel with transformation processes executing as data is being extracted from sources." (Insight Software)

"Short for extraction, transformation, and loading. The process used to populate a data warehouse from disparate existing database systems." (Microstrategy)

"The act of extracting data from various sources, transforming data to consistent types, and loading the transformed data for use by applications." (Microsoft)

"a process in data warehousing responsible for pulling data out of one source, transforming them so that they meet the needs of the processes that will be using them on next stages, and placing them into target database." (KDnuggets)

"ETL is shorthand for – extract, transform, load. An ETL solution facilitates the replication of data from one or more sources that is converted into format suitable for use in analytics and moved into a destination system." (Qlik) [source]

"ETL (extract, transform, load) is three combined processes that are used to pull data from one database and move it to another database. It is a common function in data warehousing." (snowflake) [source]

"ETL (extract, transform, load) processes are three common functions performed on databases. Extract or read data from a database, transform the extracted data into a structure that can be placed into another database, and load or write the data into the target database." (kloudless)

"ETL (extract, transform, load) processes are three database functions. They extract or read data from a database, transform the extracted data into a structure that can be placed into another database, and load or write the data into the target database." (MuleSoft)

"ETL stands for Extract-Transform-Load and it refers to the process used to collect data from numerous disparate databases, applications and systems, transforming the data so that it matches the target system’s required formatting and loading it into a destination database." (Databricks) [source]

"Extract Transform Load refers to a trio of processes that are performed when moving raw data from its source to a data warehouse, data mart, or relational database." (Informatica) [source]

"Extract, Transform and Load (ETL) refers to the process in data warehousing that concurrently reads (or extracts) data from source systems; converts (or transforms) the data into the proper format for querying and analysis; and loads it into a data warehouse, operational data store or data mart). ETL systems commonly integrate data from multiple applications or systems that may be hosted on separate hardware and managed by different groups or users. ETL is commonly used to assemble a temporary subset of data for ad-hoc reporting, migrate data to new databases or convert database into a new format or type." (Teradata) [source]

"Refers to a process in database usage consisting of three phases: extracting data from external sources, transforming it to fit operational needs (can include a quality check), and loading it into a target database or data warehouse." (Board International)

"In business intelligence, an ETL tool extracts data from one or more data-sources, transforms it and cleanses it to be optimized for reporting and analysis, and loads it into a data store or data warehouse. ETL stands for extract, transform, and load." (Logi Analytics) [source]

🕋Data Warehousing: Roll/Drill Up (Definitions)

"The act of removing a row header or replacing a row header in a report to summarize the rows of the answer set. Sometimes called dynamic aggregation." (Ralph Kimball & Margy Ross, "The Data Warehouse Toolkit" 2nd Ed , 2002)

"To present higher levels of summarization. " (Ralph Kimball & Margy Ross, "The Data Warehouse Toolkit" 2nd Ed., 2002)

"Method of analysis for retrieving higher levels of summary data starting from detailed data." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"Data analysis performed on a data set with applied mathematical functions, associated with fewer dimensions, higher levels of hierarchy in one or more dimensions, or both." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"In SQL, an OLAP extension used with the GROUP BY clause to aggregate data by different dimensions. (Rolling up the data is the exact opposite of drilling down the data.)" (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management" 9th Ed., 2011)

"To collect subsets of data from multiple locations in one location." (Microsoft, "SQL Server 2012 Glossary", 2012)

"The act of summarizing or aggregating data to a level higher than the previous level of detail." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling 2nd Ed.", 2005)

27 February 2010

🕋Data Warehousing: Conformed Dimension (Definitions)

"A shared dimension that applies to two subject areas or data marts. By utilizing conformed dimensions, comparisons across data marts are meaningful." (Microsoft Corporation, "Microsoft SQL Server 7.0 Data Warehouse Training Kit", 2000)

"Dimensions are conformed when they are either exactly the same (including the keys) or one is a perfect subset of the other." (Ralph Kimball & Margy Ross, "The Data Warehouse Toolkit 2nd Ed ", 2002)

"A conformed dimension is one that is built for use by multiple data marts. Conformed dimensions promote consistency by enabling multiple data marts to share the same reference and hierarchy information." (Claudia Imhoff et al, "Mastering Data Warehouse Design", 2003)

"A dimension whose data is reused by more than one dimensional design. When modeling multiple data marts, standards across marts with respect to dimensions are useful. Warehouse users may be confused when a dimension has the similar meaning but different names, structures, levels, or characteristics among multiple marts. Using standard dimensions throughout the warehousing environment can be referred to as 'conformed' dimensions." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"Dimension tables that are the same, or where one dimension table contains a perfect subset of the attributes of another. Conformance requires that data values be identical, and that the same combination of attribute values is present in each table." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

"A dimension that is shared between two or more fact tables. It enables the integration of data from different fact tables at query time. This is a foundational principle that enables the longevity of a data warehousing environment. By using conformed dimensions, facts can be used together, aligned along these common dimensions. The beauty of using conformed dimensions is that facts that were designed independently of each other, perhaps over a number of years, can be integrated. The use of conformed dimensions is the central technique for building an enterprise data warehouse from a set of data marts." (Laura Reeves, "A Manager's Guide to Data Warehousing", 2009)

"Two sets of business dimensions represented in dimension tables are said to be conformed if both sets are identical in their attributes or if one set is an exact subset of the other. Conformed dimensions are fundamental in the bus architecture for a family of STARS." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"A dimension that means and represents the same thing when linked to different fact tables." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

21 February 2010

💎SQL Reloaded: More Date-related Functionality II

Arithmetic Progression Sets

In yesterday’s post on the same topic I was showing how to create a sequence of dates on the fly with the help of common table expressions and multi-statement table-valued functions. I was mentioning that similar functionality could be obtained using a numeric sequence (interval), following to create the dates when calling the function. Here’s the function that provides the respective functionality, the difference from fGetDatesInInterval, excepting the different data types, is that it has a third parameter for the increment value (default 1), being thus possible to generate a list of odd/even numbers or any other type of arithmetic progressions.

-- SQL Server 2005+: Numeric Interval
CREATE FUNCTION dbo.fGetInterval(
@Start int = 0
, @End int = 0
, @Step int=1)
RETURNS @Interval TABLE(
Value int PRIMARY KEY)
AS
BEGIN
DECLARE @Sign smallint
SET @Sign = IsNull(NullIf(Sign(@End-@Start), 0), 1);
WITH CTE(Sequence)
AS(
     SELECT COALESCE(@Start, @End, 0) Sequence
     UNION ALL
     SELECT CTE.Sequence + @Step * @Sign Sequence
     FROM CTE
     WHERE (@End-CTE.Sequence-@Step)*@Sign>=0
)
INSERT @Interval
SELECT Sequence
FROM CTE
OPTION(MAXRECURSION 0)
RETURN
END

-- numeric interval
SELECT *
FROM dbo.fGetInterval(0, 1000, 1)
-- numeric interval (3rd parameter NULL)
SELECT *
FROM dbo.fGetInterval(0, 999, NULL)
-- even numbers
SELECT *
FROM dbo.fGetInterval(0, 1000, 2)
-- odd numbers
SELECT *
FROM dbo.fGetInterval(1, 1000, 2)
-- single value
SELECT *
FROM dbo.fGetInterval(1, 1, 2)

The numeric values returned by the function could be used to form a date sequence by using the DateAdd function and a Cast to the date data type. The downside with this approach is that if the dates are used in more expressions within the same query, the expression used to calculate the dates increases the visual complexity of the query, a function like fGetDatesInInterval could be easier to use instead.

-- creating a date sequence
SELECT Value
, Cast(DATEADD(d, Value, GETDATE()) as Date) DateSequence
FROM dbo.fGetInterval(0, 1000, 1)

The fGetInterval has a second particularity, the Value attribute from the table used to retrieve the data was declared as PRIMARY KEY, fact that allows to improve queries’ performance when searches are performed on table’s values. The DateSequence from the table returned by fGetDatesInInterval could be defined as PRIMARY KEY too.

The common table expression and its use resumes in writing a single statement, and considering that the logic from fGetInterval function could be written as a single statement, in theory we could write an inline table-valued function instead. This technique works as long the OPTION clause is not used within the inline function, when this is attempted is returned the following error: Incorrect syntax near the keyword 'OPTION'. As it seems the OPTION clause can’t be used with single-statement objects like views or inline table-valued functions. Therefore common table expressions can be used in single-statement objects, the functionality being thus restrained to the 100 default limit number of recursion.

Missing Values in a Data Set

While searching for something on the Web, today I found Aaron Akin’s post on Recursive Common Table Expressions, he’s using them in order to determine the missing numbers/dates in a sequence. This can be easily achieved using for example the fGetDatesInInterval or fGetInterval functions with a left join to the tested data set. Using the AdventureWorks database and fGetDatesInInterval function, supposing that I would like to see the working dates within a time interval with no POs, the query could be written as follows:

-- working days with no PO
SELECT DI.DateSequence
FROM dbo.fGetDatesInInterval('2003-01-01', '2003-12-31') DI
LEFT JOIN ( -- list of PO dates
    SELECT OrderDate
    FROM Purchasing.PurchaseOrderHeader 
    GROUP BY OrderDate
) POH
ON DATEDIFF(d, DI.DateSequence, POH.OrderDate)=0
WHERE DATEPART(dw, DI.DateSequence) BETWEEN 2 AND 6 --working days
AND POH.OrderDate IS NULL

A similar approach could be used also in case of numeric data sets, here’s an example based on fGetInterval and SplitList table-valued functions:

SELECT DI.Value
FROM dbo.fGetInterval(1, 10, 1) DI
   LEFT JOIN dbo.SplitList('1,3,6,7,10', ',') SL
      ON DI.Value = SL.Value
WHERE SL.Value IS NULL