13 March 2010

🔏MS Office: Excel for SQL Developers II (Insert Statements)

    From time to time is required to load a bulk of data into a table, and even if SQL Server and Oracle provides tools for this purpose, there are also occasions when this needs to be done with statements, an insert statement needing to be prepared for each record from the dataset. It sounds like a task that could be easily done in Excel, isn’t it? For this purpose in order to concatenate the values on a row I will use the GetRangeAsList macro defined in the previous post on the same topic. Considering the dataset from the below screenshot and that the headers match attributes’ name from the destination table (see table definition below) the formula that will allow me to create the multiple insert statements would look like:
 
= "INSERT INTO Production.ProductsTest (" & SUBSTITUTE(GetRangeAsList($A$1:$F$1), "'", "") & ")" & CHAR(13) & "VALUES (" & GetRangeAsList(A2:F2) & ")" 

Excel - Insert statement

After stripping off the output of the double quotes the output would like below: 

INSERT INTO Production.ProductsTest ( Name, ProductNumber, MakeFlag, FinishedGoodsFlag, Color, SafetyStockLevel) 
VALUES( 'Road-150 Red, 62', 'BK-R93R-62', '1', '1', 'Red', '100') 
INSERT INTO Production.ProductsTest ( Name, ProductNumber, MakeFlag, FinishedGoodsFlag, Color, SafetyStockLevel) 
VALUES( 'Road-150 Red, 44', 'BK-R93R-44', '1', '1', 'Red', '100') 
INSERT INTO Production.ProductsTest ( Name, ProductNumber, MakeFlag, FinishedGoodsFlag, Color, SafetyStockLevel) 
VALUES( 'Road-150 Red, 48', 'BK-R93R-48', '1', '1', 'Red', '100') 
INSERT INTO Production.ProductsTest ( Name, ProductNumber, MakeFlag, FinishedGoodsFlag, Color, SafetyStockLevel) 
VALUES( 'Road-150 Red, 52', 'BK-R93R-52', '1', '1', 'Red', '100') 
INSERT INTO Production.ProductsTest ( Name, ProductNumber, MakeFlag, FinishedGoodsFlag, Color, SafetyStockLevel) 
VALUES( 'Road-150 Red, 56', 'BK-R93R-56', '1', '1', 'Red', '100') 
INSERT INTO Production.ProductsTest ( Name, ProductNumber, MakeFlag, FinishedGoodsFlag, Color, SafetyStockLevel) 
VALUES( 'Road-450 Red, 58', 'BK-R68R-58', '1', '1', 'Red', '100') 

    A much simpler approach is based on the use of a UNION ALL to join the records, for this purpose could be used the = " SELECT " & GetRangeAsList(A2:F2) & " UNION ALL" formula, following to strip off the last UNION ALL and eventually add the INSERT INTO statement manually: 
 
INSERT INTO Production.ProductsTest ( Name, ProductNumber, MakeFlag, FinishedGoodsFlag, Color, SafetyStockLevel) 
SELECT 'Road-150 Red, 62', 'BK-R93R-62', '1', '1', 'Red', '100' UNION ALL 
SELECT 'Road-150 Red, 44', 'BK-R93R-44', '1', '1', 'Red', '100' UNION ALL 
SELECT 'Road-150 Red, 48', 'BK-R93R-48', '1', '1', 'Red', '100' UNION ALL 
SELECT 'Road-150 Red, 52', 'BK-R93R-52', '1', '1', 'Red', '100' UNION ALL 
SELECT 'Road-150 Red, 56', 'BK-R93R-56', '1', '1', 'Red', '100' UNION ALL 
SELECT 'Road-450 Red, 58', 'BK-R68R-58', '1', '1', 'Red', '100' 

    SQL Server 2008 allows to insert multiple records within a single statements, for this purpose the last formula could be changed to = " (" & GetRangeAsList(A2:F2) & " ) ,", following to just add the INSERT INTO statement manually: 
 
INSERT INTO Production.ProductsTest ( Name, ProductNumber, MakeFlag, FinishedGoodsFlag, Color, SafetyStockLevel) VALUES 
( 'Road-150 Red, 62', 'BK-R93R-62', '1', '1', 'Red', '100' ) , 
( 'Road-150 Red, 44', 'BK-R93R-44', '1', '1', 'Red', '100' ) , 
( 'Road-150 Red, 48', 'BK-R93R-48', '1', '1', 'Red', '100' ) , 
( 'Road-150 Red, 52', 'BK-R93R-52', '1', '1', 'Red', '100' ) , 
( 'Road-150 Red, 56', 'BK-R93R-56', '1', '1', 'Red', '100' ) , 
( 'Road-450 Red, 58', 'BK-R68R-58', '1', '1', 'Red', '100' ) 

   Just in case you want to test the statements here is Production.ProductsTest table’s definition:

CREATE TABLE [Production].[ProductsTest]( 
[ProductID] [int]IDENTITY(1,1) NOT NULL, 
[Name] [nvarchar](50) NOT NULL, 
[ProductNumber] [nvarchar](25) NOT NULL, 
[MakeFlag] [bit]NOT NULL, 
[FinishedGoodsFlag] [bit]NOT NULL, 
[Color] [nvarchar](15) NULL, 
[SafetyStockLevel] [smallint] NOT NULL) 

Note:
    Of course, also for the above 3 cases could be written macros that produces the same output taking a range as parameter, however the formulas are fairly simple to use, the only thing that needs to be changed are the ranges and, in the first example, table’s name.

🔏MS Office: Excel for SQL Developers I (Range to Delimited List of Values)

    For many SQL developers Excel could prove to be a useful tool in what concerns not only data analysis and manipulation but also in using its functionality for handling repetitive tasks. One of the simplest such examples I may thing of is the formatting of a list of values for use within a query.

    Typically users are requesting to provide a report based on a list of values provided in Excel, for example a list of Products. If the list is quite small, let's say 5-10 products, list's transformation to be used in the query is not quite a big deal. On the other side when dealing with hundreds or thousand of records the story totally changes. In all cases what I do is to select first the distinct list of values and copy them eventually in a new sheet where I use a formula like the one in the screenshot and apply it to the whole data set. 
 
Excel - List of values formula 
 
Notes:
1.    Within the = ", '" & TRIM(A2) & "'" formula the Trim function helps to remove the additional spaces.
2.    Please note that when the values from Excel were copied beforehand from other applications (e.g. Web pages) it could happen that additional formatting is stored resulting in unexpected behavior. What it helps to solve this issue is to copy paste the values in another sheet using the Paste Special feature, use Data/Text To Columns functionality for each column involved or export the data to a comma delimited file and re-import it in another Excel sheet.

     Now all I have to do is to copy the transformed output into my SQL Editor of choice and use it in a query.

SELECT * -- specify explicitly the attributes! 
FROM Production.Product 
WHERE ProductNumber IN ( 
'BK-R93R-62' , 'BK-R93R-44' 
, 'BK-R93R-48' 
, 'BK-R93R-52' 
, 'BK-R93R-56' 
, 'BK-R68R-58' 
, 'BK-R68R-60' 
, 'BK-R68R-44' 
, 'BK-R68R-48' 
, 'BK-R68R-52' 
, 'BK-R50R-58' 
, 'BK-R50R-60' 
, 'BK-R50R-62' 
, 'BK-R50R-44' 
, 'BK-R50R-48' 
, 'BK-R50R-52' 
, 'BK-R50B-58' 
, 'BK-R50B-60' 
, 'BK-R50B-62' 
, 'BK-R50B-44') 

    An even much easier approach is to create a simple macro that takes as input the range I would like to extract the information from and transform the input directly into a comma delimited list of values. Here’s the function used for this purpose: 
 
Function GetRangeAsList(ByVal rng As Range) As String 
Dim result As String 
Dim cell As Range 
For Each cell In rng 
    If Len(Trim(cell.Value)) > 0 Then 
       result = result & ", '" & Trim(cell.Value) & "'" 
    End If 
Next 
GetRangeAsList = IIf(Len(result) > 0, Right(result, Len(result) - 1), result) 
End Function 

   Now all I have to do in the Excel sheet with my data is to type the following formula =GetRangeAsList(A2:A21) in an empty cell out of existing dataset, function’s output being similar to the previous used formula, however the list of values is more compact:
'BK-R93R-62', 'BK-R93R-44', 'BK-R93R-48', 'BK-R93R-52', 'BK-R93R-56', 'BK-R68R-58', 'BK-R68R-60', 'BK-R68R-44', 'BK-R68R-48', 'BK-R68R-52', 'BK-R50R-58', 'BK-R50R-60', 'BK-R50R-62', 'BK-R50R-44', 'BK-R50R-48', 'BK-R50R-52', 'BK-R50B-58', 'BK-R50B-60', 'BK-R50B-62', 'BK-R50B-44'

Note:
  The function could be easily changed to provide the delimiter as parameter too, though I haven’t met many occasions when other delimiter was required.

🎡SSIS: Percentage Sampling Data Flow Transformation

    One of the problems relatively difficult to address in the past was the random sampling within a SQL Server data set. Of course there is always Excel that could be used for this purpose just with a few tricks, though what do you do when working with huge source dataset? As we will see below SSIS 2008’s Percentage Sampling Data Flow Transformation allows easily addressing this problem.

    Using the template SSIS package defined in Third Magic Class post, copy paste the Package.dtsx in the project and rename it (e.g. Package Percentage Sampling.dtsx), and from Toolbox add an Percentage Sampling Transformation and link it to the OLE DB Source. Access the Percentage Sampling Editor in which modify the Percentage of rows value from 10 to 50. It doesn’t really makes sense to rename the sample and unselected outputs, though you might need to do that when dealing with multiple Percentage Sampling Transformations.

SSIS - Percentage Sampling Transformation Editor
Note:
    The percentage of rows you’d like to work with depends entirely upon request, in many cases it’s indicated to determine statistically the size of your sample. Given the fact that the number of records in this example is quite small I preferred to use a medium dataset size.

    Link the Aggregate Transformation to the OLE DB Destination and in the Input Output Selection dialog select as Output the ‘Sampling Selected Output’, while in the OLE DB Destination Editor create a new table (e.g. Production.BikesSample). 
 
SSIS - Input Output Selection

   In the last step, before testing the whole package, in Control Flow tab change the Execute SQL Task’s SQLStatement property to point to the current destination table: 
 
TRUNCATE TABLE [Production].[BikesSample]  

  
SSIS - Percentage Sampling Data Flow

   Save the project, test (debug) the package (twice) and don’t forget to validate the output data: 
  
SELECT * 
FROM [Production].[BikesSample]   
 

Note:
   I was actually expecting to have 48 or 49 records (97:2=48.5) in the output and not 45, I wonder from where comes the difference?! That’s a topic I still have to investigate. I tried also to change the percentage of rows to 25 resulting an output of 23 of records (23*4=92), 75 resulting an output of 74 records, respectively 100, all the records being this time selected. At least the algorithm used by Microsoft partitions the output in complementary datasets.

🎡SSIS: Aggregate Data Flow Transformation

    At a first look the SSIS Aggregate Data Flow Transformation doesn’t seem to be so useful given the fact that the same functionality could be easily obtained with an aggregate query, which actually has more flexibility in what concerns data manipulation, however must be not forgotten that the power of transformations reside in the way they are combined in order to solve a problem and not in their isolated use. On the other side in order to master the basics it makes sense to create simple examples that include a minimum of information/tasks. It’s also the case of this post, for exemplification I will use the following query based on AdventureWorks’ Purchase Orders Header/Detail tables:

SELECT POD.PurchaseOrderDetailID 
, POD.PurchaseOrderID 
, POD.ProductID 
, POD.OrderQty 
, POD.ReceivedQty 
, POD.UnitPrice 
, POD.OrderQty * POD.UnitPrice OrderValue 
, POD.ReceivedQty * POD.UnitPrice ReceiptValue 
, POH.OrderDate 
, POD.DueDate 
FROM Purchasing.PurchaseOrderDetail POD 
   JOIN Purchasing.PurchaseOrderHeader POH 
      ON POD.PurchaseOrderID = POH.PurchaseOrderID 
WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete 

   Using the template SSIS package defined in Third Magic Class post, copy paste the Package.dtsx in the project and rename it (e.g. Package Aggregate.dtsx), then in Data Flow Task rename the Data Source (e.g. OLE DB Source POs) and change the SQL command text using the above query. From Toolbox add an Aggregate Transformation and link it to the OLE DB Source and access the Aggregate Transformation Editor in which, if everything went swell until now, it will appear a “table” based on the above query and check in a first phase the following attributes: ProductID, OrderQty, ReceivedQty, OrderValue and ReceiptValue. 

    Once an attribute is checked in the table, it will appear also as Input Column in the Editor, the same name being used also for Output Alias. For ProductID the Editor chosen the ‘Group by’ as Operation and ‘Sum’ for OrderValue. Change the Operation as Sum for the other three attributes like in the below screenshot. Add also the Order Date and once more the ProductID as Input Columns, for the first choose ‘Maximum’ as Operation while for the second the ‘Count’ in order to get the Last OrderDate, respectively the Number of Records. 
 
SSIS - Aggregate Transformation Editor

     Once the above changes completed, delete the existing OLE DB Destination and add another one, link the Aggregate Transformation to it, create a new table (e.g. Purchasing.PurchaseOrderProductAggregation) based on Aggregate’s input, accept the mappings and test the Data Flow Task: 
 
SSIS - Aggregate Data Flow Task

     In the last step, before testing the whole package, in Control Flow tab change the Execute SQL Task’s SQLStatement property to point to the current destination table:

TRUNCATE TABLE [Purchasing].[PurchaseOrderProductAggregation]   

   Save the project, test (debug) the package (twice) and don’t forget to validate the output data: 
 
SELECT * 
FROM [Purchasing].[PurchaseOrderProductAggregation]    

     The output from the above table should be actually the same with the output’s of the following aggregated query:

-- PO Product Aggregation SELECT POD.ProductID 
, SUM(POD.OrderQty) OrderQty 
, SUM(POD.ReceivedQty) ReceivedQty 
, SUM(POD.OrderQty * POD.UnitPrice) OrderValue 
, SUM(POD.ReceivedQty * POD.UnitPrice) ReceiptValue 
, Max(POH.OrderDate) LastOrderDate 
, COUNT(1) NumberRecords 
FROM Purchasing.PurchaseOrderDetail POD 
   JOIN Purchasing.PurchaseOrderHeader POH 
      ON POD.PurchaseOrderID = POH.PurchaseOrderID 
WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete 
GROUP BY POD.ProductID 
ORDER BY POD.ProductID

Happy Coding!

12 March 2010

🕋Data Warehousing: Pivot/Unpivot (Definitions)

"To rotate rows to columns, and columns to rows, in a cross-tabular data browser. Also refers to choosing dimensions from the set of available dimensions in a multidimensional data structure for display in the rows and columns of a cross-tabular structure." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"The process of rotating the view of data. For example, viewing what was the x-axis in the y-axis’s position and vice versa." (Microsoft Corporation, "Microsoft SQL Server 7.0 Data Warehouse Training Kit", 2000)

"The UNPIVOT operator is used within a SELECT statement to create a normalized data report from data that is stored as a spreadsheet." (Darril Gibson, "MCITP SQL Server 2005 Database Developer All-in-One Exam Guide", 2008)

"The PIVOT operator is used within a SELECT statement. It is used to create cross-tab reports (similar to a spreadsheet) from normalized data." (Darril Gibson, "MCITP SQL Server 2005 Database Developer All-in-One Exam Guide", 2008)

"The act of rotating rows to columns, and columns to rows." (Jim Joseph, "Microsoft SQL Server 2008 Reporting Services Unleashed", 2009)

[unpivot:] "To expand values from multiple columns in a single record into multiple records with the same values in a single column." (Microsoft, "SQL Server 2012 Glossary", 2012)

"To rotate a table-valued expression by turning the unique values from one column in the expression into multiple columns in the output, and perform aggregations where they are required on any remaining column values that are wanted in the final output." (Microsoft, "SQL Server 2012 Glossary", 2012)

"In the Lean start-up world, a pivot is a structured, often rapid, course correction on the basis of new market, customer, and development information." (Pamela Schure & Brian Lawley, "Product Management For Dummies", 2017)

 "1. To rotate rows to columns, and columns to rows, in a cross-tabular data browser. 2. To choose dimensions from the set of available dimensions in a multidimensional data structure for display in the rows and columns of a cross-tabular structure." (Microsoft Technet)

[unpivot:] "In Integration Services, the process of creating a more normalized dataset by expanding data columns in a single record into multiple records." (Microsoft Technet)

🎡SSIS: The Union All Data Flow Transformation

In yesterday’s post I was showing how to use the Conditional Split Data Flow Transformation in order to vertically partition a dataset based on a set of constraints, allowing thus to dump the data from each partition to its corresponding table. The inverse problem considers merging together the distributed partitions of the same dataset or distinct similar datasets into a common data set. If the datasets needed to be merged are on the same server it makes sense to join them by using UNION or UNION ALL inside of a simple Data Flow Task:

SELECT * -- specify explicitly the attributes 
FROM [Production].[Bikes] 
UNION ALL 
SELECT * -- specify explicitly the attributes 
FROM [Production].[ClothesAccessories] 
UNION ALL 
SELECT * -- specify explicitly the attributes 
FROM [Production].[Components] 

If the datasets are distributed across several data sources then in theory could be created a package for each data source and dump the data locally, however this often implies the synchronization of all involved packages, eventually by calling each package from a parent package. A better and easier approach is to use the Union All Data Flow Transformation that merges the data from multiple sources into a common result set, a data source needing to be defined for each input dataset. As exemplification I will use the three Product tables created in the previous post, namely [Production].[Bikes] and [Production].[ClothesAccessories] and[Production].[Components]. Even if the tables are located on the same server and the datasets could be merged with a UNION, I will use the general approach and reuse the template SSIS package defined in Third Magic Class post. Copy paste the Package.dtsx in the project and rename it (e.g. Package Union All.dtsx), in Data Flow Task rename the Data Source (e.g. OLE DB Source Bikes) and change the SQL command text using the following query: 
 
SELECT ITM.ProductID 
, ITM.ProductName 
, ITM.ProductNumber 
, ITM.ProductModel 
, ITM.ProductSubcategory 
, ITM.ProductCategory 
, ITM.MakeFlag 
, ITM.FinishedGoodsFlag 
, ITM.Color 
, ITM.StandardCost 
, ITM.ListPrice 
, ITM.Class 
, ITM.Style 
, ITM.SellStartDate 
, ITM.SellEndDate 
, ITM.DiscontinuedDate 
, ITM.ModifiedDate 
FROM Production.Bikes ITM 

Once this step completed copy the OLE DB Source, paste it again in the current Data Flow Task twice, and modify the queries of the two new added data sources to point to Production.ClothesAccessories, respectively to Production.Components tables. From Toolbox add a Union All Transformation and link to it the (green) connectors of the three sources, and once you open the Union All Transformation Editor you’ll see that SQL Server mapped the columns automatically: 
 
SSIS - Union All Editor
  
Accept the mapping and link the (green) connector of the Union All Transformation to the existing OLE DB Destination. When attempting to use the OLE DB Destination Editor SQL Server will first show the Restore Invalid Column References Editor in which you could update the invalid references. 

SSIS - Restor Invalid Column References  
 
Accept the default settings and in the OLE DB Destination Editor create a new table based on the existing input (e.g. [Production].[ProductsOfInterest]), and test the Data Flow Task: 
 
SSIS - Union All Data Load Task

In the last step, before testing the whole package, in Control Flow tab change the Execute SQL Task’s SQLStatement property to point to the current destination table: 

TRUNCATE TABLE [Production].[ProductsOfInterest] 
 
Save the project, test (debug) the package (twice) and don’t forget to validate the output data: 
 
SELECT * 
FROM [Production].[ProductsOfInterest] 

Note:
In case you need to reuse the datasets in multiple packages, it makes sense to load each dataset in its own staging table rather then loading the data over and over again.

🎡SSIS: The Conditional Split Data Flow Transformation

In Third Magic Class I shown how to create a Data Flow Task using the SQL Server 2008 Business Intelligence Development Studio (BIDS), the respective SSIS package being based only on Products with the Product Category having the value ‘Bikes’. What if would be needed to create a table for ‘Components’ Product Category, and another one for ‘Clothes’ and ‘Accessories’? Normally this would equate with creating a package for each destination table, though SSIS provides the Conditional Split component that allows to partition a data set horizontally based on a set of constraints.

For current example copy paste the Package.dtsx created in the mentioned tutorial and rename it (e.g. Package Conditional Split.dtsx). The first important change is to leverage the scope to the four Product Categories considered - 'Bikes', 'Components', 'Clothing' and 'Accessories', therefore in the ‘OLD DB Source’ change the query as follows: 
 
SELECT ITM.ProductID 
,ITM.ProductName 
, ITM.ProductNumber 
,ITM.ProductModel 
,ITM.ProductSubcategory 
,ITM.ProductCategory 
,ITM.MakeFlag 
,ITM.FinishedGoodsFlag 
,ITM.Color 
,ITM.StandardCost 
,ITM.ListPrice 
,ITM.Class 
,ITM.Style 
,ITM.SellStartDate 
,ITM.SellEndDate 
,ITM.DiscontinuedDate 
,ITM.ModifiedDate 
FROM Production.vProducts ITM 
WHERE ProductCategory IN ('Bikes', 'Components', 'Clothing', 'Accessories') 

From the Toolbox add the Conditional Split component, delete the previous connector and attach it to the new added transformation element. Double click on the Conditional Split in order to access the Conditional Split Transformation Editor in which you’ll need to specify the conditions used for the split. Considering as the first condition the case treated in the previous tutorial, define as Condition [ProductCategory]==”Bikes” and rename the default ‘Case 1’ Output Name to ‘Case Bikes’. For the second condition consider [ProductCategory] == "Components" as Constraint and ‘Case Components’ as ‘Output Name’, as in below screenshot: 
 
SSIS - Conditional Split
 
There is no need to create a third Case for 'Clothing' and 'Accessories', because they will be considered on the Conditional Split Default branch. Eventually you could rename the ‘Conditional Split Default Output’ Default output name to ‘Case Else’ as above.

For each Case in the Conditional Split, including the Default one, you’ll need to have a destination, therefore drop two more OLE DB destinations and link to all three destination one green connector from the Conditional Split. For each connector you’ll have to select the constraint the branch is supposed to address, and be sure that you selected ‘Case Bikes’ for the destination inherited from the template package! As for the other two you’ll have to create two destination tables using the default structure and rename them [Production].[ClothesAccessories], respectively [Production].[Components]. The two tables will have the same definition as [Production].[Bikes] table. Do not forget to check the ‘Keep identity’ and ‘Keep nulls’ checkboxes too!

Save the package and debug it, the resulted package showing similarly with the one from the below screenshot: 
 
SSIS - Conditional Split Package  

We have used the Conditional Split Transformation and tested it, though we haven’t finished yet! As I remarked in the previous post, the data from the destination table(s) needs to be deleted each time the package is run. For this in Control Flow tab modify the Execute SQL Task and in the SQL Statement replace the existing statement with the following lines: 

TRUNCATE TABLE [Production].[Bikes] 
TRUNCATE TABLE [Production].[ClothesAccessories] 
TRUNCATE TABLE [Production].[Components] 

Link the Execution SQL Task connector to the Data Flow Task, assure that the Constraint option value is set to ‘Success’, save the project, test (debug) the package (twice) and don’t forget to validate the output data!

Note:
I preferred to keep things as simple as possible, especially when considering the constraints used for the Conditional Split, however this shouldn't stop you to attempt using the supported library of functions and operators in order to create more complex constraints. This post is dependent on Third Magic Class post and the SSIS “template” package create in it, so please start with the respective post!

11 March 2010

💎SQL Reloaded: Oracle vs. SQL Server: Averages I

   For many people working with averages is not a complicated topic at all, and that’s quite true, though as always it depends on how much you want to complicate everything. The average or the arithmetic mean is defined as the middle value of the values within a data set calculated as the sum of the values divided by the number of values: 
AVG - Simple Average 
   RDBMS vendors provide similar functionality within the AVG aggregate function, and even if in theory the result could be calculated using the SUM and COUNT aggregate functions, as per the above definition, given its importance in statistics, to provide an AVG function seems to be a natural choice. AVG could be used across the whole data set or within a given “partition” determined by the GROUP BY clause.

-- Oracle/SQL Server: Simple Averages 
SELECT ProductID 
, AVG(UnitPrice) AverageUnitPrice 
, SUM(UnitPrice)/COUNT(1) AverageUnitPriceDef 
FROM Purchasing.PurchaseOrderDetail 
GROUP BY ProductID 


    For most of the data analysis requirements the simple average function would be enough, though there are cases in which is required to address the sensitivity of values to certain modifiers – for example quantities or the number of lines. For such cases is used the weighted average (weighted arithmetic mean) calculated as the sum of values and quantifiers products, the total being divided by the sum of quantifiers:
AVG - Weighted Average
Note:
     In case the quantifiers are all 1 then the weighted average equates with the simple average, the output being the same also when the values are constant.

      The weighted average is quite a useful tool when performing PO Price Analysis in which often needs to be be considered also the volume of purchased quantities with a given Price Unit: 
 
-- Oracle/SQL Server: Simple/Weighted Averages 
SELECT ProductID 
, AVG(UnitPrice) AverageUnitPrice 
, SUM(UnitPrice*OrderQty)/SUM(OrderQty) WeightedAverageUnitPrice 
FROM Purchasing.PurchaseOrderDetail 
GROUP BY ProductID  

   The actual implementation in a query might differ based on requirements and preferences – as multiple left joins, inline view or correlated query, each of the mentioned approaches coming with their downsides and strengths: 
 
-- Oracle/SQL Server: Averages (inline view) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, DAT.AverageUnitPrice 
, DAT.WeightedAverageUnitPrice 
FROM Production.Product ITM 
LEFT JOIN ( --calculated averages 
    SELECT POD.ProductID 
    , AVG(POD.UnitPrice) AverageUnitPrice 
    , SUM(POD.UnitPrice*OrderQty)/SUM(POD.OrderQty) WeightedAverageUnitPrice 
    FROM Purchasing.PurchaseOrderDetail POD 
       JOIN Purchasing.PurchaseOrderHeader POH 
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
    WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete 
   GROUP BY POD.ProductID)DAT 
    ON ITM.ProductID = DAT.ProductID 
ORDER BY ITM.Name 

-- Oracle/SQL Server: Averages (multiple left joins) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, AVG(POD.UnitPrice) AverageUnitPrice 
, SUM(POD.UnitPrice*OrderQty)/SUM(POD.OrderQty) WeightedAverageUnitPrice 
FROM Production.Product ITM 
   LEFT JOIN Purchasing.PurchaseOrderDetail POD 
      ON ITM.ProductID = POD.ProductID 
   LEFT JOIN Purchasing.PurchaseOrderHeader POH 
      ON POD.PurchaseOrderID = POH.PurchaseOrderID  
WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete 
GROUP BY ITM.ProductID 
,ITM.Name 
,ITM.ProductNumber 
ORDER BY ITM.Name  

-- SQL Server 2005: Averages (OUTER APPLY) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, DAT.AverageUnitPrice 
, DAT.WeightedAverageUnitPrice 
FROM Production.Product ITM 
OUTER APPLY ( -- calculated averages 
    SELECT AVG(POD.UnitPrice) AverageUnitPrice 
    , SUM(POD.UnitPrice*OrderQty)/SUM(POD.OrderQty) WeightedAverageUnitPrice 
    FROM Purchasing.PurchaseOrderDetail POD  
       JOIN Purchasing.PurchaseOrderHeader POH 
          ON POD.PurchaseOrderID = POH.PurchaseOrderID  
    WHERE ITM.ProductID = POD.ProductID  
        AND POH.Status IN (2, 4) -- 2-Approved, 4-Complete 
) DAT 
ORDER BY ITM.Name 

      A different type of approach could be focused on calculating the simple or weighted average based only on the last n PO Lines, for example the last 3 PO Lines: 
  
-- Oracle/SQL Server 2005: last 3 PO average prices 
SELECT DAT.ProductID 
, MAX(DAT.RANKING) NumberRecords 
, AVG(DAT.UnitPrice) AveragePrice 
, SUM(DAT.UnitPrice* DAT.OrderQty)/SUM(DAT.OrderQty) WeightedAveragePrice 
FROM (-- ranked PO Prices 
    SELECT POD.ProductID 
    , POD.UnitPrice 
    , POD.OrderQty 
    , RANK() OVER(PARTITION BY POD.ProductID ORDER BY POH.OrderDate DESC,   POD.PurchaseOrderDetailID DESC) RANKING 
    FROM Purchasing.PurchaseOrderDetail POD 
       JOIN Purchasing.PurchaseOrderHeader POH 
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
    WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete 
) DAT 
WHERE DAT.RANKING BETWEEN 1 AND 3 
GROUP BY DAT.ProductID
 


    The problem could be complicated even more by selecting all the POs corresponding to the last 3 distinct Prices used; not sure if it brings any value but the formulation is perfectly valid and don’t despair if a user comes with such a request.

    Occasionally it might be needed to study the evolution of prices over time, the previous problem becoming a simple moving average or weighted moving average problem, in which the average is calculated in report with a moving point on the time scale with respect to the previous k points or concerning the previous and following k points including the current point:
AVG - Simple Moving Average
AVG - Weighted Moving Average
     G. Priester shows in his Calculating Moving Averages with T-SQL article a nice solution for simple moving averages calculation exemplifying it on stock market data. Frankly, given its flexibility, I prefer to use the CROSS APPLY operator in combination with fGetDatesInInterval function introduced in More Date-related Functionality – Part I post, the function returning all the dates falling in a certain interval. Because in case of POs there could be multiple transactions per day for the same Product, the problem is slightly changed, the averages being based within a 31 window:

-- SQL Server 2005: 31 days moving averages 
SELECT ITM.ProductNumber 
, ITM.Name ProductName 
, DIN.DateSequence ReportingDate 
, DAT.AverageUnitPrice 
, DAT.WeightedAverageUnitPrice 
FROM dbo.fGetDatesInInterval('2003-01-01', '2003-12-31') DIN 
CROSS APPLY (--calculated averages 
    SELECT POD.ProductID 
    , AVG(POD.UnitPrice) AverageUnitPrice 
    , SUM(POD.UnitPrice*OrderQty)/SUM(POD.OrderQty) WeightedAverageUnitPrice 
    FROM Purchasing.PurchaseOrderDetail POD 
       JOIN Purchasing.PurchaseOrderHeader POH 
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
     WHERE DATEDIFF(d, POH.OrderDate, DIN.DateSequence) BETWEEN 0 AND 30 -- 31 days 
     --WHERE DATEDIFF(d, POH.OrderDate, DIN.DateSequence) BETWEEN -15 AND 15 -- 31 days 
         AND POH.Status IN (2, 4) -- 2-Approved, 4-Complete 
GROUP BY POD.ProductID) DAT 
LEFT JOIN Production.Product ITM 
   ON DAT.ProductID = ITM.ProductID 
WHERE DATEPART(dw, DIN.DateSequence) BETWEEN 2 AND 6 -- Working Days 
ORDER BY ITM.Name 
, DIN.DateSequence 
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.