Showing posts with label aggregate window functions. Show all posts
Showing posts with label aggregate window functions. Show all posts

17 May 2020

💎🏭SQL Reloaded: Query Patterns in SQL Server (Part IV: Window Functions)

For a long time aggregate functions were the only tool for statistical purposes available for raw SQL scripting. Their limitations become more evident with the introduction of window functions, which allow to apply the aggregates over a defined partition. Besides aggregate window functions, the use of ranking, respectively value window functions, opened the door to a new set of techniques. Here are a few examples based on the tables defined in a previous post.

Ranking windows functions allow ranking a record within the whole dataset or a partition by providing a sorting key:

-- Ranking window functions (RANK, DENSE_RANK, ROW_NUMBER)
SELECT A.CourseId 
, C.CourseName 
, A.StudentId 
, S.StudentName 
, A.StartDate 
, A.EndDate 
, RANK() OVER (ORDER BY A.Mark) [Rank over whole set]
, DENSE_RANK() OVER (ORDER BY A.Mark) [Dense Rank over whole set]
, ROW_NUMBER() OVER (ORDER BY A.Mark) [Row Number over whole set]
--, RANK() OVER (PARTITION BY A.CourseId ORDER BY A.StartDate) [Rank over Course]
--, RANK() OVER (PARTITION BY A.StudentId ORDER BY A.StartDate) [Rank over Student]
--, DENSE_RANK() OVER (PARTITION BY A.CourseId ORDER BY A.StartDate) [Dense Rank over Course]
--, DENSE_RANK() OVER (PARTITION BY A.StudentId ORDER BY A.StartDate) [Dense Rank over Student]
--, ROW_NUMBER() OVER (PARTITION BY A.CourseId ORDER BY A.StartDate) [Dense Rank over Course]
--, ROW_NUMBER() OVER (PARTITION BY A.StudentId ORDER BY A.StartDate) [Dense Rank over Student]
FROM dbo.T_Allocations A
     JOIN dbo.T_Courses C
       ON A.CourseId = C.CourseId 
     JOIN dbo.T_Students S
       ON A.StudentId = S.StudentId 
ORDER BY C.CourseName 
, S.StudentName 

Being able to rank records allows to easier select the last, respectively the first records from a dataset or a given partition:

-- first course 
SELECT *
, A.StartDate 
FROM dbo.T_Students S
     LEFT JOIN (
		SELECT A.CourseId
		, A.StudentId 
		, A.StartDate 
		, RANK() OVER (PARTITION BY A.StudentId ORDER BY A.StartDate) Ranking 
		FROM dbo.T_Allocations A
    ) A
   ON S.StudentId = A.StudentId 
  AND A.Ranking = 1 
ORDER BY S.StudentName

The aggregate window functions function similarly as their simple counterparts, except that they are valid over a partition. If the partition is left out, the functions apply over the whole data set.

-- aggregate window functions 
SELECT A.CourseId 
, C.CourseName 
, A.StudentId 
, S.StudentName 
, A.StartDate 
, A.EndDate 
, SUM(A.Mark) OVER (PARTITION BY A.StudentId) [Sum]
, MAX(A.Mark) OVER (PARTITION BY A.StudentId) [Max]
, MIN(A.Mark) OVER (PARTITION BY A.StudentId) [Min]
, AVG(A.Mark) OVER (PARTITION BY A.StudentId) [Avg]
, COUNT(A.Mark) OVER (PARTITION BY A.StudentId) [Count]
FROM dbo.T_Allocations A
     JOIN dbo.T_Courses C
       ON A.CourseId = C.CourseId 
     JOIN dbo.T_Students S
       ON A.StudentId = S.StudentId 
ORDER BY C.CourseName 
, S.StudentName 

When the ordering attributes are specified, running averages, respectively sums derive:

-- Running averages/sums and ranking via Sum and ORDER BY
SELECT A.StudentId 
, A.CourseId
, A.StartDate
, A.EndDate
, A.Mark 
, AVG(Cast(Mark as decimal(4,2))) OVER (PARTITION BY A.StudentId ORDER BY A.StartDate, A.AllocationId) AvgGrade
, SUM(Mark) OVER (PARTITION BY A.StudentId ORDER BY A.StartDate, A.AllocationId) SumGrade
, RANK() OVER (PARTITION BY A.StudentId ORDER BY A.StartDate, A.AllocationId) Ranking
FROM dbo.T_Allocations A
ORDER BY A.StudentId
, A.StartDate

Running averages were introduced only with SQL Server 2008. Previously, one could still calculate them using common table expressions and a ranking function:
 
-- Running averages/sums and ranking via common table expressions 
WITH CTE AS (
	SELECT A.StudentId 
	, A.CourseId
	, A.StartDate
	, A.EndDate
	, A.Mark 
	, RANK() OVER (PARTITION BY A.StudentId ORDER BY A.StartDate, A.AllocationId) Ranking
	FROM dbo.T_Allocations A
), 
DAT AS (
	SELECT A.StudentId 
	, A.CourseId
	, A.StartDate
	, A.EndDate
	, A.Mark 
	, A.Ranking
	, A.Mark SumMark
	FROM CTE A
	WHERE A.Ranking = 1
	UNION ALL
	SELECT CTE.StudentId 
	, CTE.CourseId
	, CTE.StartDate
	, CTE.EndDate
	, CTE.Mark 
	, CTE.Ranking
	, CTE.Mark + DAT.Mark SumMark 
	FROM CTE
		 JOIN DAT
		   ON CTE.StudentID = DAT.StudentId 
		  AND CTE.Ranking-1 = DAT.Ranking 
) 
SELECT DAT.StudentId 
, DAT.CourseId
, DAT.StartDate
, DAT.EndDate
, DAT.Mark 
, cast(DAT.SumMark/DAT.Ranking as decimal(4,2)) AvgMark
, DAT.SumMark
, DAT.Ranking
FROM DAT
ORDER BY DAT.StudentId
, DAT.StartDate
   

Notes:
The queries work also in SQL databases in Microsoft Fabric.

Happy coding!

03 December 2011

💠SQL Server: Window Functions [new feature]

Introduction

     In the past, in the absence or in parallel with other techniques, aggregate functions proved to be quite useful in order to solve several types of problems that involve the retrieval of first/last record or the display of details together with averages and other aggregates. Typically their use involves two or more joins between a dataset and an aggregation based on the same dataset or a subset of it. An aggregation can involve one or more columns that make the object of analysis. Sometimes it might be needed multiple such aggregations based on different sets of columns. Each such aggregation involves at least a join. Such queries can become quite complex, though they were a price to pay in order to solve such problems.

Partitions

     The introduction of analytic functions in Oracle and of window functions, a similar concept, in SQL Server, allowed the approach of such problems from a different simplified perspective. Central to this feature it’s the partition (of a dataset), its meaning being same as of mathematical partition of a set, defined as a division of a set into non-overlapping and non-empty parts that cover the whole initial set. The introduction of partitions it’s not necessarily something new, as the columns used in a GROUP BY clause determines (implicitly) a partition in a dataset. The difference in analytic/window functions is that the partition is defined explicitly inline together with a ranking or average function evaluated within a partition. If the concept of partition is difficult to grasp, let’s look at the result-set based on two Products (the examples are based on AdventureWorks database):
 
-- Price Details for 2 Products 
SELECT A.ProductID  
, A.StartDate 
, A.EndDate 
, A.StandardCost  
FROM [Production].[ProductCostHistory] A 
WHERE A.ProductID IN (707, 708) 
ORDER BY A.ProductID 
, A.StartDate 

window function - details

   In this case a partition is “created” based on the first Product (ProductId = 707), while a second partition is based on the second Product (ProductId = 708). As a parenthesis, another partitioning could be created based on ProductId and StartDate; considering that the two attributes are a key in the table, this will partition the dataset in partitions of 1 record (each partition will have exactly one record).

Details and Averages

     In order to exemplify the use of simple versus window aggregate functions, let’s consider a problem in which is needed to display Standard Price details together with the Average Standard Price for each ProductId. When a GROUP BY clause is applied in order to retrieve the Average Standard Cost, the query is written under the form: 

-- Average Price for 2 Products 
SELECT A.ProductID  
, AVG(A.StandardCost) AverageStandardCost 
FROM [Production].[ProductCostHistory] A 
WHERE A.ProductID IN (707, 708) 
GROUPBY A.ProductID  
ORDERBY A.ProductID 

window function - GROUP BY 

    In order to retrieve the details, the query can be written with the help of a FULL JOIN as follows:

-- Price Details with Average Price for 2 Products - using JOINs 
SELECT A.ProductID  
, A.StartDate 
, A.EndDate 
, A.StandardCost 
, B.AverageStandardCost 
, A.StandardCost - B.AverageStandardCost DiffStandardCost 
FROM [Production].[ProductCostHistory] A    
  JOIN ( -- average price        
    SELECT A.ProductID         
    , AVG(A.StandardCost) AverageStandardCost         
    FROM [Production].[ProductCostHistory] A        
    WHERE A.ProductID IN (707, 708)        
    GROUP BY A.ProductID      
) B  
    ON A.ProductID = B.ProductID 
WHERE A.ProductID IN (707, 708) 
ORDERBY A.ProductID 
, A.StartDate 

 window function - Average Price JOIN   

    As pointed above the partition is defined by ProductId. The same query written with window functions becomes:

-- Price Details with Average Price for 2 Products - using AVG window function 
SELECT A.ProductID  
, A.StartDate 
, A.EndDate 
, A.StandardCost 
, AVG(A.StandardCost) OVER(PARTITION BY A.ProductID) AverageStandardCost 
, A.StandardCost - AVG(A.StandardCost) OVER(PARTITION BY A.ProductID) DiffStandardCost 
FROM [Production].[ProductCostHistory] A 
WHERE A.ProductID IN (707, 708) 
ORDER BY A.ProductID 
, A.StartDate 

window function - Average Price WF









    As can be seen, in the second example, the AVG function is defined using the OVER clause with PartitionId as partition. Even more, the function is used in a formula to calculate the Difference Standard Cost. More complex formulas can be written making use of multiple window functions.  

The Last Record

     Let’s consider the problem of retrieving the nth record. Because with aggregate functions is easier to retrieve the first or last record, let’s consider that is needed to retrieve the last Standard Price for each ProductId. The aggregate function helps to retrieve the greatest Start Date, which farther helps to retrieve the record containing the Last Standard Price.

-- Last Price Details for 2 Products - using JOINs 
SELECT A.ProductID  
, A.StartDate 
, A.EndDate 
, A.StandardCost 
FROM [Production].[ProductCostHistory] A  
    JOIN ( -- average price          
    SELECT A.ProductID          
    , Max(A.StartDate) LastStartDate          
    FROM [Production].[ProductCostHistory] A          
    WHERE A.ProductID IN (707, 708)          
    GROUP BY A.ProductID      
) B      
   ON A.ProductID = B.ProductID  
  AND A.StartDate = B.LastStartDate 
WHERE A.ProductID IN (707, 708) 
ORDERBY A.ProductID 
,A.StartDate 

window function - Last Price JOIN  

With window functions the query can be rewritten as follows:

-- Last Price Details for 2 Products - using AVG window function 
SELECT * 
FROM (-- ordered prices      
    SELECT A.ProductID      
    , A.StartDate      
    , A.EndDate      
    , A.StandardCost      
    , RANK() OVER(PARTITION BY A.ProductID ORDER BY A.StartDate DESC) Ranking      
    FROM [Production].[ProductCostHistory] A     
    WHERE A.ProductID IN (707, 708) 
  ) A 
WHERE Ranking = 1 
ORDER BY A.ProductID 
, A.StartDate 

window function - Last Price WF  

   As can be seen, in order to retrieve the Last Standard Price, was considered the RANK function, the results being ordered descending by StartDate. Thus, the Last Standard Price will be always positioned on the first record. Because window functions can’t be used in WHERE clauses, it’s needed to encapsulate the initial logic in a subquery. Similarly could be retrieved the First Standard Price, this time ordering ascending the StartDate. The last query can be easily modified to retrieve the nth records (this can prove to be more difficult with simple average functions), the first/last nth records.

Conclusion

    Without going too deep into details, I shown above two representative scenarios in which solutions based on average functions could be simplified by using window functions. In theory the window functions provide greater flexibility but they have their own trade offs too. In the next posts I will attempt to further detail their use, especially in the context of Statistics.

20 June 2010

💎SQL Reloaded: Troubleshooting Query Issues III (Logical errors)

    In last post I provided some general guidelines on how to troubleshoot query issues related to errors thrown by the database engine. That’s a fortunate case, because engine’s validation shows that something went wrong. I’m saying fortunate because there are also cases in which there are also logical errors that result in unexpected output, and the errors could remain even for years undiscovered. It’s true that such errors are hard to discover, but not impossible; adequate testing combined with defensive programming techniques could allow decreasing the volume of such errors. The simplest and handiest test could be performed by looking in a first phase at the number of records returned by the query, and secondly by studying the evolution of cardinality with each table added to the query. The clearest signal that something went bad is when no records, too less or too many records are returned (than expected), then the problem most probably resides in the constraints used, namely in JOIN, WHERE or HAVING clause constraints.

WHERE clause constraints

    If your query is just a simple statement without joins of any type, then more likely the problem resides in the constraints used in the WHERE clause, at least one of them being evaluated to false. Here are several possible reasons:
- One of the hardcoded values or one of the provided parameters is not found between the values available in the table;
- Not treated NULL cases;
- Wrong use of predicates or parenthesis.

    The simplest constraint evaluated to false is “0=1” or something similar, actually can be used any similar values (e.g. ‘A’=’B’, ‘ABC’=’’, etc.). In contrast “1=1” is evaluated always to true. The below query returns no records because the constraint is evaluated for false:
 
-- single constraint evaluated to false 
SELECT * 
FROM Production.Product 
WHERE 1=0 

    A more complex scenario is when multiple predicates and at least one equates always to true or to false, the task in such situations being to identify such constraints:

-- multiple constraints, one evaluated to false 
SELECT * 
FROM Production.Product 
WHERE 1=0 
    AND Class = 'L' 
    AND SafetyStockLevel >100 

    The constraints could be grouped also using parenthesis:

-- multiple constraints with parenthesis, one evaluated to false 
SELECT * 
FROM Production.Product 
WHERE 1=0 
    AND ((Class = 'L' 
    AND SafetyStockLevel >100) 
      OR (Class = 'M' 
    AND SafetyStockLevel =100)) 

    Dealing with multiple such constraints is not a question of guessing but pure applied Boolean algebra essential in writing accurate queries. So if you are feeling that you’re not mastering such concepts, then maybe it’s a good idea to consult some material on this topic.

    There could be situations in which the total outcome of a set of constraints it’s not so easy to see, in such cases the constraint could be brought in the SELECT statement. Which constraints should be tested depends also on the particularities of your query, here is an example:  

--testing constraints' outcome for each record 
SELECT * 
, CASE  
     WHEN Class = 'L' AND SafetyStockLevel >100 THEN 1 
     ELSE 0 
  END Constraint1  
, CASE  
     WHEN Class = 'M' AND SafetyStockLevel =100 THEN 1 
     ELSE 0 
  END Constraint2 
, CASE  
     WHEN (Class = 'L' AND SafetyStockLevel >100) 
        OR (Class = 'M' AND SafetyStockLevel =100) THEN 1 
     ELSE 0 
END Constraint3 
FROM Production.Product  

      
  The constraints in which one of the members is NULL and the IS NULL or Null functions are not used, are incorrect evaluated, actually ignored, and in certain cases they might look even correct, though this depends also on the expressions used. For example the below query will return no records, while the next one will return some records. For more on the correct handling of NULLs see Null-ifying the world or similar posts on the web.

--misusing NULL-based constraints 
  SELECT * 
FROM Production.Product 
WHERE (NOT 1=NULL) OR (1=NULL) 
 
JOIN constraints

    JOINs based on single JOIN constraints should in theory pose no problem unless the wrong attributes are used in the JOIN and that’s so easy to do especially when no documentation is available on the topic or it is incomplete. Functional or other technical specifications, physical or semantic models, metadata or other developers’ knowledge could help the developer to figure out about the relations between the various entities. On the other side it’s in human nature to make mistakes, forgetting to add one constraint, using the wrong type of join or the wrong attributes in the join constraint. In addition all the situations described above for WHERE clauses constraints apply also to join constraints, independently on whether the ANSI or non-ANSI syntax is used, the later based on writing the JOIN constraints in the WHERE clause, making the identification of the JOIN constraints not so easy to see, this being one of the reasons for which I recommend you to use ANSI syntax.

    A technique I use on a daily basis in order to test my scripts is to check the changes in the number of records with the adding of each new table to the initial query. In a first phase is enough to count the number of records from the table with the lowest level of detail, and the number of records from the final query, this ignoring the WHERE constraints. For example let’s consider the following query written some time ago (most of attributes were removed for convenience):

-- Purchase Orders 
 SELECT POD.PurchaseOrderDetailID 
, POD.PurchaseOrderID 
, POD.ProductID  
, POH.ShipMethodID 
, POH.VendorID 
FROM Purchasing.PurchaseOrderDetail POD 
     JOIN Purchasing.PurchaseOrderHeader POH 
       ON POD.PurchaseOrderID = POH.PurchaseOrderID 
           JOIN Purchasing.ShipMethod PSM 
              ON POH.ShipMethodID = PSM.ShipMethodID 
           JOIN Purchasing.Vendor SMF 
              ON POH.VendorID = SMF.VendorID  
     JOIN Production.Product ITM 
        ON POD.ProductID = ITM.ProductID 
WHERE POD.StockedQty >100 
     AND ITM.Class = 'L' 
 
    Usually I write the table with the lowest level of detail (in theory also with the highest number of records) first, adding the referenced table gradually, starting with the tables with the lowest number of records, thus the lookup tables will come first. This usually doesn’t affect the way the database engine processes the query (unless special techniques are used), and it allows to use kind of a systematic approach. This allows me also to test the query without making important changes. As I wrote above, the first test is done against the first table, for this commenting the other joins and eventually the WHERE constraints:

-- testing changes in the number of records - test 1 
SELECT count(1) NumberRecords 
FROM Purchasing.PurchaseOrderDetail POD 
/* JOIN Purchasing.PurchaseOrderHeader POH 
     ON POD.PurchaseOrderID = POH.PurchaseOrderID 
          JOIN Purchasing.ShipMethod PSM 
             ON POH.ShipMethodID = PSM.ShipMethodID 
          JOIN Purchasing.Vendor SMF 
             ON POH.VendorID = SMF.VendorID 
      JOIN Production.Product ITM 
         ON POD.ProductID = ITM.ProductID 
WHERE POD.StockedQty >100 
AND ITM.Class = 'L' */ 
       
    In the second step is enough to check the number of records returned by the whole query without the WHERE constraints:

-- testing changes in the number of records - test 2 
SELECT count(1) NumberRecords 
FROM Purchasing.PurchaseOrderDetail POD  
     JOIN Purchasing.PurchaseOrderHeader POH 
        ON POD.PurchaseOrderID = POH.PurchaseOrderID 
            JOIN Purchasing.ShipMethod PSM 
               ON POH.ShipMethodID = PSM.ShipMethodID 
            JOIN Purchasing.Vendor SMF 
               ON POH.VendorID = SMF.VendorID  
      JOIN Production.Product ITM 
          ON POD.ProductID = ITM.ProductID 
/* WHERE POD.StockedQty >100 
     AND ITM.Class = 'L' */       

      
    If the number of records is correct then most probably the query is correct, this because there could be incorrect LEFT JOINS that are not reflected in the number of records but in the fact that the corresponding attributes are missing. So don’t rely entirely on this type of test, but take 1-2 examples for which you are sure that records must be retrieved from the LEFT JOIN, and check whether the expected values are shown. Eventually, only for testing purposes, the LEFT JOIN could be modified as a FULL JOIN to see what records are returned.  

    Now if there are changes in the number of records from one test to the other, then take the first query and move the comment one join further (see below), and repeat the step until the JOIN that causes the variances is identified. 

-- testing changes in the number of records - test 3  
SELECT count(1) NumberRecords 
FROM Purchasing.PurchaseOrderDetail POD 
JOIN Purchasing.PurchaseOrderHeader POH 
ON POD.PurchaseOrderID = POH.PurchaseOrderID 
/*      JOIN Purchasing.ShipMethod PSM 
           ON POH.ShipMethodID = PSM.ShipMethodID 
         JOIN Purchasing.Vendor SMF 
            ON POH.VendorID = SMF.VendorID 
     JOIN Production.Product ITM 
        ON POD.ProductID = ITM.ProductID 
WHERE POD.StockedQty >100       
AND ITM.Class = 'L' */          

 Once the join that causes the variances is found, then might be requested then to add a new join constraint or use a LEFT JOIN instead of a FULL JOIN. For example if in the above query there are ShipMethodID with NULL values, then the query should have been written using a LEFT JOIN. On the other side, if duplicates are found, then do a grouping by using the primary key of the table with the lowest level of detail, in this case the Purchasing.PurchaseOrderDetail.  

HAVING clause constraints

    HAVING clause constraints behave more like WHERE clause constraints, the difference residing in the different levels where the constraints are applied, so the same technique could be applied in this case too, with the difference that the constraints need to be accommodated to support aggregate functions.  

--Vendors with total purchase value >100000 
SELECT POH.VendorID 
, SUM(POH.SubTotal) TotalSubTotal 
FROM Purchasing.PurchaseOrderHeader POH 
WHERE POH.Status IN (2,4) -- 2-Approved, 4-Complete  
GROUP BY POH.VendorID 
HAVING SUM(POH.SubTotal)>100000 

    Sometimes it helps to bring the HAVING constraints into the SELECT, though that offers limited functionality, therefore it’s more productive maybe to re-link the aggregated query back to the base table(s) on which the statement is based:

--Vendors with total purchase value >100000 - base records 
SELECT A.* 
, B.TotalSubTotal 
, CASE 
     WHEN B.VendorID IS NOT NULL THEN 1 
     ELSE 0 
END ConsideredFlag 
FROM Purchasing.PurchaseOrderHeader A 
LEFT JOIN ( --aggregated data 
      SELECT POH.VendorID 
      , SUM(POH.SubTotal) TotalSubTotal 
      FROM Purchasing.PurchaseOrderHeader POH 
      WHERE POH.Status IN (2,4) -- 2-Approved, 4-Complete  
      GROUP BY POH.VendorID 
HAVING SUM(POH.SubTotal)>100000 
) B 
     ON A.VendorID = B.VendorID 
WHERE A.Status IN (2,4) -- 2-Approved, 4-Complete  

  
 
  Also this technique could prove to be limited, though I found it useful in several cases, especially when a complicated formula is included in the aggregated function. The same query could be rewritten with a window aggregate function introduced with SQL Server 2005.

--Vendors with total purchase value >100000 - window aggregate function 
SELECT POH.* 
, SUM(POH.SubTotal) OVER (PARTITION BY POH.VendorID) TotalSubTotal 
, CASE 
     WHEN SUM(POH.SubTotal) OVER (PARTITION BY POH.VendorID)>100000 THEN 1 
     ELSE 0 
END ConsideredFlag 
FROM Purchasing.PurchaseOrderHeader POH 
WHERE POH.Status IN (2,4) -- 2-Approved, 4-Complete 
 
    When working with amounts, quantities or other number values, it could be useful to check the total amount/quantity based on the query logic against the total amount/quantity based on the base table.

06 March 2010

💎SQL Reloaded: Oracle vs. SQL Server (Running Totals)

    One of the SQL Server 2005+ functionalities I kind of ignored in the past years is the CROSS APPLY operator that allows returning rows from a table-valued function of a correlated query, allowing for example to calculate running totals, cumulated amounts from the previous records and the current record in a data set. Before the introduction of CROSS APPLY operator such a query could be written as follows:
-- SQL Server/Oracle: Running Totals (simple join)
SELECT POH.PurchaseOrderID
,
POH.VendorID

,
POH.OrderDate

,
POH.SubTotal

,
POH.TaxAmt

,
POH.Freight

,
SUM(POT.SubTotal) RunningSubTotal

,
SUM(POT.TaxAmt) RunningTaxAmt

,
SUM(POT.Freight) RunningFreight

FROM
Purchasing.PurchaseOrderHeader POH
    LEFT JOIN Purchasing.PurchaseOrderHeader POT
        ON POH.PurchaseOrderID >= POT.PurchaseOrderID

GROUP
BY POH.PurchaseOrderID

,
POH.VendorID

,
POH.OrderDate

,
POH.SubTotal

,
POH.TaxAmt

,
POH.Freight

ORDER
BY POH.PurchaseOrderID


   Each of the attributes not aggregated needs to be added in the GROUP BY clause, fact that could be quite a hit from a performance standpoint. If only one aggregated value is needed then the query could be rewritten, the cumulated value being calculated in a correlated query:
-- SQL Server/Oracle: Running Totals (correlated query)
SELECT POH.PurchaseOrderID
,
POH.VendorID

,
POH.OrderDate

,
POH.SubTotal

,
POH.TaxAmt

,
POH.Freight

,
( -- calculating aggregated running total

    SELECT SUM(POT.SubTotal)
    FROM Purchasing.PurchaseOrderHeader POT
    WHERE POH.PurchaseOrderID >= POT.PurchaseOrderID
) RunningSubTotal

FROM
Purchasing.PurchaseOrderHeader POH


   In theory could be added a correlated query for each aggregated value, though this means an additional table scan for each attribute, not the best approach for sure. The CROSS APPLY operator simplifies such queries allowing to do multiple calculations within the same correlated query:


-- SQL Server 2005: Running Totals (CROSS APPLY)
SELECT POH.PurchaseOrderID
,
POH.VendorID

,
POH.OrderDate

,
POH.SubTotal

,
POH.TaxAmt

,
POH.Freight

,
DAT.RunningSubTotal

,
DAT.RunningTaxAmt

,
DAT.RunningFreight

FROM
Purchasing.PurchaseOrderHeader POH
    CROSS APPLY (-- calculating aggregated running total

        SELECT SUM(POT.SubTotal) RunningSubTotal
        , SUM(POT.TaxAmt) RunningTaxAmt
        , SUM(POT.Freight) RunningFreight
        FROM Purchasing.PurchaseOrderHeader POT
        WHERE POH.PurchaseOrderID >= POT.PurchaseOrderID) DAT

ORDER
BY POH.PurchaseOrderID


     The advantage of  the first two approaches is that they work also in Oracle, though also Oracle comes with an alternative using analytic aggregate functions:
-- Oracle: Running Totals (analytic aggregate function)
SELECT POH.PurchaseOrderID
,
POH.VendorID

,
POH.OrderDate

,
POH.SubTotal

,
POH.TaxAmt

,
POH.Freight

,
SUM(POH.SubTotal) OVER (ORDER BY POH.PurchaseOrderID ) RunningSubTotal

,
SUM(POH.TaxAmt) OVER (ORDER BY POH.PurchaseOrderID) RunningTaxAmt

,
SUM(POH.Freight) OVER (ORDER BY POH.PurchaseOrderID) RunningFreight

FROM
PurchaseOrderHeader POH

ORDER
BY POH.PurchaseOrderID


Notes:
1.  The calculation of running totals requires the existence of a unique identifier for each record that can be sorted, in this case being used the PurchaseOrderID attribute, though typically could be used the creation date (timestamp) when the record was modified, alone or in combination with other attributes; when used in combination then maybe it makes sense to add a running number in order to apply the above techniques.  In Oracle the use of analytic aggregate functions offers less headaches from this perspective, being enough to specify the unique combination of attributes in the ORDER BY clause in the body of the function.
2.  SQL Server 2005’s window aggregate functions, even if similar with Oracle’s analytic aggregate functions, unlike the window ranking functions they don’t allow the ORDER BY clause in the functions, therefore running totals can’t be modeled in SQL Server with the help of window aggregate functions.
3. In case the running totals are based on other table, then might be needed to use the OUTER APPLY operator, returning thus also the records for which not match is retrieved in the outer dataset.

   Sometimes might be requested to create running totals within a given partition, for example by Vendor, in this case all is needed to do is to add the VendorID in the WHERE or JOIN constraint, while in Oracle’s analytic functions case in the PARTITION BY clause. Here are the three queries modified:
-- SQL Server/Oracle: Running Totals per Vendor (simple join)
SELECT POH.PurchaseOrderID
,
POH.VendorID

,
POH.OrderDate

,
POH.SubTotal

,
POH.TaxAmt

,
POH.Freight

,
SUM(POT.SubTotal) RunningSubTotal

,
SUM(POT.TaxAmt) RunningTaxAmt

,
SUM(POT.Freight) RunningFreight

FROM
Purchasing.PurchaseOrderHeader POH
    LEFT JOIN Purchasing.PurchaseOrderHeader POT
        ON POH.VendorID = POT.VendorID
     AND POH.PurchaseOrderID >= POT.PurchaseOrderID

GROUP
BY POH.PurchaseOrderID

,
POH.VendorID

,
POH.OrderDate

,
POH.SubTotal

,
POH.TaxAmt

,
POH.Freight

ORDER
BY POH.VendorID

,
POH.PurchaseOrderID


-- SQL Server 2005: Running Totals per Vendor (CROSS APPLY)
SELECT POH.PurchaseOrderID
,
POH.VendorID

,
POH.OrderDate

,
POH.SubTotal

,
POH.TaxAmt

,
POH.Freight

,
DAT.RunningSubTotal

,
DAT.RunningTaxAmt

,
DAT.RunningFreight

FROM
Purchasing.PurchaseOrderHeader POH
     CROSS APPLY ( -- calculating aggregated running total

        SELECT SUM(POT.SubTotal) RunningSubTotal
        , SUM(POT.TaxAmt) RunningTaxAmt
        , SUM(POT.Freight) RunningFreight
        FROM Purchasing.PurchaseOrderHeader POT
        WHERE POH.VendorID = POT.VendorID
          AND POH.PurchaseOrderID >= POT.PurchaseOrderID) DAT

ORDER
BY POH.VendorID

,
POH.PurchaseOrderID


-- Oracle: Running Totals per Vendor (analytic aggregate function)
SELECT POH.PurchaseOrderID
,
POH.VendorID

,
POH.OrderDate

,
POH.SubTotal

,
POH.TaxAmt

,
POH.Freight

,
SUM(POH.SubTotal) OVER (PARTITION BY POH.VendorID ORDER BY POH.PurchaseOrderID ) RunningSubTotal

,
SUM(POH.TaxAmt) OVER (PARTITION BY POH.VendorID ORDER BY POH.PurchaseOrderID) RunningTaxAmt

,
SUM(POH.Freight) OVER (PARTITION BY POH.VendorID ORDER BY POH.PurchaseOrderID) RunningFreight

FROM
PurchaseOrderHeader POH

ORDER
BY POH.VendorID

,
POH.PurchaseOrderID
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.