SQL Troubles: CROSS APPLY

Showing posts with label CROSS APPLY. Show all posts

17 May 2020

💎🏭SQL Reloaded: Query Patterns in SQL Server (Part III: Aggregate Functions)

As their names denote, aggregate functions are used to summarize information across the whole data set or based on several attributes. The following queries exemplify the use of aggregates based on the tables created in a previous post:

-- aggreates over the whole table 
SELECT COUNT(*) NoRecords 
, MIN(StartDate) StartDate
, MAX(EndDate) EndDate
, DateDiff(d, MIN(StartDate), MAX(EndDate)) [DiffDays]
FROM dbo.T_Allocations A

-- aggregates based on several several attributes 
 SELECT A.CourseId
 , A.StudentId 
 , COUNT(*) NoRecords 
 , MIN(StartDate) StartDate
 , MAX(EndDate) EndDate
 , DateDiff(d, MIN(StartDate), MAX(EndDate)) [DiffDays]
 FROM dbo.T_Allocations A
GROUP BY A.CourseId
 , A.StudentId

Correlated subqueries can be of help either inside of a SELECT or an CROSS, respectively OUTER APPLY:

-- correlated subquery within a SELECT 
 SELECT *
 , (SELECT COUNT(*) NoRecords 
     FROM dbo.T_Allocations A
     WHERE C.CourseId = A.CourseId
     GROUP BY A.CourseId) NoRecords 
 FROM dbo.T_Courses C


 -- CROSS APPLY (correlated subquery)
 SELECT *
 , A.NoRecords
 , A.StartDate
 FROM dbo.T_Courses C
      CROSS APPLY (
         SELECT A.CourseId
         , COUNT(*) NoRecords 
         , MIN(StartDate) StartDate
         FROM dbo.T_Allocations A
         WHERE C.CourseId = A.CourseId
         GROUP BY A.CourseId
     ) A
 ORDER BY C.CourseName

-- OUTER APPLY (correlated subquery)
 SELECT *
 , A.NoRecords
 , A.StartDate
 FROM dbo.T_Courses C
      OUTER APPLY (
         SELECT A.CourseId
         , COUNT(*) NoRecords 
         , MIN(StartDate) StartDate
         FROM dbo.T_Allocations A
         WHERE C.CourseId = A.CourseId
         GROUP BY A.CourseId
     ) A
 ORDER BY C.CourseName

Aggregates are useful in a series of techniques like the listing of duplicates:

-- identifying duplicates via an aggregate 
SELECT A.*
FROM dbo.T_Allocations A
     JOIN (
  SELECT A.CourseId
  , A.StudentId 
  , COUNT(*) NoRecords 
  FROM dbo.T_Allocations A
  GROUP BY A.CourseId
  , A.StudentId 
  HAVING COUNT(*)>1
  ) DUP
   ON A.CourseId = DUP.CourseId 
  AND A.StudentId = DUP.Studentid
ORDER BY A.CourseId
, A.StudentId

A similar technique can be used to remove the duplicates from the base table:

-- removing the duplicates except the last record  
 DELETE dbo.T_Allocations 
 FROM (
 SELECT A.StudentId 
 , A.CourseId
 , A.StartDate
 , A.EndDate
 , Max(A.AllocationId) AllocationId
 FROM dbo.T_Allocations A
 GROUP BY A.StudentId 
 , A.CourseId
 , A.StartDate
 , A.EndDate
 HAVING count(*)>1
  ) A
WHERE dbo.T_Allocations.CourseId = A.CourseId
  AND dbo.T_Allocations.StudentId = A.StudentId
  AND dbo.T_Allocations.StartDate = A.StartDate
  AND dbo.T_Allocations.EndDate = A.EndDate
  AND dbo.T_Allocations.AllocationId <> A.AllocationId

When the number of values is fixed and unique within a grouping, one can use aggregate functions to display the values within a matrix:

-- Matrix display via aggregates 
SELECT S.StudentId 
, S.StudentName 
, CASE WHEN A.Course1 = 1 THEN 'x' ELSE '-' END Course1
, CASE WHEN A.Course2 = 1 THEN 'x' ELSE '-' END Course2
, CASE WHEN A.Course3 = 1 THEN 'x' ELSE '-' END Course3
, CASE WHEN A.Course4 = 1 THEN 'x' ELSE '-' END Course4
, CASE WHEN A.Course5 = 1 THEN 'x' ELSE '-' END Course5
FROM dbo.T_Students S
     LEFT JOIN (
  SELECT A.StudentId 
  , Max(CASE WHEN A.CourseId = 1 THEN 1 ELSE 0 END) Course1
  , Max(CASE WHEN A.CourseId = 2 THEN 1 ELSE 0 END) Course2
  , Max(CASE WHEN A.CourseId = 3 THEN 1 ELSE 0 END) Course3
  , Max(CASE WHEN A.CourseId = 4 THEN 1 ELSE 0 END) Course4
  , Max(CASE WHEN A.CourseId = 5 THEN 1 ELSE 0 END) Course5
  FROM dbo.T_Allocations A
  GROUP BY A.StudentId
    ) A
   ON S.StudentId = A.StudentId 
ORDER BY S.StudentName

Notes:

The queries work also in SQL databases in Microsoft Fabric.

Happy coding!

Previous Post <<||>> Next Post

18 October 2018

💎🏭SQL Reloaded: String_Split Function

Today, as I was playing with a data model – although simplistic, the simplicity kicked back when I had to deal with fields in which several values were encoded within the same column. The “challenge” resided in the fact that the respective attributes were quintessential in analyzing and matching the data with other datasets. Therefore, was needed a mix between flexibility and performance. It was the point where the String_Split did its magic. Introduced with SQL Server 2016 and available only under compatibility level 130 and above, the function splits a character expression using specified separator.
Here’s a simplified example of the code I had to write:

-- cleaning up
-- DROP TABLE dbo.ItemList

-- test data 
SELECT A.*
INTO dbo.ItemList 
FROM (
VALUES (1, '1001:a:blue')
, (2, '1001:b:white')
, (3, '1002:a:blue')
, (4, '1002:b:white')
, (5, '1002:c:red')
, (6, '1003:b:white')
, (7, '1003:c:red')) A(Id, List)

-- checking the data
SELECT *
FROM dbo.ItemList 

-- prepared data
SELECT ITM.Id 
, ITM.List 
, DAT.ItemId
, DAT.Size
, DAT.Country
FROM dbo.ItemList ITM
   LEFT JOIN (-- transformed data 
 SELECT DAT.id
 , [1] AS ItemId
 , [2] AS Size
 , [3] AS Country
 FROM(
  SELECT ITM.id
  , TX.Value
  , ROW_NUMBER() OVER (PARTITION BY ITM.id ORDER BY ITM.id) Ranking
  FROM dbo.ItemList ITM
  CROSS APPLY STRING_SPLIT(ITM.List, ':') TX
  ) DAT
 PIVOT (MAX(DAT.Value) FOR DAT.Ranking IN ([1],[2],[3])) 
 DAT
  ) DAT
   ON ITM.Id = DAT.Id

And, here’s the output:

For those dealing with former versions of SQL Server the functionality provided by the String_Split can be implemented with the help of user-defined functions, either by using the old fashioned loops (see this), cursors (see this) or more modern common table expressions (see this). In fact, these implementations are similar to the String_Split function, the difference being made mainly by the performance.

Notes:
The queries work also in SQL databases in Microsoft Fabric (see file in GitHub repository). You might want to use another schema (e.g. Test), not to interfere with the existing code.

Happy coding!

05 April 2010

💎SQL Reloaded: The Power of Joins VI (Not-Equal Joins)

SQL Server 2008 documentation mentions the not-equal joins without giving an explicit definition, the term appearing within the context of joining values in two columns that are not equal. On the other side the not-equal join denomination could be used to refer more generally to the joins whose join constraints involve comparison operators other than equal (“=”), for example <, >, <>, BETWEEN, IN, and its negation NOT IN, LIKE, etc. The difference between the two views resides on the fact that some of the operators (e.g. IN, BETWEEN) could imply the equality of values from joined columns. Another objection could be made on whether the not-equality join constraint could be based only on one of the table’s attributes participating in the join, the other thus missing, such constraints could be considered also as join constraints even if in some scenarios they could have been written in the main WHERE clause as well. For the sake of simplicity and flexibility I would consider a not-equal join or not-equality join as a join in which at least one of the join constraints is based on a operator other than the equal operator. Could be discussed thus about not-equal join constraints referring to the join constraints that involve other operators than equal, and equal join constraints, the two types of joins could coexist in the same join. A not-equal join constraint often involves the possibility of being returned more records then if a equal join constraint was used.

If we talk about join constraints then more likely we refer to vertical joins, though also anti-joins and semi-joins could include sporadically not-equality operators. In a normal join based on equalities for the set of attributes participating in the join from the left table are matched all the records from the right table, even if this means retrieving more than one record. Same happens also for not-equal constraints even if the business scenarios in which such needs arises are not so intuitive. One relatively simple example I often met is the case when is needed to calculate the incremental aggregation of values, for example to calculate the incremental volume of PO placed over time:

SELECT DII.DateSequence 
, IsNull(POH.SubTotal, 0) SubTotal 
FROM dbo.fGetDatesInInterval('2002-01-01', '2002-01-31') DII 
    CROSS APPLY (--incremental PO Volume 
     SELECT SUM(POH.SubTotal) SubTotal 
    FROM Purchasing.PurchaseOrderHeader POH 
   WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
       AND DATEDIFF(d, POH.OrderDate, DII.DateSequence)&gt;=0 --not-equal constraint 
    ) POH

In order to simplify the query I used the dbo.fGetDatesInInterval table-valued function that returns a sequence of dates within a given time interval, and a CROSS APPLY returning the aggregated value for each date returned by the function, for the calculations being considered all the records having an Order Date smaller or equal with the given (sequence) date. As can be seen from the above query the not-equal constraint is based on DateDiff function, instead we could have written POH.OrderDate<=DII.DateSequence, one of the reasons for its use is the fact that the constraint could be easily modified to show for example only the aggregated values for the past 90 days, the BETWEEN operator entering in scene:

SELECT DII.DateSequence 
, IsNull(POH.SubTotal, 0) SubTotal 
FROM dbo.fGetDatesInInterval('2002-01-01', '2002-01-31') DII 
     CROSS APPLY (--incremental PO Volume 
    SELECT SUM(POH.SubTotal) SubTotal 
    FROM Purchasing.PurchaseOrderHeader POH 
    WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
        AND DATEDIFF(d, POH.OrderDate, DII.DateSequence) BETWEEN 0 AND 90--not-equal constraint 
    ) POH

The not-equal constraint from the query is part of interval-based constraints category which covers the scenarios in which one of the attributes from the tables participating in a join is checked whether it falls within a given interval, typically that being valid for numeric values as above or date value, as in the next query that shows the Last Standard Cost Details per Product.

-- Products &amp; Last Standard Cost details SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
, PCH.StartDate 
, PCH.EndDate 
FROM Production.Product ITM 
    JOIN Production.ProductCostHistory PCH 
     ON ITM.ProductID = PCH.ProductID AND GETDATE() BETWEEN PCH.StartDate AND IsNull(PCH.EndDate, GETDATE())

    Such queries need often to be written when the records in a table have a given validity, no overlap existing between intervals.

    Using list of values within a join constraint and thus the IN or NOT IN operators occurs when for example the base table stores data coming from multiple sources and more encodings are used for the same source, each of them caring its meaning. For example using the Production.TransactionHistory and several other tables coming with AdventureWorks database, and supposing that for Purchase Orders are two transaction types P and B, and for Sales Orders S and I, a general scope query on Transactions could be written as follows:

-- Transaction History 

SELECT PTH.TransactionID 
, PTH.ProductID 
, PTH.ReferenceOrderID 
, CASE PTH.TransactionType 
     WHEN 'W' THEN 'Work Order' 
    WHEN 'S' THEN 'Sales Order' 
    WHEN 'I' THEN 'Internal Sales Order' 
    WHEN 'P' THEN 'Purchase Order' 
    WHEN 'B' THEN 'Blanket Purchase Order' 
    ELSE 'n.a' 
  END TransactionType , CASE  
    WHEN PTH.TransactionType = 'W' THEN StartDate 
    WHEN PTH.TransactionType IN ('S', 'I') THEN SOH.OrderDate 
    WHEN PTH.TransactionType IN ('P', 'B') THEN POH.OrderDate 
  END ReferenceDate FROM Production.TransactionHistory PTH 
    LEFT JOIN Purchasing.PurchaseOrderHeader POH       ON PTH.ReferenceOrderID = POH.PurchaseOrderID   AND PTH.TransactionType IN ('P', 'B') --not-equal constraint 
   LEFT JOIN Sales.SalesOrderHeader SOH 
     ON PTH.ReferenceOrderID = SOH.SalesOrderID 
  AND PTH.TransactionType IN ('S', 'I')  --not-equal constraint 
    LEFT JOIN Production.WorkOrder PWO 
      ON PTH.ReferenceOrderID = PWO.WorkOrderID 
   AND PTH.TransactionType = 'W'

    The need for writing such a query is seldom and could have been written as 3 distinct queries whose results are joined with two unions, both approaches having their pluses and minuses.

   In one of the above queries was used a constraint based on DateDiff function, and in theory any other function could be used in a join constraint, including user-defined functions. Such function-based join constraints are handy but should be used with caution because they could impact query’s performance. Pattern-based join constraints used with LIKE operator could be used as well.

    There are cases in which the constraints that typically would be included in the WHERE clause are added in the body of the join, apparently without any good reason. Such based denatured join constraints are base only on the attributes of one of the tables involved in the join, like in the next query:

-- PO for Products with StandardCost&gt;10 SELECT POD.PurchaseOrderDetailID  
, POD.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
, POD.UnitPrice 
, POD.OrderQty 
FROM Purchasing.PurchaseOrderDetail POD  
    JOIN Production.Product ITM       ON POD.ProductID = ITM.ProductID 
  AND ITM.StandardCost&gt;10  -- denaturated join constraint

03 April 2010

💎SQL Reloaded: Power of Joins V (Semi-Joins and Anti-Joins)

An incursion in the world of joins is incomplete without approaching two other important techniques for writing queries: semi-joins and anti-joins, some of the techniques specific to them allowing to speed up queries and reduce queries’ complexity. An anti-join returns rows from the left table where no matches is found in the right table, while an semi-join returns returns rows from the left table if at least a match is found in the second table.

The typical techniques for creating semi-joins and anti-joins are based on IN or EXISTS operators, respectively their negations the NOT EXISTS and NOT IN, both operators being available in SQL Server and Oracle. According to the above definitions also the EXCEPT and INTERSECT operators introduced in horizontal joins post could be used to create anti-joins, respectively semi-joins too, they being introduced by Microsoft starting with SQL Server 2005 in order to compare whole rows rather then a smaller set of attributes. Another operators that allows creating anti-joins and semi-joins and introduced with SQL Server 2005 is the CROSS APPLY and OUTER APPLY operators.

Because the EXCEPT and INTERSECT operators were introduced in a previous post, this post is focused on the other mentioned operators, for exemplification of the techniques following to use again the Production.Product and Purchasing.PurchaseOrderDetail tables from AdventureWorks database.

IN operator allows to test whether a given (attribute) value is in a list of values, the respective list could be based also on the values returned by another query, called actually a subquery. The output is not influenced by the eventual duplicates in the list of values, though for the sake of simplicity and testability it makes sense to provide a list of distinct values.

EXISTS operator provides a similar functionality like the IN operator, however it checks only whether a record exists in what is called a correlated suquery (also known as repeated subquery, is a subquery making direct reference to one or more of the attributes of the main query). The subquery used with an IN operator could be easily translated to a correlated subquery by bringing the attribute used to create the list of values into the WHERE clause.

Let’s start with a simple problem to exemplify the use of a semi-join, for example to select only the products that have a PO (Purchase Order) placed, the query could be written as follows:

-- Products with PO placed (IN operator) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
FROM Production.Product ITM  
WHERE ITM.ProductID IN (--list of Product IDs with PO placed 
    SELECT DISTINCT POD.ProductID  
     FROM Purchasing.PurchaseOrderDetail POD  
       JOIN Purchasing.PurchaseOrderHeader POH  
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
      WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
     )

The above query could be rewritten using the EXISTS operator, the correlated query being based on ProductID attribute found in both tables:

-- Products with PO placed (EXISTS operator) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
FROM Production.Product ITM  
WHERE EXISTS (--list of Product IDs with PO placed 
     SELECT 1 
     FROM Purchasing.PurchaseOrderDetail POD  
       JOIN Purchasing.PurchaseOrderHeader POH  
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
     WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
         AND POD.ProductID = ITM.ProductID -- correlated constraint 
)

Notes:
1. As usual special attention must be given to the handling of NULL values either by replacing them with a default value or by handing the NULL case in the WHERE clause.
2. The Execution Plan for the above two queries is the same, SQL Server using a Hash Match (Left Semi Join) for both operators. I was trying to find information on whether there are any recommendations that advices on the use of one of the two operators in the detriment of the other, however I found nothing in this direction for SQL Server. R. Schrag offers a rule of dumb rooted in Oracle documentation and based on the selectivity of a query, a query being highly selective if it returns the least number of rows:

"If the main body of your query is highly selective, then an EXISTS clause might be more appropriate to semi-join to the target table. However, if the main body of your query is not so selective and the subquery (the target of the semi-join) is more selective, then an IN clause might be more appropriate." [1]

Both above solutions are ideal when no data are needed from the tables used in the subquery, however quite often such a need arises, for example to show the volume of PO placed, the last PO Date, and so on, in such cases an inner join could be used instead:

-- Products with PO placed (inner join) SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
, POD.TotalOrderQty 
, POD.LastOrderDate 
FROM Production.Product ITM  
JOIN (--aggregated PO details 
    SELECT POD.ProductID  
    , SUM(POD.OrderQty) TotalOrderQty 
    , Max(OrderDate) LastOrderDate 
     FROM Purchasing.PurchaseOrderDetail POD  
       JOIN Purchasing.PurchaseOrderHeader POH  
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
     WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
     GROUP BY POD.ProductID 
    ) POD 
    ON POD.ProductID = ITM.ProductID

For the same purpose could be used also a CROSS APPLY taking care to eliminate the Products for which no record is retrieved in the subquery, this approach is actually a mixture between an inner join and a correlated query:

 -- Products with PO placed (CROSS APPLY) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost  
, POD.TotalOrderQty  
, POD.LastOrderDate  
FROM Production.Product ITM  
     CROSS APPLY (--aggregated PO details  
     SELECT SUM(POD.OrderQty) TotalOrderQty  
     , Max(OrderDate) LastOrderDate  
     FROM Purchasing.PurchaseOrderDetail POD  
           JOIN Purchasing.PurchaseOrderHeader POH  
              ON POD.PurchaseOrderID = POH.PurchaseOrderID  
     WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
        AND POD.ProductID = ITM.ProductID  
    ) POD  
WHERE POD.TotalOrderQty IS NOT NULL

Notes:
1. The IN and EXISTS query versions seems to provide better performance than the use of vertical joins, though from my experience is often needed to provide information also from the subqueries used with the respective operators, therefore the inner join provides more flexibility in the detriment of relatively poorer performance.
2. Microsoft’s implementation for IN operator allows using only single list of values, in contrast with Oracle that allows working with lists of n-uples. This problem in theory could be solved using the IN operators with a correlated subquery. An example of such a problem is finding the last PO placed for each Product:

 -- Last PO placed per Product (IN + correlated query) 
SELECT POD.PurchaseOrderDetailID 
, POD.ProductID  
, POH.OrderDate 
, POD.OrderQty 
FROM Purchasing.PurchaseOrderDetail POD  
     JOIN Purchasing.PurchaseOrderHeader POH         ON POD.PurchaseOrderID = POH.PurchaseOrderID 
WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
   AND POH.OrderDate IN (--Last PO Date per Product 
      SELECT Max(POH1.OrderDate) LastOrderDate 
      FROM Purchasing.PurchaseOrderDetail POD1 
          JOIN Purchasing.PurchaseOrderHeader POH1 
             ON POD1.PurchaseOrderID = POH1.PurchaseOrderID 
    WHERE POH1.Status IN (2, 4) -- 2-Approved, 4-Complete  
AND POD1.ProductID = POD.ProductID -- correlated constraint 
)

This might not be the best approach to solve this problem though the query is intended only to show a technique. The query could be rewritten using an inner join or the EXISTS operator, but I will show only the later approach because it comes with an interesting technique – because the LastOrderDate and ProductID are used as pair of values in order to retrieve the last PO per Product, given the fact that LastOrderDate is an aggregate value, the correlated constraint needs to be added in HAVING clause:

-- Last PO placed per Product (correlated query) 
SELECT POD.PurchaseOrderDetailID 
, POD.ProductID  
, POH.OrderDate 
, POD.OrderQty 
FROM Purchasing.PurchaseOrderDetail POD  
     JOIN Purchasing.PurchaseOrderHeader POH 
        ON POD.PurchaseOrderID = POH.PurchaseOrderID 
WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
     AND EXISTS(--Last PO Date per Product 
     SELECT 1 
     FROM Purchasing.PurchaseOrderDetail POD1 
          JOIN Purchasing.PurchaseOrderHeader POH1 
             ON POD1.PurchaseOrderID = POH1.PurchaseOrderID 
     WHERE POH1.Status IN (2, 4) -- 2-Approved, 4-Complete  
          AND POD1.ProductID = POD.ProductID -- correlated constraint 
     GROUP BY POD1.ProductID 
     HAVING Max(POH1.OrderDate) = POH.OrderDate 
     )

The NOT IN and NOT EXISTS versions of the operators simply negates the use of IN and EXISTS operators, showing thus the records for which a match is not found. The complementary problem of showing the Products with a PO placed is to show the Products without a PO placed, for this, as previously mentioned, being enough to add the NOT keyword in from of IN, respectively EXISTS in the first two queries from this post:

-- Products without PO placed (IN operator) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
FROM Production.Product ITM  
WHERE ITM.ProductID NOT IN (--list of Product IDs with PO placed 
    SELECT DISTINCT POD.ProductID 
     FROM Purchasing.PurchaseOrderDetail POD 
       JOIN Purchasing.PurchaseOrderHeader POH 
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
      WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
     )  

-- Products without PO placed (EXISTS operator) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
FROM Production.Product ITM  
WHERE NOT EXISTS (--list of Product IDs with PO placed 
     SELECT 1 
     FROM Purchasing.PurchaseOrderDetail POD 
       JOIN Purchasing.PurchaseOrderHeader POH 
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
     WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
         AND POD.ProductID = ITM.ProductID -- correlated constraint 
)

An alternative to NOT IN and NOT EXISTS is the use of a LEFT JOIN transposing the project of A\B problem:

  -- Products without PO placed (left join) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost FROM Production.Product ITM  
    LEFT JOIN (--aggregated PO details 
    SELECT POD.ProductID  
    FROM Purchasing.PurchaseOrderDetail POD 
       JOIN Purchasing.PurchaseOrderHeader POH 
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
     WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
     GROUP BY POD.ProductID 
    ) POD 
    ON POD.ProductID = ITM.ProductID   
WHERE POD.ProductID IS NULL

The CROSS APPLY could be used too, this time the WHERE clause from the main query being modified to show only the records without a match in CROSS APPLY, for this needing to be used a "disguised" aggregate too:

 -- Products without PO placed (CROSS APPLY) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost  FROM Production.Product ITM  
    CROSS APPLY (-- PO details  
    SELECT Max(1) HasPOPlaced 
    FROM Purchasing.PurchaseOrderDetail POD  
          JOIN Purchasing.PurchaseOrderHeader POH  
             ON POD.PurchaseOrderID = POH.PurchaseOrderID  
    WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
        AND POD.ProductID = ITM.ProductID  
    ) POD 
WHERE POD.HasPOPlaced IS NULL

As can be seen there are multiple alternative ways of solving the same problem, none of the above described patters could be considered as best approach because in the end must be weighted between performance, flexibility, reusability and complexity. In many cases the performance of a query is the most important factor, though the performance of any of the above techniques is relative because it needs to be considered factors like table structures and relations’ cardinality, server’s configuration and available resources, etc.

References:
[1] Schrag R. (2005). Speeding Up Queries with Semi-Joins and Anti-Joins: How Oracle Evaluates EXISTS, NOT EXISTS, IN, and NOT IN. [link]

11 March 2010

💎SQL Reloaded: Oracle vs. SQL Server: Averages I

For many people working with averages is not a complicated topic at all, and that’s quite true, though as always it depends on how much you want to complicate everything. The average or the arithmetic mean is defined as the middle value of the values within a data set calculated as the sum of the values divided by the number of values:

RDBMS vendors provide similar functionality within the AVG aggregate function, and even if in theory the result could be calculated using the SUM and COUNT aggregate functions, as per the above definition, given its importance in statistics, to provide an AVG function seems to be a natural choice. AVG could be used across the whole data set or within a given “partition” determined by the GROUP BY clause.

-- Oracle/SQL Server: Simple Averages 
SELECT ProductID 
, AVG(UnitPrice) AverageUnitPrice 
, SUM(UnitPrice)/COUNT(1) AverageUnitPriceDef 
FROM Purchasing.PurchaseOrderDetail 
GROUP BY ProductID

For most of the data analysis requirements the simple average function would be enough, though there are cases in which is required to address the sensitivity of values to certain modifiers – for example quantities or the number of lines. For such cases is used the weighted average (weighted arithmetic mean) calculated as the sum of values and quantifiers products, the total being divided by the sum of quantifiers:

Note:
In case the quantifiers are all 1 then the weighted average equates with the simple average, the output being the same also when the values are constant.

The weighted average is quite a useful tool when performing PO Price Analysis in which often needs to be be considered also the volume of purchased quantities with a given Price Unit:

-- Oracle/SQL Server: Simple/Weighted Averages 
SELECT ProductID 
, AVG(UnitPrice) AverageUnitPrice 
, SUM(UnitPrice*OrderQty)/SUM(OrderQty) WeightedAverageUnitPrice 
FROM Purchasing.PurchaseOrderDetail 
GROUP BY ProductID

The actual implementation in a query might differ based on requirements and preferences – as multiple left joins, inline view or correlated query, each of the mentioned approaches coming with their downsides and strengths:

-- Oracle/SQL Server: Averages (inline view) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, DAT.AverageUnitPrice 
, DAT.WeightedAverageUnitPrice 
FROM Production.Product ITM 
LEFT JOIN ( --calculated averages 
    SELECT POD.ProductID 
    , AVG(POD.UnitPrice) AverageUnitPrice 
    , SUM(POD.UnitPrice*OrderQty)/SUM(POD.OrderQty) WeightedAverageUnitPrice 
    FROM Purchasing.PurchaseOrderDetail POD 
       JOIN Purchasing.PurchaseOrderHeader POH 
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
    WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete 
   GROUP BY POD.ProductID)DAT 
    ON ITM.ProductID = DAT.ProductID 
ORDER BY ITM.Name 

-- Oracle/SQL Server: Averages (multiple left joins) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, AVG(POD.UnitPrice) AverageUnitPrice 
, SUM(POD.UnitPrice*OrderQty)/SUM(POD.OrderQty) WeightedAverageUnitPrice 
FROM Production.Product ITM 
   LEFT JOIN Purchasing.PurchaseOrderDetail POD 
      ON ITM.ProductID = POD.ProductID 
   LEFT JOIN Purchasing.PurchaseOrderHeader POH 
      ON POD.PurchaseOrderID = POH.PurchaseOrderID  
WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete 
GROUP BY ITM.ProductID 
,ITM.Name 
,ITM.ProductNumber 
ORDER BY ITM.Name  

-- SQL Server 2005: Averages (OUTER APPLY) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, DAT.AverageUnitPrice 
, DAT.WeightedAverageUnitPrice 
FROM Production.Product ITM 
OUTER APPLY ( -- calculated averages 
    SELECT AVG(POD.UnitPrice) AverageUnitPrice 
    , SUM(POD.UnitPrice*OrderQty)/SUM(POD.OrderQty) WeightedAverageUnitPrice 
    FROM Purchasing.PurchaseOrderDetail POD  
       JOIN Purchasing.PurchaseOrderHeader POH 
          ON POD.PurchaseOrderID = POH.PurchaseOrderID  
    WHERE ITM.ProductID = POD.ProductID  
        AND POH.Status IN (2, 4) -- 2-Approved, 4-Complete 
) DAT 
ORDER BY ITM.Name

A different type of approach could be focused on calculating the simple or weighted average based only on the last n PO Lines, for example the last 3 PO Lines:

-- Oracle/SQL Server 2005: last 3 PO average prices 
SELECT DAT.ProductID 
, MAX(DAT.RANKING) NumberRecords 
, AVG(DAT.UnitPrice) AveragePrice 
, SUM(DAT.UnitPrice* DAT.OrderQty)/SUM(DAT.OrderQty) WeightedAveragePrice 
FROM (-- ranked PO Prices 
    SELECT POD.ProductID 
    , POD.UnitPrice 
    , POD.OrderQty 
    , RANK() OVER(PARTITION BY POD.ProductID ORDER BY POH.OrderDate DESC,   POD.PurchaseOrderDetailID DESC) RANKING 
    FROM Purchasing.PurchaseOrderDetail POD 
       JOIN Purchasing.PurchaseOrderHeader POH 
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
    WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete 
) DAT 
WHERE DAT.RANKING BETWEEN 1 AND 3 
GROUP BY DAT.ProductID

The problem could be complicated even more by selecting all the POs corresponding to the last 3 distinct Prices used; not sure if it brings any value but the formulation is perfectly valid and don’t despair if a user comes with such a request.

Occasionally it might be needed to study the evolution of prices over time, the previous problem becoming a simple moving average or weighted moving average problem, in which the average is calculated in report with a moving point on the time scale with respect to the previous k points or concerning the previous and following k points including the current point:

G. Priester shows in his Calculating Moving Averages with T-SQL article a nice solution for simple moving averages calculation exemplifying it on stock market data. Frankly, given its flexibility, I prefer to use the CROSS APPLY operator in combination with fGetDatesInInterval function introduced in More Date-related Functionality – Part I post, the function returning all the dates falling in a certain interval. Because in case of POs there could be multiple transactions per day for the same Product, the problem is slightly changed, the averages being based within a 31 window:

-- SQL Server 2005: 31 days moving averages 
SELECT ITM.ProductNumber 
, ITM.Name ProductName 
, DIN.DateSequence ReportingDate 
, DAT.AverageUnitPrice 
, DAT.WeightedAverageUnitPrice 
FROM dbo.fGetDatesInInterval('2003-01-01', '2003-12-31') DIN 
CROSS APPLY (--calculated averages 
    SELECT POD.ProductID 
    , AVG(POD.UnitPrice) AverageUnitPrice 
    , SUM(POD.UnitPrice*OrderQty)/SUM(POD.OrderQty) WeightedAverageUnitPrice 
    FROM Purchasing.PurchaseOrderDetail POD 
       JOIN Purchasing.PurchaseOrderHeader POH 
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
     WHERE DATEDIFF(d, POH.OrderDate, DIN.DateSequence) BETWEEN 0 AND 30 -- 31 days 
     --WHERE DATEDIFF(d, POH.OrderDate, DIN.DateSequence) BETWEEN -15 AND 15 -- 31 days 
         AND POH.Status IN (2, 4) -- 2-Approved, 4-Complete 
GROUP BY POD.ProductID) DAT 
LEFT JOIN Production.Product ITM 
   ON DAT.ProductID = ITM.ProductID 
WHERE DATEPART(dw, DIN.DateSequence) BETWEEN 2 AND 6 -- Working Days 
ORDER BY ITM.Name 
, DIN.DateSequence

SQL Troubles

Pages