SQL Troubles: analytic functions

Showing posts with label analytic functions. Show all posts

06 March 2010

💎⌛SQL Reloaded: Oracle vs. SQL Server (Running Totals)

One of the SQL Server 2005+ functionalities I kind of ignored in the past years is the CROSS APPLY operator that allows returning rows from a table-valued function of a correlated query, allowing for example to calculate running totals, cumulated amounts from the previous records and the current record in a data set. Before the introduction of CROSS APPLY operator such a query could be written as follows:

-- SQL Server/Oracle: Running Totals (simple join)
SELECT POH.PurchaseOrderID
, POH.VendorID
, POH.OrderDate
, POH.SubTotal
, POH.TaxAmt
, POH.Freight
, SUM(POT.SubTotal) RunningSubTotal
, SUM(POT.TaxAmt) RunningTaxAmt
, SUM(POT.Freight) RunningFreight
FROM Purchasing.PurchaseOrderHeader POH
    LEFT JOIN Purchasing.PurchaseOrderHeader POT
        ON POH.PurchaseOrderID >= POT.PurchaseOrderID
GROUP BY POH.PurchaseOrderID
, POH.VendorID
, POH.OrderDate
, POH.SubTotal
, POH.TaxAmt
, POH.Freight
ORDER BY POH.PurchaseOrderID

Each of the attributes not aggregated needs to be added in the GROUP BY clause, fact that could be quite a hit from a performance standpoint. If only one aggregated value is needed then the query could be rewritten, the cumulated value being calculated in a correlated query:

-- SQL Server/Oracle: Running Totals (correlated query)
SELECT POH.PurchaseOrderID
, POH.VendorID
, POH.OrderDate
, POH.SubTotal
, POH.TaxAmt
, POH.Freight
, ( -- calculating aggregated running total
    SELECT SUM(POT.SubTotal)
    FROM Purchasing.PurchaseOrderHeader POT
    WHERE POH.PurchaseOrderID >= POT.PurchaseOrderID
) RunningSubTotal
FROM Purchasing.PurchaseOrderHeader POH

In theory could be added a correlated query for each aggregated value, though this means an additional table scan for each attribute, not the best approach for sure. The CROSS APPLY operator simplifies such queries allowing to do multiple calculations within the same correlated query:

-- SQL Server 2005: Running Totals (CROSS APPLY)
SELECT POH.PurchaseOrderID
, POH.VendorID
, POH.OrderDate
, POH.SubTotal
, POH.TaxAmt
, POH.Freight
, DAT.RunningSubTotal
, DAT.RunningTaxAmt
, DAT.RunningFreight
FROM Purchasing.PurchaseOrderHeader POH
    CROSS APPLY (-- calculating aggregated running total
        SELECT SUM(POT.SubTotal) RunningSubTotal
        , SUM(POT.TaxAmt) RunningTaxAmt
        , SUM(POT.Freight) RunningFreight
        FROM Purchasing.PurchaseOrderHeader POT
        WHERE POH.PurchaseOrderID >= POT.PurchaseOrderID) DAT
ORDER BY POH.PurchaseOrderID

The advantage of the first two approaches is that they work also in Oracle, though also Oracle comes with an alternative using analytic aggregate functions:

-- Oracle: Running Totals (analytic aggregate function)
SELECT POH.PurchaseOrderID
, POH.VendorID
, POH.OrderDate
, POH.SubTotal
, POH.TaxAmt
, POH.Freight
, SUM(POH.SubTotal) OVER (ORDER BY POH.PurchaseOrderID ) RunningSubTotal
, SUM(POH.TaxAmt) OVER (ORDER BY POH.PurchaseOrderID) RunningTaxAmt
, SUM(POH.Freight) OVER (ORDER BY POH.PurchaseOrderID) RunningFreight
FROM PurchaseOrderHeader POH
ORDER BY POH.PurchaseOrderID

Notes:
1. The calculation of running totals requires the existence of a unique identifier for each record that can be sorted, in this case being used the PurchaseOrderID attribute, though typically could be used the creation date (timestamp) when the record was modified, alone or in combination with other attributes; when used in combination then maybe it makes sense to add a running number in order to apply the above techniques. In Oracle the use of analytic aggregate functions offers less headaches from this perspective, being enough to specify the unique combination of attributes in the ORDER BY clause in the body of the function.
2. SQL Server 2005’s window aggregate functions, even if similar with Oracle’s analytic aggregate functions, unlike the window ranking functions they don’t allow the ORDER BY clause in the functions, therefore running totals can’t be modeled in SQL Server with the help of window aggregate functions.
3. In case the running totals are based on other table, then might be needed to use the OUTER APPLY operator, returning thus also the records for which not match is retrieved in the outer dataset.

Sometimes might be requested to create running totals within a given partition, for example by Vendor, in this case all is needed to do is to add the VendorID in the WHERE or JOIN constraint, while in Oracle’s analytic functions case in the PARTITION BY clause. Here are the three queries modified:

-- SQL Server/Oracle: Running Totals per Vendor (simple join)
SELECT POH.PurchaseOrderID
, POH.VendorID
, POH.OrderDate
, POH.SubTotal
, POH.TaxAmt
, POH.Freight
, SUM(POT.SubTotal) RunningSubTotal
, SUM(POT.TaxAmt) RunningTaxAmt
, SUM(POT.Freight) RunningFreight
FROM Purchasing.PurchaseOrderHeader POH
    LEFT JOIN Purchasing.PurchaseOrderHeader POT
        ON POH.VendorID = POT.VendorID
     AND POH.PurchaseOrderID >= POT.PurchaseOrderID
GROUP BY POH.PurchaseOrderID
, POH.VendorID
, POH.OrderDate
, POH.SubTotal
, POH.TaxAmt
, POH.Freight
ORDER BY POH.VendorID
, POH.PurchaseOrderID

-- SQL Server 2005: Running Totals per Vendor (CROSS APPLY)
SELECT POH.PurchaseOrderID
, POH.VendorID
, POH.OrderDate
, POH.SubTotal
, POH.TaxAmt
, POH.Freight
, DAT.RunningSubTotal
, DAT.RunningTaxAmt
, DAT.RunningFreight
FROM Purchasing.PurchaseOrderHeader POH
     CROSS APPLY ( -- calculating aggregated running total
        SELECT SUM(POT.SubTotal) RunningSubTotal
        , SUM(POT.TaxAmt) RunningTaxAmt
        , SUM(POT.Freight) RunningFreight
        FROM Purchasing.PurchaseOrderHeader POT
        WHERE POH.VendorID = POT.VendorID
          AND POH.PurchaseOrderID >= POT.PurchaseOrderID) DAT
ORDER BY POH.VendorID
, POH.PurchaseOrderID

-- Oracle: Running Totals per Vendor (analytic aggregate function)
SELECT POH.PurchaseOrderID
, POH.VendorID
, POH.OrderDate
, POH.SubTotal
, POH.TaxAmt
, POH.Freight
, SUM(POH.SubTotal) OVER (PARTITION BY POH.VendorID ORDER BY POH.PurchaseOrderID ) RunningSubTotal
, SUM(POH.TaxAmt) OVER (PARTITION BY POH.VendorID ORDER BY POH.PurchaseOrderID) RunningTaxAmt
, SUM(POH.Freight) OVER (PARTITION BY POH.VendorID ORDER BY POH.PurchaseOrderID) RunningFreight
FROM PurchaseOrderHeader POH
ORDER BY POH.VendorID
, POH.PurchaseOrderID

Happy coding!

17 January 2010

🧭Business Intelligence: Enterprise Reporting (Part II: Levels of Detail)

Working with data mainly from frontend (User Interface), in general Users have limited knowledge on how the data are physically stored in the various systems existing in an organization. When a new report is needed they point out the attributes they know from the screens they are working with, falling in developers’ duties to figure out whether the “soup of attributes” makes really sense and find a workable solution. Once the developer has identified the attributes and the tables they are stored in, he/she can go on and create the query/queries on which the report will be based upon. For this task is important to know how the various tables relate to each other, in other words knowing the attributes which can be used to link the tables or which is the relational path to link the table, and relations’ cardinality, which reflects how the number of records of a table or data set changes when is joined with another table or data set.

Between two tables and extensively two data sets the relations could have any of the four types of cardinality – many-to-many (represented as m:n), one-to-one (1:1), one-to-many (1:n) and the reverse many-to-one (n:1). It could be given a definition for each of the cardinality types, though it’s easier to remember that in the x-to-y compounds, between two tables or data sets A and B, x has the value ‘one’ when the number of references from table B to any record from A is maximum 1 (there could be records not referenced at all), and ‘many’ when at least a record could be referenced more than once; the same logic applies for y but inversing the tables’ perspective.

The level of detail (LOD) of a report (aka data granularity) is directly correlated with the changes in cardinality - by adding a data set to an existing data set, if the cardinality of one-to-many and many-to-many is implied then the level of detail changes too. In other words if a record in a given data set is referenced more than once in the new added table then the LOD changes. A simple example of change of LOD occurs in parent-child/master-details relations. For example if the Invoice Lines are added to a data set based on Invoice Header then the LOD changes from Invoice Headers to Invoice Lines. If the Payments are added to the resulted data set then the LOD changes from Invoice Lines to Payments only if there are multiple Payments for an Invoice (one-to-many or many-to-many cardinalities), otherwise LOD remains the same.

Summarizing, the level of detail of a report is the lowest level at which a cardinality change occurs at entity level. It can be discussed about lower or higher LOD in relation to a given level of detail, for example Invoice Payments have a lower LOD than Invoice Lines, while Invoice Header a higher LOD.

Why is it necessary to introduce the LOD especially when considering a relatively complex notion as cardinality?! It worth defining it mainly because the term is used when defining/mitigating reporting requirements, its use being intuitive rather than well-defined and understood. When creating reports it’s important to find the adequate LOD for a report and know what methods can be used in order to pull data that would normally change reports’ LOD without actually changing reports’ LOD.

The limitations imposed by the need to report at a certain LOD higher than the one implied by the row data can be overcome with the help of aggregate functions used with GROUP BY constructs. Such constructs make it possible to bring into a report data found at lower LOD than the reporting LOD by grouping the data based on a set of attributes. The aggregate functions provide functionality to calculate for a set of values the maximum, minimum, sum, count of records, standard deviation, and other statistical values, all of them working on numeric data types, while maximum/minimum and count of records work also with dates or alphanumeric data types.

For example using aggregate functions the Invoice amounts can be aggregated and shown at Invoice Header level, however, in contrast, it’s not possible to do the same with quantities, because typically each Invoice Line refers to a specific Product, and as apples can’t be counted with peaches, this means that in order to see the quantities, the level of detail has to be changed from Invoice Header to Invoice Line. The same technique could be applied to similar parent-child (master-details) relations covering one-to-many or one-to-one cardinality, and also many-to-many relations that involve higher reporting complexity.

Direct many-to-many relations are seldom, involving mainly data sets, the attributes used in relation appearing more than once in each dataset (at least once). Taking an example from ERP world, there could be multiple Receipts and Invoices for the same Purchase Order, thus if there is no direct match between Receipts and Invoices to identify uniquely which Invoice belong to each Receipt, a report involving all the three entities could be barely usable, records being duplicated (e.g. 2 Invoices vs. 3 Receipts result in 6 records in a such scenario). For such reports to be usable the LOD needs to be change at a higher level at least on one side, thus reconsidering the mentioned example a report could show the Purchase Order details and aggregating the Invoice & Receipt Quantities/Amounts at Purchase Order level, or show detailed Invoices with aggregated Receipt information, respectively detailed Receipts with aggregated Invoice information.

Unfortunately even if the aggregate functions are quite handy they have their own limitations, being difficult to use them in order to answer to questions like “what was the last product sold for each customer”, “which was latest invoice placed for each customer”, at least not without considerable effort from developer’s side. Fortunately database vendors implemented their own more specialized type of aggregate functions - analytic functions in Oracle and window functions in SQL Server, they allowing modeling such questions with a report. The downside for developers is that each database vendor comes with its own philosophy, so techniques and features working in one database might not work in another.

Previous Post <<||>> Next Post

SQL Troubles

Pages

06 March 2010

💎⌛SQL Reloaded: Oracle vs. SQL Server (Running Totals)

17 January 2010

🧭Business Intelligence: Enterprise Reporting (Part II: Levels of Detail)

About Me