Showing posts with label GROUP BY. Show all posts
Showing posts with label GROUP BY. Show all posts

20 June 2020

💎🏭SQL Reloaded: When Queries Look Like Ugly Ducks

There’s an expression of abductive reasoning “if it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck” used to identify something indefinite based on certain (definite) characteristics. When one looks at a query written by someone else, one tries to identify such characteristics. However, there are also situations in which simple queries are hard to interpret. For example, each time I see a SELECT DISTINCT in a query I ask myself whether the developer wanted to shortcut a GROUP BY, or just wanted to avoid duplicated data from the output. Without looking at the data in detail it’s almost impossible deciding between the two scenarios.

Let’s consider the following simple example based on a set of orders sent over the interface between two systems. Because of poor design, the first position was caught by the logic before entering the full order (the price from it is incorrect):

-- dropping the table
--DROP TABLE IF EXISTS dbo.T_Orders 

-- inserting the records 
SELECT *
INTO dbo.T_Orders 
FROM (VALUES ('P0001', 1,'20200618', '100001','S','black',20, 1.23)
, ('P0001', 1,'20200618', '100001','S','black',20, 1.22)
, ('P0001', 2,'20200618', '100001','S','black',5, 1.23)
, ('P0001', 3,'20200618', '100001','M','black',10, 1.24)
, ('P0002', 1,'20200619', '100001','S','black',5, 1.23)
, ('P0002', 2,'20200619', '100001','S','black',10, 1.22)) DAT (PurchId, Pos, PurchDate, ItemId, SizeId, ColorId, Qty, Price)

-- adding a primary key
ALTER TABLE dbo.T_Orders 
ADD Id int identity(1,1) NOT NULL

-- reviewing the data
SELECT *
FROM dbo.T_Orders 
ORDER BY PurchId
, Pos

The following queries return the same results :


-- simple SELECT DISTINCT 
SELECT DISTINCT PurchId, Pos, ItemId, SizeId, ColorId, Qty
FROM dbo.T_Orders

-- simple GROUP BY
SELECT PurchId, Pos, ItemId, SizeId, ColorId, Qty
FROM dbo.T_Orders
GROUP BY PurchId, Pos, ItemId, SizeId, ColorId, Qty

Both queries hide the fact that a duplicate exists. However, with a small change in the last query (adding a count) one can highlight that, while the first query doesn't allow this flexibility:

-- simple GROUP BY with Count
SELECT PurchId, Pos, ItemId, SizeId, ColorId, Qty
, count(*) NoRecords
FROM dbo.T_Orders
GROUP BY PurchId, Pos, ItemId, SizeId, ColorId, Qty
-- HAVING count(*)>1

This small change can make an important difference when one needs to analyze the data, so at least from this perspective it might be worth to write the query as a GROUP BY from the beginning, especially when writing complex queries.

There are also nested SELECT DISTINCTS that attempt refining the selection in successive steps. Such curiosities seldom make sense:

-- nested SELECT DISTINCT
SELECT DISTINCT PurchId, Pos, ItemId
FROM (
 SELECT DISTINCT PurchId, Pos, ItemId, SizeId, ColorId, Qty
 FROM dbo.T_Orders
) DAT

To remove the duplicates and perform a sum one may choose another curiosity of query - the inner query removes the duplicates (some write it also as a DISTINCT), while in the outer query is done the sum:

-- double grouping to remove duplicates
SELECT PurchId, Pos, ItemId, SizeId, ColorId
, SUM(Qty) Total
FROM (--inner query
 SELECT PurchId, Pos, ItemId, SizeId, ColorId, Qty
 FROM dbo.T_Orders
 GROUP BY PurchId, Pos, ItemId, SizeId, ColorId, Qty
) DAT
GROUP BY PurchId, Pos, ItemId, SizeId, ColorId

Even if such queries (miraculously) work upon case, sooner or later they are predestined to fail. When doing operations on data whose quality is doubtful, one needs to select a strategy for removing the duplicates. The first step is to identify the attributes which make a record unique (e.g. PurchId and Pos), and using a sorting criteria, one can remove the duplicates via a window function like Rank, Dense_Rank or Row_Number (with small exceptions their use is interchangeable). Supposing that the Id attribute determinates the order in which the data were added, one can write a similar query:

-- removing duplicates via ranking partition window function 
SELECT *
FROM (
 SELECT PurchId, Pos, PurchDate, ItemId, SizeId, ColorId, Qty, Price
 , RANK() OVER (PARTITION BY PurchId, Pos ORDER BY Id DESC) Ranking 
 FROM dbo.T_Orders
) DAT 
WHERE Ranking=1

The problem is that such queries need to be repeated in each use. A simpler method is running a regular check for duplicates and marking the duplicates as inactive or pushing them into a separate table. Deleting the duplicates can work as well, though this approach can hit back like a boomerang if not addressed correctly.

To identify the duplicates one of the following approaches can be used:

-- duplicates via grouping in inner query
SELECT DAT.*
, DUP.NoRecords 
FROM dbo.T_Orders DAT
     JOIN (-- duplicates
 SELECT PurchId, Pos
 , count(*) NoRecords
 FROM dbo.T_Orders DUP
 GROUP BY PurchId, Pos
 HAVING count(*)>1 
  ) DUP 
   ON DAT.PurchId = DUP.Purchid 
   AND DAT.Pos = DUP.Pos 

-- duplicates via grouping in correlated query 
SELECT DAT.*
, DUP.NoRecords 
FROM dbo.T_Orders DAT
     CROSS APPLY (-- duplicates
 SELECT count(*) NoRecords
 FROM dbo.T_Orders DUP
 WHERE DAT.PurchId = DUP.Purchid 
   AND DAT.Pos = DUP.Pos 
 GROUP BY PurchId, Pos
 HAVING count(*)>1 
  ) DUP 


-- duplicates via count within aggregate window function 
SELECT *
FROM (
 SELECT PurchId, Pos, PurchDate, ItemId, SizeId, ColorId, Qty, Price
 , count(*) OVER (PARTITION BY PurchId, Pos) NoRecords 
 FROM dbo.T_Orders
) DAT 
WHERE NoRecords>1

Notes:
The queries work also in SQL databases in Microsoft Fabric.

Happy coding!

20 October 2018

💫ERP Systems: AX 2009 vs. D365 FO - Product Prices with two Different Queries

   In Dynamics AX 2009 as well in Dynamics 365 for Finance and Operations (D365 FO) the Inventory, Purchase and Sales Prices are stored in InventTableModule table on separate rows, independently of the Products, stored in InventTable. Thus for each Product there are three rows that need to be joined. To get all the Prices one needs to write a query like this one:

SELECT ITM.DataAreaId 
, ITM.ItemId 
, ITM.ItemName 
, ILP.UnitId InventUnitId
, IPP.UnitId PurchUnitId
, ISP.UnitId SalesUnitId
, ILP.Price InventPrice
, IPP.Price PurchPrice
, ISP.Price SalesPrice
FROM dbo.InventTable ITM
     JOIN dbo.InventTableModule ILP
       ON ITM.ItemId = ILP.ItemId
      AND ITM.DATAAREAID = ILP.DATAAREAID 
      AND ILP.ModuleType = 0 -- Inventory
     JOIN dbo.InventTableModule IPP
       ON ITM.ItemId = IPP.ItemId
      AND ITM.DATAAREAID = IPP.DATAAREAID 
      AND IPP.ModuleType = 1 -- Purchasing
     JOIN dbo.InventTableModule ISP
       ON ITM.ItemId = ISP.ItemId
      AND ITM.DATAAREAID = ISP.DATAAREAID 
      AND ISP.ModuleType = 2 -- Sales 
WHERE ITM.DataAreaId = 'abc'

Looking at the query plan one can see that SQL Server uses three Merge Joins together with a Clustered Index Seek, and recommends using a Nonclustered Index to increase the performance:

image

   To avoid needing to perform three joins, one can rewrite the query with the help of a GROUP BY, thus reducing the number of Merge Joins from three to one:

SELECT ITD.DataAreaID 
, ITD.ItemId 
, ITM.ItemName  
, ITD.InventPrice
, ITD.InventPriceUnit
, ITD.InventUnitId
, ITD.PurchPrice
, ITD.PurchPriceUnit
, ITD.PurchUnitId
, ITD.SalesPrice
, ITD.SalesPriceUnit
, ITD.SalesUnitId
FROM dbo.InventTable ITM
     LEFT JOIN (—price details
     SELECT ITD.ITEMID
     , ITD.DATAAREAID 
     , Max(CASE ITD.ModuleType WHEN 0 THEN ITD.Price END) InventPrice
     , Max(CASE ITD.ModuleType WHEN 0 THEN ITD.PriceUnit END) InventPriceUnit
     , Max(CASE ITD.ModuleType WHEN 0 THEN ITD.UnitId END) InventUnitId
     , Max(CASE ITD.ModuleType WHEN 1 THEN ITD.Price END) PurchPrice
     , Max(CASE ITD.ModuleType WHEN 1 THEN ITD.PriceUnit END) PurchPriceUnit
     , Max(CASE ITD.ModuleType WHEN 1 THEN ITD.UnitId END) PurchUnitId
     , Max(CASE ITD.ModuleType WHEN 2 THEN ITD.Price END) SalesPrice
     , Max(CASE ITD.ModuleType WHEN 2 THEN ITD.PriceUnit END) SalesPriceUnit
     , Max(CASE ITD.ModuleType WHEN 2 THEN ITD.UnitId END) SalesUnitId
     FROM dbo.InventTableModule ITD
     GROUP BY ITD.ITEMID
     , ITD.DATAAREAID 
    ) ITD
       ON ITD.ITEMID = ITM.ITEMID
      AND ITD.DATAAREAID = ITM.DATAAREAID
WHERE ITD.DataAreaID = 'abc'
ORDER BY ITD.ItemId

And here’s the plan for it:

image

It seems that despite the overhead introduced by the GROUP BY, the second query performs better, at least when all or almost all records are performed.

In Dynamics AX there were seldom the occasions when I needed to write similar queries. Probably, most of the cases are related to the use of window functions (e.g. the first n Vendors or Vendor Prices). On the other side, a bug in previous versions of Oracle, which was limiting the number of JOINs that could be used, made me consider such queries also when was maybe not the case.  

What about you? Did you had the chance to write similar queries? What made you consider the second type of query?

Happy coding!

11 March 2010

💎SQL Reloaded: Oracle vs. SQL Server: Averages I

   For many people working with averages is not a complicated topic at all, and that’s quite true, though as always it depends on how much you want to complicate everything. The average or the arithmetic mean is defined as the middle value of the values within a data set calculated as the sum of the values divided by the number of values: 
AVG - Simple Average 
   RDBMS vendors provide similar functionality within the AVG aggregate function, and even if in theory the result could be calculated using the SUM and COUNT aggregate functions, as per the above definition, given its importance in statistics, to provide an AVG function seems to be a natural choice. AVG could be used across the whole data set or within a given “partition” determined by the GROUP BY clause.

-- Oracle/SQL Server: Simple Averages 
SELECT ProductID 
, AVG(UnitPrice) AverageUnitPrice 
, SUM(UnitPrice)/COUNT(1) AverageUnitPriceDef 
FROM Purchasing.PurchaseOrderDetail 
GROUP BY ProductID 


    For most of the data analysis requirements the simple average function would be enough, though there are cases in which is required to address the sensitivity of values to certain modifiers – for example quantities or the number of lines. For such cases is used the weighted average (weighted arithmetic mean) calculated as the sum of values and quantifiers products, the total being divided by the sum of quantifiers:
AVG - Weighted Average
Note:
     In case the quantifiers are all 1 then the weighted average equates with the simple average, the output being the same also when the values are constant.

      The weighted average is quite a useful tool when performing PO Price Analysis in which often needs to be be considered also the volume of purchased quantities with a given Price Unit: 
 
-- Oracle/SQL Server: Simple/Weighted Averages 
SELECT ProductID 
, AVG(UnitPrice) AverageUnitPrice 
, SUM(UnitPrice*OrderQty)/SUM(OrderQty) WeightedAverageUnitPrice 
FROM Purchasing.PurchaseOrderDetail 
GROUP BY ProductID  

   The actual implementation in a query might differ based on requirements and preferences – as multiple left joins, inline view or correlated query, each of the mentioned approaches coming with their downsides and strengths: 
 
-- Oracle/SQL Server: Averages (inline view) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, DAT.AverageUnitPrice 
, DAT.WeightedAverageUnitPrice 
FROM Production.Product ITM 
LEFT JOIN ( --calculated averages 
    SELECT POD.ProductID 
    , AVG(POD.UnitPrice) AverageUnitPrice 
    , SUM(POD.UnitPrice*OrderQty)/SUM(POD.OrderQty) WeightedAverageUnitPrice 
    FROM Purchasing.PurchaseOrderDetail POD 
       JOIN Purchasing.PurchaseOrderHeader POH 
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
    WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete 
   GROUP BY POD.ProductID)DAT 
    ON ITM.ProductID = DAT.ProductID 
ORDER BY ITM.Name 

-- Oracle/SQL Server: Averages (multiple left joins) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, AVG(POD.UnitPrice) AverageUnitPrice 
, SUM(POD.UnitPrice*OrderQty)/SUM(POD.OrderQty) WeightedAverageUnitPrice 
FROM Production.Product ITM 
   LEFT JOIN Purchasing.PurchaseOrderDetail POD 
      ON ITM.ProductID = POD.ProductID 
   LEFT JOIN Purchasing.PurchaseOrderHeader POH 
      ON POD.PurchaseOrderID = POH.PurchaseOrderID  
WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete 
GROUP BY ITM.ProductID 
,ITM.Name 
,ITM.ProductNumber 
ORDER BY ITM.Name  

-- SQL Server 2005: Averages (OUTER APPLY) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, DAT.AverageUnitPrice 
, DAT.WeightedAverageUnitPrice 
FROM Production.Product ITM 
OUTER APPLY ( -- calculated averages 
    SELECT AVG(POD.UnitPrice) AverageUnitPrice 
    , SUM(POD.UnitPrice*OrderQty)/SUM(POD.OrderQty) WeightedAverageUnitPrice 
    FROM Purchasing.PurchaseOrderDetail POD  
       JOIN Purchasing.PurchaseOrderHeader POH 
          ON POD.PurchaseOrderID = POH.PurchaseOrderID  
    WHERE ITM.ProductID = POD.ProductID  
        AND POH.Status IN (2, 4) -- 2-Approved, 4-Complete 
) DAT 
ORDER BY ITM.Name 

      A different type of approach could be focused on calculating the simple or weighted average based only on the last n PO Lines, for example the last 3 PO Lines: 
  
-- Oracle/SQL Server 2005: last 3 PO average prices 
SELECT DAT.ProductID 
, MAX(DAT.RANKING) NumberRecords 
, AVG(DAT.UnitPrice) AveragePrice 
, SUM(DAT.UnitPrice* DAT.OrderQty)/SUM(DAT.OrderQty) WeightedAveragePrice 
FROM (-- ranked PO Prices 
    SELECT POD.ProductID 
    , POD.UnitPrice 
    , POD.OrderQty 
    , RANK() OVER(PARTITION BY POD.ProductID ORDER BY POH.OrderDate DESC,   POD.PurchaseOrderDetailID DESC) RANKING 
    FROM Purchasing.PurchaseOrderDetail POD 
       JOIN Purchasing.PurchaseOrderHeader POH 
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
    WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete 
) DAT 
WHERE DAT.RANKING BETWEEN 1 AND 3 
GROUP BY DAT.ProductID
 


    The problem could be complicated even more by selecting all the POs corresponding to the last 3 distinct Prices used; not sure if it brings any value but the formulation is perfectly valid and don’t despair if a user comes with such a request.

    Occasionally it might be needed to study the evolution of prices over time, the previous problem becoming a simple moving average or weighted moving average problem, in which the average is calculated in report with a moving point on the time scale with respect to the previous k points or concerning the previous and following k points including the current point:
AVG - Simple Moving Average
AVG - Weighted Moving Average
     G. Priester shows in his Calculating Moving Averages with T-SQL article a nice solution for simple moving averages calculation exemplifying it on stock market data. Frankly, given its flexibility, I prefer to use the CROSS APPLY operator in combination with fGetDatesInInterval function introduced in More Date-related Functionality – Part I post, the function returning all the dates falling in a certain interval. Because in case of POs there could be multiple transactions per day for the same Product, the problem is slightly changed, the averages being based within a 31 window:

-- SQL Server 2005: 31 days moving averages 
SELECT ITM.ProductNumber 
, ITM.Name ProductName 
, DIN.DateSequence ReportingDate 
, DAT.AverageUnitPrice 
, DAT.WeightedAverageUnitPrice 
FROM dbo.fGetDatesInInterval('2003-01-01', '2003-12-31') DIN 
CROSS APPLY (--calculated averages 
    SELECT POD.ProductID 
    , AVG(POD.UnitPrice) AverageUnitPrice 
    , SUM(POD.UnitPrice*OrderQty)/SUM(POD.OrderQty) WeightedAverageUnitPrice 
    FROM Purchasing.PurchaseOrderDetail POD 
       JOIN Purchasing.PurchaseOrderHeader POH 
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
     WHERE DATEDIFF(d, POH.OrderDate, DIN.DateSequence) BETWEEN 0 AND 30 -- 31 days 
     --WHERE DATEDIFF(d, POH.OrderDate, DIN.DateSequence) BETWEEN -15 AND 15 -- 31 days 
         AND POH.Status IN (2, 4) -- 2-Approved, 4-Complete 
GROUP BY POD.ProductID) DAT 
LEFT JOIN Production.Product ITM 
   ON DAT.ProductID = ITM.ProductID 
WHERE DATEPART(dw, DIN.DateSequence) BETWEEN 2 AND 6 -- Working Days 
ORDER BY ITM.Name 
, DIN.DateSequence 

03 February 2010

💎SQL Reloaded: Ways of Looking at Data III

In the previous posts from the same cycle (see Part I and Part II) I shown how the ranking window functions can be used in SQL Server in order to show the first/last record appearing in a certain partition. A broader set of reports are the ones focusing on the top n records matching a certain criteria, for example the top 10 Vendors/Products based on the amount purchased. Before the introduction of ranking window functions in SQL Server 2005 such requirements were previously addressed using the TOP clause. The following two queries are showing the top 10 Vendors, respectively the top 10 Vendors/Products based on the amount purchased, the first query focusing on the active Vendors while the second on all Vendors – now it depends on the requirements whether only the active or all Vendors are needed, therefore the Users should be always asked about this concern. 

In contrast the use of ranking/aggregated window functions allows more flexibility in selecting not only the first n Vendors, but any interval, thus being possible to use this feature as a simple pagination technique. Here is the first query modified.
As for the second query the TOP clause allows only to select the top 10 Vendors/Products based on the amount purchased, not being possible to select each top n Vendors for each purchased Product as ranking functions allow. Here is the query for the top 3 Vendors for a Product.
The logic for the current example could be reused in further analysis, therefore it makes sense to encapsulated it in a database object. A view would do, though if we would like to use the number of Vendors as parameter, then it’s more appropriate to use a table-valued function or a stored procedure for this. Even if in theory stored procedures provide better performance, if we like to include the above logic in further queries then creating a table-valued function is the best choice.
Once the table-valued function was created we can go on and use it in other queries, for example here is a simple query showing maximum 5 vendors for each product.
Most probably the Users used with the pivoting functionality of Excel would ask you if it’s possible to show all the Vendors together in one line. Starting with SQL Server 2005 Microsoft introduced the PIVOT operator though it can pivot only one attribute, fortunately there is an alternative using a GROUP BY together with the CASE function.
The downside of this technique is that you can work only with a predefined number of Vendors per Product (here 3), on the other side there are few the cases in which more than 5 Vendors exist for the same Product. An alternative would be to process the data using a programming language as VBA, not a complicated thing to do at all. In almost 10 years of experience I found only a few cases in which the functionality provided by RDBMS needed to be extended with additional coding outside of the RDBMS (e.g. Excel, ASP), mainly when the data needed to be organized beyond the simple tabular structure.

30 January 2010

💎SQL Reloaded: Ways of Looking at Data I (Some Basics)

In two previous posts I was talking about the Level of Detail and the Level of Accuracy used in Reports, though from theory to practice is a long way and things are not always so clear as we would like them to be. Initially I wanted to exemplify the ideas with some queries based on Invoice Header/Lines example, though such an attempt supposes the construction and population of several tables– not the best thing to do when introducing concepts. AdventureWorks database’s tables seems the easiest way to start, the Purchase Orders (PO) Header/Details tables offering similar scenarios as the Invoice Header/Lines, in addition there are other tables that could be used together with them in order to demonstrate various techniques and reporting needs.

  First things first, even if in a Parent/Child situation as the one of the POs most of the examples deal with reports based on Parent & Chidren information together, it’s a good idea to start understanding first the Parent and then the Child element, in this case the PO Header and then the PO Lines. A query on PO Headers would include also the referenced tables – Vendors, Shipment Methods, Employees, etc. In case of big tables a query would typically include the most important attributes, while from referenced tables are included the most used elements.

Before further using the query it’s always a good idea to check the variations in cardinality starting from the main table, and increasingly with each join, this test allowing identifying the incorrect joins. When a WHERE clause is involved, it’s easier to use the comment characters (/* … */) and uncomment each join gradually.
Sometimes it makes sense to encapsulate the query in a view, allowing thus to reuse the logic when needed, minimize maintainability, enforce vertical and horizontal security – to mentioned just a few of views’ benefits. It’s said that views are coming with a slight decrease in performance, though there is always a trade between performance and reusability, including in OOP. It’s recommended to use query writing best practices and target to design your queries for the best performance. Excepting the columns from the initial query, in a view it’s recommended to add also the IDs used in the joins, for faster records filtering or for easier troubleshooting. I prefer to add them in front of the other attributes, though their use could be facilitated when grouped together with the attributes from the table referenced.
Probably you wonder what’s the meaning of Ex postfix, as you probably deducted from the comment, it stands for Extended. In general I prefer to create a view for each level of detail, in this case one for PO Headers and another one for PO Details, and to use nested views when the decrease in performance is minimal, or especially when the views include complex logic I don’t want to replicate. In addition the views could be reused by Users who don’t have to replicate the logic already included in view, making the analysis of data much easier. On the other side, when using views from various modules (e.g. Account Payables vs Pos, Sales vs. Pos), there are many elements which arrive to be used more than once, in general Master Data (e.g. Products, Vendors, Customers, Locations, etc.), thus in order to avoid these type of not necessary joins, I prefer to partition logically the views – two views won’t contain similar data, though this doesn’t mean that won’t be views that contradict this rule.

 Using the logical partitioning of views, an alternative view on PO Headers would be created without showing the Vendor end Employee information, following to use if necessary the corresponding views when needed. AdventureWorks includes a view on Vendors called Purchasing.vVendor though the level of detail is at Contact level, and a Vendor could have more than one contacts, therefore the view can’t be used for this purpose, following to create a second view called Purchasing.vVendors (if the name creates confusion you could rename it or rename the Purchasing.vVendor to reflect the fact that the level is Vendor Contacts, for example Purchasing.vVendorContacts).
The problem with the Purchasing.vPurchaseOrderHeaderEx view is that if a user needs additional information about Vendors not available in the view, I’ll have to add the attributes, arriving in the end to create most of the logic available in Purchasing.vVendors view, or the User will rejoin the Purchasing.vVendors view with Purchasing.vPurchaseOrderHeaderEx, having a redundant link to Vendors. The simplified version of the view is the following:
Now the three views - Purchasing.vVendors, HumanResources.vEmployee and Purchasing.vPurchaseOrderHeader could be joined in a query to obtain the same result as Purchasing.vPurchaseOrderHeaderEx view, this approach offering much more flexibility, even if the decrease in performance is higher, though the performance should be in theory better than joining Purchasing.vPurchaseOrderHeader_Ex with Purchasing.vVendors and HumanResources.vEmployee in order to cover the missing attributes. Unfortunately there is another small hitch – the HumanResources.vEmployee view doesn’t contain the ContactID, therefore the view needing to be modified in case is needed.
The logical partitioning allows also more flexibility and relatively increased performance when doing data analysis, for example in order to get the volume purchased per Vendor, the Purchasing.vPurchaseOrderHeader could be used to aggregate the data and join the resulting dataset with Purchasing.vVendors view.
A second look at the same sets of data, Vendors and PO Headers, could be based on a left join between the two, thus seeing also the Vendors for which no PO was approved or completed; the query could be used to return the same data as the previous data set (see ‘Vendors with Pos’ constraint). This second query offers greater flexibility, coming with a small decrease in performance and, in addition, must be taken care of NULL values.
The above examples were showing the cumulated values at Vendor level, what happens when we need to aggregate the data at CountryRegionName level? For this purpose could be used any of the two above queries, and simply aggregate the data at CountryRegionName level. Given the fact that the second query provides greater flexibility, it could be encapsulated in view (e.g. , thus allowing to easily aggregate the data at any level we want – CountryRegionName, StateProvinceName or even City.
 
Sometimes simple aggregations are not enough, Users needing more from the data – for example seeing which is the last/first (open) PO placed in the system for each Vendor. Such a report looks simple as request, though not always so easy to provide. Before the introduction of window functions in SQL Server and analytic functions in Oracle, the techniques for such requests were not so elegant… Here it is the SQL server implementation using RANK ranking window function in order to get the Last/First PO, the same aggregated amounts from the previous query being shown with the help of aggregate window functions.
Note:
Most of the data analysis focuses only for a certain list of vendors, countries or regions, for a certain time interval, or for any other special requirements, for example pending Pos , in order to approximate the future expenses, or for the rejected Pos, in order to see the volume of waste/rework or to quantify the lost opportunities. The above queries were created for general cases and they could be modified for specific requirements if needed.

19 March 2009

🛢DBMS: Scalar Aggregate (Definitions)

"An aggregate function that produces a single value from a select statement that does not include a group by clause. This is true whether the aggregate function is operating on all the rows in a table or on a subset of rows defined by a where clause." (Karen Paulsell et al, "Sybase SQL Server: Performance and Tuning Guide", 1996)

[vector aggregate:] "A value that results from using an aggregate function with a group by clause." (Karen Paulsell et al, "Sybase SQL Server: Performance and Tuning Guide", 1996)

"When aggregate functions are applied to the whole or partial table without the GROUP BY clause and return only one row." (Owen Williams, "MCSE TestPrep: SQL Server 6.5 Design and Implementation", 1998)

[vector aggregates:] "When aggregate functions are used with the GROUP BY clause, they return values for each group. These are called vector aggregates." (Owen Williams, "MCSE TestPrep: SQL Server 6.5 Design and Implementation", 1998)

"A function applied to all of the rows in a table (producing a single value per function). An aggregate function in the select list with no GROUP BY clause applies to the whole table and is an example of a scalar." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

[vector aggregate:] "Functions applied to all rows that have the same value in a specified column or expression by using the GROUP BY clause and, optionally, the HAVING clause (producing a value for each group per function)." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"An aggregate value that is calculated on the data source. Depending on the data source, server aggregates can be treated as detail data or as aggregates based on the dataset option InterpretSubtotalsAsDetails." (Microsoft Technet)

[aggregate of aggregates:] "A summary value calculated from aggregates, such as the maximum of a set of sums." (Microsoft Technet)

 "An aggregate function, such as MIN(), MAX(), or AVG(), that is specified in a SELECT statement column list that contains only aggregate functions." (Microsoft Technet)

23 September 2007

💎SQL Reloaded: 6 out of 49 (Part 3: Basic Data Analysis)

The most basic information one can get is the number of drawings in which a number occurred, respectively which was its frequency:

-- number & frequency drawings by number
DECLARE @NumberDrawings int = 2000
SELECT Number 
, Cast(100.0*count(*)/@NumberDrawings as decimal(10,2)) Frequency 
, count(*) NumberDrawings 
FROM dbo.vLottoNumbers 
GROUP BY Number 
ORDER BY Number  

One of the interesting things to see in a data set of drawings is the distance between two drawings in which appears the same numbers. It doesn’t look too easy to compare two consecutive records within the same query; actually this can be done easy with the help of DENSE_RANK function (works only on SQL Server 2005+ and Oracle), which ranks the data within a partition, thus if two values are identical, they have the same ranking. Let’s see how DENSE_RANK function works: 

-- ranked drawings by number
SELECT Number 
, DrawingDate 
, DENSE_RANK() OVER (PARTITION BY Number ORDER BY DrawingDate) Ranking 
FROM dbo.vLottoNumbers   

The partition is created by Number, while the ranking is delimited by DrawingDate. Now all we have to do is to join two such queries by Number with two consecutive Rankings: 

--creating the view 
CREATE VIEW dbo.vLottoConsecutiveDrawings 
AS 
-- consecutive drawings by numbers
WITH DAT 
AS (
	SELECT Number     
	, DrawingDate     
	, DENSE_RANK() OVER (PARTITION BY Number ORDER BY DrawingDate) Ranking   
	FROM dbo.vLottoNumbers  
)
SELECT A.Number 
, A.DrawingDate 
, B.DrawingDate NextDrawingDate 
, IsNull(DateDiff(d, A.DrawingDate, B.DrawingDate), 0) DifferenceDays 
, IsNull(DateDiff(wk, A.DrawingDate, B.DrawingDate), 0) DifferenceWeeks 
FROM DAT A
     LEFT JOIN DAT B 
	   ON A.Number = B.Number 
	  AND A.Ranking = B.Ranking - 1 

The first interesting thing to find out is:     
  • What’s the average number of occurrences?     
  • What’s the minimum/maximum number of consecutive drawings in which the number hasn’t appeared?  
  • How many times a number appeared?      
The following query answers to these questions, by doing a simple grouping by Number over vLottoConsecutiveDrawings output:

-- minimum/maximum differences between drawings
SELECT Number 
, Min(NullIf(DifferenceWeeks, 0)) Minimum 
, Max(NullIf(DifferenceWeeks, 0)) Maximum 
, Avg(NullIf(DifferenceWeeks, 0)) Average 
, count(*) NumberDrawings 
FROM dbo.vLottoConsecutiveDrawings 
GROUP BY Number 
ORDER BY Number    

The output doesn’t give much information, but it’s a good start. The study interval can be broken down by using the DrawingDate in GROUP and/or WHERE clause: 

-- minimum/maximum differences between drawings by year
SELECT Year(DrawingDate) DrawingYear 
, Number 
, Min(NullIf(DifferenceWeeks, 0)) Minimum 
, Max(NullIf(DifferenceWeeks, 0)) Maximum 
, Avg(NullIf(DifferenceWeeks, 0)) Average 
, count(*) NumberDrawings 
FROM vLottoConsecutiveDrawings 
--WHERE Year(DrawingDate) IN (2000, 2001) 
GROUP BY Number 
, Year(DrawingDate) 
ORDER BY Number, DrawingYear

It would be also interesting to know how many numbers fall on a given row or column within the 7x7 matrix:

-- aggregating data by drawing & row
SELECT LN.DrawingDate 
, SUM(CASE WHEN LN.Number BETWEEN 1 AND 7 THEN 1 ELSE 0 END) Bucket1
, SUM(CASE WHEN LN.Number BETWEEN 8 AND 14 THEN 1 ELSE 0 END) Bucket2
, SUM(CASE WHEN LN.Number BETWEEN 15 AND 21 THEN 1 ELSE 0 END) Bucket3
, SUM(CASE WHEN LN.Number BETWEEN 22 AND 28 THEN 1 ELSE 0 END) Bucket4
, SUM(CASE WHEN LN.Number BETWEEN 29 AND 35 THEN 1 ELSE 0 END) Bucket5
, SUM(CASE WHEN LN.Number BETWEEN 36 AND 42 THEN 1 ELSE 0 END) Bucket6
, SUM(CASE WHEN LN.Number BETWEEN 43 AND 49 THEN 1 ELSE 0 END) Bucket7
FROM dbo.vLottoNumbers LN 
GROUP BY LN.DrawingDate 

-- aggregating data by drawing & column
SELECT LN.DrawingDate 
, SUM(CASE WHEN LN.Number %7 = 1 THEN 1 ELSE 0 END) Bucket1
, SUM(CASE WHEN LN.Number %7 = 2 THEN 1 ELSE 0 END) Bucket2
, SUM(CASE WHEN LN.Number %7 = 3 THEN 1 ELSE 0 END) Bucket3
, SUM(CASE WHEN LN.Number %7 = 4 THEN 1 ELSE 0 END) Bucket4
, SUM(CASE WHEN LN.Number %7 = 5 THEN 1 ELSE 0 END) Bucket5
, SUM(CASE WHEN LN.Number %7 = 6 THEN 1 ELSE 0 END) Bucket6
, SUM(CASE WHEN LN.Number %7 = 7 THEN 1 ELSE 0 END) Bucket7
FROM dbo.vLottoNumbers LN 
GROUP BY LN.DrawingDate 

The buckets can be aggregated as follows (observe the replacement of SUM with MAX):
 
-- consolidated drawings by 7 buckets (row)
 SELECT SUM(Bucket1) Bucket1
, SUM(Bucket2) Bucket2
, SUM(Bucket3) Bucket3
, SUM(Bucket4) Bucket4
, SUM(Bucket5) Bucket5
, SUM(Bucket6) Bucket6
, SUM(Bucket7) Bucket7
 FROM (
     -- aggregating data by drawing & row
	 SELECT LN.DrawingDate 
	, Max(CASE WHEN LN.Number BETWEEN 1 AND 7 THEN 1 ELSE 0 END) Bucket1
	, Max(CASE WHEN LN.Number BETWEEN 8 AND 14 THEN 1 ELSE 0 END) Bucket2
	, Max(CASE WHEN LN.Number BETWEEN 15 AND 21 THEN 1 ELSE 0 END) Bucket3
	, Max(CASE WHEN LN.Number BETWEEN 22 AND 28 THEN 1 ELSE 0 END) Bucket4
	, Max(CASE WHEN LN.Number BETWEEN 29 AND 35 THEN 1 ELSE 0 END) Bucket5
	, Max(CASE WHEN LN.Number BETWEEN 36 AND 42 THEN 1 ELSE 0 END) Bucket6
	, Max(CASE WHEN LN.Number BETWEEN 43 AND 49 THEN 1 ELSE 0 END) Bucket7
	 FROM dbo.vLottoNumbers LN 
	 GROUP BY LN.DrawingDate 
  ) DAT

Happy coding!

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.