SQL Troubles: PIVOT operator

Showing posts with label PIVOT operator. Show all posts

09 April 2024

💎🤖SQL Reloaded: Copilot Stories (Part I: Copilot in Azure SQL Database is in Private Preview) 🆕

Following the news on Copilot and its adoption in the various areas, probably you asked yourself when it will become available in SQL Server. The good news is that the Copilot in Azure SQL Database is already in Private Preview (see [1], [2]). The bad news is that for the early adoption program, Microsoft looks only for customers that can provide feedback based on real-world use cases, and one will need a valid Tenant ID for signing up for the PP.

The new feature comes with two Azure portal experiences:

Natural language to SQL: within the Azure portal query editor the natural language prompts are translated into SQL;
Azure Copilot integration: provides customers with self-guided assistance for managing databases and troubleshooting.

In the Data Exposed webcast with Anna Hoffman [1], the invitee, Joe Sack, showcases the use of both experiences. In what concerns the first experience, besides simple joins, Joe also showcased the use of a PIVOT based on the prompt:

"Show me a pivot summary table that displays the total number of properties sold in each year from 2020 to 2023".

-- output from Copilot
SELECT *
FROM 
(
	SELECT YEAR(SaleDate) AS Year, COUNT(*) AS TotalPropertiesSold
	FROM [PropertySearch].[dbo].[PropertyHistory]
	WHERE YEAR(SaleDate) BETWEEN 2020 AND 2023
	GROUP BY YEAR(SalesDate)
) AS PivotDate
PIVOT
(
	SUM(TotalPropertiesSold)
	FOR Year IN ([2020], [2021], [2022], [2023]
) AS PivotTable;

Even if this is a modest example, the great thing is that it generates the skeleton on which the logic can be built, which can really help beginners, though it can be also useful for more seasoned data professionals as it saves time. Probably, in a first phase this will be the main benefit of Copilot - to generate skeletons or templates on which further logic can be built. Hopefully in time it will do much more than that.

I wonder how complex the prompts can become and how can the logic be structured to create a multistep scenario. The Copilot versions from other areas showed that complex prompts give results, the question is whether Copilot can devise the steps in an optimum manner, much like a seasoned data professional does.

The feature utilizes for the moment the table and view names, column names, primary key, and foreign key metadata to generate T-SQL code. Probably, it makes sense to also use index and statistics information, query hints and all the arsenal usually needed by data professionals to optimize a query. Conversely, maybe the second experience could be used for optimizing and troubleshooting the query.

I'd really love to test this feature, though probably I'll need to wait until it becomes publicly available. In the meanwhile, one can play with the GitHub Copilot [3] or install Copilot in Azure Data Studio [4].

|>> Next Post

References:
[1] Data Exposed (2024) Introducing Copilot in Azure SQL Database (Private Preview) (link)
[2] Azure SQL Blog (2024) Microsoft Copilot in Azure extends capabilities to Azure SQL Database (Private Preview) by Joe Sack (link)
[3] Azure SQL Blog (2023) GitHub Copilot for SQL Developers: Turbocharge your SQL Development, by Subhojit Basak (link)
[4] Microsoft Developer (2023) Copilot is now in Azure Data Studio and this is how it can help you! (link)

05 November 2022

💎SQL Reloaded: STRING_AGG and STRING_SPLIT at Work, and a Bit of Pivoting

Working with strings across records was for long a nightmare for SQL developers until Microsoft introduced STRING_SPLIT in SQL Server 2016, respectively STRING_AGG in SQL Server 2017. Previously, one was forced to write procedural language or use workarounds until SQL Server 2015, when recursive CTEs (common table expressions), Ranking and PIVOT were introduced, which allowed handling many scenarios.

Microsoft provides several examples for the usage of STRING_SPLIT and STRING_AGG functions based on AdventureWorks database, though let's look at another example based on the same database.

Let's say we want to show the concatenated Contacts for a store, result which can now easily be obtained by using the STRING_AGG:

-- concatenating names per store via STRING_AGG (SQL Server 2017+)
SELECT BusinessEntityID
, STRING_AGG(Concat(FirstName, ' ', LastName), ';') Contacts
FROM Sales.vStoreWithContacts
GROUP BY BusinessEntityID
HAVING count(*)>1

Observe that is needed to use a GROUP BY to show one record per Store. Unfortunately, there isn't yet a window function available for the same.

The inverse operation can be performed with the help of STRING_SPLIT table-valued function (TVF). (If you wonder why is needed a TVF, it is because the initial record needs to be multiplied by the generated output.)

-- reversing the concatenation (SQL Server 2017+)
WITH CTE
AS (
	-- concatenating names per store
	SELECT BusinessEntityID
	, STRING_AGG(Concat(FirstName, ' ', LastName), ';') Contacts
	FROM Sales.vStoreWithContacts
	GROUP BY BusinessEntityID
	HAVING count(*)>1
) 
SELECT CTE.BusinessEntityID
, DAT.Value
, DAT.Ordinal 
FROM CTE
    CROSS APPLY STRING_SPLIT(Contacts, ';', 1) DAT

STRING_SPLIT provides also an ordinal field, which can be used in theory in pivoting the values, though we'd return then from where we started. Instead of using the query just generated, let's exemplify an alternative solution which is available with SQL Server 2005 for concatenating strings across records:

-- concatenating names per store via PIVOT (SQL Server 2012+)
SELECT BusinessEntityID
, [1] Contact1
, [2] Contact2
, [3] Contact3
, [4] Contact4
, Concat([1], IsNull(';' + [2], ''), IsNull(';' + [3], ''), IsNull(';' + [4], '')) Contacts
FROM (
	-- concatenating names and adding a rank
	SELECT BusinessEntityID
	, Concat(FirstName, ' ', LastName) Contact
	, ROW_NUMBER() OVER(PARTITION BY BusinessEntityID ORDER BY FirstName) Ranking
	FROM Sales.vStoreWithContacts
) PER
PIVOT (
    Max(Contact)
	FOR Ranking IN ([1], [2], [3], [4])
) AS DAT

It's needed to rewrite the Concat function to port the code on SQL Server 2005 though.

Talking about workarounds for splitting strings, in certain scenarios I used a combination of CutLeft & CutRight functions, which proved to be useful in data migrations, or use my own version of STRING_SPLIT (see SplitListWithIndex or SplitList). For concatenations I used mainly CTEs (see example) or cursors for exceptional cases (see example).

Happy coding!

21 May 2020

💎🏭SQL Reloaded: Splitting a String (Before and After)

Starting with SQL Server 2016 Microsoft introduced the STRING_SPLIT table-valued function, which splits a string into rows of substrings, based on a specified separator character (e.g. “:”).

-- splitting a string (SQL Server 2016+)
SELECT *
FROM STRING_SPLIT('100001::M:black:', ':')

The table-valued function object allowed also earlier to implement the same functionality, either by looping through the string or by using a common table expression. Here’s the implementation based on a loop (it was modified from a previous version to include an index):

-- split table-valued function with an index
CREATE FUNCTION [dbo].[SplitListWithIndex]( 
  @ListValues varchar(500) 
, @Delimiter char(1)) 
RETURNS @Temp TABLE(
  Ind int
, Value varchar(50)) 
AS 
BEGIN 
 DECLARE @Index int 
 DECLARE @Length int 
 DECLARE @Ind int 

 SET @Index = CharIndex(@Delimiter, @ListValues) 
 SET @Length = Len(@ListValues) - @Index 
 SET @Ind = 1
   
 WHILE @Index > 0 --if the fatch was successful
 BEGIN 
  INSERT @Temp 
  VALUES(@Ind, Substring(@ListValues, 0, @Index)) 

  SET @ListValues = Substring(@ListValues, @Index+1, @Length) 
  SET @Index = CharIndex(@Delimiter, @ListValues) 
  SET @Length = @Length - @Index 
  SET @Ind = @Ind + 1
 END  
   
 INSERT @Temp 
 VALUES(@Ind, @ListValues) 
RETURN 
END  
GO

The function could be called in code same as the STRING_SPLIT:

-- splitting a string (SQL Server 2000+)
SELECT *
FROM dbo.SplitListWithIndex('100001::M:black:', ':')

The two functions are helpful when a column stores delimited values. It’s the case of Dynamics 365 which stores a SKU (Stock Keeping Unit) – the Product Numbers together with its Dimensions (ItemId, Configid, InventSizeId, InventColorId, StyleId) delimited by “:”, like in the above examples. Therefore, to parse the records one could write such code:

-- parsing delimited values (SQL Server 2000+)
SELECT DAT.ProductNumber
, Max(CASE WHEN LV.Ind = 1 THEN LV.Value END) ItemId
, Max(CASE WHEN LV.Ind = 2 THEN LV.Value END) Configid
, Max(CASE WHEN LV.Ind = 3 THEN LV.Value END) InventSizeId
, Max(CASE WHEN LV.Ind = 4 THEN LV.Value END) InventColorId
, Max(CASE WHEN LV.Ind = 5 THEN LV.Value END) StyleId
FROM ( VALUES ('100001::S:black:')
 , ('100001::M:black:')
 , ('100001::L:black:')
 , ('100001::XL:black:')
 , ('100001::S:white:')
 , ('100001::M:white:')
 , ('100001::L:white:')
 , ('100001::XL:white:')
) DAT (ProductNumber)
CROSS APPLY dbo.SplitListWithIndex(DAT.ProductNumber, ':') LV
GROUP BY DAT.ProductNumber
ORDER BY DAT.ProductNumber

Similar output can be obtained via the STRING_SPLIT with the help of row_number() ranking window function introduced with SQL Server 2005:

-- parsing delimited values (SQL Server 2016+)
SELECT LV.ProductNumber
, Max(CASE WHEN LV.Ind = 1 THEN LV.Value END) ItemId
, Max(CASE WHEN LV.Ind = 2 THEN LV.Value END) Configid
, Max(CASE WHEN LV.Ind = 3 THEN LV.Value END) InventSizeId
, Max(CASE WHEN LV.Ind = 4 THEN LV.Value END) InventColorId
, Max(CASE WHEN LV.Ind = 5 THEN LV.Value END) StyleId
FROM (
 SELECT DAT.ProductNumber
 , XT.VALUE
 , ROW_NUMBER() OVER (PARTITION BY DAT.ProductNumber ORDER BY DAT.ProductNumber) Ind
 FROM ( VALUES ('100001::S:black:')
  , ('100001::M:black:')
  , ('100001::L:black:')
  , ('100001::XL:black:')
  , ('100001::S:white:')
  , ('100001::M:white:')
  , ('100001::L:white:')
  , ('100001::XL:white:')
 ) DAT (ProductNumber)
 CROSS APPLY STRING_SPLIT(DAT.ProductNumber, ':') XT
) LV
GROUP BY LV.ProductNumber
ORDER BY LV.ProductNumber

As can be seen the introduction of an index into the dbo.SplitListWithIndex function simplified the code, making the use of a ranking window function unnecessary. It would be useful for the STRING_SPLIT to provide the same, as this time of processing is pretty common.
Here’s another example based on the PIVOT clause introduced also in SQL 2005:

-- parsing delimited values (SQL Server 2016+)
SELECT P.ProductNumber
, Cast(Trim([1]) as nvarchar(20)) ItemId
, Cast(Trim([2]) as nvarchar(20)) ConfigId
, Cast(Trim([3]) as nvarchar(20)) InventSizeid
, Cast(Trim([4]) as nvarchar(20)) InventColorId
, Cast(Trim([5]) as nvarchar(20)) StyleId
FROM (
 SELECT DAT.ProductNumber
 , XT.VALUE
 , ROW_NUMBER() OVER (PARTITION BY DAT.ProductNumber ORDER BY DAT.ProductNumber) Ranking
 FROM ( VALUES ('100001::S:black:')
  , ('100001::M:black:')
  , ('100001::L:black:')
  , ('100001::XL:black:')
  , ('100001::S:white:')
  , ('100001::M:white:')
  , ('100001::L:white:')
  , ('100001::XL:white:')
 ) DAT (ProductNumber)
 CROSS APPLY STRING_SPLIT(DAT.ProductNumber, ':') XT
) DAT
PIVOT (MAX(DAT.[VALUE]) FOR DAT.Ranking IN ([1],[2],[3],[4],[5])) P

Notes:
The queries work also in SQL databases in Microsoft Fabric (see file in GitHub repository). You might want to use another schema (e.g. Test), not to interfere with the existing code.

Happy coding!

12 March 2010

🕋Data Warehousing: Pivot/Unpivot (Definitions)

"To rotate rows to columns, and columns to rows, in a cross-tabular data browser. Also refers to choosing dimensions from the set of available dimensions in a multidimensional data structure for display in the rows and columns of a cross-tabular structure." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"The process of rotating the view of data. For example, viewing what was the x-axis in the y-axis’s position and vice versa." (Microsoft Corporation, "Microsoft SQL Server 7.0 Data Warehouse Training Kit", 2000)

"The UNPIVOT operator is used within a SELECT statement to create a normalized data report from data that is stored as a spreadsheet." (Darril Gibson, "MCITP SQL Server 2005 Database Developer All-in-One Exam Guide", 2008)

"The PIVOT operator is used within a SELECT statement. It is used to create cross-tab reports (similar to a spreadsheet) from normalized data." (Darril Gibson, "MCITP SQL Server 2005 Database Developer All-in-One Exam Guide", 2008)

"The act of rotating rows to columns, and columns to rows." (Jim Joseph, "Microsoft SQL Server 2008 Reporting Services Unleashed", 2009)

[unpivot:] "To expand values from multiple columns in a single record into multiple records with the same values in a single column." (Microsoft, "SQL Server 2012 Glossary", 2012)

"To rotate a table-valued expression by turning the unique values from one column in the expression into multiple columns in the output, and perform aggregations where they are required on any remaining column values that are wanted in the final output." (Microsoft, "SQL Server 2012 Glossary", 2012)

"In the Lean start-up world, a pivot is a structured, often rapid, course correction on the basis of new market, customer, and development information." (Pamela Schure & Brian Lawley, "Product Management For Dummies", 2017)

"1. To rotate rows to columns, and columns to rows, in a cross-tabular data browser. 2. To choose dimensions from the set of available dimensions in a multidimensional data structure for display in the rows and columns of a cross-tabular structure." (Microsoft Technet)

[unpivot:] "In Integration Services, the process of creating a more normalized dataset by expanding data columns in a single record into multiple records." (Microsoft Technet)

04 February 2010

💎SQL Reloaded: Just In CASE III (Quest for Performance)

Arranging branches

Given the fact that the CASE function stops after the first match, in case there are no two expressions that evaluate to true for any same value, the CASE could be optimized for the best performance by arranging the branches in the order of the highest number of matches. For example taking the logic from the 3rd example from Part I and checking the number of occurrences for each value, surprisingly (or maybe not) there are 257 NULL values, 97 ‘L’ values, 82 ‘H’ values, respectively 68 ‘M’ values. If we change the branches in this order, the number of evaluations will decrease from 82*1+68*2+97*3+257*3=1280 to 257*1+97*2+82*3+68*3=901, the difference is small though when applied to big data sets the improvement could pay the effort. Here is the modified logic for Example 3:

This technique is unfortunately not bulletproof – even if less probable in many cases, the order of the values’ occurrences could change overnight, the number of total evaluations varying in time on the number of occurrences for each value.

Rewriting Simple Unions

Have you ever seen a set of similar queries whose output is merged with the help of UNION (ALL) operator? I’ve seen so many queries like the below one, in which the only difference between the sub-queries was residing in the WHERE constraint(s) applied and maybe a few additional calculations!

The example is quite simple, though I think it demonstrates the point. The UNION could be translated to a simple CASE as in the 3rd example. The problem with such queries is that if there is no index on the attribute(s) used is the WHERE constraint, it will be performed a full table scan for each sub-query, quite expensive from a performance standpoint. Even if the respective attribute is indexed, there’s still a pay in performance. Such an approach could be maybe acceptable when in each scenario different tables are joined to the main table, though even then should be checked which one of the two approaches has better performance. A recent example I can recall and in which could have been used several CASE functions instead of the UNION was based on the JOIN of two tables, the query looking something like:

Rewriting Complex Unions

Even if when the base table is used in several union-based merged queries, it doesn’t sense to apply this technique all the times, especially when dealing with complex queries using multiple joins. Now it depends, there are scenarios and scenarios, is must be always considered the trade in performance, readability and usability of a query when applying any technique. The Person.Address from AdventureWorks is the best way to exemplify two scenarios in which it makes sense and doesn’t makes sense to combine the logic in two union queries. The respective table stored the addresses for Customers (Sales & Individuals), Vendors and Employees, thus supposing that we would like to see the Owners and Owner Type for each Address we would need to create several unions.

Let’s consider first the Customer-based Addresses – as there are two types of Customers and because the details for each type are stored in different tables, we might end up creating the following query:

As can be seen the two queries are quite similar, most of the tables joined are the same with 3 exceptions: Sales.Store vs. Sales.Individual & Person.Contact. The query could be rewritten using a CASE and left joins instead of inner joins as follows:

Similar queries could be written in order to get the Vendor and Employee details, and as can be seen also these queries share many similarities.

Does it makes sense to use the above technique also in this case? Maybe it would be acceptable to merge the Vendor with Employee Addresses queries, though does it make sense to do merge also the Customer Addresses using CASE & left joins. The queries are not so complex so it’s possible to do that, though I think a decision should be taken only when has been studied the performance of the two approaches.

Pivoting Behavior

Another technique when a CASE could help eliminate several joins to the same table is presented in a post on List Based Aggregations, the CASE function being used together with aggregated functions in order to cumulate the On Hand for several locations and show it on the same line. This approach was quite useful before the PIVOT operator was introduced in SQL Server 2005 (see Pivot operator example), and is still is in case is needed to select more than 1 attribute for each line.

15 September 2007

💎SQL Reloaded: PIVOT Operator Example

Starting with SQL Server 2005, Microsoft introduced the PIVOT operator, which tries to implement the pivoting feature available in Excel, however it does only static pivoting, the values on which the pivoting is done must be mentioned explicitely.

In List Based Aggregations I gave an example of a query in which the On Hand was cumulated per Location, the same output can be obtained with the below query.

-- using PIVOT
SELECT Name
, ProductNumber
, MakeFlag
, FinishedGoodsFlag
, IsNull([Paint Storage], 0) OnHandPaintStorage
, IsNull([Frame Forming], 0) OnHandFrameForming
, IsNull([Frame Welding], 0) OnHandFrameWelding
, IsNull([Debur and Polish], 0) OnHandDeburPolish
, IsNull([Specialized Paint], 0) OnHandSpecializedPaint
, IsNull([Subassembly], 0) OnHandAssembly
, IsNull([Final Assembly], 0) OnHandFinalAssembly
, IsNull([Paint Storage], 0) + IsNull([Frame Forming], 0) + IsNull([Frame Welding], 0) + IsNull([Debur and Polish], 0) + IsNull([Specialized Paint], 0) + IsNull([Subassembly], 0) + IsNull([Final Assembly], 0) OnHand
, ListPrice
, StandardCost
, (IsNull([Paint Storage], 0) + IsNull([Frame Forming], 0) + IsNull([Frame Welding], 0) + IsNull([Debur and Polish], 0) + IsNull([Specialized Paint], 0) + IsNull([Subassembly], 0) + IsNull([Final Assembly], 0) ) * StandardCost Value
FROM
(
	SELECT PP.Name
	, PP.ProductNumber
	, PP.MakeFlag
	, PP.FinishedGoodsFlag
	, PL.Name AS Location
	, PP.ListPrice
	, PP.StandardCost
	, PPI.Quantity
	FROM Production.ProductInventory PPI
	     JOIN Production.Product PP
	       ON PPI.ProductID = PP.ProductID
	      JOIN Production.Location PL
	        ON PPI.LocationID = PL.LocationID
) AS A
PIVOT
(
	SUM(Quantity)
	FOR Location IN ([Paint Storage], [Frame Forming], [Frame Welding], [Debur and Polish], [Specialized Paint], [Subassembly], [Final Assembly])
) PVT

The query is not rocket science, but it took me some time to figure out that when I need to use multiple tables, I have to create first a subquery, and only then can apply the PIVOT operator.

This is how I tried to write my query:

SELECT PP.Name
, PP.ProductNumber
, PP.MakeFlag
, PP.FinishedGoodsFlag
, IsNull([Paint Storage], 0) OnHandPaintStorage
, IsNull([Frame Forming], 0) OnHandFrameForming
, IsNull([Frame Welding], 0) OnHandFrameWelding
, IsNull([Debur and Polish], 0) OnHandDeburPolish
, IsNull([Specialized Paint], 0) OnHandSpecializedPaint
, IsNull([Subassembly], 0) OnHandAssembly
, IsNull([Final Assembly], 0) OnHandFinalAssembly
, IsNull([Paint Storage], 0) + IsNull([Frame Forming], 0) + IsNull([Frame Welding], 0) + IsNull([Debur and Polish], 0) + IsNull([Specialized Paint], 0) + IsNull([Subassembly], 0) + IsNull([Final Assembly], 0) OnHand
, PP.ListPrice
, PP.StandardCost
, (IsNull([Paint Storage], 0) + IsNull([Frame Forming], 0) + IsNull([Frame Welding], 0) + IsNull([Debur and Polish], 0) + IsNull([Specialized Paint], 0) + IsNull([Subassembly], 0) + IsNull([Final Assembly], 0) ) * StandardCost Value
FROM Production.ProductInventory PPI
JOIN Production.Product PP
ON PPI.ProductID = PP.ProductID
JOIN Production.Location PL
ON PPI.LocationID = PL.LocationID
PIVOT
(
SUM(Quantity)
FOR PL.Name IN ([Paint Storage], [Frame Forming], [Frame Welding], [Debur and Polish], [Specialized Paint], [Subassembly], [Final Assembly])
) PVT

The query generated the following errors:

Msg 8156, Level 16, State 1, Line 1

The column 'ProductID' was specified multiple times for 'PVT'.