SQL Troubles: list of values

Showing posts with label list of values. Show all posts

06 July 2020

🪄SSRS (& Paginated Reports): Design (Part II: Parameter Dependencies in Dropdowns)

Introduction

Quite often there are dependencies between the values of two distinct dropdowns, typically the values representing the different levels of a direct on implied hierarchy. This means that the choice at the upper level determines the values which will be shown at a lower level, the corresponding query needing to handle the value as a parameter. There are several ways on how one can handle this, the choice depending also on the data provider considered for the report.

Because the AdventureWorks database is used within an SQL Server, the available choices are MSOLEDBSQL, SQLOLEDB, ODBC or SQL Server Native Client. It should be noted that the SQLOLEDB and the Native Clients providers have been deprecated and shouldn't be used for new developments! In the project was used the Microsoft OLE DB Provider for SQL Server (MSOLEDBSQL) provider.

For exemplification will be used the reports build previously. The hierarchy will be built based on the State and Territory, a Country belonging to a Territory, while a State to a Country. For the beginning we will build the dependency existing between the Country and State.

First Level of Dependency

The text parameter for StateCountry dataset will be defined similarly to the one consider for the Country, and have as source the following query:

-- State Provinces
SELECT '' Id
, '(All)' Name
UNION ALL
SELECT StateProvinceCode
, StateProvinceCode + ' ' + Name
FROM [Person].[StateProvince]
WHERE CountryRegionCode = ISNULL(NullIf(?, ''), CountryRegionCode)
ORDER BY Id

The Country parameter will be handled in the WHERE constraint - it's the role of the question mark. When an empty string was selected, the equivalent for ignoring the parameter, the formula used will result to an identity, which however makes sense only if there's always a value for the Country Code in the Person.StateProvince table. Otherwise the same line needs to be written as follows:

WHERE IsNull(CountryRegionCode, '') = COALESCE(NullIf(?, ''), CountryRegionCode, '')

This might look strange, though it does the trick of handling the nulls. According to the rules of defensive programming, one should treat the eventuality of nulls. However, this depends also on application's design. If one develops reports with this supposition, then the supposition should be made explicit in the documentation!

Once the query provided, SSRS recognizes that a parameter was provided and requires to map the question mark to the respective parameter. Within dataset's properties click on the "Parameter" tab and selected @Country as Parameter Value:

Parameters in Dataset Properties

A question mark needs to be considered in the query for each parameter, the parameters from the Dataset Properties being displayed in the order in which they appear in the query.

Once the dataset defined, one can go on and define the StateProvince parameter similarly to the Country parameter and consider the StateProvinces dataset as source. The StateProvince parameter needs to be added also in report's header:
= "State Province: " & Parameters!StateProvince.Label

Within the Addresses main dataset there needs to be added the declaration for the StateProvince parameter, while in where the corresponding clause:

-- SQL Parameter
& " DECLARE @StateProvince as nvarchar(3) = '" & Parameters!StateProvince.Value & "'" & vbcrlf 

-- constraint in WHERE clause
& IIf(Parameters!StateProvince.Value<> "", "     AND SIC.StateProvinceCode = @StateProvince", "") & vbcrlf W

With these changes the report should be ready to run.

Notes:
The above query for the StateProvince dataset can be used inside of a formula with a few small changes. This should work independently of the provider used. Just replace the query in the formula and delete the parameters previously created for the data source via the Dataset Properties:

= " -- State Provinces " & vbCrLf
 & " DECLARE @Country as nvarchar(3) = '" & Parameters!Country.Value & "'" & vbcrlf 
 & " SELECT '' Id " & vbCrLf
 & " , '(All)' Name " & vbCrLf
 & " UNION ALL " & vbCrLf
 & " SELECT StateProvinceCode " & vbCrLf
 & " , StateProvinceCode + ' ' + Name " & vbCrLf
 & " FROM [Person].[StateProvince] " & vbCrLf
 & IIf(Parameters!Country.Value<>"", "     WHERE CountryRegionCode = @Country", "") & vbcrlf 
 & " ORDER BY Id " & vbCrLf

Second Level of Dependency

The second dependency is based on the Territory Group formed of North America, Pacific and Europe. There is however no Territory Group table, the relations between Countries and Territory Groups being stored into the Sales.SalesTerritory table. Therefore, the query for the corresponding dropdown can be written via a DISTINCT:

-- Territories
SELECT '' Id
, '(All)' Name
UNION ALL
SELECT DISTINCT [Group] Id
, [Group] Name
FROM Sales.SalesTerritory
ORDER BY Id

When the user selects a Territory, in the Countries and State Province dropdowns are supposed to be shown only the corresponding values for the selected value. We need to modify thus both queries via formulas.
The query for the Country dataset includes the Territory parameter and the corresponding constraint via an EXIST:

= "  -- Countries  " & vbCrLf 
& " DECLARE @Territory as nvarchar(50) = '" & Parameters!Territory.Value & "'" & vbcrlf 
& " SELECT '' Id  " & vbCrLf 
& " , '(All)' Name  " & vbCrLf 
& " UNION ALL  " & vbCrLf 
& " SELECT SSP.CountryRegionCode Id  " & vbCrLf 
& " , SSP.Name   " & vbCrLf 
& " FROM [Person].[CountryRegion] SSP " & vbCrLf 
& IIf(Parameters!Territory.Value<> "", "     WHERE EXISTS (SELECT * FROM [Sales].[SalesTerritory] SST WHERE SSP.CountryRegionCode = SST.CountryRegionCode AND SST.[Group] = @Territory)", "") & vbcrlf 
& " ORDER BY Id " & vbCrLf

In exchange the StateProvince dataset is slightly more complex:

= " -- State Provinces " & vbCrLf
 & " DECLARE @Country as nvarchar(3) = '" & Parameters!Country.Value & "'" & vbcrlf 
 & " DECLARE @Territory as nvarchar(50) = '" & Parameters!Territory.Value & "'" & vbcrlf 
 & " SELECT '' Id " & vbCrLf
 & " , '(All)' Name " & vbCrLf
 & " UNION ALL " & vbCrLf
 & " SELECT PSP.StateProvinceCode " & vbCrLf
 & " , PSP.StateProvinceCode + ' ' + PSP.Name " & vbCrLf
 & " FROM [Person].[StateProvince] PSP" & vbCrLf
 & "      JOIN [Sales].[SalesTerritory] PST" & vbCrLf
 & "       ON PSP.TerritoryId = PST.TerritoryId " & vbCrLf
 & " WHERE 0=0 " & vbCrLf
 & IIf(Parameters!Country.Value<> "", "     AND PSP.CountryRegionCode = @Country", "") & vbcrlf 
 & IIf(Parameters!Territory.Value<> "", "     AND PST.[Group] = @Territory", "") & vbcrlf 
 & " ORDER BY Id " & vbCrLf

Once the datasets changed, is still needed to modify the order of the parameters to reflect the order in which they are called within dependency, respectively Territory, State Province, Country. In the end the report looks like this:

Notes:
(1) If the order of the parameters doesn't reflect the otder in which they are called, you'll get a generic error similar with the following one (the error appears when there are problems with the query):
An error occurred during local report processing. The definition of the report '/Customer Addresses' is invalid. The CommandText expression for the query 'Countries' contains an error: [BC30277] Type character '8e does not match declared data type 'String'.

(2) When designing parameters one should consider also the best practices for avoiding the poor design of parameters.

Previous Post <<||>> Next Post

18 October 2018

💎🏭SQL Reloaded: String_Split Function

Today, as I was playing with a data model – although simplistic, the simplicity kicked back when I had to deal with fields in which several values were encoded within the same column. The “challenge” resided in the fact that the respective attributes were quintessential in analyzing and matching the data with other datasets. Therefore, was needed a mix between flexibility and performance. It was the point where the String_Split did its magic. Introduced with SQL Server 2016 and available only under compatibility level 130 and above, the function splits a character expression using specified separator.
Here’s a simplified example of the code I had to write:

-- cleaning up
-- DROP TABLE dbo.ItemList

-- test data 
SELECT A.*
INTO dbo.ItemList 
FROM (
VALUES (1, '1001:a:blue')
, (2, '1001:b:white')
, (3, '1002:a:blue')
, (4, '1002:b:white')
, (5, '1002:c:red')
, (6, '1003:b:white')
, (7, '1003:c:red')) A(Id, List)

-- checking the data
SELECT *
FROM dbo.ItemList 

-- prepared data
SELECT ITM.Id 
, ITM.List 
, DAT.ItemId
, DAT.Size
, DAT.Country
FROM dbo.ItemList ITM
   LEFT JOIN (-- transformed data 
 SELECT DAT.id
 , [1] AS ItemId
 , [2] AS Size
 , [3] AS Country
 FROM(
  SELECT ITM.id
  , TX.Value
  , ROW_NUMBER() OVER (PARTITION BY ITM.id ORDER BY ITM.id) Ranking
  FROM dbo.ItemList ITM
  CROSS APPLY STRING_SPLIT(ITM.List, ':') TX
  ) DAT
 PIVOT (MAX(DAT.Value) FOR DAT.Ranking IN ([1],[2],[3])) 
 DAT
  ) DAT
   ON ITM.Id = DAT.Id

And, here’s the output:

For those dealing with former versions of SQL Server the functionality provided by the String_Split can be implemented with the help of user-defined functions, either by using the old fashioned loops (see this), cursors (see this) or more modern common table expressions (see this). In fact, these implementations are similar to the String_Split function, the difference being made mainly by the performance.

Notes:
The queries work also in SQL databases in Microsoft Fabric (see file in GitHub repository). You might want to use another schema (e.g. Test), not to interfere with the existing code.

Happy coding!

18 June 2011

💎SQL Reloaded: Pulling the Strings of SQL Server X (Dynamic Queries)

A dynamic query is a query constructed at runtime, techniques often indispensable in many situations that require a certain flexibility in query’s creation. The creation of a dynamic query is nothing but a set of operations with strings, many of the techniques mentioned before becoming handy. SQL Server provides two functions for the execution of dynamic queries, namely EXECUTE statements (or its shortened form EXEC) and sp_executesql stored procedure. Even if the later it’s more flexible allowing passing parameters from and to the caller and allows reusing executions plans (see Using sp_esecutesql), for the following examples will be used only EXEC. But before let’s look how a static could become dynamic. For this let’s consider the following query based on AdventureWorks database:

-- example 1 - simple query   
SELECT *  
FROM Person.Address  
WHERE AddressID = 1 

-- example 2 - query encapsulated in a string: 
EXEC ('SELECT * FROM Person.Address WHERE AddressID = 1') 

-- example 3 - query stored into a string variable      
EXEC ('SELECT * FROM Person.Address WHERE AddressID = 1')

Supposing that the AddressID is considered as parameter we can write:

-- example 4 - static query     
DECLARE @AddressID int 
SET @AddressID = 1 
SELECT *  
FROM Person.Address  
WHERE AddressID = @AddressID  

-- example 5 - dynamic query  
DECLARE @sql varchar(100) 
DECLARE @AddressID int 
SET @AddressID = 1 
SET @sql = 'SELECT * FROM Person.Address WHERE AddressID = ' + CAST(@AddressID as varchar (10)) 
EXEC (@sql)

Until here there is no important conceptual difference. What if is needed to pass multiple AddressIDs? We can create a parameter for which expected values, though that’s not a reasonable solution as the number of values can vary. A more elegant solution would be to create a list of values and provided as a string parameter and then concatenate the original query and the string parameter like below. We just need to accommodate the length of the string variable to the expected size of the list of value.

-- example 6 (dynamic query) 
DECLARE @sql varchar(100) 
DECLARE @AddressIDs varchar(50) -- supposed parameter 
SET @AddressIDs = '1, 2, 4, 5, 6, 10'  
SET @sql = 'SELECT * FROM Person.Address WHERE AddressID IN (' + @AddressIDs + ')' 
EXEC (@sql)

There is actually a third solution. As in the previous post on list of values has been introduced the dbo.StringToTable function, the function can be used thus to transform the list in a table:

-- example 7 (list of values) 
DECLARE @AddressIDs varchar(50) -- supposed parameter 
SET @AddressIDs = '1,2,4,5,6,10'  
SELECT *  
FROM Person.Address  
WHERE AddressID IN ( 
      SELECT value  
      FROM dbo.StringToTable(@AddressIDs, ','))

In the same post was constructed the DoubleList list of values which can be used in a dynamic query in bulk inserts or table-value constructors. The list needs to be slightly modified by replacing the single quote with two single quotes in order to accommodate value’s storage in a string. Considering that there are no integrity constraints on the targeted table, he query for bulk insert can be written as follows:

-- example 8 (list of values & bulk insert) 
DECLARE @sql varchar(200) 
DECLARE @AddressTypes varchar(150)  
SET @AddressTypes = '(6,''Archive''), (1,''Billing''), (2,''Home''), (3,''Main Office''), (4,''Primary''), (5,''Shipping'')' 
SET @sql = 'INSERT Person.AddressType (AddressTypeID, Name) VALUES ' + @AddressTypes  
EXEC (@sql)

The same technique can be used with a table-value constructor:

-- example 9 (list of values &amp; table-value constructor) 
DECLARE @sql varchar(400) 
DECLARE @AddressTypes varchar(150)  
SET @AddressTypes = '(6,''Archive''), (1,''Billing''), (2,''Home''), (3,''Main Office''), (4,''Primary''), (5,''Shipping'')' 
SET @sql = 'SELECT VA.AddressID, VA.AddressTypeID, AT.NAME FROM Purchasing.VendorAddress VA JOIN ( VALUES ' + @AddressTypes + ') AS AT(AddressTypeID, Name) ON VA.AddressTypeID = AT.AddressTypeID' 
EXEC (@sql)

The above examples are basic, in daily problems such queries can involve multiple parameters and operations. In addition, in the last examples the concatenation step was left out.

Happy coding!

17 June 2011

💎SQL Reloaded: Pulling the Strings of SQL Server VII (List of Values)

Introduction

Lists are one of the basic structures in Mathematics, the term referring to an (ordered) set of elements separated by comma, space or any other delimiter (e.g. “:”, “;”). The elements of a list can be numbers, words, functions, or any other type of objects. In the world of databases, a list is typically formed out of the values of a given column or a given record, however it could span also a combination of rows and records, is such cases two delimiters being needed – one for column and one for row. From here comes probably the denomination of list of values. In a more general accept a list of values could be regarded as a delimited/concatenated subset. Such lists are formed when needed to send the data between the layers of an application or applications, this type of encoding being quite natural. In fact, also the data in a database are stored in similar tabular delimited structure, more complex though.

An useful example in which the list of values are quite handy is the passing of multiple values within the parameter of stored procedure or function (see example). This supposes first building the list and then use the values in a dynamic build query (like in the before mentioned example) or by building a table on the fly. We can call the two operations composition, respectively decomposition of list of values.

Composition

Composition, whether on vertical or horizontal is nothing but a concatenation in which the values alternate with one or more delimiters. Let’s reconsider the concatenation based on the values of a Person.AddressType AdventureWorks table. As the logic for concatenating for one or more attributes is the same, the below example concatenates a list based on a single attribute, namely AddressTypeID in SingleList, respectively two attributes, AddressTypeID and Name.

-- concatenation of values across a table 
;WITH CTE (AddressTypeID, Name, Ranking) 
AS (--preparing the data       
     SELECT AddressTypeID  
     , Name 
     , ROW_NUMBER () OVER(ORDER BY Name) Ranking 
     FROM Person.AddressType 
     -- WHERE ... 
) 
, DAT (SingleList, DoubleList, Ranking) 
AS ( -- concatenating the values 
     SELECT Cast(AddressTypeID as varchar(max)) SingleList 
     , Cast('('+ Cast(AddressTypeID as varchar(10)) + ',''' + Name + ''')' as varchar(max)) DoubleList 
     , Ranking 
     FROM CTE 
     WHERE Ranking = 1 
     UNION ALL 
     SELECT DAT.SingleList + ',' + Cast(CTE.AddressTypeID as varchar(20)) SingleList 
    , Cast(DAT.DoubleList + ', ('+ Cast(CTE.AddressTypeID as varchar(10)) + ',''' + CTE.Name + ''')' as varchar(max)) DoubleList 
    , CTE.Ranking  
     FROM CTE          
       JOIN DAT           
          ON CTE.Ranking = DAT.Ranking + 1       
)  

-- the lists 
SELECT SingleList 
, DoubleList 
FROM DAT 
WHERE Ranking = (SELECT MAX(Ranking) FROM DAT)

The second example is based on atypical delimiters, resembling to the structure built for a batch insert or table value constructor-based statement, and as we’ll see later, ideal to be used in a dynamically-built query

Decomposition

Decomposition follows the inverse path, though it’s much easier to exemplify. In fact it’s used the same technique introduced in the last example from the previous post belonging to the same cycle, Subparts of a String, in which a space was used as delimiter. Another example is the dbo.SplitList function which decomposes a string using a loop.

-- decomposition of a string to a table using CTE 
CREATE FUNCTION dbo.StringToTable( 
 @str varchar(500) 
,@Delimiter char(1)) 
RETURNS @Temp TABLE ( 
Id int NOT NULL 
,Value varchar(50)) 
AS 
BEGIN  
     ;WITH CTE (PrevString, Position, Word)  
     AS (  
     SELECT LTrim(RTrim( CASE  
           WHEN CharIndex(@Delimiter, @str)>;0 THEN Right(@str, Len(@str)-CharIndex(@Delimiter, @str))  
           ELSE ''  
      END)) PrevString  
     , 1 Position  
     , LTrim(RTrim(CASE  
           WHEN CharIndex(@Delimiter, @str)>0 THEN LEFT(@str, CharIndex(@Delimiter, @str)-1)  
           ELSE @str  
       END)) Word  
      UNION ALL  
      SELECT LTrim(RTrim(CASE  
            WHEN CharIndex(@Delimiter, PrevString)>0 THEN Right(PrevString, Len(PrevString)-CharIndex(@Delimiter, PrevString))  
             ELSE ''  
       END)) PrevString  
      , Position + 1 Position  
      , LTrim(RTrim(CASE  
           WHEN CharIndex(@Delimiter, PrevString)>0 THEN LEFT(PrevString, CharIndex(@Delimiter, PrevString)-1)  
          ELSE PrevString  
      END)) Word      FROM CTE  
     WHERE Len(PrevString)>0  
    )  
     INSERT @Temp(Id, Value) 
     SELECT Position  
     , Word      FROM CTE  
     OPTION (maxrecursion 100)  
     RETURN 
END

Here are two examples based on the single list created above and another one based on alphabet:

-- decomposing a list
SELECT Id 
, value 
FROM dbo.StringToTable('6,1,2,3,4,5', ',')     


-- decomposing the "alphabet" 
SELECT Id 
, value 
FROM dbo.StringToTable('a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z', ',')

Even if the function deals only with a delimiter, it could be used to decompose lists involving multiple delimiters, as long the list is adequately built:

-- decomposing double list 
SELECT Id 
, value 
, Left(value, CHARINDEX(',', value)-1) LeftValue 
, Right(value, len(value)-CHARINDEX(',', value)) RightValue 
FROM dbo.StringToTable('6,Archive;1,Billing;2,Home;3,Main Office;4,Primary;5,Shipping', ';')

The tables built thus from list of values can be further used in queries when needed to create a table on the fly. It would be interesting maybe to show that the composition and decomposition are inverse functions, however that’s out of scope, at least for current set of posts.

05 April 2010

💎SQL Reloaded: The Power of Joins VI (Not-Equal Joins)

SQL Server 2008 documentation mentions the not-equal joins without giving an explicit definition, the term appearing within the context of joining values in two columns that are not equal. On the other side the not-equal join denomination could be used to refer more generally to the joins whose join constraints involve comparison operators other than equal (“=”), for example <, >, <>, BETWEEN, IN, and its negation NOT IN, LIKE, etc. The difference between the two views resides on the fact that some of the operators (e.g. IN, BETWEEN) could imply the equality of values from joined columns. Another objection could be made on whether the not-equality join constraint could be based only on one of the table’s attributes participating in the join, the other thus missing, such constraints could be considered also as join constraints even if in some scenarios they could have been written in the main WHERE clause as well. For the sake of simplicity and flexibility I would consider a not-equal join or not-equality join as a join in which at least one of the join constraints is based on a operator other than the equal operator. Could be discussed thus about not-equal join constraints referring to the join constraints that involve other operators than equal, and equal join constraints, the two types of joins could coexist in the same join. A not-equal join constraint often involves the possibility of being returned more records then if a equal join constraint was used.

If we talk about join constraints then more likely we refer to vertical joins, though also anti-joins and semi-joins could include sporadically not-equality operators. In a normal join based on equalities for the set of attributes participating in the join from the left table are matched all the records from the right table, even if this means retrieving more than one record. Same happens also for not-equal constraints even if the business scenarios in which such needs arises are not so intuitive. One relatively simple example I often met is the case when is needed to calculate the incremental aggregation of values, for example to calculate the incremental volume of PO placed over time:

SELECT DII.DateSequence 
, IsNull(POH.SubTotal, 0) SubTotal 
FROM dbo.fGetDatesInInterval('2002-01-01', '2002-01-31') DII 
    CROSS APPLY (--incremental PO Volume 
     SELECT SUM(POH.SubTotal) SubTotal 
    FROM Purchasing.PurchaseOrderHeader POH 
   WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
       AND DATEDIFF(d, POH.OrderDate, DII.DateSequence)&gt;=0 --not-equal constraint 
    ) POH

In order to simplify the query I used the dbo.fGetDatesInInterval table-valued function that returns a sequence of dates within a given time interval, and a CROSS APPLY returning the aggregated value for each date returned by the function, for the calculations being considered all the records having an Order Date smaller or equal with the given (sequence) date. As can be seen from the above query the not-equal constraint is based on DateDiff function, instead we could have written POH.OrderDate<=DII.DateSequence, one of the reasons for its use is the fact that the constraint could be easily modified to show for example only the aggregated values for the past 90 days, the BETWEEN operator entering in scene:

SELECT DII.DateSequence 
, IsNull(POH.SubTotal, 0) SubTotal 
FROM dbo.fGetDatesInInterval('2002-01-01', '2002-01-31') DII 
     CROSS APPLY (--incremental PO Volume 
    SELECT SUM(POH.SubTotal) SubTotal 
    FROM Purchasing.PurchaseOrderHeader POH 
    WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
        AND DATEDIFF(d, POH.OrderDate, DII.DateSequence) BETWEEN 0 AND 90--not-equal constraint 
    ) POH

The not-equal constraint from the query is part of interval-based constraints category which covers the scenarios in which one of the attributes from the tables participating in a join is checked whether it falls within a given interval, typically that being valid for numeric values as above or date value, as in the next query that shows the Last Standard Cost Details per Product.

-- Products &amp; Last Standard Cost details SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
, PCH.StartDate 
, PCH.EndDate 
FROM Production.Product ITM 
    JOIN Production.ProductCostHistory PCH 
     ON ITM.ProductID = PCH.ProductID AND GETDATE() BETWEEN PCH.StartDate AND IsNull(PCH.EndDate, GETDATE())

    Such queries need often to be written when the records in a table have a given validity, no overlap existing between intervals.

    Using list of values within a join constraint and thus the IN or NOT IN operators occurs when for example the base table stores data coming from multiple sources and more encodings are used for the same source, each of them caring its meaning. For example using the Production.TransactionHistory and several other tables coming with AdventureWorks database, and supposing that for Purchase Orders are two transaction types P and B, and for Sales Orders S and I, a general scope query on Transactions could be written as follows:

-- Transaction History 

SELECT PTH.TransactionID 
, PTH.ProductID 
, PTH.ReferenceOrderID 
, CASE PTH.TransactionType 
     WHEN 'W' THEN 'Work Order' 
    WHEN 'S' THEN 'Sales Order' 
    WHEN 'I' THEN 'Internal Sales Order' 
    WHEN 'P' THEN 'Purchase Order' 
    WHEN 'B' THEN 'Blanket Purchase Order' 
    ELSE 'n.a' 
  END TransactionType , CASE  
    WHEN PTH.TransactionType = 'W' THEN StartDate 
    WHEN PTH.TransactionType IN ('S', 'I') THEN SOH.OrderDate 
    WHEN PTH.TransactionType IN ('P', 'B') THEN POH.OrderDate 
  END ReferenceDate FROM Production.TransactionHistory PTH 
    LEFT JOIN Purchasing.PurchaseOrderHeader POH       ON PTH.ReferenceOrderID = POH.PurchaseOrderID   AND PTH.TransactionType IN ('P', 'B') --not-equal constraint 
   LEFT JOIN Sales.SalesOrderHeader SOH 
     ON PTH.ReferenceOrderID = SOH.SalesOrderID 
  AND PTH.TransactionType IN ('S', 'I')  --not-equal constraint 
    LEFT JOIN Production.WorkOrder PWO 
      ON PTH.ReferenceOrderID = PWO.WorkOrderID 
   AND PTH.TransactionType = 'W'

    The need for writing such a query is seldom and could have been written as 3 distinct queries whose results are joined with two unions, both approaches having their pluses and minuses.

   In one of the above queries was used a constraint based on DateDiff function, and in theory any other function could be used in a join constraint, including user-defined functions. Such function-based join constraints are handy but should be used with caution because they could impact query’s performance. Pattern-based join constraints used with LIKE operator could be used as well.

    There are cases in which the constraints that typically would be included in the WHERE clause are added in the body of the join, apparently without any good reason. Such based denatured join constraints are base only on the attributes of one of the tables involved in the join, like in the next query:

-- PO for Products with StandardCost&gt;10 SELECT POD.PurchaseOrderDetailID  
, POD.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
, POD.UnitPrice 
, POD.OrderQty 
FROM Purchasing.PurchaseOrderDetail POD  
    JOIN Production.Product ITM       ON POD.ProductID = ITM.ProductID 
  AND ITM.StandardCost&gt;10  -- denaturated join constraint

03 April 2010

💎SQL Reloaded: Power of Joins V (Semi-Joins and Anti-Joins)

An incursion in the world of joins is incomplete without approaching two other important techniques for writing queries: semi-joins and anti-joins, some of the techniques specific to them allowing to speed up queries and reduce queries’ complexity. An anti-join returns rows from the left table where no matches is found in the right table, while an semi-join returns returns rows from the left table if at least a match is found in the second table.

The typical techniques for creating semi-joins and anti-joins are based on IN or EXISTS operators, respectively their negations the NOT EXISTS and NOT IN, both operators being available in SQL Server and Oracle. According to the above definitions also the EXCEPT and INTERSECT operators introduced in horizontal joins post could be used to create anti-joins, respectively semi-joins too, they being introduced by Microsoft starting with SQL Server 2005 in order to compare whole rows rather then a smaller set of attributes. Another operators that allows creating anti-joins and semi-joins and introduced with SQL Server 2005 is the CROSS APPLY and OUTER APPLY operators.

Because the EXCEPT and INTERSECT operators were introduced in a previous post, this post is focused on the other mentioned operators, for exemplification of the techniques following to use again the Production.Product and Purchasing.PurchaseOrderDetail tables from AdventureWorks database.

IN operator allows to test whether a given (attribute) value is in a list of values, the respective list could be based also on the values returned by another query, called actually a subquery. The output is not influenced by the eventual duplicates in the list of values, though for the sake of simplicity and testability it makes sense to provide a list of distinct values.

EXISTS operator provides a similar functionality like the IN operator, however it checks only whether a record exists in what is called a correlated suquery (also known as repeated subquery, is a subquery making direct reference to one or more of the attributes of the main query). The subquery used with an IN operator could be easily translated to a correlated subquery by bringing the attribute used to create the list of values into the WHERE clause.

Let’s start with a simple problem to exemplify the use of a semi-join, for example to select only the products that have a PO (Purchase Order) placed, the query could be written as follows:

-- Products with PO placed (IN operator) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
FROM Production.Product ITM  
WHERE ITM.ProductID IN (--list of Product IDs with PO placed 
    SELECT DISTINCT POD.ProductID  
     FROM Purchasing.PurchaseOrderDetail POD  
       JOIN Purchasing.PurchaseOrderHeader POH  
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
      WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
     )

The above query could be rewritten using the EXISTS operator, the correlated query being based on ProductID attribute found in both tables:

-- Products with PO placed (EXISTS operator) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
FROM Production.Product ITM  
WHERE EXISTS (--list of Product IDs with PO placed 
     SELECT 1 
     FROM Purchasing.PurchaseOrderDetail POD  
       JOIN Purchasing.PurchaseOrderHeader POH  
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
     WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
         AND POD.ProductID = ITM.ProductID -- correlated constraint 
)

Notes:
1. As usual special attention must be given to the handling of NULL values either by replacing them with a default value or by handing the NULL case in the WHERE clause.
2. The Execution Plan for the above two queries is the same, SQL Server using a Hash Match (Left Semi Join) for both operators. I was trying to find information on whether there are any recommendations that advices on the use of one of the two operators in the detriment of the other, however I found nothing in this direction for SQL Server. R. Schrag offers a rule of dumb rooted in Oracle documentation and based on the selectivity of a query, a query being highly selective if it returns the least number of rows:

"If the main body of your query is highly selective, then an EXISTS clause might be more appropriate to semi-join to the target table. However, if the main body of your query is not so selective and the subquery (the target of the semi-join) is more selective, then an IN clause might be more appropriate." [1]

Both above solutions are ideal when no data are needed from the tables used in the subquery, however quite often such a need arises, for example to show the volume of PO placed, the last PO Date, and so on, in such cases an inner join could be used instead:

-- Products with PO placed (inner join) SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
, POD.TotalOrderQty 
, POD.LastOrderDate 
FROM Production.Product ITM  
JOIN (--aggregated PO details 
    SELECT POD.ProductID  
    , SUM(POD.OrderQty) TotalOrderQty 
    , Max(OrderDate) LastOrderDate 
     FROM Purchasing.PurchaseOrderDetail POD  
       JOIN Purchasing.PurchaseOrderHeader POH  
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
     WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
     GROUP BY POD.ProductID 
    ) POD 
    ON POD.ProductID = ITM.ProductID

For the same purpose could be used also a CROSS APPLY taking care to eliminate the Products for which no record is retrieved in the subquery, this approach is actually a mixture between an inner join and a correlated query:

 -- Products with PO placed (CROSS APPLY) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost  
, POD.TotalOrderQty  
, POD.LastOrderDate  
FROM Production.Product ITM  
     CROSS APPLY (--aggregated PO details  
     SELECT SUM(POD.OrderQty) TotalOrderQty  
     , Max(OrderDate) LastOrderDate  
     FROM Purchasing.PurchaseOrderDetail POD  
           JOIN Purchasing.PurchaseOrderHeader POH  
              ON POD.PurchaseOrderID = POH.PurchaseOrderID  
     WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
        AND POD.ProductID = ITM.ProductID  
    ) POD  
WHERE POD.TotalOrderQty IS NOT NULL

Notes:
1. The IN and EXISTS query versions seems to provide better performance than the use of vertical joins, though from my experience is often needed to provide information also from the subqueries used with the respective operators, therefore the inner join provides more flexibility in the detriment of relatively poorer performance.
2. Microsoft’s implementation for IN operator allows using only single list of values, in contrast with Oracle that allows working with lists of n-uples. This problem in theory could be solved using the IN operators with a correlated subquery. An example of such a problem is finding the last PO placed for each Product:

 -- Last PO placed per Product (IN + correlated query) 
SELECT POD.PurchaseOrderDetailID 
, POD.ProductID  
, POH.OrderDate 
, POD.OrderQty 
FROM Purchasing.PurchaseOrderDetail POD  
     JOIN Purchasing.PurchaseOrderHeader POH         ON POD.PurchaseOrderID = POH.PurchaseOrderID 
WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
   AND POH.OrderDate IN (--Last PO Date per Product 
      SELECT Max(POH1.OrderDate) LastOrderDate 
      FROM Purchasing.PurchaseOrderDetail POD1 
          JOIN Purchasing.PurchaseOrderHeader POH1 
             ON POD1.PurchaseOrderID = POH1.PurchaseOrderID 
    WHERE POH1.Status IN (2, 4) -- 2-Approved, 4-Complete  
AND POD1.ProductID = POD.ProductID -- correlated constraint 
)

This might not be the best approach to solve this problem though the query is intended only to show a technique. The query could be rewritten using an inner join or the EXISTS operator, but I will show only the later approach because it comes with an interesting technique – because the LastOrderDate and ProductID are used as pair of values in order to retrieve the last PO per Product, given the fact that LastOrderDate is an aggregate value, the correlated constraint needs to be added in HAVING clause:

-- Last PO placed per Product (correlated query) 
SELECT POD.PurchaseOrderDetailID 
, POD.ProductID  
, POH.OrderDate 
, POD.OrderQty 
FROM Purchasing.PurchaseOrderDetail POD  
     JOIN Purchasing.PurchaseOrderHeader POH 
        ON POD.PurchaseOrderID = POH.PurchaseOrderID 
WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
     AND EXISTS(--Last PO Date per Product 
     SELECT 1 
     FROM Purchasing.PurchaseOrderDetail POD1 
          JOIN Purchasing.PurchaseOrderHeader POH1 
             ON POD1.PurchaseOrderID = POH1.PurchaseOrderID 
     WHERE POH1.Status IN (2, 4) -- 2-Approved, 4-Complete  
          AND POD1.ProductID = POD.ProductID -- correlated constraint 
     GROUP BY POD1.ProductID 
     HAVING Max(POH1.OrderDate) = POH.OrderDate 
     )

The NOT IN and NOT EXISTS versions of the operators simply negates the use of IN and EXISTS operators, showing thus the records for which a match is not found. The complementary problem of showing the Products with a PO placed is to show the Products without a PO placed, for this, as previously mentioned, being enough to add the NOT keyword in from of IN, respectively EXISTS in the first two queries from this post:

-- Products without PO placed (IN operator) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
FROM Production.Product ITM  
WHERE ITM.ProductID NOT IN (--list of Product IDs with PO placed 
    SELECT DISTINCT POD.ProductID 
     FROM Purchasing.PurchaseOrderDetail POD 
       JOIN Purchasing.PurchaseOrderHeader POH 
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
      WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
     )  

-- Products without PO placed (EXISTS operator) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
FROM Production.Product ITM  
WHERE NOT EXISTS (--list of Product IDs with PO placed 
     SELECT 1 
     FROM Purchasing.PurchaseOrderDetail POD 
       JOIN Purchasing.PurchaseOrderHeader POH 
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
     WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
         AND POD.ProductID = ITM.ProductID -- correlated constraint 
)

An alternative to NOT IN and NOT EXISTS is the use of a LEFT JOIN transposing the project of A\B problem:

  -- Products without PO placed (left join) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost FROM Production.Product ITM  
    LEFT JOIN (--aggregated PO details 
    SELECT POD.ProductID  
    FROM Purchasing.PurchaseOrderDetail POD 
       JOIN Purchasing.PurchaseOrderHeader POH 
          ON POD.PurchaseOrderID = POH.PurchaseOrderID 
     WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
     GROUP BY POD.ProductID 
    ) POD 
    ON POD.ProductID = ITM.ProductID   
WHERE POD.ProductID IS NULL

The CROSS APPLY could be used too, this time the WHERE clause from the main query being modified to show only the records without a match in CROSS APPLY, for this needing to be used a "disguised" aggregate too:

 -- Products without PO placed (CROSS APPLY) 
SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost  FROM Production.Product ITM  
    CROSS APPLY (-- PO details  
    SELECT Max(1) HasPOPlaced 
    FROM Purchasing.PurchaseOrderDetail POD  
          JOIN Purchasing.PurchaseOrderHeader POH  
             ON POD.PurchaseOrderID = POH.PurchaseOrderID  
    WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
        AND POD.ProductID = ITM.ProductID  
    ) POD 
WHERE POD.HasPOPlaced IS NULL

As can be seen there are multiple alternative ways of solving the same problem, none of the above described patters could be considered as best approach because in the end must be weighted between performance, flexibility, reusability and complexity. In many cases the performance of a query is the most important factor, though the performance of any of the above techniques is relative because it needs to be considered factors like table structures and relations’ cardinality, server’s configuration and available resources, etc.

References:
[1] Schrag R. (2005). Speeding Up Queries with Semi-Joins and Anti-Joins: How Oracle Evaluates EXISTS, NOT EXISTS, IN, and NOT IN. [link]

SQL Troubles

Pages