SQL Troubles: subquery

Showing posts with label subquery. Show all posts

08 February 2025

🌌🏭KQL Reloaded: First Steps (Part X: Translating SQL to KQL - Correlated Subqueries)

In SQL Server and other RDBMS databases there are many scenarios in which one needs information from a fact table based on a dimension table without requiring information from the dimension table.

Correlated Subquery via EXISTS

Before considering the main example, let's start with a simple subquery:

// subquery in SQL
--
explain
SELECT CustomerKey
, ProductKey
, ProductName
FROM NewSales 
WHERE ProductKey IN (
    SELECT DISTINCT ProductKey
    FROM Products 
    WHERE ProductSubcategoryName = 'MP4&MP3'
    ) 

// subquery in KQL
NewSales
| where ProductKey in (
    (Products
    | where (ProductSubcategoryName == "MP4&MP3")
    | project ProductKey
    | distinct *))
| project CustomerKey, ProductKey, ProductName

Of course, the ProductKey is unique by design, though there can be dimension, fact tables or subqueries in which the value is not unique.

Now let's consider the correlated subquery pattern, which should provide the same outcome as above, though in RDBMS there are scenarios in which it provides better performance, especially when the number of values from subquery is high.

// correlated subquery in SQL
--
explain
SELECT CustomerKey
, ProductKey
, ProductName
FROM NewSales 
WHERE EXISTS (
    SELECT Products.ProductKey
    FROM Products 
    WHERE NewSales.ProductKey = Products.ProductKey)

Unfortunately, trying to translate the code via explain leads to the following error, which confirms that the syntax is not supported in KQL (see [1]):

"Error: Reference to missing column 'NewSales.ProductKey'"

Fortunately, in this case one can use the first version of the query.

Correlated Subquery via CROSS APPLY

Before creating the main query, let's look at the inner query and check whether it gets correctly translate to KQL:

// subquery logic
--
explain
SELECT sum(TotalCost) TotalCost 
FROM NewSales 
WHERE DateKey > '20240101' and DateKey <'20240201'

Now, let's bring the logic within the CROSS APPLY:

// correlated subquery in SQL
--
explain
SELECT ProductKey
, ProductName
, TotalCost
FROM Products
    CROSS APPLY (
        SELECT sum(TotalCost) TotalCost 
        FROM NewSales 
        WHERE DateKey > '20240101' and DateKey <'20240201'
          AND Products.ProductKey = NewSales.ProductKey
    ) DAT

Running the above code leads to the following error:

"Sql node of type 'Microsoft.SqlServer.TransactSql.ScriptDom.UnqualifiedJoin' is not implemented"

Unfortunately, many SQL queries are written following this pattern, especially when an OUTER CROSS APPLY is used, retrieving thus all the records from the dimension table.

In this case one can rewrite the query via a RIGHT JOIN:

// correlated subquery in SQL
--
explain
SELECT PRD.ProductKey
, PRD.ProductName
, SAL.TotalCost
FROM Products PRD
    LEFT JOIN (
        SELECT ProductKey
        , sum(TotalCost) TotalCost 
        FROM NewSales 
        WHERE DateKey > '20240101' and DateKey <'20240201'
        GROUP BY ProductKey
    ) SAL
      ON PRD.ProductKey = SAL.ProductKey

// direct translation of the query
Products
| join kind=leftouter 
    (NewSales
        | where ((DateKey > todatetime("20240101")) and (DateKey < todatetime("20240201")))
        | summarize TotalCost=sum(TotalCost) by ProductKey
        | project ProductKey, TotalCost
    ) on ($left.ProductKey == $right.ProductKey)
| project ProductKey, ProductName, TotalCost
//| where isnull(TotalCost)
//| summarize record_number = count()


// query after restructuring
NewSales
| where ((DateKey > todatetime("20240101")) and (DateKey < todatetime("20240201")))
| summarize TotalCost=sum(TotalCost) by ProductKey
| join kind=rightouter
    (
        Products
        | project ProductKey, ProductName
    ) on ($left.ProductKey == $right.ProductKey)
| project ProductKey, ProductName, TotalCost
//| where isnull(TotalCost)
//| summarize record_number = count()

During transformations it's important to check whether the number of records changes between the various versions of the query (including the most general version in which filtering constraints were applied).

Especially when SQL solutions are planned to be migrated to KQL, it's important to know which query patterns can be used in KQL.

Happy coding!

Previous Post <<||>> Next Post

References:
[1] GitHib (2024) Module Ex-01 - Advanced KQL [link]

16 May 2020

💎🏭SQL Reloaded: Query Patterns in SQL Server (Part II: Joins and Subqueries)

The basis for being able to manipulate data via SQL scripting is a good knowledge of using joins and subqueries as seems fit for the purpose and data models. The following scripts are based on the tables created in a previous post.

-- FULL JOIN
SELECT *
FROM dbo.T_Courses C
     JOIN dbo.T_Students S
        ON C.CourseId = S.CourseId
 ORDER BY C.CourseName
 , S.StudentName 

-- FULL JOIN (deprecated)
 SELECT *
 FROM dbo.T_Courses C
 , dbo.T_Students S
 WHERE C.CourseId = S.CourseId
 ORDER BY C.CourseName
 , S.StudentName 

--LEFT JOIN
 SELECT *
 FROM dbo.T_Courses C
      LEFT JOIN dbo.T_Students S
        ON C.CourseId = S.CourseId 
 WHERE S.CourseId IS NULL
 ORDER BY C.CourseName
 , S.StudentName 

-- RIGHT JOIN
 SELECT *
 FROM dbo.T_Courses C
      RIGHT JOIN dbo.T_Students S
       ON C.CourseId = S.CourseId 
 WHERE C.CourseId IS NULL
 ORDER BY C.CourseName
 , S.StudentName 

-- FULL OUTER JOIN
 SELECT *
 FROM dbo.T_Courses C
      FULL OUTER JOIN dbo.T_Students S
        ON C.CourseId = S.CourseId 
 --WHERE C.CourseId IS NULL
 --WHERE S.CourseId IS NULL
 --WHERE (C.CourseId IS NULL OR S.StudentId IS NULL)
 --WHERE (C.CourseId IS NULL AND S.StudentId IS NULL)
 ORDER BY C.CourseName
 , S.StudentName

The IN, NOT IN, EXISTS and NOT EXISTS allow using correlated queries, their use being indicated when there are no actual data needed from the tables involved in the correlated queries:

-- EXISTS (correlated subquery)
SELECT *
FROM dbo.T_Courses C
WHERE EXISTS (SELECT StudentId 
        FROM dbo.T_Students S
        WHERE C.CourseId = S.CourseId)
ORDER BY C.CourseName

-- NOT EXISTS (correlated subquery)
 SELECT *
 FROM dbo.T_Courses C
 WHERE EXISTS (SELECT StudentId 
        FROM dbo.T_Students S
        WHERE C.CourseId = S.CourseId)
 ORDER BY C.CourseName

-- IN (subquery)
 SELECT *
 FROM dbo.T_Courses C
 WHERE CourseId IN (SELECT CourseId 
        FROM dbo.T_Students S)
 ORDER BY C.CourseName

Joining multiples tables is done using the same principles as above:

-- joins with more tables 
SELECT A.CourseId 
, C.CourseName 
, A.StudentId 
, S.StudentName 
 , A.StartDate 
 , A.EndDate 
FROM dbo.T_Allocations A
      JOIN dbo.T_Courses C
        ON A.CourseId = C.CourseId 
      JOIN dbo.T_Students S
        ON A.StudentId = S.StudentId 
 ORDER BY C.CourseName 
 , S.StudentName

One can obtain the same result via correlated subqueries (a technique often met between Oracle developers). From readability reasons I avoid writing such queries, unless there’s a special purpose to do so.

-- correlated subquery for individual values
SELECT A.CourseId 
, (SELECT C.CourseName 
    FROM dbo.T_Courses C
    WHERE A.CourseId = C.CourseId) CourseName
 , A.StudentId 
 , (SELECT S.StudentName 
    FROM dbo.T_Students S
    WHERE A.StudentId = S.StudentId) StudentName
 , A.StartDate 
 , A.EndDate 
FROM dbo.T_Allocations A
 ORDER BY CourseName 
 , StudentName

When displaying values within a SELECT via a correlated subqueries, some developers feel the need to use MAX or MIN functions to make sure only one value will be returned. For data analysis it may be acceptable, however if the data model imposes it, then a redesign of the solution is more likely necessary.

-- correlated subquery for individual values
 SELECT A.CourseId 
 , (SELECT Max(C.CourseName)
    FROM dbo.T_Courses C
    WHERE A.CourseId = C.CourseId) CourseName
 , A.StudentId 
 , (SELECT Max(S.StudentName)
    FROM dbo.T_Students S
    WHERE A.StudentId = S.StudentId) StudentName
 , A.StartDate 
 , A.EndDate 
 FROM dbo.T_Allocations A
 ORDER BY CourseName 
 , StudentName

Another technique not recommended is displaying one or more attributes from the same table with the same conditions via individual correlated queries. The use of aggregate functions is more appropriate however with numerical or date values.

Notes:
The queries work also in SQL databases in Microsoft Fabric.

Happy coding!

Previous Post <<||>> Next Post

13 June 2009

🛢DBMS: Subquery (Definitions)

"A SELECT statement nested inside another SELECT, INSERT, UPDATE, or DELETE statement, or inside another subquery." (Patrick Dalton, "Microsoft SQL Server Black Book", 1997)

"A SELECT statement that is usually a part of a larger SQL statement; a query nested inside a SELECT, INSERT, UPDATE, or DELETE statement." (Owen Williams, "MCSE TestPrep: SQL Server 6.5 Design and Implementation", 1998)

"A SELECT within another SQL statement, usually within another SELECT." (Peter Gulutzan & Trudy Pelzer, "SQL Performance Tuning", 2002)

"A query that is embedded in a main, or parent, query and used to assist in filtering the result set from a query." (Bob Bryla, "Oracle Database Foundations", 2004)

"A nested query that returns information to an outer query, thereby helping the outer query correctly identify results." (Robert D. Schneider and Darril Gibson, "Microsoft SQL Server 2008 All-In-One Desk Reference For Dummies", 2008)

"A complete SELECT statement that is part of another SELECT." (Jan L Harrington, "SQL Clearly Explained 3rd Ed. ", 2010)

"A SELECT statement that contains one or more subqueries." (SQL Server 2012 Glossary, "Microsoft", 2012)

"A query nested within another SQL statement. Unlike implicit queries, subqueries use a SELECT statement to retrieve data." (Oracle, "Database SQL Tuning Guide Glossary", 2013)

"A form of a full-select that is enclosed within parentheses and used as a component of a query." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

10 June 2009

🛢DBMS: Correlated Subquery (Definitions)

"A subquery that cannot be evaluated independently, but depends on the outer query for its results. Also called a repeating subquery, since the subquery is executed once for each row that might be selected by the outer query. See also nested query." (Karen Paulsell et al, "Sybase SQL Server: Performance and Tuning Guide", 1996)

"A repeating subquery. Many queries can be evaluated by executing the subquery once and substituting the resulting value or values into the WHERE clause of the outer query. In queries that include a correlated subquery, the subquery depends on the outer query for its values. This means that the subquery is executed repeatedly, one time for each row that is selected by the outer query." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"A subquery that references a column in an outer statement. The inner query is executed for each candidate row in the outer statement." (Anthony Sequeira & Brian Alderman, "The SQL Server 2000 Book", 2003)

"A subquery that contains a reference to a column in the main, or parent, query." (Bob Bryla, "Oracle Database Foundations", 2004)

"A subquery that a DBMS cannot process completely before turning to the outer query. The DBMS must execute the subquery repeatedly for every row in the outer query." (Jan L Harrington, "SQL Clearly Explained 3rd Ed. ", 2010)

"A subquery that contains a correlated reference." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

30 August 2008

⌛ Oracle Troubleshooting: ANSI 92 JOIN syntax error

Lately I’ve been working a lot with Oracle APPS, doing mainly ad-hoc reporting. One of my nightmares is an Oracle bug related to ANSI 92 syntax:

“ORA-01445: cannot select ROWID from, or sample, a join without a key-preserved table

Unfortunately, even if the bug was solved by Oracle, it seems the update was missed on some servers and the bug haunts my queries almost on a daily basis.

Having an SQL Server background and, for code clearness, I prefer ANSI 92 JOIN syntax:

SELECT A.column1, B.column2
FROM table1 A JOIN table2 B
 ON A.column1 = B.column2

instead of using the old fashioned writing:

SELECT A.column1, B.column2
FROM table1 A , table2 B
WHERE A.column1 = B.column2

In theory the two queries should provide the same output and have, hopefully, similar performance. The problem with ANSI 92 syntax is that, on some Oracle installations, when the number of joins exceeds a certain limit, usually greater than 7, the above error is thrown.

What one can do is to reduce the number of joins to the main table by restructuring the query and grouping multiple tables into subqueries, which are then joined to the main table. For the tables from which is returned only one column, one can move the table into the SELECT statement.

Happy coding!

SQL Troubles

Pages

08 February 2025

🌌🏭KQL Reloaded: First Steps (Part X: Translating SQL to KQL - Correlated Subqueries)

16 May 2020

💎🏭SQL Reloaded: Query Patterns in SQL Server (Part II: Joins and Subqueries)

13 June 2009

🛢DBMS: Subquery (Definitions)

10 June 2009

🛢DBMS: Correlated Subquery (Definitions)

30 August 2008

⌛ Oracle Troubleshooting: ANSI 92 JOIN syntax error

About Me