SQL Troubles

01 February 2023

💎SQL Reloaded: Alternatives for Better Code Maintainability in SQL Server & Azure Synapse I

Introduction

Queries can become quite complex and with the increased complexity they'll be harder to read and/or maintain. Since the early days of SQL Server, views and table-valued user-defined functions (UDFs) are the standard ways of encapsulating logic for reuse in queries, allowing to minimize the duplication of logic. Then came the common table expressions (CTEs), which allow a further layer of structuring the code, independently whether a view or UDF was used.

These are the main 3 options that can be combined in various ways to better structure the code. On the other side, also a temporary table or table variable could be used for the same purpose, though they have further implications.

To exemplify the various approaches, let's consider a simple query based on two tables from the AdventureWorks database. For the sake of simplicity, further business rules have been left out.

Inline Subqueries

-- products with open purchase orders
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM Production.Product ITM
     JOIN ( -- cumulated open purchase orders by product
		SELECT POL.ProductId 
		, SUM(POL.OrderQty) PurchQty
		FROM Purchasing.PurchaseOrderDetail POL
		WHERE OrderQty - (ReceivedQty - RejectedQty)>0
		GROUP BY POL.ProductId 
	) POL
	ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

As can be seen, the logic for the "Open purchase orders" result set is built within an inline subquery (aka inline view). As its logic becomes more complex, the simplest way to handle this is to move it into a CTE.

Common Table Expressions (CTEs)

A common table expression can be thought of as a temporary result set defined within the execution scope of a single SELECT, INSERT, UPDATE, DELETE or CREATE VIEW statement [1]. Thus, the CTE can't be reused between queries.

The inline query is moved at the beginning within a WITH statement to which is given a proper name that allows easier identification later:

-- products with open purchase orders (common table expression)
WITH OpenPOs
AS (-- cumulated open purchase orders by product
	SELECT POL.ProductId 
	, SUM(POL.OrderQty) PurchQty
	FROM Purchasing.PurchaseOrderDetail POL
	WHERE OrderQty - (ReceivedQty - RejectedQty)>0
	GROUP BY POL.ProductId 
)
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM Production.Product ITM
     JOIN OpenPOs POL
	   ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

Thus, this allows us to rewrite the JOIN as if it were between two tables. Multiple CTEs can be used as well, with or without any dependencies between them. Moreover, CTEs allow building recursive queries (see example).

There is no performance gain or loss by using a CTE. It's important to know that the result set is not cached, therefore, if the same CTE is called multiple times (within a query), it will be also "executed" for the same number of times. Except the cases in which the database engine uses a spool operator to save intermediate query results for a CTE, there will be created no work table in tempdb for CTEs.

If the inline query needs to be reused in several queries, defining a view is a better alternative.

Views

A view is a database object used to encapsulate a query and that can be referenced from other queries much like a table. In fact, it's also referred as a "virtual table". A view can't be execute by itself (as stored procedures do. No data, only the definition of the view is stored, and the various actions that can be performed on database objects can be performed on views as well.

-- creating the view
CREATE VIEW dbo.vOpenPurchaseOrders
AS
SELECT POL.ProductId 
, SUM(POL.OrderQty) PurchQty
FROM Purchasing.PurchaseOrderDetail POL
WHERE OrderQty - (ReceivedQty - RejectedQty)>0
GROUP BY POL.ProductId 

-- testing the view
SELECT top 10 *
FROM dbo.vOpenPurchaseOrders

Once the view is created, it can be called from any query:

-- products with open purchase orders (table-valued function)
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM Production.Product ITM
     JOIN dbo.vOpenPurchaseOrders POL
	   ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

Besides the schema binding, there are no additional costs for using views. However, views have several limitations (see [2]). Moreover, it's not possible to use parameters with views, scenarios in which tabled-valued UDFs can help.

Indexed Views

Starting with SQL Server 2015, it's possible to materialize the data in a view, storing the results of the view in a clustered index on the disk in same way a table with a clustered index is stored. This type of view is called an indexed view (aka materialized view, though the concept is used slightly different in Azure Synapse) and for long-running queries can provide considerable performance gains. In case the view contains a GROUP BY is present, its definition must contain COUNT_BIG(*) and must not contain HAVING.

-- dropping the view
--DROP VIEW IF EXISTS Purchasing.vOpenPOs

-- create view
CREATE VIEW Purchasing.vOpenPOs
WITH SCHEMABINDING
AS
SELECT POL.ProductId 
, SUM(POL.OrderQty) PurchQty
, COUNT_BIG(*) Count
FROM Purchasing.PurchaseOrderDetail POL
WHERE OrderQty - (ReceivedQty - RejectedQty)>0
GROUP BY POL.ProductId 
GO

--Create an index on the view.
CREATE UNIQUE CLUSTERED INDEX IDX_vOpenPOs
   ON Purchasing.vOpenPOs (ProductId);

--testing the view
SELECT top 100 *
FROM Purchasing.vOpenPOs

-- products with open purchase orders (indexed view)
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM [Production].[Product] ITM
     JOIN Purchasing.vOpenPOs POL
	   ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

When an indexed view is defined on a table, the query optimizer may use it to speed up the query execution even if it wasn't referenced in the query. Besides the restriction of the view to be deterministic, further limitations apply (see [6]).

Table-Valued Functions

A table-valued function is a user-defined function in which returns a table as a result, as opposed to a single data value, as scalar functions do.

Let's support that we need to restrict base the logic based on a time interval. We'd need then to provide the StartDate & EndDate as parameters. Compared with other UDFs table-valued functions, as their name implies, need to return a table:

-- creating the UDF function 
CREATE FUNCTION dbo.tvfOpenPurchaseOrdersByProduct( 
  @StartDate date 
, @EndDate date) 
RETURNS TABLE 
AS RETURN ( 
	SELECT POL.ProductId 
	, SUM(POL.OrderQty) PurchQty
	FROM Purchasing.PurchaseOrderDetail POL
	WHERE OrderQty - (ReceivedQty - RejectedQty)>0
	  AND POL.DueDate BETWEEN @StartDate AND @EndDate
	GROUP BY POL.ProductId 
)

-- testing the UDF
SELECT top 10 *
FROM dbo.tvfOpenPurchaseOrdersByProduct('2014-01-01', '2014-12-31')

A table-valued function can be used as a "table with parameters" in JOINs:

-- products with open purchase orders (table-valued function)
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM Production.Product ITM
     JOIN dbo.tvfOpenPurchaseOrdersByProduct('2014-01-01', '2014-12-31') POL
	   ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

The parameters are optional, though in such cases using a view might still be a better idea. Table-valued functions used to have poor performance in the past compared with views and in certain scenarios they might still perform poorly. Their benefit resides in allowing to pass and use parameters in the logic, which can make them irreplaceable. Moreover, multi-statement table-valued functions can be built as well (see example)!

Notes:
1) When evaluating table-valued functions for usage consider their limitations as well (see [3])!
2) Scalar UDFs can be used to simplify the code as well, though they apply only to single values, therefore they are not considered in here!

Temporary Tables

A temporary table is a base table that is stored and managed in tempdb as any other table. It exists only while the database session in which it was created is active. Therefore, it can be called multiple times, behaving much like a standard table:

-- create the temp table
CREATE TABLE dbo.#OpenPOs (
  ProductId int NOT NULL
, PurchQty decimal(8,2) NOT NULL
)

-- insert the cumulated purchase orders
INSERT INTO #OpenPOs
SELECT POL.ProductId 
, SUM(POL.OrderQty) PurchQty
FROM Purchasing.PurchaseOrderDetail POL
WHERE OrderQty - (ReceivedQty - RejectedQty)>0
GROUP BY POL.ProductId 

-- products with open purchase orders (table-valued function)
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM [Production].[Product] ITM
     JOIN dbo.#OpenPOs POL
	   ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

-- drop the table (cleaning)
-- DROP TABLE IF EXISTS dbo.#OpenPOs;

Being created in the tempdb, system database shared by several databases, temporary table's performance relies on tempdb's configuration and workload. Moreover, the concurrent creation of temporary tables from many sessions can lead to tempdb metadata contention, as each session attempts updating metadata information in the system based tables.

Temporary tables are logged, which adds more burden on the database engine, however being able to create indexes on them and use statistics can help processing result sets more efficiently, especially when called multiple times.

Also, a temporary table might be cached (see [1]) and not deleted when its purpose ends, which allows operations that drop and create the objects to execute very quickly and reduces page allocation contention.

Table Variables

A table variable is a variable of type TABLE and can be used in functions, stored procedures, and batches. The construct is similar to the temp table and is stored as well in the tempdb and cached under certain scenarios, however they are scoped to the batch or routine in which they are defined and destroyed after that.

-- create the table variable
DECLARE @OpenPOs TABLE (
  ProductId int NOT NULL
, PurchQty decimal(8,2) NOT NULL
)

-- insert the cumulated purchase orders
INSERT INTO @OpenPOs
SELECT POL.ProductId 
, SUM(POL.OrderQty) PurchQty
FROM Purchasing.PurchaseOrderDetail POL
WHERE OrderQty - (ReceivedQty - RejectedQty)>0
GROUP BY POL.ProductId 

-- products with open purchase orders (table variable)
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM [Production].[Product] ITM
     JOIN @OpenPOs POL
	   ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

Table variables don’t participate in transactions or locking, while DML operations done on them are not logged. There are also no statistics maintained and any data changes impacting the table variable will not cause recompilation. Thus, they are usually faster than temporary variables, especially when their size is small, though their performance depends also on how they are used. On the other side, for big result sets and/or when several calls are involved, a temporary table could prove to be more efficient.

Important!!! Temporary tables and table variables are means of improving the performance of long-running queries. Being able to move pieces of logic around helps in maintaining the code and it also provides a logical structure of the steps, however they shouldn't be used if the performance gain is not the target! Overusing them as technique can considerably decrease the performance of tempdb, which can have impact in other areas!

Azure Synapse

Moving to Azure Synapse there are several important limitations in what concerns the above (see [4]). Even if some features are supported, further limitations might apply. What's important to note is that materialized views act like indexed view in standard SQL Server and that CETAS (Create External Table as SELECT) are available to import/export data to the supported file formats in Hadoop, Azure storage blob or Azure Data Lake Storage Gen2.

Feature	Dedicated	Serverless	SQL Server
CTEs	Yes	Yes	Yes (2015+)
Recursive CTEs	No	No	Yes (2015+)
Views	Yes	Yes	Yes
Indexed views	No	No	Yes
Materialized views	Yes	No	No
Table-valued functions (single statement)	No	Yes	Yes
Table-valued functions (multi-statement)	No	No	Yes
Scalar UDFs	Yes	No	Yes
Tables	Yes	No	Yes
Temporary tables (local)	Yes	Limited	Yes
Temporary tables (global)	No	No	Yes
Table variables	Yes	Yes	Yes
CETAS	Yes	Limited	Yes (2022+)

Notes:
1) CETAS have two important limitations in serverless SQL Pool
a) once the data were exported to a file, they can't be overwritten via the same syntax;
b) logic based on temporary tables can't be exported via pipelines.
2) Temporary tables can be used to replace cursors (see example).

Previous Post <<||>> Next Post

Resources:
[1] Microsoft Learn (2012) Capacity Planning for tempdb (link)
[2] Microsoft Learn (2023) CREATE View (link)
[3] Microsoft Learn (2023) CREATE Function (link)
[4] Microsoft Learn (2023) Transact-SQL features supported in Azure Synapse SQL (link)
[5] Redgate (2018) Choosing Between Table Variables and Temporary Tables (ST011, ST012), by Phil Factor (link)
[6] Microsoft Learn (2023) Create indexed views (link)
[7] Microsoft Learn (2023) CREATE MATERIALIZED VIEW AS SELECT (Transact-SQL) (link)
[8] Microsoft Learn (2023) CETAS with Synapse SQL (link)

28 January 2023

💎SQL Reloaded: Temporary Tables and Tempdb in Serverless SQL Pool

In SQL Server, temporary tables are stored in tempdb database and the front end (FE) SQL Server from a serverless SQL Server pool makes no exception from it, however there's no such database listed in the list of databases available. Watching today Brian Bønk's session on "Azure Synapse serverless - CETAS vs Views vs Parquet" at Data Toboggan winter edition 2023, the speaker pointed out that the tables available in tempdb database can be actually queried:

-- Azure Serverless SQL pool: tempdb tables
SELECT top 10 *
FROM tempdb.information_schema.tables

TABLE_CATALOG	TABLE_SCHEMA	TABLE_NAME	TABLE_TYPE
tempdb	dbo	#dm_continuous_copy_status_..._0000000029C7	BASE TABLE
tempdb	dbo	diff_MonDmSbsConnections	BASE TABLE
tempdb	dbo	diff_MonTceMasterKeys	BASE TABLE
tempdb	dbo	diff_MonTceColEncryptionKey	BASE TABLE
tempdb	dbo	diff_MonAutomaticTuningState	BASE TABLE
tempdb	dbo	diff_MonDmTranActiveTransactions	BASE TABLE
tempdb	dbo	diff_MonTceEnclaveUsageInfo	BASE TABLE
tempdb	dbo	diff_MonTceEnclaveColEncryptionKey	BASE TABLE
tempdb	dbo	diff_MonTceEnclaveMasterKeys	BASE TABLE
tempdb	dbo	diff_MonTceColumnInfo	BASE TABLE

Moreover, the content of the system tables available can be queried, even if some of the tables might have no data:

-- Azure Serverless SQL pool: checking a table's content
SELECT top 10 *
FROM tempdb.dbo.dmv_view_run_history

How about the temporary tables created? To check this, let's reuse a temp table created recently for listing the DMVs available in the serverlss SQL pool:

-- dropping the temp table
DROP TABLE IF EXISTS dbo.#views;

-- create the temp table
CREATE TABLE dbo.#views (
  ranking int NOT NULL
, view_name nvarchar(150) NOT NULL
)

-- inserting a few records
INSERT INTO #views
SELECT row_number() OVER(ORDER BY object_id) ranking
, concat(schema_name(schema_id),'.', name) view_name
FROM sys.all_views obj
WHERE obj.Type = 'V'
  AND obj.is_ms_shipped = 1
  AND obj.name LIKE 'dm_exec_requests%'

-- checking temp table's content
SELECT *
FROM dbo.#views

ranking	view_name
1	sys.dm_exec_requests
2	sys.dm_exec_requests_history

Here's temp table's metadata:

-- checking temp table's metadata
SELECT *
FROM tempdb.information_schema.tables
WHERE table_name like '#views%'

TABLE_CATALOG	TABLE_SCHEMA	TABLE_NAME	TABLE_TYPE
tempdb	dbo	#views_____..._____00000000295B	BASE TABLE

Of course, you can query also the tempdb.sys.all_objects:

-- checking temp table's metadata
SELECT *
FROM tempdb.sys.all_objects
WHERE name like '#views%'

You can use now the table name returned by any of the two queries to call the table (just replace the name accordingly):

-- querying the table via tempdb (name will differ)
SELECT *
FROM tempdb.dbo.[#views______________________________________________________________________________________________________________00000000295B]

ranking	view_name
1	sys.dm_exec_requests
2	sys.dm_exec_requests_history

The benefit of knowing that is neglectable, however the topic is interesting more for the ones who want to know how the pools in Azure Synapse work, what was kept or removed compared with the standard editions of SQL Server.

Thus, another topic interesting for DBAs would be how many files are created for the tempdb, given that the database is used to store intermediate data for the various operations. Finding the optimal number of tempdb files and configuring them was and still is one of the main concerns when configuring an SQL Server instance for optimal performance.

The old query developed for standard SQL Server seems to work as well:

-- Azure Serverless SQL pool: tempdb files 
SELECT dbf.file_id
, dbf.name file_name
--, dbf.physical_name
, dsp.name file_group
--, type 
, dbf.type_desc file_type
--, dbf.growth growth_kb
, Cast(dbf.growth/128.0  as decimal(18,2)) growth_mb
, dbf.is_percent_growth
--, dbf.max_size max_size_kb
, Cast(NullIf(dbf.max_size, -1)/128.0  as decimal(18,2)) max_size_mb
--, dbf.size file_size_kb
, Cast(dbf.size/128.0 as decimal(18,2)) file_size_mb
, dbf.state_desc 
, dbf.is_read_only 
FROM tempdb.sys.database_files dbf
     LEFT JOIN tempdb.sys.data_spaces dsp
       ON dbf.data_space_id = dsp.data_space_id
ORDER BY dbf.Name

file_id	file_name	file_group	file_type	growth_mb	is_percent_growth	max_size_mb	file_size_mb	state_desc	is_read_only
1	tempdev	PRIMARY	ROWS	32	0	102670	16	ONLINE	0
3	tempdev2	PRIMARY	ROWS	32	0	102670	16	ONLINE	0
4	tempdev3	PRIMARY	ROWS	32	0	102670	16	ONLINE	0
5	tempdev4	PRIMARY	ROWS	32	0	102670	16	ONLINE	0
2	templog	NULL	LOG	64	0	16384	255	ONLINE	0

So, there were created 4 data files of 16 MB each, with a fix growth of 32 MB, and the files can grow up to about 100 GB. What would be interesting to check is how the files grow under heavy workload, what kind of issues this raises, etc. At least in serverless SQL pools many of the views one used to use for troubleshooting aren't accessible. Maybe one doesn't have to go that far, given that the resources are managed automatically.

Notes:

(1) Microsoft recommends not to drop the temporary tables explicitly, but let SQL Server handle this cleanup automatically and take thus advantage of the Optimistic Latching Algorithm, which helps prevent contention on TempDB [1].

Happy coding!

Last updated: Oct-2024

Previous Post <<||>> Next Post

References:
[1] Haripriya SB (2024) Do NOT drop #temp tables (link)

25 January 2023

💎SQL Reloaded: Documenting External Tables

When using serverless SQL pool, CETAS (aka Create External Table as Select) are the main mechanism of making data available from the Data Lake for queries. In case is needed to document them, sys.external_tables DMV can be used to export their metadata, much like sys.tables or sys.views can be used for the same:

-- CETAS metadata
SELECT TOP (50) ext.object_id
, ext.name
, schema_name(ext.schema_id) [schema_name]
, ext.type_desc
, ext.location
, ext.data_source_id
, ext.file_format_id
, ext.max_column_id_used
, ext.uses_ansi_nulls
, ext.create_date
, ext.modify_date
FROM sys.external_tables ext

It's interesting that CETAS have type_desc = 'USER_TABLE' in sys.all_objects, same like user-defined tables in SQL Server have:

-- CETAS' metadata via sys.all_objects
SELECT *
FROM sys.all_objects
WHERE object_id = object_id('<schema_name>.<CETAS name>')

The data source and file format can be retrieved via the sys.external_data_sources and sys.external_file_formats DMVs. Moreover, it's useful to include the logic into a view, like the one below:

-- drop view
--DROP VIEW IF EXISTS dbo.vAdminExternalTables

-- create view
CREATE VIEW dbo.vAdminExternalTables
AS
-- external tables - metadata 
SELECT ext.object_id
, sch.name + '.' + ext.name [unique_identifier]
, sch.name [schema]
, ext.name [object]
, ext.type_desc [type]
, ext.max_column_id_used 
, ext.location
, eds.name data_source 
, eff.name file_format 
, ext.create_date 
, ext.modify_date
FROM sys.external_tables ext
     JOIN sys.schemas sch
       ON ext.schema_id = sch.schema_id 
     JOIN sys.external_data_sources eds 
       ON ext.data_source_id = eds.data_source_id
     JOIN sys.external_file_formats eff
       ON ext.file_format_id = eff.file_format_id

-- testing the view
SELECT top 10 *
FROM dbo.vAdminExternalTables

The view can be used then for further queries, for example checking the CETAS created or modified starting with a given date:

-- external tables created after a certain date
SELECT *
FROM dbo.vAdminExternalTables ext
WHERE ext.create_date >= '20230101'
  OR ext.modify_date >= '20230101';

Or, when the CETAS are deployed from one environment to another, one can compare the datasets returned by the same view between environments, something like in the below query:

-- comparison external tables metadata between two databases
SELECT *
FROM (
    SELECT *
    FROM <test_database>.dbo.vAdminExternalTables
    WHERE [Schema] = '<schema_name>'
	) PRD
	FULL OUTER JOIN (
    SELECT *
    FROM <prod_database>.dbo.vAdminExternalTables
    WHERE [Schema] = '<schema_name>'
	) UAT
	ON PRD.[unique_identifier] = UAT.[unique_identifier]
-- WHERE PRD.[unique_identifier] IS NULL OR UAT.[unique_identifier] IS NULL

The definitions for multiple CETAS can be exported from the source database in one step via the Object Explorer Details >> Tables >> External tables >> (select CETAS) >> Script Table as >> ... .

Happy coding!

Previous Post <<||>> Next Post

20 January 2023

💎SQL Reloaded: Monitoring the Synapse serverless SQL pool with Dynamics Management Views II

Identifying the SQL Server DMVs which are accessible for the Serverless SQL pool (see previous post), allowed me to identify besides sys.dm_exec_requests_history three more DMVs with statistics on the statements executed on the server: sys.dm_request_phases, sys.dm_request_phases_task_group_stats and sys.dm_request_phases_exec_task_stats. Untofurtunately, there seems to be no documentation available on these DMVs, and, at the time the post was written, there were also no further hits on google.com or bing.com found on the same.

sys.dm_request_phases

sys.dm_request_phases provides insights in the phases an execution statement goes through, and seems to summarize the other two views:

-- Azure Serverless SQL pool: request phases
SELECT TOP (100) dist_statement_id
, RPH.dist_request_id
, TRY_CAST(RPH.id as bigint) id
, TRY_CAST(RPH.parent_ids as bigint) parent_ids
, RPH.start_time
, RPH.end_time
--, RPH.total_elapsed_time_ms
--, RPH.total_elapsed_time_ms/1000.0 total_elapsed_time_sec
--, RPH.min_time_ms
--, RPH.min_time_ms/1000.0 min_time_sec
--, RPH.max_time_ms
--, RPH.max_time_ms/1000.0 max_time_sec
--, RPH.avg_time_ms
, RPH.avg_time_ms/1000.0 avg_time_sec
--, RPH.stdev_time_ms
--, RPH.stdev_time_ms/1000.0 stdev_time_sec -- it has no values
--, RPH.min_rows
--, RPH.max_rows
--, RPH.avg_rows
--, RPH.stdev_rows -- it has no values
, RPH.total_rows
--, RPH.total_bytes_processed
, RPH.total_bytes_processed/1028.0 total_kb_processed
, RPH.state_desc
, RPH.operation_type
, RPH.input_dop
, RPH.output_dop
, RPH.task_retries
, RPH.error_id
FROM sys.dm_request_phases RPH
ORDER BY Id

dist_statement_id	dist_request_id	id	parent_ids	start_time	end_time	avg_time_sec	total_rows	total_kb_processed	state_desc	operation_type	input_dop	output_dop	task_retries	error_id
8C4386DC...	820E9FC6...	1	2	...09:58:34.213	...09:58:36.337	1.031	2030	2343.310311	succeeded	Shuffle	1	1	0	0
8C4386DC...	820E9FC6...	2	0	...09:58:36.447	...09:58:39.713	1.891	9	7145.193579	succeeded	Return	1	1	0	0
C9524971...	680DCB55...	3	4	...10:05:46.747	...10:05:47.057	0.203	2030	2343.310311	succeeded	Shuffle	1	1	0	0
C9524971...	680DCB55...	4	0	...10:05:47.057	...10:05:48.480	1.406	0	6630.101167	succeeded	Return	1	1	0	0
FD2D17AD...	C9453EF2...	5	6	...11:58:54.060	...11:58:55.297	0.547	10	1534.098249	succeeded	ComputeToControlNode	1	1	0	0
FD2D17AD...	C9453EF2...	6	0	...11:58:55.297	...11:58:55.420	0.125	10	4.074902	succeeded	Return	1	1	0	0
9FB0A268...	CAA533DE...	7	8	...11:59:16.483	...11:59:16.700	0.203	2030	2343.310311	succeeded	Shuffle	1	1	0	0
9FB0A268...	CAA533DE...	8	0	...11:59:16.700	...11:59:18.640	1.922	6	7143.673151	succeeded	Return	1	1	0	0
1732AB0D...	AC1A4F10...	9	10	...11:59:25.950	...11:59:26.140	0.172	2030	2343.310311	succeeded	Shuffle	1	1	0	0
1732AB0D...	AC1A4F10...	10	0	...11:59:26.140	...11:59:27.450	1.297	9	6635.185797	succeeded	Return	1	1	0	0

Notes:
1) The foreign keys and dates (in the above and below queries) were truncated to accomodate all the important attributes in the snapshot of the values returned.
2) Based on the exisitng queries, there are two records for each executed statement, a Shuffle or ComputeToControlNode followed by a Return (see operation_type). In more complex scenario there are several Shuffles and Broadcasts and a Return. According to the Microsoft team, even if for serverless SQL pools there's no Data Movement Service (DMS), there's a similar algorithm responsible for moving the data between the nodes.
3) Because in serverless SQL pool each query has its own distribution statement id, the min, max, avg and total values will have the sames values across the columns. Therefore, the columns with redundant values were commented.
4) The Id of the request phase seems to have numeric values despite being defined as alphanumeric. I tried to cast the values to bigint for sorting purposes.

sys.dm_request_phases_task_group_stats

sys.dm_request_phases_task_group_stats stores metadata about the requests breakdown at task group:

-- Azure Serverless SQL pool: request phases breakdown at task group
SELECT TOP (100) RPT.dist_request_id
, TRY_CAST(RPT.id as bigint) id
, TRY_CAST(RPT.parent_ids as bigint) parent_ids
, RPT.dist_statement_id
, RPT.state_desc
, RPT.start_time
, RPT.end_time
, RPT.input_dop
, RPT.output_dop
, RPT.operation_type
, RPT.task_retries
FROM sys.dm_request_phases_task_group_stats RPT
ORDER BY id

dist_request_id	id	parent_ids	dist_statement_id	state_desc	start_time	end_time	input_dop	output_dop	operation_type	task_retries
820E9FC6...	1	2	8C4386DC...	succeeded	638098055142132551	638098055163382693	1	1	Shuffle	0
820E9FC6...	2	0	8C4386DC...	succeeded	638098055164476163	638098055197133001	1	1	Return	0
680DCB55...	3	4	C9524971...	succeeded	638098059467450021	638098059470574953	1	1	Shuffle	0
680DCB55...	4	0	C9524971...	succeeded	638098059470574953	638098059484793682	1	1	Return	0
C9453EF2...	5	6	FD2D17AD...	succeeded	638098127340607112	638098127352951067	1	1	ComputeToControlNode	0
C9453EF2...	6	0	FD2D17AD...	succeeded	638098127352951067	638098127354202970	1	1	Return	0
CAA533DE...	7	8	9FB0A268...	succeeded	638098127564826084	638098127567013504	1	1	Shuffle	0
CAA533DE...	8	0	9FB0A268...	succeeded	638098127567013504	638098127586388549	1	1	Return	0
AC1A4F10...	9	10	1732AB0D...	succeeded	638098127659513620	638098127661388514	1	1	Shuffle	0
AC1A4F10...	10	0	1732AB0D...	succeeded	638098127661388514	638098127674513601	1	1	Return	0

Notes:
1) The DVM seems to return the same number of records as sys.dm_request_phases.
2) Observe the format of the start_time and end_time, probably the timestamps come from the Spark cluster and were not translated into an SQL Server data type.

sys.dm_request_phases_exec_task_stats

sys.dm_request_phases_exec_task_stats stores metadata about the requests breakdown at task level:

-- Azure Serverless SQL pool: request phases breakdown at task
SELECT TOP (100) RPE.dist_request_id
, TRY_CAST(RPE.id as bigint) id
--, RPE.min_time_ms
--, RPE.max_time_ms
, RPE.avg_time_ms/1000.0 avg_time_sec
--, RPE.stdev_time_ms
, RPE.total_bytes_processed
--, RPE.min_rows
--, RPE.max_rows
--, RPE.avg_rows
--, RPE.stdev_rows
, RPE.total_rows
, RPE.error_id
FROM sys.dm_request_phases_exec_task_stats RPE
ORDER BY id

dist_request_id	id	avg_time_sec	total_kb_processed	total_rows	error_id
820E9FC6...	1	1.031	2343.310311	2030	0
820E9FC6...	2	1.891	7145.193579	9	0
680DCB55...	3	0.203	2343.310311	2030	0
680DCB55...	4	1.406	6630.101167	0	0
C9453EF2...	5	0.547	1534.098249	10	0
C9453EF2...	6	0.125	4.074902	10	0
CAA533DE...	7	0.203	2343.310311	2030	0
CAA533DE...	8	1.922	7143.673151	6	0
AC1A4F10...	9	0.172	2343.310311	2030	0
AC1A4F10...	10	1.297	6635.185797	9	0

What does all this mean?

The lack of documentation makes it challenging to interpret the values of the views besides the data and metadata they offer. In a paper on POLARIS, the code given to the serveless SQL pool engine, a taks is defined as "a careful packaging of data and query processing into units [...] that can be readily moved across compute nodes and re-started at the task level" [1]. Therefore, one can assume that this is the level targetted by the sys.dm_request_phases_exec_task_stats DMV. Further on, the tasks are grouped at phase level according to the sys.dm_request_phases_task_group_stats, the metadata from the two DMVs being further combined into sys.dm_request_phases DMV.

If the meaning is kept from dedicated SQL pools, a shuffle operation indicates that data is moved between the frontend and backend nodes to satisfy a request, while a Result represents the operation of returning the result selt to client. The "ComputeToControlNode" operation involves a simple select (e.g. SELECT top 10) from a CETA and therefore no "Shuffle" is needed.

Requests' history

Further on, one can use the "Distributed statement id" to join the execution request phases with the request history, however matches will be found only for a small subset of the records (probably the executions since the pool started):

-- Azure Serverless SQL pool: requests history with request phase info
SELECT top 100 ERH.status
, ERH.transaction_Id
, ERH.distributed_statement_Id 
, ERH.query_hash 
--, ERH.login_name 
, ERH.start_time
, ERH.end_time 
, ERH.command 
, ERH.query_text 
--, ERH.total_elapsed_time_ms
, ERH.total_elapsed_time_ms/1000.0 total_elapsed_time_sec
--, ERH.data_processed_mb
, ERH.data_processed_mb
, RPH.avg_time_ms/1000.0 avg_time_sec
, RPH.total_rows
, RPH.total_bytes_processed/1028.0/1028.0 total_mb_processed
, RPH.state_desc
, RPH.operation_type
, RPH.input_dop
, RPH.output_dop
, RPH.task_retries
, RPH.error_id
, ERH.error
, ERH.error_code 
FROM sys.dm_exec_requests_history ERH
     JOIN sys.dm_request_phases RPH
	   ON ERH.distributed_statement_Id = RPH.dist_statement_id
	  --AND RPH.parent_ids = 0 -- only the parent
ORDER BY RPH.Id DESC

Here's a subset of the result set focusing only on the statistical values:

distr_statement_Id	start_time	end_time	total_elapsed_time_sec	data_processed_mb	avg_time_sec	total_rows	total_mb_processed	operation_type	id	parent_ids
{8C4386D...	...8:24.4300000	...8:39.8266666	15.396	10	1.031	2030	2.279484738326	Shuffle	1	2
{8C4386D...	...8:24.4300000	...8:39.8266666	15.396	10	1.891	9	6.950577411478	Return	2	0
{C952497...	...5:45.2100000	...5:48.4933333	3.283	10	0.203	2030	2.279484738326	Shuffle	3	4
{C952497...	...5:45.2100000	...5:48.4933333	3.283	10	1.406	0	6.449514753891	Return	4	0
{FD2D17A...	...8:52.1400000	...8:55.4166666	3.276	10	0.547	10	1.492313471789	ComputeToControlNode	5	6
{FD2D17A...	...8:52.1400000	...8:55.4166666	3.276	10	0.125	10	0.003963912451	Return	6	0
{9FB0A26...	...9:15.1300000	...9:18.6366666	3.506	10	0.203	2030	2.279484738326	Shuffle	7	8
{9FB0A26...	...9:15.1300000	...9:18.6366666	3.506	10	1.922	6	6.949098395914	Return	8	0
{1732AB0...	...9:24.6900000	...9:27.4500000	2.76	10	0.172	2030	2.279484738326	Shuffle	9	10
{1732AB0...	...9:24.6900000	...9:27.4500000	2.76	10	1.297	9	6.454460892023	Return	10	0

Notes:
As can be seen, the volume of data processed and the elapsed time values don't match between the two tables, though they are close. The differences probably result from further steps occuring in the process.

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Josep Aguilar-Saborit, Raghu Ramakrishnan et al, "POLARIS: The Distributed SQL Engine in Azure Synapse", VLDB Conferences. PVLDB, 13(12): 3204 – 3216, 2020, DOI: https://doi.org/10.14778/3415478.3415545

15 January 2023

💎🏭SQL Reloaded: Monitoring the Synapse serverless SQL pool with Dynamics Management Views I

I feel sometimes flying blind when I build or troubleshoot SQL queries and I don't have the query plan and/or further statistics to understand how the database engine works, why some queries take longer than expected, etc. Unfortunately, Synapse serverless SQL pool doesn't seem to support showing exection plans in SQL Server Management Studio as per now (SHOWPLAN_XML is not supported for SET). I looked at my old queries based on the the sys.dm_exec_requests and sys.dm_exec_query_stats DMVs, however the results didn't proved to be what I was searching for. (

This weekend, I found Sidney Cirqueira's post on monitoring Synapse serverless SQL pools where he describes how to do that via the Monitoring hub, DMVs, QPI library, respectively Log Analytics. (You should check regularly the Azure Synapse Analytics Blog as it's full of goodies!)

Thus, I found out that there's a new DMV called sys.dm_exec_requests_history which provides at least the duration and the volume of data processes by each statement run on the service:

-- Azure Serverless SQL pool: requests' history 
SELECT top 100 ERH.status
, ERH.transaction_Id
, ERH.distributed_statement_Id 
, ERH.query_hash 
, ERH.login_name 
, ERH.start_time
, ERH.end_time 
, ERH.command 
, ERH.query_text 
--, ERH.total_elapsed_time_ms
, ERH.total_elapsed_time_ms/1000.0 total_elapsed_time_sec

--, ERH.data_processed_mb
, ERH.data_processed_mb/1028.0 data_processed_gb
, ERH.error
, ERH.error_code 
FROM sys.dm_exec_requests_history ERH
ORDER BY ERH.data_processed_mb DESC

It isn't much information, compared with the columns returned by sys.dm_exec_requests, but it's something to start with. At least it allows focusing on the queries with the longest duration (use the above query sorting the records based on the total_elapsed_time_ms descending) or highest volume of data processed:

-- Azure Serverless SQL pool: queries with most data processed
SELECT TOP 50 ERH.query_text  
, COUNT(*) no_runs
, SUM(ERH.total_elapsed_time_ms) total_elapsed_time_ms
, SUM(ERH.data_processed_mb) data_processed_mb
, SUM(ERH.data_processed_mb/1028.0) data_processed_gb
, MIN(ERH.start_time) first_run_date
, MAX(ERH.start_time) last_run_date
FROM sys.dm_exec_requests_history ERH
GROUP BY ERH.query_text
HAVING COUNT(*)>1
ORDER BY data_processed_mb DESC

The same query can be slightly changed to retrieve the volume of data processed by month:

-- Azure Serverless SQL pool: data processed by month
SELECT Convert(nvarchar(7), ERH.start_time, 23) [period]
, COUNT(*) no_runs
, SUM(ERH.total_elapsed_time_ms) total_elapsed_time_ms
, SUM(ERH.data_processed_mb) data_processed_mb
, SUM(ERH.data_processed_mb/1028.0) data_processed_gb
, MIN(ERH.start_time) first_run_date
, MAX(ERH.start_time) last_run_date
FROM sys.dm_exec_requests_history ERH
GROUP BY Convert(nvarchar(7), ERH.start_time, 23)
HAVING COUNT(*)>1
ORDER BY data_processed_mb DESC

One can add in the grouping also the login name to break down the analysis by the login that issued the query. Organization's domain can be used to differentiate between system or organization-baed queries.

The volume of data processed is stored also in the sys.dm_external_data_processed DMV aggregated for the current day, week, respectively month as part of the cost control related feature:

-- Azure Serverless SQL pool: volume of data processed
SELECT type 
, data_processed_mb 
, data_processed_mb/1028.0 data_processed_gb
FROM sys.dm_external_data_processed

And here's how the output looks like:

type	data_processed_mb	data_processed_gb
daily	230	0.223735
weekly	377	0.366731
monthly	223522	217.433852

Notes:

1) I still need to play with the DMVs to understand their scope and limitations.

2) The view appears also in the list of DMVs I idenfitied to be supported the by Synapse serverless SQL pool. As I discoered later, 3 more DMVs are available with useful statistics.

3) ~~The queries based on sys.dm_exec_requests and sys.dm_exec_query_stats DMVs seem to return only the running query based on them.~~ (Actually, the DMVs seem to work.)

4) The view is available also in SQL Server 2022, though it doesn't seem to be used.

5) According to the above-mentioned source, the view is provided for ticket purposes to help customers better troubleshooting the SQL requests. Use the distributed_statement_id in the tickets raised with Microsoft to troubleshoot any issues with Synapse.

6) Unfortunately, also the useful Query Store feature is not yet supported, even if the DMVs related to it seem to be available. Attempting to enable it results in the error:

"Msg 15869, Level 16, State 9, Line 1
QUERY_STORE is not supported for ALTER DATABASE"

Previous Post <<||>> Next Post

Happy coding!

SQL Troubles

Pages

01 February 2023

💎SQL Reloaded: Alternatives for Better Code Maintainability in SQL Server & Azure Synapse I

28 January 2023

💎SQL Reloaded: Temporary Tables and Tempdb in Serverless SQL Pool

25 January 2023

💎SQL Reloaded: Documenting External Tables

20 January 2023

💎SQL Reloaded: Monitoring the Synapse serverless SQL pool with Dynamics Management Views II

15 January 2023

💎🏭SQL Reloaded: Monitoring the Synapse serverless SQL pool with Dynamics Management Views I

About Me