01 February 2023

💎SQL Reloaded: Alternatives for Better Code Maintainability in SQL Server & Azure Synapse I

Introduction

Queries can become quite complex and with the increased complexity they'll be harder to read and/or maintain. Since the early days of SQL Server, views and table-valued user-defined functions (UDFs) are the standard ways of encapsulating logic for reuse in queries, allowing to minimize the duplication of logic. Then came the common table expressions (CTEs), which allow a further layer of structuring the code, independently whether a view or UDF was used. 

These are the main 3 options that can be combined in various ways to better structure the code. On the other side, also a temporary table or table variable could be used for the same purpose, though they have further implications.

To exemplify the various approaches, let's consider a simple query based on two tables from the AdventureWorks database. For the sake of simplicity, further business rules have been left out.

Inline Subqueries

-- products with open purchase orders
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM Production.Product ITM
     JOIN ( -- cumulated open purchase orders by product
		SELECT POL.ProductId 
		, SUM(POL.OrderQty) PurchQty
		FROM Purchasing.PurchaseOrderDetail POL
		WHERE OrderQty - (ReceivedQty - RejectedQty)>0
		GROUP BY POL.ProductId 
	) POL
	ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

As can be seen, the logic for the "Open purchase orders" result set is built within an inline subquery (aka inline view). As its logic becomes more complex, the simplest way to handle this is to move it into a CTE.

Common Table Expressions (CTEs)

A common table expression can be thought of as a temporary result set defined within the execution scope of a single SELECT, INSERT, UPDATE, DELETE or CREATE VIEW statement [1]. Thus, the CTE can't be reused between queries.

The inline query is moved at the beginning within a WITH statement to which is given a proper name that allows easier identification later:

-- products with open purchase orders (common table expression)
WITH OpenPOs
AS (-- cumulated open purchase orders by product
	SELECT POL.ProductId 
	, SUM(POL.OrderQty) PurchQty
	FROM Purchasing.PurchaseOrderDetail POL
	WHERE OrderQty - (ReceivedQty - RejectedQty)>0
	GROUP BY POL.ProductId 
)
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM Production.Product ITM
     JOIN OpenPOs POL
	   ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

Thus, this allows us to rewrite the JOIN as if it were between two tables. Multiple CTEs can be used as well, with or without any dependencies between them. Moreover, CTEs allow building recursive queries (see example).

There is no performance gain or loss by using a CTE. It's important to know that the result set is not cached, therefore, if the same CTE is called multiple times (within a query), it will be also "executed" for the same number of times. Except the cases in which the database engine uses a spool operator to save intermediate query results for a CTE, there will be created no work table in tempdb for CTEs.

If the inline query needs to be reused in several queries, defining a view is a better alternative.

Views

A view is a database object used to encapsulate a query and that can be referenced from other queries much like a table. In fact, it's also referred as a "virtual table". A view can't be execute by itself (as stored procedures do. No data, only the definition of the view is stored, and the various actions that can be performed on database objects can be performed on views as well.

-- creating the view
CREATE VIEW dbo.vOpenPurchaseOrders
AS
SELECT POL.ProductId 
, SUM(POL.OrderQty) PurchQty
FROM Purchasing.PurchaseOrderDetail POL
WHERE OrderQty - (ReceivedQty - RejectedQty)>0
GROUP BY POL.ProductId 

-- testing the view
SELECT top 10 *
FROM dbo.vOpenPurchaseOrders

Once the view is created, it can be called from any query:

-- products with open purchase orders (table-valued function)
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM Production.Product ITM
     JOIN dbo.vOpenPurchaseOrders POL
	   ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

Besides the schema binding, there are no additional costs for using views. However, views have several limitations (see [2]). Moreover, it's not possible to use parameters with views, scenarios in which tabled-valued UDFs can help.

Indexed Views 

Starting with SQL Server 2015, it's possible to materialize the data in a view, storing the results of the view in a clustered index on the disk in same way a table with a clustered index is stored. This type of view is called an indexed view (aka materialized view, though the concept is used slightly different in Azure Synapse) and for long-running queries can provide considerable performance gains. In case the view contains a GROUP BY is present, its definition must contain COUNT_BIG(*) and must not contain HAVING.

-- dropping the view
--DROP VIEW IF EXISTS Purchasing.vOpenPOs

-- create view
CREATE VIEW Purchasing.vOpenPOs
WITH SCHEMABINDING
AS
SELECT POL.ProductId 
, SUM(POL.OrderQty) PurchQty
, COUNT_BIG(*) Count
FROM Purchasing.PurchaseOrderDetail POL
WHERE OrderQty - (ReceivedQty - RejectedQty)>0
GROUP BY POL.ProductId 
GO

--Create an index on the view.
CREATE UNIQUE CLUSTERED INDEX IDX_vOpenPOs
   ON Purchasing.vOpenPOs (ProductId);

--testing the view
SELECT top 100 *
FROM Purchasing.vOpenPOs

-- products with open purchase orders (indexed view)
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM [Production].[Product] ITM
     JOIN Purchasing.vOpenPOs POL
	   ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

When an indexed view is defined on a table, the query optimizer may use it to speed up the query execution even if it wasn't referenced in the query. Besides the restriction of the view to be deterministic, further limitations apply (see [6]).

Table-Valued Functions

A table-valued function is a user-defined function in which returns a table as a result, as opposed to a single data value, as scalar functions do.

Let's support that we need to restrict base the logic based on a time interval. We'd need then to provide the StartDate & EndDate as parameters. Compared with other UDFs table-valued functions, as their name implies, need to return a table:

-- creating the UDF function 
CREATE FUNCTION dbo.tvfOpenPurchaseOrdersByProduct( 
  @StartDate date 
, @EndDate date) 
RETURNS TABLE 
AS RETURN ( 
	SELECT POL.ProductId 
	, SUM(POL.OrderQty) PurchQty
	FROM Purchasing.PurchaseOrderDetail POL
	WHERE OrderQty - (ReceivedQty - RejectedQty)>0
	  AND POL.DueDate BETWEEN @StartDate AND @EndDate
	GROUP BY POL.ProductId 
)

-- testing the UDF
SELECT top 10 *
FROM dbo.tvfOpenPurchaseOrdersByProduct('2014-01-01', '2014-12-31')

A table-valued function can be used as a "table with parameters" in JOINs:

-- products with open purchase orders (table-valued function)
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM Production.Product ITM
     JOIN dbo.tvfOpenPurchaseOrdersByProduct('2014-01-01', '2014-12-31') POL
	   ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

The parameters are optional, though in such cases using a view might still be a better idea. Table-valued functions used to have poor performance in the past compared with views and in certain scenarios they might still perform poorly. Their benefit resides in allowing to pass and use parameters in the logic, which can make them irreplaceable. Moreover, multi-statement table-valued functions can be built as well (see example)!

Notes:
1) When evaluating table-valued functions for usage consider their limitations as well (see [3])!
2) Scalar UDFs can be used to simplify the code as well, though they apply only to single values, therefore they are not considered in here!

Temporary Tables 

A temporary table is a base table that is stored and managed in tempdb as any other table. It exists only while the database session in which it was created is active. Therefore, it can be called multiple times, behaving much like a standard table:

-- create the temp table
CREATE TABLE dbo.#OpenPOs (
  ProductId int NOT NULL
, PurchQty decimal(8,2) NOT NULL
)

-- insert the cumulated purchase orders
INSERT INTO #OpenPOs
SELECT POL.ProductId 
, SUM(POL.OrderQty) PurchQty
FROM Purchasing.PurchaseOrderDetail POL
WHERE OrderQty - (ReceivedQty - RejectedQty)>0
GROUP BY POL.ProductId 

-- products with open purchase orders (table-valued function)
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM [Production].[Product] ITM
     JOIN dbo.#OpenPOs POL
	   ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

-- drop the table (cleaning)
-- DROP TABLE IF EXISTS dbo.#OpenPOs;

Being created in the tempdb, system database shared by several databases, temporary table's performance relies on tempdb's configuration and workload. Moreover, the concurrent creation of temporary tables from many sessions can lead to tempdb metadata contention, as each session attempts updating metadata information in the system based tables.

Temporary tables are logged, which adds more burden on the database engine, however being able to create indexes on them and use statistics can help processing result sets more efficiently, especially when called multiple times. 

Also, a temporary table might be cached (see [1]) and not deleted when its purpose ends, which allows operations that drop and create the objects to execute very quickly and reduces page allocation contention.

Table Variables

A table variable is a variable of type TABLE and can be used in functions, stored procedures, and batches. The construct is similar to the temp table and is stored as well in the tempdb and cached under certain scenarios, however they are scoped to the batch or routine in which they are defined and destroyed after that. 

-- create the table variable
DECLARE @OpenPOs TABLE (
  ProductId int NOT NULL
, PurchQty decimal(8,2) NOT NULL
)

-- insert the cumulated purchase orders
INSERT INTO @OpenPOs
SELECT POL.ProductId 
, SUM(POL.OrderQty) PurchQty
FROM Purchasing.PurchaseOrderDetail POL
WHERE OrderQty - (ReceivedQty - RejectedQty)>0
GROUP BY POL.ProductId 

-- products with open purchase orders (table variable)
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM [Production].[Product] ITM
     JOIN @OpenPOs POL
	   ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

Table variables don’t participate in transactions or locking, while DML operations done on them are not logged. There are also no statistics maintained and any data changes impacting the table variable will not cause recompilation. Thus, they are usually faster than temporary variables, especially when their size is small, though their performance depends also on how they are used. On the other side, for big result sets and/or when several calls are involved, a temporary table could prove to be more efficient. 

Important!!! Temporary tables and table variables are means of improving the performance of long-running queries. Being able to move pieces of logic around helps in maintaining the code and it also provides a logical structure of the steps, however they shouldn't be used if the performance gain is not the target! Overusing them as technique can considerably decrease the performance of tempdb, which can have impact in other areas!

Azure Synapse

Moving to Azure Synapse there are several important limitations in what concerns the above (see [4]). Even if some features are supported, further limitations might apply. What's important to note is that materialized views act like indexed view in standard SQL Server and that CETAS (Create External Table as SELECT) are available to import/export data to the supported file formats in Hadoop, Azure storage blob or Azure Data Lake Storage Gen2.

FeatureDedicatedServerlessSQL Server
CTEsYesYesYes (2015+)
Recursive CTEsNoNoYes (2015+)
ViewsYesYesYes
Indexed viewsNoNoYes
Materialized viewsYesNoNo
Table-valued functions (single statement)NoYesYes
Table-valued functions (multi-statement)NoNoYes
Scalar UDFs YesNoYes
TablesYesNoYes
Temporary tables (local)YesLimitedYes
Temporary tables (global)NoNoYes
Table variablesYesYesYes
CETASYesLimitedYes (2022+)

Notes:
1) CETAS have two important limitations in serverless SQL Pool
    a) once the data were exported to a file, they can't be overwritten via the same syntax;
    b) logic based on temporary tables can't be exported via pipelines.
2) Temporary tables can be used to replace cursors (see example).

Previous Post  <<||>>  Next Post

Resources:
[1] Microsoft Learn (2012) Capacity Planning for tempdb (link)
[2] Microsoft Learn (2023) CREATE View (link)
[3] Microsoft Learn (2023) CREATE Function (link)
[4] Microsoft Learn (2023) Transact-SQL features supported in Azure Synapse SQL (link)
[5] Redgate (2018) Choosing Between Table Variables and Temporary Tables (ST011, ST012), by Phil Factor (link)
[6] Microsoft Learn (2023) Create indexed views (link)
[7] Microsoft Learn (2023) CREATE MATERIALIZED VIEW AS SELECT (Transact-SQL) (link)
[8] Microsoft Learn (2023) CETAS with Synapse SQL (link)

28 January 2023

💎SQL Reloaded: Temporary Tables and Tempdb in Serverless SQL Pool

In SQL Server, temporary tables are stored in tempdb database and the front end (FE) SQL Server from a serverless SQL Server pool makes no exception from it, however there's no such database listed in the list of databases available. Watching today Brian Bønk's session on "Azure Synapse serverless - CETAS vs Views vs Parquet" at Data Toboggan winter edition 2023, the speaker pointed out that the tables available in tempdb database can be actually queried:

-- Azure Serverless SQL pool: tempdb tables
SELECT top 10 *
FROM tempdb.information_schema.tables
TABLE_CATALOGTABLE_SCHEMATABLE_NAMETABLE_TYPE
tempdbdbo#dm_continuous_copy_status_..._0000000029C7BASE TABLE
tempdbdbodiff_MonDmSbsConnectionsBASE TABLE
tempdbdbodiff_MonTceMasterKeysBASE TABLE
tempdbdbodiff_MonTceColEncryptionKeyBASE TABLE
tempdbdbodiff_MonAutomaticTuningStateBASE TABLE
tempdbdbodiff_MonDmTranActiveTransactionsBASE TABLE
tempdbdbodiff_MonTceEnclaveUsageInfoBASE TABLE
tempdbdbodiff_MonTceEnclaveColEncryptionKeyBASE TABLE
tempdbdbodiff_MonTceEnclaveMasterKeysBASE TABLE
tempdbdbodiff_MonTceColumnInfoBASE TABLE

Moreover, the content of the system tables available can be queried, even if some of the tables might have no data:
 
-- Azure Serverless SQL pool: checking a table's content
SELECT top 10 *
FROM tempdb.dbo.dmv_view_run_history

How about the temporary tables created? To check this, let's reuse a temp table created recently for listing the DMVs available in the serverlss SQL pool:

-- dropping the temp table
DROP TABLE IF EXISTS dbo.#views;

-- create the temp table
CREATE TABLE dbo.#views (
  ranking int NOT NULL
, view_name nvarchar(150) NOT NULL
)

-- inserting a few records
INSERT INTO #views
SELECT row_number() OVER(ORDER BY object_id) ranking
, concat(schema_name(schema_id),'.', name) view_name
FROM sys.all_views obj
WHERE obj.Type = 'V'
  AND obj.is_ms_shipped = 1
  AND obj.name LIKE 'dm_exec_requests%'

-- checking temp table's content
SELECT *
FROM dbo.#views
rankingview_name
1sys.dm_exec_requests
2sys.dm_exec_requests_history

Here's temp table's metadata:
 
-- checking temp table's metadata
SELECT *
FROM tempdb.information_schema.tables
WHERE table_name like '#views%'
TABLE_CATALOGTABLE_SCHEMATABLE_NAMETABLE_TYPE
tempdbdbo#views_____..._____00000000295BBASE TABLE

Of course, you can query also the tempdb.sys.all_objects: 

-- checking temp table's metadata
SELECT *
FROM tempdb.sys.all_objects
WHERE name like '#views%'

You can use now the table name returned by any of the two queries to call the table (just replace the name accordingly):

-- querying the table via tempdb (name will differ)
SELECT *
FROM tempdb.dbo.[#views______________________________________________________________________________________________________________00000000295B]
rankingview_name
1sys.dm_exec_requests
2sys.dm_exec_requests_history

The benefit of knowing that is neglectable, however the topic is interesting more for the ones who want to know how the pools in Azure Synapse work, what was kept or removed compared with the standard editions of SQL Server. 

Thus, another topic interesting for DBAs would be how many files are created for the tempdb, given that the database is used to store intermediate data for the various operations. Finding the optimal number of tempdb files and configuring them was and still is one of the main concerns when configuring an SQL Server instance for optimal performance.

The old query developed for standard SQL Server seems to work as well:
 
-- Azure Serverless SQL pool: tempdb files 
SELECT dbf.file_id
, dbf.name file_name
--, dbf.physical_name
, dsp.name file_group
--, type 
, dbf.type_desc file_type
--, dbf.growth growth_kb
, Cast(dbf.growth/128.0  as decimal(18,2)) growth_mb
, dbf.is_percent_growth
--, dbf.max_size max_size_kb
, Cast(NullIf(dbf.max_size, -1)/128.0  as decimal(18,2)) max_size_mb
--, dbf.size file_size_kb
, Cast(dbf.size/128.0 as decimal(18,2)) file_size_mb
, dbf.state_desc 
, dbf.is_read_only 
FROM tempdb.sys.database_files dbf
     LEFT JOIN tempdb.sys.data_spaces dsp
       ON dbf.data_space_id = dsp.data_space_id
ORDER BY dbf.Name
file_idfile_namefile_groupfile_typegrowth_mbis_percent_growthmax_size_mbfile_size_mbstate_descis_read_only
1tempdevPRIMARYROWS32010267016ONLINE0
3tempdev2PRIMARYROWS32010267016ONLINE0
4tempdev3PRIMARYROWS32010267016ONLINE0
5tempdev4PRIMARYROWS32010267016ONLINE0
2templogNULLLOG64016384255ONLINE0

So, there were created 4 data files of 16 MB each, with a fix growth of 32 MB, and the files can grow up to about 100 GB. What would be interesting to check is how the files grow under heavy workload, what kind of issues this raises, etc. At least in serverless SQL pools many of the views one used to use for troubleshooting aren't accessible. Maybe one doesn't have to go that far, given that the resources are managed automatically. 

Notes
(1) Microsoft recommends not to drop the temporary tables explicitly, but let SQL Server handle this cleanup automatically and take thus advantage of the Optimistic Latching Algorithm, which helps prevent contention on TempDB [1].

Happy coding!

Last updated: Oct-2024


References:
[1] Haripriya SB (2024) Do NOT drop #temp tables (link)

25 January 2023

💎SQL Reloaded: Documenting External Tables

When using serverless SQL pool, CETAS (aka Create External Table as Select) are the main mechanism of making data available from the Data Lake for queries. In case is needed to document them, sys.external_tables DMV can be used to export their metadata, much like sys.tables or sys.views can be used for the same:

-- CETAS metadata
SELECT TOP (50) ext.object_id
, ext.name
, schema_name(ext.schema_id) [schema_name]
, ext.type_desc
, ext.location
, ext.data_source_id
, ext.file_format_id
, ext.max_column_id_used
, ext.uses_ansi_nulls
, ext.create_date
, ext.modify_date
FROM sys.external_tables ext

It's interesting that CETAS have type_desc = 'USER_TABLE' in sys.all_objects, same like user-defined tables in SQL Server have:

-- CETAS' metadata via sys.all_objects
SELECT *
FROM sys.all_objects
WHERE object_id = object_id('<schema_name>.<CETAS name>')

The data source and file format can be retrieved via the sys.external_data_sources and sys.external_file_formats DMVs. Moreover, it's useful to include the logic into a view, like the one below:
 
-- drop view
--DROP VIEW IF EXISTS dbo.vAdminExternalTables

-- create view
CREATE VIEW dbo.vAdminExternalTables
AS
-- external tables - metadata 
SELECT ext.object_id
, sch.name + '.' + ext.name [unique_identifier]
, sch.name [schema]
, ext.name [object]
, ext.type_desc [type]
, ext.max_column_id_used 
, ext.location
, eds.name data_source 
, eff.name file_format 
, ext.create_date 
, ext.modify_date
FROM sys.external_tables ext
     JOIN sys.schemas sch
       ON ext.schema_id = sch.schema_id 
     JOIN sys.external_data_sources eds 
       ON ext.data_source_id = eds.data_source_id
     JOIN sys.external_file_formats eff
       ON ext.file_format_id = eff.file_format_id

-- testing the view
SELECT top 10 *
FROM dbo.vAdminExternalTables

The view can be used then for further queries, for example checking the CETAS created or modified starting with a given date:
 
-- external tables created after a certain date
SELECT *
FROM dbo.vAdminExternalTables ext
WHERE ext.create_date >= '20230101'
  OR ext.modify_date >= '20230101';

Or, when the CETAS are deployed from one environment to another, one can compare the datasets returned by the same view between environments, something like in the below query:
 
-- comparison external tables metadata between two databases
SELECT *
FROM (
    SELECT *
    FROM <test_database>.dbo.vAdminExternalTables
    WHERE [Schema] = '<schema_name>'
	) PRD
	FULL OUTER JOIN (
    SELECT *
    FROM <prod_database>.dbo.vAdminExternalTables
    WHERE [Schema] = '<schema_name>'
	) UAT
	ON PRD.[unique_identifier] = UAT.[unique_identifier]
-- WHERE PRD.[unique_identifier] IS NULL OR UAT.[unique_identifier] IS NULL

The definitions for multiple CETAS can be exported from the source database in one step via the Object Explorer Details >> Tables >> External tables >> (select CETAS) >> Script Table as >> ... . 

Happy coding!

20 January 2023

💎SQL Reloaded: Monitoring the Synapse serverless SQL pool with Dynamics Management Views II

Identifying the SQL Server DMVs which are accessible for the Serverless SQL pool (see previous post), allowed me to identify besides sys.dm_exec_requests_history three more DMVs with statistics on the statements executed on the server: sys.dm_request_phases, sys.dm_request_phases_task_group_stats and sys.dm_request_phases_exec_task_stats. Untofurtunately, there seems to be no documentation available on these DMVs, and, at the time the post was written, there were also no further hits on google.com or bing.com found on the same.

sys.dm_request_phases

sys.dm_request_phases  provides insights in the phases an execution statement goes through, and seems to summarize the other two views:

-- Azure Serverless SQL pool: request phases
SELECT TOP (100) dist_statement_id
, RPH.dist_request_id
, TRY_CAST(RPH.id as bigint) id
, TRY_CAST(RPH.parent_ids as bigint) parent_ids
, RPH.start_time
, RPH.end_time
--, RPH.total_elapsed_time_ms
--, RPH.total_elapsed_time_ms/1000.0 total_elapsed_time_sec
--, RPH.min_time_ms
--, RPH.min_time_ms/1000.0 min_time_sec
--, RPH.max_time_ms
--, RPH.max_time_ms/1000.0 max_time_sec
--, RPH.avg_time_ms
, RPH.avg_time_ms/1000.0 avg_time_sec
--, RPH.stdev_time_ms
--, RPH.stdev_time_ms/1000.0 stdev_time_sec -- it has no values
--, RPH.min_rows
--, RPH.max_rows
--, RPH.avg_rows
--, RPH.stdev_rows -- it has no values
, RPH.total_rows
--, RPH.total_bytes_processed
, RPH.total_bytes_processed/1028.0 total_kb_processed
, RPH.state_desc
, RPH.operation_type
, RPH.input_dop
, RPH.output_dop
, RPH.task_retries
, RPH.error_id
FROM sys.dm_request_phases RPH
ORDER BY Id
dist_statement_iddist_request_ididparent_idsstart_timeend_timeavg_time_sectotal_rowstotal_kb_processedstate_descoperation_typeinput_dopoutput_doptask_retrieserror_id
8C4386DC...820E9FC6...12...09:58:34.213...09:58:36.3371.03120302343.310311succeededShuffle1100
8C4386DC...820E9FC6...20...09:58:36.447...09:58:39.7131.89197145.193579succeededReturn1100
C9524971...680DCB55...34...10:05:46.747...10:05:47.0570.20320302343.310311succeededShuffle1100
C9524971...680DCB55...40...10:05:47.057...10:05:48.4801.40606630.101167succeededReturn1100
FD2D17AD...C9453EF2...56...11:58:54.060...11:58:55.2970.547101534.098249succeededComputeToControlNode1100
FD2D17AD...C9453EF2...60...11:58:55.297...11:58:55.4200.125104.074902succeededReturn1100
9FB0A268...CAA533DE...78...11:59:16.483...11:59:16.7000.20320302343.310311succeededShuffle1100
9FB0A268...CAA533DE...80...11:59:16.700...11:59:18.6401.92267143.673151succeededReturn1100
1732AB0D...AC1A4F10...910...11:59:25.950...11:59:26.1400.17220302343.310311succeededShuffle1100
1732AB0D...AC1A4F10...100...11:59:26.140...11:59:27.4501.29796635.185797succeededReturn1100

Notes:
1) The foreign keys and dates (in the above and below queries) were truncated to accomodate all the important attributes in the snapshot of the values returned.
2) Based on the exisitng queries, there are two records for each executed statement, a Shuffle or ComputeToControlNode followed by a Return (see operation_type). In more complex scenario there are several Shuffles and Broadcasts and a Return. According to the Microsoft team, even if for serverless SQL pools there's no Data Movement Service (DMS), there's a similar algorithm responsible for moving the data between the nodes.
3) Because in serverless SQL pool each query has its own distribution statement id, the min, max, avg and total values will have the sames values across the columns. Therefore, the columns with redundant values were commented.
4) The Id of the request phase seems to have numeric values despite being defined as alphanumeric. I tried to cast the values to bigint for sorting purposes.

sys.dm_request_phases_task_group_stats

sys.dm_request_phases_task_group_stats stores metadata about the requests breakdown at task group:

-- Azure Serverless SQL pool: request phases breakdown at task group
SELECT TOP (100) RPT.dist_request_id
, TRY_CAST(RPT.id as bigint) id
, TRY_CAST(RPT.parent_ids as bigint) parent_ids
, RPT.dist_statement_id
, RPT.state_desc
, RPT.start_time
, RPT.end_time
, RPT.input_dop
, RPT.output_dop
, RPT.operation_type
, RPT.task_retries
FROM sys.dm_request_phases_task_group_stats RPT
ORDER BY id
dist_request_ididparent_idsdist_statement_idstate_descstart_timeend_timeinput_dopoutput_dopoperation_typetask_retries
820E9FC6...128C4386DC...succeeded63809805514213255163809805516338269311Shuffle0
820E9FC6...208C4386DC...succeeded63809805516447616363809805519713300111Return0
680DCB55...34C9524971...succeeded63809805946745002163809805947057495311Shuffle0
680DCB55...40C9524971...succeeded63809805947057495363809805948479368211Return0
C9453EF2...56FD2D17AD...succeeded63809812734060711263809812735295106711ComputeToControlNode0
C9453EF2...60FD2D17AD...succeeded63809812735295106763809812735420297011Return0
CAA533DE...789FB0A268...succeeded63809812756482608463809812756701350411Shuffle0
CAA533DE...809FB0A268...succeeded63809812756701350463809812758638854911Return0
AC1A4F10...9101732AB0D...succeeded63809812765951362063809812766138851411Shuffle0
AC1A4F10...1001732AB0D...succeeded63809812766138851463809812767451360111Return0

Notes:
1) The DVM seems to return the same number of records as sys.dm_request_phases.
2) Observe the format of the start_time and end_time, probably the timestamps come from the Spark cluster and were not translated into an SQL Server data type.

sys.dm_request_phases_exec_task_stats

sys.dm_request_phases_exec_task_stats stores metadata about the requests breakdown at task level:

-- Azure Serverless SQL pool: request phases breakdown at task
SELECT TOP (100) RPE.dist_request_id
, TRY_CAST(RPE.id as bigint) id
--, RPE.min_time_ms
--, RPE.max_time_ms
, RPE.avg_time_ms/1000.0 avg_time_sec
--, RPE.stdev_time_ms
, RPE.total_bytes_processed
--, RPE.min_rows
--, RPE.max_rows
--, RPE.avg_rows
--, RPE.stdev_rows
, RPE.total_rows
, RPE.error_id
FROM sys.dm_request_phases_exec_task_stats RPE
ORDER BY id
dist_request_ididavg_time_sectotal_kb_processedtotal_rowserror_id
820E9FC6...11.0312343.31031120300
820E9FC6...21.8917145.19357990
680DCB55...30.2032343.31031120300
680DCB55...41.4066630.10116700
C9453EF2...50.5471534.098249100
C9453EF2...60.1254.074902100
CAA533DE...70.2032343.31031120300
CAA533DE...81.9227143.67315160
AC1A4F10...90.1722343.31031120300
AC1A4F10...101.2976635.18579790

What does all this mean?

The lack of documentation makes it challenging to interpret the values of the views besides the data and metadata they offer. In a paper on POLARIS, the code given to the serveless SQL pool engine, a taks is defined as "a careful packaging of data and query processing into units [...] that can be readily moved across compute nodes and re-started at the task level" [1]. Therefore, one can assume that this is the level targetted by the sys.dm_request_phases_exec_task_stats DMV. Further on, the tasks are grouped at phase level according to the sys.dm_request_phases_task_group_stats, the metadata from the two DMVs being further combined into sys.dm_request_phases DMV. 

If the meaning is kept from dedicated SQL pools, a shuffle operation indicates that data is moved between the frontend and backend nodes to satisfy a request, while a Result represents the operation of returning the result selt to client. The "ComputeToControlNode" operation involves a simple select (e.g. SELECT top 10) from a CETA and therefore no "Shuffle" is needed.

Requests' history

Further on, one can use the "Distributed statement id" to join the execution request phases with the request history, however matches will be found only for a small subset of the records (probably the executions since the pool started):
 
-- Azure Serverless SQL pool: requests history with request phase info
SELECT top 100 ERH.status
, ERH.transaction_Id
, ERH.distributed_statement_Id 
, ERH.query_hash 
--, ERH.login_name 
, ERH.start_time
, ERH.end_time 
, ERH.command 
, ERH.query_text 
--, ERH.total_elapsed_time_ms
, ERH.total_elapsed_time_ms/1000.0 total_elapsed_time_sec
--, ERH.data_processed_mb
, ERH.data_processed_mb
, RPH.avg_time_ms/1000.0 avg_time_sec
, RPH.total_rows
, RPH.total_bytes_processed/1028.0/1028.0 total_mb_processed
, RPH.state_desc
, RPH.operation_type
, RPH.input_dop
, RPH.output_dop
, RPH.task_retries
, RPH.error_id
, ERH.error
, ERH.error_code 
FROM sys.dm_exec_requests_history ERH
     JOIN sys.dm_request_phases RPH
	   ON ERH.distributed_statement_Id = RPH.dist_statement_id
	  --AND RPH.parent_ids = 0 -- only the parent
ORDER BY RPH.Id DESC

Here's a subset of the result set focusing only on the statistical values:
 
distr_statement_Idstart_timeend_timetotal_elapsed_time_secdata_processed_mbavg_time_sectotal_rowstotal_mb_processedoperation_typeidparent_ids
{8C4386D......8:24.4300000...8:39.826666615.396101.03120302.279484738326Shuffle12
{8C4386D......8:24.4300000...8:39.826666615.396101.89196.950577411478Return20
{C952497......5:45.2100000...5:48.49333333.283100.20320302.279484738326Shuffle34
{C952497......5:45.2100000...5:48.49333333.283101.40606.449514753891Return40
{FD2D17A......8:52.1400000...8:55.41666663.276100.547101.492313471789ComputeToControlNode56
{FD2D17A......8:52.1400000...8:55.41666663.276100.125100.003963912451Return60
{9FB0A26......9:15.1300000...9:18.63666663.506100.20320302.279484738326Shuffle78
{9FB0A26......9:15.1300000...9:18.63666663.506101.92266.949098395914Return80
{1732AB0......9:24.6900000...9:27.45000002.76100.17220302.279484738326Shuffle910
{1732AB0......9:24.6900000...9:27.45000002.76101.29796.454460892023Return100

Notes:
As can be seen, the volume of data processed and the elapsed time values don't match between the two tables, though they are close. The differences probably result from further steps occuring in the process. 

Happy coding!

Previous Post  <<||>>  Next Post

References:
[1] Josep Aguilar-Saborit, Raghu Ramakrishnan et al, "POLARIS: The Distributed SQL Engine in Azure Synapse", VLDB Conferences. PVLDB, 13(12): 3204 – 3216, 2020, DOI: https://doi.org/10.14778/3415478.3415545

15 January 2023

💎🏭SQL Reloaded: Monitoring the Synapse serverless SQL pool with Dynamics Management Views I

I feel sometimes flying blind when I build or troubleshoot SQL queries and I don't have the query plan and/or further statistics to understand how the database engine works, why some queries take longer than expected, etc. Unfortunately, Synapse serverless SQL pool doesn't seem to support showing exection plans in SQL Server Management Studio as per now (SHOWPLAN_XML is not supported for SET). I looked at my old queries based on the the sys.dm_exec_requests and sys.dm_exec_query_stats  DMVs, however the results didn't proved to be what I was searching for. (

This weekend, I found Sidney Cirqueira's post on monitoring Synapse serverless SQL pools where he describes how to do that via the Monitoring hub, DMVs, QPI library, respectively Log Analytics. (You should check regularly the Azure Synapse Analytics Blog as it's full of goodies!)

Thus, I found out that there's a new DMV called sys.dm_exec_requests_history which provides at least the duration and the volume of data processes by each statement run on the service:

-- Azure Serverless SQL pool: requests' history 
SELECT top 100 ERH.status
, ERH.transaction_Id
, ERH.distributed_statement_Id 
, ERH.query_hash 
, ERH.login_name 
, ERH.start_time
, ERH.end_time 
, ERH.command 
, ERH.query_text 
--, ERH.total_elapsed_time_ms
, ERH.total_elapsed_time_ms/1000.0 total_elapsed_time_sec
--, ERH.data_processed_mb
, ERH.data_processed_mb/1028.0 data_processed_gb
, ERH.error
, ERH.error_code 
FROM sys.dm_exec_requests_history ERH
ORDER BY ERH.data_processed_mb DESC

It isn't much information, compared with the columns returned by sys.dm_exec_requests, but it's something to start with. At least it allows focusing on the queries with the longest duration (use the above query sorting the records based on the total_elapsed_time_ms descending) or highest volume of data processed:

-- Azure Serverless SQL pool: queries with most data processed
SELECT TOP 50 ERH.query_text  
, COUNT(*) no_runs
, SUM(ERH.total_elapsed_time_ms) total_elapsed_time_ms
, SUM(ERH.data_processed_mb) data_processed_mb
, SUM(ERH.data_processed_mb/1028.0) data_processed_gb
, MIN(ERH.start_time) first_run_date
, MAX(ERH.start_time) last_run_date
FROM sys.dm_exec_requests_history ERH
GROUP BY ERH.query_text
HAVING COUNT(*)>1
ORDER BY data_processed_mb DESC

The same query can be slightly changed to retrieve the volume of data processed by month:
 
-- Azure Serverless SQL pool: data processed by month
SELECT Convert(nvarchar(7), ERH.start_time, 23) [period]
, COUNT(*) no_runs
, SUM(ERH.total_elapsed_time_ms) total_elapsed_time_ms
, SUM(ERH.data_processed_mb) data_processed_mb
, SUM(ERH.data_processed_mb/1028.0) data_processed_gb
, MIN(ERH.start_time) first_run_date
, MAX(ERH.start_time) last_run_date
FROM sys.dm_exec_requests_history ERH
GROUP BY Convert(nvarchar(7), ERH.start_time, 23)
HAVING COUNT(*)>1
ORDER BY data_processed_mb DESC

One can add in the grouping also the login name to break down the analysis by the login that issued the query. Organization's domain can be used to differentiate between system or organization-baed queries.

The volume of data processed is stored also in the sys.dm_external_data_processed DMV aggregated for the current day, week, respectively month as part of the cost control related feature:
 
-- Azure Serverless SQL pool: volume of data processed
SELECT type 
, data_processed_mb 
, data_processed_mb/1028.0 data_processed_gb
FROM sys.dm_external_data_processed

And here's how the output looks like:
 
typedata_processed_mbdata_processed_gb
daily2300.223735
weekly3770.366731
monthly223522217.433852

Notes:
1) I still need to play with the DMVs to understand their scope and limitations.
2) The view appears also in the list of DMVs I idenfitied to be supported the by Synapse serverless SQL pool. As I discoered later, 3 more DMVs are available with useful statistics.
3) The queries based on sys.dm_exec_requests and sys.dm_exec_query_stats DMVs seem to return only the running query based on them.  (Actually, the DMVs seem to work.)
4) The view is available also in SQL Server 2022, though it doesn't seem to be used.
5) According to the above-mentioned source, the view is provided for ticket purposes to help customers better troubleshooting the SQL requests. Use the distributed_statement_id in the tickets raised with Microsoft to troubleshoot any issues with Synapse.
6) Unfortunately, also the useful Query Store feature is not yet supported, even if the DMVs related to it seem to be available. Attempting to enable it results in the error:
Msg 15869, Level 16, State 9, Line 1
QUERY_STORE is not supported for ALTER DATABASE


Happy coding!
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.