Showing posts with label temporary table. Show all posts
Showing posts with label temporary table. Show all posts

08 December 2023

💎🏭SQL Reloaded: Microsoft Fabric (Part I: Monitoring the Warehouse)

I was exploring this week the Microsoft Fabric Warehouse and I observed that there're three views available under the queryinsight schema: exec_requests_history, frequently_run_queries and long_running_queries,  According to their definitions, they are based on two database objects fabric_query_starting and fabric_query_completed that cannot be called directly.

Announced in the Nov-2023 update, the Query Insights (QI) feature is a "scalable, sustainable, and extendable solution to enhance the SQL analytics experience" (see Microsoft's documentation).

Strangely, the three views appear in the Model view together with the objects defined in the dbo or other user-defined schemas. One can hide the queryinsight objects, however doing this operation in each model is impractical, especially when the number of the objects defined in the respective schema will increase in time. On the other side, the respective objects might be useful in building a report for visualizing query's performance. (Probably, a multi-model solution would and/or further settings will allow more flexibility.)

Secondly, the objects from the queryinsight schema are not available in the sys.objects DMV:

SELECT top 10 *
FROM sys.objects 
WHERE name LIKE 'fabric%'

The exec_requests_history DMV (similar to the dm_exec_requests_history DMV from the standard SQL Server) references several DMVs, and it would be useful to retrieve the corresponding information within the same query:

 -- fabric warehouse
 SELECT erh.distributed_statement_id
, erh.start_time
, erh.end_time
, erh.total_elapsed_time_ms
--, erh.login_name
, erh.row_count
, erh.status
, erh.session_id
, erh.connection_id
, erh.program_name
, erh.batch_id
, erh.root_batch_id
, erh.query_hash
, erh.command 
FROM queryinsights.exec_requests_history erh
WHERE status = 'Succeeded'--'Failed'

However, attempting to retrieve session information via the sys.dm_exec_sessions DMV leads to the below error message:

SELECT *
FROM queryinsights.exec_requests_history erh
     LEFT JOIN sys.dm_exec_sessions ses
       ON erh.session_id = ses.session_id
WHERE erh.status = 'Succeeded'--'Failed'

The query references an object that is not supported in distributed processing mode.
Msg 15816, Level 16, State 7, Code line 11

Using the standard SQL Sever system functions seem to work, as long the view from the queryinsights schema is not considered:

According to the documentation, "Some objects, like system views, and functions can't be used while you query data stored in Azure Data Lake or Azure Cosmos DB analytical storage. Avoid using the queries that join external data with system views, load external data in a temp table, or use some security or metadata functions to filter external data."

One can presume thus that fabric_query_starting and fabric_query_completed are stored in the Data Lake and behave like standard user-defined tables. Unfortunately, no documentation seems to be available on this.

I tried using a temporary table, as advised above:

-- dropping the temp table
--DROP TABLE IF EXISTS dbo.#requests_history;

-- create the temp table
CREATE TABLE dbo.#requests_history (
  distributed_statement_id uniqueidentifier
, start_time datetime2(6)
, end_time datetime2(6)
, total_elapsed_time_ms bigint
, login_name varchar(255)
, row_count bigint
, status varchar(50)
, session_id bigint
, connection_id uniqueidentifier
, program_name varchar(255)
, batch_id uniqueidentifier
, root_batch_id uniqueidentifier
, query_hash uniqueidentifier
, command varchar(max)
)

-- inserting a few records
INSERT INTO dbo.#requests_history
SELECT erh.distributed_statement_id
, erh.start_time
, erh.end_time
, erh.total_elapsed_time_ms
, erh.login_name
, erh.row_count
, erh.status
, erh.session_id
, erh.connection_id
, erh.program_name
, erh.batch_id
, erh.root_batch_id
, erh.query_hash
, erh.command 
FROM queryinsights.exec_requests_history erh;

-- retrieve the inserted records
SELECT *
FROM dbo.#requests_history;

Unfortunately, the attempt led to the same error message. Further investigating the issue I arrived to a know issue: "Temp table usage in Data Warehouse and SQL analytics endpoint". Hopefully, the fix will address this scenario as well. Otherwise, it might be easier to import the data into a solution (e.g. Power BI) and do in there the analysis.

Using temporary tables with DMVs seems to work (see post).

Please note that the values are case sensitive and only a subset of the standard data types are supported (see documentation).

You might want to check also the queries from the SQL Server System Catalog.

Happy coding!

22 February 2023

💎🏭SQL Reloaded: Automatic Statistics Creation & Dropping for CETAS based on CSV File Format in Serverless SQL Pool

Introduction

The serverless SQL pool query optimizer uses statistics to calculate and compare the cost of various query plans, and then choose the plan with the lowest cost. Automatic creation of statistics is turned on for parquet file format, though for CSV file format statistics will be automatically created only when OPENROWSET is used. This means that when creating CETAS based on CSV the statistics need to be created manually. 

This would be one more reason for holding the files in the Data Lake as parquet files. On the other side there are also many files already available in CSV format, respectively technoloqies that allows exporting data only/still as CSV. Moreover, transforming the files as parquet is not always technically feasible.

Using OPENROWSET could also help, though does it make sense to use a different mechanismus for the CSV file format? In some scenarios will do. I prefer to have a unitary design, when possible. Moreover, even if some columns are not needed, they can still be useful for certain scenarios (e.g. troubleshooting, reevaluating their use, etc.). 

There are files, especially the ones coming from ERPs (Enterprise Resource Planning) or similar systems, which have even a few hundred columns (on average between 50 and 100 columns). Manually creating  the statistics for the respective tables will cost lot of time and effort. To automate the process there are mainly three choices:
(1) Creating statistics for all the columns for a given set of tables (e.g. for a given schema).
(2) Finding a way to automatically identify the columns which are actually used.
(3) Storing the list of tables and columns on which statistics should be build (however the list needs to be maintained manually). 

Fortunately, (1) can be solved relatively easy, based on the available table metadata, however it's not the best solution, as lot of statistics will be unnecessarily created. (2) is possible under certain architectures or additional effort. (3) takes time, though it's also an approachable solution.

What do we need?

For building the solution, we need table and statistics metadata, and the good news is that the old SQL Server queries still work. To minimize code's repetition, it makes sense to encapsulate the logic in views. For table metadata one can use the sys.objects DMV as is more general (one can replace sys.objects with sys.tables to focus only on tables):

-- drop the view (for cleaning)
-- DROP VIEW IF EXISTS dbo.vAdminObjectColumns

-- create view
CREATE OR ALTER VIEW dbo.vObjectColumns
AS 
-- object-based column metadata
SELECT sch.name + '.' + obj.name two_part_name
, sch.Name schema_name
, obj.name object_name
, col.name column_name
, obj.type
, CASE 
	WHEN col.is_ansi_padded = 1 and LEFT(udt.name , 1) = 'n' THEN col.max_length/2
	ELSE col.max_length
  END max_length
, col.precision 
, col.scale
, col.is_nullable 
, col.is_identity
, col.object_id
, col.column_id
, udt.name as data_type
, col.collation_name
, ROW_NUMBER() OVER(PARTITION BY col.object_id ORDER BY col.column_id) ranking FROM sys.columns col JOIN sys.types udt on col.user_type_id= udt.user_type_id JOIN sys.objects obj ON col.object_id = obj.object_id JOIN sys.schemas as sch on sch.schema_id = obj.schema_id -- testing the view SELECT obc.* FROM dbo.vObjectColumns obc WHERE obc.object_name LIKE '<table name>%' AND obc.schema_name = 'CRM' AND obc.type = 'U' ORDER BY obc.two_part_name , Ranking

The view can be used also as basis for getting the defined stats:

-- drop the view (for cleaning)
-- DROP VIEW IF EXISTS dbo.vAdminObjectStats

-- create view 
CREATE OR ALTER VIEW dbo.vObjectStats
AS
-- object-based column statistics
SELECT obc.two_part_name + '.' + QuoteName(stt.name) three_part_name
, obc.two_part_name
, obc.schema_name
, obc.object_name
, obc.column_name
, stt.name stats_name
, STATS_DATE(stt.[object_id], stt.stats_id) AS last_updated
, stt.auto_created
, stt.user_created
, stt.no_recompute
, stt.has_filter 
, stt.filter_definition
, stt.is_temporary 
, stt.is_incremental 
, stt.auto_drop 
, stt.stats_generation_method_desc
, stt.[object_id]
, obc.type 
, stt.stats_id
, stc.stats_column_id
, stc.column_id
FROM dbo.vObjectColumns obc
     LEFT JOIN sys.stats_columns stc 
	   ON stc.object_id = obc.object_id
	  AND stc.column_id = obc.column_id 
          LEFT JOIN sys.stats stt
            ON stc.[object_id] = stt.[object_id] 
           AND stc.stats_id = stt.stats_id

-- testing the view 
SELECT * FROM dbo.vObjectStats obs WHERE (obs.auto_created = 1 OR obs.user_created = 1) AND obs.type = 'U' AND obs.object_name = '<table name>' ORDER BY obs.two_part_name , obs.column_id

Now we have a basis for the next step. However, before using the stored procedure define below, one should use the last query and check whether statistics were defined before on a table. Use for testing also a table for which you know that statistics are available.

Create Statistics

The code below is based on a similar stored procedure available in the Microsoft documentation (see [1]). It uses a table's column metadata, stores them in a temporary table and then looks through each record, create the DDL script and runs it:

-- drop procedure (for cleaning)
--DROP PROCEDURE dbo.pCreateStatistics

-- create stored procedure
CREATE OR ALTER PROCEDURE dbo.pCreateStatistics
(   @schema_name nvarchar(50)
,   @table_name nvarchar(100)
)
AS
-- creates statistics for serverless SQL pool
BEGIN
	DECLARE @query as nvarchar(1000) = ''
	DECLARE @index int = 1, @nr_records int = 0

	-- drop temporary table if it exists 
	DROP TABLE IF EXISTS #stats_ddl;

	-- create temporary table 
	CREATE TABLE #stats_ddl( 
	  schema_name nvarchar(50)
	, table_name nvarchar(128)
	, column_name nvarchar(128)
	, ranking int
	);

	-- fill table
	INSERT INTO #stats_ddl
	SELECT obc.schema_name
	, obc.object_name
	, obc.column_name 
	, ROW_NUMBER() OVER(ORDER BY obc.schema_name, obc.object_name) ranking
	FROM dbo.vObjectColumns obc
	WHERE obc.type = 'U' -- tables
	  AND IsNull(@schema_name, obc.schema_name) = obc.schema_name 
	  AND IsNull(@table_name, obc.object_name) = obc.object_name

	SET @nr_records = (SELECT COUNT(*) FROM #stats_ddl)

	WHILE @index <= @nr_records
	BEGIN
		SET @query = (SELECT 'CREATE STATISTICS '+ QUOTENAME('stat_' + schema_name + '_' + table_name + '_' + column_name) + ' ON '+ QUOTENAME(schema_name) + '.' + QUOTENAME(table_name) + '(' + QUOTENAME(column_name) + ')' 
			   FROM #stats_ddl ddl
			   WHERE ranking = @index);

		BEGIN TRY
		        -- execute ddl
			EXEC sp_executesql @query;
		END TRY
		BEGIN CATCH
			SELECT 'create failed for ' + @query;
		END CATCH

		SET @index+=1;
	END

	DROP TABLE #stats_ddl;
END


-- test stored procedure (various scenario)
EXEC dbo.pCreateStatistics '<schema name>', '<table name>' -- based on schema & table
EXEC dbo.pCreateStatistics '<schema name>', NULL -- based on a schema
EXEC dbo.pCreateStatistics NULL, '<table name>' -- based on a table

Notes:
IMPORTANT!!! I recommend testing the stored procedure in a test environment first for a few tables and not for a whole schema. If there are too many tables, this will take time.

Please note that rerunning the stored procedure without deleting previously the statitics on the tables in scope will make the procedure raise failures for each column (behavior by design), though the error messages can be surpressed by commenting the code, if needed. One can introduce further validation, e.g. considering only the columns which don't have a statistic define on them.

Further Steps?

What can we do to improve the code? It would be great if we could find a way to identify the columns which are used in the queries. It is possible to retrieve the queries run in serverless SQL pool, however identifying the tables and columns from there or a similar source is not a straightforward solution. 

The design of views based on the external tables can help in the process! I prefer to build on top of the external tables a first level of views (aka "base views") that include only the fields in use (needed by the business) ordered and "grouped" together based on their importance or certain characteristics. The views are based solely on the external table and thus contain no joins. They can include conversions of data types, translations of codes into meaningful values, and quite seldom filters on the data. However, for traceability the name of the columns don't change! This means that if view's name is easily identifiable based on external table's name, we could check view's columns against the ones of the external table and create statistics only for the respective columns. Using a unique prefix (e.g. "v") to derive views' name from tables' name would do the trick.

To do that, we need to create a view that reflects the dependencies between objects (we'll be interested only in external tables vs views dependencies):

-- drop view (for cleaning)
-- DROP VIEW IF EXISTS dbo.vObjectsReferenced

-- create view
CREATE OR ALTER VIEW dbo.vObjectsReferenced
AS 
-- retrieving the objects referenced 
SELECT QuoteName(sch.name) + '.' + QuoteName(obj.name) AS two_part_name 
, obj.object_id 
, obj.schema_id 
, sch.name schema_name 
, obj.name object_name 
, obj.type
, QuoteName(scr.name) + '.'+ QuoteName(sed.referenced_entity_name) AS ref_two_part_name 
, obr.object_id ref_object_id
, obj.schema_id ref_schema_id 
, scr.name ref_schema_name 
, obr.name ref_object_name 
, obr.type ref_type
FROM sys.sql_expression_dependencies sed 
     JOIN sys.objects obj
       ON obj.object_id = sed.referencing_id 
	      JOIN sys.schemas as sch
	        ON obj.schema_id = sch.schema_id 
	 JOIN sys.objects obr
	   ON sed.referenced_id = obr.object_id
	      JOIN sys.schemas as scr
	        ON obr.schema_id = scr.schema_id

-- testing the view
SELECT top 10 *
FROM dbo.vObjectsReferenced
WHERE ref_type = 'U'

With this, the query used above to fill the table becomes:

-- fill table query with column selection 
SELECT obc.schema_name
, obc.object_name
, obc.column_name 
, ROW_NUMBER() OVER(ORDER BY obc.schema_name, obc.object_name) ranking
FROM dbo.vObjectColumns obc
WHERE obc.type = 'U' -- tables
	AND IsNull(@schema_name, obc.schema_name) = obc.schema_name 
	AND IsNull(@table_name, obc.object_name) = obc.object_name
	AND EXISTS ( -- select only columns referenced in views
	    SELECT * 
		FROM dbo.vObjectsReferenced obr 
		    JOIN dbo.vAdminObjectColumns obt
			ON obr.object_id = obt.object_id 
		WHERE obt.type = 'V' -- view
		AND obr.object_name  =  'v' + obr.ref_object_name
		AND obc.object_id = obr.ref_object_id
		AND obc.column_name = obt.column_name);

This change will reduce the number of statistics created on average by 50-80%. Of course, there will be also cases in which further statistics need to be added manually. One can use this as input for an analysis of the columns used and store the metadata in a file, do changes to it and base on it statistics' creation. 

Drop Statistics

Dropping the indexes resumes to using the dbo.vObjectStats view created above for the schema and/or table provided as parameter. The logic is similar to statistics' creation:

-- drop stored procedure (for cleaning)
-- DROP PROCEDURE IF EXISTS dbo.pDropStatistics

-- create procedure
CREATE OR ALTER PROCEDURE dbo.pDropStatistics
(   @schema_name nvarchar(50)
,   @table_name nvarchar(128)
)
AS
-- drop statistics for a schema and/or external table in serverless SQL pool
BEGIN

	DECLARE @query as nvarchar(1000) = ''
	DECLARE @index int = 1, @nr_records int = 0

	-- drop temporary table if it exists 
	DROP TABLE IF EXISTS #stats_ddl;

	-- create temporary table 
	CREATE TABLE #stats_ddl( 
	 three_part_name nvarchar(128)
	, ranking int
	);

	-- fill table
	INSERT INTO #stats_ddl
	SELECT obs.three_part_name
	, ROW_NUMBER() OVER(ORDER BY obs.three_part_name) ranking
	FROM dbo.vObjectStats obs
	WHERE obs.type = 'U' -- tables
	  AND IsNull(@schema_name, obs.schema_name) = obs.schema_name 
	  AND IsNull(@table_name, obs.object_name) = obs.object_name

	SET @nr_records = (SELECT COUNT(*) FROM #stats_ddl)

	WHILE @index <= @nr_records
	BEGIN
   
		SET @query = (SELECT 'DROP STATISTICS ' + ddl.three_part_name
			   FROM #stats_ddl ddl
			   WHERE ranking = @index);

		BEGIN TRY
		        -- execute ddl
			EXEC sp_executesql @query;
		END TRY
		BEGIN CATCH
			SELECT 'drop failed for ' + @query;
		END CATCH

		SET @index+=1;
	END

	DROP TABLE #stats_ddl;
END

Note:
IMPORTANT!!!
I recommend testing the stored procedure in a test environment first for a few tables and not for a whole schema. If there are too many tables, this will take time.

Closing Thoughts

The solution for statistics' creation is not perfect, though it's a start! It would have been great if such a feature would be provided by Microsoft, and probably they will, given the importance of statistics of identifying an optimal plan. It would be intersting to understand how much statistics help in a distributed environment and what's the volume of data processed for this purpose. 

Please let me know if you found other workarounds for statistics's automation.

Happy coding!

Previous Post  <<||>>  Next Post

References:
[1] Microsoft Learn (2022) Statistics in Synapse SQL (link)

01 February 2023

💎SQL Reloaded: Alternatives for Better Code Maintainability in SQL Server & Azure Synapse I

Introduction

Queries can become quite complex and with the increased complexity they'll be harder to read and/or maintain. Since the early days of SQL Server, views and table-valued user-defined functions (UDFs) are the standard ways of encapsulating logic for reuse in queries, allowing to minimize the duplication of logic. Then came the common table expressions (CTEs), which allow a further layer of structuring the code, independently whether a view or UDF was used. 

These are the main 3 options that can be combined in various ways to better structure the code. On the other side, also a temporary table or table variable could be used for the same purpose, though they have further implications.

To exemplify the various approaches, let's consider a simple query based on two tables from the AdventureWorks database. For the sake of simplicity, further business rules have been left out.

Inline Subqueries

-- products with open purchase orders
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM Production.Product ITM
     JOIN ( -- cumulated open purchase orders by product
		SELECT POL.ProductId 
		, SUM(POL.OrderQty) PurchQty
		FROM Purchasing.PurchaseOrderDetail POL
		WHERE OrderQty - (ReceivedQty - RejectedQty)>0
		GROUP BY POL.ProductId 
	) POL
	ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

As can be seen, the logic for the "Open purchase orders" result set is built within an inline subquery (aka inline view). As its logic becomes more complex, the simplest way to handle this is to move it into a CTE.

Common Table Expressions (CTEs)

A common table expression can be thought of as a temporary result set defined within the execution scope of a single SELECT, INSERT, UPDATE, DELETE or CREATE VIEW statement [1]. Thus, the CTE can't be reused between queries.

The inline query is moved at the beginning within a WITH statement to which is given a proper name that allows easier identification later:

-- products with open purchase orders (common table expression)
WITH OpenPOs
AS (-- cumulated open purchase orders by product
	SELECT POL.ProductId 
	, SUM(POL.OrderQty) PurchQty
	FROM Purchasing.PurchaseOrderDetail POL
	WHERE OrderQty - (ReceivedQty - RejectedQty)>0
	GROUP BY POL.ProductId 
)
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM Production.Product ITM
     JOIN OpenPOs POL
	   ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

Thus, this allows us to rewrite the JOIN as if it were between two tables. Multiple CTEs can be used as well, with or without any dependencies between them. Moreover, CTEs allow building recursive queries (see example).

There is no performance gain or loss by using a CTE. It's important to know that the result set is not cached, therefore, if the same CTE is called multiple times (within a query), it will be also "executed" for the same number of times. Except the cases in which the database engine uses a spool operator to save intermediate query results for a CTE, there will be created no work table in tempdb for CTEs.

If the inline query needs to be reused in several queries, defining a view is a better alternative.

Views

A view is a database object used to encapsulate a query and that can be referenced from other queries much like a table. In fact, it's also referred as a "virtual table". A view can't be execute by itself (as stored procedures do. No data, only the definition of the view is stored, and the various actions that can be performed on database objects can be performed on views as well.

-- creating the view
CREATE VIEW dbo.vOpenPurchaseOrders
AS
SELECT POL.ProductId 
, SUM(POL.OrderQty) PurchQty
FROM Purchasing.PurchaseOrderDetail POL
WHERE OrderQty - (ReceivedQty - RejectedQty)>0
GROUP BY POL.ProductId 

-- testing the view
SELECT top 10 *
FROM dbo.vOpenPurchaseOrders

Once the view is created, it can be called from any query:

-- products with open purchase orders (table-valued function)
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM Production.Product ITM
     JOIN dbo.vOpenPurchaseOrders POL
	   ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

Besides the schema binding, there are no additional costs for using views. However, views have several limitations (see [2]). Moreover, it's not possible to use parameters with views, scenarios in which tabled-valued UDFs can help.

Indexed Views 

Starting with SQL Server 2015, it's possible to materialize the data in a view, storing the results of the view in a clustered index on the disk in same way a table with a clustered index is stored. This type of view is called an indexed view (aka materialized view, though the concept is used slightly different in Azure Synapse) and for long-running queries can provide considerable performance gains. In case the view contains a GROUP BY is present, its definition must contain COUNT_BIG(*) and must not contain HAVING.

-- dropping the view
--DROP VIEW IF EXISTS Purchasing.vOpenPOs

-- create view
CREATE VIEW Purchasing.vOpenPOs
WITH SCHEMABINDING
AS
SELECT POL.ProductId 
, SUM(POL.OrderQty) PurchQty
, COUNT_BIG(*) Count
FROM Purchasing.PurchaseOrderDetail POL
WHERE OrderQty - (ReceivedQty - RejectedQty)>0
GROUP BY POL.ProductId 
GO

--Create an index on the view.
CREATE UNIQUE CLUSTERED INDEX IDX_vOpenPOs
   ON Purchasing.vOpenPOs (ProductId);

--testing the view
SELECT top 100 *
FROM Purchasing.vOpenPOs

-- products with open purchase orders (indexed view)
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM [Production].[Product] ITM
     JOIN Purchasing.vOpenPOs POL
	   ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

When an indexed view is defined on a table, the query optimizer may use it to speed up the query execution even if it wasn't referenced in the query. Besides the restriction of the view to be deterministic, further limitations apply (see [6]).

Table-Valued Functions

A table-valued function is a user-defined function in which returns a table as a result, as opposed to a single data value, as scalar functions do.

Let's support that we need to restrict base the logic based on a time interval. We'd need then to provide the StartDate & EndDate as parameters. Compared with other UDFs table-valued functions, as their name implies, need to return a table:

-- creating the UDF function 
CREATE FUNCTION dbo.tvfOpenPurchaseOrdersByProduct( 
  @StartDate date 
, @EndDate date) 
RETURNS TABLE 
AS RETURN ( 
	SELECT POL.ProductId 
	, SUM(POL.OrderQty) PurchQty
	FROM Purchasing.PurchaseOrderDetail POL
	WHERE OrderQty - (ReceivedQty - RejectedQty)>0
	  AND POL.DueDate BETWEEN @StartDate AND @EndDate
	GROUP BY POL.ProductId 
)

-- testing the UDF
SELECT top 10 *
FROM dbo.tvfOpenPurchaseOrdersByProduct('2014-01-01', '2014-12-31')

A table-valued function can be used as a "table with parameters" in JOINs:

-- products with open purchase orders (table-valued function)
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM Production.Product ITM
     JOIN dbo.tvfOpenPurchaseOrdersByProduct('2014-01-01', '2014-12-31') POL
	   ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

The parameters are optional, though in such cases using a view might still be a better idea. Table-valued functions used to have poor performance in the past compared with views and in certain scenarios they might still perform poorly. Their benefit resides in allowing to pass and use parameters in the logic, which can make them irreplaceable. Moreover, multi-statement table-valued functions can be built as well (see example)!

Notes:
1) When evaluating table-valued functions for usage consider their limitations as well (see [3])!
2) Scalar UDFs can be used to simplify the code as well, though they apply only to single values, therefore they are not considered in here!

Temporary Tables 

A temporary table is a base table that is stored and managed in tempdb as any other table. It exists only while the database session in which it was created is active. Therefore, it can be called multiple times, behaving much like a standard table:

-- create the temp table
CREATE TABLE dbo.#OpenPOs (
  ProductId int NOT NULL
, PurchQty decimal(8,2) NOT NULL
)

-- insert the cumulated purchase orders
INSERT INTO #OpenPOs
SELECT POL.ProductId 
, SUM(POL.OrderQty) PurchQty
FROM Purchasing.PurchaseOrderDetail POL
WHERE OrderQty - (ReceivedQty - RejectedQty)>0
GROUP BY POL.ProductId 

-- products with open purchase orders (table-valued function)
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM [Production].[Product] ITM
     JOIN dbo.#OpenPOs POL
	   ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

-- drop the table (cleaning)
-- DROP TABLE IF EXISTS dbo.#OpenPOs;

Being created in the tempdb, system database shared by several databases, temporary table's performance relies on tempdb's configuration and workload. Moreover, the concurrent creation of temporary tables from many sessions can lead to tempdb metadata contention, as each session attempts updating metadata information in the system based tables.

Temporary tables are logged, which adds more burden on the database engine, however being able to create indexes on them and use statistics can help processing result sets more efficiently, especially when called multiple times. 

Also, a temporary table might be cached (see [1]) and not deleted when its purpose ends, which allows operations that drop and create the objects to execute very quickly and reduces page allocation contention.

Table Variables

A table variable is a variable of type TABLE and can be used in functions, stored procedures, and batches. The construct is similar to the temp table and is stored as well in the tempdb and cached under certain scenarios, however they are scoped to the batch or routine in which they are defined and destroyed after that. 

-- create the table variable
DECLARE @OpenPOs TABLE (
  ProductId int NOT NULL
, PurchQty decimal(8,2) NOT NULL
)

-- insert the cumulated purchase orders
INSERT INTO @OpenPOs
SELECT POL.ProductId 
, SUM(POL.OrderQty) PurchQty
FROM Purchasing.PurchaseOrderDetail POL
WHERE OrderQty - (ReceivedQty - RejectedQty)>0
GROUP BY POL.ProductId 

-- products with open purchase orders (table variable)
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM [Production].[Product] ITM
     JOIN @OpenPOs POL
	   ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

Table variables don’t participate in transactions or locking, while DML operations done on them are not logged. There are also no statistics maintained and any data changes impacting the table variable will not cause recompilation. Thus, they are usually faster than temporary variables, especially when their size is small, though their performance depends also on how they are used. On the other side, for big result sets and/or when several calls are involved, a temporary table could prove to be more efficient. 

Important!!! Temporary tables and table variables are means of improving the performance of long-running queries. Being able to move pieces of logic around helps in maintaining the code and it also provides a logical structure of the steps, however they shouldn't be used if the performance gain is not the target! Overusing them as technique can considerably decrease the performance of tempdb, which can have impact in other areas!

Azure Synapse

Moving to Azure Synapse there are several important limitations in what concerns the above (see [4]). Even if some features are supported, further limitations might apply. What's important to note is that materialized views act like indexed view in standard SQL Server and that CETAS (Create External Table as SELECT) are available to import/export data to the supported file formats in Hadoop, Azure storage blob or Azure Data Lake Storage Gen2.

FeatureDedicatedServerlessSQL Server
CTEsYesYesYes (2015+)
Recursive CTEsNoNoYes (2015+)
ViewsYesYesYes
Indexed viewsNoNoYes
Materialized viewsYesNoNo
Table-valued functions (single statement)NoYesYes
Table-valued functions (multi-statement)NoNoYes
Scalar UDFs YesNoYes
TablesYesNoYes
Temporary tables (local)YesLimitedYes
Temporary tables (global)NoNoYes
Table variablesYesYesYes
CETASYesLimitedYes (2022+)

Notes:
1) CETAS have two important limitations in serverless SQL Pool
    a) once the data were exported to a file, they can't be overwritten via the same syntax;
    b) logic based on temporary tables can't be exported via pipelines.
2) Temporary tables can be used to replace cursors (see example).

Previous Post  <<||>>  Next Post

Resources:
[1] Microsoft Learn (2012) Capacity Planning for tempdb (link)
[2] Microsoft Learn (2023) CREATE View (link)
[3] Microsoft Learn (2023) CREATE Function (link)
[4] Microsoft Learn (2023) Transact-SQL features supported in Azure Synapse SQL (link)
[5] Redgate (2018) Choosing Between Table Variables and Temporary Tables (ST011, ST012), by Phil Factor (link)
[6] Microsoft Learn (2023) Create indexed views (link)
[7] Microsoft Learn (2023) CREATE MATERIALIZED VIEW AS SELECT (Transact-SQL) (link)
[8] Microsoft Learn (2023) CETAS with Synapse SQL (link)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.