04 February 2023

💎SQL Reloaded: Alternatives for Better Code Maintainability in SQL Server & Azure Synapse II

Few of the answers we search resume to simple queries as the one considered in the previous post. Usually, there are at least 5-10 tables involved, with more or less complex joins and different levels of aggregation. With each table added the number of options for implementing the logic increases. Based on analysis' particularities we can use a combination of views, table-valued functions, CTEs, temporary tables or table variables.

Let's consider two more tables involving aggregations on Sales orders and On-hand:

-- products analysis (via JOINs)
SELECT ITM.ProductNumber
, ITM.Name
, IsNull(OHD.OnHand, 0) OnHand
, IsNull(SOL.SalesQty, 0) SalesQty
, IsNull(POL.PurchQty, 0) PurchQty
FROM [Production].[Product] ITM
     LEFT JOIN ( -- cumulated on hand 
		SELECT OHD.ProductId
		, SUM(OHD.Quantity) OnHand
		, 0 SalesQty
		, 0 PurchQty
		FROM [Production].[ProductInventory] OHD
		GROUP BY OHD.ProductId
	 ) OHD
	   ON ITM.ProductId = OHD.ProductId 
	 LEFT JOIN ( -- cumulated sales orders
		SELECT SOL.ProductId
		, 0 OnHand
		, SUM(SOL.OrderQty) SalesQty
		, 0 PurchQty
		FROM [Sales].[SalesOrderDetail] SOL
		GROUP BY SOL.ProductId
	 ) SOL
	   ON ITM.ProductID = SOL.ProductId
	 LEFT JOIN (-- cumulated open purchase orders
		SELECT POL.ProductId 
		, 0 OnHand 
		, 0 SalesQty
		, SUM(POL.OrderQty) PurchQty
		FROM Purchasing.PurchaseOrderDetail POL
		WHERE OrderQty - (ReceivedQty - RejectedQty)>0
		  --AND POL.DueDate BETWEEN '2014-01-01' AND '2014-12-31'
		GROUP BY POL.ProductId 
	 ) POL
	    ON ITM.ProductId = POL.ProductId
WHERE COALESCE(OHD.ProductId, SOL.ProductId, POL.ProductId) IS NOT NULL
ORDER BY ITM.ProductNumber

With the help of CTEs the final query can be simplified considerably:

-- products analysis (via CTEs)
WITH OnHand 
AS ( -- cumulated on hand 
	SELECT OHD.ProductId
	, SUM(OHD.Quantity) OnHand
	, 0 SalesQty
	, 0 PurchQty
	FROM [Production].[ProductInventory] OHD
	GROUP BY OHD.ProductId
)
, SalesOrders
AS ( -- cumulated sales orders
		SELECT SOL.ProductId
		, 0 OnHand
		, SUM(SOL.OrderQty) SalesQty
		, 0 PurchQty
		FROM [Sales].[SalesOrderDetail] SOL
		GROUP BY SOL.ProductId
)
, OpenPOs 
AS ( -- cumulated open purchase orders
	SELECT POL.ProductId 
	, 0 OnHand 
	, 0 SalesQty
	, SUM(POL.OrderQty) PurchQty
	FROM Purchasing.PurchaseOrderDetail POL
	WHERE OrderQty - (ReceivedQty - RejectedQty)>0
		--AND POL.DueDate BETWEEN '2014-01-01' AND '2014-12-31'
	GROUP BY POL.ProductId 
) 
SELECT ITM.ProductNumber
, ITM.Name
, IsNull(OHD.OnHand, 0) OnHand
, IsNull(SOL.SalesQty, 0) SalesQty
, IsNull(POL.PurchQty, 0) PurchQty
FROM [Production].[Product] ITM
     LEFT JOIN OnHand OHD
	   ON ITM.ProductId = OHD.ProductId 
	 LEFT JOIN SalesOrders SOL
	   ON ITM.ProductID = SOL.ProductId
	 LEFT JOIN OpenPOs POL
	    ON ITM.ProductId = POL.ProductId
WHERE COALESCE(OHD.ProductId, SOL.ProductId, POL.ProductId) IS NOT NULL
ORDER BY ITM.ProductNumber

This way of structuring the code allows us in theory also a better way to troubleshoot the logic on segments (each CTE) and a better grip at the end query. In fact, any of the above alternatives would result in a similar writing of the end query and, with the right indexes and processing power, the performance won't be in general an issue. 

A CTE is just a way of reorganizing a query and the changes made by moving the logic to the CTE  shouldn't have any impact on the performance. Even if the constraints and tables appear somewhere else, the database engine will push down predicates and move operators around to obtain an optimal algebraic tree. In theory, the two approaches should lead to the same query plan. Of course, there are also exceptions, e.g. when restructuring the logic inline in a query without CTE can favor one plan over the other.

There's also the temptation to use CTEs for everything. An extreme case is using a CTE for each table on which a filter is applied instead of bringing the constraints into the JOIN or the WHERE clause. Even if the former approach allows troubleshooting individual sources, it can create further overhead as is needed to check back and forth between CTEs and JOINs, taking thus more time and allowing some errors maybe to escape. CTEs are for breaking down complex logic to a level at which the code can be managed and not create further overhead. There must be a balance between usability and effort. 

Moving to a distributed environment like Azure Synapse, where tables need to be distributed between several compute nodes, because of the dependencies existing between tables, queries based on several aggregations might take longer than expected. Besides following the best practices for table design including tables' distribution (see [1], [2]), it's maybe worth playing around with the query structure and see how the database engine will react to it. 

For example, the considered query could be restructured as follows by transforming the many LEFT JOINs to UNIONs: 

-- products analysis (via UNIONs)
SELECT ITM.ProductNumber
, ITM.Name
, SUM(DAT.OnHand) OnHand
, SUM(DAT.SalesQty) SalesQty
, SUM(DAT.PurchQty) PurchQty
FROM [Production].[Product] ITM
     JOIN (
		-- cumulated on hand 
		SELECT OHD.ProductId
		, SUM(OHD.Quantity) OnHand
		, 0 SalesQty
		, 0 PurchQty
		FROM [Production].[ProductInventory] OHD
		GROUP BY OHD.ProductId
		UNION ALL
		-- cumulated sales orders
		SELECT SOL.ProductId
		, 0 OnHand
		, SUM(SOL.OrderQty) SalesQty
		, 0 PurchQty
		FROM [Sales].[SalesOrderDetail] SOL
		GROUP BY SOL.ProductId
		UNION ALL
		-- cumulated open purchase orders
		SELECT POL.ProductId 
		, 0 OnHand 
		, 0 SalesQty
		, SUM(POL.OrderQty) PurchQty
		FROM Purchasing.PurchaseOrderDetail POL
		WHERE OrderQty - (ReceivedQty - RejectedQty)>0
		  --AND POL.DueDate BETWEEN '2014-01-01' AND '2014-12-31'
		GROUP BY POL.ProductId 
	) DAT
	ON ITM.ProductId = DAT.ProductId
GROUP BY ITM.ProductNumber
, ITM.Name
ORDER BY ITM.ProductNumber

In similar situations I found this approach to provide better performance, especially when working with serverless SQL pools. Though, this might depend also on tables' particularities. 

If the queries take long time and the result sets are needed frequently, it might be useful to store the intermediary or final result sets in (base) tables, or in case of serverless SQL pool to parquet files and have the data updated on a regular basis. Upon case, unless the data chance too often, this might prove a much better option than using temporary tables or table variables.

Happy coding!


Resources:
[1] Microsoft Learn (2022) Guidance for designing distributed tables using dedicated SQL pool in Azure Synapse Analytics (link)
[2] Microsoft Learn (2022) Design tables using dedicated SQL pool in Azure Synapse Analytics (link)


01 February 2023

💎SQL Reloaded: Alternatives for Better Code Maintainability in SQL Server & Azure Synapse I

Introduction

Queries can become quite complex and with the increased complexity they'll be harder to read and/or maintain. Since the early days of SQL Server, views and table-valued user-defined functions (UDFs) are the standard ways of encapsulating logic for reuse in queries, allowing to minimize the duplication of logic. Then came the common table expressions (CTEs), which allow a further layer of structuring the code, independently whether a view or UDF was used. 

These are the main 3 options that can be combined in various ways to better structure the code. On the other side, also a temporary table or table variable could be used for the same purpose, though they have further implications.

To exemplify the various approaches, let's consider a simple query based on two tables from the AdventureWorks database. For the sake of simplicity, further business rules have been left out.

Inline Subqueries

-- products with open purchase orders
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM Production.Product ITM
     JOIN ( -- cumulated open purchase orders by product
		SELECT POL.ProductId 
		, SUM(POL.OrderQty) PurchQty
		FROM Purchasing.PurchaseOrderDetail POL
		WHERE OrderQty - (ReceivedQty - RejectedQty)>0
		GROUP BY POL.ProductId 
	) POL
	ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

As can be seen, the logic for the "Open purchase orders" result set is built within an inline subquery (aka inline view). As its logic becomes more complex, the simplest way to handle this is to move it into a CTE.

Common Table Expressions (CTEs)

A common table expression can be thought of as a temporary result set defined within the execution scope of a single SELECT, INSERT, UPDATE, DELETE or CREATE VIEW statement [1]. Thus, the CTE can't be reused between queries.

The inline query is moved at the beginning within a WITH statement to which is given a proper name that allows easier identification later:

-- products with open purchase orders (common table expression)
WITH OpenPOs
AS (-- cumulated open purchase orders by product
	SELECT POL.ProductId 
	, SUM(POL.OrderQty) PurchQty
	FROM Purchasing.PurchaseOrderDetail POL
	WHERE OrderQty - (ReceivedQty - RejectedQty)>0
	GROUP BY POL.ProductId 
)
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM Production.Product ITM
     JOIN OpenPOs POL
	   ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

Thus, this allows us to rewrite the JOIN as if it were between two tables. Multiple CTEs can be used as well, with or without any dependencies between them. Moreover, CTEs allow building recursive queries (see example).

There is no performance gain or loss by using a CTE. It's important to know that the result set is not cached, therefore, if the same CTE is called multiple times (within a query), it will be also "executed" for the same number of times. Except the cases in which the database engine uses a spool operator to save intermediate query results for a CTE, there will be created no work table in tempdb for CTEs.

If the inline query needs to be reused in several queries, defining a view is a better alternative.

Views

A view is a database object used to encapsulate a query and that can be referenced from other queries much like a table. In fact, it's also referred as a "virtual table". A view can't be execute by itself (as stored procedures do. No data, only the definition of the view is stored, and the various actions that can be performed on database objects can be performed on views as well.

-- creating the view
CREATE VIEW dbo.vOpenPurchaseOrders
AS
SELECT POL.ProductId 
, SUM(POL.OrderQty) PurchQty
FROM Purchasing.PurchaseOrderDetail POL
WHERE OrderQty - (ReceivedQty - RejectedQty)>0
GROUP BY POL.ProductId 

-- testing the view
SELECT top 10 *
FROM dbo.vOpenPurchaseOrders

Once the view is created, it can be called from any query:

-- products with open purchase orders (table-valued function)
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM Production.Product ITM
     JOIN dbo.vOpenPurchaseOrders POL
	   ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

Besides the schema binding, there are no additional costs for using views. However, views have several limitations (see [2]). Moreover, it's not possible to use parameters with views, scenarios in which tabled-valued UDFs can help.

Indexed Views 

Starting with SQL Server 2015, it's possible to materialize the data in a view, storing the results of the view in a clustered index on the disk in same way a table with a clustered index is stored. This type of view is called an indexed view (aka materialized view, though the concept is used slightly different in Azure Synapse) and for long-running queries can provide considerable performance gains. In case the view contains a GROUP BY is present, its definition must contain COUNT_BIG(*) and must not contain HAVING.

-- dropping the view
--DROP VIEW IF EXISTS Purchasing.vOpenPOs

-- create view
CREATE VIEW Purchasing.vOpenPOs
WITH SCHEMABINDING
AS
SELECT POL.ProductId 
, SUM(POL.OrderQty) PurchQty
, COUNT_BIG(*) Count
FROM Purchasing.PurchaseOrderDetail POL
WHERE OrderQty - (ReceivedQty - RejectedQty)>0
GROUP BY POL.ProductId 
GO

--Create an index on the view.
CREATE UNIQUE CLUSTERED INDEX IDX_vOpenPOs
   ON Purchasing.vOpenPOs (ProductId);

--testing the view
SELECT top 100 *
FROM Purchasing.vOpenPOs

-- products with open purchase orders (indexed view)
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM [Production].[Product] ITM
     JOIN Purchasing.vOpenPOs POL
	   ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

When an indexed view is defined on a table, the query optimizer may use it to speed up the query execution even if it wasn't referenced in the query. Besides the restriction of the view to be deterministic, further limitations apply (see [6]).

Table-Valued Functions

A table-valued function is a user-defined function in which returns a table as a result, as opposed to a single data value, as scalar functions do.

Let's support that we need to restrict base the logic based on a time interval. We'd need then to provide the StartDate & EndDate as parameters. Compared with other UDFs table-valued functions, as their name implies, need to return a table:

-- creating the UDF function 
CREATE FUNCTION dbo.tvfOpenPurchaseOrdersByProduct( 
  @StartDate date 
, @EndDate date) 
RETURNS TABLE 
AS RETURN ( 
	SELECT POL.ProductId 
	, SUM(POL.OrderQty) PurchQty
	FROM Purchasing.PurchaseOrderDetail POL
	WHERE OrderQty - (ReceivedQty - RejectedQty)>0
	  AND POL.DueDate BETWEEN @StartDate AND @EndDate
	GROUP BY POL.ProductId 
)

-- testing the UDF
SELECT top 10 *
FROM dbo.tvfOpenPurchaseOrdersByProduct('2014-01-01', '2014-12-31')

A table-valued function can be used as a "table with parameters" in JOINs:

-- products with open purchase orders (table-valued function)
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM Production.Product ITM
     JOIN dbo.tvfOpenPurchaseOrdersByProduct('2014-01-01', '2014-12-31') POL
	   ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

The parameters are optional, though in such cases using a view might still be a better idea. Table-valued functions used to have poor performance in the past compared with views and in certain scenarios they might still perform poorly. Their benefit resides in allowing to pass and use parameters in the logic, which can make them irreplaceable. Moreover, multi-statement table-valued functions can be built as well (see example)!

Notes:
1) When evaluating table-valued functions for usage consider their limitations as well (see [3])!
2) Scalar UDFs can be used to simplify the code as well, though they apply only to single values, therefore they are not considered in here!

Temporary Tables 

A temporary table is a base table that is stored and managed in tempdb as any other table. It exists only while the database session in which it was created is active. Therefore, it can be called multiple times, behaving much like a standard table:

-- create the temp table
CREATE TABLE dbo.#OpenPOs (
  ProductId int NOT NULL
, PurchQty decimal(8,2) NOT NULL
)

-- insert the cumulated purchase orders
INSERT INTO #OpenPOs
SELECT POL.ProductId 
, SUM(POL.OrderQty) PurchQty
FROM Purchasing.PurchaseOrderDetail POL
WHERE OrderQty - (ReceivedQty - RejectedQty)>0
GROUP BY POL.ProductId 

-- products with open purchase orders (table-valued function)
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM [Production].[Product] ITM
     JOIN dbo.#OpenPOs POL
	   ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

-- drop the table (cleaning)
-- DROP TABLE IF EXISTS dbo.#OpenPOs;

Being created in the tempdb, system database shared by several databases, temporary table's performance relies on tempdb's configuration and workload. Moreover, the concurrent creation of temporary tables from many sessions can lead to tempdb metadata contention, as each session attempts updating metadata information in the system based tables.

Temporary tables are logged, which adds more burden on the database engine, however being able to create indexes on them and use statistics can help processing result sets more efficiently, especially when called multiple times. 

Also, a temporary table might be cached (see [1]) and not deleted when its purpose ends, which allows operations that drop and create the objects to execute very quickly and reduces page allocation contention.

Table Variables

A table variable is a variable of type TABLE and can be used in functions, stored procedures, and batches. The construct is similar to the temp table and is stored as well in the tempdb and cached under certain scenarios, however they are scoped to the batch or routine in which they are defined and destroyed after that. 

-- create the table variable
DECLARE @OpenPOs TABLE (
  ProductId int NOT NULL
, PurchQty decimal(8,2) NOT NULL
)

-- insert the cumulated purchase orders
INSERT INTO @OpenPOs
SELECT POL.ProductId 
, SUM(POL.OrderQty) PurchQty
FROM Purchasing.PurchaseOrderDetail POL
WHERE OrderQty - (ReceivedQty - RejectedQty)>0
GROUP BY POL.ProductId 

-- products with open purchase orders (table variable)
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM [Production].[Product] ITM
     JOIN @OpenPOs POL
	   ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

Table variables don’t participate in transactions or locking, while DML operations done on them are not logged. There are also no statistics maintained and any data changes impacting the table variable will not cause recompilation. Thus, they are usually faster than temporary variables, especially when their size is small, though their performance depends also on how they are used. On the other side, for big result sets and/or when several calls are involved, a temporary table could prove to be more efficient. 

Important!!! Temporary tables and table variables are means of improving the performance of long-running queries. Being able to move pieces of logic around helps in maintaining the code and it also provides a logical structure of the steps, however they shouldn't be used if the performance gain is not the target! Overusing them as technique can considerably decrease the performance of tempdb, which can have impact in other areas!

Azure Synapse

Moving to Azure Synapse there are several important limitations in what concerns the above (see [4]). Even if some features are supported, further limitations might apply. What's important to note is that materialized views act like indexed view in standard SQL Server and that CETAS (Create External Table as SELECT) are available to import/export data to the supported file formats in Hadoop, Azure storage blob or Azure Data Lake Storage Gen2.

FeatureDedicatedServerlessSQL Server
CTEsYesYesYes (2015+)
Recursive CTEsNoNoYes (2015+)
ViewsYesYesYes
Indexed viewsNoNoYes
Materialized viewsYesNoNo
Table-valued functions (single statement)NoYesYes
Table-valued functions (multi-statement)NoNoYes
Scalar UDFs YesNoYes
TablesYesNoYes
Temporary tables (local)YesLimitedYes
Temporary tables (global)NoNoYes
Table variablesYesYesYes
CETASYesLimitedYes (2022+)

Notes:
1) CETAS have two important limitations in serverless SQL Pool
    a) once the data were exported to a file, they can't be overwritten via the same syntax;
    b) logic based on temporary tables can't be exported via pipelines.
2) Temporary tables can be used to replace cursors (see example).

Previous Post  <<||>>  Next Post

Resources:
[1] Microsoft Learn (2012) Capacity Planning for tempdb (link)
[2] Microsoft Learn (2023) CREATE View (link)
[3] Microsoft Learn (2023) CREATE Function (link)
[4] Microsoft Learn (2023) Transact-SQL features supported in Azure Synapse SQL (link)
[5] Redgate (2018) Choosing Between Table Variables and Temporary Tables (ST011, ST012), by Phil Factor (link)
[6] Microsoft Learn (2023) Create indexed views (link)
[7] Microsoft Learn (2023) CREATE MATERIALIZED VIEW AS SELECT (Transact-SQL) (link)
[8] Microsoft Learn (2023) CETAS with Synapse SQL (link)

28 January 2023

💎SQL Reloaded: Temporary Tables and Tempdb in Serverless SQL Pool

In SQL Server, temporary tables are stored in tempdb database and the front end (FE) SQL Server from a serverless SQL Server pool makes no exception from it, however there's no such database listed in the list of databases available. Watching today Brian Bønk's session on "Azure Synapse serverless - CETAS vs Views vs Parquet" at Data Toboggan winter edition 2023, the speaker pointed out that the tables available in tempdb database can be actually queried:

-- Azure Serverless SQL pool: tempdb tables
SELECT top 10 *
FROM tempdb.information_schema.tables
TABLE_CATALOGTABLE_SCHEMATABLE_NAMETABLE_TYPE
tempdbdbo#dm_continuous_copy_status_..._0000000029C7BASE TABLE
tempdbdbodiff_MonDmSbsConnectionsBASE TABLE
tempdbdbodiff_MonTceMasterKeysBASE TABLE
tempdbdbodiff_MonTceColEncryptionKeyBASE TABLE
tempdbdbodiff_MonAutomaticTuningStateBASE TABLE
tempdbdbodiff_MonDmTranActiveTransactionsBASE TABLE
tempdbdbodiff_MonTceEnclaveUsageInfoBASE TABLE
tempdbdbodiff_MonTceEnclaveColEncryptionKeyBASE TABLE
tempdbdbodiff_MonTceEnclaveMasterKeysBASE TABLE
tempdbdbodiff_MonTceColumnInfoBASE TABLE

Moreover, the content of the system tables available can be queried, even if some of the tables might have no data:
 
-- Azure Serverless SQL pool: checking a table's content
SELECT top 10 *
FROM tempdb.dbo.dmv_view_run_history

How about the temporary tables created? To check this, let's reuse a temp table created recently for listing the DMVs available in the serverlss SQL pool:

-- dropping the temp table
DROP TABLE IF EXISTS dbo.#views;

-- create the temp table
CREATE TABLE dbo.#views (
  ranking int NOT NULL
, view_name nvarchar(150) NOT NULL
)

-- inserting a few records
INSERT INTO #views
SELECT row_number() OVER(ORDER BY object_id) ranking
, concat(schema_name(schema_id),'.', name) view_name
FROM sys.all_views obj
WHERE obj.Type = 'V'
  AND obj.is_ms_shipped = 1
  AND obj.name LIKE 'dm_exec_requests%'

-- checking temp table's content
SELECT *
FROM dbo.#views
rankingview_name
1sys.dm_exec_requests
2sys.dm_exec_requests_history

Here's temp table's metadata:
 
-- checking temp table's metadata
SELECT *
FROM tempdb.information_schema.tables
WHERE table_name like '#views%'
TABLE_CATALOGTABLE_SCHEMATABLE_NAMETABLE_TYPE
tempdbdbo#views_____..._____00000000295BBASE TABLE

Of course, you can query also the tempdb.sys.all_objects: 

-- checking temp table's metadata
SELECT *
FROM tempdb.sys.all_objects
WHERE name like '#views%'

You can use now the table name returned by any of the two queries to call the table (just replace the name accordingly):

-- querying the table via tempdb (name will differ)
SELECT *
FROM tempdb.dbo.[#views______________________________________________________________________________________________________________00000000295B]
rankingview_name
1sys.dm_exec_requests
2sys.dm_exec_requests_history

The benefit of knowing that is neglectable, however the topic is interesting more for the ones who want to know how the pools in Azure Synapse work, what was kept or removed compared with the standard editions of SQL Server. 

Thus, another topic interesting for DBAs would be how many files are created for the tempdb, given that the database is used to store intermediate data for the various operations. Finding the optimal number of tempdb files and configuring them was and still is one of the main concerns when configuring an SQL Server instance for optimal performance.

The old query developed for standard SQL Server seems to work as well:
 
-- Azure Serverless SQL pool: tempdb files 
SELECT dbf.file_id
, dbf.name file_name
--, dbf.physical_name
, dsp.name file_group
--, type 
, dbf.type_desc file_type
--, dbf.growth growth_kb
, Cast(dbf.growth/128.0  as decimal(18,2)) growth_mb
, dbf.is_percent_growth
--, dbf.max_size max_size_kb
, Cast(NullIf(dbf.max_size, -1)/128.0  as decimal(18,2)) max_size_mb
--, dbf.size file_size_kb
, Cast(dbf.size/128.0 as decimal(18,2)) file_size_mb
, dbf.state_desc 
, dbf.is_read_only 
FROM tempdb.sys.database_files dbf
     LEFT JOIN tempdb.sys.data_spaces dsp
       ON dbf.data_space_id = dsp.data_space_id
ORDER BY dbf.Name
file_idfile_namefile_groupfile_typegrowth_mbis_percent_growthmax_size_mbfile_size_mbstate_descis_read_only
1tempdevPRIMARYROWS32010267016ONLINE0
3tempdev2PRIMARYROWS32010267016ONLINE0
4tempdev3PRIMARYROWS32010267016ONLINE0
5tempdev4PRIMARYROWS32010267016ONLINE0
2templogNULLLOG64016384255ONLINE0

So, there were created 4 data files of 16 MB each, with a fix growth of 32 MB, and the files can grow up to about 100 GB. What would be interesting to check is how the files grow under heavy workload, what kind of issues this raises, etc. At least in serverless SQL pools many of the views one used to use for troubleshooting aren't accessible. Maybe one doesn't have to go that far, given that the resources are managed automatically. 

Notes
(1) Microsoft recommends not to drop the temporary tables explicitly, but let SQL Server handle this cleanup automatically and take thus advantage of the Optimistic Latching Algorithm, which helps prevent contention on TempDB [1].

Happy coding!

Last updated: Oct-2024


References:
[1] Haripriya SB (2024) Do NOT drop #temp tables (link)

25 January 2023

💎SQL Reloaded: Documenting External Tables

When using serverless SQL pool, CETAS (aka Create External Table as Select) are the main mechanism of making data available from the Data Lake for queries. In case is needed to document them, sys.external_tables DMV can be used to export their metadata, much like sys.tables or sys.views can be used for the same:

-- CETAS metadata
SELECT TOP (50) ext.object_id
, ext.name
, schema_name(ext.schema_id) [schema_name]
, ext.type_desc
, ext.location
, ext.data_source_id
, ext.file_format_id
, ext.max_column_id_used
, ext.uses_ansi_nulls
, ext.create_date
, ext.modify_date
FROM sys.external_tables ext

It's interesting that CETAS have type_desc = 'USER_TABLE' in sys.all_objects, same like user-defined tables in SQL Server have:

-- CETAS' metadata via sys.all_objects
SELECT *
FROM sys.all_objects
WHERE object_id = object_id('<schema_name>.<CETAS name>')

The data source and file format can be retrieved via the sys.external_data_sources and sys.external_file_formats DMVs. Moreover, it's useful to include the logic into a view, like the one below:
 
-- drop view
--DROP VIEW IF EXISTS dbo.vAdminExternalTables

-- create view
CREATE VIEW dbo.vAdminExternalTables
AS
-- external tables - metadata 
SELECT ext.object_id
, sch.name + '.' + ext.name [unique_identifier]
, sch.name [schema]
, ext.name [object]
, ext.type_desc [type]
, ext.max_column_id_used 
, ext.location
, eds.name data_source 
, eff.name file_format 
, ext.create_date 
, ext.modify_date
FROM sys.external_tables ext
     JOIN sys.schemas sch
       ON ext.schema_id = sch.schema_id 
     JOIN sys.external_data_sources eds 
       ON ext.data_source_id = eds.data_source_id
     JOIN sys.external_file_formats eff
       ON ext.file_format_id = eff.file_format_id

-- testing the view
SELECT top 10 *
FROM dbo.vAdminExternalTables

The view can be used then for further queries, for example checking the CETAS created or modified starting with a given date:
 
-- external tables created after a certain date
SELECT *
FROM dbo.vAdminExternalTables ext
WHERE ext.create_date >= '20230101'
  OR ext.modify_date >= '20230101';

Or, when the CETAS are deployed from one environment to another, one can compare the datasets returned by the same view between environments, something like in the below query:
 
-- comparison external tables metadata between two databases
SELECT *
FROM (
    SELECT *
    FROM <test_database>.dbo.vAdminExternalTables
    WHERE [Schema] = '<schema_name>'
	) PRD
	FULL OUTER JOIN (
    SELECT *
    FROM <prod_database>.dbo.vAdminExternalTables
    WHERE [Schema] = '<schema_name>'
	) UAT
	ON PRD.[unique_identifier] = UAT.[unique_identifier]
-- WHERE PRD.[unique_identifier] IS NULL OR UAT.[unique_identifier] IS NULL

The definitions for multiple CETAS can be exported from the source database in one step via the Object Explorer Details >> Tables >> External tables >> (select CETAS) >> Script Table as >> ... . 

Happy coding!

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.