Introduction
The serverless SQL pool query optimizer uses statistics to calculate and compare the cost of various query plans, and then choose the plan with the lowest cost. Automatic creation of statistics is turned on for parquet file format, though for CSV file format statistics will be automatically created only when OPENROWSET is used. This means that when creating CETAS based on CSV the statistics need to be created manually.
This would be one more reason for holding the files in the Data Lake as parquet files. On the other side there are also many files already available in CSV format, respectively technoloqies that allows exporting data only/still as CSV. Moreover, transforming the files as parquet is not always technically feasible.
Using OPENROWSET could also help, though does it make sense to use a different mechanismus for the CSV file format? In some scenarios will do. I prefer to have a unitary design, when possible. Moreover, even if some columns are not needed, they can still be useful for certain scenarios (e.g. troubleshooting, reevaluating their use, etc.).
There are files, especially the ones coming from ERPs (Enterprise Resource Planning) or similar systems, which have even a few hundred columns (on average between 50 and 100 columns). Manually creating the statistics for the respective tables will cost lot of time and effort. To automate the process there are mainly three choices:
(1) Creating statistics for all the columns for a given set of tables (e.g. for a given schema).
(2) Finding a way to automatically identify the columns which are actually used.
(3) Storing the list of tables and columns on which statistics should be build (however the list needs to be maintained manually).
Fortunately, (1) can be solved relatively easy, based on the available table metadata, however it's not the best solution, as lot of statistics will be unnecessarily created. (2) is possible under certain architectures or additional effort. (3) takes time, though it's also an approachable solution.
What do we need?
For building the solution, we need table and statistics metadata, and the good news is that the old SQL Server queries still work. To minimize code's repetition, it makes sense to encapsulate the logic in views. For table metadata one can use the sys.objects DMV as is more general (one can replace sys.objects with sys.tables to focus only on tables):
-- drop the view (for cleaning)
-- DROP VIEW IF EXISTS dbo.vAdminObjectColumns
-- create view
CREATE OR ALTER VIEW dbo.vObjectColumns
AS
-- object-based column metadata
SELECT sch.name + '.' + obj.name two_part_name
, sch.Name schema_name
, obj.name object_name
, col.name column_name
, obj.type
, CASE
WHEN col.is_ansi_padded = 1 and LEFT(udt.name , 1) = 'n' THEN col.max_length/2
ELSE col.max_length
END max_length
, col.precision
, col.scale
, col.is_nullable
, col.is_identity
, col.object_id
, col.column_id
, udt.name as data_type
, col.collation_name
, ROW_NUMBER() OVER(PARTITION BY col.object_id ORDER BY col.column_id) ranking
FROM sys.columns col
JOIN sys.types udt
on col.user_type_id= udt.user_type_id
JOIN sys.objects obj
ON col.object_id = obj.object_id
JOIN sys.schemas as sch
on sch.schema_id = obj.schema_id
-- testing the view
SELECT obc.*
FROM dbo.vObjectColumns obc
WHERE obc.object_name LIKE '<table name>%'
AND obc.schema_name = 'CRM'
AND obc.type = 'U'
ORDER BY obc.two_part_name
, Ranking
The view can be used also as basis for getting the defined stats:
-- drop the view (for cleaning)
-- DROP VIEW IF EXISTS dbo.vAdminObjectStats
-- create view
CREATE OR ALTER VIEW dbo.vObjectStats
AS
-- object-based column statistics
SELECT obc.two_part_name + '.' + QuoteName(stt.name) three_part_name
, obc.two_part_name
, obc.schema_name
, obc.object_name
, obc.column_name
, stt.name stats_name
, STATS_DATE(stt.[object_id], stt.stats_id) AS last_updated
, stt.auto_created
, stt.user_created
, stt.no_recompute
, stt.has_filter
, stt.filter_definition
, stt.is_temporary
, stt.is_incremental
, stt.auto_drop
, stt.stats_generation_method_desc
, stt.[object_id]
, obc.type
, stt.stats_id
, stc.stats_column_id
, stc.column_id
FROM dbo.vObjectColumns obc
LEFT JOIN sys.stats_columns stc
ON stc.object_id = obc.object_id
AND stc.column_id = obc.column_id
LEFT JOIN sys.stats stt
ON stc.[object_id] = stt.[object_id]
AND stc.stats_id = stt.stats_id
-- testing the view
SELECT *
FROM dbo.vObjectStats obs
WHERE (obs.auto_created = 1 OR obs.user_created = 1)
AND obs.type = 'U'
AND obs.object_name = '<table name>'
ORDER BY obs.two_part_name
, obs.column_id
Now we have a basis for the next step. However, before using the stored procedure define below, one should use the last query and check whether statistics were defined before on a table. Use for testing also a table for which you know that statistics are available.
Create Statistics
The code below is based on a similar stored procedure available in the Microsoft documentation (see [1]). It uses a table's column metadata, stores them in a temporary table and then looks through each record, create the DDL script and runs it:
-- drop procedure (for cleaning)
--DROP PROCEDURE dbo.pCreateStatistics
-- create stored procedure
CREATE OR ALTER PROCEDURE dbo.pCreateStatistics
( @schema_name nvarchar(50)
, @table_name nvarchar(100)
)
AS
-- creates statistics for serverless SQL pool
BEGIN
DECLARE @query as nvarchar(1000) = ''
DECLARE @index int = 1, @nr_records int = 0
-- drop temporary table if it exists
DROP TABLE IF EXISTS #stats_ddl;
-- create temporary table
CREATE TABLE #stats_ddl(
schema_name nvarchar(50)
, table_name nvarchar(128)
, column_name nvarchar(128)
, ranking int
);
-- fill table
INSERT INTO #stats_ddl
SELECT obc.schema_name
, obc.object_name
, obc.column_name
, ROW_NUMBER() OVER(ORDER BY obc.schema_name, obc.object_name) ranking
FROM dbo.vObjectColumns obc
WHERE obc.type = 'U' -- tables
AND IsNull(@schema_name, obc.schema_name) = obc.schema_name
AND IsNull(@table_name, obc.object_name) = obc.object_name
SET @nr_records = (SELECT COUNT(*) FROM #stats_ddl)
WHILE @index <= @nr_records
BEGIN
SET @query = (SELECT 'CREATE STATISTICS '+ QUOTENAME('stat_' + schema_name + '_' + table_name + '_' + column_name) + ' ON '+ QUOTENAME(schema_name) + '.' + QUOTENAME(table_name) + '(' + QUOTENAME(column_name) + ')'
FROM #stats_ddl ddl
WHERE ranking = @index);
BEGIN TRY
-- execute ddl
EXEC sp_executesql @query;
END TRY
BEGIN CATCH
SELECT 'create failed for ' + @query;
END CATCH
SET @index+=1;
END
DROP TABLE #stats_ddl;
END
-- test stored procedure (various scenario)
EXEC dbo.pCreateStatistics '<schema name>', '<table name>' -- based on schema & table
EXEC dbo.pCreateStatistics '<schema name>', NULL -- based on a schema
EXEC dbo.pCreateStatistics NULL, '<table name>' -- based on a table
Notes:
IMPORTANT!!! I recommend testing the stored procedure in a test environment first for a few tables and not for a whole schema. If there are too many tables, this will take time.
Please note that rerunning the stored procedure without deleting previously the statitics on the tables in scope will make the procedure raise failures for each column (behavior by design), though the error messages can be surpressed by commenting the code, if needed. One can introduce further validation, e.g. considering only the columns which don't have a statistic define on them.
Further Steps?
What can we do to improve the code? It would be great if we could find a way to identify the columns which are used in the queries. It is possible to retrieve the queries run in serverless SQL pool, however identifying the tables and columns from there or a similar source is not a straightforward solution.
The design of views based on the external tables can help in the process! I prefer to build on top of the external tables a first level of views (aka "base views") that include only the fields in use (needed by the business) ordered and "grouped" together based on their importance or certain characteristics. The views are based solely on the external table and thus contain no joins. They can include conversions of data types, translations of codes into meaningful values, and quite seldom filters on the data. However, for traceability the name of the columns don't change! This means that if view's name is easily identifiable based on external table's name, we could check view's columns against the ones of the external table and create statistics only for the respective columns. Using a unique prefix (e.g. "v") to derive views' name from tables' name would do the trick.
To do that, we need to create a view that reflects the dependencies between objects (we'll be interested only in external tables vs views dependencies):
-- drop view (for cleaning)
-- DROP VIEW IF EXISTS dbo.vObjectsReferenced
-- create view
CREATE OR ALTER VIEW dbo.vObjectsReferenced
AS
-- retrieving the objects referenced
SELECT QuoteName(sch.name) + '.' + QuoteName(obj.name) AS two_part_name
, obj.object_id
, obj.schema_id
, sch.name schema_name
, obj.name object_name
, obj.type
, QuoteName(scr.name) + '.'+ QuoteName(sed.referenced_entity_name) AS ref_two_part_name
, obr.object_id ref_object_id
, obj.schema_id ref_schema_id
, scr.name ref_schema_name
, obr.name ref_object_name
, obr.type ref_type
FROM sys.sql_expression_dependencies sed
JOIN sys.objects obj
ON obj.object_id = sed.referencing_id
JOIN sys.schemas as sch
ON obj.schema_id = sch.schema_id
JOIN sys.objects obr
ON sed.referenced_id = obr.object_id
JOIN sys.schemas as scr
ON obr.schema_id = scr.schema_id
-- testing the view
SELECT top 10 *
FROM dbo.vObjectsReferenced
WHERE ref_type = 'U'
With this, the query used above to fill the table becomes:
-- fill table query with column selection
SELECT obc.schema_name
, obc.object_name
, obc.column_name
, ROW_NUMBER() OVER(ORDER BY obc.schema_name, obc.object_name) ranking
FROM dbo.vObjectColumns obc
WHERE obc.type = 'U' -- tables
AND IsNull(@schema_name, obc.schema_name) = obc.schema_name
AND IsNull(@table_name, obc.object_name) = obc.object_name
AND EXISTS ( -- select only columns referenced in views
SELECT *
FROM dbo.vObjectsReferenced obr
JOIN dbo.vAdminObjectColumns obt
ON obr.object_id = obt.object_id
WHERE obt.type = 'V' -- view
AND obr.object_name = 'v' + obr.ref_object_name
AND obc.object_id = obr.ref_object_id
AND obc.column_name = obt.column_name);
This change will reduce the number of statistics created on average by 50-80%. Of course, there will be also cases in which further statistics need to be added manually. One can use this as input for an analysis of the columns used and store the metadata in a file, do changes to it and base on it statistics' creation.
Drop Statistics
Dropping the indexes resumes to using the dbo.vObjectStats view created above for the schema and/or table provided as parameter. The logic is similar to statistics' creation:
-- drop stored procedure (for cleaning)
-- DROP PROCEDURE IF EXISTS dbo.pDropStatistics
-- create procedure
CREATE OR ALTER PROCEDURE dbo.pDropStatistics
( @schema_name nvarchar(50)
, @table_name nvarchar(128)
)
AS
-- drop statistics for a schema and/or external table in serverless SQL pool
BEGIN
DECLARE @query as nvarchar(1000) = ''
DECLARE @index int = 1, @nr_records int = 0
-- drop temporary table if it exists
DROP TABLE IF EXISTS #stats_ddl;
-- create temporary table
CREATE TABLE #stats_ddl(
three_part_name nvarchar(128)
, ranking int
);
-- fill table
INSERT INTO #stats_ddl
SELECT obs.three_part_name
, ROW_NUMBER() OVER(ORDER BY obs.three_part_name) ranking
FROM dbo.vObjectStats obs
WHERE obs.type = 'U' -- tables
AND IsNull(@schema_name, obs.schema_name) = obs.schema_name
AND IsNull(@table_name, obs.object_name) = obs.object_name
SET @nr_records = (SELECT COUNT(*) FROM #stats_ddl)
WHILE @index <= @nr_records
BEGIN
SET @query = (SELECT 'DROP STATISTICS ' + ddl.three_part_name
FROM #stats_ddl ddl
WHERE ranking = @index);
BEGIN TRY
-- execute ddl
EXEC sp_executesql @query;
END TRY
BEGIN CATCH
SELECT 'drop failed for ' + @query;
END CATCH
SET @index+=1;
END
DROP TABLE #stats_ddl;
END
Note:
IMPORTANT!!! I recommend testing the stored procedure in a test environment first for a few tables and not for a whole schema. If there are too many tables, this will take time.
Closing Thoughts
The solution for statistics' creation is not perfect, though it's a start! It would have been great if such a feature would be provided by Microsoft, and probably they will, given the importance of statistics of identifying an optimal plan. It would be intersting to understand how much statistics help in a distributed environment and what's the volume of data processed for this purpose.
Please let me know if you found other workarounds for statistics' automation.
Notes:
The above objects and queries seem to work also in SQL databases in Microsoft Fabric.
Happy coding!
Previous Post
<<||>>
Next Post
References:
[1] Microsoft Learn (2022) Statistics in Synapse SQL (link)