Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!
Last updated: 18-Mar-2024
[Microsoft Fabric] Statistics- {def} objects that contain relevant information about data, to allow query optimizer to estimate plans' costs [1]
- critical for the warehouse and lakehouse SQL endpoint for executing queries quickly and efficiently [3]
- when a query is executed, the engine tries to collect existing statistics for certain columns in the query and use that information to assist in choosing an optimal execution plan [3]
- inaccurate statistics can lead to unoptimized query plans and execution times [5]
- {type} user-defined statistics
- statistics defined manually by the users via DDL statement [1]
- users can create, update and drop statistic
- via CREATE|UPDATE|DROP STATISTICS
- users can review the contents of histogram-based single-column statistics [1]
- via DBCC SHOW_STATISTICS
- only, a limited version of these statements is supported [1]
- {recommendation} focus on columns heavily used in query workloads
- e.g. GROUP BYs, ORDER BYs, filters, and JOINs
- {recommendation} consider updating column-level statistics regularly [1]
- e.g. after data changes that significantly change rowcount or distribution of the data [1]
- {type} automatic statistics
- statistics created and maintained automatically by the query engine at query time [1]
- when a query is issued and query optimizer requires statistics for plan exploration, MF automatically creates those statistics if they don't already exist [1]
- then the query optimizer can utilize them in estimating the plan costs of the triggering query [1]
- if the query engine determines that existing statistics relevant to query no longer accurately reflect the data, those statistics are automatically refreshed [1]
- these automatic operations are done synchronously [1]
- the query duration includes this time [1]
- {object type} histogram statistics
- created per column needing histogram statistics at query time [1]
- contains histogram and density information regarding the distribution of a particular column [1]
- similar to the statistics automatically created at query-time in Azure Synapse Analytics dedicated pools [1]
- name begins with _WA_Sys_.
- contents can be viewed with DBCC SHOW_STATISTICS
- {object type} average column length statistics
- created for variable character columns (varchar) greater than 100 needing average column length at query-time [1]
- contain a value representing the average row size of the varchar column at the time of statistics creation [1]
- name begins with ACE-AverageColumnLength_
- contents cannot be viewed and are nonactionable by users [1]
- {object type} table-based cardinality statistics
- created per table needing cardinality estimation at query-time [1]
- contain an estimate of the rowcount of a table [1]
- named ACE-Cardinality [1]
- contents cannot be viewed and are nonactionable by user [1]
- [lakehouse] SQL analytics endpoint
- uses the same engine as the warehouse to serve high performance, low latency SQL queries [4]
- {feature} automatic metadata discovery
- a seamless process reads the delta logs and from the files folder and ensures SQL metadata for tables is always up to date [4]
- e.g. statistics [4]
- {limitation} only single-column histogram statistics can be manually created and modified [1]
- {limitation} multi-column statistics creation is not supported [1]
- {limitation} other statistics objects might appear in sys.stats
- besides the statistics created manually/automatically [1]
- ⇐ the objects are not used for query optimization [1]
- {limitation} if a transaction has data insertion into an empty table and issues a SELECT before rolling back, the automatically generated statistics can still reflect the uncommitted data, causing inaccurate statistics [5]
- {recommendation} update statistics for the columns mentioned in the SELECT [5]
- {recommendation} ensure all table statistics are updated after large DML transactions [2]
[4] Microsoft Learn (2024) Fabric: Better together: the lakehouse and warehouse [link]
[5] Microsoft Learn (2024) Fabric: Transactions in Warehouse tables in Microsoft Fabric [link]
DDL - Data Definition Language
MF - Microsoft Fabric