18 March 2025

🏭🗒️Microsoft Fabric: Statistics in Warehouse [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 18-Mar-2024

[Microsoft Fabric] Statistics
  • {def} objects that contain relevant information about data, to allow query optimizer to estimate plans' costs [1]
    • critical for the warehouse and lakehouse SQL endpoint for executing queries quickly and efficiently [3]
      • when a query is executed, the engine tries to collect existing statistics for certain columns in the query and use that information to assist in choosing an optimal execution plan [3]
      • inaccurate statistics can lead to unoptimized query plans and execution times [5]
  • {type} user-defined statistics
    • statistics defined manually by the users via DDL statement [1]
    • users can create, update and drop statistic 
      • via CREATE|UPDATE|DROP STATISTICS
    • users can review the contents of histogram-based single-column statistics [1]
      • via DBCC SHOW_STATISTICS
        • only, a limited version of these statements is supported [1]
    • {recommendation} focus on columns heavily used in query workloads
      • e.g. GROUP BYs, ORDER BYs, filters, and JOINs
    • {recommendation} consider updating column-level statistics regularly [1]
      • e.g. after data changes that significantly change rowcount or distribution of the data [1]
  • {type} automatic statistics
    • statistics created and maintained automatically by the query engine at query time [1]
    • when a query is issued and query optimizer requires statistics for plan exploration, MF automatically creates those statistics if they don't already exist [1]
      • then the query optimizer can utilize them in estimating the plan costs of the triggering query [1]
      • if the query engine determines that existing statistics relevant to query no longer accurately reflect the data, those statistics are automatically refreshed [1]
        • these automatic operations are done synchronously [1]
          • the query duration includes this time [1]
  • {object type} histogram statistics
    • created per column needing histogram statistics at query time [1]
    • contains histogram and density information regarding the distribution of a particular column [1]
    • similar to the statistics automatically created at query-time in Azure Synapse Analytics dedicated pools [1]
    • name begins with _WA_Sys_.
    • contents can be viewed with DBCC SHOW_STATISTICS
  • {object type} average column length statistics
    • created for variable character columns (varchar) greater than 100 needing average column length at query-time [1]
    • contain a value representing the average row size of the varchar column at the time of statistics creation [1]
    • name begins with ACE-AverageColumnLength_
    • contents cannot be viewed and are nonactionable by users [1]
  • {object type} table-based cardinality statistics
    • created per table needing cardinality estimation at query-time [1]
    • contain an estimate of the rowcount of a table [1]
    • named ACE-Cardinality [1]
    • contents cannot be viewed and are nonactionable by user [1]
  • [lakehouse] SQL analytics endpoint
    • uses the same engine as the warehouse to serve high performance, low latency SQL queries [4]
    • {feature} automatic metadata discovery
      • a seamless process reads the delta logs and from the files folder and ensures SQL metadata for tables is always up to date [4]
        • e.g. statistics [4]
  • {limitation} only single-column histogram statistics can be manually created and modified [1]
  • {limitation} multi-column statistics creation is not supported [1]
  • {limitation} other statistics objects might appear in sys.stats
    • besides the statistics created manually/automatically [1]
      • ⇐ the objects are not used for query optimization [1]
  • {limitation} if a transaction has data insertion into an empty table and issues a SELECT before rolling back, the automatically generated statistics can still reflect the uncommitted data, causing inaccurate statistics [5]
    • {recommendation} update statistics for the columns mentioned in the SELECT [5]
  • {recommendation} ensure all table statistics are updated after large DML transactions [2]


References:
[1] Microsoft Learn (2025) Fabric: Statistics in Fabric data warehousing [link
[2] Microsoft Learn (2025) Fabric: Troubleshoot the Warehouse [link
[3] Microsoft Fabric Updates Blog (2023) Microsoft Fabric July 2023 Update [link]
[4] Microsoft Learn (2024) Fabric: Better together: the lakehouse and warehouse [link]
[5] Microsoft Learn (2024) Fabric: Transactions in Warehouse tables in Microsoft Fabric [link]

Resources:
[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:
DDL - Data Definition Language
MF - Microsoft Fabric

No comments:

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.