Showing posts with label indexes. Show all posts
Showing posts with label indexes. Show all posts

20 February 2025

💠🛠️🗒️SQL Server: Nulls [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources. It considers only on-premise SQL Server, for other platforms please refer to the documentation.

Last updated: 20-Feb-2024

[SQL Server] Null

  • {def} keyword that indicates that the value is unknown [1]
    • different from an empty or zero value [1]
    • no two null values are equal [1]
      • comparisons between two null values, or between a null value and any other value, return unknown because the value of each NULL is unknown [1]
    • indicates the the value is
      • unknown
      • not applicable
      • to be added later
      • ⇒ can't be used as information that is required to distinguish one row in a table from another row in a table [1]
  • can be assigned to a value by
    • explicitly stating NULL in an INSERT or UPDATE statement [1[
    • leaving a column out of an INSERT statement [1]
  • {recommendation} test for null values in queries 
    • via IS NULL or IS NOT NULL in the WHERE clause [1]
    • WHEN present in data, logical and comparison operators can potentially return a third result of UNKNOWN instead of just TRUE or FALSE [1]
      •  ⇐ three-valued logic can be the source for many application errors [1]
    • ⇐ parameters and variables not explicitly initialized can cause problems in code
  • {recommendation} handle null values in logic
    • via IsNull or Coalesce functions
  • {constraint} [primary kyes] if any of the columns considered in a primary key contain NULL values, the PRIMARY KEY constraint can’t be created [3]
  • {constraint} [UNIQUE constraint] allows the columns that make up the constraint to allow NULLs, but it doesn’t allow all key columns to be NULL for more than one row [3]
  • [data warehouse] nullability of columns
    • {best practice} define columns as NOT NULL when appropriate 
      • {benefit} helps the Query Optimizer 
      • {benefit} reduces in some cases the storage space required for the data
      • {benefit} allows SQL Server to avoid unnecessary encoding in columnstore indexes and during batch mode execution [2]
    • {example} [SQL Server 2000+] bigint column
      • when the value is defined as NOT NULL , the value fits into a single CPU register
        • ⇒ operations on the value can be performed more quickly
      • a nullable bigint column requires another, 65th bit to indicate NULL values
        • SQL Server avoids cross-register data storage by storing some of the row values (usually the highest or lowest values) in main memory using special markers to indicate it in the data that resides in the CPU cache [2]
          • ⇒ adds extra load during execution
    • {recommendation} avoid nullable columns in data warehouse environments [2]
      • ⇐ the recommendation can apply also to OLTP databases
        • there are database designs that enforces not null values for all attributes
          • e.g. Dynamics AX 2009/365 F&O
          • {benefit} eliminates the need to test for null values in legacy code
    • {recommendation} use CHECK and UNIQUE constraints or indexes when overhead introduced by constraints or unique indexes is acceptable [2]
    • {recommendation} consider using filtered indexes instead of normal indexes for columns with many null values
      • minimizes the waste of storage space
      • ⇐ understand the characteristics of the columns used in the queries [3]


References:
[1] Microsoft Learn (2024) SQL Server 2022: NULL and UNKNOWN (T-SQL)
[2] Dmitri Korotkevitch (2016) Pro SQL Server Internals 2nd Ed.
[3] Microsoft SQL Server 2012 Internals, by Kalen Delaney, Bob Beauchemin, Conor Cunningham, Jonathan Kehayias, Benjamin Nevarez & Paul S. Randal, Microsoft Press, ISBN: 978-0-7356-5856-1 , 2013

16 February 2025

💠🛠️🗒️SQL Server: Columnstore Indexes [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources. It considers only on-premise SQL Server, for other platforms please refer to the documentation.

Last updated: 15-Feb-2024

[SQL Server] columnstore indexes (CI)

  • {def} a technology for storing, retrieving and managing data by using a columnar data format (aka columnstore
    • store compressed data on a per-column rather than a per-row basis [5]
  • {benefit} designed for analytics and data warehousing workloads
    • data warehousing
      • {scenario} store fact tables and large dimension tables
        • ⇐ tend to require full table scans rather than table seeks
    • analytics workloads
      • {scenario} [SQL Server 2016 SP1] can be used for real-time analytics on operational databases
        • ⇐ an updatable nonclustered columnstore index can be created on a rowstore table
  • {benefit} performance increase 
    • can achieve up to 100x better performance [4]
    • offers an order of magnitude better performance than a rowstore index
      • {feature} uses batch mode execution
        • improves query performance typically by two to four times
      • have high performance gains for analytic queries that scan large amounts of data, especially on large tables (>1 million rows) 
  • {benefit} reduces significantly the data warehouse storage costs
    • {feature} data compression
      • ⇒ provides high compression rates, typically by 10 times
        • ⇒ reduces total I/O from the physical media
          • ⇐ queries often select only a few columns from a table
          • minimizes or eliminates system I/O bottlenecks
        • reduces significantly the memory footprint
          • ⇒ query performance can improve 
            • because SQL Server can perform more query and data operations in memory
  • {benefit} built in memory
    • ⇒ sufficient memory must be available 
  • {benefit} part of the database engine
    • no special hardware is needed
  • {concept} columnstore 
    • {def} data structure logically organized as a table with rows and columns, and physically stored in a column-wise data format
      • stores values from the same domain which commonly have similar values
        • when a query references a column, then only that column is fetched from disk [3]
          • ⇐ the columns not requested are skipped 
            • ⇒ they are not loaded into memory 
        • when a query is executed, the rows must be reconstructed
          • ⇒ row reconstruction takes some time and uses some CPU and memory resources [3]
        • [SQL Server 2016] columnstore index on rowstore tables
          • columnstore is updated when data changes in the rowstore table
        • both indexes work against the same data
  • {concept}rowstore
    • {def} data that's logically organized as a table with rows and columns, and physically stored in a row-wise data format
      • ⇐ the traditional way to store relational table data
      • refers to a table where the underlying data storage format is either
        • a heap
        • a clustered index
        • a memory-optimized table
  • {concept} rowstore index
    • performs best on queries that seek into the data, when searching for a particular value, or for queries on a small range of values
      • ⇒ appropriate for transactional workloads 
        • because they tend to require mostly table seeks instead of table scans
  • {concept} rowgroup
    • {def} a group of rows that are compressed into columnstore format at the same time
      • {constraint} has a maximum number of rows per rowgroup, which is 1,048,576 =2^20 rows
      • contains one column segment for every column in the table
      • can have more than one delta rowgroup that form the deltastore
        • e.g. when multiple threads create columnstore indexes using parallel execution plans [5]
          • ⇐ each thread will work with its own subset of data, creating separate rowgroups [5]
        • [partitions] each table partition has its own set of row groups [5]
          • ⇐  too many partitions may prevent workloads from benefiting from a CCI [11]
            • ⇐ data aren’t pushed into a compressed columnstore segment until the rowgroup limit is reached
    • {event} rowgroup is compressed
      • marked as read-only [16]
      • a compressed rowgroup is considered as fragmented when either 
        • row number < rowgroup limit but dictionary size reached the maximum
          • nothing can be done to increase the number of rows [15]
          • the trim_reason is other than DICTIONARY_SIZE
        • it has nonzero deleted rows that exceeds a minimum threshold [15]
    • {event} all data from rowgroup deleted 
      • transitions from COMPRESSED into TOMBSTONE state
      • later removed by the tuple-mover background process
    • {event} rows in the columnstore indexes can be moved to different locations
      • row-id in the nonclustered indexes aren’t updated 
        • ⇐ the mappings between old and new row locations are stored in an internal structure (aka mapping index) 
    • {event} rowgroup build
      • all column data are combined on a per-row group basis, encoded and compressed [5]
        • the rows within a row group can be rearranged if that helps to achieve a better compression rate [5]
  • {feature} data compression
    • the table is sliced into rowgroups, and each rowgroup is compresses in a column-wise manner
      • the number of rows in the rowgroup must be 
        • large enough to improve compression rates
        • small enough to benefit from in-memory operations
          • having too many small rowgroups decreases columnstore index’s quality
    • uses its own compression mechanism 
      • ⇒ row or page compression cannot be used on it [3]
      • [SQL Server 2016] page compression has been removed
        • ⇐ in some cases, page compression disallowed the creation of columnstore indexes with a very large number of columns [5]
  • {feature} compression delay
    • computed when a delta rowgroup is closed [7]
    • keeps the ‘active’ rows in delta rowgroup and only transition these rows to compressed rowgroup after a specified delay [7]
      • ⇐ reduces the overall maintenance overhead of NCCI [7]
      • ⇒ leads to a larger number of delta rowgroups [7]
    • {best practice} if the workload is primarily inserting data and querying it, the default COMPRESSION_DELAY of 0 is the recommended option [7]
    • {best practice} [OLTP workload] if > 10% rows are marked deleted in recently compressed rowgroups, then consider a value that accommodates the behavior [7]
      • via: create nonclustered columnstore index with (compression_delay= 150)
  • {feature} data encoding
    • all values in the data are replaced with 64-bit integers using one of two encoding algorithms
    • {concept} dictionary encoding
      • stores distinct values from the data in a separate structure (aka dictionary} 
        • every value in a dictionary has a unique ID assigned [5]
          • the ID is used for replacement
    • {concept} global dictionary
      • shared across all segments that belong to the same index partition [5]
    • {concept} local dictionary
      • created for individual segments using values that are not present in the global dictionary
    • {concept} value-based encoding
      • mainly used for numeric and integer data types that do not have enough duplicated values [5]
        • dictionary encoding would be inefficient [5]
      • converts integer and numeric values to a smaller range of 64-bit integers in 2 steps
        • {step} [numeric data types] are converted to integers using the minimum positive exponent (aka magnitude that allows this conversion) [5]
          • {goal} convert all numeric values to integers [5]
          • [integer data types] the smallest negative exponent is chosen that can be applied to all values without losing their precision [5]
            • {goal} reduce the interval between the minimum and maximum values stored in the segment [5]
        • {step} the minimum value (aka base value) in the segment is identified and subtracted it from all other values [5]
          • ⇒ makes the minimum value in the segment number 0 [5]
      • after encoding the data are compressed and stored as a LOB allocation unit
  • {concept} column segment 
    • {def} a column of data from within the rowgroup
    • is compressed together and stored on physical media
    • SQL Server loads an entire segment to memory when it needs to access its data
  • {concept} segment metadata 
    • store metadata about each segment 
      • e.g. minimum and maximum values
      • ⇐ segments that do not have the required data are skipped [5]
  • {concept} deltastore
    • {def} all of the delta rowgroups of a columnstore index
    • its operations are handled behind the scenes
      • can be in either states
        • {state} open (aka open delta store) 
          • accepts new rows and allow modifications and deletions of data
        • {state} closed (aka closed data store)
          • a delta store is closed when it reaches its rowgroup limit
  • {concept} delta rowgroup 
    • {def} a clustered B-tree index that's used only with columnstore indexes
    • improves columnstore compression and performance by storing rows until the number of rows reaches the rowgroup limit and are then moved into the columnstore
    • {event} reaches the maximum number of rows
      • it transitions from an ‘open’ to ‘closed’ state
      • a closed rowgroup is compressed by the tuple-mover and stored into the columnstore as COMPRESSED rowgroup
    • {event} compressed
      • the existing delta rowgroup transitions into TOMBSTONE state to be removed later by the tuple-mover when there is no reference to it
  • {concept} tuple-mover 
    • background process that checks for closed row group
      • if it finds a closed rowgroup, it compresses the delta rowgroup and stores it into the columnstore as a COMPRESSED rowgroup
  • {concept} clustered columnstore index (CCI) 
    • is the primary storage for the entire table
    • {characteristic) updatable
      • has two structures that support data modifications
        • ⇐ both use the B-Tree format to store data [5]
        • ⇐ created on demand [5]
        • delete bitmap
          • indicates which rows were deleted from a table
          • upon deletion the row continues to be stored into the rowgroup
          • during query execution SQL Server checks the delete bitmap and excludes deleted rows from the processing [5]
        • delta store
          • includes newly inserted rows
          • updating a row triggers the deletion of the existing row and insertion of a new version of a row to a delta store
            • ⇒ the update does not change the row data
            • ⇒ the updated row is inserted to a delete bitmap
        • [partitions] each partition can have a single delete bitmap and multiple delta stores
          • ⇐ this makes each partition self-contained and independent from other partitions
            • ⇒ allows performing a partition switch on tables that have clustered columnstore indexes defined [5]
    • {feature} supports minimal logging for batch sizes >= rowgroup’s limit [12]
    • [SQL Server 2017] supports non-persisted computed columns in clustered columnstore indexes [2]
    • store some data temporarily into a clustered index (aka deltastore) and a btree list of IDs for deleted rows
      • ⇐ {benefit} reduces fragmentation of the column segments and improves performance
      • combines query results from both the columnstore and the deltastore to return the correct query results
    • [partitions] too many partitions can hurt the performance of a clustered columnstore index [11]
  • {concept} nonclustered columnstore index (NCCI)
    • {def} a secondary index that's created on a rowstore table
      • is defined as one or more columns of the table and has an optional condition that filters the rows
      • designed to be used for workloads involving a mix of transactional and analytics workload*
      • functions the same as a clustered columnstore index
        • ⇐ has same performance optimizations (incl. batchmode operators)
        • {exception} doesn’t supports persisted computed columns
          • can’t be created on a columnstore index that has a computed column [2]
        • however behave differently between the various versions of SQL Server
          • [SQL Server 2012|2014] {restriction} readonly
      • contains a copy of part or all of the rows and columns in the underlying table
        • include a row-id , which is either the address of
          • a row in a heap table 
          • a clustered index key value
            • includes all columns from the clustered index even when not explicitly defined in the CREATE statement
              • the not specified columns will not be available in the sys.index_columns view
      • [SQL Server 2016] multiple nonclustered rowstore indexes can be created on a columnstore index and perform efficient table seeks on the underlying columnstore
        • ⇒ once created, makes it possible to drop one or more btree nonclustered indexes
      • enables real-time operational analytics where the OLTP workload uses the underlying clustered index while analytics run concurrently on the columnstore index
  • {concept} batch mode execution (aka vector-based execution, vectorized execution
    • {def} query processing method used to process multiple rows together in groups of rows, or batches, rather than one row at a time
      • SQL Server can push a predicate to the columnstore index scan operator, preventing unnecessary rows from being loaded into the batch [5]
      • queries can process up to 900 rows together
        • enables efficient query execution (by a 3-4x factor) [4]
        • ⇐ the size of the batches varies to fit into the CPU cache
        • ⇒ reduces the number of times that the CPU needs to request external data from memory or other components [5]
      • improves the performance of aggregations, which can be calculated on a per-batch rather than a per-row basis [5]
      • tries to minimize the copy of data between operators by creating and maintaining a special bitmap that indicates if a row is still valid in the batch [5]
        • ⇐ subsequent operators will ignore the non-valid rows
        • every operator has a queue of work items (batches) to process [5]
        • worker threads from a shared pool pick items from queues and process them while migrating from operator to operator [5]
    • is closely integrated with, and optimized around, the columnstore storage format.
      • columnstore indexes use batch mode execution
        • ⇐ improves query performance typically by two to four times
  • {concept} tuple mover
    • single-threaded process that works in the background, preserving system resources
      • runs every five minutes
    • converts closed delta stores to row groups that store data in a column-based storage format [5]
      • can be disabled via trace flag T-634 
      • ⇐ the conversion of closed delta stores to row groups can be forced by reorganizing an index [5]
        • runs in parallel using multiple threads
          • decreases significantly conversion time at a cost of extra CPU load and memory usage [5]
        • via: ALTER INDEX REORGANIZE command
    • it doesn’t prevent other sessions from inserting new data into a table [5]
    • deletions and data modifications would be blocked for the duration of the operation [5]
      • {recommendation} consider forcing index reorganization manually to reduce execution, and therefore locking, time [5]
    • considered fragmented if it has
      • multiple delta rowgroups
      • deleted rows
    • require maintenance like that of regular B-Tree indexes [5]
      • {issue] partially populated row groups
      • {issue} overhead of delta store and delete bitmap scans during query execution
      • rebuilding the columnstore index addresses the issues
      • the strategy depends on the volatility of the data and the ETL processes implemented in the system [5]
        • {recommendation} rebuild indexes when a table has a considerable volme of deleted rows and/or a large number of partially populated rowgroups [5]
        • {recommendation} rebuild partition(s) that still have a large number of rows in open delta stores after the ETL process has completed, especially if the ETL process does not use a bulk insert API [5]
      • creating/dropping/disabling/rebuilding functions like any other index
    • columnstore statistics 
      • a statistics object is created at the time of columnstore index creation; however, it is neither populated nor updated afterward [5]
        • ⇐ SQL Server relies on segment information, B-Tree indexes (when available), and column-level statistics when deciding if a columnstore index needs to be used [5]
        • it is beneficial to create missing column-level statistics on the columns that participate in a columnstore index and are used in query predicates and as join keys [5]
          • ⇐ statistics rarely update automatically on very large tables [5]
            • ⇒ statistics must be updated ‘manually’
      • [SQL Server 2019] included into the schema-only clone of a database functionality [8]
        • enable performance troubleshooting without the need to manual capture the statistics information
      • columnstore indexes has been added to sp_estimate_data_compression_savings. In SQL Server 2019 both 
      • COLUMNSTORE and COLUMNSTORE_ARCHIVE have been added to allow you to estimate the space savings if 
      • either of these indexes are used on a table.
        • via DBCC CLONEDATABASE
    • [in-memory tables] 
      • {limitation} a columnstore index must include all the columns and can’t have a filtered condition [2]
      • {limitation} queries on columnstore indexes run only in InterOP mode, and not in the in-memory native mode [2]
    • {operation} designing columnstore indexes
      • {best practice} understand as much as possible data’s characteristics
      • {best practice} identify workload’s characteristics
    • {operation} create a clustered columnstore index
      • via CREATE CLUSTERED COLUMNSTORE INDEX command
      • not needed to specify any columns in the statement
        • ⇐ the index will include all table columns
    • {operation} index rebuilding 
      • forces SQL Server to remove deleted rows physically from the index and to merge the delta stores’ and row groups’ data [5]
        • all column segments are recreated with row groups fully populated [5]
      • [<SQL Server 2019] offline operation
      • [SQL Server 2019 Enterprise] online operation
        • ⇒ higher availability 
        • ⇐ pausing and resuming create and rebuild operations are not supported [11]
      • very resource intensive process
      • holds a schema modification (Sch-M) lock on the table
        • ⇒ prevents other sessions from accessing it [5]
        • ⇐ the overhead can be mitigated by using table/index partitioning
          • ⇒ indexes will be rebuild on a partition basis for those partition with volatile data [5]
    • {operation} index reorganization 
      • [<SQL Server 2019] a reorganize operation is required to merge smaller COMPRESSED rowgroups, following an internal threshold policy that determines how to remove deleted rows and combine the compressed rowgroups
      • [SQL Server 2019] a background merge task also works to merge COMPRESSED rowgroups from where a large number of rows has been deleted
        • ⇐ after merging smaller rowgroups, the index quality should be improved.
        • the tuple-mover is helped by a background merge task that automatically compresses smaller OPEN delta rowgroups that have existed for some time as determined by an internal threshold, or merges COMPRESSED rowgroups from where a large number of rows has been deleted
        • via: ALTER INDEX REORGANIZE command
      • [SQL Server 2016] performs additional defragmentation
        • removes deleted rows from row groups that have 10 or more percent of the rows logically deleted [5]
        • merges closed row groups together, keeping the total number of rows less than or equal than rowgroup’s limit [5]
        • ⇐ both processes can be done together [5]
      • [SQL Server 2014] the only action performed is compressing and moving the data from closed delta stores to rowgroups [5] 
        • ⇐ delete bitmap and open delta stores stay intact [5]
      • via: ALTER INDEX REORGANIZE
        • uses all available system resources while it is running [5]
          • ⇒ speeds up the execution process 
          • reduce the time during which other sessions cannot modify or delete data in a table [5]
        • close and compress all open row groups
          • via: ALTER INDEX REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON)
          • row groups aren’t merged during this operation [5]
    • {operation} estimate compression savings
      • [SQL Server 2019] COLUMNSTORE and COLUMNSTORE_ARCHIVE added
        • allows estimating the space savings if either of these indexes are used on a table [8]
        • {limitation} not available in all editions 
      • via: sp_estimate_data_compression_savings 
    • {operation} [bulk loads] when the number of rows is less than deltastore’s limit, all the rows go directly to the deltastore
      • [large bulk load] most of the rows go directly to the columnstore without passing through the deltastore
        • some rows at the end of the bulk load might be too few in number to meet the minimum size of a rowgroup
          • ⇒ the final rows go to the deltastore instead of the columnstore
      • bulk insert operations provide the number of rows in the batch as part of the API call [5]
        • best results are achieved by choosing a batch size that is divisible by rowgroup’s limit [5]
          • ⇐ guarantees that every batch produces one or several fully populated row groups [5]
            • ⇒ reduce the total number of row groups in a table [5]
            • ⇒ improves query performance
          • ⇐ the batch size shouldn’t exceed rowgroup’s limit [5]
            • row groups can be still created on the fly in a manner to similar a bulk insert when the size of the insert batch is close to or exceeds [5]
  • {operation} [non-bulk operations] trickle inserts go directly to a delta store
  • {feature} parallel inserts
    • [SQL Server 2016] requires following conditions for parallel insert on CCI [6]
      • must specify TABLOCK
      • no NCI on the clustered columnstore index
      • no identity column
      • database compatibility is set to 130
  • {recommendation} minimize the use of string columns in facts tables [5]
    • string data use more space
    • their encoding involves additional overhead during batch mode execution [5]
    • queries with predicates on string columns may have less efficient execution plans that also require significantly larger memory grants as compared to their non-string counterparts [5]
  • {recommendation} [SQL Server 2012|2014] do not push string predicates down toward the lowest operators in execution plans.
  • {recommendation} add another dimension table and replace the string value in the facts table with a synthetic, integer-based ID key that references a new table [5]
  • {operation} upgrading to SQL Server 2016
    • make sure that queries against the tables with columnstore indexes can utilize parallelism in case if database compatibility level less than 130 [5]
  • {feature} [SQL Server 2019] automated columnstore index maintenance [8]
  • {improvement} [SQL Server 2019] better columnstore metadata memory management
  • {improvement} [SQL Server 2019] low-memory load path for columnstore tables
  • {improvement} [SQL Server 2019] improved performance for bulk loading to columnstore indexes
  • {improvement} [SQL Server 2019] server startup process has been made faster for databases that use in-memory columnstore tables for HTAP
  • {feature} DMVs

References:
[1] SQL Docs (2020) Columnstore indexes: Overview [link]
[2] Microsoft Learn (2024) SQL: What's new in columnstore indexes  [link]
[3] Dejan Sarka et al (2012) Exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012 (Training Kit)
[4] SQL Docs (2019) Columnstore indexes - Query performance [link]
[5] Dmitri Korotkevitch (2016) Pro SQL Server Internals 2nd Ed.
[6] Microsoft Learn (2016) Columnstore Index: Parallel load into clustered columnstore index from staging table [link]
[7] Microsoft Learn (2016) Columnstore Index Defragmentation using REORGANIZE Command [link]
[8] Microsoft (2018) Microsoft SQL Server 2019: Technical white paper [link]

Acronyms:
CCI - clustered columnstore index
CI - columnstore index
DBCC - Database Console Commands
DMV - Dynamic Management View
ETL - Extract, Transform, Load
HTAP - Hybrid Transactional/Analytical Processing 
LOB - Line of Business
NCCI - nonclustered columnstore index
OLTP - On-Line Transaction Processing
SP - Service Pack

06 January 2025

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part VI: Index Usage Analysis) [new feature]

There are several system dynamic management views (DMV) available in SQL Server, Azure SQL Server and now in SQL databases that allow to gather more information about indexes' fragmentation and usage. Let's look at the most important information available based on the indexes create in the previous posts. As the data were probably purged from the views, it's needed to run first the select queries based on the SalesLT.Product from the previous post. This step is important, otherwise the DMVs might return no records!

One starting point is to use the sys.dm_db_index_physical_stats DMV to look at the indexes' size and fragmentation information for a given table (or view). The table is used usually as starting point for analyzing indexes' fragmentation and then defragment the indexes with high fragmentation.

-- sys metadata - index & data size and fragmentation information for the data and indexes of the specified table or view
SELECT --db_name() db_name
--, object_name(IND.object_id) table_name
 IND.name index_name
, IND.type_desc
, IPS.page_count
, IPS.record_count
, IPS.index_level
, Cast(IPS.avg_fragmentation_in_percent as decimal(10,2)) avg_fragmentation_perc
, Cast(IPS.avg_page_space_used_in_percent as decimal(10,2)) space_used_perc
--, IPS.*
FROM sys.indexes IND
     CROSS APPLY sys.dm_db_index_physical_stats(DB_ID(), IND.object_id, IND.index_id, NULL, 'DETAILED') IPS
WHERE IND.object_id = OBJECT_ID(N'SalesLT.Product');
Output:
index_name type_desc page_count record_count index_level avg_fragmentation_perc space_used_perc
PK_Product_ProductID CLUSTERED 101 295 0 0.99 87.90
PK_Product_ProductID CLUSTERED 1 101 1 0.00 16.20
AK_Product_rowguid NONCLUSTERED 2 295 0 50.00 74.69
AK_Product_rowguid NONCLUSTERED 1 2 1 0.00 0.59
AK_Product_ProductNumber NONCLUSTERED 2 295 0 50.00 85.79
AK_Product_ProductNumber NONCLUSTERED 1 2 1 0.00 0.49
AK_Product_Name NONCLUSTERED 3 295 0 33.33 87.32
AK_Product_Name NONCLUSTERED 1 3 1 0.00 1.67
IX_SalesLT_Product_Color NONCLUSTERED 1 295 0 0.00 79.24
IX_SalesLT_Product_Color_Size NONCLUSTERED 1 295 0 0.00 94.12
IX_SalesLT_Product_ListPrice_IC NONCLUSTERED 4 295 0 0.00 86.60
IX_SalesLT_Product_ListPrice_IC NONCLUSTERED 1 4 1 0.00 1.01

In a second step one can look at the sys.dm_db_index_usage_stats DMV which provides the counts of the different types of index operations and the time each type of operation was last performed:

-- sys metadata - counts of different types of index operations and the time each type of operation was last performed.
SELECT -- db_name() db_name
--, object_name(IND.object_id) table_name
 IND.name
, IND.type_desc
, IUS.user_seeks 
, IUS.user_scans
, IUS.user_lookups 
, IUS.user_updates
, IUS.last_user_seek
, IUS.last_user_scan 
, IUS.last_user_lookup
, IUS.last_user_update
FROM sys.dm_db_index_usage_stats IUS
     JOIN sys.indexes IND
       ON IUS.index_id = IND.index_id
WHERE IND.object_id = OBJECT_ID(N'SalesLT.Product');
Output:
name type_desc user_seeks user_scans user_lookups user_updates last_user_seek last_user_scan last_user_lookup
PK_Product_ProductID CLUSTERED 0 10 15 0 2025-01-06T14:23:54 2025-01-06T14:23:54
IX_SalesLT_Product_Color_Size NONCLUSTERED 11 0 0 0 2025-01-06T14:23:54
IX_SalesLT_Product_ListPrice_IC NONCLUSTERED 8 0 0 0 2025-01-06T13:38:03

Finally, it might be useful to look also at the sys.dm_db_index_operational_stats DMV which returns the current lower-level I/O, locking, latching, and access method activity for each partition of a table or index in the database (see the documentation for the full list of attrbutes):

-- sys metadata - index operations stats
SELECT -- db_name() db_name
--, object_name(IND.object_id) table_name
 IND.name index_name
, IND.type_desc
, IOS.range_scan_count
, IOS.singleton_lookup_count
, IOS.leaf_insert_count
, IOS.leaf_delete_count
, IOS.leaf_update_count
, IOS.nonleaf_insert_count
, IOS.nonleaf_delete_count
, IOS.nonleaf_update_count
FROM sys.indexes IND
     CROSS APPLY sys.dm_db_index_operational_stats(DB_ID(), IND.object_id, IND.index_id, NULL) IOS
WHERE IND.object_id = OBJECT_ID(N'SalesLT.Product')
 AND IOS.range_scan_count<>0
ORDER BY IND.name;
Output:
index_name type_desc range_scan_count singleton_lookup_count leaf_insert_count leaf_delete_count leaf_update_count nonleaf_insert_count nonleaf_delete_count nonleaf_update_count
IX_SalesLT_Product_Color_Size NONCLUSTERED 11 0 0 0 0 0 0 0
IX_SalesLT_Product_ListPrice_IC NONCLUSTERED 8 0 0 0 0 0 0 0
PK_Product_ProductID CLUSTERED 10 64 0 0 0 0 0 0

For more information on these DMVs check the documentation.

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) SQL Server: sys.dm_db_index_physical_stats [link]
[2] Microsoft Learn (2024) SQL Server: sys.dm_db_index_usage_stats [link]
[3] Microsoft Learn (2024) SQL Server: sys.dm_db_index_operational_stats [link]

04 January 2025

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part V: Manual Index Maintenance) [new feature]

Indexes' maintenance in Microsoft Fabric's SQL databases is supposed to happen automatically in the background via automatic tuning options feature, though the whole functionality is still in its early phases, and therefore many questions regarding the whole process may arise. Probably the most important question is whether indexes can still be created, respectively maintained manually. That's useful for temporary or even periodic workloads, where maybe organizations might still want to maintain indexes manually. 

The tests made below are based on the SalesLT.Product from AdventureWorkds database available in Microsoft Fabric. The target was to create several indexes that could be used for the various testing purposes. Each set of the below scripts was run 5-10 times until records appeared in the sys.dm_db_missing_index_details table for each test case (see further below):

-- batch 1: filter on single column (to be run 5-10 times)
SELECT *
FROM SalesLT.Product 
WHERE Color = 'Red'

SELECT *
FROM SalesLT.Product 
WHERE Color = 'Black'

SELECT *
FROM SalesLT.Product 
WHERE Color = 'White'

-- batch 2: filter on two columns (to be run 5-10 times)
SELECT *
FROM SalesLT.Product 
WHERE Color = 'Red'
  AND Size = '58'

SELECT *
FROM SalesLT.Product 
WHERE Color = 'Black'
  AND Size = '58'

SELECT *
FROM SalesLT.Product 
WHERE Color = 'White'
     AND Size = '58'

-- batch 3: filter with column selection (to be run 5-10 times)
SELECT ProductNumber, Name, Color, ListPrice
FROM SalesLT.Product 
WHERE ListPrice BETWEEN 50 AND 55

SELECT ProductNumber, Name, Color, ListPrice
FROM SalesLT.Product 
WHERE ListPrice BETWEEN 100 and 105

Once the scripts run, one can look at the records created in the above considered dynamic management view:

-- sys metadata -  missing indexes
SELECT MID.statement AS table_name
, MID.equality_columns
, MID.inequality_columns
, MID.included_columns
--, MIG.index_group_handle
--, MIG.index_handle
FROM sys.dm_db_missing_index_details MID 
    JOIN sys.dm_db_missing_index_groups MIG 
     ON MID.index_handle =  MIG.index_handle
ORDER BY MIG.index_group_handle
, MIG.index_handle
Output:
table_name equality_columns inequality_columns included_columns
[AdventureWorks01-...].[SalesLT].[Product] [Color]
[AdventureWorks01-...].[SalesLT].[Product] [Color], [Size]
[AdventureWorks01-...].[SalesLT].[Product] [ListPrice] [Name], [ProductNumber], [Color]

The next step is to create one of the indexes (please note that database's name must be replaced accordingly or used only the 2-part naming convention - schema & table name ):

-- create index on Color
CREATE INDEX IX_SalesLT_Product_Color 
ON [AdventureWorks01-...].[SalesLT].[Product] (Color);

Once the script was run, all the records related to the SalesLT.Product disappeared from the dynamic management view. Therefore, it might be a good idea to take a snapshot with view's data before creating any indexes manually. Probably the same behavior should be expected when the indexes are created by the system.

-- create index on Color & Size
CREATE INDEX IX_SalesLT_Product_Color_Size
ON [SalesLT].[Product] (Color, Size);

-- create index on ListPrice with included columns
CREATE INDEX IX_SalesLT_Product_ListPrice_IC
ON [SalesLT].[Product] (ListPrice) INCLUDE(ProductNumber, Name, Color);

One can use the following query based on the meta.vIndexes (created in a previous post) to look at the indexes created:

-- sys metadata - index columns
SELECt IND.db_name
, IND.schema_name
, IND.table_name
, IND.index_name
, IND.index_type
, IND.principal_type
, IND.auto_created
FROM meta.vIndexes IND
WHERE IND.schema_name = 'SalesLT'
  AND IND.table_name = 'Product'
  AND IND.index_name IN ('IX_SalesLT_Product_Color ','IX_SalesLT_Product_Color_Size'
,'IX_SalesLT_Product_ListPrice_IC')
ORDER BY IND.table_name
, IND.index_name
Output:
db_name schema_name table_name index_name index_type principal_type auto_created
AdventureWorks01-... SalesLT Product IX_SalesLT_Product_Color NONCLUSTERED S False
AdventureWorks01-... SalesLT Product IX_SalesLT_Product_Color_Size NONCLUSTERED S False
AdventureWorks01-... SalesLT Product IX_SalesLT_Product_ListPrice_IC NONCLUSTERED S False

After this model can be created further indexes as needed. It's always a good idea to take a "copy" of the indexes created (or keep a history of the scripts run for indexes' maintenance). This best practice is now more important, when the system can drop indexes as it considers fit. 

Don't forget to clean up the changes made if the indexes aren't needed anymore:

-- cleaning after
DROP INDEX IF EXISTS SalesLT.IX_SalesLT_Product_Color;
DROP INDEX IF EXISTS SalesLT.IX_SalesLT_Product_Color_Size;
DROP INDEX IF EXISTS SalesLT.IX_SalesLT_Product_ListPrice_IC;

So, after these tests, the standard syntax for index's maintenance seems to work also on SQL databases, with all the implications deriving from this (e.g. porting of scripts, database objects, etc.)

Happy coding!

Previous Post <<||>> Next Post

01 January 2025

💎🏭SQL Reloaded: SQL Server Metadata (Part I: Indexes Overview)

There are scenarios in which it's useful to gather information about the available indexes and their definition as a primary step for troubleshooting or index maintenance. Moreover, it's useful to take a baseline of the defined indexes and update it accordingly when indexes change. A minimum of information can be gathered in Excel files or similar repositories, though for the same can be used also metadata tools, at least when they're easy to use and the associated costs aren't neglectable. 

Usually, there are two levels at which the information is needed - at index, respectively at column level. Sometimes it's useful to have an independent query for each level of detail, though in data warehouses and similar use cases it's useful to provide a model on top of the metadata, at least to improve query's maintainability.

The first view encapsulates the logic needed to export the data and it's based on the sys.indexes, sys.objects, sys.schemas and sys.database_principals system objects. 

-- create schema for metadata
CREATE SCHEMA meta;

-- clean after
DROP VIEW IF EXISTS meta.vIndexes

-- create views
CREATE OR ALTER VIEW meta.vIndexes
AS
-- sys metadata - indexes
SELECT OBJ.schema_id
, IND.object_id 
, IND.index_id 
, SCH.name schema_name
, OBJ.name table_name
, IND.name index_name
, IND.type_desc index_type
, IND.is_primary_key
, IND.is_unique_constraint
, IND.fill_factor
, IND.has_filter
, IND.auto_created
, IND.is_unique is_unique_index
, DBP.type principal_type
, DBP.type_desc principal_type_desc
--, INC.*
FROM sys.indexes IND WITH (NOLOCK)
     JOIN sys.objects OBJ WITH (NOLOCK)
       ON IND.object_id = OBJ.object_id
          JOIN sys.schemas SCH WITH (NOLOCK)
            ON OBJ.schema_id = SCH.schema_id
               JOIN sys.database_principals DBP
                 ON SCH.principal_id = DBP.principal_id 
WHERE DBP.type IN ('S', 'E')

Upon case, the needed information can be exported via a query like the one below:

-- sys metadata - index columns
SELECt IND.db_name
, IND.schema_name
, IND.table_name
, IND.index_name
, IND.index_type
, IND.principal_type
FROM meta.vIndexes IND
WHERE IND.schema_name = 'SalesLT'
  AND IND.table_name IN ('Address', 'Customer')
ORDER BY IND.table_name
, IND.index_name

Output:

db_name schema_name table_name index_name index_type principal_type
AdventureWorks01... SalesLT Address AK_Address_rowguid NONCLUSTERED S
AdventureWorks01... SalesLT Address IX_Address_AddressLine1_AddressLine2_City_StateProvince_PostalCode_CountryRegion NONCLUSTERED S
AdventureWorks01... SalesLT Address IX_Address_StateProvince NONCLUSTERED S
AdventureWorks01... SalesLT Address PK_Address_AddressID CLUSTERED S
AdventureWorks01... SalesLT Customer AK_Customer_rowguid NONCLUSTERED S
AdventureWorks01... SalesLT Customer IX_Customer_EmailAddress NONCLUSTERED S
AdventureWorks01... SalesLT Customer PK_Customer_CustomerID CLUSTERED S

Similarly, on top of the above view can be built a similar object that provides also the column-related information by adding the sys.index_columns and sys.columns system object to the logic:

-- clean after
DROP VIEW IF EXISTS meta.vIndexColumns

-- create 
CREATE OR ALTER VIEW meta.vIndexColumns
AS
-- sys metadata - index columns
SELECT IND.db_id
, IND.schema_id
, INC.object_id 
, INC.index_id 
, INC.index_column_id
, INC.column_id
, IND.db_name
, IND.schema_name
, IND.table_name
, IND.index_name
, COL.name column_name
, INC.key_ordinal
, INC.partition_ordinal
, IND.index_type
, IND.is_primary_key
, IND.is_unique_constraint
, IND.fill_factor
, IND.has_filter
, IND.auto_created
, IND.is_unique_index
, INC.is_descending_key
, INC.is_included_column
, IND.principal_type
, IND.principal_type_desc
FROM sys.index_columns INC
     JOIN sys.columns COL WITH (NOLOCK)
       ON INC.object_id = COL.object_id 
      AND INC.column_id = COL.column_id 
     JOIN meta.vIndexes IND WITH (NOLOCK)
       ON INC.object_id = IND.object_id
      AND INC.index_id = IND.index_id

And, there's an example of the query based on this view:

-- sys metadata - index columns
SELECt INC.schema_name
, INC.table_name
, INC.index_name
, INC.column_name
, INC.key_ordinal
, INC.index_type
, INC.principal_type
FROM meta.vIndexColumns INC
WHERE INC.schema_name = 'SalesLT'
  AND INC.table_name IN ('Address', 'Customer')
ORDER BY INC.table_name
, INC.index_name
, INC.key_ordinal

Output:
schema_name table_name index_name column_name key_ordinal index_type principal_type
SalesLT Address AK_Address_rowguid rowguid 1 NONCLUSTERED S
SalesLT Address IX_Address_AddressLine1_AddressLine2_City_StateProvince_PostalCode_CountryRegion AddressLine1 1 NONCLUSTERED S
SalesLT Address IX_Address_AddressLine1_AddressLine2_City_StateProvince_PostalCode_CountryRegion AddressLine2 2 NONCLUSTERED S
SalesLT Address IX_Address_AddressLine1_AddressLine2_City_StateProvince_PostalCode_CountryRegion City 3 NONCLUSTERED S
SalesLT Address IX_Address_AddressLine1_AddressLine2_City_StateProvince_PostalCode_CountryRegion StateProvince 4 NONCLUSTERED S
SalesLT Address IX_Address_AddressLine1_AddressLine2_City_StateProvince_PostalCode_CountryRegion PostalCode 5 NONCLUSTERED S
SalesLT Address IX_Address_AddressLine1_AddressLine2_City_StateProvince_PostalCode_CountryRegion CountryRegion 6 NONCLUSTERED S
SalesLT Address IX_Address_StateProvince StateProvince 1 NONCLUSTERED S
SalesLT Address PK_Address_AddressID AddressID 1 CLUSTERED S
SalesLT Customer AK_Customer_rowguid rowguid 1 NONCLUSTERED S
SalesLT Customer IX_Customer_EmailAddress EmailAddress 1 NONCLUSTERED S
SalesLT Customer PK_Customer_CustomerID CustomerID 1 CLUSTERED S

Notes:
1) As a DBA it's useful to take a baseline of the indexes defined and reevaluate their usefulness over time. These are one of the checks that should be done when one becomes responsible for the administration of a database server, independently of the vendor.
2) The definitions of the views can be extended as needed, though one should try to keep the overall complexity to a minimum. 
3) There are voices against the use of NOLOCK. Feel free to change the objects accordingly!
4) It's useful to work in a dedicated schema (e.g. meta) and have a different naming convention that deviates slightly from one defined by Microsoft. This should make sure that no confusion with the system objects exists.
5) Azure SQL takes over many of the responsibilities for index maintenance. Even if indexes are managed automatically by the system, a baseline is still needed, at least to evaluate functionality's performance, respectively the changes that occurred in the environment. 
6) Except attributes which were added for specific functionality over time, the queries should work at least starting with SQL Server 2005 forward.
7) See also the notes on clustered vs nonclustered indexes.

Happy coding!

Previous Post <<||>> Next Post

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.