SQL Troubles: indexes

Showing posts with label indexes. Show all posts

22 March 2025

💠🛠️🗒️SQL Server: Indexed Views [Notes]

Disclaimer: This is work in progress based on notes gathered over the years, intended to consolidate information from the various sources. The content needs yet to be reviewed against the current documentation.

Last updated: 22-Mar-2024

[SQL Server 2005] Indexed View

{def} a materialized view

materializes the data from the view queries, storing it in the database in a way similar to tables [6]

⇒ its definition is computed and the resulting data stored just like a table [3]
the view is indexed by creating a unique clustered index on it

the resulting structure is physically identical to a table with a clustered index

⇐ nonclustered indexes also are supported on this structure

can be created on a partitioned table, respectively can be partitioned [1]

{benefit} can improve the performance of some types of queries [3]

e.g. queries that aggregate many rows
⇐ because the view is stored the same way a table with a clustered index is stored [1]
⇐ not well-suited for underlying data that are frequently updated [3]
more expensive to use and maintain than filtered indexes [5]

[query optimizer]

can use it to speed up the query execution [1]

the view doesn't have to be referenced in the query for the optimizer to consider it for a substitution [1]
{downside} DML query performance can degrade significantly [1]

⇐ in some cases, a query plan can't even be produced [1]
when executing UPDATE, DELETE or INSERT on the base table referenced, the indexed views must be updated as well [1]
{recommendation} test DML queries before production use [1]

analyze the query plan and tune/simplify the DML statemen [1]

can use the structure to return results more efficiently to the user

contains logic to use this index in either of the cases

the original query text referenced the view explicitly [2]
the user submits a query that uses the same components as the view (in any equivalent order) [2]
⇐ the query processor expands indexed views early in the query pipeline and always uses the same matching code for both cases [2]

the WITH(NOEXPAND) hint tells the query processor not to expand the view definition [2]
also instructs the query processor to perform an index scan of the indexed view rather than expand it into its component parts [5]
any extra rows in the indexed view are reported as 8907 errors [5]
any missing rows are reported as 8908 errors [5]

expose some of the benefits of view materialization while retaining the benefits of global reasoning about query operations [2]
expanded (aka in-lined) before optimization begins

gives the Query Optimizer opportunities to optimize queries globally [2]
makes it difficult for the (query) optimizer to consider plans that perform the view evaluation first, then process the rest of the query [2]

arbitrary tree matching is a computationally complex problem, and the feature set of views is too large to perform this operation efficiently [2]

cases in which it does not match the view

indexed views are inserted into the Memo and evaluated against other plan choices

while they are often the best plan choice, this is not always the case [2]
the Query Optimizer can detect logical contradictions between the view definition and the query that references the view [2]

there are also some cases where the Query Optimizer does not recognize an indexed view even when it would be a good plan choice [2]

often, these cases deal with complex interactions between high-level features within the query processor (e.g. computed column matching, the algorithm to explore join orders) [2]
consider the WITH (NOEXPAND) hint to force the query processor to pick that indexed view [2]

this usually is enough to get the plan to include the indexed view [2]

indexed view alternatives

are generated and stored in the Memo
are compared using costing equations against other possible plans
partial matches cost the residual operations as well

an indexed-view plan can be generated but not picked when the Query Optimizer considers other plans to have lower costs [2]

maintained as part of the update processing for tables on which the view is based

this makes sure that the view provides a consistent result if it is selected by the Query Optimizer for any query plan [2]
some query operations are incompatible with this design guarantee

restrictions are placed on the set of supported constructs in indexed views to make sure that the view can be created, matched, and updated efficiently [2]

{operation} updating indexed views

the core question behind the restrictions is “Can the query processor compute the necessary changes to the Indexed View clustered and nonclustered indexes without having to recompute the whole indexed view?” [2]

if so, the query processor can perform these changes efficiently as part of the maintenance of the base tables that are referenced in the view[2]

this property is relatively easy for filters, projections (compute scalar), and inner joins on keys[2]
operators that destroy or create data are more difficult to maintain, so often these are restricted from use in indexed views. [2]

matching indexed views is supported in cases beyond exact matches of the query text to the view definition [2]

it also supports using an indexed view for inexact matches where the definition of the view is broader than the query submitted by the user [2]

then applies residual filters, projections (columns in the select list), and even aggregates to use the view as a partial precomputation of the query result [2]

{concept} statistics on indexed views

normally statistics aren't needed

because the substitution of the indexed views into the query plan is considered only after all the statistics for the underlying tables and indexes are attached to the query plan [3]
used if the view is directly referenced by the NOEXPAND hint in a FROM clause

an error is generated and the plan is not created if the NOEXPAND hint is used on a view that does not also contain an index [3]

can’t be created by using sp_createstats or updated by using sp_updatestats.
auto update and auto create statistics features work for indexed views

created manually

via CREATE STATISTICS on the indexed view columns
via UPDATE STATISTICS to update column or index statistics on indexed views

{operation} creating a view

requires that the underlying object’s schema can’t change
requires WITH SCHEMABINDING option [5]
⇒ must include the two-part names of all referenced tables [5]
⇐ the tables can't be dropped and the columns can't be altetd when participate in a view unless the view is tropped [5]
⇐ an error is raised [5]
the user must hold

the CREATE VIEW permission in the database [1]
ALTER permission on the schema in which the view is being created [1]
if the base table resides within a different schema, the REFERENCES permission on the table is required as a minimum [1]
if the user creating the index differs from the users who created the view, for the index creation alone the ALTER permission on the view is required [1]

{operation} creating an index on the view[

indexes can only be created on views that have the same owner as the referenced table or tables (aka intact ownership chain between the view and the tables) [1]

{operation} dropping a view

makes all indexes on the view to be dropped [1]

⇐ all nonclustered indexes and auto-created statistics on the view are dropped when the clustered index is dropped [1]

{exception} ser-created statistics on the view are maintained [1]

nonclustered indexes can be individually dropped [1]
dropping the clustered index on the view

removes the stored result set [1]
the optimizer returns to processing the view like a standard view [1]

{operation} disable indexes on tables and views

when a clustered index on a table is disabled, indexes on views associated with the table are also disabled [1]

{option} EXPAND VIEWS

allows to prevent the Database Engine from using indexed views [1]

if any of the listed options are incorrectly set, this option prevents the optimizer from using the indexes on the views [1]
via OPTION (EXPAND VIEWS) hint

{recommendation} when using datetime and smalldatetime string literals in indexed views, explicitly convert the literal to the date type by using a deterministic date format style [1]
{limitation} AVG is not allowed {workaround} use SUM and COUNT_BIG (5]
{limitation} impacted by SET options [1]

{restriction} require fixed values for several SET options [1]
{recommendation} set the ARITHABORT user option to ON server-wide as soon as the first indexed view or index on a computed column is created in any database on the server [1]

{limitation} further requirements apply (see [1])
{limitation} aren't supported on top of temporal queries

⇐ queries that use FOR SYSTEM_TIME clause).

{scenario}simplifying SQL queries
{scenario} abstracting data models from user models
{scenario} enforcing user security

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) SQL Server: Create indexed views [link]
[2] Kalen Delaney et all (2009) Microsoft® SQL Server® 2008 Internals
[3] Microsoft Learn (2024) SQL Server: Views [link]
[4] Microsoft Learn (2024) SQL Server: CREATE INDEX (Transact-SQL) [link]

[5] Kalen Delaney et all (2012) Microsoft® SQL Server® 2012 Internals

[6] Dmitri Korotkevitch (2016) Pro SQL Server Internals 2nd Ed.

Resources:
[R1] Microsoft Learn (2024) SQL Server: Optimize index maintenance to improve query performance and reduce resource consumption [link]

Acronyms:
DML - Data Manipulation Language
QO - Query Optimizer

19 March 2025

💠🛠️🗒️SQL Server: Views [Notes]

Last updated: 19-Mar-2024

[SQL Server 2005] View (aka virtual table)

{def} a database object that encapsulates a SQL statement and that can be used as a virtual table in further SQL statements

cannot be executed by itself

⇐ it must be used within a query [15]

doesn't store any data

except index views
data is dynamically produced from the underlying table when the view is used [32]

views depend on the underlying tables and act like a filter on the underlying tables [32]

used just like regular tables without incurring additional cost

unless the view is indexed [25]

turning a query into a view

remove the ORDER BY clause
assure there are no name duplicates
assure that each column has a name

projected columns

columns included in the view

view’s column list

renames every output column just as if every column had those alias names in the SELECT statement
a view is more self-documenting if the column names of the view are specified in the SELECT statement and not listed separately in the view [27]

{restriction} sorting is not allowed in a view

unless the view includes a TOP predicate

ORDER BY clause serves only to define which rows qualify for the TOP predicate [15]

the only way to logically guarantee sorted results is to define the ORDER BY clause in the executing query [15]

[SQL Server 2005] had a bug in the Query Optimizer that would enable an ORDER BY in a view using a top 100 percent predicate [15]

the behavior was never documented or officially supported [15]

OFFSET FETCH clause

{restriction} parameters can’t be passed to a view [100]

{alternative} use an inline table-valued function

{restriction} cannot reference a variable inside the SELECT statement [100]
{restriction} cannot create a table, whether permanent or temporary

⇒ cannot use the SELECT/INTO syntax in a view

{restriction} can reference only permanent tables

⇒ cannot reference a temporary table [100]

{benefit} present the correct fields to the user
{benefit} enforce security

by specifying

only needed columns

projects a predefined set of columns [15]
hides sensitive, irrelevant, or confusing columns [15]
should be used in parallel with SQL Server–enforced security [15]

only needed records

by allowing users access to the view without the need to give access to the used tables

grant users read permission from only the views, and restrict access to the physical tables [15]

{benefit} maintainability
{benefit} provides a level of abstraction

hides the complexity of the underlying data structures
encapsulates (business)logic
denormalize or flatten complex joins
can consolidate data across databases/servers
can be used as single version of truth

{benefit} allow changing data in the base tables
{downside} layers of nested views require needless overhead for views’ understanding
{downside} single-purpose views quickly become obsolete and clutter the database [15]
{downside} complex views are perceived as having poor performance [15]
{best practice} use generic/standard naming conventions
{best practice} use aliases for cryptic/recurring column names
{best practice} consider only the requested columns
{best practice} group specific purpose view under own schema
{best practice} avoid hardcoding values
{best practice} use views for column-level security together with SQL Server–enforced security
{best practice} limit views to ad-hoc queries and reports

for extensibility and control [15]
⇐ performance isn’t the reason [15]

{poor practices} create views for single-purpose queries (aka one time requests)
{operation} create a view
{operation} drop a view
{operation} alter a view
{operation} select data
{operation} update data

unless the view is a simple single table view, it’s difficult to update the underlying data through the view [15]

{type} inline views

exist only during the execution of a query [32]
simplify the development of a one-time query [32]

allows creating queries in steps

enables troubleshooting

can replace inline UDFs
alternatives

inline UDFs
temporary tables

{type} indexed views

materialize the data, storing the results of the view in a clustered index on disk [15]
similar to a covering index

but with greater control

can include data from multiple data sources [15]
no need to include the clustered index keys [15]

designing an indexed view is more like designing an indexing structure than creating a view [15]

can cause deadlock when two or more of the participating tables is updated/inserted/deleted from two or more sessions in parallel such that they block each other causing a deadlock scenario [29]

{type} compatibility views

allow accessing a subset of the SQL Server 2000 system tables

don’t contain any metadata related to features added after

views have the same names as many of the system tables in previous version, as well as the same column names

⇒ any code that uses the SQL Server 2000 system tables won’t break [16]
there’s no guarantee that will be returned exactly the same results as the ones from the corresponding tables in SQL Server 2000 [16]

accessible from any database
hidden in the resource database

e.g. sysobjects, sysindexes, sysusers, sysdatabases

{type} [SQL Server 2015] catalog views

general interface to the persisted system metadata
built on an inheritance model

⇒ no need to redefine internally sets of attributes common to many objects

available over sys schema

must be included in object’s reference

some of the names are easy to remember because they are similar to the SQL Server 2000 system table names [16]
the columns displayed are very different from the columns in the compatibility views
some metadata appears only in the master database

keeps track of system-wide data (e.g. databases and logins)
other metadata is available in every database (e.g. objects and permissions)
metadata appearing only in the msdb database isn’t available through catalog views but is still available in system tables, in the schema dbo (e.g. backup and restore, replication, Database Maintenance Plans, Integration Services, log shipping, and SQL Server Agent)

{type} partitioned views

allow the data in a large table to be split into smaller member tables

the data is partitioned between the member tables based on ranges of data values in one of the columns [4]
the data ranges for each member table are defined in a CHECK constraint specified on the partitioning column [4]
a view that uses UNION ALL to combine selects of all the member tables into a single result set is then defined [4]
when SELECT statements referencing the view specify a search condition on the partition column, the query optimizer uses the CHECK constraint definitions to determine which member table contains the rows [4]

{type} distributed partition views (DPV)

local partitioned views

a single table is horizontally split into multiple tables, usually all have the same structure [30]

cross database partitioned views

tables are split among different databases on the same server instance

distributed (across server or instance) partitioned views

tables participating in the view reside in different databases which reside ondifferent servers or different instances

{type} nested views

views referred by other views [15]
can lead to an abstraction layer with nested views several layers deep

too difficult to diagnose and maintain [15]

{type} updatable view

view that allows updating the underlying tables

only one table may be updated
if the view includes joins, then the UPDATE statement that references the view must change columns in only one table [15]

typically not a recommended solution for application design
WITH CHECK OPTION causes the WHERE clause of the view to check the data being inserted or updated through the view in addition to the data being retrieved [15]

it makes the WHERE clause a two-way restriction [15]

⇒ can protect the data from undesired inserts and updates [15]

⇒ useful when the view should limit inserts and updates with the same restrictions applied to the WHERE clause [15]
when CHECK OPTION isn’t use, records inserted in the view that don’t match the WHERE constraints will disappear (aka disappearing rows) [15]

{type} non-updatable views

views that don’t allow updating the underlying tables
{workaround} build an INSTEAD OF trigger that inspects the modified data and then performs a legal UPDATE operation based on that data [15]

{type} horizontally positioned views.

used s to enforce row-level security with the help of a WITH CHECK option

{downside} has a high maintenance cost [15]
{alternative} row-level security can be designed using user-access tables and stored procedures [15]

{type} schema-bound views

the SELECT statement must include the schema name for any referenced objects [15]

SELECT * (all columns) is not permitted [15]

{type} subscription views

a view used to export Master Data Services data to subscribing systems

Previous Post <<||>> Next Post

References:
[4] Microsoft (2013) SQL Server 2000 Documentation

[15] Adam Jorgensen et al (2012) Microsoft® SQL Server® 2012 Bible

[16] Bob Beauchemin et al (2012) Microsoft SQL Server 2012 Internals
[25] Basit A Masood-Al-Farooq et al (2014) SQL Server 2014 Development Essentials: Design, implement, and deliver a successful database solution with Microsoft SQL Server 2014

[30] Kevin Cox (2007) Distributed Partitioned Views / Federated Databases: Lessons Learned

[32] Sikha S Bagui & Richard W Earp (2006) Learning SQL on SQL Server 2005

[100] Itzik Ben-Gan et al (2012) Exam 70-461: Querying Microsoft SQL Server 201

Acronyms:
DPV - Distributed Partition Views
UDF - User-Defined Function

20 February 2025

💠🛠️🗒️SQL Server: Nulls [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources. It considers only on-premise SQL Server, for other platforms please refer to the documentation.

Last updated: 20-Feb-2024

[SQL Server] Null

{def} keyword that indicates that the value is unknown [1]

different from an empty or zero value [1]
no two null values are equal [1]

comparisons between two null values, or between a null value and any other value, return unknown because the value of each NULL is unknown [1]

indicates the the value is

unknown
not applicable
to be added later
⇒ can't be used as information that is required to distinguish one row in a table from another row in a table [1]

can be assigned to a value by

explicitly stating NULL in an INSERT or UPDATE statement [1[
leaving a column out of an INSERT statement [1]

{recommendation} test for null values in queries

via IS NULL or IS NOT NULL in the WHERE clause [1]
WHEN present in data, logical and comparison operators can potentially return a third result of UNKNOWN instead of just TRUE or FALSE [1]

⇐ three-valued logic can be the source for many application errors [1]

⇐ parameters and variables not explicitly initialized can cause problems in code

{recommendation} handle null values in logic

via IsNull or Coalesce functions

{constraint} [primary kyes] if any of the columns considered in a primary key contain NULL values, the PRIMARY KEY constraint can’t be created [3]
{constraint} [UNIQUE constraint] allows the columns that make up the constraint to allow NULLs, but it doesn’t allow all key columns to be NULL for more than one row [3]
[data warehouse] nullability of columns

{best practice} define columns as NOT NULL when appropriate

{benefit} helps the Query Optimizer
{benefit} reduces in some cases the storage space required for the data
{benefit} allows SQL Server to avoid unnecessary encoding in columnstore indexes and during batch mode execution [2]

{example} [SQL Server 2000+] bigint column

when the value is defined as NOT NULL , the value fits into a single CPU register

⇒ operations on the value can be performed more quickly

a nullable bigint column requires another, 65th bit to indicate NULL values

SQL Server avoids cross-register data storage by storing some of the row values (usually the highest or lowest values) in main memory using special markers to indicate it in the data that resides in the CPU cache [2]

⇒ adds extra load during execution

{recommendation} avoid nullable columns in data warehouse environments [2]

⇐ the recommendation can apply also to OLTP databases

there are database designs that enforces not null values for all attributes

e.g. Dynamics AX 2009/365 F&O
{benefit} eliminates the need to test for null values in legacy code

{recommendation} use CHECK and UNIQUE constraints or indexes when overhead introduced by constraints or unique indexes is acceptable [2]
{recommendation} consider using filtered indexes instead of normal indexes for columns with many null values

minimizes the waste of storage space
⇐ understand the characteristics of the columns used in the queries [3]

See also: Null-ifying the World, Dynamic Queries, Handling missing dates

Previous Post <<||>> Next Post

References:

[1] Microsoft Learn (2024) SQL Server 2022: NULL and UNKNOWN (T-SQL)
[2] Dmitri Korotkevitch (2016) Pro SQL Server Internals 2nd Ed.

[3] Microsoft SQL Server 2012 Internals, by Kalen Delaney, Bob Beauchemin, Conor Cunningham, Jonathan Kehayias, Benjamin Nevarez & Paul S. Randal, Microsoft Press, ISBN: 978-0-7356-5856-1 , 2013

16 February 2025

💠🛠️🗒️SQL Server: Columnstore Indexes [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources. It considers only on-premise SQL Server, for other platforms please refer to the documentation.

Last updated: 15-Feb-2024

[SQL Server] columnstore indexes (CI)

{def} a technology for storing, retrieving and managing data by using a columnar data format (aka columnstore)

store compressed data on a per-column rather than a per-row basis [5]

{benefit} designed for analytics and data warehousing workloads

data warehousing

{scenario} store fact tables and large dimension tables

⇐ tend to require full table scans rather than table seeks

analytics workloads

{scenario} [SQL Server 2016 SP1] can be used for real-time analytics on operational databases

⇐ an updatable nonclustered columnstore index can be created on a rowstore table

{benefit} performance increase

can achieve up to 100x better performance [4]
offers an order of magnitude better performance than a rowstore index

{feature} uses batch mode execution

improves query performance typically by two to four times

have high performance gains for analytic queries that scan large amounts of data, especially on large tables (>1 million rows)

{benefit} reduces significantly the data warehouse storage costs

{feature} data compression

⇒ provides high compression rates, typically by 10 times

⇒ reduces total I/O from the physical media

⇐ queries often select only a few columns from a table
minimizes or eliminates system I/O bottlenecks

reduces significantly the memory footprint

⇒ query performance can improve

because SQL Server can perform more query and data operations in memory

{benefit} built in memory

⇒ sufficient memory must be available

{benefit} part of the database engine

no special hardware is needed

{concept} columnstore

{def} data structure logically organized as a table with rows and columns, and physically stored in a column-wise data format

stores values from the same domain which commonly have similar values

when a query references a column, then only that column is fetched from disk [3]

⇐ the columns not requested are skipped

⇒ they are not loaded into memory

when a query is executed, the rows must be reconstructed

⇒ row reconstruction takes some time and uses some CPU and memory resources [3]

[SQL Server 2016] columnstore index on rowstore tables

columnstore is updated when data changes in the rowstore table

both indexes work against the same data

{concept}rowstore

{def} data that's logically organized as a table with rows and columns, and physically stored in a row-wise data format

⇐ the traditional way to store relational table data
refers to a table where the underlying data storage format is either

a heap
a clustered index
a memory-optimized table

{concept} rowstore index

performs best on queries that seek into the data, when searching for a particular value, or for queries on a small range of values

⇒ appropriate for transactional workloads

because they tend to require mostly table seeks instead of table scans

{concept} rowgroup

{def} a group of rows that are compressed into columnstore format at the same time

{constraint} has a maximum number of rows per rowgroup, which is 1,048,576 =2^20 rows
contains one column segment for every column in the table
can have more than one delta rowgroup that form the deltastore

e.g. when multiple threads create columnstore indexes using parallel execution plans [5]

⇐ each thread will work with its own subset of data, creating separate rowgroups [5]

[partitions] each table partition has its own set of row groups [5]

⇐ too many partitions may prevent workloads from benefiting from a CCI [11]

⇐ data aren’t pushed into a compressed columnstore segment until the rowgroup limit is reached

{event} rowgroup is compressed

marked as read-only [16]
a compressed rowgroup is considered as fragmented when either

row number < rowgroup limit but dictionary size reached the maximum

nothing can be done to increase the number of rows [15]
the trim_reason is other than DICTIONARY_SIZE

it has nonzero deleted rows that exceeds a minimum threshold [15]

{event} all data from rowgroup deleted

transitions from COMPRESSED into TOMBSTONE state
later removed by the tuple-mover background process

{event} rows in the columnstore indexes can be moved to different locations

row-id in the nonclustered indexes aren’t updated

⇐ the mappings between old and new row locations are stored in an internal structure (aka mapping index)

{event} rowgroup build

all column data are combined on a per-row group basis, encoded and compressed [5]

the rows within a row group can be rearranged if that helps to achieve a better compression rate [5]

{feature} data compression

the table is sliced into rowgroups, and each rowgroup is compresses in a column-wise manner

the number of rows in the rowgroup must be

large enough to improve compression rates
small enough to benefit from in-memory operations

having too many small rowgroups decreases columnstore index’s quality

uses its own compression mechanism

⇒ row or page compression cannot be used on it [3]
[SQL Server 2016] page compression has been removed

⇐ in some cases, page compression disallowed the creation of columnstore indexes with a very large number of columns [5]

{feature} compression delay

computed when a delta rowgroup is closed [7]
keeps the ‘active’ rows in delta rowgroup and only transition these rows to compressed rowgroup after a specified delay [7]

⇐ reduces the overall maintenance overhead of NCCI [7]
⇒ leads to a larger number of delta rowgroups [7]

{best practice} if the workload is primarily inserting data and querying it, the default COMPRESSION_DELAY of 0 is the recommended option [7]
{best practice} [OLTP workload] if > 10% rows are marked deleted in recently compressed rowgroups, then consider a value that accommodates the behavior [7]

via: create nonclustered columnstore index with (compression_delay= 150)

{feature} data encoding

all values in the data are replaced with 64-bit integers using one of two encoding algorithms
{concept} dictionary encoding

stores distinct values from the data in a separate structure (aka dictionary}

every value in a dictionary has a unique ID assigned [5]

the ID is used for replacement

{concept} global dictionary

shared across all segments that belong to the same index partition [5]

{concept} local dictionary

created for individual segments using values that are not present in the global dictionary

{concept} value-based encoding

mainly used for numeric and integer data types that do not have enough duplicated values [5]

dictionary encoding would be inefficient [5]

converts integer and numeric values to a smaller range of 64-bit integers in 2 steps

{step} [numeric data types] are converted to integers using the minimum positive exponent (aka magnitude that allows this conversion) [5]

{goal} convert all numeric values to integers [5]
[integer data types] the smallest negative exponent is chosen that can be applied to all values without losing their precision [5]

{goal} reduce the interval between the minimum and maximum values stored in the segment [5]

{step} the minimum value (aka base value) in the segment is identified and subtracted it from all other values [5]

⇒ makes the minimum value in the segment number 0 [5]

after encoding the data are compressed and stored as a LOB allocation unit

{concept} column segment

{def} a column of data from within the rowgroup
is compressed together and stored on physical media
SQL Server loads an entire segment to memory when it needs to access its data

{concept} segment metadata

store metadata about each segment

e.g. minimum and maximum values
⇐ segments that do not have the required data are skipped [5]

{concept} deltastore

{def} all of the delta rowgroups of a columnstore index
its operations are handled behind the scenes

can be in either states

{state} open (aka open delta store)

accepts new rows and allow modifications and deletions of data

{state} closed (aka closed data store)

a delta store is closed when it reaches its rowgroup limit

{concept} delta rowgroup

{def} a clustered B-tree index that's used only with columnstore indexes
improves columnstore compression and performance by storing rows until the number of rows reaches the rowgroup limit and are then moved into the columnstore
{event} reaches the maximum number of rows

it transitions from an ‘open’ to ‘closed’ state
a closed rowgroup is compressed by the tuple-mover and stored into the columnstore as COMPRESSED rowgroup

{event} compressed

the existing delta rowgroup transitions into TOMBSTONE state to be removed later by the tuple-mover when there is no reference to it

{concept} tuple-mover

background process that checks for closed row group

if it finds a closed rowgroup, it compresses the delta rowgroup and stores it into the columnstore as a COMPRESSED rowgroup

{concept} clustered columnstore index (CCI)

is the primary storage for the entire table
{characteristic) updatable

has two structures that support data modifications

⇐ both use the B-Tree format to store data [5]
⇐ created on demand [5]
delete bitmap

indicates which rows were deleted from a table
upon deletion the row continues to be stored into the rowgroup
during query execution SQL Server checks the delete bitmap and excludes deleted rows from the processing [5]

delta store

includes newly inserted rows
updating a row triggers the deletion of the existing row and insertion of a new version of a row to a delta store

⇒ the update does not change the row data
⇒ the updated row is inserted to a delete bitmap

[partitions] each partition can have a single delete bitmap and multiple delta stores

⇐ this makes each partition self-contained and independent from other partitions

⇒ allows performing a partition switch on tables that have clustered columnstore indexes defined [5]

{feature} supports minimal logging for batch sizes >= rowgroup’s limit [12]
[SQL Server 2017] supports non-persisted computed columns in clustered columnstore indexes [2]
store some data temporarily into a clustered index (aka deltastore) and a btree list of IDs for deleted rows

⇐ {benefit} reduces fragmentation of the column segments and improves performance
combines query results from both the columnstore and the deltastore to return the correct query results

[partitions] too many partitions can hurt the performance of a clustered columnstore index [11]

{concept} nonclustered columnstore index (NCCI)

{def} a secondary index that's created on a rowstore table

is defined as one or more columns of the table and has an optional condition that filters the rows
designed to be used for workloads involving a mix of transactional and analytics workload*
functions the same as a clustered columnstore index

⇐ has same performance optimizations (incl. batchmode operators)
{exception} doesn’t supports persisted computed columns

can’t be created on a columnstore index that has a computed column [2]

however behave differently between the various versions of SQL Server

[SQL Server 2012|2014] {restriction} readonly

contains a copy of part or all of the rows and columns in the underlying table

include a row-id , which is either the address of

a row in a heap table
a clustered index key value

includes all columns from the clustered index even when not explicitly defined in the CREATE statement

the not specified columns will not be available in the sys.index_columns view

[SQL Server 2016] multiple nonclustered rowstore indexes can be created on a columnstore index and perform efficient table seeks on the underlying columnstore

⇒ once created, makes it possible to drop one or more btree nonclustered indexes

enables real-time operational analytics where the OLTP workload uses the underlying clustered index while analytics run concurrently on the columnstore index

{concept} batch mode execution (aka vector-based execution, vectorized execution)

{def} query processing method used to process multiple rows together in groups of rows, or batches, rather than one row at a time

SQL Server can push a predicate to the columnstore index scan operator, preventing unnecessary rows from being loaded into the batch [5]
queries can process up to 900 rows together

enables efficient query execution (by a 3-4x factor) [4]
⇐ the size of the batches varies to fit into the CPU cache
⇒ reduces the number of times that the CPU needs to request external data from memory or other components [5]

improves the performance of aggregations, which can be calculated on a per-batch rather than a per-row basis [5]
tries to minimize the copy of data between operators by creating and maintaining a special bitmap that indicates if a row is still valid in the batch [5]

⇐ subsequent operators will ignore the non-valid rows
every operator has a queue of work items (batches) to process [5]
worker threads from a shared pool pick items from queues and process them while migrating from operator to operator [5]

is closely integrated with, and optimized around, the columnstore storage format.

columnstore indexes use batch mode execution

⇐ improves query performance typically by two to four times

{concept} tuple mover

single-threaded process that works in the background, preserving system resources

runs every five minutes

converts closed delta stores to row groups that store data in a column-based storage format [5]

can be disabled via trace flag T-634
⇐ the conversion of closed delta stores to row groups can be forced by reorganizing an index [5]

runs in parallel using multiple threads

decreases significantly conversion time at a cost of extra CPU load and memory usage [5]

via: ALTER INDEX REORGANIZE command

it doesn’t prevent other sessions from inserting new data into a table [5]
deletions and data modifications would be blocked for the duration of the operation [5]

{recommendation} consider forcing index reorganization manually to reduce execution, and therefore locking, time [5]

considered fragmented if it has

multiple delta rowgroups
deleted rows

require maintenance like that of regular B-Tree indexes [5]

{issue] partially populated row groups
{issue} overhead of delta store and delete bitmap scans during query execution
rebuilding the columnstore index addresses the issues
the strategy depends on the volatility of the data and the ETL processes implemented in the system [5]

{recommendation} rebuild indexes when a table has a considerable volme of deleted rows and/or a large number of partially populated rowgroups [5]
{recommendation} rebuild partition(s) that still have a large number of rows in open delta stores after the ETL process has completed, especially if the ETL process does not use a bulk insert API [5]

creating/dropping/disabling/rebuilding functions like any other index

columnstore statistics

a statistics object is created at the time of columnstore index creation; however, it is neither populated nor updated afterward [5]

⇐ SQL Server relies on segment information, B-Tree indexes (when available), and column-level statistics when deciding if a columnstore index needs to be used [5]
it is beneficial to create missing column-level statistics on the columns that participate in a columnstore index and are used in query predicates and as join keys [5]

⇐ statistics rarely update automatically on very large tables [5]

⇒ statistics must be updated ‘manually’

[SQL Server 2019] included into the schema-only clone of a database functionality [8]

enable performance troubleshooting without the need to manual capture the statistics information

columnstore indexes has been added to sp_estimate_data_compression_savings. In SQL Server 2019 both
COLUMNSTORE and COLUMNSTORE_ARCHIVE have been added to allow you to estimate the space savings if
either of these indexes are used on a table.

via DBCC CLONEDATABASE

[in-memory tables]

{limitation} a columnstore index must include all the columns and can’t have a filtered condition [2]
{limitation} queries on columnstore indexes run only in InterOP mode, and not in the in-memory native mode [2]

{operation} designing columnstore indexes

{best practice} understand as much as possible data’s characteristics
{best practice} identify workload’s characteristics

{operation} create a clustered columnstore index

via CREATE CLUSTERED COLUMNSTORE INDEX command
not needed to specify any columns in the statement

⇐ the index will include all table columns

{operation} index rebuilding

forces SQL Server to remove deleted rows physically from the index and to merge the delta stores’ and row groups’ data [5]

all column segments are recreated with row groups fully populated [5]

[<SQL Server 2019] offline operation
[SQL Server 2019 Enterprise] online operation

⇒ higher availability
⇐ pausing and resuming create and rebuild operations are not supported [11]

very resource intensive process
holds a schema modification (Sch-M) lock on the table

⇒ prevents other sessions from accessing it [5]
⇐ the overhead can be mitigated by using table/index partitioning

⇒ indexes will be rebuild on a partition basis for those partition with volatile data [5]

{operation} index reorganization

[<SQL Server 2019] a reorganize operation is required to merge smaller COMPRESSED rowgroups, following an internal threshold policy that determines how to remove deleted rows and combine the compressed rowgroups
[SQL Server 2019] a background merge task also works to merge COMPRESSED rowgroups from where a large number of rows has been deleted

⇐ after merging smaller rowgroups, the index quality should be improved.
the tuple-mover is helped by a background merge task that automatically compresses smaller OPEN delta rowgroups that have existed for some time as determined by an internal threshold, or merges COMPRESSED rowgroups from where a large number of rows has been deleted
via: ALTER INDEX REORGANIZE command

[SQL Server 2016] performs additional defragmentation

removes deleted rows from row groups that have 10 or more percent of the rows logically deleted [5]
merges closed row groups together, keeping the total number of rows less than or equal than rowgroup’s limit [5]
⇐ both processes can be done together [5]

[SQL Server 2014] the only action performed is compressing and moving the data from closed delta stores to rowgroups [5]

⇐ delete bitmap and open delta stores stay intact [5]

via: ALTER INDEX REORGANIZE

uses all available system resources while it is running [5]

⇒ speeds up the execution process
reduce the time during which other sessions cannot modify or delete data in a table [5]

close and compress all open row groups

via: ALTER INDEX REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON)
row groups aren’t merged during this operation [5]

{operation} estimate compression savings

[SQL Server 2019] COLUMNSTORE and COLUMNSTORE_ARCHIVE added

allows estimating the space savings if either of these indexes are used on a table [8]
{limitation} not available in all editions

via: sp_estimate_data_compression_savings

{operation} [bulk loads] when the number of rows is less than deltastore’s limit, all the rows go directly to the deltastore

[large bulk load] most of the rows go directly to the columnstore without passing through the deltastore

some rows at the end of the bulk load might be too few in number to meet the minimum size of a rowgroup

⇒ the final rows go to the deltastore instead of the columnstore

bulk insert operations provide the number of rows in the batch as part of the API call [5]

best results are achieved by choosing a batch size that is divisible by rowgroup’s limit [5]

⇐ guarantees that every batch produces one or several fully populated row groups [5]

⇒ reduce the total number of row groups in a table [5]
⇒ improves query performance

⇐ the batch size shouldn’t exceed rowgroup’s limit [5]

row groups can be still created on the fly in a manner to similar a bulk insert when the size of the insert batch is close to or exceeds [5]

{operation} [non-bulk operations] trickle inserts go directly to a delta store
{feature} parallel inserts

[SQL Server 2016] requires following conditions for parallel insert on CCI [6]

must specify TABLOCK
no NCI on the clustered columnstore index
no identity column
database compatibility is set to 130

{recommendation} minimize the use of string columns in facts tables [5]

string data use more space
their encoding involves additional overhead during batch mode execution [5]
queries with predicates on string columns may have less efficient execution plans that also require significantly larger memory grants as compared to their non-string counterparts [5]

{recommendation} [SQL Server 2012|2014] do not push string predicates down toward the lowest operators in execution plans.
{recommendation} add another dimension table and replace the string value in the facts table with a synthetic, integer-based ID key that references a new table [5]
{operation} upgrading to SQL Server 2016

make sure that queries against the tables with columnstore indexes can utilize parallelism in case if database compatibility level less than 130 [5]

{feature} [SQL Server 2019] automated columnstore index maintenance [8]
{improvement} [SQL Server 2019] better columnstore metadata memory management
{improvement} [SQL Server 2019] low-memory load path for columnstore tables
{improvement} [SQL Server 2019] improved performance for bulk loading to columnstore indexes
{improvement} [SQL Server 2019] server startup process has been made faster for databases that use in-memory columnstore tables for HTAP
{feature} DMVs

sys.column_store_segments

returns one row for each column per segment

sys.column_store_dictionaries

provides information about the dictionaries used by a columnstore index

sys.dm_db_column_store_row_group_physical_stats

rowgroup statuses

sys.column_store_row_groups

provides clustered columnstore index information on a per-segment basis

Previous Post <<||>> Next Post

References:
[1] SQL Docs (2020) Columnstore indexes: Overview [link]

[2] Microsoft Learn (2024) SQL: What's new in columnstore indexes [link]
[3] Dejan Sarka et al (2012) Exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012 (Training Kit)

[4] SQL Docs (2019) Columnstore indexes - Query performance [link]

[5] Dmitri Korotkevitch (2016) Pro SQL Server Internals 2nd Ed.

[6] Microsoft Learn (2016) Columnstore Index: Parallel load into clustered columnstore index from staging table [link]

[7] Microsoft Learn (2016) Columnstore Index Defragmentation using REORGANIZE Command [link]

[8] Microsoft (2018) Microsoft SQL Server 2019: Technical white paper [link]

Acronyms:
CCI - clustered columnstore index
CI - columnstore index
DBCC - Database Console Commands
DMV - Dynamic Management View
ETL - Extract, Transform, Load
HTAP - Hybrid Transactional/Analytical Processing
LOB - Line of Business
NCCI - nonclustered columnstore index
OLTP - On-Line Transaction Processing
SP - Service Pack

06 January 2025

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part VI: Index Usage Analysis)🆕

There are several system dynamic management views (DMV) available in SQL Server, Azure SQL Server and now in SQL databases that allow to gather more information about indexes' fragmentation and usage. Let's look at the most important information available based on the indexes create in the previous posts. As the data were probably purged from the views, it's needed to run first the select queries based on the SalesLT.Product from the previous post. This step is important, otherwise the DMVs might return no records!

One starting point is to use the sys.dm_db_index_physical_stats DMV to look at the indexes' size and fragmentation information for a given table (or view). The table is used usually as starting point for analyzing indexes' fragmentation and then defragment the indexes with high fragmentation.

-- sys metadata - index & data size and fragmentation information for the data and indexes of the specified table or view
SELECT --db_name() db_name
--, object_name(IND.object_id) table_name
 IND.name index_name
, IND.type_desc
, IPS.page_count
, IPS.record_count
, IPS.index_level
, Cast(IPS.avg_fragmentation_in_percent as decimal(10,2)) avg_fragmentation_perc
, Cast(IPS.avg_page_space_used_in_percent as decimal(10,2)) space_used_perc
--, IPS.*
FROM sys.indexes IND
     CROSS APPLY sys.dm_db_index_physical_stats(DB_ID(), IND.object_id, IND.index_id, NULL, 'DETAILED') IPS
WHERE IND.object_id = OBJECT_ID(N'SalesLT.Product');

Output:

index_name	type_desc	page_count	record_count	index_level	avg_fragmentation_perc	space_used_perc
PK_Product_ProductID	CLUSTERED	101	295	0	0.99	87.90
PK_Product_ProductID	CLUSTERED	1	101	1	0.00	16.20
AK_Product_rowguid	NONCLUSTERED	2	295	0	50.00	74.69
AK_Product_rowguid	NONCLUSTERED	1	2	1	0.00	0.59
AK_Product_ProductNumber	NONCLUSTERED	2	295	0	50.00	85.79
AK_Product_ProductNumber	NONCLUSTERED	1	2	1	0.00	0.49
AK_Product_Name	NONCLUSTERED	3	295	0	33.33	87.32
AK_Product_Name	NONCLUSTERED	1	3	1	0.00	1.67
IX_SalesLT_Product_Color	NONCLUSTERED	1	295	0	0.00	79.24
IX_SalesLT_Product_Color_Size	NONCLUSTERED	1	295	0	0.00	94.12
IX_SalesLT_Product_ListPrice_IC	NONCLUSTERED	4	295	0	0.00	86.60
IX_SalesLT_Product_ListPrice_IC	NONCLUSTERED	1	4	1	0.00	1.01

In a second step one can look at the sys.dm_db_index_usage_stats DMV which provides the counts of the different types of index operations and the time each type of operation was last performed:

-- sys metadata - counts of different types of index operations and the time each type of operation was last performed.
SELECT -- db_name() db_name
--, object_name(IND.object_id) table_name
 IND.name
, IND.type_desc
, IUS.user_seeks 
, IUS.user_scans
, IUS.user_lookups 
, IUS.user_updates
, IUS.last_user_seek
, IUS.last_user_scan 
, IUS.last_user_lookup
, IUS.last_user_update
FROM sys.dm_db_index_usage_stats IUS
     JOIN sys.indexes IND
       ON IUS.index_id = IND.index_id
WHERE IND.object_id = OBJECT_ID(N'SalesLT.Product');

Output:

name	type_desc	user_seeks	user_scans	user_lookups	user_updates	last_user_seek	last_user_scan	last_user_lookup
PK_Product_ProductID	CLUSTERED	0	10	15	0		2025-01-06T14:23:54	2025-01-06T14:23:54
IX_SalesLT_Product_Color_Size	NONCLUSTERED	11	0	0	0	2025-01-06T14:23:54
IX_SalesLT_Product_ListPrice_IC	NONCLUSTERED	8	0	0	0	2025-01-06T13:38:03

Finally, it might be useful to look also at the sys.dm_db_index_operational_stats DMV which returns the current lower-level I/O, locking, latching, and access method activity for each partition of a table or index in the database (see the documentation for the full list of attrbutes):

-- sys metadata - index operations stats
SELECT -- db_name() db_name
--, object_name(IND.object_id) table_name
 IND.name index_name
, IND.type_desc
, IOS.range_scan_count
, IOS.singleton_lookup_count
, IOS.leaf_insert_count
, IOS.leaf_delete_count
, IOS.leaf_update_count
, IOS.nonleaf_insert_count
, IOS.nonleaf_delete_count
, IOS.nonleaf_update_count
FROM sys.indexes IND
     CROSS APPLY sys.dm_db_index_operational_stats(DB_ID(), IND.object_id, IND.index_id, NULL) IOS
WHERE IND.object_id = OBJECT_ID(N'SalesLT.Product')
 AND IOS.range_scan_count<>0
ORDER BY IND.name;

Output:

index_name	type_desc	range_scan_count	singleton_lookup_count	leaf_insert_count	leaf_delete_count	leaf_update_count	nonleaf_insert_count	nonleaf_delete_count	nonleaf_update_count
IX_SalesLT_Product_Color_Size	NONCLUSTERED	11	0	0	0	0	0	0	0
IX_SalesLT_Product_ListPrice_IC	NONCLUSTERED	8	0	0	0	0	0	0	0
PK_Product_ProductID	CLUSTERED	10	64	0	0	0	0	0	0

For more information on these DMVs check the documentation.

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) SQL Server: sys.dm_db_index_physical_stats [link]
[2] Microsoft Learn (2024) SQL Server: sys.dm_db_index_usage_stats [link]
[3] Microsoft Learn (2024) SQL Server: sys.dm_db_index_operational_stats [link]

04 January 2025

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part V: Manual Index Maintenance) 🆕

Indexes' maintenance in Microsoft Fabric's SQL databases is supposed to happen automatically in the background via automatic tuning options feature, though the whole functionality is still in its early phases, and therefore many questions regarding the whole process may arise. Probably the most important question is whether indexes can still be created, respectively maintained manually. That's useful for temporary or even periodic workloads, where maybe organizations might still want to maintain indexes manually.

The tests made below are based on the SalesLT.Product from AdventureWorkds database available in Microsoft Fabric. The target was to create several indexes that could be used for the various testing purposes. Each set of the below scripts was run 5-10 times until records appeared in the sys.dm_db_missing_index_details table for each test case (see further below):

-- batch 1: filter on single column (to be run 5-10 times)
SELECT *
FROM SalesLT.Product 
WHERE Color = 'Red'

SELECT *
FROM SalesLT.Product 
WHERE Color = 'Black'

SELECT *
FROM SalesLT.Product 
WHERE Color = 'White'

-- batch 2: filter on two columns (to be run 5-10 times)
SELECT *
FROM SalesLT.Product 
WHERE Color = 'Red'
  AND Size = '58'

SELECT *
FROM SalesLT.Product 
WHERE Color = 'Black'
  AND Size = '58'

SELECT *
FROM SalesLT.Product 
WHERE Color = 'White'
     AND Size = '58'

-- batch 3: filter with column selection (to be run 5-10 times)
SELECT ProductNumber, Name, Color, ListPrice
FROM SalesLT.Product 
WHERE ListPrice BETWEEN 50 AND 55

SELECT ProductNumber, Name, Color, ListPrice
FROM SalesLT.Product 
WHERE ListPrice BETWEEN 100 and 105

Once the scripts run, one can look at the records created in the above considered dynamic management view:

-- sys metadata -  missing indexes
SELECT MID.statement AS table_name
, MID.equality_columns
, MID.inequality_columns
, MID.included_columns
--, MIG.index_group_handle
--, MIG.index_handle
FROM sys.dm_db_missing_index_details MID 
    JOIN sys.dm_db_missing_index_groups MIG 
     ON MID.index_handle =  MIG.index_handle
ORDER BY MIG.index_group_handle
, MIG.index_handle

Output:

table_name	equality_columns	inequality_columns	included_columns
[AdventureWorks01-...].[SalesLT].[Product]	[Color]
[AdventureWorks01-...].[SalesLT].[Product]	[Color], [Size]
[AdventureWorks01-...].[SalesLT].[Product]		[ListPrice]	[Name], [ProductNumber], [Color]

The next step is to create one of the indexes (please note that database's name must be replaced accordingly or used only the 2-part naming convention - schema & table name ):

-- create index on Color
CREATE INDEX IX_SalesLT_Product_Color 
ON [AdventureWorks01-...].[SalesLT].[Product] (Color);

Once the script was run, all the records related to the SalesLT.Product disappeared from the dynamic management view. Therefore, it might be a good idea to take a snapshot with view's data before creating any indexes manually. Probably the same behavior should be expected when the indexes are created by the system.

-- create index on Color & Size
CREATE INDEX IX_SalesLT_Product_Color_Size
ON [SalesLT].[Product] (Color, Size);

-- create index on ListPrice with included columns
CREATE INDEX IX_SalesLT_Product_ListPrice_IC
ON [SalesLT].[Product] (ListPrice) INCLUDE(ProductNumber, Name, Color);

One can use the following query based on the meta.vIndexes (created in a previous post) to look at the indexes created:

-- sys metadata - index columns
SELECt IND.db_name
, IND.schema_name
, IND.table_name
, IND.index_name
, IND.index_type
, IND.principal_type
, IND.auto_created
FROM meta.vIndexes IND
WHERE IND.schema_name = 'SalesLT'
  AND IND.table_name = 'Product'
  AND IND.index_name IN ('IX_SalesLT_Product_Color ','IX_SalesLT_Product_Color_Size'
,'IX_SalesLT_Product_ListPrice_IC')
ORDER BY IND.table_name
, IND.index_name

Output:

db_name	schema_name	table_name	index_name	index_type	principal_type	auto_created
AdventureWorks01-...	SalesLT	Product	IX_SalesLT_Product_Color	NONCLUSTERED	S	False
AdventureWorks01-...	SalesLT	Product	IX_SalesLT_Product_Color_Size	NONCLUSTERED	S	False
AdventureWorks01-...	SalesLT	Product	IX_SalesLT_Product_ListPrice_IC	NONCLUSTERED	S	False

After this model can be created further indexes as needed. It's always a good idea to take a "copy" of the indexes created (or keep a history of the scripts run for indexes' maintenance). This best practice is now more important, when the system can drop indexes as it considers fit.

Don't forget to clean up the changes made if the indexes aren't needed anymore:

-- cleaning after
DROP INDEX IF EXISTS SalesLT.IX_SalesLT_Product_Color;
DROP INDEX IF EXISTS SalesLT.IX_SalesLT_Product_Color_Size;
DROP INDEX IF EXISTS SalesLT.IX_SalesLT_Product_ListPrice_IC;

So, after these tests, the standard syntax for index's maintenance seems to work also on SQL databases, with all the implications deriving from this (e.g. porting of scripts, database objects, etc.)

Happy coding!

Previous Post <<||>> Next Post

SQL Troubles

Pages

22 March 2025

💠🛠️🗒️SQL Server: Indexed Views [Notes]

19 March 2025

💠🛠️🗒️SQL Server: Views [Notes]

20 February 2025

💠🛠️🗒️SQL Server: Nulls [Notes]

16 February 2025

💠🛠️🗒️SQL Server: Columnstore Indexes [Notes]

06 January 2025

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part VI: Index Usage Analysis)🆕

04 January 2025

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part V: Manual Index Maintenance) 🆕

About Me