SQL Troubles: statistics

Showing posts with label statistics. Show all posts

14 August 2025

🖍️Steve McKillup - Collected Quotes

"A correlation between two variables means they vary together. A positive correlation means that high values of one variable are associated with high values of the other, while a negative correlation means that high values of one variable are associated with low values of the other." (Steve McKillup, "Statistics Explained: An Introductory Guide for Life Scientists", 2005)

"Accuracy is the closeness of a measured value to the true value. Precision is the ‘spread’ or variability of repeated measures of the same value." (Steve McKillup, "Statistics Explained: An Introductory Guide for Life Scientists", 2005)

"Correlation is an exploratory technique used to examine whether the values of two variables are significantly related, meaning whether the values of both variables change together in a consistent way. (For example, an increase in one may be accompanied by a decrease in the other.) There is no expectation that the value of one variable can be predicted from the other, or that there is any causal relationship between them." (Steve McKillup, "Statistics Explained: An Introductory Guide for Life Scientists", 2005)

"Designing a well-controlled, appropriately replicated and realistic experiment has been described by some researchers as an ‘art’. It is not, but there are often several different ways to test the same hypothesis, and hence several different experiments that could be done. Consequently, it is difficult to set a guide to designing experiments beyond an awareness of the general principles." (Steve McKillup, "Statistics Explained: An Introductory Guide for Life Scientists", 2005)

"Even an apparently well-designed mensurative or manipulative experiment may still suffer from a lack of realism." (Steve McKillup, "Statistics Explained: An Introductory Guide for Life Scientists", 2005)

"First, if you already know that the population from which your sample has been taken is normally distributed (perhaps you have data for a variable that has been studied before), you can assume the distribution of sample means from this population will also be normally distributed. Second, the central limit theorem […] states that the distribution of the means of samples of about 25 or more taken from any population will be approximately normal, provided the population is not grossly non-normal (e.g. a population that is bimodal). Therefore, provided your sample size is sufficiently large you can usually do a parametric test. Finally, you can examine your sample. Although there are statistical tests for normality, many statisticians have cautioned that these tests often indicate the sample is significantly non normal even when a t-test will still give reliable results." (Steve McKillup, "Statistics Explained: An Introductory Guide for Life Scientists", 2005)

"Graphs may reveal patterns in data sets that are not obvious from looking at lists or calculating descriptive statistics. Graphs can also provide an easily understood visual summary of a set of results." (Steve McKillup, "Statistics Explained: An Introductory Guide for Life Scientists", 2005)

"Inaccurate and imprecise measurements or a poor or unrealistic sampling design can result in the generation of inappropriate hypotheses. Measurement errors or a poor experimental design can give a false or misleading outcome that may result in the incorrect retention or rejection of an hypothesis." (Steve McKillup, "Statistics Explained: An Introductory Guide for Life Scientists", 2005)

"It has often been said, ‘There is no such thing as a perfect experiment.’ One inherent problem is that, as a design gets better and better, the cost in time and equipment also increases, but the ability to actually do the experiment decreases. An absolutely perfect design may be impossible to carry out. Therefore, every researcher must choose a design that is ‘good enough’ but still practical. There are no rules for this – the decision on design is in the hands of the researcher, and will be eventually judged by their colleagues who examine any report from the work." (Steve McKillup, "Statistics Explained: An Introductory Guide for Life Scientists", 2005)

"It is important to realise that Type 1 error can only occur when the null hypothesis applies. There is absolutely no risk if the null hypothesis is false. Unfortunately, you are most unlikely to know if the null hypothesis applies or not - if you did know, you would not be doing an experiment to test it! If the null hypothesis applies, the risk of Type 1 error is the same as the probability level you have chosen" (Steve McKillup, "Statistics Explained: An Introductory Guide for Life Scientists", 2005)

"Linear correlation analysis assumes that the data are random representatives taken from the larger population of values for each variable, which are normally distributed and have been measured on a ratio, interval or ordinal scale. A scatter plot of these variables will have what is called a bivariate normal distribution. If the data are not normally distributed, or the relationship does not appear to be linear, they may be able to be analysed by nonparametric tests for correlation [...]" (Steve McKillup, "Statistics Explained: An Introductory Guide for Life Scientists", 2005)

"No hypothesis or theory can ever be proven - one day there may be evidence that rejects it and leads to a different explanation (which can include all the successful predictions of the previous hypothesis).Consequently we can only falsify or disprove hypotheses and theories – we can never ever prove them." (Steve McKillup, "Statistics Explained: An Introductory Guide for Life Scientists", 2005)

"One of the nastiest pitfalls is appearing to have a replicated manipulative experimental design, which really is not replicated." (Steve McKillup, "Statistics Explained: An Introductory Guide for Life Scientists", 2005)

"One way of generating hypotheses is to collect data and look for patterns. Often, however, it is difficult to see any pattern from a set of data, which may just be a list of numbers. Graphs and descriptive statistics are very useful for summarising and displaying data in ways that may reveal patterns." (Steve McKillup, "Statistics Explained: An Introductory Guide for Life Scientists", 2005)

"Parametric tests are designed for analyzing data from a known distribution, and the majority assume a normally distributed population. Although parametric tests are quite robust to departures from normality, and major ones can often be reduced by transformation, there are some cases where the population is so grossly non-normal that parametric testing is unwise. In these cases a powerful analysis can often still be done by using a non-parametric test." (Steve McKillup, "Statistics Explained: An Introductory Guide for Life Scientists", 2005)

"Sample statistics like the mean, variance, standard deviation, and especially the standard error of the mean are estimates of population statistics that can be used to predict the range within which 95% of the means of a particular sample size will occur. Knowing this, you can use a parametric test to estimate the probability that a sample mean is the same as an expected value, or the probability that the means of two samples are from the same population." (Steve McKillup, "Statistics Explained: An Introductory Guide for Life Scientists", 2005)

"The essential features of the ‘hypothetico-deductive’ view of scientific method are that a person observes or samples the natural world and uses all the information available to make an intuitive, logical guess, called an hypothesis, about how the system functions. The person has no way of knowing if their hypothesis is correct - it may or may not apply. Predictions made from the hypothesis are tested, either by further sampling or by doing experiments. If the results are consistent with the predictions then the hypothesis is retained. If they are not, it is rejected, and a new hypothesis formulated." (Steve McKillup, "Statistics Explained: An Introductory Guide for Life Scientists", 2005)

"The unavoidable problem with using probability to help you make a decision is that there is always a chance of making a wrong decision and you have no way of telling when you have done this." (Steve McKillup, "Statistics Explained: An Introductory Guide for Life Scientists", 2005)

"The use of a t-test makes three assumptions. The first is that the data are normally distributed. The second is that each sample has been taken at random from its respective population and the third is that for an independent sample test, the variances are the same. It has, however, been shown that t-tests are actually very ‘robust’ – that is, they will still generate statistics that approximate the t distribution and give realistic probabilities even when the data show considerable departure from normality and when sample variances are dissimilar." (Steve McKillup, "Statistics Explained: An Introductory Guide for Life Scientists", 2005)

"Unfortunately, the only way to estimate the appropriate minimum sample size needed in an experiment is to know, or have good estimates of, the effect size and standard deviation of the population(s). Often the only way to estimate these is to do a pilot experiment with a sample. For most tests there are formulae that use these (sample) statistics to give the appropriate sized sample for a desired power." (Steve McKillup, "Statistics Explained: An Introductory Guide for Life Scientists", 2005)

"When expected frequencies are small, the calculated chi-square statistic is inaccurate and tends to be too large, therefore indicating a lower than appropriate probability, which increases the risk of Type 1 error. It used to be recommended that no expected frequency in a goodness of fit test should be less than five, but this has been relaxed somewhat in the light of more recent research, and it is now recommended that no more than 20% of expected frequencies should be less than five." (Steve McKillup, "Statistics Explained: An Introductory Guide for Life Scientists", 2005)

"Whenever you make a decision based on the probability of a result, there is a risk of either a Type 1 or a Type 2 error. There is only a risk of Type 1 error when the null hypothesis applies, and the risk is the chosen probability level. There is only a risk of Type 2 error when the null hypothesis is false." (Steve McKillup, "Statistics Explained: An Introductory Guide for Life Scientists", 2005)

"Without compromising the risk of Type 1 error, the only way a researcher can reduce the risk of Type 2 error to an acceptable level and therefore ensure sufficient power is to increase their sample size. Every researcher has to ask themselves the question, ‘What sample size do I need to ensure the risk of Type 2 error is low and therefore power is high?’ This is an important question because samples are usually costly to take, so there is no point in increasing sample size past the point where power reaches an acceptable level." (Steve McKillup, "Statistics Explained: An Introductory Guide for Life Scientists", 2005)

24 April 2025

🧭Business Intelligence: Perspectives (Part 30: The Data Science Connection)

Business Intelligence Series

Data Science is a collection of quantitative and qualitative methods, respectively techniques, algorithms, principles, processes and technologies used to analyze, and process amounts of raw and aggregated data to extract information or knowledge it contains. Its theoretical basis is rooted within mathematics, mainly statistics, computer science and domain expertise, though it can include further aspects related to communication, management, sociology, ecology, cybernetics, and probably many other fields, as there’s enough space for experimentation and translation of knowledge from one field to another.

The aim of Data Science is to extract valuable insights from data to support decision-making, problem-solving, drive innovation and probably it can achieve more in time. Reading in between the lines, Data Science sounds like a superhero that can solve all the problems existing out there, which frankly is too beautiful to be true! In theory everything is possible, when in practice there are many hard limitations! Given any amount of data, the knowledge that can be obtained from it can be limited by many factors - the degree to which the data, processes and models built reflect reality, and there can be many levels of approximation, respectively the degree to which such data can be collected consistently.

Moreover, even if the theoretical basis seems sound, the data, information or knowledge which is not available can be the important missing link in making any sensible progress toward the goals set in Data Science projects. In some cases, one might be aware of what's missing, though for the data scientist not having the required domain knowledge, this can be a hard limit! This gap can be probably bridged with sensemaking, exploration and experimentation approaches, especially by applying models from other domains, though there are no guarantees ahead!

AI can help in this direction by utilizing its capacity to explore fast ideas or models. However, it's questionable how much the models built with AI can be further used if one can't build mechanistical mental models of the processes reflected in the data. It's like devising an algorithm for winning at lottery small amounts, though investing more money in the algorithm doesn't automatically imply greater wins. Even if occasionally the performance is improved, it's questionable how much it can be leveraged for each utilization. Statistics has its utility when one studies data in aggregation and can predict average behavior. It can’t be used to predict the occurrence of events with a high precision. Think how hard the prediction of earthquakes or extreme weather is by just looking at a pile of data reflecting what’s happening only in a certain zone!

In theory, the more data one has from different geographical areas or organizations, the more robust the models can become. However, no two geographies, respectively no two organizations are alike: business models, the people, the events and other aspects make global models less applicable to local context. Frankly, one has more chances of progress if a model is obtained by having a local scope and then attempting to leverage the respective model for a broader scope. Even then, there can be differences between the behavior or phenomena at micro, respectively at macro level (see the law of physics).

This doesn’t mean that Data Science or AI related knowledge is useless. The knowledge accumulated by applying various techniques, models and programming languages in problem-solving can be more valuable than the results obtained! Experimentation is a must for organizations to innovate, to extend their knowledge base. It’s also questionable how much of the respective knowledge can be retained and put to good use. In the end, each organization must determine this by itself!

Previous Post <<||>> Next Post

22 March 2025

💠🛠️🗒️SQL Server: Statistics Issues [Notes]

Disclaimer: This is work in progress based on notes gathered over the years, intended to consolidate information from the various sources. The content needs yet to be reviewed against the current documentation.

Last updated: 22-Mar-2024

[SQL Server 2005] Statistics

{def} objects that contain statistical information about the distribution of values in one or more columns of a table or indexed view [2] [see notes]
{issue} inaccurate statistics (aka bad statistics)

the primary source for cardinality estimation errors [6]
guessed selectivity can lead to a less-than-optimal query plan
{cause} missing statistics

if an error or other event prevents the query optimizer from creating statistics, the query optimizer creates the query plan without using statistics [22]

the query optimizer marks the statistics as missing and attempts to regenerate the statistics the next time the query is executed [22]

[SSMS] indicated as warnings when the execution plan of a query is graphically displayed [22]

[SQL Server Profiler] monitoring the Missing Column Statistics event class indicates when statistics are missing [22]
troubleshooting

{step} verify that AUTO_CREATE_STATISTICS and AUTO_UPDATE_STATISTICS are on [22]
verify that the database is not read-only [22]

the query optimizer cannot save statistics in read-only databases [22]

create the missing statistics by using the CREATE STATISTICS statement [22]

{cause} use of local variables
{cause} non-constant-foldable expressions
{cause} complex expressions
{cause} ascending key columns

can cause inaccurate statistics in tables with frequent INSERTs because new values all lie outside the histogram [3]

it is not necessary for statistics to be highly accurate to be effective [18]

{issue} stale statistics (aka outdated statistics)

statistics are outdated based on the number of modifications of the statistics (and index) columns [6]
SQL Server checks to see if the statistics are outdated when it looks up a plan from the cache [6]

it recompiles the query if they are [6]

⇐ triggers a statistics update [6]

[temporary tables] can increase the number of recompilations triggered by outdated statistics [6]
[query hint] KEEPFIXED PLAN

prevents query recompilation in cases of outdated statistics [6]

queries are recompiled only when the schemas of the underlying tables are changed or the recompilation is forced
e.g. when a stored procedure is called using the WITH RECOMPILE clause [6]

can be identified based on

Stats_Date(object_id, stats_id) returns the last update date
Rowmodctr from sys.sysindexes

has negative impact on performance

almost always outweighs the performance benefits of avoiding automatic statistics updates/creation [2]

⇐ applies also to missing statistics [2]
auto update/create stats should be disabled only when thorough testing has shown that there's no other way to achieve the performance or scalability you require [2]
without correct and up-to-date statistical information, can be challenging to determine the best execution plan for a particular query [8]

{issue} missing statistics

{symptom} excessive number of logical reads

often indicates suboptimal execution plans

due to missing indexes and/or suboptimal join strategies selected

because of incorrect cardinality estimation [6]

shouldn't be used as the only criteria during optimization

further factors should be considered

resource usage, parallelism, related operators in the execution plan [6]

{issue} unused statistics

table can end up having man statistics that serve no purpose

it makes sense to review and clean up the statistics as part of general maintenance

based on the thresholds for the automatic update

{issue} skewed statistics

may happen when the sample rate doesn’t reflect the characteristics of values’ distribution

{solution} using a higher sample rate

update statistics manually with fullscan

{issue} statistics accumulate over time

statistics objects are persistent and will remain in the database forever until explicitly dropped

this becomes expensive over time

{solution} drop all auto stats

will place some temporary stress on the system [11]
all queries that adheres to a certain pattern that requires a statistics to be created, will wait [11]
typically in a matter of minutes most of the missing statistics will be already back in place and the temporary stress will be over [11]
{tip} the cleanup can be performed at off-peak hours and ‘warm-up’ the database by capturing and replaying a trace of the most common read only queries [11]

create many of the required statistics objects without impacting your ‘real’ workload [11]

{downside} statements may deadlock on schema locks with concurrent sessions [11]

{issue} overlapping statistics

column statistics created by SQL Server based on queries that get executed can overlap with statistics for indexes created by users on the same column [2]

Previous Post <<||>> Next Post

References:
[2] Ken Henderson (2003) Guru's Guide to SQL Server Architecture and Internals

[3] Kendal Van Dyke (2010) More On Overlapping Statistics [link]

[6] Dmitri Korotkevitch (2016) Pro SQL Server Internals 2nd Ed.

[18] Joe Chang’s Blog (2012) Decoding STATS_STREAM, by Joe Chang

[22] MSDN (2016) Statistics [link]

Resources:

Acronyms:
SSMS - SQL Server Management Studio

19 March 2025

💠🛠️🗒️SQL Server: Statistics [Notes]

Last updated: 19-Mar-2024

[SQL Server 2005] Statistics

{def} objects that contain statistical information about the distribution of values in one or more columns of a table or indexed view [2]

used by query optimizer to estimate the cardinality (aka number of rows) in the query result
created on any data type that supports comparison operations [4]
metadata maintained about index keys and, optionally, nonindexed column values

implemented via statistics objects
can be created over most types

generally data types that support comparisons (such as >, =, and so on) support the creation of statistics [20]

the DBA is responsible for keeping the statistics objects up to date in the system [20]
{restriction} the combined width of all columns constituting a single statistics set must not be greater than 900 bytes [21]

statistics collected

time of the last statistics collection (inside STATBLOB) [21]
number of rows in the table or index (rows column in SYSINDEXES) [21]
number of pages occupied by the table or index (dpages column in SYSINDEXES) [21]
number of rows used to produce the histogram and density information (inside STATBLOB, described below) [21]
average key length (inside STATBLOB) [21]
histogram

measures the frequency of occurrence for each distinct value in a data set [31]
computed by the query optimizer on the column values in the first key column of the statistics object, selecting the column values by statistically sampling the rows or by performing a full scan of all rows in the table or view [31]

when created from a sampled set of rows, the stored totals for number of rows and number of distinct values are estimates and do not need to be whole integers [31]
it sorts the column values, computes the number of values that match each distinct column value and then aggregates the column values into a maximum of 200 contiguous histogram steps [31]

each step includes a range of column values followed by an upper bound column value [31]
the range includes all possible column values between boundary values, excluding the boundary values themselves [21]
the lowest of the sorted column values is the upper boundary value for the first histogram step [31]

defines the histogram steps according to their statistical significance [31]

uses a maximum difference algorithm to minimize the number of steps in the histogram while maximizing the difference between the boundary values

the maximum number of steps is 200 [31]
the number of histogram steps can be fewer than the number of distinct values, even for columns with fewer than 200 boundary points [31]

stored only for the first column of a composite index

the most selective columns in a multicolumn index should be selected first [2]

the histogram will be more useful to the optimizer [2]

single column histogram [21]
splitting up composite indexes into multiple single-column indexes is sometimes advisable [2]

used to estimate the selectivity of nonequality selection predicates, joins, and other operators [3]
contains a sampling of up to 200 values for the index's first key column

the values in a given column are sorted in ordered sequence

divided into up to 199 intervals

so that the most statistically significant information is captured
in general, of nonequal size

{type} table-level statistics

statistics maintained on each table
include:

number of rows in the table [8]
number of pages used by the table [8]
number of modifications made to the keys of the table since the last update to the statistics [8]

{type} index statistics

created with the indexes
updated with fullscan when indexes are rebuild the index [13]

{exception} [SQL Server 2012] for partitioned indexes when the number of partitions >1000 it uses default sampling [13]

{type} column statistics (aka non-index statistics)

statistics on non-indexed columns
determine the likelihood that a given value might occur in a column [2]

gives the optimizer valuable information in determining how best to service a query [2]

allows the optimizer to estimate the number of rows that will qualify from a given table involved in a join [2]

allows to more accurately select join order [2]

if automatic creation or updating of statistics is disabled, the Query Optimizer returns a warning in the showplan output when compiling a query where it thinks it needs this information [20]

used by optimizer to provide histogram-type information for the other columns in a multicolumn index [2]

the more information the optimizer has about data, the better [2]
queries asking for data that is outside the bounds of the histogram will estimate 1 row and typically end up with a bad, serial plan [15]

automatic created

when nonindexed column is queried while AUTO_CREATE_STATISTICS is enabled for the database

aren’t updated when an index is reorganized or rebuild [10]

{exception} [SQL Server 2005] indexes rebuilt with DBCC dbreindex [10]

updates index statistics as well column statistics [10]
this feature will be removed in a future version of Microsoft SQL Server

{type} [SQL Server 2008] filtered statistics

useful when dealing with partitioned data or data that is wildly skewed due to wide ranging data or lots of nulls
scope defined with the help of a statistic filter

a condition that is evaluated to determine whether a row must be part of the filtered statistics
the predicate appears in the WHERE clause of the CREATE STATISTICS or CREATE INDEX statements (in the case when statistics are automatically created as a side effect of creating an index)

{type} custom statistics

{limitation}[SQL Server] not supported

{type} temporary statistics

when statistics on a read-only database or read-only snapshot are missing or stale, the Database Engine creates and maintains temporary statistics in tempdb [22]

a SQL Server restart causes all temporary statistics to disappear [22]
the statistics name is appended with the suffix _readonly_database_statistic

to differentiate the temporary statistics from the permanent statistics [22]
reserved for statistics generated by SQL Server [2]
scripts for the temporary statistics can be created and reproduced on a read-write database [22]

when scripted, SSMS changes the suffix of the statistics name from _readonly_database_statistic to _readonly_database_statistic_scripted [22]

{restriction} can be created and updated only by SQL Server [22]

can be deleted and monitored [22]

{operation} create statistics

auto-create (aka auto-create statistics)

on by default

samples the rows by default

{exception} when statistics are created as a by-product of index creation, then a full scan is used [3]

{exception} statistics may not be created

for tables where the cost of the plan execution would be lower than the statistics creation itself [21]
when the server is too busy [21]

e.g. too many outstanding compilations in progress [21]

manually

via CREATE STATISTICS

only generates the statistics for a given column or combination of columns [21]
{recommendation} keep the AUTO_CREATE_STATISTICS option on so that the query optimizer continues to routinely create single-column statistics for query predicate columns [22]

{operation} update statistics

covered by the same SQL Server Profiler event as statistics creation
ensures that queries compile with up-to-date statistics [22]
{recommendation} statistics should not be updated too frequently [22]

because there is a performance tradeoff between improving query plans and the time it takes to recompile queries [22]

tradeoffs depend from application to application [22]

triggered by the executions of either commands:

CREATE INDEX ... WITH DROP EXISTING

scans the whole data set

the index statistics are initially created without sampling

allows to set the sample size in the WITH clause either by specifying

FULLSCAN
percentage of data to scan
interpreted as an approximation

sp_createstats stored procedure
sp_updatestats stored procedure
DBCC DBREINDEX

rebuilds one or more indexes for a table in the specified database [21]
DBCC INDEXDEFRAG or ALTER INDEX REORGANIZE operations don’t update the statistics [22]

undocumented options

meant for testing and debugging purposes

should never be used on production systems [29]

STATS_STREAM = stats_stream
ROWCOUNT = numeric_constant

incl PAGECOUNT, alter the internal metadata of the specified table or index by overriding the counters containing the row and page counts of the object [29]

read by the Query Optimizer when processing queries that access the table and/or index in question [29]

cheat the Optimizer into thinking that a table or index is extremely large [29]

the content of the actual tables and indexes will remain intact [29]

PAGECOUNT = numeric contant

auto-update (aka auto statistics update)

on by default

samples the rows by default

always performed by sampling the index or table [21]

triggered by either

query optimization
execution of a compiled plan

involves only a subset of the columns referred to in the query
occurs

before query compilation if AUTO_UPDATE_STATISTCS_ASYNC is OFF
asynchronously if AUTO_UPDATE_STATISTCS_ASYNC is ON

the query that triggered the update proceeds using the old statistics

provides more predictable query response time for some workloads [3]

particularly the workload with short running queries and very large tables [3]

it tracks changes to columns in the statistics

{limitation} it doesn’t track changes to columns in the predicate [3]

{recommendation} if there are many changes to the columns used in predicates of filtered statistics, consider using manual updates to keep up with the changes [3]

enable/disable statistics update

database level

ALTER DATABASE dbname SET AUTO_UPDATE_STATISTICS OFF
{limitation} it’s not possible to override the database setting of OFF for auto update statistics by setting it ON at the statistics object level [3]

table level

NORECOMPUTE option of the UPDATE STATISTICS command
CREATE STATISTICS command
sp_autostats

index

sp_autostats

statistics object
sp_autostats

[SQL Server 2005] asynchronous statistics update

allows the statistics update operation to be performed on a background thread in a different transaction context

avoids the repeating rollback issue [20]
the original query continues and uses out-of-date statistical information to compile the query and return it to be executed [20]
when the statistics are updated, plans based on those statistics objects are invalidated and are recompiled on their next use [20]

{command}

ALTER DATABASE... SET AUTO_UPDATE_STATISTICS_ASYNC {ON | OFF}

manually

{recommendation} don’t update statistics after index defragmentation

update eventually only the column statistics

[system tables] it might be needed to update statistics also on system tables when many objects were created

triggers a recompilation of the queries [22]

{exception} when a plan is trivial [1]

won’t be generated a better or different plan [1]

{exception} when the plan is non-trivial but no row modifications since last statistics update [1]

no insert, delete or update since last statistics update [1]
manually free procedure cache in case is needed to generate a new plan [1]

schema change

dropping an index defined on a table or an indexed view

only if the index is used by the query plan in question

[SQL Server 2000] manually updating or dropping a statistic (not creating!) on a table will cause a recompilation of any query plans that use that table

the recompilation happens the next time the query plan in question begins execution

[SQL Server 2005] dropping a statistic (not creating or updating!) defined on a table will cause a correctness-related recompilation of any query plans that use that table

the recompilations happen the next time the query plan in question begins execution

updating a statistic (both manual and auto-update) will cause an optimality-related (data related) recompilation of any query plans that uses this statistic

update scenarios

{scenario} query execution times are slow [22]

troubleshooting: ensure that queries have up-to-date statistics before performing additional analysis [22]

{scenario} insert operations occur on ascending or descending key columns [22]

statistics on ascending or descending key columns (e.g. IDENTITY or real-time timestamp) might require more frequent statistics updates than the query optimizer performs [22]
if statistics are not up-to-date and queries select from the most recently added rows, the current statistics will not have cardinality estimates for these new values [22]

inaccurate cardinality estimates and slow query performance [22]

after maintenance operations [22]

{recommendation} consider updating statistics after performing maintenance procedures that change the distribution of data (e.g. truncating a table, bulk inserts) [22]

this can avoid future delays in query processing while queries wait for automatic statistics updates [22]
rebuilding, defragmenting, or reorganizing an index do not change the distribution of data [22]

no need to update statistics after performing ALTER INDEX REBUILD, DBCC REINDEX, DBCC INDEXDEFRAG, or ALTER INDEX REORGANIZE operations [22]

{operation} dropping statistics

via DROP STATISTICS command
{limitation} it’s not possible to drop statistics that are a byproduct of an index [21]

such statistics are removed only when the index is dropped [21]

aging

[SQL Server 2000] ages the automatically created statistics (only those that are not a byproduct of the index creation)
after several automatic updates the column statistics are dropped rather than updated [21]

if they are needed in the future, they may be created again [21]
there is no substantial cost difference between statistics that are updated and created [21]

does not affect user-created statistics [21]

{operation} statistics import)

performed by running generated scripts (see statistics export )

{operation}export (aka statistics export)

performed via “Generate Scripts” task

WITH STATS_STREAM

value can be generated by using DBCC SHOW_STATISTICS WITH STATS_STREAM

{operation} save/restore statistics

not supported

{operation} monitoring

via dbcc show_statistics
via SQL Server query profiler

{option} AUTO_CREATE_STATISTICS

when enabled the query optimizer creates statistics on individual columns in the query predicate, as necessary, to improve cardinality estimates for the query plan [22]

these single-column statistics are created on columns that do not already have a histogram in an existing statistics object [22]
it applies strictly to single-column statistics for the full table [22]
statistics name starts with _WA

does not determine whether statistics get created for indexes [22]
does not generate filtered statistics [22]

call: SELECT DATABASEPROPERTY(<database_name>','IsAutoCreateStatistics')
call: select is_auto_create_stats_on from sys.databases where name = '<database_name'

{option}AUTO_UPDATE_STATISTICS option

when enabled the query optimizer determines when statistics might be out-of-date and then updates them when they are used by a query [22]

statistics become out-of-date after insert, update, delete, or merge operations change the data distribution in the table or indexed view [22]
determined by counting the number of data modifications since the last statistics update and comparing the number of modifications to a threshold [22]

the threshold is based on the number of rows in the table or indexed view [22]

the query optimizer checks for out-of-date statistics

before compiling a query

the query optimizer uses the columns, tables, and indexed views in the query predicate to determine which statistics might be out-of-date [22]

before executing a cached query plan

the Database Engine verifies that the query plan references up-to-date statistics [22]

applies to

statistics objects created for indexes, single-columns in query predicates [22]
statistics created with the CREATE STATISTICS statement [22]
filtered statistics [22]

call: SELECT DATABASEPROPERTY(<database_name>','IsAutoUpdateStatistics')
call: select is_auto_update_stats_on from sys.databases where name = '<database_name'

{option} AUTO_UPDATE_STATISTICS_ASYNC

determines whether the query optimizer uses synchronous or asynchronous statistics updates
{default} off

the query optimizer updates statistics synchronously

queries always compile and execute with up-to-date statistics
when statistics are out-of-date, the query optimizer waits for updated statistics before compiling and executing the query [22]
{recommendation} use when performing operations that change the distribution of data [22]

e.g. truncating a table or performing a bulk update of a large percentage of the rows [22]
if the statistics are not updated after completing the operation, using synchronous statistics will ensure statistics are up-to-date before executing queries on the changed data [22]

applies to statistics objects created for [22]

indexes
single columns in query predicates
statistics created with the CREATE STATISTICS statement

asynchronous statistics updates

queries compile with existing statistics even if the existing statistics are out-of-date

the query optimizer could choose a suboptimal query plan if statistics are out-of-date when the query compiles [22]
queries that compile after the asynchronous updates have completed will benefit from using the updated statistics [22]

{warning} it will do nothing if the “Auto Update Statistics” database option isn’t enabled [30]
{recommendation} scenarios

to achieve more predictable query response times
an application frequently executes the same query, similar queries, or similar cached query plans

the query response times might be more predictable with asynchronous statistics updates than with synchronous statistics updates because the query optimizer can execute incoming queries without waiting for up-to-date statistics [22]

this avoids delaying some queries and not others [22]

an application experienced client request time outs caused by one or more queries waiting for updated statistics [22]

in some cases waiting for synchronous statistics could cause applications with aggressive time outs to fail [22]

call: select is_auto_update_stats_async_on from sys.databases where name = '<database_name'

[SQL Server 2014] Auto Create Incremental Statistics

when active (ON), the statistics created are per partition statistics [22]
{default} when disabled (OFF), the statistics tree is dropped and SQL Server re-computes the statistics [22]
this setting overrides the database level INCREMENTAL property [22]

[partitions] when new partitions are added to a large table, statistics should be updated to include the new partitions [22]

however the time required to scan the entire table (FULLSCAN or SAMPLE option) might be quite long [22]
scanning the entire table isn't necessary because only the statistics on the new partitions might be needed [22]
the incremental option creates and stores statistics on a per partition basis, and when updated, only refreshes statistics on those partitions that need new statistics [22]
if per partition statistics are not supported the option is ignored and a warning is generated [22]
not supported for several statistics types

statistics created with indexes that are not partition-aligned with the base table [22]
statistics created on AlwaysOn readable secondary databases [22]
statistics created on read-only databases [22]
statistics created on filtered indexes [22]
statistics created on views [22]
statistics created on internal tables [22]
statistics created with spatial indexes or XML indexes [22]

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2025) SQL Server: Statistics [link]
[2] Microsoft Learn (2025) SQL Server: Configure auto statistics [link]
[3] Microsoft Learn (2009) SQL Server: Statistics Used by the Query Optimizer in Microsoft SQL Server 2008 [link]

[4] Dmitri Korotkevitch (2016) Pro SQL Server Internals 2nd Ed.
[8] Microsoft (2014) Statistical maintenance functionality (autostats) in SQL Server (kb 195565)

[10] CSS SQL Server Engineers (2015) Does rebuild index update statistics? [link]

[16] Thomas Kejser's Database Blog (2011) The Ascending Key Problem in Fact Tables – Part one: Pain! by Thomas Kejser
[20] Kalen Delaney et al (2013) Microsoft SQL Server 2012 Internals, 2013

[22] MSDN (2016) Statistics [link]
[29] Tips, Tricks, and Advice from the SQL Server Query Optimization Team (2006) UPDATE STATISTICS undocumented options [link]
[31] Microsoft Learn (2025) SQL Server: sys.dm_db_stats_histogram (Transact-SQL) [link]

18 March 2025

🏭🗒️Microsoft Fabric: Statistics in Warehouse [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 18-Mar-2024

[Microsoft Fabric] Statistics

{def} objects that contain relevant information about data, to allow query optimizer to estimate plans' costs [1]

critical for the warehouse and lakehouse SQL endpoint for executing queries quickly and efficiently [3]

when a query is executed, the engine tries to collect existing statistics for certain columns in the query and use that information to assist in choosing an optimal execution plan [3]
inaccurate statistics can lead to unoptimized query plans and execution times [5]

{type} user-defined statistics

statistics defined manually by the users via DDL statement [1]
users can create, update and drop statistic

via CREATE|UPDATE|DROP STATISTICS

users can review the contents of histogram-based single-column statistics [1]

via DBCC SHOW_STATISTICS

only, a limited version of these statements is supported [1]

{recommendation} focus on columns heavily used in query workloads

e.g. GROUP BYs, ORDER BYs, filters, and JOINs

{recommendation} consider updating column-level statistics regularly [1]

e.g. after data changes that significantly change rowcount or distribution of the data [1]

{type} automatic statistics

statistics created and maintained automatically by the query engine at query time [1]
when a query is issued and query optimizer requires statistics for plan exploration, MF automatically creates those statistics if they don't already exist [1]

then the query optimizer can utilize them in estimating the plan costs of the triggering query [1]
if the query engine determines that existing statistics relevant to query no longer accurately reflect the data, those statistics are automatically refreshed [1]

these automatic operations are done synchronously [1]

the query duration includes this time [1]

{object type} histogram statistics

created per column needing histogram statistics at query time [1]
contains histogram and density information regarding the distribution of a particular column [1]
similar to the statistics automatically created at query-time in Azure Synapse Analytics dedicated pools [1]
name begins with _WA_Sys_.
contents can be viewed with DBCC SHOW_STATISTICS

{object type} average column length statistics

created for variable character columns (varchar) greater than 100 needing average column length at query-time [1]
contain a value representing the average row size of the varchar column at the time of statistics creation [1]
name begins with ACE-AverageColumnLength_
contents cannot be viewed and are nonactionable by users [1]

{object type} table-based cardinality statistics

created per table needing cardinality estimation at query-time [1]
contain an estimate of the rowcount of a table [1]
named ACE-Cardinality [1]
contents cannot be viewed and are nonactionable by user [1]

[lakehouse] SQL analytics endpoint

uses the same engine as the warehouse to serve high performance, low latency SQL queries [4]
{feature} automatic metadata discovery

a seamless process reads the delta logs and from the files folder and ensures SQL metadata for tables is always up to date [4]

e.g. statistics [4]

{limitation} only single-column histogram statistics can be manually created and modified [1]
{limitation} multi-column statistics creation is not supported [1]
{limitation} other statistics objects might appear in sys.stats

besides the statistics created manually/automatically [1]

⇐ the objects are not used for query optimization [1]

{limitation} if a transaction has data insertion into an empty table and issues a SELECT before rolling back, the automatically generated statistics can still reflect the uncommitted data, causing inaccurate statistics [5]

{recommendation} update statistics for the columns mentioned in the SELECT [5]

{recommendation} ensure all table statistics are updated after large DML transactions [2]

07 February 2025

🌌🏭KQL Reloaded: First Steps (Part VII: Basic Data Visualizations)

One of the greatest aspects of KQL and its environment is that creating a chart is just one instruction away from the dataset generated in the process. Of course, the data still need to be in an appropriate form to be used as source for a visual, though the effort is minimal. Let's consider the example used in the previous post based ln the ContosoSales data, where the visualization part is everything that comes after "| render":

// visualizations by Country: various charts
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize count_customers = count_distinct(CustomerKey) by RegionCountryName
| order by count_customers desc
//| render table
//| render linechart
//| render areachart

//| render stackedchart
//| render columnchart
| render piechart
    with (xtitle="Country", ytitle="# Customers",
    title="# Customers by Country (pie chart)", legend=hidden)

Output:

# Customers by Country (various charts)

It's enough to use "render" with the chart type without specifying the additional information provided under "with", though the legend can facilitate data's understanding. Unfortunately, the available properties are relatively limited, at least for now.

Adding one more dimension is quite simple, even if the display may be sometimes confusing as there's no clear delimitation between the entities represented while the legend grows linearly with the number of points. It might be a good idea to use additional charts for the further dimensions in scope.

// visualizations by Region & Country: various charts
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize count_customers = count_distinct(CustomerKey) by ContinentName, RegionCountryName
| order by count_customers desc   
//| render stackedareachart 
//| render linechart 
//| render table 
//| render areachart 
//| render piechart
| render columnchart 
    with (xtitle="Region/Country", ytitle="# Customers",
    title="#Customers by Continent & Country", legend=hidden)

Output:

# Customers by Continent & Country (column chart)

Sometimes, it makes sense to reduce the number of values, recommendation that applies mainly to pie charts:

// visualizations by Zone: pie chart
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize count_customers = count_distinct(CustomerKey) by iif(RegionCountryName in ('United States', 'Canada'), RegionCountryName, 'Others')
| render piechart
    with (xtitle="Country", ytitle="Sales volume",
    title="Sales volume by Zone")

Output:

# Customers by Zone (pie chart)

Adding a second set of values (e.g. Total cost) allows to easily create a scatter chart:

// visualization by Occupation: scatter chart
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize count_customers = count_distinct(CustomerKey) 
    , TotalCost = sum(TotalCost) by Occupation
| order by count_customers desc
| render scatterchart 
    with (xtitle="# Customers", ytitle="Sales volume",
    title="# Customers vs Sales volume by Occupation", legend=visible )

Output:

# Customers vs Sales volume by Occupation (scatter chart)

The visualizations are pretty simple to build, though one shouldn't expect that one can build a visualization on top of any dataset, at least not without further formatting and eventually code changes. For example, considering the query from the previous post, with a small change one can use the data with a column chart, though this approach might have some limitation (e.g. it doesn't work pie charts):

// calculating percentages from totals: column chart
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
//| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize count_customers = count_distinct(CustomerKey)
    , count_customers_US = count_distinctif(CustomerKey, RegionCountryName == 'United States')
    , count_customers_CA = count_distinctif(CustomerKey, RegionCountryName == 'Canada')
    , count_customers_other = count_distinctif(CustomerKey, not(RegionCountryName in ('United States', 'Canada')))
| project Charting = "Country"
    , US = count_customers_US
    , CA = count_customers_CA
    , other = count_customers_other
| render columnchart

    with (xtitle="Region", ytitle="# Customers",
    title="# Customers by Region")

Output:

# Customers by Region (column chart)

There are a few more visuals that will be considered in a next post. Despite the relatively limited set of visuals and properties, the visualizations are useful to get a sense of data's shape, and this with a minimum of changes. Ad-hoc visualizations can help also in data modeling, validating the logic and/or identifying issues in the data when creating the queries, which makes it a great feature.

Happy coding!

Previous Post <<||>> Next Post

06 February 2025

🌌🏭KQL Reloaded: First Steps (Part VI: Actual vs. Estimated Count)

More examples are available nowadays for developers, at least compared with 10-20 years ago when besides the scarce documentation, the blogs and source code from books were the only ways to find how a function or a piece of standard functionality works. Copying code without understanding it may lead to unexpected results, with all the consequences resulting from this.

A recent example in this direction in KQL are the dcount and dcountif functions, which according to the documentation calculates an estimate of the number of distinct values that are taken by a scalar expression in the summary group. An estimate is not the actual number of records, trading performance for accuracy. The best example are the following pieces of code:

// counting records 
NewSales
| summarize record_count = count() // values availanle 
    , aprox_distinct_count = dcount(CustomerKey) // estimated disting values
    , distinct_count = count_distinct(CustomerKey) // actual number of records
    , aprox_distict_count_by_value  = dcountif(CustomerKey, SalesAmount <> 0) //estimated count of records with not null amounts
    , distict_count_by_value  = count_distinctif(CustomerKey, SalesAmount <> 0) // count of records with not null amounts
    , aprox_distict_count_by_value2  = dcountif(CustomerKey, SalesAmount == 0) //estimated count of records with null amounts
    , distict_count_by_value2  = count_distinctif(CustomerKey, SalesAmount == 0) // count of records with not amounts
| extend error_aprox_distinct_count = distinct_count - aprox_distinct_count
    , error_aprox_distict_count_by_value = distict_count_by_value - aprox_distict_count_by_value

Output:

record_count	aprox_distinct_count	distinct_count	aprox_distict_count_by_value	distict_count_by_value	aprox_distict_count_by_value2	distict_count_by_value2
2832193	18497	18484	18497	18484	10251	10219
error_aprox_distinct_count		error_aprox_distict_count_by_value
-13		-13

It's interesting that the same difference is observable also when a narrower time interval is chosen (e.g. 1 month). When using estimate it's important to understand also how big is the error between the actual value and the estimate, and that's the purpose of the last two lines added to the query. In many scenarios the difference might be neglectable until is not.

One can wonder whether the two functions are deterministic, in other words whether they return the same results if given the same input values. It would be also useful to understand what's the performance of the two estimative functions especially when further constraints are applied.

Moreover, the functions accept a third parameter which allows control over the trade between speed and accuracy (see provided table).

// counting records 
NewSales
| summarize record_count = count() // values availanle 
    , distinct_count = count_distinct(CustomerKey) // actual number of records
    , aprox_distinct_count = dcount(CustomerKey) // estimated disting values (default)
    , aprox_distinct_count0 = dcount(CustomerKey, 0) // 0-based accuracy
    , aprox_distinct_count1 = dcount(CustomerKey, 1) // 1-based accuracy (default)
    , aprox_distinct_count2 = dcount(CustomerKey, 2) // 2-based accuracy
    , aprox_distinct_count3 = dcount(CustomerKey, 3) // 3-based accuracy
    , aprox_distinct_count4 = dcount(CustomerKey, 4) // 4-based accuracy

Output:

record_count	distinct_count	aprox_distinct_count	aprox_distinct_count0	aprox_distinct_count1	aprox_distinct_count2	aprox_distinct_count3	aprox_distinct_count4
2832193	18484	18497	18793	18497	18500	18470	18487

It will be interesting to see which one of these parameters are used in practice. The problems usually start when different approximation parameters are used alternatively with no previous agreement. How could one argument in the favor of one parameter over the others?

A natural question: how big will be the error introduced by each parameter? Usually, when approximating values, one needs to specify also the expected error somehow. The documentation provide some guiding value, though are these values enough? Do similar estimate functions make sense also for the other aggregate functions?

In exchange, the count_distinct and count_distinctif seem to be still in preview, with all the consequences derived from this. They are supposed to be more resource-intensive than the estimative counterparts. Conversely, the values returned can be still rounded in dashboards up to the meaningful unit (e.g. thousands), and this usually depends on the context. The question whether the values can be rounded can be put also in the context and the estimative counterparts. It would be interesting to check how far away are the rounded values from each other in the context of the two sets of functions.

In practice, counting is useful for calculating percentages (e.g. how many customers come from a certain zone compared to the total), which are more useful and easier to grasp than big numbers:

// calculating percentages from totals
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize count_customers = count_distinct(CustomerKey)
    , count_customers_US = count_distinctif(CustomerKey, RegionCountryName == 'United States')
    , count_customers_CA = count_distinctif(CustomerKey, RegionCountryName == 'Canada')
    , count_customers_other = count_distinctif(CustomerKey, not(RegionCountryName in ('United States', 'Canada')))
| extend percent_customers_US = iif(count_customers<>0, round(100.00 * count_customers_US/count_customers, 2), 0.00)
    , percent_customers_CA = iif(count_customers<>0, round(100.00 * count_customers_CA/count_customers, 2), 0.00)
    , percent_customers_other = iif(count_customers<>0, round(100.00 * count_customers_other/count_customers,2), 0.00)

Output:

count_customers	count_customers_US	count_customers_CA	count_customers_other	percent_customers_US	percent_customers_CA	percent_customers_other
10317	3912	789	5616	37.92	7.65	54.43

Note:
When showing percentages it's important to provide also the "context", the actual count or amount. This allows to understand the scale associated with the percentages.

Happy coding!

Previous Post <<||>> Next Post

Resources:
[R1] Arcane Code (2022) Fun With KQL – DCount, by R C Cain [link]

[R2] M Morowczynsk et al (2024) "The Definitive Guide to KQL" [sample]
[R3] M Zorich (2022) Too much noise in your data? Summarize it! [link]

SQL Troubles

Pages

14 August 2025

🖍️Steve McKillup - Collected Quotes

24 April 2025

🧭Business Intelligence: Perspectives (Part 30: The Data Science Connection)

22 March 2025

💠🛠️🗒️SQL Server: Statistics Issues [Notes]

19 March 2025

💠🛠️🗒️SQL Server: Statistics [Notes]

18 March 2025

🏭🗒️Microsoft Fabric: Statistics in Warehouse [Notes]

07 February 2025

🌌🏭KQL Reloaded: First Steps (Part VII: Basic Data Visualizations)

06 February 2025

🌌🏭KQL Reloaded: First Steps (Part VI: Actual vs. Estimated Count)

About Me