03 January 2025

🔭Data Science: Samples (Just the Quotes)

"Little experience is sufficient to show that the traditional machinery of statistical processes is wholly unsuited to the needs of practical research. Not only does it take a cannon to shoot a sparrow, but it misses the sparrow! The elaborate mechanism built on the theory of infinitely large samples is not accurate enough for simple laboratory data. Only by systematically tackling small sample problems on their metrics does it seem possible to apply accurate tests to practical data." (Ronald A Fisher, "Statistical Methods for Research Workers", 1925)

"The postulate of randomness thus resolves itself into the question, ‘of what population is this a random sample?’ which must frequently be asked by every practical statistician." (Ronald  A Fisher, "On the Mathematical Foundation of Theoretical Statistics", Philosophical Transactions of the Royal Society of London Vol. A222, 1922) 

"Null hypotheses of no difference are usually known to be false before the data are collected [...] when they are, their rejection or acceptance simply reflects the size of the sample and the power of the test, and is not a contribution to science." (I Richard Savage, "Nonparametric Statistics", Journal of the American Statistical Association 52, 1957)

"Assumptions that we make, such as those concerning the form of the population sampled, are always untrue." (David R Cox, "Some problems connected with statistical inference", Annals of Mathematical Statistics 29, 1958)

"[...] a priori reasons for believing that the null hypothesis is generally false anyway. One of the common experiences of research workers is the very high frequency with which significant results are obtained with large samples." (David Bakan, "The test of significance in psychological research", Psychological Bulletin 66, 1966)

"People have erroneous intuitions about the laws of chance. In particular, they regard a sample randomly drawn from a population as highly representative, that is, similar to the population in all essential characteristics. The prevalence of the belief and its unfortunate consequences for psychological research are illustrated by the responses of professional psychologists to a questionnaire concerning research decisions." (Amos Tversky & Daniel Kahneman, "Belief in the law of small numbers", Psychological Bulletin 76(2), 1971)

"[...] too many users of the analysis of variance seem to regard the reaching of a mediocre level of significance as more important than any descriptive specification of the underlying averages Our thesis is that people have strong intuitions about random sampling; that these intuitions are wrong in fundamental respects; that these intuitions are shared by naive subjects and by trained scientists; and that they are applied with unfortunate consequences in the course of scientific inquiry. We submit that people view a sample randomly drawn from a population as highly representative, that is, similar to the population in all essential characteristics. Consequently, they expect any two samples drawn from a particular population to be more similar to one another and to the population than sampling theory predicts, at least for small samples." (Amos Tversky & Daniel Kahneman, "Belief in the law of small numbers", Psychological Bulletin 76(2), 1971) 

"It would help if the standard statistical programs did not generate t statistics in such profusion. The programs might be written to ask, 'Do you really have a probability sample?', 'By what standard would you judge a fitted coefficient large or small?' Or perhaps they could merely say, printed in bold capitals beside each equation, 'So What Else Is New?'" (Donald M McCloskey, "The Loss Function Has Been Mislaid: The Rhetoric of Significance Tests", American Economic Review Vol. 75, 1985)

"Since a point hypothesis is not to be expected in practice to be exactly true, but only approximate, a proper test of significance should almost always show significance for large enough samples. So the whole game of testing point hypotheses, power analysis notwithstanding, is but a mathematical game without empirical importance." (Louis Guttman, "The illogic of statistical inference for cumulative science", Applied Stochastic Models and Data Analysis, 1985)

"A little thought reveals a fact widely understood among statisticians: The null hypothesis, taken literally (and that’s the only way you can take it in formal hypothesis testing), is always false in the real world[...]. If it is false, even to a tiny degree, it must be the case that a large enough sample will produce a significant result and lead to its rejection. So if the null hypothesis is always false, what’s the big deal about rejecting it." (Jacob Cohen, "Things I have learned (so far)", American Psychologist 45, 1990)

"Unfortunately, when applied in a cook-book fashion, such significance tests do not extract the maximum amount of information available from the data. Worse still, misleading conclusions can be drawn. There are at least three problems: (1) a conclusion that there is a significant difference can often be reached merely by collecting enough samples; (2) a statistically significant result is not necessarily practically significant; and (3) reports of the presence or absence of significant differences for multiple tests are not comparable unless identical sample sizes are used." (Graham B McBride et al, "What do significance tests really tell us about the environment?", Environmental Management 17, 1993)

"Statistical hypothesis testing is commonly used inappropriately to analyze data, determine causality, and make decisions about significance in ecological risk assessment,[...] It discourages good toxicity testing and field studies, it provides less protection to ecosystems or their components that are difficult to sample or replicate, and it provides less protection when more treatments or responses are used. It provides a poor basis for decision-making because it does not generate a conclusion of no effect, it does not indicate the nature or magnitude of effects, it does address effects at untested exposure levels, and it confounds effects and uncertainty[...]. Risk assessors should focus on analyzing the relationship between exposure and effects[...]."  (Glenn W Suter, "Abuse of hypothesis testing statistics in ecological risk assessment", Human and Ecological Risk Assessment 2, 1996)

"The standard error of most statistics is proportional to 1 over the square root of the sample size. God did this, and there is nothing we can do to change it." (Howard Wainer, "Improving Tabular Displays, With NAEP Tables as Examples and Inspirations", Journal of Educational and Behavioral Statistics Vol 22 (1), 1997)

"It is not always convenient to remember that the right model for a population can fit a sample of data worse than a wrong model - even a wrong model with fewer parameters. We cannot rely on statistical diagnostics to save us, especially with small samples. We must think about what our models mean, regardless of fit, or we will promulgate nonsense." (Leland Wilkinson, "The Grammar of Graphics" 2nd Ed., 2005)

"It’s a commonplace among statisticians that a chi-squared test (and, really, any p-value) can be viewed as a crude measure of sample size: When sample size is small, it’s very difficult to get a rejection (that is, a p-value below 0.05), whereas when sample size is huge, just about anything will bag you a rejection. With large n, a smaller signal can be found amid the noise. In general: small n, unlikely to get small p-values. Large n, likely to find something. Huge n, almost certain to find lots of small p-values." (Andrew Gelman, "The sample size is huge, so a p-value of 0.007 is not that impressive", 2009)

"Why are you testing your data for normality? For large sample sizes the normality tests often give a meaningful answer to a meaningless question (for small samples they give a meaningless answer to a meaningful question)." (Greg Snow, "R-Help", 2014)

"The Dirty Data Theorem states that 'real world' data tends to come from bizarre and unspecifiable distributions of highly correlated variables and have unequal sample sizes, missing data points, non-independent observations, and an indeterminate number of inaccurately recorded values." (Anon, Statistically Speaking)

"The old rule of trusting the Central Limit Theorem if the sample size is larger than 30 is just that–old. Bootstrap and permutation testing let us more easily do inferences for a wider variety of statistics." (Tim Hesterberg)

"While the main emphasis in the development of power analysis has been to provide methods for assessing and increasing power, it should also be noted that it is possible to have too much power. If your sample is too large, nearly any difference, no matter how small or meaningless from a practical standpoint, will be ‘statistically significant’." (Clay Helberg) 

📉Graphical Representation: Power (Just the Quotes)

"But law is no explanation of anything; law is simply a generalization, a category of facts. Law is neither a cause, nor a reason, nor a power, nor a coercive force. It is nothing but a general formula, a statistical table." (Florence Nightingale, "Suggestions for Thought", 1860)

"The wandering of a line is more powerful in its effect on the mind than a tabulated statement; it shows what is happening and what is likely to take place just as quickly as the eye is capable of working." (A Lester Boddington, "Statistics And Their Application To Commerce", 1921)

"Seeing color isn't always as simple as it may seem. Some colors are not easy to see unless the conditions are just right; some are so easy to see that they overpower everything else; some are easy to see but difficult to distinguish. […] Large masses of color become too visible and easily overwhelm the entire chart. The more visible the color the easier it is to use too much of it." (Kenneth W Haemer, "Color in Chart Presentation", The American Statistician Vol. 4" (2) , 1950)

"To see is to reason. Thus, the use of visual forms of communication has great potential for influencing what a person thinks. Graphic presentation is always much more than a way to present just facts or information. Rather, it is a way to influence thought, and, as such, graphics can be a powerful mode of persuasion." (Robert Lefferts, "Elements of Graphics: How to prepare charts and graphs for effective reports", 1981)

"An especially effective device for enhancing the explanatory power of time-series displays is to add spatial dimensions to the design of the graphic, so that the data are moving over space (in two or three dimensions) as well as over time. […] Occasionally graphics are belligerently multivariate, advertising the technique rather than the data." (Edward R Tufte, "The Visual Display of Quantitative Information", 1983)

"Modern data graphics can do much more than simply substitute for small statistical tables. At their best, graphics are instruments for reasoning about quantitative information. Often the most effective way to describe, explore, and summarize a set of numbers even a very large set - is to look at pictures of those numbers. Furthermore, of all methods for analyzing and communicating statistical information, well-designed data graphics are usually the simplest and at the same time the most powerful." (Edward R Tufte, "The Visual Display of Quantitative Information", 1983)

"The logarithm is an extremely powerful and useful tool for graphical data presentation. One reason is that logarithms turn ratios into differences, and for many sets of data, it is natural to think in terms of ratios. […] Another reason for the power of logarithms is resolution. Data that are amounts or counts are often very skewed to the right; on graphs of such data, there are a few large values that take up most of the scale and the majority of the points are squashed into a small region of the scale with no resolution." (William S. Cleveland, "Graphical Methods for Data Presentation: Full Scale Breaks, Dot Charts, and Multibased Logging", The American Statistician Vol. 38" (4) 1984)

"Confusion and clutter are failures of design, not attributes of information. And so the point is to find design strategies that reveal detail and complexity - rather than to fault the data for an excess of complication. Or, worse, to fault viewers for a lack of understanding. Among the most powerful devices for reducing noise and enriching the content of displays is the technique of layering and separation, visually stratifying various aspects of the data." (Edward R Tufte, "Envisioning Information", 1990)

"Sometimes, when visualization thoroughly reveals the structure of a set of data, there is a tendency to underrate the power of the method for the application. Little effort is expended in seeing the structure once the right visualization method is used, so we are mislead into thinking nothing exciting has occurred." (William S Cleveland, "Visualizing Data", 1993)

"Dashboards aren't all that different from some of the other means of presenting information, but when properly designed the single-screen display of integrated and finely tuned data can deliver insight in an especially powerful way." (Richard Brath & Michael Peters, "Dashboard Design: Why Design is Important," DM Direct, 2004)

"Exploratory Data Analysis is more than just a collection of data-analysis techniques; it provides a philosophy of how to dissect a data set. It stresses the power of visualisation and aspects such as what to look for, how to look for it and how to interpret the information it contains. Most EDA techniques are graphical in nature, because the main aim of EDA is to explore data in an open-minded way. Using graphics, rather than calculations, keeps open possibilities of spotting interesting patterns or anomalies that would not be apparent with a calculation" (where assumptions and decisions about the nature of the data tend to be made in advance)." (Alan Graham, "Developing Thinking in Statistics", 2006)

"Most dashboards fail to communicate efficiently and effectively, not because of inadequate technology (at least not primarily), but because of poorly designed implementations. No matter how great the technology, a dashboard's success as a medium of communication is a product of design, a result of a display that speaks clearly and immediately. Dashboards can tap into the tremendous power of visual perception to communicate, but only if those who implement them understand visual perception and apply that understanding through design principles and practices that are aligned with the way people see and think." (Stephen Few, "Information Dashboard Design", 2006)

"Arbitrary category sequence and misplaced pie chart emphasis lead to general confusion and weaken messages. Although this can be used for quite deliberate and targeted deceit, manipulation of the category axis only really comes into its own with techniques that bend the relationship between the data and the optics in a more calculated way. Many of these techniques are just twins of similar ruses on the value axis. but are none the less powerful for that." (Nicholas Strange, "Smoke and Mirrors: How to bend facts and figures to your advantage", 2007)

"Design has the power to enrich our lives by engaging our emotions through image, form, texture, color, sound, and smell. The intrinsically human-centered nature of design thinking points to the next step: we can use our empathy and understanding of people to design experiences that create opportunities for active engagement and participation." (Tim Brown, "Change by Design: How Design Thinking Transforms Organizations and Inspires Innovation", 2009)

"Histograms are powerful in cases where meaningful class breaks can be defined and classes are used to select intervals and groups in the data. However, they often perform poorly when it comes to the visualization of a distribution." (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)

"Colour is a very powerful way to draw attention to specific portions of the design. Colour evokes feelings and emotions, making it an essential component in branding." (Brian Suda, "A Practical Guide to Designing with Data", 2010)

"Given the important role that correlation plays in structural equation modeling, we need to understand the factors that affect establishing relationships among multivariable data points. The key factors are the level of measurement, restriction of range in data values" (variability, skewness, kurtosis), missing data, nonlinearity, outliers, correction for attenuation, and issues related to sampling variation, confidence intervals, effect size, significance, sample size, and power." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

"An infographic (short for information graphic) is a type of picture that blends data with design, helping individuals and organizations concisely communicate messages to their audience." (Mark Smiciklas, "The Power of Infographics: Using Pictures to Communicate and Connect with Your Audiences", 2012)

"Color can modify - and possibly even contradict - our intuitive response to value, because of its own powerful connotations." (Joel Katz, "Designing Information: Human factors and common sense in information design", 2012)

"The simplicity of the process behavior chart can be deceptive. This is because the simplicity of the charts is based on a completely different concept of data analysis than that which is used for the analysis of experimental data. When someone does not understand the conceptual basis for process behavior charts they are likely to view the simplicity of the charts as something that needs to be fixed.  Out of these urges to fix the charts all kinds of myths have sprung up resulting in various levels of complexity and obstacles to the use of one of the most powerful analysis techniques ever invented." (Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"A great infographic leads readers on a visual journey, telling them a story along the way. Powerful infographics are able to capture people’s attention in the first few seconds with a strong title and visual image, and then reel them in to digest the entire message. Infographics have become an effective way to speak for the creator, conveying information and image simultaneously." (Justin Beegel, "Infographics For Dummies", 2014)

"Statistical characterization of datasets is a very powerful approach, but it has the intrinsic limitation of losing information through summarization. " (Tamara Munzner, "Visualization Analysis and Design", 2014)

"The most powerful depth cue is occlusion, where some objects can not be seen because they are hidden behind others. The visible objects are interpreted as being closer than the occluded ones. The occlusion relationships between objects change as we move around; this motion parallax allows us to build up an understanding of the relative distances between objects in the world. " (Tamara Munzner, "Visualization Analysis and Design", 2014)

"The power of infographics is that it keeps things short and sweet. [...] The concise, pointed nature of an infographic works well with the fact that people’s attention span is decreasing significantly." (Justin Beegel, "Infographics For Dummies", 2014)

"Maps also have the disadvantage that they consume the most powerful encoding channels in the visualization toolbox - position and size - on an aspect that is held constant. This leaves less effective encoding channels like color for showing the dimension of interest." (Danyel Fisher & Miriah Meyer, "Making Data Visual", 2018)

"Analysis is a two-step process that has an exploratory and an explanatory phase. In order to create a powerful data story, you must effectively transition from data discovery" (when you’re finding insights) to data communication" (when you’re explaining them to an audience). If you don’t properly traverse these two phases, you may end up with something that resembles a data story but doesn’t have the same effect. Yes, it may have numbers, charts, and annotations, but because it’s poorly formed, it won’t achieve the same results." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"Aligning on data ink can be a powerful way to build relationships across charts. It can be used to obscure the lines between charts, making the composition feel more seamless. [....] Alignment paradigms can also influence the layout design needed. [...] The layout added to the alignment further supports this relationship." (Vidya Setlur & Bridget Cogley, "Functional Aesthetics for data visualization", 2022)

"When the colors are dull and neutral, they can communicate a sense of uniformity and an aura of calmness. Grays do a great job of mapping out the context of your story so that the more sharp colors highlight what you’re trying to explain. The power of gray comes in handy for all of our supporting details such as the axis, gridlines, and nonessential data that is included for comparative purposes. By using gray as the primary color in a visualization, we automatically draw our viewers’ eyes to whatever isn’t gray. That way, if we are interested in telling a story about one data point, we can do so quite easily. " (Kate Strachnyi, "ColorWise: A Data Storyteller’s Guide to the Intentional Use of Color", 2023)

"Graphic design is not just about making things look good. It is a powerful combination of form and function that uses visual elements to communicate a message. Form refers to the physical appearance of a design, such as its shape, color, and typography. Function refers to the purpose of a design, such as what it is trying to communicate or achieve. A good graphic design is both visually appealing and functional. It uses the right combination of form and function to communicate its message effectively. Graphic design is also a strategic and thoughtful craft. It requires careful planning and execution to create a design that is both effective and aesthetically pleasing." (Faith Aderemi, "The Essential Graphic Design Handbook", 2024)

📉Graphical Representation: Experience (Just the Quotes)

"It is not possible to lay down any hard and fast rules for determining what chart is the best for any given problem. Ordinarily that one is the best which will produce the quickest and clearest results. but unfortunately it is not always possible to construct the clearest one in the least time. Experience is the best guide. Generally speaking, a rectilinear chart is best adapted for equations of the first degree, logarithmic for those other than the first degree and not containing over two variables, and alignment charts where there are three or more variables. However, nearly every person becomes more or less familiar with one type of chart and prefers to adhere to the use of that type because he does not care to take the time and trouble to find out how to use the others. It is best to know what the possibilities of all types are and to be governed accordingly when selecting one or the other for presenting or working out certain data." (Allan C Haskell, "How to Make and Use Graphic Charts", 1919)

"The number of grid lines should be kept to a minimum. This means that there should be just enough coordinate lines in the field so that the eye can readily interpret the values at any point on the curve. No definite rule can be specified as to the optimum number of lines in a grid. This must be left to the discretion of the chart-maker and can come only from experience. The size of the chart, the type and range of the data. the number of curves, the length and detail of the period covered, as well as other factors, will help to determine the number of grid lines." (Calvin F Schmid, "Handbook of Graphic Presentation", 1954)

"The art of using the language of figures correctly is not to be over-impressed by the apparent air of accuracy, and yet to be able to take account of error and inaccuracy in such a way as to know when, and when not, to use the figures. This is a matter of skill, judgment, and experience, and there are no rules and short cuts in acquiring this expertness." (Ely Devons, "Essays in Economics", 1961)

"Nearly all those who produce graphics for mass publication are trained exclusively in the fine arts and have had little experience with the analysis of data. Such experiences are essential for achieving precision and grace in the presence of statistics. [...] Those who get ahead are those who beautified data, never mind statistical integrity." (Edward R Tufte, "The Visual Display of Quantitative Information", 1983)

"Design has the power to enrich our lives by engaging our emotions through image, form, texture, color, sound, and smell. The intrinsically human-centered nature of design thinking points to the next step: we can use our empathy and understanding of people to design experiences that create opportunities for active engagement and participation." (Tim Brown, "Change by Design: How Design Thinking Transforms Organizations and Inspires Innovation", 2009)

"Done well, annotation can help explain and facilitate the viewing and interpretive experience. It is the challenge of creating a layer of user assistance and user insight: how can you maximize the clarity and value of engaging with this visualization design?" (Andy Kirk, "Data Visualization: A successful design process", 2012)

"Geographic maps have the advantage of being true to scale - great for walking. Diagrams have the advantage of being easily imaged and remembered, often true to a non-pedestrian experience, and the ability to open up congestion, reduce empty space, and use real estate efficiently. Hybrids 'mapograms' ? - often have the disadvantages of both map and diagram with none of the corresponding advantages." (Joel Katz, "Designing Information: Human factors and common sense in information design", 2012)

"Visual metaphors are about integrating a certain visual quality in your work that somehow conveys that extra bit of connection between the data, the design, and the topic. It goes beyond just the choice of visual variable, though this will have a strong influence. Deploying the best visual metaphor is something that really requires a strong design instinct and a certain amount of experience." (Andy Kirk, "Data Visualization: A successful design process", 2012)

"[...] communicating with data is less often about telling a specific story and more like starting a guided conversation. It is a dialogue with the audience rather than a monologue. While some data presentations may share the linear approach of a traditional story, other data products" (analytical tools, in particular) give audiences the flexibility for exploration. In our experience, the best data products combine a little of both: a clear sense of direction defined by the author with the ability for audiences to focus on the information that is most relevant to them. The attributes of the traditional story approach combined with the self-exploration approach leads to the guided safari analogy." (Zach Gemignani et al, "Data Fluency", 2014)

"People see bar charts and line charts and pie charts all the time, and those charts are often boring. Boring graphs are forgettable. Different shapes and uncommon forms that move beyond the borders of our typical data visualization experience can draw readers in." (Jonathan Schwabish, "Better Data Visualizations: A guide for scholars, researchers, and wonks", 2021)

"A well-designed dashboard needs to provide a similar experience; information cannot be placed just anywhere on the dashboard. Charts that relate to one another are usually positioned close to one another. Important charts often appear larger and more visually prominent than less important ones. In other words, there are natural sizes for how a dashboard comprises charts based on the task and context." (Vidya Setlur & Bridget Cogley, "Functional Aesthetics for data visualization", 2022)

"Maps are a type of chart that can convey relationships about space and relationships between objects that we relate to in the real world. Their effectiveness as a communication medium is strongly influenced by a host of factors: the nature of spatial data, the form and structure of representation, their intended purpose, the experience of the audience, and the context in the time and space in which the map is viewed. In other words, maps are a ubiquitous representation of spatial information that we can understand and relate to." (Vidya Setlur & Bridget Cogley, "Functional Aesthetics for data visualization", 2022)

"Using symbols is one common way of applying semantics to help make sense of the world. Symbols provide clues to understanding experiences by conveying recognizable meanings that are shared by societies." (Vidya Setlur & Bridget Cogley, "Functional Aesthetics for data visualization", 2022)

"When deeply complex charts work, we find them effective and beautiful, just as we find a symphony beautiful, which is another marvelously complex arrangement of millions of data points that we experience as a coherent whole." (Scott Berinato, "Good Charts : the HBR guide to making smarter, more persuasive data visualizations", 2023)


01 January 2025

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part IV: Automatic Tuning Options) [new feature]

Automatic tuning in SQL databases, respectively in SQL Azure datsbases, is a fully managed performance service that uses built-in intelligence to continuously monitor queries executed and automatically improve their performance [1]. At least in SQL databases the target is to fully automate indexes' maintenance by providing automated tuning, verification of performance gains, rollback and self-correction, respectively tuning history. The future sounds promising, though the question is what's already available. 

The documentation references several objects made available already for this feature. sys.database_automatic_tuning_options returns the tuning options available for the database:

-- SQL databases - automatic tuning options
SELECT ATO.name
, ATO.desired_state
, ATO.desired_state_desc
, ATO.actual_state
, ATO.actual_state_desc
, ATO.reason
, ATO.reason_desc
FROM sys.database_automatic_tuning_options ATO
ORDER BY ATO.name
Output:
name desired_state desired_state_desc actual_state actual_state_desc reason reason_desc
CREATE_INDEX 2 DEFAULT 1 ON 3 INHERITED_FROM_SERVER
DROP_INDEX 2 DEFAULT 1 ON 3 INHERITED_FROM_SERVER
FORCE_LAST_GOOD_PLAN 2 DEFAULT 1 ON 3 INHERITED_FROM_SERVER
MAINTAIN_INDEX 2 DEFAULT 0 OFF 3 INHERITED_FROM_SERVER

For further information see the automatic tuning options in the documentation [2], respectively [1] for an overview of the feature. 

There's a sys.database_automatic_tuning_mode and a sys.database_automatic_tuning_configurations, though on the SQL database instance only the first has records.

-- SQL databases - automatic tuning mode
SELECT ATM.desired_state
, ATM.desired_state_desc
, ATM.actual_state
, ATM.actual_state_desc
FROM sys.database_automatic_tuning_mode ATM
Output:
desired_state desired_state_desc actual_state actual_state_desc
1 INHERIT 3 AUTO

Attempting to modify the above settings at database level via the ALTER DATABASE leads to the following error message: 

-- disabling database properties
ALTER DATABASE [AdventureWorks01]
SET AUTOMATIC_TUNING ( FORCE_LAST_GOOD_PLAN = OFF);
Output:
"Msg 16202, Level 16, State 162, Line 1, Keyword or statement option 'SET' is not supported on the 'Microsoft Fabric' platform."

At least for the moment there seems to be no features available to change these settings. 

The automatic index tab report trom the Performance Dashboard for SQL database shows the history and status of automatically created indexes [5]. See Home toolbar in the Query Editor window >> Performance summary >> Automatic indexes. 

What it's not clear is how the database engine balances between index coverage and performance. How long does it take until the engine identifies the first missing index scenario and index's creation, respectively between the drop of an unused index and reoccurence of the same scenario that lead to index's creation. Moreover, what happens with index fragmentation? The documentation doesn't seem to provide the answers that probably only the hand-on experience can provide. Even if AI-based features are used in indexes' maintenance, it's still hard to grasp what lies beyond the various features. 

Are DBAs comfortable enough to relinquish control over index maintenence? It will be interesting to see their feedback. Probably, more control over what the engine does is needed, as sometimes it's enough to have 2-3 major exceptions for a solution to become not feasible. Usually the devil lies in details. 

Migrating from SQL Server or Azure SQL to SQL databases requires some degree of reengineering, probably with more effort and redesign for the first scenario, given the functionality gap. 

Taking a look at the indexes already available in AdventureWorks database, there seem no new indexes created since database's creation (see the definition of the referenced object):

-- sys metadata - index columns
SELECt IND.db_name
, IND.schema_name
, IND.table_name
, IND.index_name
, IND.index_type
, IND.principal_type
, IND.auto_created
FROM meta.vIndexes IND
WHERE IND.schema_name = 'SalesLT'
  AND IND.auto_created = 1
ORDER BY IND.table_name
, IND.index_name

Stay tuned! More tests on the way... 

Happy coding! 

References:
[1] Microsoft Learn (2024) Automatic tuning [link]
[2] Microsoft Learn (2024) Performance Dashboard for SQL database in Microsoft Fabric [link]
[3] Microsoft Learn (2024) ALTER DATABASE SET options (Transact-SQL) [link]
[4] Microsoft Learn (2024) Enable automatic tuning in the Azure portal to monitor queries and improve workload performance [link]

💎🏭SQL Reloaded: SQL Server Metadata (Part I: Indexes Overview)

There are scenarios in which it's useful to gather information about the available indexes and their definition as a primary step for troubleshooting or index maintenance. Moreover, it's useful to take a baseline of the defined indexes and update it accordingly when indexes change. A minimum of information can be gathered in Excel files or similar repositories, though for the same can be used also metadata tools, at least when they're easy to use and the associated costs aren't neglectable. 

Usually, there are two levels at which the information is needed - at index, respectively at column level. Sometimes it's useful to have an independent query for each level of detail, though in data warehouses and similar use cases it's useful to provide a model on top of the metadata, at least to improve query's maintainability.

The first view encapsulates the logic needed to export the data and it's based on the sys.indexes, sys.objects, sys.schemas and sys.database_principals system objects. 

-- create schema for metadata
CREATE SCHEMA meta;

-- clean after
DROP VIEW IF EXISTS meta.vIndexes

-- create views
CREATE OR ALTER VIEW meta.vIndexes
AS
-- sys metadata - indexes
SELECT OBJ.schema_id
, IND.object_id 
, IND.index_id 
, SCH.name schema_name
, OBJ.name table_name
, IND.name index_name
, IND.type_desc index_type
, IND.is_primary_key
, IND.is_unique_constraint
, IND.fill_factor
, IND.has_filter
, IND.auto_created
, IND.is_unique is_unique_index
, DBP.type principal_type
, DBP.type_desc principal_type_desc
--, INC.*
FROM sys.indexes IND WITH (NOLOCK)
     JOIN sys.objects OBJ WITH (NOLOCK)
       ON IND.object_id = OBJ.object_id
          JOIN sys.schemas SCH WITH (NOLOCK)
            ON OBJ.schema_id = SCH.schema_id
               JOIN sys.database_principals DBP
                 ON SCH.principal_id = DBP.principal_id 
WHERE DBP.type IN ('S', 'E')

Upon case, the needed information can be exported via a query like the one below:

-- sys metadata - index columns
SELECt IND.db_name
, IND.schema_name
, IND.table_name
, IND.index_name
, IND.index_type
, IND.principal_type
FROM meta.vIndexes IND
WHERE IND.schema_name = 'SalesLT'
  AND IND.table_name IN ('Address', 'Customer')
ORDER BY IND.table_name
, IND.index_name

Output:

db_name schema_name table_name index_name index_type principal_type
AdventureWorks01... SalesLT Address AK_Address_rowguid NONCLUSTERED S
AdventureWorks01... SalesLT Address IX_Address_AddressLine1_AddressLine2_City_StateProvince_PostalCode_CountryRegion NONCLUSTERED S
AdventureWorks01... SalesLT Address IX_Address_StateProvince NONCLUSTERED S
AdventureWorks01... SalesLT Address PK_Address_AddressID CLUSTERED S
AdventureWorks01... SalesLT Customer AK_Customer_rowguid NONCLUSTERED S
AdventureWorks01... SalesLT Customer IX_Customer_EmailAddress NONCLUSTERED S
AdventureWorks01... SalesLT Customer PK_Customer_CustomerID CLUSTERED S

Similarly, on top of the above view can be built a similar object that provides also the column-related information by adding the sys.index_columns and sys.columns system object to the logic:

-- clean after
DROP VIEW IF EXISTS meta.vIndexColumns

-- create 
CREATE OR ALTER VIEW meta.vIndexColumns
AS
-- sys metadata - index columns
SELECT IND.db_id
, IND.schema_id
, INC.object_id 
, INC.index_id 
, INC.index_column_id
, INC.column_id
, IND.db_name
, IND.schema_name
, IND.table_name
, IND.index_name
, COL.name column_name
, INC.key_ordinal
, INC.partition_ordinal
, IND.index_type
, IND.is_primary_key
, IND.is_unique_constraint
, IND.fill_factor
, IND.has_filter
, IND.auto_created
, IND.is_unique_index
, INC.is_descending_key
, INC.is_included_column
, IND.principal_type
, IND.principal_type_desc
FROM sys.index_columns INC
     JOIN sys.columns COL WITH (NOLOCK)
       ON INC.object_id = COL.object_id 
      AND INC.column_id = COL.column_id 
     JOIN meta.vIndexes IND WITH (NOLOCK)
       ON INC.object_id = IND.object_id
      AND INC.index_id = IND.index_id

And, there's an example of the query based on this view:

-- sys metadata - index columns
SELECt INC.schema_name
, INC.table_name
, INC.index_name
, INC.column_name
, INC.key_ordinal
, INC.index_type
, INC.principal_type
FROM meta.vIndexColumns INC
WHERE INC.schema_name = 'SalesLT'
  AND INC.table_name IN ('Address', 'Customer')
ORDER BY INC.table_name
, INC.index_name
, INC.key_ordinal

Output:
schema_name table_name index_name column_name key_ordinal index_type principal_type
SalesLT Address AK_Address_rowguid rowguid 1 NONCLUSTERED S
SalesLT Address IX_Address_AddressLine1_AddressLine2_City_StateProvince_PostalCode_CountryRegion AddressLine1 1 NONCLUSTERED S
SalesLT Address IX_Address_AddressLine1_AddressLine2_City_StateProvince_PostalCode_CountryRegion AddressLine2 2 NONCLUSTERED S
SalesLT Address IX_Address_AddressLine1_AddressLine2_City_StateProvince_PostalCode_CountryRegion City 3 NONCLUSTERED S
SalesLT Address IX_Address_AddressLine1_AddressLine2_City_StateProvince_PostalCode_CountryRegion StateProvince 4 NONCLUSTERED S
SalesLT Address IX_Address_AddressLine1_AddressLine2_City_StateProvince_PostalCode_CountryRegion PostalCode 5 NONCLUSTERED S
SalesLT Address IX_Address_AddressLine1_AddressLine2_City_StateProvince_PostalCode_CountryRegion CountryRegion 6 NONCLUSTERED S
SalesLT Address IX_Address_StateProvince StateProvince 1 NONCLUSTERED S
SalesLT Address PK_Address_AddressID AddressID 1 CLUSTERED S
SalesLT Customer AK_Customer_rowguid rowguid 1 NONCLUSTERED S
SalesLT Customer IX_Customer_EmailAddress EmailAddress 1 NONCLUSTERED S
SalesLT Customer PK_Customer_CustomerID CustomerID 1 CLUSTERED S

Notes:
1) As a DBA it's useful to take a baseline of the indexes defined and reevaluate their usefulness over time. These are one of the checks that should be done when one becomes responsible for the administration of a database server, independently of the vendor.
2) The definitions of the views can be extended as needed, though one should try to keep the overall complexity to a minimum. 
3) There are voices against the use of NOLOCK. Feel free to change the objects accordingly!
4) It's useful to work in a dedicated schema (e.g. meta) and have a different naming convention that deviates slightly from one defined by Microsoft. This should make sure that no confusion with the system objects exists.
5) Azure SQL takes over many of the responsibilities for index maintenance. Even if indexes are managed automatically by the system, a baseline is still needed, at least to evaluate functionality's performance, respectively the changes that occurred in the environment. 
6) Except attributes which were added for specific functionality over time, the queries should work at least starting with SQL Server 2005 forward.
7) See also the notes on clustered vs nonclustered indexes.

Happy coding!

Previous Post <<||>> Next Post

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.