26 March 2023

Data Science: Lying with Statistics (Just the Quotes)

"Thus the alteration of the truth which is already manifesting itself in the progressive form of lying and perjury, offers us, in the superlative, the statistics." (Fran├žois Magendie, 18th century) 

"An old jest runs to the effect that there are three degrees of comparison among liars. There are liars, there are outrageous liars, and there are scientific experts. This has lately been adapted to throw dirt upon statistics. There are three degrees of comparison, it is said, in lying. There are lies, there are outrageous lies, and there are statistics." (Robert Giffen, Economic Journal 2 (6), 1892)

"Professor [Joseph] Munro reminded him of an old saying which he rather reluctantly proposed, in that company, to repeat. It was to the effect that there were three gradations of inveracity - there were lies, there were d-d lies, and there were statistics." (Arthur J Balfour, [in Manchester Guardian] 1892)

"Columns of figures are hurled about in the papers, and demonstrate the justice of the witty claim that there are three kinds of untruth : fibs, lies, and statistics." (Herbert B Workman, "The principles of the Gothenburg system", Wesleyan-Methodist Magazine 118, 1895)

"After all, facts are facts, and although we may quote one to another with a chuckle the words of the Wise Statesman, 'Lies - damn lies - and statistics', still there are some easy figures the simplest must understand, and the astutest cannot wriggle out of." (Leonard H. Courtney, [speech] 1895)

"There are three kinds of lies - lies, damned lies and statistics." (Carroll D Wright, New York Times, 1896) 

"Figures often beguile me, particularly when I have the arranging of them myself; in which case the remark attributed to Disraeli would often apply with justice and force: “There are three kinds of lies: lies, damned lies, and statistics." (Mark Twain, [in "Mark Twain’s Autobiography" Vol I, 1904])

"Figures may not lie, but statistics compiled unscientifically and analyzed incompetently are almost sure to be misleading, and when this condition is unnecessarily chronic the so-called statisticians may be called liars." (Edwin B Wilson, "Bulletin of the American Mathematical Society", Vol 18, 1912)

"In earlier times they had no statistics and so they had to fall back on lies. Hence the huge exaggerations of primitive literature, giants, miracles, wonders! It's the size that counts. They did it with lies and we do it with statistics: but it's all the same." (Stephen Leacock, "Model memoirs and other sketches from simple to serious", 1939)

"It has long been recognized by public men of all kinds […] that statistics come under the head of lying, and that no lie is so false or inconclusive as that which is based on statistics." (Hilaire Belloc, "The Silence of the Sea", 1940)

"For many people the first word that comes to mind when they think about statistical charts is 'lie'. No doubt some graphics do distort the underlying data, making it hard for the viewer to learn the truth. But data graphics are no different from words in this regard, for any means of communication can be used to deceive. There is no reason to believe that graphics are especially vulnerable to exploitation by liars; in fact, most of us have pretty good graphical lie detectors that help us see right through frauds." (Edward R Tufte, "The Visual Display of Quantitative Information", 1983)

"The conditions under which many data graphics are produced - the lack of substantive and quantitative skills of the illustrators, dislike of quantitative evidence, and contempt for the intelligence of the audience-guarantee graphic mediocrity. These conditions engender graphics that (1) lie; (2) employ only the simplest designs, often unstandardized time-series based on a small handful of data points; and (3) miss the real news actually in the data." (Edward R Tufte, "The Visual Display of Quantitative Information", 1983)

"Fairy tales lie just as much as statistics do, but sometimes you can find a grain of truth in them." (Sergei Lukyanenko, "The Night Watch", 1998)

"While some social problems statistics are deliberate deceptions, many - probably the great majority - of bad statistics are the result of confusion, incompetence, innumeracy, or selective, self-righteous efforts to produce numbers that reaffirm principles and interests that their advocates consider just and right. The best response to stat wars is not to try and guess who's lying or, worse, simply to assume that the people we disagree with are the ones telling lies. Rather, we need to watch for the standard causes of bad statistics - guessing, questionable definitions or methods, mutant numbers, and inappropriate comparisons." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"I believe that the backlash against statistics is due to four primary reasons. The first, and easiest for most people to relate to, is that even the most basic concepts of descriptive and inferential statistics can be difficult to grasp and even harder to explain. […] The second cause for vitriol is that even well-intentioned experts misapply the tools and techniques of statistics far too often, myself included. Statistical pitfalls are numerous and tough to avoid. When we can't trust the experts to get it right, there's a temptation to throw the baby out with the bathwater. The third reason behind all the hate is that those with an agenda can easily craft statistics to lie when they communicate with us  […] And finally, the fourth cause is that often statistics can be perceived as cold and detached, and they can fail to communicate the human element of an issue." (Ben Jones, "Avoiding Data Pitfalls: How to Steer Clear of Common Blunders When Working with Data and Presenting Analysis and Visualizations", 2020)

"It is easy to lie with statistics. It is hard to tell the truth without it." (Andrejs Dunkels)

25 March 2023

Data Science: Trust (Just the Quotes)

"We must trust to nothing but facts: These are presented to us by Nature, and cannot deceive. We ought, in every instance, to submit our reasoning to the test of experiment, and never to search for truth but by the natural road of experiment and observation." (Antoin-Laurent de Lavoisiere, "Elements of Chemistry", 1790)

"A law of nature, however, is not a mere logical conception that we have adopted as a kind of memoria technical to enable us to more readily remember facts. We of the present day have already sufficient insight to know that the laws of nature are not things which we can evolve by any speculative method. On the contrary, we have to discover them in the facts; we have to test them by repeated observation or experiment, in constantly new cases, under ever-varying circumstances; and in proportion only as they hold good under a constantly increasing change of conditions, in a constantly increasing number of cases with greater delicacy in the means of observation, does our confidence in their trustworthiness rise." (Hermann von Helmholtz, "Popular Lectures on Scientific Subjects", 1873)

"It is of the nature of true science to take nothing on trust or on authority. Every fact must be established by accurate observation, experiment, or calculation. Every law and principle must rest on inductive argument. The apostolic motto, ‘Prove all things, hold fast that which is good’, is thoroughly scientific. It is true that the mere reader of popular science must often be content to take that on testimony which he cannot personally verify; but it is desirable that even the most cursory reader should fully comprehend the modes in which facts are ascertained and the reasons on which the conclusions are based." (Sir John W Dawson, "The Chain of Life in Geological Time", 1880)

"The aim of science is to seek the simplest explanations of complex facts. We are apt to fall into the error of thinking that the facts are simple because simplicity is the goal of our quest. The guiding motto in the life of every natural philosopher should be, ‘Seek simplicity and distrust it’." (Alfred N Whitehead, "The Concept of Nature", 1919)

"Every bit of knowledge we gain and every conclusion we draw about the universe or about any part or feature of it depends finally upon some observation or measurement. Mankind has had again and again the humiliating experience of trusting to intuitive, apparently logical conclusions without observations, and has seen Nature sail by in her radiant chariot of gold in an entirely different direction." (Oliver J Lee, "Measuring Our Universe: From the Inner Atom to Outer Space", 1950)

"Being built on concepts, hypotheses, and experiments, laws are no more accurate or trustworthy than the wording of the definitions and the accuracy and extent of the supporting experiments." (Gerald Holton, "Introduction to Concepts and Theories in Physical Science", 1952)

"No observations are absolutely trustworthy. In no field of observation can we entirely rule out the possibility that an observation is vitiated by a large measurement or execution error. If a reading is found to lie a very long way from its fellows in a series of replicate observations, there must be a suspicion that the deviation is caused by a blunder or gross error of some kind. [...] One sufficiently erroneous reading can wreck the whole of a statistical analysis, however many observations there are." (Francis J Anscombe, "Rejection of Outliers", Technometrics Vol. 2 (2), 1960)

"Even properly done statistics can’t be trusted. The plethora of available statistical techniques and analyses grants researchers an enormous amount of freedom when analyzing their data, and it is trivially easy to ‘torture the data until it confesses’." (Alex Reinhart, "Statistics Done Wrong: The Woefully Complete Guide", 2015)

"Science’s predictions are more trustworthy, but they are limited to what we can systematically observe and tractably model. Big data and machine learning greatly expand that scope. Some everyday things can be predicted by the unaided mind, from catching a ball to carrying on a conversation. Some things, try as we might, are just unpredictable. For the vast middle ground between the two, there’s machine learning." (Pedro Domingos, "The Master Algorithm", 2015)

"The closer that sample-selection procedures approach the gold standard of random selection - for which the definition is that every individual in the population has an equal chance of appearing in the sample - the more we should trust them. If we don’t know whether a sample is random, any statistical measure we conduct may be biased in some unknown way." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"GIGO is a famous saying coined by early computer scientists: garbage in, garbage out. At the time, people would blindly put their trust into anything a computer output indicated because the output had the illusion of precision and certainty. If a statistic is composed of a series of poorly defined measures, guesses, misunderstandings, oversimplifications, mismeasurements, or flawed estimates, the resulting conclusion will be flawed." (Daniel J Levitin, "Weaponized Lies", 2017)

"Are your insights based on data that is accurate and reliable? Trustworthy data is correct or valid, free from significant defects and gaps. The trustworthiness of your data begins with the proper collection, processing, and maintenance of the data at its source. However, the reliability of your numbers can also be influenced by how they are handled during the analysis process. Clean data can inadvertently lose its integrity and true meaning depending on how it is analyzed and interpreted." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"Big data is revolutionizing the world around us, and it is easy to feel alienated by tales of computers handing down decisions made in ways we don’t understand. I think we’re right to be concerned. Modern data analytics can produce some miraculous results, but big data is often less trustworthy than small data. Small data can typically be scrutinized; big data tends to be locked away in the vaults of Silicon Valley. The simple statistical tools used to analyze small datasets are usually easy to check; pattern-recognizing algorithms can all too easily be mysterious and commercially sensitive black boxes." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"I believe that the backlash against statistics is due to four primary reasons. The first, and easiest for most people to relate to, is that even the most basic concepts of descriptive and inferential statistics can be difficult to grasp and even harder to explain. […] The second cause for vitriol is that even well-intentioned experts misapply the tools and techniques of statistics far too often, myself included. Statistical pitfalls are numerous and tough to avoid. When we can't trust the experts to get it right, there's a temptation to throw the baby out with the bathwater. The third reason behind all the hate is that those with an agenda can easily craft statistics to lie when they communicate with us  […] And finally, the fourth cause is that often statistics can be perceived as cold and detached, and they can fail to communicate the human element of an issue." (Ben Jones, "Avoiding Data Pitfalls: How to Steer Clear of Common Blunders When Working with Data and Presenting Analysis and Visualizations", 2020)

08 March 2023

Graphical Representation: Scales (Just the Quotes)

"The time-series plot is the most frequently used form of graphic design. With one dimension marching along to the regular rhythm of seconds, minutes, hours, days, weeks, months, years, centuries, or millennia, the natural ordering of the time scale gives this design a strength and efficiency of interpretation found in no other graphic arrangement." (Edward R Tufte, "The Visual Display of Quantitative Information", 1983)

"It is common for positive data to be skewed to the right: some values bunch together at the low end of the scale and others trail off to the high end with increasing gaps between the values as they get higher. Such data can cause severe resolution problems on graphs, and the common remedy is to take logarithms. Indeed, it is the frequent success of this remedy that partly accounts for the large use of logarithms in graphical data display." (William S Cleveland, "The Elements of Graphing Data", 1985)

"When magnitudes are graphed on a logarithmic scale, percents and factors are easier to judge since equal multiplicative factors and percents result in equal distances throughout the entire scale." (William S Cleveland, "The Elements of Graphing Data", 1985)

"When the data are magnitudes, it is helpful to have zero included in the scale so we can see its value relative to the value of the data. But the need for zero is not so compelling that we should allow its inclusion to ruin the resolution of the data on the graph." (William S Cleveland, "The Elements of Graphing Data", 1985)

"The logarithm is one of many transformations that we can apply to univariate measurements. The square root is another. Transformation is a critical tool for visualization or for any other mode of data analysis because it can substantially simplify the structure of a set of data. For example, transformation can remove skewness toward large values, and it can remove monotone increasing spread. And often, it is the logarithm that achieves this removal." (William S Cleveland, "Visualizing Data", 1993)

"The rule is that a graph of a change in a variable with time should always have a vertical scale that starts with zero. Otherwise, it is inherently misleading." (Douglas A Downing & Jeffrey Clark, "Forgotten Statistics: A Self-Teaching Refresher Course", 1996)

"The more clues to meaning that are supplied elsewhere, the less the need for cluttersome scales." (Eric Meyer, "Designing Infographics", 1997) 

"Choose scales wisely, as they have a profound influence on the interpretation of graphs. Not all scales require that zero be included, but bar graphs and other graphs where area is judged do require it." (Naomi B Robbins, "Creating More effective Graphs", 2005) 

"Use a logarithmic scale when it is important to under- stand percent change or multiplicative factors. […] Showing data on a logarithmic scale can cure skewness toward large values." (Naomi B Robbins, "Creating More effective Graphs", 2005) 

"Use a scale break only when necessary. If a break cannot be avoided, use a full scale break. Taking logs can cure the need for a break." (Naomi B Robbins, "Creating More effective Graphs", 2005)

"Color can tell us where to look, what to compare and contrast, and it can give us a visual scale of measure. Because color can be so effective, it is often used for multiple purposes in the same graphic - which can create graphics that are dazzling but difficult to interpret. Separating the roles that color can play makes it easier to apply color specifically for encouraging different kinds of visual thinking. [...] Choose colors to draw attention, to label, to show relationships (compare and contrast), or to indicate a visual scale of measure." (Felice C Frankel & Angela H DePace, "Visual Strategies", 2012)

"Geographic maps have the advantage of being true to scale - great for walking. Diagrams have the advantage of being easily imaged and remembered, often true to a non-pedestrian experience, and the ability to open up congestion, reduce empty space, and use real estate efficiently. Hybrids 'mapograms' ? - often have the disadvantages of both map and diagram with none of the corresponding advantages." (Joel Katz, "Designing Information: Human factors and common sense in information design", 2012)

"Context (information that lends to better understanding the who, what, when, where, and why of your data) can make the data clearer for readers and point them in the right direction. At the least, it can remind you what a graph is about when you come back to it a few months later. […] Context helps readers relate to and understand the data in a visualization better. It provides a sense of scale and strengthens the connection between abstract geometry and colors to the real world." (Nathan Yau, "Data Points: Visualization That Means Something", 2013)

03 March 2023

Data Warehousing: Building a Modern Data Warehouse with Azure Synapse

Data Warehousing

Introduction

When building a data warehouse (DWH) several key words or derivatives of them appear in requirements: secure, flexible, simple, scalable, reliable, performant, non-redundant, modern, automated, real-timed, etc. As it proves in practice, all these requirements are sometimes challenging to address with the increased complexity of the architecture chosen. There are so many technologies on the DWH market promising all these at low costs, low effort and high ROI, though DWH projects continue to fail addressing the business and technical requirements.

On a basic level for building a DWH is needed a data storage layer and an ETL (Extract, Transfer, Load) tool responsible for the data movement between the various source systems and DWH, and eventually within the DWH itself. After that, each technology added to the landscape tends to increase the overall complexity (and should be regarded with a critical eye in what concerns the advantages and disadvantages).

Data Warehouse Architecture (on-premise)

A Reference Architecture

When building a DWH or a data migration solution, which has many of the characteristics of a DWH, from the many designs, I prefer to keep things as simple as possible.  An approach based on a performant database engine like SQL Server as storage layer and SSIS (SQL Server Integration Services) as ETL proved to be the best choice until now, allowing to address most of the technical requirements by design. Then come the choices on how and where to import and transform the data, at what level of granularity, on how the semantic layer is built, how the data are accessed, etc.

Being able to pull (see extract subprocess) the data from the data sources on a need by basis offers the most flexible approach, however there are cases in which the direct access to source data is not possible, having to rely on a push approach, where data are dumped regularly to a given location (e.g. FTP folder structure), following to be picked up as needed. It's actually a hybrid between a push and pull, because a fully push approach would mean pushing the data directly to the DWH, which can be also acceptable, though might offer lower control on data's movement and involve a few other challenges (e.g. permissions, concurrency). 

Data can be prepared for the DWH in the source systems (e.g. exposed via data objects or API calls), anywhere in between via ETL-based transformations (see transform subprocess) or directly in the DWH. I prefer importing the data (see load subprocess) 1:1 without any transformations from the various sources via SSIS (or similar technologies) into a set of tables that designated the staging area. It's true that in this way the ETL technology is used to a minimum, though unless there's a major benefit to use it for data transformations, using DWH's capabilities and SQL for data processing can provide better performance and flexibility

Besides the selection of the columns in scope (typically columns with meaningful values), it's important not to do any transformations in the extraction layer because the data is imported faster (eventually using fast load options as in SSIS) and it assures a basis for troubleshooting (as the data don't change between loads). Some filters can be applied only when the volume of data is high, and the subset of the data could be identified clearly (e.g. when data are partitioned based on a key like business unit, legal entity or creation date).

For better traceability, the staging schemas can reflect the systems they come from, the tables and the columns should have the same names, respectively same data types. On such tables no constraints are applied and no indexes are needed. They can be constructed however on the production tables (aka base tables) - copy of the tables from production. 

Some DWH architects try replicating the constraints from the source systems and/or add more constraints on top to define the various business rules. Rigor is good in some scenarios, though it can involve a considerable effort and it might be challenging to keep over time, especially when considering the impact of big data on DWH architectures. Instead of using constraints, building a set of SQL scripts that pinpoint the issues as reports allow more flexibility with the risk of having inconsistencies running wild through the reports. The data should be cleaned in the source system and not possible then properly addressed in the DWH. Applying constraints will make the data unavailable for reporting until data are corrected, while being more permissive would allow dirty data. Thus, either case has advantages or disadvantages, though the latter seems to be more appropriate. 

Indexes on the production schema should reflect the characteristics of the queries run on the data and shouldn't replicate the indexes from the source environments, even if some overlaps might exist. In practice, dropping the non-clustered indexes on the production tables before loading the data from staging, and recreating them afterwards proves to provide faster loading (see load optimization techniques). 

The production tables are used for building a "semantic" data model or something similar. Several levels of views, table-valued functions and/or indexed/materialized views allows building the dimensions and facts tables, the latter incorporating the business logic needed by the reports. Upon case, stored-procedures, physical or temporary tables, table variables can be used to prepare the data, though they tend to break the "free" flow of data as steps in-between need to be run. On the other side, in certain scenarios their use is unavoidable. 

The first level of views (aka base views) is based on the base tables without any joins, though they include only the fields in use (needed by the business) ordered and "grouped" together based on their importance or certain characteristics. The views can include conversions of data types, translations of codes into meaningful values, and quite seldom filters on the data. Based on these "base" views the second level is built, which attempts to define the dimension and fact tables at the lowest granularity. These views include joins between tables coming from the same or different systems, respectively mappings of values defined in tables, and whatever it takes to build such entities. However, transformations on individual fields are pushed, when possible, to the lower level to minimize logic redundancy. From similar reasons, the logic could be broken down over two or more "helper" views when visible benefits could be obtained from it (e.g troubleshooting, reuse, maintenance). It's important to balance between creating too many helper views and encapsulating too much logic in a view. 

One of the design principles used in building the entities is to minimize the redundance of the fields used, ideally without having columns duplicated between entities at this level. This would facilitate the traceability of columns to the source tables within the "semantic" layer (typically in the detriment of a few more joins). In practice, one is forced to replicate some columns to simplify some parts of the logic. 

Further views can be built based on the dimension and fact entities to define the logic needed by the reports. Only these objects are used and no direct reference to the "base" tables or views are made. Moreover, to offer better performance when the views can be materialized or, when there's an important benefit, physically saved as table (e.g. having multiple indexes for different scenarios). It's the case of entities with considerable data volume called over and over. 

This approach of building the entities is usually flexible enough to address most of the reporting requirements, independently whether the technical solution has the characteristics of a DWH, data mart or data migration layer. Moreover, the overall architectural approach can be used on-premise as well in cloud architectures, where Azure SQL Server and ADF (Azure Data Factory) provide similar capabilities. Compared with standard SQL Server, some features might not be available, while other features might bring further benefits, though the gaps should be neglectable.

Data Management topics like Master Data Management (MDM), Data Quality Management (DQM) and/or Metadata Management can be addressed as well by using third-party tools or tools from the Microsoft stack - Master Data Services (MDS) and Data Quality Services (DQS) in combination with SSIS help addressing a wide range of scenarios - however these are optional. 

Moving to the Cloud

Within the context of big data, characterized by (high/variable) volume, value, variety, velocity, veracity, and further less important V's, the before technical requirements still apply, however within a cloud environment the overall architecture becomes more complex. Each component becomes a service. There are thus various services for data ingestion, storage, processing, sharing, collaboration, etc. The way data are processed involves also several important transformations: ETL becomes ELT, FTP and local storage by Data Lakes, data packages by data pipelines, stateful by stateless, SMP (Symmetric Multi-Processing) by MPP (Massive Parallel Processing), and so on.

As file storage is less expensive than database storage, there's an increasing trend of dumping business critical data into the Data Lake via data pipelines or features like Link to Data Lake or Export to Data Lake, which synchronize the data between source systems and Data Lake in near real-time at table or entity level. Either saved as csv, parquet, delta lake or any other standard file format, in single files or partitions, the data can be used directly or indirectly for analytics.

Cloud-native warehouses allow addressing topics like scalability, elasticity, fault-tolerance and performance by design, though further challenges appear as compute needs to be decoupled from storage, the workloads need to be estimated for assuring the performance, data may be distributed across data centers spanning geographies, the infrastructure is exposed to attacks, etc. 

Azure Synapse

If one wants to take advantage of the MPP architecture's power, Microsoft provides an analytical architecture based on Azure Synapse, an analytics service that brings together data integration, enterprise DWH, and big data analytics. Besides two types of SQL-based data processing services  (dedicated vs serverless SQL pools) it comes also with a Spark pool for in-memory cluster computing.

A DWH based on Azure Synapse is not that different from the reference architecture described above for an on-premise solution. Actually, a DWH based on a dedicated SQL pool (aka a physical data warehouse) involves the same steps mentioned above. 

Data Warehouse Architecture with Dedicated SQL Pool

The data can be imported via ETL/ELT pipelines in the DWH, though there are also mechanisms for consuming the data directly from the files stored in the Data Lake or Azure storage. CETAS (aka Create External Table as Select) can be defined on top of the data files, the external tables acting as "staging" or "base" tables in the architecture described above. When using a dedicated SQL pool it makes sense to use the CETAS as "staging" tables, the processed data following to be dumped to "optimized" physical tables for consumption and refreshed periodically. However, when this happens the near real-time character of data is lost. Using the CETAs as base tables would keep this characteristic as long the data isn't saved physically in tables or files, maybe in the detriment of performance.

Using a dedicated SQL pool for direct reporting can become expensive as the pool needs to be available at least during business hours for incoming user requests, or at least for importing the data and refreshing the datasets. When using the CETAS as a base table, a serverless (aka on-demand) SQL pool, which uses a per-pay-use billing model could prove to be more cost-effective and flexible in many scenarios. By design, it helps to keep the near real-time character of the data. Moreover, even if the data are actually moved from the source tables into the Data Lake, this architecture has the characteristics of a logical data warehouse:

Data Warehouse Architecture with Serverless SQL Pool

Unfortunately, unless one uses Spark tables, misuses views or adds an Azure SQL database to the architecture, there are no physical tables or materialized views in a serverless SQL pool. There's still the option to use data pipelines for regullarly exporting intermediary data to files (incl. over partitions or folders), even if this involves more overhead as it's not possible to export data over SQL syntax to files more than once (though this might change in the future). For certain scenario it could be useful to store data in a Azure SQL Server or similar database, including a dedicated SQL pool. 

Choosing between serverless and dedicated SQL pool is not an exclusive choice, both or all 3 types of pools (if we consider also the Spark pool) can be used in the architecture for addressing specific challenges, especially when we consider that there are important differences between the features available in each of the pools. Moreover, one can start the PoC based on the serverless SQL pool and when the solution became mature enough and used in all enterprise, parts of the logic or all of it can be migrated to a dedicated SQL pool. This would allow to save costs at the beginning in the detriment of further effort later. 

Talking about the physical storage, data engineers recommend defining within a Data Lake several layers (aka regions, zones) labeled as bronze, silver and gold (and probably platinum will join the club anytime soon). The bronze layer refers to the raw data available in the Data Lake, including the files on which the initial CETAS are defined upon. The silver refers to transformed, cleaned, enriched and integrated data, data resulting from the second layer of views described above. The gold layer refers to the data to which business logic was applied and prepared for consumption, data resulting from the final layer of views. Of course, data pipelines can be used to prepare the data at these stages, though a view-based approach offers more flexibility, are easier to troubleshoot, manage and reuse than data pipelines.

Ideally the gold data should involve no or minimal further transformation before reaching the users, though that's not realistic. Building a DWH takes a considerable time and the business can't usually wait until everything is in place. Therefore, reports based on DWH will continue to coexist with reports directly accessing the source data, which will lead to controversies. Enforcing a single source of truth will help to minimize the gap, though will not eliminate it completely. 

Closing Notes

These are just outlines of a minimal reference architecture. There's more to consider, as there are several alternatives (see [1] [2] [3] [4]) for each of the steps considered in here, each technology, new features or mechanisms opening new opportunities. The advantages and disadvantages should be always considered against the business needs and requirements. One approach, even if recommended, might not work for all, though unless there's an important requirement or an opportunity associated with an additional technology, deviating from reference architectures might not be such a good idea afterall. 

Resources:
[1] Microsoft Learn (2022) Modern data warehouse for small and medium business (link)
[2] Microsoft Learn (2022) Data warehousing and analytics (link)
[3] Microsoft Learn (2022) Enterprise business intelligence (link)
[4] Microsoft Learn (2022) Serverless Modern Data Warehouse Sample using Azure Synapse Analytics and Power BI (link)
[5] Coursera (2023) Data Warehousing with Microsoft Azure Synapse Analytics (link) [course, free to audit]
[6] SQLBits (2020) Mahesh Balija's Building Modern Data Warehouse with Azure Synapse Analytics (link)
[7] Matt How (2020) The Modern Data Warehouse in Azure: Building with Speed and Agility on Microsoft’s Cloud Platform (Amazon)
[8] James Serra's blog (2022) Data lake architecture (link)
[9] SQL Stijn (2022) SQL Building a Modern Lakehouse Data Warehouse with Azure Synapse Analytics: Moving your Database to the lake (link)
[10] Solliance (2022) Azure Synapse Analytics Workshop 400 (link) [GitHub repository]

26 February 2023

Data Science: Data Literacy (Just the Quotes)

"[…] statistical literacy. That is, the ability to read diagrams and maps; a 'consumer' understanding of common statistical terms, as average, percent, dispersion, correlation, and index number."  (Douglas Scates, "Statistics: The Mathematics for Social Problems", 1943)

"Just as by ‘literacy’, in this context, we mean much more than its dictionary sense of the ability to read and write, so by ‘numeracy’ we mean more than mere ability to manipulate the rule of three. When we say that a scientist is ‘illiterate’, we mean that he is not well enough read to be able to communicate effectively with those who have had a literary education. When we say that a historian or a linguist is ‘innumerate’ we mean that he cannot even begin to understand what scientists and mathematicians are talking about." (Sir Geoffrey Crowther, "A Report of the Central Advisory Committee for Education", 1959)

"People often feel inept when faced with numerical data. Many of us think that we lack numeracy, the ability to cope with numbers. […] The fault is not in ourselves, but in our data. Most data are badly presented and so the cure lies with the producers of the data. To draw an analogy with literacy, we do not need to learn to read better, but writers need to be taught to write better." (Andrew Ehrenberg, "The problem of numeracy", American Statistician 35(2), 1981)

"If you give users with low data literacy access to a business query tool and they create incorrect queries because they didn’t understand the different ways revenue could be calculated, the BI tool will be perceived as delivering bad data." (Cindi Howson, "Successful Business Intelligence: Secrets to making BI a killer App", 2008)

"Even with simple and usable models, most organizations will need to upgrade their analytical skills and literacy. Managers must come to view analytics as central to solving problems and identifying opportunities - to make it part of the fabric of daily operations." (Dominic Barton & David Court, "Making Advanced Analytics Work for You", 2012)

"Statistical literacy is more than numeracy. It includes the ability to read and communicate the meaning of data. This quality makes people literate as opposed to just numerate. Wherever words (and pictures) are added to numbers and data in your communication, people need to be able to understand them correctly." (United Nations, "Making Data Meaningful" Part 4: "A guide to improving statistical literacy", 2012)

"Most important, the range of data literacy and familiarity with your data’s context is much wider when you design graphics for a general audience." (Nathan Yau, "Data Points: Visualization That Means Something", 2013)

"Data literacy, simply put, means the ability to read, understand, and communicate with data and the insights derived from it. Some people argue that it’s not like reading text because it requires math skills, implying a greater complexity. I disagree. To the uninitiated, reading text is just as hard as “reading” data or graphs." (Jennifer Belissent, "Data Literacy Matters - Do We Have To Spell It Out?!", 2019)

"Even though data is being thrust on more people, it doesn’t mean everyone is prepared to consume and use it effectively. As our dependence on data for guidance and insights increases, the need for greater data literacy also grows. If literacy is defined as the ability to read and write, data literacy can be defined as the ability to understand and communicate data. Today’s advanced data tools can offer unparalleled insights, but they require capable operators who can understand and interpret data." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"Data fluency, as defined in this book, is the ability to speak and understand the language of data; it is essentially an ability to communicate with and about data. In different cases around the world, the term data fluency has sometimes been used interchangeably with data literacy. That is not the approach of this book. This book looks to define data literacy as the ability to read, work with, analyze, and communicate with data. Data fluency is the ability to speak and understand the language of data." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"Data literacy is not a change in an individual’s abilities, talents, or skills within their careers, but more of an enhancement and empowerment of the individual to succeed with data. When it comes to data and analytics succeeding in an organization’s culture, the increase in the workforces’ skills with data literacy will help individuals to succeed with the strategy laid in front of them. In this way, organizations are not trying to run large change management programs; the process is more of an evolution and strengthening of individual’s talents with data. When we help individuals do more with data, we in turn help the organization’s culture do more with data." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"Overall [...] everyone also has a need to analyze data. The ability to analyze data is vital in its understanding of product launch success. Everyone needs the ability to find trends and patterns in the data and information. Everyone has a need to ‘discover or reveal (something) through detailed examination’, as our definition says. Not everyone needs to be a data scientist, but everyone needs to drive questions and analysis. Everyone needs to dig into the information to be successful with diagnostic analytics. This is one of the biggest keys of data literacy: analyzing data." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"The process of asking, acquiring, analyzing, integrating, deciding, and iterating should become second nature to you. This should be a part of how you work on a regular basis with data literacy. Again, without a decision, what is the purpose of data literacy? Data literacy should lead you as an individual, and organizations, to make smarter decisions." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"The reality is, the majority of a workforce doesn’t need to be data scientists, they just need comfort with data literacy." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"Data literacy is not achieved by mastering a uniform set of competencies that applies to everyone. Those that are relevant to each individual can vary significantly depending on how they interact with data and which part of the data process they are involved in." (Angelika Klidas & Kevin Hanegan, "Data Literacy in Practice", 2022)

"Data literacy is something that affects everyone and every organization. The more people who can debate, analyze, work with, and use data in their daily roles, the better data-informed decision-making will be." (Angelika Klidas & Kevin Hanegan, "Data Literacy in Practice", 2022)

"It is also important to note that data literacy is not about expecting to or becoming an expert; rather, it is a journey that must begin somewhere." (Angelika Klidas & Kevin Hanegan, "Data Literacy in Practice", 2022)

"Organizations must have a plan and vision for data literacy, which they then communicate to all employees. They will need to develop and foster a culture that embraces data literacy and data-informed decisions. They will need to provide employees with access to various learning content specific to data literacy. Along their journey, they will need to make sure they benchmark and measure progress toward their vision and celebrate successes along the way." (Angelika Klidas & Kevin Hanegan, "Data Literacy in Practice", 2022)

25 February 2023

Data Science: Data Stories (Just the Quotes)

"A data story starts out like any other story, with a beginning and a middle. However, the end should never be a fixed event, but rather a set of options or questions to trigger an action from the audience. Never forget that the goal of data storytelling is to encourage and energize critical thinking for business decisions." (James Richardson, 2017)

"All human storytellers bring their subjectivity to their narratives. All have bias, and possibly error. Acknowledging and defusing that bias is a vital part of successfully using data stories. By debating a data story collaboratively and subjecting it to critical thinking, organizations can get much higher levels of engagement with data and analytics and impact their decision making much more than with reports and dashboards alone." (James Richardson, 2017)

"A random collection of interesting but disconnected facts will lack the unifying theme to become a data story - it may be informative, but it won’t be insightful." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"Analysis is a two-step process that has an exploratory and an explanatory phase. In order to create a powerful data story, you must effectively transition from data discovery (when you’re finding insights) to data communication (when you’re explaining them to an audience). If you don’t properly traverse these two phases, you may end up with something that resembles a data story but doesn’t have the same effect. Yes, it may have numbers, charts, and annotations, but because it’s poorly formed, it won’t achieve the same results." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"Before you can even consider creating a data story, you must have a meaningful insight to share. One of the essential attributes of a data story is a central or main insight. Without a main point, your data story will lack purpose, direction, and cohesion. A central insight is the unifying theme (telos appeal) that ties your various findings together and guides your audience to a focal point or climax for your data story. However, when you have an increasing amount of data at your disposal, insights can be elusive. The noise from irrelevant and peripheral data can interfere with your ability to pinpoint the important signals hidden within its core." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"Data storytelling gives your insight the best opportunity to capture attention, be understood, be remembered, and be acted on. An effective data story helps your insight reach its full potential: inspiring others to act and drive change." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"Data storytelling involves the skillful combination of three key elements: data, narrative, and visuals. Data is the primary building block of every data story. It may sound simple, but a data story should always find its origin in data, and data should serve as the foundation for the narrative and visual elements of your story." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"Even with a solid narrative and insightful visuals, a data story cannot overcome a weak data foundation. As the master architect, builder, and designer of your data story, you play an instrumental role in ensuring its truthfulness, quality, and effectiveness. Because you are responsible for pouring the data foundation and framing the narrative structure of your data story, you need to be careful during the analysis process. Because all of the data is being processed and interpreted by you before it is shared with others, it can be exposed to cognitive biases and logical fallacies that distort or weaken the data foundation of your story." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"Numbers are ideal vehicles for promulgating bullshit. They feel objective, but are easily manipulated to tell whatever story one desires. Words are clearly constructs of human minds, but numbers? Numbers seem to come directly from Nature herself. We know words are subjective. We know they are used to bend and blur the truth. Words suggest intuition, feeling, and expressivity. But not numbers. Numbers suggest precision and imply a scientific approach. Numbers appear to have an existence separate from the humans reporting them." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

"So what does it mean to tell an honest story? Numbers should be presented in ways that allow meaningful comparisons." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

"To tell an honest story, it is not enough for numbers to be correct. They need to be placed in an appropriate context so that a reader or listener can properly interpret them." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

22 February 2023

SQL Reloaded: Automatic Statistics Creation & Dropping for CETAS based on CSV File Format in Serverless SQL Pool

Introduction

The serverless SQL pool query optimizer uses statistics to calculate and compare the cost of various query plans, and then choose the plan with the lowest cost. Automatic creation of statistics is turned on for parquet file format, though for CSV file format statistics will be automatically created only when OPENROWSET is used. This means that when creating CETAS based on CSV the statistics need to be created manually. 

This would be one more reason for holding the files in the Data Lake as parquet files. On the other side there are also many files already available in CSV format, respectively technoloqies that allows exporting data only/still as CSV. Moreover, transforming the files as parquet is not always technically feasible.

Using OPENROWSET could also help, though does it make sense to use a different mechanismus for the CSV file format? In some scenarios will do. I prefer to have a unitary design, when possible. Moreover, even if some columns are not needed, they can still be useful for certain scenarios (e.g. troubleshooting, reevaluating their use, etc.). 

There are files, especially the ones coming from ERPs (Enterprise Resource Planning) or similar systems, which have even a few hundred columns (on average between 50 and 100 columns). Manually creating  the statistics for the respective tables will cost lot of time and effort. To automate the process there are mainly three choices:
(1) Creating statistics for all the columns for a given set of tables (e.g. for a given schema).
(2) Finding a way to automatically identify the columns which are actually used.
(3) Storing the list of tables and columns on which statistics should be build (however the list needs to be maintained manually). 

Fortunately, (1) can be solved relatively easy, based on the available table metadata, however it's not the best solution, as lot of statistics will be unnecessarily created. (2) is possible under certain architectures or additional effort. (3) takes time, though it's also an approachable solution.

What do we need?

For building the solution, we need table and statistics metadata, and the good news is that the old SQL Server queries still work. To minimize code's repetition, it makes sense to encapsulate the logic in views. For table metadata one can use the sys.objects DMV as is more general (one can replace sys.objects with sys.tables to focus only on tables):

-- drop the view (for cleaning)
-- DROP VIEW IF EXISTS dbo.vAdminObjectColumns

-- create view
CREATE OR ALTER VIEW dbo.vObjectColumns
AS 
-- object-based column metadata
SELECT sch.name + '.' + obj.name two_part_name
, sch.Name schema_name
, obj.name object_name
, col.name column_name
, obj.type
, CASE 
	WHEN col.is_ansi_padded = 1 and LEFT(udt.name , 1) = 'n' THEN col.max_length/2
	ELSE col.max_length
  END max_length
, col.precision 
, col.scale
, col.is_nullable 
, col.is_identity
, col.object_id
, col.column_id
, udt.name as data_type
, col.collation_name
, ROW_NUMBER() OVER(PARTITION BY col.object_id ORDER BY col.column_id) ranking FROM sys.columns col JOIN sys.types udt on col.user_type_id= udt.user_type_id JOIN sys.objects obj ON col.object_id = obj.object_id JOIN sys.schemas as sch on sch.schema_id = obj.schema_id -- testing the view SELECT obc.* FROM dbo.vObjectColumns obc WHERE obc.object_name LIKE '<table name>%' AND obc.schema_name = 'CRM' AND obc.type = 'U' ORDER BY obc.two_part_name , Ranking

The view can be used also as basis for getting the defined stats:

-- drop the view (for cleaning)
-- DROP VIEW IF EXISTS dbo.vAdminObjectStats

-- create view 
CREATE OR ALTER VIEW dbo.vObjectStats
AS
-- object-based column statistics
SELECT obc.two_part_name + '.' + QuoteName(stt.name) three_part_name
, obc.two_part_name
, obc.schema_name
, obc.object_name
, obc.column_name
, stt.name stats_name
, STATS_DATE(stt.[object_id], stt.stats_id) AS last_updated
, stt.auto_created
, stt.user_created
, stt.no_recompute
, stt.has_filter 
, stt.filter_definition
, stt.is_temporary 
, stt.is_incremental 
, stt.auto_drop 
, stt.stats_generation_method_desc
, stt.[object_id]
, obc.type 
, stt.stats_id
, stc.stats_column_id
, stc.column_id
FROM dbo.vObjectColumns obc
     LEFT JOIN sys.stats_columns stc 
	   ON stc.object_id = obc.object_id
	  AND stc.column_id = obc.column_id 
          LEFT JOIN sys.stats stt
            ON stc.[object_id] = stt.[object_id] 
           AND stc.stats_id = stt.stats_id

-- testing the view 
SELECT * FROM dbo.vObjectStats obs WHERE (obs.auto_created = 1 OR obs.user_created = 1) AND obs.type = 'U' AND obs.object_name = '<table name>' ORDER BY obs.two_part_name , obs.column_id

Now we have a basis for the next step. However, before using the stored procedure define below, one should use the last query and check whether statistics were defined before on a table. Use for testing also a table for which you know that statistics are available.

Create Statistics

The code below is based on a similar stored procedure available in the Microsoft documentation (see [1]). It uses a table's column metadata, stores them in a temporary table and then looks through each record, create the DDL script and runs it:

-- drop procedure (for cleaning)
--DROP PROCEDURE dbo.pCreateStatistics

-- create stored procedure
CREATE OR ALTER PROCEDURE dbo.pCreateStatistics
(   @schema_name nvarchar(50)
,   @table_name nvarchar(100)
)
AS
-- creates statistics for serverless SQL pool
BEGIN
	DECLARE @query as nvarchar(1000) = ''
	DECLARE @index int = 1, @nr_records int = 0

	-- drop temporary table if it exists 
	DROP TABLE IF EXISTS #stats_ddl;

	-- create temporary table 
	CREATE TABLE #stats_ddl( 
	  schema_name nvarchar(50)
	, table_name nvarchar(128)
	, column_name nvarchar(128)
	, ranking int
	);

	-- fill table
	INSERT INTO #stats_ddl
	SELECT obc.schema_name
	, obc.object_name
	, obc.column_name 
	, ROW_NUMBER() OVER(ORDER BY obc.schema_name, obc.object_name) ranking
	FROM dbo.vObjectColumns obc
	WHERE obc.type = 'U' -- tables
	  AND IsNull(@schema_name, obc.schema_name) = obc.schema_name 
	  AND IsNull(@table_name, obc.object_name) = obc.object_name

	SET @nr_records = (SELECT COUNT(*) FROM #stats_ddl)

	WHILE @index <= @nr_records
	BEGIN
		SET @query = (SELECT 'CREATE STATISTICS '+ QUOTENAME('stat_' + schema_name + '_' + table_name + '_' + column_name) + ' ON '+ QUOTENAME(schema_name) + '.' + QUOTENAME(table_name) + '(' + QUOTENAME(column_name) + ')' 
			   FROM #stats_ddl ddl
			   WHERE ranking = @index);

		BEGIN TRY
		        -- execute ddl
			EXEC sp_executesql @query;
		END TRY
		BEGIN CATCH
			SELECT 'create failed for ' + @query;
		END CATCH

		SET @index+=1;
	END

	DROP TABLE #stats_ddl;
END


-- test stored procedure (various scenario)
EXEC dbo.pCreateStatistics '<schema name>', '<table name>' -- based on schema & table
EXEC dbo.pCreateStatistics '<schema name>', NULL -- based on a schema
EXEC dbo.pCreateStatistics NULL, '<table name>' -- based on a table

Notes:
IMPORTANT!!! I recommend testing the stored procedure in a test environment first for a few tables and not for a whole schema. If there are too many tables, this will take time.

Please note that rerunning the stored procedure without deleting previously the statitics on the tables in scope will make the procedure raise failures for each column (behavior by design), though the error messages can be surpressed by commenting the code, if needed. One can introduce further validation, e.g. considering only the columns which don't have a statistic define on them.

Further Steps?

What can we do to improve the code? It would be great if we could find a way to identify the columns which are used in the queries. It is possible to retrieve the queries run in serverless SQL pool, however identifying the tables and columns from there or a similar source is not a straightforward solution. 

The design of views based on the external tables can help in the process! I prefer to build on top of the external tables a first level of views (aka "base views") that include only the fields in use (needed by the business) ordered and "grouped" together based on their importance or certain characteristics. The views are based solely on the external table and thus contain no joins. They can include conversions of data types, translations of codes into meaningful values, and quite seldom filters on the data. However, for traceability the name of the columns don't change! This means that if view's name is easily identifiable based on external table's name, we could check view's columns against the ones of the external table and create statistics only for the respective columns. Using a unique prefix (e.g. "v") to derive views' name from tables' name would do the trick.

To do that, we need to create a view that reflects the dependencies between objects (we'll be interested only in external tables vs views dependencies):

-- drop view (for cleaning)
-- DROP VIEW IF EXISTS dbo.vObjectsReferenced

-- create view
CREATE OR ALTER VIEW dbo.vObjectsReferenced
AS 
-- retrieving the objects referenced 
SELECT QuoteName(sch.name) + '.' + QuoteName(obj.name) AS two_part_name 
, obj.object_id 
, obj.schema_id 
, sch.name schema_name 
, obj.name object_name 
, obj.type
, QuoteName(scr.name) + '.'+ QuoteName(sed.referenced_entity_name) AS ref_two_part_name 
, obr.object_id ref_object_id
, obj.schema_id ref_schema_id 
, scr.name ref_schema_name 
, obr.name ref_object_name 
, obr.type ref_type
FROM sys.sql_expression_dependencies sed 
     JOIN sys.objects obj
       ON obj.object_id = sed.referencing_id 
	      JOIN sys.schemas as sch
	        ON obj.schema_id = sch.schema_id 
	 JOIN sys.objects obr
	   ON sed.referenced_id = obr.object_id
	      JOIN sys.schemas as scr
	        ON obr.schema_id = scr.schema_id

-- testing the view
SELECT top 10 *
FROM dbo.vObjectsReferenced
WHERE ref_type = 'U'

With this, the query used above to fill the table becomes:

-- fill table query with column selection 
SELECT obc.schema_name
, obc.object_name
, obc.column_name 
, ROW_NUMBER() OVER(ORDER BY obc.schema_name, obc.object_name) ranking
FROM dbo.vObjectColumns obc
WHERE obc.type = 'U' -- tables
	AND IsNull(@schema_name, obc.schema_name) = obc.schema_name 
	AND IsNull(@table_name, obc.object_name) = obc.object_name
	AND EXISTS ( -- select only columns referenced in views
	    SELECT * 
		FROM dbo.vObjectsReferenced obr 
		    JOIN dbo.vAdminObjectColumns obt
			ON obr.object_id = obt.object_id 
		WHERE obt.type = 'V' -- view
		AND obr.object_name  =  'v' + obr.ref_object_name
		AND obc.object_id = obr.ref_object_id
		AND obc.column_name = obt.column_name);

This change will reduce the number of statistics created on average by 50-80%. Of course, there will be also cases in which further statistics need to be added manually. One can use this as input for an analysis of the columns used and store the metadata in a file, do changes to it and base on it statistics' creation. 

Drop Statistics

Dropping the indexes resumes to using the dbo.vObjectStats view created above for the schema and/or table provided as parameter. The logic is similar to statistics' creation:

-- drop stored procedure (for cleaning)
-- DROP PROCEDURE IF EXISTS dbo.pDropStatistics

-- create procedure
CREATE OR ALTER PROCEDURE dbo.pDropStatistics
(   @schema_name nvarchar(50)
,   @table_name nvarchar(128)
)
AS
-- drop statistics for a schema and/or external table in serverless SQL pool
BEGIN

	DECLARE @query as nvarchar(1000) = ''
	DECLARE @index int = 1, @nr_records int = 0

	-- drop temporary table if it exists 
	DROP TABLE IF EXISTS #stats_ddl;

	-- create temporary table 
	CREATE TABLE #stats_ddl( 
	 three_part_name nvarchar(128)
	, ranking int
	);

	-- fill table
	INSERT INTO #stats_ddl
	SELECT obs.three_part_name
	, ROW_NUMBER() OVER(ORDER BY obs.three_part_name) ranking
	FROM dbo.vObjectStats obs
	WHERE obs.type = 'U' -- tables
	  AND IsNull(@schema_name, obs.schema_name) = obs.schema_name 
	  AND IsNull(@table_name, obs.object_name) = obs.object_name

	SET @nr_records = (SELECT COUNT(*) FROM #stats_ddl)

	WHILE @index <= @nr_records
	BEGIN
   
		SET @query = (SELECT 'DROP STATISTICS ' + ddl.three_part_name
			   FROM #stats_ddl ddl
			   WHERE ranking = @index);

		BEGIN TRY
		        -- execute ddl
			EXEC sp_executesql @query;
		END TRY
		BEGIN CATCH
			SELECT 'drop failed for ' + @query;
		END CATCH

		SET @index+=1;
	END

	DROP TABLE #stats_ddl;
END

Note:
IMPORTANT!!!
I recommend testing the stored procedure in a test environment first for a few tables and not for a whole schema. If there are too many tables, this will take time.

Closing Thoughts

The solution for statistics' creation is not perfect, though it's a start! It would have been great if such a feature would be provided by Microsoft, and probably they will, given the importance of statistics of identifying an optimal plan. It would be intersting to understand how much statistics help in a distributed environment and what's the volume of data processed for this purpose. 

Please let me know if you found other workarounds for statistics's automation.

Happy coding!

References:
[1] Microsoft Learn (2022) Statistics in Synapse SQL (link)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 22 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.