SQL Troubles

28 October 2018

🔭Data Science: Limits (Just the Quotes)

"Whatever lies beyond the limits of experience, and claims another origin than that of induction and deduction from established data, is illegitimate." (George H Lewes, "The Foundations of a Creed", 1875)

"It is difficult to understand why statisticians commonly limit their inquiries to Averages, and do not revel in more comprehensive views. Their souls seem as dull to the charm of variety as that of the native of one of our flat English counties, whose retrospect of Switzerland was that, if its mountains could be thrown into its lakes, two nuisances would be got rid of at once. An Average is but a solitary fact, whereas if a single other fact be added to it, an entire Normal Scheme, which nearly corresponds to the observed one, starts potentially into existence." (Sir Francis Galton, "Natural Inheritance", 1889)

"Physical research by experimental methods is both a broadening and a narrowing field. There are many gaps yet to be filled, data to be accumulated, measurements to be made with great precision, but the limits within which we must work are becoming, at the same time, more and more defined." (Elihu Thomson, "Annual Report of the Board of Regents of the Smithsonian Institution", 1899)

"Statistics may, for instance, be called the science of counting. Counting appears at first sight to be a very simple operation, which any one can perform or which can be done automatically; but, as a matter of fact, when we come to large numbers, e.g., the population of the United Kingdom, counting is by no means easy, or within the power of an individual; limits of time and place alone prevent it being so carried out, and in no way can absolute accuracy be obtained when the numbers surpass certain limits." (Sir Arthur L Bowley, "Elements of Statistics", 1901)

"The usefulness of the models in constructing a testable theory of the process is severely limited by the quickly increasing number of parameters which must be estimated in order to compare the predictions of the models with empirical results" (Anatol Rapoport, "Prisoner's Dilemma: A study in conflict and cooperation", 1965)

"A real change of theory is not a change of equations - it is a change of mathematical structure, and only fragments of competing theories, often not very important ones conceptually, admit comparison with each other within a limited range of phenomena." (Yuri I Manin, "Mathematics and Physics", 1981)

"Models are often used to decide issues in situations marked by uncertainty. However statistical differences from data depend on assumptions about the process which generated these data. If the assumptions do not hold, the inferences may not be reliable either. This limitation is often ignored by applied workers who fail to identify crucial assumptions or subject them to any kind of empirical testing. In such circumstances, using statistical procedures may only compound the uncertainty." (David A Greedman & William C Navidi, "Regression Models for Adjusting the 1980 Census", Statistical Science Vol. 1 (1), 1986)

"[…] an honest exploratory study should indicate how many comparisons were made […] most experts agree that large numbers of comparisons will produce apparently statistically significant findings that are actually due to chance. The data torturer will act as if every positive result confirmed a major hypothesis. The honest investigator will limit the study to focused questions, all of which make biologic sense. The cautious reader should look at the number of ‘significant’ results in the context of how many comparisons were made." (James L Mills, "Data torturing", New England Journal of Medicine, 1993)

"In spite of the insurmountable computational limits, we continue to pursue the many problems that possess the characteristics of organized complexity. These problems are too important for our well being to give up on them. The main challenge in pursuing these problems narrows down fundamentally to one question: how to deal with systems and associated problems whose complexities are beyond our information processing limits? That is, how can we deal with these problems if no computational power alone is sufficient?" (George Klir, "Fuzzy sets and fuzzy logic", 1995)

"The larger, more detailed and complex the model - the less abstract the abstraction – the smaller the number of people capable of understanding it and the longer it takes for its weaknesses and limitations to be found out." (John Adams, "Risk", 1995)

"[...] the NFL theorems mean that if an algorithm does particularly well on average for one class of problems then it must do worse on average over the remaining problems. In particular, if an algorithm performs better than random search on some class of problems then in must perform worse than random search on the remaining problems. Thus comparisons reporting the performance of a particular algorithm with a particular parameter setting on a few sample problems are of limited utility. While such results do indicate behavior on the narrow range of problems considered, one should be very wary of trying to generalize those results to other problems." (David H Wolpert & William G Macready, "No free lunch theorems for optimization", IEEE Transactions on Evolutionary Computation 1 (1), 1997)

"No comparison between two values can be global. A simple comparison between the current figure and some previous value and convey the behavior of any time series. […] While it is simple and easy to compare one number with another number, such comparisons are limited and weak. They are limited because of the amount of data used, and they are weak because both of the numbers are subject to the variation that is inevitably present in weak world data. Since both the current value and the earlier value are subject to this variation, it will always be difficult to determine just how much of the difference between the values is due to variation in the numbers, and how much, if any, of the difference is due to real changes in the process." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"[…] an obvious difference between our best classifiers and human learning is the number of examples required in tasks such as object detection. […] the difficulty of a learning task depends on the size of the required hypothesis space. This complexity determines in turn how many training examples are needed to achieve a given level of generalization error. Thus the complexity of the hypothesis space sets the speed limit and the sample complexity for learning." (Tomaso Poggio & Steve Smale, "The Mathematics of Learning: Dealing with Data", Notices of the AMS, 2003)

"Every number has its limitations; every number is a product of choices that inevitably involve compromise. Statistics are intended to help us summarize, to get an overview of part of the world’s complexity. But some information is always sacrificed in the process of choosing what will be counted and how. Something is, in short, always missing. In evaluating statistics, we should not forget what has been lost, if only because this helps us understand what we still have." (Joel Best, "More Damned Lies and Statistics : How numbers confuse public issues", 2004)

"Statistics depend on collecting information. If questions go unasked, or if they are asked in ways that limit responses, or if measures count some cases but exclude others, information goes ungathered, and missing numbers result. Nevertheless, choices regarding which data to collect and how to go about collecting the information are inevitable." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"The Bayesian approach is based on the following postulates: (B1) Probability describes degree of belief, not limiting frequency. As such, we can make probability statements about lots of things, not just data which are subject to random variation. […] (B2) We can make probability statements about parameters, even though they are fixed constants. (B3) We make inferences about a parameter θ by producing a probability distribution for θ. Inferences, such as point estimates and interval estimates, may then be extracted from this distribution." (Larry A Wasserman, "All of Statistics: A concise course in statistical inference", 2004)

"A population that grows logistically, initially increases exponentially; then the growth lows down and eventually approaches an upper bound or limit. The most well-known form of the model is the logistic differential equation." (Linda J S Allen, "An Introduction to Mathematical Biology", 2007)

"Humans have difficulty perceiving variables accurately […]. However, in general, they tend to have inaccurate perceptions of system states, including past, current, and future states. This is due, in part, to limited ‘mental models’ of the phenomena of interest in terms of both how things work and how to influence things. Consequently, people have difficulty determining the full implications of what is known, as well as considering future contingencies for potential systems states and the long-term value of addressing these contingencies. " (William B. Rouse, "People and Organizations: Explorations of Human-Centered Design", 2007)

"The methodology of feedback design is borrowed from cybernetics (control theory). It is based upon methods of controlled system model’s building, methods of system states and parameters estimation (identification), and methods of feedback synthesis. The models of controlled system used in cybernetics differ from conventional models of physics and mechanics in that they have explicitly specified inputs and outputs. Unlike conventional physics results, often formulated as conservation laws, the results of cybernetical physics are formulated in the form of transformation laws, establishing the possibilities and limits of changing properties of a physical system by means of control." (Alexander L Fradkov, "Cybernetical Physics: From Control of Chaos to Quantum Control", 2007)

"All graphics present data and allow a certain degree of exploration of those same data. Some graphics are almost all presentation, so they allow just a limited amount of exploration; hence we can say they are more infographics than visualization, whereas others are mostly about letting readers play with what is being shown, tilting more to the visualization side of our linear scale. But every infographic and every visualization has a presentation and an exploration component: they present, but they also facilitate the analysis of what they show, to different degrees." (Alberto Cairo, "The Functional Art", 2011)

"There are limits on the data we can gather and the kinds of experiments we can perform."(Charles Wheelan, "Naked Statistics: Stripping the Dread from the Data", 2012)

"Learning theory claims that a machine learning algorithm can generalize well from a finite training set of examples. This seems to contradict some basic principles of logic. Inductive reasoning, or inferring general rules from a limited set of examples, is not logically valid. To logically infer a rule describing every member of a set, one must have information about every member of that set." (Ian Goodfellow et al, "Deep Learning", 2015)

"Science’s predictions are more trustworthy, but they are limited to what we can systematically observe and tractably model. Big data and machine learning greatly expand that scope. Some everyday things can be predicted by the unaided mind, from catching a ball to carrying on a conversation. Some things, try as we might, are just unpredictable. For the vast middle ground between the two, there’s machine learning." (Pedro Domingos, "The Master Algorithm", 2015)

"To make progress, every field of science needs to have data commensurate with the complexity of the phenomena it studies. [...] With big data and machine learning, you can understand much more complex phenomena than before. In most fields, scientists have traditionally used only very limited kinds of models, like linear regression, where the curve you fit to the data is always a straight line. Unfortunately, most phenomena in the world are nonlinear. [...] Machine learning opens up a vast new world of nonlinear models." (Pedro Domingos, "The Master Algorithm", 2015)

"Repeated observations of the same phenomenon do not always produce the same results, due to random noise or error. Sampling errors result when our observations capture unrepresentative circumstances, like measuring rush hour traffic on weekends as well as during the work week. Measurement errors reflect the limits of precision inherent in any sensing device. The notion of signal to noise ratio captures the degree to which a series of observations reflects a quantity of interest as opposed to data variance. As data scientists, we care about changes in the signal instead of the noise, and such variance often makes this problem surprisingly difficult." (Steven S Skiena, "The Data Science Design Manual", 2017)

"Regularization is particularly important when the amount of available data is limited. A neat biological interpretation of regularization is that it corresponds to gradual forgetting, as a result of which 'less important' (i.e., noisy) patterns are removed. In general, it is often advisable to use more complex models with regularization rather than simpler models without regularization." (Charu C Aggarwal, "Neural Networks and Deep Learning: A Textbook", 2018)

"The no free lunch theorems set limits on the range of optimality of any method. That is, each methodology has a ‘catchment area’ where it is optimal or nearly so. Often, intuitively, if the optimality is particularly strong then the effectiveness of the methodology falls off more quickly outside its catchment area than if its optimality were not so strong. Boosting is a case in point: it seems so well suited to binary classification that efforts to date to extend it to give effective classification (or regression) more generally have not been very successful. Overall, it remains to characterize the catchment areas where each class of predictors performs optimally, performs generally well, or breaks down." (Bertrand S Clarke & Jennifer L. Clarke, "Predictive Statistics: Analysis and Inference beyond Models", 2018)

"Unless we’re collecting data ourselves, there’s a limit to how much we can do to combat the problem of missing data. But we can and should remember to ask who or what might be missing from the data we’re being told about. Some missing numbers are obvious […]. Other omissions show up only when we take a close look at the claim in question." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"Despite their predictive power, most analytics and data science practices ignore relationships because it has been historically challenging to process them at scale." (Jesús Barrasa et al, "Knowledge Graphs: Data in Context for Responsive Businesses", 2021)

"Visualisation is fundamentally limited by the number of pixels you can pump to a screen. If you have big data, you have way more data than pixels, so you have to summarise your data. Statistics gives you lots of really good tools for this." (Hadley Wickham)

27 October 2018

🔭Data Science: Research (Just the Quotes)

"The aim of research is the discovery of the equations which subsist between the elements of phenomena." (Ernst Mach, 1898)

"[…] scientific research is somewhat like unraveling complicated tangles of strings, in which luck is almost as vital as skill and accurate observation." (Ernst Mach, "Knowledge and Error: Sketches on the Psychology of Enquiry", 1905)

"Research is fundamentally a state of mind involving continual reexamination of doctrines and axioms upon which current thought and action are based. It is, therefore, critical of existing practices." (Theobald Smith, "The Influence of Research in Bringing into Closer Relationship the Practice of Medicine and Public Health Activities", American Journal of Medical Sciences, 1929)

"In every important advance the physicist finds that the fundamental laws are simplified more and more as experimental research advances. He is astonished to notice how sublime order emerges from what appeared to be chaos. And this cannot be traced back to the workings of his own mind but is due to a quality that is inherent in the world of perception." (Albert Einstein, 1932)

"Statistics is a scientific discipline concerned with collection, analysis, and interpretation of data obtained from observation or experiment. The subject has a coherent structure based on the theory of Probability and includes many different procedures which contribute to research and development throughout the whole of Science and Technology." (Egon Pearson, 1936)

"A successful hypothesis is not necessarily a permanent hypothesis, but it is one which stimulates additional research, opens up new fields, or explains and coordinates previously unrelated facts." (Farrington Daniels, "Outlines of Physical Chemistry", 1948)

"The hypothesis is the principal intellectual instrument in research. Its function is to indicate new experiments and observations and it therefore sometimes leads to discoveries even when not correct itself. We must resist the temptation to become too attached to our hypothesis, and strive to judge it objectively and modify it or discard it as soon as contrary evidence is brought to light. Vigilance is needed to prevent our observations and interpretations being biased in favor of the hypothesis. Suppositions can be used without being believed." (William I B Beveridge, "The Art of Scientific Investigation", 1950)

"Mathematical models for empirical phenomena aid the development of a science when a sufficient body of quantitative information has been accumulated. This accumulation can be used to point the direction in which models should be constructed and to test the adequacy of such models in their interim states. Models, in turn, frequently are useful in organizing and interpreting experimental data and in suggesting new directions for experimental research." (Robert R. Bush & Frederick Mosteller, "A Mathematical Model for Simple Learning", Psychological Review 58, 1951)

"Statistics is the fundamental and most important part of inductive logic. It is both an art and a science, and it deals with the collection, the tabulation, the analysis and interpretation of quantitative and qualitative measurements. It is concerned with the classifying and determining of actual attributes as well as the making of estimates and the testing of various hypotheses by which probable, or expected, values are obtained. It is one of the means of carrying on scientific research in order to ascertain the laws of behavior of things - be they animate or inanimate. Statistics is the technique of the Scientific Method." (Bruce D Greenschields & Frank M Weida, "Statistics with Applications to Highway Traffic Analyses", 1952)

"In a general way it may be said that to think in terms of systems seems the most appropriate conceptual response so far available when the phenomena under study - at any level and in any domain--display the character of being organized, and when understanding the nature of the interdependencies constitutes the research task. In the behavioral sciences, the first steps in building a systems theory were taken in connection with the analysis of internal processes in organisms, or organizations, when the parts had to be related to the whole." (Fred Emery, "The Causal Texture of Organizational Environments", 1963)

"If the null hypothesis is not rejected, [Sir Ronald] Fisher's position was that nothing could be concluded. But researchers find it hard to go to all the trouble of conducting a study only to conclude that nothing can be concluded." (Frank L Schmidt, "Statistical Significance Testing and Cumulative Knowledge", "Psychology: Implications for Training of Researchers, Psychological Methods" Vol. 1 (2), 1996)

"Statisticians can calculate the probability that such random samples represent the population; this is usually expressed in terms of sampling error [...]. The real problem is that few samples are random. Even when researchers know the nature of the population, it can be time-consuming and expensive to draw a random sample; all too often, it is impossible to draw a true random sample because the population cannot be defined. This is particularly true for studies of social problems. [...] The best samples are those that come as close as possible to being random." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Meta-analytic thinking is the consideration of any result in relation to previous results on the same or similar questions, and awareness that combination with future results is likely to be valuable. Meta-analytic thinking is the application of estimation thinking to more than a single study. It prompts us to seek meta-analysis of previous related studies at the planning stage of research, then to report our results in a way that makes it easy to include them in future meta-analyses. Meta-analytic thinking is a type of estimation thinking, because it, too, focuses on estimates and uncertainty." (Geoff Cumming, "Understanding the New Statistics", 2012)

"Statistical cognition is concerned with obtaining cognitive evidence about various statistical techniques and ways to present data. It’s certainly important to choose an appropriate statistical model, use the correct formulas, and carry out accurate calculations. It’s also important, however, to focus on understanding, and to consider statistics as communication between researchers and readers." (Geoff Cumming, "Understanding the New Statistics", 2012)

"Another way to secure statistical significance is to use the data to discover a theory. Statistical tests assume that the researcher starts with a theory, collects data to test the theory, and reports the results - whether statistically significant or not. Many people work in the other direction, scrutinizing the data until they find a pattern and then making up a theory that fits the pattern." (Gary Smith, "Standard Deviations", 2014)

"How can we tell the difference between a good theory and quackery? There are two effective antidotes: common sense and fresh data. If it is a ridiculous theory, we shouldn’t be persuaded by anything less than overwhelming evidence, and even then be skeptical. Extraordinary claims require extraordinary evidence. Unfortunately, common sense is an uncommon commodity these days, and many silly theories have been seriously promoted by honest researchers." (Gary Smith, "Standard Deviations", 2014)

"These practices - selective reporting and data pillaging - are known as data grubbing. The discovery of statistical significance by data grubbing shows little other than the researcher’s endurance. We cannot tell whether a data grubbing marathon demonstrates the validity of a useful theory or the perseverance of a determined researcher until independent tests confirm or refute the finding. But more often than not, the tests stop there. After all, you won’t become a star by confirming other people’s research, so why not spend your time discovering new theories? The data-grubbed theory consequently sits out there, untested and unchallenged." (Gary Smith, "Standard Deviations", 2014)

"A conceptual model is a framework that is initially used in research to outline the possible courses of action or to present an idea or thought. When a conceptual model is developed in a logical manner, it will provide a rigor to the research process." (N Elangovan & R Rajendran, "Conceptual Model: A Framework for Institutionalizing the Vigor in Business Research", 2015)

"Even properly done statistics can’t be trusted. The plethora of available statistical techniques and analyses grants researchers an enormous amount of freedom when analyzing their data, and it is trivially easy to ‘torture the data until it confesses’." (Alex Reinhart, "Statistics Done Wrong: The Woefully Complete Guide", 2015)

"The correlational technique known as multiple regression is used frequently in medical and social science research. This technique essentially correlates many independent (or predictor) variables simultaneously with a given dependent variable (outcome or output). It asks, 'Net of the effects of all the other variables, what is the effect of variable A on the dependent variable?' Despite its popularity, the technique is inherently weak and often yields misleading results. The problem is due to self-selection. If we don’t assign cases to a particular treatment, the cases may differ in any number of ways that could be causing them to differ along some dimension related to the dependent variable. We can know that the answer given by a multiple regression analysis is wrong because randomized control experiments, frequently referred to as the gold standard of research techniques, may give answers that are quite different from those obtained by multiple regression analysis." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"Collecting data through sampling therefore becomes a never-ending battle to avoid sources of bias. [...] While trying to obtain a random sample, researchers sometimes make errors in judgment about whether every person or thing is equally likely to be sampled." (Daniel J Levitin, "Weaponized Lies", 2017)

"Samples give us estimates of something, and they will almost always deviate from the true number by some amount, large or small, and that is the margin of error. […] The margin of error does not address underlying flaws in the research, only the degree of error in the sampling procedure. But ignoring those deeper possible flaws for the moment, there is another measurement or statistic that accompanies any rigorously defined sample: the confidence interval." (Daniel J Levitin, "Weaponized Lies", 2017)

"The job of the statistician is to formulate an inventory of all those things that matter in order to obtain a representative sample. Researchers have to avoid the tendency to capture variables that are easy to identify or collect data on - sometimes the things that matter are not obvious or are difficult to measure." (Daniel J Levitin, "Weaponized Lies", 2017)

💎SQL Reloaded: Wish List (Part I: Replace From)

With SQL Server 2017 Microsoft introduced the Trim function, which not only replaces the combined use of LTrim and RTrim functions, but also replaces other specified characters from the start or end of a string (see my previous post):

-- Trim special characters 
SELECT Trim ('# ' FROM '# 843984 #') Example1
, Trim ('[]' FROM '[843984]') Example2

Output:
Example1 Example2
---------- --------
843984 843984

Similarly, I wish I had a function that replaces special characters from a whole string (not only the trails), for example:

-- Replace special characters 
SELECT Replace ('# ' FROM '# 84#3984 #', '') Example1
, Replace ('[]' FROM '[84][39][84]', '') Example2

Unfortunately, as far I know, there is no such simple function. Therefore, in order to replace the “]”, “[“ and “#” special characters from a string one is forced either to write verbose expressions like in the first example or to include the logic into a user-defined function like in the second:

-- a chain of replacements 
SELECT Replace(Replace(Replace('[#84][#39][#84]', '[' , ''), ']', ''), '#', '') Example1

-- encapsulated replacements
CREATE FUNCTION [dbo].[ReplaceSpecialChars](
  @string nvarchar(max)
, @replacer as nvarchar(1) 
) RETURNS nvarchar(max)
-- replaces the special characters from a string with a given replacer
AS
BEGIN   
  IF CharIndex('#', @string) > 0  
     SET @string = replace(@string, '#', @replacer) 
        
  IF CharIndex('[', @string) > 0  
     SET @string = replace(@string, '[', @replacer) 
    
  IF CharIndex(']', @string) > 0  
     SET @string = replace(@string, ']', @replacer) 
                                
  RETURN Trim(@string)
END

-- testing the function 
SELECT [dbo].[ReplaceSpecialChars]('[#84][#39][#84]', '') Example2

In data cleaning the list of characters to replace can get considerable big (somewhere between 10 and 30 characters). In addition, one can deal with different scenarios in which the strings to be replaced differ and thus one is forced to write multiple such functions.

To the list of special characters often needs to be considered also language specific characters like ß, ü, ä, ö that are replaced with ss, ue, ae, respectively oe (see also the post).

Personally, I would find such a replace function more than useful. What about you?

Happy coding!

💎🏭SQL Reloaded: Drop If Exists (Before and After)

One of the activities of a database developer/administrator is to create and drop objects on the fly. If in objects' creation there are always some aspects to take into account that are implied by object's definition, the verbose syntax for their destruction seemed to be an unnecessary thing. For example for dropping a table, view, stored procedure, function or index, the most used objects, one would need to write such statements:

-- dropping a table 
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[TestTable]') AND type in (N'U'))
DROP TABLE [dbo].[TestTable]

-- dropping a view 
IF  EXISTS (SELECT * FROM sys.views WHERE object_id = OBJECT_ID(N'[dbo].[TestView]'))
DROP VIEW [dbo].[TestView]

-- dropping a stored procedure 
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[TestProcedure]') AND type in (N'P', N'PC'))
DROP PROCEDURE [dbo].[TestProcedure]

-- dropping a fucntion
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[TestFunction]') AND type in (N'FN', N'IF', N'TF', N'FS', N'FT'))
DROP FUNCTION [dbo].[TestFunction]

-- dropping an index
IF EXISTS (SELECT Name FROM sysindexes WHERE Name = 'IX_TestTable') 
DROP INDEX dbo.TestTable.IX_TestTable

Even if Copy-Paste does its magic and SQL Server allows generating scripts for existing objects, there’s still some work do be done in order to drop an object. Fortunately, with SQL Server 2016 Microsoft introduced a simplified syntax for dropping an object, namely DROP IF EXISTS.

The general syntax:

DROP <object_type> [ IF EXISTS ] <object_name>

The above statements can be written as follows:

-- dropping the table 
DROP TABLE IF EXISTS dbo.TestTable

-- dropping the view 
DROP VIEW IF EXISTS dbo.TestView 

-- dropping the procedure 
DROP PROCEDURE IF EXISTS dbo.TestProcedure

-- dropping the function 
DROP FUNCTION IF EXISTS dbo.TestFunction

-- dropping the index 
DROP INDEX IF EXISTS dbo.TestTable.IX_TestTable

Similarly can be dropped aggregates, assemblies, roles, triggers, rules, databases, schemas, users, sequences, synonyms, etc. The scripts will run also when the objects don’t exist.

An object can't be dropped if explicit dependencies exist on them, e.g. when the table is referenced by a FOREIGN KEY constraint. For each object there are specific rules that apply, therefore for more details check the documentation.

To explore the functionality here are the definitions of the above objects and the further scripts to test them:

-- creating the test table
CREATE TABLE dbo.TestTable(City nvarchar(50)
, PostalCode nvarchar(50))

-- creating the test view 
CREATE VIEW dbo.TestView 
AS
SELECT 'Test' as Result

-- creating the test stored procedure  
CREATE PROCEDURE dbo.TestProcedure
AS
BEGIN
SELECT 'Test' as Result
END

-- creating the test function
CREATE FUNCTION dbo.TestFunction()
RETURNS nvarchar(50)
BEGIN
    RETURN 'Test'
END

-- creating the test index
CREATE NONCLUSTERED INDEX [IX_TestTable] ON [dbo].[TestTable]
(
 [PostalCode] ASC
)


--testing the table
SELECT *
FROM dbo.TestTable

-- testing the view 
SELECT *
FROM dbo.TestView 

-- testing the procedure 
EXEC dbo.TestProcedure

-- testing the function
SELECT dbo.TestFunction() as Result

Moreover, the IF EXISTS can be used when dropping the constraint or column of a table:

-- adding a new column 
ALTER TABLE dbo.TestTable
ADD DateFrom datetime2(0)

-- adding a constraint on it
ALTER TABLE dbo.TestTable
ADD CONSTRAINT [DF_DateFrom_Default]  DEFAULT (GetDate()) FOR [DateFrom]

-- inserting test data
INSERT INTO dbo.TestTable(City, PostalCode)
VALUES ('New York', 'XZY')

--testing the changes
SELECT *
FROM dbo.TestTable

-- dropping the constraint
ALTER TABLE dbo.TestTable
DROP CONSTRAINT IF EXISTS DF_DateFrom_Default

-- dropping a column from a table
ALTER TABLE dbo.TestTable
DROP COLUMN IF EXISTS DateFrom

--testing the changes
SELECT *
FROM dbo.TestTable

If a constraint exists on the column first must be dropped the constraint and after that the column, like in the above example.

Happy coding!

26 October 2018

💎🏭SQL Reloaded: Trimming Strings (Before and After)

One of the annoying things when writing queries is the repetitive lengthy expressions that obfuscate in general the queries making them more difficult to read, understand and troubleshoot, and sometimes such expressions come with a performance penalty as well. Loading data from Excel, text files and other sources involving poorly formatted data often requires trimming (all) the text values. In the early versions of SQL Server, the equivalent of a Trim function was obtained by using the combined LTrim and RTrim functions. This resumed in writing code like this (based on AdventureWorks 2014 database):

-- trimming via LTrim, RTrim 
SELECT LTrim(RTrim(AddressLine1)) AddressLine1
, LTrim(RTrim(AddressLine2)) AddressLine2
, LTrim(RTrim(City)) City
, LTrim(RTrim(PostalCode)) PostalCode
FROM Person.Address

This might not look much though imagine you have to deal with 30-50 text attributes, that the code is not written in a readable format (e.g. the way is stored in database), that some attributes require further processing (e.g. removal of special characters, splitting, concatenating).

Often developers preferred encapsulating the call to the two functions within a user-defined function:

-- Trim user-defiend function
CREATE FUNCTION dbo.Trim(
@string nvarchar(max))
RETURNS nvarchar(max)
BEGIN
    RETURN LTrim(RTrim(@string))
END

With it the code is somehow simplified, but not by much and includes the costs of calling a user-defined function:

-- trimming via dbo.Trim
SELECT dbo.Trim(AddressLine1) AddressLine1
, dbo.Trim(AddressLine2) AddressLine2
, dbo.Trim(City) City
, dbo.Trim(PostalCode) PostalCode
FROM Person.Address

In SQL Server 2017 was introduced the Trim function which not only replaces the combined use of LTrim and RTrim functions, but it also allows to replace other specified characters (including CR, LF, Tab) from the start or end of a string.

By default the function removes the space from both sides of a string:

-- trimming via Trim
SELECT Trim(AddressLine1) AddressLine1
, Trim(AddressLine2) AddressLine2
, Trim(City) City
, Trim(PostalCode) PostalCode
FROM Person.Address

When a set of characters is provided the function removes the specified characters:

SELECT Trim ('#' FROM '# 843984') Example1
, Trim ('[]' FROM '[843984]') Example2
, Trim ('+' FROM '+49127298000') Example3
, Trim ('+-' FROM '+ 49-12729-8000 ') + ';' Example4
, Trim ('+ ' FROM '+ 49-12729-8000 ') + ';' Example5
, ASCII(Left(Trim (char(13) FROM char(13) + '49127298000'), 1)) Example6

Output:
Example1   Example2     Example3        Example4            Example5            Example6
--------          --------          ------------           -----------------       -----------------        -----------
843984      843984        49127298000   49-12729-8000 ; 49-12729-8000;    52

As can be seen, when is needed to remove other characters together with the space then is needed to include the space in the list of characters.

Notes:
1) The dbo.Trim function can be created in SQL Server 2017 environments as well.
2) The collation of the database will affect the behavior of Trim function, therefore the results might look different when a case sensitive collection is used.

3) The queries work also in SQL databases in Microsoft Fabric.

Happy coding!

25 October 2018

💎SQL Reloaded: Cursor and Linked Server for Data Import

There are times when is needed to pull some data (repeatedly) from one or more databases for analysis and SSIS is not available or there’s not much time to create individual packages via data imports. In such scenarios is needed to rely on the use of SQL Server Engine’s built-in support. In this case the data can be easily imported via a linked server into ad-hoc created tables in a local database. In fact, the process can be partially automated with the use of a cursor that iterates through a predefined set of tables.For exemplification I will use a SELECT instead of an EXEC just to render the results:

-- cleaning up
-- DROP TABLE dbo.LoadTables 

-- defining the scope
SELECT *
INTO dbo.LoadTables
FROM (VALUES ('dbo.InventTable')
    , ('dbo.InventDimCombination')
    , ('dbo.InventDim')
    , ('dbo.InventItemLocation')) DAT ([Table])


-- creating the stored procedure 
CREATE PROCEDURE dbo.pLoadData(
    @Table as nvarchar(50))
AS
/* loads the set of tables defiend in dbo.LoadTables */
BEGIN
   DECLARE @cTable varchar(50)

   -- creating the cursor
   DECLARE TableList CURSOR FOR
   SELECT [Table]
   FROM dbo.LoadTables
   WHERE [Table] = IsNull(@Table, [Table])
   ORDER BY [Table]

   -- opening the cursor
   OPEN TableList 

   -- fetching next record 
   FETCH NEXT FROM TableList
   INTO @cTable

   -- looping through each record 
   WHILE @@FETCH_STATUS = 0 
   BEGIN
 --- preparing the DROP TABLE statement 
        SELECT(' DROP TABLE IF EXISTS ' + @cTable + '')

        -- preparing the SELECT INTO STATEMENT
        SELECT( ' SELECT *' +
         ' INTO ' + @cTable +
                ' FROM [server].[database].[' + @cTable + ']')

 -- fetching next record 
 FETCH NEXT FROM TableList
 INTO @cTable
   END

   --closing the cursor
   CLOSE TableList 
   -- deallocating the cursor
   DEALLOCATE TableList 
END

Running the stored procedure for all the tables:

 -- Testing the procedure 
 EXEC dbo.pLoadData NULL -- loading all tables 

-- output 
 DROP TABLE IF EXISTS dbo.InventDim
 SELECT * INTO dbo.InventDim FROM [server].[database].[dbo.InventDim]

 DROP TABLE IF EXISTS dbo.InventDimCombination
 SELECT * INTO dbo.InventDimCombination FROM [server].[database].[dbo.InventDimCombination]

 DROP TABLE IF EXISTS dbo.InventItemLocation
 SELECT * INTO dbo.InventItemLocation FROM [server].[database].[dbo.InventItemLocation]

 DROP TABLE IF EXISTS dbo.InventTable
 SELECT * INTO dbo.InventTable FROM [server].[database].[dbo.InventTable]

Running the stored procedure for a specific table:

-- Testing the procedure

EXEC dbo.pLoadData 'dbo.InventTable' -- loading a specific table

-- output 
DROP TABLE IF EXISTS dbo.InventTable
SELECT * INTO dbo.InventTable FROM [server].[database].[dbo.InventTable]

Notes:
1) Having an old example of using a cursor (see Cursor and Lists) the whole mechanism for loading the data was available in 30 Minutes or so.
2) Tables can be added or removed after need, and the loading can be made more flexible by adding other parameters to the logic.
3) The solution is really easy to use and the performance is as well acceptable in comparison to SSIS packages.
4) Probably you already observed the use of DROP TABLE IF EXSISTS introduced with SQL Server 2016 (see also post)

Advantages:
The stored procedure can be extended to any database for which can be created a linked server.
Structural changes of the source tables are reflected in each load.
Tables can be quickly updated when needed just by executing the stored procedure.

Disadvantages:
Such solutions are more for personal use and their use should be avoided in a production environment.
The metadata will be temporarily unavailable during the time the procedure is run. Indexes need to be created after each load.

Happy Coding!

🔭Data Science: Conclusions (Just the Quotes)

"Before anything can be reasoned upon to a conclusion, certain facts, principles, or data, to reason from, must be established, admitted, or denied." (Thomas Paine, "Rights of Man", 1791)

"In order to supply the defects of experience, we will have recourse to the probable conjectures of analogy, conclusions which we will bequeath to our posterity to be ascertained by new observations, which, if we augur rightly, will serve to establish our theory and to carry it gradually nearer to absolute certainty." (Johann H Lambert, "The System of the World", 1800)

"Such is the tendency of the human mind to speculation, that on the least idea of an analogy between a few phenomena, it leaps forward, as it were, to a cause or law, to the temporary neglect of all the rest; so that, in fact, almost all our principal inductions must be regarded as a series of ascents and descents, and of conclusions from a few cases, verified by trial on many." (Sir John Herschel, "A Preliminary Discourse on the Study of Natural Philosophy" , 1830)

"Just as data gathered by an incompetent observer are worthless - or by a biased observer, unless the bias can be measured and eliminated from the result - so also conclusions obtained from even the best data by one unacquainted with the principles of statistics must be of doubtful value." (William F White, "A Scrap-Book of Elementary Mathematics: Notes, Recreations, Essays", 1908)

"Ordinarily, facts do not speak for themselves. When they do speak for themselves, the wrong conclusions are often drawn from them. Unless the facts are presented in a clear and interesting manner, they are about as effective as a phonograph record with the phonograph missing." (Willard C Brinton, "Graphic Methods for Presenting Facts", 1919)

"The preliminary examination of most data is facilitated by the use of diagrams. Diagrams prove nothing, but bring outstanding features readily to the eye; they are therefore no substitutes for such critical tests as may be applied to the data, but are valuable in suggesting such tests, and in explaining the conclusions founded upon them." (Sir Ronald A Fisher, "Statistical Methods for Research Workers", 1925)

"Observed facts must be built up, woven together, ordered, arranged, systematized into conclusions and theories by reflection and reason, if they are to have full bearing on life and the universe. Knowledge is the accumulation of facts. Wisdom is the establishment of relations. And just because the latter process is delicate and perilous, it is all the more delightful." (Gamaliel Bradford, "Darwin", 1926)

"The statistician’s job is to draw general conclusions from fragmentary data. Too often the data supplied to him for analysis are not only fragmentary but positively incoherent, so that he can do next to nothing with them. Even the most kindly statistician swears heartily under his breath whenever this happens". (Michael J Moroney, "Facts from Figures", 1927)

"All statistical analysis in business must aim at the control of action. The possible conclusions are: 1. Certain action must be taken. 2. No action is required. 3. Certain tendencies must be watched. 4. The analysis is not significant and either (a) certain further facts are required, or (b) there are no indications that further facts should be obtained." (John R Riggleman & Ira N Frisbee, "Business Statistics", 1938)

"Starting from statistical observations, it is possible to arrive at conclusions which not less reliable or useful than those obtained in any other exact science. It is only necessary to apply a clear and precise concept of probability to such observations. " (Richard von Mises, "Probability, Statistics, and Truth", 1939)

"The characteristic which distinguishes the present-day professional statistician, is his interest and skill in the measurement of the fallibility of conclusions." (George W Snedecor, "On a Unique Feature of Statistics", [address] 1948)

"The enthusiastic use of statistics to prove one side of a case is not open to criticism providing the work is honestly and accurately done, and providing the conclusions are not broader than indicated by the data. This type of work must not be confused with the unfair and dishonest use of both accurate and inaccurate data, which too commonly occurs in business. Dishonest statistical work usually takes the form of: (1) deliberate misinterpretation of data; (2) intentional making of overestimates or underestimates; and (3) biasing results by using partial data, making biased surveys, or using wrong statistical methods." (John R Riggleman & Ira N Frisbee, "Business Statistics", 1951)

"Another thing to watch out for is a conclusion in which a correlation has been inferred to continue beyond the data with which it has been demonstrated." (Darell Huff, "How to Lie with Statistics", 1954)

"The statistics themselves prove nothing; nor are they at any time a substitute for logical thinking. There are […] many simple but not always obvious snags in the data to contend with. Variations in even the simplest of figures may conceal a compound of influences which have to be taken into account before any conclusions are drawn from the data." (Alfred R Ilersic, "Statistics", 1959)

"Predictions, prophecies, and perhaps even guidance – those who suggested this title to me must have hoped for such-even though occasional indulgences in such actions by statisticians has undoubtedly contributed to the characterization of a statistician as a man who draws straight lines from insufficient data to foregone conclusions!" (John W Tukey, "Where do We Go From Here?", Journal of the American Statistical Association, Vol. 55, No. 289, 1960)

"Model-making, the imaginative and logical steps which precede the experiment, may be judged the most valuable part of scientific method because skill and insight in these matters are rare. Without them we do not know what experiment to do. But it is the experiment which provides the raw material for scientific theory. Scientific theory cannot be built directly from the conclusions of conceptual models." (Herbert G Andrewartha," Introduction to the Study of Animal Population", 1961)

"Almost all efforts at data analysis seek, at some point, to generalize the results and extend the reach of the conclusions beyond a particular set of data. The inferential leap may be from past experiences to future ones, from a sample of a population to the whole population, or from a narrow range of a variable to a wider range. The real difficulty is in deciding when the extrapolation beyond the range of the variables is warranted and when it is merely naive. As usual, it is largely a matter of substantive judgment - or, as it is sometimes more delicately put, a matter of 'a priori nonstatistical considerations'." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"A mature science, with respect to the matter of errors in variables, is not one that measures its variables without error, for this is impossible. It is, rather, a science which properly manages its errors, controlling their magnitudes and correctly calculating their implications for substantive conclusions." (Otis D Duncan, "Introduction to Structural Equation Models", 1975)

"Just like the spoken or written word, statistics and graphs can lie. They can lie by not telling the full story. They can lead to wrong conclusions by omitting some of the important facts. [...] Always look at statistics with a critical eye, and you will not be the victim of misleading information." (Dyno Lowenstein, "Graphs", 1976)

"Crude measurement usually yields misleading, even erroneous conclusions no matter how sophisticated a technique is used." (Henry T Reynolds, "Analysis of Nominal Data", 1977)

"The word ‘induction’ has two essentially different meanings. Scientific induction is a process by which scientists make observations of particular cases, such as noticing that some crows are black, then leap to the universal conclusion that all crows are black. The conclusion is never certain. There is always the possibility that at least one unobserved crow is not black." (Martin Gardner, "Aha! Insight", 1978)

"Being experimental, however, doesn't necessarily make a scientific study entirely credible. One weakness of experimental work is that it can be out of touch with reality when its controls are so rigid that conclusions are valid only in the experimental situation and don't carryover into the real world." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"In everyday life, 'estimation' means a rough and imprecise procedure leading to a rough and imprecise result. You 'estimate' when you cannot measure exactly. In statistics, on the other hand, 'estimation' is a technical term. It means a precise and accurate procedure, leading to a result which may be imprecise, but where at least the extent of the imprecision is known. It has nothing to do with approximation. You have some data, from which you want to draw conclusions and produce a 'best' value for some particular numerical quantity (or perhaps for several quantities), and you probably also want to know how reliable this value is, i.e. what the error is on your estimate." (Roger J Barlow, "Statistics: A guide to the use of statistical methods in the physical sciences", 1989)

"Statistical models for data are never true. The question whether a model is true is irrelevant. A more appropriate question is whether we obtain the correct scientific conclusion if we pretend that the process under study behaves according to a particular statistical model." (Scott Zeger, "Statistical reasoning in epidemiology", American Journal of Epidemiology, 1991)

"When looking at the end result of any statistical analysis, one must be very cautious not to over interpret the data. Care must be taken to know the size of the sample, and to be certain the method for gathering information is consistent with other samples gathered. […] No one should ever base conclusions without knowing the size of the sample and how random a sample it was. But all too often such data is not mentioned when the statistics are given - perhaps it is overlooked or even intentionally omitted." (Theoni Pappas, "More Joy of Mathematics: Exploring mathematical insights & concepts", 1991)

"Nature behaves in ways that look mathematical, but nature is not the same as mathematics. Every mathematical model makes simplifying assumptions; its conclusions are only as valid as those assumptions. The assumption of perfect symmetry is excellent as a technique for deducing the conditions under which symmetry-breaking is going to occur, the general form of the result, and the range of possible behaviour. To deduce exactly which effect is selected from this range in a practical situation, we have to know which imperfections are present." (Ian Stewart & Martin Golubitsky, "Fearful Symmetry", 1992)

"Visualization is an approach to data analysis that stresses a penetrating look at the structure of data. No other approach conveys as much information. […] Conclusions spring from data when this information is combined with the prior knowledge of the subject under investigation." (William S Cleveland, "Visualizing Data", 1993)

"Visualization is an effective framework for drawing inferences from data because its revelation of the structure of data can be readily combined with prior knowledge to draw conclusions. By contrast, because of the formalism of probablistic methods, it is typically impossible to incorporate into them the full body of prior information." (William S Cleveland, "Visualizing Data", 1993)

"The science of statistics may be described as exploring, analyzing and summarizing data; designing or choosing appropriate ways of collecting data and extracting information from them; and communicating that information. Statistics also involves constructing and testing models for describing chance phenomena. These models can be used as a basis for making inferences and drawing conclusions and, finally, perhaps for making decisions." (Fergus Daly et al, "Elements of Statistics", 1995)

"'Garbage in, garbage out' is a sound warning for those in the computer field; it is every bit as sound in the use of statistics. Even if the “garbage” which comes out leads to a correct conclusion, this conclusion is still tainted, as it cannot be supported by logical reasoning. Therefore, it is a misuse of statistics. But obtaining a correct conclusion from faulty data is the exception, not the rule. Bad basic data (the 'garbage in') almost always leads to incorrect conclusions (the 'garbage out'). Unfortunately, incorrect conclusions can lead to bad policy or harmful actions." (Herbert F Spirer et al, "Misused Statistics" 2nd Ed, 1998)

"Information needs representation. The idea that it is possible to communicate information in a 'pure' form is fiction. Successful risk communication requires intuitively clear representations. Playing with representations can help us not only to understand numbers (describe phenomena) but also to draw conclusions from numbers (make inferences). There is no single best representation, because what is needed always depends on the minds that are doing the communicating." (Gerd Gigerenzer, "Calculated Risks: How to know when numbers deceive you", 2002)

"Nonetheless, the basic principles regarding correlations between variables are not that difficult to understand. We must look for patterns that reveal potential relationships and for evidence that variables are actually related. But when we do spot those relationships, we should not jump to conclusions about causality. Instead, we need to weigh the strength of the relationship and the plausibility of our theory, and we must always try to discount the possibility of spuriousness." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"Data, reason, and calculation can only produce conclusions; they do not inspire action. Good numbers are not the result of managing numbers." (Ronald J Baker, "Measure what Matters to Customers: Using Key Predictive Indicators", 2006)

"It is in the nature of human beings to bend information in the direction of desired conclusions." (John Naisbitt, "Mind Set!: Reset Your Thinking and See the Future", 2006)

"Perception requires imagination because the data people encounter in their lives are never complete and always equivocal. [...] We also use our imagination and take shortcuts to fill gaps in patterns of nonvisual data. As with visual input, we draw conclusions and make judgments based on uncertain and incomplete information, and we conclude, when we are done analyzing the patterns, that out picture is clear and accurate. But is it?" (Leonard Mlodinow, "The Drunkard’s Walk: How Randomness Rules Our Lives", 2008)

"Traditional statistics is strong in devising ways of describing data and inferring distributional parameters from sample. Causal inference requires two additional ingredients: a science-friendly language for articulating causal knowledge, and a mathematical machinery for processing that knowledge, combining it with data and drawing new causal conclusions about a phenomenon." (Judea Pearl, "Causal inference in statistics: An overview", Statistics Surveys 3, 2009)

"Data scientists combine entrepreneurship with patience, the willingness to build data products incrementally, the ability to explore, and the ability to iterate over a solution. They are inherently interdisciplinary. They can tackle all aspects of a problem, from initial data collection and data conditioning to drawing conclusions. They can think outside the box to come up with new ways to view the problem, or to work with very broadly defined problems: 'there’s a lot of data, what can you make from it?'" (Mike Loukides, "What Is Data Science?", 2011)

"Any factor you don’t account for can become a confounding factor. A confounding factor is any variable that confuses the conclusions of your study, or makes them ambiguous. [...] Confounding factors can really screw up an otherwise perfectly good statistical analysis." (Kristin H Jarman, "The Art of Data Analysis: How to answer almost any question using basic statistics", 2013)

"Any time you collect data, you have uncertainty to deal with. This uncertainty comes from two places: (1) inherent variation in the values a random variable can take on and (2) the fact that for most studies, you can’t capture the entire population and so you must rely on a sample to make your conclusions." (Kristin H Jarman, "The Art of Data Analysis: How to answer almost any question using basic statistics", 2013)

"A study that leaves out data is waving a big red flag. A decision to include or exclude data sometimes makes all the difference in the world. This decision should be based on the relevance and quality of the data, not on whether the data support or undermine a conclusion that is expected or desired." (Gary Smith, "Standard Deviations", 2014)

"We naturally draw conclusions from what we see […]. We should also think about what we do not see […]. The unseen data may be just as important, or even more important, than the seen data. To avoid survivor bias, start in the past and look forward." (Gary Smith, "Standard Deviations", 2014)

"If your conclusions change dramatically by excluding a data point, then that data point is a strong candidate to be an outlier. In a good statistical model, you would expect that you can drop a data point without seeing a substantive difference in the results. It’s something to think about when looking for outliers." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"GIGO is a famous saying coined by early computer scientists: garbage in, garbage out. At the time, people would blindly put their trust into anything a computer output indicated because the output had the illusion of precision and certainty. If a statistic is composed of a series of poorly defined measures, guesses, misunderstandings, oversimplifications, mismeasurements, or flawed estimates, the resulting conclusion will be flawed." (Daniel J Levitin, "Weaponized Lies", 2017)

"In terms of characteristics, a data scientist has an inquisitive mind and is prepared to explore and ask questions, examine assumptions and analyse processes, test hypotheses and try out solutions and, based on evidence, communicate informed conclusions, recommendations and caveats to stakeholders and decision makers." (Jesús Rogel-Salazar, "Data Science and Analytics with Python", 2017)

"Just because there’s a number on it, it doesn’t mean that the number was arrived at properly. […] There are a host of errors and biases that can enter into the collection process, and these can lead millions of people to draw the wrong conclusions. Although most of us won’t ever participate in the collection process, thinking about it, critically, is easy to learn and within the reach of all of us." (Daniel J Levitin, "Weaponized Lies", 2017)

"But [bootstrap-based] simulations are clumsy and time-consuming, especially with large data sets, and in more complex circumstances it is not straightforward to work out what should be simulated. In contrast, formulae derived from probability theory provide both insight and convenience, and always lead to the same answer since they don’t depend on a particular simulation. But the flip side is that this theory relies on assumptions, and we should be careful not to be deluded by the impressive algebra into accepting unjustified conclusions." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"Good data scientists know that, because of inevitable ups and downs in the data for almost any interesting question, they shouldn’t draw conclusions from small samples, where flukes might look like evidence." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"When we have all the data, it is straightforward to produce statistics that describe what has been measured. But when we want to use the data to draw broader conclusions about what is going on around us, then the quality of the data becomes paramount, and we need to be alert to the kind of systematic biases that can jeopardize the reliability of any claims." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"With the growing availability of massive data sets and user-friendly analysis software, it might be thought that there is less need for training in statistical methods. This would be naïve in the extreme. Far from freeing us from the need for statistical skills, bigger data and the rise in the number and complexity of scientific studies makes it even more difficult to draw appropriate conclusions. More data means that we need to be even more aware of what the evidence is actually worth." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"Each decision about what data to gather and how to analyze them is akin to standing on a pathway as it forks left and right and deciding which way to go. What seems like a few simple choices can quickly multiply into a labyrinth of different possibilities. Make one combination of choices and you’ll reach one conclusion; make another, equally reasonable, and you might find a very different pattern in the data." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"If the data that go into the analysis are flawed, the specific technical details of the analysis don’t matter. One can obtain stupid results from bad data without any statistical trickery. And this is often how bullshit arguments are created, deliberately or otherwise. To catch this sort of bullshit, you don’t have to unpack the black box. All you have to do is think carefully about the data that went into the black box and the results that came out. Are the data unbiased, reasonable, and relevant to the problem at hand? Do the results pass basic plausibility checks? Do they support whatever conclusions are drawn?" (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

"Inference is to bring about a new thought, which in logic amounts to drawing a conclusion, and more generally involves using what we already know, and what we see or observe, to update prior beliefs. […] Inference is also a leap of sorts, deemed reasonable […] Inference is a basic cognitive act for intelligent minds. If a cognitive agent (a person, an AI system) is not intelligent, it will infer badly. But any system that infers at all must have some basic intelligence, because the very act of using what is known and what is observed to update beliefs is inescapably tied up with what we mean by intelligence. If an AI system is not inferring at all, it doesn’t really deserve to be called AI." (Erik J Larson, "The Myth of Artificial Intelligence: Why Computers Can’t Think the Way We Do", 2021)

"Any time you run regression analysis on arbitrary real-world observational data, there’s a significant risk that there’s hidden confounding in your dataset and so causal conclusions from such analysis are likely to be (causally) biased." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

23 October 2018

🔭Data Science: Simulations (Just the Quotes)

"The mathematical and computing techniques for making programmed decisions replace man but they do not generally simulate him." (Herbert A Simon, "Corporations 1985", 1960)

"The main object of cybernetics is to supply adaptive, hierarchical models, involving feedback and the like, to all aspects of our environment. Often such modelling implies simulation of a system where the simulation should achieve the object of copying both the method of achievement and the end result. Synthesis, as opposed to simulation, is concerned with achieving only the end result and is less concerned (or completely unconcerned) with the method by which the end result is achieved. In the case of behaviour, psychology is concerned with simulation, while cybernetics, although also interested in simulation, is primarily concerned with synthesis." (Frank H George, "Soviet Cybernetics, the militairy and Professor Lerner", New Scientist, 1973)

"Computer based simulation is now in wide spread use to analyse system models and evaluate theoretical solutions to observed problems. Since important decisions must rely on simulation, it is essential that its validity be tested, and that its advocates be able to describe the level of authentic representation which they achieved." (Richard Hamming, 1975)

"When a real situation involves chance we have to use probability mathematics to understand it quantitatively. Direct mathematical solutions sometimes exist […] but most real systems are too complicated for direct solutions. In these cases the computer, once taught to generate random numbers, can use simulation to get useful answers to otherwise impossible problems." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"The real leverage in most management situations lies in understanding dynamic complexity, not detail complexity. […] Unfortunately, most 'systems analyses' focus on detail complexity not dynamic complexity. Simulations with thousands of variables and complex arrays of details can actually distract us from seeing patterns and major interrelationships. In fact, sadly, for most people 'systems thinking' means 'fighting complexity with complexity', devising increasingly 'complex' (we should really say 'detailed') solutions to increasingly 'complex' problems. In fact, this is the antithesis of real systems thinking." (Peter M Senge, "The Fifth Discipline: The Art and Practice of the Learning Organization", 1990)

"A model for simulating dynamic system behavior requires formal policy descriptions to specify how individual decisions are to be made. Flows of information are continuously converted into decisions and actions. No plea about the inadequacy of our understanding of the decision-making processes can excuse us from estimating decision-making criteria. To omit a decision point is to deny its presence - a mistake of far greater magnitude than any errors in our best estimate of the process." (Jay W Forrester, "Policies, decisions and information sources for modeling", 1994)

"A field of study that includes a methodology for constructing computer simulation models to achieve better under-standing of social and corporate systems. It draws on organizational studies, behavioral decision theory, and engineering to provide a theoretical and empirical base for structuring the relationships in complex systems." (Virginia Anderson & Lauren Johnson, "Systems Thinking Basics: From Concepts to Casual Loops", 1997)

"What it means for a mental model to be a structural analog is that it embodies a representation of the spatial and temporal relations among, and the causal structures connecting the events and entities depicted and whatever other information that is relevant to the problem-solving talks. […] The essential points are that a mental model can be nonlinguistic in form and the mental mechanisms are such that they can satisfy the model-building and simulative constraints necessary for the activity of mental modeling." (Nancy J Nersessian, "Model-based reasoning in conceptual change", 1999)

"A neural network is a particular kind of computer program, originally developed to try to mimic the way the human brain works. It is essentially a computer simulation of a complex circuit through which electric current flows." (Keith J Devlin & Gary Lorden, "The Numbers behind NUMB3RS: Solving crime with mathematics", 2007)

"[...] a model is a tool for taking decisions and any decision taken is the result of a process of reasoning that takes place within the limits of the human mind. So, models have eventually to be understood in such a way that at least some layer of the process of simulation is comprehensible by the human mind. Otherwise, we may find ourselves acting on the basis of models that we don’t understand, or no model at all.” (Ugo Bardi, “The Limits to Growth Revisited”, 2011)

"Not only the mathematical way of thinking, but also simulations assisted by mathematical methods, is quite effective in solving problems. The latter is utilized in various fields, including detection of causes of troubles, optimization of expected performances, and best possible adjustments of usage conditions. Conversely, without the aid of mathematical methods, our problem-solving effort will get stuck most probably [...]" (Shiro Hiruta, "Mathematics Contributing to Innovation of Management", [in "What Mathematics Can Do for You"] 2013)

"System dynamics [...] uses models and computer simulations to understand behavior of an entire system, and has been applied to the behavior of large and complex national issues. It portrays the relationships in systems as feedback loops, lags, and other descriptors to explain dynamics, that is, how a system behaves over time. Its quantitative methodology relies on what are called 'stock-and-flow diagrams' that reflect how levels of specific elements accumulate over time and the rate at which they change. Qualitative systems thinking constructs evolved from this quantitative discipline." (Karen L Higgins, "Economic Growth and Sustainability: Systems Thinking for a Complex World", 2015)

"Optimization is more than finding the best simulation results. It is itself a complex and evolving field that, subject to certain information constraints, allows data scientists, statisticians, engineers, and traders alike to perform reality checks on modeling results." (Chris Conlan, "Automated Trading with R: Quantitative Research and Platform Development", 2016)