SQL Troubles

26 October 2018

💎SQL Reloaded: Trimming Strings (Before and After)

One of the annoying things when writing queries is the repetitive lengthy expressions that obfuscate in general the queries making them more difficult to read, understand and troubleshoot, and sometimes such expressions come with a performance penalty as well. Loading data from Excel, text files and other sources involving poorly formatted data often requires trimming (all) the text values. In the early versions of SQL Server, the equivalent of a Trim function was obtained by using the combined LTrim and RTrim functions. This resumed in writing code like this (based on AdventureWorks 2014 database):

-- trimming via LTrim, RTrim 
SELECT LTrim(RTrim(AddressLine1)) AddressLine1
, LTrim(RTrim(AddressLine2)) AddressLine2
, LTrim(RTrim(City)) City
, LTrim(RTrim(PostalCode)) PostalCode
FROM Person.Address

This might not look much though imagine you have to deal with 30-50 text attributes, that the code is not written in a readable format (e.g. the way is stored in database), that some attributes require further processing (e.g. removal of special characters, splitting, concatenating).
Often developers preferred encapsulating the call to the two functions within a user-defined function:

-- Trim user-defiend function
CREATE FUNCTION dbo.Trim(
@string nvarchar(max))
RETURNS nvarchar(max)
BEGIN
    RETURN LTrim(RTrim(@string))
END

With it the code is somehow simplified, but not by much and includes the costs of calling a user-defined function:

-- trimming via dbo.Trim
SELECT dbo.Trim(AddressLine1) AddressLine1
, dbo.Trim(AddressLine2) AddressLine2
, dbo.Trim(City) City
, dbo.Trim(PostalCode) PostalCode
FROM Person.Address

In SQL Server 2017 was introduced the Trim function which not only replaces the combined use of LTrim and RTrim functions, but also replaces other specified characters (including CR, LF, Tab) from the start or end of a string.

By default the function removes the space from both sides of a string:

-- trimming via Trim
SELECT Trim(AddressLine1) AddressLine1
, Trim(AddressLine2) AddressLine2
, Trim(City) City
, Trim(PostalCode) PostalCode
FROM Person.Address

When a set of characters is provided the function removes the specified characters:

SELECT Trim ('#' FROM '# 843984') Example1
, Trim ('[]' FROM '[843984]') Example2
, Trim ('+' FROM '+49127298000') Example3
, Trim ('+-' FROM '+ 49-12729-8000 ') + ';' Example4
, Trim ('+ ' FROM '+ 49-12729-8000 ') + ';' Example5
, ASCII(Left(Trim (char(13) FROM char(13) + '49127298000'), 1)) Example6

Output:
Example1   Example2     Example3        Example4            Example5            Example6
--------          --------          ------------           -----------------       -----------------        -----------
843984      843984        49127298000   49-12729-8000 ; 49-12729-8000;    52

As can be seen when is needed to remove other characters together with the space then is needed to include the space in the list of characters.

Notes:
The dbo.Trim function can be created in SQL Server 2017 environments as well.
The collation of the database will affect the behavior of Trim function, therefore the results might look different when a case sensitive collection is used.

Happy coding!

25 October 2018

💎SQL Reloaded: Cursor and Linked Server for Data Import

There are times when is needed to pull some data (repeatedly) from one or more databases for analysis and SSIS is not available or there’s not much time to create individual packages via data imports. In such scenarios is needed to rely on the use of SQL Server Engine’s built-in support. In this case the data can be easily imported via a linked server into ad-hoc created tables in a local database. In fact, the process can be partially automated with the use of a cursor that iterates through a predefined set of tables.For exemplification I will use a SELECT instead of an EXEC just to render the results:

-- cleaning up
-- DROP TABLE dbo.LoadTables 

-- defining the scope
SELECT *
INTO dbo.LoadTables
FROM (VALUES ('dbo.InventTable')
           , ('dbo.InventDimCombination')
    , ('dbo.InventDim')
    , ('dbo.InventItemLocation')) DAT ([Table])


-- creating the stored procedure 
CREATE PROCEDURE dbo.pLoadData(
    @Table as nvarchar(50))
AS
/* loads the set of tables defiend in dbo.LoadTables */
BEGIN
   DECLARE @cTable varchar(50)

   -- creating the cursor
   DECLARE TableList CURSOR FOR
   SELECT [Table]
   FROM dbo.LoadTables
   WHERE [Table] = IsNull(@Table, [Table])
   ORDER BY [Table]

   -- opening the cursor
   OPEN TableList 

   -- fetching next record 
   FETCH NEXT FROM TableList
   INTO @cTable

   -- looping through each record 
   WHILE @@FETCH_STATUS = 0 
   BEGIN
 --- preparing the DROP TABLE statement 
        SELECT(' DROP TABLE IF EXISTS ' + @cTable + '')

        -- preparing the SELECT INTO STATEMENT
        SELECT( ' SELECT *' +
         ' INTO ' + @cTable +
                ' FROM [server].[database].[' + @cTable + ']')

 -- fetching next record 
 FETCH NEXT FROM TableList
 INTO @cTable
   END

   --closing the cursor
   CLOSE TableList 
   -- deallocating the cursor
   DEALLOCATE TableList 
END

Running the stored procedure for all the tables:

 -- Testing the procedure 
 EXEC dbo.pLoadData NULL -- loading all tables 

-- output 
 DROP TABLE IF EXISTS dbo.InventDim
 SELECT * INTO dbo.InventDim FROM [server].[database].[dbo.InventDim]

 DROP TABLE IF EXISTS dbo.InventDimCombination
 SELECT * INTO dbo.InventDimCombination FROM [server].[database].[dbo.InventDimCombination]

 DROP TABLE IF EXISTS dbo.InventItemLocation
 SELECT * INTO dbo.InventItemLocation FROM [server].[database].[dbo.InventItemLocation]

 DROP TABLE IF EXISTS dbo.InventTable
 SELECT * INTO dbo.InventTable FROM [server].[database].[dbo.InventTable]

Running the stored procedure for a specific table:

-- Testing the procedure

EXEC dbo.pLoadData 'dbo.InventTable' -- loading a specific table

-- output 
DROP TABLE IF EXISTS dbo.InventTable
SELECT * INTO dbo.InventTable FROM [server].[database].[dbo.InventTable]

Notes:
Having an old example of using a cursor (see Cursor and Lists) the whole mechanism for loading the data was available in 30 Minutes or so.
Tables can be added or removed after need, and the loading can be made more flexible by adding other parameters to the logic.
The solution is really easy to use and the performance is as well acceptable in comparison to SSIS packages.
Probably you already observed the use of DROP TABLE IF EXSISTS introduced with SQL Server 2016 (see also post)

Advantages:The stored procedure can be extended to any database for which can be created a linked server.
Structural changes of the source tables are reflected in each load.
Tables can be quickly updated when needed just by executing the stored procedure.

Disadvantages:
Such solutions are more for personal use and their use should be avoided in a production environment.
The metadata will be temporarily unavailable during the time the procedure is run. Indexes need to be created after each load.

Happy Coding!

🔭Data Science: Conclusions (Just the Quotes)

"Before anything can be reasoned upon to a conclusion, certain facts, principles, or data, to reason from, must be established, admitted, or denied." (Thomas Paine, "Rights of Man", 1791)

"In order to supply the defects of experience, we will have recourse to the probable conjectures of analogy, conclusions which we will bequeath to our posterity to be ascertained by new observations, which, if we augur rightly, will serve to establish our theory and to carry it gradually nearer to absolute certainty." (Johann H Lambert, "The System of the World", 1800)

"Such is the tendency of the human mind to speculation, that on the least idea of an analogy between a few phenomena, it leaps forward, as it were, to a cause or law, to the temporary neglect of all the rest; so that, in fact, almost all our principal inductions must be regarded as a series of ascents and descents, and of conclusions from a few cases, verified by trial on many." (Sir John Herschel, "A Preliminary Discourse on the Study of Natural Philosophy" , 1830)

"Just as data gathered by an incompetent observer are worthless - or by a biased observer, unless the bias can be measured and eliminated from the result - so also conclusions obtained from even the best data by one unacquainted with the principles of statistics must be of doubtful value." (William F White, "A Scrap-Book of Elementary Mathematics: Notes, Recreations, Essays", 1908)

"Ordinarily, facts do not speak for themselves. When they do speak for themselves, the wrong conclusions are often drawn from them. Unless the facts are presented in a clear and interesting manner, they are about as effective as a phonograph record with the phonograph missing." (Willard C Brinton, "Graphic Methods for Presenting Facts", 1919)

"The preliminary examination of most data is facilitated by the use of diagrams. Diagrams prove nothing, but bring outstanding features readily to the eye; they are therefore no substitutes for such critical tests as may be applied to the data, but are valuable in suggesting such tests, and in explaining the conclusions founded upon them." (Sir Ronald A Fisher, "Statistical Methods for Research Workers", 1925)

"Observed facts must be built up, woven together, ordered, arranged, systematized into conclusions and theories by reflection and reason, if they are to have full bearing on life and the universe. Knowledge is the accumulation of facts. Wisdom is the establishment of relations. And just because the latter process is delicate and perilous, it is all the more delightful." (Gamaliel Bradford, "Darwin", 1926)

"The statistician’s job is to draw general conclusions from fragmentary data. Too often the data supplied to him for analysis are not only fragmentary but positively incoherent, so that he can do next to nothing with them. Even the most kindly statistician swears heartily under his breath whenever this happens". (Michael J Moroney, "Facts from Figures", 1927)

"All statistical analysis in business must aim at the control of action. The possible conclusions are: 1. Certain action must be taken. 2. No action is required. 3. Certain tendencies must be watched. 4. The analysis is not significant and either (a) certain further facts are required, or (b) there are no indications that further facts should be obtained." (John R Riggleman & Ira N Frisbee, "Business Statistics", 1938)

"Starting from statistical observations, it is possible to arrive at conclusions which not less reliable or useful than those obtained in any other exact science. It is only necessary to apply a clear and precise concept of probability to such observations. " (Richard von Mises, "Probability, Statistics, and Truth", 1939)

"The characteristic which distinguishes the present-day professional statistician, is his interest and skill in the measurement of the fallibility of conclusions." (George W Snedecor, "On a Unique Feature of Statistics", [address] 1948)

"The enthusiastic use of statistics to prove one side of a case is not open to criticism providing the work is honestly and accurately done, and providing the conclusions are not broader than indicated by the data. This type of work must not be confused with the unfair and dishonest use of both accurate and inaccurate data, which too commonly occurs in business. Dishonest statistical work usually takes the form of: (1) deliberate misinterpretation of data; (2) intentional making of overestimates or underestimates; and (3) biasing results by using partial data, making biased surveys, or using wrong statistical methods." (John R Riggleman & Ira N Frisbee, "Business Statistics", 1951)

"Another thing to watch out for is a conclusion in which a correlation has been inferred to continue beyond the data with which it has been demonstrated." (Darell Huff, "How to Lie with Statistics", 1954)

"The statistics themselves prove nothing; nor are they at any time a substitute for logical thinking. There are […] many simple but not always obvious snags in the data to contend with. Variations in even the simplest of figures may conceal a compound of influences which have to be taken into account before any conclusions are drawn from the data." (Alfred R Ilersic, "Statistics", 1959)

"Predictions, prophecies, and perhaps even guidance – those who suggested this title to me must have hoped for such-even though occasional indulgences in such actions by statisticians has undoubtedly contributed to the characterization of a statistician as a man who draws straight lines from insufficient data to foregone conclusions!" (John W Tukey, "Where do We Go From Here?", Journal of the American Statistical Association, Vol. 55, No. 289, 1960)

"Model-making, the imaginative and logical steps which precede the experiment, may be judged the most valuable part of scientific method because skill and insight in these matters are rare. Without them we do not know what experiment to do. But it is the experiment which provides the raw material for scientific theory. Scientific theory cannot be built directly from the conclusions of conceptual models." (Herbert G Andrewartha," Introduction to the Study of Animal Population", 1961)

"Almost all efforts at data analysis seek, at some point, to generalize the results and extend the reach of the conclusions beyond a particular set of data. The inferential leap may be from past experiences to future ones, from a sample of a population to the whole population, or from a narrow range of a variable to a wider range. The real difficulty is in deciding when the extrapolation beyond the range of the variables is warranted and when it is merely naive. As usual, it is largely a matter of substantive judgment - or, as it is sometimes more delicately put, a matter of 'a priori nonstatistical considerations'." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"A mature science, with respect to the matter of errors in variables, is not one that measures its variables without error, for this is impossible. It is, rather, a science which properly manages its errors, controlling their magnitudes and correctly calculating their implications for substantive conclusions." (Otis D Duncan, "Introduction to Structural Equation Models", 1975)

"Just like the spoken or written word, statistics and graphs can lie. They can lie by not telling the full story. They can lead to wrong conclusions by omitting some of the important facts. [...] Always look at statistics with a critical eye, and you will not be the victim of misleading information." (Dyno Lowenstein, "Graphs", 1976)

"Crude measurement usually yields misleading, even erroneous conclusions no matter how sophisticated a technique is used." (Henry T Reynolds, "Analysis of Nominal Data", 1977)

"The word ‘induction’ has two essentially different meanings. Scientific induction is a process by which scientists make observations of particular cases, such as noticing that some crows are black, then leap to the universal conclusion that all crows are black. The conclusion is never certain. There is always the possibility that at least one unobserved crow is not black." (Martin Gardner, "Aha! Insight", 1978)

"Being experimental, however, doesn't necessarily make a scientific study entirely credible. One weakness of experimental work is that it can be out of touch with reality when its controls are so rigid that conclusions are valid only in the experimental situation and don't carryover into the real world." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"In everyday life, 'estimation' means a rough and imprecise procedure leading to a rough and imprecise result. You 'estimate' when you cannot measure exactly. In statistics, on the other hand, 'estimation' is a technical term. It means a precise and accurate procedure, leading to a result which may be imprecise, but where at least the extent of the imprecision is known. It has nothing to do with approximation. You have some data, from which you want to draw conclusions and produce a 'best' value for some particular numerical quantity (or perhaps for several quantities), and you probably also want to know how reliable this value is, i.e. what the error is on your estimate." (Roger J Barlow, "Statistics: A guide to the use of statistical methods in the physical sciences", 1989)

"Statistical models for data are never true. The question whether a model is true is irrelevant. A more appropriate question is whether we obtain the correct scientific conclusion if we pretend that the process under study behaves according to a particular statistical model." (Scott Zeger, "Statistical reasoning in epidemiology", American Journal of Epidemiology, 1991)

"When looking at the end result of any statistical analysis, one must be very cautious not to over interpret the data. Care must be taken to know the size of the sample, and to be certain the method for gathering information is consistent with other samples gathered. […] No one should ever base conclusions without knowing the size of the sample and how random a sample it was. But all too often such data is not mentioned when the statistics are given - perhaps it is overlooked or even intentionally omitted." (Theoni Pappas, "More Joy of Mathematics: Exploring mathematical insights & concepts", 1991)

"Nature behaves in ways that look mathematical, but nature is not the same as mathematics. Every mathematical model makes simplifying assumptions; its conclusions are only as valid as those assumptions. The assumption of perfect symmetry is excellent as a technique for deducing the conditions under which symmetry-breaking is going to occur, the general form of the result, and the range of possible behaviour. To deduce exactly which effect is selected from this range in a practical situation, we have to know which imperfections are present." (Ian Stewart & Martin Golubitsky, "Fearful Symmetry", 1992)

"Visualization is an approach to data analysis that stresses a penetrating look at the structure of data. No other approach conveys as much information. […] Conclusions spring from data when this information is combined with the prior knowledge of the subject under investigation." (William S Cleveland, "Visualizing Data", 1993)

"Visualization is an effective framework for drawing inferences from data because its revelation of the structure of data can be readily combined with prior knowledge to draw conclusions. By contrast, because of the formalism of probablistic methods, it is typically impossible to incorporate into them the full body of prior information." (William S Cleveland, "Visualizing Data", 1993)

"The science of statistics may be described as exploring, analyzing and summarizing data; designing or choosing appropriate ways of collecting data and extracting information from them; and communicating that information. Statistics also involves constructing and testing models for describing chance phenomena. These models can be used as a basis for making inferences and drawing conclusions and, finally, perhaps for making decisions." (Fergus Daly et al, "Elements of Statistics", 1995)

"'Garbage in, garbage out' is a sound warning for those in the computer field; it is every bit as sound in the use of statistics. Even if the “garbage” which comes out leads to a correct conclusion, this conclusion is still tainted, as it cannot be supported by logical reasoning. Therefore, it is a misuse of statistics. But obtaining a correct conclusion from faulty data is the exception, not the rule. Bad basic data (the 'garbage in') almost always leads to incorrect conclusions (the 'garbage out'). Unfortunately, incorrect conclusions can lead to bad policy or harmful actions." (Herbert F Spirer et al, "Misused Statistics" 2nd Ed, 1998)

"Information needs representation. The idea that it is possible to communicate information in a 'pure' form is fiction. Successful risk communication requires intuitively clear representations. Playing with representations can help us not only to understand numbers (describe phenomena) but also to draw conclusions from numbers (make inferences). There is no single best representation, because what is needed always depends on the minds that are doing the communicating." (Gerd Gigerenzer, "Calculated Risks: How to know when numbers deceive you", 2002)

"Nonetheless, the basic principles regarding correlations between variables are not that difficult to understand. We must look for patterns that reveal potential relationships and for evidence that variables are actually related. But when we do spot those relationships, we should not jump to conclusions about causality. Instead, we need to weigh the strength of the relationship and the plausibility of our theory, and we must always try to discount the possibility of spuriousness." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"Data, reason, and calculation can only produce conclusions; they do not inspire action. Good numbers are not the result of managing numbers." (Ronald J Baker, "Measure what Matters to Customers: Using Key Predictive Indicators", 2006)

"It is in the nature of human beings to bend information in the direction of desired conclusions." (John Naisbitt, "Mind Set!: Reset Your Thinking and See the Future", 2006)

"Perception requires imagination because the data people encounter in their lives are never complete and always equivocal. [...] We also use our imagination and take shortcuts to fill gaps in patterns of nonvisual data. As with visual input, we draw conclusions and make judgments based on uncertain and incomplete information, and we conclude, when we are done analyzing the patterns, that out picture is clear and accurate. But is it?" (Leonard Mlodinow, "The Drunkard’s Walk: How Randomness Rules Our Lives", 2008)

"Traditional statistics is strong in devising ways of describing data and inferring distributional parameters from sample. Causal inference requires two additional ingredients: a science-friendly language for articulating causal knowledge, and a mathematical machinery for processing that knowledge, combining it with data and drawing new causal conclusions about a phenomenon." (Judea Pearl, "Causal inference in statistics: An overview", Statistics Surveys 3, 2009)

"Data scientists combine entrepreneurship with patience, the willingness to build data products incrementally, the ability to explore, and the ability to iterate over a solution. They are inherently interdisciplinary. They can tackle all aspects of a problem, from initial data collection and data conditioning to drawing conclusions. They can think outside the box to come up with new ways to view the problem, or to work with very broadly defined problems: 'there’s a lot of data, what can you make from it?'" (Mike Loukides, "What Is Data Science?", 2011)

"Any factor you don’t account for can become a confounding factor. A confounding factor is any variable that confuses the conclusions of your study, or makes them ambiguous. [...] Confounding factors can really screw up an otherwise perfectly good statistical analysis." (Kristin H Jarman, "The Art of Data Analysis: How to answer almost any question using basic statistics", 2013)

"Any time you collect data, you have uncertainty to deal with. This uncertainty comes from two places: (1) inherent variation in the values a random variable can take on and (2) the fact that for most studies, you can’t capture the entire population and so you must rely on a sample to make your conclusions." (Kristin H Jarman, "The Art of Data Analysis: How to answer almost any question using basic statistics", 2013)

"A study that leaves out data is waving a big red flag. A decision to include or exclude data sometimes makes all the difference in the world. This decision should be based on the relevance and quality of the data, not on whether the data support or undermine a conclusion that is expected or desired." (Gary Smith, "Standard Deviations", 2014)

"We naturally draw conclusions from what we see […]. We should also think about what we do not see […]. The unseen data may be just as important, or even more important, than the seen data. To avoid survivor bias, start in the past and look forward." (Gary Smith, "Standard Deviations", 2014)

"If your conclusions change dramatically by excluding a data point, then that data point is a strong candidate to be an outlier. In a good statistical model, you would expect that you can drop a data point without seeing a substantive difference in the results. It’s something to think about when looking for outliers." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"GIGO is a famous saying coined by early computer scientists: garbage in, garbage out. At the time, people would blindly put their trust into anything a computer output indicated because the output had the illusion of precision and certainty. If a statistic is composed of a series of poorly defined measures, guesses, misunderstandings, oversimplifications, mismeasurements, or flawed estimates, the resulting conclusion will be flawed." (Daniel J Levitin, "Weaponized Lies", 2017)

"In terms of characteristics, a data scientist has an inquisitive mind and is prepared to explore and ask questions, examine assumptions and analyse processes, test hypotheses and try out solutions and, based on evidence, communicate informed conclusions, recommendations and caveats to stakeholders and decision makers." (Jesús Rogel-Salazar, "Data Science and Analytics with Python", 2017)

"Just because there’s a number on it, it doesn’t mean that the number was arrived at properly. […] There are a host of errors and biases that can enter into the collection process, and these can lead millions of people to draw the wrong conclusions. Although most of us won’t ever participate in the collection process, thinking about it, critically, is easy to learn and within the reach of all of us." (Daniel J Levitin, "Weaponized Lies", 2017)

"But [bootstrap-based] simulations are clumsy and time-consuming, especially with large data sets, and in more complex circumstances it is not straightforward to work out what should be simulated. In contrast, formulae derived from probability theory provide both insight and convenience, and always lead to the same answer since they don’t depend on a particular simulation. But the flip side is that this theory relies on assumptions, and we should be careful not to be deluded by the impressive algebra into accepting unjustified conclusions." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"Good data scientists know that, because of inevitable ups and downs in the data for almost any interesting question, they shouldn’t draw conclusions from small samples, where flukes might look like evidence." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"When we have all the data, it is straightforward to produce statistics that describe what has been measured. But when we want to use the data to draw broader conclusions about what is going on around us, then the quality of the data becomes paramount, and we need to be alert to the kind of systematic biases that can jeopardize the reliability of any claims." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"With the growing availability of massive data sets and user-friendly analysis software, it might be thought that there is less need for training in statistical methods. This would be naïve in the extreme. Far from freeing us from the need for statistical skills, bigger data and the rise in the number and complexity of scientific studies makes it even more difficult to draw appropriate conclusions. More data means that we need to be even more aware of what the evidence is actually worth." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"Each decision about what data to gather and how to analyze them is akin to standing on a pathway as it forks left and right and deciding which way to go. What seems like a few simple choices can quickly multiply into a labyrinth of different possibilities. Make one combination of choices and you’ll reach one conclusion; make another, equally reasonable, and you might find a very different pattern in the data." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"If the data that go into the analysis are flawed, the specific technical details of the analysis don’t matter. One can obtain stupid results from bad data without any statistical trickery. And this is often how bullshit arguments are created, deliberately or otherwise. To catch this sort of bullshit, you don’t have to unpack the black box. All you have to do is think carefully about the data that went into the black box and the results that came out. Are the data unbiased, reasonable, and relevant to the problem at hand? Do the results pass basic plausibility checks? Do they support whatever conclusions are drawn?" (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

"Inference is to bring about a new thought, which in logic amounts to drawing a conclusion, and more generally involves using what we already know, and what we see or observe, to update prior beliefs. […] Inference is also a leap of sorts, deemed reasonable […] Inference is a basic cognitive act for intelligent minds. If a cognitive agent (a person, an AI system) is not intelligent, it will infer badly. But any system that infers at all must have some basic intelligence, because the very act of using what is known and what is observed to update beliefs is inescapably tied up with what we mean by intelligence. If an AI system is not inferring at all, it doesn’t really deserve to be called AI." (Erik J Larson, "The Myth of Artificial Intelligence: Why Computers Can’t Think the Way We Do", 2021)

"Any time you run regression analysis on arbitrary real-world observational data, there’s a significant risk that there’s hidden confounding in your dataset and so causal conclusions from such analysis are likely to be (causally) biased." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

23 October 2018

🔭Data Science: Simulations (Just the Quotes)

"The mathematical and computing techniques for making programmed decisions replace man but they do not generally simulate him." (Herbert A Simon, "Corporations 1985", 1960)

"The main object of cybernetics is to supply adaptive, hierarchical models, involving feedback and the like, to all aspects of our environment. Often such modelling implies simulation of a system where the simulation should achieve the object of copying both the method of achievement and the end result. Synthesis, as opposed to simulation, is concerned with achieving only the end result and is less concerned (or completely unconcerned) with the method by which the end result is achieved. In the case of behaviour, psychology is concerned with simulation, while cybernetics, although also interested in simulation, is primarily concerned with synthesis." (Frank H George, "Soviet Cybernetics, the militairy and Professor Lerner", New Scientist, 1973)

"Computer based simulation is now in wide spread use to analyse system models and evaluate theoretical solutions to observed problems. Since important decisions must rely on simulation, it is essential that its validity be tested, and that its advocates be able to describe the level of authentic representation which they achieved." (Richard Hamming, 1975)

"When a real situation involves chance we have to use probability mathematics to understand it quantitatively. Direct mathematical solutions sometimes exist […] but most real systems are too complicated for direct solutions. In these cases the computer, once taught to generate random numbers, can use simulation to get useful answers to otherwise impossible problems." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"The real leverage in most management situations lies in understanding dynamic complexity, not detail complexity. […] Unfortunately, most 'systems analyses' focus on detail complexity not dynamic complexity. Simulations with thousands of variables and complex arrays of details can actually distract us from seeing patterns and major interrelationships. In fact, sadly, for most people 'systems thinking' means 'fighting complexity with complexity', devising increasingly 'complex' (we should really say 'detailed') solutions to increasingly 'complex' problems. In fact, this is the antithesis of real systems thinking." (Peter M Senge, "The Fifth Discipline: The Art and Practice of the Learning Organization", 1990)

"A model for simulating dynamic system behavior requires formal policy descriptions to specify how individual decisions are to be made. Flows of information are continuously converted into decisions and actions. No plea about the inadequacy of our understanding of the decision-making processes can excuse us from estimating decision-making criteria. To omit a decision point is to deny its presence - a mistake of far greater magnitude than any errors in our best estimate of the process." (Jay W Forrester, "Policies, decisions and information sources for modeling", 1994)

"A field of study that includes a methodology for constructing computer simulation models to achieve better under-standing of social and corporate systems. It draws on organizational studies, behavioral decision theory, and engineering to provide a theoretical and empirical base for structuring the relationships in complex systems." (Virginia Anderson & Lauren Johnson, "Systems Thinking Basics: From Concepts to Casual Loops", 1997)

"What it means for a mental model to be a structural analog is that it embodies a representation of the spatial and temporal relations among, and the causal structures connecting the events and entities depicted and whatever other information that is relevant to the problem-solving talks. […] The essential points are that a mental model can be nonlinguistic in form and the mental mechanisms are such that they can satisfy the model-building and simulative constraints necessary for the activity of mental modeling." (Nancy J Nersessian, "Model-based reasoning in conceptual change", 1999)

"A neural network is a particular kind of computer program, originally developed to try to mimic the way the human brain works. It is essentially a computer simulation of a complex circuit through which electric current flows." (Keith J Devlin & Gary Lorden, "The Numbers behind NUMB3RS: Solving crime with mathematics", 2007)

"[...] a model is a tool for taking decisions and any decision taken is the result of a process of reasoning that takes place within the limits of the human mind. So, models have eventually to be understood in such a way that at least some layer of the process of simulation is comprehensible by the human mind. Otherwise, we may find ourselves acting on the basis of models that we don’t understand, or no model at all.” (Ugo Bardi, “The Limits to Growth Revisited”, 2011)

"Not only the mathematical way of thinking, but also simulations assisted by mathematical methods, is quite effective in solving problems. The latter is utilized in various fields, including detection of causes of troubles, optimization of expected performances, and best possible adjustments of usage conditions. Conversely, without the aid of mathematical methods, our problem-solving effort will get stuck most probably [...]" (Shiro Hiruta, "Mathematics Contributing to Innovation of Management", [in "What Mathematics Can Do for You"] 2013)

"System dynamics [...] uses models and computer simulations to understand behavior of an entire system, and has been applied to the behavior of large and complex national issues. It portrays the relationships in systems as feedback loops, lags, and other descriptors to explain dynamics, that is, how a system behaves over time. Its quantitative methodology relies on what are called 'stock-and-flow diagrams' that reflect how levels of specific elements accumulate over time and the rate at which they change. Qualitative systems thinking constructs evolved from this quantitative discipline." (Karen L Higgins, "Economic Growth and Sustainability: Systems Thinking for a Complex World", 2015)

"Optimization is more than finding the best simulation results. It is itself a complex and evolving field that, subject to certain information constraints, allows data scientists, statisticians, engineers, and traders alike to perform reality checks on modeling results." (Chris Conlan, "Automated Trading with R: Quantitative Research and Platform Development", 2016)

20 October 2018

💫ERP Systems: AX 2009 vs. D365 FO - Product Prices with two Different Queries

In Dynamics AX 2009 as well in Dynamics 365 for Finance and Operations (D365 FO) the Inventory, Purchase and Sales Prices are stored in InventTableModule table on separate rows, independently of the Products, stored in InventTable. Thus for each Product there are three rows that need to be joined. To get all the Prices one needs to write a query like this one:

SELECT ITM.DataAreaId 
, ITM.ItemId 
, ITM.ItemName 
, ILP.UnitId InventUnitId

, IPP.UnitId PurchUnitId
, ISP.UnitId SalesUnitId
, ILP.Price InventPrice
, IPP.Price PurchPrice
, ISP.Price SalesPrice
FROM dbo.InventTable ITM
     JOIN dbo.InventTableModule ILP
       ON ITM.ItemId = ILP.ItemId
      AND ITM.DATAAREAID = ILP.DATAAREAID 
      AND ILP.ModuleType = 0 -- Inventory
     JOIN dbo.InventTableModule IPP
       ON ITM.ItemId = IPP.ItemId
      AND ITM.DATAAREAID = IPP.DATAAREAID 
      AND IPP.ModuleType = 1 -- Purchasing
     JOIN dbo.InventTableModule ISP
       ON ITM.ItemId = ISP.ItemId
      AND ITM.DATAAREAID = ISP.DATAAREAID 
      AND ISP.ModuleType = 2 -- Sales 
WHERE ITM.DataAreaId = 'abc'

Looking at the query plan one can see that SQL Server uses three Merge Joins together with a Clustered Index Seek, and recommends using a Nonclustered Index to increase the performance:

To avoid needing to perform three joins, one can rewrite the query with the help of a GROUP BY, thus reducing the number of Merge Joins from three to one:

SELECT ITD.DataAreaID 
, ITD.ItemId 
, ITM.ItemName  
, ITD.InventPrice
, ITD.InventPriceUnit
, ITD.InventUnitId
, ITD.PurchPrice
, ITD.PurchPriceUnit
, ITD.PurchUnitId
, ITD.SalesPrice
, ITD.SalesPriceUnit
, ITD.SalesUnitId
FROM dbo.InventTable ITM
     LEFT JOIN (—price details
     SELECT ITD.ITEMID
     , ITD.DATAAREAID 
     , Max(CASE ITD.ModuleType WHEN 0 THEN ITD.Price END) InventPrice
     , Max(CASE ITD.ModuleType WHEN 0 THEN ITD.PriceUnit END) InventPriceUnit
     , Max(CASE ITD.ModuleType WHEN 0 THEN ITD.UnitId END) InventUnitId
     , Max(CASE ITD.ModuleType WHEN 1 THEN ITD.Price END) PurchPrice
     , Max(CASE ITD.ModuleType WHEN 1 THEN ITD.PriceUnit END) PurchPriceUnit
     , Max(CASE ITD.ModuleType WHEN 1 THEN ITD.UnitId END) PurchUnitId
     , Max(CASE ITD.ModuleType WHEN 2 THEN ITD.Price END) SalesPrice
     , Max(CASE ITD.ModuleType WHEN 2 THEN ITD.PriceUnit END) SalesPriceUnit
     , Max(CASE ITD.ModuleType WHEN 2 THEN ITD.UnitId END) SalesUnitId
     FROM dbo.InventTableModule ITD
     GROUP BY ITD.ITEMID
     , ITD.DATAAREAID 
    ) ITD
       ON ITD.ITEMID = ITM.ITEMID
      AND ITD.DATAAREAID = ITM.DATAAREAID
WHERE ITD.DataAreaID = 'abc'
ORDER BY ITD.ItemId

And here’s the plan for it:

It seems that despite the overhead introduced by the GROUP BY, the second query performs better, at least when all or almost all records are performed.

In Dynamics AX there were seldom the occasions when I needed to write similar queries. Probably, most of the cases are related to the use of window functions (e.g. the first n Vendors or Vendor Prices). On the other side, a bug in previous versions of Oracle, which was limiting the number of JOINs that could be used, made me consider such queries also when was maybe not the case.

What about you? Did you had the chance to write similar queries? What made you consider the second type of query?

Happy coding!

18 October 2018

💎SQL Reloaded: String_Split Function

Today, as I was playing with a data model – although simplistic, the simplicity kicked back when I had to deal with fields in which several values were encoded within the same column. The “challenge” resided in the fact that the respective attributes were quintessential in analyzing and matching the data with other datasets. Therefore, was needed a mix between flexibility and performance. It was the point where the String_Split did its magic. Introduced with SQL Server 2016 and available only under compatibility level 130 and above, the function splits a character expression using specified separator.
Here’s a simplified example of the code I had to write:

-- cleaning up
-- DROP TABLE dbo.ItemList

-- test data 
SELECT A.*
INTO dbo.ItemList 
FROM (
VALUES (1, '1001:a:blue')
, (2, '1001:b:white')
, (3, '1002:a:blue')
, (4, '1002:b:white')
, (5, '1002:c:red')
, (6, '1003:b:white')
, (7, '1003:c:red')) A(Id, List)

-- checking the data
SELECT *
FROM dbo.ItemList 

-- prepared data
SELECT ITM.Id 
, ITM.List 
, DAT.ItemId
, DAT.Size
, DAT.Country
FROM dbo.ItemList ITM
   LEFT JOIN (-- transformed data 
 SELECT DAT.id
 , [1] AS ItemId
 , [2] AS Size
 , [3] AS Country
 FROM(
  SELECT ITM.id
  , TX.Value
  , ROW_NUMBER() OVER (PARTITION BY ITM.id ORDER BY ITM.id) Ranking
  FROM dbo.ItemList ITM
  CROSS APPLY STRING_SPLIT(ITM.List, ':') TX
  ) DAT
 PIVOT (MAX(DAT.Value) FOR DAT.Ranking IN ([1],[2],[3])) 
 DAT
  ) DAT
   ON ITM.Id = DAT.Id

And, here’s the output:

For those dealing with former versions of SQL Server the functionality provided by the String_Split can be implemented with the help of user-defined functions, either by using the old fashioned loops (see this), cursors (see this) or more modern common table expressions (see this). In fact, these implementations are similar to the String_Split function, the difference being made mainly by the performance.

Happy coding!

13 October 2018

🔭Data Science: Groups (Just the Quotes)

"The object of statistical science is to discover methods of condensing information concerning large groups of allied facts into brief and compendious expressions suitable for discussion. The possibility of doing this is based on the constancy and continuity with which objects of the same species are found to vary." (Sir Francis Galton, "Inquiries into Human Faculty and Its Development, Statistical Methods", 1883)

"Some of the common ways of producing a false statistical argument are to quote figures without their context, omitting the cautions as to their incompleteness, or to apply them to a group of phenomena quite different to that to which they in reality relate; to take these estimates referring to only part of a group as complete; to enumerate the events favorable to an argument, omitting the other side; and to argue hastily from effect to cause, this last error being the one most often fathered on to statistics. For all these elementary mistakes in logic, statistics is held responsible." (Sir Arthur L Bowley, "Elements of Statistics", 1901)

"Statistics may be defined as numerical statements of facts by means of which large aggregates are analyzed, the relations of individual units to their groups are ascertained, comparisons are made between groups, and continuous records are maintained for comparative purposes." (Melvin T Copeland. "Statistical Methods" [in: Harvard Business Studies, Vol. III, Ed. by Melvin T Copeland, 1917])

"'Correlation' is a term used to express the relation which exists between two series or groups of data where there is a causal connection. In order to have correlation it is not enough that the two sets of data should both increase or decrease simultaneously. For correlation it is necessary that one set of facts should have some definite causal dependence upon the other set [...]" (Willard C Brinton, "Graphic Methods for Presenting Facts", 1919)

"The conception of statistics as the study of variation is the natural outcome of viewing the subject as the study of populations; for a population of individuals in all respects identical is completely described by a description of anyone individual, together with the number in the group. The populations which are the object of statistical study always display variations in one or more respects. To speak of statistics as the study of variation also serves to emphasise the contrast between the aims of modern statisticians and those of their predecessors." (Sir Ronald A Fisher, "Statistical Methods for Research Workers", 1925)

"An average is a single value which is taken to represent a group of values. Such a representative value may be obtained in several ways, for there are several types of averages. […] Probably the most commonly used average is the arithmetic average, or arithmetic mean." (John R Riggleman & Ira N Frisbee, "Business Statistics", 1938)

"The fact that index numbers attempt to measure changes of items gives rise to some knotty problems. The dispersion of a group of products increases with the passage of time, principally because some items have a long-run tendency to fall while others tend to rise. Basic changes in the demand is fundamentally responsible. The averages become less and less representative as the distance from the period increases." (Anna C Rogers, "Graphic Charts Handbook", 1961)

"Pencil and paper for construction of distributions, scatter diagrams, and run-charts to compare small groups and to detect trends are more efficient methods of estimation than statistical inference that depends on variances and standard errors, as the simple techniques preserve the information in the original data." (William E Deming, "On Probability as Basis for Action" American Statistician Vol. 29 (4), 1975)

"When the distributions of two or more groups of univariate data are skewed, it is common to have the spread increase monotonically with location. This behavior is monotone spread. Strictly speaking, monotone spread includes the case where the spread decreases monotonically with location, but such a decrease is much less common for raw data. Monotone spread, as with skewness, adds to the difficulty of data analysis. For example, it means that we cannot fit just location estimates to produce homogeneous residuals; we must fit spread estimates as well. Furthermore, the distributions cannot be compared by a number of standard methods of probabilistic inference that are based on an assumption of equal spreads; the standard t-test is one example. Fortunately, remedies for skewness can cure monotone spread as well." (William S Cleveland, "Visualizing Data", 1993)

"The central limit theorem […] states that regardless of the shape of the curve of the original population, if you repeatedly randomly sample a large segment of your group of interest and take the average result, the set of averages will follow a normal curve." (Charles Livingston & Paul Voakes, "Working with Numbers and Statistics: A handbook for journalists", 2005)

"Random events often come like the raisins in a box of cereal - in groups, streaks, and clusters. And although Fortune is fair in potentialities, she is not fair in outcomes." (Leonard Mlodinow, "The Drunkard’s Walk: How Randomness Rules Our Lives", 2008)

"[...] statisticians are constantly looking out for missed nuances: a statistical average for all groups may well hide vital differences that exist between these groups. Ignoring group differences when they are present frequently portends inequitable treatment." (Kaiser Fung, "Numbers Rule the World", 2010)

"The issue of group differences is fundamental to statistical thinking. The heart of this matter concerns which groups should be aggregated and which shouldn’t." (Kaiser Fung, "Numbers Rule the World", 2010)

"Be careful not to confuse clustering and stratification. Even though both of these sampling strategies involve dividing the population into subgroups, both the way in which the subgroups are sampled and the optimal strategy for creating the subgroups are different. In stratified sampling, we sample from every stratum, whereas in cluster sampling, we include only selected whole clusters in the sample. Because of this difference, to increase the chance of obtaining a sample that is representative of the population, we want to create homogeneous groups for strata and heterogeneous (reflecting the variability in the population) groups for clusters." (Roxy Peck et al, "Introduction to Statistics and Data Analysis" 4th Ed., 2012)

"If the group is large enough, even very small differences can become statistically significant." (Victor Cohn & Lewis Cope, "News & Numbers: A writer’s guide to statistics" 3rd Ed, 2012)

"Self-selection bias occurs when people choose to be in the data - for example, when people choose to go to college, marry, or have children. […] Self-selection bias is pervasive in 'observational data', where we collect data by observing what people do. Because these people chose to do what they are doing, their choices may reflect who they are. This self-selection bias could be avoided with a controlled experiment in which people are randomly assigned to groups and told what to do." (Gary Smith, "Standard Deviations", 2014)

"Often when people relate essentially the same variable in two different groups, or at two different times, they see this same phenomenon - the tendency of the response variable to be closer to the mean than the predicted value. Unfortunately, people try to interpret this by thinking that the performance of those far from the mean is deteriorating, but it’s just a mathematical fact about the correlation. So, today we try to be less judgmental about this phenomenon and we call it regression to the mean. We managed to get rid of the term 'mediocrity', but the name regression stuck as a name for the whole least squares fitting procedure - and that’s where we get the term regression line." (Richard D De Veaux et al, "Stats: Data and Models", 2016)

"Bias is error from incorrect assumptions built into the model, such as restricting an interpolating function to be linear instead of a higher-order curve. [...] Errors of bias produce underfit models. They do not fit the training data as tightly as possible, were they allowed the freedom to do so. In popular discourse, I associate the word 'bias' with prejudice, and the correspondence is fairly apt: an apriori assumption that one group is inferior to another will result in less accurate predictions than an unbiased one. Models that perform lousy on both training and testing data are underfit." (Steven S Skiena, "The Data Science Design Manual", 2017)

"To be any good, a sample has to be representative. A sample is representative if every person or thing in the group you’re studying has an equally likely chance of being chosen. If not, your sample is biased. […] The job of the statistician is to formulate an inventory of all those things that matter in order to obtain a representative sample. Researchers have to avoid the tendency to capture variables that are easy to identify or collect data on - sometimes the things that matter are not obvious or are difficult to measure." (Daniel J Levitin, "Weaponized Lies", 2017)

"If you study one group and assume that your results apply to other groups, this is extrapolation. If you think you are studying one group, but do not manage to obtain a representative sample of that group, this is a different problem. It is a problem so important in statistics that it has a special name: selection bias. Selection bias arises when the individuals that you sample for your study differ systematically from the population of individuals eligible for your study." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

06 October 2018

🔭Data Science: Users (Just the Quotes)

"Knowledge is a potential for a certain type of action, by which we mean that the action would occur if certain tests were run. For example, a library plus its user has knowledge if a certain type of response will be evoked under a given set of stipulations." (C West Churchman, "The Design of Inquiring Systems", 1971)

"To conceive of knowledge as a collection of information seems to rob the concept of all of its life. Knowledge resides in the user and not in the collection. It is how the user reacts to a collection of information that matters." (C West Churchman, "The Design of Inquiring Systems", 1971)

"Averages, ranges, and histograms all obscure the time-order for the data. If the time-order for the data shows some sort of definite pattern, then the obscuring of this pattern by the use of averages, ranges, or histograms can mislead the user. Since all data occur in time, virtually all data will have a time-order. In some cases this time-order is the essential context which must be preserved in the presentation." (Donald J Wheeler," Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Visualizations can be used to explore data, to confirm a hypothesis, or to manipulate a viewer. [...] In exploratory visualization the user does not necessarily know what he is looking for. This creates a dynamic scenario in which interaction is critical. [...] In a confirmatory visualization, the user has a hypothesis that needs to be tested. This scenario is more stable and predictable. System parameters are often predetermined." (Usama Fayyad et al, "Information Visualization in Data Mining and Knowledge Discovery", 2002)

"Dashboards and visualization are cognitive tools that improve your 'span of control' over a lot of business data. These tools help people visually identify trends, patterns and anomalies, reason about what they see and help guide them toward effective decisions. As such, these tools need to leverage people's visual capabilities. With the prevalence of scorecards, dashboards and other visualization tools now widely available for business users to review their data, the issue of visual information design is more important than ever." (Richard Brath & Michael Peters, "Dashboard Design: Why Design is Important," DM Direct, 2004)

"Data are raw facts and figures that by themselves may be useless. To be useful, data must be processed into finished information, that is, data converted into a meaningful and useful context for specific users. An increasing challenge for managers is being able to identify and access useful information." (Richard L Daft & Dorothy Marcic, "Understanding Management" 5th Ed., 2006)

"[...] construction of a data model is precisely the selective relevant depiction of the phenomena by the user of the theory required for the possibility of representation of the phenomenon." (Bas C van Fraassen, "Scientific Representation: Paradoxes of Perspective", 2008)

"The thread that ties most of these applications together is that data collected from users provides added value. Whether that data is search terms, voice samples, or product reviews, the users are in a feedback loop in which they contribute to the products they use. That’s the beginning of data science." (Mike Loukides, "What Is Data Science?", 2011)

"What is really important is to remember that no matter how creative and innovative you wish to be in your graphics and visualizations, the first thing you must do, before you put a finger on the computer keyboard, is ask yourself what users are likely to try to do with your tool." (Alberto Cairo, "The Functional Art", 2011)

"As data scientists, we prefer to interact with the raw data. We know how to import it, transform it, mash it up with other data sources, and visualize it. Most of your customers can’t do that. One of the biggest challenges of developing a data product is figuring out how to give data back to the user. Giving back too much data in a way that’s overwhelming and paralyzing is 'data vomit'. It’s natural to build the product that you would want, but it’s very easy to overestimate the abilities of your users. The product you want may not be the product they want." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"By giving data back to the user, you can create both engagement and revenue. We’re far enough into the data game that most users have realized that they’re not the customer, they’re the product. Their role in the system is to generate data, either to assist in ad targeting or to be sold to the highest bidder, or both." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"Generalizing beyond advertising, when building any data product in which the data is obfuscated (where there isn’t a clear relationship between the user and the result), you can compromise on precision, but not on recall. But when the data is exposed, focus on high precision." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"Successful information design in movement systems gives the user the information he needs - and only the information he needs - at every decision point." (Joel Katz, "Designing Information: Human factors and common sense in information design", 2012)

"You can give your data product a better chance of success by carefully setting the users’ expectations. [...] One under-appreciated facet of designing data products is how the user feels after using the product. Does he feel good? Empowered? Or disempowered and dejected?" (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"Creating effective visualizations is hard. Not because a dataset requires an exotic and bespoke visual representation - for many problems, standard statistical charts will suffice. And not because creating a visualization requires coding expertise in an unfamiliar programming language [...]. Rather, creating effective visualizations is difficult because the problems that are best addressed by visualization are often complex and ill-formed. The task of figuring out what attributes of a dataset are important is often conflated with figuring out what type of visualization to use. Picking a chart type to represent specific attributes in a dataset is comparatively easy. Deciding on which data attributes will help answer a question, however, is a complex, poorly defined, and user-driven process that can require several rounds of visualization and exploration to resolve." (Danyel Fisher & Miriah Meyer, "Making Data Visual", 2018)

"Designing effective visualizations presents a paradox. On the one hand, visualizations are intended to help users learn about parts of their data that they don’t know about. On the other hand, the more we know about the users’ needs and the context of their data, the better we can design a visualization to serve them." (Danyel Fisher & Miriah Meyer, "Making Data Visual", 2018)