SQL Troubles

01 November 2018

🔭Data Science: Probabilistic Models (Just the Quotes)

"A deterministic system is one in which the parts interact in a perfectly predictable way. There is never any room for doubt: given a last state of the system and the programme of information by defining its dynamic network, it is always possible to predict, without any risk of error, its succeeding state. A probabilistic system, on the other hand, is one about which no precisely detailed prediction can be given. The system may be studied intently, and it may become more and more possible to say what it is likely to do in any given circumstances. But the system simply is not predetermined, and a prediction affecting it can never escape from the logical limitations of the probabilities in which terms alone its behaviour can be described." (Stafford Beer, "Cybernetics and Management", 1959)

"[...] there can be such a thing as a simple probabilistic system. For example, consider the tossing of a penny. Here is a perfectly simple system, but one which is notoriously unpredictable. It maybe described in terms of a binary decision process, with a built-in even probability between the two possible outcomes." (Stafford Beer, "Cybernetics and Management", 1959)

"When loops are present, the network is no longer singly connected and local propagation schemes will invariably run into trouble. [...] If we ignore the existence of loops and permit the nodes to continue communicating with each other as if the network were singly connected, messages may circulate indefinitely around the loops and process may not converges to a stable equilibrium. […] Such oscillations do not normally occur in probabilistic networks […] which tend to bring all messages to some stable equilibrium as time goes on. However, this asymptotic equilibrium is not coherent, in the sense that it does not represent the posterior probabilities of all nodes of the network." (Judea Pearl, "Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference", 1988)

"We will use the convenient expression 'chosen at random' to mean that the probabilities of the events in the sample space are all the same unless some modifying words are near to the words 'at random'. Usually we will compute the probability of the outcome based on the uniform probability model since that is very common in modeling simple situations. However, a uniform distribution does not imply that it comes from a random source; […]" (Richard W Hamming, "The Art of Probability for Scientists and Engineers", 1991)

"Exploratory data analysis (EDA) is a collection of techniques that reveal (or search for) structure in a data set before calculating any probabilistic model. Its purpose is to obtain information about the data distribution (univariate or multivariate), about the presence of outliers and clusters, to disclose relationships and correlations between objects and/or variables." (Ildiko E Frank & Roberto Todeschini, "The Data Analysis Handbook", 1994)

"To understand what kinds of problems are solvable by the Monte Carlo method, it is important to note that the method enables simulation of any process whose development is influenced by random factors. Second, for many mathematical problems involving no chance, the method enables us to artificially construct a probabilistic model (or several such models), making possible the solution of the problems." (Ilya M Sobol, "A Primer for the Monte Carlo Method", 1994)

"The role of graphs in probabilistic and statistical modeling is threefold: (1) to provide convenient means of expressing substantive assumptions; (2) to facilitate economical representation of joint probability functions; and (3) to facilitate efficient inferences from observations." (Judea Pearl, "Causality: Models, Reasoning, and Inference", 2000)

"The nice thing with Monte Carlo is that you play a game of let’s pretend, like this: first of all there are ten scenarios with different probabilities, so let’s first pick a probability. The dice in this case is a random number generator in the computer. You roll the dice and pick a scenario to work with. Then you roll the dice for a certain speed, and you roll the dice again to see what direction it took. The last thing is that it collided with the bottom at an unknown time so you roll dice for the unknown time. So now you have speed, direction, starting point, time. Given them all, I know precisely where it [could have] hit the bottom. You have the computer put a point there. Rolling dice, I come up with different factors for each scenario. If I had enough patience, I could do it with pencil and paper. We calculated ten thousand points. So you have ten thousand points on the bottom of the ocean that represent equally likely positions of the sub. Then you draw a grid, count the points in each cell of the grid, saying that 10% of the points fall in this cell, 1% in that cell, and those percentages are what you use for probabilities for the prior for the individual distributions." (Henry R Richardson) [in (Sharon B McGrayne, "The Theory That Would Not Die", 2011)]

"A major advantage of probabilistic models is that they can be easily applied to virtually any data type (or mixed data type), as long as an appropriate generative model is available for each mixture component. [...] A downside of probabilistic models is that they try to fit the data to a particular kind of distribution, which may often not be appropriate for the underlying data. Furthermore, as the number of model parameters increases, over-fitting becomes more common. In such cases, the outliers may fit the underlying model of normal data. Many parametric models are also harder to interpret in terms of intensional knowledge, especially when the parameters of the model cannot be intuitively presented to an analyst in terms of underlying attributes. This can defeat one of the important purposes of anomaly detection, which is to provide diagnostic understanding of the abnormal data generative process." (Charu C Aggarwal, "Outlier Analysis", 2013)

"The process of using a probabilistic model to answer a query, given evidence." (Avi Pfeffer, "Practical Probabilistic Programming", 2016)

"Monte Carlo simulations handle uncertainty by using a computer’s random number generator to determine outcomes. Done over and over again, the simulations show the distribution of the possible outcomes. [...] The beauty of these Monte Carlo simulations is that they allow users to see the probabilistic consequences of their decisions, so that they can make informed choices. [...] Monte Carlo simulations are one of the most valuable applications of data science because they can be used to analyze virtually any uncertain situation where we are able to specify the nature of the uncertainty [...]" (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"A simple probabilistic model would not be sufficient to generate the fantastic diversity we see." Wolfgang Pauli

31 October 2018

🔭Data Science: Deep Learning (Just the Quotes)

"Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression." (Pankaj Mehta & David J Schwab, "An exact mapping between the Variational Renormalization Group and Deep Learning", 2014)

"Deep learning is about using a stacked hierarchy of feature detectors. [...] we use pattern detectors and we build them into networks that are arranged in hundreds of layers and then we adjust the links between these layers, usually using some kind of gradient descent." (Joscha Bach, "Joscha: Computational Meta-Psychology", 2015)

"The power of deep learning models comes from their ability to classify or predict nonlinear data using a modest number of parallel nonlinear steps4. A deep learning model learns the input data features hierarchy all the way from raw data input to the actual classification of the data. Each layer extracts features from the output of the previous layer." (N D Lewis, "Deep Learning Made Easy with R: A Gentle Introduction for Data Science", 2016)

"Although deep learning systems share some similarities with machine learning systems, certain characteristics make them sufficiently distinct. For example, conventional machine learning systems tend to be simpler and have fewer options for training. DL systems are noticeably more sophisticated; they each have a set of training algorithms, along with several parameters regarding the systems’ architecture. This is one of the reasons we consider them a distinct framework in data science." (Yunus E Bulut & Zacharias Voulgaris, "AI for Data Science: Artificial Intelligence Frameworks and Functionality for Deep Learning, Optimization, and Beyond", 2018)

"Deep learning broadly describes the large family of neural network architectures that contain multiple, interacting hidden layers." (Benjamin Bengfort et al, Applied Text Analysis with Python, 2018)

"Deep learning has instead given us machines with truly impressive abilities but no intelligence. The difference is profound and lies in the absence of a model of reality." (Judea Pearl, "The Book of Why: The New Science of Cause and Effect", 2018)

"DL systems also tend to be more autonomous than their machine counterparts. To some extent, DL systems can do their own feature engineering. More conventional systems tend to require more fine-tuning of the feature-set, and sometimes require dimensionality reduction to provide any decent results. In addition, the generalization of conventional ML systems when provided with additional data generally don’t improve as much as DL systems. This is also one of the key characteristics that makes DL systems a preferable option when big data is involved." (Yunus E Bulut & Zacharias Voulgaris, "AI for Data Science: Artificial Intelligence Frameworks and Functionality for Deep Learning, Optimization, and Beyond", 2018)

"[…] deep learning has succeeded primarily by showing that certain questions or tasks we thought were difficult are in fact not. It has not addressed the truly difficult questions that continue to prevent us from achieving humanlike AI." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"In essence, deep learning models are just chains of functions, which means that many deep learning libraries tend to have a functional or verbose, declarative style." (Benjamin Bengfort et al, Applied Text Analysis with Python, 2018)

"The second big myth of data science is that every data science project needs big data and needs to use deep learning. In general, having more data helps, but having the right data is the more important requirement" (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"People who assume that extensions of modern machine learning methods like deep learning will somehow 'train up', or learn to be intelligent like humans, do not understand the fundamental limitations that are already known. Admitting the necessity of supplying a bias to learning systems is tantamount to Turing’s observing that insights about mathematics must be supplied by human minds from outside formal methods, since machine learning bias is determined, prior to learning, by human designers." (Erik J Larson, "The Myth of Artificial Intelligence: Why Computers Can’t Think the Way We Do", 2021)

30 October 2018

💠🛠️SQL Server: Administration (Troubleshooting Login Failed for User)

    Since the installation of an SQL Server 2017 on a virtual machine (VM) in the Microsoft Cloud started to appear in the error log records with the following message:

Login failed for user '<domain>\<computer>$'. Reason: Could not find a login matching the name provided. [CLIENT: <local machine>]
Error: 18456, Severity: 14, State: 5.

   From the text it seemed like a permission problem, thing confirmed by the documentation (see [1]), the Error Number and State correspond to a „User Id is not valid“ situation. In a first step I attempted to give permissions to the local account (dollar sign included). The account wasn’t found in the Active Directory (AD), though by typing the account directly in the “Login name” I managed to give temporarily sysadmin permission to the account. The error continued to appear in the error log. I looked then at the accounts under which the SQL Services run - nothing suspect in there.

   Except the error message, which was appearing with an alarming frequency (a few seconds apart), everything seemed to be working on the server. The volume of records (a few hundred thousands over a few days) bloating the error log, as well the fact that I didn’t knew what’s going on made me take the time and further investigate the issue.

Looking today at the Windows Logs for Applications I observed that the error is caused by an account used for the Microsoft SQL Server IaaS Agent and IaaS Query Service. Once I gave permissions to the account the error disappeared.

   The search for a best practice on what permissions to give to the IaaS Agent and IaaS Query Service lead me to [2]. To quote, the “Agent Service needs Local System rights to be able to install and configure SQL Server, attach disks and enable storage pool and manage automated security patching of Windows and SQL server”, while the “IaaS Query Service is started with an NT Service account which is a Sys Admin on the SQL Server”. In fact, this was the only resource I found that made a reference to the IaaS Query Service.

   This was just one of the many scenarios in which the above error appears. For more information see for example [3], [4] or [5].

References:
[1] Microsoft (2017) MSSQLSERVER_18456 [Online] Available from: https://docs.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-18456-database-engine-error?view=sql-server-2017
[2] SQL Database Engine Blog (2018) SQL Server IaaS Extension Query Service for SQL Server on Azure VM, by Mine Tokus Altug [Online] Available from: https://blogs.msdn.microsoft.com/sqlserverstorageengine/2018/10/25/sql-server-iaas-extension-query-service-for-sql-server-on-azure-vm/
[3] Microsoft Support (2018) "Login failed for user" error message when you log on to SQL Server [Online] Available from: https://support.microsoft.com/en-sg/help/555332/login-failed-for-user-error-message-when-you-log-on-to-sql-server
[4] Microsoft Technet (2018) How to Troubleshoot Connecting to the SQL Server Database [Online] Available from: Engine https://social.technet.microsoft.com/wiki/contents/articles/2102.how-to-troubleshoot-connecting-to-the-sql-server-database-engine.aspx
[5] Microsoft Blogs (2011)Troubleshoot Connectivity/Login failures (18456 State x) with SQL Server, by Sakthivel Chidambaram [Online] Available from: https://blogs.msdn.microsoft.com/sqlsakthi/2011/02/06/troubleshoot-connectivitylogin-failures-18456-state-x-with-sql-server/

29 October 2018

💠🛠️SQL Server: Administration (Searching the Error Log)

Searching for a needle in a haystack is an achievable task though may turn to be daunting. Same can be said about searching for a piece of information in the SQL error log. Fortunately, there is xp_readerrorlog, an undocumented (extended) stored procedure, which helps in the process. The stored procedure makes available the content of the error log and provides basic search capabilities via a small set of parameters. For example, it can be used to search for errors, warnings, failed backups, consistency checks, failed logins, databases instant file initializations, and so on. It helps identify whether an event occurred and the time at which the event occurred.

The following are the parameter available with the stored procedure:

Parameter	Name	Type	Description
1	FileToRead	int	0 = Current, 1 or 2, 3, … n Archive Number
2	Logtype	int	1 = SQL Error Log and 2 = SQL Agent log
3	String1	varchar(255)	the string to match the logs on
4	String2	varchar(255)	a second string to match in combination with String1 (AND)
5	StartDate	datetime	beginning date to look from
6	EndDate	datetime	ending date to look up to
7	ResultsOrder		ASC or DESC sorting

Note:
If the SQL Server Agent hasn’t been active, then there will be no Agent log and the call to the stored procedure will return an error.

Here are a few examples of using the stored procedure:

-- listing the content of the current SQL Server error log
EXEC xp_readerrorlog 0, 1

-- listing the content of the second SQL Server error log
EXEC xp_readerrorlog 1, 1

-- listing the content of the current SQL Server Agent log
EXEC xp_readerrorlog 0, 2

-- searching for errors 
EXEC xp_readerrorlog 0, 1, N'error'

-- searching for errors that have to do with consistency checks
EXEC xp_readerrorlog 0, 1, N'error', N'CHECKDB'

-- searching for errors that have to do with consistency checks
EXEC xp_readerrorlog 0, 1, N'failed', N'backups'

-- searching for warnings 
EXEC xp_readerrorlog 0, 1, N'warning'

-- searching who killed a session
EXEC xp_readerrorlog 0, 1, N'kill'

-- searching for I/O information
EXEC xp_readerrorlog 0, 1, N'I/O'

-- searching for consistency checks 
EXEC xp_readerrorlog 0, 1, N'CHECKDB'

-- searching for consistency checks performed via DBCC
EXEC xp_readerrorlog 0, 1, N'DBCC CHECKDB'

-- searching for failed logins  
EXEC xp_readerrorlog 0, 1, N'Login failed'

-- searching for 
EXEC xp_readerrorlog 0, 1, N'[INFO]'

-- searching for shutdowns 
EXEC xp_readerrorlog 0, 1, N'shutdown'

-- searching for a database instant file initialization event  
EXEC xp_readerrorlog 0, 1, N'database instant file initialization'

If the error log is too big it’s important to narrow the search for a given time interval:

-- searching for errors starting with a given date 
DECLARE @StartDate as Date = DateAdd(d, -1, GetDate())
EXEC xp_readerrorlog 0, 1, N'error', N'', @StartDate

-- searching for errors within a time interval 
DECLARE @StartDate as Date = DateAdd(d, -14, GetDate())
DECLARE @EndDate as Date = DateAdd(d, -7, GetDate())
EXEC xp_readerrorlog 0, 1, N'', N'', @StartDate, @EndDate, N'desc'

The output can be dumped into a table especially when is needed to perform a detailed analysis on the error log. It might be interesting to check how often an error message occurred, like in the below example. One can take thus advantage of more complex pattern searching.

-- creating the error log table 
CREATE TABLE dbo.ErrorLogMessages (
    LogDate datetime2(0) 
  , ProcessInfo nvarchar(255)
  , [Text] nvarchar(max))

-- loading the errors 
INSERT INTO dbo.ErrorLogMessages
EXEC xp_readerrorlog 0, 1

-- checking the results 
SELECT *
FROM dbo.ErrorLogMessages

-- checking messages frequency 
SELECT [Text]
, count(*) NoOccurrences
, Min(LogDate) FirstOccurrence
FROM dbo.ErrorLogMessages
GROUP BY [Text]
HAVING count(*)>1
ORDER BY NoOccurrences DESC

-- getting the errors and their information 
SELECT *
FROM (
 SELECT *
 , Lead([Text], 1) OVER (PARTITION BY LogDate, ProcessInfo ORDER BY LogDate) PrevMessage
 FROM dbo.ErrorLogMessages
 ) DAT
WHERE [Text] LIKE '%error:%[0-9]%'

-- cleaning up 
--DROP TABLE IF EXISTS dbo.ErrorLogMessages

For those who don’t have admin permissions it is necessary to explicitly give execute permissions on the xp_readerrorlog stored procedure:

-- giving explicit permissions to account
GRANT EXECUTE ON xp_readerrorlog TO [<account_name>]

Personally, I’ve been using the stored procedure mainly to check whether error messages were logged for a given time interval and whether the consistency checks run without problems. Occasionally, I used it to check for failed logins or sessions terminations (aka kills).

Notes:
Microsoft warns that undocumented objects might change in future releases. Fortunately, xp_readerrorlog made it since SQL Server 2005 to SQL Server 2017, so it might make it further…
The above code was tested also on SQL Server 2017.

Happy coding!

28 October 2018

🔭Data Science: Limits (Just the Quotes)

"Whatever lies beyond the limits of experience, and claims another origin than that of induction and deduction from established data, is illegitimate." (George H Lewes, "The Foundations of a Creed", 1875)

"It is difficult to understand why statisticians commonly limit their inquiries to Averages, and do not revel in more comprehensive views. Their souls seem as dull to the charm of variety as that of the native of one of our flat English counties, whose retrospect of Switzerland was that, if its mountains could be thrown into its lakes, two nuisances would be got rid of at once. An Average is but a solitary fact, whereas if a single other fact be added to it, an entire Normal Scheme, which nearly corresponds to the observed one, starts potentially into existence." (Sir Francis Galton, "Natural Inheritance", 1889)

"Physical research by experimental methods is both a broadening and a narrowing field. There are many gaps yet to be filled, data to be accumulated, measurements to be made with great precision, but the limits within which we must work are becoming, at the same time, more and more defined." (Elihu Thomson, "Annual Report of the Board of Regents of the Smithsonian Institution", 1899)

"Statistics may, for instance, be called the science of counting. Counting appears at first sight to be a very simple operation, which any one can perform or which can be done automatically; but, as a matter of fact, when we come to large numbers, e.g., the population of the United Kingdom, counting is by no means easy, or within the power of an individual; limits of time and place alone prevent it being so carried out, and in no way can absolute accuracy be obtained when the numbers surpass certain limits." (Sir Arthur L Bowley, "Elements of Statistics", 1901)

"The usefulness of the models in constructing a testable theory of the process is severely limited by the quickly increasing number of parameters which must be estimated in order to compare the predictions of the models with empirical results" (Anatol Rapoport, "Prisoner's Dilemma: A study in conflict and cooperation", 1965)

"A real change of theory is not a change of equations - it is a change of mathematical structure, and only fragments of competing theories, often not very important ones conceptually, admit comparison with each other within a limited range of phenomena." (Yuri I Manin, "Mathematics and Physics", 1981)

"Models are often used to decide issues in situations marked by uncertainty. However statistical differences from data depend on assumptions about the process which generated these data. If the assumptions do not hold, the inferences may not be reliable either. This limitation is often ignored by applied workers who fail to identify crucial assumptions or subject them to any kind of empirical testing. In such circumstances, using statistical procedures may only compound the uncertainty." (David A Greedman & William C Navidi, "Regression Models for Adjusting the 1980 Census", Statistical Science Vol. 1 (1), 1986)

"[…] an honest exploratory study should indicate how many comparisons were made […] most experts agree that large numbers of comparisons will produce apparently statistically significant findings that are actually due to chance. The data torturer will act as if every positive result confirmed a major hypothesis. The honest investigator will limit the study to focused questions, all of which make biologic sense. The cautious reader should look at the number of ‘significant’ results in the context of how many comparisons were made." (James L Mills, "Data torturing", New England Journal of Medicine, 1993)

"In spite of the insurmountable computational limits, we continue to pursue the many problems that possess the characteristics of organized complexity. These problems are too important for our well being to give up on them. The main challenge in pursuing these problems narrows down fundamentally to one question: how to deal with systems and associated problems whose complexities are beyond our information processing limits? That is, how can we deal with these problems if no computational power alone is sufficient?" (George Klir, "Fuzzy sets and fuzzy logic", 1995)

"The larger, more detailed and complex the model - the less abstract the abstraction – the smaller the number of people capable of understanding it and the longer it takes for its weaknesses and limitations to be found out." (John Adams, "Risk", 1995)

"[...] the NFL theorems mean that if an algorithm does particularly well on average for one class of problems then it must do worse on average over the remaining problems. In particular, if an algorithm performs better than random search on some class of problems then in must perform worse than random search on the remaining problems. Thus comparisons reporting the performance of a particular algorithm with a particular parameter setting on a few sample problems are of limited utility. While such results do indicate behavior on the narrow range of problems considered, one should be very wary of trying to generalize those results to other problems." (David H Wolpert & William G Macready, "No free lunch theorems for optimization", IEEE Transactions on Evolutionary Computation 1 (1), 1997)

"No comparison between two values can be global. A simple comparison between the current figure and some previous value and convey the behavior of any time series. […] While it is simple and easy to compare one number with another number, such comparisons are limited and weak. They are limited because of the amount of data used, and they are weak because both of the numbers are subject to the variation that is inevitably present in weak world data. Since both the current value and the earlier value are subject to this variation, it will always be difficult to determine just how much of the difference between the values is due to variation in the numbers, and how much, if any, of the difference is due to real changes in the process." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"[…] an obvious difference between our best classifiers and human learning is the number of examples required in tasks such as object detection. […] the difficulty of a learning task depends on the size of the required hypothesis space. This complexity determines in turn how many training examples are needed to achieve a given level of generalization error. Thus the complexity of the hypothesis space sets the speed limit and the sample complexity for learning." (Tomaso Poggio & Steve Smale, "The Mathematics of Learning: Dealing with Data", Notices of the AMS, 2003)

"Every number has its limitations; every number is a product of choices that inevitably involve compromise. Statistics are intended to help us summarize, to get an overview of part of the world’s complexity. But some information is always sacrificed in the process of choosing what will be counted and how. Something is, in short, always missing. In evaluating statistics, we should not forget what has been lost, if only because this helps us understand what we still have." (Joel Best, "More Damned Lies and Statistics : How numbers confuse public issues", 2004)

"Statistics depend on collecting information. If questions go unasked, or if they are asked in ways that limit responses, or if measures count some cases but exclude others, information goes ungathered, and missing numbers result. Nevertheless, choices regarding which data to collect and how to go about collecting the information are inevitable." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"The Bayesian approach is based on the following postulates: (B1) Probability describes degree of belief, not limiting frequency. As such, we can make probability statements about lots of things, not just data which are subject to random variation. […] (B2) We can make probability statements about parameters, even though they are fixed constants. (B3) We make inferences about a parameter θ by producing a probability distribution for θ. Inferences, such as point estimates and interval estimates, may then be extracted from this distribution." (Larry A Wasserman, "All of Statistics: A concise course in statistical inference", 2004)

"A population that grows logistically, initially increases exponentially; then the growth lows down and eventually approaches an upper bound or limit. The most well-known form of the model is the logistic differential equation." (Linda J S Allen, "An Introduction to Mathematical Biology", 2007)

"Humans have difficulty perceiving variables accurately […]. However, in general, they tend to have inaccurate perceptions of system states, including past, current, and future states. This is due, in part, to limited ‘mental models’ of the phenomena of interest in terms of both how things work and how to influence things. Consequently, people have difficulty determining the full implications of what is known, as well as considering future contingencies for potential systems states and the long-term value of addressing these contingencies. " (William B. Rouse, "People and Organizations: Explorations of Human-Centered Design", 2007)

"The methodology of feedback design is borrowed from cybernetics (control theory). It is based upon methods of controlled system model’s building, methods of system states and parameters estimation (identification), and methods of feedback synthesis. The models of controlled system used in cybernetics differ from conventional models of physics and mechanics in that they have explicitly specified inputs and outputs. Unlike conventional physics results, often formulated as conservation laws, the results of cybernetical physics are formulated in the form of transformation laws, establishing the possibilities and limits of changing properties of a physical system by means of control." (Alexander L Fradkov, "Cybernetical Physics: From Control of Chaos to Quantum Control", 2007)

"All graphics present data and allow a certain degree of exploration of those same data. Some graphics are almost all presentation, so they allow just a limited amount of exploration; hence we can say they are more infographics than visualization, whereas others are mostly about letting readers play with what is being shown, tilting more to the visualization side of our linear scale. But every infographic and every visualization has a presentation and an exploration component: they present, but they also facilitate the analysis of what they show, to different degrees." (Alberto Cairo, "The Functional Art", 2011)

"There are limits on the data we can gather and the kinds of experiments we can perform."(Charles Wheelan, "Naked Statistics: Stripping the Dread from the Data", 2012)

"Learning theory claims that a machine learning algorithm can generalize well from a finite training set of examples. This seems to contradict some basic principles of logic. Inductive reasoning, or inferring general rules from a limited set of examples, is not logically valid. To logically infer a rule describing every member of a set, one must have information about every member of that set." (Ian Goodfellow et al, "Deep Learning", 2015)

"Science’s predictions are more trustworthy, but they are limited to what we can systematically observe and tractably model. Big data and machine learning greatly expand that scope. Some everyday things can be predicted by the unaided mind, from catching a ball to carrying on a conversation. Some things, try as we might, are just unpredictable. For the vast middle ground between the two, there’s machine learning." (Pedro Domingos, "The Master Algorithm", 2015)

"To make progress, every field of science needs to have data commensurate with the complexity of the phenomena it studies. [...] With big data and machine learning, you can understand much more complex phenomena than before. In most fields, scientists have traditionally used only very limited kinds of models, like linear regression, where the curve you fit to the data is always a straight line. Unfortunately, most phenomena in the world are nonlinear. [...] Machine learning opens up a vast new world of nonlinear models." (Pedro Domingos, "The Master Algorithm", 2015)

"Repeated observations of the same phenomenon do not always produce the same results, due to random noise or error. Sampling errors result when our observations capture unrepresentative circumstances, like measuring rush hour traffic on weekends as well as during the work week. Measurement errors reflect the limits of precision inherent in any sensing device. The notion of signal to noise ratio captures the degree to which a series of observations reflects a quantity of interest as opposed to data variance. As data scientists, we care about changes in the signal instead of the noise, and such variance often makes this problem surprisingly difficult." (Steven S Skiena, "The Data Science Design Manual", 2017)

"Regularization is particularly important when the amount of available data is limited. A neat biological interpretation of regularization is that it corresponds to gradual forgetting, as a result of which 'less important' (i.e., noisy) patterns are removed. In general, it is often advisable to use more complex models with regularization rather than simpler models without regularization." (Charu C Aggarwal, "Neural Networks and Deep Learning: A Textbook", 2018)

"The no free lunch theorems set limits on the range of optimality of any method. That is, each methodology has a ‘catchment area’ where it is optimal or nearly so. Often, intuitively, if the optimality is particularly strong then the effectiveness of the methodology falls off more quickly outside its catchment area than if its optimality were not so strong. Boosting is a case in point: it seems so well suited to binary classification that efforts to date to extend it to give effective classification (or regression) more generally have not been very successful. Overall, it remains to characterize the catchment areas where each class of predictors performs optimally, performs generally well, or breaks down." (Bertrand S Clarke & Jennifer L. Clarke, "Predictive Statistics: Analysis and Inference beyond Models", 2018)

"Unless we’re collecting data ourselves, there’s a limit to how much we can do to combat the problem of missing data. But we can and should remember to ask who or what might be missing from the data we’re being told about. Some missing numbers are obvious […]. Other omissions show up only when we take a close look at the claim in question." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"Despite their predictive power, most analytics and data science practices ignore relationships because it has been historically challenging to process them at scale." (Jesús Barrasa et al, "Knowledge Graphs: Data in Context for Responsive Businesses", 2021)

"Visualisation is fundamentally limited by the number of pixels you can pump to a screen. If you have big data, you have way more data than pixels, so you have to summarise your data. Statistics gives you lots of really good tools for this." (Hadley Wickham)

27 October 2018

🔭Data Science: Research (Just the Quotes)

"The aim of research is the discovery of the equations which subsist between the elements of phenomena." (Ernst Mach, 1898)

"[…] scientific research is somewhat like unraveling complicated tangles of strings, in which luck is almost as vital as skill and accurate observation." (Ernst Mach, "Knowledge and Error: Sketches on the Psychology of Enquiry", 1905)

"Research is fundamentally a state of mind involving continual reexamination of doctrines and axioms upon which current thought and action are based. It is, therefore, critical of existing practices." (Theobald Smith, "The Influence of Research in Bringing into Closer Relationship the Practice of Medicine and Public Health Activities", American Journal of Medical Sciences, 1929)

"In every important advance the physicist finds that the fundamental laws are simplified more and more as experimental research advances. He is astonished to notice how sublime order emerges from what appeared to be chaos. And this cannot be traced back to the workings of his own mind but is due to a quality that is inherent in the world of perception." (Albert Einstein, 1932)

"Statistics is a scientific discipline concerned with collection, analysis, and interpretation of data obtained from observation or experiment. The subject has a coherent structure based on the theory of Probability and includes many different procedures which contribute to research and development throughout the whole of Science and Technology." (Egon Pearson, 1936)

"A successful hypothesis is not necessarily a permanent hypothesis, but it is one which stimulates additional research, opens up new fields, or explains and coordinates previously unrelated facts." (Farrington Daniels, "Outlines of Physical Chemistry", 1948)

"The hypothesis is the principal intellectual instrument in research. Its function is to indicate new experiments and observations and it therefore sometimes leads to discoveries even when not correct itself. We must resist the temptation to become too attached to our hypothesis, and strive to judge it objectively and modify it or discard it as soon as contrary evidence is brought to light. Vigilance is needed to prevent our observations and interpretations being biased in favor of the hypothesis. Suppositions can be used without being believed." (William I B Beveridge, "The Art of Scientific Investigation", 1950)

"Mathematical models for empirical phenomena aid the development of a science when a sufficient body of quantitative information has been accumulated. This accumulation can be used to point the direction in which models should be constructed and to test the adequacy of such models in their interim states. Models, in turn, frequently are useful in organizing and interpreting experimental data and in suggesting new directions for experimental research." (Robert R. Bush & Frederick Mosteller, "A Mathematical Model for Simple Learning", Psychological Review 58, 1951)

"Statistics is the fundamental and most important part of inductive logic. It is both an art and a science, and it deals with the collection, the tabulation, the analysis and interpretation of quantitative and qualitative measurements. It is concerned with the classifying and determining of actual attributes as well as the making of estimates and the testing of various hypotheses by which probable, or expected, values are obtained. It is one of the means of carrying on scientific research in order to ascertain the laws of behavior of things - be they animate or inanimate. Statistics is the technique of the Scientific Method." (Bruce D Greenschields & Frank M Weida, "Statistics with Applications to Highway Traffic Analyses", 1952)

"In a general way it may be said that to think in terms of systems seems the most appropriate conceptual response so far available when the phenomena under study - at any level and in any domain--display the character of being organized, and when understanding the nature of the interdependencies constitutes the research task. In the behavioral sciences, the first steps in building a systems theory were taken in connection with the analysis of internal processes in organisms, or organizations, when the parts had to be related to the whole." (Fred Emery, "The Causal Texture of Organizational Environments", 1963)

"If the null hypothesis is not rejected, [Sir Ronald] Fisher's position was that nothing could be concluded. But researchers find it hard to go to all the trouble of conducting a study only to conclude that nothing can be concluded." (Frank L Schmidt, "Statistical Significance Testing and Cumulative Knowledge", "Psychology: Implications for Training of Researchers, Psychological Methods" Vol. 1 (2), 1996)

"Statisticians can calculate the probability that such random samples represent the population; this is usually expressed in terms of sampling error [...]. The real problem is that few samples are random. Even when researchers know the nature of the population, it can be time-consuming and expensive to draw a random sample; all too often, it is impossible to draw a true random sample because the population cannot be defined. This is particularly true for studies of social problems. [...] The best samples are those that come as close as possible to being random." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Meta-analytic thinking is the consideration of any result in relation to previous results on the same or similar questions, and awareness that combination with future results is likely to be valuable. Meta-analytic thinking is the application of estimation thinking to more than a single study. It prompts us to seek meta-analysis of previous related studies at the planning stage of research, then to report our results in a way that makes it easy to include them in future meta-analyses. Meta-analytic thinking is a type of estimation thinking, because it, too, focuses on estimates and uncertainty." (Geoff Cumming, "Understanding the New Statistics", 2012)

"Statistical cognition is concerned with obtaining cognitive evidence about various statistical techniques and ways to present data. It’s certainly important to choose an appropriate statistical model, use the correct formulas, and carry out accurate calculations. It’s also important, however, to focus on understanding, and to consider statistics as communication between researchers and readers." (Geoff Cumming, "Understanding the New Statistics", 2012)

"Another way to secure statistical significance is to use the data to discover a theory. Statistical tests assume that the researcher starts with a theory, collects data to test the theory, and reports the results - whether statistically significant or not. Many people work in the other direction, scrutinizing the data until they find a pattern and then making up a theory that fits the pattern." (Gary Smith, "Standard Deviations", 2014)

"How can we tell the difference between a good theory and quackery? There are two effective antidotes: common sense and fresh data. If it is a ridiculous theory, we shouldn’t be persuaded by anything less than overwhelming evidence, and even then be skeptical. Extraordinary claims require extraordinary evidence. Unfortunately, common sense is an uncommon commodity these days, and many silly theories have been seriously promoted by honest researchers." (Gary Smith, "Standard Deviations", 2014)

"These practices - selective reporting and data pillaging - are known as data grubbing. The discovery of statistical significance by data grubbing shows little other than the researcher’s endurance. We cannot tell whether a data grubbing marathon demonstrates the validity of a useful theory or the perseverance of a determined researcher until independent tests confirm or refute the finding. But more often than not, the tests stop there. After all, you won’t become a star by confirming other people’s research, so why not spend your time discovering new theories? The data-grubbed theory consequently sits out there, untested and unchallenged." (Gary Smith, "Standard Deviations", 2014)

"A conceptual model is a framework that is initially used in research to outline the possible courses of action or to present an idea or thought. When a conceptual model is developed in a logical manner, it will provide a rigor to the research process." (N Elangovan & R Rajendran, "Conceptual Model: A Framework for Institutionalizing the Vigor in Business Research", 2015)

"Even properly done statistics can’t be trusted. The plethora of available statistical techniques and analyses grants researchers an enormous amount of freedom when analyzing their data, and it is trivially easy to ‘torture the data until it confesses’." (Alex Reinhart, "Statistics Done Wrong: The Woefully Complete Guide", 2015)

"The correlational technique known as multiple regression is used frequently in medical and social science research. This technique essentially correlates many independent (or predictor) variables simultaneously with a given dependent variable (outcome or output). It asks, 'Net of the effects of all the other variables, what is the effect of variable A on the dependent variable?' Despite its popularity, the technique is inherently weak and often yields misleading results. The problem is due to self-selection. If we don’t assign cases to a particular treatment, the cases may differ in any number of ways that could be causing them to differ along some dimension related to the dependent variable. We can know that the answer given by a multiple regression analysis is wrong because randomized control experiments, frequently referred to as the gold standard of research techniques, may give answers that are quite different from those obtained by multiple regression analysis." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"Collecting data through sampling therefore becomes a never-ending battle to avoid sources of bias. [...] While trying to obtain a random sample, researchers sometimes make errors in judgment about whether every person or thing is equally likely to be sampled." (Daniel J Levitin, "Weaponized Lies", 2017)

"Samples give us estimates of something, and they will almost always deviate from the true number by some amount, large or small, and that is the margin of error. […] The margin of error does not address underlying flaws in the research, only the degree of error in the sampling procedure. But ignoring those deeper possible flaws for the moment, there is another measurement or statistic that accompanies any rigorously defined sample: the confidence interval." (Daniel J Levitin, "Weaponized Lies", 2017)

"The job of the statistician is to formulate an inventory of all those things that matter in order to obtain a representative sample. Researchers have to avoid the tendency to capture variables that are easy to identify or collect data on - sometimes the things that matter are not obvious or are difficult to measure." (Daniel J Levitin, "Weaponized Lies", 2017)

💎SQL Reloaded: Wish List (Part I: Replace From)

With SQL Server 2017 Microsoft introduced the Trim function, which not only replaces the combined use of LTrim and RTrim functions, but also replaces other specified characters from the start or end of a string (see my previous post):

-- Trim special characters 
SELECT Trim ('# ' FROM '# 843984 #') Example1
, Trim ('[]' FROM '[843984]') Example2

Output:
Example1 Example2
---------- --------
843984 843984

Similarly, I wish I had a function that replaces special characters from a whole string (not only the trails), for example:

-- Replace special characters 
SELECT Replace ('# ' FROM '# 84#3984 #', '') Example1
, Replace ('[]' FROM '[84][39][84]', '') Example2

Unfortunately, as far I know, there is no such simple function. Therefore, in order to replace the “]”, “[“ and “#” special characters from a string one is forced either to write verbose expressions like in the first example or to include the logic into a user-defined function like in the second:

-- a chain of replacements 
SELECT Replace(Replace(Replace('[#84][#39][#84]', '[' , ''), ']', ''), '#', '') Example1

-- encapsulated replacements
CREATE FUNCTION [dbo].[ReplaceSpecialChars](
  @string nvarchar(max)
, @replacer as nvarchar(1) 
) RETURNS nvarchar(max)
-- replaces the special characters from a string with a given replacer
AS
BEGIN   
  IF CharIndex('#', @string) > 0  
     SET @string = replace(@string, '#', @replacer) 
        
  IF CharIndex('[', @string) > 0  
     SET @string = replace(@string, '[', @replacer) 
    
  IF CharIndex(']', @string) > 0  
     SET @string = replace(@string, ']', @replacer) 
                                
  RETURN Trim(@string)
END

-- testing the function 
SELECT [dbo].[ReplaceSpecialChars]('[#84][#39][#84]', '') Example2

In data cleaning the list of characters to replace can get considerable big (somewhere between 10 and 30 characters). In addition, one can deal with different scenarios in which the strings to be replaced differ and thus one is forced to write multiple such functions.

To the list of special characters often needs to be considered also language specific characters like ß, ü, ä, ö that are replaced with ss, ue, ae, respectively oe (see also the post).

Personally, I would find such a replace function more than useful. What about you?

Happy coding!

💎🏭SQL Reloaded: Drop If Exists (Before and After)

One of the activities of a database developer/administrator is to create and drop objects on the fly. If in objects' creation there are always some aspects to take into account that are implied by object's definition, the verbose syntax for their destruction seemed to be an unnecessary thing. For example for dropping a table, view, stored procedure, function or index, the most used objects, one would need to write such statements:

-- dropping a table 
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[TestTable]') AND type in (N'U'))
DROP TABLE [dbo].[TestTable]

-- dropping a view 
IF  EXISTS (SELECT * FROM sys.views WHERE object_id = OBJECT_ID(N'[dbo].[TestView]'))
DROP VIEW [dbo].[TestView]

-- dropping a stored procedure 
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[TestProcedure]') AND type in (N'P', N'PC'))
DROP PROCEDURE [dbo].[TestProcedure]

-- dropping a fucntion
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[TestFunction]') AND type in (N'FN', N'IF', N'TF', N'FS', N'FT'))
DROP FUNCTION [dbo].[TestFunction]

-- dropping an index
IF EXISTS (SELECT Name FROM sysindexes WHERE Name = 'IX_TestTable') 
DROP INDEX dbo.TestTable.IX_TestTable

Even if Copy-Paste does its magic and SQL Server allows generating scripts for existing objects, there’s still some work do be done in order to drop an object. Fortunately, with SQL Server 2016 Microsoft introduced a simplified syntax for dropping an object, namely DROP IF EXISTS.

The general syntax:

DROP <object_type> [ IF EXISTS ] <object_name>

The above statements can be written as follows:

-- dropping the table 
DROP TABLE IF EXISTS dbo.TestTable

-- dropping the view 
DROP VIEW IF EXISTS dbo.TestView 

-- dropping the procedure 
DROP PROCEDURE IF EXISTS dbo.TestProcedure

-- dropping the function 
DROP FUNCTION IF EXISTS dbo.TestFunction

-- dropping the index 
DROP INDEX IF EXISTS dbo.TestTable.IX_TestTable

Similarly can be dropped aggregates, assemblies, roles, triggers, rules, databases, schemas, users, sequences, synonyms, etc. The scripts will run also when the objects don’t exist.

An object can't be dropped if explicit dependencies exist on them, e.g. when the table is referenced by a FOREIGN KEY constraint. For each object there are specific rules that apply, therefore for more details check the documentation.

To explore the functionality here are the definitions of the above objects and the further scripts to test them:

-- creating the test table
CREATE TABLE dbo.TestTable(City nvarchar(50)
, PostalCode nvarchar(50))

-- creating the test view 
CREATE VIEW dbo.TestView 
AS
SELECT 'Test' as Result

-- creating the test stored procedure  
CREATE PROCEDURE dbo.TestProcedure
AS
BEGIN
SELECT 'Test' as Result
END

-- creating the test function
CREATE FUNCTION dbo.TestFunction()
RETURNS nvarchar(50)
BEGIN
    RETURN 'Test'
END

-- creating the test index
CREATE NONCLUSTERED INDEX [IX_TestTable] ON [dbo].[TestTable]
(
 [PostalCode] ASC
)


--testing the table
SELECT *
FROM dbo.TestTable

-- testing the view 
SELECT *
FROM dbo.TestView 

-- testing the procedure 
EXEC dbo.TestProcedure

-- testing the function
SELECT dbo.TestFunction() as Result

Moreover, the IF EXISTS can be used when dropping the constraint or column of a table:

-- adding a new column 
ALTER TABLE dbo.TestTable
ADD DateFrom datetime2(0)

-- adding a constraint on it
ALTER TABLE dbo.TestTable
ADD CONSTRAINT [DF_DateFrom_Default]  DEFAULT (GetDate()) FOR [DateFrom]

-- inserting test data
INSERT INTO dbo.TestTable(City, PostalCode)
VALUES ('New York', 'XZY')

--testing the changes
SELECT *
FROM dbo.TestTable

-- dropping the constraint
ALTER TABLE dbo.TestTable
DROP CONSTRAINT IF EXISTS DF_DateFrom_Default

-- dropping a column from a table
ALTER TABLE dbo.TestTable
DROP COLUMN IF EXISTS DateFrom

--testing the changes
SELECT *
FROM dbo.TestTable

If a constraint exists on the column first must be dropped the constraint and after that the column, like in the above example.

Happy coding!

26 October 2018

💎🏭SQL Reloaded: Trimming Strings (Before and After)

One of the annoying things when writing queries is the repetitive lengthy expressions that obfuscate in general the queries making them more difficult to read, understand and troubleshoot, and sometimes such expressions come with a performance penalty as well. Loading data from Excel, text files and other sources involving poorly formatted data often requires trimming (all) the text values. In the early versions of SQL Server, the equivalent of a Trim function was obtained by using the combined LTrim and RTrim functions. This resumed in writing code like this (based on AdventureWorks 2014 database):

-- trimming via LTrim, RTrim 
SELECT LTrim(RTrim(AddressLine1)) AddressLine1
, LTrim(RTrim(AddressLine2)) AddressLine2
, LTrim(RTrim(City)) City
, LTrim(RTrim(PostalCode)) PostalCode
FROM Person.Address

This might not look much though imagine you have to deal with 30-50 text attributes, that the code is not written in a readable format (e.g. the way is stored in database), that some attributes require further processing (e.g. removal of special characters, splitting, concatenating).

Often developers preferred encapsulating the call to the two functions within a user-defined function:

-- Trim user-defiend function
CREATE FUNCTION dbo.Trim(
@string nvarchar(max))
RETURNS nvarchar(max)
BEGIN
    RETURN LTrim(RTrim(@string))
END

With it the code is somehow simplified, but not by much and includes the costs of calling a user-defined function:

-- trimming via dbo.Trim
SELECT dbo.Trim(AddressLine1) AddressLine1
, dbo.Trim(AddressLine2) AddressLine2
, dbo.Trim(City) City
, dbo.Trim(PostalCode) PostalCode
FROM Person.Address

In SQL Server 2017 was introduced the Trim function which not only replaces the combined use of LTrim and RTrim functions, but it also allows to replace other specified characters (including CR, LF, Tab) from the start or end of a string.

By default the function removes the space from both sides of a string:

-- trimming via Trim
SELECT Trim(AddressLine1) AddressLine1
, Trim(AddressLine2) AddressLine2
, Trim(City) City
, Trim(PostalCode) PostalCode
FROM Person.Address

When a set of characters is provided the function removes the specified characters:

SELECT Trim ('#' FROM '# 843984') Example1
, Trim ('[]' FROM '[843984]') Example2
, Trim ('+' FROM '+49127298000') Example3
, Trim ('+-' FROM '+ 49-12729-8000 ') + ';' Example4
, Trim ('+ ' FROM '+ 49-12729-8000 ') + ';' Example5
, ASCII(Left(Trim (char(13) FROM char(13) + '49127298000'), 1)) Example6

Output:
Example1   Example2     Example3        Example4            Example5            Example6
--------          --------          ------------           -----------------       -----------------        -----------
843984      843984        49127298000   49-12729-8000 ; 49-12729-8000;    52

As can be seen, when is needed to remove other characters together with the space then is needed to include the space in the list of characters.

Notes:
1) The dbo.Trim function can be created in SQL Server 2017 environments as well.
2) The collation of the database will affect the behavior of Trim function, therefore the results might look different when a case sensitive collection is used.

3) The queries work also in SQL databases in Microsoft Fabric.

Happy coding!