SQL Troubles

18 October 2018

💎🏭SQL Reloaded: String_Split Function

Today, as I was playing with a data model – although simplistic, the simplicity kicked back when I had to deal with fields in which several values were encoded within the same column. The “challenge” resided in the fact that the respective attributes were quintessential in analyzing and matching the data with other datasets. Therefore, was needed a mix between flexibility and performance. It was the point where the String_Split did its magic. Introduced with SQL Server 2016 and available only under compatibility level 130 and above, the function splits a character expression using specified separator.
Here’s a simplified example of the code I had to write:

-- cleaning up
-- DROP TABLE dbo.ItemList

-- test data 
SELECT A.*
INTO dbo.ItemList 
FROM (
VALUES (1, '1001:a:blue')
, (2, '1001:b:white')
, (3, '1002:a:blue')
, (4, '1002:b:white')
, (5, '1002:c:red')
, (6, '1003:b:white')
, (7, '1003:c:red')) A(Id, List)

-- checking the data
SELECT *
FROM dbo.ItemList 

-- prepared data
SELECT ITM.Id 
, ITM.List 
, DAT.ItemId
, DAT.Size
, DAT.Country
FROM dbo.ItemList ITM
   LEFT JOIN (-- transformed data 
 SELECT DAT.id
 , [1] AS ItemId
 , [2] AS Size
 , [3] AS Country
 FROM(
  SELECT ITM.id
  , TX.Value
  , ROW_NUMBER() OVER (PARTITION BY ITM.id ORDER BY ITM.id) Ranking
  FROM dbo.ItemList ITM
  CROSS APPLY STRING_SPLIT(ITM.List, ':') TX
  ) DAT
 PIVOT (MAX(DAT.Value) FOR DAT.Ranking IN ([1],[2],[3])) 
 DAT
  ) DAT
   ON ITM.Id = DAT.Id

And, here’s the output:

For those dealing with former versions of SQL Server the functionality provided by the String_Split can be implemented with the help of user-defined functions, either by using the old fashioned loops (see this), cursors (see this) or more modern common table expressions (see this). In fact, these implementations are similar to the String_Split function, the difference being made mainly by the performance.

Notes:
The queries work also in SQL databases in Microsoft Fabric (see file in GitHub repository). You might want to use another schema (e.g. Test), not to interfere with the existing code.

Happy coding!

13 October 2018

🔭Data Science: Groups (Just the Quotes)

"The object of statistical science is to discover methods of condensing information concerning large groups of allied facts into brief and compendious expressions suitable for discussion. The possibility of doing this is based on the constancy and continuity with which objects of the same species are found to vary." (Sir Francis Galton, "Inquiries into Human Faculty and Its Development, Statistical Methods", 1883)

"Some of the common ways of producing a false statistical argument are to quote figures without their context, omitting the cautions as to their incompleteness, or to apply them to a group of phenomena quite different to that to which they in reality relate; to take these estimates referring to only part of a group as complete; to enumerate the events favorable to an argument, omitting the other side; and to argue hastily from effect to cause, this last error being the one most often fathered on to statistics. For all these elementary mistakes in logic, statistics is held responsible." (Sir Arthur L Bowley, "Elements of Statistics", 1901)

"Statistics may be defined as numerical statements of facts by means of which large aggregates are analyzed, the relations of individual units to their groups are ascertained, comparisons are made between groups, and continuous records are maintained for comparative purposes." (Melvin T Copeland. "Statistical Methods" [in: Harvard Business Studies, Vol. III, Ed. by Melvin T Copeland, 1917])

"'Correlation' is a term used to express the relation which exists between two series or groups of data where there is a causal connection. In order to have correlation it is not enough that the two sets of data should both increase or decrease simultaneously. For correlation it is necessary that one set of facts should have some definite causal dependence upon the other set [...]" (Willard C Brinton, "Graphic Methods for Presenting Facts", 1919)

"The conception of statistics as the study of variation is the natural outcome of viewing the subject as the study of populations; for a population of individuals in all respects identical is completely described by a description of anyone individual, together with the number in the group. The populations which are the object of statistical study always display variations in one or more respects. To speak of statistics as the study of variation also serves to emphasise the contrast between the aims of modern statisticians and those of their predecessors." (Sir Ronald A Fisher, "Statistical Methods for Research Workers", 1925)

"An average is a single value which is taken to represent a group of values. Such a representative value may be obtained in several ways, for there are several types of averages. […] Probably the most commonly used average is the arithmetic average, or arithmetic mean." (John R Riggleman & Ira N Frisbee, "Business Statistics", 1938)

"The fact that index numbers attempt to measure changes of items gives rise to some knotty problems. The dispersion of a group of products increases with the passage of time, principally because some items have a long-run tendency to fall while others tend to rise. Basic changes in the demand is fundamentally responsible. The averages become less and less representative as the distance from the period increases." (Anna C Rogers, "Graphic Charts Handbook", 1961)

"Pencil and paper for construction of distributions, scatter diagrams, and run-charts to compare small groups and to detect trends are more efficient methods of estimation than statistical inference that depends on variances and standard errors, as the simple techniques preserve the information in the original data." (William E Deming, "On Probability as Basis for Action" American Statistician Vol. 29 (4), 1975)

"When the distributions of two or more groups of univariate data are skewed, it is common to have the spread increase monotonically with location. This behavior is monotone spread. Strictly speaking, monotone spread includes the case where the spread decreases monotonically with location, but such a decrease is much less common for raw data. Monotone spread, as with skewness, adds to the difficulty of data analysis. For example, it means that we cannot fit just location estimates to produce homogeneous residuals; we must fit spread estimates as well. Furthermore, the distributions cannot be compared by a number of standard methods of probabilistic inference that are based on an assumption of equal spreads; the standard t-test is one example. Fortunately, remedies for skewness can cure monotone spread as well." (William S Cleveland, "Visualizing Data", 1993)

"The central limit theorem […] states that regardless of the shape of the curve of the original population, if you repeatedly randomly sample a large segment of your group of interest and take the average result, the set of averages will follow a normal curve." (Charles Livingston & Paul Voakes, "Working with Numbers and Statistics: A handbook for journalists", 2005)

"Random events often come like the raisins in a box of cereal - in groups, streaks, and clusters. And although Fortune is fair in potentialities, she is not fair in outcomes." (Leonard Mlodinow, "The Drunkard’s Walk: How Randomness Rules Our Lives", 2008)

"[...] statisticians are constantly looking out for missed nuances: a statistical average for all groups may well hide vital differences that exist between these groups. Ignoring group differences when they are present frequently portends inequitable treatment." (Kaiser Fung, "Numbers Rule the World", 2010)

"The issue of group differences is fundamental to statistical thinking. The heart of this matter concerns which groups should be aggregated and which shouldn’t." (Kaiser Fung, "Numbers Rule the World", 2010)

"Be careful not to confuse clustering and stratification. Even though both of these sampling strategies involve dividing the population into subgroups, both the way in which the subgroups are sampled and the optimal strategy for creating the subgroups are different. In stratified sampling, we sample from every stratum, whereas in cluster sampling, we include only selected whole clusters in the sample. Because of this difference, to increase the chance of obtaining a sample that is representative of the population, we want to create homogeneous groups for strata and heterogeneous (reflecting the variability in the population) groups for clusters." (Roxy Peck et al, "Introduction to Statistics and Data Analysis" 4th Ed., 2012)

"If the group is large enough, even very small differences can become statistically significant." (Victor Cohn & Lewis Cope, "News & Numbers: A writer’s guide to statistics" 3rd Ed, 2012)

"Self-selection bias occurs when people choose to be in the data - for example, when people choose to go to college, marry, or have children. […] Self-selection bias is pervasive in 'observational data', where we collect data by observing what people do. Because these people chose to do what they are doing, their choices may reflect who they are. This self-selection bias could be avoided with a controlled experiment in which people are randomly assigned to groups and told what to do." (Gary Smith, "Standard Deviations", 2014)

"Often when people relate essentially the same variable in two different groups, or at two different times, they see this same phenomenon - the tendency of the response variable to be closer to the mean than the predicted value. Unfortunately, people try to interpret this by thinking that the performance of those far from the mean is deteriorating, but it’s just a mathematical fact about the correlation. So, today we try to be less judgmental about this phenomenon and we call it regression to the mean. We managed to get rid of the term 'mediocrity', but the name regression stuck as a name for the whole least squares fitting procedure - and that’s where we get the term regression line." (Richard D De Veaux et al, "Stats: Data and Models", 2016)

"Bias is error from incorrect assumptions built into the model, such as restricting an interpolating function to be linear instead of a higher-order curve. [...] Errors of bias produce underfit models. They do not fit the training data as tightly as possible, were they allowed the freedom to do so. In popular discourse, I associate the word 'bias' with prejudice, and the correspondence is fairly apt: an apriori assumption that one group is inferior to another will result in less accurate predictions than an unbiased one. Models that perform lousy on both training and testing data are underfit." (Steven S Skiena, "The Data Science Design Manual", 2017)

"To be any good, a sample has to be representative. A sample is representative if every person or thing in the group you’re studying has an equally likely chance of being chosen. If not, your sample is biased. […] The job of the statistician is to formulate an inventory of all those things that matter in order to obtain a representative sample. Researchers have to avoid the tendency to capture variables that are easy to identify or collect data on - sometimes the things that matter are not obvious or are difficult to measure." (Daniel J Levitin, "Weaponized Lies", 2017)

"If you study one group and assume that your results apply to other groups, this is extrapolation. If you think you are studying one group, but do not manage to obtain a representative sample of that group, this is a different problem. It is a problem so important in statistics that it has a special name: selection bias. Selection bias arises when the individuals that you sample for your study differ systematically from the population of individuals eligible for your study." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

06 October 2018

🔭Data Science: Users (Just the Quotes)

"Knowledge is a potential for a certain type of action, by which we mean that the action would occur if certain tests were run. For example, a library plus its user has knowledge if a certain type of response will be evoked under a given set of stipulations." (C West Churchman, "The Design of Inquiring Systems", 1971)

"To conceive of knowledge as a collection of information seems to rob the concept of all of its life. Knowledge resides in the user and not in the collection. It is how the user reacts to a collection of information that matters." (C West Churchman, "The Design of Inquiring Systems", 1971)

"Averages, ranges, and histograms all obscure the time-order for the data. If the time-order for the data shows some sort of definite pattern, then the obscuring of this pattern by the use of averages, ranges, or histograms can mislead the user. Since all data occur in time, virtually all data will have a time-order. In some cases this time-order is the essential context which must be preserved in the presentation." (Donald J Wheeler," Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Visualizations can be used to explore data, to confirm a hypothesis, or to manipulate a viewer. [...] In exploratory visualization the user does not necessarily know what he is looking for. This creates a dynamic scenario in which interaction is critical. [...] In a confirmatory visualization, the user has a hypothesis that needs to be tested. This scenario is more stable and predictable. System parameters are often predetermined." (Usama Fayyad et al, "Information Visualization in Data Mining and Knowledge Discovery", 2002)

"Dashboards and visualization are cognitive tools that improve your 'span of control' over a lot of business data. These tools help people visually identify trends, patterns and anomalies, reason about what they see and help guide them toward effective decisions. As such, these tools need to leverage people's visual capabilities. With the prevalence of scorecards, dashboards and other visualization tools now widely available for business users to review their data, the issue of visual information design is more important than ever." (Richard Brath & Michael Peters, "Dashboard Design: Why Design is Important," DM Direct, 2004)

"Data are raw facts and figures that by themselves may be useless. To be useful, data must be processed into finished information, that is, data converted into a meaningful and useful context for specific users. An increasing challenge for managers is being able to identify and access useful information." (Richard L Daft & Dorothy Marcic, "Understanding Management" 5th Ed., 2006)

"[...] construction of a data model is precisely the selective relevant depiction of the phenomena by the user of the theory required for the possibility of representation of the phenomenon." (Bas C van Fraassen, "Scientific Representation: Paradoxes of Perspective", 2008)

"The thread that ties most of these applications together is that data collected from users provides added value. Whether that data is search terms, voice samples, or product reviews, the users are in a feedback loop in which they contribute to the products they use. That’s the beginning of data science." (Mike Loukides, "What Is Data Science?", 2011)

"What is really important is to remember that no matter how creative and innovative you wish to be in your graphics and visualizations, the first thing you must do, before you put a finger on the computer keyboard, is ask yourself what users are likely to try to do with your tool." (Alberto Cairo, "The Functional Art", 2011)

"As data scientists, we prefer to interact with the raw data. We know how to import it, transform it, mash it up with other data sources, and visualize it. Most of your customers can’t do that. One of the biggest challenges of developing a data product is figuring out how to give data back to the user. Giving back too much data in a way that’s overwhelming and paralyzing is 'data vomit'. It’s natural to build the product that you would want, but it’s very easy to overestimate the abilities of your users. The product you want may not be the product they want." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"By giving data back to the user, you can create both engagement and revenue. We’re far enough into the data game that most users have realized that they’re not the customer, they’re the product. Their role in the system is to generate data, either to assist in ad targeting or to be sold to the highest bidder, or both." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"Generalizing beyond advertising, when building any data product in which the data is obfuscated (where there isn’t a clear relationship between the user and the result), you can compromise on precision, but not on recall. But when the data is exposed, focus on high precision." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"Successful information design in movement systems gives the user the information he needs - and only the information he needs - at every decision point." (Joel Katz, "Designing Information: Human factors and common sense in information design", 2012)

"You can give your data product a better chance of success by carefully setting the users’ expectations. [...] One under-appreciated facet of designing data products is how the user feels after using the product. Does he feel good? Empowered? Or disempowered and dejected?" (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"Creating effective visualizations is hard. Not because a dataset requires an exotic and bespoke visual representation - for many problems, standard statistical charts will suffice. And not because creating a visualization requires coding expertise in an unfamiliar programming language [...]. Rather, creating effective visualizations is difficult because the problems that are best addressed by visualization are often complex and ill-formed. The task of figuring out what attributes of a dataset are important is often conflated with figuring out what type of visualization to use. Picking a chart type to represent specific attributes in a dataset is comparatively easy. Deciding on which data attributes will help answer a question, however, is a complex, poorly defined, and user-driven process that can require several rounds of visualization and exploration to resolve." (Danyel Fisher & Miriah Meyer, "Making Data Visual", 2018)

"Designing effective visualizations presents a paradox. On the one hand, visualizations are intended to help users learn about parts of their data that they don’t know about. On the other hand, the more we know about the users’ needs and the context of their data, the better we can design a visualization to serve them." (Danyel Fisher & Miriah Meyer, "Making Data Visual", 2018)

04 October 2018

🔭Data Science: Data Products (Just the Quotes)

"Data scientists combine entrepreneurship with patience, the willingness to build data products incrementally, the ability to explore, and the ability to iterate over a solution. They are inherently interdisciplinary. They can tackle all aspects of a problem, from initial data collection and data conditioning to drawing conclusions. They can think outside the box to come up with new ways to view the problem, or to work with very broadly defined problems: 'there’s a lot of data, what can you make from it?'" (Mike Loukides, "What Is Data Science?", 2011)

"Discovery is the key to building great data products, as opposed to products that are merely good." (Mike Loukides, "The Evolution of Data Products", 2011)

"New interfaces for data products are all about hiding the data itself, and getting to what the user wants." (Mike Loukides, "The Evolution of Data Products", 2011)

"[...] a good definition of a data product is a product that facilitates an end goal through the use of data. It’s tempting to think of a data product purely as a data problem. After all, there’s nothing more fun than throwing a lot of technical expertise and fancy algorithmic work at a difficult problem." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"Generalizing beyond advertising, when building any data product in which the data is obfuscated (where there isn’t a clear relationship between the user and the result), you can compromise on precision, but not on recall. But when the data is exposed, focus on high precision." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"Ideas for data products tend to start simple and become complex; if they start complex, they become impossible." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"In an emergency, a data product that just produces more data is of little use. Data scientists now have the predictive tools to build products that increase the common good, but they need to be aware that building the models is not enough if they do not also produce optimized, implementable outcomes." (Jeremy Howard et al, "Designing Great Data Products", 2012)

"The best way to avoid data vomit is to focus on actionability of data. That is, what action do you want the user to take? If you want them to be impressed with the number of things that you can do with the data, then you’re likely producing data vomit. If you’re able to lead them to a clear set of actions, then you’ve built a product with a clear focus." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"The key aspect of making a data product is putting the 'product' first and 'data' second. Saying it another way, data is one mechanism by which you make the product user-focused. With all products, you should ask yourself the following three questions: (1) What do you want the user to take away from this product? (2) What action do you want the user to take because of the product? (3) How should the user feel during and after using your product?" (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"To explain a data mesh in one sentence, a data mesh is a centrally managed network of decentralized data products. The data mesh breaks the central data lake into decentralized islands of data that are owned by the teams that generate the data. The data mesh architecture proposes that data be treated like a product, with each team producing its own data/output using its own choice of tools arranged in an architecture that works for them. This team completely owns the data/output they produce and exposes it for others to consume in a way they deem fit for their data." (Aniruddha Deswandikar,"Engineering Data Mesh in Azure Cloud", 2024)

"Data product usage is growing quickly, doubling every year. Obviously, since we made the investment, we'll work with our customer to find applications." (Ken Shelton)

01 October 2018

🔭Data Science: Summaries (Just the Quotes)

"The null hypothesis of no difference has been judged to be no longer a sound or fruitful basis for statistical investigation. […] Significance tests do not provide the information that scientists need, and, furthermore, they are not the most effective method for analyzing and summarizing data." (Cherry A Clark, "Hypothesis Testing in Relation to Statistical Methodology", Review of Educational Research Vol. 33, 1963)

"Comparable objectives in data analysis are (l) to achieve more specific description of what is loosely known or suspected; (2) to find unanticipated aspects in the data, and to suggest unthought-of-models for the data's summarization and exposure; (3) to employ the data to assess the (always incomplete) adequacy of a contemplated model; (4) to provide both incentives and guidance for further analysis of the data; and (5) to keep the investigator usefully stimulated while he absorbs the feeling of his data and considers what to do next." (John W Tukey & Martin B Wilk, "Data Analysis and Statistics: An Expository Overview", 1966)

"Data analysis must be iterative to be effective. [...] The iterative and interactive interplay of summarizing by fit and exposing by residuals is vital to effective data analysis. Summarizing and exposing are complementary and pervasive." (John W Tukey & Martin B Wilk, "Data Analysis and Statistics: An Expository Overview", 1966)

"Summarizing data is a process of constrained and partial a process that essentially and inevitably corresponds to description - some sort of fitting, though it need not necessarily involve formal criteria or well-defined computations." (John W Tukey & Martin B Wilk, "Data Analysis and Statistics: An Expository Overview", 1966)

"[…] fitting lines to relationships between variables is often a useful and powerful method of summarizing a set of data. Regression analysis fits naturally with the development of causal explanations, simply because the research worker must, at a minimum, know what he or she is seeking to explain." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Fitting lines to relationships between variables is the major tool of data analysis. Fitted lines often effectively summarize the data and, by doing so, help communicate the analytic results to others. Estimating a fitted line is also the first step in squeezing further information from the data." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Modern data graphics can do much more than simply substitute for small statistical tables. At their best, graphics are instruments for reasoning about quantitative information. Often the most effective way to describe, explore, and summarize a set of numbers even a very large set - is to look at pictures of those numbers. Furthermore, of all methods for analyzing and communicating statistical information, well-designed data graphics are usually the simplest and at the same time the most powerful." (Edward R Tufte, "The Visual Display of Quantitative Information", 1983)

"Probabilities are summaries of knowledge that is left behind when information is transferred to a higher level of abstraction." (Judea Pearl, "Probabilistic Reasoning in Intelligent Systems: Network of Plausible, Inference", 1988)

"A good description of the data summarizes the systematic variation and leaves residuals that look structureless. That is, the residuals exhibit no patterns and have no exceptionally large values, or outliers. Any structure present in the residuals indicates an inadequate fit. Looking at the residuals laid out in an overlay helps to spot patterns and outliers and to associate them with their source in the data." (Christopher H Schrnid, "Value Splitting: Taking the Data Apart", 1991)

"The science of statistics may be described as exploring, analyzing and summarizing data; designing or choosing appropriate ways of collecting data and extracting information from them; and communicating that information. Statistics also involves constructing and testing models for describing chance phenomena. These models can be used as a basis for making inferences and drawing conclusions and, finally, perhaps for making decisions." (Fergus Daly et al, "Elements of Statistics", 1995)

"Ockham's Razor in statistical analysis is used implicitly when models are embedded in richer models -for example, when testing the adequacy of a linear model by incorporating a quadratic term. If the coefficient of the quadratic term is not significant, it is dropped and the linear model is assumed to summarize the data adequately." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Every number has its limitations; every number is a product of choices that inevitably involve compromise. Statistics are intended to help us summarize, to get an overview of part of the world’s complexity. But some information is always sacrificed in the process of choosing what will be counted and how. Something is, in short, always missing. In evaluating statistics, we should not forget what has been lost, if only because this helps us understand what we still have." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"Data often arrive in raw form, as long lists of numbers. In this case your job is to summarize the data in a way that captures its essence and conveys its meaning. This can be done numerically, with measures such as the average and standard deviation, or graphically. At other times you find data already in summarized form; in this case you must understand what the summary is telling, and what it is not telling, and then interpret the information for your readers or viewers." (Charles Livingston & Paul Voakes, "Working with Numbers and Statistics: A handbook for journalists", 2005)

"Whereas regression is about attempting to specify the underlying relationship that summarises a set of paired data, correlation is about assessing the strength of that relationship. Where there is a very close match between the scatter of points and the regression line, correlation is said to be 'strong' or 'high' . Where the points are widely scattered, the correlation is said to be 'weak' or 'low'." (Alan Graham, "Developing Thinking in Statistics", 2006)

"Graphical displays are often constructed to place principal focus on the individual observations in a dataset, and this is particularly helpful in identifying both the typical positions of data points and unusual or influential cases. However, in many investigations, principal interest lies in identifying the nature of underlying trends and relationships between variables, and so it is often helpful to enhance graphical displays in ways which give deeper insight into these features. This can be very beneficial both for small datasets, where variation can obscure underlying patterns, and large datasets, where the volume of data is so large that effective representation inevitably involves suitable summaries." (Adrian W Bowman, "Smoothing Techniques for Visualisation" [in "Handbook of Data Visualization"], 2008)

"In order to be effective a descriptive statistic has to make sense - it has to distill some essential characteristic of the data into a value that is both appropriate and understandable. […] the justification for computing any given statistic must come from the nature of the data themselves - it cannot come from the arithmetic, nor can it come from the statistic. If the data are a meaningless collection of values, then the summary statistics will also be meaningless - no arithmetic operation can magically create meaning out of nonsense. Therefore, the meaning of any statistic has to come from the context for the data, while the appropriateness of any statistic will depend upon the use we intend to make of that statistic." (Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"In general, when building statistical models, we must not forget that the aim is to understand something about the real world. Or predict, choose an action, make a decision, summarize evidence, and so on, but always about the real world, not an abstract mathematical world: our models are not the reality - a point well made by George Box in his oft-cited remark that "all models are wrong, but some are useful". (David Hand, "Wonderful examples, but let's not close our eyes", Statistical Science 29, 2014)

"Just as with aggregated data, an average is a summary statistic that can tell you something about the data - but it is only one metric, and oftentimes a deceiving one at that. By taking all of the data and boiling it down to one value, an average (and other summary statistics) may imply that all of the underlying data is the same, even when it’s not." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Again, classical statistics only summarizes data, so it does not provide even a language for asking [a counterfactual] question. Causal inference provides a notation and, more importantly, offers a solution. As with predicting the effect of interventions [...], in many cases we can emulate human retrospective thinking with an algorithm that takes what we know about the observed world and produces an answer about the counterfactual world." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"[...] data often has some errors, outliers and other strange values, but these do not necessarily need to be individually identified and excluded. It also points to the benefits of using summary measures that are not unduly affected by odd observations [...] are known as robust measures, and include the median and the inter-quartile range." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"It is convenient to use a single number to summarize a steadily increasing or decreasing relationship between the pairs of numbers shown on a scatter-plot. This is generally chosen to be the Pearson correlation coefficient [...]. A Pearson correlation runs between −1 and 1, and expresses how close to a straight line the dots or data-points fall. A correlation of 1 occurs if all the points lie on a straight line going upwards, while a correlation of −1 occurs if all the points lie on a straight line going downwards. A correlation near 0 can come from a random scatter of points, or any other pattern in which there is no systematic trend upwards or downwards [...]." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"A data visualization, or dashboard, is great for summarizing or describing what has gone on in the past, but if people don’t know how to progress beyond looking just backwards on what has happened, then they cannot diagnose and find the ‘why’ behind it." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"Visualisation is fundamentally limited by the number of pixels you can pump to a screen. If you have big data, you have way more data than pixels, so you have to summarise your data. Statistics gives you lots of really good tools for this." (Hadley Wickham)

24 September 2018

🔭Data Science: Artificial Intelligence (Just the Quotes)

"There is no security against the ultimate development of mechanical consciousness, in the fact of machines possessing little consciousness now. A mollusc has not much consciousness. Reflect upon the extraordinary advance which machines have made during the last few hundred years, and note how slowly the animal and vegetable kingdoms are advancing. The more highly organized machines are creatures not so much of yesterday, as of the last five minutes, so to speak, in comparison with past time." (Samuel Butler, "Erewhon: Or, Over the Range", 1872)

"In other words then, if a machine is expected to be infallible, it cannot also be intelligent. There are several theorems which say almost exactly that. But these theorems say nothing about how much intelligence may be displayed if a machine makes no pretense at infallibility." (Alan M Turing, 1946)

"A computer would deserve to be called intelligent if it could deceive a human into believing that it was human." (Alan Turing, "Computing Machinery and Intelligence", 1950)

"The original question, 'Can machines think?:, I believe too meaningless to deserve discussion. Nevertheless I believe that at the end of the century the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted." (Alan M Turing, 1950)

"The view that machines cannot give rise to surprises is due, I believe, to a fallacy to which philosophers and mathematicians are particularly subject. This is the assumption that as soon as a fact is presented to a mind all consequences of that fact spring into the mind simultaneously with it. It is a very useful assumption under many circumstances, but one too easily forgets that it is false. A natural consequence of doing so is that one then assumes that there is no virtue in the mere working out of consequences from data and general principles." (Alan Turing, "Computing Machinery and Intelligence", Mind Vol. 59, 1950)

"The following are some aspects of the artificial intelligence problem: […] If a machine can do a job, then an automatic calculator can be programmed to simulate the machine. […] It may be speculated that a large part of human thought consists of manipulating words according to rules of reasoning and rules of conjecture. From this point of view, forming a generalization consists of admitting a new word and some rules whereby sentences containing it imply and are implied by others. This idea has never been very precisely formulated nor have examples been worked out. […] How can a set of (hypothetical) neurons be arranged so as to form concepts. […] to get a measure of the efficiency of a calculation it is necessary to have on hand a method of measuring the complexity of calculating devices which in turn can be done. […] Probably a truly intelligent machine will carry out activities which may best be described as self-improvement. […] A number of types of 'abstraction' can be distinctly defined and several others less distinctly. […] the difference between creative thinking and unimaginative competent thinking lies in the injection of a some randomness. The randomness must be guided by intuition to be efficient." (John McCarthy et al, "A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence", 1955)

"We shall therefore say that a program has common sense if it automatically deduces for itself a sufficient wide class of immediate consequences of anything it is told and what it already knows. [...] Our ultimate objective is to make programs that learn from their experience as effectively as humans do." (John McCarthy, "Programs with Common Sense", 1958)

"Although it sounds implausible, it might turn out that above a certain level of complexity, a machine ceased to be predictable, even in principle, and started doing things on its own account, or, to use a very revealing phrase, it might begin to have a mind of its own." (John R Lucas, "Minds, Machines and Gödel", 1959)

"When intelligent machines are constructed, we should not be surprised to find them as confused and as stubborn as men in their convictions about mind-matter, consciousness, free will, and the like." (Marvin Minsky, "Matter, Mind, and Models", Proceedings of the International Federation of Information Processing Congress Vol. 1 (49), 1965)

"Artificial intelligence is the science of making machines do things that would require intelligence if done by men." (Marvin Minsky, 1968)

"There are now machines in the world that think, that learn and create. Moreover, their ability to do these things is going to increase rapidly until - in the visible future - the range of problems they can handle will be coextensive with the range to which the human mind has been applied." (Allen Newell & Herbert A Simon, "Human problem solving", 1976)

"Intelligence has two parts, which we shall call the epistemological and the heuristic. The epistemological part is the representation of the world in such a form that the solution of problems follows from the facts expressed in the representation. The heuristic part is the mechanism that on the basis of the information solves the problem and decides what to do." (John McCarthy & Patrick J Hayes, "Some Philosophical Problems from the Standpoint of Artificial Intelligence", Machine Intelligence 4, 1969)

"It is essential to realize that a computer is not a mere 'number cruncher', or supercalculating arithmetic machine, although this is how computers are commonly regarded by people having no familiarity with artificial intelligence. Computers do not crunch numbers; they manipulate symbols. [...] Digital computers originally developed with mathematical problems in mind, are in fact general purpose symbol manipulating machines." (Margaret A Boden, "Minds and mechanisms", 1981)

"The basic idea of cognitive science is that intelligent beings are semantic engines - in other words, automatic formal systems with interpretations under which they consistently make sense. We can now see why this includes psychology and artificial intelligence on a more or less equal footing: people and intelligent computers (if and when there are any) turn out to be merely different manifestations of the same underlying phenomenon. Moreover, with universal hardware, any semantic engine can in principle be formally imitated by a computer if only the right program can be found." (John Haugeland, "Semantic Engines: An introduction to mind design", 1981)

"The digital-computer field defined computers as machines that manipulated numbers. The great thing was, adherents said, that everything could be encoded into numbers, even instructions. In contrast, scientists in AI [artificial intelligence] saw computers as machines that manipulated symbols. The great thing was, they said, that everything could be encoded into symbols, even numbers." (Allen Newell, "Intellectual Issues in the History of Artificial Intelligence", 1983)

"Artificial intelligence is based on the assumption that the mind can be described as some kind of formal system manipulating symbols that stand for things in the world. Thus it doesn't matter what the brain is made of, or what it uses for tokens in the great game of thinking. Using an equivalent set of tokens and rules, we can do thinking with a digital computer, just as we can play chess using cups, salt and pepper shakers, knives, forks, and spoons. Using the right software, one system (the mind) can be mapped onto the other (the computer)." (George Johnson, Machinery of the Mind: Inside the New Science of Artificial Intelligence, 1986)

"Cybernetics is simultaneously the most important science of the age and the least recognized and understood. It is neither robotics nor freezing dead people. It is not limited to computer applications and it has as much to say about human interactions as it does about machine intelligence. Today’s cybernetics is at the root of major revolutions in biology, artificial intelligence, neural modeling, psychology, education, and mathematics. At last there is a unifying framework that suspends long-held differences between science and art, and between external reality and internal belief." (Paul Pangaro, "New Order From Old: The Rise of Second-Order Cybernetics and Its Implications for Machine Intelligence", 1988)

"The cybernetics phase of cognitive science produced an amazing array of concrete results, in addition to its long-term (often underground) influence: the use of mathematical logic to understand the operation of the nervous system; the invention of information processing machines (as digital computers), thus laying the basis for artificial intelligence; the establishment of the metadiscipline of system theory, which has had an imprint in many branches of science, such as engineering (systems analysis, control theory), biology (regulatory physiology, ecology), social sciences (family therapy, structural anthropology, management, urban studies), and economics (game theory); information theory as a statistical theory of signal and communication channels; the first examples of self-organizing systems. This list is impressive: we tend to consider many of these notions and tools an integrative part of our life […]" (Francisco Varela, "The Embodied Mind", 1991)

"The deep paradox uncovered by AI research: the only way to deal efficiently with very complex problems is to move away from pure logic. [...] Most of the time, reaching the right decision requires little reasoning.[...] Expert systems are, thus, not about reasoning: they are about knowing. [...] Reasoning takes time, so we try to do it as seldom as possible. Instead we store the results of our reasoning for later reference." (Daniel Crevier, "The Tree of Knowledge", 1993)

"The insight at the root of artificial intelligence was that these 'bits' (manipulated by computers) could just as well stand as symbols for concepts that the machine would combine by the strict rules of logic or the looser associations of psychology." (Daniel Crevier, "AI: The tumultuous history of the search for artificial intelligence", 1993)

"Artificial intelligence comprises methods, tools, and systems for solving problems that normally require the intelligence of humans. The term intelligence is always defined as the ability to learn effectively, to react adaptively, to make proper decisions, to communicate in language or images in a sophisticated way, and to understand." (Nikola K Kasabov, "Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering", 1996)

"But intelligence is not just a matter of acting or behaving intelligently. Behavior is a manifestation of intelligence, but not the central characteristic or primary definition of being intelligent. A moment's reflection proves this: You can be intelligent just lying in the dark, thinking and understanding. Ignoring what goes on in your head and focusing instead on behavior has been a large impediment to understanding intelligence and building intelligent machines." (Jeff Hawkins, "On Intelligence", 2004)

"The brain and its cognitive mental processes are the biological foundation for creating metaphors about the world and oneself. Artificial intelligence, human beings’ attempt to transcend their biology, tries to enter into these scenarios to learn how they function. But there is another metaphor of the world that has its own particular landscapes, inhabitants, and laws. The brain provides the organic structure that is necessary for generating the mind, which in turn is considered a process that results from brain activity." (Diego Rasskin-Gutman, "Chess Metaphors: Artificial Intelligence and the Human Mind", 2009)

"From a historical viewpoint, computationalism is a sophisticated version of behaviorism, for it only interpolates the computer program between stimulus and response, and does not regard novel programs as brain creations. [...] The root of computationalism is of course the actual similarity between brains and computers, and correspondingly between natural and artificial intelligence. The two are indeed similar because the artifacts in question have been designed to perform analogs of certain brain functions. And the computationalist program is an example of the strategy of treating similars as identicals." (Mario Bunge, "Matter and Mind: A Philosophical Inquiry", 2010)

"Artificial intelligence is a concept that obscures accountability. Our problem is not machines acting like humans - it's humans acting like machines." (John Twelve Hawks, "Spark", 2014)

"AI failed (at least relative to the hype it had generated), and it’s partly out of embarrassment on behalf of their discipline that the term 'artificial intelligence' is rarely used in computer science circles (although it’s coming back into favor, just without the over-hyping). We are as far away from mimicking human intelligence as we have ever been, partly because the human brain is fantastically more complicated than a mere logic engine." (Field Cady, "The Data Science Handbook", 2017)

"AI ever allows us to truly understand ourselves, it will not be because these algorithms captured the mechanical essence of the human mind. It will be because they liberated us to forget about optimizations and to instead focus on what truly makes us human: loving and being loved." (Kai-Fu Lee, "AI Superpowers: China, Silicon Valley, and the New World Order", 2018)

"Artificial intelligence is defined as the branch of science and technology that is concerned with the study of software and hardware to provide machines the ability to learn insights from data and the environment, and the ability to adapt in changing situations with high precision, accuracy and speed." (Amit Ray, "Compassionate Artificial Intelligence", 2018)

"Artificial Intelligence is not just learning patterns from data, but understanding human emotions and its evolution from its depth and not just fulfilling the surface level human requirements, but sensitivity towards human pain, happiness, mistakes, sufferings and well-being of the society are the parts of the evolving new AI systems." (Amit Ray, "Compassionate Artificial Intelligence", 2018)

"Artificial intelligence is the elucidation of the human learning process, the quantification of the human thinking process, the explication of human behavior, and the understanding of what makes intelligence possible." (Kai-Fu Lee, "AI Superpowers: China, Silicon Valley, and the New World Order", 2018)

"AI won‘t be fool proof in the future since it will only as good as the data and information that we give it to learn. It could be the case that simple elementary tricks could fool the AI algorithm and it may serve a complete waste of output as a result." (Zoltan Andrejkovics, "Together: AI and Human. On the Same Side", 2019)

"It is the field of artificial intelligence in which the population is in the form of agents which search in a parallel fashion with multiple initialization points. The swarm intelligence-based algorithms mimic the physical and natural processes for mathematical modeling of the optimization algorithm. They have the properties of information interchange and non-centralized control structure." (Sajad A Rather & P Shanthi Bala, "Analysis of Gravitation-Based Optimization Algorithms for Clustering and Classification", 2020)

"A significant factor missing from any form of artificial intelligence is the inability of machines to learn based on real life experience. Diversity of life experience is the single most powerful characteristic of being human and enhances how we think, how we learn, our ideas and our ability to innovate. Machines exist in a homogeneous ecosystem, which is ok for solving known challenges, however even Artificial General Intelligence will never challenge humanity in being able to acquire the knowledge, creativity and foresight needed to meet the challenges of the unknown." (Tom Golway, 2021)

"AI is intended to create systems for making probabilistic decisions, similar to the way humans make decisions. […] Today’s AI is not very able to generalize. Instead, it is effective for specific, well-defined tasks. It struggles with ambiguity and mostly lacks transfer learning that humans take for granted. For AI to make humanlike decisions that are more situationally appropriate, it needs to incorporate context." (Jesús Barrasa et al, "Knowledge Graphs: Data in Context for Responsive Businesses", 2021)

"In an era of machine learning, where data is likely to be used to train AI, getting quality and governance under control is a business imperative. Failing to govern data surfaces problems late, often at the point closest to users (for example, by giving harmful guidance), and hinders explainability (garbage data in, machine-learned garbage out)." (Jesús Barrasa et al, "Knowledge Graphs: Data in Context for Responsive Businesses", 2021)

"Many AI systems employ heuristic decision making, which uses a strategy to find the most likely correct decision to avoid the high cost (time) of processing lots of information. We can think of those heuristics as shortcuts or rules of thumb that we would use to make fast decisions." (Jesús Barrasa et al, "Knowledge Graphs: Data in Context for Responsive Businesses", 2021)

"We think of context as the network surrounding a data point of interest that is relevant to a specific AI system. […] AI benefits greatly from context to enable probabilistic decision making for real-time answers, handle adjacent scenarios for broader applicability, and be maximally relevant to a given situation. But all systems, including AI, are only as good as their inputs." (Jesús Barrasa et al, "Knowledge Graphs: Data in Context for Responsive Businesses", 2021)

"Every machine has artificial intelligence. And the more advanced a machine gets, the more advanced artificial intelligence gets as well. But, a machine cannot feel what it is doing. It only follows instructions - our instructions - instructions of the humans. So, artificial intelligence will not destroy the world. Our irresponsibility will destroy the world." (Abhijit Naskar)

More quotes on "Artificial Intelligence" at the-web-of-knowledge.blogspot.com.

23 September 2018

🔭Data Science: Computation (Just the Quotes)

"If the system exhibits a structure which can be represented by a mathematical equivalent, called a mathematical model, and if the objective can be also so quantified, then some computational method may be evolved for choosing the best schedule of actions among alternatives. Such use of mathematical models is termed mathematical programming." (George Dantzig, "Linear Programming and Extensions", 1959)

"Computers do not decrease the need for mathematical analysis, but rather greatly increase this need. They actually extend the use of analysis into the fields of computers and computation, the former area being almost unknown until recently, the latter never having been as intensively investigated as its importance warrants. Finally, it is up to the user of computational equipment to define his needs in terms of his problems, In any case, computers can never eliminate the need for problem-solving through human ingenuity and intelligence." (Richard E Bellman & Paul Brock, "On the Concepts of a Problem and Problem-Solving", American Mathematical Monthly 67, 1960)

"Cellular automata are discrete dynamical systems with simple construction but complex self-organizing behaviour. Evidence is presented that all one-dimensional cellular automata fall into four distinct universality classes. Characterizations of the structures generated in these classes are discussed. Three classes exhibit behaviour analogous to limit points, limit cycles and chaotic attractors. The fourth class is probably capable of universal computation, so that properties of its infinite time behaviour are undecidable." (Stephen Wolfram, "Nonlinear Phenomena, Universality and complexity in cellular automata", Physica 10D, 1984)

"The formal structure of a decision problem in any area can be put into four parts: (1) the choice of an objective function denning the relative desirability of different outcomes; (2) specification of the policy alternatives which are available to the agent, or decisionmaker, (3) specification of the model, that is, empirical relations that link the objective function, or the variables that enter into it, with the policy alternatives and possibly other variables; and (4) computational methods for choosing among the policy alternatives that one which performs best as measured by the objective function." (Kenneth Arrow, "The Economics of Information", 1984)

"In spite of the insurmountable computational limits, we continue to pursue the many problems that possess the characteristics of organized complexity. These problems are too important for our well being to give up on them. The main challenge in pursuing these problems narrows down fundamentally to one question: how to deal with systems and associated problems whose complexities are beyond our information processing limits? That is, how can we deal with these problems if no computational power alone is sufficient?" (George Klir, "Fuzzy sets and fuzzy logic", 1995)

"Small changes in the initial conditions in a chaotic system produce dramatically different evolutionary histories. It is because of this sensitivity to initial conditions that chaotic systems are inherently unpredictable. To predict a future state of a system, one has to be able to rely on numerical calculations and initial measurements of the state variables. Yet slight errors in measurement combined with extremely small computational errors (from roundoff or truncation) make prediction impossible from a practical perspective. Moreover, small initial errors in prediction grow exponentially in chaotic systems as the trajectories evolve. Thus, theoretically, prediction may be possible with some chaotic processes if one is interested only in the movement between two relatively close points on a trajectory. When longer time intervals are involved, the situation becomes hopeless."(Courtney Brown, "Chaos and Catastrophe Theories", 1995)

"An artificial neural network (or simply a neural network) is a biologically inspired computational model that consists of processing elements (neurons) and connections between them, as well as of training and recall algorithms." (Nikola K Kasabov, "Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering", 1996)

"In science, it is a long-standing tradition to deal with perceptions by converting them into measurements. But what is becoming increasingly evident is that, to a much greater extent than is generally recognized, conversion of perceptions into measurements is infeasible, unrealistic or counter-productive. With the vast computational power at our command, what is becoming feasible is a counter-traditional move from measurements to perceptions. […] To be able to compute with perceptions it is necessary to have a means of representing their meaning in a way that lends itself to computation." (Lotfi A Zadeh, "The Birth and Evolution of Fuzzy Logic: A personal perspective", 1999)

"Theories of choice are at best approximate and incomplete. One reason for this pessimistic assessment is that choice is a constructive and contingent process. When faced with a complex problem, people employ a variety of heuristic procedures in order to simplify the representation and the evaluation of prospects. These procedures include computational shortcuts and editing operations, such as eliminating common components and discarding nonessential differences. The heuristics of choice do not readily lend themselves to formal analysis because their application depends on the formulation of the problem, the method of elicitation, and the context of choice." (Amos Tversky & Daniel Kahneman, "Advances in Prospect Theory: Cumulative Representation of Uncertainty" [in "Choices, Values, and Frames"], 2000)

"Prime numbers belong to an exclusive world of intellectual conceptions. We speak of those marvelous notions that enjoy simple, elegant description, yet lead to extreme - one might say unthinkable - complexity in the details. The basic notion of primality can be accessible to a child, yet no human mind harbors anything like a complete picture. In modern times, while theoreticians continue to grapple with the profundity of the prime numbers, vast toil and resources have been directed toward the computational aspect, the task of finding, characterizing, and applying the primes in other domains." (Richard Crandall & Carl Pomerance, "Prime Numbers: A Computational Perspective", 2001)

"Complexity Theory is concerned with the study of the intrinsic complexity of computational tasks. Its 'final' goals include the determination of the complexity of any well-defined task. Additional goals include obtaining an understanding of the relations between various computational phenomena (e.g., relating one fact regarding computational complexity to another). Indeed, we may say that the former type of goal is concerned with absolute answers regarding specific computational phenomena, whereas the latter type is concerned with questions regarding the relation between computational phenomena." (Oded Goldreich, "Computational Complexity: A Conceptual Perspective", 2008)

"Granular computing is a general computation theory for using granules such as subsets, classes, objects, clusters, and elements of a universe to build an efficient computational model for complex applications with huge amounts of data, information, and knowledge. Granulation of an object a leads to a collection of granules, with a granule being a clump of points (objects) drawn together by indiscernibility, similarity, proximity, or functionality. In human reasoning and concept formulation, the granules and the values of their attributes are fuzzy rather than crisp. In this perspective, fuzzy information granulation may be viewed as a mode of generalization, which can be applied to any concept, method, or theory." (Salvatore Greco et al, "Granular Computing and Data Mining for Ordered Data: The Dominance-Based Rough Set Approach", 2009)

"How are we to explain the contrast between the matter-of-fact way in which v-1 and other imaginary numbers are accepted today and the great difficulty they posed for learned mathematicians when they first appeared on the scene? One possibility is that mathematical intuitions have evolved over the centuries and people are generally more willing to see mathematics as a matter of manipulating symbols according to rules and are less insistent on interpreting all symbols as representative of one or another aspect of physical reality. Another, less self-congratulatory possibility is that most of us are content to follow the computational rules we are taught and do not give a lot of thought to rationales." (Raymond S Nickerson, "Mathematical Reasoning: Patterns, Problems, Conjectures, and Proofs", 2009)

"It should also be noted that the novel information generated by interactions in complex systems limits their predictability. Without randomness, complexity implies a particular non-determinism characterized by computational irreducibility. In other words, complex phenomena cannot be known a priori." (Carlos Gershenson, "Complexity", 2011)

"The notion of emergence is used in a variety of disciplines such as evolutionary biology, the philosophy of mind and sociology, as well as in computational and complexity theory. It is associated with non-reductive naturalism, which claims that a hierarchy of levels of reality exist. While the emergent level is constituted by the underlying level, it is nevertheless autonomous from the constituting level. As a naturalistic theory, it excludes non-natural explanations such as vitalistic forces or entelechy. As non-reductive naturalism, emergence theory claims that higher-level entities cannot be explained by lower-level entities." (Martin Neumann, "An Epistemological Gap in Simulation Technologies and the Science of Society", 2011)

"Black Swans (capitalized) are large-scale unpredictable and irregular events of massive consequence - unpredicted by a certain observer, and such un - predictor is generally called the 'turkey' when he is both surprised and harmed by these events. [...] Black Swans hijack our brains, making us feel we 'sort of' or 'almost' predicted them, because they are retrospectively explainable. We don’t realize the role of these Swans in life because of this illusion of predictability. […] An annoying aspect of the Black Swan problem - in fact the central, and largely missed, point - is that the odds of rare events are simply not computable." (Nassim N Taleb, "Antifragile: Things that gain from disorder", 2012)

"[…] there exists a close relation between design analysis of algorithm and computational complexity theory. The former is related to the analysis of the resources (time and/or space) utilized by a particular algorithm to solve a problem and the later is related to a more general question about all possible algorithms that could be used to solve the same problem. There are different types of time complexity for different algorithms." (Shyamalendu Kandar, "Introduction to Automata Theory, Formal Languages and Computation", 2013)

"These nature-inspired algorithms gradually became more and more attractive and popular among the evolutionary computation research community, and together they were named swarm intelligence, which became the little brother of the major four evolutionary computation algorithms." (Yuhui Shi, "Emerging Research on Swarm Intelligence and Algorithm Optimization", Information Science Reference, 2014)

"The higher the dimension, in other words, the higher the number of possible interactions, and the more disproportionally difficult it is to understand the macro from the micro, the general from the simple units. This disproportionate increase of computational demands is called the curse of dimensionality." (Nassim N Taleb, "Skin in the Game: Hidden Asymmetries in Daily Life", 2018)

"Computational complexity theory, or just complexity theory, is the study of the difficulty of computational problems. Rather than focusing on specific algorithms, complexity theory focuses on problems." (Rod Stephens, "Essential Algorithms" 2nd Ed., 2019)

19 September 2018

🔭Data Science: Features (Just the Quotes)

"The preliminary examination of most data is facilitated by the use of diagrams. Diagrams prove nothing, but bring outstanding features readily to the eye; they are therefore no substitutes for such critical tests as may be applied to the data, but are valuable in suggesting such tests, and in explaining the conclusions founded upon them." (Sir Ronald A Fisher, "Statistical Methods for Research Workers", 1925)

"Every bit of knowledge we gain and every conclusion we draw about the universe or about any part or feature of it depends finally upon some observation or measurement. Mankind has had again and again the humiliating experience of trusting to intuitive, apparently logical conclusions without observations, and has seen Nature sail by in her radiant chariot of gold in an entirely different direction." (Oliver J Lee, "Measuring Our Universe: From the Inner Atom to Outer Space", 1950)

"Probability is the mathematics of uncertainty. Not only do we constantly face situations in which there is neither adequate data nor an adequate theory, but many modem theories have uncertainty built into their foundations. Thus learning to think in terms of probability is essential. Statistics is the reverse of probability (glibly speaking). In probability you go from the model of the situation to what you expect to see; in statistics you have the observations and you wish to estimate features of the underlying model." (Richard W Hamming, "Methods of Mathematics Applied to Calculus, Probability, and Statistics", 1985)

"Complexity is not an objective factor but a subjective one. Supersignals reduce complexity, collapsing a number of features into one. Consequently, complexity must be understood in terms of a specific individual and his or her supply of supersignals. We learn supersignals from experience, and our supply can differ greatly from another individual's. Therefore there can be no objective measure of complexity." (Dietrich Dorner, "The Logic of Failure: Recognizing and Avoiding Error in Complex Situations", 1989)

"Formulation of a mathematical model is the first step in the process of analyzing the behaviour of any real system. However, to produce a useful model, one must first adopt a set of simplifying assumptions which have to be relevant in relation to the physical features of the system to be modelled and to the specific information one is interested in. Thus, the aim of modelling is to produce an idealized description of reality, which is both expressible in a tractable mathematical form and sufficiently close to reality as far as the physical mechanisms of interest are concerned." (Francois Axisa, "Discrete Systems" Vol. I, 2001)

"It is impossible to construct a model that provides an entirely accurate picture of network behavior. Statistical models are almost always based on idealized assumptions, such as independent and identically distributed (i.i.d.) interarrival times, and it is often difficult to capture features such as machine breakdowns, disconnected links, scheduled repairs, or uncertainty in processing rates." (Sean Meyn, "Control Techniques for Complex Networks", 2008)

"In order to deal with these phenomena, we abstract from details and attempt to concentrate on the larger picture - a particular set of features of the real world or the structure that underlies the processes that lead to the observed outcomes. Models are such abstractions of reality. Models force us to face the results of the structural and dynamic assumptions that we have made in our abstractions." (Bruce Hannon and Matthias Ruth, "Dynamic Modeling of Diseases and Pests", 2009)

"Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression." (Pankaj Mehta & David J Schwab, "An exact mapping between the Variational Renormalization Group and Deep Learning", 2014)

"A predictive model overfits the training set when at least some of the predictions it returns are based on spurious patterns present in the training data used to induce the model. Overfitting happens for a number of reasons, including sampling variance and noise in the training set. The problem of overfitting can affect any machine learning algorithm; however, the fact that decision tree induction algorithms work by recursively splitting the training data means that they have a natural tendency to segregate noisy instances and to create leaf nodes around these instances. Consequently, decision trees overfit by splitting the data on irrelevant features that only appear relevant due to noise or sampling variance in the training data. The likelihood of overfitting occurring increases as a tree gets deeper because the resulting predictions are based on smaller and smaller subsets as the dataset is partitioned after each feature test in the path." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies", 2015)

"Bayesian networks provide a more flexible representation for encoding the conditional independence assumptions between the features in a domain. Ideally, the topology of a network should reflect the causal relationships between the entities in a domain. Properly constructed Bayesian networks are relatively powerful models that can capture the interactions between descriptive features in determining a prediction." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, worked examples, and case studies", 2015)

"Bayesian networks use a graph-based representation to encode the structural relationships - such as direct influence and conditional independence - between subsets of features in a domain. Consequently, a Bayesian network representation is generally more compact than a full joint distribution (because it can encode conditional independence relationships), yet it is not forced to assert a global conditional independence between all descriptive features. As such, Bayesian network models are an intermediary between full joint distributions and naive Bayes models and offer a useful compromise between model compactness and predictive accuracy." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, worked examples, and case studies", 2015)

"Decision trees are also discriminative models. Decision trees are induced by recursively partitioning the feature space into regions belonging to the different classes, and consequently they define a decision boundary by aggregating the neighboring regions belonging to the same class. Decision tree model ensembles based on bagging and boosting are also discriminative models." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies", 2015)

"There are two kinds of mistakes that an inappropriate inductive bias can lead to: underfitting and overfitting. Underfitting occurs when the prediction model selected by the algorithm is too simplistic to represent the underlying relationship in the dataset between the descriptive features and the target feature. Overfitting, by contrast, occurs when the prediction model selected by the algorithm is so complex that the model fits to the dataset too closely and becomes sensitive to noise in the data." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies", 2015)

"The power of deep learning models comes from their ability to classify or predict nonlinear data using a modest number of parallel nonlinear steps4. A deep learning model learns the input data features hierarchy all the way from raw data input to the actual classification of the data. Each layer extracts features from the output of the previous layer." (N D Lewis, "Deep Learning Made Easy with R: A Gentle Introduction for Data Science", 2016)

"Decision trees are important for a few reasons. First, they can both classify and regress. It requires literally one line of code to switch between the two models just described, from a classification to a regression. Second, they are able to determine and share the feature importance of a given training set." (Russell Jurney, "Agile Data Science 2.0: Building Full-Stack Data Analytics Applications with Spark", 2017)

"Extracting good features is the most important thing for getting your analysis to work. It is much more important than good machine learning classifiers, fancy statistical techniques, or elegant code. Especially if your data doesn’t come with readily available features (as is the case with web pages, images, etc.), how you reduce it to numbers will make the difference between success and failure." (Field Cady, "The Data Science Handbook", 2017)

"Feature extraction is also the most creative part of data science and the one most closely tied to domain expertise. Typically, a really good feature will correspond to some real‐world phenomenon. Data scientists should work closely with domain experts and understand what these phenomena mean and how to distill them into numbers." (Field Cady, "The Data Science Handbook", 2017)

"Variables which follow symmetric, bell-shaped distributions tend to be nice as features in models. They show substantial variation, so they can be used to discriminate between things, but not over such a wide range that outliers are overwhelming." (Steven S Skiena, "The Data Science Design Manual", 2017)

"The idea behind deeper architectures is that they can better leverage repeated regularities in the data patterns in order to reduce the number of computational units and therefore generalize the learning even to areas of the data space where one does not have examples. Often these repeated regularities are learned by the neural network within the weights as the basis vectors of hierarchical features." (Charu C Aggarwal, "Neural Networks and Deep Learning: A Textbook", 2018)

"We humans are reasonably good at defining rules that check one, two, or even three attributes (also commonly referred to as features or variables), but when we go higher than three attributes, we can start to struggle to handle the interactions between them. By contrast, data science is often applied in contexts where we want to look for patterns among tens, hundreds, thousands, and, in extreme cases, millions of attributes." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"Any machine learning model is trained based on certain assumptions. In general, these assumptions are the simplistic approximations of some real-world phenomena. These assumptions simplify the actual relationships between features and their characteristics and make a model easier to train. More assumptions means more bias. So, while training a model, more simplistic assumptions = high bias, and realistic assumptions that are more representative of actual phenomena = low bias." (Imran Ahmad, "40 Algorithms Every Programmer Should Know", 2020)

18 September 2018

🔭Data Science: Regularities (Just the Quotes)

"By sampling we can learn only about collective properties of populations, not about properties of individuals. We can study the average height, the percentage who wear hats, or the variability in weight of college juniors [...]. The population we study may be small or large, but there must be a population - and what we are studying must be a population characteristic. By sampling, we cannot study individuals as particular entities with unique idiosyncrasies; we can study regularities (including typical variabilities as well as typical levels) in a population as exemplified by the individuals in the sample." (Frederick Mosteller et al, "Principles of Sampling", Journal of the American Statistical Association Vol. 49 (265), 1954)

"Theories are usually introduced when previous study of a class of phenomena has revealed a system of uniformities. […] Theories then seek to explain those regularities and, generally, to afford a deeper and more accurate understanding of the phenomena in question. To this end, a theory construes those phenomena as manifestations of entities and processes that lie behind or beneath them, as it were." (Carl G Hempel, "Philosophy of Natural Science", 1966)

"System' is the concept that refers both to a complex of interdependencies between parts, components, and processes, that involves discernible regularities of relationships, and to a similar type of interdependency between such a complex and its surrounding environment." (Talcott Parsons, "Systems Analysis: Social Systems", 1968)

"The dynamics of any system can be explained by showing the relations between its parts and the regularities of their interactions so as to reveal its organization. For us to fully understand it, however, we need not only to see it as a unity operating in its internal dynamics, but also to see it in its circumstances, i.e., in the context to which its operation connects it. This understanding requires that we adopt a certain distance for observation, a perspective that in the case of historical systems implies a reference to their origin. This can be easy, for instance, in the case of man-made machines, for we have access to every detail of their manufacture. The situation is not that easy, however, as regards living beings: their genesis and their history are never directly visible and can be reconstructed only by fragments." (Humberto Maturana, "The Tree of Knowledge", 1987)

"The term chaos is used in a specific sense where it is an inherently random pattern of behaviour generated by fixed inputs into deterministic (that is fixed) rules (relationships). The rules take the form of non-linear feedback loops. Although the specific path followed by the behaviour so generated is random and hence unpredictable in the long-term, it always has an underlying pattern to it, a 'hidden' pattern, a global pattern or rhythm. That pattern is self-similarity, that is a constant degree of variation, consistent variability, regular irregularity, or more precisely, a constant fractal dimension. Chaos is therefore order (a pattern) within disorder (random behaviour)." (Ralph D Stacey, "The Chaos Frontier: Creative Strategic Control for Business", 1991)

"A measure that corresponds much better to what is usually meant by complexity in ordinary conversation, as well as in scientific discourse, refers not to the length of the most concise description of an entity (which is roughly what AIC [algorithmic information content] is), but to the length of a concise description of a set of the entity’s regularities. Thus something almost entirely random, with practically no regularities, would have effective complexity near zero. So would something completely regular, such as a bit string consisting entirely of zeroes. Effective complexity can be high only a region intermediate between total order and complete." (Murray Gell-Mann, "What is Complexity?", Complexity Vol 1 (1), 1995)

"The second law of thermodynamics, which requires average entropy (or disorder) to increase, does not in any way forbid local order from arising through various mechanisms of self-organization, which can turn accidents into frozen ones producing extensive regularities. Again, such mechanisms are not restricted to complex adaptive systems." (Murray Gell-Mann, "What is Complexity?", Complexity Vol 1 (1), 1995)

"A form of machine learning in which the goal is to identify regularities in the data. These regularities may include clusters of similar instances within the data or regularities between attributes. In contrast to supervised learning, in unsupervised learning no target attribute is defined in the data set." (John D Kelleher & Brendan Tierney, "Data science", 2018)

"Unexpected phenomena appearing (and often having a regularity or pattern) from a collection of apparently unrelated elements and where the elements themselves do not have the characteristics of the phenomena and that phenomena itself is not contained deductively within the elements." (Jeremy Horne, "Visualizing Big Data From a Philosophical Perspective", Handbook of Research on Big Data Storage and Visualization Techniques, 2018)

17 September 2018

🔭Data Science: Underfitting (Just the Quotes)

"A smaller model with fewer covariates has two advantages: it might give better predictions than a big model and it is more parsimonious (simpler). Generally, as you add more variables to a regression, the bias of the predictions decreases and the variance increases. Too few covariates yields high bias; this called underfitting. Too many covariates yields high variance; this called overfitting. Good predictions result from achieving a good balance between bias and variance. […] finding a good model involves trading of fit and complexity." (Larry A Wasserman, "All of Statistics: A concise course in statistical inference", 2004)

"When generating trees, it is usually optimal to grow a larger tree than is justifiable and then prune it back. The main reason this works well is that stop splitting rules do not look far enough forward. That is, stop splitting rules tend to underfit, meaning that even if a rule stops at a split for which the next candidate splits give little improvement, it may be that splitting them one layer further will give a large improvement in accuracy." (Bertrand Clarke et al, "Principles and Theory for Data Mining and Machine Learning", 2009)

"Briefly speaking, to solve a Machine Learning problem means you optimize a model to fit all the data from your training set, and then you use the model to predict the results you want. Therefore, evaluating a model need to see how well it can be used to predict the data out of the training set. Usually there are three types of the models: underfitting, fair and overfitting model [...]. If we want to predict a value, both (a) and (c) in this figure cannot work well. The underfitting model does not capture the structure of the problem at all, and we say it has high bias. The overfitting model tries to fit every sample in the training set and it did it, but we say it is of high variance. In other words, it fails to generalize new data." (Shudong Hao, "A Beginner’s Tutorial for Machine Learning Beginners", 2014)

"There are two kinds of mistakes that an inappropriate inductive bias can lead to: underfitting and overfitting. Underfitting occurs when the prediction model selected by the algorithm is too simplistic to represent the underlying relationship in the dataset between the descriptive features and the target feature. Overfitting, by contrast, occurs when the prediction model selected by the algorithm is so complex that the model fits to the dataset too closely and becomes sensitive to noise in the data."(John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies", 2015)

"Underfitting is when a model doesn’t take into account enough information to accurately model real life. For example, if we observed only two points on an exponential curve, we would probably assert that there is a linear relationship there. But there may not be a pattern, because there are only two points to reference. [...] It seems that the best way to mitigate underfitting a model is to give it more information, but this actually can be a problem as well. More data can mean more noise and more problems. Using too much data and too complex of a model will yield something that works for that particular data set and nothing else." (Matthew Kirk, "Thoughtful Machine Learning", 2015)

"Bias occurs normally when the model is underfitted and has failed to learn enough from the training data. It is the difference between the mean of the probability distribution and the actual correct value. Hence, the accuracy of the model is different for different data sets (test and training sets). To reduce the bias error, data scientists repeat the model-building process by resampling the data to obtain better prediction values." (Umesh R Hodeghatta & Umesha Nayak, "Business Analytics Using R: A Practical Approach", 2017)

"High-bias models typically produce simpler models that do not overfit and in those cases the danger is that of underfitting. Models with low-bias are typically more complex and that complexity enables us to represent the training data in a more accurate way. The danger here is that the flexibility provided by higher complexity may end up representing not only a relationship in the data but also the noise. Another way of portraying the bias-variance trade-off is in terms of complexity v simplicity." (Jesús Rogel-Salazar, "Data Science and Analytics with Python", 2017)

"If either bias or variance is high, the model can be very far off from reality. In general, there is a trade-off between bias and variance. The goal of any machine-learning algorithm is to achieve low bias and low variance such that it gives good prediction performance. In reality, because of so many other hidden parameters in the model, it is hard to calculate the real bias and variance error. Nevertheless, the bias and variance provide a measure to understand the behavior of the machine-learning algorithm so that the model model can be adjusted to provide good prediction performance." (Umesh R Hodeghatta & Umesha Nayak, "Business Analytics Using R: A Practical Approach", 2017)

"Overfitting and underfitting are two important factors that could impact the performance of machine-learning models. Overfitting occurs when the model performs well with training data and poorly with test data. Underfitting occurs when the model is so simple that it performs poorly with both training and test data. [...] When the model does not capture and fit the data, it results in poor performance. We call this underfitting. Underfitting is the result of a poor model that typically does not perform well for any data." (Umesh R Hodeghatta & Umesha Nayak, "Business Analytics Using R: A Practical Approach", 2017)

"Overfitting refers to the phenomenon where a model is highly fitted on a dataset. This generalization thus deprives the model from making highly accurate predictions about unseen data. [...] Underfitting is a phenomenon where the model is not trained with high precision on data at hand. The treatment of underfitting is subject to bias and variance. A model will have a high bias if both train and test errors are high [...] If a model has a high bias type underfitting, then the remedy can be to increase the model complexity, and if a model is suffering from high variance type underfitting, then the cure can be to bring in more data or otherwise make the model less complex." (Danish Haroon, "Python Machine Learning Case Studies", 2017)

"The tension between bias and variance, simplicity and complexity, or underfitting and overfitting is an area in the data science and analytics process that can be closer to a craft than a fixed rule. The main challenge is that not only is each dataset different, but also there are data points that we have not yet seen at the moment of constructing the model. Instead, we are interested in building a strategy that enables us to tell something about data from the sample used in building the model." (Jesús Rogel-Salazar, "Data Science and Analytics with Python", 2017)

"Even though a natural way of avoiding overfitting is to simply build smaller networks (with fewer units and parameters), it has often been observed that it is better to build large networks and then regularize them in order to avoid overfitting. This is because large networks retain the option of building a more complex model if it is truly warranted. At the same time, the regularization process can smooth out the random artifacts that are not supported by sufficient data. By using this approach, we are giving the model the choice to decide what complexity it needs, rather than making a rigid decision for the model up front (which might even underfit the data)." (Charu C Aggarwal, "Neural Networks and Deep Learning: A Textbook", 2018)

"One of the most common problems that you will encounter when training deep neural networks will be overfitting. What can happen is that your network may, owing to its flexibility, learn patterns that are due to noise, errors, or simply wrong data. [...] The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e., the noise) as if that variation represented the underlying model structure. The opposite is called underfitting - when the model cannot capture the structure of the data." (Umberto Michelucci, "Applied Deep Learning: A Case-Based Approach to Understanding Deep Neural Networks", 2018)

"The high generalization error in a neural network may be caused by several reasons. First, the data itself might have a lot of noise, in which case there is little one can do in order to improve accuracy. Second, neural networks are hard to train, and the large error might be caused by the poor convergence behavior of the algorithm. The error might also be caused by high bias, which is referred to as underfitting. Finally, overfitting (i.e., high variance) may cause a large part of the generalization error. In most cases, the error is a combination of more than one of these different factors." (Charu C Aggarwal, "Neural Networks and Deep Learning: A Textbook", 2018)

"The trick is to walk the line between underfitting and overfitting. An underfit model has low variance, generally making the same predictions every time, but with extremely high bias, because the model deviates from the correct answer by a significant amount. Underfitting is symptomatic of not having enough data points, or not training a complex enough model. An overfit model, on the other hand, has memorized the training data and is completely accurate on data it has seen before, but varies widely on unseen data. Neither an overfit nor underfit model is generalizable - that is, able to make meaningful predictions on unseen data." (Benjamin Bengfort et al, "Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning", 2018)

"Any fool can fit a statistical model, given the data and some software. The real challenge is to decide whether it actually fits the data adequately. It might be the best that can be obtained, but still not good enough to use." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)