SQL Troubles

28 November 2018

🔭Data Science: Standard Deviation (Just the Quotes)

"Equal variability is not always achieved in plots. For instance, if the theoretical distribution for a probability plot has a density that drops off gradually to zero in the tails (as the normal density does), then the variability of the data in the tails of the probability plot is greater than in the center. Another example is provided by the histogram. Since the height of any one bar has a binomial distribution, the standard deviation of the height is approximately proportional to the square root of the expected height; hence, the variability of the longer bars is greater." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"The most important reason for portraying standard deviations is that they give us a sense of the relative variability of the points in different regions of the plot." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"Many good things happen when data distributions are well approximated by the normal. First, the question of whether the shifts among the distributions are additive becomes the question of whether the distributions have the same standard deviation; if so, the shifts are additive. […] A second good happening is that methods of fitting and methods of probabilistic inference, to be taken up shortly, are typically simple and on well understood ground. […] A third good thing is that the description of the data distribution is more parsimonious." (William S Cleveland, "Visualizing Data", 1993)

"The bounds on the standard deviation are pretty crude but it is surprising how often the rule will pick up gross errors such as confusing the standard error and standard deviation, confusing the variance and the standard deviation, or reporting the mean in one scale and the standard deviation in another scale." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Data often arrive in raw form, as long lists of numbers. In this case your job is to summarize the data in a way that captures its essence and conveys its meaning. This can be done numerically, with measures such as the average and standard deviation, or graphically. At other times you find data already in summarized form; in this case you must understand what the summary is telling, and what it is not telling, and then interpret the information for your readers or viewers." (Charles Livingston & Paul Voakes, "Working with Numbers and Statistics: A handbook for journalists", 2005)

"Roughly stated, the standard deviation gives the average of the differences between the numbers on the list and the mean of that list. If data are very spread out, the standard deviation will be large. If the data are concentrated near the mean, the standard deviation will be small." (Charles Livingston & Paul Voakes, "Working with Numbers and Statistics: A handbook for journalists", 2005)

"A feature shared by both the range and the interquartile range is that they are each calculated on the basis of just two values - the range uses the maximum and the minimum values, while the IQR uses the two quartiles. The standard deviation, on the other hand, has the distinction of using, directly, every value in the set as part of its calculation. In terms of representativeness, this is a great strength. But the chief drawback of the standard deviation is that, conceptually, it is harder to grasp than other more intuitive measures of spread." (Alan Graham, "Developing Thinking in Statistics", 2006)

"Numerical precision should be consistent throughout and summary statistics such as means and standard deviations should not have more than one extra decimal place (or significant digit) compared to the raw data. Spurious precision should be avoided although when certain measures are to be used for further calculations or when presenting the results of analyses, greater precision may sometimes be appropriate." (Jenny Freeman et al, "How to Display Data", 2008)

"Need to consider outliers as they can affect statistics such as means, standard deviations, and correlations. They can either be explained, deleted, or accommodated (using either robust statistics or obtaining additional data to fill-in). Can be detected by methods such as box plots, scatterplots, histograms or frequency distributions." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

"Outliers or influential data points can be defined as data values that are extreme or atypical on either the independent (X variables) or dependent (Y variables) variables or both. Outliers can occur as a result of observation errors, data entry errors, instrument errors based on layout or instructions, or actual extreme values from self-report data. Because outliers affect the mean, the standard deviation, and correlation coefficient values, they must be explained, deleted, or accommodated by using robust statistics." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

[myth] "The standard deviation statistic is more efficient than the range and therefore we should use the standard deviation statistic when computing limits for a process behavior chart."(Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"Outliers make it very hard to give an intuitive interpretation of the mean, but in fact, the situation is even worse than that. For a real‐world distribution, there always is a mean (strictly speaking, you can define distributions with no mean, but they’re not realistic), and when we take the average of our data points, we are trying to estimate that mean. But when there are massive outliers, just a single data point is likely to dominate the value of the mean and standard deviation, so much more data is required to even estimate the mean, let alone make sense of it." (Field Cady, "The Data Science Handbook", 2017)

"Theoretically, the normal distribution is most famous because many distributions converge to it, if you sample from them enough times and average the results. This applies to the binomial distribution, Poisson distribution and pretty much any other distribution you’re likely to encounter (technically, any one for which the mean and standard deviation are finite)." (Field Cady, "The Data Science Handbook", 2017)

"With time series though, there is absolutely no substitute for plotting. The pertinent pattern might end up being a sharp spike followed by a gentle taper down. Or, maybe there are weird plateaus. There could be noisy spikes that have to be filtered out. A good way to look at it is this: means and standard deviations are based on the naïve assumption that data follows pretty bell curves, but there is no corresponding 'default' assumption for time series data (at least, not one that works well with any frequency), so you always have to look at the data to get a sense of what’s normal. [...] Along the lines of figuring out what patterns to expect, when you are exploring time series data, it is immensely useful to be able to zoom in and out." (Field Cady, "The Data Science Handbook", 2017)

"With skewed data, quantiles will reflect the skew, while adding standard deviations assumes symmetry in the distribution and can be misleading." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

"[…] whenever people make decisions after being supplied with the standard deviation number, they act as if it were the expected mean deviation." (Nassim N Taleb, "Statistical Consequences of Fat Tails: Real World Preasymptotics, Epistemology, and Applications" 2nd Ed., 2022)

27 November 2018

🔭Data Science: Data Science (Just the Quotes)

"Data is frequently missing or incongruous. If data is missing, do you simply ignore the missing points? That isn’t always possible. If data is incongruous, do you decide that something is wrong with badly behaved data (after all, equipment fails), or that the incongruous data is telling its own story, which may be more interesting? It’s reported that the discovery of ozone layer depletion was delayed because automated data collection tools discarded readings that were too low. In data science, what you have is frequently all you’re going to get. It’s usually impossible to get 'better' data, and you have no alternative but to work with the data at hand." (Mike Loukides, "What Is Data Science?", 2011).

"Data science isn’t just about the existence of data, or making guesses about what that data might mean; it’s about testing hypotheses and making sure that the conclusions you’re drawing from the data are valid." (Mike Loukides, "What Is Data Science?", 2011)

"The thread that ties most of these applications together is that data collected from users provides added value. Whether that data is search terms, voice samples, or product reviews, the users are in a feedback loop in which they contribute to the products they use. That’s the beginning of data science." (Mike Loukides, "What Is Data Science?", 2011)

"Using data effectively requires something different from traditional statistics, where actuaries in business suits perform arcane but fairly well-defined kinds of analysis. What differentiates data science from statistics is that data science is a holistic approach. We’re increasingly finding data in the wild, and data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others" (Mike Loukides, "What Is Data Science?", 2011).

"Data science is an iterative process. It starts with a hypothesis (or several hypotheses) about the system we’re studying, and then we analyze the information. The results allow us to reject our initial hypotheses and refine our understanding of the data. When working with thousands of fields and millions of rows, it’s important to develop intuitive ways to reject bad hypotheses quickly." (Phil Simon, "The Visual Organization: Data Visualization, Big Data, and the Quest for Better Decisions", 2014)

"Hollywood loves the myth of a lone scientist working late nights in a dark laboratory on a mysterious island, but the truth is far less melodramatic. Real science is almost always a team sport. Groups of people, collaborating with other groups of people, are the norm in science - and data science is no exception to the rule. When large groups of people work together for extended periods of time, a culture begins to emerge." (Mike Barlow, "Learning to Love Data Science", 2015)

"One important thing to bear in mind about the outputs of data science and analytics is that in the vast majority of cases they do not uncover hidden patterns or relationships as if by magic, and in the case of predictive analytics they do not tell us exactly what will happen in the future. Instead, they enable us to forecast what may come. In other words, once we have carried out some modelling there is still a lot of work to do to make sense out of the results obtained, taking into account the constraints and assumptions in the model, as well as considering what an acceptable level of reliability is in each scenario." (Jesús Rogel-Salazar, "Data Science and Analytics with Python", 2017)

"One of the biggest myths is the belief that data science is an autonomous process that we can let loose on our data to find the answers to our problems. In reality, data science requires skilled human oversight throughout the different stages of the process. [...] The second big myth of data science is that every data science project needs big data and needs to use deep learning. In general, having more data helps, but having the right data is the more important requirement. [...] A third data science myth is that modern data science software is easy to use, and so data science is easy to do. [...] The last myth about data science [...] is the belief that data science pays for itself quickly. The truth of this belief depends on the context of the organization. Adopting data science can require significant investment in terms of developing data infrastructure and hiring staff with data science expertise. Furthermore, data science will not give positive results on every project." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"The goal of data science is to improve decision making by basing decisions on insights extracted from large data sets. As a field of activity, data science encompasses a set of principles, problem definitions, algorithms, and processes for extracting nonobvious and useful patterns from large data sets. It is closely related to the fields of data mining and machine learning, but it is broader in scope. (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"The patterns that we extract using data science are useful only if they give us insight into the problem that enables us to do something to help solve the problem." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"We humans are reasonably good at defining rules that check one, two, or even three attributes (also commonly referred to as features or variables), but when we go higher than three attributes, we can start to struggle to handle the interactions between them. By contrast, data science is often applied in contexts where we want to look for patterns among tens, hundreds, thousands, and, in extreme cases, millions of attributes." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"Even in an era of open data, data science and data journalism, we still need basic statistical principles in order not to be misled by apparent patterns in the numbers." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"Data science is, in reality, something that has been around for a very long time. The desire to utilize data to test, understand, experiment, and prove out hypotheses has been around for ages. To put it simply: the use of data to figure things out has been around since a human tried to utilize the information about herds moving about and finding ways to satisfy hunger. The topic of data science came into popular culture more and more as the advent of ‘big data’ came to the forefront of the business world." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"Data scientists are advanced in their technical skills. They like to do coding, statistics, and so forth. In its purest form, data science is where an individual uses the scientific method on data." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"Pure data science is the use of data to test, hypothesize, utilize statistics and more, to predict, model, build algorithms, and so forth. This is the technical part of the puzzle. We need this within each organization. By having it, we can utilize the power that these technical aspects bring to data and analytics. Then, with the power to communicate effectively, the analysis can flow throughout the needed parts of an organization." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"Aim for simplicity in Data Science. Real creativity won’t make things more complex. Instead, it will simplify them." (Damian D Mingle)

"Data Science is a series of failures punctuated by the occasional success." (Nigel C Lewis)

"Invite your Data Science team to ask questions and assume any system, rule, or way of doing things is open to further consideration." (Damian D Mingle)

🔭Data Science: Planning (Just the Quotes)

"The preparation of clear and simple plans, and a convenient system of numbering the [treatments] that are to be applied, will lighten the work of the man in the field, who is usually operating under averse conditions, is frequently in a hurry, and is sometimes not very certain of the points at issue." (F Yates, "The Design and Analysis of Factorial Experiments" Harpenden Imperial Bureau of Soil Science, 1937)

"The statistician who supposes that his main contribution to the planning of an experiment will involve statistical theory, finds repeatedly that he makes his most valuable contribution simply by persuading the investigator to explain why he wishes to do the experiment, by persuading him to justify the experimental treatments, and to explain why it is that the experiment, when completed, will assist him in his research." (Gertrude Cox, [lecture] 1951)

"What goes wrong [in long-range planning] is that sensible anticipation gets converted into foolish numbers: and their validity always hinges on large loose assumptions." (Robert Heller, "The Naked Manager: Games Executives Play", 1972)

"A good rule of thumb for deciding how long the analysis of the data actually will take is (1) to add up all the time for everything you can think of - editing the data, checking for errors, calculating various statistics, thinking about the results, going back to the data to try out a new idea, and (2) then multiply the estimate obtained in this first step by five." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Statistics is a tool. In experimental science you plan and carry out experiments, and then analyse and interpret the results. To do this you use statistical arguments and calculations. Like any other tool - an oscilloscope, for example, or a spectrometer, or even a humble spanner - you can use it delicately or clumsily, skillfully or ineptly. The more you know about it and understand how it works, the better you will be able to use it and the more useful it will be." (Roger Barlow, "Statistics: A Guide to the Use of Statistical Methods in the Physical Sciences", 1989)

"An important part of the explanation [of continued use of significance testing] is that researchers hold false beliefs about significance testing, beliefs that tell them that significance testing offers important benefits to researchers that it in fact does not. Three of these beliefs are particularly important. The first is the false belief that the significance level of a study indicates the probability of successful replications of the study [...]. A second false belief widely held by researchers is that statistical significance level provides an index of the importance or size of a difference or relation [...]. The third false belief held by many researchers is the most devastating of all to the research enterprise. This is the belief that if a difference or relation is not statistically significant, then it is zero, or at least so small that it can safely be considered to be zero. This is the belief that if the null hypothesis is not rejected then it is to be accepted. This is the belief that a major benefit from significance tests is that they tell us whether a difference or affect is real or ‘probably just occurred by chance’." (Frank L Schmidt, "Statistical significance testing and cumulative knowledge in psychology: implications for training of researchers", Psychological Methods 1(2), 1996)

"Consideration needs to be given to the most appropriate data to be collected. Often the temptation is to collect too much data and not give appropriate attention to the most important. Filing cabinets and computer files world-wide are filled with data that have been collected because they may be of interest to someone in future. Most is never of interest to anyone and if it is, its existence is unknown to those seeking the information, who will set out to collect the data again, probably in a trial better designed for the purpose. In general, it is best to collect only the data required to answer the questions posed, when setting up the trial, and plan another trial for other data in the future, if necessary." (P Portmann & H Ketata, "Statistical Methods for Plant Variety Evaluation", 1997)

"Meta-analytic thinking is the consideration of any result in relation to previous results on the same or similar questions, and awareness that combination with future results is likely to be valuable. Meta-analytic thinking is the application of estimation thinking to more than a single study. It prompts us to seek meta-analysis of previous related studies at the planning stage of research, then to report our results in a way that makes it easy to include them in future meta-analyses. Meta-analytic thinking is a type of estimation thinking, because it, too, focuses on estimates and uncertainty." (Geoff Cumming, "Understanding the New Statistics", 2012)

"Statistics can be defined as a collection of techniques used when planning a data collection, and when subsequently analyzing and presenting data." (Birger S Madsen, "Statistics for Non-Statisticians", 2016)

"The best time to plan an experiment is after you’ve done it." (Ronald A Fisher)

🔭Data Science: Percentiles & Quantiles (Just the Quotes)

"When distributions are compared, the goal is to understand how the distributions shift in going from one data set to the next. […] The most effective way to investigate the shifts of distributions is to compare corresponding quantiles." (William S Cleveland, "Visualizing Data", 1993)

"If the sample is not representative of the population because the sample is small or biased, not selected at random, or its constituents are not independent of one another, then the bootstrap will fail. […] For a given size sample, bootstrap estimates of percentiles in the tails will always be less accurate than estimates of more centrally located percentiles. Similarly, bootstrap interval estimates for the variance of a distribution will always be less accurate than estimates of central location such as the mean or median because the variance depends strongly upon extreme values in the population." (Phillip I Good & James W Hardin, "Common Errors in Statistics (and How to Avoid Them)", 2003)

"A useful feature of a stem plot is that the values maintain their natural order, while at the same time they are laid out in a way that emphasizes the overall distribution of where the values are concentrated (that is, where the longer branches are). This enables you easily to pick out key values such as the median and quartiles." (Alan Graham, "Developing Thinking in Statistics", 2006)

"Having NUMBERSENSE means: (•) Not taking published data at face value; (•) Knowing which questions to ask; (•) Having a nose for doctored statistics. [...] NUMBERSENSE is that bit of skepticism, urge to probe, and desire to verify. It’s having the truffle hog’s nose to hunt the delicacies. Developing NUMBERSENSE takes training and patience. It is essential to know a few basic statistical concepts. Understanding the nature of means, medians, and percentile ranks is important. Breaking down ratios into components facilitates clear thinking. Ratios can also be interpreted as weighted averages, with those weights arranged by rules of inclusion and exclusion. Missing data must be carefully vetted, especially when they are substituted with statistical estimates. Blatant fraud, while difficult to detect, is often exposed by inconsistency." (Kaiser Fung, "Numbersense: How To Use Big Data To Your Advantage", 2013)

"Percentile points are used to define the percentage of cases equal to and below a certain point in a distribution or set of scores." (Neil J Salkind, "Statistics for People who (think They) Hate Statistics: Excel 2007 Edition", 2010)

"Had we started with this [quantile] plot, noticed that it looks straight and not looked further, we would have missed the important features of the data. The general lesson is important. Theoretical quantile -quantile plots are not a panacea and must be used in conjunction with other displays and analyses to get a full picture of the behavior of the data." (John M Chambers et al, "Graphical Methods for Data Analysis", 2011)

"[...] when measuring performance, it’s worth using percentiles rather than averages. The main advantage of the mean is that it’s easy to calculate, but percentiles are much more meaningful." (Martin Kleppmann, "Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems", 2015)

"Many researchers have fallen into the trap of assuming percentiles are interval data and using them in Statistical procedures that require interval data. The results are somewhat distorted under these conditions since the scores are actually only ordinal data." (Martin L Abbott, "Using Statistics in the Social and Health Sciences with SPSS and Excel", 2016)

"The percentile or rank is the point in a distribution of scores below which a given percentage of scores fall. This is an indication of rank since it establishes score that is above the percentage of a set of scores. [...] Therefore, percentiles describe where a certain score is in relation to others in the distribution. [...] Statistically, it is important to remember that percentile ranks are ranks and therefore not interval data." (Martin L Abbott, "Using Statistics in the Social and Health Sciences with SPSS and Excel", 2016)

"It is not enough to give a single summary for a distribution - we need to have an idea of the spread, sometimes known as the variability. [...] The range is a natural choice, but is clearly very sensitive to extreme values [...] In contrast the inter-quartile range (IQR) is unaffected by extremes. This is the distance between the 25th and 75th percentiles of the data and so contains the ‘central half’ of the numbers [...] Finally the standard deviation is a widely used measure of spread. It is the most technically complex measure, but is only really appropriate for well-behaved symmetric data* since it is also unduly influenced by outlying values." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

🔭Data Science: Fuzziness (Just the Quotes)

"Today we preach that science is not science unless it is quantitative. We substitute correlation for causal studies, and physical equations for organic reasoning. Measurements and equations are supposed to sharpen thinking, but [...] they more often tend to make the thinking non-causal and fuzzy." (John R Platt, "Strong Inference", Science Vol. 146 (3641), 1964)

"Information that is only partially structured (and therefore contains some 'noise' is fuzzy, inconsistent, and indistinct. Such imperfect information may be regarded as having merit only if it represents an intermediate step in structuring the information into a final meaningful form. If the partially Structured information remains in fuzzy form, it will create a state of dissatisfaction in the mind of the originator and certainly in the mind of the recipient. The natural desire is to continue structuring until clarity, simplicity, precision, and definitiveness are obtained." (Cecil H Meyers, "Handbook of Basic Graphs: A modern approach", 1970)

"Mental models are fuzzy, incomplete, and imprecisely stated. Furthermore, within a single individual, mental models change with time, even during the flow of a single conversation. The human mind assembles a few relationships to fit the context of a discussion. As debate shifts, so do the mental models. Even when only a single topic is being discussed, each participant in a conversation employs a different mental model to interpret the subject. Fundamental assumptions differ but are never brought into the open. […] A mental model may be correct in structure and assumptions but, even so, the human mind - either individually or as a group consensus - is apt to draw the wrong implications for the future." (Jay W Forrester, "Counterintuitive Behaviour of Social Systems", Technology Review, 1971)

"In general, complexity and precision bear an inverse relation to one another in the sense that, as the complexity of a problem increases, the possibility of analysing it in precise terms diminishes. Thus 'fuzzy thinking' may not be deplorable, after all, if it makes possible the solution of problems which are much too complex for precise analysis." (Lotfi A Zadeh, "Fuzzy languages and their relation to human intelligence", 1972)

"Fuzziness, then, is a concomitant of complexity. This implies that as the complexity of a task, or of a system for performing that task, exceeds a certain threshold, the system must necessarily become fuzzy in nature. Thus, with the rapid increase in the complexity of the information processing tasks which the computers are called upon to perform, we are reaching a point where computers will have to be designed for processing of information in fuzzy form. In fact, it is the capability to manipulate fuzzy concepts that distinguishes human intelligence from the machine intelligence of current generation computers. Without such capability we cannot build machines that can summarize written text, translate well from one natural language to another, or perform many other tasks that humans can do with ease because of their ability to manipulate fuzzy concepts." (Lotfi A Zadeh, "The Birth and Evolution of Fuzzy Logic", 1989)

"Probability theory is an ideal tool for formalizing uncertainty in situations where class frequencies are known or where evidence is based on outcomes of a sufficiently long series of independent random experiments. Possibility theory, on the other hand, is ideal for formalizing incomplete information expressed in terms of fuzzy propositions." (George Klir, "Fuzzy sets and fuzzy logic", 1995)

"[…] interval mathematics and fuzzy logic together can provide a promising alternative to mathematical modeling for many physical systems that are too vague or too complicated to be described by simple and crisp mathematical formulas or equations. When interval mathematics and fuzzy logic are employed, the interval of confidence and the fuzzy membership functions are used as approximation measures, leading to the so-called fuzzy systems modeling." (Guanrong Chen & Trung Tat Pham, "Introduction to Fuzzy Sets, Fuzzy Logic, and Fuzzy Control Systems", 2001)

"Fuzzy relations are developed by allowing the relationship between elements of two or more sets to take on an infinite number of degrees of relationship between the extremes of 'completely related' and 'not related', which are the only degrees of relationship possible in crisp relations. In this sense, fuzzy relations are to crisp relations as fuzzy sets are to crisp sets; crisp sets and relations are more constrained realizations of fuzzy sets and relations." (Timothy J Ross & W Jerry Parkinson, "Fuzzy Set Theory, Fuzzy Logic, and Fuzzy Systems", 2002)

"The vast majority of information that we have on most processes tends to be nonnumeric and nonalgorithmic. Most of the information is fuzzy and linguistic in form." (Timothy J Ross & W Jerry Parkinson, "Fuzzy Set Theory, Fuzzy Logic, and Fuzzy Systems", 2002)

"Each fuzzy set is uniquely defined by a membership function. […] There are two approaches to determining a membership function. The first approach is to use the knowledge of human experts. Because fuzzy sets are often used to formulate human knowledge, membership functions represent a part of human knowledge. Usually, this approach can only give a rough formula of the membership function and fine-tuning is required. The second approach is to use data collected from various sensors to determine the membership function. Specifically, we first specify the structure of membership function and then fine-tune the parameters of membership function based on the data." (Huaguang Zhang & Derong Liu, "Fuzzy Modeling and Fuzzy Control", 2006)

"Granular computing is a general computation theory for using granules such as subsets, classes, objects, clusters, and elements of a universe to build an efficient computational model for complex applications with huge amounts of data, information, and knowledge. Granulation of an object a leads to a collection of granules, with a granule being a clump of points (objects) drawn together by indiscernibility, similarity, proximity, or functionality. In human reasoning and concept formulation, the granules and the values of their attributes are fuzzy rather than crisp. In this perspective, fuzzy information granulation may be viewed as a mode of generalization, which can be applied to any concept, method, or theory." (Salvatore Greco et al, "Granular Computing and Data Mining for Ordered Data: The Dominance-Based Rough Set Approach", 2009)

"We use the term fuzzy logic to refer to all aspects of representing and manipulating knowledge that employ intermediary truth-values. This general, commonsense meaning of the term fuzzy logic encompasses, in particular, fuzzy sets, fuzzy relations, and formal deductive systems that admit intermediary truth-values, as well as the various methods based on them." (Radim Belohlavek & George J Klir, "Concepts and Fuzzy Logic", 2011)

🔭Data Science: Problems (Just the Quotes)

"The problems which arise in the reduction of data may thus conveniently be divided into three types: (i) Problems of Specification, which arise in the choice of the mathematical form of the population. (ii) When a specification has been obtained, problems of Estimation arise. These involve the choice among the methods of calculating, from our sample, statistics fit to estimate the unknow nparameters of the population. (iii) Problems of Distribution include the mathematical deduction of the exact nature of the distributions in random samples of our estimates of the parameters, and of other statistics designed to test the validity of our specification (tests of Goodness of Fit)." (Sir Ronald A Fisher, "Statistical Methods for Research Workers", 1925)

"The most important maxim for data analysis to heed, and one which many statisticians seem to have shunned is this: ‘Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.’ Data analysis must progress by approximate answers, at best, since its knowledge of what the problem really is will at best be approximate." (John W Tukey, "The Future of Data Analysis", Annals of Mathematical Statistics, Vol. 33, No. 1, 1962)

"The validation of a model is not that it is 'true' but that it generates good testable hypotheses relevant to important problems." (Richard Levins, "The Strategy of Model Building in Population Biology”, 1966)

"Statistical methods are tools of scientific investigation. Scientific investigation is a controlled learning process in which various aspects of a problem are illuminated as the study proceeds. It can be thought of as a major iteration within which secondary iterations occur. The major iteration is that in which a tentative conjecture suggests an experiment, appropriate analysis of the data so generated leads to a modified conjecture, and this in turn leads to a new experiment, and so on." (George E P Box & George C Tjao, "Bayesian Inference in Statistical Analysis", 1973)

"The fact must be expressed as data, but there is a problem in that the correct data is difficult to catch. So that I always say 'When you see the data, doubt it!' 'When you see the measurement instrument, doubt it!' [...]For example, if the methods such as sampling, measurement, testing and chemical analysis methods were incorrect, data. […] to measure true characteristics and in an unavoidable case, using statistical sensory test and express them as data." (Kaoru Ishikawa, Annual Quality Congress Transactions, 1981)

"Doing data analysis without explicitly defining your problem or goal is like heading out on a road trip without having decided on a destination." (Michael Milton, "Head First Data Analysis", 2009)

"Data scientists combine entrepreneurship with patience, the willingness to build data products incrementally, the ability to explore, and the ability to iterate over a solution. They are inherently interdisciplinary. They can tackle all aspects of a problem, from initial data collection and data conditioning to drawing conclusions. They can think outside the box to come up with new ways to view the problem, or to work with very broadly defined problems: 'there’s a lot of data, what can you make from it?'" (Mike Loukides, "What Is Data Science?", 2011)

"Smart data scientists don’t just solve big, hard problems; they also have an instinct for making big problems small." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"The big problems with statistics, say its best practitioners, have little to do with computations and formulas. They have to do with judgment - how to design a study, how to conduct it, then how to analyze and interpret the results. Journalists reporting on statistics have many chances to do harm by shaky reporting, and so are also called on to make sophisticated judgments. How, then, can we tell which studies seem credible, which we should report?" (Victor Cohn & Lewis Cope, "News & Numbers: A writer’s guide to statistics" 3rd Ed, 2012)

"We have let ourselves become enchanted by big data only because we exoticize technology. We’re impressed with small feats accomplished by computers alone, but we ignore big achievements from complementarity because the human contribution makes them less uncanny. Watson, Deep Blue, and ever-better machine learning algorithms are cool. But the most valuable companies in the future won’t ask what problems can be solved with computers alone. Instead, they’ll ask: how can computers help humans solve hard problems?" (Peter Thiel & Blake Masters, "Zero to One: Notes on Startups, or How to Build the Future", 2014)

"Machine learning is a science and requires an objective approach to problems. Just like the scientific method, test-driven development can aid in solving a problem. The reason that TDD and the scientific method are so similar is because of these three shared characteristics: Both propose that the solution is logical and valid. Both share results through documentation and work over time. Both work in feedback loops." (Matthew Kirk, "Thoughtful Machine Learning", 2015)

"While Big Data, when managed wisely, can provide important insights, many of them will be disruptive. After all, it aims to find patterns that are invisible to human eyes. The challenge for data scientists is to understand the ecosystems they are wading into and to present not just the problems but also their possible solutions." (Cathy O'Neil, "Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy", 2016)

"The term [Big Data] simply refers to sets of data so immense that they require new methods of mathematical analysis, and numerous servers. Big Data - and, more accurately, the capacity to collect it - has changed the way companies conduct business and governments look at problems, since the belief wildly trumpeted in the media is that this vast repository of information will yield deep insights that were previously out of reach." (Beau Lotto, "Deviate: The Science of Seeing Differently", 2017)

"There are other problems with Big Data. In any large data set, there are bound to be inconsistencies, misclassifications, missing data - in other words, errors, blunders, and possibly lies. These problems with individual items occur in any data set, but they are often hidden in a large mass of numbers even when these numbers are generated out of computer interactions." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"Your machine-learning algorithm should answer a very specific question that tells you something you need to know and that can be answered appropriately by the data you have access to. The best first question is something you already know the answer to, so that you have a reference and some intuition to compare your results with. Remember: you are solving a business problem, not a math problem."(Prashant Natarajan et al, "Demystifying Big Data and Machine Learning for Healthcare", 2017)

"Data scientists should have some domain expertise. Most data science projects begin with a real-world, domain-specific problem and the need to design a data-driven solution to this problem. As a result, it is important for a data scientist to have enough domain expertise that they understand the problem, why it is important, an dhow a data science solution to the problem might fit into an organization’s processes. This domain expertise guides the data scientist as she works toward identifying an optimized solution." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"The goal of data science is to improve decision making by basing decisions on insights extracted from large data sets. As a field of activity, data science encompasses a set of principles, problem definitions, algorithms, and processes for extracting nonobvious and useful patterns from large data sets. It is closely related to the fields of data mining and machine learning, but it is broader in scope." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"Many people have strong intuitions about whether they would rather have a vital decision about them made by algorithms or humans. Some people are touchingly impressed by the capabilities of the algorithms; others have far too much faith in human judgment. The truth is that sometimes the algorithms will do better than the humans, and sometimes they won’t. If we want to avoid the problems and unlock the promise of big data, we’re going to need to assess the performance of the algorithms on a case-by-case basis. All too often, this is much harder than it should be. […] So the problem is not the algorithms, or the big datasets. The problem is a lack of scrutiny, transparency, and debate." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"The problem is the hype, the notion that something magical will emerge if only we can accumulate data on a large enough scale. We just need to be reminded: Big data is not better; it’s just bigger. And it certainly doesn’t speak for itself." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

"The way we explore data today, we often aren't constrained by rigid hypothesis testing or statistical rigor that can slow down the process to a crawl. But we need to be careful with this rapid pace of exploration, too. Modern business intelligence and analytics tools allow us to do so much with data so quickly that it can be easy to fall into a pitfall by creating a chart that misleads us in the early stages of the process." (Ben Jones, "Avoiding Data Pitfalls: How to Steer Clear of Common Blunders When Working with Data and Presenting Analysis and Visualizations", 2020)

🔭Data Science: Facts (Just the Quotes)

"Isolated facts, those that can only be obtained by rough estimate and that require development, can only be presented in memoires; but those that can be presented in a body, with details, and on whose accuracy one can rely, may be expounded in tables." (E Duvillard, "Memoire sur le travail du Bureau de statistique", 1806)

"Facts, however numerous, do not constitute a science. Like innumerable grains of sand on the sea shore, single facts appear isolated, useless, shapeless; it is only when compared, when arranged in their natural relations, when crystallised by the intellect, that they constitute the eternal truths of science." (William Farr, "Observation", Br. Ann. Med. 1, 1837)

"From carefully compiled statistical facts more may be learned [about] the moral nature of Man than can be gathered from all the accumulated experiences of the preceding ages." (Henry Thomas Buckle, "A History of Civilization in England", 1857/1898)

"The graphical method has considerable superiority for the exposition of statistical facts over the tabular. A heavy bank of figures is grievously wearisome to the eye, and the popular mind is as incapable of drawing any useful lessons from it as of extracting sunbeams from cucumbers." (Arthur B Farquhar & Henry Farquhar, "Economic and Industrial Delusions", 1891)

"[…] to kill an error is as good a service as, and sometimes even better than, the establishing of a new truth or fact." (Charles R Darwin, "More Letters of Charles Darwin", Vol 2, 1903)

"Entia non sunt multiplicanda praeter necessitatem. That is to say; before you try a complicated hypothesis, you should make quite sure that no simplification of it will explain the facts equally well." (Charles S Peirce," Pragmatism and Pragmaticism", [lecture] 1903)

"But, once again, what the physical states as the result of an experiment is not the recital of observed facts, but the interpretation and the transposing of these facts into the ideal, abstract, symbolic world created by the theories he regards as established." (Pierre-Maurice-Marie Duhem, "The Aim and Structure of Physical Theory", 1908)

"The facts of greatest outcome are those we think simple; may be they really are so, because they are influenced only by a small number of well-defined circumstances, may be they take on an appearance of simplicity because the various circumstances upon which they depend obey the laws of chance and so come to mutually compensate." (Henri Poincaré, "The Foundations of Science", 1913)

"Statistics may be defined as numerical statements of facts by means of which large aggregates are analyzed, the relations of individual units to their groups are ascertained, comparisons are made between groups, and continuous records are maintained for comparative purposes." (Melvin T Copeland. "Statistical Methods" [in: Harvard Business Studies, Vol. III, Ed. by Melvin T Copeland, 1917])

"The aim of science is to seek the simplest explanations of complex facts. We are apt to fall into the error of thinking that the facts are simple because simplicity is the goal of our quest. The guiding motto in the life of every natural philosopher should be, ‘Seek simplicity and distrust it’." (Alfred N Whitehead, "The Concept of Nature", 1919)

"Observed facts must be built up, woven together, ordered, arranged, systematized into conclusions and theories by reflection and reason, if they are to have full bearing on life and the universe. Knowledge is the accumulation of facts. Wisdom is the establishment of relations. And just because the latter process is delicate and perilous, it is all the more delightful." (Gamaliel Bradford, "Darwin", 1926)

"In scientific thought we adopt the simplest theory which will explain all the facts under consideration and enable us to predict new facts of the same kind. The catch in this criterion lies in the world 'simplest'. It is really an aesthetic canon such as we find implicit in our criticisms of poetry or painting. The layman finds such a law as dx/dt = K(d2x/dy2) much less simple than 'it oozes', of which it is the mathematical statement. The physicist reverses this judgment, and his statement is certainly the more fruitful of the two, so far as prediction is concerned. It is, however, a statement about something very unfamiliar to the plainman, namely, the rate of change of a rate of change." (John B S Haldane, "Possible Worlds", 1927)

"We can invent as many theories we like, and any one of them can be made to fit the facts. But that theory is always preferred which makes the fewest number of assumptions." (Albert Einstein [interview] 1929)

"A system is said to be coherent if every fact in the system is related every other fact in the system by relations that are not merely conjunctive. A deductive system affords a good example of a coherent system." (Lizzie S Stebbing, "A modern introduction to logic", 1930)

"In experimental science facts of the greatest importance are rarely discovered accidentally: more frequently new ideas point the way towards them." (Erwin Schrödinger, "Science and the Human Temperament", 1935)

"Science is the attempt to discover, by means of observation, and reasoning based upon it, first, particular facts about the world, and then laws connecting facts with one another and (in fortunate cases) making it possible to predict future occurrences." (Bertrand Russell, "Religion and Science, Grounds of Conflict", 1935)

"With the help of physical theories we try to find our way through the maze of observed facts, to order and understand the world of our sense impressions." (Albert Einstein & Leopold Infeld, "The Evolution of Physics", 1938)

"Graphs are all inclusive. No fact is too slight or too great to plot to a scale suited to the eye. Graphs may record the path of an ion or the orbit of the sun, the rise of a civilization, or the acceleration of a bullet, the climate of a century or the varying pressure of a heart beat, the growth of a business, or the nerve reactions of a child." (Henry D Hubbard [foreword to Willard C Brinton, "Graphic Presentation", 1939)])

"The graphic art depicts magnitudes to the eye. It does more. It compels the seeing of relations. We may portray by simple graphic methods whole masses of intricate routine, the organization of an enterprise, or the plan of a campaign. Graphs serve as storm signals for the manager, statesman, engineer; as potent narratives for the actuary, statist, naturalist; and as forceful engines of research for science, technology and industry. They display results. They disclose new facts and laws. They reveal discoveries as the bud unfolds the flower." (Henry D Hubbard [foreword to Willard C Brinton, "Graphic Presentation", 1939)])

"[…] the grand aim of all science […] is to cover the greatest possible number of empirical facts by logical deductions from the smallest possible number of hypotheses or axioms." (Albert Einstein, 1954)

"Science does not begin with facts; one of its tasks is to uncover the facts by removing misconceptions." (Lancelot L Whyte, "Accent on Form", 1954)

"Science is the creation of concepts and their exploration in the facts. It has no other test of the concept than its empirical truth to fact." (Jacob Bronowski, "Science and Human Values", 1956)

"When we meet a fact which contradicts a prevailing theory, we must accept the fact and abandon the theory, even when the theory is supported by great names and generally accepted." (Claude Bernard, "An Introduction to the Study of Experimental Medicine", 1957)

"Science aims at the discovery, verification, and organization of fact and information [...] engineering is fundamentally committed to the translation of scientific facts and information to concrete machines, structures, materials, processes, and the like that can be used by men." (Eric A Walker, "Engineers and/or Scientists", Journal of Engineering Education Vol. 51, 1961)

"A model is a useful (and often indispensable) framework on which to organize our knowledge about a phenomenon. […] It must not be overlooked that the quantitative consequences of any model can be no more reliable than the a priori agreement between the assumptions of the model and the known facts about the real phenomenon. When the model is known to diverge significantly from the facts, it is self-deceiving to claim quantitative usefulness for it by appeal to agreement between a prediction of the model and observation." (John R Philip, 1966)

"To do science is to search for repeated patterns, not simply to accumulate facts, and to do the science of geographical ecology is to search for patterns of plants and animal life that can be put on a map." (Robert H. MacArthur, "Geographical Ecology", 1972)

"No theory ever agrees with all the facts in its domain, yet it is not always the theory that is to blame. Facts are constituted by older ideologies, and a clash between facts and theories may be proof of progress. It is also a first step in our attempt to find the principles implicit in familiar observational notions." (Paul K Feyerabend, "Against Method: Outline of an Anarchistic Theory of Knowledge", 1975)

"Statistical significance testing has involved more fantasy than fact. The emphasis on statistical significance over scientific significance in educational research represents a corrupt form of the scientific method. Educational research would be better off if it stopped testing its results for statistical significance." (Ronald P. Carver, "The case against statistical testing", Harvard Educational Review 48, 1978)

"Facts and theories are different things, not rungs in a hierarchy of increasing certainty. Facts are the world's data. Theories are structures of ideas that explain and interpret facts. Facts do not go away while scientists debate rival theories for explaining them." (Stephen J Gould "Evolution as Fact and Theory", 1981)

"Facts do not 'speak for themselves'. They speak for or against competing theories. Facts divorced from theory or visions are mere isolated curiosities." (Thomas Sowell, "A Conflict of Visions: Ideological Origins of Political Struggles", 1987)

"[…] no good model ever accounted for all the facts, since some data was bound to be misleading if not plain wrong. A theory that did fit all the data would have been ‘carpentered’ to do this and would thus be open to suspicion." (Francis H C Crick, "What Mad Pursuit: A Personal View of Scientific Discovery", 1988)

"The common perception of science as a rational activity, in which one confronts the evidence of fact with an open mind, could not be more false. Facts assume significance only within a pre-existing intellectual structure, which may be based as much on intuition and prejudice as on reason." (Walter Gratzer, The Guardian, 1989)

"As a result, surprisingly enough, scientific advance rarely comes solely through the accumulation of new facts. It comes most often through the construction of new theoretical frameworks. [..] To understand scientific development, it is not enough merely to chronicle new discoveries and inventions. We must also trace the succession of worldviews" (Nancy R Pearcey & Charles B Thaxton, "The Soul of Science: Christian Faith and Natural Philosophy", 1994)

"Modeling involves a style of scientific thinking in which the argument is structured by the model, but in which the application is achieved via a narrative prompted by an external fact, an imagined event or question to be answered." (Uskali Mäki, "Fact and Fiction in Economics: Models, Realism and Social Construction", 2002)

"Although fiction is not fact, paradoxically we need some fictions, particularly mathematical ideas and highly idealized models, to describe, explain, and predict facts. This is not because the universe is mathematical, but because our brains invent or use refined and law-abiding fictions, not only for intellectual pleasure but also to construct conceptual models of reality." (Mario Bunge, "Chasing Reality: Strife over Realism", 2006)

"There are no surprising facts, only models that are surprised by facts; and if a model is surprised by the facts, it is no credit to that model." (Eliezer S Yudkowsky, "Quantum Explanations", 2008)

"Obviously, the final goal of scientists and mathematicians is not simply the accumulation of facts and lists of formulas, but rather they seek to understand the patterns, organizing principles, and relationships between these facts to form theorems and entirely new branches of human thought." (Clifford A Pickover, "The Math Book", 2009)

"Relevance is not something you can predict. It is something you discover after the fact." (Thomas Sowell, "The Thomas Sowell Reader", 2011)

"Science does not live with facts alone. In addition to facts, it needs models. Scientific models fulfill two main functions with respect to empirical facts." (Andreas Bartels [in "Models, Simulations, and the Reduction of Complexity", Ed. by Ulrich Gähde et al, 2013)

"The whole point of science is that most of it is uncertain. That’s why science is exciting–because we don’t know. Science is all about things we don’t understand. The public, of course, imagines science is just a set of facts. But it’s not. Science is a process of exploring, which is always partial. We explore, and we find out things that we understand. We find out things we thought we understood were wrong. That’s how it makes progress." (Freeman Dyson, [interview] 2014)

"A mental representation is a mental structure that corresponds to an object, an idea, a collection of information, or anything else, concrete or abstract, that the brain is thinking about. […] Because the details of mental representations can differ dramatically from field to field, it’s hard to offer an overarching definition that is not too vague, but in essence these representations are preexisting patterns of information - facts, images, rules, relationships, and so on - that are held in long-term memory and that can be used to respond quickly and effectively in certain types of situations." (Anders Ericsson & Robert Pool," Peak: Secrets from the New Science of Expertise", 2016)

"Statistics is the science of collecting, organizing, and interpreting numerical facts, which we call data. […] Statistics is the science of learning from data." (Moore McCabe & Alwan Craig, "The Practice of Statistics for Business and Economics" 4th Ed., 2016)

"That is the trouble with facts: they sometimes force you to conclusions that differ with your intuition." (Steven G Krantz, "A Primer of Mathematical Writing" 2nd Ed., 2016)

More quotes on "Facts" at the-web-of-knowledge.blogspot.com.

🔭Data Science: Constraints (Just the Quotes)

"A common and very powerful constraint is that of continuity. It is a constraint because whereas the function that changes arbitrarily can undergo any change, the continuous function can change, at each step, only to a neighbouring value." (W Ross Ashby, "An Introduction to Cybernetics", 1956)

"A most important concept […] is that of constraint. It is a relation between two sets, and occurs when the variety that exists under one condition is less than the variety that exists under another. [...] Constraints are of high importance in cybernetics […] because when a constraint exists advantage can usually be taken of it." (W Ross Ashby, "An Introduction to Cybernetics", 1956)

"[…] as every law of nature implies the existence of an invariant, it follows that every law of nature is a constraint. […] Science looks for laws; it is therefore much concerned with looking for constraints. […] the world around us is extremely rich in constraints. We are so familiar with them that we take most of them for granted, and are often not even aware that they exist. […] A world without constraints would be totally chaotic." (W Ross Ashby, "An Introduction to Cybernetics", 1956)

"[...] the existence of any invariant over a set of phenomena implies a constraint, for its existence implies that the full range of variety does not occur. The general theory of invariants is thus a part of the theory of constraints. Further, as every law of nature implies the existence of an invariant, it follows that every law of nature is a constraint." (W Ross Ashby, "An Introduction to Cybernetics", 1956)

"Formulating consists of determining the system inputs, outputs, requirements, objectives, constraints. Structuring the system provides one or more methods of organizing the solution, the method of operation, the selection of parts, and the nature of their performance requirements. It is evident that the processes of formulating a system and structuring it are strongly related." (Harold Chestnut, "Systems Engineering Tools", 1965)

"In general, we can say that the larger the system becomes, the more the parts interact, the more difficult it is to understand environmental constraints, the more obscure becomes the problem of what resources should be made available, and deepest of all, the more difficult becomes the problem of the legitimate values of the system." (C West Churchman, "The Systems Approach", 1968)

"A physical theory must accept some actual data as inputs and must be able to generate from them another set of possible data (the output) in such a way that both input and output match the assumptions of the theory - laws, constraints, etc. This concept of matching involves relevance: thus boundary conditions are relevant only to field-like theories such as hydrodynamics and quantum mechanics. But matching is more than relevance: it is also logical compatibility." (Mario Bunge, "Philosophy of Physics", 1973)

"Physics is like that. It is important that the models we construct allow us to draw the right conclusions about the behaviour of the phenomena and their causes. But it is not essential that the models accurately describe everything that actually happens; and in general it will not be possible for them to do so, and for much the same reasons. The requirements of the theory constrain what can be literally represented. This does not mean that the right lessons cannot be drawn. Adjustments are made where literal correctness does not matter very much in order to get the correct effects where we want them; and very often, as in the staging example, one distortion is put right by another. That is why it often seems misleading to say that a particular aspect of a model is false to reality: given the other constraints that is just the way to restore the representation." (Nancy Cartwright, "How the Laws of Physics Lie", 1983)

"Indeed, except for the very simplest physical systems, virtually everything and everybody in the world is caught up in a vast, nonlinear web of incentives and constraints and connections. The slightest change in one place causes tremors everywhere else. We can't help but disturb the universe, as T.S. Eliot almost said. The whole is almost always equal to a good deal more than the sum of its parts. And the mathematical expression of that property - to the extent that such systems can be described by mathematics at all - is a nonlinear equation: one whose graph is curvy." (M Mitchell Waldrop, "Complexity: The Emerging Science at the Edge of Order and Chaos", 1992)

"Many of the basic functions performed by neural networks are mirrored by human abilities. These include making distinctions between items (classification), dividing similar things into groups (clustering), associating two or more things (associative memory), learning to predict outcomes based on examples (modeling), being able to predict into the future (time-series forecasting), and finally juggling multiple goals and coming up with a good- enough solution (constraint satisfaction)." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"A conceptual model is a representation of the system expertise using this formalism. An internal model is derived from the conceptual model and from a specification of the system transactions and the performance constraints." (Zbigniew W. Ras & Andrzej Skowron [Eds.], Foundations of Intelligent Systems: 10th International Symposium Vol 10, 1997)

"Whereas formal systems apply inference rules to logical variables, neural networks apply evolutive principles to numerical variables. Instead of calculating a solution, the network settles into a condition that satisfies the constraints imposed on it." (Paul Cilliers, "Complexity and Postmodernism: Understanding Complex Systems", 1998)

"What it means for a mental model to be a structural analog is that it embodies a representation of the spatial and temporal relations among, and the causal structures connecting the events and entities depicted and whatever other information that is relevant to the problem-solving talks. […] The essential points are that a mental model can be nonlinguistic in form and the mental mechanisms are such that they can satisfy the model-building and simulative constraints necessary for the activity of mental modeling." (Nancy J Nersessian, "Model-based reasoning in conceptual change", 1999)

"To develop a Control, the designer should find aspect systems, subsystems, or constraints that will prevent the negative interferences between elements (friction) and promote positive interferences (synergy). In other words, the designer should search for ways of minimizing frictions that will result in maximization of the global satisfaction" (Carlos Gershenson, "Design and Control of Self-organizing Systems", 2007)

"[chaos theory] presents a universe that is at once deterministic and obeys the fundamental physical laws, but is capable of disorder, complexity, and unpredictability. It shows that predictability is a rare phenomenon operating only within the constraints that science has filtered out from the rich diversity of our complex world." (Ziauddin Sardar & Iwona Abrams, "Introducing Chaos: A Graphic Guide", 2008)

"Cybernetics is the art of creating equilibrium in a world of possibilities and constraints. This is not just a romantic description, it portrays the new way of thinking quite accurately. Cybernetics differs from the traditional scientific procedure, because it does not try to explain phenomena by searching for their causes, but rather by specifying the constraints that determine the direction of their development." (Ernst von Glasersfeld, "Partial Memories: Sketches from an Improbable Life", 2010)

"Optimization is more than finding the best simulation results. It is itself a complex and evolving field that, subject to certain information constraints, allows data scientists, statisticians, engineers, and traders alike to perform reality checks on modeling results." (Chris Conlan, "Automated Trading with R: Quantitative Research and Platform Development", 2016)

"Exponentially growing systems are prevalent in nature, spanning all scales from biochemical reaction networks in single cells to food webs of ecosystems. How exponential growth emerges in nonlinear systems is mathematically unclear. […] The emergence of exponential growth from a multivariable nonlinear network is not mathematically intuitive. This indicates that the network structure and the flux functions of the modeled system must be subjected to constraints to result in long-term exponential dynamics." (Wei-Hsiang Lin et al, "Origin of exponential growth in nonlinear reaction networks", PNAS 117 (45), 2020)

More quotes on "Constraints" at the-web-of-knowledge.blogspot.com.

26 November 2018

🔭Data Science: Clustering (Just the Quotes)

"To the untrained eye, randomness appears as regularity or tendency to cluster." (William Feller, "An Introduction to Probability Theory and its Applications", 1950)

"In scientific information, then, we find that subjects – the themes and topics on which books and articles are written – cluster into fields, each of which can be analysed into its characteristic set of facets of terms." (Brian C Vickery, "Classification and indexing in science", 1958)

"In comparison with Predicate Calculus encoding is of factual knowledge, semantic nets seem more natural and understandable. This is due to the one-to-one correspondence between nodes and the concepts they denote, to the clustering about a particular node of propositions about a particular thing, and to the visual immediacy of 'interrelationships' between concepts, i.e., their connections via sequences of propositional links." (Lenhart K Schubert, "Extending the Expressive Power of Semantic Networks", Artificial Intelligence 7, 1976)

"Cyberspace. A consensual hallucination experienced daily by billions of legitimate operators, in every nation, by children being taught mathematical concepts. [...] A graphic representation of data abstracted from banks of every computer in the human system. Unthinkable complexity. Lines of light ranged in the nonspace of the mind, clusters and constellations of data." (William Gibson, "Neuromancer", 1984)

"While a small domain (consisting of fifty or fewer objects) can generally be analyzed as a unit, large domains must be partitioned to make the analysis a manageable task. To make such a partitioning, we take advantage of the fact that objects on an information model tend to fall into clusters: groups of objects that are interconnected with one another by many relationships. By contrast, relatively few relationships connect objects in different clusters." (Stephen J. Mellor, "Object-Oriented Systems Analysis: Modeling the World In Data", 1988)

"Randomness is a difficult notion for people to accept. When events come in clusters and streaks, people look for explanations and patterns. They refuse to believe that such patterns - which frequently occur in random data - could equally well be derived from tossing a coin. So it is in the stock market as well." (Burton G Malkiel, "A Random Walk Down Wall Street", 1989)

"While classification is important, it can certainly be overdone. Making too fine a distinction between things can be as serious a problem as not being able to decide at all. Because we have limited storage capacity in our brain (we still haven't figured out how to add an extender card), it is important for us to be able to cluster similar items or things together. Not only is clustering useful from an efficiency standpoint, but the ability to group like things together (called chunking by artificial intelligence practitioners) is a very important reasoning tool. It is through clustering that we can think in terms of higher abstractions, solving broader problems by getting above all of the nitty-gritty details." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"Random events often come like the raisins in a box of cereal - in groups, streaks, and clusters. And although Fortune is fair in potentialities, she is not fair in outcomes." (Leonard Mlodinow, "The Drunkard’s Walk: How Randomness Rules Our Lives", 2008)

"With the ever increasing amount of empirical information that scientists from all disciplines are dealing with, there exists a great need for robust, scalable and easy to use clustering techniques for data abstraction, dimensionality reduction or visualization to cope with and manage this avalanche of data." (Jörg Reichardt, "Structure in Complex Networks", 2009)

"Data clusters are everywhere, even in random data. Someone who looks for an explanation will inevitably find one, but a theory that fits a data cluster is not persuasive evidence. The found explanation needs to make sense and it needs to be tested with uncontaminated data." (Gary Smith, "Standard Deviations", 2014)

"Your goal when designing a scattr plot is to make the relationship between two variables as clear as possible, including the overall level of association but also revealing clusters and outliers. This is easier said than done. The data and a few bad design choices can make reading a scatter plot too complex or misleading." (Jorge Camões, "Data at Work: Best practices for creating effective charts and information graphics in Microsoft Excel", 2016)

"Cluster analysis refers to the grouping of observations so that the objects within each cluster share similar properties, and properties of all clusters are independent of each other. Cluster algorithms usually optimize by maximizing the distance among clusters and minimizing the distance between objects in a cluster. Cluster analysis does not complete in a single iteration but goes through several iterations until the model converges. Model convergence means that the cluster memberships of all objects converge and don’t change with every new iteration." (Danish Haroon, "Python Machine Learning Case Studies", 2017)

🔭Data Science: Risk (Just the Quotes)

"A deterministic system is one in which the parts interact in a perfectly predictable way. There is never any room for doubt: given a last state of the system and the programme of information by defining its dynamic network, it is always possible to predict, without any risk of error, its succeeding state. A probabilistic system, on the other hand, is one about which no precisely detailed prediction can be given. The system may be studied intently, and it may become more and more possible to say what it is likely to do in any given circumstances. But the system simply is not predetermined, and a prediction affecting it can never escape from the logical limitations of the probabilities in which terms alone its behaviour can be described." (Stafford Beer, "Cybernetics and Management", 1959)

"It is easy to obtain confirmations, or verifications, for nearly every theory - if we look for confirmations. Confirmations should count only if they are the result of risky predictions. […] A theory which is not refutable by any conceivable event is non-scientific. Irrefutability is not a virtue of a theory (as people often think) but a vice. Every genuine test of a theory is an attempt to falsify it, or refute it." (Karl R Popper, "Conjectures and Refutations: The Growth of Scientific Knowledge", 1963)

"Statistical hypothesis testing is commonly used inappropriately to analyze data, determine causality, and make decisions about significance in ecological risk assessment,[...] It discourages good toxicity testing and field studies, it provides less protection to ecosystems or their components that are difficult to sample or replicate, and it provides less protection when more treatments or responses are used. It provides a poor basis for decision-making because it does not generate a conclusion of no effect, it does not indicate the nature or magnitude of effects, it does address effects at untested exposure levels, and it confounds effects and uncertainty[...]. Risk assessors should focus on analyzing the relationship between exposure and effects[...]." (Glenn W Suter, "Abuse of hypothesis testing statistics in ecological risk assessment", Human and Ecological Risk Assessment 2, 1996)

"Until we can distinguish between an event that is truly random and an event that is the result of cause and effect, we will never know whether what we see is what we'll get, nor how we got what we got. When we take a risk, we are betting on an outcome that will result from a decision we have made, though we do not know for certain what the outcome will be. The essence of risk management lies in maximizing the areas where we have some control over the outcome while minimizing the areas where we have absolutely no control over the outcome and the linkage between effect and cause is hidden from us." (Peter L Bernstein, "Against the Gods: The Remarkable Story of Risk", 1996)

"Overcoming innumeracy is like completing a three-step program to statistical literacy. The first step is to defeat the illusion of certainty. The second step is to learn about the actual risks of relevant events and actions. The third step is to communicate the risks in an understandable way and to draw inferences without falling prey to clouded thinking. The general point is this: Innumeracy does not simply reside in our minds but in the representations of risk that we choose." (Gerd Gigerenzer, "Calculated Risks: How to know when numbers deceive you", 2002)

"The goal of random sampling is to produce a sample that is likely to be representative of the population. Although random sampling does not guarantee that the sample will be representative, it does allow us to assess the risk of an unrepresentative sample. It is the ability to quantify this risk that will enable us to generalize with confidence from a random sample to the corresponding population." (Roxy Peck et al, "Introduction to Statistics and Data Analysis" 4th Ed., 2012)

"Decision trees are an important tool for decision making and risk analysis, and are usually represented in the form of a graph or list of rules. One of the most important features of decision trees is the ease of their application. Being visual in nature, they are readily comprehensible and applicable. Even if users are not familiar with the way that a decision tree is constructed, they can still successfully implement it. Most often decision trees are used to predict future scenarios, based on previous experience, and to support rational decision making." (Jelena Djuris et al, "Neural computing in pharmaceutical products and process development", Computer-Aided Applications in Pharmaceutical Technology, 2013)

"Without context, data is useless, and any visualization you create with it will also be useless. Using data without knowing anything about it, other than the values themselves, is like hearing an abridged quote secondhand and then citing it as a main discussion point in an essay. It might be okay, but you risk finding out later that the speaker meant the opposite of what you thought." (Nathan Yau, "Data Points: Visualization That Means Something", 2013)

"The more complex the system, the more variable (risky) the outcomes. The profound implications of this essential feature of reality still elude us in all the practical disciplines. Sometimes variance averages out, but more often fat-tail events beget more fat-tail events because of interdependencies. If there are multiple projects running, outlier (fat-tail) events may also be positively correlated - one IT project falling behind will stretch resources and increase the likelihood that others will be compromised." (Paul Gibbons, "The Science of Successful Organizational Change", 2015)

"Roughly stated, the No Free Lunch theorem states that in the lack of prior knowledge (i.e. inductive bias) on average all predictive algorithms that search for the minimum classification error (or extremum over any risk metric) have identical performance according to any measure." (N D Lewis, "Deep Learning Made Easy with R: A Gentle Introduction for Data Science", 2016)

"Premature enumeration is an equal-opportunity blunder: the most numerate among us may be just as much at risk as those who find their heads spinning at the first mention of a fraction. Indeed, if you’re confident with numbers you may be more prone than most to slicing and dicing, correlating and regressing, normalizing and rebasing, effortlessly manipulating the numbers on the spreadsheet or in the statistical package - without ever realizing that you don’t fully understand what these abstract quantities refer to. Arguably this temptation lay at the root of the last financial crisis: the sophistication of mathematical risk models obscured the question of how, exactly, risks were being measured, and whether those measurements were something you’d really want to bet your global banking system on." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"Behavioral finance so far makes conclusions from statics not dynamics, hence misses the picture. It applies trade-offs out of context and develops the consensus that people irrationally overestimate tail risk (hence need to be 'nudged' into taking more of these exposures). But the catastrophic event is an absorbing barrier. No risky exposure can be analyzed in isolation: risks accumulate. If we ride a motorcycle, smoke, fly our own propeller plane, and join the mafia, these risks add up to a near-certain premature death. Tail risks are not a renewable resource." (Nassim N Taleb, "Statistical Consequences of Fat Tails: Real World Preasymptotics, Epistemology, and Applications" 2nd Ed., 2022)

"Any time you run regression analysis on arbitrary real-world observational data, there’s a significant risk that there’s hidden confounding in your dataset and so causal conclusions from such analysis are likely to be (causally) biased." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"[Making reasoned macro calls] starts with having the best and longest-time-series data you can find. You may have to take some risks in terms of the quality of data sources, but it amazes me how people are often more willing to act based on little or no data than to use data that is a challenge to assemble." (Robert J Shiller)

🔭Data Science: Lying with Statistics (Just the Quotes)

"Thus the alteration of the truth which is already manifesting itself in the progressive form of lying and perjury, offers us, in the superlative, the statistics." (François Magendie, 18th century)

"An old jest runs to the effect that there are three degrees of comparison among liars. There are liars, there are outrageous liars, and there are scientific experts. This has lately been adapted to throw dirt upon statistics. There are three degrees of comparison, it is said, in lying. There are lies, there are outrageous lies, and there are statistics." (Robert Giffen, Economic Journal 2 (6), 1892)

"Professor [Joseph] Munro reminded him of an old saying which he rather reluctantly proposed, in that company, to repeat. It was to the effect that there were three gradations of inveracity - there were lies, there were d-d lies, and there were statistics." (Arthur J Balfour, [in Manchester Guardian] 1892)

"Columns of figures are hurled about in the papers, and demonstrate the justice of the witty claim that there are three kinds of untruth : fibs, lies, and statistics." (Herbert B Workman, "The principles of the Gothenburg system", Wesleyan-Methodist Magazine 118, 1895)

"After all, facts are facts, and although we may quote one to another with a chuckle the words of the Wise Statesman, 'Lies - damn lies - and statistics', still there are some easy figures the simplest must understand, and the astutest cannot wriggle out of." (Leonard H. Courtney, [speech] 1895)

"There are three kinds of lies - lies, damned lies and statistics." (Carroll D Wright, New York Times, 1896)

"Figures often beguile me, particularly when I have the arranging of them myself; in which case the remark attributed to Disraeli would often apply with justice and force: “There are three kinds of lies: lies, damned lies, and statistics." (Mark Twain, [in "Mark Twain’s Autobiography" Vol I, 1904])

"Figures may not lie, but statistics compiled unscientifically and analyzed incompetently are almost sure to be misleading, and when this condition is unnecessarily chronic the so-called statisticians may be called liars." (Edwin B Wilson, "Bulletin of the American Mathematical Society", Vol 18, 1912)

"In earlier times they had no statistics and so they had to fall back on lies. Hence the huge exaggerations of primitive literature, giants, miracles, wonders! It's the size that counts. They did it with lies and we do it with statistics: but it's all the same." (Stephen Leacock, "Model memoirs and other sketches from simple to serious", 1939)

"It has long been recognized by public men of all kinds […] that statistics come under the head of lying, and that no lie is so false or inconclusive as that which is based on statistics." (Hilaire Belloc, "The Silence of the Sea", 1940)

"Many people use statistics as a drunkard uses a street lamp - for support rather than illumination. It is not enough to avoid outright falsehood; one must be on the alert to detect possible distortion of truth. One can hardly pick up a newspaper without seeing some sensational headline based on scanty or doubtful data." (Anna C Rogers, "Graphic Charts Handbook", 1961)

"Just like the spoken or written word, statistics and graphs can lie. They can lie by not telling the full story. They can lead to wrong conclusions by omitting some of the important facts. [...] Always look at statistics with a critical eye, and you will not be the victim of misleading information." (Dyno Lowenstein, "Graphs", 1976)

"For many people the first word that comes to mind when they think about statistical charts is 'lie'. No doubt some graphics do distort the underlying data, making it hard for the viewer to learn the truth. But data graphics are no different from words in this regard, for any means of communication can be used to deceive. There is no reason to believe that graphics are especially vulnerable to exploitation by liars; in fact, most of us have pretty good graphical lie detectors that help us see right through frauds." (Edward R Tufte, "The Visual Display of Quantitative Information", 1983)

"The conditions under which many data graphics are produced - the lack of substantive and quantitative skills of the illustrators, dislike of quantitative evidence, and contempt for the intelligence of the audience-guarantee graphic mediocrity. These conditions engender graphics that (1) lie; (2) employ only the simplest designs, often unstandardized time-series based on a small handful of data points; and (3) miss the real news actually in the data." (Edward R Tufte, "The Visual Display of Quantitative Information", 1983)

"Fairy tales lie just as much as statistics do, but sometimes you can find a grain of truth in them." (Sergei Lukyanenko, "The Night Watch", 1998)

"While some social problems statistics are deliberate deceptions, many - probably the great majority - of bad statistics are the result of confusion, incompetence, innumeracy, or selective, self-righteous efforts to produce numbers that reaffirm principles and interests that their advocates consider just and right. The best response to stat wars is not to try and guess who's lying or, worse, simply to assume that the people we disagree with are the ones telling lies. Rather, we need to watch for the standard causes of bad statistics - guessing, questionable definitions or methods, mutant numbers, and inappropriate comparisons." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Believe it or not, it’s easy to make statistics lie. It’s called massaging the facts, and people do it all the time. […] To avoid this, graphics reporters should develop a keen eye for spotting problems with statistics in order to avoid the embarrassment and possible liability of reporting incorrect information." (Jennifer George-Palilonis," A Practical Guide to Graphics Reporting: Information Graphics for Print, Web & Broadcast", 2006)

"Another way to obscure the truth is to hide it with relative numbers. […] Relative scales are always given as percentages or proportions. An increase or decrease of a given percentage only tells us part of the story, however. We are missing the anchoring of absolute values." (Brian Suda, "A Practical Guide to Designing with Data", 2010)

"One way a chart can lie is through overemphasis of the size and scale of items, particularly when the dimension of depth isnʼt considered." (Brian Suda, "A Practical Guide to Designing with Data", 2010)

"I believe that the backlash against statistics is due to four primary reasons. The first, and easiest for most people to relate to, is that even the most basic concepts of descriptive and inferential statistics can be difficult to grasp and even harder to explain. […] The second cause for vitriol is that even well-intentioned experts misapply the tools and techniques of statistics far too often, myself included. Statistical pitfalls are numerous and tough to avoid. When we can't trust the experts to get it right, there's a temptation to throw the baby out with the bathwater. The third reason behind all the hate is that those with an agenda can easily craft statistics to lie when they communicate with us […] And finally, the fourth cause is that often statistics can be perceived as cold and detached, and they can fail to communicate the human element of an issue." (Ben Jones, "Avoiding Data Pitfalls: How to Steer Clear of Common Blunders When Working with Data and Presenting Analysis and Visualizations", 2020)

"It is easy to lie with statistics. It is hard to tell the truth without it." (Andrejs Dunkels)

25 November 2018

🔭Data Science: Trust (Just the Quotes)

"We must trust to nothing but facts: These are presented to us by Nature, and cannot deceive. We ought, in every instance, to submit our reasoning to the test of experiment, and never to search for truth but by the natural road of experiment and observation." (Antoin-Laurent de Lavoisiere, "Elements of Chemistry", 1790)

"A law of nature, however, is not a mere logical conception that we have adopted as a kind of memoria technical to enable us to more readily remember facts. We of the present day have already sufficient insight to know that the laws of nature are not things which we can evolve by any speculative method. On the contrary, we have to discover them in the facts; we have to test them by repeated observation or experiment, in constantly new cases, under ever-varying circumstances; and in proportion only as they hold good under a constantly increasing change of conditions, in a constantly increasing number of cases with greater delicacy in the means of observation, does our confidence in their trustworthiness rise." (Hermann von Helmholtz, "Popular Lectures on Scientific Subjects", 1873)

"It is of the nature of true science to take nothing on trust or on authority. Every fact must be established by accurate observation, experiment, or calculation. Every law and principle must rest on inductive argument. The apostolic motto, ‘Prove all things, hold fast that which is good’, is thoroughly scientific. It is true that the mere reader of popular science must often be content to take that on testimony which he cannot personally verify; but it is desirable that even the most cursory reader should fully comprehend the modes in which facts are ascertained and the reasons on which the conclusions are based." (Sir John W Dawson, "The Chain of Life in Geological Time", 1880)

"Every bit of knowledge we gain and every conclusion we draw about the universe or about any part or feature of it depends finally upon some observation or measurement. Mankind has had again and again the humiliating experience of trusting to intuitive, apparently logical conclusions without observations, and has seen Nature sail by in her radiant chariot of gold in an entirely different direction." (Oliver J Lee, "Measuring Our Universe: From the Inner Atom to Outer Space", 1950)

"Being built on concepts, hypotheses, and experiments, laws are no more accurate or trustworthy than the wording of the definitions and the accuracy and extent of the supporting experiments." (Gerald Holton, "Introduction to Concepts and Theories in Physical Science", 1952)

"No observations are absolutely trustworthy. In no field of observation can we entirely rule out the possibility that an observation is vitiated by a large measurement or execution error. If a reading is found to lie a very long way from its fellows in a series of replicate observations, there must be a suspicion that the deviation is caused by a blunder or gross error of some kind. [...] One sufficiently erroneous reading can wreck the whole of a statistical analysis, however many observations there are." (Francis J Anscombe, "Rejection of Outliers", Technometrics Vol. 2 (2), 1960)

"Even properly done statistics can’t be trusted. The plethora of available statistical techniques and analyses grants researchers an enormous amount of freedom when analyzing their data, and it is trivially easy to ‘torture the data until it confesses’." (Alex Reinhart, "Statistics Done Wrong: The Woefully Complete Guide", 2015)

"Science’s predictions are more trustworthy, but they are limited to what we can systematically observe and tractably model. Big data and machine learning greatly expand that scope. Some everyday things can be predicted by the unaided mind, from catching a ball to carrying on a conversation. Some things, try as we might, are just unpredictable. For the vast middle ground between the two, there’s machine learning." (Pedro Domingos, "The Master Algorithm", 2015)

"The closer that sample-selection procedures approach the gold standard of random selection - for which the definition is that every individual in the population has an equal chance of appearing in the sample - the more we should trust them. If we don’t know whether a sample is random, any statistical measure we conduct may be biased in some unknown way." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"GIGO is a famous saying coined by early computer scientists: garbage in, garbage out. At the time, people would blindly put their trust into anything a computer output indicated because the output had the illusion of precision and certainty. If a statistic is composed of a series of poorly defined measures, guesses, misunderstandings, oversimplifications, mismeasurements, or flawed estimates, the resulting conclusion will be flawed." (Daniel J Levitin, "Weaponized Lies", 2017)

"Are your insights based on data that is accurate and reliable? Trustworthy data is correct or valid, free from significant defects and gaps. The trustworthiness of your data begins with the proper collection, processing, and maintenance of the data at its source. However, the reliability of your numbers can also be influenced by how they are handled during the analysis process. Clean data can inadvertently lose its integrity and true meaning depending on how it is analyzed and interpreted." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"Big data is revolutionizing the world around us, and it is easy to feel alienated by tales of computers handing down decisions made in ways we don’t understand. I think we’re right to be concerned. Modern data analytics can produce some miraculous results, but big data is often less trustworthy than small data. Small data can typically be scrutinized; big data tends to be locked away in the vaults of Silicon Valley. The simple statistical tools used to analyze small datasets are usually easy to check; pattern-recognizing algorithms can all too easily be mysterious and commercially sensitive black boxes." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)