Showing posts with label certainty. Show all posts
Showing posts with label certainty. Show all posts

09 April 2024

🧭Business Intelligence: Why Data Projects Fail to Deliver Real-Life Impact (Part IV: Making It in the Statistics)

Business Intelligence
Business Intelligence Series

Various sources (e.g., [1], [2], [3]) advance the failure rates for data projects somewhere between 70% and 85%, rates which are a bit higher than the failure of standard projects estimated at 60-75% but not by much. This means that only 2-3 out of 10 projects will succeed and that’s another reason to plan for failure, respectively embrace the failure

Unfortunately, the statistics advanced on project failure have no solid fundament and should be regarded with circumspection as long the methodology and information about the population used for the estimates aren’t shared, though they do reflect an important point – many data projects do fail! It would be foolish to think that your project will not fail just because you’re a big company, and you have the best resources, and you have a proven rate of success, and you took all the precautions for the project not to fail.

Usually at the end of a project the team meets together to document the lessons learned in the hope that the next projects will benefit from them. The team did learn something, though as the practice shows even if the team managed to avoid some issues, other issues will impact the next similar project, leading to similar variances. One can summarize this as "on the average the impact of new issues and avoided known issues tends to zero out" or "on average, the plusses and minuses balance each other across projects". It’s probably a question of focus – if organizations focus too much on certain aspects, other aspects are ignored and/or unseen. 

So, your first data project will more likely fail. The question is: what do you do about it? It’s important to be aware of why projects and data projects fail, though starting to consider and monitor each possible issue can prove to be ineffective. One can, however, create a risk register from the list and estimate the rates for each of the potential failures, respectively focus on only the top 3-5 which have the highest risk. Of course, one should reevaluate the estimates on a regular basis though that’s Risk Management 101. 

Besides this, one should focus on how the team can make the project succeed. When adopting a technology, methodology or set of processes, it’s recommended to start with a proof-of-concept (PoC). To make the PoC a helpful experience it’s probably important to start with a topic that’s not too big to handle, but that also involves some complexity that would allow the organization to evaluate the targeted set of tools and technologies. It can also be a topic for which other organizations have made important progress, respectively succeed. The temptation is big to approach the most stringent issues in the organization, respectively to build something big that can have an enormous impact for the organization. Jumping too soon into such topics can just increase the chances of failure. 

One can also formulate the goals, objectives and further requirements in a form that allows the organization to build upon them even if the project fails. A PoC is about learning, building a foundation, doing the groundwork, exploring, mapping the unknown, and identifying what's still missing to make progress, respectively closing the full circle. A PoC is less about overachievement and a big impact, which can happen, though is a consequence of the good work done in the PoC. 

The bottom line, no matter whether you succeed or fail, once you start a project, you’ll still make it in the statistics! More important is what you’ve learnt after the first data project, respectively how you can use the respective knowledge in further projects to make a difference!

Previous Post <<||>> Next Post

References:
[1] Harvard Business Review (2023) Keep Your AI Projects on Track, by Iavor Bojinov (link)
[2] Cognilytica (2023) The Shocking Truth: 70-80% of AI Projects Fail! (link)
[3] VentureBeat (2019) Why do 87% of data science projects never make it into production? (link)

08 April 2024

🧭Business Intelligence: Why Data Projects Fail to Deliver Real-Life Impact (Part III: Failure through the Looking Glass)

Business Intelligence
Business Intelligence Series

There’s a huge volume of material available on project failure – resources that document why individual projects failed, why in general projects fail, why project members, managers and/or executives think projects fail, and there seems to be no other more rewarding activity at the end of a project than to theorize why a project failed, the topic culminating occasionally with the blaming game. Success may generate applause, though it's failure that attracts and stirs the most waves (irony, disapproval, and other similar behavior) and everybody seems to be an expert after the consumed endeavor. 

The mere definition of a project failure – not fulfilling project’s objectives within the set budget and timeframe - is a misnomer because budgets and timelines are estimated based on the information available at the beginning of the project, the amount of uncertainty for many projects being considerable, and data projects are no exceptions from it. The higher the uncertainty the less probable are the two estimates. Even simple projects can reveal uncertainty especially when the broader context of the projects is considered. 

Even if it’s not a common practice, one way to cope with uncertainty is to add a tolerance for the estimates, though even this practice probably will not always accommodate the full extent of the unknown as the tolerances are usually small. The general expectation is to have an accurate and precise landing, which for big or exploratory projects is seldom possible!

Moreover, the assumptions under which the estimates hold are easily invalidated in praxis – resources’ availability, first time right, executive’s support to set priorities, requirements’ quality, technologies’ maturity, etc. If one looks beyond the reasons why projects fail in general, quite often the issues are more organizational than technological, the lack of knowledge and experience being some of the factors. 

Conversely, many projects will not get approved if the estimates don’t look positive, and therefore people are pressured in one way or another to make the numbers fit the expectations. Some projects, given their importance, need to be done even if the numbers don’t look good or can’t be quantified correctly. Other projects represent people’s subsistence on the job, respectively people's self-occupation to create motion, though they can occasionally have also a positive impact for the organizations. These kinds of aspects almost never make it in statistics or surveys. Neither do the big issues people are afraid to talk about. Where to consider that in the light of politics and office’s grapevine the facts get distorted!

Data projects reflect all the symptoms of failure projects have in general, though when words like AI, Statistics or Machine Learning are used, the chances for failure are even higher given that the respective fields require a higher level of expertise, the appropriate use of technologies and adherence to the scientific process for the results to be valid. If projects can benefit from general recipes, respectively established procedures and methods, their range of applicability decreases when the mentioned areas are involved. 

Many data projects have an exploratory nature – seeing what’s possible - and therefore a considerable percentage will not reach production. Moreover, even those that reach that far might arrive to be stopped or discarded sooner or later if they don’t deliver the expected value, and probably many of the models created in the process are biased, irrelevant, or incorrectly apply the theory. Where to add that the mere use of tools and algorithms is not Data Science or Data Analysis. 

The challenge for many data projects is to identify which Project Management (PM) best practices to consider. Following all or no practices at all just increases the risks of failure!

Previous Post <<||>> Next Post

06 April 2024

🧭Business Intelligence: Why Data Projects Fail to Deliver Real-Life Impact (Part II: There's Value in Failure)

Business Intelligence
Business Intelligence Series

"Results are nothing; the energies which produce them
and which again spring from them are everything."
(Wilhelm von Humboldt,  "On Language", 1836)

When the data is not available and is needed on a continuous basis then usually the solution is to redesign the processes and make sure the data becomes available at the needed quality level. Redesign involves additional costs for the business; therefore, it might be tempting to cancel or postpone data projects, at least until they become feasible, though they’re seldom feasible. 

Just because there’s a set of data, this doesn’t mean that there is important knowledge to be extracted from it, respectively that the investment is feasible. There’s however value in building experience in the internal resources, in identifying the challenges and the opportunities, in identifying what needs to be changed for harnessing the data. Unfortunately, organizations expect that somebody else will do the work for them instead of doing the jump by themselves, and this approach more likely will fail. It’s like expecting to get enlightened after a few theoretical sessions with a guru than walking the path by oneself. 

This is reflected also in organizations’ readiness to do the required endeavors for making the jump on the maturity scale. If organizations can’t approach such topics systematically and address the assumptions, opportunities, and risks adequately, respectively to manage the various aspects, it’s hard to believe that their data journey will be positive. 

A data journey shouldn’t be about politics even if some minds need to be changed in the process, at management as well as at lower level. If the leadership doesn’t recognize the importance of becoming an enabler for such initiatives, then the organization probably deserves to keep the status quo. The drive for change should come from the leadership even if we talk about data culture, data strategy, decision-making, or any critical aspect.

An organization will always need to find the balance between time, scope, cost, and quality, and this applies to operations, tactics, and strategies as well as to projects.  There are hard limits and lot of uncertainty associated with data projects and the tasks involved, limits reflected in cost and time estimations (which frankly are just expert’s rough guesses that can change for the worst in the light of new information). Therefore, especially in data projects one needs to be able to compromise, to change scope and timelines as seems fit, and why not, to cancel the projects if the objectives aren’t feasible anymore, respectively if compromises can’t be reached.

An organization must be able to take the risks and invest in failure, otherwise the opportunities for growth don’t change. Being able to split a roadmap into small iterative steps that allow besides breaking down the complexity and making progress to evaluate the progress and the knowledge resulted, respectively incorporate the feedback and knowledge in the next steps, can prove to be what organizations lack in coping with the high uncertainty. Instead, organizations seem to be fascinated by the big bang, thinking that technology can automatically fill the organizational gaps.

Doing the same thing repeatedly and expecting different results is called insanity. Unfortunately, this is what organizations and service providers do in what concerns Project Management in general and data projects in particular. Building something without a foundation, without making sure that the employees have the skillset, maturity and culture to manage the data-related tasks, challenges and opportunities is pure insanity!

Bottom line, harnessing the data requires a certain maturity and it starts with recognizing and pursuing opportunities, setting goals, following roadmaps, learning to fail and getting value from failure, respectively controlling the failure. Growth or instant enlightenment without a fair amount of sweat is possible, though that’s an exception for few in sight!

Previous Post <<||>> Next Post

11 March 2024

🧭🚥Business Intelligence: Key Performance Indicators [KPI] (Between Certainty and Uncertainty)

Business Intelligence
Business Intelligence Series

Despite the huge collection of documented Key Performance Indicators (KPIs) and best practices on which KPIs to choose, choosing a reliable set of KPIs that reflect how the organization performs in achieving its objectives continues to be a challenge for many organizations. Ideally, for each objective there should be only one KPIs that reflects the target and the progress made, though is that realistic?

Let's try to use the driver's metaphor to exemplify several aspects related to the choice of KPIs. A driver's goal is to travel from point A to point B over a distance d in x hours. The goal is SMART (Specific, Measurable, Achievable, Relevant, and Time-bound) if the speed and time are realistic and don't contradict Physics, legal or physical laws. The driver can define the objective as "arriving on time to the destination". 

One can define a set of metrics based on the numbers that can be measured. We have the overall distance and the number of hours planned, from which one can derive an expected average speed v. To track a driver's progress over time there are several metrics that can be thus used: e.g., (1) the current average speed, (2) the number of kilometers to the destination, (3) the number of hours estimated to the destination. However, none of these metrics can be used alone to denote the performance alone. One can compare the expected with the current average speed to get a grasp of the performance, and probably many organizations will use only (1) as KPI, though it's needed to use either (2) or (3) to get the complete picture. So, in theory two KPIs should be enough. Is it so?

When estimating (3) one assumes that there are no impediments and that the average speed can be attained, which might be correct for a road without traffic. There can be several impediments - planned/unplanned breaks, traffic jams, speed limits, accidents or other unexpected events, weather conditions (that depend on the season), etc. Besides the above formula, one needs to quantify such events in one form or another, e.g., through the perspective of the time added to the initial estimation from (3). However, this calculation is based on historical values or navigator's estimation, value which can be higher or lower than the final value. 

Therefore, (3) is an approximation for which is needed also a confidence interval (± t hours). The value can still include a lot of uncertainty that maybe needs to be broken down and quantified separately upon case to identify the deviation from expectations, e.g. on average there are 3 traffic jams (4), if the road crosses states or countries there may be at least 1 control on average (5), etc. These numbers can be included in (3) and the confidence interval, and usually don't need to be reported separately, though probably there are exceptions. 

When planning, one needs to also consider the number of stops for refueling or recharging the car, and the average duration of such stops, which can be included in (3) as well. However, (3) slowly becomes  too complex a formula, and even if there's an estimation, the more facts we're pulling into it, the bigger the confidence interval's variation will be. Sometimes, it's preferable to have instead two-three other metrics with a low confidence interval than one with high variation. Moreover, the longer the distance planned, the higher the uncertainty. One thing is to plan a trip between two neighboring city, and another thing is to plan a trip around the world. 

Another assumption is that the capability of the driver/car to drive is the same over time, which is not always the case. This can be neglected occasionally (e.g. one trip), though it involves a risk (6) that might be useful to quantify, especially when the process is repeatable (e.g. regular commuting). The risk value can increase considering new information, e.g. knowing that every a few thousand kilometers something breaks, or that there's a traffic fine, or an accident. In spite of new information, the objective might also change. Also, the objective might suffer changes, e.g. arrive on-time safe and without fines to the destination. As the objective changes or further objectives are added, more metrics can be defined. It would make sense to measure how many kilometers the driver covered in a lifetime with the car (7), how many accidents (8) or how many fines (9) the driver had. (7) is not related to a driver's performance, but (8) and (9) are. 

As can be seen, simple processes can also become very complex if one attempts to consider all the facts and/or quantify the uncertainty. The driver's metaphor applies to a simple individual, though once the same process is considered across the whole organization (a group of drivers), the more complexity is added and the perspective changes completely. E.g., some drivers might not even reach the destination or not even have a car to start with, and so on. Of course, with this also the objectives change and need to be redefined accordingly. 

The driver's metaphor is good for considering planning activities in which a volume of work needs to be completed in a given time and where a set of constraints apply. Therefore, for some organizations, just using two numbers might be enough for getting a feeling for what's happening. However, as soon one needs to consider other aspects like safety or compliance (considered in aggregation across many drivers), there might be other metrics that qualify as KPIs.

It's tempting to add two numbers and consider for example (8) and (9) together as the two are events that can be cumulated, even if they refer to different things that can overlap (an accident can result in a fine and should be counted maybe only once). One needs to make sure that one doesn't add apples with juice - the quantified values must have the same unit of measure, otherwise they might need to be considered separately. There's the tendency of mixing multiple metrics in a KPI that doesn't say much if the units of measure of its components are not the same. Some conversions can still be made (e.g. how much juice can be obtained from apples), though that's seldom the case.

Previous Post <<||>> Next Post

26 December 2018

🔭Data Science: Uncertainty (Just the Quotes)

"If the number of experiments be very large, we may have precise information as to the value of the mean, but if our sample be small, we have two sources of uncertainty: (I) owing to the 'error of random sampling' the mean of our series of experiments deviates more or less widely from the mean of the population, and (2) the sample is not sufficiently large to determine what is the law of distribution of individuals." (William S Gosset, "The Probable Error of a Mean", Biometrika, 1908)

"The making of decisions, as everyone knows from personal experience, is a burdensome task. Offsetting the exhilaration that may result from correct and successful decision and the relief that follows the termination of a struggle to determine issues is the depression that comes from failure, or error of decision, and the frustration which ensues from uncertainty." (Chester I Barnard, "The Functions of the Executive", 1938)

"Uncertainty is introduced, however, by the impossibility of making generalizations, most of the time, which happens to all members of a class. Even scientific truth is a matter of probability and the degree of probability stops somewhere short of certainty." (Wayne C Minnick, "The Art of Persuasion", 1957)

"Statistics is a body of methods and theory applied to numerical evidence in making decisions in the face of uncertainty." (Lawrence Lapin, "Statistics for Modern Business Decisions", 1973)

"The most dominant decision type [that will have to be made in an organic organization] will be decisions under uncertainty." (Henry L Tosi & Stephen J Carroll, "Management", 1976)

"The greater the uncertainty, the greater the amount of decision making and information processing. It is hypothesized that organizations have limited capacities to process information and adopt different organizing modes to deal with task uncertainty. Therefore, variations in organizing modes are actually variations in the capacity of organizations to process information and make decisions about events which cannot be anticipated in advance." (John K Galbraith, "Organization Design", 1977)

"Probability is the mathematics of uncertainty. Not only do we constantly face situations in which there is neither adequate data nor an adequate theory, but many modem theories have uncertainty built into their foundations. Thus learning to think in terms of probability is essential. Statistics is the reverse of probability (glibly speaking). In probability you go from the model of the situation to what you expect to see; in statistics you have the observations and you wish to estimate features of the underlying model." (Richard W Hamming, "Methods of Mathematics Applied to Calculus, Probability, and Statistics", 1985)

"Probability plays a central role in many fields, from quantum mechanics to information theory, and even older fields use probability now that the presence of 'noise' is officially admitted. The newer aspects of many fields start with the admission of uncertainty." (Richard Hamming, "Methods of Mathematics Applied to Calculus, Probability, and Statistics", 1985)

"Models are often used to decide issues in situations marked by uncertainty. However statistical differences from data depend on assumptions about the process which generated these data. If the assumptions do not hold, the inferences may not be reliable either. This limitation is often ignored by applied workers who fail to identify crucial assumptions or subject them to any kind of empirical testing. In such circumstances, using statistical procedures may only compound the uncertainty." (David A Greedman & William C Navidi, "Regression Models for Adjusting the 1980 Census", Statistical Science Vol. 1 (1), 1986)

"The mathematical theories generally called 'mathematical theories of chance' actually ignore chance, uncertainty and probability. The models they consider are purely deterministic, and the quantities they study are, in the final analysis, no more than the mathematical frequencies of particular configurations, among all equally possible configurations, the calculation of which is based on combinatorial analysis. In reality, no axiomatic definition of chance is conceivable." (Maurice Allais, "An Outline of My Main Contributions to Economic Science", [Noble lecture] 1988)

"The worst, i.e., most dangerous, feature of 'accepting the null hypothesis' is the giving up of explicit uncertainty. [...] Mathematics can sometimes be put in such black-and-white terms, but our knowledge or belief about the external world never can." (John Tukey, "The Philosophy of Multiple Comparisons", Statistical Science Vol. 6 (1), 1991)

"In nonlinear systems - and the economy is most certainly nonlinear - chaos theory tells you that the slightest uncertainty in your knowledge of the initial conditions will often grow inexorably. After a while, your predictions are nonsense." (M Mitchell Waldrop, "Complexity: The Emerging Science at the Edge of Order and Chaos", 1992)

"Statistics as a science is to quantify uncertainty, not unknown." (Chamont Wang, "Sense and Nonsense of Statistical Inference: Controversy, Misuse, and Subtlety", 1993)

"There is a new science of complexity which says that the link between cause and effect is increasingly difficult to trace; that change (planned or otherwise) unfolds in non-linear ways; that paradoxes and contradictions abound; and that creative solutions arise out of diversity, uncertainty and chaos." (Andy P Hargreaves & Michael Fullan, "What’s Worth Fighting for Out There?", 1998)

"Information entropy has its own special interpretation and is defined as the degree of unexpectedness in a message. The more unexpected words or phrases, the higher the entropy. It may be calculated with the regular binary logarithm on the number of existing alternatives in a given repertoire. A repertoire of 16 alternatives therefore gives a maximum entropy of 4 bits. Maximum entropy presupposes that all probabilities are equal and independent of each other. Minimum entropy exists when only one possibility is expected to be chosen. When uncertainty, variety or entropy decreases it is thus reasonable to speak of a corresponding increase in information." (Lars Skyttner, "General Systems Theory: Ideas and Applications", 2001)

"Any scientific data without (a stated) uncertainty is of no avail. Therefore the analysis and description of uncertainty are almost as important as those of the data value itself . It should be clear that the uncertainty itself also has an uncertainty – due to its nature as a scientific quantity – and so on. The uncertainty of an uncertainty is generally not determined." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"As uncertainties of scientific data values are nearly as important as the data values themselves, it is usually not acceptable that a best estimate is only accompanied by an estimated uncertainty. Therefore, only the size of nondominant uncertainties should be estimated. For estimating the size of a nondominant uncertainty we need to find its upper limit, i.e., we want to be as sure as possible that the uncertainty does not exceed a certain value." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Before best estimates are extracted from data sets by way of a regression analysis, the uncertainties of the individual data values must be determined.In this case care must be taken to recognize which uncertainty components are common to all the values, i.e., those that are correlated (systematic)." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Due to the theory that underlies uncertainties an infinite number of data values would be necessary to determine the true value of any quantity. In reality the number of available data values will be relatively small and thus this requirement can never be fully met; all one can get is the best estimate of the true value." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"For linear dependences the main information usually lies in the slope. It is obvious that those points that lie far apart have the strongest influence on the slope if all points have the same uncertainty. In this context we speak of the strong leverage of distant points; when determining the parameter 'slope' these distant points carry more effective weight. Naturally, this weight is distinct from the 'statistical' weight usually used in regression analysis." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"In error analysis the so-called 'chi-squared' is a measure of the agreement between the uncorrelated internal and the external uncertainties of a measured functional relation. The simplest such relation would be time independence. Theory of the chi-squared requires that the uncertainties be normally distributed. Nevertheless, it was found that the test can be applied to most probability distributions encountered in practice." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"In many cases systematic errors are interpreted as the systematic difference between nature (which is being questioned by the experimenter in his experiment) and the model (which is used to describe nature). If the model used is not good enough, but the measurement result is interpreted using this model, the final result (the interpretation) will be wrong because it is biased, i.e., it has a systematic deviation (not uncertainty). If we do not use the best model (the best theory) available for the description of a certain phenomenon this procedure is just wrong. It has nothing to do with an uncertainty." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"It is also inevitable for any model or theory to have an uncertainty (a difference between model and reality). Such uncertainties apply both to the numerical parameters of the model and to the inadequacy of the model as well. Because it is much harder to get a grip on these types of uncertainties, they are disregarded, usually." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"It is important that uncertainty components that are independent of each other are added quadratically. This is also true for correlated uncertainty components, provided they are independent of each other, i.e., as long as there is no correlation between the components." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"It is the nature of an uncertainty that it is not known and can never be known, whether the best estimate is greater or less than the true value." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Outliers or flyers are those data points in a set that do not quite fit within the rest of the data, that agree with the model in use. The uncertainty of such an outlier is seemingly too small. The discrepancy between outliers and the model should be subject to thorough examination and should be given much thought. Isolated data points, i.e., data points that are at some distance from the bulk of the data are not outliers if their values are in agreement with the model in use." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"The fact that the same uncertainty (e.g., scale uncertainty) is uncorrelated if we are dealing with only one measurement, but correlated (i.e., systematic) if we look at more than one measurement using the same instrument shows that both types of uncertainties are of the same nature. Of course, an uncertainty keeps its characteristics (e.g., Poisson distributed), independent of the fact whether it occurs only once or more often." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"To fulfill the requirements of the theory underlying uncertainties, variables with random uncertainties must be independent of each other and identically distributed. In the limiting case of an infinite number of such variables, these are called normally distributed. However, one usually speaks of normally distributed variables even if their number is finite." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"In fact, H [entropy] measures the amount of uncertainty that exists in the phenomenon. If there were only one event, its probability would be equal to 1, and H would be equal to 0 - that is, there is no uncertainty about what will happen in a phenomenon with a single event because we always know what is going to occur. The more events that a phenomenon possesses, the more uncertainty there is about the state of the phenomenon. In other words, the more entropy, the more information." (Diego Rasskin-Gutman, "Chess Metaphors: Artificial Intelligence and the Human Mind", 2009)

"Data always vary randomly because the object of our inquiries, nature itself, is also random. We can analyze and predict events in nature with an increasing amount of precision and accuracy, thanks to improvements in our techniques and instruments, but a certain amount of random variation, which gives rise to uncertainty, is inevitable." (Alberto Cairo, "The Functional Art", 2011)

"The storytelling mind is allergic to uncertainty, randomness, and coincidence. It is addicted to meaning. If the storytelling mind cannot find meaningful patterns in the world, it will try to impose them. In short, the storytelling mind is a factory that churns out true stories when it can, but will manufacture lies when it can't." (Jonathan Gottschall, "The Storytelling Animal: How Stories Make Us Human", 2012)

"The data is a simplification - an abstraction - of the real world. So when you visualize data, you visualize an abstraction of the world, or at least some tiny facet of it. Visualization is an abstraction of data, so in the end, you end up with an abstraction of an abstraction, which creates an interesting challenge. […] Just like what it represents, data can be complex with variability and uncertainty, but consider it all in the right context, and it starts to make sense." (Nathan Yau, "Data Points: Visualization That Means Something", 2013)

"Without precise predictability, control is impotent and almost meaningless. In other words, the lesser the predictability, the harder the entity or system is to control, and vice versa. If our universe actually operated on linear causality, with no surprises, uncertainty, or abrupt changes, all future events would be absolutely predictable in a sort of waveless orderliness." (Lawrence K Samuels, "Defense of Chaos", 2013)

"The greater the uncertainty, the bigger the gap between what you can measure and what matters, the more you should watch out for overfitting - that is, the more you should prefer simplicity." (Brian Christian & Thomas L Griffiths, "Algorithms to Live By: The Computer Science of Human Decisions", 2016)

"A notable difference between many fields and data science is that in data science, if a customer has a wish, even an experienced data scientist may not know whether it’s possible. Whereas a software engineer usually knows what tasks software tools are capable of performing, and a biologist knows more or less what the laboratory can do, a data scientist who has not yet seen or worked with the relevant data is faced with a large amount of uncertainty, principally about what specific data is available and about how much evidence it can provide to answer any given question. Uncertainty is, again, a major factor in the data scientific process and should be kept at the forefront of your mind when talking with customers about their wishes."  (Brian Godsey, "Think Like a Data Scientist", 2017)

"The elements of this cloud of uncertainty (the set of all possible errors) can be described in terms of probability. The center of the cloud is the number zero, and elements of the cloud that are close to zero are more probable than elements that are far away from that center. We can be more precise in this definition by defining the cloud of uncertainty in terms of a mathematical function, called the probability distribution." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"Uncertainty is an adversary of coldly logical algorithms, and being aware of how those algorithms might break down in unusual circumstances expedites the process of fixing problems when they occur - and they will occur. A data scientist’s main responsibility is to try to imagine all of the possibilities, address the ones that matter, and reevaluate them all as successes and failures happen." (Brian Godsey, "Think Like a Data Scientist", 2017)

"Bootstrapping provides an intuitive, computer-intensive way of assessing the uncertainty in our estimates, without making strong assumptions and without using probability theory. But the technique is not feasible when it comes to, say, working out the margins of error on unemployment surveys of 100,000 people. Although bootstrapping is a simple, brilliant and extraordinarily effective idea, it is just too clumsy to bootstrap such large quantities of data, especially when a convenient theory exists that can generate formulae for the width of uncertainty intervals." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"Entropy is a measure of amount of uncertainty or disorder present in the system within the possible probability distribution. The entropy and amount of unpredictability are directly proportional to each other." (G Suseela & Y Asnath V Phamila, "Security Framework for Smart Visual Sensor Networks", 2019)

"Estimates based on data are often uncertain. If the data were intended to tell us something about a wider population (like a poll of voting intentions before an election), or about the future, then we need to acknowledge that uncertainty. This is a double challenge for data visualization: it has to be calculated in some meaningful way and then shown on top of the data or statistics without making it all too cluttered." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

"Uncertainty confuses many people because they have the unreasonable expectation that science and statistics will unearth precise truths, when all they can yield is imperfect estimates that can always be subject to changes and updates." (Alberto Cairo, "How Charts Lie", 2019)

"We over-fit when we go too far in adapting to local circumstances, in a worthy but misguided effort to be ‘unbiased’ and take into account all the available information. Usually we would applaud the aim of being unbiased, but this refinement means we have less data to work on, and so the reliability goes down. Over-fitting therefore leads to less bias but at a cost of more uncertainty or variation in the estimates, which is why protection against over-fitting is sometimes known as the bias/variance trade-off." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

15 December 2018

🔭Data Science: Probability (Just the Quotes)

"Probability is a degree of possibility." (Gottfried W Leibniz, "On estimating the uncertain", 1676)

"Probability, however, is not something absolute, [it is] drawn from certain information which, although it does not suffice to resolve the problem, nevertheless ensures that we judge correctly which of the two opposites is the easiest given the conditions known to us." (Gottfried W Leibniz, "Forethoughts for an encyclopaedia or universal science", cca. 1679)

"[…] the highest probability amounts not to certainty, without which there can be no true knowledge." (John Locke, "An Essay Concerning Human Understanding", 1689)

"As mathematical and absolute certainty is seldom to be attained in human affairs, reason and public utility require that judges and all mankind in forming their opinions of the truth of facts should be regulated by the superior number of the probabilities on the one side or the other whether the amount of these probabilities be expressed in words and arguments or by figures and numbers." (William Murray, 1773)

"All certainty which does not consist in mathematical demonstration is nothing more than the highest probability; there is no other historical certainty." (Voltaire, "A Philosophical Dictionary", 1881)

"Nature prefers the more probable states to the less probable because in nature processes take place in the direction of greater probability. Heat goes from a body at higher temperature to a body at lower temperature because the state of equal temperature distribution is more probable than a state of unequal temperature distribution." (Max Planck, "The Atomic Theory of Matter", 1909)

"Sometimes the probability in favor of a generalization is enormous, but the infinite probability of certainty is never reached." (William Dampier-Whetham, "Science and the Human Mind", 1912)

"There can be no unique probability attached to any event or behaviour: we can only speak of ‘probability in the light of certain given information’, and the probability alters according to the extent of the information." (Sir Arthur S Eddington, "The Nature of the Physical World", 1928)

"[…] the statistical prediction of the future from the past cannot be generally valid, because whatever is future to any given past, is in tum past for some future. That is, whoever continually revises his judgment of the probability of a statistical generalization by its successively observed verifications and failures, cannot fail to make more successful predictions than if he should disregard the past in his anticipation of the future. This might be called the ‘Principle of statistical accumulation’." (Clarence I Lewis, "Mind and the World-Order: Outline of a Theory of Knowledge", 1929)

"Science does not aim, primarily, at high probabilities. It aims at a high informative content, well backed by experience. But a hypothesis may be very probable simply because it tells us nothing, or very little." (Karl Popper, "The Logic of Scientific Discovery", 1934)

"The most important application of the theory of probability is to what we may call 'chance-like' or 'random' events, or occurrences. These seem to be characterized by a peculiar kind of incalculability which makes one disposed to believe - after many unsuccessful attempts - that all known rational methods of prediction must fail in their case. We have, as it were, the feeling that not a scientist but only a prophet could predict them. And yet, it is just this incalculability that makes us conclude that the calculus of probability can be applied to these events." (Karl R Popper, "The Logic of Scientific Discovery", 1934)

"Equiprobability in the physical world is purely a hypothesis. We may exercise the greatest care and the most accurate of scientific instruments to determine whether or not a penny is symmetrical. Even if we are satisfied that it is, and that our evidence on that point is conclusive, our knowledge, or rather our ignorance, about the vast number of other causes which affect the fall of the penny is so abysmal that the fact of the penny’s symmetry is a mere detail. Thus, the statement 'head and tail are equiprobable' is at best an assumption." (Edward Kasner & James R Newman, "Mathematics and the Imagination", 1940)

"Probabilities must be regarded as analogous to the measurement of physical magnitudes; that is to say, they can never be known exactly, but only within certain approximation." (Emile Borel, "Probabilities and Life", 1943)

"Just as entropy is a measure of disorganization, the information carried by a set of messages is a measure of organization. In fact, it is possible to interpret the information carried by a message as essentially the negative of its entropy, and the negative logarithm of its probability. That is, the more probable the message, the less information it gives. Clichés, for example, are less illuminating than great poems." (Norbert Wiener, "The Human Use of Human Beings", 1950)

"To say that observations of the past are certain, whereas predictions are merely probable, is not the ultimate answer to the question of induction; it is only a sort of intermediate answer, which is incomplete unless a theory of probability is developed that explains what we should mean by ‘probable’ and on what ground we can assert probabilities." (Hans Reichenbach, "The Rise of Scientific Philosophy", 1951)

"Uncertainty is introduced, however, by the impossibility of making generalizations, most of the time, which happens to all members of a class. Even scientific truth is a matter of probability and the degree of probability stops somewhere short of certainty." (Wayne C Minnick, "The Art of Persuasion", 1957)

"Everybody has some idea of the meaning of the term 'probability' but there is no agreement among scientists on a precise definition of the term for the purpose of scientific methodology. It is sufficient for our purpose, however, if the concept is interpreted in terms of relative frequency, or more simply, how many times a particular event is likely to occur in a large population." (Alfred R Ilersic, "Statistics", 1959)

"Incomplete knowledge must be considered as perfectly normal in probability theory; we might even say that, if we knew all the circumstances of a phenomenon, there would be no place for probability, and we would know the outcome with certainty." (Félix E Borel, Probability and Certainty", 1963)

"Probability is the mathematics of uncertainty. Not only do we constantly face situations in which there is neither adequate data nor an adequate theory, but many modem theories have uncertainty built into their foundations. Thus learning to think in terms of probability is essential. Statistics is the reverse of probability (glibly speaking). In probability you go from the model of the situation to what you expect to see; in statistics you have the observations and you wish to estimate features of the underlying model." (Richard W Hamming, "Methods of Mathematics Applied to Calculus, Probability, and Statistics", 1985) 

"Probability plays a central role in many fields, from quantum mechanics to information theory, and even older fields use probability now that the presence of 'noise' is officially admitted. The newer aspects of many fields start with the admission of uncertainty." (Richard W Hamming, "Methods of Mathematics Applied to Calculus, Probability, and Statistics", 1985)

"Probabilities are summaries of knowledge that is left behind when information is transferred to a higher level of abstraction." (Judea Pearl, "Probabilistic Reasoning in Intelligent Systems: Network of Plausible, Inference", 1988)

"[In statistics] you have the fact that the concepts are not very clean. The idea of probability, of randomness, is not a clean mathematical idea. You cannot produce random numbers mathematically. They can only be produced by things like tossing dice or spinning a roulette wheel. With a formula, any formula, the number you get would be predictable and therefore not random. So as a statistician you have to rely on some conception of a world where things happen in some way at random, a conception which mathematicians don’t have." (Lucien LeCam, [interview] 1988)

"So we pour in data from the past to fuel the decision-making mechanisms created by our models, be they linear or nonlinear. But therein lies the logician's trap: past data from real life constitute a sequence of events rather than a set of independent observations, which is what the laws of probability demand. [...] It is in those outliers and imperfections that the wildness lurks." (Peter L Bernstein, "Against the Gods: The Remarkable Story of Risk", 1996) 

"Often, we use the word random loosely to describe something that is disordered, irregular, patternless, or unpredictable. We link it with chance, probability, luck, and coincidence. However, when we examine what we mean by random in various contexts, ambiguities and uncertainties inevitably arise. Tackling the subtleties of randomness allows us to go to the root of what we can understand of the universe we inhabit and helps us to define the limits of what we can know with certainty." (Ivars Peterson, "The Jungles of Randomness: A Mathematical Safari", 1998)

"In the laws of probability theory, likelihood distributions are fixed properties of a hypothesis. In the art of rationality, to explain is to anticipate. To anticipate is to explain." (Eliezer S. Yudkowsky, "A Technical Explanation of Technical Explanation", 2005)

"For some scientific data the true value cannot be given by a constant or some straightforward mathematical function but by a probability distribution or an expectation value. Such data are called probabilistic. Even so, their true value does not change with time or place, making them distinctly different from  most statistical data of everyday life." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"In fact, H [entropy] measures the amount of uncertainty that exists in the phenomenon. If there were only one event, its probability would be equal to 1, and H would be equal to 0 - that is, there is no uncertainty about what will happen in a phenomenon with a single event because we always know what is going to occur. The more events that a phenomenon possesses, the more uncertainty there is about the state of the phenomenon. In other words, the more entropy, the more information." (Diego Rasskin-Gutman, "Chess Metaphors: Artificial Intelligence and the Human Mind", 2009)

"The four questions of data analysis are the questions of description, probability, inference, and homogeneity. [...] Descriptive statistics are built on the assumption that we can use a single value to characterize a single property for a single universe. […] Probability theory is focused on what happens to samples drawn from a known universe. If the data happen to come from different sources, then there are multiple universes with different probability models.  [...] Statistical inference assumes that you have a sample that is known to have come from one universe." (Donald J Wheeler," Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"Entropy is a measure of amount of uncertainty or disorder present in the system within the possible probability distribution. The entropy and amount of unpredictability are directly proportional to each other." (G Suseela & Y Asnath V Phamila, "Security Framework for Smart Visual Sensor Networks", 2019)

14 December 2018

🔭Data Science: Coincidence (Just the Quotes)

"It is no great wonder if in long process of time, while fortune takes her course hither and thither, numerous coincidences should spontaneously occur. If the number and variety of subjects to be wrought upon be infinite, it is all the more easy for fortune, with such an abundance of material, to effect this similarity of results." (Plutarch, Life of Sertorius, 1st century BC)

"Coincidences, in general, are great stumbling blocks in the way of that class of thinkers who have been educated to know nothing of the theory of probabilities - that theory to which the most glorious objects of human research are indebted for the most glorious of illustrations." (Edgar A Poe, "The Murders in the Rue Morgue", 1841)

"Nothing is more certain in scientific method than that approximate coincidence alone can be expected. In the measurement of continuous quantity perfect correspondence must be accidental, and should give rise to suspicion rather than to satisfaction." (William S Jevons, "The Principles of Science: A Treatise on Logic and Scientific Method", 1874)

"Before we can completely explain a phenomenon we require not only to find its true cause, its chief relations to other causes, and all the conditions which determine how the cause operates, and what its effect and amount of effect are, but also all the coincidences." (George Gore, "The Art of Scientific Discovery", 1878)

"As science progress, it becomes more and more difficult to fit in the new facts when they will not fit in spontaneously. The older theories depend upon the coincidences of so many numerical results which can not be attributed to chance. We should not separate what has been joined together." (Henri Poincaré, "The Ether and Matter", 1912)

"By the laws of statistics we could probably approximate just how unlikely it is that it would happen. But people forget - especially those who ought to know better, such as yourself - that while the laws of statistics tell you how unlikely a particular coincidence is, they state just as firmly that coincidences do happen." (Robert A Heinlein, "The Door Into Summer", 1957)

"There is no coherent knowledge, i.e. no uniform comprehensive account of the world and the events in it. There is no comprehensive truth that goes beyond an enumeration of details, but there are many pieces of information, obtained in different ways from different sources and collected for the benefit of the curious. The best way of presenting such knowledge is the list - and the oldest scientific works were indeed lists of facts, parts, coincidences, problems in several specialized domains." (Paul K Feyerabend, "Farewell to Reason", 1987)

"A tendency to drastically underestimate the frequency of coincidences is a prime characteristic of innumerates, who generally accord great significance to correspondences of all sorts while attributing too little significance to quite conclusive but less flashy statistical evidence." (John A Paulos, "Innumeracy: Mathematical Illiteracy and its Consequences", 1988)

"The law of truly large numbers states: With a large enough sample, any outrageous thing is likely to happen." (Frederick Mosteller, "Methods for Studying Coincidences", Journal of the American Statistical Association Vol. 84, 1989)

"Most coincidences are simply chance events that turn out to be far more probable than many people imagine." (Ivars Peterson, "The Jungles of Randomness: A Mathematical Safari", 1997)

"Often, we use the word random loosely to describe something that is disordered, irregular, patternless, or unpredictable. We link it with chance, probability, luck, and coincidence. However, when we examine what we mean by random in various contexts, ambiguities and uncertainties inevitably arise. Tackling the subtleties of randomness allows us to go to the root of what we can understand of the universe we inhabit and helps us to define the limits of what we can know with certainty." (Ivars Peterson, "The Jungles of Randomness: A Mathematical Safari", 1998)

"Coincidence surprises us because our intuition about the likelihood of an event is often wildly inaccurate." (Michael Starbird, "Coincidences, Chaos, and All That Math Jazz", 2005)

"With our heads spinning in the world of coincidence and chaos, we nevertheless must make decisions and take steps into the minefield of our future. To avoid explosive missteps, we rely on data and statistical reasoning to inform our thinking." (Michael Starbird, "Coincidences, Chaos, and All That Math Jazz", 2005)

"The human mind delights in finding pattern - so much so that we often mistake coincidence or forced analogy for profound meaning. No other habit of thought lies so deeply within the soul of a small creature trying to make sense of a complex world not constructed for it." (Stephen J Gould, "The Flamingo's Smile: Reflections in Natural History", 2010)

More quotes on "Coincidence" at the-web-of-knowledge.blogspot.com.

03 December 2018

🔭Data Science: Observation (Just the Quotes)

"[…] it is not necessary that these hypotheses should be true, or even probably; but it is enough if they provide a calculus which fits the observations […]" (Andrew Osiander, "On the Revolutions of the Heavenly Spheres", 1543)

"[…] it is from long experience chiefly that we are to expect the most certain rules of practice, yet it is withal to be remembered, that observations, and to put us upon the most probable means of improving any art, is to get the best insight we can into the nature and properties of those things which we are desirous to cultivate and improve." (Stephen Hales, "Vegetable Staticks", 1727) 

"Those who have not imbibed the prejudices of philosophers, are easily convinced that natural knowledge is to be founded on experiment and observation." (Colin Maclaurin, "An Account of Sir Isaac Newton’s Philosophical Discoveries", 1748)

"We have three principal means: observation of nature, reflection, and experiment. Observation gathers the facts reflection combines them, experiment verifies the result of the combination. It is essential that the observation of nature be assiduous, that reflection be profound, and that experimentation be exact. Rarely does one see these abilities in combination. And so, creative geniuses are not common." (Denis Diderot, "On the Interpretation of Nature", 1753)

"Facts, observations, experiments - these are the materials of a great edifice, but in assembling them we must combine them into classes, distinguish which belongs to which order and to which part of the whole each pertains." (Antoine L Lavoisier, "Mémoires de l’Académie Royale des Sciences", 1777)

"On the other hand, if we add observation to observation, without attempting to draw no only certain conclusions, but also conjectural views from them, we offend against the very end for which only observations ought to be made." (Friedrich W Herschel, "On the Construction of the Heavens", Philosophical Transactions of the Royal Society of London Vol. LXXV, 1785)

"[It] may be laid down as a general rule that, if the result of a long series of precise observations approximates a simple relation so closely that the remaining difference is undetectable by observation and may be attributed to the errors to which they are liable, then this relation is probably that of nature." (Pierre-Simon Laplace, "Mémoire sur les Inégalites Séculaires des Planètes et des Satellites", 1787)

"The art of drawing conclusions from experiments and observations consists in evaluating probabilities and in estimating whether they are sufficiently great or numerous enough to constitute proofs. This kind of calculation is more complicated and more difficult than it is commonly thought to be […]" (Antoine-Laurent Lavoisier, cca. 1790)

"We must trust to nothing but facts: These are presented to us by Nature, and cannot deceive. We ought, in every instance, to submit our reasoning to the test of experiment, and never to search for truth but by the natural road of experiment and observation." (Antoin-Laurent de Lavoisiere, "Elements of Chemistry", 1790)

"Conjecture may lead you to form opinions, but it cannot produce knowledge. Natural philosophy must be built upon the phenomena of nature discovered by observation and experiment." (George Adams, "Lectures on Natural and Experimental Philosophy" Vol. 1, 1794)

"In order to supply the defects of experience, we will have recourse to the probable conjectures of analogy, conclusions which we will bequeath to our posterity to be ascertained by new observations, which, if we augur rightly, will serve to establish our theory and to carry it gradually nearer to absolute certainty." (Johann H Lambert, "The System of the World", 1800)

"[…] we must not measure the simplicity of the laws of nature by our facility of conception; but when those which appear to us the most simple, accord perfectly with observations of the phenomena, we are justified in supposing them rigorously exact." (Pierre-Simon Laplace, "The System of the World", 1809)

"Primary causes are unknown to us; but are subject to simple and constant laws, which may be discovered by observation, the study of them being the object of natural philosophy." (Jean-Baptiste-Joseph Fourier, "The Analytical Theory of Heat", 1822)

"The aim of every science is foresight. For the laws of established observation of phenomena are generally employed to foresee their succession. All men, however little advanced make true predictions, which are always based on the same principle, the knowledge of the future from the past." (Auguste Compte, "Plan des travaux scientifiques nécessaires pour réorganiser la société", 1822)

"The framing of hypotheses is, for the enquirer after truth, not the end, but the beginning of his work. Each of his systems is invented, not that he may admire it and follow it into all its consistent consequences, but that he may make it the occasion of a course of active experiment and observation. And if the results of this process contradict his fundamental assumptions, however ingenious, however symmetrical, however elegant his system may be, he rejects it without hesitation. He allows no natural yearning for the offspring of his own mind to draw him aside from the higher duty of loyalty to his sovereign, Truth, to her he not only gives his affections and his wishes, but strenuous labour and scrupulous minuteness of attention." (William Whewell, "Philosophy of the Inductive Sciences" Vol. 2, 1847)

"In the fields of observation chance favors only the prepared mind." (Louis Pasteur, [lecture] 1854)

"When a power of nature, invisible and impalpable, is the subject of scientific inquiry, it is necessary, if we would comprehend its essence and properties, to study its manifestations and effects. For this purpose simple observation is insufficient, since error always lies on the surface, whilst truth must be sought in deeper regions." (Justus von Liebig," Familiar Letters on Chemistry", 1859)

"Observation is so wide awake, and facts are being so rapidly added to the sum of human experience, that it appears as if the theorizer would always be in arrears, and were doomed forever to arrive at imperfect conclusion; but the power to perceive a law is equally rare in all ages of the world, and depends but little on the number of facts observed." (Henry D Thoreau, "A Week on the Concord and Merrimack Rivers", 1862)

"The process of discovery is very simple. An unwearied and systematic application of known laws to nature, causes the unknown to reveal themselves. Almost any mode of observation will be successful at last, for what is most wanted is method." (Henry D Thoreau, "A Week on the Concord and Merrimack Rivers", 1862)

"An anticipative idea or an hypothesis is, then, the necessary starting point for all experimental reasoning. Without it, we could not make any investigation at all nor learn anything; we could only pile up sterile observations. If we experiment without a preconceived idea, we should move at random […]" (Claude Bernard, "An Introduction to the Study of Experimental Medicine", 1865)

"Men who have excessive faith in their theories or ideas are not only ill prepared for making discoveries; they also make very poor observations." (Claude Bernard, "An Introduction to the Study of Experimental Medicine", 1865)

"Only within very narrow boundaries can man observe the phenomena which surround him; most of them naturally escape his senses, and mere observation is not enough." (Claude Bernard, "An Introduction to the Study of Experimental Medicine", 1865)

"[…] wrong hypotheses, rightly worked from, have produced more useful results than unguided observation." (Augustus de Morgan, "A Budget of Paradoxes", 1872)

"Every science begins by accumulating observations, and presently generalizes these empirically; but only when it reaches the stage at which its empirical generalizations are included in a rational generalization does it become developed science." (Herbert Spencer, "The Data of Ethics", 1879)

"Science is the observation of things possible, whether present or past; prescience is the knowledge of things which may come to pass, though but slowly." (Leonardo da Vinci, "The Notebooks of Leonardo da Vinci", 1883)

"Even one well-made observation will be enough in many cases, just as one well-constructed experiment often suffices for the establishment of a law." (Émile Durkheim, "The Rules of Sociological Method", "The Rules of Sociological Method", 1895)

"Every experiment, every observation has, besides its immediate result, effects which, in proportion to its value, spread always on all sides into ever distant parts of knowledge." (Sir Michael Foster, "Annual Report of the Board of Regents of the Smithsonian Institution", 1898)

"The primary basis of all scientific thinking is observation." (Douglas Marsland, "Principles of Modern Biology", 1899)

"To observe is not enough. We must use our observations, and to do that we must generalize." (Henri Poincaré, "Science and Hypothesis", 1902)

"An isolated sensation teaches us nothing, for it does not amount to an observation. Observation is a putting together of several results of sensation which are or are supposed to be connected with each other according to the law of causality, so that some represent causes and others their effects." (Thorvald N Thiele, "Theory of Observations", 1903)

"Man's determination not to be deceived is precisely the origin of the problem of knowledge. The question is always and only this: to learn to know and to grasp reality in the midst of a thousand causes of error which tend to vitiate our observation." (Federigo Enriques, "Problems of Science", 1906)

"An experiment is an observation that can be repeated, isolated and varied. The more frequently you can repeat an observation, the more likely are you to see clearly what is there and to describe accurately what you have seen. The more strictly you can isolate an observation, the easier does your task of observation become, and the less danger is there of your being led astray by irrelevant circumstances, or of placing emphasis on the wrong point. The more widely you can vary an observation, the more clearly will be the uniformity of experience stand out, and the better is your chance of discovering laws." (Edward B Titchener, "A Text-Book of Psychology", 1909)

"Neither logic without observation, nor observation without logic, can move one step in the formation of science." (Alfred N Whitehead, "The Organization of Thought", 1916)

"A discovery is rarely, if ever, a sudden achievement, nor is it the work of one man; a long series of observations, each in turn received in doubt and discussed in hostility, are familiarized by time, and lead at last to the gradual disclosure of truth." (Sir Berkeley Moynihan, "Surgery, Gynecology & Obstetrics" Vol. 31, 1920)

"In the world of natural knowledge, no authority is great enough to support a theory when a crucial observation has shown it to be untenable." (Sir Richard A Gregory, "Discovery; or, The Spirit and Service of Science", 1928)

"The rational concept of probability, which is the only basis of probability calculus, applies only to problems in which either the same event repeats itself again and again, or a great number of uniform elements are involved at the same time. Using the language of physics, we may say that in order to apply the theory of probability we must have a practically unlimited sequence of uniform observations." (Richard von Mises, "Probability, Statistics and Truth", 1928)

"An observation is judged significant, if it would rarely have been produced, in the absence of a real cause of the kind we are seeking. It is a common practice to judge a result significant, if it is of such a magnitude that it would have been produced by chance not more frequently than once in twenty trials. This is an arbitrary, but convenient, level of significance for the practical investigator, but it does not mean that he allows himself to be deceived once in every twenty experiments. The test of significance only tells him what to ignore, namely all experiments in which significant results are not obtained. He should only claim that a phenomenon is experimentally demonstrable when he knows how to design an experiment so that it will rarely fail to give a significant result. Consequently, isolated significant results which he does not know how to reproduce are left in suspense pending further investigation." (Ronald A Fisher, "The Statistical Method in Psychical Research", Proceedings of the Society for Psychical Research 39, 1929)

"Science is but a method. Whatever its material, an observation accurately made and free of compromise to bias and desire, and undeterred by consequence, is science." (Hans Zinsser, "Untheological Reflections", The Atlantic Monthly, 1929)

"Abstraction is the detection of a common quality in the characteristics of a number of diverse observations […] A hypothesis serves the same purpose, but in a different way. It relates apparently diverse experiences, not by directly detecting a common quality in the experiences themselves, but by inventing a fictitious substance or process or idea, in terms of which the experience can be expressed. A hypothesis, in brief, correlates observations by adding something to them, while abstraction achieves the same end by subtracting something." (Herbert Dingle, Science and Human Experience, 1931)

"A scientist, whether theorist or experimenter, puts forward statements, or systems of statements, and tests them step by step. In the field of the empirical sciences, more particularly, he constructs hypotheses, or systems of theories, and tests them against experience by observation and experiment." (Karl Popper, "The Logic of Scientific Discovery", 1934)

"Science is the attempt to discover, by means of observation, and reasoning based upon it, first, particular facts about the world, and then laws connecting facts with one another and (in fortunate cases) making it possible to predict future occurrences." (Bertrand Russell, "Religion and Science, Grounds of Conflict", 1935)

"Starting from statistical observations, it is possible to arrive at conclusions which not less reliable or useful than those obtained in any other exact science. It is only necessary to apply a clear and precise concept of probability to such observations. " (Richard von Mises, "Probability, Statistics, and Truth", 1939)

"Experiment as compared with mere observation has some of the characteristics of cross-examining nature rather than merely overhearing her." (Alan Gregg, "The Furtherance of Medical Research", 1941)

"Science, in the broadest sense, is the entire body of the most accurately tested, critically established, systematized knowledge available about that part of the universe which has come under human observation. For the most part this knowledge concerns the forces impinging upon human beings in the serious business of living and thus affecting man’s adjustment to and of the physical and the social world. […] Pure science is more interested in understanding, and applied science is more interested in control […]" (Austin L Porterfield, "Creative Factors in Scientific Research", 1941)

"We see what we want to see, and observation conforms to hypothesis." (Bergen Evans, "The Natural History of Nonsense", 1947)

"[...] the conception of chance enters in the very first steps of scientific activity in virtue of the fact that no observation is absolutely correct. I think chance is a more fundamental conception that causality; for whether in a concrete case, a cause-effect relation holds or not can only be judged by applying the laws of chance to the observation." (Max Born, 1949)

"Every bit of knowledge we gain and every conclusion we draw about the universe or about any part or feature of it depends finally upon some observation or measurement. Mankind has had again and again the humiliating experience of trusting to intuitive, apparently logical conclusions without observations, and has seen Nature sail by in her radiant chariot of gold in an entirely different direction." (Oliver J Lee, "Measuring Our Universe: From the Inner Atom to Outer Space", 1950)

"Science is an interconnected series of concepts and schemes that have developed as a result of experimentation and observation and are fruitful of further experimentation and observation."(James B Conant, "Science and Common Sense", 1951)

"The stumbling way in which even the ablest of the scientists in every generation have had to fight through thickets of erroneous observations, misleading generalizations, inadequate formulations, and unconscious prejudice is rarely appreciated by those who obtain their scientific knowledge from textbooks." (James B Conant, "Science and Common Sense", 1951)

"[...] no batch of observations, however large, either definitively rejects or definitively fails to reject the hypothesis H0." (Richard B Braithwaite, "Scientific Explanation: A Study of the Function of Theory, Probability and Law in Science", 1953) 

"The methods of science may be described as the discovery of laws, the explanation of laws by theories, and the testing of theories by new observations. A good analogy is that of the jigsaw puzzle, for which the laws are the individual pieces, the theories local patterns suggested by a few pieces, and the tests the completion of these patterns with pieces previously unconsidered." (Edwin P Hubble, "The Nature of Science and Other Lectures", 1954)

"Scientists whose work has no clear, practical implications would want to make their decisions considering such things as: the relative worth of (1) more observations, (2) greater scope of his conceptual model, (3) simplicity, (4) precision of language, (5) accuracy of the probability assignment." (C West Churchman, "Costs, Utilities, and Values", 1956)

"Confidence intervals give a feeling of the uncertainty of experimental evidence, and (very important) give it in the same units [...] as the original observations." (Mary G Natrella, "The relation between confidence intervals and tests of significance", American Statistician 14, 1960)

"No observations are absolutely trustworthy. In no field of observation can we entirely rule out the possibility that an observation is vitiated by a large measurement or execution error. If a reading is found to lie a very long way from its fellows in a series of replicate observations, there must be a suspicion that the deviation is caused by a blunder or gross error of some kind. [...] One sufficiently erroneous reading can wreck the whole of a statistical analysis, however many observations there are." (Francis J Anscombe, "Rejection of Outliers", Technometrics Vol. 2 (2), 1960)

"Observation, reason, and experiment make up what we call the scientific method. (Richard Feynman, "Mainly mechanics, radiation, and heat", 1963)

"As soon as we inquire into the reasons for the phenomena, we enter the domain of theory, which connects the observed phenomena and traces them back to a single ‘pure’ phenomena, thus bringing about a logical arrangement of an enormous amount of observational material." (Georg Joos, "Theoretical Physics", 1968)

"[…] the link between observation and formulation is one of the most difficult and crucial in the scientific enterprise. It is the process of interpreting our theory or, as some say, of ‘operationalizing our concepts’. Our creations in the world of possibility must be fitted in the world of probability; in Kant’s epigram, ‘Concepts without precepts are empty’. It is also the process of relating our observations to theory; to finish the epigram, ‘Precepts without concepts are blind’." (Scott Greer, "The Logic of Social Inquiry", 1969)

"Innocent, unbiased observation is a myth." (Sir Peter B Medawar, Induction and Intuition in Scientific Thought, 1969)

"The advantages of models are, on one hand, that they force us to present a 'complete' theory by which I mean a theory taking into account all relevant phenomena and relations and, on the other hand, the confrontation with observation, that is, reality." (Jan Tinbergen, "The Use of Models: Experience," 1969)

"Science consists simply of the formulation and testing of hypotheses based on observational evidence; experiments are important where applicable, but their function is merely to simplify observation by imposing controlled conditions." (Henry L Batten, "Evolution of the Earth", 1971)

"All perceiving is also thinking, all reasoning is also intuition, all observation is also invention." (Rudolf Arnheim, "Entropy and Art: An Essay on Disorder and Order", 1974)

"No theory ever agrees with all the facts in its domain, yet it is not always the theory that is to blame. Facts are constituted by older ideologies, and a clash between facts and theories may be proof of progress. It is also a first step in our attempt to find the principles implicit in familiar observational notions." (Paul K Feyerabend, "Against Method: Outline of an Anarchistic Theory of Knowledge", 1975)

"The essential function of a hypothesis consists in the guidance it affords to new observations and experiments, by which our conjecture is either confirmed or refuted." (Ernst Mach, "Knowledge and Error: Sketches on the Psychology of Enquiry", 1976)

"After all of this it is a miracle that our models describe anything at all successfully. In fact, they describe many things well: we observe what they have predicted, and we understand what we observe. However, this last act of observation and understanding always eludes physical description." (Yuri I Manin, "Mathematics and Physics", 1981)

"Science is a process. It is a way of thinking, a manner of approaching and of possibly resolving problems, a route by which one can produce order and sense out of disorganized and chaotic observations. Through it we achieve useful conclusions and results that are compelling and upon which there is a tendency to agree." (Isaac Asimov, "‘X’ Stands for Unknown", 1984)

"Science is defined as a set of observations and theories about observations." (F Albert Matsen, "The Role of Theory in Chemistry", Journal of Chemical Education Vol. 62 (5), 1985)

"The only touchstone for empirical truth is experiment and observation." (Heinz Pagels, "Perfect Symmetry: The Search for the Beginning of Time", 1985)

"The model is only a suggestive metaphor, a fiction about the messy and unwieldy observations of the real world. In order for it to be persuasive, to convey a sense of credibility, it is important that it not be too complicated and that the assumptions that are made be clearly in evidence. In short, the model must be simple, transparent, and verifiable." (Edward Beltrami, "Mathematics for Dynamic Modeling", 1987)

"A theory is a good theory if it satisfies two requirements: it must accurately describe a large class of observations on the basis of a model that contains only a few arbitrary elements, and it must make definite predictions about the results of future observations." (Stephen Hawking, "A Brief History of Time: From Big Bang To Black Holes", 1988)

"A law explains a set of observations; a theory explains a set of laws. […] a law applies to observed phenomena in one domain (e.g., planetary bodies and their movements), while a theory is intended to unify phenomena in many domains. […] Unlike laws, theories often postulate unobservable objects as part of their explanatory mechanism." (John L Casti, "Searching for Certainty: How Scientists Predict the Future", 1990)

"A model is often judged by how well it 'explains' some observations. There need not be a unique model for a particular situation, nor need a model cover every possible special case. A model is not reality, it merely helps to explain some of our impressions of reality. [...] Different models may thus seem to contradict each other, yet we may use both in their appropriate places." (Richard W Hamming, "The Art of Probability for Scientists and Engineers", 1991)

"The ability of a scientific theory to be refuted is the key criterion that distinguishes science from metaphysics. If a theory cannot be refuted, if there is no observation that will disprove it, then nothing can prove it - it cannot predict anything, it is a worthless myth." (Eric Lerner, "The Big Bang Never Happened", 1991)

"It is in the nature of theoretical science that there can be no such thing as certainty. A theory is only ‘true’ for as long as the majority of the scientific community maintain the view that the theory is the one best able to explain the observations." (Jim Baggott, "The Meaning of Quantum Theory", 1992)

"The art of science is knowing which observations to ignore and which are the key to the puzzle." (Edward W Kolb, "Blind Watchers of the Sky", 1996)

"The rate of the development of science is not the rate at which you make observations alone but, much more important, the rate at which you create new things to test." (Richard Feynman, "The Meaning of It All", 1998)

"[…] because observations are all we have, we take them seriously. We choose hard data and the framework of mathematics as our guides, not unrestrained imagination or unrelenting skepticism, and seek the simplest yet most wide-reaching theories capable of explaining and predicting the outcome of today’s and future experiments." (Brian Greene, "The Fabric of the Cosmos", 2004)

"If any observation has been classed as an outlier, the next step should be if possible to infer the cause[...]attention should be given to the possibility that laboratory and data management techniques have been imperfect: improvements and safeguards for the future should be considered." (David Finney, "Calibration Guidelines Challenge Outlier Practices", The American Statistician Vol 60 (4), 2006)

"One cautious approach is represented by Bernoulli’s more conservative outlook. If there are very strong reasons for believing that an observation has suffered an accident that made the value in the data-file thoroughly untrustworthy, then reject it; in the absence of clear evidence that an observation, identified by formal rule as an outlier, is unacceptable then retain it unless there is lack of trust that the laboratory obtaining it is conscientiously operated by able persons who have '[...] taken every care.'" " (David Finney, "Calibration Guidelines Challenge Outlier Practices", The American Statistician Vol 60 (4), 2006)

"Every messy data is messy in its own way - it’s easy to define the characteristics of a clean dataset (rows are observations, columns are variables, columns contain values of consistent types). If you start to look at real life data you’ll see every way you can imagine data being messy (and many that you can’t)!" (Hadley Wickham, "R-help mailing list", 2008)

"A model is a good model if it:1. Is elegant 2. Contains few arbitrary or adjustable elements 3. Agrees with and explains all existing observations 4. Makes detailed predictions about future observations that can disprove or falsify the model if they are not borne out." (Stephen Hawking & Leonard Mlodinow, "The Grand Design", 2010)

"Whatever actually happened, outliers need to be investigated not omitted. Try to understand what caused some observations to be different from the bulk of the observations. If you understand the reasons, you are then in a better position to judge whether the points can legitimately removed from the data set, or whether you’ve just discovered something new and interesting. Never remove a point just because it is weird." (Rob J Hyndman, "Omitting outliers", 2016)

"The Dirty Data Theorem states that 'real world' data tends to come from bizarre and unspecifiable distributions of highly correlated variables and have unequal sample sizes, missing data points, non-independent observations, and an indeterminate number of inaccurately recorded values." (Unknown, Statistically Speaking)

"When the ratio of the largest to smallest observation is large you should question whether the data are being analyzed in the right metric (transformation)." (George E P Box)

02 December 2018

🔭Data Science: Error (Just the Quotes)

"The probable is something which lies midway between truth and error" (Christian Thomasius, "Institutes of Divine Jurisprudence", 1688)

"Knowledge being to be had only of visible and certain truth, error is not a fault of our knowledge, but a mistake of our judgment, giving assent to that which is not true." (John Locke, "An Essay Concerning Human Understanding", 1689)

"The errors of definitions multiply themselves according as the reckoning proceeds; and lead men into absurdities, which at last they see but cannot avoid, without reckoning anew from the beginning." (Thomas Hobbes, "The Moral and Political Works of Thomas Hobbes of Malmesbury", 1750)

"Men are often led into errors by the love of simplicity, which disposes us to reduce things to few principles, and to conceive a greater simplicity in nature than there really is." (Thomas Reid, "Essays on the Intellectual Powers of Man", 1785)

"The orbits of certainties touch one another; but in the interstices there is room enough for error to go forth and prevail." (Johann Wolfgang von Goethe, "Maxims and Reflections", 1833)

"Nothing hurts a new truth more than an old error." (Johann Wolfgang von Goethe, "Sprüche in Prosa", 1840)

"Every detection of what is false directs us towards what is true: every trial exhausts some tempting form of error. Not only so; but scarcely any attempt is entirely a failure; scarcely any theory, the result of steady thought, is altogether false; no tempting form of error is without some latent charm derived from truth." (William Whewell, "Lectures on the History of Moral Philosophy in England", 1852)

"[…] ideas may be both novel and important, and yet, if they are incorrect - if they lack the very essential support of incontrovertible fact, they are unworthy of credence. Without this, a theory may be both beautiful and grand, but must be as evanescent as it is beautiful, and as unsubstantial as it is grand." (George Brewster, "A New Philosophy of Matter", 1858)

"When a power of nature, invisible and impalpable, is the subject of scientific inquiry, it is necessary, if we would comprehend its essence and properties, to study its manifestations and effects. For this purpose simple observation is insufficient, since error always lies on the surface, whilst truth must be sought in deeper regions." (Justus von Liebig," Familiar Letters on Chemistry", 1859)

"As in the experimental sciences, truth cannot be distinguished from error as long as firm principles have not been established through the rigorous observation of facts." (Louis Pasteur, "Étude sur la maladie des vers à soie", 1870)

"It would be an error to suppose that the great discoverer seizes at once upon the truth, or has any unerring method of divining it. In all probability the errors of the great mind exceed in number those of the less vigorous one. Fertility of imagination and abundance of guesses at truth are among the first requisites of discovery; but the erroneous guesses must be many times as numerous as those that prove well founded. The weakest analogies, the most whimsical notions, the most apparently absurd theories, may pass through the teeming brain, and no record remain of more than the hundredth part. […] The truest theories involve suppositions which are inconceivable, and no limit can really be placed to the freedom of hypotheses." (W Stanley Jevons, "The Principles of Science: A Treatise on Logic and Scientific Method", 1877)

"Perfect readiness to reject a theory inconsistent with fact is a primary requisite of the philosophic mind. But it, would be a mistake to suppose that this candour has anything akin to fickleness; on the contrary, readiness to reject a false theory may be combined with a peculiar pertinacity and courage in maintaining an hypothesis as long as its falsity is not actually apparent." (William S Jevons, "The Principles of Science", 1887)

"One is almost tempted to assert that quite apart from its intellectual mission, theory is the most practical thing conceivable, the quintessence of practice as it were, since the precision of its conclusions cannot be reached by any routine of estimating or trial and error; although given the hidden ways of theory, this will hold only for those who walk them with complete confidence." (Ludwig E Boltzmann, "On the Significance of Theories", 1890)

"[…] to kill an error is as good a service as, and sometimes even better than, the establishing of a new truth or fact." (Charles R Darwin, "More Letters of Charles Darwin", Vol 2, 1903)

"Man's determination not to be deceived is precisely the origin of the problem of knowledge. The question is always and only this: to learn to know and to grasp reality in the midst of a thousand causes of error which tend to vitiate our observation." (Federigo Enriques, "Problems of Science", 1906)

"The aim of science is to seek the simplest explanations of complex facts. We are apt to fall into the error of thinking that the facts are simple because simplicity is the goal of our quest. The guiding motto in the life of every natural philosopher should be, ‘Seek simplicity and distrust it’." (Alfred N Whitehead, "The Concept of Nature", 1919)

"Poor statistics may be attributed to a number of causes. There are the mistakes which arise in the course of collecting the data, and there are those which occur when those data are being converted into manageable form for publication. Still later, mistakes arise because the conclusions drawn from the published data are wrong. The real trouble with errors which arise during the course of collecting the data is that they are the hardest to detect." (Alfred R Ilersic, "Statistics", 1959)

"When using estimated figures, i.e. figures subject to error, for further calculation make allowance for the absolute and relative errors. Above all, avoid what is known to statisticians as 'spurious' accuracy. For example, if the arithmetic Mean has to be derived from a distribution of ages given to the nearest year, do not give the answer to several places of decimals. Such an answer would imply a degree of accuracy in the results of your calculations which are quite un- justified by the data. The same holds true when calculating percentages." (Alfred R Ilersic, "Statistics", 1959)

"While it is true to assert that much statistical work involves arithmetic and mathematics, it would be quite untrue to suggest that the main source of errors in statistics and their use is due to inaccurate calculations." (Alfred R Ilersic, "Statistics", 1959)

"Errors may also creep into the information transfer stage when the originator of the data is unconsciously looking for a particular result. Such situations may occur in interviews or questionnaires designed to gather original data. Improper wording of the question, or improper voice inflections. and other constructional errors may elicit nonobjective responses. Obviously, if the data is incorrectly gathered, any graph based on that data will contain the original error - even though the graph be most expertly designed and beautifully presented." (Cecil H Meyers, "Handbook of Basic Graphs: A modern approach", 1970)

"One grievous error in interpreting approximations is to allow only good approximations." (Preston C Hammer, "Mind Pollution", Cybernetics, Vol. 14, 1971)

"Thus, the construction of a mathematical model consisting of certain basic equations of a process is not yet sufficient for effecting optimal control. The mathematical model must also provide for the effects of random factors, the ability to react to unforeseen variations and ensure good control despite errors and inaccuracies." (Yakov Khurgin, "Did You Say Mathematics?", 1974)

"A mature science, with respect to the matter of errors in variables, is not one that measures its variables without error, for this is impossible. It is, rather, a science which properly manages its errors, controlling their magnitudes and correctly calculating their implications for substantive conclusions." (Otis D Duncan, "Introduction to Structural Equation Models", 1975)

"Most people like to believe something is or is not true. Great scientists tolerate ambiguity very well. They believe the theory enough to go ahead; they doubt it enough to notice the errors and faults so they can step forward and create the new replacement theory. If you believe too much you'll never notice the flaws; if you doubt too much you won't get started. It requires a lovely balance." (Richard W Hamming, "You and Your Research", 1986) 

"We have found that some of the hardest errors to detect by traditional methods are unsuspected gaps in the data collection (we usually discovered them serendipitously in the course of graphical checking)." (Peter Huber, "Huge data sets", Compstat '94: Proceedings, 1994)

"Humans may crave absolute certainty; they may aspire to it; they may pretend, as partisans of certain religions do, to have attained it. But the history of science - by far the most successful claim to knowledge accessible to humans - teaches that the most we can hope for is successive improvement in our understanding, learning from our mistakes, an asymptotic approach to the Universe, but with the proviso that absolute certainty will always elude us. We will always be mired in error. The most each generation can hope for is to reduce the error bars a little, and to add to the body of data to which error bars apply." (Carl Sagan, "The Demon-Haunted World: Science as a Candle in the Dark", 1995)

"[myth:] Counting can be done without error. Usually, the counted number is an integer and therefore without (rounding) error. However, the best estimate of a scientifically relevant value obtained by counting will always have an error. These errors can be very small in cases of consecutive counting, in particular of regular events, e.g., when measuring frequencies." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"In error analysis the so-called 'chi-squared' is a measure of the agreement between the uncorrelated internal and the external uncertainties of a measured functional relation. The simplest such relation would be time independence. Theory of the chi-squared requires that the uncertainties be normally distributed. Nevertheless, it was found that the test can be applied to most probability distributions encountered in practice." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"[myth:] Random errors can always be determined by repeating measurements under identical conditions. […] this statement is true only for time-related random errors ." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"[myth:] Systematic errors can be determined inductively. It should be quite obvious that it is not possible to determine the scale error from the pattern of data values." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"What is so unconventional about the statistical way of thinking? First, statisticians do not care much for the popular concept of the statistical average; instead, they fixate on any deviation from the average. They worry about how large these variations are, how frequently they occur, and why they exist. [...] Second, variability does not need to be explained by reasonable causes, despite our natural desire for a rational explanation of everything; statisticians are frequently just as happy to pore over patterns of correlation. [...] Third, statisticians are constantly looking out for missed nuances: a statistical average for all groups may well hide vital differences that exist between these groups. Ignoring group differences when they are present frequently portends inequitable treatment. [...] Fourth, decisions based on statistics can be calibrated to strike a balance between two types of errors. Predictably, decision makers have an incentive to focus exclusively on minimizing any mistake that could bring about public humiliation, but statisticians point out that because of this bias, their decisions will aggravate other errors, which are unnoticed but serious. [...] Finally, statisticians follow a specific protocol known as statistical testing when deciding whether the evidence fits the crime, so to speak. Unlike some of us, they don’t believe in miracles. In other words, if the most unusual coincidence must be contrived to explain the inexplicable, they prefer leaving the crime unsolved." (Kaiser Fung, "Numbers Rule the World", 2010) 

"A key difference between a traditional statistical problems and a time series problem is that often, in time series, the errors are not independent." (DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

 "A wide variety of statistical procedures (regression, t-tests, ANOVA) require three assumptions: (i) Normal observations or errors. (ii) Independent observations (or independent errors, which is equivalent, in normal linear models to independent observations). (iii) Equal variance - when that is appropriate (for the one-sample t-test, for example, there is nothing being compared, so equal variances do not apply).(DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

"If the observations/errors are not independent, the statistical formulations are completely unreliable unless corrections can be made.(DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

"Once a model has been fitted to the data, the deviations from the model are the residuals. If the model is appropriate, then the residuals mimic the true errors. Examination of the residuals often provides clues about departures from the modeling assumptions. Lack of fit - if there is curvature in the residuals, plotted versus the fitted values, this suggests there may be whole regions where the model overestimates the data and other whole regions where the model underestimates the data. This would suggest that the current model is too simple relative to some better model.(DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

 "The random element in most data analysis is assumed to be white noise - normal errors independent of each other. In a time series, the errors are often linked so that independence cannot be assumed (the last examples). Modeling the nature of this dependence is the key to time series.(DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

"When data is not normal, the reason the formulas are working is usually the central limit theorem. For large sample sizes, the formulas are producing parameter estimates that are approximately normal even when the data is not itself normal. The central limit theorem does make some assumptions and one is that the mean and variance of the population exist. Outliers in the data are evidence that these assumptions may not be true. Persistent outliers in the data, ones that are not errors and cannot be otherwise explained, suggest that the usual procedures based on the central limit theorem are not applicable.(DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

"Bias is error from incorrect assumptions built into the model, such as restricting an interpolating function to be linear instead of a higher-order curve. [...] Errors of bias produce underfit models. They do not fit the training data as tightly as possible, were they allowed the freedom to do so. In popular discourse, I associate the word 'bias' with prejudice, and the correspondence is fairly apt: an apriori assumption that one group is inferior to another will result in less accurate predictions than an unbiased one. Models that perform lousy on both training and testing data are underfit." (Steven S Skiena, "The Data Science Design Manual", 2017)

"Repeated observations of the same phenomenon do not always produce the same results, due to random noise or error. Sampling errors result when our observations capture unrepresentative circumstances, like measuring rush hour traffic on weekends as well as during the work week. Measurement errors reflect the limits of precision inherent in any sensing device. The notion of signal to noise ratio captures the degree to which a series of observations reflects a quantity of interest as opposed to data variance. As data scientists, we care about changes in the signal instead of the noise, and such variance often makes this problem surprisingly difficult." (Steven S Skiena, "The Data Science Design Manual", 2017)

"Variance is error from sensitivity to fluctuations in the training set. If our training set contains sampling or measurement error, this noise introduces variance into the resulting model. [...] Errors of variance result in overfit models: their quest for accuracy causes them to mistake noise for signal, and they adjust so well to the training data that noise leads them astray. Models that do much better on testing data than training data are overfit." (Steven S Skiena, "The Data Science Design Manual", 2017)

"Machine learning bias is typically understood as a source of learning error, a technical problem. […] Machine learning bias can introduce error simply because the system doesn’t 'look' for certain solutions in the first place. But bias is actually necessary in machine learning - it’s part of learning itself." (Erik J Larson, "The Myth of Artificial Intelligence: Why Computers Can’t Think the Way We Do", 2021)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.