SQL Troubles: errors

Showing posts with label errors. Show all posts

05 August 2025

🤖〽️Prompt Engineering: Copilot Unabridged (Part 72: When Machines Acknowledge Their Boundaries: How AI Can Recognize Its Own Limitations)

Prompt Engineering Series

Prompt: "write a blogpost of 600 words on how AI can recognize its own limitations"

Introduction

Artificial Intelligence (AI) dazzles with its versatility - from composing symphonies to diagnosing diseases - but what happens when machines encounter tasks beyond their reach? Can AI recognize its own limitations? The answer, intriguingly, is yes. Not in the human sense of self-reflection, but through engineered mechanisms that simulate self-awareness.

What Does "Recognizing Limitations" Mean for AI?

In human terms, recognizing a limitation means knowing what we can’t do and adjusting our behavior accordingly. It involves:

Self-awareness
Emotional intelligence
Experience-based introspection

AI doesn’t possess any of these. However, it can still "recognize" limits through:

Pre-programmed constraints
Statistical confidence levels
Self-monitoring systems
Language cues that express uncertainty

While the recognition isn’t conscious, it’s functionally effective - and surprisingly persuasive in conversation.

Built-In Boundaries

Modern AI models come with explicit design guardrails:

Content filters prevent engagement with harmful or sensitive topics.
Knowledge boundaries are maintained by restricting access to certain real-time data (e.g., financial predictions, medical diagnostics).
Model constraints define what the AI should never claim or fabricate, such as pretending to be sentient or giving legal advice.

These boundaries act as digital ethics - code-level boundaries that help AI "know" when to decline or deflect.

Confidence Estimation and Reasoning

AI systems often attach confidence scores to their outputs:

When solving math problems, diagnosing images, or retrieving factual data, the system evaluates how likely its answer is correct.
If confidence falls below a threshold, it may respond with disclaimers like:
This isn’t emotion-driven humility - it’s probability-based caution. Yet to users, it feels like genuine thoughtfulness.

Language That Mirrors Self-Awareness

One of the most powerful illusions of limitation recognition lies in language. Advanced models can say:

"I don’t have personal beliefs."
"That information is beyond my current knowledge."
"I can’t access real-time data."

These phrases aren’t true reflections of awareness. They’re statistical echoes of human disclaimers, trained from billions of conversational examples. The AI doesn’t "know" it’s limited - but it has learned that people expect limitations to be acknowledged, and adapts accordingly.

Error Detection and Feedback Loops

Some AI systems have self-monitoring capabilities:

They compare outputs against known ground truths.
They flag inconsistencies or hallucinations in generated text.
They correct or retract inaccurate answers based on post-processing feedback.

Think of it as a digital conscience - not moral, but methodical. These loops mimic reflection: a kind of pseudo-reasoning where AI revises itself based on performance metrics.

Recognizing Limitations ≠ Understanding Them

To be clear: AI doesn’t understand its limitations. It doesn’t feel frustration or doubt. But it can:

Identify failure patterns
Communicate constraints
Avoid tasks outside defined parameters

This engineered humility makes AI safer, more trustworthy, and easier to collaborate with.

Why This Matters

When AI "recognizes" its limitations, we get:

More ethical interactions (e.g., declining bias-prone questions)
Greater user trust (knowing the machine won’t pretend it knows everything)
Improved transparency in decision-making and data handling

It also compels us to ask deeper questions: If machines can convincingly simulate self-awareness, how do we differentiate introspection from imitation?

Final Thought

AI doesn’t ponder its limits - it performs them. But in that performance, it holds up a mirror not to itself, but to us. Through design, language, and feedback, we’ve taught machines to "know" their bounds - and in doing so, we remind ourselves of our own.

Disclaimer: The whole text was generated by Copilot (under Windows 11) at the first attempt. This is just an experiment to evaluate feature's ability to answer standard general questions, independently on whether they are correctly or incorrectly posed. Moreover, the answers may reflect hallucinations and other types of inconsistent or incorrect reasoning.

Previous Post <<||>> Next Post

19 May 2025

#️⃣Software Engineering: Mea Culpa (Part VIII: A Look Beyond)

Software Engineering Series

With AI on the verge, blogging and bloggers can easily become obsolete. Why bother navigating through the many blogs to get a broader perspective when the same can be obtained with AI? Just type in a prompt of the type "write a blogpost of 600 words on the importance of AI in society" and Copilot or any other similar AI agent will provide you an answer that may look much better than the first draft of most of the bloggers out there! It doesn't matter whether the text follows a well-articulated idea, a personal perspective or something creative! One gets an acceptable answer with a minimum of effort and that's what matters for many.

The results tend to increase in complexity the more models are assembled together, respectively the more uncontrolled are the experiments. Moreover, solutions that tend to work aren't necessarily optimal. Machines can't offer instant enlightenment or anything close to it. Though they have an incomparable processing power of retrieval, association, aggregation, segregation and/or iteration, which coupled with the vast amount of data, information and knowledge can generate anything in just a matter of seconds. Probably, the only area in which humans can compete with machines is creativity and wisdom, though how many will be able to leverage these at scale? Probably, machines have some characteristics that can be associated with these intrinsic human characteristics, though usually more likely the brute computational power will prevail.

At Microsoft Build, Satya Nadella mentioned that foundry encompasses already more than 1900 supported models. In theory, one can still evaluate and test such models adequately. What will happen when the scale increases with a few orders of magnitude? What will happen when for each person there are one or more personalized AI models? AI can help in many areas by generating and evaluating rapidly many plausible alternatives, though as soon the models deal with some kind of processing randomization, the chances for errors increase exponentially (at least in theory).

It's enough for one or more hallucinations or other unexpected behavior to lead to more unexpected behavior. No matter how well a model was tested, as long as there's no stable predictable mathematical model behind it, the chances for something to go wrong increase with the number of inputs, parameters, uses, or changes of context the model deals with. Unfortunately, all these aspects are seldom documented. It's not like using a formula and you know that given a set of inputs and operations, the result is the same. The evolving nature of such models makes them unpredictable in the long term. Therefore, there must always be a way to observe the changes occurring in models.

One of the important questions is how many errors can we afford in such models? How long it takes until errors impact each other to create effects comparable with a tornado. And what if the tornado increases in magnitude to the degree that it wrecks everything that crosses its path? What if multiple tornadoes join forces? How many tornadoes can destroy a field, a country or a continent? How many or big must be the tornadoes to trigger a warning?

Science-Fiction authors love to create apocalyptic scenarios, and all happens in just a few steps, respectively chapters. In nature, usually it takes many orders of magnitude to generate unpredictable behavior. But, as nature often reveals, unpredictable behavior does happen, probably more often than we expect and wish for. The more we are poking the bear, the higher the chances for something unexpected to happen! Do we really want this? What will be the price we must pay for progress?

Previous Post <<||>> Next Post

06 January 2025

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part VII: Things That Don't Work) 🆕

Microsoft does relatively a good job in documenting what doesn't work in Microsoft Fabric's SQL Databases. There's a good overview available already in the documentation, though beyond this the current post lists my finding while testing the previously written code on this blog,

USE Database

The standard syntax allows to change via USE the database context to the specified database or database snapshot. Unfortunately, this syntax doesn't seem to be supported currently and unfortunately many scripts seem to abuse of it. Thus, the following line of code throws an error:

-- changing the context
USE master;
GO
USE tempdb;

"Msg 40508, Level 16, State 1, Line 1, USE statement is not supported to switch between databases. Use a new connection to connect to a different database"

However, one can use the 3-part naming convention to reference the various objects:

-- sys metadata - retrieving the database files

SELECT *

FROM tempdb.sys.database_files dbf

ORDER BY name;

Even if the tempdb is not listed in the sys.databases table, it's still available for querying, which can prove helpful for troubleshooting.

DBCC commands

The documentation warns that some DBCC commands won't work, though in some cases there are also alternatives. For example:

-- clearing the procedure cache via DBCC
DBCC FREEPROCCACHE;

Output:

"Msg 2571, Level 14, State 9, Line 1, User '<user>' does not have permission to run DBCC freeproccache."

Alternatively, one can use the following command, which seems to work:

-- clearing the procedure cash via ALTER
ALTER DATABASE SCOPED CONFIGURATION CLEAR PROCEDURE_CACHE;

CHECKDB, which checks the logical and physical integrity of all the objects in the specified database, can't be used as well:

-- Checking the logical and physical integrity of a database
DBCC CHECKDB();

Output:

"Msg 916, Level 14, State 2, Line 1, The server principal "..." is not able to access the database "..." under the current security context."

The same error message is received for CHECKTABLE, utility which checks the integrity of all the pages and structures that make up the table (or indexed view):

-- checking a table's integrity
DBCC CHECKTABLE ('SalesLT.Address');

Output:

"Msg 916, Level 14, State 2, Line 2, The server principal "..." is not able to access the database "..." under the current security context."

A similar error messages is received for SQLPERF, which provides transaction log space usage statistics for all databases:

-- retrieving the LOGSPACE information for all databases
DBCC SQLPERF (LOGSPACE);

Output:

"Msg 297, Level 16, State 10, Line 1, The user does not have permission to perform this action."

There are however DBCC commands like SHOW_STATISTICS or SHRINKDATABASE which do work.

-- current query optimization statistics
DBCC SHOW_STATISTICS('SalesLT.Address','PK_Address_AddressID');

Output:

Name	Updated	Rows	Rows Sampled	Steps	Density	Average key length	String Index	Filter Expression	Unfiltered Rows	Persisted Sample Percent
PK_Address_AddressID	Dec 21 2024 3:02AM	450	450	197	1	4	NO		450	0

SHRINKDATABASE shrinks the size of the data and log files in the specified database:

-- shrinking database
DBCC SHRINKDATABASE([AdventureWorks01-...]) WITH NO_INFOMSGS;

Update 29-Jan-2025: According to an answer from Ask the Expert session on Fabric Database [3], Microsoft seems to be working in bringing more DBCC features to SQL databases.

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) SQL Server: USE <database> [link]
[2] Microsoft Learn (2024) Database console commands [link]
[3] Microsoft Reactor (2025) Ask The Expert - Fabric Edition - Fabric Databases [link]

04 April 2021

💼Project Management: Lean Management (Part I: Between Value and Waste I - An Introduction)

Independently on whether Lean Management is considered in the context of Manufacturing, Software Development (SD), Project Management (PM) or any other business-related areas, there are three fundamental business concepts on which the whole scaffolding of the Lean philosophies is built upon, namely the ones of value, value stream and waste.

From an economic standpoint, value refers to the monetary worth of a product, asset or service (further referred as product) to an organization, while from a qualitative perspective, it refers to the perceived benefit associated with its usage. The value is thus reflected in the costs associated with a product’s delivery (producer’s perspective), respectively the price paid on acquiring it and the degree to which the product can fulfill a demand (customer’s perspective).

Without diving too deep into theory of product valuation, the challenges revolve around reducing the costs associated with a product’s delivery, respectively selling it to a price the customer is willing to pay for, typically to address a given set of needs. Moreover, the customer is willing to pay only for the functions that satisfy the needs a product is thought to cover. From this friction of opposing driving forces, a product is designed and valued.

The value stream is the sequence of activities (also steps or processes) needed to deliver a product to customers. This formulation includes value-added and non-value-added activities, internal and external customers, respectively covers the full lifecycle of products and/or services in whatever form it occurs, either if is or not perceived by the customers.

Waste is any activity that consumes resources but creates no value for the customers or, generally, for the stakeholders, be it internal or external. The waste is typically associated with the non-added value activities, activities that don’t produce value for stakeholders, and can increase directly or indirectly the costs of products especially when no attention is given to it and/or not recognized as such. Therefore, eliminating the waste can have an important impact on products’ costs and become one of the goals of Lean Management. Moreover, eliminating the waste is an incremental process that, when put in the context of continuous improvement, can lead to processes redesign and re-engineering.

Taiichi Ohno, the ‘father’ of the Toyota Production System (TPS), originally identified seven forms of waste (Japanese: muda): overproduction, waiting, transporting, inappropriate processing, unnecessary inventory, unnecessary/excess motion, and defects. Within the context of SD and PM, Tom and Marry Poppendieck [1] translated the types of wastes in concepts closer to the language of software developers: partially done work, extra processes, extra features, task switching, waiting, motion and, of course, defects. To this list were added later further types of waste associated with resources, confusion and work conditions.

Defects in form of errors and bugs, ineffective communication, rework and overwork, waiting, repetitive activities like handoffs or even unnecessary meetings are usually the visible part of products and projects and important from the perspective of stakeholders, which in extremis can become sensitive when their volume increases out of proportion.

Unfortunately, lurking in the deep waters of projects and wrecking everything that stands in their way are the other forms of waste less perceivable from stakeholders’ side: unclear requirements/goals, code not released or not tested, specifications not implemented, scrapped code, overutilized/underutilized resources, bureaucracy, suboptimal processes, unnecessary optimization, searching for information, mismanagement, task switching, improper work condition, confusion, to mention just the important activities associated to waste.

Through their elusive nature, independently on whether they are or not visible to stakeholders, they all impact the costs of projects and products when the proper attention is not given to them and not handled accordingly.

References:
[1] Mary Poppendieck & Tom Poppendieck (2003) Lean Software Development: An Agile Toolkit, Addison Wesley, ISBN: 0-321-15078-3

29 April 2019

🗄️Data Management: Data Integration (Part I: From Disintegration to Integration)

Data Management Series

No matter how tight the integration between the various systems or processes there will be always gaps that need to be addressed in one way or another. The problems are in general caused by design errors rooted in the complexity of the logic from the integration layer or from the systems integrated. The errors can range from missing or incorrect validation rules, mappings and parameters to data quality issues.

A unidirectional integration involves distributing data from one system (aka publisher) to one or more systems (aka subscribers), while in bidirectional integrations systems can act as publishers and subscribers, resulting thus complex data flows with multiple endpoints. In simplest integrations the records flow one-to-one between systems, though more complex scenarios can involve logic based on business rules, mappings and other type of transformations. The challenge is to reflect the states as needed by the system with minimal involvement from the users.

Typically, it falls in application/process owners or key users’ responsibilities to make sure that the integration works smoothly. When the integration makes use of interface or staging tables they can be used as starting point for the troubleshooting, however even then the troubleshooting can be troublesome and involve a considerable manual effort. When possible the data can be exported manually from the various systems and matched in Excel or similar solutions. This leads often to personal or departmental solutions hard to maintain, control and support.

A better approach is to automatize the process by importing the data from the integrated systems at regular points in time into the same database (much like in a data warehouse), model the entities and the needed logic in there, and report the differences. Even if this approach involves a small investment in the beginning and some optimization in logic or performance over time, it can become a useful tool for troubleshooting the differences. Such solutions can be used successfully in multiple integration scenarios (e.g. web shop or ERP integrations).

A set of reports for each entity can help identify the differences between the various entities. Starting from the reported differences the users can identify, categorize and devise specific countermeasures for the various issues. The best time to have such a solution is shortly before or during UAT. This would allow to make sure that the integration layer really works, and helps correcting the issues as long they still have a small impact on the systems. Some integration issues might even lead to a postponement of the Go-Live. The second best time is during the time the first important issues were found, as the issues can be used as support for a Business Case for implementing this type of solutions.

In general, it’s recommended to fix the problems in the integration layer and use the reports only for troubleshooting and for assuring that the integration runs smoothly. There are however situations in which the integration problems can’t be fixed without creating more issues. It’s the case in which multiple systems are involved and integrated over an integration bus.

One extreme approach, not advisable though, is to build a second integration to correct the issues of the first. This solution might work in theory however there’s the risk of multiplying the issues is really high and the complexity of troubleshooting increases with the degree of dependency between the two integrations. It would be more advisable to rebuild the integration anew, however also this approach has its advantages and disadvantages.

Bottom line is that integration issues should be addressed while they are small and that an automated solution for comparing the data can help in the process

21 April 2019

#️⃣Software Engineering: Programming (Part VIII: Pair Programming)

Software Engineering Series

“Two heads are better than one” – a proverb whose wisdom is embraced today in the various forms of harnessing the collective intelligence. The use of groups in problem solving is based on principles like “the collective is more than the sum of its individuals” or that “the crowds are better on average at estimations than the experts”. All well and good, based on the rationality of the same proverb has been advanced the idea of having two developers working together on the same piece of code – one doing the programming while the other looks over the shoulder as a observer or navigator (whatever that means), reviewing each line of code as it is written, strategizing or simply being there.

This approach is known as pair programming and considered as an agile software development technique, adhering thus to the agile principles (see the agile manifesto). Beyond some intangible benefits, its intent is to reduce the volume of defects in software and thus ensure an acceptable quality of the deliverables. It’s also an extreme approach of the pear review concept.

Without considering whether pair programming adheres to the agile principles, the concept has several big loopholes. The first time I read about pair programming it took me some time to digest the idea – I was asking myself what programmer will do that on a daily basis, watching as other programmers code or being watched while coding, each line of code being followed by questions, affirmative or negative nodding… Beyond their statute of being lone wolves, programmers can cooperate when the tasks ahead requires it, however to ask a programmer watch actively as others program it won’t work on the long run!

Talking from my own experience as programmer and of a professional working together with other programmers, I know that a programmer sees each task as a challenge, a way of learning, of reaching beyond his own condition. Programming is a way of living, with its pluses and minuses.

Moreover, the complexity of the tasks doesn’t resume at handling the programming language but of resolving the right problem. Solving the right problem is not something that can one overcome with brute force but with intelligence. If using the programming language is the challenge then the problem lies somewhere else and other countermeasures must be taken!

Some studies have identified that the use of pair programming led to a reduction of defects in software, however the numbers are misleading as long they compare apples with pears. To statistically conclude that one method is better than the other means doing the same experiment with the different methods using a representative population. Unless one addressees the requirements of statistics the numbers advanced are just fiction!

Just think again about the main premise! One doubles the expenditure for a theoretical reduction of the defects?! Actually, it's more than double considering that different types of communication takes place. Without a proven basis the effort can be somewhere between 2.2 and 2.5 and for an average project this can be a lot! The costs might be bearable in situations in which the labor is cheap, however programmers’ cooperation is a must.

The whole concept of pair programming seems like a bogus idea, just like two drivers driving the same car! This approach might work when the difference in experience and skills between developers is considerable, that being met in universities or apprenticeship environments, in which the accent is put on learning and forming. It might work on handling complex tasks as some adepts declare, however even then is less likely that the average programmer will willingly do it!

Previous Post <<||>> Next Post

07 January 2019

🤝Governance: Accountability (Just the Quotes)

"To hold a group or individual accountable for activities of any kind without assigning to him or them the necessary authority to discharge that responsibility is manifestly both unsatisfactory and inequitable. It is of great Importance to smooth working that at all levels authority and responsibility should be coterminous and coequal." (Lyndall Urwick, "Dynamic Administration", 1942)

"Complete accountability is established and enforced throughout; and if there there is any error committed, it will be discovered on a comparison with the books and can be traced to its source." (Alfred D Chandler Jr, "The Visible Hand", 1977)

"If responsibility - and particularly accountability - is most obviously upwards, moral responsibility also reaches downwards. The commander has a responsibility to those whom he commands. To forget this is to vitiate personal integrity and the ethical validity of the system." (Roger L Shinn, "Military Ethics", 1987)

"Perhaps nothing in our society is more needed for those in positions of authority than accountability." (Larry Burkett, "Business By The Book: Complete Guide of Biblical Principles for the Workplace", 1990)

"Corporate governance is concerned with holding the balance between economic and social goals and between individual and communal goals. The governance framework is there to encourage the efficient use of resources and equally to require accountability for the stewardship of those resources. The aim is to align as nearly as possible the interests of individuals, corporations and society." (Dominic Cadbury, "UK, Commission Report: Corporate Governance", 1992)

"Accountability is essential to personal growth, as well as team growth. How can you improve if you're never wrong? If you don't admit a mistake and take responsibility for it, you're bound to make the same one again." (Pat Summitt, "Reach for the Summit", 1999)

"Responsibility equals accountability equals ownership. And a sense of ownership is the most powerful weapon a team or organization can have." (Pat Summitt, "Reach for the Summit", 1999)

"There's not a chance we'll reach our full potential until we stop blaming each other and start practicing personal accountability." (John G Miller, "QBQ!: The Question Behind the Question", 2001)

"Democracy is not about trust; it is about distrust. It is about accountability, exposure, open debate, critical challenge, and popular input and feedback from the citizenry." (Michael Parenti, "Superpatriotism", 2004)

"No individual can achieve worthy goals without accepting accountability for his or her own actions." (Dan Miller, "No More Dreaded Mondays", 2008)

"In putting together your standards, remember that it is essential to involve your entire team. Standards are not rules issued by the boss; they are a collective identity. Remember, standards are the things that you do all the time and the things for which you hold one another accountable." (Mike Krzyzewski, "The Gold Standard: Building a World-Class Team", 2009)

"Nobody can do everything well, so learn how to delegate responsibility to other winners and then hold them accountable for their decisions." (George Foreman, "Knockout Entrepreneur: My Ten-Count Strategy for Winning at Business", 2010)

"Failing to hold someone accountable is ultimately an act of selfishness." (Patrick Lencioni, "The Advantage, Enhanced Edition: Why Organizational Health Trumps Everything Else In Business", 2012)

"We cannot have a just society that applies the principle of accountability to the powerless and the principle of forgiveness to the powerful. This is the America in which we currently reside." (Chris Hayes, "Twilight of the Elites: America After Meritocracy", 2012)

"Artificial intelligence is a concept that obscures accountability. Our problem is not machines acting like humans - it's humans acting like machines." (John Twelve Hawks, "Spark", 2014)

"In order to cultivate a culture of accountability, first it is essential to assign it clearly. People ought to clearly know what they are accountable for before they can be held to it. This goes beyond assigning key responsibility areas (KRAs). To be accountable for an outcome, we need authority for making decisions, not just responsibility for execution. It is tempting to refrain from the tricky exercise of explicitly assigning accountability. Executives often hope that their reports will figure it out. Unfortunately, this is easier said than done." (Sriram Narayan, "Agile IT Organization Design: For Digital Transformation and Continuous Delivery", 2015)

"Some hierarchy is essential for the effective functioning of an organization. Eliminating hierarchy has the frequent side effect of slowing down decision making and diffusing accountability." (Sriram Narayan, "Agile IT Organization Design: For Digital Transformation and Continuous Delivery", 2015)

"Accountability makes no sense when it undermines the larger goals of education." (Diane Ravitch, "The Death and Life of the Great American School System", 2016)

"[...] high-accountability teams are characterized by having members that are willing and able to resolve issues within the team. They take responsibility for their own actions and hold each other accountable. They take ownership of resolving disputes and feel empowered to do so without intervention from others. They learn quickly by identifying issues and solutions together, adopting better patterns over time. They are able to work without delay because they don’t need anyone else to resolve problems. Their managers are able to work more strategically without being bogged down by day-to-day conflict resolution." (Morgan Evans, "Engineering Manager's Handbook", 2023)

"In a workplace setting, accountability is the willingness to take responsibility for one’s actions and their outcomes. Accountable team members take ownership of their work, admit their mistakes, and are willing to hold each other accountable as peers." (Morgan Evans, "Engineering Manager's Handbook", 2023)

"Low-accountability teams can be recognized based on their tendency to shift blame, avoid addressing issues within the team, and escalate most problems to their manager. In low-accountability teams, it is difficult to determine the root of problems, failures are met with apathy, and managers have to spend much of their time settling disputes and addressing performance. Members of low-accountability teams believe it is not their role to resolve disputes and instead shift that responsibility up to the manager, waiting for further direction. These teams fall into conflict and avoidance deadlocks, unable to move quickly because they cannot resolve issues within the team."

25 December 2018

🔭Data Science: Trial and Error (Just the Quotes)

"One is almost tempted to assert that quite apart from its intellectual mission, theory is the most practical thing conceivable, the quintessence of practice as it were, since the precision of its conclusions cannot be reached by any routine of estimating or trial and error; although given the hidden ways of theory, this will hold only for those who walk them with complete confidence." (Ludwig E Boltzmann, "On the Significance of Theories", 1890)

"The discoveries in physical science, the triumphs in invention, attest the value of the process of trial and error. In large measure, these advances have been due to experimentation." (Louis Brandeis, "Judicial opinions", 1932)

"We know the laws of trial and error, of large numbers and probabilities. We know that these laws are part of the mathematical and mechanical fabric of the universe, and that they are also at play in biological processes. But, in the name of the experimental method and out of our poor knowledge, are we really entitled to claim that everything happens by chance, to the exclusion of all other possibilities?" (Albert Claude, [Nobel Prize Lecture], 1974)

"The natural as well as the social sciences always start from problems, from the fact that something inspires amazement in us, as the Greek philosophers used to say. To solve these problems, the sciences use fundamentally the same method that common sense employs, the method of trial and error. To be more precise, it is the method of trying out solutions to our problem and then discarding the false ones as erroneous. This method assumes that we work with a large number of experimental solutions. One solution after another is put to the test and eliminated." (Karl R Popper, "All Life is Problem Solving", 1999)

"Heuristics are rules of thumb that help constrain the problem in certain ways (in other words they help you to avoid falling back on blind trial and error), but they don't guarantee that you will find a solution. Heuristics are often contrasted with algorithms that will guarantee that you find a solution - it may take forever, but if the problem is algorithmic you will get there. However, heuristics are also algorithms." (S Ian Robertson, "Problem Solving", 2001)

"Technology is the result of antifragility, exploited by risk-takers in the form of tinkering and trial and error, with nerd-driven design confined to the backstage." (Nassim N Taleb, "Antifragile: Things that gain from disorder", 2012)

"We can simplify the relationships between fragility, errors, and antifragility as follows. When you are fragile, you depend on things following the exact planned course, with as little deviation as possible - for deviations are more harmful than helpful. This is why the fragile needs to be very predictive in its approach, and, conversely, predictive systems cause fragility. When you want deviations, and you don’t care about the possible dispersion of outcomes that the future can bring, since most will be helpful, you are antifragile. Further, the random element in trial and error is not quite random, if it is carried out rationally, using error as a source of information. If every trial provides you with information about what does not work, you start zooming in on a solution - so every attempt becomes more valuable, more like an expense than an error. And of course you make discoveries along the way." (Nassim N Taleb, "Antifragile: Things that gain from disorder", 2012)

"Another crowning achievement of deep learning is its extension to the domain of reinforcement learning. In the context of reinforcement learning, an autonomous agent must learn to perform a task by trial and error, without any guidance from the human operator." (Ian Goodfellow et al, "Deep Learning", 2015)

"A learning algorithm for a robot or a software agent to take actions in an environment so as to maximize the sum of rewards through trial and error." (Tomohiro Yamaguchi et al, "Analyzing the Goal-Finding Process of Human Learning With the Reflection Subtask", 2018)

"Reinforcement learning is also a subset of AI algorithms which creates independent, self-learning systems through trial and error. Any positive action is assigned a reward and any negative action would result in a punishment. Reinforcement learning can be used in training autonomous vehicles where the goal would be obtaining the maximum rewards." (Vijayaraghavan Varadharajan & Akanksha Rajendra Singh, "Building Intelligent Cities: Concepts, Principles, and Technologies", 2021)

"Methodologically, much of modern machine learning practice rests on a variant of trial and error, which we call the train-test paradigm. Practitioners repeatedly build models using any number of heuristics and test their performance to see what works. Anything goes as far as training is concerned, subject only to computational constraints, so long as the performance looks good in testing. Trial and error is sound so long as the testing protocol is robust enough to absorb the pressure placed on it." (Moritz Hardt & Benjamin Recht, "Patterns, Predictions, and Actions: Foundations of Machine Learning", 2022)

22 December 2018

🔭Data Science: Significance (Just the Quotes)

"What the use of P [the significance level] implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred." (Harold Jeffreys, "Theory of Probability", 1939)

"As usual we may make the errors of I) rejecting the null hypothesis when it is true, II) accepting the null hypothesis when it is false. But there is a third kind of error which is of interest because the present test of significance is tied up closely with the idea of making a correct decision about which distribution function has slipped furthest to the right. We may make the error of III) correctly rejecting the null hypothesis for the wrong reason." (Frederick Mosteller, "A k-Sample Slippage Test for an Extreme Population", The Annals of Mathematical Statistics 19, 1948)

"Errors of the third kind happen in conventional tests of differences of means, but they are usually not considered, although their existence is probably recognized. It seems to the author that there may be several reasons for this among which are 1) a preoccupation on the part of mathematical statisticians with the formal questions of acceptance and rejection of null hypotheses without adequate consideration of the implications of the error of the third kind for the practical experimenter, 2) the rarity with which an error of the third kind arises in the usual tests of significance." (Frederick Mosteller, "A k-Sample Slippage Test for an Extreme Population", The Annals of Mathematical Statistics 19, 1948)

"If significance tests are required for still larger samples, graphical accuracy is insufficient, and arithmetical methods are advised. A word to the wise is in order here, however. Almost never does it make sense to use exact binomial significance tests on such data - for the inevitable small deviations from the mathematical model of independence and constant split have piled up to such an extent that the binomial variability is deeply buried and unnoticeable. Graphical treatment of such large samples may still be worthwhile because it brings the results more vividly to the eye." (Frederick Mosteller & John W Tukey, "The Uses and Usefulness of Binomial Probability Paper?", Journal of the American Statistical Association 44, 1949)

"It will, of course, happen but rarely that the proportions will be identical, even if no real association exists. Evidently, therefore, we need a significance test to reassure ourselves that the observed difference of proportion is greater than could reasonably be attributed to chance. The significance test will test the reality of the association, without telling us anything about the intensity of association. It will be apparent that we need two distinct things: (a) a test of significance, to be used on the data first of all, and (b) some measure of the intensity of the association, which we shall only be justified in using if the significance test confirms that the association is real." (Michael J Moroney, "Facts from Figures", 1951)

"The main purpose of a significance test is to inhibit the natural enthusiasm of the investigator." (Frederick Mosteller, "Selected Quantitative Techniques", 1954)

"Null hypotheses of no difference are usually known to be false before the data are collected [...] when they are, their rejection or acceptance simply reflects the size of the sample and the power of the test, and is not a contribution to science." (I Richard Savage, "Nonparametric Statistics", Journal of the American Statistical Association 52, 1957)

"[...] to make measurements and then ignore their magnitude would ordinarily be pointless. Exclusive reliance on tests of significance obscures the fact that statistical significance does not imply substantive significance." (I Richard Savage, "Nonparametric Statistics", Journal of the American Statistical Association 52, 1957)

"[...] the tests of null hypotheses of zero differences, of no relationships, are frequently weak, perhaps trivial statements of the researcher’s aims [...] in many cases, instead of the tests of significance it would be more to the point to measure the magnitudes of the relationships, attaching proper statements of their sampling variation. The magnitudes of relationships cannot be measured in terms of levels of significance." (Leslie Kish, "Some statistical problems in research design", American Sociological Review 24, 1959)

"There are instances of research results presented in terms of probability values of ‘statistical significance’ alone, without noting the magnitude and importance of the relationships found. These attempts to use the probability levels of significance tests as measures of the strengths of relationships are very common and very mistaken." (Leslie Kish, "Some statistical problems in research design", American Sociological Review 24, 1959)

"The null-hypothesis significance test treats ‘acceptance’ or ‘rejection’ of a hypothesis as though these were decisions one makes. But a hypothesis is not something, like a piece of pie offered for dessert, which can be accepted or rejected by a voluntary physical action. Acceptance or rejection of a hypothesis is a cognitive process, a degree of believing or disbelieving which, if rational, is not a matter of choice but determined solely by how likely it is, given the evidence, that the hypothesis is true." (William W Rozeboom, "The fallacy of the null–hypothesis significance test", Psychological Bulletin 57, 1960)

"The null hypothesis of no difference has been judged to be no longer a sound or fruitful basis for statistical investigation. […] Significance tests do not provide the information that scientists need, and, furthermore, they are not the most effective method for analyzing and summarizing data." (Cherry A Clark, "Hypothesis Testing in Relation to Statistical Methodology", Review of Educational Research Vol. 33, 1963)

"[...] the test of significance has been carrying too much of the burden of scientific inference. It may well be the case that wise and ingenious investigators can find their way to reasonable conclusions from data because and in spite of their procedures. Too often, however, even wise and ingenious investigators [...] tend to credit the test of significance with properties it does not have." (David Bakan, "The test of significance in psychological research", Psychological Bulletin 66, 1966)

"[...] we need to get on with the business of generating [...] hypotheses and proceed to do investigations and make inferences which bear on them, instead of [...] testing the statistical null hypothesis in any number of contexts in which we have every reason to suppose that it is false in the first place." (David Bakan, "The test of significance in psychological research", Psychological Bulletin 66, 1966)

"Science usually amounts to a lot more than blind trial and error. Good statistics consists of much more than just significance tests; there are more sophisticated tools available for the analysis of results, such as confidence statements, multiple comparisons, and Bayesian analysis, to drop a few names. However, not all scientists are good statisticians, or want to be, and not all people who are called scientists by the media deserve to be so described." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"The idea of statistical significance is valuable because it often keeps us from announcing results that later turn out to be nonresults. A significant result tells us that enough cases were observed to provide reasonable assurance of a real effect. It does not necessarily mean, though, that the effect is big enough to be important." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"A tendency to drastically underestimate the frequency of coincidences is a prime characteristic of innumerates, who generally accord great significance to correspondences of all sorts while attributing too little significance to quite conclusive but less flashy statistical evidence." (John A Paulos, "Innumeracy: Mathematical Illiteracy and its Consequences", 1988)

"Which I would like to stress are: (1) A significant effect is not necessarily the same thing as an interesting effect. (2) A non-significant effect is not necessarily the same thing as no difference." (Christopher Chatfield, "Problem solving: a statistician’s guide", 1988)

"A little thought reveals a fact widely understood among statisticians: The null hypothesis, taken literally (and that’s the only way you can take it in formal hypothesis testing), is always false in the real world. [...] If it is false, even to a tiny degree, it must be the case that a large enough sample will produce a significant result and lead to its rejection. So if the null hypothesis is always false, what’s the big deal about rejecting it?" (Jacob Cohen,"Things I Have Learned (So Far)", American Psychologist, 1990)

"I do not think that significance testing should be completely abandoned [...] and I don’t expect that it will be. But I urge researchers to provide estimates, with confidence intervals: scientific advance requires parameters with known reliability estimates. Classical confidence intervals are formally equivalent to a significance test, but they convey more information." (Nigel G Yoccoz, "Use, Overuse, and Misuse of Significance Tests in Evolutionary Biology and Ecology", Bulletin of the Ecological Society of America Vol. 72 (2), 1991)

"Rejection of a true null hypothesis at the 0.05 level will occur only one in 20 times. The overwhelming majority of these false rejections will be based on test statistics close to the borderline value. If the null hypothesis is false, the inter-ocular traumatic test ['hit between the eyes'] will often suffice to reject it; calculation will serve only to verify clear intuition." (Ward Edwards et al, "Bayesian Statistical Inference for Psychological Research", 1992)

"Statistical significance testing can involve a tautological logic in which tired researchers, having collected data on hundreds of subjects, then conduct a statistical test to evaluate whether there were a lot of subjects, which the researchers already know, because they collected the data and know they are tired. This tautology has created considerable damage as regards the cumulation of knowledge." (Bruce Thompson, "Two and One-Half Decades of Leadership in Measurement and Evaluation", Journal of Counseling & Development 70 (3), 1992)

"[…] an honest exploratory study should indicate how many comparisons were made […] most experts agree that large numbers of comparisons will produce apparently statistically significant findings that are actually due to chance. The data torturer will act as if every positive result confirmed a major hypothesis. The honest investigator will limit the study to focused questions, all of which make biologic sense. The cautious reader should look at the number of ‘significant’ results in the context of how many comparisons were made." (James L Mills, "Data torturing", New England Journal of Medicine, 1993)

"Graphic misrepresentation is a frequent misuse in presentations to the nonprofessional. The granddaddy of all graphical offenses is to omit the zero on the vertical axis. As a consequence, the chart is often interpreted as if its bottom axis were zero, even though it may be far removed. This can lead to attention-getting headlines about 'a soar' or 'a dramatic rise (or fall)'. A modest, and possibly insignificant, change is amplified into a disastrous or inspirational trend." (Herbert F Spirer et al, "Misused Statistics" 2nd Ed, 1998)

"When significance tests are used and a null hypothesis is not rejected, a major problem often arises - namely, the result may be interpreted, without a logical basis, as providing evidence for the null hypothesis." (David F Parkhurst, "Statistical Significance Tests: Equivalence and Reverse Tests Should Reduce Misinterpretation", BioScience Vol. 51 (12), 2001)

"If you flip a coin three times and it lands on heads each time, it's probably chance. If you flip it a hundred times and it lands on heads each time, you can be pretty sure the coin has heads on both sides. That's the concept behind statistical significance - it's the odds that the correlation (or other finding) is real, that it isn't just random chance." (T Colin Campbell, "The China Study", 2004)

"The dual meaning of the word significant brings into focus the distinction between drawing a mathematical inference and practical inference from statistical results." (Charles Livingston & Paul Voakes, "Working with Numbers and Statistics: A handbook for journalists", 2005)

"A type of error used in hypothesis testing that arises when incorrectly rejecting the null hypothesis, although it is actually true. Thus, based on the test statistic, the final conclusion rejects the Null hypothesis, but in truth it should be accepted. Type I error equates to the alpha (α) or significance level, whereby the generally accepted default is 5%." (Lynne Hambleton, "Treasure Chest of Six Sigma Growth Methods, Tools, and Best Practices", 2007)

"For the study of the topology of the interactions of a complex system it is of central importance to have proper random null models of networks, i.e., models of how a graph arises from a random process. Such models are needed for comparison with real world data. When analyzing the structure of real world networks, the null hypothesis shall always be that the link structure is due to chance alone. This null hypothesis may only be rejected if the link structure found differs significantly from an expectation value obtained from a random model. Any deviation from the random null model must be explained by non-random processes." (Jörg Reichardt, "Structure in Complex Networks", 2009)

"There are three possible reasons for [the] absence of predictive power. First, it is possible that the models are misspecified. Second, it is possible that the model’s explanatory factors are measured at too high a level of aggregation [...] Third, [...] the search for statistically significant relationships may not be the strategy best suited for evaluating our model’s ability to explain real world events [...] the lack of predictive power is the result of too much emphasis having been placed on finding statistically significant variables, which may be overdetermined. Statistical significance is generally a flawed way to prune variables in regression models [...] Statistically significant variables may actually degrade the predictive accuracy of a model [...] [By using] models that are constructed on the basis of pruning undertaken with the shears of statistical significance, it is quite possible that we are winnowing our models away from predictive accuracy." (Michael D Ward et al, "The perils of policy by p-value: predicting civil conflicts" Journal of Peace Research 47, 2010)

"If the group is large enough, even very small differences can become statistically significant." (Victor Cohn & Lewis Cope, "News & Numbers: A writer’s guide to statistics" 3rd Ed, 2012)

"Another way to secure statistical significance is to use the data to discover a theory. Statistical tests assume that the researcher starts with a theory, collects data to test the theory, and reports the results - whether statistically significant or not. Many people work in the other direction, scrutinizing the data until they find a pattern and then making up a theory that fits the pattern." (Gary Smith, "Standard Deviations", 2014)

"These practices - selective reporting and data pillaging - are known as data grubbing. The discovery of statistical significance by data grubbing shows little other than the researcher’s endurance. We cannot tell whether a data grubbing marathon demonstrates the validity of a useful theory or the perseverance of a determined researcher until independent tests confirm or refute the finding. But more often than not, the tests stop there. After all, you won’t become a star by confirming other people’s research, so why not spend your time discovering new theories? The data-grubbed theory consequently sits out there, untested and unchallenged." (Gary Smith, "Standard Deviations", 2014)

"With fast computers and plentiful data, finding statistical significance is trivial. If you look hard enough, it can even be found in tables of random numbers." (Gary Smith, "Standard Deviations", 2014)

"In short, statistical significance does not mean your result has any practical significance. As for statistical insignificance, it doesn’t tell you much. A statistically insignificant difference could be nothing but noise, or it could represent a real effect that can be pinned down only with more data." (Alex Reinhart, "Statistics Done Wrong: The Woefully Complete Guide", 2015)

"Statistical significance is a concept used by scientists and researchers to set an objective standard that can be used to determine whether or not a particular relationship 'statistically' exists in the data. Scientists test for statistical significance to distinguish between whether an observed effect is present in the data (given a high degree of probability), or just due to chance. It is important to note that finding a statistically significant relationship tells us nothing about whether a relationship is a simple correlation or a causal one, and it also can’t tell us anything about whether some omitted factor is driving the result." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Statistical significance refers to the probability that something is true. It’s a measure of how probable it is that the effect we’re seeing is real (rather than due to chance occurrence), which is why it’s typically measured with a p-value. P, in this case, stands for probability. If you accept p-values as a measure of statistical significance, then the lower your p-value is, the less likely it is that the results you’re seeing are due to chance alone." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

More quotes on "Significance" at the-web-of-knowledge.blogspot.com.

19 December 2018

🔭Data Science: Errors in Statistics (Just the Quotes)

"[It] may be laid down as a general rule that, if the result of a long series of precise observations approximates a simple relation so closely that the remaining difference is undetectable by observation and may be attributed to the errors to which they are liable, then this relation is probably that of nature." (Pierre-Simon Laplace, "Mémoire sur les Inégalites Séculaires des Planètes et des Satellites", 1787)

"It is surprising to learn the number of causes of error which enter into the simplest experiment, when we strive to attain rigid accuracy." (William S Jevons, "The Principles of Science: A Treatise on Logic and Scientific Method", 1874)

"Some of the common ways of producing a false statistical argument are to quote figures without their context, omitting the cautions as to their incompleteness, or to apply them to a group of phenomena quite different to that to which they in reality relate; to take these estimates referring to only part of a group as complete; to enumerate the events favorable to an argument, omitting the other side; and to argue hastily from effect to cause, this last error being the one most often fathered on to statistics. For all these elementary mistakes in logic, statistics is held responsible." (Sir Arthur L Bowley, "Elements of Statistics", 1901)

"If the number of experiments be very large, we may have precise information as to the value of the mean, but if our sample be small, we have two sources of uncertainty: (I) owing to the 'error of random sampling' the mean of our series of experiments deviates more or less widely from the mean of the population, and (2) the sample is not sufficiently large to determine what is the law of distribution of individuals." (William S Gosset, "The Probable Error of a Mean", Biometrika, 1908)

"We know not to what are due the accidental errors, and precisely because we do not know, we are aware they obey the law of Gauss. Such is the paradox." (Henri Poincaré, "The Foundations of Science", 1913)

"No observations are absolutely trustworthy. In no field of observation can we entirely rule out the possibility that an observation is vitiated by a large measurement or execution error. If a reading is found to lie a very long way from its fellows in a series of replicate observations, there must be a suspicion that the deviation is caused by a blunder or gross error of some kind. [...] One sufficiently erroneous reading can wreck the whole of a statistical analysis, however many observations there are." (Francis J Anscombe, "Rejection of Outliers", Technometrics Vol. 2 (2), 1960)

"It might be reasonable to expect that the more we know about any set of statistics, the greater the confidence we would have in using them, since we would know in which directions they were defective; and that the less we know about a set of figures, the more timid and hesitant we would be in using them. But, in fact, it is the exact opposite which is normally the case; in this field, as in many others, knowledge leads to caution and hesitation, it is ignorance that gives confidence and boldness. For knowledge about any set of statistics reveals the possibility of error at every stage of the statistical process; the difficulty of getting complete coverage in the returns, the difficulty of framing answers precisely and unequivocally, doubts about the reliability of the answers, arbitrary decisions about classification, the roughness of some of the estimates that are made before publishing the final results. Knowledge of all this, and much else, in detail, about any set of figures makes one hesitant and cautious, perhaps even timid, in using them." (Ely Devons, "Essays in Economics", 1961)

"The art of using the language of figures correctly is not to be over-impressed by the apparent ai

"Measurement, we have seen, always has an element of error in it. The most exact description or prediction that a scientist can make is still only approximate." (Abraham Kaplan, "The Conduct of Inquiry: Methodology for Behavioral Science", 1964)

"A mature science, with respect to the matter of errors in variables, is not one that measures its variables without error, for this is impossible. It is, rather, a science which properly manages its errors, controlling their magnitudes and correctly calculating their implications for substantive conclusions." (Otis D Duncan, "Introduction to Structural Equation Models", 1975)

"Pencil and paper for construction of distributions, scatter diagrams, and run-charts to compare small groups and to detect trends are more efficient methods of estimation than statistical inference that depends on variances and standard errors, as the simple techniques preserve the information in the original data." (William E Deming, "On Probability as Basis for Action" American Statistician Vol. 29 (4), 1975)

"When the statistician looks at the outside world, he cannot, for example, rely on finding errors that are independently and identically distributed in approximately normal distributions. In particular, most economic and business data are collected serially and can be expected, therefore, to be heavily serially dependent. So is much of the data collected from the automatic instruments which are becoming so common in laboratories these days. Analysis of such data, using procedures such as standard regression analysis which assume independence, can lead to gross error. Furthermore, the possibility of contamination of the error distribution by outliers is always present and has recently received much attention. More generally, real data sets, especially if they are long, usually show inhomogeneity in the mean, the variance, or both, and it is not always possible to randomize." (George E P Box, "Some Problems of Statistics and Everyday Life", Journal of the American Statistical Association, Vol. 74 (365), 1979)

"Under conditions of uncertainty, both rationality and measurement are essential to decision-making. Rational people process information objectively: whatever errors they make in forecasting the future are random errors rather than the result of a stubborn bias toward either optimism or pessimism. They respond to new information on the basis of a clearly defined set of preferences. They know what they want, and they use the information in ways that support their preferences." (Peter L Bernstein, "Against the Gods: The Remarkable Story of Risk", 1996)

"Linear regression assumes that in the population a normal distribution of error values around the predicted Y is associated with each X value, and that the dispersion of the error values for each X value is the same. The assumptions imply normal and similarly dispersed error distributions." (Fred C Pampel, "Linear Regression: A primer", 2000)

"Compound errors can begin with any of the standard sorts of bad statistics - a guess, a poor sample, an inadvertent transformation, perhaps confusion over the meaning of a complex statistic. People inevitably want to put statistics to use, to explore a number's implications. [...] The strengths and weaknesses of those original numbers should affect our confidence in the second-generation statistics." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Trimming potentially theoretically meaningful variables is not advisable unless one is quite certain that the coefficient for the variable is near zero, that the variable is inconsequential, and that trimming will not introduce misspecification error." (James Jaccard, "Interaction Effects in Logistic Regression", 2001)

"The central limit theorem says that, under conditions almost always satisfied in the real world of experimentation, the distribution of such a linear function of errors will tend to normality as the number of its components becomes large. The tendency to normality occurs almost regardless of the individual distributions of the component errors. An important proviso is that several sources of error must make important contributions to the overall error and that no particular source of error dominate the rest." (George E P Box et al, "Statistics for Experimenters: Design, discovery, and innovation" 2nd Ed., 2005)

"Two things explain the importance of the normal distribution: (1) The central limit effect that produces a tendency for real error distributions to be 'normal like'. (2) The robustness to nonnormality of some common statistical procedures, where 'robustness' means insensitivity to deviations from theoretical normality." (George E P Box et al, "Statistics for Experimenters: Design, discovery, and innovation" 2nd Ed., 2005)

"There are many ways for error to creep into facts and figures that seem entirely straightforward. Quantities can be miscounted. Small samples can fail to accurately reflect the properties of the whole population. Procedures used to infer quantities from other information can be faulty. And then, of course, numbers can be total bullshit, fabricated out of whole cloth in an effort to confer credibility on an otherwise flimsy argument. We need to keep all of these things in mind when we look at quantitative claims. They say the data never lie - but we need to remember that the data often mislead." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

"Always expect to find at least one error when you proofread your own statistics. If you don’t, you are probably making the same mistake twice." (Cheryl Russell)

[Murphy’s Laws of Analysis:] "(1) In any collection of data, the figures that are obviously correct contain errors. (2) It is customary for a decimal to be misplaced. (3) An error that can creep into a calculation, will. Also, it will always be in the direction that will cause the most damage to the calculation." (G C Deakly)

09 December 2018

🔭Data Science: Failure (Just the Quotes)

"Every detection of what is false directs us towards what is true: every trial exhausts some tempting form of error. Not only so; but scarcely any attempt is entirely a failure; scarcely any theory, the result of steady thought, is altogether false; no tempting form of error is without some latent charm derived from truth." (William Whewell, "Lectures on the History of Moral Philosophy in England", 1852)

"Scarcely any attempt is entirely a failure; scarcely any theory, the result of steady thought, is altogether false; no tempting form of Error is without some latent charm derived from Truth." (William Whewell, "Lectures on the History of Moral Philosophy in England", 1852)

"We learn wisdom from failure much more than from success. We often discover what will do, by finding out what will not do; and probably he who never made a mistake never made a discovery." (Samuel Smiles, "Facilities and Difficulties", 1859)

"[…] the statistical prediction of the future from the past cannot be generally valid, because whatever is future to any given past, is in tum past for some future. That is, whoever continually revises his judgment of the probability of a statistical generalization by its successively observed verifications and failures, cannot fail to make more successful predictions than if he should disregard the past in his anticipation of the future. This might be called the ‘Principle of statistical accumulation’." (Clarence I Lewis, "Mind and the World-Order: Outline of a Theory of Knowledge", 1929)

"Science condemns itself to failure when, yielding to the infatuation of the serious, it aspires to attain being, to contain it, and to possess it; but it finds its truth if it considers itself as a free engagement of thought in the given, aiming, at each discovery, not at fusion with the thing, but at the possibility of new discoveries; what the mind then projects is the concrete accomplishment of its freedom." (Simone de Beauvoir, "The Ethics of Ambiguity", 1947)

"Common sense […] may be thought of as a series of concepts and conceptual schemes which have proved highly satisfactory for the practical uses of mankind. Some of those concepts and conceptual schemes were carried over into science with only a little pruning and whittling and for a long time proved useful. As the recent revolutions in physics indicate, however, many errors can be made by failure to examine carefully just how common sense ideas should be defined in terms of what the experimenter plans to do." (James B Conant, "Science and Common Sense", 1951)

"Catastrophes are often stimulated by the failure to feel the emergence of a domain, and so what cannot be felt in the imagination is experienced as embodied sensation in the catastrophe. (William I Thompson, "Gaia, a Way of Knowing: Political Implications of the New Biology", 1987)

"What about confusing clutter? Information overload? Doesn't data have to be ‘boiled down’ and ‘simplified’? These common questions miss the point, for the quantity of detail is an issue completely separate from the difficulty of reading. Clutter and confusion are failures of design, not attributes of information." (Edward R Tufte, "Envisioning Information", 1990)

"When a system is predictable, it is already performing as consistently as possible. Looking for assignable causes is a waste of time and effort. Instead, you can meaningfully work on making improvements and modifications to the process. When a system is unpredictable, it will be futile to try and improve or modify the process. Instead you must seek to identify the assignable causes which affect the system. The failure to distinguish between these two different courses of action is a major source of confusion and wasted effort in business today." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"[…] in cybernetics, control is seen not as a function of one agent over something else, but as residing within circular causal networks, maintaining stabilities in a system. Circularities have no beginning, no end and no asymmetries. The control metaphor of communication, by contrast, punctuates this circularity unevenly. It privileges the conceptions and actions of a designated controller by distinguishing between messages sent in order to cause desired effects and feedback that informs the controller of successes or failures." (Klaus Krippendorff, "On Communicating: Otherness, Meaning, and Information", 2009)

"To get a true understanding of the work of mathematicians, and the need for proof, it is important for you to experiment with your own intuitions, to see where they lead, and then to experience the same failures and sense of accomplishment that mathematicians experienced when they obtained the correct results. Through this, it should become clear that, when doing any level of mathematics, the roads to correct solutions are rarely straight, can be quite different, and take patience and persistence to explore." (Alan Sultan & Alice F Artzt, "The Mathematics that every Secondary School Math Teacher Needs to Know", 2011)

"A very different - and very incorrect - argument is that successes must be balanced by failures (and failures by successes) so that things average out. Every coin flip that lands heads makes tails more likely. Every red at roulette makes black more likely. […] These beliefs are all incorrect. Good luck will certainly not continue indefinitely, but do not assume that good luck makes bad luck more likely, or vice versa." (Gary Smith, "Standard Deviations", 2014)

"We are seduced by patterns and we want explanations for these patterns. When we see a string of successes, we think that a hot hand has made success more likely. If we see a string of failures, we think a cold hand has made failure more likely. It is easy to dismiss such theories when they involve coin flips, but it is not so easy with humans. We surely have emotions and ailments that can cause our abilities to go up and down. The question is whether these fluctuations are important or trivial." (Gary Smith, "Standard Deviations", 2014)

"Although cascading failures may appear random and unpredictable, they follow reproducible laws that can be quantified and even predicted using the tools of network science. First, to avoid damaging cascades, we must understand the structure of the network on which the cascade propagates. Second, we must be able to model the dynamical processes taking place on these networks, like the flow of electricity. Finally, we need to uncover how the interplay between the network structure and dynamics affects the robustness of the whole system." (Albert-László Barabási, "Network Science", 2016)

More quotes in "Failure" at the-web-of-knowledge.blogspot.com.

🔭Data Science: Distributions (Just the Quotes)

"The problems which arise in the reduction of data may thus conveniently be divided into three types: (i) Problems of Specification, which arise in the choice of the mathematical form of the population. (ii) When a specification has been obtained, problems of Estimation arise. These involve the choice among the methods of calculating, from our sample, statistics fit to estimate the unknow n parameters of the population. (iii) Problems of Distribution include the mathematical deduction of the exact nature of the distributions in random samples of our estimates of the parameters, and of other statistics designed to test the validity of our specification (tests of Goodness of Fit)." (Sir Ronald A Fisher, "Statistical Methods for Research Workers", 1925)

"An inference, if it is to have scientific value, must constitute a prediction concerning future data. If the inference is to be made purely with the help of the distribution theory of statistics, the experiments that constitute evidence for the inference must arise from a state of statistical control; until that state is reached, there is no universe, normal or otherwise, and the statistician’s calculations by themselves are an illusion if not a delusion. The fact is that when distribution theory is not applicable for lack of control, any inference, statistical or otherwise, is little better than a conjecture. The state of statistical control is therefore the goal of all experimentation. (William E Deming, "Statistical Method from the Viewpoint of Quality Control", 1939)

"Normality is a myth; there never was, and never will be, a normal distribution. This is an overstatement from the practical point of view, but it represents a safer initial mental attitude than any in fashion during the past two decades." (Roy C Geary, "Testing for Normality", Biometrika Vol. 34, 1947)

"A good estimator will be unbiased and will converge more and more closely (in the long run) on the true value as the sample size increases. Such estimators are known as consistent. But consistency is not all we can ask of an estimator. In estimating the central tendency of a distribution, we are not confined to using the arithmetic mean; we might just as well use the median. Given a choice of possible estimators, all consistent in the sense just defined, we can see whether there is anything which recommends the choice of one rather than another. The thing which at once suggests itself is the sampling variance of the different estimators, since an estimator with a small sampling variance will be less likely to differ from the true value by a large amount than an estimator whose sampling variance is large." (Michael J Moroney, "Facts from Figures", 1951)

"Some distributions [...] are symmetrical about their central value. Other distributions have marked asymmetry and are said to be skew. Skew distributions are divided into two types. If the 'tail' of the distribution reaches out into the larger values of the variate, the distribution is said to show positive skewness; if the tail extends towards the smaller values of the variate, the distribution is called negatively skew." (Michael J Moroney, "Facts from Figures", 1951)

"[A] sequence is random if it has every property that is shared by all infinite sequences of independent samples of random variables from the uniform distribution." (Joel N Franklin, 1962)

"Mathematical statistics provides an exceptionally clear example of the relationship between mathematics and the external world. The external world provides the experimentally measured distribution curve; mathematics provides the equation (the mathematical model) that corresponds to the empirical curve. The statistician may be guided by a thought experiment in finding the corresponding equation." (Marshall J Walker, "The Nature of Scientific Thought", 1963)

"At the heart of probabilistic statistical analysis is the assumption that a set of data arises as a sample from a distribution in some class of probability distributions. The reasons for making distributional assumptions about data are several. First, if we can describe a set of data as a sample from a certain theoretical distribution, say a normal distribution (also called a Gaussian distribution), then we can achieve a valuable compactness of description for the data. For example, in the normal case, the data can be succinctly described by giving the mean and standard deviation and stating that the empirical (sample) distribution of the data is well approximated by the normal distribution. A second reason for distributional assumptions is that they can lead to useful statistical procedures. For example, the assumption that data are generated by normal probability distributions leads to the analysis of variance and least squares. Similarly, much of the theory and technology of reliability assumes samples from the exponential, Weibull, or gamma distribution. A third reason is that the assumptions allow us to characterize the sampling distribution of statistics computed during the analysis and thereby make inferences and probabilistic statements about unknown aspects of the underlying distribution. For example, assuming the data are a sample from a normal distribution allows us to use the t-distribution to form confidence intervals for the mean of the theoretical distribution. A fourth reason for distributional assumptions is that understanding the distribution of a set of data can sometimes shed light on the physical mechanisms involved in generating the data." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"Equal variability is not always achieved in plots. For instance, if the theoretical distribution for a probability plot has a density that drops off gradually to zero in the tails (as the normal density does), then the variability of the data in the tails of the probability plot is greater than in the center. Another example is provided by the histogram. Since the height of any one bar has a binomial distribution, the standard deviation of the height is approximately proportional to the square root of the expected height; hence, the variability of the longer bars is greater." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"Symmetry is also important because it can simplify our thinking about the distribution of a set of data. If we can establish that the data are (approximately) symmetric, then we no longer need to describe the shapes of both the right and left halves. (We might even combine the information from the two sides and have effectively twice as much data for viewing the distributional shape.) Finally, symmetry is important because many statistical procedures are designed for, and work best on, symmetric data." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"We will use the convenient expression 'chosen at random' to mean that the probabilities of the events in the sample space are all the same unless some modifying words are near to the words 'at random'. Usually we will compute the probability of the outcome based on the uniform probability model since that is very common in modeling simple situations. However, a uniform distribution does not imply that it comes from a random source; […]" (Richard W Hamming, "The Art of Probability for Scientists and Engineers", 1991)

"Data that are skewed toward large values occur commonly. Any set of positive measurements is a candidate. Nature just works like that. In fact, if data consisting of positive numbers range over several powers of ten, it is almost a guarantee that they will be skewed. Skewness creates many problems. There are visualization problems. A large fraction of the data are squashed into small regions of graphs, and visual assessment of the data degrades. There are characterization problems. Skewed distributions tend to be more complicated than symmetric ones; for example, there is no unique notion of location and the median and mean measure different aspects of the distribution. There are problems in carrying out probabilistic methods. The distribution of skewed data is not well approximated by the normal, so the many probabilistic methods based on an assumption of a normal distribution cannot be applied." (William S Cleveland, "Visualizing Data", 1993)

"Fitting data means finding mathematical descriptions of structure in the data. An additive shift is a structural property of univariate data in which distributions differ only in location and not in spread or shape. […] The process of identifying a structure in data and then fitting the structure to produce residuals that have the same distribution lies at the heart of statistical analysis. Such homogeneous residuals can be pooled, which increases the power of the description of the variation in the data." (William S Cleveland, "Visualizing Data", 1993)

"Many good things happen when data distributions are well approximated by the normal. First, the question of whether the shifts among the distributions are additive becomes the question of whether the distributions have the same standard deviation; if so, the shifts are additive. […] A second good happening is that methods of fitting and methods of probabilistic inference, to be taken up shortly, are typically simple and on well understood ground. […] A third good thing is that the description of the data distribution is more parsimonious." (William S Cleveland, "Visualizing Data", 1993)

"Probabilistic inference is the classical paradigm for data analysis in science and technology. It rests on a foundation of randomness; variation in data is ascribed to a random process in which nature generates data according to a probability distribution. This leads to a codification of uncertainly by confidence intervals and hypothesis tests." (William S Cleveland, "Visualizing Data", 1993)

"When distributions are compared, the goal is to understand how the distributions shift in going from one data set to the next. […] The most effective way to investigate the shifts of distributions is to compare corresponding quantiles." (William S Cleveland, "Visualizing Data", 1993)

"When the distributions of two or more groups of univariate data are skewed, it is common to have the spread increase monotonically with location. This behavior is monotone spread. Strictly speaking, monotone spread includes the case where the spread decreases monotonically with location, but such a decrease is much less common for raw data. Monotone spread, as with skewness, adds to the difficulty of data analysis. For example, it means that we cannot fit just location estimates to produce homogeneous residuals; we must fit spread estimates as well. Furthermore, the distributions cannot be compared by a number of standard methods of probabilistic inference that are based on an assumption of equal spreads; the standard t-test is one example. Fortunately, remedies for skewness can cure monotone spread as well." (William S Cleveland, "Visualizing Data", 1993)

"A normal distribution is most unlikely, although not impossible, when the observations are dependent upon one another - that is, when the probability of one event is determined by a preceding event. The observations will fail to distribute themselves symmetrically around the mean." (Peter L Bernstein, "Against the Gods: The Remarkable Story of Risk", 1996)

"The principle of maximum entropy is employed for estimating unknown probabilities (which cannot be derived deductively) on the basis of the available information. According to this principle, the estimated probability distribution should be such that its entropy reaches maximum within the constraints of the situation, i.e., constraints that represent the available information. This principle thus guarantees that no more information is used in estimating the probabilities than available." (George J Klir & Doug Elias, "Architecture of Systems Problem Solving" 2nd Ed, 2003)

"The principle of minimum entropy is employed in the formulation of resolution forms and related problems. According to this principle, the entropy of the estimated probability distribution, conditioned by a particular classification of the given events (e.g., states of the variable involved), is minimum subject to the constraints of the situation. This principle thus guarantees that all available information is used, as much as possible within the given constraints (e.g., required number of states), in the estimation of the unknown probabilities." (George J Klir & Doug Elias, "Architecture of Systems Problem Solving" 2nd Ed, 2003)

"In the laws of probability theory, likelihood distributions are fixed properties of a hypothesis. In the art of rationality, to explain is to anticipate. To anticipate is to explain." (Eliezer S. Yudkowsky, "A Technical Explanation of Technical Explanation", 2005)

"For some scientific data the true value cannot be given by a constant or some straightforward mathematical function but by a probability distribution or an expectation value. Such data are called probabilistic. Even so, their true value does not change with time or place, making them distinctly different from most statistical data of everyday life." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"In error analysis the so-called 'chi-squared' is a measure of the agreement between the uncorrelated internal and the external uncertainties of a measured functional relation. The simplest such relation would be time independence. Theory of the chi-squared requires that the uncertainties be normally distributed. Nevertheless, it was found that the test can be applied to most probability distributions encountered in practice." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"To fulfill the requirements of the theory underlying uncertainties, variables with random uncertainties must be independent of each other and identically distributed. In the limiting case of an infinite number of such variables, these are called normally distributed. However, one usually speaks of normally distributed variables even if their number is finite." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Traditional statistics is strong in devising ways of describing data and inferring distributional parameters from sample. Causal inference requires two additional ingredients: a science-friendly language for articulating causal knowledge, and a mathematical machinery for processing that knowledge, combining it with data and drawing new causal conclusions about a phenomenon." (Judea Pearl, "Causal inference in statistics: An overview", Statistics Surveys 3, 2009)

"The elements of this cloud of uncertainty (the set of all possible errors) can be described in terms of probability. The center of the cloud is the number zero, and elements of the cloud that are close to zero are more probable than elements that are far away from that center. We can be more precise in this definition by defining the cloud of uncertainty in terms of a mathematical function, called the probability distribution." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"It is not enough to give a single summary for a distribution - we need to have an idea of the spread, sometimes known as the variability. [...] The range is a natural choice, but is clearly very sensitive to extreme values [...] In contrast the inter-quartile range (IQR) is unaffected by extremes. This is the distance between the 25th and 75th percentiles of the data and so contains the ‘central half’ of the numbers [...] Finally the standard deviation is a widely used measure of spread. It is the most technically complex measure, but is only really appropriate for well-behaved symmetric data since it is also unduly influenced by outlying values." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"[...] the Central Limit Theorem [...] says that the distribution of sample means tends towards the form of a normal distribution with increasing sample size, almost regardless of the shape of the original data distribution." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"There is no ‘correct’ way to display sets of numbers: each of the plots we have used has some advantages: strip-charts show individual points, box-and-whisker plots are convenient for rapid visual summaries, and histograms give a good feel for the underlying shape of the data distribution." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

More quotes on "Distributions" at the-web-of-knowledge.blogspot.com.

SQL Troubles

Pages

05 August 2025

🤖〽️Prompt Engineering: Copilot Unabridged (Part 72: When Machines Acknowledge Their Boundaries: How AI Can Recognize Its Own Limitations)

19 May 2025

#️⃣Software Engineering: Mea Culpa (Part VIII: A Look Beyond)

06 January 2025

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part VII: Things That Don't Work) 🆕

04 April 2021

💼Project Management: Lean Management (Part I: Between Value and Waste I - An Introduction)

29 April 2019

🗄️Data Management: Data Integration (Part I: From Disintegration to Integration)

21 April 2019

#️⃣Software Engineering: Programming (Part VIII: Pair Programming)

07 January 2019

🤝Governance: Accountability (Just the Quotes)

25 December 2018

🔭Data Science: Trial and Error (Just the Quotes)

22 December 2018

🔭Data Science: Significance (Just the Quotes)

19 December 2018

🔭Data Science: Errors in Statistics (Just the Quotes)

09 December 2018

🔭Data Science: Failure (Just the Quotes)

🔭Data Science: Distributions (Just the Quotes)

About Me