02 November 2018

Data Science: Torturing the Data in Statistics

Statistics, through its methods, techniques and models rooted in mathematical reasoning, allows exploring, analyzing and summarizing a given set of data, being used to support decision-making, experiments, theories and ultimately to gain and communicate insights. When used adequately, statistics can prove to be a useful toolset, however as soon its use deviates from the mathematical rigor and principles on which it was built, it can be easily misused. Moreover, the results obtained with the help of statistics, can be easily denatured in communication, even when the statistical results are valid. 

The easiness with which statistics can be misused is probably best reflected in sayings like 'if you torture the data long enough it will confess'.  The formulation is attributed by several sources to the economist Ronald H Coase, however according to Coase the reference made by him in the 1960’s was slightly different: 'if you torture the data enough, nature will always confess' (see [1]). The latter formulation is not necessarily negative if one considers the persistence needed by researchers in revealing nature’s secrets. In exchange, the former formulation seems to stress only the negative aspect. 

The word 'torture' seems to be used instead of 'abuse', though metaphorically it has more weight, it draws the attention and sticks with the reader or audience. As the Quotes Investigator remarks [1], ‘torturing the data’ was employed as metaphor much earlier. For example, a 1933 article contains the following passage: 

"The evidence submitted by the committee from its own questionnaire warrants no such conclusion. To torture the data given in Table I into evidence supporting a twelve-hour minimum of professional training is indeed a statistical feat, but one which the committee accomplishes to its own satisfaction." ("The Elementary School Journal" Vol. 33 (7), 1933)

More than a decade earlier, in a similar context with Coase's quote, John Dewey remarked:

"Active experimentation must force the apparent facts of nature into forms different to those in which they familiarly present themselves; and thus make them tell the truth about themselves, as torture may compel an unwilling witness to reveal what he has been concealing." (John Dewey, "Reconstruction in Philosophy", 1920)

Torture was used metaphorically from 1600s, if we consider the following quote from Sir Francis Bacon’s 'Advancement of Learning':

"Another diversity of Methods is according to the subject or matter which is handled; for there is a great difference in delivery of the Mathematics, which are the most abstracted of knowledges, and Policy, which is the most immersed […], yet we see how that opinion, besides the weakness of it, hath been of ill desert towards learning, as that which taketh the way to reduce learning to certain empty and barren generalities; being but the very husks and shells of sciences, all the kernel being forced out and expulsed with the torture and press of the method." (Sir Francis Bacon, Advancement of Learning, 1605)

However a similar metaphor with closer meaning can be found almost two centuries later:

"One very reprehensible mode of theory-making consists, after honest deductions from a few facts have been made, in torturing other facts to suit the end proposed, in omitting some, and in making use of any authority that may lend assistance to the object desired; while all those which militate against it are carefully put on one side or doubted." (Henry De la Beche, "Sections and Views, Illustrative of Geological Phaenomena", 1830)

Probably, also the following quote from Goethe deservers some attention:

"Someday someone will write a pathology of experimental physics and bring to light all those swindles which subvert our reason, beguile our judgement and, what is worse, stand in the way of any practical progress. The phenomena must be freed once and for all from their grim torture chamber of empiricism, mechanism, and dogmatism; they must be brought before the jury of man's common sense." (Johann Wolfgang von Goethe)

Alternatives to Coase’s formulation were used in several later sources, replacing 'data' with 'statistics' or 'numbers':

"Beware of the problem of testing too many hypotheses; the more you torture the data, the more likely they are to confess, but confessions obtained under duress may not be admissible in the court of scientific opinion." (Stephen M Stigler, "Neutral Models in Biology", 1987)

"Torture numbers, and they will confess to anything." (Gregg Easterbrook, New Republic, 1989)

"[…] an honest exploratory study should indicate how many comparisons were made […] most experts agree that large numbers of comparisons will produce apparently statistically significant findings that are actually due to chance. The data torturer will act as if every positive result confirmed a major hypothesis. The honest investigator will limit the study to focused questions, all of which make biologic sense. The cautious reader should look at the number of ‘significant’ results in the context of how many comparisons were made." (James L Mills, "Data torturing", New England Journal of Medicine, 1993)

"This is true only if you torture the statistics until they produce the confession you want." (Larry Schweikart, "Myths of the 1980s Distort Debate over Tax Cuts", 2001) [source

"Even properly done statistics can’t be trusted. The plethora of available statistical techniques and analyses grants researchers an enormous amount of freedom when analyzing their data, and it is trivially easy to ‘torture the data until it confesses’." (Alex Reinhart, "Statistics Done Wrong: The Woefully Complete Guide", 2015)

There is also a psychological component attached to data or facts' torturing to fit the reality, tendency derived from the way the human mind works, the limits and fallacies associated with mind's workings. 

"What are the models? Well, the first rule is that you’ve got to have multiple models - because if you just have one or two that you’re using, the nature of human psychology is such that you’ll torture reality so that it fits your models, or at least you’ll think it does." (Charles Munger, 1994)

Independently of the formulation and context used, the fact remains: statistics (aka data, numbers) can be easily abused, and the reader/audience should be aware of it!

Previously published on quotablemath.blogspot.com.

No comments:

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.