randombio.com | science notes
Wednesday, August 23, 2017

The Earth is Round (p<0.05)

After 23 years, the paper with that title still raises uncomfortable questions.

T he Earth is Round (p<0.05) might sound like the title of a 1950s sci-fi movie, but it actually was the title of a famous 1994 paper by Jacob Cohen[1], who discussed the perennial question of whether p<0.05 is a good criterion for statistical significance. We still don't have a good answer to that question, though not for lack of trying.

Statistics in science are essential as a way of forcing us to present our data in a manner amenable to rigorous analysis. But scientists are not really interested in the probability that something is due to chance. What they care about is whether something is true, which is not exactly the same thing.

Anybody who has dealt with statisticians knows how infuriating statistics can be. If you talk to statisticians, you get the impression that anything is acceptable, sometimes, except when it's not. Few mathematicians would ever answer a question with “Well, it sort of depends” and even fewer would say that the result of a mathematical test depends on what you wanted to prove. But that is what statisticians tell us.

NASA square Earth
Don't be so sure

In biology, despite the trend to publish actual p-values, a p value of 0.05 is still used by reviewers as a decision point about publishability. In the lab, it's considered a decision point whether or not to continue the experiment. Yet the need to “get it past the reviewers” forces us to choose between inappropriate tests and accepting a criterion we all know to be inadequate.

One-tailed t-tests

We can see this most clearly with the issue of one-tailed tests. One article in a pharmacology journal[2] suggests that one-tailed t-tests are often appropriate in pharmacology. This opinion seems to be quite in the minority, and indeed this letter elicited a lot of critical response [3–7] (author's reply[8]). Most results are presented using two-tailed statistics, with most not bothering to state this explicitly[9].

This isn't some minor academic issue. (Okay, it is, but I'm going to talk about it anyway.) Statisticians have made careers out of proposing solutions[10], and there are scientists out there who will cite Ludbrook's article as evidence that their p values of 0.11 represent a real effect. Earlier this year one drug company did just that. They presented their clinical results using one-tailed t-tests. Investors were not impressed; the company's stock tanked, sending it hurtling toward bankruptcy. If one-tailed tests are viable, why would this happen? Why doesn't everyone use them? Why not base drug approvals on them?

There are actually two interpretations of a one-tailed test. The first, and most common, is that it's only valid when it's physically impossible for the alternative to occur. For example, when counting the number of defective toasters, a toaster is either defective or it's not. Only in this case, say advocates, is a one-tailed test appropriate.

The other interpretation is that a one-tailed test is appropriate whenever the hypothesis is only concerned with one outcome. If one asks whether a cancer drug does or does not cure patients, the thinking is that a one tailed test is appropriate; the company can ignore the possibility that the drug could make cancer worse.

The rationale is that both cases have a binary outcome. But any question with a multiplicity of possible outcomes can be broken down into a finite number of binary yes/no questions, depending on how the question is asked. (So, technically, if we use a one-tailed test, we should do it for every possible outcome and run a Bonferroni or some other correction on the result.) But statisticians tell us the choice of test depends on what you believe could be happening. And that brings us to the real problem: why should it matter what we believe?

Statisticians reply that we should always decide what test to use before analyzing the data. In my observation, this never actually happens. Why, a scientist would ask, should ‘when I ask the question’ make a difference? There is no quantum mechanics involved here. If the result is real, it should be real regardless of when I do the test. If what I'm thinking makes a difference as to whether the result is real, have we not re-introduced subjectivity into science?

Scientists are not really interested in the probability that their hypothesis is false. What they really want to know is: is this result real? This is not a scientific question, but a metaphysical one. Perhaps there is a clue here: it may be that the metaphysical roots of statistics need to be more closely examined.


And what about the plausibility of the hypothesis itself? Years ago Carl Sagan solemnly informed us that extraordinary claims require extraordinary evidence. Is this really true? If so, how can we measure the extraordinariness of a claim?

In statistics we have Type 1 errors and Type 2 errors, which depend on whether the null hypothesis is falsely rejected or falsely accepted. Perhaps we could say a Type 3 error occurs when the hypothesis is so ridiculously improbable that a two-tailed test doesn't do it justice. A Type 4 error could be one that's so absolutely ludicrous that no one in their right mind would even waste time studying it. At the other end of the scale, maybe a Type 0 error could be one where the result is so obviously true that publishing it would be a waste of paper.

This might have been what Ioannidis[11] was getting at in his justly-maligned claim to have ‘proven’ that most hypotheses in the scientific literature are false. That paper is cited as evidence for the uselessness of open-access journals, but it has also become a sort of flytrap for the anti-science brigades. Yet it raises an interesting question: what percentage of possible hypotheses are true?

A non-scientist might imagine that hypotheses are pulled out of thin air. If so, then the vast majority of research would indeed be useless, and most papers would be reporting negative results: “Cancer is not caused by aliens from Mars”; “Cancer is not caused by aliens from Alpha Centauri”; and “Cancer is not caused by the retrograde rotation of Venus.” The fact that this doesn't happen disproves Ioannidis's claim beyond any need to examine the pseudo-mathematics in the paper.

In fact, one of our biggest problems is that we spend too much time proving things that are already evident to common sense.

Artificial Intelligence

Ironically, that puts us in a world of mathematics: calculating common sense is the holy grail of much of artificial intelligence research. In AI, there are such things as belief functions, qualitative probability, and plausibility measures. Kraus, Lehmann and Magidor[12] proposed a measure now called preferential entailment, a type of non-monotonic logic, in which something can be tentatively true but later judged false on the basis of new information. Their theory is called KLM, which if nothing else is proof of the TMDA problem: too many darn acronyms.

Friedman and Halpern[13] say that a plausibility space is a generalization of probability space, and propose properties that will be needed to represent uncertainty. The goal is to help an AI decide whether a particular interpretation of the world makes intuitive sense.

This is all fascinating stuff, in the Spockian sense, but one thing seems clear: whatever solution is found to these simple statistical problems, it probably won't make a scientist's job any easier. In fact, I'd say the probability of that is well below 0.05.

1. Cohen J (1994). The earth is round (p<0.05). American Psychologist 49, 997–1003

2. Ludbrook J (2013). Should we use one-sided or two-sided P values in tests of significance? Clin. Exp. Pharmacol. Physiol. 40, 357–361.

3. Woodman RJ (2013). Using one-sided hypothesis tests with a clear conscience. Clin. Exp. Pharmacol. Physiol. 40, 595–596.

4. Drummond G. One, two, or lots of sides to a problem? Clin. Exp. Pharmacol. Physiol. 2013; 40: 592.

5. Curran-Everett D. Sides of the story. Clin. Exp. Pharmacol. Physiol. 2013; 40: 593.

6. Matthews ST. One-tailed significance tests and the accounting for alpha. Clin. Exp. Pharmacol. Physiol. 2013; 40: 594.

7. Hurlbert SH, Lombardi CM. One-tailed tests are rarely appropriate in either basic or applied research. Clin. Exp. Pharmacol. Physiol. 2013; 40: 591.

8. Lundbrook J (2013). Second thoughts on the sidedness of P. Clin. Exp. Pharmacol. Physiol. 40, 589–590.

9. Lombardi CM, Hurlbert SH (2009). Misprescription and misuse of one-tailed tests. Austral Ecology 34, 447–468.

10. Hurlbert SH, Lombardi CM (2016). Pseudoreplication, one-tailed tests, neofisherianism, multiple comparisons, and pseudofactorialism. Integr. Environ. Assess Manag. 12, 196–197.

11. Ioannidis JPA (2005). Why most published research findings are false. PLoS Med 2(8), e124.

12. Kraus S, Lehmann D, Magidor M (1990). Nonmonotonic reasoning, preferential models and cumulative logics. Artificial Intelligence 44, 167–207.

13. Friedman N, Halpern JY (1995). Plausibility measures: a user's guide. Proc Eleventh Conference on Uncertainty in AI, 175–184. Link

aug 23, 2017; last edited oct 01, 2017, 7:20 am

See also

Science Under Siege, Part III
Understanding what causes bad science is critical to reforming it.

Science Under Siege, Part II
People say there are no jokes in scientific papers. But I found one.

On the Internet, no one can tell whether you're a dolphin or a porpoise
Name and address
book reviews