## Why Can't We Replicate? |
November 17th, 2011 |

science, stupid_posts |

`p < 0.05`

level (5%
likely that the results are due to chance) then we would expect to
fail to confirm their results at most 5% of the time. Unfortunately,
replications fail to find results as strong as the original study much
more than 5% of the time, closer to 41%. [1] Why is this happening,
and what can we do about it?
Several factors are working here. A big one is publication bias: only
a fraction of studies are published, and it's not a random fraction.
Imagine twenty teams investigating the link between jellybeans and acne. Chances are at
least one team will find results at the `p < 0.05`

level.
The teams that didn't find anything can't get published, but the one
team that did find a link is much more likely to be. Theodore
Sterling described this
in 1959, but it's still a problem. It comes from no one liking
negative results: they're generally less interesting and less likely
to be cited, so journals avoid them. There is some attempt to fight
this, both with journals of negative results and with study
pre-registration, but many negative results go unpublished.

Another problem, similar to publication bias, is interpretation bias
or data mining. There are often many reasonable ways to interpret a
study: perhaps you were looking for an increase in heart disease rates
but noticed instead that rates of cancer rose. Or perhaps you were
originally looking at all people but found this only affected women.
Or teenagers. With all these ways you can break up your population
and all the outcomes you could report, finding *something* that passes
statistical significance testing at `p < 0.05`

is not as
difficult as it should be. As long as there is some noise in the
data, a careless or unscrupulous researcher probably can claim to have
found something real. This is an overfitting problem: the model the
researchers choose for the data has a large enough number of
configurable options (how to divide people up, what to test for) that
it generalizes badly to new data. We could deal with this by
crossvalidation: using a random 70% of the data, the researchers would
come up with any claims they thought were likely to be true and wanted
to publish. Then they would see if their predictions held for the 30%
they had not examined. This would probably be more efficient than
doing straight replications for every promising result.

What if the hypothesis isn't very likely? John Ioannidis published Why Most Published Research Findings Are False (2005) claiming that for most studies there was less than a 50% chance that they had found something true. He looks at the biases above, but the biggest contribition to his claim comes from a-priori probability: before we even ran the experiment, how likely did we think the claim was to be true? For example:

Let us assume that a team of investigators performs a whole genome association study to test whether any of 100,000 gene polymorphisms are associated with susceptibility to schizophrenia. Based on what we know about the extent of heritability of the disease, it is reasonable to expect that probably around ten gene polymorphisms among those tested would be truly associated with schizophrenia, with relatively similar odds ratios around 1.3 for the ten or so polymorphisms and with a fairly similar power to identify any of them. Then R = 10/100,000 = 10^-4, and the pre-study probability for any polymorphism to be associated with schizophrenia is also R/(R + 1) = 10^-4. Let us also suppose that the study has 60% power to find an association with an odds ratio of 1.3 at alpha = 0.05. Then it can be estimated that if a statistically significant association is found with the p-value barely crossing the 0.05 threshold, the post-study probability that this is true increases about 12-fold compared with the pre-study probability, but it is still only 12x10^-4.

The study only was able to make us shift our belief by a factor of 12: success meant 12 times more likely to be true. So we moved from thinking it had odds of being true of about 1:10000 to odds of about 1:1000. Thinking the claim was probably true would mean odds above 1:2, so the claim is probably false.

This is kind of a mess. What can we do? The easiest thing is to make
an internal adjustment: if a single study says just ```
p <
0.05
```

, there's a good chance it's false and so we should't pay
much attention to it. If it gets replicated successfully, then we
should take it seriously. Promoting registration would be good too.

**Update 2011-11-21**: As george points out in the facebook
comments below, I don't understand p-values properly. I intend to
remedy this, and then write another post.

[1] John Ioannidis 2005, Contradicted
and Initially Stronger Effects in Highly Cited Clinical Research:

Of 49 highly cited original clinical research studies, 45 claimed that the intervention was effective. Of these, 7 (16%) were contradicted by subsequent studies, 7 others (16%) had found effects that were stronger than those of subsequent studies, 20 (44%) were replicated, and 11 (24%) remained largely unchallenged.

I've not found other analysises of the overall success rate of replication, so if you know of one I would be curious to read.

I calculated 41% as the fraction of studies that "claimed that the
intervention was effective" (45) and didn't "remain largely
unchallenged" (11) that were either "contradicted by subsequent
studies" (7) or "had found effects that were stronger than those of
subsequent studies" (7), which is `(7+7)/(45-11)`

.

*Referenced in: Design Testing*

Comment via: google plus, facebook