Why Can't We Replicate?

Kaufman, Jeff T.

Why Can't We Replicate?	November 17th, 2011
	science, stupid_posts

If we were to attempt to replicate the results of a scientific paper claiming to be significant at the p < 0.05 level (5% likely that the results are due to chance) then we would expect to fail to confirm their results at most 5% of the time. Unfortunately, replications fail to find results as strong as the original study much more than 5% of the time, closer to 41%. [1] Why is this happening, and what can we do about it?

Several factors are working here. A big one is publication bias: only a fraction of studies are published, and it's not a random fraction. Imagine twenty teams investigating the link between jellybeans and acne. Chances are at least one team will find results at the p < 0.05 level. The teams that didn't find anything can't get published, but the one team that did find a link is much more likely to be. Theodore Sterling described this in 1959, but it's still a problem. It comes from no one liking negative results: they're generally less interesting and less likely to be cited, so journals avoid them. There is some attempt to fight this, both with journals of negative results and with study pre-registration, but many negative results go unpublished.

Another problem, similar to publication bias, is interpretation bias or data mining. There are often many reasonable ways to interpret a study: perhaps you were looking for an increase in heart disease rates but noticed instead that rates of cancer rose. Or perhaps you were originally looking at all people but found this only affected women. Or teenagers. With all these ways you can break up your population and all the outcomes you could report, finding *something* that passes statistical significance testing at p < 0.05 is not as difficult as it should be. As long as there is some noise in the data, a careless or unscrupulous researcher probably can claim to have found something real. This is an overfitting problem: the model the researchers choose for the data has a large enough number of configurable options (how to divide people up, what to test for) that it generalizes badly to new data. We could deal with this by crossvalidation: using a random 70% of the data, the researchers would come up with any claims they thought were likely to be true and wanted to publish. Then they would see if their predictions held for the 30% they had not examined. This would probably be more efficient than doing straight replications for every promising result.

What if the hypothesis isn't very likely? John Ioannidis published Why Most Published Research Findings Are False (2005) claiming that for most studies there was less than a 50% chance that they had found something true. He looks at the biases above, but the biggest contribition to his claim comes from a-priori probability: before we even ran the experiment, how likely did we think the claim was to be true? For example:

Let us assume that a team of investigators performs a whole genome association study to test whether any of 100,000 gene polymorphisms are associated with susceptibility to schizophrenia. Based on what we know about the extent of heritability of the disease, it is reasonable to expect that probably around ten gene polymorphisms among those tested would be truly associated with schizophrenia, with relatively similar odds ratios around 1.3 for the ten or so polymorphisms and with a fairly similar power to identify any of them. Then R = 10/100,000 = 10^-4, and the pre-study probability for any polymorphism to be associated with schizophrenia is also R/(R + 1) = 10^-4. Let us also suppose that the study has 60% power to find an association with an odds ratio of 1.3 at alpha = 0.05. Then it can be estimated that if a statistically significant association is found with the p-value barely crossing the 0.05 threshold, the post-study probability that this is true increases about 12-fold compared with the pre-study probability, but it is still only 12x10^-4.

The study only was able to make us shift our belief by a factor of 12: success meant 12 times more likely to be true. So we moved from thinking it had odds of being true of about 1:10000 to odds of about 1:1000. Thinking the claim was probably true would mean odds above 1:2, so the claim is probably false.

This is kind of a mess. What can we do? The easiest thing is to make an internal adjustment: if a single study says just p < 0.05, there's a good chance it's false and so we should't pay much attention to it. If it gets replicated successfully, then we should take it seriously. Promoting registration would be good too.

Update 2011-11-21: As george points out in the facebook comments below, I don't understand p-values properly. I intend to remedy this, and then write another post.

[1] John Ioannidis 2005, Contradicted and Initially Stronger Effects in Highly Cited Clinical Research:

Of 49 highly cited original clinical research studies, 45 claimed that the intervention was effective. Of these, 7 (16%) were contradicted by subsequent studies, 7 others (16%) had found effects that were stronger than those of subsequent studies, 20 (44%) were replicated, and 11 (24%) remained largely unchallenged.

I've not found other analysises of the overall success rate of replication, so if you know of one I would be curious to read.

I calculated 41% as the fraction of studies that "claimed that the intervention was effective" (45) and didn't "remain largely unchallenged" (11) that were either "contradicted by subsequent studies" (7) or "had found effects that were stronger than those of subsequent studies" (7), which is (7+7)/(45-11).

Referenced in: Design Testing

←

The Fall Line Tongue-o-Vision

→

Comment via: google plus, facebook, substack

Why Can't We Replicate?

Recent posts on blogs I like:

Facts I Learned From Maoism: A Global History (Part Two)

The anti-fragile culture

Retrospective on life tracking and effectiveness systems