◆ Powerful

The Replication Crisis

Most published scientific findings do not replicate. This is not a failure of individual scientists — it is a structural consequence of how science is conducted and reported. Here is what it means, why it happened, and how to read a single study properly.

Time: 15 minutes

Opening Hook

In 2010, Harvard Business School psychologist Amy Cuddy co-authored a paper that appeared to show something remarkable. If you stand in a “power pose” for two minutes before a stressful event, your testosterone rises, your cortisol drops, and you perform better in job interviews. The paper was published in Psychological Science, one of the field’s most prestigious journals. The finding spread immediately. It made the cover of Time magazine. The TED talk Amy Cuddy gave on the subject became one of the most watched in the history of the platform, with over seventy million views.

In 2015, five researchers attempted to replicate the hormonal findings. They used a larger sample, pre-registered their analysis plan, and followed the original protocol as closely as possible. The testosterone and cortisol effects vanished completely. The hormonal changes that were the mechanistic core of the claim simply did not appear. One of the original paper’s own co-authors, Dana Carney, published a statement saying she no longer believed the effect was real.

Seventy million people had been told something about their biology that was probably untrue.

Power poses is not an isolated case. It is the most visible face of a problem that runs through the foundations of scientific research.

The Concept

In 2011, a researcher at pharmaceutical company Bayer tried to replicate the results of 67 landmark preclinical studies, the kind of findings that form the basis for new drug development. He could fully replicate only 14 of them. That is roughly 21 percent.

In 2012, Glenn Begley and Lee Ellis at Amgen attempted to reproduce 53 landmark cancer biology studies. They managed to replicate six. That is 11 percent.

The most systematic investigation came in 2015. The Open Science Collaboration, a group of 270 researchers across dozens of institutions, selected 100 psychology studies published in three leading journals, studies that had passed peer review and were considered solid contributions to the field, and attempted to replicate each one. The original studies had reported statistically significant findings in 97 out of 100 cases. The replications found statistically significant results in only 36.

That figure, often summarised as “roughly half of psychology findings do not replicate,” is actually optimistic. When the replicators measured effect sizes rather than just whether a result cleared the significance threshold, the replicated effects were on average about half the size of the originals, a strong signal that many “significant” original findings were inflated even in the cases where the replication nominally succeeded.

This is what the replication crisis refers to: the discovery, across multiple fields and multiple large-scale efforts, that a substantial fraction of published scientific findings cannot be reproduced by independent researchers following the same methods.

Understanding why it happened requires revisiting some concepts from this area of the curriculum. You already know what a p-value is (Unit 2.7), and you have just finished a unit on the Bayesian framework (Unit 2.10). Both are directly relevant here.

Small samples. The smaller the sample in a study, the greater the random variation in the results. A small sample can produce a large, statistically significant effect by chance alone. Many foundational psychology studies were conducted with sample sizes of 20 to 30 participants, in an era when this was considered acceptable. With that few people, the noise in the data can easily overwhelm the signal, and a lucky random draw can produce a p-value under 0.05 without any genuine underlying effect.

Publication bias. You will meet this formally in Unit 3.11. The short version: scientific journals strongly prefer to publish positive results. A study that finds an effect gets published. A study that finds no effect often does not. This means the published literature is a biased sample of all research ever conducted. The true null results are sitting in file drawers, invisible to anyone reviewing the evidence. When you read a meta-analysis of, say, ten studies all showing the same positive effect, it is possible that thirty null-result studies on the same question were never published. The ten visible studies look like a consensus. They are not.

P-hacking, also called researcher degrees of freedom. Every study involves dozens of decisions: which participants to include, which outcome measure to use, which covariates to control for, when to stop collecting data. When a researcher, consciously or not, makes these decisions in ways that push the p-value below 0.05, that is p-hacking. It does not require dishonesty. It requires only that the researcher, believing in their hypothesis and wanting a clean result, keeps adjusting the analysis until the number cooperates. A single p-value under 0.05 at the end of this process does not mean what it is supposed to mean.

Flexibility in analysis. Related to p-hacking but broader: the more freedom a researcher has to choose what to test and how, the more likely they are to find something that looks significant by chance. John Simmons and colleagues demonstrated in 2011 that by using entirely legitimate, published analytical techniques in sequence, they could produce a statistically significant result showing that listening to a song called “When I’m Sixty-Four” by the Beatles actually made people younger, as measured by their reported age. The data were fabricated for the demonstration, but the statistical procedures were real. The point was to show how flexibility can generate publishable-looking nonsense.

Weak theoretical grounding. In social priming research, a whole subfield of psychology, the theoretical basis for many findings was thin to begin with. Researchers claimed that brief, subtle exposures to words or images related to a concept could produce measurable changes in behaviour. People exposed to words associated with “old age” walked more slowly. People primed with the concept of money became less cooperative. Most of these findings have since failed to replicate at any reasonable sample size. The theoretical mechanism was never well-specified, which meant the hypothesis was compatible with almost any outcome and difficult to falsify.

John Ioannidis, a physician and statistician at Stanford, published a paper in 2005 titled “Why Most Published Research Findings Are False.” The title sounds inflammatory. The argument is mathematical. He showed that when you account for the typical statistical power of studies in a field, the prevalence of true effects being tested, and the multiple comparisons problem (Unit 2.9), the probability that a published significant finding reflects a true underlying effect can be well below 50 percent. This is the Bayesian point restated in epidemiological terms: a positive test result is not very informative if the prior probability of the thing being tested is low and the false positive rate is non-negligible. For many research hypotheses, the prior probability that the specific mechanism proposed is real is low. A single p-value under 0.05 is not strong enough evidence to overcome that prior.

What does replication failure mean? It is important to be precise here. A failed replication is not proof that the original finding was fraudulent, that the original researchers were incompetent, or that the underlying effect is definitely zero. Science is noisy. Replication requires careful attention to whether the replication genuinely matched the original conditions. Some failed replications have themselves been questioned. What replication failure tells you is that the evidence for the original claim is weaker than the single published study made it appear.

This is the core lesson for how to read a single study. One study, however well-designed, however prestigious the journal, is an initial observation, not a conclusion. It is a prior in a Bayesian framework, one that should be updated by subsequent evidence, including replication attempts. The appropriate response to a single positive finding is “interesting, we should see if this holds up,” not “this is established fact.”

Why It Matters

The replication crisis would be an interesting problem for academics to work through among themselves if the consequences were contained within universities. They are not.

Policy is regularly built on unreplicated findings. The “ego depletion” hypothesis, the claim that willpower is a limited resource that depletes with use, like a muscle, generated an enormous literature with practical implications for workplace design, diet, and education. Dozens of papers reported the effect. When a large pre-registered replication involving 23 laboratories and over 2,000 participants was conducted in 2016, the effect was essentially zero. The implications for policy and self-help advice were significant, but the corrections never reached the audiences who had absorbed the original message.

The problem is structural in science journalism. A press release about a new study generates a news story. The study turns out not to replicate. The failed replication, if it generates any news coverage at all, receives a fraction of the attention of the original. The asymmetry between “new finding” and “finding retracted” is built into the media economy of attention. Novelty sells. Correction does not.

Medical decisions are affected. The Bayer and Amgen replication failures were in preclinical research, the stage before trials in humans. If preclinical findings routinely fail to reproduce, the pipeline of drugs entering clinical trials is partly built on foundations that do not exist. Resources are consumed, trials are conducted, and patients are occasionally harmed by treatments whose underlying rationale was never solid.

How to Spot It

The tell is the single-study claim.

The most documented high-profile case for this section is ego depletion. The original finding, published by Roy Baumeister and colleagues in 1998, reported that participants who had to resist eating cookies performed worse on a subsequent task than those who had eaten freely. The interpretation: exerting willpower depleted a resource. The effect was replicated repeatedly in small-sample studies over nearly two decades, generating over 600 published papers and substantial commercial spin-off in productivity and management literature.

The 2016 multi-lab replication, which enrolled 2,141 participants across 23 independent laboratories, pre-registered all its analyses in advance, and was designed precisely to test the ego depletion effect, found an overall effect size of virtually zero. The effect estimate was d = 0.04, where the original reported effects were typically in the range of d = 0.62 to d = 0.85. That is not a modest reduction. It is the difference between a real, reproducible phenomenon and noise.

What the ego depletion case reveals is the mechanism by which false positives accumulate into apparent consensus. Small studies with high variance each have a reasonable chance of producing a significant result by luck alone. When researchers are running these studies with genuine belief in the hypothesis, and when journals are publishing only the ones that reach significance, the visible literature converges on an effect that may not exist. The funnel plot, a tool for detecting publication bias, would show the cone of published results clustered around significance rather than symmetrically distributed around zero, but that picture only becomes visible retrospectively, once someone goes looking for the missing null results.

The tell in practice: any claim that traces back to a single published study, especially in psychology, nutrition, or social science, especially from a study with a small sample (under 100 participants is a reasonable flag), especially if the finding has not been pre-registered or subsequently replicated in an independent lab, should be treated as preliminary. Not false, but not established.

Additional red flags: the effect is large and surprising; the study was conducted in a single location with a narrow sample; the original paper was published without a registered analysis plan; the finding has generated widespread popular interest before anyone attempted to replicate it.

Your Challenge

A newspaper publishes a story under the headline: “Scientists find that spending just ten minutes in a park reduces cortisol levels by 20 percent, cutting stress risk.”

The story references a single study. The study enrolled 40 university students in one city. It measured cortisol by saliva samples before and after participants either sat in a park or sat in an indoor room. The cortisol reduction in the park group was statistically significant at p = 0.03. The effect size was large. The researchers conclude that brief nature exposure has a measurable physiological stress-reduction effect.

Before accepting this as established fact, what questions would you ask? What features of this study should lower your confidence in the headline claim? What would need to be true for you to update your belief substantially?

There is no answer on this page.

References

Power poses original paper: Carney, D.R., Cuddy, A.J.C., and Yap, A.J. (2010). “Power posing: brief nonverbal displays affect neuroendocrine levels and risk tolerance.” Psychological Science 21(10): 1363–1368.

Power poses replication: Ranehill, E., Dreber, A., Johannesson, M., Leiberg, S., Sul, S., and Weber, R.A. (2015). “Assessing the robustness of power posing: no effect on hormones and risk tolerance in a large sample of men and women.” Psychological Science 26(5): 653–656.

Dana Carney’s statement: Carney, D.R. (2016). “My position on ‘power poses’.” Statement published on her UC Berkeley faculty page. URL: http://faculty.haas.berkeley.edu/dana_carney/pdf_My%20position%20on%20power%20poses.pdf

Bayer replication study: Prinz, F., Schlange, T., and Asadullah, K. (2011). “Believe it or not: how much can we rely on published data on potential drug targets?” Nature Reviews Drug Discovery 10(9): 712.

Amgen replication study: Begley, C.G. and Ellis, L.M. (2012). “Drug development: raise standards for preclinical cancer research.” Nature 483(7391): 531–533.

Open Science Collaboration replication project: Open Science Collaboration (2015). “Estimating the reproducibility of psychological science.” Science 349(6251): aac4716. DOI: 10.1126/science.aac4716.

Ioannidis, “Why Most Published Research Findings Are False”: Ioannidis, J.P.A. (2005). PLOS Medicine 2(8): e124. DOI: 10.1371/journal.pmed.0020124.

Ego depletion original paper: Baumeister, R.F., Bratslavsky, E., Muraven, M., and Tice, D.M. (1998). “Ego depletion: is the active self a limited resource?” Journal of Personality and Social Psychology 74(5): 1252–1265.

Ego depletion multi-lab replication: Hagger, M.S. et al. (2016). “A multilab preregistered replication of the ego-depletion effect.” Perspectives on Psychological Science 11(4): 546–573.

False positive rates and researcher degrees of freedom: Simmons, J.P., Nelson, L.D., and Simonsohn, U. (2011). “False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant.” Psychological Science 22(11): 1359–1366.

General account of the replication crisis: Ritchie, S. (2020). Science Fictions: How Fraud, Bias, Negligence, and Hype Undermine the Search for Truth. Metropolitan Books.