★ Essential

The Unrepresentative Sample

The politics of who gets studied and who does not. WEIRD psychology populations, male-dominant clinical trials, the Tuskegee legacy, volunteer bias, and attrition bias — and how conclusions drawn from one group get quietly applied to everyone else.

Time: 15 minutes

Requires: Unit 2.4 Unit 2.5

Opening Hook

In 1977, the US Food and Drug Administration issued guidance recommending that women of childbearing age be excluded from Phase I and early Phase II clinical trials. The reasoning was precautionary: after the thalidomide disaster, nobody wanted to risk fetal harm. The guidance remained in effect until 1993. During those sixteen years, drugs were approved, doses were fixed, and prescribing guidelines were written, almost entirely on the basis of trials conducted on men.

When the FDA finally looked back at drugs approved in that period, a 2001 report found that eight of the ten drugs withdrawn from the market between 1997 and 2000 had presented greater health risks for women than men. The sample had been male. The population taking the drugs was not.

The most vivid specific case came later. Zolpidem, the sleeping drug sold under the brand name Ambien, had been on the market for two decades when the FDA, in 2013, required manufacturers to cut the recommended starting dose for women in half. Pharmacokinetic studies had found that average plasma zolpidem levels are roughly 50 percent higher in women than men after taking the same dose, because the enzyme that breaks the drug down is partially driven by testosterone. Women taking the standard 10mg dose were waking up with enough drug in their blood to impair driving. The drug had been dosed for a body it had not been tested on.

This is not an unusual story. It is closer to the standard one.

The Concept

Unit 2.4 introduced the foundational principle: a sample is only useful if it resembles the population you want to draw conclusions about. This unit covers the specific, recurring, and often invisible ways that the people inside studies systematically differ from the people those studies are meant to describe.

WEIRD populations are the first pattern. In 2010, the psychologists Joseph Henrich, Steven Heine, and Ara Norenzayan published a paper in Behavioral and Brain Sciences with a title that became its own critique: “The Weirdest People in the World?” They pointed out that the vast majority of published psychology research was conducted on participants who were Western, Educated, Industrialised, Rich, and Democratic — WEIRD for short. At the time, this population constituted roughly 12 percent of the world’s population but provided around 96 percent of experimental participants in psychology. The problem, as Henrich and colleagues documented, was that WEIRD participants are not a neutral or average sample of humanity. On many of the dimensions psychologists care about most — visual perception, spatial reasoning, moral intuitions, social cooperation, conformity, and the experience of fairness — WEIRD populations are outliers, not norms. The textbooks describing “how humans think” were, in large part, describing how a particular subset of humans think.

The pattern persists. More than a decade after Henrich’s paper, the proportion of psychology research conducted on US university undergraduates, the most convenient sample available to researchers at US institutions, remains stubbornly high. Findings from these samples continue to be described without qualification, as if they were universal.

Clinical trial demographics extend the same problem into medicine, with more immediate consequences. The exclusion of women was formalised by the 1977 FDA guidance, but it predates that document and has outlasted it. Among cardiovascular trials conducted between 2010 and 2017, women made up only 38 percent of participants, despite heart disease being the leading cause of death for women in the developed world. Women experience different symptoms of heart attack from men, respond differently to some interventions, and metabolise drugs differently. When treatment protocols are built on majority-male trials, the female patient is receiving care calibrated to a body not much like hers.

A similar pattern applies across racial and ethnic groups, older patients, and people with multiple conditions simultaneously (a population who are often explicitly excluded from trials that want “clean” results). The trial population is frequently younger, whiter, and healthier than the population who will ultimately receive the treatment.

Volunteer bias is the mechanism that produces these imbalances at the individual level. The people who agree to participate in studies differ, in ways that matter, from the people who decline. Volunteers tend to be more health-conscious, more educated, more compliant with instructions, and in better health to begin with. This is sometimes called the “healthy volunteer effect.” It means that when a trial finds good outcomes in its participants, those outcomes may be partly a product of who the participants were, not solely of the treatment being tested. The treatment looks better than it would if applied to the full, less-selected population.

Volunteer bias also has a social dimension. Participation in research requires, at minimum, availability, transport, and willingness to engage with medical institutions. Groups who lack these things, or who have historical reasons to distrust medical institutions, are systematically underrepresented. The literature produced by such studies then shapes treatment guidelines that apply to those absent groups.

The Tuskegee effect is the most documented case of how past misconduct shapes present participation. Between 1932 and 1972, the US Public Health Service conducted a study on 399 Black men in Alabama who had syphilis. The men were told they had “bad blood” and were offered various minor treatments, but they were not given penicillin even after it became the standard cure in 1945. The study’s purpose was to observe the natural progression of untreated syphilis. It continued for four decades while the men went untreated, infected their partners, and died. By the time a whistleblower forced it into public view in 1972, 28 men had died directly from syphilis, 100 had died from related complications, 40 wives had been infected, and 19 children had been born with congenital syphilis.

The disclosure of the study produced a measurable public health consequence. Research published by economists Marcella Alsan and Marianne Wanamaker found that life expectancy at age 45 for Black men fell by up to 1.4 years in the years following the 1972 revelation, accounting for roughly 35 percent of the life expectancy gap between Black and white men in 1980. The mechanism was reduced engagement with the healthcare system. Men who had learned not to trust medical institutions stayed away from them, and the health costs were real and lasting.

This is what makes unrepresentative sampling, in some of its forms, not merely a methodological inconvenience. The absence of certain groups from studies is not random. It has structural causes, and some of those causes are the direct legacy of the research enterprise’s own history of exploitation.

Attrition bias is the cousin of volunteer bias, operating not at recruitment but during the study itself. Attrition means dropout: participants who leave the study before it concludes. The problem arises when dropout is not random. If the people who drop out of a trial are the ones who are experiencing side effects, or the ones who are not improving, or the ones for whom the intervention is most burdensome, then the sample that completes the trial is systematically different from the sample that started it. The findings are then the findings for people who stayed — who may be, as a group, more tolerant, more committed, or already doing better. The analysis, if it ignores dropout patterns, overstates the benefit and understates the difficulty.

Why It Matters

The drug dosing story is the clearest illustration of direct, measurable harm from an unrepresentative sample. When a drug’s dose is calibrated to a male body and prescribed unchanged to a female body, the woman is an unintended subject of a different experiment. No one consented to that experiment. It was the default consequence of whose data was used to set the dose.

Psychology findings that do not generalise create quieter but similarly pervasive harm. When theories of cognition, moral development, social behaviour, or mental health are built on WEIRD samples and exported universally, they become embedded in interventions, educational systems, and clinical practices applied to populations that were never studied. A therapy developed and tested on white, English-speaking, middle-class North American adults may or may not work in the same way for a recent immigrant from rural Southeast Asia. The question was not asked, because the sample was not built to ask it.

Nutrition research sits in a particularly difficult position. Most large dietary studies rely on what are called food frequency questionnaires (FFQs): instruments that ask participants to recall what they have eaten over the past week, month, or year. The problems with this method are well-documented. Memory is unreliable over long periods. People systematically underreport foods they consider unhealthy and overreport foods they consider healthy. The act of being asked about food changes how people describe their eating. On top of these measurement problems, the participants who enrol in long-term dietary studies are, again, volunteers — more health-conscious, more engaged, and with more stable lives than average. When such studies generate findings about diet and health, those findings rest on recalled data from an unrepresentative sample. The confident headlines that follow often describe the sample, not the world.

How to Spot It

In January 2001, the New England Journal of Medicine published results from a large clinical trial called VIGOR (the Vioxx Gastrointestinal Outcomes Research trial). The trial showed that Vioxx, a painkiller, reduced gastrointestinal bleeding compared to naproxen. The comparison was made against a full dose of naproxen. The trial excluded patients with a history of cardiovascular disease and required participants to be free of significant comorbidities. It was, in other words, a trial conducted on a relatively healthy, carefully filtered population.

What the trial also found, though the cardiovascular signal was initially attributed to naproxen’s cardioprotective effect rather than to Vioxx’s harm, was an elevated rate of heart attacks in the Vioxx arm. Vioxx was withdrawn from the market in 2004, when an independent trial confirmed the cardiovascular risk. The FDA estimated that between 88,000 and 140,000 Americans had suffered serious coronary heart disease as a result of Vioxx before withdrawal. The population that actually took the drug, older patients with arthritis and pre-existing cardiovascular risk, was not the population that had been carefully selected for the trial.

The tell for unrepresentative sampling is always in the methods section, which most people never read. Look for the phrase “inclusion criteria” and its companion “exclusion criteria.” Every trial has both. Exclusion criteria tell you who was considered too complicated, too ill, too young, too old, or too risky to include. The more extensive the exclusion list, the further the trial population sits from the general population who will eventually receive the treatment.

For non-clinical research, the tell is the description of who the participants were. If it says “undergraduate psychology students at [university],” the sample is 18 to 22 year olds with above-average education, selected for availability, and not representative of most of the world. If it says “self-selected participants recruited via social media,” add self-selection bias to the list. If a nutrition study says “participants completed a food frequency questionnaire at baseline,” you are looking at self-reported dietary recall from a volunteer population.

None of this necessarily invalidates the finding. It constrains it. A result from an unrepresentative sample tells you something. It tells you about the sample. Whether it tells you about the broader population is a separate question, and it is one the abstract usually does not answer.

Your Challenge

A study is published examining the psychological effects of remote working on job satisfaction and mental health. The researchers recruited participants by posting notices on professional networking platforms and in university alumni groups. Participants completed an online questionnaire over four weeks. The study reports that remote working is associated with higher job satisfaction and lower anxiety than office working, across a wide age range.

Before you accept this as a finding about remote workers in general, what do you know about the sample? Who was likely to see the recruitment notice, to have the time and inclination to participate, and to complete four weeks of questionnaires? In what specific ways might these participants differ from remote workers who were not in the study? And what, if anything, does the study actually tell you?

There is no answer on this page. That is the point.

References

FDA 1977 guidance on exclusion of women from early clinical trials, and the 2001 General Accounting Office report finding elevated health risks for women in withdrawn drugs: US General Accounting Office, “Drug Safety: Most Drugs Withdrawn in Recent Years Had Greater Health Risks for Women” (January 2001). GAO-01-286R. Available at https://www.gao.gov/products/gao-01-286r

Zolpidem dosing and the FDA’s 2013 label change: US Food and Drug Administration, Drug Safety Communication, “FDA approves new label changes and dosing for zolpidem products and a recommendation to avoid driving the day after using Ambien CR” (January 2013). https://www.fda.gov/drugs/drug-safety-and-availability/fda-drug-safety-communication-fda-approves-new-label-changes-and-dosing-zolpidem-products-and

Sex differences in zolpidem metabolism: Rubio-Vela, T. et al., “Effect of CYP3A4 metabolism on sex differences in the pharmacokinetics and pharmacodynamics of zolpidem,” PMC (2021). https://pmc.ncbi.nlm.nih.gov/articles/PMC8476623/

Underrepresentation of women in cardiovascular trials: Dent, M.P. et al., reporting that 38.2% of participants in cardiovascular trials 2010–2017 were women, summarised at Northwestern Now: https://news.northwestern.edu/stories/2021/06/women-and-men-are-underrepresented-in-clinical-trials

Women and adverse drug reactions (1.5 to 1.7 times greater risk): summarised in Medidata, “History of Women in Clinical Trials: Overcoming Bias and Exclusion.” https://www.medidata.com/en/life-science-resources/medidata-blog/women-in-clinical-trials-history/

WEIRD populations: Henrich, J., Heine, S.J., and Norenzayan, A., “The Weirdest People in the World?” Behavioral and Brain Sciences 33, nos. 2–3 (2010): 61–83. https://www.cambridge.org/core/journals/behavioral-and-brain-sciences/article/abs/weirdest-people-in-the-world/BF84F7517D56AFF7B7EB58411A554C17. For the Nature summary: “Most people are not WEIRD,” Nature 466 (2010): 29. https://www.nature.com/articles/466029a

The Tuskegee syphilis study: CDC, “About The Untreated Syphilis Study at Tuskegee.” https://www.cdc.gov/tuskegee/about/index.html. For the mortality legacy: Alsan, M. and Wanamaker, M., “Tuskegee and the Health of Black Men,” Quarterly Journal of Economics 133, no. 1 (2018): 407–455. https://pmc.ncbi.nlm.nih.gov/articles/PMC6258045/

Volunteer bias and the healthy volunteer effect: Catalog of Bias, Oxford EBM, “Volunteer Bias.” https://catalogofbias.org/biases/volunteer-bias/. Attrition bias: Catalog of Bias, Oxford EBM, “Attrition Bias.” https://catalogofbias.org/biases/attrition-bias/

Vioxx/rofecoxib cardiovascular risk and FDA withdrawal: US FDA, “Vioxx (rofecoxib) Questions and Answers” (2004). FDA estimated 88,000–140,000 cases of serious coronary heart disease. Multiple accounts including Graham, D.J. et al., “Risk of acute myocardial infarction and sudden cardiac death in patients treated with cyclo-oxygenase 2 selective and non-selective non-steroidal anti-inflammatory drugs,” The Lancet 365, no. 9458 (2005): 475–481.

Food frequency questionnaire reliability: Dietary Assessment Primer, National Cancer Institute, “Principles Underlying Recommendations.” https://dietassessmentprimer.cancer.gov/approach/principles.html. For systematic comparison with biomarkers: Park, Y. et al., American Journal of Clinical Nutrition (2022), showing FFQs underperform 24-hour recalls against recovery biomarkers. https://www.sciencedirect.com/science/article/pii/S0002916522027459