◆ Powerful

Descriptive Statistics: The Full Picture

Mean and median were the start. This unit adds variance, standard deviation, percentiles, quartiles, the five-number summary, box plots, and weighted averages — the full toolkit for understanding what a dataset actually looks like, and for noticing when someone has shown you only part of it.

Time: 12 minutes

Requires: Unit 0.3

Opening Hook

In 2014, the Office for National Statistics reported that mean annual earnings for full-time employees in the UK were £27,200. That is a respectable number. It sounds like a solid middle ground. What the headline did not feature was the median: £22,044. The gap between those two figures is £5,156, and it tells you something the headline number conceals entirely. A substantial number of people earn very high salaries. Those salaries pull the mean upward. The typical worker, the one sitting exactly in the middle of the pile when you sort everyone by pay, earns considerably less than the average you were given.

This is the problem that descriptive statistics were designed to solve. A single summary number, however accurately calculated, can hide the shape of the data behind it. This unit gives you the rest of the toolkit.

The Concept

You already know that the mean is the sum divided by the count, and that the median is the middle value when the data is sorted. You know they can diverge sharply in skewed datasets, and you know why that matters. What you do not yet have is a way to describe how spread out the data is, which parts of the range contain most of the values, and where a particular value sits relative to everyone else. That is what we are building here.

Variance and standard deviation measure how far values scatter around their centre.

Variance is the average of the squared distances from the mean. You take each value, subtract the mean, square the result, then average all those squared differences. The squaring does two things: it makes all the differences positive (so that values above the mean and values below it do not cancel each other out), and it gives extra weight to values that are far from the centre.

Standard deviation is the square root of variance. The reason we take the square root is practical: variance is measured in squared units, which are awkward to interpret. If your data is in pounds, variance is in pounds squared, which is meaningless. Standard deviation brings it back to the original units.

A small standard deviation means the values cluster tightly around the mean. A large standard deviation means they are spread wide. Two datasets can have identical means and completely different standard deviations. A classroom where everyone scores between 65 and 75 on an exam and a classroom where scores range from 20 to 100 have very different characters, even if their means coincide. The standard deviation is what captures that difference.

Percentiles tell you where a given value sits within the distribution.

If your exam score is at the 80th percentile, it means 80 percent of scores in the dataset fall at or below yours. The 50th percentile is the median: half the values are below, half above. Percentiles are useful because they give you position rather than distance. Knowing that a salary is £45,000 tells you what it is. Knowing it is at the 75th percentile tells you how it compares to everyone else.

The 25th percentile is called the first quartile, often written Q1. The 50th percentile is the second quartile, Q2, which is the median. The 75th percentile is the third quartile, Q3. Quartiles divide the sorted dataset into four equal-sized groups.

The interquartile range (IQR) is the distance between the first and third quartiles: Q3 minus Q1. It tells you how wide the middle half of the distribution is. Because it focuses on the central 50 percent of the data, the IQR is resistant to extreme values in a way that the standard deviation is not. One very high outlier will inflate the standard deviation substantially, but it will not move Q1 or Q3 at all. This makes the IQR a robust measure of spread, particularly useful when the data is skewed or contains outliers.

The five-number summary assembles the five values that together give you a compact, honest picture of any dataset:

The minimum: the smallest value in the dataset
Q1: the 25th percentile
The median: the 50th percentile
Q3: the 75th percentile
The maximum: the largest value in the dataset

These five numbers show you the centre of the data, where the bulk of values fall, how the distribution is shaped, and whether there are extreme values at either end. They are much harder to manipulate than a single mean, because they make the shape of the data visible.

Box plots are a graphical representation of the five-number summary. A box is drawn from Q1 to Q3, with a line inside at the median. Lines called whiskers extend from the box to the minimum and maximum (with some conventions placing the whiskers at 1.5 times the IQR from each quartile and showing outliers as individual points beyond that). The result is a compact picture of the whole distribution. The width of the box shows you the IQR. A median line close to the bottom of the box tells you the lower half of the data is more compressed than the upper half. A box with a very long upper whisker tells you there are high-value outliers pulling the distribution rightward. You can learn a lot from a box plot at a glance that a mean and standard deviation would not reveal.

Weighted averages address a problem that simple means cannot.

An ordinary mean gives equal weight to every value in the calculation. Sometimes that is wrong. If a course has three components and they contribute 20 percent, 30 percent, and 50 percent of the final grade, a simple average of the three component scores would give each component an equal one-third weight. That does not match the course structure. A weighted average multiplies each value by its assigned weight, sums the results, and divides by the total weight.

Weighted averages appear everywhere: GDP figures, inflation indices, stock market indices, school performance league tables. In each case, the weights are choices, and different weight choices produce different results. A house price index weighted by number of sales tells a different story than one weighted by total transaction value, because expensive properties make up a small fraction of sales but a large fraction of total value. The question to ask of any weighted average is: what are the weights, and whose interests did those choices serve?

The relationship between mean, median, and skew becomes clearer once you have the full toolkit. In a symmetric distribution, the mean and median are very close together. When a distribution has a long tail to the right, called right-skew or positive skew, the mean gets pulled up toward the outliers while the median stays where the bulk of the data is. Income and wealth distributions are almost always right-skewed: most people earn or own modest amounts, but the extreme values at the top pull the mean well above what any typical person experiences. When someone gives you a mean for right-skewed data, they are, whether intentionally or not, giving you a number that overstates the typical.

Why It Matters

The choice of which summary to publish is not a neutral technical decision. It is an editorial one, and it is made in the context of someone’s interests.

Performance pay schemes provide a clean example. A company announces that average bonus payments last year were £12,000. That is the mean. If you ask for the median, you might find it was £4,500. The divergence happens because a small number of very senior people received bonuses of £100,000 or more, which pulled the mean far above where most employees actually landed. The mean figure is not false. But it creates an impression of generosity that the typical employee’s experience does not match.

Salary disclosure regulations in the UK now require companies to report their gender pay gap using both mean and median. This was a deliberate policy choice because the two measures tell different stories. The mean gap is typically larger than the median gap, because highly paid positions are more male-dominated and those salaries have a bigger weight in the mean calculation. Publishing only the mean or only the median would produce a different political conversation about the same underlying data.

Exam results illustrate a third pattern. A school publishing its A-level results will frequently lead with the percentage of A and A* grades, or with the mean grade. What it is much less likely to publish is the full distribution: how many students got U grades, what the lowest decile looks like, whether the spread has changed from year to year. A school with a high mean but a wide spread may be doing very well for its top students and failing the rest. The five-number summary would show you this. The mean alone does not.

The general principle is this: whoever publishes a summary statistic has made choices about what to show and what to omit. A mean with no measure of spread is a floor plan with no walls. It tells you something, but it hides the structure of the space.

How to Spot It

The documented case that crystallised this problem for statisticians is the reporting of executive pay in the United States.

In 2010, the Securities and Exchange Commission required US public companies to disclose the ratio of CEO pay to median worker pay. The rule was resisted for years and did not come into force until 2018, when the Dodd-Frank pay ratio disclosure requirement finally took effect. When companies began publishing their figures, the median ratios revealed something the mean-based conversations had obscured. The company with the highest disclosed ratio in the first year of reporting was Mattel, where the CEO earned 4,987 times the median worker’s annual compensation. Other major retailers showed ratios of several hundred to one.

Before median disclosure was mandated, these companies routinely reported mean compensation figures for their workforces, which included the executive salaries and compressed the ratio substantially. The SEC rule forced the median into public view. The tell in that story is the resistance to the change: organisations that fight against median disclosure in favour of mean disclosure are, as a matter of arithmetic, organisations whose mean is substantially higher than their median. That gap is the information they do not want you to have.

More generally, the tell for quartile suppression is this: you are given a central tendency figure (usually mean) and either a maximum or a “top performer” figure, with nothing in between. The full distribution is absent. Someone who earned £75,000 in a role where the median is £45,000 and the 75th percentile is £60,000 is a genuine outlier. Someone who earned £75,000 where the median is £68,000 is slightly above average. Without the five-number summary, you cannot tell which story you are being given.

Your Challenge

A recruitment firm publishes a quarterly report on salaries in the technology sector. The headline figure reads: “Average salary placed: £78,500. Top placement this quarter: £220,000.”

You are considering a career move and this report has caught your eye.

What specific information is missing from this summary? What five numbers would you need to form an honest picture of salary distribution in this sector? If the firm told you the median placement was £54,000, what would that tell you about the shape of the distribution? And what further question would you ask about the weights behind that “average” figure?

There is no answer on this page. That is the point.

References

UK earnings data, mean vs median divergence: Office for National Statistics, “Annual Survey of Hours and Earnings” (2014). The 2014 ASHE reported mean full-time earnings of £27,200 and median full-time earnings of £22,044. ONS ASHE data is published annually at https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/earningsandworkinghours/bulletins/annualsurveyofhoursandearnings

Gender pay gap reporting and mean vs median: UK Government Equalities Office, “Gender pay gap reporting: overview.” The requirement to report both mean and median gender pay gaps under the Equality Act 2010 (Gender Pay Gap Information) Regulations 2017 is described at https://www.gov.uk/guidance/gender-pay-gap-reporting-overview

SEC CEO pay ratio disclosure rule: US Securities and Exchange Commission, “Pay Ratio Disclosure,” 17 CFR Parts 229 and 249 (August 2015, effective 2018). Background on the Dodd-Frank Section 953(b) requirement at https://www.sec.gov/rules/final/2015/33-9877.pdf

Mattel pay ratio (4,987:1): first year of Dodd-Frank pay ratio disclosures reported by the AFL-CIO, “Executive Paywatch” database (2018), and by the Economic Policy Institute, “CEO compensation has grown 940% since 1978” (August 2019) at https://www.epi.org/publication/ceo-compensation-2018/

Five-number summary and box plots: Tukey, J.W., Exploratory Data Analysis (Addison-Wesley, 1977). Tukey introduced the box-and-whisker plot in this volume. A clear reference treatment is also in Spiegelhalter, D., The Art of Statistics (Pelican, 2019), Chapter 2.

Weighted averages and index construction: Office for National Statistics, “Consumer Price Indices Technical Manual” (2019), which describes how expenditure weights are constructed for the CPI. Available at https://www.ons.gov.uk/economy/inflationandpriceindices/methodologies/consumerpriceindicestechnicalmanual2019