Lecture Notes on Simpson's Paradox | MATH 243 | Assignments Probability and Statistics

MATH 243, LECTURE 7

1. Simpson’s paradox

We will not be covering most of the material from chapter 6. But it is useful to be aware of Simpson’s

paradox.

Fact 1 (Simpson’s paradox).It is possible for one individual to outperform another in every category

measured yet to not perform as well in the aggregate.

Example 2. Let us look at flight delays for two of our local carriers, Alaska Airlines and America West

Airlines, the former of which has a hub in Seattle, the latter in Phoenix. At their hubs we have the following

data:

* Alaska OT Alaska delayed AW OT AW delayed

PHX 221 12 4840 415

SEA 1841 305 201 61

Calculate the on-time percentage of each airline at each airport. Calculate the on-time percentage over

both airports. Explain what you see.

It is simple to see how Simpson’s paradox works if we look at a simple enough example. Suppose Dick

and Jane both take MA 243 and (somehow!) negotiate negotiate different weightings to compute their

final grades. Dick has HW count 10% and the final exam 90%, and Jane has HW count 90% and the

final exam count 10%. They get A and A-, respectively, on their HW, and C and C- on their final exams,

respectively. So Dick has scored better on both. But he ends up with a C+ and Jane with a B+.

Some useful terminology, if thinking in terms of percentages: there are two ways to take an average,

aweighted average which depends on the sample sizes, and a “straight” average of percentages (which

really is not so straight). The weighted average is the one which calculates the true percentage, but it

is susceptible to Simpson’s paradox. A “straight” average behaves predictably in this way, but the final

answer depends on the categories by which the data has been broken down.

2. Looking at how data is produced

So far we have taken data as given and analyzed it.

For single variables have found mean, median, quartiles, and seen the standard deviation. If the variable

is normally distributed, we can find answers to more detailed questions about percentiles.

For many variables, we have taken them two at a time and compared them through scatterplots. Using

the value rand the regression line, we have looked for positive and negative correlations.

But so far we have not questioned how good our data is. That is an important question since data

analysis, like any analysis, obeys the maxim “garbage in, garbage out.” We now start to talk about

generating reliable data.

3. Sampling

Suppose you have a question you wish to answer about a large population. For example,

•What percent of Americans think George Bush is doing a good job?

•What percentage of Americans know that the Earth orbits the sun in a year?

•What percentage of Oregonians are obese?

Lecture Notes on Simpson's Paradox | MATH 243, Assignments of Probability and Statistics