





Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
We bring the assumptions we've learned into play and discuss key ideas and principles of good experimental design. Good design con- cepts are best illustrated ...
Typology: Exercises
1 / 9
This page cannot be seen from the preview
Don't miss anything!






This chapter focuses on how to design experiments. We bring the assumptions we’ve learned into play and discuss key ideas and principles of good experimental design. Good design con- cepts are best illustrated by examples, so we provide many in this chapter, mostly involving clinical trials or improving education in developing countries.
While experimental design is a broad topic that is often difficult to get right, there are a few guiding principles that all good designs are built on top of:
sense to compare against a standard treatment currently in use (but that perhaps isn’t very effective) rather than a placebo. Despite a placebo being a sham treatment, it can often actually make a subject feel better compared to not giving any treatment at all! This is explained in the next example panel.
Beware the placebo effect! When you apply a treatment to a subject, that treatment may be ineffective, but can still produce a significant effect simply due to the existence of the treatment. For example, a fake placebo surgery can actually do as well as a common knee surgery!a^ And giving people nonalcoholic drinks but telling them that the drinks are alcoholic can result in a decline in memory powers!b Because of the placebo effect, development of medical treatments demands the stronger standard of outperforming a placebo rather than outperforming not giving any treatment at all since a placebo alone could already result in a startling improvement in a test subject’s condition, often largely due to psychological factors. To examine the benefit of a placebo, an experiment could have a control group that receives no treatment, a placebo group that receives a sham treatment, and a treatment group that recieves the actual treatment under study. By doing this, the experimenter can measure both the effect of the placebo over the baseline and the effect of the treatment over the placebo. Closely related to the placebo effect is the Hawthorne effect: in a behavioral study, the behavior of subjects might be due to their reaction to being studied. In a famous experiment in the early 20th century, a factory called the Hawthorne Works wanted to measure the effect of lighting on productivity. An experimental group had their light bulbs changed, and the experimenters wanted to measure the effect on productivity. A control group saw workers change their bulbs, but the new bulbs were identical to the old ones. However, both groups improved after the “new” bulbs were put in: the control group improved purely due to their perception of an effect. These effects can turn up where you least expect them to! For example, suppose you’re designing an experiment to measure the effect of fertilizers on a farm. While the plants probably aren’t vulnerable to the placebo effect, the farmers could be. A farmer whose field is fertilized might work harder and be more motivated simply by being part of the study. As a result, in such a study, it might be a good idea to have a placebo farm that receives plain dirt. There are also other confounding factors: seemingly uninteresting quantities such as the overall moisture level of the fertilizer may have an impact on the result, so it’s important to make the placebo group as equalized as possible! aSee Baylor College Of Medicine, “Study Finds Common Knee Surgery No Better Than Placebo,” ScienceDaily, July 12, 2002. bSee BBC News, “Being drunk ‘a trick of the mind”’, January 7, 2003.
is the true proportion of the population that is in stratum, and σis the true standard deviation within stratum, then Neyman’s optimal allocation says that the size of the SRS for stratum should be proportional to Wσ`. This technique allows us to accurately measure the effects of small groups that may have otherwise been missed in an SRS over the whole population (i.e., without stratification). For example, we may want to sample the performance of students in different types of schools. If some school categories are larger (i.e., have more students) than others, then an SRS over the whole population may miss the small categories. A stratified sample would list the categories and sample randomly within each type of school.However, all of these frameworks have issues:
In October 2004, Stanley Presser ran a poll for The New Yorker, where half of respondents were asked “Do you think the United States should allow public speeches against democracy?” and the other half were asked the same question except with “allow” replaced by “forbid”. Whereas 56% answered no to “forbidding”, 39% answered yes to “allowing” despite the two answers corresponding to the same response.
Generally speaking, it is a good idea to word questions as neutrally as possible, and if the questions don’t have some order dependence, to randomize their ordering.
The above issues often make it hard to extend conclusions beyond a study: any analysis we can do is only valid for the population that we sampled from. This is highlighted by the following example, which led to the downfall of the magazine The Literary Digest.
Republican candidate Alfred Landon was running against Democrat Franklin Delano Roosevelt. The Literary Digest projected that Landon would win by a huge margin: a 57% to 43% victory. The maga- zine had polled 10 million people and received a whopping 2.4 million responses! Yet Franklin Roosevelt won by a landslide, carrying 46 states while Landon only carried 2 states. The win wasn’t just in the electoral college either: Roosevelt won 61% of the popular vote. What had happened? Of course, a mix of things happened including non-response bias and likely wording issues, but the main issue was selection bias: the questionnaires were sent out to readers of The Literary Digest, those that were in a phone listing, and those on a listing of car owners. But all these lists contain more rich people than poor people, which led to a heavily skewed poll result. In contrast, a Gallup poll that same year predicted that Roosevelt would receive 56% of the popular vote using only a sample size of 50,000, which turned out to be far more accurate than The Literary Digest’s poll results.
Thus, an estimate of the standard error is
sμ =
s √ n
N − n N
from which we derive a 95% confidence interval for the mean height μ:
x¯ ± 2 sμ = ¯x ± 2
s √ n
N − n N
As a reminder, the coefficient 2 comes from the fact that within 2 standard deviations of a standard normal random variable lies 95% of the probability mass centered around the mean.
If instead the xi’s had been binary random variables taken on values 0 or 1 (e.g., we ask each of n people a yes/no question, where “yes” is encoded as a 1), then one could show that an approximate 95% confidence interval is
¯x ± 2
x¯(1 − ¯x) n − 1
N − n N
where in this context, note that ¯x estimates the fraction of the population that has value 1 (e.g., the proportion of people who answer “yes” to a poll).
This section covers more experimental designs that are useful for more complex experiments.
Whenever possible, if applying a treatment, it’s best to have paired data, where we obtain measurements for each subject before and after treatment. As we saw with t-tests, paired tests often give us the most power.
A generalization of paired tests is repeated measures design. In such a design, we may have multiple (i.e., 2 or more) treatments, and each subject will receive all the treatments. This way, each subject can be thought of as its own control.
For example, suppose we measure the effect of caffeine (in the form of tea and coffee) on student performance. In a repeated measures design, each student would spend a month drinking coffee, a month drinking tea, and a month with no caffeine intake (for control). We may also want to add a month with a decaffeinated drink as a placebo. In such designs, it’s important to randomize the order, and to be wary of temporal effects. In this example, stopping caffeine treatment might lead to worse performance due to withdrawal. As a result, it might be worthwhile to wait in between each “measure”. We can sometimes model these temporal effects with autocorrelation models, where the errors are no longer assumed to be independent, but rather to depend on each other in sequence.
What do we do when we have multiple factors to block on? If the factors don’t depend on each other, then we’ll probably have the same number of sub-blocks with in each block. For example, in an experiment where we block on gender and handedness (left or right), we’ll have left-handed and right-handed groups for men, and left-handed and right-handed groups for women. Such a design is called complete, because each sub-block is being tested. We’ll focus here on cases where we have two blocking factors, although the ideas we’ll discuss can be generalized. In a randomized complete block design, we may not have enough data points to replicate within sub-blocks, so we must assign different sub-blocks to different treatment conditions.
Example
For example, suppose we want to measure the effect of giving tablets to students in developing countries. Our experimental condition might be providing students with tablets and giving them an extra hour every day to use them. We would need a control group that receives the normal curriculum, and a placebo group that receives an extra hour of unstructured time (but no tablets) every day. This gives us three levels for the treatment factor: tablet (T), unstructured hour (U), and control (C). Suppose this is a one-year study where we have three terms (fall, spring, and summer), and three (mostly-similar) schools in which to run the experiment. Such a setup is known as a row-column design, and the experimental setup can be illustrated by the following table:
Time of year Fall Spring Summer
Location
We’ll fill in each entry with the treatment we use for that setup. A first attempt at this design (where T, U, and C stand for tablet, unstructured hour, and control respectively) might look like this:
Time of year Fall Spring Summer
Location
Unfortunately, this design does not properly take into account the time of year: if we were to run the experiment and see a significant improvement from the tablets, it might have been entirely due to the confounding effect of having the tablet conditions all in the fall! As a result, our ideal design would have each condition appear exactly once per row and once per column (like a Sudoku). Grids that satisfy constraints like this are called latin squares, and we can produce one by taking the table above and shifting each row: