




Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Lab; Class: Introduction to Statistical Methods for Life and Health Sciences; Subject: Statistics; University: University of California - Los Angeles; Term: Winter 2007;
Typology: Lab Reports
1 / 8
This page cannot be seen from the preview
Don't miss anything!





With this lab we consider the ways in which we can quantify the variability in an estimate. We have seen two basic approaches in lecture: One computational and the other analytical. We will make explicit links between these two and in the process firm up our language and our methods a bit (“precision” in two senses of the word).
We will again appeal to simulation for this little study. We will simulate the act of conducting a survey. The pieces we need to assemble to do this include: a) a population of interest, b) a numerical value we can assign to all the items/subjects in the population and a summary of these values (the mean, median, proportion) that we will estimate c) a method for drawing simple random samples from the population.
The (custom) R function survey will do this for us. We will take as our population the 20,718 residents of Millbrae, CA (a city in the San Francisco Bay Area). We would like to learn something about this group of people by taking small surveys. This lab will consdider how precise our estimates are, and how this precision varies with sample size.
We will first load up data that we’ll need for this lab.
source("http://www.stat.ucla.edu/~cocteau/stat13/data/precise.R") ls()
You should see a function called survey and a couple of other data sets that con- stitute our populations (people and households). The values in people refer to the ages of every resident of Millbrae; and those in households refer to household incomes in 2007 (in thousands of dollars). We will begin by taking a simple survey of the households Millbrae.
x = survey(100,households)
mean(x)
At a conceptual level, you can imagine selecting 100 households at random (say, from the phonebook), interviewing the head of the household and finding out how much money they made last year. Admitedly this setup is a little strained because household incomes in Millbrae are published by the U.S. Census Bureau (which is ultimately the source of our data). For the purposes of this lab, however, we will let household income represent some aspect of the community that is not known. Think instead about how you might ask about more interesting questions like Internet usage – “How many hours a day are you on the web?” – or perhaps the effects of the current recession – “Have you put off a big purchase in the last six months?”.
y = survey(200,people) mean(y)
With these commands, we imagine creating a simple random sample of 200 residents in Millbrae, contacting them and recording their ages. As with household income, these data are carefully collected by the Census Bureau and so the example is a little artificial from a practical standpoint. You can instead imagine needing information about residents that is not quite so readily available. Think about a health-related question like “Have you eaten peanut butter in the last week?”, a question the CDC and FDA was asking a number of people last month as they tried to uncover the source of the salmonella outbreak.
In each case we constructed a simple random sample of the population, collected some numerical measure for each member of the sample, and then computed the sample mean. As we saw in lecture, this statistic is an estimate of the population mean (the average age of residents in Millbrae or the average income of households in the city). Let’s repeat the last two lines 10 times or so.
y = survey(200,people) mean(y)
You should notice that the mean of your sample varies each time you conduct a survey. If this were a real survey, the effort in contacting 200 people could be considerable. So each time you hit ENTER, you should think about the effort it would take to perform this survey and appreciate the luxury of this magical simulation world where we have the population so readily at our disposal.
The last command asks whether or not the mean of the population (in this case the mean age of people living in Millbrae). The function returns TRUE if the population parameter is contained in the interval and FALSE otherwise. We can now give a precise meaning to a confidence interval. Let us repeat the process above 100 times, each time taking a different survey, forming a confidence interval and seeing how many of the intervals contain the truth.
results = rep(0,100)
for(i in 1:100) { y = survey(200,people) boots = bootstrap(y,5000,mean)
lo = quantile(boots$thetastar,0.025) hi = quantile(boots$thetastar,0.975)
results[i] = contains(lo,hi,mean(people)) print(i) }
This looks bad, I know. But notice that all the commands are things you’ve seen before. You have a loop of 100 iterations and for each one we are conducting a survey, computing a confidence interval from the survey results (the ages of 200 randomly selected people in the city), and then peaking to see if the confidence intervals contained the truth or not. Oh, and we’ve added a call to the command print so you can see R working. If you think about it, we’re doing quite a lot and the whole process will take a little time. By printing a count you know how far R has gotten.
Ultimately, results contains a 1 if the confidence interval contained the population parameter and 0 otherwise. Let’s have a look.
results sum(results)
Question 1. (a) How many of your 100 surveys produced confidence intervals that contained the population parameter? Is this what you expected? Why? (b) Define what we mean by a confidence interval; feel free to use this experiment as a reference.
In the last lecture we discussed an analytical approach to computing confidence intervals. We began with the fact that for the mean, we have a formula for its standard error. Recall that the standard error of an estimate (like the sample mean), is defined to be the standard deviation of its sampling distribution. The narrower the sampling distribution, the more precise our estimate. If our sampling distribution is roughly normal looking, then we can form confidence intervals by adding and subtracting two standard errors from the sample mean. (To be precise, as that is the theme of this lab, we should add and subtract 1.96 times the standard error; technically pnorm(-1.96) is 0.025 and 1-pnorm(1.96) is 0.025. The values of pnorm(-2) and 1-pnorm(2) are 0.0228.)
In a previous lab, we estimated the standard error of the sample mean using our bootstrap replicates. Specifically, we computed something like the following.
y = survey(200,people) boots = bootstrap(y,5000,mean)
seboot = sd(boots$thetastar)
Again, our 5000 bootstrap replicates represent an estimate of the sampling distri- bution and so we can estimate the standard error with their standard deviation.
In class we found that the sample mean, as an estimate, was fairly special in that we have a formula that describes exactly its standard error. With a little math, we can show that the standard error of the sample mean is σ/√n where σ is the standard deviation of the population.
This suggests that another route to estimating the standard error would be to esti- mate σ/√n. In code this might look like the following.
Question 2. (a) How many of your 100 surveys produced confidence intervals that contained the population parameter? Is this what you expected? Why? (b) What is the first thing that struck you about running this code as compared to the code involving the bootstrap? What explains the difference?
Finally, we will consider a statistic not covered in our book. Given the skew in the data, it is often common for people to report median household incomes rather than means. In this last part of the lab, we will consider the performance of the bootstrap confidence intervals for the population median. We will conduct surveys of size 100 this time, and our subjects will be drawn from the households data set.
Below we present the code. It is virtually the same as the code for the means, with a couple slight alterations to reflect the fact that we are working with the income data now and our surveys only involve 100 households.
results = rep(0,100)
for(i in 1:100) {
y = survey(100,households) boots = bootstrap(y,5000,median)
lo = quantile(boots$thetastar,0.025) hi = quantile(boots$thetastar,0.975)
results[i] = contains(lo,hi,median(households))
print(i) }
Question 3. (a) How many of your 100 surveys produced confidence intervals that contained the population parameter? Is this what you expected? Why? (b) Look at lo and hi (these are the values from the last simulation); interpret what the interval [lo,hi] means.