The Bootstrap Lab 6 - Introduction to Statistical Methods for Life | STATS 13, Lab Reports of Statistics

Material Type: Lab; Class: Introduction to Statistical Methods for Life and Health Sciences; Subject: Statistics; University: University of California - Los Angeles; Term: Unknown 1989;

Typology: Lab Reports

Pre 2010

Uploaded on 08/26/2009

koofers-user-x2j
koofers-user-x2j 🇺🇸

10 documents

1 / 5

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Lab 6: The bootstrap (I)
Due 2/24, 3 Questions
Last week we made a move from testing to estimation and introduced the bootstrap
as a tool for assessing uncertainty. In this lab we will discuss some of the mechan-
ics behind bootstrapping; in the next lab, we will examine built-in functions that
perform the bootstrap with one command (as opposed to writing out the code to
generate samples). This lab starts with a dangling thread, with a subject that has
been waiting for a week or two to be fully described: The Q-Q plot.
1. Getting started
We will first load up data that we’ll need for this lab.
source("http://www.stat.ucla.edu/~cocteau/stat13/data/fruit.R")
ls()
This lab is a little, well, cluttered datawise. You should see the following separate
data sets: banana,broc,grapefruit,lemon,gs,lettuce,lime,orange,pepper,
snow,spinach, and tomato (gs stands for Granny Smith apples). Each holds mea-
surements made on samples of produce collected from a chain of supermarkets in
Atlanta, GA. So, the unit of observation is, literally, a piece of fruit. The two vari-
ables are green and red:green represents the intensity of light reflected off the
samples in a subset of the green range (500-565nm); and red holds intensity mea-
surements from the same sampled produce, but in a subset of the yellow/orange/red
range (565-740nm).
The following commands will give you a basic sense of what’s in these data.
names(banana)
dim(banana)
names(orange)
dim(orange)
pf3
pf4
pf5

Partial preview of the text

Download The Bootstrap Lab 6 - Introduction to Statistical Methods for Life | STATS 13 and more Lab Reports Statistics in PDF only on Docsity!

Lab 6: The bootstrap (I)

Due 2/24, 3 Questions

Last week we made a move from testing to estimation and introduced the bootstrap as a tool for assessing uncertainty. In this lab we will discuss some of the mechan- ics behind bootstrapping; in the next lab, we will examine built-in functions that perform the bootstrap with one command (as opposed to writing out the code to generate samples). This lab starts with a dangling thread, with a subject that has been waiting for a week or two to be fully described: The Q-Q plot.

1. Getting started

We will first load up data that we’ll need for this lab.

source("http://www.stat.ucla.edu/~cocteau/stat13/data/fruit.R") ls()

This lab is a little, well, cluttered datawise. You should see the following separate data sets: banana, broc, grapefruit, lemon, gs, lettuce, lime, orange, pepper, snow, spinach, and tomato (gs stands for Granny Smith apples). Each holds mea- surements made on samples of produce collected from a chain of supermarkets in Atlanta, GA. So, the unit of observation is, literally, a piece of fruit. The two vari- ables are green and red: green represents the intensity of light reflected off the samples in a subset of the green range (500-565nm); and red holds intensity mea- surements from the same sampled produce, but in a subset of the yellow/orange/red range (565-740nm).

The following commands will give you a basic sense of what’s in these data.

names(banana) dim(banana) names(orange) dim(orange)

You should see that the data sets contain different numbers of items. These data were collected during the second phase of the fruit scanning project mentioned in Lecture 10; the sample sizes are more or less related to the color variation researchers found in different groups of produce.

2. Quantiles

We have been using normal Q-Q plots (what your book calls normal probability plots) for some time now in lecture, but haven’t really let you kick the tires on the tool. It’s time. We have already seen the idea behind a quantile (or percentile in your book’s terminology). The median is the point that separates your data in half (if you have an even number of samples, say). It is also known as the 50th percentile or the 0.5 quantile. Similarly, the lower quartile is the point below which 25% of your data lie; this is also known as the 25th percentile or the 0.25 quantile. Finally, the upper quartile is the point below which 75% of the data lie; it is also known as the 75th percentile or the 0.75 quantile.

In general, for any 0 ≤ q ≤ 1, we can define the q quantile, xq, to be the point that separates the data into two parts: the proportion of data less than or equal to xq is q and the proportion strictly greater than xq is 1 − q. Your book considers just 100 values of q (0. 01 , 0. 02 ,... , 0. 99 , 1 .0) and calls the points percentiles (for obvious reasons). The idea of a quantile is more general in that q can take on any value between 0 and 1.

To try this out, consider the following sequence of commands.

hist(banana$green) median(banana$green) quantile(banana$green,0.5) quantile(banana$green,c(0.25,0.5,0.75))

The last command will return a vector of length three. By “concatenating” the proportions 0.25, 0.5 and 0.75, we have created a vector of three items that R then uses as input to the quantile function. You can compare these numbers to the endpoints of a boxplot

boxplot(banana)

Question 2: Consider two other kinds of produce and either the red or green variable. For each, make a histogram and a Q-Q plot and explain whether or not the data appear bell-shaped and why.

3. A first pass at the bootstrap

Recall from Lecture 11 that we could use the bootstrap to provide an estimate of the sampling distribution of a statistic we’re interested in, providing us with a fairly simple (at least conceptually) way to assess an estimate’s accuracy. Let’s consider a simple case, the mean. Suppose we would like to estimate the average greenness of grapefruits sold by a particular chain of markets in Atlanta. This will be the population and we would like to say something about the average greenness of this population, a population parameter. Our data are a sample of 767 grapefruit taken from this chain of markets. The mean greenness in our sample is 1.98 (compared to an average greenness of 3.67 for broccoli, say). The sample mean, 1.98, is an estimate of the population mean.

mean(grapefruit$green) mean(broc$green) # just for comparison

Of course we are not interested in the 767 grapefruits in our sample, but rather, we would like to know what these data say about the greenness of the larger population of grapefruit at this chain of markets. To assess the accuracy of our estimate of 1.98, we will perform a basic bootstrap (re)sampling procedure. The code below draws bootstrap samples of size 767 from our original data (sampling with replacement), each time forming a bootstrap replicate of the mean.

nsim = 5000 bootmns = rep(0,5000)

for(i in 1:nsim) { boots = sample(grapefruit$green,767,replace=T) bootmns[i] = mean(boots) }

This code looks very much like an example from the last lab; recall our sequence of commands to simulate the sampling distribution for the odds ratio. Again, we have a loop and then a series of commands that are performed with each iteration. In this case, we are drawing new bootstrap samples each time and computing the bootstrap replicate of our original estimate, the mean.

The distribution of the bootstrap replicates is an estimate of the sampling distribu- tion of our estimator, the mean greenness. In lecture, we saw that in many cases, this distribution is bell-shaped. What do you think?

hist(bootmns) qqnorm(bootmns) qqline(bootmns)

We can estimate the standard error for the mean of our sample of grapefruit by tak- ing the standard deviation of the bootstrap replicates. The standard error provides us with a sense of the accuracy of our estimate, in this case 1.98. Assuming that the bootstrap replicates have a bell-shaped distribution, we can form a (roughly) 95% confidence interval by taking 1.98 plus or minus two estimated standard errors.

se = sd(bootmns) 1.98-2se 1.98+2se

And remember from lecture, that we could also form a 95% confidence interval using the quantiles of the bootstrap replicates. If the bootstrap replicates have a bell-shaped distribution, these two approaches should give similar results.

quantile(bootmns,c(0.025,0.975))

Question 3: Select a (as in one) different variety of produce to work with. Compute the sample mean of either its redness or greenness and estimate its standard error using the bootstrap resampling procedure above. Comment on whether the bootstrap replicates (an estimate of the sampling distribution of the sample mean) is bell- shaped and form the two kinds of confidence intervals examined above. Do they agree? Why?