Regression Analysis - Laboratory Review 8 | STATS 13, Lab Reports of Statistics

Material Type: Lab; Class: Introduction to Statistical Methods for Life and Health Sciences; Subject: Statistics; University: University of California - Los Angeles; Term: Unknown 1989;

Typology: Lab Reports

Pre 2010

Uploaded on 08/21/2009

koofers-user-ks5
koofers-user-ks5 🇺🇸

9 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Lab 8: Regression analysis
Due with your final writeup: 1 Question, 1 Bonus
In this lab, we will examine the data on mercury contamination considered in lecture.
You can think of it as a prelude to your final, which will largely concern regression
analysis. Perhaps it is more precise to say that the final, with its focus on regression,
will let us explore the big themes addressed during the quarter: Graphical and
numerical summaries, formulating and testing hypotheses, sampling variablity and
confidence intervals. To get us started, we will spend this Lab session learning about
how R implements the simple linear model.
Getting Started
As usual, we begin this lab by loading data into your R session. This time, however,
our data are not stored in R’s special format, but rather as comma-separated values
(CSV). Programs like Excel, for example, can read and write data in this format.
After weeks of the usual routine, we’ve decided to publish lab materials this way so
that you see how to bring your own data into R.
The data on fish contamination is linked to the course home page, or is available
directly at
http://www.stat.ucla.edu/~cocteau/stat13/data/fish.csv
In eiter case, use a browser to download the file, and save it in the “working direc-
tory” for your R session. The command
getwd()
should return the folder where R will look for files to load. Save fish.csv there.
(We have not talked about this much, but when you start using R outside of this
class, you will inevitably want to save your work in different folders, giving each new
project its own folder. In general, you can specify a new working directory with the
command setwd, which takes the location of your project’s folder as an argument.)
pf3
pf4
pf5

Partial preview of the text

Download Regression Analysis - Laboratory Review 8 | STATS 13 and more Lab Reports Statistics in PDF only on Docsity!

Lab 8: Regression analysis

Due with your final writeup: 1 Question, 1 Bonus

In this lab, we will examine the data on mercury contamination considered in lecture. You can think of it as a prelude to your final, which will largely concern regression analysis. Perhaps it is more precise to say that the final, with its focus on regression, will let us explore the big themes addressed during the quarter: Graphical and numerical summaries, formulating and testing hypotheses, sampling variablity and confidence intervals. To get us started, we will spend this Lab session learning about how R implements the simple linear model.

Getting Started

As usual, we begin this lab by loading data into your R session. This time, however, our data are not stored in R’s special format, but rather as comma-separated values (CSV). Programs like Excel, for example, can read and write data in this format. After weeks of the usual routine, we’ve decided to publish lab materials this way so that you see how to bring your own data into R.

The data on fish contamination is linked to the course home page, or is available directly at

http://www.stat.ucla.edu/~cocteau/stat13/data/fish.csv

In eiter case, use a browser to download the file, and save it in the “working direc- tory” for your R session. The command

getwd()

should return the folder where R will look for files to load. Save fish.csv there.

(We have not talked about this much, but when you start using R outside of this class, you will inevitably want to save your work in different folders, giving each new project its own folder. In general, you can specify a new working directory with the command setwd, which takes the location of your project’s folder as an argument.)

Once you have stored fish.csv in your current working directory, you can read the data into R with the following command.

fish = read.csv("fish.csv")

R has a number of functions for handling data with different formats. The function read.csv is specially designed for files of comma separated values. After reading the contents of the file, the command above saves the data in fish.

You can also read in the file without downloading it. Instead you could create a url from the address of the file.

fish = read.csv( url("http://www.stat.ucla.edu/~cocteau/stat13/data/fish.csv"))

Whether you go the url route or you download the file to your computer, the following command

ls()

will give you a list of objects that should include the dataset fish. Recall from lecture that these data were collected to study the extent of mercury concentration in two rivers in North Carolina. Researchers considered only largemouth bass; and each row of this dataset refers to a fish taken from one of two rivers, the Lumber or the Waccamaw.

dim(fish)

You will see that the data consists of 171 rows (fish) and 5 columns. Data for the first five fish can be shown with the command

fish[1:5,]

which should produce the following output.

river station length weight mercury 1 0 0 47.0 1616 1. 2 0 0 48.7 1862 1. 3 0 0 55.7 2855 1. 4 0 0 45.2 1199 0. 5 0 0 44.7 1320 0.

We observed in class that the data have a positive trend in the sense that longer fish tend to have higher mercury content. The relationship is somewhat noisy, however. Still, we decided to posit a population model of the form

(mercury) = β 0 + β 1 (length) + (error)

that is, the mercury content of a fish is described by linear model. Given a fish’s length, its mercury level is determined by a point on the line plus some error; a fish’s length is multiplied by the slope (β 1 ), we add the intercept β 0 to the result, and finally an error term is introduced.

In class we examined the use of least squares as a way of estimating the slope and intercept of the population model from a sample of fish. In R, this is done with the command

fit = lm(mercury~length,data=waccamaw) summary(fit)

Here, the command lm refers to a “linear model.” The expression inside the paren- theses, mercury~length, is a formula in R. It specifies a statistical model, in this case the one given above: that mercury content depends linearly on a fish’s length. Notice that this expression is missing any reference to the population parameters or the error. To make this specification clearer, if we wanted a model of the form

(mercury) = β 0 + β 1 (length) + β 2 (weight) + (error)

our formula would be mercury~length+weight. And with this, you get a sense that the linear model is not just about one input and one response; but more on that later.

In the summary output, we find a table of the Coefficients and their standard errors. Recall that these come from an analytical expression for the sampling dis- tribution that is not dissimilar in spirit to what we discussed for the t-distribution (consult yesterday’s lecture for a list of connections).

The object fit represents your fitted linear model. It contains information about

coefficients(fit)

and

residuals(fit)

and a number of other derived quantities that we’ll cover in our last lecture. You can look at the residuals with

r = residuals(fit) hist(r) qqnorm(r)

You can add the regression line to your scatterplot of mercury content versus weight with the following.

plot(waccamaw$length,waccamaw$mercury) abline(coefficients(fit))

At the risk of beating a dead horse, we can bootstrap the regression coefficients in a fairly straightforward way. Again, the bootstrap gives us a quick assessment of the variability in our estimates, this time β̂ 0 and β̂ 1. Below we bootstrap the slope.

nsim = 5000 bootbeta1 = rep(0,5000)

for(i in 1:nsim) { brows = sample(1:98,replace=T) boots = waccamaw[brows,]

bfit = lm(mercury~length,data=boots) bcoeff = coefficients(bfit)

bootbeta1[i] = bcoeff[2] }

sd(bootbeta1)

Here, we are sampling row numbers (1 to 98) with replacement and then using those to create a new data set boots. We then fit the linear model using the bootstrap data set, pull out the coefficients and keep the second one, the bootstrap replicates