Assignment 4 - Statistical Methods for Bioscience I | HORT 572, Assignments of Data Analysis & Statistical Methods

Material Type: Assignment; Professor: Ane; Class: Statistical Methods for Bioscience II; Subject: HORTICULTURE; University: University of Wisconsin - Madison; Term: Spring 2009;

Typology: Assignments

Pre 2010

Uploaded on 09/02/2009

koofers-user-a2w
koofers-user-a2w 🇺🇸

10 documents

1 / 2

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Stat/F&W/Hort 572 Ane February 12, 2009
Assignment #4 Due Friday, Feb. 20, 2009, by 4pm
Turn in lecture, discussion, or to your TA’s mailbox. Please circle the discussion section you expect to pick up
this assignment:
311 312 313 314
Purpose: Get experience with logistic regression.
A biologist interested in crustaceans sought evidence of a geographic trend in allelic distribution in the gene
mannose-6-phosphate isomerase (Mpi) in populations of the amphipod crustacean Megalorchestia californiana
located along the Pacific coast. In eight populations ranging from Santa Barbara, California in the south to Port
Townsend, Washington in the north, the biologist genotyped individual crustaceans at the Mpi locus. Sample sizes
ranged from 30 to over 2000. Two alleles, Mpi90 and Mpi100 were prevalent. If latitude is helpful in predicting
allelic frequency of one of the alleles, say Mpi90, this would be evidence consistent with a story of differential
selection at this locus based on an environmental factor associated with latitude. Latitude is in degrees, north of
the equator.
The data is in the file crustacean.txt and is also shown below.
location latitude Mpi90 Mpi100
Port Townsend, WA 48.1 47 139
Neskowin, OR 45.2 177 241
Siuslaw R., OR 44.0 1087 1183
Umpqua R., OR 43.7 187 175
Coos Bay, OR 43.5 397 671
San Francisco, CA 37.8 40 14
Carmel, CA 36.6 39 17
Santa Barbara, CA 34.3 30 0
(a) Read the data into R, create a data frame from this data, then add a column with the total sample size, and
another column with the proportion of Mpi90. Plot these proportions versus latitude. Explain why logistic
regression is an appropriate model to address this question. (Include the plot with your solution.)
(b) Fit a logistic regression model to this data and report the coefficients. Is the latitude effect significant?
(c) On a single plot, include
the observed proportion of Mpi90 allele versus latitude, as in question (a),
the total number of samples from each location, next to each point. Hint: the text function can add
text to a plot. It can work like this:
text(x=latitudeValues, y=Mpi90proportions, labels=as.character(totalNumbers), pos=4)
The option pos can be ommitted. If provided, the ‘labels’ will be shifted either down (1), left (2), up
(3) or right (4).
a curve of the estimated probability of Mpi90 allele versus latitude. Hint: the predict function can
take the logistic regression fit and use it to predict the proportion on Mpi90 allele at new latitude. To
use this function, you may create new latitude values, and a new data set with these values like this:
newlatitude = seq(20, 60, by=.5)
newlatitude
newdat = data.frame( latitude=newlatitude )
pf2

Partial preview of the text

Download Assignment 4 - Statistical Methods for Bioscience I | HORT 572 and more Assignments Data Analysis & Statistical Methods in PDF only on Docsity!

Stat/F&W/Hort 572 Ane February 12, 2009

Assignment #4 — Due Friday, Feb. 20, 2009, by 4pm

Turn in lecture, discussion, or to your TA’s mailbox. Please circle the discussion section you expect to pick up this assignment: 311 312 313 314

Purpose: Get experience with logistic regression.

A biologist interested in crustaceans sought evidence of a geographic trend in allelic distribution in the gene mannose-6-phosphate isomerase (Mpi) in populations of the amphipod crustacean Megalorchestia californiana located along the Pacific coast. In eight populations ranging from Santa Barbara, California in the south to Port Townsend, Washington in the north, the biologist genotyped individual crustaceans at the Mpi locus. Sample sizes ranged from 30 to over 2000. Two alleles, Mpi^90 and Mpi^100 were prevalent. If latitude is helpful in predicting allelic frequency of one of the alleles, say Mpi^90 , this would be evidence consistent with a story of differential selection at this locus based on an environmental factor associated with latitude. Latitude is in degrees, north of the equator. The data is in the file crustacean.txt and is also shown below. location latitude Mpi90 Mpi Port Townsend, WA 48.1 47 139 Neskowin, OR 45.2 177 241 Siuslaw R., OR 44.0 1087 1183 Umpqua R., OR 43.7 187 175 Coos Bay, OR 43.5 397 671 San Francisco, CA 37.8 40 14 Carmel, CA 36.6 39 17 Santa Barbara, CA 34.3 30 0

(a) Read the data into R, create a data frame from this data, then add a column with the total sample size, and another column with the proportion of Mpi^90. Plot these proportions versus latitude. Explain why logistic regression is an appropriate model to address this question. (Include the plot with your solution.)

(b) Fit a logistic regression model to this data and report the coefficients. Is the latitude effect significant?

(c) On a single plot, include

  • the observed proportion of Mpi^90 allele versus latitude, as in question (a),
  • the total number of samples from each location, next to each point. Hint: the text function can add text to a plot. It can work like this: text(x=latitudeValues, y=Mpi90proportions, labels=as.character(totalNumbers), pos=4) The option pos can be ommitted. If provided, the ‘labels’ will be shifted either down (1), left (2), up (3) or right (4).
  • a curve of the estimated probability of Mpi^90 allele versus latitude. Hint: the predict function can take the logistic regression fit and use it to predict the proportion on Mpi^90 allele at new latitude. To use this function, you may create new latitude values, and a new data set with these values like this: newlatitude = seq(20, 60, by=.5) newlatitude newdat = data.frame( latitude=newlatitude )

Stat/F&W/Hort 572 Ane February 12, 2009

The function lines(x,y) will create a curve: it will join the points with coordinates (x, y) by line segments.

(d) According to the model, at what latitude would we expect a 50/50 distribution of the two Mpi alleles?

(e) Use the model to predict the proportion of Mpi^90 alleles in a population at latitude 40.0. You are encouraged to check your result with the predict function in R, but you also need to make this prediction by hand, using the coefficients from the fitted model.

(f) Based on residual plots, evaluate the presence of influential observations. Why are these observations in- fluential? Is latitude effect and its significance changed when influential observations are removed? Hint: to remove observations from a data set, you may use the subset function. For instance, to consider only observations with latitude in between 40 and 50, one could do this:

subset( crustacean , latitude<50 & latitude>40)

(g) Based on the deviance, is the model fitting this data well? (You might think about what corrective action would be appropriate if there is lack of fit, although you do not need to).

Reading: Chapter 5 for logistic regression. For model selection, suggested (but not required) reading is Chapter 8 in Linear models with R by Julian Faraway.