Stat 430: Analyzing Mercury in Fish & Creating Regression Model for Medieval Cathedrals, Assignments of Statistics

Instructions for a statistics problem set consisting of two parts. In the first part, students are required to analyze a dataset on mercury contamination in fish using sas, including creating scatterplots, applying log transformations, identifying outliers, and fitting regression models. In the second part, students are asked to create a large sas dataset using pseudo-random monte carlo simulation and fit a multiple regression model for the heights and lengths of medieval english cathedrals, considering various predictor combinations and interactions. The objective is to demonstrate the use of statistical tools to build accurate models.

Typology: Assignments

Pre 2010

Uploaded on 07/30/2009

koofers-user-rlt
koofers-user-rlt 🇺🇸

10 documents

1 / 3

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Stat 430, Problem Set 6, Due Friday April 24, 2009
For this assignment, provide the SAS program code used as well as the
edited SAS output you produced to answer the questions. You may annotate
your SAS output, in handwritten form if you like, but verbally explain how
your output answers the questions asked, and please do not hand in data or
printed output which is not specifically requested and does not figure in your
answers to questions.
(I). The dataset bass contains data from a study of Mercury contamina-
tion in fish that live in Floridian lakes.
(a). Make a scatterplot of average mercury contamination (AvgMercury)
as a function of alkalinity.
(b). Use the log transform on the response, and fit a regression line.
Prepare a scatterplot with regression line, and a residual plot as well.
(c). Construct the Cook’s distance measure from the data. Remove the
outliers that you identify from the residual plot. Do the outliers have the
largest Cook’s distances?
(d). After removing cases 36 and 52, you will see that there is evidence
of another outlier. Keep going until you have deleted everything outlier-like
(i.e. cases 36, 52, 40, 3, and 38). Fit a regression model to the remaining
points and identify the changes in the regression coefficient estimates and
in adj R-sq between your model with all the outliers and your model with
none of them.
(e). Construct the 95% prediction interval at the value of the predictor
near the outlier, using SAS to generate the intervals before and after removing
the outlier. Does the outlier appear to have a great effect on this interval?
Based on your result, decide whether or not you would consider the outlier
an influential case.
(f). Do a scatterplot of the response versus the log(predictor). Why would
this produce prediction intervals that would be difficult to trust? [Note: you
may be able to solve this one by observation, without running the regression
program again].
(g). Based on the results from your analysis, would you hypothesize
that acid rain (which decreases alkalinity) is likely to improve or make worse
the average levels of mercury contamination in fish? Would your conclusion
from this analysis alone be sufficient to have NOAA sending trucks to dump
calcium chloride into Florida lakes? Explain briefly.
1
pf3

Partial preview of the text

Download Stat 430: Analyzing Mercury in Fish & Creating Regression Model for Medieval Cathedrals and more Assignments Statistics in PDF only on Docsity!

Stat 430, Problem Set 6, Due Friday April 24, 2009

For this assignment, provide the SAS program code used as well as the edited SAS output you produced to answer the questions. You may annotate your SAS output, in handwritten form if you like, but verbally explain how your output answers the questions asked, and please do not hand in data or printed output which is not specifically requested and does not figure in your answers to questions.

(I). The dataset bass contains data from a study of Mercury contamina- tion in fish that live in Floridian lakes. (a). Make a scatterplot of average mercury contamination (AvgMercury) as a function of alkalinity. (b). Use the log transform on the response, and fit a regression line. Prepare a scatterplot with regression line, and a residual plot as well. (c). Construct the Cook’s distance measure from the data. Remove the outliers that you identify from the residual plot. Do the outliers have the largest Cook’s distances? (d). After removing cases 36 and 52, you will see that there is evidence of another outlier. Keep going until you have deleted everything outlier-like (i.e. cases 36, 52, 40, 3, and 38). Fit a regression model to the remaining points and identify the changes in the regression coefficient estimates and in adj R-sq between your model with all the outliers and your model with none of them. (e). Construct the 95% prediction interval at the value of the predictor near the outlier, using SAS to generate the intervals before and after removing the outlier. Does the outlier appear to have a great effect on this interval? Based on your result, decide whether or not you would consider the outlier an influential case. (f). Do a scatterplot of the response versus the log(predictor). Why would this produce prediction intervals that would be difficult to trust? [Note: you may be able to solve this one by observation, without running the regression program again]. (g). Based on the results from your analysis, would you hypothesize that acid rain (which decreases alkalinity) is likely to improve or make worse the average levels of mercury contamination in fish? Would your conclusion from this analysis alone be sufficient to have NOAA sending trucks to dump calcium chloride into Florida lakes? Explain briefly.

(II). (a) Create (and save!) by pseudo-random Monte Carlo simulation a large (n=1000) SAS dataset with the columns Y, X, Z defined as follows: X ∼ Uniform[0, 5], Z ∼ Binom(1, 0 .5) are independent random variables in each row, and if V denotes another independent random variable with t 3 distribution, then

Y = 1.5 + 2 ∗ X − 0. 4 ∗ X^2 + 7 ∗ Z −. 1 ∗ X ∗ Z + 2 ∗ V

Hint: if you multiply c times a Uniform[0, 1] random variable, you get a Uniform[0, c] random variable; and the easiest way to generate a t 3 random variable is to generate four independent N (0, 1) random variables W√ 1 , W 2 , W 2 , W 4 using the SAS function RANNOR and then define V = W 1 ∗

3 /

√ W 22 + W 32 + W 42. (b) Fit a simple linear regression model of Y on X to your dataset. Ex- amine a residuals plot or (use SAS-generated prediction intervals and/or stu- dentized residuals or other statistical tools in SAS to show how you would be guided in this setting to augment the model by including a quadratic (X^2 ) term in the model and also a term involving Z.

(c) Now fit the multiple regression model with Y modelled in terms of X, X^2 , Z. Note that in this problem, we know in advance what the cor- rect model should be. The objective is to show which tools get us to build the correct model. What tools would you use to examine whether a fourth predictor variable X ∗ Z will actually improve the fit of the model.

(d) For the multiple regression model based on the correct model (Y regressed on X, X^2 , Z, X ∗ Z), do the residuals look patternless (plottedd both against X and against the predictor Yˆ? Plot a histogram of the residuals from the final fitted model and examine them for normality. (Use histograms with over-plotted normal densities with same mean and variance, or QQplot.) Do the residuals look normal?

(III). The dataset cathedrals contains a list of the heights and lengths of a selection of medieval English cathedrals. Of these, the Romanesque cathedrals are indexed by style=0 and the Gothic by style=1. Find the best model you can to describe height in terms of style and length. (Choose a reasonable criterion for this !) You may want to consider the following:

  • transformations do not seem to be useful,