

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Instructions for a statistics problem set consisting of two parts. In the first part, students are required to analyze a dataset on mercury contamination in fish using sas, including creating scatterplots, applying log transformations, identifying outliers, and fitting regression models. In the second part, students are asked to create a large sas dataset using pseudo-random monte carlo simulation and fit a multiple regression model for the heights and lengths of medieval english cathedrals, considering various predictor combinations and interactions. The objective is to demonstrate the use of statistical tools to build accurate models.
Typology: Assignments
1 / 3
This page cannot be seen from the preview
Don't miss anything!


For this assignment, provide the SAS program code used as well as the edited SAS output you produced to answer the questions. You may annotate your SAS output, in handwritten form if you like, but verbally explain how your output answers the questions asked, and please do not hand in data or printed output which is not specifically requested and does not figure in your answers to questions.
(I). The dataset bass contains data from a study of Mercury contamina- tion in fish that live in Floridian lakes. (a). Make a scatterplot of average mercury contamination (AvgMercury) as a function of alkalinity. (b). Use the log transform on the response, and fit a regression line. Prepare a scatterplot with regression line, and a residual plot as well. (c). Construct the Cook’s distance measure from the data. Remove the outliers that you identify from the residual plot. Do the outliers have the largest Cook’s distances? (d). After removing cases 36 and 52, you will see that there is evidence of another outlier. Keep going until you have deleted everything outlier-like (i.e. cases 36, 52, 40, 3, and 38). Fit a regression model to the remaining points and identify the changes in the regression coefficient estimates and in adj R-sq between your model with all the outliers and your model with none of them. (e). Construct the 95% prediction interval at the value of the predictor near the outlier, using SAS to generate the intervals before and after removing the outlier. Does the outlier appear to have a great effect on this interval? Based on your result, decide whether or not you would consider the outlier an influential case. (f). Do a scatterplot of the response versus the log(predictor). Why would this produce prediction intervals that would be difficult to trust? [Note: you may be able to solve this one by observation, without running the regression program again]. (g). Based on the results from your analysis, would you hypothesize that acid rain (which decreases alkalinity) is likely to improve or make worse the average levels of mercury contamination in fish? Would your conclusion from this analysis alone be sufficient to have NOAA sending trucks to dump calcium chloride into Florida lakes? Explain briefly.
(II). (a) Create (and save!) by pseudo-random Monte Carlo simulation a large (n=1000) SAS dataset with the columns Y, X, Z defined as follows: X ∼ Uniform[0, 5], Z ∼ Binom(1, 0 .5) are independent random variables in each row, and if V denotes another independent random variable with t 3 distribution, then
Y = 1.5 + 2 ∗ X − 0. 4 ∗ X^2 + 7 ∗ Z −. 1 ∗ X ∗ Z + 2 ∗ V
Hint: if you multiply c times a Uniform[0, 1] random variable, you get a Uniform[0, c] random variable; and the easiest way to generate a t 3 random variable is to generate four independent N (0, 1) random variables W√ 1 , W 2 , W 2 , W 4 using the SAS function RANNOR and then define V = W 1 ∗
3 /
√ W 22 + W 32 + W 42. (b) Fit a simple linear regression model of Y on X to your dataset. Ex- amine a residuals plot or (use SAS-generated prediction intervals and/or stu- dentized residuals or other statistical tools in SAS to show how you would be guided in this setting to augment the model by including a quadratic (X^2 ) term in the model and also a term involving Z.
(c) Now fit the multiple regression model with Y modelled in terms of X, X^2 , Z. Note that in this problem, we know in advance what the cor- rect model should be. The objective is to show which tools get us to build the correct model. What tools would you use to examine whether a fourth predictor variable X ∗ Z will actually improve the fit of the model.
(d) For the multiple regression model based on the correct model (Y regressed on X, X^2 , Z, X ∗ Z), do the residuals look patternless (plottedd both against X and against the predictor Yˆ? Plot a histogram of the residuals from the final fitted model and examine them for normality. (Use histograms with over-plotted normal densities with same mean and variance, or QQplot.) Do the residuals look normal?
(III). The dataset cathedrals contains a list of the heights and lengths of a selection of medieval English cathedrals. Of these, the Romanesque cathedrals are indexed by style=0 and the Gothic by style=1. Find the best model you can to describe height in terms of style and length. (Choose a reasonable criterion for this !) You may want to consider the following: