Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Simple Linear Regression: Homework Solutions - Prof. Rhonda R. Decook, Assignments of Statistics

Solutions to homework problems on simple linear regression. It covers topics such as the formula for variance, better estimation of the slope when the relationship is more apparent, and the relationship between pcb levels in fish and their age with some non-linear relationship. The document also includes r code for data analysis and visualization.

Typology: Assignments

Pre 2010

Uploaded on 03/19/2009

koofers-user-z6r
koofers-user-z6r 🇺🇸

10 documents

1 / 6

Toggle sidebar

Related documents


Partial preview of the text

Download Simple Linear Regression: Homework Solutions - Prof. Rhonda R. Decook and more Assignments Statistics in PDF only on Docsity!

22s:

Homework 3: Solutions

Assigned Wednesday, September 17

Due Wednesday, September 24 at classtime

Simple linear regression

1. The formula for the variance:

σ

(

n

+

n

i=

(x

i

− x¯)

)

=

σ

n

i=

x

i

n

n

i=

(x

i

− x¯)

Looking at this formula, we see that the variability in the estimated intercept is small-

est when ¯x = 0. If ¯x = 0, then we have x-values above and below x=0, and we can get

a pretty good estimate of

ˆ

Y

i

when x

i

= 0, which is in the ‘middle’ of the data. If most

of the x-values are far away from x=0 (as in the pictures below), it is much harder to

do a good job at estimating the

ˆ

Y

i

when x

i

= 0 (i.e. the estimated intercept)... we

don’t have a lot of information near x=0. Depending on which sample you draw, the

place where the fitted line crosses the y-axis can change quite a bit.

0

x

y

slope= 1.

interc= 16.

0

x

y

slope= 2.

interc= 12.

0

x

y

slope= 2.

interc= 13.

0

x

y

slope= 1.

interc= 17.

The above 4 plots show a random sample of size 40 drawn from the same X,Y distri-

bution with a known linear relationship (β 0 =15, β 1 = 2). Though the estimated slopes

are all fairly close, the intercepts have a wider range of estimated values.

2. The slope is better estimated when the relationship is easier to detect, or visually more

apparent. Recall the example we did in class where we took a subset of 20 data points

from a full set of 100 points that showed a strong relationship between x and y. If

we took the middle 20 points, the relationship between x and y was less apparent, the

range of the y-values in that subset was about -5 to 20, but the ˆσ (estimated variability

around the fitted line) was about the same as if we had considered the whole data set.

If we took the outside most 20 data points, the range of the y-values went from -30 to

35 (much larger), and again, ˆσ was about the same. The relationship between x and y

is much more apparent in this second set of subsetted points (containing the extreme

x-values). The fitted line in this second set explains a large % of the variability in the

y-values. The x-values in this second set are much more spread out than in the first

subsetted set that only included the middle data points.

−10 −5 0 5 10

0

All the data

x

y

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l

l l

l

l

l l

l l

l

l

l

l

l

l

l

l

l

l

l

−10 −5 0 5 10

0

Most extreme 20 points

x

y

l l

l l l

l

l

ll

l

l

l

l

lll

l

l

l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l l

l l

l

ll

l

lll

ll

l

l

l

l

l

l

l

ll

l

l

ll

ll l l

l

l

l l

l

l

l

l (^) l l

l

l

l l

l l l

l

l l

l

l

l

l

l

l

l

l

−10 −5 0 5 10

0

Middle 20 points

x

y

l

l

l

lll

l

l

l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l l

l l

l

ll

l

lll

ll

l

l

l

l

l

l

l

ll

l

l

ll

ll l

l

l

l

ll

l

l

l

l (^) ll

l

l

l l

l l l

l

l l

l

ll

l

l

l

l

l ll

l l l

l

l

ll

l

lll

ll

l

l

l

l

l

l

l

ll

l

l

ll

ll

When there is no variability in X, all the points fall on a vertical line and the slope

is undefined. Mathematically we can see this in the formula because we would be

dividing by 0.

3. This problem is on the relationship between PCB levels in fish and their age.

(a) It looks like there’s some curvature in this relationship. It looks non-linear.

> PCB.data=read.csv("PCB_in_fish.csv")

> attach(PCB.data)

> plot(PCB~Age) ## or you could use plot(Age,PCB)

0

Age

PCB

(b) Some output first:

> lm.out=lm(PCB~Age)

> lm.out$coefficients

(Intercept) Age

-1.451944 1.

> par(mfrow=c(1,2))

> plot(lm.out$fitted.values,lm.out$residuals)

> abline(h=0)

> qqnorm(lm.out$residuals)

> qqline(lm.out$residuals)

-

0

lm.out$fitted.values

lm.out$residuals

-2 -1 0 1 2

-

0

Normal Q-Q Plot

Theoretical Quantiles

Sample Quantiles

If we ignored the curvature in the relationship and went ahead and fit a SLR, we

definitely have some violations of the error assumptions. It looks like variability

increases with the mean (or fitted value), so we’re violating the constant variance

assumption. Also, the qq-plot suggests there is some concern for non-normality,

specifically, there are some unusually large residuals (both positive and negative).

(c) Apply the log

e

transformation...

> plot(log(PCB)~Age,pch=16)

> y.new=log(PCB)

> lm.out=lm(y.new~Age)

> lm.out$coefficients

(Intercept) Age

0.03147247 0.

0

Age

log(PCB)

0.5 1.0 1.5 2.0 2.5 3.

-0.

0.

0.

lm.out$fitted.values

lm.out$residuals

-2 -1 0 1 2

-0.

0.

0.

Normal Q-Q Plot

Theoretical Quantiles

Sample Quantiles

After transformation, the relationship looks fairly linear (first plot), and the con-

stant variance assumption looks pretty good (second plot). There’s a slight de-

parture from normality for the largest residuals, but nothing falls too far from the

line (like it did in the untransformed data).

(d) > summary(lm.out)

Call:

lm(formula = y.new ~ Age)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.03147 0.20136 0.156 0.

Age 0.25913 0.03080 8.414 6.78e-09 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1

Residual standard error: 0.567 on 26 degrees of freedom

Multiple R-squared: 0.7314,Adjusted R-squared: 0.

F-statistic: 70.8 on 1 and 26 DF, p-value: 6.78e-

Age is a very significant linear predictor for log(PCB). The t-statistic is 8.

with a p-value of 6. 78 × 10

− 9

(e) A 1 year increase in the Age of a fish is associated with a 0.25913 increase in the

mean of the logged-PCB level in fish.

> detach(PCB.data)

4. The Robey data set...

(a) > attach(Robey)

> plot(contraceptors,tfr,pch=16)

l

l

l

l

l l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l

l

l

l

contraceptors

tfr

(b) > lm.out=lm(tfr~contraceptors)

> summary(lm.out)

.

.

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 6.875085 0.156860 43.83 <2e-16 ***

contraceptors -0.058416 0.003584 -16.30 <2e-16 ***

Contraceptors is a significant linear predictor of total fertility rate. The t-statistic

is -16.30 with a p-value < 2 × 10

− 16

.

(c) Estimated standard error for the intercept is 0.156860, the estimated standard

error for the slope is 0.003584.

(d) > confint(lm.out)

2.5 % 97.5 %

(Intercept) 6.55969710 7.

contraceptors -0.06562173 -0.

The 95% C.I. for the intercept is [6.56, 7.19].

The 95% C.I. for the slope is [-0.066, -0.051].

(e)

ˆ

Y = 6. 875085 − 0 .058416(25) = 5. 414685.

A country with 25% contraceptors per married woman of childbearing age is pre-

dicted to have a total fertility rate of 5.41 children per woman.

(f) > predict(lm.out,newdata=data.frame(contraceptors=c(10,40)),se.fit=TRUE)

$fit

6.290928 4.

$se.fit

0.12755988 0.

i) What are the predicted values?

When contraceptors is 10%, we expect a total fertility rate of 6.29.

When contraceptors is 40%, we expect a total fertility rate of 4.54.

ii) What are the standard errors for the estimated

ˆ

Y ’s?

When contraceptors is 10%, the standard error for

ˆ

Y is 0.

When contraceptors is 40%, the standard error for

ˆ

Y is 0.0818.

iii) Compare the size of the standard errors and how this relates to the x-values

at which the predictions were made.

The standard error for the predicted mean is smaller when x=40 because this

x-value is near the middle of the data. We have a lot of information here. The

standard error for the predicted mean is larger when x=10. This x-value is at

the far left end of the overall distribution of x-values. We don’t have a lot of

information about the relationship between x and y for x-values less than 10, so

there’s a little less information over there.