Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Practice Problems for Exam 1 - Methods of Applied Statistics | STAT 420, Study notes of Data Analysis & Statistical Methods

practice Material Type: Notes; Professor: Unger; Class: Methods of Applied Statistics; Subject: Statistics; University: University of Illinois - Urbana-Champaign; Term: Spring 2012;

Typology: Study notes

2014/2015

Uploaded on 11/16/2015

zhaoliupeng
zhaoliupeng 🇺🇸

5

(2)

3 documents

1 / 17

Toggle sidebar

Related documents


Partial preview of the text

Download Practice Problems for Exam 1 - Methods of Applied Statistics | STAT 420 and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

STAT 420 Fall 2012

Practice Problems 1

1. Do NOT use a computer for this problem.

Alex obtains a random sample of seven students from STAT 100 class he taught in

Fall Semester of 2007 and wants to use it to see if there is a relationship between

the number of absences and students’ final grade. The data and the scatterplot are

given below.

Number of

Absences, x

Final Grade

Percentage, y

Σ x^ = 56,^ Σ y^ = 504,^ Σ x^

= 556, (^) Σ y

= 38,458, (^) Σ x y = 3,600,

Σ (^ x^ –^ x^ )^

= 108, (^) Σ ( yy )

= 2,170, (^) Σ ( xx ) ( yy ) = (^) Σ ( xx ) y = – 432.

Consider the model Y

i

= β

0

+ β

1

x

i

+ ε

i

, where ε

i

’s are i.i.d. N ( 0, σ

2

).

a) Find the equation of the least-squares regression line.

b) What proportion of observed variation in final grade percentage is explained by a

straight-line relationship with the number of absences?

c) Give an estimate for σ, the standard deviation of the observations about the true

regression line?

1. ( continued )

d) Use the F-test to test the hypothesis that absences from class do not affect the final grade

percentage at a 5% significance level. That is, test H

0

: β

1

= 0 vs. H

1

: β

1

≠ 0

at a 5% significance level. Report the value of the test statistic, critical value(s), and

decision.

e) Use the t-test to test the hypothesis that absences from class do not affect the final grade

percentage at a 5% significance level. That is, test H

0

: β

1

= 0 vs. H

1

: β

1

≠ 0

at a 5% significance level. Report the value of the test statistic, critical value(s), and

decision.

f) Alex claims that each absence lowers the student’s final grade percentage by at least 5

5 percentage points (on average). Test his claim at a 5% significance level. That is, test

H

0

: β

1

= – 5 vs. H

1

: β

1

– 5 at a 5% significance level. Report the value of the

test statistic, critical value(s), and decision.

g) A student claims that the average final grade percentage for the students without absences

is above 90 ( the cut-off for an A ). Test H

0

: β

0

= 90 vs. H

1

: β

0

90 at a 5%

significance level. Report the value of the test statistic, critical value(s), and decision.

h) Construct a 90% confidence interval for the average final grade percentage for a

student who missed 7classes.

i) Construct 90% limits of prediction for y when x = 7.

2. Do use a computer to do (☺ and double-check ☺) parts (a), (b), (c), (d), (e), (h), (i) of 1.

3. An agronomist experimented with different amounts of liquid fertilizer on a

sample of equal-size plots. The amount of fertilizer and the yields are:

Plot

Amount of

Fertilizer

(tons)

Yield

(hundreds

of bushels)

A 0 4
B 1 4.
C 1 5.
D 1.5 5.
E 2 4.
F 2.5 6.
G 3 5.
H 3 7
I 4 6.

∑ xi^ =^18 ,^ ∑ yi^ =^ 49.5,^ ∑

2

x i = 48.5, ∑ xi yi = 107 , ∑

2

y i = 280.75,

( )

2 ∑ x i −^ x =^ 12.5,^ ∑ (^ xi^ −^ x )^ (^ yiy )^ =∑(^ xix )^ yi =^8 ,^ (^ )^

2

∑ y i −^ y =^ 8.5.

a) The agronomist is interested in predicting yield. What is the dependent variable?

The independent variable?

b) Draw a scatter plot.

c) Find the equation of the least-squares regression

line. Add the regression line to the scatter plot.

d)* Find the sample correlation coefficient.

Consider the model

Y

i

= β

0

+ β

1

x

i

+ ε

i

,

where Y is yield,

x is amount of fertilizer,

ε

i

’s are i.i.d. N ( 0, σ

2

).

3. (continued)

e) Find the residuals and plot them against the x -values (amount of fertilizer).

f) What proportion of the observed variation in yield is explained by a straight-line

relationship with the amount of fertilizer?

g) Use the regression line to predict the yield for a plot of land with 7 tons of

fertilizer.

h) Use the regression line to predict the yield for a plot of land with 3.5 tons of

fertilizer.

i) Which prediction ( part (g) or part (h) ) is more reliable? Explain.

j) Compute

2

s e.

k) Construct a 90% confidence interval for β

1

.

l) Is there enough evidence to claim that each additional ton of fertilizer increases the

yield by less than 100 bushels? I.e., test H

0

: β

1

= 1.0 vs. H

1

: β

1

< 1.0.

(i) Use a 10% level of significance. (ii) Use a 5% level of significance.

m) Construct a 90% confidence interval for the mean response μ ( x ) for x = 5.

n) Construct a 90% prediction interval for the future Y value if x = 5.

o) Test H

0

: μ ( x ) = 6.5 vs. H

1

: μ ( x ) > 6.5 for x = 5. Use a 10% level of

significance.

p) Test H

0

: β

0

= 5 vs. H

1

: β

0 ≠ 5 at a 10% level of significance.

4. A marketing firm wishes to determine whether or not there is a relationship between

the number of television commercials broadcast and the sales of its product. The data,

obtained from 5 different cities, are shown in the following table.

Number of TV

Commercials

x

Sales Units

y

Σ x^ = 30,^ Σ y^ = 65,^ Σ x^

= 200, (^) Σ y

= 925, (^) Σ x y = 420,

Σ (^ x^ –^ x^ )^

= 20, (^) Σ ( yy )

= 80, (^) Σ ( xx ) ( yy ) = (^) Σ ( xx ) y = 30.

Consider the model Y i = α + β x i + ε i , where ε i ’s are i.i.d. N ( 0, σ

2

).

a) Find the equation of the least-squares regression line. Add the least-squares regression

line to the scatter plot.

b) In Anytown, 20 commercials aired. What is your prediction of the sales? Why is it

dangerous to predict sales for this particular value of x.

c) Find an estimate for σ, the standard deviation of the observations about the true

regression line?

d) What proportion of the observed variation in the sales is explained by a straight-line

relationship with the number of television commercials for the product?

e) Construct a 90% confidence interval for β.

4. (continued)

f) Test for the significance of the regression at a 5% level of significance. That is, test

H

0

: β = 0 vs. H

1

: β ≠ 0 at a 5% level of significance.

g) Construct a 95% prediction interval for the sales corresponding to x = 8 TV

commercials.

h) Test H

0

: μ ( x = 8 ) = 20 vs. H

1

: μ ( x = 8 ) < 20 at a 10% level of significance.

i) Test H

0

: α = 0 vs. H

1

: α ≠ 0 at a 10% level of significance.

Answers:

1. Do NOT use a computer for this problem.

Alex obtains a random sample of seven students from STAT 100 class he taught in

Fall Semester of 2007 and wants to use it to see if there is a relationship between

the number of absences and students’ final grade. The data and the scatterplot are

given below.

Number of

Absences, x

Final Grade

Percentage, y

Σ x^ = 56,^ Σ y^ = 504,^ Σ x^

= 556, (^) Σ y

= 38,458, (^) Σ x y = 3,600,

Σ (^ x^ –^ x^ )^

= 108, (^) Σ ( yy )

= 2,170, (^) Σ ( xx ) ( yy ) = (^) Σ ( xx ) y = – 432.

Consider the model Y

i

= β

0

+ β

1

x

i

+ ε

i

, where ε

i

’s are i.i.d. N ( 0, σ

2

).

a) Find the equation of the least-squares regression line.

x = = 8.

y = = 72.

1

β

ˆ −

= =
SXX
SXY

= – 4. = y − ⋅ x

0 1

β

ˆ

β

ˆ

= 72 – ( – 4 ) ⋅ 8 = 104.

Least-squares regression line: ˆ y^ = 104 – 4 x.

b) What proportion of observed variation in final grade percentage is explained by a

straight-line relationship with the number of absences?

Need R

=?

Residuals: e y y y ( x )

0 1

β

ˆ

β

ˆ ˆ

= − = − +.

x y yˆ e e^

2

6 70 80 – 10 100

2 92 96 – 4 16

11 49 60 – 11 121

Sum: 0 442

RSS

OR

SSRegression = SXX

2

β 1

ˆ

⋅ = ( – 4 )

2

⋅ 108 = 1728.

RSS = SYY – SSRegression = 2170 – 1728 = 442.

2 = − = − SYY

RSS

R = 0.7963. 79.63%.

c) Give an estimate for σ, the standard deviation of the observations about the true

regression line?

s

e

= 88. 4

2 = = −

=

n

s

RSS

e =^ 9..

d) Use the F-test to test the hypothesis that absences from class do not affect the final grade

percentage at a 5% significance level. That is, test H

0

: β

1

= 0 vs. H

1

: β

1

≠ 0

at a 5% significance level. Report the value of the test statistic, critical value(s), and

decision.

H

0

: β

1

= 0 vs. H

1

: β

1

≠ 0

Source SS DF MS F

Regression ∑ (^ −^ )

2

y ˆ y

i

= 1728 1 1728 19.

Residuals

∑ (^ −^ )

2

ˆ

yi yi = 442 n – 2 = 5 88.

Total ∑ (^ −^ )

2

y y

i = 2170^ n^ – 1 = 6

Rejection Region: F > F

α

( 1, 5 ) = F

0.

( 1, 5 ) = 6.61.

The Test Statistic T is in the Rejection Region. Reject H

0

at α = 0.05.

( p-value ≈ 0.006884 )

e) Use the t-test to test the hypothesis that absences from class do not affect the final grade

percentage at a 5% significance level. That is, test H

0

: β

1

= 0 vs. H

1

: β

1

≠ 0

at a 5% significance level. Report the value of the test statistic, critical value(s), and

decision.

H

0

: β

1

= 0 vs. H

1

: β

1

≠ 0

T =

( )

β 1 β 10 4 0

ˆ

.
SXX

s e

− −
=

= – 4.4213. n – 2 = 5 degrees of freedom.

Rejection Region: T < – t

α / 2

or T > t

α / 2

. t

0.

= 2.571.

The Test Statistic T is in the Rejection Region. Reject H

0

at α = 0.05.

( p-value ≈ 0.006884 )

f) Alex claims that each absence lowers the student’s final grade percentage by at least 5

5 percentage points (on average). Test his claim at a 5% significance level. That is, test

H

0

: β

1

= – 5 vs. H

1

: β

1

– 5 at a 5% significance level. Report the value of the

test statistic, critical value(s), and decision.

H

0

: β

1

= – 5 vs. H

1

: β

1

> – 5

T =

( ) ( )

β 1 β 10 4 5

ˆ

.
SXX

s e

− − −
=

= 1.1053. n – 2 = 5 degrees of freedom.

Rejection Region: T > t

α

. t

0.

= 2.015.

The Test Statistic T is NOT in the Rejection Region.

Do NOT Reject H

0

at α = 0.05. ( p-value ≈ 0.15968 )

g) A student claims that the average final grade percentage for the students without absences

is above 90 ( the cut-off for an A ). Test H

0

: β

0

= 90 vs. H

1

: β

0

90 at a 5%

significance level. Report the value of the test statistic, critical value(s), and decision.

H

0

: β

0

= 90 vs. H

1

: β

0

> 90

T =

( )

108

2 2

0 00

β β

ˆ

SXX

x

n

s e +

=
+

⋅^ ⋅

= 1.7363.

n – 2 = 5 degrees of freedom.

Rejection Region: T > t

α

. t

0.

= 2.015.

The Test Statistic T is NOT in the Rejection Region.

Do NOT Reject H

0

at α = 0.05. ( p-value ≈ 0.07151 )

h) Construct a 90% confidence interval for the average final grade percentage for a

student who missed 7classes.

Confidence interval for μ

y | x

:

( )

SXX

x x

n

yˆ^ t se

2

2

α

± ⋅ +

n – 2 = 5 degrees of freedom. t

0.
= 2.015.

( )

2 −

±. ⋅. ⋅ + 76 ±±±± 7.

( 68.611 , 83.389 )

i) Construct 90% limits of prediction for y when x = 7.

Limits of prediction for y | x :

( )

SXX

x x

n

yˆ^ t se

2

2

α

± ⋅ + +

n – 2 = 5 degrees of freedom. t

0.
= 2.015.

( )

2 −

±. ⋅. ⋅ + + 76 ±±±± 20.

( 55.665 , 96.335 )

2. Do use a computer to do (☺ and double-check ☺) parts (a), (b), (c), (d), (e), (h), (i) of 1.

> x = c( 6, 2,15, 9,11, 5, 8)

> y = c(70,92,47,72,49,96,78)

> fit = lm(y ~ x)

> summary(fit)

Call:

lm(formula = y ~ x)

Residuals:

-10 -4 3 4 -11 12 6

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 104.0000 8.0631 12.898 4.99e-05 ***

x -4.0000 0.9047 -4.421 0.00688 **

---

Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.402 on 5 degrees of freedom

Multiple R-squared: 0.7963 , Adjusted R-squared: 0.

F-statistic: 19.55 on 1 and 5 DF, p-value: 0.

> new = data.frame(x=7)

> predict.lm(fit,new,interval=c("confidence"),level=0.90)

fit lwr upr

1 76 68.61076 83.

> predict.lm(fit,new,interval=c("prediction"),level=0.90)

fit lwr upr

1 76 55.66427 96.

a) Dependent variable = y = Yield.

Independent variable = x = Amount of Fertilizer.

c) x = 2. y = 5.5. Least squares regression line y ˆ^ = 4.22 + 0.64 x.

d) r = 0.776114.

e) (^) Plot A B C D E F G H I

Res. – 0.22 – 0.36 0.64 0.32 – 1 0.68 – 0.64 0.86 – 0.

SSResid = ∑

2

e i = 3.38.

f) R

= 0.602353. 60.2353%.

g) y ˆ^ = 8.7 hundred of bushels. h) y ˆ^ = 6.46 hundred of bushels.

i) Part (h). x = 3.5 is within the range of the data, x = 7 is not.

j)

2

s e = 0.48286. s e = 0.69488.

k) 0.64 ±±±± 0.372. ( 7 degrees of freedom )

l) T = – 1.832. (i) Reject H

0

. (ii) Do NOT Reject H

0

.

m) 7.42 ±±±± 1.2. n) 7.42 ±±±± 1..

o) T = 1.452. Reject H

0

.

p) T = – 1.710. Do NOT Reject H

0

.

4. Number of TV

Commercials

x

Sales Units

y

Σ x^ = 30,^ Σ y^ = 65,^ Σ x^

= 200, (^) Σ y

= 925, (^) Σ x y = 420,

Σ (^ x^ –^ x^ )^

= 20, (^) Σ ( yy )

= 80, (^) Σ ( xx ) ( yy ) = (^) Σ ( xx ) y = 30.

a)

( )( )

( )

2

β

ˆ

=
− −
=

x x

x x y y

= 1..

OR

( )( )

( ) 100

2 2 2

β

ˆ

=
=
=

⋅ ⋅

∑ ∑

∑ ∑ ∑

n x x

n xy x y

= 1..

= =

n

x

x = 6.

= =

n

y

y = 13.

y β x

ˆ

αˆ^ = − = 13 – 1.5 ⋅ 6 = 4.

The least-squares regression line: y ˆ^ = 4 + 1.5x.

b) y ˆ^ = 4 + 1.5 ⋅ 20 = 34. Extrapolation.

c)

x y y ˆ^ e e

2

3 7 8.5 – 1.5 2.
7 19 14.5 4.5 20.
5 13 11.5 1.5 2.
9 15 17.5 – 2.5 6.
6 11 13 – 2 4.

⇒ (^) Σ ( yy ˆ )

2

= 35.

OR

SSRegr = (^ )^

2 2 2

β

ˆ

β

ˆ

∑ x −^ x =

i

SXX = 45.

⇒ (^) Σ ( yy ˆ )

2

= SSResid = SYY – SSRegr = 80 – 45 = 35.

( )

ˆ

2 2

2

= ∑ − =

e y y

n

s ≈ 11.6667.

s e = ≈ 3..

or ( )

5

σˆ^ = ∑ y − y ˆ =

n

= 7. σˆ^ = 7 ≈ 2.64575.

d) Need R

2 =? (coefficient of determination)

R

=

( )

( )

2

2

ˆ

− =

y y

y y

= 0.5625. 56.25%.

e) Confidence interval for β:

∑ (^ − )

± ⋅

2 2

β t

ˆ

α

x x

s e

α = 0.10. α / 2 = 0.05. n – 2 = 3 degrees of freedom. t

= 2.353.

1. 5 ± 2. 353 ⋅ 1.5 ±±±± 1.8 ( – 0.3 , 3.3 )

f) Test Statistic:

T =

( )

2

β β 0

ˆ

e

x x

s

i

=

= 1..

Rejection Region:

Reject H

0

if T < – t

( 5 – 2 = 3 df ) or T > t

( 3 df )

± t

( 3 df ) = ± 3..

Do NOT Reject H

0

OR

ANOVA table:

Source SS DF MS F

Regression ∑ (^ −^ )

2

y^ ˆ^ y

i = 45^1 45 3.

Residuals ∑^ (^ −^ )

2

ˆ

yi yi = 35 n – 2 = 3 11.

Total ∑ (^ −^ )

2

y y

i = 80^ n^ – 1 = 4

Rejection Region:

Reject H

0 if F > F

( 1, 3 ) F

( 1, 3 ) = 10.13.

Do NOT Reject H

0

g) x = 8. ⇒ y ˆ = 4 + 1.5 ⋅ x = 4 + 1.5 ⋅ 8 = 16.

Prediction interval for y :

( )

∑ (^ − )

± ⋅ + +

2

2

2

α

t

x x

x x

n

y s

i

e

ˆ

α = 0.05. α / 2 = 0.025. n – 2 = 3 degrees of freedom. t

= 3.182.

( )

16 3. 182 11. 6667 1

2 −

± ⋅ ⋅ + + 16 ±±±± 12.

( 3.14 , 28.86 )

h) x = 8. ⇒ y ˆ^ = 4 + 1.5 ⋅ x = 4 + 1.5 ⋅ 8 = 16.

Test Statistic:

T =

( )

( )

( )

2

2

2

μ 0

ˆ

e

i

x x

x x

n

s

y

+
+

=

= – 1.852.

Rejection Region:

Reject H

0

if T < – t

( 5 – 2 = 3 df )

  • t

( 3 df ) = – 1.638.

Reject H

0

i) Test Statistic:

T =

( )

( )

2

2

2

α α 0

ˆ

i

e

x x

x

n

s +

+

=

= 0.828.

Rejection Region:

Reject H

0

if T < – t

( 5 – 2 = 3 df ) or T > t

( 5 – 2 = 3 df )

± t

( 3 df ) = ± 2.353.

Do NOT Reject H

0