Stat 139 Final Exam Solutions Fall 2015, Study notes of Electronics

Which of the following is the best linear regression model to model these data? (A) y = 91.004 + 0.9665(x). (B) y = 67.79 + 161.0 · ln(x).

Typology: Study notes

2021/2022

Uploaded on 08/05/2022

char_s67
char_s67 🇱🇺

4.5

(116)

1.9K documents

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Stat 139 Final Exam Solutions
Fall 2015
Problem 1: Multiple Choice. [3 points each]. Please circle the best answer; no justification is
needed. Parts are unrelated.
(a) Federal guidelines require that pharmaceutical companies provide evidence that a new drug is
effective by demonstrating that two independently conducted randomized studies both show sta-
tistically significant benefit from the drug at significance level of 0.025, i.e. α= 0.025. Given that
the null hypothesis is that the drug has no benefit, what type of error has been committed if a
new drug appears beneficial to the FDA when in fact the drug provides no benefit?
(A) Type I error
(B) Type II error
(C) No error was made
(b) Cook’s distance measure is used to
(A) identify influential observations in multiple regression analysis.
(B) determine the significance of an independent variable.
(C) determine if there is significant multicollinearity.
(D) determine if the overall regression model is significant.
Using the Normal quantile plot below, answer the following 2 questions:
(c) This variable can be described as:
(A) Left-skewed
(B) Right-skewed
(C) Symmetric
(d) A friend of yours proposes using the log-transformation
on this variable. The resulting varaible is likely to be:
(A) Less skewed.
(B) More skewed.
(C) Unaffected.
−2 −1 0 1 2
02468
Normal Q−Q Plot
Theoretical Quantiles
Sample Quantiles
(e) In the multiple regression model, the adjusted R2,
(A) cannot be negative.
(B) is the percent of variability in the respnose variable associated with the predictors.
(C) equals the square of the correlation coefficient r.
(D) may decrease when an additional explanatory variable is added.
(f) In the multiple regression model, the least squares estimator is derived by
(A) minimizing the sum of squared prediction errors.
(B) setting the sum of squared errors equal to zero.
(C) minimizing the absolute difference of the residuals.
(D) forcing the smallest distance between the actual and fitted values.
1
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download Stat 139 Final Exam Solutions Fall 2015 and more Study notes Electronics in PDF only on Docsity!

Stat 139 Final Exam Solutions

Fall 2015

Problem 1: Multiple Choice. [3 points each]. Please circle the best answer; no justification is needed. Parts are unrelated. (a) Federal guidelines require that pharmaceutical companies provide evidence that a new drug is effective by demonstrating that two independently conducted randomized studies both show sta- tistically significant benefit from the drug at significance level of 0.025, i.e. α = 0.025. Given that the null hypothesis is that the drug has no benefit, what type of error has been committed if a new drug appears beneficial to the FDA when in fact the drug provides no benefit? (A) Type I error (B) Type II error (C) No error was made

(b) Cook’s distance measure is used to

(A) identify influential observations in multiple regression analysis. (B) determine the significance of an independent variable. (C) determine if there is significant multicollinearity. (D) determine if the overall regression model is significant.

Using the Normal quantile plot below, answer the following 2 questions:

(c) This variable can be described as: (A) Left-skewed (B) Right-skewed (C) Symmetric

(d) A friend of yours proposes using the log-transformation on this variable. The resulting varaible is likely to be: (A) Less skewed. (B) More skewed. (C) Unaffected.

l

l

l l l

l

l l ll^ ll

l l

l

l

l

l ll^ l

l

l

l l lll^ ll −2 −1 0 1 2

0

2

4

6

8

Normal Q−Q Plot

Theoretical Quantiles

Sample Quantiles

(e) In the multiple regression model, the adjusted R^2 , (A) cannot be negative. (B) is the percent of variability in the respnose variable associated with the predictors. (C) equals the square of the correlation coefficient r. (D) may decrease when an additional explanatory variable is added.

(f) In the multiple regression model, the least squares estimator is derived by (A) minimizing the sum of squared prediction errors. (B) setting the sum of squared errors equal to zero. (C) minimizing the absolute difference of the residuals. (D) forcing the smallest distance between the actual and fitted values.

(g) [3 points] To the right is a scatterplot of Y = Brain Weight in grams vs. X = Body Weight in kilograms for a random sample of n = 62 species of mammals. Which of the following is the best linear regression model to model these data? (A) ˆy = 91.004 + 0.9665(x) (B) ˆy = 67.79 + 161. 0 · ln(x) (C) ln (ˆy) = 2.903 + 0.001193(x) (D) ln (ˆy) = 2. 1348 + 0. 7517 · ln(x)

l

lll

l

lllllllllllllll

l lllllll

l llll

l

llllll

l lllllllllllllllllllll 0 1000 2000 3000 4000 5000 6000

0

1000

2000

3000

4000

5000

BodyWt

BrainWt

l

lll

l

lllllllllllllll

l lllllll

l llll

l

llllll

l lllllllllllllllllllll

(h) Imagine you built the following multiple regression model:

Y = β 0 + β 1 (X 1 ) + β 2 (X 2 ) where Y = time spent studying for the final exam, X 1 = a binary variable which takes on the value 1 for athletes and is 0 for non-athletes, and X 2 = a binary variable which takes on the value 1 for non-athletes and is 0 for athletes. If athletes spend more time studying for the final exam, then you would expect (A) the coefficient for X 1 to have a positive sign, and for X 2 a negative sign. (B) both coefficients to be the same distance from the constant, one above and the other below. (C) β 1 estimates the mean of Y for athletes, β 2 estimates the mean of Y for non-athletes. (D) none of the OLS estimators to exist because there is perfect multicollinearity.

(i) When there is an omitted predictor variable in a regression model, which is a determinant of the response variable, then (A) you cannot measure the effect of the omitted predictor variables, but the coefficient estimators of your included predictors are unaffected. (B) this has no effect on the coefficient estimators of your included predictors because the omitted predictor variable is not included. (C) this will always bias the OLS coefficient estimators of the included predictors. (D) the OLS coefficient estimators are biased if the omitted variable is correlated with the incl (j) A 99% confidence interval for the average salary of all CEOs in the electronics industry was con- structed using the results of a random survey of 45 CEOs. The interval was ($107,815, $122,881). To make more useful inferences from the data, it is desired to reduce the width of the confidence interval. Which of the following will result in a reduced interval width?

(A) Increase the sample size and decrease the confidence level. (B) Decrease the sample size and decrease the confidence level. (C) Increase the sample size and increase the confidence level. (D) Decrease the sample size and increase the confidence level.

(k) The following hypothesis can be tested using an F -test with the exception of (A) β 2 = 1 and β 3 = β 4 /β 5. (B) β 0 = 2. (C) β 1 + β 2 = 1 and β 3 = − 2 β 4. (D) β 1 = β 2 = 0.

Solution: Using the fact that E

S X^2

= E

S^2 Y

= σ^2 since they are both unbiased estimates for σ^2 :

E

S p^2

= E

(nX − 1)S X^2 + (nY − 1)S Y^2 nX + nY − 2

nX − 1 nX + nY − 2

E

S X^2

nY − 1 nX + nY − 2

E

S Y^2

nX − 1 nX + nY − 2

σ^2 +

nY − 1 nX + nY − 2

σ^2 =

nX − 1 + nY − 1 nX + nY − 2

σ^2 = σ^2

(e) [6 points] For simple linear regression, let T be the usual 2-sided t-statistic to determine whether H 0 : β 1 = 0 is reasonable, and let F be the usual overall F -statistic for the whole model. Show that F = T 2. Hint: start with the F -statistic, and use the fact that βˆ 0 = Y¯ − βˆ 1 X¯ to simplify the expanded numerator.

Solution:

F =

SSM/dfM SSE/dfE

( Yˆi − Y¯ )^2 / 1 σˆ^2

( βˆ 0 + βˆ 1 Xi − Y¯ )^2 ˆσ^2

( Y¯ − βˆ 1 X¯ + βˆ 1 Xi − Y¯ )^2 ˆσ^2

=

( βˆ 1 Xi − βˆ 1 X¯)^2 ˆσ^2

βˆ 12 ∑^ (Xi − X¯)^2 σ ˆ^2

βˆ^21 SSX σ ˆ^2

βˆ^21 ( σ/ ˆ

SSX

) 2 =^ T^2

Problem 3. [27 points total] The economic structure of Major League Baseball (MLB) allows some teams to make substantially more money than others, which in turn allows some teams to spend much more on player salaries. These teams might therefore be expected to have better players and win more games on the field as a result. Over the course of the last 4 years (2012-2015), each of the 30 MLB teams were measured each year and the following data were collected for these 120 observations (note, these data wil also be used in Problem 4):

wins: number of games the team won for a specific year payroll: opening day payroll, in millions of dollars, for the team for a specific year AL: a binary variable for whether the team is in the American League (AL) (14 of the 30 teams are in the AL...the rest are in the National League (NL)) year: the year in which the measurement was taken

A regression model, Model 1, was run to predict the number of wins a team had from the ln(payroll), and the results in R are shown below, along with some summary statistics:

summary(model1<-lm(wins~log(payroll),data=mlb))

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 44.076 11.654 3.782 0.000246 *** log(payroll) 7.963 2.505 3.179 0.001887 **


Residual standard error: 10.5649 on 118 degrees of freedom Multiple R-squared: 0.07889, Adjusted R-squared: 0. F-statistic: 10.11 on 1 and 118 DF, p-value: 0.

mean(log(mlb$payroll)) [1] 4. sd(log(mlb$payroll)) [1] 0.

(a) [2 points] What is the estimated correlation between wins and log(payroll )?

Solution: For simple regression:

r = ±

R^2 =

The sign is positive since the since of the slope estimate is positive.

(b) [3 points] Interpret the slope coefficient in this model. Be specific.

Solution: A 1-unit change in ln(payroll) is associated with an estimated increase in of 7.963 wins, on average. More specifically, a doubling of payroll isassociated with an estimated increase in of ln(2) · 7 .963 = 5.52 wins, on average.

(c) [5 points] Calculate Y¯ , the mean number of wins, and sY , the standard deviation for the number of wins among the 120 observations.

Solution: Using the fact that the regression line goes through the point ( X,¯ Y¯ ):

Y¯ = βˆ 0 + βˆ 1 X¯ = 44.076 + 7.963(4.6371) = 81. 0

Using the fact that slope is equal to correlation times ratio of standard deviations:

βˆ 1 = r

sY sX

=⇒ sY = βˆ 1 · sX r

(d) [5 points] The Red Sox are projected to spend about $180 million dollars on payroll in 2016 (next year). Provide a 95% interval for the number of games they will win next year.

Solution: Since the Red Sox in 2016 is one new observation, a prediction interval should be calculated:

Yˆ ± t∗ n−k− 1 σˆe

n

(X 0 − X¯)^2

(n − 1)s^2 X ≈ βˆ 0 + βˆ 1 (180) ± 1 .984(10.565)

(ln(180) − 4 .6371)^2 (119)0. 386672 = ± = (, )

Note: since n − k − 1 = 118 is not in the t-table, the t-critical value (e) [8 points] Below are the residual graphs for this model. List out all of the assumptions for this regression model, and comment on whether they are reasonable. Be specific.

l

l

l l

l

l

l

l

l l

l

l

l l

l l

l

l

l

l

l

l

l l

l

l

l

l

l

l

l

l

l

l

ll l

l

l

l

l

l

l l l (^) l

l

l l

l

l

l

l

l

l l

l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l l l

l l l

l

l l

ll

l l

l

l

l l

l

l

l

l l

l

l

l

l

l

l

l

ll l

l l

l l

l

l

70 75 80 85

0

10

20

fitted(model1)

resid(model1)

Histogram of resid(model1)

resid(model1)

Frequency

−20 −10 0 10 20

0

5

10

15

Solution: For a team in the AL, the coefficients involved in the slope relating wins to ln(payroll ) are β 1 and β 3 added together. Thus a linear combination of coefficients t-test should be used:

H 0 : β 1 + β 3 = 0 vs. HA : β 1 + β 3 6 = 0

t = ( βˆ 1 + βˆ 3 ) − 0 √̂ Var( βˆ 1 ) + ̂Var( βˆ 3 ) + 2 Cov( ̂ βˆ 1 , βˆ 3 )

This t-statistic has df = 116, which has a critical value a little smaller than 1.984, thus this result is significant. There is evidence to suggest that there is a negative linear relationship relating the number of wins by a team in the AL with how much they spend on payroll (on the log scale).

A third multiple regression model, Model 3, was run and is shown below:

summary(model3<-lm(wins~(log(payroll)+year)^2,data=mlb))

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -5162.2424 23760.9377 -0.217 0. log(payroll) 1436.7697 5091.2562 0.282 0. year 2.5847 11.8015 0.219 0. log(payroll):year -1.7094 1.5287 -1.118 0.


Residual standard error: 10.5363 on 116 degrees of freedom Multiple R-squared: 0.09943, Adjusted R-squared: 0. F-statistic: 6.404 on 3 and 116 DF, p-value: 0.

(c) [4 points] Below are 3 selected rows of R’s data matrix used for these models. Provide the respective 3 rows of the design matrix X for Model 3.

Solution:

X =

1 ln(91.52) 2015 ln(91.52) · 2015 1 ln(100.68) 2015 ln(100.68) · 2015 1 ln(107.407) 2015 ln(107.407) · 2014

mlb[c(1,23,33),] team_name team league wins payroll year AL 1 Diamondbacks ARI NL 79 91.520 2015 0 23 Padres SDP NL 74 100.680 2015 0 33 Orioles BAL AL 96 107.407 2014 1

(d) [4 points] In 2-3 sentences, explain what the interaction term is attempting to capture in this model.

Solution: The interaction term is estimating the change in the effect of payroll on wins over time (as a linear change as time increases). Note: since this is negative, it means that the effect of payroll on wins has decreased as time has increased.

(e) [6 points] Formally test whether Model 3 is doing a better job of explaining wins than Model 1. The critical value for this test statistic is 3.074.

Solution: This should be tested through an Extra-Sum-of-Squares F -test:

H 0 : β 2 = β 3 = 0 vs. HA : β 1 6 = 0 and/or β 3 6 = 0

F =

(SSEH 0 − SSEHA )/(dfE,H 0 − dfE,HA ) SSEHA /dfE,HA

This F -statistic has df = 2, 116. Since our test statistic is not further out in the tail than the critical value of 3.074, we cannot reject the null hypothesis. The association of wins to payroll (on the log

scale) may be the same over the 4 year time span the data is collected from.

(f) [3 points] Provide an alternative modeling approach that would give more appropriate inferences than seen here. Explain in 1-2 sentences.

Solution: the best alternative method would be to use a random-effect model (aka, mixed-effects or hierarchical model) that accounts for the “clusters” defined by each team: 4 observatoins (years) were measured within each team. This accounts for the dependence between these 4 observations for each team.

Problem 5. [30 points total] It is oftern cited that women make 78 cents for every dollar a man makes in the United States. To investigate this phenomenon, you collect data on weekly earnings from 1,744 randomly sample individ- uals from the entire US population of working adults: 850 females and 894 males. Next, you calculate their average weekly earnings and find that females in your sample earned $693.96, while males made on average $1,035.40.

(a) [5 points] Calculate the average female earnings as a percent of the average male earnings. How would you test whether or not this result is statistically significant? Provide the analysis procedure you would use (do not perform the calculations).

Solution: The estimated ratio of means is 693. 96 / 1035 .40 = 0.670. To properly test if this result is statistically significant, a t-test on the log scale would be the most appropriate approach (a test if the means on the log scale is zero is equivalent to see if the ratio of medians is equal to 1, but this is the best choice to test this ratio is equal to 1 or not).

You recall from an Econ class that additional years of experience are supposed to result in higher earnings; you reason that this is because experience is related to “on the job training.” Measuring age instead can be a proxy for “experience.” You estimate two models initially (standard error of the coefficients are given in parentheses below the estimates):

Earn = 647.40 + 10.30(Age) − 339 .56(F emale), R^2 = 0. 129 , σˆ^2 = 1099. 01 (42.36) (1.102) (26.12)

ln(Earn) = 6.342 + 0.015(Age) − 0 .421(F emale), R^2 = 0. 172 , σˆ^2 = 1. 232 (0.097) (0.00209) (0.0364)

where Earn is weekly earnings in dollars, Age is measured in years, and Female is a binary variable, which takes on the value of 1 if the individual is female and is 0 otherwise.

(b) [5 points] Interpret the coefficient for Female in each model carefully.

Solution: In the first model, women are estimated to make 339.56 dollars less than men, on average, while controlling for age. In the second model, women are estimated to make exp(− 0 .421) = 0. 6564 as much as men (aka, 34.36% less), on average, while controlling for age.

(c) [4 points] Should you choose the second specification on grounds of the higher R^2 and smaller ˆσ^2? Explain in 1-2 sentences

Solution: No, you cannot use either as a comparison tool since the scale of the response variables are not the same. In order to compare these models, we should compare the errors on the same scale (by transforming the Yˆ ’s in the 2nd unit back to the original scale and compare to the observed earnings).