Analyzing Relationship between Predictors and Response Variable in Multiple Regression, Study notes of Statistics

A set of lecture notes from a statistics 102 course focusing on multiple regression analysis. It covers topics such as transforming data, model assumptions, inference in multiple regression, collinearity, and examples of multiple regression models. The notes explain how to estimate the fixed and variable costs of a lease using a regression model and discuss the importance of each predictor in the model.

Typology: Study notes

Pre 2010

Uploaded on 03/28/2010

koofers-user-vah
koofers-user-vah 🇺🇸

10 documents

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Statistics 102 Multiple Regression
Spring, 2000 - 1 -
Multiple Regression
Project Analysis for Today
First steps
Transforming the data into a form that lets you estimate the fixed and variable
costs of a lease using a regression model that meets the three key assumptions.
Review of Multiple Regression from Last Week
Objective
Isolate the key factors that influence the response and separate their effects.
Model “Y” = β0 + β1 “X1” + ... + βk “Xk” + Error
Sales = β0 + β1 Adv$ + β2 Price + Error
with - Independence
- Constant variance σ2 about regression line
- Normally distributed errors about the regression line.
Discussion
– Model is additive
– Geometry of multiple regression
– Slopes measure effect of each predictor “holding others fixed”
“Simple” regression slope vs multiple regression slope
Relationship between R2 and RMSE
– Both describe “goodness-of-fit”
– R2 is relative whereas RMSE is absolute.
– They are related as follows:
RMSE2 = Var (residuals) (1 – R2) Var (response)
– Same interpretation in simple (one predictor) and multiple regression.
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download Analyzing Relationship between Predictors and Response Variable in Multiple Regression and more Study notes Statistics in PDF only on Docsity!

Spring, 2000 -1-

Multiple Regression

Project Analysis for Today

First steps

Transforming the data into a form that lets you estimate the fixed and variable

costs of a lease using a regression model that meets the three key assumptions.

Review of Multiple Regression from Last Week

Objective

Isolate the key factors that influence the response and separate their effects.

M o d e l

“Y” = β 0 + β 1 “X 1” + ... + βk “X k” + Error

Sales = β 0 + β 1 Adv$ + β 2 Price + Error

with

- Independence

- Constant variance σ

2

about regression line

- Normally distributed errors about the regression line.

Discussion

– Model is additive

– Geometry of multiple regression

– Slopes measure effect of each predictor “holding others fixed”

“Simple” regression slope vs multiple regression slope

Relationship between R

2

and RMSE

– Both describe “goodness-of-fit”

– R

2

is relative whereas RMSE is absolute.

– They are related as follows:

RMSE

2

= Var ( residuals ) ≈ (1 – R

2

) Var ( response )

– Same interpretation in simple (one predictor) and multiple regression.

Spring, 2000 -2-

Inference in Multiple Regression

Inference in multiple regression

– One coefficient t-ratio (estimate/SE)

“Is this slope different from zero?”

“Does this variable significantly improve a model containing rest?”

– All coefficients overall F-ratio (anova table)

“Does this entire model explain significant amounts of variation?”

Analysis of variance (ANOVA) summary (page 141)

– Summary of how much variation is being explained per predictor.

– Example for the car data with weight and horsepower as predictors.

S o u r c e Model E r r o r C Total

DF

Sum of Squares

Mean Square

F Ratio

Prob>F <.

Why do we need different tests?

– Each addresses a specific aspect of the fitted model:

t-ratio considers one coefficient (intercept or slope)

F-ratio considers all slopes , simultaneously

– Why not just do a bunch of t-tests, one for each slope?

With 20 predictors and 95% CI, you can expect one significant (not

zero) by chance alone! Too many things will appear significant that

really are not meaningful.

– Recall the use of multiple comparisons in anova.

Spring, 2000 -4-

Example of Multiple Regression

Automobile design Car89.jmp, page 109

“What is the predicted mileage for a 4000 lb. design, and what characteristics

of the design are crucial?”

“How much does my 200 pound brother owe me for gas for carrying him

3,000 miles to California?” (Oops, it’s urban mileage in example)

– Initial one-predictor model

• Transform response to gallons per 1000 mile scale.

• Cannot compare R

2

’s since two model use different dependent

variables (MPG and GPM)

• Effect of scaling from GPM to GP1000M.

• RMSE = 4.23 (p 111)

• Skewness in residuals from regression with Weight. (p 112)

• Prediction @ 4000 lbs = 63.9, ⇑ 200 lbs for 3000 miles ≈ 8.2 gals

– Add variable for Horsepower (p 117)

• R 2 increases from 77% to 84% (added variable is significant, t=7.21)

• RMSE drops to 3.

• Predictors are related, both increase together, higher SE for Weight.

• Picture explains the increase in SE due to restricted range (p 120).

• ⇑ 200 lbs for 3000 miles ≈ 5.3 gals

• Prediction from multiple regression

– Add a predictor less correlated with Weight, use HP/Pound (p 123)

• Weight and HP/Pound less related, more distinct properties of these cars.

• Engineer can manipulate these separately, unlike HP and weight.

Residual plots

– Show residuals plotted on fitted values

– Inspect for deviations from assumptions (such as lack of constant variance)

Leverage plots (p 125)

– Diagnostic plot, designed especially for multiple regression

– Reveals leveraged observations in multiple regression.

Next steps for this model…

– What other factors are important for the design?

– How small can we make the RMSE?

Spring, 2000 -5-

Example with Extreme Collinearity in Multiple Regression

Stock prices and market indices Stocks.jmp, page 138

“What’s the beta for Walmart when regressed on two indices?”

– Fitted slope of stock returns on market estimate the beta for the stock.

– Huge collinearity (correlation between VW and S&P is 0.993), so almost

no unique variation in either one given that other is in model.

– Either taken separately is a good predictor, but show weak effects

when used together.

– “Squished” leverage plots... little unique variation in either predictor

available to explain the variation in the response. (p 144)

– More complete VW index is better predictor, as financial theory suggests.

Next Time

Categorical predictors…

Categorical predictors allow us to compare regression models for different

groups, judging if the models for the different groups are comparable.

Spring, 2000 -7-

Horsepower

Weight(lb)

SE(slope estimate for X (^) j ) ≈

σ √n

1 SD(Adjusted X (^) j)

=

σ √n

√VIF (^) j SD(X (^) j) = √VIF (^) j ∗ (SE if no collinearity)

T e r m Intercept Weight(lb) Horsepower

E s t i m a t e

Std Error

t Ratio

P r o b > | t | <. <. <.

V I F

Residual

GP1000M City Predicted

  • 5

0

5

1 0

Spring, 2000 -8-

C o r r e l a t i o n s

V a r i a b l e VW S P 5 0 0 WALMART Sequence Number

VW

S P 5 0 0

WALMART

Sequence Number -0.

-0.

Scatterplot Matrix

VW

SP

WALMART

Sequence Number

Parameter Estimates T e r m Intercept SP

E s t i m a t e

Std Error

t Ratio

P r o b > | t |

<.

V I F

Parameter Estimates T e r m Intercept SP VW

E s t i m a t e

-1.

Std Error

t Ratio

-1.

P r o b > | t |

V I F