Impact of Independent Variables on Dependent Variable in Multiple Regression, Study notes of Statistics

Multiple linear regression, a statistical method used to understand the relationship between a dependent variable and multiple independent variables. It uses an example of kwh data to illustrate how adding a new independent variable (dryer) to the model affects the regression equation, r-squared value, and the sum of squares. The document also discusses the importance of analyzing the correlation between independent variables and the impact of adding collinear variables to the model.

Typology: Study notes

Pre 2010

Uploaded on 09/17/2009

koofers-user-jbx
koofers-user-jbx 🇺🇸

10 documents

1 / 16

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Multiple Linear Regression Model
Multiple Linear Regression refers to regression applications in which there are more than
one independent variables, x1, x2, … , xk . A multiple linear regression model with k
independent variables has the equation
011
... kk
yxx
β
ββε
=+ ++ + (1)
The ε is a random variable with mean 0 and variance σ2. The parameter 1
β
represents the
expected change in yresulting from a single unit change in 1
x
, holding all other
independent variables fixed.
A prediction equation for this model fitted to data is
011
ˆˆ ˆ
ˆ... kk
yxx
β
ββ
=+ ++ (2)
where ˆ
y denotes the “predicted” value computed from the equation, and ˆi
β
denotes an
estimate of βi. These estimates are usually obtained by the method of least squares.
This means finding among the set of all possible values for the parameter estimates the
ones which minimize the sum of squared residuals, 2
1
ˆ
()
n
i
i
y
y
=
. The least squares
estimates yield the best fitting equation in terms of minimizing the sum of squared
distances of the fitted plane to the data points. The interpretation of the parameter
estimates is the same as the interpretation of the model parameters, except with respect to
the fitted model. The parameter estimate 1
ˆ
β
represents the change in ˆ
y resulting from a
single unit change in 1
x
, holding all other independent variables fixed.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Impact of Independent Variables on Dependent Variable in Multiple Regression and more Study notes Statistics in PDF only on Docsity!

Multiple Linear Regression Model

Multiple Linear Regression refers to regression applications in which there are more than one independent variables, x 1 , x 2 , … , x (^) k. A multiple linear regression model with k independent variables has the equation

y = β 0 + β 1 x 1 + ... + β k xk + ε (1)

The ε is a random variable with mean 0 and variance σ^2. The parameter β 1 represents the

expected change in y resulting from a single unit change in x 1 , holding all other

independent variables fixed.

A prediction equation for this model fitted to data is

0 1 1

ˆ ˆ^ ˆ^ ... ˆ

y = β + β x + + β k xk (2)

where y ˆ denotes the “predicted” value computed from the equation, and βˆ i denotes an

estimate of β i. These estimates are usually obtained by the method of least squares. This means finding among the set of all possible values for the parameter estimates the

ones which minimize the sum of squared residuals, 2 1

n i i

y y

∑ −. The least squares

estimates yield the best fitting equation in terms of minimizing the sum of squared distances of the fitted plane to the data points. The interpretation of the parameter estimates is the same as the interpretation of the model parameters, except with respect to

the fitted model. The parameter estimate βˆ 1 represents the change in y ˆ resulting from a

single unit change in x 1 , holding all other independent variables fixed.

Example of Multiple Linear Regression

An example of a multiple linear regression with two independent variables is given by the KWH data, but now with x 1 =AC and x 2 =DRYER. Figure 1 shows a plot or KWH versus DRYER.

KWH

DRYER

Figure 1. Plot of KWH versus DRYER.

The plot in Figure 1 clearly shows KWH increases with increasing runs of the dryer, but the plot does not take into account the variable AC. Visualizing the simultaneous effects of AC and DRYER on KWH would requite a plot in three dimensions, which is difficult to construct.

The model equation would be

KWH = β 0 + β 1 AC + β 2 DRYER + ε.

Least squares parameter estimates are

0 1 2

βˆ^ = 8.11, βˆ^ = 5.47, βˆ =13.

AC

DRYER

Figure 2. Plot of AC versus DRYER

Compare the values of predicted KWH from the two models. Previously, AC=10 was inserted in the simple linear prediction equation to get

KWH = 27.85 + 5.34(10) = 81.25. (5)

A value of DRYER must also be inserted into the multiple regression equation to get a predicted KWH value. Trying DRYER = 0, 1, and 2 and holding AC=10 gives

KWH = 8.11 + 5.47(10) + 13.22(0) = 62.81, KWH = 8.11 + 5.47(10) + 13.22(1) = 76.03, (6) KWH = 8.11 + 5.47(10) + 13.22(2) = 89.25.

KWH consumption increases by 13.22 as DRYER goes from 0 to 1 and again from 1 to 2, holding AC fixed at 10.

Analysis of Variance for Multiple Regression Model

An analysis of variance for a multiple linear regression model with k independent variables fitted to a data set with n observations is

Source of Variation DF SS MS Regression k SSR MSR (7) Error n-k-1 SSE MSE Total n-1 SSTot

The sums of squares SSR, SSE, and SST have the same definitions in relation to the model as in simple linear regression:

SSR = 2 1

n j j

y y

∑ − , SSE =^2

1

n

j

y y

∑ − , SSTot =^2

1

n

j

y y

Also, SSTot=SSR+SSE. The value of SSTot does not change with the model. It depends only on the values of the dependent variable y. But SSE decreases as variables are added to a model, and SSR increases by the same amount. This amount of increase in SSR is the amount of variation due to variables in the larger model that was not accounted for by variables in the smaller model. This increase in regression sum of squares is sometimes denoted

SSR( added variables | original variables ), (9)

where original variables represents the list of independent variables that were in the model prior to adding new variables, and added variables represents the list of variables that were added to obtain the new model. The overall SSR for the new model can be partitioned into the variation attributable to the original variables plus the variation due to the added variables that is not due to the original variables ,

SSR( all variables ) = SSR( original variables ) (10)

  • SSR( added variables | original variables ).

Generally speaking, larger values of the coefficient of determination R^2 =SSR/SST indicate a better fitting model. The value of R^2 must necessarily increase as variables are added to the model. However, this does not necessarily mean that the model has actually been improved. The amount of increase in R^2 can be a mathematical artifact rather than a meaningful indication of an improved model. Sometimes an adjusted R^2 is used to overcome this shortcoming of the usual R^2. Most regression computer programs include both versions of R^2.

Statistical Inference for Regression Parameters

Statistical inference about the parameters requires standard errors of the estimates. A 95% confidence interval for β i is

βˆ^ i ± tdf ,.025( σ ˆ βˆ i ) (12)

where t df,.025 is the critical value from a t distribution with df=n-k-1, the degrees of

freedom for error, and σˆ^ βˆ i is the standard error of βˆ i.

Standard errors for parameters in the two-variable model are

ˆ 0 ˆ 1 ˆ 2

σˆ^ β= 2.48, σˆ^ β= 0.28, σˆ β= 0.86. (13)

The critical value from a t distribution with df=18 is t 18,.025 =2.1. Thus, a 95% confidence interval for β 1 is

1 18,.025 ˆ 1 ˆ (^) ± t ˆ = 5.47 ± 2.1(0.28) = 5.47 ± 0.59.

We are 95% confident that the “true” hourly KWH consumption of the AC is between 4.88 and 6.06. This is a considerably shorter interval than the interval 5.34±2.16 that was obtained from the simple linear regression model because the variance estimate (MSE) has been reduced from 208.9 to 15.5.

It seems apparent that the model including both AC and DRYER is superior to the model containing AC alone. The value of R^2 is much higher (.9709 compared to .5856) and MS(Error) is much smaller.

You can conduct a statistical test of significance to compare the two models using the ANOVA table with the partitioned SS(Reg). The test statistic is

F = MS(DRYER|AC)/MS(Error) = 3690.1/15.5 = 238.1,

with numerator df=1 and denominator df=18. This is a huge value of F with these degrees of freedom and is significant at any reasonable level.

Bivariate Fit of RESIDUAL By AC

RESIDUAL

AC

Figure 3. Plot of Residuals versus AC

Bivariate Fit of RESIDUAL By DRYER

RESIDUAL

DRYER

Figure 4. Plot of Residuals versus DRYER

Regression with Collinear Variables

The example on household KWH consumption utilized two independent variables that are almost uncorrelated. Thus, when DRYER was added tot the model in addition to AC, the AC regression coefficient changed very little. Also, the amount of variation attributable to DRYER is almost the same when it is the only variable in the model as when it is added to a model that already includes AC.

The following example illustrates a situation when two highly correlated variables are in a regression model: Students in a graduate statistics course recorded the spans of their left and right hands and their heights, all in inches. The objective was to develop a regression model to predict height from hand span. The variable names are HT, LSPAN, and RSPAN. Of course, LSPAN and RSPAN are highly correlated. This example illustrates the consequence of using two highly correlated variables in a multiple regression equation.

The two simple linear regression models are:

HT = β 0 + β 1 LSPAN +ε

and

HT = β 0 + β 2 RSPAN +ε

The prediction equations are:

HT=43.62+2.88LSPAN and HT=41.35+3.17RSPAN

Not surprisingly, the two equations are quite similar.

Figures 6a and 6b show HT versus LSPAN and RSPAN:

Figure 6a. Regression of HT on LSPAN

60

65

70

75

80

ht

6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10. lspan

Figure 6b. Regression of HT on RSPAN

Figure 7 shows a plot of RSPAN versus LSPAN showing the high degree of collinearity.

rspan

6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10. lspan

Figure 7. Plot showing collinearity between LSPAN and RSPAN

The multiple linear regression model is:

HT = β 0 + β 1 LSPAN + β 2 RSPAN + ε (14)

The prediction equation is

HT = 41.13 – 4.31LSPAN + 7.53RSPAN

At first look, this equation seems to make no sense at all. The regression coefficient on LSPAN is negative, and the coefficient on RSPAN is twice as large as the coefficient

60

65

70

75

80

ht

6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10. rspan

Figures 9a and 9b show the data and fitted plane on similar axes:

Figure 9a. HT plotted versus LSPAN and RSPAN

Figure 9b. Plane of predicted values plotted versus

You can see that although the slope in the LSPAN direction is negative, the plane increases in slope from the smallest values of LSPAN and RSPAN up to the largest values. This illustrates the following facts of regression with highly collinear variables:

  1. Individual regression coefficients are practically meaningless.
  2. Predictions over the region of observed values of the collinear independent variables are relatively stable.

SAS Program for Multiple Linear Regression Analysis of KWH Data

options nonumber nodate; Title1 'Household Electricity Consumption Data'; data kilowatt; input kwh ac dryer; cards; 35 1.5 1 63 4.5 2 66 5.0 2 17 2.0 0 94 8.5 3 79 6.0 3 93 13.5 1 66 8.0 1 94 12.5 1 82 7.5 2 78 6.5 3 65 8.0 1 77 7.5 2 75 8.0 2 62 7.5 1 85 12.0 1 43 6.0 0 57 2.5 3 33 5.0 0 65 7.5 1 33 6.0 0

. 10 0 . 10 1 . 10 2 ;

proc print data=kilowatt; run ;

proc corr sscp data=kilowatt; run ;

proc sort data=kilowatt; by ac; run ;

data acplot; do ac= 0 to 15 by .5 ; kwh=. ; output;

data dryplot; do dryer= 0 to 3 by .1 ; kwh=. ; output; end; run ;

proc print data=dryplot; run ;

data dryplot; merge dryplot kilowatt; by dryer; run ;

proc reg data=dryplot; id ac; model kwh=dryer/p; plot kwh*dryer; output out=dryplot1 p=drypred r=dryresid1; run ;

proc gplot data=dryplot1; plot dryresid1dryer; plot dryresid1drypred; run ;

proc reg data=dryplot; id ac; model kwh=dryer/p clm cli; plot kwh*dryer; output out=dryplot2 p=drypred lclm=lcl uclm=ucl lcl=lpl ucl=upl; run ;

proc print data=dryplot2; run ;

proc gplot data=dryplot2; plot kwhdryer drypreddryer lcldryer ucldryer / overlay; plot kwhdryer drypreddryer lpldryer upldryer / overlay; run ;