









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Multiple linear regression, a statistical method used to understand the relationship between a dependent variable and multiple independent variables. It uses an example of kwh data to illustrate how adding a new independent variable (dryer) to the model affects the regression equation, r-squared value, and the sum of squares. The document also discusses the importance of analyzing the correlation between independent variables and the impact of adding collinear variables to the model.
Typology: Study notes
1 / 16
This page cannot be seen from the preview
Don't miss anything!










Multiple Linear Regression Model
Multiple Linear Regression refers to regression applications in which there are more than one independent variables, x 1 , x 2 , … , x (^) k. A multiple linear regression model with k independent variables has the equation
expected change in y resulting from a single unit change in x 1 , holding all other
independent variables fixed.
A prediction equation for this model fitted to data is
0 1 1
estimate of β i. These estimates are usually obtained by the method of least squares. This means finding among the set of all possible values for the parameter estimates the
ones which minimize the sum of squared residuals, 2 1
n i i
estimates yield the best fitting equation in terms of minimizing the sum of squared distances of the fitted plane to the data points. The interpretation of the parameter estimates is the same as the interpretation of the model parameters, except with respect to
single unit change in x 1 , holding all other independent variables fixed.
Example of Multiple Linear Regression
An example of a multiple linear regression with two independent variables is given by the KWH data, but now with x 1 =AC and x 2 =DRYER. Figure 1 shows a plot or KWH versus DRYER.
Figure 1. Plot of KWH versus DRYER.
The plot in Figure 1 clearly shows KWH increases with increasing runs of the dryer, but the plot does not take into account the variable AC. Visualizing the simultaneous effects of AC and DRYER on KWH would requite a plot in three dimensions, which is difficult to construct.
The model equation would be
Least squares parameter estimates are
0 1 2
Figure 2. Plot of AC versus DRYER
Compare the values of predicted KWH from the two models. Previously, AC=10 was inserted in the simple linear prediction equation to get
KWH = 27.85 + 5.34(10) = 81.25. (5)
A value of DRYER must also be inserted into the multiple regression equation to get a predicted KWH value. Trying DRYER = 0, 1, and 2 and holding AC=10 gives
KWH = 8.11 + 5.47(10) + 13.22(0) = 62.81, KWH = 8.11 + 5.47(10) + 13.22(1) = 76.03, (6) KWH = 8.11 + 5.47(10) + 13.22(2) = 89.25.
KWH consumption increases by 13.22 as DRYER goes from 0 to 1 and again from 1 to 2, holding AC fixed at 10.
Analysis of Variance for Multiple Regression Model
An analysis of variance for a multiple linear regression model with k independent variables fitted to a data set with n observations is
Source of Variation DF SS MS Regression k SSR MSR (7) Error n-k-1 SSE MSE Total n-1 SSTot
The sums of squares SSR, SSE, and SST have the same definitions in relation to the model as in simple linear regression:
SSR = 2 1
n j j
1
n
j
1
n
j
Also, SSTot=SSR+SSE. The value of SSTot does not change with the model. It depends only on the values of the dependent variable y. But SSE decreases as variables are added to a model, and SSR increases by the same amount. This amount of increase in SSR is the amount of variation due to variables in the larger model that was not accounted for by variables in the smaller model. This increase in regression sum of squares is sometimes denoted
SSR( added variables | original variables ), (9)
where original variables represents the list of independent variables that were in the model prior to adding new variables, and added variables represents the list of variables that were added to obtain the new model. The overall SSR for the new model can be partitioned into the variation attributable to the original variables plus the variation due to the added variables that is not due to the original variables ,
SSR( all variables ) = SSR( original variables ) (10)
Generally speaking, larger values of the coefficient of determination R^2 =SSR/SST indicate a better fitting model. The value of R^2 must necessarily increase as variables are added to the model. However, this does not necessarily mean that the model has actually been improved. The amount of increase in R^2 can be a mathematical artifact rather than a meaningful indication of an improved model. Sometimes an adjusted R^2 is used to overcome this shortcoming of the usual R^2. Most regression computer programs include both versions of R^2.
Statistical Inference for Regression Parameters
Statistical inference about the parameters requires standard errors of the estimates. A 95% confidence interval for β i is
where t df,.025 is the critical value from a t distribution with df=n-k-1, the degrees of
Standard errors for parameters in the two-variable model are
ˆ 0 ˆ 1 ˆ 2
The critical value from a t distribution with df=18 is t 18,.025 =2.1. Thus, a 95% confidence interval for β 1 is
1 18,.025 ˆ 1 ˆ (^) ± t ˆ = 5.47 ± 2.1(0.28) = 5.47 ± 0.59.
We are 95% confident that the “true” hourly KWH consumption of the AC is between 4.88 and 6.06. This is a considerably shorter interval than the interval 5.34±2.16 that was obtained from the simple linear regression model because the variance estimate (MSE) has been reduced from 208.9 to 15.5.
It seems apparent that the model including both AC and DRYER is superior to the model containing AC alone. The value of R^2 is much higher (.9709 compared to .5856) and MS(Error) is much smaller.
You can conduct a statistical test of significance to compare the two models using the ANOVA table with the partitioned SS(Reg). The test statistic is
F = MS(DRYER|AC)/MS(Error) = 3690.1/15.5 = 238.1,
with numerator df=1 and denominator df=18. This is a huge value of F with these degrees of freedom and is significant at any reasonable level.
Bivariate Fit of RESIDUAL By AC
Figure 3. Plot of Residuals versus AC
Bivariate Fit of RESIDUAL By DRYER
Figure 4. Plot of Residuals versus DRYER
Regression with Collinear Variables
The example on household KWH consumption utilized two independent variables that are almost uncorrelated. Thus, when DRYER was added tot the model in addition to AC, the AC regression coefficient changed very little. Also, the amount of variation attributable to DRYER is almost the same when it is the only variable in the model as when it is added to a model that already includes AC.
The following example illustrates a situation when two highly correlated variables are in a regression model: Students in a graduate statistics course recorded the spans of their left and right hands and their heights, all in inches. The objective was to develop a regression model to predict height from hand span. The variable names are HT, LSPAN, and RSPAN. Of course, LSPAN and RSPAN are highly correlated. This example illustrates the consequence of using two highly correlated variables in a multiple regression equation.
The two simple linear regression models are:
and
The prediction equations are:
HT=43.62+2.88LSPAN and HT=41.35+3.17RSPAN
Not surprisingly, the two equations are quite similar.
Figures 6a and 6b show HT versus LSPAN and RSPAN:
Figure 6a. Regression of HT on LSPAN
60
65
70
75
80
ht
6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10. lspan
Figure 6b. Regression of HT on RSPAN
Figure 7 shows a plot of RSPAN versus LSPAN showing the high degree of collinearity.
rspan
6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10. lspan
Figure 7. Plot showing collinearity between LSPAN and RSPAN
The multiple linear regression model is:
The prediction equation is
HT = 41.13 – 4.31LSPAN + 7.53RSPAN
At first look, this equation seems to make no sense at all. The regression coefficient on LSPAN is negative, and the coefficient on RSPAN is twice as large as the coefficient
60
65
70
75
80
ht
6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10. rspan
Figures 9a and 9b show the data and fitted plane on similar axes:
Figure 9a. HT plotted versus LSPAN and RSPAN
Figure 9b. Plane of predicted values plotted versus
You can see that although the slope in the LSPAN direction is negative, the plane increases in slope from the smallest values of LSPAN and RSPAN up to the largest values. This illustrates the following facts of regression with highly collinear variables:
SAS Program for Multiple Linear Regression Analysis of KWH Data
options nonumber nodate; Title1 'Household Electricity Consumption Data'; data kilowatt; input kwh ac dryer; cards; 35 1.5 1 63 4.5 2 66 5.0 2 17 2.0 0 94 8.5 3 79 6.0 3 93 13.5 1 66 8.0 1 94 12.5 1 82 7.5 2 78 6.5 3 65 8.0 1 77 7.5 2 75 8.0 2 62 7.5 1 85 12.0 1 43 6.0 0 57 2.5 3 33 5.0 0 65 7.5 1 33 6.0 0
. 10 0 . 10 1 . 10 2 ;
proc print data=kilowatt; run ;
proc corr sscp data=kilowatt; run ;
proc sort data=kilowatt; by ac; run ;
data acplot; do ac= 0 to 15 by .5 ; kwh=. ; output;
data dryplot; do dryer= 0 to 3 by .1 ; kwh=. ; output; end; run ;
proc print data=dryplot; run ;
data dryplot; merge dryplot kilowatt; by dryer; run ;
proc reg data=dryplot; id ac; model kwh=dryer/p; plot kwh*dryer; output out=dryplot1 p=drypred r=dryresid1; run ;
proc gplot data=dryplot1; plot dryresid1dryer; plot dryresid1drypred; run ;
proc reg data=dryplot; id ac; model kwh=dryer/p clm cli; plot kwh*dryer; output out=dryplot2 p=drypred lclm=lcl uclm=ucl lcl=lpl ucl=upl; run ;
proc print data=dryplot2; run ;
proc gplot data=dryplot2; plot kwhdryer drypreddryer lcldryer ucldryer / overlay; plot kwhdryer drypreddryer lpldryer upldryer / overlay; run ;