Multiple Linear Regression and Its Application in Predicting Student's Test Score, Study notes of Data Analysis & Statistical Methods

An in-depth analysis of multiple linear regression, its equation, assumptions, and application in predicting a stat 301 student's test 2 score based on their scores for test 1, lab grades, homework grades, attendance, and whether they handed in the review for exam 2. It includes the calculation of correlations, regression output, and the fitted regression equation. The document also discusses the normality and constant variance assumptions, and the impact of removing certain explanatory variables on the model.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-pfq-1
koofers-user-pfq-1 🇺🇸

10 documents

1 / 23

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CHAPTER 11: Multiple Regression
With multiple linear regression, more than one explanatory
variable is used to explain or predict a single response variable.
Introducing several explanatory variables leads to additional
considerations. We will not be able to address all these issues, but
will outline some basic facts about multiple regression.
Equation of the Multiple Regression Model:
We have data on several explanatory variables
1 2 3
, , ,......,
p
x x x x
(where
p
is the number of explanatory variables in the model) and
a response variable
y
.
The regression model for the population is:
1 1 2 2
......
i o p p i
y x x x
The sample prediction equation is:
0 1 1 2 2
....
p p
y b b x b x b x
The
th
i
residual is:
i i i
e y y
= observed response – predicted response
The estimate for the variability of the response about the
regression equation is:
2
2
1
i
e
sn p
(The degrees of freedom
associated with
2
s
are
)
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17

Partial preview of the text

Download Multiple Linear Regression and Its Application in Predicting Student's Test Score and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

CHAPTER 11: Multiple Regression With multiple linear regression, more than one explanatory variable is used to explain or predict a single response variable. Introducing several explanatory variables leads to additional considerations. We will not be able to address all these issues, but will outline some basic facts about multiple regression. Equation of the Multiple Regression Model: We have data on several explanatory variables x 1^ ,^ x 2^ ,^ x 3^ ,......,^ xp (where p^ is the number of explanatory variables in the model) and a response variable y^.  The regression model for the population is:

yi   o   1 x 1   2 x 2  ......  p xp  i

 The sample prediction equation is:  yb 0 (^)  b x 1 1 (^)  b x 2 2 (^)  ....  b xp p  The th i residual is:  e iyiyi = observed response – predicted response  The estimate for the variability of the response about the regression equation is: 2 2 1 e i s n p

(The degrees of freedom associated with 2 s are^ n^ ^ p ^1 )

As with all models, there are some assumptions that need to be met with multiple regression. Multiple Regression Assumptions:

  1. LINEARITY: The regression equation must be of the right form to describe the true underlying relationship among the variables. (To check for linearity, make a scatterplot of y against each predictor variable.)
  2. CONSTANT VARIANCE: The variability of the residuals must be the same for all values of the x^ variables. (To check for constant variance scatterplots of residuals against predicted values are made.)
  3. INDEPENDENCE: The residual at one data value must be independent of the residuals at any other data values.
  4. NORMALITY: The distribution of the residuals must be Normal for the t- test on the coefficients to follow student’s t- distribution exactly. (To check the normality assumption, make a probability plot of residuals.) Collinearity: A multiple regression has a collinearity problem when any of the predictors has a strong linear relationship with any of the other predictors. The standard error of the coefficient of any predictor that is collinear with the others is inflated, leading to a smaller t- statistic and correspondingly larger (less significant p- value.) One clue that collinearity might be a problem is a regression with a large overall R-square, but with small t- ratio for the coefficients. Detecting collinearity: Regress one predictor on the others. If (^) R^2 is high for any of the regressions you know that the two predictors are collinear.

Any variables left in the equation ideally should have a significant P- value for the individual t- tests of the coefficient. Furthermore, the confidence intervals or these coefficients should not contain 0)

Confidence Intervals for individual^ ^ j :

j b j bt SE (^) (SPSS gives you a 95% CI)

Significance Test for ^ j :

(Format)  State the null and alternative hypothesis

H 0 :  j  0

Ha :  j 0, Ha :  j  0,or Ha :  j 0,

 Find the test statistic on the printout or by using the formula: j j b b t SE

 Find the P -value from the printout  Compare the P- value to the^ ^ level If P -value^ ^ , then reject H 0 If P -value^ ^ , then fail to reject H 0 State your conclusions in terms of the problem Once you have carefully looked at the individual t-tests for the explanatory variables in your model, if any of them have p-values higher than your chosen alpha level, you should remove the one with the highest p-value from your model and examine what happens to your (^) R^2 and your s^. Checking for bias in the model:

Look at the overall (^) R^2 (the squared multiple correlation) for the model. The (^) R^2 is the proportion of the variation of the response variable y^ that is explained by the explanatory variables x 1 (^) , x 2 ,....., x (^) p in a multiple linear regression. Basically, (^) R^2 should not decrease too rapidly when we are dropping a variable, or else that variable should be added back into the model. Checking for variability in the model: Look at the standard deviation, s, for the model. Recall 2 2 1 e i s n p

and can be found in the regression output. The s^ should not increase too rapidly when we are dropping a variable or else that variable should be added back into the model. This process should be continued until all insignificant variables have been removed from the model. Only one variable, however, should be removed from the model at a time. Sometimes we choose to keep insignificant variables, however, we should examine the model with a and without that variable before we make our decision. Note: Individual regression coefficients, their standard errors, and significance tests are meaningful only when interpreted in context of the other explanatory variables in the model. Another test that is useful is the F -test. It is an overall test that will tell you whether you want to proceed. If you fail to reject the null in the F-test, then none of the explanatory variables in your

Example 1: We will now look at an example on SPSS! Our goal today will be to find the best model to predict a STAT 301 students test 2 score based on their scores for test 1, lab grades, homework grades, attendance and whether or not they handed in the review for exam 2. The grades are taken from three of Joan Brenneman’s stat 301 classes last semester.

  1. For each of variables in the data set, find the mean, median and standard deviation. Display each distribution with a histogram. SPSS: To get the descriptive statistics: >Analyze >Descriptive statistics > Explore Pull all variables into “Dependent List” box. Click OK. Do the histograms individually for each variable.

Descriptives 17.08.

5 21 16 6 -1.136. .390. 16.412.

-1.303. 2.124. 14.375.

.

-1.780. 3.095. .68. . . .

. . 0 1 1 1 -.774. -1.434. 77.01 1.

33 100 67 24 -.851. -.004. 74.26 1.

26 97 71 19 -.972. .632. Mean Lower Bound Upper Bound 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis Mean Lower Bound Upper Bound 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis Mean Lower Bound Upper Bound 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis Mean Lower Bound Upper Bound 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis Mean Lower Bound Upper Bound 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis Mean Lower Bound Upper Bound 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis attendance Lab_t hw_t review Test 1 Test 2 Statistic Std. Error

Correlations 1 .711** .686** .405** .615** .719**

. .000 .000 .000 .000. 90 90 90 90 90 90 .711** 1 .685** .406** .493** .609** .000. .000 .000 .000. 90 90 90 90 90 90 .686** .685** 1 .558** .351** .490** .000 .000. .000 .001. 90 90 90 90 90 90 .405** .406** .558** 1 .173 .369** .000 .000 .000. .103. 90 90 90 90 90 90 .615** .493** .351** .173 1 .706** .000 .000 .001 .103.. 90 90 90 90 90 90 .719** .609** .490** .369** .706** 1 .000 .000 .000 .000.. 90 90 90 90 90 90 Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Lab_t hw_t attendance review Test 1 Test 2 Lab_t hw_t attendance review Test 1 Test 2 **. Correlation is significant at the 0.01 level (2-tailed). Which of the explanatory variables above are significantly correlated with the response variable? Do you see any potential collinearity problems?

  1. Make a chart showing your model, it’s corresponding (^) R^2 and s and which of the variables are non-insignificant in your model listed from most to least non-significant Model (^) R^2 s^ Non-significant variables
  2. Perform a multiple regression using all the explanatory variables and answer the questions on the next page based on the output. SPSS: To get the multiple regression output below: >Analyze >Regression >Linear Move “Test 2” to “dependent” box Move remaining variables to “Independent box” Select “Statistics” and click on “Confidence Intervals” Then click “Continue” followed by “OK”. Model Summary .808a^ .653 .632 9. Model 1 R R Square Adjusted R Square Std. Error of the Estimate Predictors: (Constant), Test 1, review, hw_t, attendance, Lab_t a.

d. What is the value of s the estimator for standard deviation? e. What is the percent of the variability in test 2 that is explained by this regression line?

  1. One variable looks like a good candidate to be dropped. Which is it? Try running the regression again without this variable. Model Summary .807a^ .651 .634 9. Model 1 R R Square Adjusted R Square Std. Error of the Estimate a. Predictors: (Constant), review, Test 1, hw_t, Lab_t ANOVAb 14753.108 4 3688.277 39.574 .000a 7922.014 85 93. 22675.122 89 Regression Residual Total Model 1 Sum of Squares df Mean Square F Sig. a. Predictors: (Constant), review, Test 1, hw_t, Lab_t b. Dependent Variable: Test 2 Coefficientsa 5.463 6.065 .901 .370 -6.596 17. 1.745 .561 .322 3.110 .003 .629 2. .400 .077 .428 5.193 .000 .247. .465 .354 .123 1.312 .193 -.239 1. 3.905 2.442 .115 1.599 .114 -.950 8. (Constant) Lab_t Test 1 hw_t review Model 1 B Std. Error Unstandardized Coefficients Beta Standardized Coefficients t Sig. Lower Bound Upper Bound 95% Confidence Interval for B a. Dependent Variable: Test 2 a. Give the fitted regression equation.

b. Which of the explanatory variables have significant slopes at the 5% level of significance. c. What is the value of s the estimator for standard deviation? d. What is the percent of the variability in test 2 that is explained by this regression line?

  1. Now lets see what happens when we remove hw_t? Model Summary .802a^ .644 .631 9. Model 1 R R Square Adjusted R Square Std. Error of the Estimate a. Predictors: (Constant), Lab_t, review, Test 1 ANOVAb 14592.579 3 4864.193 51.756 .000a 8082.543 86 93. 22675.122 89 Regression Residual Total Model 1 Sum of Squares df Mean Square F Sig. a. Predictors: (Constant), Lab_t, review, Test 1 b. Dependent Variable: Test 2 Coefficientsa 4.395 6.035 .728 .468 -7.603 16. .413 .077 .441 5.374 .000 .260. 4.533 2.405 .133 1.885 .063 -.248 9. 2.132 .479 .394 4.451 .000 1.180 3. (Constant) Test 1 review Lab_t Model 1 B Std. Error Unstandardized Coefficients Beta Standardized Coefficients t Sig. Lower Bound Upper Bound 95% Confidence Interval for B a. Dependent Variable: Test 2 a. Give the fitted regression equation. b. What is the value of s the estimator of standard

Coefficientsa 2.947 6.073 .485 .629 -9.125 15. 2.479 .449 .458 5.526 .000 1.587 3. .398 .078 .425 5.130 .000 .244. (Constant) Lab_t Test 1 Model 1 B Std. Error Unstandardized Coefficients Beta Standardized Coefficients t Sig. Lower Bound Upper Bound 95% Confidence Interval for B a. Dependent Variable: Test 2 a. Has (^) R^2 changed drastically with the review variable removed? b. What is the value of s the estimator of standard deviation? c. Which model do you think is best and why? d. What are some additional things we can look at when deciding which model would be best. First let’s look at the normal probability plots for two of our potential models: Does the normality assumption appear to be met for our best model based on this normal probability plot? Model includes – test 1, lab_t and review. 0.0 0.20.40.6 0.8 1. Observed Cu…

Expected Cu… Normal P-P Plot of Regressi… Dependent Variable: Test 2

Now let’s look at some of the residual plots: Residual plot for test_ 30 40 50 60 70 80 90 100 Test 1

-20. Unstandardized Residual -40. Residual plot for lab_t 6.0 8.0 10.012.014.016.018.0 20. Lab_t

-10. -20. -30. Unstandardized … -40. Residual plot for review 0 0.2 0.4 0.6 0.8 1 review

Unstandardized -25. Residual Does the constant variance assumption appear to be met based on the residual plots above? Example 2: In class group activity: (Problems 11.45 to 11.51 in Moore and McCabe fifth edition.) As cheddar cheese matures a variety of chemical processes take place. The taste of mature cheese is related to the concentration of several chemicals in the final product. In a study of cheddar cheese from the La Trobe Valley of Victoria, Australia, samples of cheese were analyzed for their chemical composition and were subjected to taste tests. Data for one type of cheese-manufacturing processes appears below. The variable “Case” is used to number the observations

1. Describe the scatterplots below for each pair of variables in the cheese data set. Describe the relationships for each. Examine the correlation for each pair of variables and report which ones have significant correlations. 2. Fill in the chart below based on the output. Model (^) R^2 s^ Non-significant variables 3. Which model is the best for predicting taste based on the three explanatory variables “Acetic”, “H2S” and “Lactic”? 4. Does the model you chose meet the assumptions of multiple regression?