















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An in-depth analysis of multiple linear regression, its equation, assumptions, and application in predicting a stat 301 student's test 2 score based on their scores for test 1, lab grades, homework grades, attendance, and whether they handed in the review for exam 2. It includes the calculation of correlations, regression output, and the fitted regression equation. The document also discusses the normality and constant variance assumptions, and the impact of removing certain explanatory variables on the model.
Typology: Study notes
1 / 23
This page cannot be seen from the preview
Don't miss anything!
















CHAPTER 11: Multiple Regression With multiple linear regression, more than one explanatory variable is used to explain or predict a single response variable. Introducing several explanatory variables leads to additional considerations. We will not be able to address all these issues, but will outline some basic facts about multiple regression. Equation of the Multiple Regression Model: We have data on several explanatory variables x 1^ ,^ x 2^ ,^ x 3^ ,......,^ xp (where p^ is the number of explanatory variables in the model) and a response variable y^. The regression model for the population is:
The sample prediction equation is: y b 0 (^) b x 1 1 (^) b x 2 2 (^) .... b xp p The th i residual is: e i yi yi = observed response – predicted response The estimate for the variability of the response about the regression equation is: 2 2 1 e i s n p
(The degrees of freedom associated with 2 s are^ n^ ^ p ^1 )
As with all models, there are some assumptions that need to be met with multiple regression. Multiple Regression Assumptions:
Any variables left in the equation ideally should have a significant P- value for the individual t- tests of the coefficient. Furthermore, the confidence intervals or these coefficients should not contain 0)
j b j b t SE (^) (SPSS gives you a 95% CI)
(Format) State the null and alternative hypothesis
Find the test statistic on the printout or by using the formula: j j b b t SE
Find the P -value from the printout Compare the P- value to the^ ^ level If P -value^ ^ , then reject H 0 If P -value^ ^ , then fail to reject H 0 State your conclusions in terms of the problem Once you have carefully looked at the individual t-tests for the explanatory variables in your model, if any of them have p-values higher than your chosen alpha level, you should remove the one with the highest p-value from your model and examine what happens to your (^) R^2 and your s^. Checking for bias in the model:
Look at the overall (^) R^2 (the squared multiple correlation) for the model. The (^) R^2 is the proportion of the variation of the response variable y^ that is explained by the explanatory variables x 1 (^) , x 2 ,....., x (^) p in a multiple linear regression. Basically, (^) R^2 should not decrease too rapidly when we are dropping a variable, or else that variable should be added back into the model. Checking for variability in the model: Look at the standard deviation, s, for the model. Recall 2 2 1 e i s n p
and can be found in the regression output. The s^ should not increase too rapidly when we are dropping a variable or else that variable should be added back into the model. This process should be continued until all insignificant variables have been removed from the model. Only one variable, however, should be removed from the model at a time. Sometimes we choose to keep insignificant variables, however, we should examine the model with a and without that variable before we make our decision. Note: Individual regression coefficients, their standard errors, and significance tests are meaningful only when interpreted in context of the other explanatory variables in the model. Another test that is useful is the F -test. It is an overall test that will tell you whether you want to proceed. If you fail to reject the null in the F-test, then none of the explanatory variables in your
Example 1: We will now look at an example on SPSS! Our goal today will be to find the best model to predict a STAT 301 students test 2 score based on their scores for test 1, lab grades, homework grades, attendance and whether or not they handed in the review for exam 2. The grades are taken from three of Joan Brenneman’s stat 301 classes last semester.
Descriptives 17.08.
5 21 16 6 -1.136. .390. 16.412.
-1.303. 2.124. 14.375.
.
-1.780. 3.095. .68. . . .
. . 0 1 1 1 -.774. -1.434. 77.01 1.
33 100 67 24 -.851. -.004. 74.26 1.
26 97 71 19 -.972. .632. Mean Lower Bound Upper Bound 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis Mean Lower Bound Upper Bound 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis Mean Lower Bound Upper Bound 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis Mean Lower Bound Upper Bound 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis Mean Lower Bound Upper Bound 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis Mean Lower Bound Upper Bound 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis attendance Lab_t hw_t review Test 1 Test 2 Statistic Std. Error
Correlations 1 .711** .686** .405** .615** .719**
. .000 .000 .000 .000. 90 90 90 90 90 90 .711** 1 .685** .406** .493** .609** .000. .000 .000 .000. 90 90 90 90 90 90 .686** .685** 1 .558** .351** .490** .000 .000. .000 .001. 90 90 90 90 90 90 .405** .406** .558** 1 .173 .369** .000 .000 .000. .103. 90 90 90 90 90 90 .615** .493** .351** .173 1 .706** .000 .000 .001 .103.. 90 90 90 90 90 90 .719** .609** .490** .369** .706** 1 .000 .000 .000 .000.. 90 90 90 90 90 90 Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Lab_t hw_t attendance review Test 1 Test 2 Lab_t hw_t attendance review Test 1 Test 2 **. Correlation is significant at the 0.01 level (2-tailed). Which of the explanatory variables above are significantly correlated with the response variable? Do you see any potential collinearity problems?
d. What is the value of s the estimator for standard deviation? e. What is the percent of the variability in test 2 that is explained by this regression line?
b. Which of the explanatory variables have significant slopes at the 5% level of significance. c. What is the value of s the estimator for standard deviation? d. What is the percent of the variability in test 2 that is explained by this regression line?
Coefficientsa 2.947 6.073 .485 .629 -9.125 15. 2.479 .449 .458 5.526 .000 1.587 3. .398 .078 .425 5.130 .000 .244. (Constant) Lab_t Test 1 Model 1 B Std. Error Unstandardized Coefficients Beta Standardized Coefficients t Sig. Lower Bound Upper Bound 95% Confidence Interval for B a. Dependent Variable: Test 2 a. Has (^) R^2 changed drastically with the review variable removed? b. What is the value of s the estimator of standard deviation? c. Which model do you think is best and why? d. What are some additional things we can look at when deciding which model would be best. First let’s look at the normal probability plots for two of our potential models: Does the normality assumption appear to be met for our best model based on this normal probability plot? Model includes – test 1, lab_t and review. 0.0 0.20.40.6 0.8 1. Observed Cu…
Expected Cu… Normal P-P Plot of Regressi… Dependent Variable: Test 2
Now let’s look at some of the residual plots: Residual plot for test_ 30 40 50 60 70 80 90 100 Test 1
-20. Unstandardized Residual -40. Residual plot for lab_t 6.0 8.0 10.012.014.016.018.0 20. Lab_t
-10. -20. -30. Unstandardized … -40. Residual plot for review 0 0.2 0.4 0.6 0.8 1 review
Unstandardized -25. Residual Does the constant variance assumption appear to be met based on the residual plots above? Example 2: In class group activity: (Problems 11.45 to 11.51 in Moore and McCabe fifth edition.) As cheddar cheese matures a variety of chemical processes take place. The taste of mature cheese is related to the concentration of several chemicals in the final product. In a study of cheddar cheese from the La Trobe Valley of Victoria, Australia, samples of cheese were analyzed for their chemical composition and were subjected to taste tests. Data for one type of cheese-manufacturing processes appears below. The variable “Case” is used to number the observations
1. Describe the scatterplots below for each pair of variables in the cheese data set. Describe the relationships for each. Examine the correlation for each pair of variables and report which ones have significant correlations. 2. Fill in the chart below based on the output. Model (^) R^2 s^ Non-significant variables 3. Which model is the best for predicting taste based on the three explanatory variables “Acetic”, “H2S” and “Lactic”? 4. Does the model you chose meet the assumptions of multiple regression?