Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Multiple Linear Regression and Its Application in Predicting Student's Test Score, Study notes of Data Analysis & Statistical Methods

Purdue University Data Analysis & Statistical Methods

An in-depth analysis of multiple linear regression, its equation, assumptions, and application in predicting a stat 301 student's test 2 score based on their scores for test 1, lab grades, homework grades, attendance, and whether they handed in the review for exam 2. It includes the calculation of correlations, regression output, and the fitted regression equation. The document also discusses the normality and constant variance assumptions, and the impact of removing certain explanatory variables on the model.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-pfq-1 🇺🇸

10 documents

1 / 23

This page cannot be seen from the preview

Don't miss anything!

CHAPTER 11: Multiple Regression

With multiple linear regression, more than one explanatory

variable is used to explain or predict a single response variable.

Introducing several explanatory variables leads to additional

considerations. We will not be able to address all these issues, but

will outline some basic facts about multiple regression.

Equation of the Multiple Regression Model:

We have data on several explanatory variables

1 2 3

, , ,......,

p

x x x x

(where

p

is the number of explanatory variables in the model) and

a response variable

y

.

The regression model for the population is:

1 1 2 2

......

i o p p i

y x x x

    

     

The sample prediction equation is:



0 1 1 2 2

....

p p

y b b x b x b x    

The

th

i

residual is:



i i i

e y y 

= observed response – predicted response

The estimate for the variability of the response about the

regression equation is:

2

1

i

e

sn p

 



(The degrees of freedom

associated with

2

s

are

1n p 

)

1

Discover Study notes of Data Analysis & Statistical Methods Purdue University

Partial preview of the text

Download Multiple Linear Regression and Its Application in Predicting Student's Test Score and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

CHAPTER 11: Multiple Regression With multiple linear regression, more than one explanatory variable is used to explain or predict a single response variable. Introducing several explanatory variables leads to additional considerations. We will not be able to address all these issues, but will outline some basic facts about multiple regression. Equation of the Multiple Regression Model: We have data on several explanatory variables x 1^ ,^ x 2^ ,^ x 3^ ,......,^ xp (where p^ is the number of explanatory variables in the model) and a response variable y^.  The regression model for the population is:

yi   o   1 x 1   2 x 2  ......  p xp  i

 The sample prediction equation is:  y  b 0 (^)  b x 1 1 (^)  b x 2 2 (^)  ....  b xp p  The th i residual is:  e i  yi  yi = observed response – predicted response  The estimate for the variability of the response about the regression equation is: 2 2 1 e i s n p

(The degrees of freedom associated with 2 s are^ n^ ^ p ^1 )

As with all models, there are some assumptions that need to be met with multiple regression. Multiple Regression Assumptions:

LINEARITY: The regression equation must be of the right form to describe the true underlying relationship among the variables. (To check for linearity, make a scatterplot of y against each predictor variable.)
CONSTANT VARIANCE: The variability of the residuals must be the same for all values of the x^ variables. (To check for constant variance scatterplots of residuals against predicted values are made.)
INDEPENDENCE: The residual at one data value must be independent of the residuals at any other data values.
NORMALITY: The distribution of the residuals must be Normal for the t- test on the coefficients to follow student’s t- distribution exactly. (To check the normality assumption, make a probability plot of residuals.) Collinearity: A multiple regression has a collinearity problem when any of the predictors has a strong linear relationship with any of the other predictors. The standard error of the coefficient of any predictor that is collinear with the others is inflated, leading to a smaller t- statistic and correspondingly larger (less significant p- value.) One clue that collinearity might be a problem is a regression with a large overall R-square, but with small t- ratio for the coefficients. Detecting collinearity: Regress one predictor on the others. If (^) R^2 is high for any of the regressions you know that the two predictors are collinear.

Any variables left in the equation ideally should have a significant P- value for the individual t- tests of the coefficient. Furthermore, the confidence intervals or these coefficients should not contain 0)

Confidence Intervals for individual^ ^ j :

j b j b  t SE (^) (SPSS gives you a 95% CI)

Significance Test for ^ j :

(Format)  State the null and alternative hypothesis

H 0 :  j  0

Ha :  j 0, Ha :  j  0,or Ha :  j 0,

 Find the test statistic on the printout or by using the formula: j j b b t SE

 Find the P -value from the printout  Compare the P- value to the^ ^ level If P -value^ ^ , then reject H 0 If P -value^ ^ , then fail to reject H 0 State your conclusions in terms of the problem Once you have carefully looked at the individual t-tests for the explanatory variables in your model, if any of them have p-values higher than your chosen alpha level, you should remove the one with the highest p-value from your model and examine what happens to your (^) R^2 and your s^. Checking for bias in the model:

Look at the overall (^) R^2 (the squared multiple correlation) for the model. The (^) R^2 is the proportion of the variation of the response variable y^ that is explained by the explanatory variables x 1 (^) , x 2 ,....., x (^) p in a multiple linear regression. Basically, (^) R^2 should not decrease too rapidly when we are dropping a variable, or else that variable should be added back into the model. Checking for variability in the model: Look at the standard deviation, s, for the model. Recall 2 2 1 e i s n p

and can be found in the regression output. The s^ should not increase too rapidly when we are dropping a variable or else that variable should be added back into the model. This process should be continued until all insignificant variables have been removed from the model. Only one variable, however, should be removed from the model at a time. Sometimes we choose to keep insignificant variables, however, we should examine the model with a and without that variable before we make our decision. Note: Individual regression coefficients, their standard errors, and significance tests are meaningful only when interpreted in context of the other explanatory variables in the model. Another test that is useful is the F -test. It is an overall test that will tell you whether you want to proceed. If you fail to reject the null in the F-test, then none of the explanatory variables in your

Example 1: We will now look at an example on SPSS! Our goal today will be to find the best model to predict a STAT 301 students test 2 score based on their scores for test 1, lab grades, homework grades, attendance and whether or not they handed in the review for exam 2. The grades are taken from three of Joan Brenneman’s stat 301 classes last semester.

For each of variables in the data set, find the mean, median and standard deviation. Display each distribution with a histogram. SPSS: To get the descriptive statistics: >Analyze >Descriptive statistics > Explore Pull all variables into “Dependent List” box. Click OK. Do the histograms individually for each variable.

Descriptives 17.08.

5 21 16 6 -1.136. .390. 16.412.

-1.303. 2.124. 14.375.

.

-1.780. 3.095. .68. . . .

. . 0 1 1 1 -.774. -1.434. 77.01 1.

33 100 67 24 -.851. -.004. 74.26 1.

26 97 71 19 -.972. .632. Mean Lower Bound Upper Bound 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis Mean Lower Bound Upper Bound 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis Mean Lower Bound Upper Bound 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis Mean Lower Bound Upper Bound 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis Mean Lower Bound Upper Bound 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis Mean Lower Bound Upper Bound 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis attendance Lab_t hw_t review Test 1 Test 2 Statistic Std. Error

Correlations 1 .711** .686** .405** .615** .719**

. .000 .000 .000 .000. 90 90 90 90 90 90 .711** 1 .685** .406** .493** .609** .000. .000 .000 .000. 90 90 90 90 90 90 .686** .685** 1 .558** .351** .490** .000 .000. .000 .001. 90 90 90 90 90 90 .405** .406** .558** 1 .173 .369** .000 .000 .000. .103. 90 90 90 90 90 90 .615** .493** .351** .173 1 .706** .000 .000 .001 .103.. 90 90 90 90 90 90 .719** .609** .490** .369** .706** 1 .000 .000 .000 .000.. 90 90 90 90 90 90 Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Lab_t hw_t attendance review Test 1 Test 2 Lab_t hw_t attendance review Test 1 Test 2 **. Correlation is significant at the 0.01 level (2-tailed). Which of the explanatory variables above are significantly correlated with the response variable? Do you see any potential collinearity problems?

Make a chart showing your model, it’s corresponding (^) R^2 and s and which of the variables are non-insignificant in your model listed from most to least non-significant Model (^) R^2 s^ Non-significant variables
Perform a multiple regression using all the explanatory variables and answer the questions on the next page based on the output. SPSS: To get the multiple regression output below: >Analyze >Regression >Linear Move “Test 2” to “dependent” box Move remaining variables to “Independent box” Select “Statistics” and click on “Confidence Intervals” Then click “Continue” followed by “OK”. Model Summary .808a^ .653 .632 9. Model 1 R R Square Adjusted R Square Std. Error of the Estimate Predictors: (Constant), Test 1, review, hw_t, attendance, Lab_t a.

d. What is the value of s the estimator for standard deviation? e. What is the percent of the variability in test 2 that is explained by this regression line?

One variable looks like a good candidate to be dropped. Which is it? Try running the regression again without this variable. Model Summary .807a^ .651 .634 9. Model 1 R R Square Adjusted R Square Std. Error of the Estimate a. Predictors: (Constant), review, Test 1, hw_t, Lab_t ANOVAb 14753.108 4 3688.277 39.574 .000a 7922.014 85 93. 22675.122 89 Regression Residual Total Model 1 Sum of Squares df Mean Square F Sig. a. Predictors: (Constant), review, Test 1, hw_t, Lab_t b. Dependent Variable: Test 2 Coefficientsa 5.463 6.065 .901 .370 -6.596 17. 1.745 .561 .322 3.110 .003 .629 2. .400 .077 .428 5.193 .000 .247. .465 .354 .123 1.312 .193 -.239 1. 3.905 2.442 .115 1.599 .114 -.950 8. (Constant) Lab_t Test 1 hw_t review Model 1 B Std. Error Unstandardized Coefficients Beta Standardized Coefficients t Sig. Lower Bound Upper Bound 95% Confidence Interval for B a. Dependent Variable: Test 2 a. Give the fitted regression equation.

b. Which of the explanatory variables have significant slopes at the 5% level of significance. c. What is the value of s the estimator for standard deviation? d. What is the percent of the variability in test 2 that is explained by this regression line?

Now lets see what happens when we remove hw_t? Model Summary .802a^ .644 .631 9. Model 1 R R Square Adjusted R Square Std. Error of the Estimate a. Predictors: (Constant), Lab_t, review, Test 1 ANOVAb 14592.579 3 4864.193 51.756 .000a 8082.543 86 93. 22675.122 89 Regression Residual Total Model 1 Sum of Squares df Mean Square F Sig. a. Predictors: (Constant), Lab_t, review, Test 1 b. Dependent Variable: Test 2 Coefficientsa 4.395 6.035 .728 .468 -7.603 16. .413 .077 .441 5.374 .000 .260. 4.533 2.405 .133 1.885 .063 -.248 9. 2.132 .479 .394 4.451 .000 1.180 3. (Constant) Test 1 review Lab_t Model 1 B Std. Error Unstandardized Coefficients Beta Standardized Coefficients t Sig. Lower Bound Upper Bound 95% Confidence Interval for B a. Dependent Variable: Test 2 a. Give the fitted regression equation. b. What is the value of s the estimator of standard

Coefficientsa 2.947 6.073 .485 .629 -9.125 15. 2.479 .449 .458 5.526 .000 1.587 3. .398 .078 .425 5.130 .000 .244. (Constant) Lab_t Test 1 Model 1 B Std. Error Unstandardized Coefficients Beta Standardized Coefficients t Sig. Lower Bound Upper Bound 95% Confidence Interval for B a. Dependent Variable: Test 2 a. Has (^) R^2 changed drastically with the review variable removed? b. What is the value of s the estimator of standard deviation? c. Which model do you think is best and why? d. What are some additional things we can look at when deciding which model would be best. First let’s look at the normal probability plots for two of our potential models: Does the normality assumption appear to be met for our best model based on this normal probability plot? Model includes – test 1, lab_t and review. 0.0 0.20.40.6 0.8 1. Observed Cu…

Expected Cu… Normal P-P Plot of Regressi… Dependent Variable: Test 2

Now let’s look at some of the residual plots: Residual plot for test_ 30 40 50 60 70 80 90 100 Test 1

-20. Unstandardized Residual -40. Residual plot for lab_t 6.0 8.0 10.012.014.016.018.0 20. Lab_t

-10. -20. -30. Unstandardized … -40. Residual plot for review 0 0.2 0.4 0.6 0.8 1 review

Unstandardized -25. Residual Does the constant variance assumption appear to be met based on the residual plots above? Example 2: In class group activity: (Problems 11.45 to 11.51 in Moore and McCabe fifth edition.) As cheddar cheese matures a variety of chemical processes take place. The taste of mature cheese is related to the concentration of several chemicals in the final product. In a study of cheddar cheese from the La Trobe Valley of Victoria, Australia, samples of cheese were analyzed for their chemical composition and were subjected to taste tests. Data for one type of cheese-manufacturing processes appears below. The variable “Case” is used to number the observations

1. Describe the scatterplots below for each pair of variables in the cheese data set. Describe the relationships for each. Examine the correlation for each pair of variables and report which ones have significant correlations. 2. Fill in the chart below based on the output. Model (^) R^2 s^ Non-significant variables 3. Which model is the best for predicting taste based on the three explanatory variables “Acetic”, “H2S” and “Lactic”? 4. Does the model you chose meet the assumptions of multiple regression?