Cheddar Cheese Taste Analysis with Acetic Acid, Hydrogen Sulfide, and Lactic Acid - Prof. , Study notes of Data Analysis & Statistical Methods

An analysis of the relationship between the taste of cheddar cheese and the concentrations of acetic acid, hydrogen sulfide, and lactic acid using multiple regression. Instructions on how to perform the analysis using spss, interpret the results, and refine the model.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-eskvay2b3r
koofers-user-eskvay2b3r 🇺🇸

10 documents

1 / 11

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Chapter 11: Multiple Regression
Multiple Regression is what you use when you have 2 or more quantitative
explanatory variables which will be used to predict another quantitative response
variable.
Simple Linear Regression (Chapters 2 and 10) is used when you have just 1 quantitative
explanatory variable and 1 quantitative response variable.
For simple linear regression (Chapters 2 and 10), our statistical model was:
01ii
yx
i
β
βε
=+ +
In the multiple regression (Chapter 11), our statistical model is:
011 22
...
iiipp
yxxx
ii
β
ββ β
=+ + ++ +
ε
where you have p explanatory variables.
Just because you have data for several x variables doesn’t mean that all the x variables are
important enough to go in your model. We must do a multiple-step procedure to decide
which x variables are the most important when describing y.
So what do we do when we have multiple x variables?
1. Look at the variables individually.
Means, standard deviations, minimums, and maximums, outliers (if any),
stem plots or histograms are all good ways to show what is happening
with your individual variables.
In SPSS, AnalyzeÆDescriptive StatisticsÆExplore.
2. Look at the relationships between the variables using the correlation and scatter
plots.
In SPSS, AnalyzeÆCorrelateÆBivariate. Put all your variables (all the x’s
and y) into the “variables” box, and hit “ok.”
The higher the Pearson Correlation between 2 variables, the better, and the
lower the Sig. (2-tailed) the better. The P-value (Sig.) is the result of the test
0: 0 vs. : 0
a
HH
ρ
ρ
=≠that we did in chapter 10.
Which are the stronger relationships between an x and the y? Which are the
stronger x-to-x relationships?
1
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Cheddar Cheese Taste Analysis with Acetic Acid, Hydrogen Sulfide, and Lactic Acid - Prof. and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

Chapter 11: Multiple Regression

Multiple Regression is what you use when you have 2 or more quantitative

explanatory variables which will be used to predict another quantitative response

variable.

Simple Linear Regression (Chapters 2 and 10) is used when you have just 1 quantitative

explanatory variable and 1 quantitative response variable.

For simple linear regression (Chapters 2 and 10), our statistical model was:

yi = β 0 + β 1 xi +ε i

In the multiple regression (Chapter 11), our statistical model is:

yi = β 0 + β 1 x 1 i + β 2 x 2 i + ...+ β p x pi + ε i

where you have p explanatory variables.

Just because you have data for several x variables doesn’t mean that all the x variables are

important enough to go in your model. We must do a multiple-step procedure to decide

which x variables are the most important when describing y.

So what do we do when we have multiple x variables?

1. Look at the variables individually.

• Means, standard deviations, minimums, and maximums, outliers (if any),

stem plots or histograms are all good ways to show what is happening

with your individual variables.

• In SPSS, AnalyzeÆDescriptive StatisticsÆExplore.

2. Look at the relationships between the variables using the correlation and scatter

plots.

• In SPSS, AnalyzeÆCorrelateÆBivariate. Put all your variables (all the x ’s

and y ) into the “variables” box, and hit “ok.”

• The higher the Pearson Correlation between 2 variables, the better, and the

lower the Sig. (2-tailed) the better. The P-value (Sig.) is the result of the test

H 0 : ρ= 0 vs. Ha : ρ≠ 0 that we did in chapter 10.

• Which are the stronger relationships between an x and the y? Which are the

stronger x -to- x relationships?

  • Look at scatter plots between each pair of variables, too (you will look at a

LOT of graphs).

  • We are only interested in keeping the variables which had strong correlations.

3. Do a regression using the variables you decided were important from part 2.

  • This will include an ANOVA table and coefficient output like what we saw in

Chapter 10.

We had ANOVA results for simple linear

regression in Ch. 10, too, but since we

only had one β i, we didn’t need to use it.

ANOVA Table for Multiple Regression:

ANOVA

SS df MS F Significance

Regression SSM DFM=p MSM=SSM/DFM MSM/MSE P-value

Residual SSE DFE=n-p-1 MSE=SSE/DFE=s

2

Total SST DFT=n-1 MST=SST/DFT

s = estimate for the standard deviation = MSE

Analysis of Variance F-Test:

In the multiple regression model, the hypothesis

H 0 : β 1 = β 2 =…= β p= 0

Ha: Not ALL β 1 = β 2 =…= β p= 0

Ha means at least one β j ≠ 0. We can’t tell how many are regression coefficients are not

0 at this point. We need to do t-tests to be more specific. (Think back: We did

Bonferroni multiple comparisons t tests if we could reject the null hypothesis in a One-

way ANOVA F test.) If we reject H 0 , basically we have determined that this problem is

worthy of further study.

  • Even if the P -value (Sig.) is small, you need to look at R 2

. If the R

2

is small, it

means the model (variables) you are using does not do a very good job of

explaining the variation in y.

  • You can get a fitted regression equation from the estimates for bj in the

SPSS output at this point.

  • The SPSS output will also include confidence intervals, t test statistics and

respective P -values for the respective individual bj.

  • The degrees of freedom we will use for the t-tests will be n-p- 1 where

o The F-test statistic from ANOVA should get bigger, and the P-value from

the ANOVA F-test should get smaller

o Any variables left in the equation should have a significant P-value from

their t-test of the coefficient (their confidence intervals should not contain

0) unless taking out a slightly insignificant coefficient makes the R

2

and s

move the wrong direction.

  • Our goal is to keep only the variables which are the most useful to us. Get rid of

any excess variables, but balance removing insignificant variables with the

change that has on the whole model.

How do we know which variables should be included in our model and which should

not?

***Procedure 1 :

Start with a model that contains all your explanatory variables with strong correlations,

run the regression, and then remove one at a time whichever variables aren’t significant

from the t -test untill you find that your R

2

starts to decrease too rapidly or your s goes up

too rapidly. You may end up leaving in one or more variables which are not significant

on their own. You just have to see what removing them does to the whole model. (This

is the procedure that we will follow in the lecture notes and that you should use for this

class.)

Procedure 2 :

Start with a model that contains only one explanatory variable and add one variable at a

time till you find that your R

2

is no longer increasing rapidly.

Sometimes there may be more than one appropriate choice for your model. The

most important thing is to be able to explain why you chose the model you did. Not

every model is as easy to define as the one in the CHEESE example below.

Example (Exercises 11.43-11.51) :

As cheddar cheese matures a variety of chemical processes take place. The taste of

mature cheese is related to the concentration of several chemicals in the final product. In

a study of cheddar cheese from the La Trobe Valley of Victoria, Australia, samples of

cheese were analyzed for their chemical composition and were subjected to taste tests.

Data for one type of cheese-manufacturing processes appears in below. The

variable “Case” is used to number the observations from 1 to 30. “Taste” is the response

variable of interest. The taste scores were obtained by combining the scores from several

tasters.

Three chemicals whose concentrations were measured were acetic acid, hydrogen

sulfide, and lactic acid. For acetic acid and hydrogen sulfide (natural) log

transformations were taken. Thus the explanatory variables are the transformed

concentrations of acetic acid (“Acetic”) and hydrogen sulfide (“H2S”) and the

untransformed concentration of lactic acid (“Lactic”). a) For each of the 4 variables in the CHEESE data set, find the mean, median, standard deviation, and IQR. Display each distribution by means of a stemplot.

Descriptives Interval for Mean

  • 1 12.3 4.543 3.135 0. Case Taste Acetic H2S Lactic
  • 2 20.9 5.159 5.043 1.
  • 3 39 5.366 5.438 1.
  • 4 47.9 5.759 7.496 1.
  • 5 5.6 4.663 3.807 0.
  • 6 25.9 5.697 7.601 1.
  • 7 37.3 5.892 8.726 1.
  • 8 21.9 6.078 7.966 1.
  • 9 18.1 4.898 3.85 1.
  • 10 21 5.242 4.174 1.
  • 11 34.9 5.74 6.142 1.
  • 12 57.2 6.446 7.908 1.
  • 13 0.7 4.477 2.996 1.
  • 14 25.9 5.236 4.942 1.
  • 15 54.9 6.151 6.752 1.
  • 16 40.9 6.365 9.588 1.
  • 17 15.9 4.787 3.912 1.
  • 18 6.4 5.412 4.7 1.
  • 19 18 5.247 6.174 1.
  • 20 38.9 5.438 9.064 1.
  • 21 14 4.564 4.949 1.
  • 22 15.2 5.298 5.22 1.
  • 23 32 5.455 9.242 1.
  • 24 56.7 5.855 10.199 2.
  • 25 16.8 5.366 3.664 1.
  • 26 11.6 6.043 3.219 1.
  • 27 26.5 6.458 6.962 1.
  • 28 0.7 5.328 3.912 1.
  • 29 13.4 5.802 6.685 1.
  • 30 5.5 6.176 4.787 1. - Mean 24.533 2. Statistic Std. Error - 95% Confidence Lower Bound 18. - 30. Upper Bound - 5% Trimmed Mean 24. - Median 20. - Variance 264. - Std. Deviation 16. - Minimum. - Maximum 57. Taste

Taste Stem-and-Leaf Plot

Frequency Stem & Leaf

Acetic Stem-and-Leaf Plot

Frequency Stem & Leaf

H2S Stem-and-Leaf Plot

Frequency Stem & Leaf

Lactic Stem-and-Leaf Plot

Frequency Stem & Leaf

b) Make a scatterplot for each pair of variables in the CHEESE data set (you will

have 6 plots). Describe the relationships. Calculate the correlation for each

pair of variables and report the P -value for the test of zero population

correlation in each case.

Correlations

Taste Acetic H2S Lactic Pearson Correlation

Sig. (2-tailed) (^). .002 .000.

Taste

N 30 30 30

Pearson Correlation

Sig. (2-tailed) (^) .002. .000.

Acetic

N 30 30 30

Pearson Correlation

Sig. (2-tailed) (^) .000 .000..

H2S

N 30 30 30

Pearson Correlation

Sig. (2-tailed) (^) .000 .000..

Lactic

N 30 30 30

** Correlation is significant at the 0.01 level (2- tailed).

f) Give the 95% confidence intervals for the regression coefficients of your

explanatory variables. Do any of the intervals contain the point 0? (This

should verify your answer to part e.)

g) What is the value of s, the estimator for standard error of the model?

h) What percent of variation in taste is explained by these explanatory variables?

i) One variable looks like a good candidate to be dropped. Which one is it? Try

running the multiple regression again without this variable. Look at parts c

through h again.

Model Summary

Model R R Square

Adjusted R Square

Std. Error of the Estimate (^1) .807(a) .652 .626 9.

a Predictors: (Constant), Lactic, H2S

ANOVA(b)

Model

Sum of Squares df Mean Square F Sig. Regressio n 4993.921 2 2496.961 25.260 .000(a) Residual (^) 2668.965 27 98.

Total (^) 7662.887 29

a Predictors: (Constant), Lactic, H2S b Dependent Variable: Taste

Coefficients(a)

Model

Unstandardized Coefficients

Standard ized Coefficie nts t Sig.

95% Confidence Interval for B

B Std. Error Beta

Lower Bound

Upper Bound 1 (Constant) (^) -27.592 8.982 -3.072 .005 -46.021 -9. H2S (^) 3.946 1.136 .516 3.475 .002 1.616 6. Lactic (^) 19.887 7.959 .371 2.499 .019 3.557 36.

a Dependent Variable: Taste

What changed? What stayed the same or improved?

Original (all 3

explanatory

variables)

New (only H2S and

Lactic)

Change

R

2

s 10.1307 9.

F, P-value 16.221, 0 25.260, 0

Insignificant

explanatory

variables

Acetic None

j) Now look at a residual plot for each of the variables you still have in the

model. Do a normal probability plot, too.

2.000 4.000 6.000 8.000 10. H2S

-20.

-10.

Unstandardized Residual

0.80 1.00 1.20 1.40 1.60 1.80 2.00 2. Lactic

-20.00 0.0^ 0.2^ 0.4^ 0.6^ 0.8^ 1. Observed Cum Prob

Expected Cum Prob

Dependent Variable: Taste

Normal P-P Plot of Regression Standardized Residual

k) Using the better model, predict the “taste” for an H2S=4 and Lactic=1.

000

-10.00 0

20.0 0

30.0 0

Unstandardized Residual

000

000

00