Predicting Mercury Concentration in Fish with Linear Regression: Length and Weight, Study notes of Statistics

Multiple linear regression analysis to predict mercury concentration (ppm) in fish based on their length (cm) and weight (g). The concept of multiple regression analysis, least-squares estimation, and interpreting the coefficients. It also includes an example of omitted variables and confounding. Useful for students studying statistics, particularly those focusing on regression analysis.

Typology: Study notes

Pre 2010

Uploaded on 03/28/2010

koofers-user-7v1
koofers-user-7v1 🇺🇸

10 documents

1 / 13

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Statistics 431:
Statistical Inference
Lecture 19: Multiple linear least-squares
regression
pf3
pf4
pf5
pf8
pf9
pfa
pfd

Partial preview of the text

Download Predicting Mercury Concentration in Fish with Linear Regression: Length and Weight and more Study notes Statistics in PDF only on Docsity!

Statistics 431:

Statistical Inference

Lecture 19: Multiple linear least-squares

regression

Multiple predictor variables: example

  • (^) Too much mercury in the body results in memory loss, depression,

irritability, and anxiety – the “mad hatter” syndrome.

  • (^) Rivers and oceans contain a small amount of mercury, which accumulates

in fish over their lifetimes.

  • (^) Concentration of mercury in fish tissue can be measured, at considerable

expense, by catching fish and sending samples to a lab for analysis.

  • (^) It is important to understand the relationship between mercury

concentration and measurable characteristics of a fish, such as length and weight, in order to develop safety guidelines about how much fish to eat.

  • (^) We have data from a study of large-mouth bass in the Wacamaw and Lumber rivers of North Carolina. At several stations along each river, a group of fish were caught, weighed, and sized. In addition, a filet from each fish was sent to the lab to determine mercury tissue concentration.
  • (^) We want to predict Y = mercury concentration (ppm),

based on X 1 = length (cm) and X 2 = weight (g).

1

Multiple regression: least-squares estimation

  • (^) As always, we are going to estimate regression coefficients by minimizing

the sum of squared residuals.

  • (^) Now we have ( p + 1 ) coefficients instead of two, and the mechanics of

calculating residuals are different (but the idea is the same):

S ( βˆ 0 ,... , βˆ p ) =

∑^ n

i = 1

e^2 i =

∑^ n

i = 1

( yi − ˆ yi )^2 =

∑^ n

i = 1

( yi − ( βˆ 0 + ˆβ 1 xi 1 +... + ˆβ p xi p ))^2.

  • (^) This optimization problem has a unique solution βˆ 0 LS ,... , βˆLS p. Hereafter we

drop the “LS” superscript.

  • (^) There are formulas for the βˆ i ’s in terms of the yi ’s and xi j ’s, but they are not

enlightening. (In the notation of matrix analysis, they are revealing, but that’s outside our scope.) JMP will do the computations for you.

  • (^) The unbiased estimate of σ 2 is now σˆ 2 = RSS/( n − ( p + 1 )), rather than

RSS/( n − 2 ).

3

Mercury: fitted multiple linear regression

Response Mercury Concentration Whole Model Actual by Predicted Plot

Summary of Fit

RSquare 0.

RSquare Adj 0.

Root Mean Square Error 0.

Mean of Response 1.

Observations (or Sum Wgts) 171

Analysis of Variance

Source DF Sum of Squares Mean Square F Ratio

Model 2 42.173416 21.0867 62.

Error 168 56.448858 0.3360 Prob > F

C. Total 170 98.622274 <.

Parameter Estimates

Term Estimate Std Error t Ratio Prob>|t|

Intercept -1.496193 0.365801 -4.09 <.

Length 0.071353 0.011979 5.96 <.

Weight -0.000143 0.000117 -1.23 0.

4

Interpreting the coefficients

  • (^) μ( x 1 , x 2 ) = E [ Y | Length = x 1 , Weight = x 2 ] = β 0 + β 1 x 1 + β 2 x 2.
  • (^) Interpretation of β 0 : mean mercury concentration for a zero-dimensional

fish (length = 0 cm, weight = 0 g)

  • (^) Interpretation of β 1 : if observed length increases by 1 cm, and other

variables remain fixed, then average mercury concentration increases by β 1 ppm, no matter what the fixed values of the other variables are.

  • (^) Analogous interpretation for β 2.
  • (^) NOTE: “observed length increases by 1 cm” 6 = “we intervene to increase a

fish’s length by 1cm” (growth hormone?)

  • case 1: we catch a longer fish, just like we caught them in the sample.
  • case 2: we manipulate the fish separately from the process which produced fish in our sample.
  • (^) To predict mercury concentration for fish caught like those in the sample,

we are in case 1: regression will be helpful.

  • (^) To predict mercury concentration if we do something to the fish, we are in case 2: regression will often not be helpful, because of confounding.

6

Example: omitted variables and confounding

  • (^) In order to evaluate the benefits of a proposed irrigation scheme in a

certain region, suppose that the relation of yield Y (bushels/acre) to rainfall R is investigated over several years. We also measure temperature T each year but do not include it in the initial analysis.

1964 50 10 4 1965 70 11 5 1966 70 10 5 1967 80 9 5 1968 50 9 4 1969 60 12 4 1970 40 11 4

We could consider the simple regression function E(Y|R):

Bivariate Fit of Y By R

30

40

50

60

70

80

90

Y

7 8 9 10 11 12 13 R

Linear Fit Linear Fit

7

What’s going on?

  • (^) A negative slope coefficient tells us that more rainfall went with less yield in

this sample.? (Causally, we would anticipate rain improving yield.)

  • (^) Suggests a confounder: a variable which affects both rainfall and yield.
  • (^) This slope coefficient tells us the change in average yield we observed to

be associated with a unit change in rainfall: there’s no guarantee that other variables affecting yield (not included in this regression) were being held fixed. (These are omitted variables .)

  • (^) To decide about irrigation, we want to know what would happen to average

yield if we intervened to increase water supply. We need a regression coefficient for water supply (i.e., rainfall) which corresponds to all other effects being held fixed.

  • (^) Of course, we will never even figure out every factor affecting yield, much

less measure them all. But perhaps we can find some important ones, and make an argument that the influence of remaining factors can reasonably be glossed.

  • (^) For starters, we should stick in temperature.

9

Rainfall redux

Response Y

Whole Model

Actual by Predicted Plot

30

40

50

60

70

80

90

Y Actual

40 50 60 70 80 Y Predicted P=0.0201 RSq=0. RMSE=7.

Summary of Fit RSquare 0. RSquare Adj 0. Root Mean Square Error 7.

10

  • (^) Now more rainfall is associated with higher yield, for any given fixed

temperature – a conclusion which seems causally less peculiar.

  • (^) Temperature was a confounder for the original regression. In the original

data, T was not held constant. It changed from year to year. In fact,

  • temperature and rainfall were inversely related: lower temperature, more rain; higher temperature, less rain.
  • temperature and yield were directly related: lower temperature, lower yield; higher temperature, higher yield.
  • (^) When we observed high rainfall, the corresponding yield included (i) the

positive effect of more rain, (ii) the negative effect of the lower temperature that usually accompanied more rain.

  • (^) In this case, (ii) was bigger than (i): the regression of yield on rainfall alone

gave a (noisy) negative slope coefficient, because the coefficient included both the direct effect of rain, and the larger indirect effect of temperature (the omitted variable).

  • (^) Are we done?

12