Understanding Relationship between Response and Explanatory Variables in Linear Regression, Study notes of Community Health

The concept of simple and multiple linear regression models, where the mean of a response variable is believed to depend on the values of one or more explanatory variables. How to approximate the mean function with a linear function, the role of intercept and slope coefficients, and how to test the significance of explanatory variables using anova and t-tests. It also covers the calculation of fitted values and prediction intervals.

Typology: Study notes

Pre 2010

Uploaded on 10/01/2009

koofers-user-ad8
koofers-user-ad8 🇺🇸

10 documents

1 / 10

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CPH 930 Fall 2006 Dr. Charnigo
Lecture 1
The linear regression model
A simple scenario. Suppose that we have a continuous variable Ywhose
mean is believed to depend on the value of another continuous variable
X. We refer to Yas a response variable (or outcome variable, dependent
variable) and to Xas an explanatory variable (or predictor, covariate, in-
dependent variable). For example, Ycould be birthweight in grams and X
could be gestational age in weeks.
Let m(x) denote the mean of Ywhen X=x. So, m(37) would denote the
mean birthweight among infants with a gestational age of 37 weeks. Note
that some authors prefer to use E[Y|X=x], E[Y|x], or µY|xinstead of m(x).
Suppose that we want to estimate m(x) from a data set (x1, y1), . . ., (xn, yn).
Here xiis the value of Xfor the ith person on whom we have a measurement,
while yiis the value of Yfor this person.
Formulating a statistical model. If m(x) can be closely approximated by a
linear function α+βx, then a complicated problem (estimating m(x) for ev-
ery possible value of X) becomes much simpler (estimating two parameters,
αand β). For this reason, we often assume that (Equation 11.1)
m(x) = α+βx.
Of course, Ydoes not have to equal m(x), so to finish formulating the
statistical model we need to describe how Ydiffers from m(x). Typically
this is done by specifying that (Equation 11.2)
Yi=α+βxi+i,
1
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Understanding Relationship between Response and Explanatory Variables in Linear Regression and more Study notes Community Health in PDF only on Docsity!

CPH 930 — Fall 2006 — Dr. Charnigo

Lecture 1

The linear regression model

A simple scenario. Suppose that we have a continuous variable Y whose mean is believed to depend on the value of another continuous variable X. We refer to Y as a response variable (or outcome variable, dependent variable) and to X as an explanatory variable (or predictor, covariate, in- dependent variable). For example, Y could be birthweight in grams and X could be gestational age in weeks. Let m(x) denote the mean of Y when X = x. So, m(37) would denote the mean birthweight among infants with a gestational age of 37 weeks. Note that some authors prefer to use E[Y |X = x], E[Y |x], or μY |x instead of m(x). Suppose that we want to estimate m(x) from a data set (x 1 , y 1 ),... , (xn, yn). Here xi is the value of X for the ith^ person on whom we have a measurement, while yi is the value of Y for this person.

Formulating a statistical model. If m(x) can be closely approximated by a linear function α + βx, then a complicated problem (estimating m(x) for ev- ery possible value of X) becomes much simpler (estimating two parameters, α and β). For this reason, we often assume that (Equation 11.1)

m(x) = α + βx.

Of course, Y does not have to equal m(x), so to finish formulating the statistical model we need to describe how Y differs from m(x). Typically this is done by specifying that (Equation 11.2)

Yi = α + βxi + i,

where the i are independent and identically distributed normal random variables with mean 0 and (unknown) variance σ^2.

Remarks on the statistical model. The above model (Equation 11.2) is often called a “simple linear regression model”. The word “simple” is meant to convey that there is only one explanatory variable. The basic idea of simple linear regression is to look at the scatterplot of (x 1 , y 1 ),... , (xn, yn) and draw a line that best describes the trend in the data. Although we can perceive visually whether a line effectively describes the trend in the data, there is a mathematical procedure for choosing the best line. Identifying this best line is tantamount to estimating the parameters α and β. We call α the intercept (coefficient) and β the slope (coefficient). The intercept is the mean response when X = 0 (if it is meaningful to have X = 0). The slope is the change in the mean response associated with a one-unit increase in X. To see this, note that

m(x + 1) − m(x) = α + β(x + 1) − α − βx = β

for any x. A positive slope implies a tendency for large values of Y to accom- pany large values of X (Figure 11.3-a), a negative slope implies a tendency for small values of Y to accompany large values of X (Figure 11.3-b), and a zero slope implies no tendency in either direction (Figure 11.3-c).

Generalizing to multiple explanatory variables. Now suppose (more realisti- cally!) that the mean of Y may depend on the values of several explanatory variables X 1 ,... , Xk. Typically we think of X 1 ,... , Xk as continuous, but we can allow some (or all) of them to be dichotomous. Let m(x 1 ,... , xk) or E[Y |X 1 = x 1 ,... , Xk = xk] denote the mean of Y when X 1 = x 1 ,... , Xk =

If the guesses are good, then

r 0 + r 1 x 1 ,i + · · · + rkxk,i ≈ α + β 1 x 1 ,i + · · · + βkxk,i = m(x 1 ,i,... , xk,i) ≈ yi,

so that g(r 0 , r 1 ,... , rk) is small. If the guesses are poor, then g(r 0 , r 1 ,... , rk) is large. Hence, we are led to choose r 0 , r 1 ,... , rk to make g(r 0 , r 1 ,... , rk) as small as possible. Such choices of r 0 , r 1 ,... , rk become our estimates of α, β 1 ,... , βk and are denoted either as a, b 1 ,... , bk or ˆα, βˆ 1 ,... , βˆk. This estimation procedure is referred to as “least squares” and must be carried out with statistical software (unless k = 1).

Fitted values and fitted model. Given specific numbers x 1 ,... , xk, let

yˆ := a + b 1 x 1 + · · · + bkxk.

We refer to ˆy as a “fitted value” or a “predicted value”. These two terms are used interchangeably. However, “fitted value” seems more appropriate when we think of ˆy as an estimate of m(x 1 ,... , xk), whereas “predicted value” seems preferable when we think of ˆy as a prediction for a not-yet- observed response. If x 1 ,... , xk are not specified, then the full equation

yˆ = a + b 1 x 1 + · · · + bkxk

is called a “fitted model”.

Example (fitted values and fitted model). Refer to {MLRExample.pdf}. The response variable, which we will denote as Y , is total cholesterol (TC). There are three explanatory variables, which we

will denote as X 1 through X 3 : polyunsaturated fat intake (POLYFAT), al- cohol intake (ALCOHOL), and body mass index (BMI). In the “Parameter Estimates” box on page 1, we see that a = 69.470, b 1 = 2.069, b 2 = 0.338, and b 3 = 4.403. With x 1 = 6.4, x 2 = 23.8, and x 3 = 26.7, as for the first subject in {Cholesterol.xls}, we have

yˆ = 69.470 + 2.069(6.4) + 0.338(23.8) + 4.403(26.7) = 208. 3 ,

as shown in the “Predicted Value” column on page 2 of {MLRExample.pdf}. The fitted model is

yˆ = 69.470 + 2. 069 x 1 + 0. 338 x 2 + 4. 403 x 3.

Inferences in linear regression

ANOVA test for model significance. Just because we go to the trouble of fitting a linear regression model doesn’t mean that we are going to be satisfied with the results. To begin with, we may ask whether the regression model even explains a significant portion of the variability in the response. We can address this by decomposing a gross measure of variability in the response (“total sum of squares”) into parts that represent variability accounted for by the model (“regression sum of squares”) and variability not accounted for by the model (“residual sum of squares”):

∑^ n i=

(yi − y¯)^2 =

∑n i=

(ˆyi − y¯)^2 +

∑n i=

(yi − yˆi)^2 or

Tot SS = Reg SS + Res SS.

The regression sum of squares is large when the ˆyi differ considerably from ¯y (i.e., knowledge of the x 1 ,i,... , xk,i permits us to make better “predictions”

Example (the t-test for a partial slope coefficient). Refer to {MLRExample.pdf}. In the “Parameter Estimates” box on page 1, we see that b 1 = 2.069, se(b 1 ) = 0.914, and t = 2.26. The corresponding p-value is 0.0282, so we reject H 0 : β 1 = 0. Polyunsaturated fat intake is useful in predicting total cholesterol even after we control for alcohol intake and body mass index.

Confidence interval for a partial slope coefficient. The 100(1 − α)% confi- dence interval for β 1 is

b 1 ± tn−{k+1}, 1 −α/ 2 se(b 1 ).

Confidence intervals for β 2 ,... , βk are constructed analogously.

The partial f-test. Suppose that we want to know whether Xk− 1 and Xk are useful in predicting Y when we control for X 1 ,... , Xk− 2. At first glance it seems that we can reach a conclusion simply by carrying out the t-tests with respective null hypotheses H 0 : βk− 1 = 0 and H 0 : βk = 0. There are two problems with such an approach. First, what do we con- clude if one null hypothesis is accepted and the other is rejected? Second, and more importantly, such an approach can fail in the following situation: knowing Xk does not help if we already know Xk− 1 (so H 0 : βk = 0 is ac- cepted), knowing Xk− 1 does not help if we already know Xk (so H 0 : βk− 1 = 0 is accepted), but knowing at least one of Xk− 1 , Xk is helpful. Hence, we need a way to test H 0 : βk− 1 = βk = 0. In general, we need a way to test H 0 : βk−m+1 = · · · = βk = 0, where m is between 1 and k. [Note that I could equally well pose the problem in

terms of a null hypothesis involving any m partial slope coefficients.] The approach we will use is called the partial f-test; Rosner gives the formulation for m = 1 only in Equation 11.33, although he makes a subtle mistake in step (1). To carry out the partial f-test, we fit a “reduced” model that excludes Xk−m+1,... , Xk. Denote the residual sum of squares for that reduced model by ResSSred. Let

f := (ResSSred − ResSS)/m ResMS

We reject H 0 : βk−m+1 = · · · = βk = 0 if f > fm,n−{k+1}, 1 −α.

Example (the partial f-test). Refer to {MLRpartf.pdf}. Now we have ten explanatory variables: the three already introduced and seven new ones (ENERGY, TOTFAT, SATFAT, VEGFAT, ANIMFAT, CHOL, FIBER). The question is whether adding the seven new explanatory variables has improved our ability to predict total cholesterol. The result of the partial f-test is given on page 3. We have f = 0.81 with a corresponding p-value of 0 .5871, so the answer to our question is “No”.

Special cases. The ANOVA test for model significance is actually a special case of the partial f-test with m = k. In this case, ResSSred = TotSS. Also, if m = 1, then the partial f-test is equivalent to the t-test described earlier in the sense that the same p-value is obtained.

Estimation of the mean response and prediction

Estimation of the mean response. Suppose that, with specific numbers x 1 ,... , xk in mind, we want to provide a confidence interval for m(x 1 ,... , xk).

We will use laboratory time to pursue the discussion questions below and to work through the first eight pages of the computing handout. This handout is available from my web page as either {SAS930F06.ps} or {SAS930F06.pdf}.

Discussion questions

  1. Consider the fitted model

yˆ = 69.470 + 2. 069 x 1 + 0. 338 x 2 + 4. 403 x 3.

Suppose that someone makes the following statement: “For every one- unit increase in body mass index, total cholesterol increases by 4. 403 points.” Is this statement correct?

  1. Consider the fitted model

yˆ = 69.470 + 2. 069 x 1 + 0. 338 x 2 + 4. 403 x 3.

Suppose that Person A has x 1 = 6.4 and x 2 = 23.8, while Person B has x 1 = 6.9 and x 2 = 19.8. If Person A and Person B have the same body mass index, for whom do we predict higher total cholesterol?

  1. What is Rosner’s mistake in step (1) of Equation 11.33?