






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The concept of simple and multiple linear regression models, where the mean of a response variable is believed to depend on the values of one or more explanatory variables. How to approximate the mean function with a linear function, the role of intercept and slope coefficients, and how to test the significance of explanatory variables using anova and t-tests. It also covers the calculation of fitted values and prediction intervals.
Typology: Study notes
1 / 10
This page cannot be seen from the preview
Don't miss anything!







Lecture 1
The linear regression model
A simple scenario. Suppose that we have a continuous variable Y whose mean is believed to depend on the value of another continuous variable X. We refer to Y as a response variable (or outcome variable, dependent variable) and to X as an explanatory variable (or predictor, covariate, in- dependent variable). For example, Y could be birthweight in grams and X could be gestational age in weeks. Let m(x) denote the mean of Y when X = x. So, m(37) would denote the mean birthweight among infants with a gestational age of 37 weeks. Note that some authors prefer to use E[Y |X = x], E[Y |x], or μY |x instead of m(x). Suppose that we want to estimate m(x) from a data set (x 1 , y 1 ),... , (xn, yn). Here xi is the value of X for the ith^ person on whom we have a measurement, while yi is the value of Y for this person.
Formulating a statistical model. If m(x) can be closely approximated by a linear function α + βx, then a complicated problem (estimating m(x) for ev- ery possible value of X) becomes much simpler (estimating two parameters, α and β). For this reason, we often assume that (Equation 11.1)
m(x) = α + βx.
Of course, Y does not have to equal m(x), so to finish formulating the statistical model we need to describe how Y differs from m(x). Typically this is done by specifying that (Equation 11.2)
Yi = α + βxi + i,
where the i are independent and identically distributed normal random variables with mean 0 and (unknown) variance σ^2.
Remarks on the statistical model. The above model (Equation 11.2) is often called a “simple linear regression model”. The word “simple” is meant to convey that there is only one explanatory variable. The basic idea of simple linear regression is to look at the scatterplot of (x 1 , y 1 ),... , (xn, yn) and draw a line that best describes the trend in the data. Although we can perceive visually whether a line effectively describes the trend in the data, there is a mathematical procedure for choosing the best line. Identifying this best line is tantamount to estimating the parameters α and β. We call α the intercept (coefficient) and β the slope (coefficient). The intercept is the mean response when X = 0 (if it is meaningful to have X = 0). The slope is the change in the mean response associated with a one-unit increase in X. To see this, note that
m(x + 1) − m(x) = α + β(x + 1) − α − βx = β
for any x. A positive slope implies a tendency for large values of Y to accom- pany large values of X (Figure 11.3-a), a negative slope implies a tendency for small values of Y to accompany large values of X (Figure 11.3-b), and a zero slope implies no tendency in either direction (Figure 11.3-c).
Generalizing to multiple explanatory variables. Now suppose (more realisti- cally!) that the mean of Y may depend on the values of several explanatory variables X 1 ,... , Xk. Typically we think of X 1 ,... , Xk as continuous, but we can allow some (or all) of them to be dichotomous. Let m(x 1 ,... , xk) or E[Y |X 1 = x 1 ,... , Xk = xk] denote the mean of Y when X 1 = x 1 ,... , Xk =
If the guesses are good, then
r 0 + r 1 x 1 ,i + · · · + rkxk,i ≈ α + β 1 x 1 ,i + · · · + βkxk,i = m(x 1 ,i,... , xk,i) ≈ yi,
so that g(r 0 , r 1 ,... , rk) is small. If the guesses are poor, then g(r 0 , r 1 ,... , rk) is large. Hence, we are led to choose r 0 , r 1 ,... , rk to make g(r 0 , r 1 ,... , rk) as small as possible. Such choices of r 0 , r 1 ,... , rk become our estimates of α, β 1 ,... , βk and are denoted either as a, b 1 ,... , bk or ˆα, βˆ 1 ,... , βˆk. This estimation procedure is referred to as “least squares” and must be carried out with statistical software (unless k = 1).
Fitted values and fitted model. Given specific numbers x 1 ,... , xk, let
yˆ := a + b 1 x 1 + · · · + bkxk.
We refer to ˆy as a “fitted value” or a “predicted value”. These two terms are used interchangeably. However, “fitted value” seems more appropriate when we think of ˆy as an estimate of m(x 1 ,... , xk), whereas “predicted value” seems preferable when we think of ˆy as a prediction for a not-yet- observed response. If x 1 ,... , xk are not specified, then the full equation
yˆ = a + b 1 x 1 + · · · + bkxk
is called a “fitted model”.
Example (fitted values and fitted model). Refer to {MLRExample.pdf}. The response variable, which we will denote as Y , is total cholesterol (TC). There are three explanatory variables, which we
will denote as X 1 through X 3 : polyunsaturated fat intake (POLYFAT), al- cohol intake (ALCOHOL), and body mass index (BMI). In the “Parameter Estimates” box on page 1, we see that a = 69.470, b 1 = 2.069, b 2 = 0.338, and b 3 = 4.403. With x 1 = 6.4, x 2 = 23.8, and x 3 = 26.7, as for the first subject in {Cholesterol.xls}, we have
yˆ = 69.470 + 2.069(6.4) + 0.338(23.8) + 4.403(26.7) = 208. 3 ,
as shown in the “Predicted Value” column on page 2 of {MLRExample.pdf}. The fitted model is
yˆ = 69.470 + 2. 069 x 1 + 0. 338 x 2 + 4. 403 x 3.
Inferences in linear regression
ANOVA test for model significance. Just because we go to the trouble of fitting a linear regression model doesn’t mean that we are going to be satisfied with the results. To begin with, we may ask whether the regression model even explains a significant portion of the variability in the response. We can address this by decomposing a gross measure of variability in the response (“total sum of squares”) into parts that represent variability accounted for by the model (“regression sum of squares”) and variability not accounted for by the model (“residual sum of squares”):
∑^ n i=
(yi − y¯)^2 =
∑n i=
(ˆyi − y¯)^2 +
∑n i=
(yi − yˆi)^2 or
Tot SS = Reg SS + Res SS.
The regression sum of squares is large when the ˆyi differ considerably from ¯y (i.e., knowledge of the x 1 ,i,... , xk,i permits us to make better “predictions”
Example (the t-test for a partial slope coefficient). Refer to {MLRExample.pdf}. In the “Parameter Estimates” box on page 1, we see that b 1 = 2.069, se(b 1 ) = 0.914, and t = 2.26. The corresponding p-value is 0.0282, so we reject H 0 : β 1 = 0. Polyunsaturated fat intake is useful in predicting total cholesterol even after we control for alcohol intake and body mass index.
Confidence interval for a partial slope coefficient. The 100(1 − α)% confi- dence interval for β 1 is
b 1 ± tn−{k+1}, 1 −α/ 2 se(b 1 ).
Confidence intervals for β 2 ,... , βk are constructed analogously.
The partial f-test. Suppose that we want to know whether Xk− 1 and Xk are useful in predicting Y when we control for X 1 ,... , Xk− 2. At first glance it seems that we can reach a conclusion simply by carrying out the t-tests with respective null hypotheses H 0 : βk− 1 = 0 and H 0 : βk = 0. There are two problems with such an approach. First, what do we con- clude if one null hypothesis is accepted and the other is rejected? Second, and more importantly, such an approach can fail in the following situation: knowing Xk does not help if we already know Xk− 1 (so H 0 : βk = 0 is ac- cepted), knowing Xk− 1 does not help if we already know Xk (so H 0 : βk− 1 = 0 is accepted), but knowing at least one of Xk− 1 , Xk is helpful. Hence, we need a way to test H 0 : βk− 1 = βk = 0. In general, we need a way to test H 0 : βk−m+1 = · · · = βk = 0, where m is between 1 and k. [Note that I could equally well pose the problem in
terms of a null hypothesis involving any m partial slope coefficients.] The approach we will use is called the partial f-test; Rosner gives the formulation for m = 1 only in Equation 11.33, although he makes a subtle mistake in step (1). To carry out the partial f-test, we fit a “reduced” model that excludes Xk−m+1,... , Xk. Denote the residual sum of squares for that reduced model by ResSSred. Let
f := (ResSSred − ResSS)/m ResMS
We reject H 0 : βk−m+1 = · · · = βk = 0 if f > fm,n−{k+1}, 1 −α.
Example (the partial f-test). Refer to {MLRpartf.pdf}. Now we have ten explanatory variables: the three already introduced and seven new ones (ENERGY, TOTFAT, SATFAT, VEGFAT, ANIMFAT, CHOL, FIBER). The question is whether adding the seven new explanatory variables has improved our ability to predict total cholesterol. The result of the partial f-test is given on page 3. We have f = 0.81 with a corresponding p-value of 0 .5871, so the answer to our question is “No”.
Special cases. The ANOVA test for model significance is actually a special case of the partial f-test with m = k. In this case, ResSSred = TotSS. Also, if m = 1, then the partial f-test is equivalent to the t-test described earlier in the sense that the same p-value is obtained.
Estimation of the mean response and prediction
Estimation of the mean response. Suppose that, with specific numbers x 1 ,... , xk in mind, we want to provide a confidence interval for m(x 1 ,... , xk).
We will use laboratory time to pursue the discussion questions below and to work through the first eight pages of the computing handout. This handout is available from my web page as either {SAS930F06.ps} or {SAS930F06.pdf}.
Discussion questions
yˆ = 69.470 + 2. 069 x 1 + 0. 338 x 2 + 4. 403 x 3.
Suppose that someone makes the following statement: “For every one- unit increase in body mass index, total cholesterol increases by 4. 403 points.” Is this statement correct?
yˆ = 69.470 + 2. 069 x 1 + 0. 338 x 2 + 4. 403 x 3.
Suppose that Person A has x 1 = 6.4 and x 2 = 23.8, while Person B has x 1 = 6.9 and x 2 = 19.8. If Person A and Person B have the same body mass index, for whom do we predict higher total cholesterol?