Data Analysis The Truth About Linear Regression, Lecture Slide - Engineering, Slides of Advanced Data Analysis

Data Analysis The Truth About Linear Regression, Lecture Slide - Engineering - Prof. Cosma Shalizi, Advanced Data Analysis, Optimal Linear Prediction, Collinearity, Shifting Distributions, Omitted Variables, Transformations, Changing Slopes, Errors in Variables, polynomial regression, additive models, Linear Regression Is Not the Philosopher's Stone, X¬2 Likelihood Ratio, Hessian, Fisher information

Typology: Slides

2010/2011

Uploaded on 11/02/2011

bridge
bridge 🇺🇸

4.9

(13)

287 documents

1 / 20

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
The Truth About Linear Regression
36-402, Advanced Data Analysis
13 January 2011
Contents
1 Optimal Linear Prediction: Multiple Variables 2
1.1 Collinearity .............................. 3
1.2 Estimating the Optimal Linear Predictor . . . . . . . . . . . . . 4
2 Shifting Distributions, Omitted Variables, and Transformations 4
2.1 ChangingSlopes ........................... 4
2.1.1 R2: Distraction or Nuisance? . . . . . . . . . . . . . . . . 5
2.2 Omitted Variables and Shifting Distributions . . . . . . . . . . . 5
2.3 ErrorsinVariables .......................... 9
2.4 Transformation ............................ 11
3 Adding Probabilistic Assumptions 14
3.1 Examine the Residuals . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Linear Regression Is Not the Philosopher’s Stone 16
5 Exercises 18
A Where the χ2Likelihood Ratio Test Comes From 19
We need to say some more about how linear regression, and especially about
how it really works and how it can fail. Linear regression is important because
1. it’s a fairly straightforward technique which often works reasonably well;
2. it’s a simple foundation for some more sophisticated techniques;
3. it’s a standard method so people use it to communicate; and
4. it’s a standard method so people have come to confuse it with prediction
and even with causal inference as such.
We need to go over (1)–(3), and provide prophylaxis against (4).
In addition to the readings from last time, a very good resource on regression
is Berk (2004). It omits technical details, but is superb on the high-level picture,
and especially on what must be assumed in order to do certain things with
regression, and what cannot be done under any assumption.
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14

Partial preview of the text

Download Data Analysis The Truth About Linear Regression, Lecture Slide - Engineering and more Slides Advanced Data Analysis in PDF only on Docsity!

The Truth About Linear Regression

36-402, Advanced Data Analysis

13 January 2011

Contents

1 Optimal Linear Prediction: Multiple Variables 2 1.1 Collinearity.............................. 3 1.2 Estimating the Optimal Linear Predictor............. 4

2 Shifting Distributions, Omitted Variables, and Transformations 4 2.1 Changing Slopes........................... 4 2.1.1 R^2 : Distraction or Nuisance?................ 5 2.2 Omitted Variables and Shifting Distributions........... 5 2.3 Errors in Variables.......................... 9 2.4 Transformation............................ 11

3 Adding Probabilistic Assumptions 14 3.1 Examine the Residuals........................ 15

4 Linear Regression Is Not the Philosopher’s Stone 16

5 Exercises 18

A Where the χ^2 Likelihood Ratio Test Comes From 19 We need to say some more about how linear regression, and especially about how it really works and how it can fail. Linear regression is important because

  1. it’s a fairly straightforward technique which often works reasonably well;
  2. it’s a simple foundation for some more sophisticated techniques;
  3. it’s a standard method so people use it to communicate; and
  4. it’s a standard method so people have come to confuse it with prediction and even with causal inference as such.

We need to go over (1)–(3), and provide prophylaxis against (4). In addition to the readings from last time, a very good resource on regression is Berk (2004). It omits technical details, but is superb on the high-level picture, and especially on what must be assumed in order to do certain things with regression, and what cannot be done under any assumption.

1 Optimal Linear Prediction: Multiple Variables

We have a response variable Y and a p-dimensional vector of predictor variables or features X~. To simplify the book-keeping, we’ll take these to be centered — we can always un-center them later. We would like to predict Y using X~. We saw last time that the best predictor we could use, at least in a mean-squared sense, is the conditional expectation,

r(~x) = E

[

Y | X~ = ~x

]

Let’s approximate r(~x) by a linear function of ~x, say ~x · β. This is not an assumption about the world, but rather a decision on our part; a choice, not a hypothesis. This decision can be good — the approximation can be accurate — even if the linear hypothesis is wrong. One reason to think it’s not a crazy decision is that we may hope r is a smooth function. If it is, then we can Taylor expand it about our favorite point, say ~u:

r(~x) = r(~u) +

∑^ p

i=

∂r ∂xi

~u

(xi − ui) + O(‖~x − ~u‖^2 ) (2)

or, in the more compact vector calculus notation,

r(~x) = r(~u) + (~x − ~u) · ∇r(~u) + O(‖~x − ~u‖^2 ) (3)

If we only look at points ~x which are close to ~u, then the remainder terms O(‖~x − ~u‖^2 ) are small, and a linear approximation is a good one. Of course there are lots of linear functions so we need to pick one, and we may as well do that by minimizing mean-squared error again:

M SE(β) = E

[(

Y − X~ · β

) 2 ]

Going through the optimization is parallel to the one-dimensional case (see last handout), with the conclusion that the optimal β is

β = V−^1 Cov

[

X, Y~

]

where V is the covariance matrix of X~, i.e., Vij = Cov [Xi, Xj ], and Cov

[

X, Y~

]

is the vector of covariances between the predictor variables and Y , i.e. Cov

[

X, Y~

]

i

Cov [Xi, Y ]. Notice that if the input variables were uncorrelated, V is diagonal (Vij = 0 unless i = j), and so is V−^1. Then doing multiple regression breaks up into a sum of separate simple regressions across each input variable. In the general case, where V is not diagonal, we can think of it as de-correlating X~ —

1.2 Estimating the Optimal Linear Predictor

To actually estimate β from data, we need to make some probabilistic assump- tions about where the data comes from. A comparatively weak but sufficient assumption is that observations ( X~i, Yi) are independent for different values of i, with unchanging covariances. Then if we look at the sample covariances, they will converge on the true covariances:

1 n

XT^ Y → Cov

[

X, Y~

]

n

XT^ X → V

where as before X is the data-frame matrix with one row for each data point and one column for each feature, and similarly for Y. So, by continuity, β̂ = (XT^ X)−^1 XT^ Y → β (6)

and we have a consistent estimator. On the other hand, we could start with the residual sum of squares

RSS(β) ≡

∑^ n

i=

(yi − ~xi · β)^2 (7)

and try to minimize it. The minimizer is the same β̂ we got by plugging in the sample covariances. No probabilistic assumption is needed to do this, but it doesn’t let us say anything about the convergence of β̂. (One can also show that the least-squares estimate is the linear prediction with the minimax prediction risk. That is, its worst-case performance, when everything goes wrong and the data are horrible, will be better than any other linear method. This is some comfort, especially if you have a gloomy and pes- simistic view of data, but other methods of estimation may work better in less-than-worst-case scenarios.)

2 Shifting Distributions, Omitted Variables, and

Transformations

2.1 Changing Slopes

I said earlier that the best β in linear regression will depend on the distribution of the predictor variable, unless the conditional mean is exactly linear. Here is an illustration. For simplicity, let’s say that p = 1, so there’s only one predictor variable. I generated data from Y =

X + , with  ∼ N (0, 0. 052 ) (i.e. the standard deviation of the noise was 0.05). Figure 1 shows the regression lines inferred from samples with three different distributions of X: the black points are X ∼ Unif(0, 1), the blue are X ∼ N (0. 5 , 0 .01) and the red X ∼ Unif(2, 3). The regression lines are shown as

colored solid lines; those from the blue and the black data are quite similar — and similarly wrong. The dashed black line is the regression line fitted to the complete data set. Finally, the light grey curve is the true regression function, r(x) =

x.

2.1.1 R^2 : Distraction or Nuisance?

This little set-up, by the way, illustrates that R^2 is not a stable property of the distribution either. For the black points, R^2 = 0.92; for the blue, R^2 = 0.70; and for the red, R^2 = 0.77; and for the complete data, 0.96. Other sets of xi values would give other values for R^2. Note that while the global linear fit isn’t even a good approximation anywhere in particular, it has the highest R^2. This kind of perversity can happen even in a completely linear set-up. Sup- pose now that Y = aX + , and we happen to know a exactly. The variance of Y will be a^2 Var [X] + Var []. The amount of variance our regression “ex- plains” — really, the variance of our predictions —- will be a^2 Var [X]. So

R^2 = a

(^2) Var[X] a^2 Var[X]+Var[].^ This goes to zero as Var [X]^ →^ 0 and it goes to 1 as Var [X] → ∞. It thus has little to do with the quality of the fit, and a lot to do with how spread out the independent variable is. Notice also how easy it is to get a very high R^2 even when the true model is not linear!

2.2 Omitted Variables and Shifting Distributions

That the optimal regression coefficients can change with the distribution of the predictor features is annoying, but one could after all notice that the distribution has shifted, and so be cautious about relying on the old regression. More subtle is that the regression coefficients can depend on variables which you do not measure, and those can shift without your noticing anything. Mathematically, the issue is that

E

[

Y | X~

]

= E

[

E

[

Y |Z, X~

]

| X~

]

Now, if Y is independent of Z given X~, then the extra conditioning in the inner expectation does nothing and changing Z doesn’t alter our predictions. But in general there will be plenty of variables Z which we don’t measure (so they’re not included in X~) but which have some non-redundant information about the response (so that Y depends on Z even conditional on X~). If the distribution of Z given X~ changes, then the optimal regression of Y on X~ should change too. Here’s an example. X and Z are both N (0, 1), but with a positive correlation of 0.1. In reality, Y ∼ N (X + Z, 0 .01). Figure 2 shows a scatterplot of all three variables together (n = 100). Now I change the correlation between X and Z to − 0 .1. This leaves both marginal distributions alone, and is barely detectable by eye (Figure 3).^2

(^2) I’m sure there’s a way to get R to combine the scatterplots, but it’s beyond me.

X

Z

Y

Figure 2: Scatter-plot of response variable Y (vertical axis) and two variables which influence it (horizontal axes): X, which is included in the regression, and Z, which is omitted. X and Z have a correlation of +0.1. (Figure created using the cloud command in the package lattice.)

X

Z

Y

Figure 3: As in Figure 2, but shifting so that the correlation between X and Z is now − 0 .1, though the marginal distributions, and the distribution of Y given X and Z, are unchanged.

-3 -2 -1 0 1 2

0

1

2

3

x

y

Figure 4: Joint distribution of X and Y from Figure 2 (black, with a positive cor- relation between X and Z) and from Figure 3 (blue, with a negative correlation between X and Z). Tick-marks on the axes show the marginal distributions, which are manifestly little-changed.

or not. Similarly, the reasoning last time about the actual regression function being the over-all optimal predictor, etc., is unaffected. If in the future we will continue to have X~ rather than U~ available to us for prediction, then Eq. 9 is irrelevant for prediction. Without better data, the relationship of Y to U~ is just one of the unanswerable questions the world is full of, as much as “what song the sirens sang, or what name Achilles took when he hid among the women”. Now, if you are willing to assume that ~η is a very nicely behaved Gaussian and you know its variance, then there are standard solutions to the error-in- variables problem for linear regression — ways of estimating the coefficients you’d get if you could regress Y on U~. I’m not going to go over them, partly because they’re in standard textbooks, but mostly because the assumptions are hopelessly demanding.^4

2.4 Transformation

Let’s look at a simple non-linear example, Y |X ∼ N (log X, 1). The problem with smoothing data from this source on to a straight line is that the true regression curve isn’t very straight, E [Y |X = x] = log x. (Figure 5.) This suggests replacing the variables we have with ones where the relationship is linear, and then undoing the transformation to get back to what we actually measure and care about. We have two choices: we can transform the response Y , or the predictor X. Here transforming the response would mean regressing exp Y on X, and transforming the predictor would mean regressing Y on log X. Both kinds of transformations can be worth trying, but transforming the predictors is, in my experience, often a better bet, for three reasons.

  1. Mathematically, E [f (Y )] 6 = f (E [Y ]). A mean-squared optimal prediction of f (Y ) is not necessarily close to the transformation of an optimal predic- tion of Y. And Y is, presumably, what we really want to predict. (Here, however, it works out.)
  2. Imagine that Y =

X + log Z. There’s not going to be any particularly nice transformation of Y that makes everything linear; though there will be transformations of the features.

  1. This generalizes to more complicated models with features built from mul- tiple covariates. Figure 6 shows the effect of these transformations. Here transforming the predictor does, indeed, work out more nicely; but of course I chose the example so that it does so. To expand on that last point, imagine a model like so:

r(~x) =

∑^ q

j=

cj fj (~x) (10)

(^4) Non-parametric error-in-variable methods are an active topic of research (Carroll et al., 2009).

-5 -4 -3 -2 -1 0

-^ -^ -^ -^

0

2

log(x)

y

0.0 0.2 0.4 0.6 0.8 1.

-^ -^ -^ -^

0

2

x

y

0.0 0.2 0.4 0.6 0.8 1.

-^ -^ -^ -^

0

2

x

y

0.0 0.2 0.4 0.6 0.8 1.

-^ -^ -^ -^

0

2

x

y

0.0 0.2 0.4 0.6 0.8 1.

-^ -^ -^ -^

0

2

x

y

Figure 6: Transforming the predictor (left column) and the response (right col- umn) in the data from Figure 5, displayed in both the transformed coordinates (top row) and the original coordinates (middle row). The bottom row super- imposes the two estimated curves (transformed X in black, transformed Y in blue). The true regression curve is always shown in grey.^13

If we know the functions fj , we can estimate the optimal values of the coeffi- cients cj by least squares — this is a regression of the response on new features, which happen to be defined in terms of the old ones. Because the parameters are outside the functions, that part of the estimation works just like linear regres- sion. Models embraced under the heading of Eq. 10 include linear regressions with interactions between the independent variables (set fj = xixk, for vari- ous combinations of i and k), and polynomial regression. There is however nothing magical about using products and powers of the independent variables; we could regress Y on sin x, sin 2x, sin 3x, etc. To apply models like Eq. 10, we can either (a) fix the functions fj in advance, based on guesses about what should be good features for this problem; (b) fix the functions in advance by always using some “library” of mathematically convenient functions, like polynomials or trigonometric functions; or (c) try to find good functions from the data. Option (c) takes us beyond the realm of linear regression as such, into things like additive models. Later, after we have seen how additive models work, we’ll examine how to automatically search for transformations of both sides of a regression model.

3 Adding Probabilistic Assumptions

The usual treatment of linear regression adds many more probabilistic assump- tions. Specifically, the assumption is that

Y | X~ ∼ N ( X~ · β, σ^2 ) (11)

with all Y values being independent conditional on their X~ values. So now we are assuming that the regression function is exactly linear; we are assuming that at each X~ the scatter of Y around the regression function is Gaussian; we are assuming that the variance of this scatter is constant; and we are assuming that there is no dependence between this scatter and anything else. None of these assumptions was needed in deriving the optimal linear predic- tor. None of them is so mild that it should go without comment or without at least some attempt at testing. Leaving that aside just for the moment, why make those assumptions? As you know from your earlier classes, they let us write down the likelihood of the observed responses y 1 , y 2 ,... yn (conditional on the covariates ~x 1 ,... ~xn), and then estimate β and σ^2 by maximizing this likelihood. As you also know, the maximum likelihood estimate of β is exactly the same as the β obtained by minimizing the residual sum of squares. This coincidence would not hold in other models, with non-Gaussian noise. We saw earlier that β̂ is consistent under comparatively weak assumptions — that it converges to the optimal coefficients. But then there might, possibly, still be other estimators are also consistent, but which converge faster. If we make the extra statistical assumptions, so that β̂ is also the maximum likelihood estimate, we can lay that worry to rest. The MLE is generically (and certainly here!) asymptotically efficient, meaning that it converges as fast as any

These properties — Gaussianity, homoskedasticity, lack of correlation — are all testable properties. When they all hold, we say that the residuals are white noise. One would never expect them to hold exactly in any finite sample, but if you do test for them and find them strongly violated, you should be extremely suspicious of your model. These tests are much more important than checking whether the coefficients are significantly different from zero. Every time someone uses linear regression and does not test whether the residuals are white noise, an angel loses its wings.

4 Linear Regression Is Not the Philosopher’s

Stone

The philosopher’s stone, remember, was supposed to be able to transmute base metals (e.g., lead) into the perfect metal, gold (Eliade, 1971). Many people treat linear regression as though it had a similar ability to transmute a correlation matrix into a scientific theory. In particular, people often argue that:

  1. because a variable has a non-zero regression coefficient, it must influence the response;
  2. because a variable has a zero regression coefficient, it must not influence the response;
  3. if the independent variables change, we can predict how much the response will change by plugging in to the regression.

All of this is wrong, or at best right only under very particular circumstances. We have already seen examples where influential variables have regression coefficients of zero. We have also seen examples of situations where a variable with no influence has a non-zero coefficient (e.g., because it is correlated with an omitted variable which does have influence). If there are no nonlinearities and if there are no omitted influential variables and if the noise terms are always independent of the predictor variables, are we good? No. Remember from Equation 5 that the optimal regression coefficients depend on both the marginal distribution of the predictors and the joint dis- tribution (covariances) of the response and the predictors. There is no reason whatsoever to suppose that if we change the system, this will leave the condi- tional distribution of the response alone. A simple example may drive the point home. Suppose we surveyed all the cars in Pittsburgh, recording the maximum speed they reach over a week, and how often they are waxed and polished. I don’t think anyone doubts that there will be a positive correlation here, and in fact that there will be a positive regression coefficient, even if we add in many other variables as predictors. Let us even postulate that the relationship is linear (perhaps after a suitable transformation). Would anyone believe that waxing cars will make them go faster? Manifestly not; at best the causality goes the other way. But this is

exactly how people interpret regressions in all kinds of applied fields — instead of saying waxing makes cars go faster, it might be saying that receiving targeted ads makes customers buy more, or that consuming dairy foods makes diabetes progress faster, or.... Those claims might be true, but the regressions could easily come out the same way if the claims were false. Hence, the regression results provide little or no evidence for the claims. Similar remarks apply to the idea of using regression to “control for” extra variables. If we are interested in the relationship between one predictor, or a few predictors, and the response, it is common to add a bunch of other variables to the regression, to check both whether the apparent relationship might be due to correlations with something else, and to “control for” those other variables. The regression coefficient this is interpreted as how much the response would change, on average, if the independent variable were increased by one unit, “holding everything else constant”. There is a very particular sense in which this is true: it’s a prediction about the changes in the conditional of the response (conditional on the given values for the other predictors), assuming that observations are randomly drawn from the same population we used to fit the regression. In a word, what regression does is probabilistic prediction. It says what will happen if we keep drawing from the same population, but select a sub-set of the observations, namely those with given values of the independent vari- ables. A causal or counter-factual prediction would say what would happen if (or Someone) made those variables take on those values. There may be no dif- ference between selection and intervention, in which case regression can work as a tool for causal inference^6 ; but in general there is. Probabilistic prediction is a worthwhile endeavor, but it’s important to be clear that this is what regression does. Every time someone thoughtlessly uses regression for causal inference, an angel not only loses its wings, but is cast out of Heaven and falls in most extreme agony into the everlasting fire.

(^6) In particular, if we assign values of the independent variables in a way which breaks possible dependencies with omitted variables and noise — either by randomization or by experimental control — then regression can, in fact, work for causal inference.

A Where the χ^2 Likelihood Ratio Test Comes

From

This appendix is optional. Here is a very hand-wavy explanation for Eq. 12. We’re assuming that the true parameter value, call it θ, lies in the restricted class of models ω. So there are q components to θ which matter, and the other p − q are fixed by the constraints defining ω. To simplify the book-keeping, let’s say those constraints are all that the extra parameters are zero, so θ = (θ 1 , θ 2 ,... θq , 0 ,... 0), with p − q zeroes at the end. The restricted MLE θ̂ obeys the constraints, so

̂ θ = (̂θ 1 , ̂θ 2 ,... θ̂q , 0 ,... 0) (13)

The unrestricted MLE does not have to obey the constraints, so it’s

Θ = (̂ Θ̂ 1 , Θ̂ 2 ,... Θ̂ q , Θ̂ q+1,... Θ̂ p) (14)

Because both MLEs are consistent, we know that θ̂i → θi, Θ̂ i → θi if 1 ≤ i ≤ q, and that Θ̂ i → 0 if q + 1 ≤ i ≤ p. Very roughly speaking, it’s the last extra terms which end up making L(Θ)̂ larger than L( θ̂). Each of them tends towards a mean-zero Gaussian whose variance is O(1/n), but their impact on the log-likelihood depends on the square of their sizes, and the square of a mean-zero Gaussian has a χ^2 distribution with one degree of freedom. A whole bunch of factors cancel out, leaving us with a sum of p − q independent χ^21 variables, which has a χ^2 p−q distribution.

In slightly more detail, we know that L(Θ)̂ ≥ L( θ̂), because the former is a maximum over a larger space than the latter. Let’s try to see how big the difference is by doing a Taylor expansion around Θ, which we’ll take out tô second order.

L( θ̂) ≈ L(Θ) +̂

∑^ p

i=

(Θ̂ i − ̂θi)

∂L

∂θi

Θ

∑^ p

i=

∑^ p

j=

( Θ̂ i − ̂θi)

∂^2 L

∂θi∂θj

Θ

( Θ̂ j − θ̂j )

= L(Θ) + ̂

∑^ p

i=

∑^ p

j=

(Θ̂ i − ̂θi)

∂^2 L

∂θi∂θj

Θ

(Θ̂ j − ̂θj ) (15)

All the first-order terms go away, because Θ is a maximum of the likelihood,̂ and so the first derivatves are all zero there. Now we’re left with the second- order terms. Writing all the partials out repeatedly gets tiresome, so abbreviate ∂^2 L/∂θi∂θj as L,ij. To simplify the book-keeping, suppose that the second-derivative matrix, or Hessian, is diagonal. (This should seem like a swindle, but we get the same conclusion without this supposition, only we need to use a lot more algebra — we diagonalize the Hessian by an orthogonal transformation.) That is, suppose

L,ij = 0 unless i = j. Now we can write

L( Θ)̂ − L( θ̂) ≈ −

∑^ p

i=

( Θ̂ i − ̂θi)^2 L,ii (16)

[

L(Θ)̂ − L(θ̂)

]

∑^ q

i=

(Θ̂ i − θ̂i)^2 L,ii −

∑^ p

i=q+

(Θ̂ i)^2 L,ii (17)

At this point, we need a fact about the asymptotic distribution of maximum likelihood estimates: they’re generally Gaussian, centered around the true value, and with a shrinking variance that depends on the Hessian evaluated at the true parameter value; this is called the Fisher information, F or I. (Call it F .) If the Hessian is diagonal, then we can say that

Θ̂ i ; N (θi, − 1 /nFii) θ̂i ; N (θ 1 , − 1 /nFii) 1 ≤ i ≤ q θ̂i = 0 q + 1 ≤ i ≤ p

Also, (1/n)L,ii → −Fii. Putting all this together, we see that each term in the second summation in Eq. 17 is (to abuse notation a little)

− 1 nFii

(N (0, 1))^2 nL,ii → χ^21 (18)

so the whole second summation has a χ^2 p−q distribution. The first summation,

meanwhile, goes to zero because Θ̂ i and θ̂i are actually strongly correlated, so their difference is O(1/n), and their difference squared is O(1/n^2 ). Since L,ii is only O(n), that summation drops out. A somewhat less hand-wavy version of the argument uses the fact that the MLE is really a vector, with a multivariate normal distribution which depends on the inverse of the Fisher information matrix:

Θ̂ ; N (θ, (1/n)F −^1 ) (19)

Then, at the cost of more linear algebra, we don’t have to assume that the Hessian is diagonal.

References

Berk, Richard A. (2004). Regression Analysis: A Constructive Critique. Thou- sand Oaks, California: Sage.

Carroll, Raymond J., Aurore Delaigle and Peter Hall (2009). “Nonparametric PRediction in Measurement Error Models.” Journal of the American Statis- tical Association, 104 : 993–1003. doi:10.1198/jasa.2009.tm07543.

Eliade, Mircea (1971). The Forge and the Crucible: The Origin and Structure of Alchemy. New York: Harper and Row.