Optimal Smoothing in Regression: Multivariate Gaussian Distribution, Slides of Advanced Data Analysis

The concept of smoothing in regression analysis, specifically the use of the multivariate gaussian distribution. How smoothing methods, such as k-nearest neighbors and kernel regression, help reduce error in regression analysis by averaging values and controlling bias and variance. The document also includes examples and code for implementing smoothing with the npreg function in r.

Typology: Slides

2010/2011

Uploaded on 11/02/2011

bridge
bridge 🇺🇸

4.9

(13)

287 documents

1 / 23

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Using Nonparametric Smoothing in Regression
36-402, Data Analysis
20 January 2011
Contents
1 How Much Should We Smooth? 1
2 Adapting to Unknown Roughness 2
3 Kernel Regression with Multiple Inputs 15
4 Interpreting Smoothers: Plots 20
A The Multivariate Gaussian Distribution 22
We are still talking about using smoothing methods in regression.
Recall the basic kind of smoothing we are interested in: we have a response
variable Y, some input variables which we bind up into a vector X, and a
collection of data values, (x1, y1),(x2, y2),...xn, yn). By “smoothing”, I mean
that predictions are going to be weighted averages of the observed responses in
the training data:
br(x) =
n
X
i=1
yiw(x, xi, h) (1)
Most smoothing methods have a control setting, which here I write h, that
determines how much smoothing we do. With knearest neighbors, for instance,
the weights are 1/k if xiis one of the k-nearest points to x, and w= 0 otherwise,
so large kmeans that each prediction is an average over many training points.
Similarly with kernel regression, where the degree of smoothing is controlled by
the bandwidth h.
Why do we want to do this? How do we pick how much smoothing to do?
1 How Much Should We Smooth?
When we smooth very little (h0), then we can match very small or sharp
aspects of the true regression function, if there are such. Smoothing less leads
to less bias. At the same time, each of our predictions is going to be an average
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17

Partial preview of the text

Download Optimal Smoothing in Regression: Multivariate Gaussian Distribution and more Slides Advanced Data Analysis in PDF only on Docsity!

Using Nonparametric Smoothing in Regression

36-402, Data Analysis

20 January 2011

Contents

1 How Much Should We Smooth? 1

2 Adapting to Unknown Roughness 2

3 Kernel Regression with Multiple Inputs 15

4 Interpreting Smoothers: Plots 20

A The Multivariate Gaussian Distribution 22 We are still talking about using smoothing methods in regression. Recall the basic kind of smoothing we are interested in: we have a response variable Y , some input variables which we bind up into a vector X, and a collection of data values, (x 1 , y 1 ), (x 2 , y 2 ),... xn, yn). By “smoothing”, I mean that predictions are going to be weighted averages of the observed responses in the training data:

̂ r(x) =

∑^ n

i=

yiw(x, xi, h) (1)

Most smoothing methods have a control setting, which here I write h, that determines how much smoothing we do. With k nearest neighbors, for instance, the weights are 1/k if xi is one of the k-nearest points to x, and w = 0 otherwise, so large k means that each prediction is an average over many training points. Similarly with kernel regression, where the degree of smoothing is controlled by the bandwidth h. Why do we want to do this? How do we pick how much smoothing to do?

1 How Much Should We Smooth?

When we smooth very little (h → 0), then we can match very small or sharp aspects of the true regression function, if there are such. Smoothing less leads to less bias. At the same time, each of our predictions is going to be an average

over (in effect) just a few observations, so it becomes noisier. So smoothing less increases the variance of our estimate. Since

(totalerror) = (noise) + (bias)^2 + (variance)

(see lecture 1, section 4.1), if we plot the different components of error as a function of h, we typically get something that looks like Figure 1. Because changing the amount of smoothing has opposite effects on the bias and the variance, there is an optimal amount of smoothing, where we can’t reduce one source of error without increasing the other. We therefore want to find that optimal amount of smoothing, which is where (as explained last time) cross- validation comes in. You should note, at this point, that the optimal amount of smoothing de- pends on (1) the real regression curve, (2) our smoothing method, and (3) how much data we have. This is because the variance contribution generally shrinks as we get more data.^1 If we get more data, we go from Figure 1 to Figure 2. The minimum of the over-all error curve has shifted to the left, and we should smooth less. Strictly speaking, parameters are properties of the data-generating process alone, so the optimal amount of smoothing is not really a parameter. If you do think of it as a parameter, you have the problem of why the “true” value changes as you get more data. It’s better thought of as a setting or control variable in the smoothing method, to be adjusted as convenient.

2 Adapting to Unknown Roughness

Consider Figure 3, which graphs two functions, f and g. Both are “smooth” functions in the qualitative, mathematical sense (C∞: they’re not only con- tinuous, their derivatives exist and are continuous to all orders). We could Taylor expand both functions to approximate their values anywhere, just from knowing enough derivatives at one point x 0.^2 Alternately, if instead of know- ing the derivatives at x 0 , we have the values of the functions at a sequence of points x 1 , x 2 ,... xn, we could use interpolation to fill out the rest of the curve. Quantitatively, however, f (x) is less smooth than g(x) — it changes much more rapidly, with many reversals of direction. For the same degree of inaccuracy in the interpolation f (·) needs more, and more closely spaced, training points xi than goes g(·). Now suppose that we don’t get to actually get to see f (x) and g(x), but rather just f (x) +  and g(x) + η, where  and η are noise. (To keep things simple I’ll assume they’re the usual mean-zero, constant-variance, IID Gaussian noises, say with σ = 0.15.) The data now look something like Figure 4. Can we now recover the curves? If we had multiple measurements at the same x, then we could recover the expectation value by averaging: since the regression curve r(x) = E [Y |X = x],

(^1) Sometimes bias changes as well. Noise does not (why?). (^2) Technically, a function whose Taylor series converges everywhere is analytic.

0.0 0.2 0.4 0.6 0.8 1.

Smoothing

Generalization error

Figure 2: Consequences of adding more data to the components of error: noise (dotted) and bias (dashed) are unchanged, but the new variance curve (dotted and dashed, black) is to the left of the old (greyed), so the new over-all er- ror curve (solid black) is lower, and has its minimum at a smaller amount of smoothing than the old (solid grey).

0.0 0.5 1.0 1.5 2.0 2.5 3.

-1.

x

f(x)

0.0 0.5 1.0 1.5 2.0 2.5 3.

x

g(x)

par(mfcol=c(2,1)) curve(sin(x)cos(20x),from=0,to=3,xlab="x",ylab=expression(f(x))) curve(log(x+1),from=0,to=3,xlab="x",ylab=expression(g(x)))

Figure 3: Two curves for the running example. Above, f (x); below, g(x). (As it happens, f (x) = sin x cos 20x, and g(x) = log x + 1, but that doesn’t really matter.)

if we had many observations at the same xi, the average of the corresponding yi would (by the law of large numbers) converge on r(x). Generally, however, we have at most one measurement per value of x, so simple averaging won’t work. Even if we just confine ourselves to the xi where we have observations, the mean-squared error will always be σ^2 , the noise variance. However, our estimate will be unbiased. What smoothing methods try to use is that we may have multiple measure- ments at points xi which are near the point of interest x. If the regression function is smooth, as we’re assuming it is, r(xi) will be close to r(x). Re- member that the mean-squared error is the sum of bias (squared) and variance. Averaging values at xi 6 = x is going to introduce bias, but averaging many in- dependent terms together also reduces variance. If by smoothing we get rid of more variance than we gain bias, we come out ahead. Here’s a little math to see it. Let’s assume that we can do a first-order Taylor expansion, so r(xi) ≈ r(x) + (xi − x)r′(x) (2)

and yi ≈ r(x) + (xi − x)r′(x) + i (3)

Now we average: to keep the notation simple, abbreviate the weight w(xi, x, h) by just wi.

̂ r(x) =

n

∑^ n

i=

yiwi

n

∑^ n

i=

(r(x) + (xi − x)r′(x) + i)wi

= r(x) +

∑^ n

i=

wii +

∑^ n

i=

wi(xi − x)r′(x)

̂ r(x) − r(x) =

∑^ n

i=

wii +

∑^ n

i=

wi(xi − x)r′(x)

E

[

(̂ r(x) − r(x))^2

]

= σ^2

∑^ n

i=

w i^2 + E

( (^) n ∑

i=

wi(xi − x)r′(x)

(Remember that:

wi = 1, that E [i] = 0, that the noise is uncorrelated with everything, and that E [i] = σ^2 .) The first term on the final right-hand side is variance, which will tend to shrink as n grows. (If wi = 1/n, the unweighted averaging case, we get back the familiar σ^2 /n.) The second term, on the other hand, is bias, which grows with how far the xi are from x, and the magnitude of the derivative, i.e., how smooth or wiggly the regression function is. For this to work, wi had better shrink as xi − x and r′(x) grow.^3 Finally, all else being equal, wi should also shrink with (^3) The higher derivatives of r also matter, since we should really be keeping more than just the first term in the Taylor expansion, but you get the idea.

n, so that the over-all size of the sum shrinks as we get more data. To illustrate, let’s try to estimate f (1.6) and g(1.6) from the noisy observa- tions. We’ll try a simple approach, just averaging all values of f (xi) + i and g(xi) + ηi for 1. 5 < xi < 1 .7 with equal weights. For f , this gives 0.46, while f (1.6) = 0.89. For g, this gives 0.98, with g(1.6) = 0.95. (See figure 5). The same size window introduces a much larger bias with the rougher, more rapidly changing f than with the smoother, more slowly changing g. Varying the size of the averaging window will change the amount of error, and it will change it in different ways for the two functions. If we look at the expression for the mean-squared error of the smoother, we can see that it’s quadratic in the weights wi. However, once we pick the smoother and take our data, the weights wi are all functions of h, the control setting which determines the degree of smoothing. So in principle there will be an optimal choice of h. We can find this through calculus — take the derivative of the MSE with respect to h (via the chain rule) and set it equal to zero — but the expression for the optimal h involves the derivative r′(x) of the regression function. Of course, if we knew the derivative of the regression function, we would basically know the function itself (just integrate), so we seem to be in a vicious circle, where we need to know the function before we can learn it. One way of expressing this is to talk about how well a smoothing procedure would work, if an Oracle were to tell us the derivative, or (to cut to the chase) the optimal bandwidth hopt. Since most of us do not have access to such oracles, we need to estimate hopt. Once we have this estimate, ̂h, then we get out weights and our predictions, and so a certain mean-squared error. Basically, our MSE will be the Oracle’s MSE, plus an extra term which depends on how far ̂h is to hopt, and how sensitive the smoother is to the choice of bandwidth. What would be really nice would be an adaptive procedure, one where our actual MSE, using ̂h, approaches the Oracle’s MSE, which it gets from hopt. This would mean that, in effect, we are figuring out how rough the underlying regression function is, and so how much smoothing to do, rather than having to guess or be told. An adaptive procedure, if we can find one, is a partial^4 substitute for prior knowledge. The most straight-forward way to pick a bandwidth, and one which generally manages to be adaptive, is in fact cross-validation; k-fold CV is usually some- what better than leave-one-out, but the latter often works acceptably too. The random-division CV would work in the usual way, going over a grid of possible bandwidths. Here is how it would work with the input variable being in the vector x (one dimensional) and the response in the vector y (one dimensional), and using the npreg function from the np library (?).^5 The return value has three parts. The first is the actual best bandwidth. The second is a vector which gives the cross-validated mean-squared mean- squared errors of all the different bandwidths in the vector bandwidths. The

(^4) Only partial, because we’d always do better if the Oracle would just tell us hopt. (^5) The np package actually has a function, npregbw, which automatically selects bandwidths through a sophisticated combination of cross-validation and optimization techniques. It tends to be very slow.

0.0 0.2 0.4 0.6 0.8 1.

Radius of averaging window

Absolute value of error

loc_ave_err <- function(h,y,y0) {abs(y0-mean(y[(1.6-h < x) & (1.6+h>x)]))} yf0=sin(1.6)cos(201.6) yg0=log(1+1.6) f.LAE = sapply(0:100/100,loc_ave_err,y=yf,y0=yf0) g.LAE = sapply(0:100/100,loc_ave_err,y=yg,y0=yg0) plot(0:100/100,f.LAE,xlab="Radius of averaging window", ylab="Absolute value of error",type="l") lines(0:100/100,g.LAE,lty=2) abline(h=0.15,col="grey")

Figure 6: Estimating f (1.6) and g(1.6) from averaging observed values at 1. 6 − h < x < 1 .6 + h, for different radii h. Solid line: error of estimates of f (1.6); dashed line: error of estimates of g(1.6); grey line: σ, the standard deviation of the noise.

Multi-fold cross-validation for univariate kernel regression

cv_bws_npreg <- function(x,y,bandwidths=(1:50)/50,num.folds=10) { require(np) n <- length(x) stopifnot(n> 1, length(y) == n) stopifnot(length(bandwidths) > 1) stopifnot(num.folds > 0, num.folds==trunc(num.folds))

fold_MSEs <- matrix(0,nrow=num.folds,ncol=length(bandwidths)) colnames(fold_MSEs) = bandwidths

case.folds <- rep(1:num.folds,length.out=n) case.folds <- sample(case.folds) for (fold in 1:num.folds) { train.rows = which(case.folds==fold) x.train = x[train.rows] y.train = y[train.rows] x.test = x[-train.rows] y.test = y[-train.rows] for (bw in bandwidths) { fit <- npreg(txdat=x.train,tydat=y.train, exdat=x.test,eydat=y.test,bws=bw) fold_MSEs[fold,paste(bw)] <- fit$MSE } } CV_MSEs = colMeans(fold_MSEs) best.bw = bandwidths[which.min(CV_MSEs)] return(list(best.bw=best.bw,CV_MSEs=CV_MSEs,fold_MSEs=fold_MSEs)) }

Code Example 1: Comments omitted here to save space; see the accompany- ing R file on the class website. The colnames trick: component names have to be character strings; other data types will be coerced into characters when we assign them to be names. Later, when we want to refer to a bandwidth column by its name, we wrap the name in another coercing function, such as paste. — The default vector of default bandwidths is pretty arbitrary; one could do better.

0.0 0.1 0.2 0.3 0.4 0.

Bandwidth

Root CV MSE

fbws <- cv_bws_npreg(x,yf,bandwidths=(1:100)/200) gbws <- cv_bws_npreg(x,yg,bandwidths=(1:100)/200) plot(1:100/200,sqrt(fbws$CV_MSEs),xlab="Bandwidth",ylab="Root CV MSE", type="l",ylim=c(0,0.6)) lines(1:100/200,sqrt(gbws$CV_MSEs),lty=2) abline(h=0.15,col="grey")

Figure 7: Cross-validated estimate of the (root) mean-squard error as a function of the bandwidth. Solid curve: data from f (x); dashed curve: data from g(x); grey line: true σ. Notice that the rougher curve is more sensitive to the choice of bandwidth, and that the smoother curve is more predictable at every choice of bandwidth. Also notice that CV does not completely compensate for the optimism of in-sample fitting (see where the dashed curve falls below the grey line). CV selects bandwidths of 0.015 for f and 0.165 for g.

0.0 0.5 1.0 1.5 2.0 2.5 3.

-1.

x

f(x

)^ +

ε

0.0 0.5 1.0 1.5 2.0 2.5 3.

x

g(

x)

η

x.ord=order(x) par(mfcol=c(2,1)) plot(x,yf,xlab="x",ylab=expression(f(x)+epsilon)) lines(x[x.ord],fitted(npreg(bws=fbws$best.bw,txdat=x,tydat=yf))[x.ord],lwd=4) curve(sin(x)cos(20x),col="grey",add=TRUE,lwd=2) plot(x,yg,xlab="x",ylab=expression(g(x)+eta)) lines(x[x.ord],fitted(npreg(bws=fbws$best.bw,txdat=x,tydat=yg))[x.ord],lwd=4) curve(log(x+1),col="grey",add=TRUE,lwd=2)

Figure 8: Data from the running examples (circles), true regression functions (grey) and kernel estimates of regression functions with CV-selected bandwidths (black). The widths of the regression functions are exaggerated. Since the x values aren’t sorted, we need to put them in order if we want to draw lines connecting the fitted values; then we need to put the fitted values in the same order. An alternative would be to use predict on the sorted values, as in the next section.

the others, cross-validation will tend to give them the maximum possible band- width, and smooth away their influence. We will look, later, at formal tests based on this idea.) Whether we use product kernels or some multivariate distribution including correlations, kernel regression will recover almost any regression function. This is true even when the true regression function involves lots of interactions among the input variables, perhaps in complicated forms that would be very hard to express in linear regression. For instance, Figure 9 shows a contour plot of a reasonably complicated regression surface, at least if one were to write it as polynomials in x^1 and x^2 , which would be the usual approach. Figure 11 shows the estimate we get with a product of Gaussian kernels and only 1000 noisy data points. It’s not perfect, of course (in particular the estimated contours aren’t as perfectly smooth and round as the true ones), but the important thing is that we got this without having to know, and describe in analytic geometry, the type of shape we were looking for. Kernel smoothing discovered the right general form.

0

1

2

3

0

1

2

3

x^1

x^2

y

x.points <- seq(-3,3,length.out=100) y.points <- x.points xygrid <- expand.grid(x=x.points,y=y.points) z <- matrix(0,nrow=100,ncol=100) for (i in 1:100) { for (j in 1:100) { z[i,j] <- f(x.points[i],y.points[j]) } } library(lattice) wireframe(z~xygrid$x*xygrid$y,scales=list(arrows=FALSE),xlab=expression(x^1), ylab=expression(x^2),zlab="y")

Figure 9: An example of a regression surface that would be very hard to learn by piling together interaction terms in a linear regression framework. (Can you guess what the function f is?) — wireframe is from the graphics library lattice.

0

1

2

3

0

1

2

3

x^1

x^2

y

noise.np <- npreg(z~x+y,data=noise) z.out <- matrix(0,100,100) for (i in 1:100) { for (j in 1:100) { z.out[i,j] <- predict(noise.np,newdata=data.frame(x=x.points[i], y=y.points[j])) } } wireframe(z.out~xygrid$x*xygrid$y,scales=list(arrows=FALSE), xlab=expression(x^1),ylab=expression(x^2),zlab="y")

Figure 11: Gaussian kernel regression of the points in Figure 10. Notice that the estimated function will make predictions are arbitrary points, not just the places where there was training data.

4 Interpreting Smoothers: Plots

In a linear regression without interactions, it is fairly easy to interpret the coefficients. The expected response changes by βi for a one-unit change in the ith^ input variable. The coefficients are also the derivatives of the expected response with respect to the inputs. And it is easy to draw pictures of how the output changes as the inputs are varied, though the pictures are somewhat boring (straight lines or planes). As soon as we introduce interactions, all this becomes harder, even for para- metric regression. If there is an interaction between two components of the input, say x^1 and x^2 , then we can’t talk about the change in the expected re- sponse for a one-unit change in x^1 without saying what x^2 is. We might average over x^2 values, and we’ll see next time a reasonable way of doing this, but the flat statement “increasing x^1 by one unit increases the response by β 1 ” is just false, no matter what number we fill in for β 1. Likewise for derivatives; we’ll come back to them next time as well. What about pictures? If there are only two input variables, then we can make plots like the wireframes in the previous section, or contour- or level- plots, which will show the predictions for different combinations of the two variables. But suppose we want to look at one variable at a time? Suppose there are more than two input variables? A reasonable way of producing a curve for each input variable is to set all the others to some “typical” value, such as the mean or the median, and to then plot the predicted response as a function of the one remaining variable of interest. See Figure 12 for an example of this. Of course, when there are interactions, changing the values of the other inputs will change the response to the input of interest, so it may be a good idea to produce a couple of curves, possibly super-imposed (again, see Figure 12). If there are three or more input variables, we can look at the interactions of any two of them, taken together, by fixing the others and making three- dimensional or contour plots, along the same principles. The fact that smoothers don’t give us a simple story about how each input is associated with the response may seem like a disadvantage compared to using linear regression. Whether it really is a disadvantage depends on whether there really is a simple story to be told — and, if there isn’t, how big a lie you are prepared to tell in order to keep your story simple.