Smoothing Splines: Confidence Bands and Mathematical Derivation, Slides of Advanced Data Analysis

How to find a good approximation to the true conditional expectation or regression function using smoothing splines. It covers the minimization problem of the spline objective function, the properties of smoothing splines, and the calculation of confidence bands. The text also includes the mathematical derivation of the smoothing spline equation.

Typology: Slides

2010/2011

Uploaded on 11/02/2011

bridge
bridge 🇺🇸

4.9

(13)

287 documents

1 / 16

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Lecture 11: Splines
36-402, Advanced Data Analysis
15 February 2011
Reading: Chapter 11 in Faraway; chapter 2, pp. 49–73 in Berk.
Contents
1 Smoothing by Directly Penalizing Curve Flexibility 1
1.1 The Meaning of the Splines . . . . . . . . . . . . . . . . . . . . . 3
2 An Example 5
2.1 Confidence Bands for Splines . . . . . . . . . . . . . . . . . . . . 7
3 Basis Functions 10
4 Splines in Multiple Dimensions 12
5 Smoothing Splines versus Kernel Regression 13
A Constraints, Lagrange multipliers, and penalties 14
1 Smoothing by Directly Penalizing Curve Flex-
ibility
Let’s go back to the problem of smoothing one-dimensional data. We imagine,
that is to say, that we have data points (x1, y1),(x2, y2), . . . (xn, yn), and we
want to find a function ˆr(x) which is a good approximation to the true condi-
tional expectation or regression function r(x). Previously, we rather indirectly
controlled how irregular we allowed our estimated regression curve to be, by
controlling the bandwidth of our kernels. But why not be more direct, and
directly control irregularity?
A natural way to do this, in one dimension, is to minimize the spline ob-
jective function
L(m, λ)1
n
n
X
i=1
(yim(xi))2+λZdx(m00(x))2(1)
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Smoothing Splines: Confidence Bands and Mathematical Derivation and more Slides Advanced Data Analysis in PDF only on Docsity!

Lecture 11: Splines

36-402, Advanced Data Analysis

15 February 2011

Reading: Chapter 11 in Faraway; chapter 2, pp. 49–73 in Berk.

Contents

1 Smoothing by Directly Penalizing Curve Flexibility 1 1.1 The Meaning of the Splines..................... 3

2 An Example 5 2.1 Confidence Bands for Splines.................... 7

3 Basis Functions 10

4 Splines in Multiple Dimensions 12

5 Smoothing Splines versus Kernel Regression 13

A Constraints, Lagrange multipliers, and penalties 14

1 Smoothing by Directly Penalizing Curve Flex-

ibility

Let’s go back to the problem of smoothing one-dimensional data. We imagine, that is to say, that we have data points (x 1 , y 1 ), (x 2 , y 2 ),... (xn, yn), and we want to find a function ˆr(x) which is a good approximation to the true condi- tional expectation or regression function r(x). Previously, we rather indirectly controlled how irregular we allowed our estimated regression curve to be, by controlling the bandwidth of our kernels. But why not be more direct, and directly control irregularity? A natural way to do this, in one dimension, is to minimize the spline ob- jective function

L(m, λ) ≡

n

∑^ n

i=

(yi − m(xi))^2 + λ

dx(m′′(x))^2 (1)

Figure 1: A craftsman’s spline, from Wikipedia, s.v. “Flat spline”.

The first term here is just the mean squared error of using the curve m(x) to predict y. We know and like this; it is an old friend. The second term, however, is something new for us. m′′^ is the second deriva- tive of m with respect to x — it would be zero if m were linear, so this measures the curvature of m at x. The sign of m′′^ says whether the curvature is concave or convex, but we don’t care about that so we square it. We then integrate this over all x to say how curved m is, on average. Finally, we multiply by λ and add that to the MSE. This is adding a penalty to the MSE criterion — given two functions with the same MSE, we prefer the one with less average curvature. In fact, we are willing to accept changes in m that increase the MSE by 1 unit if they also reduce the average curvature by at least λ. The solution to this minimization problem,

rˆλ = argmin m

L(m, λ) (2)

is a function of x, or curve, called a smoothing spline, or smoothing spline function^1. (^1) The name “spline” actually comes from a simple tool used by craftsmen to draw smooth curves, which was a thin strip of a flexible material like a soft wood, as in Figure 1. (A few years ago, when the gas company dug up my front yard, the contractors they hired to put the driveway back used a plywood board to give a smooth, outward-curve edge to the new driveway. The “knots” metal stakes which the board was placed between, and the curve of the board was a spline, and they poured concrete to one side of the board, which they left standing until the concrete dried.) Bending such a material takes energy — the stiffer the material, the more energy has to go into bending it through the same shape, and so the straighter the curve it will make between given points. For smoothing splines, using a stiffer material corresponds to increasing λ.

even faster approximation called “generalized cross-validation” or GCV. The details of how to rapidly compute the LOOCV or GCV scores are not especially important for us, but can be found, if you want them, in many books, such as Simonoff (1996, §5.6.3).

2 An Example

The default R function for fitting a smoothing spline is called smooth.spline. The syntax is

smooth.spline(x, y, cv=FALSE)

where x should be a vector of values for input variable, y is a vector of values for the response (in the same order), and the switch cv controls whether to pick λ by generalized cross-validation (the default) or by leave-one-out cross-validation. The object which smooth.spline returns has an $x component, re-arranged in increasing order, a $y component of fitted values, a $yin component of original values, etc. See help(smooth.spline) for more. Figure 2 revisits the stock-market data from Homework 4 (and lecture 7). The vector sp contains the log-returns of the S & P 500 stock index on 2528 consecutive trading days; we want to use the log-returns on one day to predict what they will be on the next. The horizontal axis in the figure shows the log- returns for each of 2527 days t, and the vertical axis shows the corresponding log-return for the succeeding day t − 1q You saw in the homework that a linear model fitted to this data displays a slope of − 0 .0822 (grey line in the figure). Fitting a smoothing spline with cross-validation selects λ = 0.0513, and the black curve:

> sp.today <- sp[-length(sp)] > sp.tomorrow <- sp[-1] > sp.spline <- smooth.spline(x=sp[-length(sp)],y=sp[-1],cv=TRUE) Warning message: In smooth.spline(sp[-length(sp)], sp[-1], cv = TRUE) : crossvalidation with non-unique ’x’ values seems doubtful > sp.spline Call: smooth.spline(x = sp[-length(sp)], y = sp[-1], cv = TRUE)

Smoothing Parameter spar= 1.389486 lambda= 0.05129822 (14 iterations) Equivalent Degrees of Freedom (Df): 4. Penalized Criterion: 0. PRESS: 0. > sp.spline$lambda [1] 0.

(PRESS is the “prediction sum of squares”, i.e., the sum of the squared leave- one-out prediction errors. Also, the warning about cross-validation, while well- intentioned, is caused here by there being just two days with log-returns of zero.) This is the curve shown in black in the figure. The curves shown in blue are for large values of λ, and clearly approach the linear regression; the curves shown in red are for smaller values of λ. The spline can also be used for prediction. For instance, if we want to know what the return to expect following a day when the log return was +0.01,

> predict(sp.spline,x=0.01) $x [1] 0. $y [1] -0.

i.e., a very slightly negative log-return. (Giving both an x and a y value like this means that we can directly feed this output into many graphics routines, like points or lines.)

2.1 Confidence Bands for Splines

Continuing the example, the smoothing spline selected by cross-validation has a negative slope everywhere, like the regression line, but it’s asymmetric — the slope is more negative to the left, and then levels off towards the regression line. (See Figure 2 again.) Is this real, or might the asymmetry be a sampling artifact? We’ll investigate by finding confidence bands for the spline, much as we did in Lecture 8 for kernel regression. Again, we need to bootstrap, and we can do it either by resampling the residuals or resampling whole data points. Let’s take the latter approach, which assumes less about the data. We’ll need a simulator:

sp.frame <- data.frame(today=sp.today,tomorrow=sp.tomorrow) sp.resampler <- function() { n <- nrow(sp.frame) resample.rows <- sample(1:n,size=n,replace=TRUE) return(sp.frame[resample.rows,]) }

This treats the points in the scatterplot as a complete population, and then draws a sample from them, with replacement, just as large as the original. We’ll also need an estimator. What we want to do is get a whole bunch of spline curves, one on each simulated data set. But since the values of the input variable will change from one simulation to another, to make everything comparable we’ll evaluate each spline function on a fixed grid of points, that runs along the range of the data.

sp.spline.estimator <- function(data,m=300) {

Fit spline to data, with cross-validation to pick lambda

fit <- smooth.spline(x=data[,1],y=data[,2],cv=TRUE)

Set up a grid of m evenly-spaced points on which to evaluate the spline

eval.grid <- seq(from=min(sp.today),to=max(sp.today),length.out=m)

Slightly inefficient to re-define the same grid every time we call this,

but not a big overhead

Do the prediction and return the predicted values

return(predict(fit,x=eval.grid)$y) # We only want the predicted values }

This sets the number of evaluation points to 300, which is large enough to give visually smooth curves, but not so large as to be computationally unwieldly. Now put these together to get confidence bands:

sp.spline.cis <- function(B,alpha,m=300) { spline.main <- sp.spline.estimator(sp.frame,m=m)

Draw B boottrap samples, fit the spline to each

spline.boots <- replicate(B,sp.spline.estimator(sp.resampler(),m=m))

Result has m rows and B columns

cis.lower <- 2spline.main - apply(spline.boots,1,quantile,probs=1-alpha/2) cis.upper <- 2spline.main - apply(spline.boots,1,quantile,probs=alpha/2) return(list(main.curve=spline.main,lower.ci=cis.lower,upper.ci=cis.upper, x=seq(from=min(sp.today),to=max(sp.today),length.out=m))) }

The return value here is a list which includes the original fitted curve, the lower and upper confidence limits, and the points at which all the functions were evaluated. Figure 3 shows the resulting 95% confidence limits, based on B=1000 boot- strap replications. (Doing all the bootstrapping took 45 seconds on my laptop.) These are pretty clearly asymmetric in the same way as the curve fit to the whole data, but notice how wide they are, and how they get wider the further we go from the center of the distribution in either direction.

3 Basis Functions

This section is optional reading.

Splines, I said, are piecewise cubic polynomials. To see how to fit them, let’s think about how to fit a global cubic polynomial. We would define four basis functions,

B 1 (x) = 1 (3) B 2 (x) = x (4) B 3 (x) = x^2 (5) B 4 (x) = x^3 (6)

with the hypothesis being that the regression function is a weight sum of these,

r(x) =

∑^4

j=

βj Bj (x) (7)

That is, the regression would be linear in the transformed variable B 1 (x),... B 4 (x), even though it is nonlinear in x. To estimate the coefficients of the cubic polynomial, we would apply each basis function to each data point xi and gather the results in an n × 4 matrix B, Bij = Bj (xi) (8)

Then we would do OLS using the B matrix in place of the usual data matrix x:

βˆ = (BT^ B)−^1 BT^ y (9)

Since splines are piecewise cubics, things proceed similarly, but we need to be a little more careful in defining the basis functions. Recall that we have n values of the input variable x, x 1 , x 2 ,... xn. For the rest of this section, I will assume that these are in increasing order, because it simplifies the notation. These n “knots” define n + 1 pieces or segments n − 1 of them between the knots, one from −∞ to x 1 , and one from xn to +∞. A third-order polynomial on each segment would seem to need a constant, linear, quadratic and cubic term per segment. So the segment running from xi to xi+1 would need the basis functions

(^1) (xi,xi+1)(x), x (^1) (xi,xi+1)(x), x^21 (xi,xi+1)(x), x^31 (xi,xi+1)(x) (10)

where as usual the indicator function (^1) (xi,xi+1)(x) is 1 if x ∈ (xi, xi+1) and 0 otherwise. This makes it seem like we need 4(n + 1) = 4n + 4 basis functions. However, we know from linear algebra that the number of basis vectors we need is equal to the number of dimensions of the vector space. The number of dimensions for an arbitrary piecewise cubic with n+1 segments is indeed 4n+4, but splines are constrained to be smooth. The spline must be continuous, which

means that at each xi, the value of the cubic from the left, defined on (xi− 1 , xi), must match the value of the cubic from the right, defined on (xi, xi+1). This gives us one constraint per data point, reducing the dimensionality to at most 3 n + 4. Since the first and second derivatives are also continuous, we come down to a dimensionality of just n + 4. Finally, we know that the spline function is linear outside the range of the data, i.e., on (−∞, x 1 ) and on (xn, ∞), lowering the dimension to n. There are no more constraints, so we end up needing only n basis functions. And in fact, from linear algebra, any set of n piecewise cubic functions which are linearly independent^3 can be used as a basis. One common choice is

B 1 (x) = 1 (11) B 2 (x) = x (12)

Bi+2(x) =

(x − xi)^3 + − (x − xn)^3 + xn − xi

(x − xn− 1 )^3 + − (x − xn)^3 + xn − xn− 1

where (a)+ = a if a > 0, and = 0 otherwise. This rather unintuitive-looking basis has the nice property that the second and third derivatives of each Bj are zero outside the interval (x 1 , xn). Now that we have our basis functions, we can once again write the spline as a weighted sum of them,

m(x) =

∑^ m

j=

βj Bj (x) (14)

and put together the matrix B where Bij = Bj (xi). We can write the spline objective function in terms of the basis functions,

nL = (y − Bβ)T^ (y − Bβ) + nλβT^ Ωβ (15)

where the matrix Ω encodes information about the curvature of the basis func- tions:

Ωjk =

dxB j′′ (x)B′′ k (x) (16)

Notice that only the quadratic and cubic basis functions will contribute to Ω. With the choice of basis above, the second derivatives are non-zero on, at most, the interval (x 1 , xn), so each of the integrals in Ω is going to be finite. This is something we (or, realistically, R) can calculate once, no matter what λ is. Now we can find the smoothing spline by differentiating with respect to β:

0 = − 2 BT^ y + 2BT^ B βˆ + 2nλΩ βˆ (17) BT^ y = (BT^ B + nλΩ) βˆ (18) βˆ = (BT^ B + nλΩ)−^1 BT^ y (19)

Once again, if this were ordinary linear regression, the OLS estimate of the coefficients would be (xT^ x)−^1 xT^ y. In comparison to that, we’ve made two

(^3) Recall that vectors ~v 1 , ~v 2 ,... ~vd are linearly independent when there is no way to write any one of the vectors as a weighted sum of the others. The same definition applies to functions.

Doing all possible multiplications of one set of numbers or functions with another is said to give their outer product or tensor product, so this is known as a tensor product spline or tensor spline. We have to chose the number of terms to include for each variable (M 1 and M 2 ), since using n for each would give n^2 basis functions, and fitting n^2 coefficients to n data points is asking for trouble.

5 Smoothing Splines versus Kernel Regression

For one input variable and one output variable, smoothing splines can basically do everything which kernel regression can do^5. The advantages of splines are their computational speed and (once we’ve calculated the basis functions) sim- plicity, as well as the clarity of controlling curvature directly. Kernels however are easier to program (if slower to run), easier to analyze mathematically^6 , and extend more straightforwardly to multiple variables, and to combinations of discrete and continuous variables.

Further Reading

In addition to the sections in the textbooks mentioned on p. 1, there are good discussions of splines in Simonoff (1996, §5), Hastie et al. (2009, ch. 5) and Wasserman (2006, §5.5). The classic reference, by one of the people who really developed splines as a useful statistical tool, is Wahba (1990), which is great if you already know what a Hilbert space is and how to manipulate it.

References

Hastie, Trevor, Robert Tibshirani and Jerome Friedman (2009). The El- ements of Statistical Learning: Data Mining, Inference, and Prediction. Berlin: Springer, 2nd edn. URL http://www-stat.stanford.edu/~tibs/ ElemStatLearn/.

Simonoff, Jeffrey S. (1996). Smoothing Methods in Statistics. Berlin: Springer- Verlag.

Wahba, Grace (1990). Spline Models for Observational Data. Philadelphia: Society for Industrial and Applied Mathematics.

Wasserman, Larry (2006). All of Nonparametric Statistics. Berlin: Springer- Verlag.

(^5) In fact, as n → ∞, smoothing splines approach the kernel regression curve estimated with a specific, rather non-Gaussian kernel. See Simonoff (1996, §5.6.2). (^6) Most of the bias-variance analysis for kernel regression can be done with basic calculus, much as we did for kernel density estimation in lecture 6. The corresponding analysis for splines requires working in infinite-dimensional function spaces called “Hilbert spaces”. It’s a pretty theory, if you like that sort of thing.

A Constraints, Lagrange multipliers, and penal-

ties

Suppose we want to minimize^7 a function L(u, v) of two variables u and v. (It could be more, but this will illustrate the pattern.) Ordinarily, we know exactly what to do: we take the derivatives of L with respect to u and to v, and solve for the u∗, v∗^ which makes the derivatives equal to zero, i.e., solve the system of equations

∂L ∂u

∂L

∂v

If necessary, we take the second derivative matrix of L and check that it is positive. Suppose however that we want to impose a constraint on u and v, to de- mand that they satisfy some condition which we can express as an equation, g(u, v) = c. The old, unconstrained minimum u∗, v∗^ generally will not satisfy the constraint, so there will be a different, constrained minimum, say ˆu, ˆv. How do we find it? We could attempt to use the constraint to eliminate either u or v — take the equation g(u, v) = c and solve for u as a function of v, say u = h(v, c). Then L(u, v) = L(h(v, c), v), and we can minimize this over v, using the chain rule:

dL dv

∂L

∂v

∂L

∂u

∂h ∂v

which we then set to zero and solve for v. Except in quite rare cases, this is messy. A superior alternative is the method of Lagrange multipliers. We intro- duce a new variable λ, the Lagrange multiplier, and a new objective function, the Lagrangian,

L(u, v, λ) = L(u, v) + λ(g(u, v) − c) (25)

which we minimize with respect to λ, u and v and λ. That is, we solve

∂L ∂λ

∂L

∂u

∂L

∂v

Notice that minimize L with respect to λ always gives us back the constraint equation, because ∂ ∂λL = g(u, v) − c. Moreover, when the constraint is satisfied,

(^7) Maximizing L is of course just minimizing −L.

has to be the same as the one which minimizes

L(u, v) + λg(u, v) (33)

This is a penalized optimization problem. Changing the magnitude of the penalty λ corresponds to changing the level c of the constraint. Conversely, if we start with a penalized problem, it implicitly corresponds to a constraint on the value of the penalty function g(u, v). So, generally speaking, constrained optimization corresponds to penalized optimization, and vice versa. For splines, λ is the shadow price of curvature in units of mean squared error. Turned around, each value of λ corresponds to constraining the average curvature not to exceed some level.