




Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Various methods for validating regression models, including calibration on the entire available data, validation with a time series segment of the predictand withheld from calibration, and cross-validation. The advantages and disadvantages of each method and provides equations for calculating validation statistics such as sum of squares of errors, mean squared error, root mean squared error, and r-squared.
Typology: Study notes
1 / 8
This page cannot be seen from the preview
Don't miss anything!





Regression R-squared, even if adjusted for loss of degrees of freedom due to the number of predictors in the model, can give a misleading, overly optimistic view of accuracy of prediction when the model is applied outside the calibration period. Application outside the calibration period is the rule rather than the exception in dendroclimatology. The calibration-period statistics are typically biased because the model is “tuned” for maximum agreement in the calibration period. Sometimes too large a pool of potential predictors is used in automated procedures to select final predictors. Another possible problem is that the calibration period itself may be anomalous in terms of the relationships between the variables: modeled relationships may hold up for some periods of time but not for others. It is advisable therefore to “validate” the regression model by testing the model on data not used to fit the model. Several approaches to validation are available. Among these are cross-validation and split-sample validation. In cross- validation, a series of regression models is fit, each time deleting a different observation from the calibration set and using the model to predict the predictand for the deleted observation. The merged series of predictions for deleted observations is then checked for accuracy against the observed data. In split-sample calibration, the model is fit to some portion of the data (say, the second half), and accuracy is measured on the predictions for the other half of the data. The calibration and validation periods are then exchanged and the process repeated. In any regression problem it is also important to keep in mind that modeled relationships may not be valid for periods when the predictors are outside their ranges for the calibration period: the multivariate distribution of the predictors for some observations outside the calibration period may have no analog in the calibration period. The distinction of predictions as extrapolations versus interpolations is useful in flagging such occurrences.
Validation strategies. Several alternative strategies for validation are available, and some may be better than others depending on the data and purpose of analysis. Three different ways of validating are:
Compare predictions made by the model with records of some proxy for the predictand. a) Calibrate on the entire available length of the overlap of predictors and predictand b) Apply the model to predict outside the calibration period c) Compare the predictions outside the calibration period with observations of some proxy for the predictand d) Pro: uses all available data for calibration; a long calibration time series generally gives a more stable model e) Con: validation semi-qualitative
Validate the model with a time series segment of the predictand withheld from calibration. a) Calibrate on just a part of the period of overlap of predictors and predictand b) Apply model to generate predictions for the data withheld from calibration c) Compare the predicted and observed predictand for the period withheld from calibration d) Use the model from (a) for final prediction e) Pro: model validated is same model used for final prediction f) Con: requires long time series of predictand if some data are to be sacrificed to validation
These methods might be referred to as “leave n out.” At one extreme n is half the sample length. This type of cross-validation is split-sample validation. In split-sample validation the model is calibrated on some fraction (say first half) of the data and validated on the other fraction (Snee 1977). Then the calibration/validation periods are exchanged and the calibration and validation done again. The final prediction model is then calibrated using the full available length of predictand data.
At the other extreme is “leave-one-out” cross-validation, which is equivalent to “cross- validation” as described by Michaelsen (1987) and to the predicted-residual-sum-of squares procedure, or PRESS procedure as described by Weisberg (1985). Say the full available period for calibration is length of n years. Models are repeatedly estimated using data sets of n − 1 years, each time omitting a different observation from calibration and using the estimated model to generate a predicted value of the predictand for the deleted observation. At the end of this procedure, a time series of n predictions assembled from the deleted observations is compared with the observed predictand to compute validation statistics of model accuracy and error.
Validation statistics. Validation statistics measure the error or accuracy of the prediction for the validation period. The statistics can generally be expressed as functions of just a few simple terms, or build blocks. We begin by defining the building blocks.
Validation errors. All of the statistics described here are computed as some function of the validation error, which is the difference of the observed and predicted values:
e ˆ( )^ (^) i = yi − ˆ y ( ) i (1)
where yi and y ˆ( )^ (^) i are the observed and predicted values of the predictand in year i, and the
notation ( ) i indicates that data for year i were not used in fitting the model that generated the
prediction y ˆ( )^ (^) i.
Sum of squares of errors, validation (SSE (^) v ). SSEv is the sum of the squared differences of the observed and predicted values:
( )
2 ( ) 1
nv v i i
= (^) ∑ (2)
where the summation is over the nv years making up the validation period.
Mean squared error of validation (MSE v ). MSEv is the average squared error for the validation data, or the sum-of-squares of errors divided by the length of validation period:
The similarity in form of the equations for R^2 and RE (equations (7) and (5)) suggests that RE be used as a validation equivalent of regression R^2 , and that a value of RE “close to” the value of R^2 be considered as evidence of validation. The rational for this comparison is easily seen for leave-one-out cross-validation. In both equations, the numerator is a sum of squares of prediction errors, and the denominator if the sum of squares of departures of the observed values of the predictand from a constant. For leave-1-out cross-validation the constant is equal to the calibration-period mean for both (5) and (7). This is so because for leave-1-out cross-validation the aggregate “validation” period is essentially the same as the calibration period: each year of the calibration period is individually and separately used as a validation period in the iterative cross-validation, and the aggregate of these validation years is the “validation period.”
PRESS Statistic. PRESS is an acronym for “predicted residual sum of squares” (Weisberg 1985, p. 217). The PRESS procedure is equivalent to “leave-1-out” cross-validation, as described previously. The PRESS statistic is defined as
2 ( ) 1
n i i
= (^) ∑ (8)
where e ˆ( ) (^) i is the residual for observation i computed as the difference between the observed value
of the predictand and the prediction from a regression model calibrated on the set of n − 1 observations from which observation i was excluded. The PRESS statistic is therefore identical to the sum of squares or residuals for validation, SSEv , defined in equation (2)which was described previously.
As described earlier, the automated entry of predictors into the regression equation runs the risk of over-fitting, as R^2 is guaranteed to increase with each predictor entering the model. The
adjusted R^2 is one alternative criterion to identify when to halt entry of predictors (e.g., Meko et al. 1980), but the adjusted R^2 has two major drawbacks. First, the theory behind adjusted R^2 assumes the predictors are independent, while in practice the predictors are often inter- correlated. Consequently, entry of an additional predictor does not necessarily mean the loss of one degree of freedom for estimation of the model. Second, the adjusted R^2 does not address the problem of selecting the predictors from a pool – sometimes a large pool – of potential predictors. If the pool of potential predictors is large, R^2 can be seriously biased (high), and the bias will not be accounted by the adjustment for number of variables in the model used by the algorithm for adjusted R^2 (Rencher and Pun 1980). An alternative method of guarding against over-fitting the regression model is to use cross- validation as a guide for stopping the entry of additional predictors (Wilks 1995). By evaluating the performance of the model on data withheld from calibration at every step of the stepwise procedure, the level of complexity (number of predictors) above which the model is over-fit can be estimated. Graphs of change in calibration and validation accuracy statistics as a function of step in forward stepwise entry of predictors can be used as a guide for cutting off entry of predictors into the model. For example, in a graph of RMSEv against step in a model run out to many steps (e.g., 10 steps), the step at which the RMSEv is minimized (or approximately so) can set as the final step for the model. are produced. The same result would be achieved from a plot of RE against step, except that the maximum in RE indicates the “best” model.
Extending the entry of predictors beyond the indicated steps amounts to “over-fitting” the model. Over-fitting refers to the tuning of the model to noise rather than to any real relationship between the variables. In the extreme, over-fitting is illustrated by a model whose number of predictors equals the number of observations for calibration: the model will explain 100% of the variance of the predictand even if the predictor data is merely random noise.
Predictions are the values of the predictand obtained when the prediction equation
y ˆ i = b ˆ 0 (^) + b x ˆ 1 i (^) ,1 + b x ˆ 2 i (^) ,2 + … + b x ˆ K i K , (9)
is applied outside the period used to fit the model. For example, in dendroclimatology, the tree- ring indices ( x’s) for the long-term record are substituted into (9) to get estimates of past climate. The prediction is called a reconstruction in this case because the estimates are extended into the past rather than the future. Once the regression model has been estimated, the generation of the reconstruction is a trivial mathematical step, but important assumptions are made in taking the step. First, the multivariate relationship between predictand and predictors in the calibration period is assumed to have applied in the past. This assumption might be violated for many possible reasons. For example, in a tree-ring reconstruction, the climate for the calibration period may have been much different than for the earlier period, such that a threshold of response was exceeded in the earlier period. Or the quality of the tree-ring data might have decreased back in time because of a drop-off in sample size (number of trees) in the chronologies. Many other data- dependent scenarios could be envisioned that would invalidate the application of the regression data to reconstruct past climate. For time series in general, regardless of the physical system, it is important to statistically check the ability of the model to predict outside its calibration period or to validate the model, as described in the preceding section.
A reconstruction should always be accompanied by some estimate of its uncertainty. The uncertainty is frequently summarized by error bars on a time series plot of the reconstruction. Error bars can be derived by different methods:
mean squared residuals, MSE. Following Wilks (1995, p. 176), the Gaussian assumption leads to an expected 95% confidence interval of roughly CI 1 (^) y ˆ (^) t ± 2 se (10)
Confidence bands by this method are the same width for all reconstructed values. The ± 2 se rule
of thumb is often a good approximation to the 95% confidence interval, especially if the sample size for calibration is large (Wilks 1995, p. 176). But because of uncertainty in the sample mean of the predictand and in the estimates of the regression coefficients, the prediction variance for data not used to fit the model is somewhat larger than indicated by MSE, and is not the same for all predicted values. This consideration gives rise to a slightly more precise estimate of prediction error called the standard error of prediction (see next section). Also note that the “2”
If the model is used to predict data outside the calibration period, and the predictor data for some year to be predicted is given by the row vector x * , the predicted value for that year is given
by
2 2 1 ˆy * * * * 2
s =varpred | 1
1 h
σ
σ
− = +
= +
y ^ x x T^ X XT x (14)
error of prediction is the square root of the above conditional variance
procedure, RMSE (^) v= PRESS/ nv is the validation equivalent of the standard error of
prediction, and if normality is assumed, can be used in the same way as described for se or s (^) y ˆto
place confidence bands at a desired significance level around the predictions. For example, an approximate 95% confidence interval is y ˆ^ i ± 2 RMSEv. Weisberg (1985, p. 230) recommends
this approach as a “sensible estimate of average prediction error.”
A regression equation is estimated on a data set called the construction data set , or calibration data set. For this construction set the predictors have a defined range. For example, in regressing annual precipitation on tree-ring indices, perhaps the tree-ring data for the calibration period are range between 0.4 and 1.8 -- or 40% to 180% of “normal” growth. The relationship between the predictand and predictors expressed by the regression equation applies strictly only when the predictors are “similar” to their values in the calibration period. If the form of the regression equation is not known a priori, then we have no information on the relationship outside the observed range for the predictor in the calibration period. When the model is applied to generate predictions outside the calibration period, an important question is how “different” can the predictor data be from its values in the calibration period before the predictions are considered invalid. When the predictors are acceptably similar to their values in the calibration period, the predictions are called interpolations. Otherwise, the predictions are called extrapolations. Extrapolations in a dendroclimatic reconstruction model present a dilemma: the most interesting observations are often extrapolations, while the regression model is strictly valid only for interpolations. A compromise to simply tossing out extrapolations is to flag them in the reconstruction. Algorithm for identifying extrapolations. Extrapolations are identified by locating the predictor data for any given prediction year relative to the multivariate “cloud” of the predictor data for the calibration period. Identification is trivial for the simple linear regression, as any
prediction year for which the predictor is outside its range for the calibration period can be regarded as an extrapolation. For MLR, any prediction for which the predictor data fall outside the predictor “cloud” for the calibration period can be regarded as an extrapolation. In MLR, extrapolations can be defined more specifically as observations that fall outside an ellipsoid that encloses the predictor values for the calibration period. This is an ellipsoid in p- dimensional space, where p is then number of predictors. For the simple case of one predictor, the “ellipsoid” is one-dimensional, and any values of x outside the range of x for the calibration period would lead to an extrapolaton. For MLR with two variables, the ellipsoid is an ellipse in the space defined by the axes for variables x 1 and x 2.
For the general case of an MLR regression with p predictors and an calibration period of n years, Weisberg (1985, p. 236) suggests an ellipsoid defined by constant values of the diagonal of the “hat” matrix H, defined in matrix algebra as
where X is the n by ( p + 1)time series matrix of predictors, with ones in the first column to
allow for the constant of regression. For each prediction year with predictor values in the vector x * , the scalar quantity
h * (^) * ( ) * = xT X XT^ −^1 x (17)
is computed, and any prediction for which
h * (^) > h max (18)
where h max is the largest hii in the diagonal of the hat matrix H, is regarded as an extrapolation.
Fritts, H.C., Guiot, J., and Gordon, G.A., 1990, Verification, in Cook, E.R., and Kairiukstis, L.A., eds., Methods of Dendrochronology, Applications in the Environmental Sciences: Kluwer Academic Publishers, p. 178-185. Fritts, H.C., 1976, Tree rings and climate: London, Academic Press, 567 p. Meko, D.M., Stockton, C.W., and Boggess, W.R., 1980, A tree-ring reconstruction of drought in southern California: Water Resources Bulletin, v. 16, no. 4, p. 594-600. Michaelsen, J., 1987, Cross-validation in statistical climate forecast models, J. of Climate and Applied Meterorology 26, 1589-1600. Rencher, A.C., and Pun, Fu Ceayong, 1980, Inflation of R2 in best subset regression, Technometrics 22 (1), 49-53. Snee, R.D., 1977, Validation of regression models: Methods and examples, Technometrics 19, 415-428. Weisberg, S., 1985, Applied Linear Regression, 2nd ed., John Wiley, New York, 324 pp. Wilks, D.S., 1995, Statistical methods in the atmospheric sciences: Academic Press, 467 p.