Validating Regression Models: Methods and Techniques - Prof. Meko | Study notes Geology

Notes_12, GEOS 585A, Spring 2009

12 Validating the Regression Model

Regression R-squared, even if adjusted for loss of degrees of freedom due to the number of

predictors in the model, can give a misleading, overly optimistic view of accuracy of prediction

when the model is applied outside the calibration period. Application outside the calibration

period is the rule rather than the exception in dendroclimatology. The calibration-period statistics

are typically biased because the model is “tuned” for maximum agreement in the calibration

period. Sometimes too large a pool of potential predictors is used in automated procedures to

select final predictors. Another possible problem is that the calibration period itself may be

anomalous in terms of the relationships between the variables: modeled relationships may hold

up for some periods of time but not for others. It is advisable therefore to “validate” the

regression model by testing the model on data not used to fit the model. Several approaches to

validation are available. Among these are cross-validation and split-sample validation. In cross-

validation, a series of regression models is fit, each time deleting a different observation from the

calibration set and using the model to predict the predictand for the deleted observation. The

merged series of predictions for deleted observations is then checked for accuracy against the

observed data. In split-sample calibration, the model is fit to some portion of the data (say, the

second half), and accuracy is measured on the predictions for the other half of the data. The

calibration and validation periods are then exchanged and the process repeated. In any regression

problem it is also important to keep in mind that modeled relationships may not be valid for

periods when the predictors are outside their ranges for the calibration period: the multivariate

distribution of the predictors for some observations outside the calibration period may have no

analog in the calibration period. The distinction of predictions as extrapolations versus

interpolations is useful in flagging such occurrences.

12.1 Validation

Validation strategies. Several alternative strategies for validation are available, and some

may be better than others depending on the data and purpose of analysis. Three different ways of

validating are:

1) Compare predictions made by the model with records of some proxy for the predictand.

a) Calibrate on the entire available length of the overlap of predictors and predictand

b) Apply the model to predict outside the calibration period

c) Compare the predictions outside the calibration period with observations of some proxy

for the predictand

d) Pro: uses all available data for calibration; a long calibration time series generally gives a

more stable model

e) Con: validation semi-qualitative

2) Validate the model with a time series segment of the predictand withheld from calibration.

a) Calibrate on just a part of the period of overlap of predictors and predictand

b) Apply model to generate predictions for the data withheld from calibration

c) Compare the predicted and observed predictand for the period withheld from calibration

d) Use the model from (a) for final prediction

e) Pro: model validated is same model used for final prediction

f) Con: requires long time series of predictand if some data are to be sacrificed to validation

Validating Regression Models: Methods and Techniques - Prof. Meko, Study notes of Geology

Related documents

Partial preview of the text

Download Validating Regression Models: Methods and Techniques - Prof. Meko and more Study notes Geology in PDF only on Docsity!

12 Validating the Regression Model

12.1 Validation

SSE ˆ

e

2 1 SSE

SST

R = − (7)

PRESS ˆ

e

12.2 Cross-validation stopping rule

12.3 Prediction (Reconstruction)

12.4 Error bars for predictions

where σ 2 is generally estimated as the residual mean square, or s e^2. The estimated standard

s =sepredyˆ ( y * | x * )= σ 1 + h * (15)

12.5 Interpolation vs extrapolation

H = X X X ( T^ ) −^1 XT (16)

12.6 References