Download Data Mining: Predictive Modeling and Evaluation - Prof. Jennifer L. Neville and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!
Data Mining
CS57300 / STAT 59800-
Purdue University February 19, 2009 1
Predictive modeling: evaluation
Score functions
- Zero-one loss
- Accuracy
- Sensitivity/specificity
- Precision/Recall/F
- Absolute loss
- Squared loss
- Likelihood/conditional likelihood
- Area under the ROC curve
- True positive rate (TPR) = TP/(TP+FN)
- False positive rate (FPR) = FP/(FP+TN)
- Recall = TP/(TP+FN) = TPR
- Precision = TP/(TP+FP)
- Specificity = TN/(FP+TN)
- Sensitivity = TPR Simple measures on tables P re d ic te d Actual - FN TN
+ TP FP
3
Cost-sensitive models
- Define a score function based on a cost matrix
- If ~y is the predicted class and y is the true class, then need to define a matrix of costs C(~y,y)
- Reflects the severity of classifying an instance with true class y to class ~y - True positive rate (TPR) = TP/(TP+FN) - False positive rate (FPR) = FP/(FP+TN) - Recall = TP/(TP+FN) = TPR - Precision = TP/(TP+FP) - Specificity = TN/(FP+TN) - Sensitivity = TPR Simple measures on tables P re di ct e d Actual - FN TN
+ TP FP
Bias-variance analysis
41 / 45 42 / 45 Conventional bias/variance framework Training Set Samples M 1 M 2 M 3 Models Test Set Model predictions 7
Findings
- Bias
- Often related to size of model space
- More complex models tend to have lower bias
- Variance
- Often related to size of dataset
- When data is large enough to estimate parameters well then models have lower variance
- Simple models can perform surprisingly well due to lower variance
Bias/variance tradeoff
Expected MSE Size of parameter space Low bias High variance High bias Low variance 9
Ensemble methods
- Motivation
- Too difficult to construct a single model that optimizes performance (why?)
- Approach
- Construct many models on different versions of the training set and combine them during prediction
- Goal: reduce bias and/or variance
Bagging
- Given a training data set D={(x 1 ,y 1 ),..., (xN,yN)}
- For m=1:M
- Obtain a bootstrap sample Dm by drawing N instances with replacement from D
- Learn model Mm from Dm
- To classify test instance t , apply all models to t and take majority vote
- Models have uncorrelated errors due to difference in training sets (each bootstrap sample has ~68% of D) 13
Boosting
- Main assumption
- Combining many weak (but stable) predictors in an ensemble produces a strong predictor
- Weak predictor: only weakly predicts correct class of instances (e.g., tree stumps, 1-R)
- Model space: non-parametric, can model any function if an appropriate base model is used
Boosting
- Assign every example in D an equal weight (1/N)
- For m=1:M
- Learn model Mm with Dm
- Calculate the error of Mm and up-weight the examples that are incorrectly classified to form Dm+
- Normalize weights in Dm+1 to sum to 1
- Set !m = log((1-errm)/errm)
- To classify test instance t , apply all models to t and take weighted vote of
predictions (ie. using !m)
15
Pathologies
Overfitting (cont)
(Oates & Jensen 1999) 19
Oversearching
Heuristic search Exhaustive search
(Quinlan and Cameron-Jones 1995; Murthy and Salzberg 1995)
Search Method Accuracy
Training set
Test set
Attribute selection errors
**A 1 4 3 6... 2 3 A
Few
possible
values
Many
Possible
values
(Quinlan 1998; Liu and White 1994)
Possible values Accuracy
Training set
Test set
21
- Evaluation functions are functions f(m,D) on models (m) and data samples (D)
- Samples vary in their “representativeness”: f(m,D 1 ) = x1! x2 = f(m,D 2 ) !
Each score x is an
estimate of some
population
parameter! x 1 x 2
Evaluation functions are estimators
For a fair die with six outcomes (H 0 : All outcomes are equally likely) What is the sampling distribution of Xi? 1 2 3 4 5 6
E(Xi|H 0 ) = 3.
p(Xi>5|H 0 ) = 0.
Example: Dice rolling
25 For the maximum of ten dice (H 0 : all outcomes equally likely) What is the sampling distribution of Xmax? 1 2 3 4 5 6
E(Xmax|H 0 ) = 5.
p(Xmax>5|H 0 ) = 0.
Example: Dice rolling
Using the right sampling distribution
- The sampling distribution of^ Xmax differs from the sampling distribution of^ Xi
- A direct analogy exists between dice rolling and searching multiple models, model components, attributes, etc.
- The evaluation of any given score varies with the number of models (or components, attributes, etc.) compared during search. 27
Multiple comparisons are ubiquitous in learning...
- Used to select:
- Settings!! A>1, A>2, A>4…
- Components !A>3, B=4, C>56.3…
- Models!! Tree 1, Tree 2, Tree 3…
- Methods !! trees, rules, networks…
- Parameters " depth=4, depth=5, depth=6...
- Many components are available to use in a given model.
- Algorithms select the component with the maximum score.
- The correct sampling distribution depends on number of components evaluated.
- Most learning algorithms do not adjust for number of components.
Overfitting
31
- Sample scores are routinely used as estimates of population parameters. Any xi score is often an unbiased estimator of the population score.
But the xmax is almost
always a biased estimator
Biased parameter estimates
- Two or more search spaces contain different numbers of models.
- Maximum scores in each space are biased to differing degrees.
- Most algorithms directly compare scores.
- Attribute selection errors can be explained in an analogous way.
Oversearching
33
Adjusting for multiple comparisons
- Remove bias by testing on withheld data
- New data (e.g., Oates & Jensen 1999)
- Cross-validation (e.g., Weiss and Kulikowski 1991)
- Estimate sampling distribution accurately
- Randomization tests (e.g., Jensen 1992)
- Adjust probability calculation
- Bonferroni adjustment (e.g., Jensen & Schmill 1997)
- Alter evaluation function to incorporate complexity penalty