Data Mining: Predictive Modeling and Evaluation - Prof. Jennifer L. Neville, Study notes of Data Analysis & Statistical Methods

This document from purdue university covers various aspects of predictive modeling and evaluation in data mining. Topics include score functions, cost-sensitive models, roc curves, bias-variance analysis, ensemble methods, and pathologies. Measures such as accuracy, precision, recall, and f1 score are discussed, along with concepts like overfitting, oversearching, and attribute selection errors.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-htd-1
koofers-user-htd-1 🇺🇸

10 documents

1 / 18

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Data Mining
CS57300 / STAT 59800-024
Purdue University
February 19, 2009
1
Predictive modeling: evaluation
2
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12

Partial preview of the text

Download Data Mining: Predictive Modeling and Evaluation - Prof. Jennifer L. Neville and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

Data Mining

CS57300 / STAT 59800-

Purdue University February 19, 2009 1

Predictive modeling: evaluation

Score functions

  • Zero-one loss
    • Accuracy
    • Sensitivity/specificity
    • Precision/Recall/F
  • Absolute loss
  • Squared loss
    • Root mean-squared error
  • Likelihood/conditional likelihood
  • Area under the ROC curve
    • True positive rate (TPR) = TP/(TP+FN)
    • False positive rate (FPR) = FP/(FP+TN)
    • Recall = TP/(TP+FN) = TPR
    • Precision = TP/(TP+FP)
    • Specificity = TN/(FP+TN)
    • Sensitivity = TPR Simple measures on tables P re d ic te d Actual - FN TN

+ TP FP

3

Cost-sensitive models

  • Define a score function based on a cost matrix
  • If ~y is the predicted class and y is the true class, then need to define a matrix of costs C(~y,y)
  • Reflects the severity of classifying an instance with true class y to class ~y - True positive rate (TPR) = TP/(TP+FN) - False positive rate (FPR) = FP/(FP+TN) - Recall = TP/(TP+FN) = TPR - Precision = TP/(TP+FP) - Specificity = TN/(FP+TN) - Sensitivity = TPR Simple measures on tables P re di ct e d Actual - FN TN

+ TP FP

Bias-variance analysis

41 / 45 42 / 45 Conventional bias/variance framework Training Set Samples M 1 M 2 M 3 Models Test Set Model predictions 7

Findings

  • Bias
    • Often related to size of model space
    • More complex models tend to have lower bias
  • Variance
    • Often related to size of dataset
    • When data is large enough to estimate parameters well then models have lower variance
  • Simple models can perform surprisingly well due to lower variance

Bias/variance tradeoff

Expected MSE Size of parameter space Low bias High variance High bias Low variance 9

Ensemble methods

  • Motivation
    • Too difficult to construct a single model that optimizes performance (why?)
  • Approach
    • Construct many models on different versions of the training set and combine them during prediction
  • Goal: reduce bias and/or variance

Bagging

  • Given a training data set D={(x 1 ,y 1 ),..., (xN,yN)}
  • For m=1:M
    • Obtain a bootstrap sample Dm by drawing N instances with replacement from D
    • Learn model Mm from Dm
  • To classify test instance t , apply all models to t and take majority vote
  • Models have uncorrelated errors due to difference in training sets (each bootstrap sample has ~68% of D) 13

Boosting

  • Main assumption
    • Combining many weak (but stable) predictors in an ensemble produces a strong predictor
    • Weak predictor: only weakly predicts correct class of instances (e.g., tree stumps, 1-R)
  • Model space: non-parametric, can model any function if an appropriate base model is used

Boosting

  • Assign every example in D an equal weight (1/N)
  • For m=1:M
    • Learn model Mm with Dm
    • Calculate the error of Mm and up-weight the examples that are incorrectly classified to form Dm+
    • Normalize weights in Dm+1 to sum to 1
    • Set !m = log((1-errm)/errm)
  • To classify test instance t , apply all models to t and take weighted vote of

predictions (ie. using !m)

15

Pathologies

Overfitting (cont)

(Oates & Jensen 1999) 19

Oversearching

Heuristic search Exhaustive search

(Quinlan and Cameron-Jones 1995; Murthy and Salzberg 1995)

Search Method Accuracy

Training set

Test set

Attribute selection errors

**A 1 4 3 6... 2 3 A

  • – – +... + –**
Few
possible
values
Many
Possible
values

(Quinlan 1998; Liu and White 1994)

Possible values Accuracy

Training set

Test set

21

  • Evaluation functions are functions f(m,D) on models (m) and data samples (D)
  • Samples vary in their “representativeness”: f(m,D 1 ) = x1! x2 = f(m,D 2 ) !

Each score x is an

estimate of some

population

parameter! x 1 x 2

Evaluation functions are estimators

For a fair die with six outcomes (H 0 : All outcomes are equally likely) What is the sampling distribution of Xi? 1 2 3 4 5 6

E(Xi|H 0 ) = 3.
p(Xi>5|H 0 ) = 0.

Example: Dice rolling

25 For the maximum of ten dice (H 0 : all outcomes equally likely) What is the sampling distribution of Xmax? 1 2 3 4 5 6

E(Xmax|H 0 ) = 5.
p(Xmax>5|H 0 ) = 0.

Example: Dice rolling

Using the right sampling distribution

  • The sampling distribution of^ Xmax differs from the sampling distribution of^ Xi
  • A direct analogy exists between dice rolling and searching multiple models, model components, attributes, etc.
  • The evaluation of any given score varies with the number of models (or components, attributes, etc.) compared during search. 27

Multiple comparisons are ubiquitous in learning...

  • Used to select:
  • Settings!! A>1, A>2, A>4…
  • Components !A>3, B=4, C>56.3…
  • Models!! Tree 1, Tree 2, Tree 3…
  • Methods !! trees, rules, networks…
  • Parameters " depth=4, depth=5, depth=6...
  • Many components are available to use in a given model.
  • Algorithms select the component with the maximum score.
  • The correct sampling distribution depends on number of components evaluated.
  • Most learning algorithms do not adjust for number of components.

Overfitting

31

  • Sample scores are routinely used as estimates of population parameters. Any xi score is often an unbiased estimator of the population score.
But the xmax is almost
always a biased estimator

Biased parameter estimates

  • Two or more search spaces contain different numbers of models.
  • Maximum scores in each space are biased to differing degrees.
  • Most algorithms directly compare scores.
  • Attribute selection errors can be explained in an analogous way.

Oversearching

33

Adjusting for multiple comparisons

  • Remove bias by testing on withheld data
    • New data (e.g., Oates & Jensen 1999)
    • Cross-validation (e.g., Weiss and Kulikowski 1991)
  • Estimate sampling distribution accurately
    • Randomization tests (e.g., Jensen 1992)
  • Adjust probability calculation
    • Bonferroni adjustment (e.g., Jensen & Schmill 1997)
  • Alter evaluation function to incorporate complexity penalty
    • MDL, BIC, etc.