Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Data Mining: Predictive Modeling and Evaluation - Prof. Jennifer L. Neville, Study notes of Data Analysis & Statistical Methods

Purdue University Data Analysis & Statistical Methods

Prof. Jennifer L. Neville

This document from purdue university covers various aspects of predictive modeling and evaluation in data mining. Topics include score functions, cost-sensitive models, roc curves, bias-variance analysis, ensemble methods, and pathologies. Measures such as accuracy, precision, recall, and f1 score are discussed, along with concepts like overfitting, oversearching, and attribute selection errors.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-htd-1 🇺🇸

10 documents

1 / 18

This page cannot be seen from the preview

Don't miss anything!

Data Mining

CS57300 / STAT 59800-024

Purdue University

February 19, 2009

Predictive modeling: evaluation

Discover Study notes of Data Analysis & Statistical Methods Purdue University

Partial preview of the text

Download Data Mining: Predictive Modeling and Evaluation - Prof. Jennifer L. Neville and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

Data Mining

CS57300 / STAT 59800-

Purdue University February 19, 2009 1

Predictive modeling: evaluation

Score functions

Zero-one loss
- Accuracy
- Sensitivity/specificity
- Precision/Recall/F
Absolute loss
Squared loss
- Root mean-squared error
Likelihood/conditional likelihood
Area under the ROC curve
- True positive rate (TPR) = TP/(TP+FN)
- False positive rate (FPR) = FP/(FP+TN)
- Recall = TP/(TP+FN) = TPR
- Precision = TP/(TP+FP)
- Specificity = TN/(FP+TN)
- Sensitivity = TPR Simple measures on tables P re d ic te d Actual - FN TN

+ TP FP

Cost-sensitive models

Define a score function based on a cost matrix
If ~y is the predicted class and y is the true class, then need to define a matrix of costs C(~y,y)
Reflects the severity of classifying an instance with true class y to class ~y - True positive rate (TPR) = TP/(TP+FN) - False positive rate (FPR) = FP/(FP+TN) - Recall = TP/(TP+FN) = TPR - Precision = TP/(TP+FP) - Specificity = TN/(FP+TN) - Sensitivity = TPR Simple measures on tables P re di ct e d Actual - FN TN

+ TP FP

Bias-variance analysis

41 / 45 42 / 45 Conventional bias/variance framework Training Set Samples M 1 M 2 M 3 Models Test Set Model predictions 7

Findings

Bias
- Often related to size of model space
- More complex models tend to have lower bias
Variance
- Often related to size of dataset
- When data is large enough to estimate parameters well then models have lower variance
Simple models can perform surprisingly well due to lower variance

Bias/variance tradeoff

Expected MSE Size of parameter space Low bias High variance High bias Low variance 9

Ensemble methods

Motivation
- Too difficult to construct a single model that optimizes performance (why?)
Approach
- Construct many models on different versions of the training set and combine them during prediction
Goal: reduce bias and/or variance

Bagging

Given a training data set D={(x 1 ,y 1 ),..., (xN,yN)}
For m=1:M
- Obtain a bootstrap sample Dm by drawing N instances with replacement from D
- Learn model Mm from Dm
To classify test instance t , apply all models to t and take majority vote
Models have uncorrelated errors due to difference in training sets (each bootstrap sample has ~68% of D) 13

Boosting

Main assumption
- Combining many weak (but stable) predictors in an ensemble produces a strong predictor
- Weak predictor: only weakly predicts correct class of instances (e.g., tree stumps, 1-R)
Model space: non-parametric, can model any function if an appropriate base model is used

Boosting

Assign every example in D an equal weight (1/N)
For m=1:M
- Learn model Mm with Dm
- Calculate the error of Mm and up-weight the examples that are incorrectly classified to form Dm+
- Normalize weights in Dm+1 to sum to 1
- Set !m = log((1-errm)/errm)
To classify test instance t , apply all models to t and take weighted vote of

predictions (ie. using !m)

Pathologies

Overfitting (cont)

(Oates & Jensen 1999) 19

Oversearching

Heuristic search Exhaustive search

(Quinlan and Cameron-Jones 1995; Murthy and Salzberg 1995)

Search Method Accuracy

Training set

Test set

Attribute selection errors

**A 1 4 3 6... 2 3 A

– – +... + –**

Few

possible

values

Many

Possible

values

(Quinlan 1998; Liu and White 1994)

Possible values Accuracy

Training set

Test set

Evaluation functions are functions f(m,D) on models (m) and data samples (D)
Samples vary in their “representativeness”: f(m,D 1 ) = x1! x2 = f(m,D 2 ) !

Each score x is an

estimate of some

population

parameter! x 1 x 2

Evaluation functions are estimators

For a fair die with six outcomes (H 0 : All outcomes are equally likely) What is the sampling distribution of Xi? 1 2 3 4 5 6

E(Xi|H 0 ) = 3.

p(Xi>5|H 0 ) = 0.

Example: Dice rolling

25 For the maximum of ten dice (H 0 : all outcomes equally likely) What is the sampling distribution of Xmax? 1 2 3 4 5 6

E(Xmax|H 0 ) = 5.

p(Xmax>5|H 0 ) = 0.

Example: Dice rolling

Using the right sampling distribution

The sampling distribution of^ Xmax differs from the sampling distribution of^ Xi
A direct analogy exists between dice rolling and searching multiple models, model components, attributes, etc.
The evaluation of any given score varies with the number of models (or components, attributes, etc.) compared during search. 27

Multiple comparisons are ubiquitous in learning...

Used to select:
Settings!! A>1, A>2, A>4…
Components !A>3, B=4, C>56.3…
Models!! Tree 1, Tree 2, Tree 3…
Methods !! trees, rules, networks…
Parameters " depth=4, depth=5, depth=6...

Many components are available to use in a given model.
Algorithms select the component with the maximum score.
The correct sampling distribution depends on number of components evaluated.
Most learning algorithms do not adjust for number of components.

Overfitting

Sample scores are routinely used as estimates of population parameters. Any xi score is often an unbiased estimator of the population score.

But the xmax is almost

always a biased estimator

Biased parameter estimates

Two or more search spaces contain different numbers of models.
Maximum scores in each space are biased to differing degrees.
Most algorithms directly compare scores.
Attribute selection errors can be explained in an analogous way.

Oversearching

Adjusting for multiple comparisons

Remove bias by testing on withheld data
- New data (e.g., Oates & Jensen 1999)
- Cross-validation (e.g., Weiss and Kulikowski 1991)
Estimate sampling distribution accurately
- Randomization tests (e.g., Jensen 1992)
Adjust probability calculation
- Bonferroni adjustment (e.g., Jensen & Schmill 1997)
Alter evaluation function to incorporate complexity penalty
- MDL, BIC, etc.

Data Mining: Predictive Modeling and Evaluation - Prof. Jennifer L. Neville, Study notes of Data Analysis & Statistical Methods

Related documents

Partial preview of the text

Download Data Mining: Predictive Modeling and Evaluation - Prof. Jennifer L. Neville and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

Data Mining

CS57300 / STAT 59800-

Predictive modeling: evaluation

Score functions

+ TP FP

Cost-sensitive models

+ TP FP

Bias-variance analysis

Findings

Bias/variance tradeoff

Ensemble methods

Bagging

Boosting

Boosting

predictions (ie. using !m)

Pathologies

Overfitting (cont)

Oversearching

Heuristic search Exhaustive search

Search Method Accuracy

Training set

Test set

Attribute selection errors

Few

possible

values

Many

Possible

values

Possible values Accuracy

Training set

Test set

Each score x is an

estimate of some

population

Evaluation functions are estimators

E(Xi|H 0 ) = 3.

p(Xi>5|H 0 ) = 0.

Example: Dice rolling

E(Xmax|H 0 ) = 5.

p(Xmax>5|H 0 ) = 0.

Example: Dice rolling

Using the right sampling distribution

Multiple comparisons are ubiquitous in learning...

Overfitting

But the xmax is almost

always a biased estimator

Biased parameter estimates

Oversearching

Adjusting for multiple comparisons