Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Data Mining: Predictive Modeling and Search Algorithms - Prof. Jennifer L. Neville, Study notes of Data Analysis & Statistical Methods

Purdue University Data Analysis & Statistical Methods

Prof. Jennifer L. Neville

This document from purdue university, dated february 10, 2009, covers various aspects of data mining, including predictive modeling, learning algorithms, search techniques, problem solving as search, and specific examples like the cannibal and missionaries problem and the eight-queens problem. The document also discusses decision tree learning and evaluation functions.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-bv0 🇺🇸

10 documents

1 / 20

This page cannot be seen from the preview

Don't miss anything!

Data Mining

CS57300 / STAT 59800-024

Purdue University

February 10, 2009

Predictive modeling: learning

Discover Study notes of Data Analysis & Statistical Methods Purdue University

Partial preview of the text

Download Data Mining: Predictive Modeling and Search Algorithms - Prof. Jennifer L. Neville and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

Data Mining

CS57300 / STAT 59800-

Purdue University

February 10, 2009

Predictive modeling: learning

Learning algorithms

Evaluation function
Search technique

Search

Cannibal and Missionaries problem

Three cannibals and three missionaries come to a crocodile infested river.
There is a boat on their side that can be used by either one or two persons.
If cannibals outnumber the missionaries at any time, the cannibals eat the

missionaries.

How can they use the boat to cross the river so that all missionaries survive?

Problem formulation

What is the goal to be achieved?
What are the actions?
What is the state space?

Search algorithms

Generate a search tree by:
- Considering a particular state
- Testing to see if it is the goal state
- And if not, expanding the node to generate successor states (by applying all

possible actions)

Search strategies differ in their choice of which state to expand

Local search

Used when path to goal state does not matter
Basic idea:
- Only use current state
- Don’t save paths followed
Why use local search?
- Low memory requirements: Usually constant
- Effective: Can often find good solutions in extremely large state spaces

Evaluation functions

Guide search inside of algorithms
- Select path in heuristic search
- Decide when to stop
- Identify whether to include model or pattern in output
Evaluate results outside of algorithms
- Measure the absolute quality of model or pattern
- Compare the relative quality of different algorithms or outputs

Decision tree learning

Learning predictive models

Choose a data representation
Select a knowledge representation (a “model”)
Use search to identify possible models
- Search a space of alternative model structures and/or parameters
- Score possible models with an evaluation function to determine the model

with the best fit to the data

Tree models

Easy to understand knowledge

representation

Can handle mixed variables
Recursive, divide and conquer

learning method

Efficient inference

Tree models

Most well-known systems
- CART: Breiman, Friedman, Olshen and Stone
- ID3, C4.5: Quinlan
How do they differ?
- Split evaluation function
- Stopping criterion
- Pruning mechanism

Choosing an attribute/feature

Idea: a good attribute splits the examples into subsets that are (ideally) "all

positive" or "all negative"

Split evaluation

CS 590 D 30

Instance Based Methods
Classification by decision tree induction
Classification by Neural Networks
Classification by Support Vector Machines

(SVM)

Prediction

CS 590 D 31

Training Dataset

age income student credit_rating buys_computer

<= 30 high no fair no

<= 30 high no excellent no

31 … 40 high no fair yes

> 40 medium no fair yes

> 40 low yes fair yes

> 40 low yes excellent no

31 … 40 low yes excellent yes

<= 30 medium no fair no

<= 30 low yes fair yes

> 40 medium yes fair yes

<= 30 medium yes excellent yes

31 … 40 medium no excellent yes

31 … 40 high yes fair yes

> 40 medium no excellent no

!"#$%

&'((')$%*+%%

,-*./(,%

& 0 '.%

12 #+(*+ 3 $%

456

yes

High Low Med

Income

Buy No buy

High 2 2

Med 4 2

Low 3 1

Entropy

Used to quantify the amount of randomness of a probability distribution.
Definition: The entropy H(X) of a discrete random variable X is defined by:

H ( X ) = " p ( x )log

p ( x )

Information gain

yes

High Low

Med

Income

Entropy(Income=high)

= -2/4 log 2/4 -2/4 log 2/4 = 1

Entropy(Income=med)

= -4/6 log 4/6 -2/6 log 2/6 = 0.

Entropy(Income=low)

= -3/4 log 3/4 -1/4 log 1/4 = 0.

Gain(D,Income)

= 0.9400 - (4/14 [1] + 6/14 [0.9183] + 4/14 [0.8113])

Example

What is the information gain for

Type?

What is the information gain for

Patrons?

Gini gain

Similar to information gain
Uses gini index instead of entropy
Measures decrease in gini index after split:

Contingency tables

Buy No buy

High 2 2

Med 4 2

Low 3 1

Example calculation

Buy No buy

High 2 2

Med 4 2

Low 3 1

Observed

Buy No buy

High 2.57 1.

Med 3.86 2.

Low 2.57 1.

Expected

o

# e

i ( )

e

i = 1 i

Tree search algorithm

Input:
- Initial state?
- Set of actions?
- Goal test?
- Path cost function?
Output:

Algorithm comparison

• CART

Evaluation criterion:

Gini index

Search algorithm:

Simple to complex, hill-climbing

Stopping criterion:

When leaves are pure

Pruning mechanism:

Cross-validation to select gini

threshold

• C4.

Evaluation criterion: Information

gain

Search algorithm:

Simple to complex, hill-climbing

Stopping criterion:

When leaves are pure

Pruning mechanism:

Reduce error pruning

When to stop growing

Full growth methods (require pruning later)
- All samples for at a node belong to the same class
- There are no attributes left for further splits
- There are no samples left
Prepruning methods
- Stop when split evaluation measure is not significant

Movement in the table

Movement in the table

P

- c d

+ a b

Actual

Can be used to make search more efficient

Other issues

Why not just use accuracy?
Are the different “roles” of evaluation functions compatible?

Next class

Reading: Chapter 8 PDM
Topic: Learning cont’
Due: Homework #

Data Mining: Predictive Modeling and Search Algorithms - Prof. Jennifer L. Neville, Study notes of Data Analysis & Statistical Methods

Related documents

Partial preview of the text

Download Data Mining: Predictive Modeling and Search Algorithms - Prof. Jennifer L. Neville and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

Data Mining

CS57300 / STAT 59800-

Predictive modeling: learning

Learning algorithms

Search

Cannibal and Missionaries problem

Problem formulation

Search algorithms

Local search

Evaluation functions

Decision tree learning

Learning predictive models

Tree models

Tree models

Choosing an attribute/feature

Split evaluation

(SVM)

Income

Buy No buy

High 2 2

Med 4 2

Low 3 1

Entropy

H ( X ) = " p ( x )log

p ( x )

Information gain

High Low

Med

Income

Entropy(Income=high)

= -2/4 log 2/4 -2/4 log 2/4 = 1

Entropy(Income=med)

= -4/6 log 4/6 -2/6 log 2/6 = 0.

Entropy(Income=low)

= -3/4 log 3/4 -1/4 log 1/4 = 0.

Gain(D,Income)

= 0.9400 - (4/14 [1] + 6/14 [0.9183] + 4/14 [0.8113])

Example

Gini gain

Contingency tables

Example calculation

Observed

Expected

o

# e

e

Tree search algorithm

Algorithm comparison

• CART

• C4.

When to stop growing

Movement in the table

P

+ a b

Other issues

Next class