Data Mining: Predictive Modeling and Search Algorithms - Prof. Jennifer L. Neville, Study notes of Data Analysis & Statistical Methods

This document from purdue university, dated february 10, 2009, covers various aspects of data mining, including predictive modeling, learning algorithms, search techniques, problem solving as search, and specific examples like the cannibal and missionaries problem and the eight-queens problem. The document also discusses decision tree learning and evaluation functions.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-bv0
koofers-user-bv0 🇺🇸

10 documents

1 / 20

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Data Mining
CS57300 / STAT 59800-024
Purdue University
February 10, 2009
1
Predictive modeling: learning
2
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14

Partial preview of the text

Download Data Mining: Predictive Modeling and Search Algorithms - Prof. Jennifer L. Neville and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

Data Mining

CS57300 / STAT 59800-

Purdue University

February 10, 2009

1

Predictive modeling: learning

Learning algorithms

  • Evaluation function
  • Search technique

3

Search

Cannibal and Missionaries problem

  • Three cannibals and three missionaries come to a crocodile infested river.
  • There is a boat on their side that can be used by either one or two persons.
  • If cannibals outnumber the missionaries at any time, the cannibals eat the

missionaries.

  • How can they use the boat to cross the river so that all missionaries survive?

7

Problem formulation

  • What is the goal to be achieved?
  • What are the actions?
  • What is the state space?

Search algorithms

  • Generate a search tree by:
    • Considering a particular state
    • Testing to see if it is the goal state
    • And if not, expanding the node to generate successor states (by applying all

possible actions)

  • Search strategies differ in their choice of which state to expand

9

Local search

  • Used when path to goal state does not matter
  • Basic idea:
    • Only use current state
    • Don’t save paths followed
  • Why use local search?
    • Low memory requirements: Usually constant
    • Effective: Can often find good solutions in extremely large state spaces

Evaluation functions

  • Guide search inside of algorithms
    • Select path in heuristic search
    • Decide when to stop
    • Identify whether to include model or pattern in output
  • Evaluate results outside of algorithms
    • Measure the absolute quality of model or pattern
    • Compare the relative quality of different algorithms or outputs

13

Decision tree learning

Learning predictive models

  • Choose a data representation
  • Select a knowledge representation (a “model”)
  • Use search to identify possible models
    • Search a space of alternative model structures and/or parameters
    • Score possible models with an evaluation function to determine the model

with the best fit to the data

15

Tree models

  • Easy to understand knowledge

representation

  • Can handle mixed variables
  • Recursive, divide and conquer

learning method

  • Efficient inference

Tree models

  • Most well-known systems
    • CART: Breiman, Friedman, Olshen and Stone
    • ID3, C4.5: Quinlan
  • How do they differ?
    • Split evaluation function
    • Stopping criterion
    • Pruning mechanism

19

Choosing an attribute/feature

  • Idea: a good attribute splits the examples into subsets that are (ideally) "all

positive" or "all negative"

Split evaluation

15

CS 590 D 30

  • Instance Based Methods
  • Classification by decision tree induction
  • Classification by Neural Networks
  • Classification by Support Vector Machines
(SVM)
  • Prediction

CS 590 D 31

Training Dataset

age income student credit_rating buys_computer

<= 30 high no fair no

<= 30 high no excellent no

31 … 40 high no fair yes

> 40 medium no fair yes

> 40 low yes fair yes

> 40 low yes excellent no

31 … 40 low yes excellent yes

<= 30 medium no fair no

<= 30 low yes fair yes

> 40 medium yes fair yes

<= 30 medium yes excellent yes

31 … 40 medium no excellent yes

31 … 40 high yes fair yes

> 40 medium no excellent no

!"#$%

&'((')$%*+%%

,-*./(,%

& 0 '.%

12 #+(*+ 3 $%

456

no

no

yes

yes

yes

no

yes

yes

yes

no

yes

no

yes

yes

High Low Med

Income

Buy No buy

High 2 2

Med 4 2

Low 3 1

21

22

Entropy

  • Used to quantify the amount of randomness of a probability distribution.
  • Definition: The entropy H(X) of a discrete random variable X is defined by:

H ( X ) = " p ( x )log

p ( x )

x

25

Information gain

no

no

yes

yes

yes

no

yes

yes

yes

no

yes

no

yes

yes

High Low

Med

Income

Entropy(Income=high)

= -2/4 log 2/4 -2/4 log 2/4 = 1

Entropy(Income=med)

= -4/6 log 4/6 -2/6 log 2/6 = 0.

Entropy(Income=low)

= -3/4 log 3/4 -1/4 log 1/4 = 0.

Gain(D,Income)

= 0.9400 - (4/14 [1] + 6/14 [0.9183] + 4/14 [0.8113])

25

26

Example

  • What is the information gain for

Type?

  • What is the information gain for

Patrons?

27

Gini gain

  • Similar to information gain
  • Uses gini index instead of entropy
  • Measures decrease in gini index after split:

27

28

Contingency tables

Buy No buy

High 2 2

Med 4 2

Low 3 1

31

Example calculation

Buy No buy

High 2 2

Med 4 2

Low 3 1

Observed

Buy No buy

High 2.57 1.

Med 3.86 2.

Low 2.57 1.

Expected

2

o

i

# e

i ( )

2

e

i = 1 i

k

2

2

2

2

2

2

31

Tree search algorithm

  • Input:
    • Initial state?
    • Set of actions?
    • Goal test?
    • Path cost function?
  • Output:

Algorithm comparison

• CART
  • Evaluation criterion:

Gini index

  • Search algorithm:

Simple to complex, hill-climbing

search

  • Stopping criterion:

When leaves are pure

  • Pruning mechanism:

Cross-validation to select gini

threshold

• C4.
  • Evaluation criterion: Information

gain

  • Search algorithm:

Simple to complex, hill-climbing

search

  • Stopping criterion:

When leaves are pure

  • Pruning mechanism:

Reduce error pruning

33

34

When to stop growing

  • Full growth methods (require pruning later)
    • All samples for at a node belong to the same class
    • There are no attributes left for further splits
    • There are no samples left
  • Prepruning methods
    • Stop when split evaluation measure is not significant

Movement in the table

Movement in the table

P

r

e

d

i

c

t

e

d

- c d

+ a b

Actual

Can be used to make search more efficient

37

Other issues

  • Why not just use accuracy?
  • Are the different “roles” of evaluation functions compatible?

Next class

  • Reading: Chapter 8 PDM
  • Topic: Learning cont’
  • Due: Homework #