Download Data Mining: Predictive Modeling and Search Algorithms - Prof. Jennifer L. Neville and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!
Data Mining
CS57300 / STAT 59800-
Purdue University
February 10, 2009
1
Predictive modeling: learning
Learning algorithms
- Evaluation function
- Search technique
3
Search
Cannibal and Missionaries problem
- Three cannibals and three missionaries come to a crocodile infested river.
- There is a boat on their side that can be used by either one or two persons.
- If cannibals outnumber the missionaries at any time, the cannibals eat the
missionaries.
- How can they use the boat to cross the river so that all missionaries survive?
7
Problem formulation
- What is the goal to be achieved?
- What are the actions?
- What is the state space?
Search algorithms
- Generate a search tree by:
- Considering a particular state
- Testing to see if it is the goal state
- And if not, expanding the node to generate successor states (by applying all
possible actions)
- Search strategies differ in their choice of which state to expand
9
Local search
- Used when path to goal state does not matter
- Basic idea:
- Only use current state
- Don’t save paths followed
- Why use local search?
- Low memory requirements: Usually constant
- Effective: Can often find good solutions in extremely large state spaces
Evaluation functions
- Guide search inside of algorithms
- Select path in heuristic search
- Decide when to stop
- Identify whether to include model or pattern in output
- Evaluate results outside of algorithms
- Measure the absolute quality of model or pattern
- Compare the relative quality of different algorithms or outputs
13
Decision tree learning
Learning predictive models
- Choose a data representation
- Select a knowledge representation (a “model”)
- Use search to identify possible models
- Search a space of alternative model structures and/or parameters
- Score possible models with an evaluation function to determine the model
with the best fit to the data
15
Tree models
- Easy to understand knowledge
representation
- Can handle mixed variables
- Recursive, divide and conquer
learning method
Tree models
- Most well-known systems
- CART: Breiman, Friedman, Olshen and Stone
- ID3, C4.5: Quinlan
- How do they differ?
- Split evaluation function
- Stopping criterion
- Pruning mechanism
19
Choosing an attribute/feature
- Idea: a good attribute splits the examples into subsets that are (ideally) "all
positive" or "all negative"
Split evaluation
15
CS 590 D 30
- Instance Based Methods
- Classification by decision tree induction
- Classification by Neural Networks
- Classification by Support Vector Machines
(SVM)
CS 590 D 31
Training Dataset
age income student credit_rating buys_computer
<= 30 high no fair no
<= 30 high no excellent no
31 … 40 high no fair yes
> 40 medium no fair yes
> 40 low yes fair yes
> 40 low yes excellent no
31 … 40 low yes excellent yes
<= 30 medium no fair no
<= 30 low yes fair yes
> 40 medium yes fair yes
<= 30 medium yes excellent yes
31 … 40 medium no excellent yes
31 … 40 high yes fair yes
> 40 medium no excellent no
!"#$%
&'((')$%*+%%
,-*./(,%
& 0 '.%
12 #+(*+ 3 $%
456
no
no
yes
yes
yes
no
yes
yes
yes
no
yes
no
yes
yes
High Low Med
Income
Buy No buy
High 2 2
Med 4 2
Low 3 1
21
22
Entropy
- Used to quantify the amount of randomness of a probability distribution.
- Definition: The entropy H(X) of a discrete random variable X is defined by:
H ( X ) = " p ( x )log
p ( x )
x
25
Information gain
no
no
yes
yes
yes
no
yes
yes
yes
no
yes
no
yes
yes
High Low
Med
Income
Entropy(Income=high)
= -2/4 log 2/4 -2/4 log 2/4 = 1
Entropy(Income=med)
= -4/6 log 4/6 -2/6 log 2/6 = 0.
Entropy(Income=low)
= -3/4 log 3/4 -1/4 log 1/4 = 0.
Gain(D,Income)
= 0.9400 - (4/14 [1] + 6/14 [0.9183] + 4/14 [0.8113])
25
26
Example
- What is the information gain for
Type?
- What is the information gain for
Patrons?
27
Gini gain
- Similar to information gain
- Uses gini index instead of entropy
- Measures decrease in gini index after split:
27
28
Contingency tables
Buy No buy
High 2 2
Med 4 2
Low 3 1
31
Example calculation
Buy No buy
High 2 2
Med 4 2
Low 3 1
Observed
Buy No buy
High 2.57 1.
Med 3.86 2.
Low 2.57 1.
Expected
2
o
i
# e
i ( )
2
e
i = 1 i
k
2
2
2
2
2
2
31
Tree search algorithm
- Input:
- Initial state?
- Set of actions?
- Goal test?
- Path cost function?
- Output:
Algorithm comparison
• CART
Gini index
Simple to complex, hill-climbing
search
When leaves are pure
Cross-validation to select gini
threshold
• C4.
- Evaluation criterion: Information
gain
Simple to complex, hill-climbing
search
When leaves are pure
Reduce error pruning
33
34
When to stop growing
- Full growth methods (require pruning later)
- All samples for at a node belong to the same class
- There are no attributes left for further splits
- There are no samples left
- Prepruning methods
- Stop when split evaluation measure is not significant
Movement in the table
Movement in the table
P
r
e
d
i
c
t
e
d
- c d
+ a b
Actual
Can be used to make search more efficient
37
Other issues
- Why not just use accuracy?
- Are the different “roles” of evaluation functions compatible?
Next class
- Reading: Chapter 8 PDM
- Topic: Learning cont’
- Due: Homework #