




























































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
unit 2 data mining notes . this is very good notes
Typology: Exercises
1 / 122
This page cannot be seen from the preview
Don't miss anything!





























































































February 10, 2014
Data Mining: Concepts and Techniques
1
Chapter 6. Classification and Prediction
What is classification? What isprediction?
Issues regarding classification andprediction
Classification by decision tree
Support Vector Machines (SVM)
Associative classification
Lazy learners (or learning from yourneighbors)
Other classification methods
February 10, 2014
Data Mining: Concepts and Techniques
2
Classification by decision tree induction
Bayesian classification
Rule-based classification
Classification by back propagation
Other classification methods
Prediction
Accuracy and error measures
Ensemble methods
Model selection
Summary
Classification—A Two-Step Process
Model construction
: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class, as determined bythe
class label attribute
The set of tuples used for model construction is
training set
The model is represented as classification rules, decision trees, or mathematicalformulae
Model usage
: for classifying future or unknown objects
February 10, 2014
Data Mining: Concepts and Techniques
4
Model usage
: for classifying future or unknown objects
Estimate accuracy
of the model
The known label of test sample is compared with the classified result from themodel
Accuracy rate is the percentage of test set samples that are correctly classifiedby the model
Test set is independent of training set, otherwise over-fitting will occur
If the accuracy is acceptable, use the model to
classify data
tuples whose class
labels are not known
Process (1): Model Construction
February 10, 2014
Data Mining: Concepts and Techniques
M ike
Assistant Prof
no
M ary
Assistant Prof
yes
Bill
Professor
yes
Jim
Associate Prof
yes
Dave
Assistant Prof
no
Anne
Associate Prof
no
Supervised vs. Unsupervised Learning
February 10, 2014
Data Mining: Concepts and Techniques
7
Chapter 6. Classification and Prediction
What is classification? What isprediction?
Issues regarding classification andprediction
Classification by decision tree
Support Vector Machines (SVM)
Associative classification
Lazy learners (or learning from yourneighbors)
Other classification methods
February 10, 2014
Data Mining: Concepts and Techniques
8
Classification by decision tree induction
Bayesian classification
Rule-based classification
Classification by back propagation
Other classification methods
Prediction
Accuracy and error measures
Ensemble methods
Model selection
Summary
Issues: Evaluating Classification Methods
February 10, 2014
Data Mining: Concepts and Techniques
10
Chapter 6. Classification and Prediction
What is classification? What isprediction?
Issues regarding classification andprediction
Classification by decision tree
Support Vector Machines (SVM)
Associative classification
Lazy learners (or learning from yourneighbors)
Other classification methods
February 10, 2014
Data Mining: Concepts and Techniques
11
Classification by decision tree induction
Bayesian classification
Rule-based classification
Classification by back propagation
Other classification methods
Prediction
Accuracy and error measures
Ensemble methods
Model selection
Summary
February 10, 2014
Data Mining: Concepts and Techniques
13
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)^
Tree is constructed in a
top-down recursive divide-and-conquer manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are discretized in advance)
Examples are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or statistical measure (e.g.,
February 10, 2014
Data Mining: Concepts and Techniques
14
Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain
Conditions for stopping partitioning^
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning –
majority voting
is
employed for classifying the leaf
There are no samples left
Attribute Selection: Information Gain
g
Class P: buys_computer = “yes”
g
Class N: buys_computer = “no”
means “age <=30” has 5 out of 14
samples, with 2 yes’es and 3 no’s.
i^
i^
Info
age
(^940). 0
)
14
5 (
log (^514)
) (^914) (
log
14
9
) (^5) , 9 (
)
(
2
2
=
−
=
I
D
Info
February 10, 2014
Data Mining: Concepts and Techniques
16
samples, with 2 yes’es and 3 no’s. Hence
Similarly,
(^246). 0 ) ( ) ( ) ( =
−
=
D
Info
D
Info
age
Gain
age
age
income
student
credit_rating
buys_computer
<=
high
no
fair
no
<=
high
no
excellent
no
31…
high
no
fair
yes
>
medium
no
fair
yes
>
low
yes
fair
yes
>
low
yes
excellent
no
31…
low
yes
excellent
yes
<=
medium
no
fair
no
<=
low
yes
fair
yes
>
medium
yes
fair
yes
<=
medium
yes
excellent
yes
31…
medium
no
excellent
yes
31…
high
yes
fair
yes
>
medium
no
excellent
no
Computing Information-Gain for Continuous-ValueAttributes
February 10, 2014
Data Mining: Concepts and Techniques
17
(a
+ai
i+
)/2 is the midpoint between the values of a
i^
and a
i+
Gini index (CART, IBM IntelligentMiner)
If a data set
contains examples from
n
classes, gini index,
gini
) is defined as
where
p
j^
is the relative frequency of class
j
in
If a data set
is split on A into two subsets
1
and
2
, the
gini
index
gini
) is defined
as
∑ =
−
=
n j
p
j
D
gini
1
2
1
)
(
Reduction in Impurity:
The attribute provides the smallest
gini
split
) (or the largest reduction in impurity) is
chosen to split the node (
need to enumerate all the possible splitting points for each attribute
) ) ( ) ( ) (
D
gini
D
gini
A
gini
−
=
∆
February 10, 2014
Data Mining: Concepts and Techniques
19
)
(
|
|
| | ) ( | |
|
|
)
(
2
2
1
1
D
gini
D D
D
gini
D D
D
gini
A
=
Gini index (CART, IBM IntelligentMiner)
Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
Suppose the attribute income partitions D into 10 in D
1
: {low, medium} and 4 in D
2
(^459). 0
14
5
14
9
1
)
(
2
2
=
−
−
=
D
gini
)
(
14
4
)
(
10 14
)
(
1
1
}
,
{^
D
Gini
D
Gini
D
gini
medium
low
income
∈
but gini
{medium,high}
is 0.30 and thus the best since it is the lowest
All attributes are assumed continuous-valued
May need other tools, e.g., clustering, to get the possible split values
Can be modified for categorical attributes^ February 10, 2014
Data Mining: Concepts and Techniques
20
14
14