Decision Trees for Classification: Impurity Measures and Information Gain - Prof. Narendra, Study notes of Computer Science

An overview of the topic of classification using decision trees, with a focus on choosing the attributes for splitting the data based on impurity measures and information gain. The concept of impurity and its measures, including the gini index, entropy, and misclassification rate. It also describes how to calculate the reduction in impurity due to knowing the value of an attribute and how to select the attribute for the root node of the decision tree. Additionally, the document introduces boat, an optimistic approach to build decision trees using bootstrapping, and its coarse splitting criteria and cleanup phase.

Typology: Study notes

Pre 2010

Uploaded on 02/13/2009

koofers-user-iq9
koofers-user-iq9 🇺🇸

9 documents

1 / 5

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS 6604: Data Mining Fall 2007
Lecture 9 17 Sept 2007
Lecture: Naren Ramakrishnan Scribe: Mahima Gopalakrishnan
1 Overview
We move from itemset mining and associations to the next topic: classification. In classification, we are
given a set of labeled examples (labeled with the class information) and we desire to learn a classifier that
predicts the class for new examples (instances). One of the common classification methods is a decision tree
which can be viewed as a compaction of many rules.
2 Decision Trees
A decision tree is a sequence of conditions factored into a tree-structured series of branches. The following
(partial) tree is an example of a classifier that can predict whether an email is SPAM or NON-SPAM based
on some features.
In the above figure f1and f2are questions that could be as simple as a ”YES” or ”NO” question or a
complex one that could give give us a range of values. But how do we know the exact order in which the
questions must be posed? This introduces the complexity involved in inducing a decision tree.
2.1 Choosing the Attributes
2.1.1 Splitting Criteria
To know the order in which attributes much be chosen to split the data, we need some measure that would
allow us to compare the attributes on some scale and choose one above the other. One of the measures
for selecting the ”best” question or attribute is based on the level of Impurity in the resulting classes of
data. Impurity could be defined as the amount of uncertainty present in the data and that the attribute which
reduces the impurity most should be chosen.
1
pf3
pf4
pf5

Partial preview of the text

Download Decision Trees for Classification: Impurity Measures and Information Gain - Prof. Narendra and more Study notes Computer Science in PDF only on Docsity!

CS 6604: Data Mining Fall 2007

Lecture 9 — 17 Sept 2007

Lecture: Naren Ramakrishnan Scribe: Mahima Gopalakrishnan

1 Overview

We move from itemset mining and associations to the next topic: classification. In classification, we are given a set of labeled examples (labeled with the class information) and we desire to learn a classifier that predicts the class for new examples (instances). One of the common classification methods is a decision tree which can be viewed as a compaction of many rules.

2 Decision Trees

A decision tree is a sequence of conditions factored into a tree-structured series of branches. The following (partial) tree is an example of a classifier that can predict whether an email is SPAM or NON-SPAM based on some features.

In the above figure f 1 and f 2 are questions that could be as simple as a ”YES” or ”NO” question or a complex one that could give give us a range of values. But how do we know the exact order in which the questions must be posed? This introduces the complexity involved in inducing a decision tree.

2.1 Choosing the Attributes

2.1.1 Splitting Criteria

To know the order in which attributes much be chosen to split the data, we need some measure that would allow us to compare the attributes on some scale and choose one above the other. One of the measures for selecting the ”best” question or attribute is based on the level of Impurity in the resulting classes of data. Impurity could be defined as the amount of uncertainty present in the data and that the attribute which reduces the impurity most should be chosen.

  • Examples of pure datasets: {”All emails are classified SPAM”},{”All emails are classified NON- SPAM”}
  • An example of an impure dataset: {”Mix of SPAM and NON-SPAM emails”}

Given probability p, some of the impurity measures studied in the literature are:

  • Gini Index: 2 p (1 - p )
  • Entropy: −[p log p + (1 − p) log (1 − p)]
  • Misclassification Rate: 1 - max( p , 1 - p )

In the context of our example, we can define

P(spam): Probability of an email being a SPAM = p and

P(non-spam): Probability of an email being a NON-SPAM = 1 - p.

In general, when the dataset could be divided into two classes, then p is the proportion of instances in the database that has one value for the target attribute and 1 - p is the proportion of instances in the database that has the second value for the same target attribute.

2.1.2 Generalization of the Impurity Measures

In the previous section, we assumed the instances in the database could be classified two-ways. When the number of classes becomes three or more i.e. C 1 , C 2 and C 3 , where

P(C 1 ) = p

P(C 2 ) = q

P(C 3 ) = 1-p-q

then the impurity measures could be generalized as follows.

  • Gini Index ∑ i,j,i 6 =j pi, pj^ = 1^ −^

i p 2 i

  • Entropy −[p log p + q log q + (1 − p − q) log (1 − p − q)] or

i pi^ log^ pi

  • Misclassification Rate 1 - max( p, 1 - p, 1 - p - q )

(Information Gain)B = H(C) -

i∈B

|Ci| |C| H(Ci)

where, H(Ci) is the entropy of C given B. In essence, we calculate entropies of each ‘partition’ of the database induced by the attribute B and weightedly sum these entropies to be used in the above equation.

To select the attribute for the root node of the decision tree, we calculate the information gain for all the attributes B in the database and choose the one that yields the most reduction in entropy. Once the attribute with the highest attribute is chosen, the same procedure is carried out recursively on the subsets of the database that are induced at each level of the tree.

4 BOAT - Decision Trees using Bootstrapping

BOAT is an optimistic approach to build decision trees using a statistical technique called ‘Bootstrapping’. This algorithm’s main idea is to construct a sample tree T ′^ from small subsets of data derived from the original database D and refining that tree during subsequent scanning of all the data.

Typical decision tree construction algorithms have two parts: growth phase and the pruning phase. BOAT concentrates on the first phase, i.e., growth phase. Since a whole database D cannot fit into main memory, a large subset of this database D′^ is stored in-memory to compute T ′. Bootstrapping is then applied on this subset.

4.1 Coarse Splitting Criteria

The aim of bootstrapping is to obtain coarse splitting criteria. A coarse splitting criteria reduces the set of possible splitting criteria at every node n. So it is a coarse view of the final splitting criteria.

From the subset of dataset D′, we sample b training datasets with replacements and construct b bootstrap trees according to the procedure outlined above. For each node n , it is checked if the splitting attributes are identical. If not, the node n and its subtree is removed in all the bootstrap trees. Also, if the splitting attribute at n is numerical, then we have b split points from which we can obtain a confidence interval , such that the final split point would lie inside this interval. If the splitting attribute at n is categorical, then the subsets induced by the split should be identical in all the b trees. Otherwise, remove n and its subtree in all the b trees. Basically, the b bootstrap trees are overlaid on top of each other and the identical parts alone are selected. This is the sample tree T ′.

4.2 Cleanup Phase

Now the T ′^ that results can said to be ‘close’ to the final tree T, only when at every node n in T ′^ and T the splitting attribute, X, is identical. In addition, if X is a numerical attribute, then the final split point, x , at n in T , should be inside the confidence interval of x in T. If X was a categorical attribute, then the sample splitting criteria at n should be identical in T and T ′.

In this phase, the algorithm takes the sample tree T ′^ and the coarse splitting criteria, and tries to compute the final splitting criteria by making one scan over the training database. The algorithm chooses that criteria that minimizes the value of the impurity function at every node. If the splitting attribute at node n in T ′ is categorical, then there is a splitting criteria already computed. The algorithm then checks if the sample

splitting subsets are equal to the final splitting subsets.

If the sample splitting attribute at n is numerical, then the final split point should lie inside its confidence interval with a high probability. Let x′^ be the value of the sample splitting attribute with minimum value,i′, for the impurity function. To prove that x′^ does not lie outside its confidence interval, the global minimum of the impurity function is calculated and compared with i′. They use this global minimum to also prove that the sample splitting attribute, X, is the final splitting attribute too.

Therefore, this method corrects for any difference between the sample and the final decision trees by check- ing the validity of the splitting criteria during the clean-up phase itself. So in the end the method outputs the final tree, which would be the same tree as if it were constructed by scanning the database once for every node.

References

[1] Johannes Gehrke, Venkatesh Ganti, Raghu Ramakrishnan and Wei-Yin Loh, BOAT Optimistic Deci- sion Tree Construction , Proceedings of the 1999 ACM SIGMOD international conference on Manage- ment of data, 169–180, 1999.