Decision Trees and Classification , Lecture Notes - Computer Science, Study notes of Artificial Intelligence

Prof. David C Parkes, Computer Science, Decision Trees, Classification, Inductive Bias, Restriction bias, Algorithm for Learning Decision Trees, Harvard, Lecture Notes

Typology: Study notes

2010/2011

Uploaded on 10/25/2011

thecoral
thecoral 🇺🇸

4.5

(30)

395 documents

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS181 Lecture 2 Decision Trees and Classification
Avi Pfeffer; Revised by David Parkes
Jan 23, 2010
In introducing supervised learning we consider the special problem of learning a Boolean concept from
training examples, some of which satisfy the concept and some of which do not. This is a classical machine
learning problem. After discussing some basic ideas for supervised learning, we will turn to a particular
learning algorithm, decision trees.
Optional readings for the next two lectures: Chapter 18 (through 18.4) of Russell & Norvig, Chapters 1
& 3 of Mitchell “Machine Learning”
1 The Task: Supervised Learning
Whenever we talk about learning, there is a hierarchy of tasks to consider. We must first talk about the
ultimate task to be performed, the thing we are trying to learn to do. Then we can talk about the task
of learning how to perform the ultimate task. Finally, we can consider the task of designing a learning
algorithm.
For the next few classes, we will focus on classification as the ultimate task to be performed. Classification
means determining what category an object falls into, based on its features (or attributes). For example, we
might try to classify a plant as nutritious or poisonous, based on biological features such as color, leaf shape
and so on. Or we might try to classify a pixel image as being a particular digit.
Classification An object is described by a set of features X1,...,Xm. There is a set Y={1,...,c}of
possible classes. Given the features x= (x1,...,xm)Xof a particular object, a classifier needs to
determine it true class f(x) = yY.1Thus a classifier is a function h:XY.
The performance of a classifier hon a new instance (x, y) is measured by an error function, for example
∆(y, y) = 0 if y=y
1 otherwise, (1)
where y=h(x). In some domains a more complex error function can be appropriate. For example, in
a medical domain, the cost of false negatives (missing a disease diagnosis) is likely higher than that of
false positives (incorrectly diagnosing disease when there is none).
Classification is a special case of the general problem of supervised learning.
Supervised Learning The goal of supervised learning is to learn a classifier from a set of labeled data D.
Each instance (x, y) D defines feature values x= (x1,...,xm) and a target value yY. Together,
we have D={(x1,y 1),. . . , (xn, yn)}, and nlabeled examples. A supervised learning algorithm takes
this data and outputs a function h:XY.
Thus a supervised learning algorithm can itself be considered to define a function Lfrom labeled
training data to classifiers. For a particular training set D,L(D) is a classifier.
1We will generally use boldf ace to denote vectors or matrices and capital letters to denote sets, with small letters to denote
particular elements of these sets.
1
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download Decision Trees and Classification , Lecture Notes - Computer Science and more Study notes Artificial Intelligence in PDF only on Docsity!

CS181 Lecture 2 — Decision Trees and Classification

Avi Pfeffer; Revised by David Parkes

Jan 23, 2010

In introducing supervised learning we consider the special problem of learning a Boolean concept from training examples, some of which satisfy the concept and some of which do not. This is a classical machine learning problem. After discussing some basic ideas for supervised learning, we will turn to a particular learning algorithm, decision trees. Optional readings for the next two lectures: Chapter 18 (through 18.4) of Russell & Norvig, Chapters 1 & 3 of Mitchell “Machine Learning”

1 The Task: Supervised Learning

Whenever we talk about learning, there is a hierarchy of tasks to consider. We must first talk about the ultimate task to be performed, the thing we are trying to learn to do. Then we can talk about the task of learning how to perform the ultimate task. Finally, we can consider the task of designing a learning algorithm. For the next few classes, we will focus on classification as the ultimate task to be performed. Classification means determining what category an object falls into, based on its features (or attributes). For example, we might try to classify a plant as nutritious or poisonous, based on biological features such as color, leaf shape and so on. Or we might try to classify a pixel image as being a particular digit.

Classification An object is described by a set of features X 1 ,... , Xm. There is a set Y = { 1 ,... , c} of possible classes. Given the features x = (x 1 ,... , xm) ∈ X of a particular object, a classifier needs to determine it true class f (x) = y ∈ Y.^1 Thus a classifier is a function h : X → Y. The performance of a classifier h on a new instance (x, y) is measured by an error function, for example

∆(y, y′) =

0 if y = y′ 1 otherwise,

where y′^ = h(x). In some domains a more complex error function can be appropriate. For example, in a medical domain, the cost of false negatives (missing a disease diagnosis) is likely higher than that of false positives (incorrectly diagnosing disease when there is none).

Classification is a special case of the general problem of supervised learning.

Supervised Learning The goal of supervised learning is to learn a classifier from a set of labeled data D. Each instance (x, y) ∈ D defines feature values x = (x 1 ,... , xm) and a target value y ∈ Y. Together, we have D = {(x 1 , y 1 ),... , (xn, yn)}, and n labeled examples. A supervised learning algorithm takes this data and outputs a function h : X → Y. Thus a supervised learning algorithm can itself be considered to define a function L from labeled training data to classifiers. For a particular training set D, L(D) is a classifier. (^1) We will generally use boldface to denote vectors or matrices and capital letters to denote sets, with small letters to denote particular elements of these sets.

How do we evaluate the performance of a learning algorithm L on training data D? Since the goal is to produce a classifier, we measure the performance of the learning algorithm through the performance of the classifier on future instances. A key point is that the classifier’s performance is measured relative to future examples that it has not yet seen, and not the training set. The critical question is whether the learning algorithm is able to produce classifiers that generalize to data that it did not see.

When we have to choose a particular learning algorithm for a particular domain, we do not generally know exactly what the correct function will be, or exactly what the training set will look like. We need to design the learning algorithm so that it will perform well in the domain, given the particular characteristics of the domain, and given the amount of training data it will get. For example, we need to decide which features to generate to describe an object. We make also have parameters of the algorithm that we need to set. Different algorithms are good at learning different kinds of functions. Also, as we will see, some algorithms need a lot of data to work well. A key aspect of studying machine learning is not only to understand the learning algorithms themselves, but to understand what aspects of the domain make them work well.

Binary classification. Often, the possible classes in a classification problem will be true and false. In this case, the learning task is sometimes called concept learning, where the learned function provides a definition of a concept; e.g., it could represent the concept rainy, or sad, or the digit 9. For concept learning, the training data consists of positive and negative examples: instances that are, or are not, examples. If the features themselves are also Boolean, then the problem is that of learning Boolean concepts. In this case, the concept is defined by a Boolean formula over the features.

We will focus here mostly on the case of learning Boolean concepts, because they are simple, but already cover most of the issues that come up.

There are some useful distinctions we can make concerning learning problems. The first concerns whether the classification task is deterministic or non-deterministic. This is a distinction concerning the underlying prediction problem.

Deterministic. In a deterministic domain, if two objects have the same features, then they necessarily have the same classification: in symbols (xi = xj ) ⇒ (yi = yj ).

Non-deterministic. In an non-deterministic domain, two instances may have the same features but different classifications. A non-deterministic domain is often called noisy. Noise can happen either because of inherent non-determinism in the domain, or because of errors and mis-labeling in the data.

The second distinction is similar, but concerns the training data itself. The training data is consistent if there are no two instances that have the same features but different classifications, otherwise it is inconsistent. Note that a deterministic domain will necessarily have a consistent training set. Still, a particular training set for a non-deterministic domain may also be consistent.

2 Is Learning Possible?

Before we go on to consider how learning is done, we need to consider whether learning is possible at all. In fact, there are strong arguments that learning is logically impossible! Consider the problem of learning Boolean concepts, and assume the target concept is deterministic. We want to learn a Boolean formula over the features X 1 ,... , Xm from data consisting of positive and negative examples. For notational convenience, let’s order the data so that the positive examples appear first. So the data consists of positive examples {x 1 ,... , xk} and negative examples {xk+1,... , xn}. Suppose we want to classify a new instance x. What should we do? There are two cases. If x appears in the data, we should provide exactly the same classification (because the domain is deterministic). If it does

3 Inductive Bias

The answer is that in fact, for learning to be possible, we need to make some assumptions. These assumptions are called our inductive bias. They will insist that a learned classifier will be sometimes incorrect, but this is necessary in order to allow for learning. Implicitly, by introducing a bias into a learning algorithm we are assuming that some true concepts f : X → Y are more likely than others. There are two basic kinds of inductive bias.

Restriction bias. The first kind of bias of a learning algorithm is called restriction bias, because the learning algorithm does not consider all possible classifiers h : X → Y , but only those from some subset of this function space. The set of classifiers h ∈ H, where H is the hypothesis space, that the algorithm considers is called the hypothesis space of the algorithm, and the different possible classifiers are called hypotheses. An example, is an algorithm for learning Boolean concepts that only considers conjunctive concepts. A conjunctive concept is formed from the conjunction of literals, where a literal is a single feature or its negation. Thus X 1 ∧ ¬X 2 is a conjunctive concept, but X 1 ∨ ¬X 2 is not. An algorithm that uses a restriction bias is able to learn. There may be only one hypothesis in its hypothesis space H consistent with the training data, in which case it returns that hypothesis. It is also possible that no hypothesis in H is consistent with the training data, and even if the training data is itself consistent. In that case, the learning algorithm may look to return the hypotheses that best matches the data, even though the match is imperfect.

Preference bias. The second kind of bias of a learning algorithm is called a preference bias. A preference bias indicates a relative ranking between different hypotheses the algorithm can consider. For example, in learning Boolean concepts, one may prefer more specific hypotheses, in which case the first formula above will be returned. An algorithm with no restriction bias that does have a preference bias is able to learn. Given a consistent training set, it may choose the most preferred hypothesis consistent with the data, or even go with something inconsistent. For an inconsistent data set, it will trade off the preferability of a hypothesis with the degree to which it agrees with the data. There are two basic ways of achieving a preference bias. One is to define a penalty function that assigns a penalty to different hypotheses, so that h 1 : X → Y is preferred to h 2 : X → Y if it has a lower penalty. This might reflect the “complexity” of the hypothesis (to be defined). The second is to define a search process that searches through the hypothesis space in a particular way, looking for hypotheses that fit the data well. Hypotheses that appear early in the search process are preferred.

Understanding the inductive bias is the most important thing to know about any learning algorithm. When considering an algorithm for a particular problem, it is critical to ask whether or not its inductive bias actually fits what you believe to be true about the domain. For a deterministic domain, one might ask whether the true concept actually in its hypothesis space, (or at least, where there a hypothesis in its hypothesis space that is close to the truth)? Another natural question to ask is whether its preference bias lead it to prefer hypotheses that are appropriate, that is to say likely, for the domain? Different algorithms have stronger or weaker bias. Algorithms with weaker bias require more training data to learn well. Because of this, another natural question to ask is whether the algorithm have the right amount of bias for the training data that is available?

4 Learning Conjunctive Concepts with Decision Trees

Given these concepts, we are ready to plunge into studying decision trees. The decision tree representation and associated learning algorithms are quite simple but widely used in practice because they are simple and generally applicable.

In describing the decision tree framework, we need to describe two things: the decision trees themselves, which are the classifiers to be learned, and an algorithm for learning a decision tree from data. A decision tree is a representation of a function from a set of (discrete) features to a classification. As its name suggests, a decision tree can be understood as a rule that is structured as a tree. Each internal node of the tree is labeled by an feature Xk. There is a branch leaving the Xk node corresponding to each possible value of Xk. A leaf of the tree is associated with the training data that respects the branching decisions from the root to the leaf, and eventually (once the tree is built) a prediction. For example, let us consider the Nutritious vs Poisonous classification problem for predicting whether or not a plant might be good to eat.^2 Suppose there are four features, each with the following values:

  • Skin (smooth, rough or scaly)
  • Color (pink, purple or orange)
  • Thorny (true or false)
  • Flowering (true or false)

A possible decision tree for this domain is as follows:

Thorny Flowering

Skin

P N

N

P N

snooth scaly

rough

T F T F

The feature branched at the root is “Skin” and if this is smooth, then the next feature branched on is “Thorny” which is either True or False. On the leaves (there are five), we see the assigned labels of (P)oisonous, (N)utritious, (N)utritious, (P)oisonous and (N)utritious. Note that the decision tree does not need to branch on all possible features on a path from the root to a leaf, nor on the same features on every path. Given a decision tree, then we use it to classify a new instance as follows. Beginning at the root, select the appropriate subtree based on feature values of the instance, until reaching a leaf. The prediction is the label associated with the leaf. Formally, a tree classifier, h, classifies an instance x as follows:

  • at an feature node Xk, follow the subtree reached by the branch labeled xk, where x = (x 1 ,... , xk,... , xm)
  • at a leaf with class label y′, then the classification is y′.

What is the expressive power of decision trees? In fact, any function f : X → Y from discrete (and finite) features X to discrete (and finite) Y can be represented as a decision tree. To see this, consider the complete tree, which splits on each of the features in turn, so that a leaf is reached only after assigning values to all the features. This means that a learning algorithm that learns decision trees does not have a restriction bias. Therefore, if the algorithm is to be successful, it must have a preference bias.

(^2) This example is based on Chapter 3 of Mitchell.

Label(T ) = the most common classification in D Return T

Choose the best splitting feature Xk in X // described in the “information gain” section Label(T ) = Xk For each value x of Xk Let Dx be the subset of instances in D that have value x for Xk If Dx is empty // no data left Let Tx be a new tree Label(T ) = the most common classification in D Else Tx = ID3(Dx, X − {Xk}) Add a branch from T to Tx, labeled by x Return T

Careful: A common mistake is to forget that D described in the above recursive definition of ID3 is the current data set, at that node. It is NOT the complete set of training data.

We already see that ID3 can stop growing the tree in any recursive call for one of three reasons:

  • There is no data left, i.e. no training data that respects the branching decisions made between this node and the root. In this case, there is no need to grow the tree further, because any further elaboration would introduce complexity that is not supported by the training data. This is one place where the preference bias of the algorithm comes into play.
  • All training instances that respect the branching decisions made between this node and the root have the same classification y. In this case the constant function that returns y can be returned as the perfect classifier for future data with these feature values, and there is no need to elaborate further. Again, this is an example of where the algorithm includes a preference bias.
  • There are no features left to split on. Since all features have already been split on, all instances in the training data that respect the branching decisions made between this node and the root must have the same value for all features, so there is no point in splitting any further. If this case happens, the training data (and thus domain) must necessarily be inconsistent.

We left undefined in ID3 how the “best splitting feature Xk in X is selected.” Here we will see an additional reason for why ID3 will stop splitting. We explain this next.

6 Information Gain: Deciding How to Branch

The one thing that remains to be defined is the criterion for choosing which feature to split on. What criterion should we choose? Consider the ultimate goal: to produce short trees. This is the most basic preference bias of ID3. ID3 is a greedy algorithm, that greedily tries to grow the tree in such a way that the result is a short tree. Therefore, the criterion for choosing which feature to split on should tend to produce short trees. In general, the more “orderly” the data is, the shorter the tree needed to represent it. At one extreme, if all instances have the same classification, then ID3 will generate a tree of depth 0. Information theory gives us a precise concept of the “orderliness” of data. The key notion in information theory is entropy, defined as follows:

Entropy(D) = −

y∈Y

ny n

log 2

( (^) n y n

where n is the total number of instances in D, and ny is the number of instances that have classification y. Entropy can be interpreted as the amount of disorder in the data. If the data has high entropy, it is less orderly. One information theoretic interpretation is that the entropy is the (average) number of bits per instance required to encode the data given that the distribution is itself common knowledge. For example, let us consider binary classification. At one extreme, if all instances are positive (or all negative), Entropy(D) = −1 log 2 1 − 0 log 2 0 = 0; where we take 0 log 2 0 = 0. I can tell you that all the data is positive (or negative), and then I don’t need to send you any information about individual instances. At the other extreme, half the instances are positive and half negative, and Entropy(D) = − 1 /2 log 2 1 / 2 − 1 /2 log 2 1 /2 = 1. In this case, I can do no better than telling you explicitly whether each individual instance is positive or negative. The graph of entropy, as a function of the fraction of instances that are positive, looks like this:

0

1

0 0.2 0.4 0.6 0.8 1

"log"

Since entropy is a measure of the amount of information required to encode the data, and we want to encode the data in as compact a way as possible, we will choose an feature to split on that provides us with the most information. We find the feature Xk to split on that minimizes the amount of information that is still required to encode the data, after splitting on the feature. For this, let D′^ denote the data that is consistent with the branching decisions made so far between the root and the current node. Let D′ x denote the subset of this data where the value of feature Xk = x. Remember that we will only consider features Xk that have not already been branched on. Define:

Remainder (Xk, D′) =

x∈Xk

nx n′^

Entropy(D′ x) (5)

where nx is the number of data instances in D′^ that have feature value x ∈ Xk and n′^ = |D′| and the number of instances in D′. The remainder captures the expected number of bits needed to encode the data if we test on feature Xk. One can think about the information gain from branching next on feature Xk as follows:

Gain(Xk, D′) = Entropy(D′) − Remainder (Xk, D′) (6)

Our criterion for choosing which feature to split on is to choose the one with highest information gain (breaking ties at random). Equivalently, we select Xk with the lowest remainder. It is a basic result of information theory that the information gain cannot be negative for any feature. However, it can be zero, and an information gain of zero means that splitting on an feature is useless and provides no additional information. If the information gain for all features is zero, then ID3 stops splitting.