Machine Learning Assignment 3: Decision Tree with Cross Validation | Assignments Computer Science

CSCI 5622, Sec 001 Professor Mozer

Machine Learning Spring 2001

Assignment 3

Assigned: Tue Feb 13, 2001

Due: Thu Feb 22 2001

In this assignment, you will implement a decision-tree learning system, much like ID3 described in the text-

book. You will also use cross validation to estimate the performance of the decision tree. The decision tree

will have a parameter—the maximum depth of the tree, or

maxdepth

—which will control the complexity of

the resulting decision tree. You will explore how varying

maxdepth

affects the decision tree’s performance

on both the training set and the test set.

The data set available for this assignment is based on the U.S. congress voting record from 1984. The data

set consists of the votes (

yes

) on sixteen issues for each of the 435 members of congress. From the

voting record, the task of the machine learning system is to predict whether the member of congress is a

Republican or a Democrat. The voting data set is described in the UCI Machine Learning Repository. How-

ever, I have cleaned up the data set and made it available at

ftp://ftp.cs.colorado.edu/users/mozer/5622/votes.tar

The original data set contained many

missing values

, i.e., votes in which a member of congress failed to

participate. Dealing with missing values is tricky, so I made your task simpler by inserting—for each absent

vote—the voting decision of the majority. The result is that each record in the data base looks something

like the following:

D y y y n n n y y y n n n n n y y

The D (or R) indicates the individual is a Democrat (Republican), and the symbols y and n denote yes and

no votes. The column in which the y or n appears corresponds to the particular issue being voted on.

Because all input attributes are binary, the resulting decision tree will also be binary (two branches from

every non-leaf node).

The votes.tar file contains five copies of the data, each split into training and test sets. The training sets

are named votes-train[0-4].data and the test sets are named votes-test[0-4].data. The

union of the five test sets is the entire data base. Thus, training and testing on each of the five sets corre-

sponds to five-fold cross validation. I’ve split up the data in this manner to make your task easier, and to

ensure that everyone should obtain the same result.

Your decision tree program should take three inputs: (1) the name of the training file, (2) the name of the

test file, and (3)

maxdepth

. The tree should be prevented from going deeper than

maxdepth

(

maxdepth=1

means a root node—level 0—and leaves—level 1). The program should build a decision tree subject to the

maxdepth

constraint, and then output the classification accuracy on both the training and test sets. You

should perform this experiment for the five hold-out (test) sets, and compute the total proportion correct on

the training and test sets for a particular value of

maxdepth

. Repeat this procedure for

maxdepth

ranging

from 0 to 8.

Hand in code for your decision tree, a table and possibly a graph showing performance on the training and

test sets as

maxdepth

is varied.

To begin constructing your tree, start with a root node. Associate all training examples with the root node.

Consider branching the tree along each attribute dimension, and choose the dimension that yields the

greatest gain, as quantified in equation 3.4 of the text. Create two child nodes—the two branches from the

root—and associate with each child node the training examples that would be passed to that node. Repeat

this process, stopping with the maximum tree depth is reached, when all examples associated with a node

have the same classification, or when all examples associated with a node have the exact same voting

record. For each of these leaf nodes, label the node as D or R, based on whether a majority of the training

examples are Democrats or Republicans. If there’s exactly the same number of Democrats and Republi-

Machine Learning Assignment 3: Decision Tree with Cross Validation, Assignments of Computer Science

Related documents

Partial preview of the text

Download Machine Learning Assignment 3: Decision Tree with Cross Validation and more Assignments Computer Science in PDF only on Docsity!

Assignment 3

Assigned: Tue Feb 13, 2001

Due: Thu Feb 22 2001