Implementing Naïve Bayes Classification – Programming Project 2 | CSC 573 | Study Guides, Projects, Research Computer Science

CSC 573: Data Mining

Programming Project #2: Implementing Naïve Bayes Classification

Instructor: Ratko Orlandic

Your task is to implement two programs for Naïve Bayes classification. From the given data set

described below, the first program should produce a training set and a testing set after performing a

simple “randomization” of data (see below). The second program should derive the Naïve Bayes

classification model from the training set and evaluate its accuracy on the testing set (see below).

The programs can be developed in a programming language of your choice (Java, C, or C++), but

they should run on Windows. While this assignment text gives you some implementation freedom,

you should make every effort to follow this specification exactly.

Data Set

Your program must work on the tab-delimited data file “iris-discretized.txt” included in

Project2Files.zip. This is a version of the “iris” data set whose non-class attributes have been

discretized using equi-width binning. The file has 150 instances (data points), i.e. rows in the file.

The first 4 attributes (1. sepallength, 2. sepalwidth, 3. petallength, and 4. petalwidth) are assumed

to be nominal (values 1-5). Note that each value 1-5 represents a range of values in the original

“iris” data set from which I derived the “iris-discretized.txt” file. The last (5th) attribute is the class

dimension (nominal values 1-3 denoting types of iris). Note that a row in the data file represents a

data point (instance) and that a column represents an attribute (dimension).

Data Randomization

The first program should be called “randomize”. It should: a) read the given data file “iris-

discretized.txt”, b) perform a simple randomization of data points in it, and c) output two text files,

called “iris-train.txt” and “iris-test.txt”, which will later be used as the training and test data sets,

respectively. The program can be specific to the “iris-discretized.txt” data set and it should not take

any other input parameter.

For the purposes of “randomizing” the data, use the following simple procedure:

1) Read the “iris-discretized.txt” set into an array of data points.

2) Assume that the numbering of data points (rows) is between 0 and 149.

3) For each data point i (between 0 and 149) in the array:

3a) If i mod 3 = 0 or 1, include the data point i in the “iris-train.txt” set.

3b) If i mod 3 = 2, include the data point i in the “iris-test.txt” set.

Note that mod is the “modulo” operator that, in this case, computes the remainder of the integer

division of the number i by 3. This procedure will create a training data set with 100 data points

(instances) and a test set with 50 instances. You can create temporary “training” and “test” arrays

before writing the data points into the “iris-train.txt” and “iris-test.txt” files.

Data Classification

To implement the second program, called “classify”, please read relevant course slides (Part 4) and

section 6.4 in the textbook describing the Naïve Bayes classification. For the purposes of this

assignment, you should ignore the “zero-frequency problem” (i.e., assume that the computed

posterior probabilities can be 0.0). You also need not worry about the Naïve Bayes handling of

missing values and numeric attributes.

Implementing Naïve Bayes Classification – Programming Project 2 | CSC 573, Study Guides, Projects, Research of Computer Science

Related documents

Partial preview of the text

Download Implementing Naïve Bayes Classification – Programming Project 2 | CSC 573 and more Study Guides, Projects, Research Computer Science in PDF only on Docsity!

Data Set

Data Randomization

Data Classification

Output of the Classification Program