Implementing Naïve Bayes Classification – Programming Project 2 | CSC 573, Study Guides, Projects, Research of Computer Science

Material Type: Project; Class: Data Mining; Subject: Computer Science; University: University of Illinois Springfield; Term: Unknown 1989;

Typology: Study Guides, Projects, Research

Pre 2010

Uploaded on 08/19/2009

koofers-user-bqs-1
koofers-user-bqs-1 🇺🇸

9 documents

1 / 3

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
CSC 573: Data Mining
Programming Project #2: Implementing Naïve Bayes Classification
Instructor: Ratko Orlandic
Your task is to implement two programs for Naïve Bayes classification. From the given data set
described below, the first program should produce a training set and a testing set after performing a
simple “randomization” of data (see below). The second program should derive the Naïve Bayes
classification model from the training set and evaluate its accuracy on the testing set (see below).
The programs can be developed in a programming language of your choice (Java, C, or C++), but
they should run on Windows. While this assignment text gives you some implementation freedom,
you should make every effort to follow this specification exactly.
Data Set
Your program must work on the tab-delimited data file “iris-discretized.txt” included in
Project2Files.zip. This is a version of the “iris” data set whose non-class attributes have been
discretized using equi-width binning. The file has 150 instances (data points), i.e. rows in the file.
The first 4 attributes (1. sepallength, 2. sepalwidth, 3. petallength, and 4. petalwidth) are assumed
to be nominal (values 1-5). Note that each value 1-5 represents a range of values in the original
“iris” data set from which I derived the “iris-discretized.txt” file. The last (5th) attribute is the class
dimension (nominal values 1-3 denoting types of iris). Note that a row in the data file represents a
data point (instance) and that a column represents an attribute (dimension).
Data Randomization
The first program should be called “randomize”. It should: a) read the given data file “iris-
discretized.txt”, b) perform a simple randomization of data points in it, and c) output two text files,
called “iris-train.txt” and “iris-test.txt”, which will later be used as the training and test data sets,
respectively. The program can be specific to the “iris-discretized.txt” data set and it should not take
any other input parameter.
For the purposes of “randomizing” the data, use the following simple procedure:
1) Read the “iris-discretized.txt” set into an array of data points.
2) Assume that the numbering of data points (rows) is between 0 and 149.
3) For each data point i (between 0 and 149) in the array:
3a) If i mod 3 = 0 or 1, include the data point i in the “iris-train.txt” set.
3b) If i mod 3 = 2, include the data point i in the “iris-test.txt” set.
Note that mod is the “modulo” operator that, in this case, computes the remainder of the integer
division of the number i by 3. This procedure will create a training data set with 100 data points
(instances) and a test set with 50 instances. You can create temporary “training” and “test” arrays
before writing the data points into the “iris-train.txt” and “iris-test.txt” files.
Data Classification
To implement the second program, called “classify”, please read relevant course slides (Part 4) and
section 6.4 in the textbook describing the Naïve Bayes classification. For the purposes of this
assignment, you should ignore the “zero-frequency problem” (i.e., assume that the computed
posterior probabilities can be 0.0). You also need not worry about the Naïve Bayes handling of
missing values and numeric attributes.
pf3

Partial preview of the text

Download Implementing Naïve Bayes Classification – Programming Project 2 | CSC 573 and more Study Guides, Projects, Research Computer Science in PDF only on Docsity!

CSC 573: Data Mining Programming Project #2 : Implementing Naïve Bayes Classification Instructor: Ratko Orlandic

Your task is to implement two programs for Naïve Bayes classification. From the given data set described below, the first program should produce a training set and a testing set after performing a simple “randomization” of data (see below). The second program should derive the Naïve Bayes classification model from the training set and evaluate its accuracy on the testing set (see below). The programs can be developed in a programming language of your choice (Java, C, or C++), but they should run on Windows. While this assignment text gives you some implementation freedom, you should make every effort to follow this specification exactly.

Data Set

Your program must work on the tab-delimited data file “iris-discretized.txt” included in Project2Files.zip. This is a version of the “iris” data set whose non-class attributes have been discretized using equi-width binning. The file has 150 instances (data points), i.e. rows in the file. The first 4 attributes (1. sepallength, 2. sepalwidth, 3. petallength, and 4. petalwidth) are assumed to be nominal (values 1-5). Note that each value 1-5 represents a range of values in the original “iris” data set from which I derived the “iris-discretized.txt” file. The last (5th^ ) attribute is the class dimension (nominal values 1-3 denoting types of iris). Note that a row in the data file represents a data point (instance) and that a column represents an attribute (dimension).

Data Randomization

The first program should be called “randomize”. It should: a) read the given data file “iris- discretized.txt”, b) perform a simple randomization of data points in it, and c) output two text files, called “iris-train.txt” and “iris-test.txt”, which will later be used as the training and test data sets, respectively. The program can be specific to the “iris-discretized.txt” data set and it should not take any other input parameter.

For the purposes of “randomizing” the data, use the following simple procedure:

  1. Read the “iris-discretized.txt” set into an array of data points.
  2. Assume that the numbering of data points (rows) is between 0 and 149.
  3. For each data point i (between 0 and 149) in the array: 3a) If i mod 3 = 0 or 1, include the data point i in the “iris-train.txt” set. 3b) If i mod 3 = 2, include the data point i in the “iris-test.txt” set. Note that mod is the “modulo” operator that, in this case, computes the remainder of the integer division of the number i by 3. This procedure will create a training data set with 100 data points (instances) and a test set with 50 instances. You can create temporary “training” and “test” arrays before writing the data points into the “iris-train.txt” and “iris-test.txt” files.

Data Classification

To implement the second program, called “classify”, please read relevant course slides (Part 4) and section 6.4 in the textbook describing the Naïve Bayes classification. For the purposes of this assignment, you should ignore the “zero-frequency problem” (i.e., assume that the computed posterior probabilities can be 0.0). You also need not worry about the Naïve Bayes handling of missing values and numeric attributes.

The “classify” program should operate in two phases. In the first phase, the program should read the “iris-train.txt” set into an array and compute all prior probabilities , which will constitute your classification model. More precisely, for each value Vi (1-5) of a non-class attribute Ai (1-4), compute conditional probability P(Vi|C) for each of the 3 classes C. In other words, compute the fraction (between 0.0 and 1.0) of instances of class C (1-3) in the training set that have the value Vi (1-5) for the attribute Ai (1-4)). Then, compute the unconditional probability P(C) of each class C (i.e., the fraction of instances in the training set that have class C (1-3)).

You should use two arrays for the purposes of maintaining the computed prior probabilities. The first should be a 3-dimensional array with 60 elements (4 attributes x 5 values x 3 classes), each of which would hold the conditional probability of an attribute value given a class. The second should be a 1-dimensional array with 3 elements holding unconditional probability of each class. All 63 probabilities computed in this phase should be recorded in the “output1.txt” file described below.

The second phase of the “classify” program should begin by reading the “iris-test.txt” set with 50 instances into a new array. Then, using the probabilities computed in the first phase, it should predict the class of each instance in the testing set and compute the accuracy of the model. For the purposes of predicting the class of each instance E, follow the Naïve Bayes classification method. More precisely, compute the conditional probabilities P(C|E) of each class C given the values of the non-class attributes of the instance E, and then select the class C with the highest value P(C|E). If more than one class has the highest probability, choose any of them as the predicted class. Since the data set has 4 non-class attributes A1-A4 and we are using the naïve assumption, each probability P(C|E) is computed as P(V1|C)⋅P(V2|C)⋅P(V3|C)⋅P(V4|C)⋅ P(C), where Vi is the value of the instance for the attribute Ai, and C is the given class.

Note that the actual class value (5 th^ attribute) of a given test instance E is not used in the process of predicting the class of E. However, it is used to determine the accuracy of the model. For that, simply compare the predicted and the actual class of each test instance E. In order to derive the accuracy of the model, at the end of the second phase, divide the number of correctly classified instances in the test set by 50 (the number of instances in the test set). All output of the second phase should be recorded in the “output2.txt” file described below.

Output of the Classification Program

As noted earlier, the output of the first phase of the “classify” program should be recorded in the output file called “output1.txt”. This file should record the 63 probabilities computed by this phase, i.e. 60 unconditional probabilities P(Vi|C) (one for each combination of the attribute, value, and class) and 3 unconditional probabilities P(C) (one for each class). The requested format of the “output1.txt” file is given in the “output1file-format.txt” included in “Project2Files.zip”.

The output of the second phase of the “classify” program should be recorded in the output file called “output2.txt”. Please follow the requested format of the “output2.txt” file, which is given in the “output2file-format.txt” file included in “Project2Files.zip”. Note that for each test instance E, “output2.txt” should record: the order number of the instance (0-49), the predicted class of E (1-3), the actual class of E (1-3), as well as the computed posterior probabilities P(C|E) for each of the 3 classes C. At the end of the “output2.txt” file, record the overall accuracy of the model (number of correctly classified test instances divided by 50).