

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Project; Class: Data Mining; Subject: Computer Science; University: University of Illinois Springfield; Term: Unknown 1989;
Typology: Study Guides, Projects, Research
1 / 3
This page cannot be seen from the preview
Don't miss anything!


CSC 573: Data Mining Programming Project #2 : Implementing Naïve Bayes Classification Instructor: Ratko Orlandic
Your task is to implement two programs for Naïve Bayes classification. From the given data set described below, the first program should produce a training set and a testing set after performing a simple “randomization” of data (see below). The second program should derive the Naïve Bayes classification model from the training set and evaluate its accuracy on the testing set (see below). The programs can be developed in a programming language of your choice (Java, C, or C++), but they should run on Windows. While this assignment text gives you some implementation freedom, you should make every effort to follow this specification exactly.
Your program must work on the tab-delimited data file “iris-discretized.txt” included in Project2Files.zip. This is a version of the “iris” data set whose non-class attributes have been discretized using equi-width binning. The file has 150 instances (data points), i.e. rows in the file. The first 4 attributes (1. sepallength, 2. sepalwidth, 3. petallength, and 4. petalwidth) are assumed to be nominal (values 1-5). Note that each value 1-5 represents a range of values in the original “iris” data set from which I derived the “iris-discretized.txt” file. The last (5th^ ) attribute is the class dimension (nominal values 1-3 denoting types of iris). Note that a row in the data file represents a data point (instance) and that a column represents an attribute (dimension).
The first program should be called “randomize”. It should: a) read the given data file “iris- discretized.txt”, b) perform a simple randomization of data points in it, and c) output two text files, called “iris-train.txt” and “iris-test.txt”, which will later be used as the training and test data sets, respectively. The program can be specific to the “iris-discretized.txt” data set and it should not take any other input parameter.
For the purposes of “randomizing” the data, use the following simple procedure:
To implement the second program, called “classify”, please read relevant course slides (Part 4) and section 6.4 in the textbook describing the Naïve Bayes classification. For the purposes of this assignment, you should ignore the “zero-frequency problem” (i.e., assume that the computed posterior probabilities can be 0.0). You also need not worry about the Naïve Bayes handling of missing values and numeric attributes.
The “classify” program should operate in two phases. In the first phase, the program should read the “iris-train.txt” set into an array and compute all prior probabilities , which will constitute your classification model. More precisely, for each value Vi (1-5) of a non-class attribute Ai (1-4), compute conditional probability P(Vi|C) for each of the 3 classes C. In other words, compute the fraction (between 0.0 and 1.0) of instances of class C (1-3) in the training set that have the value Vi (1-5) for the attribute Ai (1-4)). Then, compute the unconditional probability P(C) of each class C (i.e., the fraction of instances in the training set that have class C (1-3)).
You should use two arrays for the purposes of maintaining the computed prior probabilities. The first should be a 3-dimensional array with 60 elements (4 attributes x 5 values x 3 classes), each of which would hold the conditional probability of an attribute value given a class. The second should be a 1-dimensional array with 3 elements holding unconditional probability of each class. All 63 probabilities computed in this phase should be recorded in the “output1.txt” file described below.
The second phase of the “classify” program should begin by reading the “iris-test.txt” set with 50 instances into a new array. Then, using the probabilities computed in the first phase, it should predict the class of each instance in the testing set and compute the accuracy of the model. For the purposes of predicting the class of each instance E, follow the Naïve Bayes classification method. More precisely, compute the conditional probabilities P(C|E) of each class C given the values of the non-class attributes of the instance E, and then select the class C with the highest value P(C|E). If more than one class has the highest probability, choose any of them as the predicted class. Since the data set has 4 non-class attributes A1-A4 and we are using the naïve assumption, each probability P(C|E) is computed as P(V1|C)⋅P(V2|C)⋅P(V3|C)⋅P(V4|C)⋅ P(C), where Vi is the value of the instance for the attribute Ai, and C is the given class.
Note that the actual class value (5 th^ attribute) of a given test instance E is not used in the process of predicting the class of E. However, it is used to determine the accuracy of the model. For that, simply compare the predicted and the actual class of each test instance E. In order to derive the accuracy of the model, at the end of the second phase, divide the number of correctly classified instances in the test set by 50 (the number of instances in the test set). All output of the second phase should be recorded in the “output2.txt” file described below.
As noted earlier, the output of the first phase of the “classify” program should be recorded in the output file called “output1.txt”. This file should record the 63 probabilities computed by this phase, i.e. 60 unconditional probabilities P(Vi|C) (one for each combination of the attribute, value, and class) and 3 unconditional probabilities P(C) (one for each class). The requested format of the “output1.txt” file is given in the “output1file-format.txt” included in “Project2Files.zip”.
The output of the second phase of the “classify” program should be recorded in the output file called “output2.txt”. Please follow the requested format of the “output2.txt” file, which is given in the “output2file-format.txt” file included in “Project2Files.zip”. Note that for each test instance E, “output2.txt” should record: the order number of the instance (0-49), the predicted class of E (1-3), the actual class of E (1-3), as well as the computed posterior probabilities P(C|E) for each of the 3 classes C. At the end of the “output2.txt” file, record the overall accuracy of the model (number of correctly classified test instances divided by 50).