



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Assignment; Class: Data Mining; Subject: Information Systems; University: George Mason University; Term: Unknown 1989;
Typology: Assignments
1 / 5
This page cannot be seen from the preview
Don't miss anything!




This is an individual assignment. Please ensure that assignment is submitted in class in hard copy before start class. No late submissions allowed. The first part of the assignment are a few questions from Chapter 1 and Chapter 2. The second part gives you a feel for the KDD process and dealing with data. It lets you get acquainted to WEKA.
This assignment was borrowed from the TCSS 555A Data Mining Class taught at the University of Washington, Tacoma Branch. Assignment preparation This assignment will be using Weka data mining tool. Weka is an open source Java development environment for data mining from the University of Waikato in New Zealand. It can be downloaded freely from http://www.cs.waikato.ac.nz/ml/weka/ , Heart disease datasets The dataset studied is the heart disease dataset from UCI repository. Two different datasets are provided: heart-h.arff (Hungarian data), and heart-c.arff (Cleveland data). These datasets describe factors of heart disease. Both these data sets are available to you on the assignment page. The data mining project goal is to better understand the risk factors for heart disease, as represented in the 14 th attribute: num (<50 means no disease, and values <50-1 to <50- 4 represent increasing levels of heart disease). The question on which this machine learning study concentrates is whether it is possible to predict heart disease from the other known data about a patient. The data mining task of choice to answer this question will be classification/prediction, and several different algorithms will be used to find which one provides the best predictive power. However this exercise focuses on the various aspects of the KDD process.
1. Data preparation- integration
a. From the documentation provided in the dataset, how many attributes were originally in these datasets? b. With Weka , attribute selection can be achieved either from the specific Select attributes tab, or within Preprocess tab. List the different options in Weka for selecting attributes, with a short explanation about the corresponding method.
4. Data preparation - cleaning Data cleaning deals with such defaults of real-world data as incompleteness, noise, and inconsistencies. In Weka , data cleaning can be accomplished by applying filters to the data in the Preprocess tab. a. Missing values. List the methods seen in class for dealing with missing values, and which Weka filters implement them โ if available. Remove the missing values with the method of your choice, explaining which filter you are using and why you make this choice. If a filter is not available for your method of choice, develop a new one that you add to the available filters as a Java class. (that should be exciting and fun โฆ send me an email if you plan to do this) b. Noisy data. List the methods seen in class for dealing with noisy data, and which Weka filters implement them โ if available. c. Save the cleaned dataset into heart-cleaned.arff , and paste here a screenshot showing at least the first 10 rows of this dataset โ with all the columns. 5. Data preparation - transformation
a. Stay in the Preprocess tab for now. Beside attribute selection, a reduction method is to select rows from a dataset. This is called sampling. How to perform sampling with Weka filters? Can it perform the two main methods: Simple Random Sample Without Replacement , and Simple Random Sample With Replacement?