Assignment 1 on Data Mining - Fall 2008 | INFS 755, Assignments of Information Technology

Material Type: Assignment; Class: Data Mining; Subject: Information Systems; University: George Mason University; Term: Unknown 1989;

Typology: Assignments

Pre 2010

Uploaded on 02/12/2009

koofers-user-8nt-1
koofers-user-8nt-1 ๐Ÿ‡บ๐Ÿ‡ธ

10 documents

1 / 5

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
INFS%755%(Fall%2008)%%
%
Assignment)1)(Due%on%09/22/2008)%%
%
This is an individual assignment. Please ensure that assignment is submitted in
class in hard copy before start class. No late submissions allowed. The first part
of the assignment are a few questions from Chapter 1 and Chapter 2. The
second part gives you a feel for the KDD process and dealing with data. It lets
you get acquainted to WEKA.
%
Part%1%(50%points)%(Questions%borrowed%from%the%
Tan%et.%al.%book)%
%
1. Discuss whether or not each of the following activities is a data-mining
task. [2 points each โ€“ 10 points]
a. Sorting a student database based on the student identification
number.
b. Predicting the future stock price of a company using historical
records.
c. Monitoring the heart rate of a patient for abnormalities.
d. Monitoring seismic waves for earthquake activities.
e. Extracting the frequencies of a sound wave.
2. Classify the following attributes as binary, discrete, or continuous. Also
classify them as qualitative (nominal or ordinal) or quantitative (interval
or ratio). Some cases may have more than one interpretation, so
briefly, indicate your reasoning if you think there may be some
ambiguity. [3 points each โ€“ 15 points]
a. Brightness as measured by a light meter.
b. Angles as measured in degrees between 0 degrees and 360
degrees.
c. Bronze, Silver, and Gold medals as awarded at the Olympics.
d. ISBN numbers for books.
e. Military Rank.
f. Coat check number.
3. Which of the following quantities is likely to show more temporal
autocorrelation: daily rainfall or daily temperature? Why? [5 points]
pf3
pf4
pf5

Partial preview of the text

Download Assignment 1 on Data Mining - Fall 2008 | INFS 755 and more Assignments Information Technology in PDF only on Docsity!

INFS 755 (Fall 2008)

Assignment 1 (Due on 09/22/2008)

This is an individual assignment. Please ensure that assignment is submitted in class in hard copy before start class. No late submissions allowed. The first part of the assignment are a few questions from Chapter 1 and Chapter 2. The second part gives you a feel for the KDD process and dealing with data. It lets you get acquainted to WEKA.

Part 1 (50 points) (Questions borrowed from the

Tan et. al. book)

  1. Discuss whether or not each of the following activities is a data-mining task. [2 points each โ€“ 10 points] a. Sorting a student database based on the student identification number. b. Predicting the future stock price of a company using historical records. c. Monitoring the heart rate of a patient for abnormalities. d. Monitoring seismic waves for earthquake activities. e. Extracting the frequencies of a sound wave.
  2. Classify the following attributes as binary, discrete, or continuous. Also classify them as qualitative (nominal or ordinal) or quantitative (interval or ratio). Some cases may have more than one interpretation, so briefly, indicate your reasoning if you think there may be some ambiguity. [3 points each โ€“ 15 points] a. Brightness as measured by a light meter. b. Angles as measured in degrees between 0 degrees and 360 degrees. c. Bronze, Silver, and Gold medals as awarded at the Olympics. d. ISBN numbers for books. e. Military Rank. f. Coat check number.
  3. Which of the following quantities is likely to show more temporal autocorrelation: daily rainfall or daily temperature? Why? [5 points]
  1. Discuss why a document-term matrix is an example of a data set that has asymmetric discrete or asymmetric continuous features.[5 points]
  2. For the following vectors, x and y, calculate the indicated similarity or distance measures.[3 points each = 9 points] a. x = (1, 1, 1, 1), y = (2, 2, 2, 2)- cosine, correlation, and Euclidean. b. x = (0, 1, 0, 1), y = (1, 0, 1, 0)- cosine, correlation, Euclidean, and Jaccard. c. x= (0, - 1, 0, 1), y = (1, 0, - 1, 0)- cosine, correlation, Euclidean
  3. Given a similarity measure with values in the interval [0,1] describe two ways to transform this similarity value into a dissimilarity value in the interval [0, โˆž]. [6 points]

Part 2 (50 points) : The KDD Process in Weka

This assignment was borrowed from the TCSS 555A Data Mining Class taught at the University of Washington, Tacoma Branch. Assignment preparation This assignment will be using Weka data mining tool. Weka is an open source Java development environment for data mining from the University of Waikato in New Zealand. It can be downloaded freely from http://www.cs.waikato.ac.nz/ml/weka/ , Heart disease datasets The dataset studied is the heart disease dataset from UCI repository. Two different datasets are provided: heart-h.arff (Hungarian data), and heart-c.arff (Cleveland data). These datasets describe factors of heart disease. Both these data sets are available to you on the assignment page. The data mining project goal is to better understand the risk factors for heart disease, as represented in the 14 th attribute: num (<50 means no disease, and values <50-1 to <50- 4 represent increasing levels of heart disease). The question on which this machine learning study concentrates is whether it is possible to predict heart disease from the other known data about a patient. The data mining task of choice to answer this question will be classification/prediction, and several different algorithms will be used to find which one provides the best predictive power. However this exercise focuses on the various aspects of the KDD process.

1. Data preparation- integration

a. From the documentation provided in the dataset, how many attributes were originally in these datasets? b. With Weka , attribute selection can be achieved either from the specific Select attributes tab, or within Preprocess tab. List the different options in Weka for selecting attributes, with a short explanation about the corresponding method.

4. Data preparation - cleaning Data cleaning deals with such defaults of real-world data as incompleteness, noise, and inconsistencies. In Weka , data cleaning can be accomplished by applying filters to the data in the Preprocess tab. a. Missing values. List the methods seen in class for dealing with missing values, and which Weka filters implement them โ€“ if available. Remove the missing values with the method of your choice, explaining which filter you are using and why you make this choice. If a filter is not available for your method of choice, develop a new one that you add to the available filters as a Java class. (that should be exciting and fun โ€ฆ send me an email if you plan to do this) b. Noisy data. List the methods seen in class for dealing with noisy data, and which Weka filters implement them โ€“ if available. c. Save the cleaned dataset into heart-cleaned.arff , and paste here a screenshot showing at least the first 10 rows of this dataset โ€“ with all the columns. 5. Data preparation - transformation

  1. Among the different data transformation techniques, explore those available through the Weka Filters. Stay in the Preprocess tab for now. Study the following data transformation only: a. Attribute construction โ€“ for example adding an attribute representing the sum of two other ones. Which Weka filter permits to do this? b. Normalize an attribute. Which Weka filter permits to do this? Can this filter perform Min-max normalization? Z-score normalization? Decimal normalization? Provide detailed information about how to perform these in Weka. c. Normalize all real attributes in the dataset using the method of your choice โ€“ state which one you choose. d. Save the normalized dataset into heart-normal.arff , and paste here a screenshot showing at least the first 10 rows of this dataset โ€“ with all the columns. 6. Data preparation- reduction Often, data mining datasets are too large to process directly. Data reduction techniques are used to preprocess the data. Once the data mining project has been successful on these reduced data, the larger dataset can be processed too.

a. Stay in the Preprocess tab for now. Beside attribute selection, a reduction method is to select rows from a dataset. This is called sampling. How to perform sampling with Weka filters? Can it perform the two main methods: Simple Random Sample Without Replacement , and Simple Random Sample With Replacement?