Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Assignment 1 on Data Mining - Fall 2008 | INFS 755, Assignments of Information Technology

George Mason University (GMU)Information Technology

Material Type: Assignment; Class: Data Mining; Subject: Information Systems; University: George Mason University; Term: Unknown 1989;

Typology: Assignments

Pre 2010

Uploaded on 02/12/2009

koofers-user-8nt-1 🇺🇸

10 documents

1 / 5

This page cannot be seen from the preview

Don't miss anything!

INFS%755%(Fall%2008)%%

%

Assignment)1)(Due%on%09/22/2008)%%

%

This is an individual assignment. Please ensure that assignment is submitted in

class in hard copy before start class. No late submissions allowed. The first part

of the assignment are a few questions from Chapter 1 and Chapter 2. The

second part gives you a feel for the KDD process and dealing with data. It lets

you get acquainted to WEKA.

%

Part%1%(50%points)%(Questions%borrowed%from%the%

Tan%et.%al.%book)%

%

1. Discuss whether or not each of the following activities is a data-mining

task. [2 points each – 10 points]

a. Sorting a student database based on the student identification

number.

b. Predicting the future stock price of a company using historical

records.

c. Monitoring the heart rate of a patient for abnormalities.

d. Monitoring seismic waves for earthquake activities.

e. Extracting the frequencies of a sound wave.

2. Classify the following attributes as binary, discrete, or continuous. Also

classify them as qualitative (nominal or ordinal) or quantitative (interval

or ratio). Some cases may have more than one interpretation, so

briefly, indicate your reasoning if you think there may be some

ambiguity. [3 points each – 15 points]

a. Brightness as measured by a light meter.

b. Angles as measured in degrees between 0 degrees and 360

degrees.

c. Bronze, Silver, and Gold medals as awarded at the Olympics.

d. ISBN numbers for books.

e. Military Rank.

f. Coat check number.

3. Which of the following quantities is likely to show more temporal

autocorrelation: daily rainfall or daily temperature? Why? [5 points]

Discover Assignments of Information Technology George Mason University (GMU)

Partial preview of the text

Download Assignment 1 on Data Mining - Fall 2008 | INFS 755 and more Assignments Information Technology in PDF only on Docsity!

INFS 755 (Fall 2008)

Assignment 1 (Due on 09/22/2008)

This is an individual assignment. Please ensure that assignment is submitted in class in hard copy before start class. No late submissions allowed. The first part of the assignment are a few questions from Chapter 1 and Chapter 2. The second part gives you a feel for the KDD process and dealing with data. It lets you get acquainted to WEKA.

Part 1 (50 points) (Questions borrowed from the

Tan et. al. book)

Discuss whether or not each of the following activities is a data-mining task. [2 points each – 10 points] a. Sorting a student database based on the student identification number. b. Predicting the future stock price of a company using historical records. c. Monitoring the heart rate of a patient for abnormalities. d. Monitoring seismic waves for earthquake activities. e. Extracting the frequencies of a sound wave.
Classify the following attributes as binary, discrete, or continuous. Also classify them as qualitative (nominal or ordinal) or quantitative (interval or ratio). Some cases may have more than one interpretation, so briefly, indicate your reasoning if you think there may be some ambiguity. [3 points each – 15 points] a. Brightness as measured by a light meter. b. Angles as measured in degrees between 0 degrees and 360 degrees. c. Bronze, Silver, and Gold medals as awarded at the Olympics. d. ISBN numbers for books. e. Military Rank. f. Coat check number.
Which of the following quantities is likely to show more temporal autocorrelation: daily rainfall or daily temperature? Why? [5 points]

Discuss why a document-term matrix is an example of a data set that has asymmetric discrete or asymmetric continuous features.[5 points]
For the following vectors, x and y, calculate the indicated similarity or distance measures.[3 points each = 9 points] a. x = (1, 1, 1, 1), y = (2, 2, 2, 2)- cosine, correlation, and Euclidean. b. x = (0, 1, 0, 1), y = (1, 0, 1, 0)- cosine, correlation, Euclidean, and Jaccard. c. x= (0, - 1, 0, 1), y = (1, 0, - 1, 0)- cosine, correlation, Euclidean
Given a similarity measure with values in the interval [0,1] describe two ways to transform this similarity value into a dissimilarity value in the interval [0, ∞]. [6 points]

Part 2 (50 points) : The KDD Process in Weka

This assignment was borrowed from the TCSS 555A Data Mining Class taught at the University of Washington, Tacoma Branch. Assignment preparation This assignment will be using Weka data mining tool. Weka is an open source Java development environment for data mining from the University of Waikato in New Zealand. It can be downloaded freely from http://www.cs.waikato.ac.nz/ml/weka/ , Heart disease datasets The dataset studied is the heart disease dataset from UCI repository. Two different datasets are provided: heart-h.arff (Hungarian data), and heart-c.arff (Cleveland data). These datasets describe factors of heart disease. Both these data sets are available to you on the assignment page. The data mining project goal is to better understand the risk factors for heart disease, as represented in the 14 th attribute: num (<50 means no disease, and values <50-1 to <50- 4 represent increasing levels of heart disease). The question on which this machine learning study concentrates is whether it is possible to predict heart disease from the other known data about a patient. The data mining task of choice to answer this question will be classification/prediction, and several different algorithms will be used to find which one provides the best predictive power. However this exercise focuses on the various aspects of the KDD process.

1. Data preparation- integration

a. From the documentation provided in the dataset, how many attributes were originally in these datasets? b. With Weka , attribute selection can be achieved either from the specific Select attributes tab, or within Preprocess tab. List the different options in Weka for selecting attributes, with a short explanation about the corresponding method.

4. Data preparation - cleaning Data cleaning deals with such defaults of real-world data as incompleteness, noise, and inconsistencies. In Weka , data cleaning can be accomplished by applying filters to the data in the Preprocess tab. a. Missing values. List the methods seen in class for dealing with missing values, and which Weka filters implement them – if available. Remove the missing values with the method of your choice, explaining which filter you are using and why you make this choice. If a filter is not available for your method of choice, develop a new one that you add to the available filters as a Java class. (that should be exciting and fun … send me an email if you plan to do this) b. Noisy data. List the methods seen in class for dealing with noisy data, and which Weka filters implement them – if available. c. Save the cleaned dataset into heart-cleaned.arff , and paste here a screenshot showing at least the first 10 rows of this dataset – with all the columns. 5. Data preparation - transformation

Among the different data transformation techniques, explore those available through the Weka Filters. Stay in the Preprocess tab for now. Study the following data transformation only: a. Attribute construction – for example adding an attribute representing the sum of two other ones. Which Weka filter permits to do this? b. Normalize an attribute. Which Weka filter permits to do this? Can this filter perform Min-max normalization? Z-score normalization? Decimal normalization? Provide detailed information about how to perform these in Weka. c. Normalize all real attributes in the dataset using the method of your choice – state which one you choose. d. Save the normalized dataset into heart-normal.arff , and paste here a screenshot showing at least the first 10 rows of this dataset – with all the columns. 6. Data preparation- reduction Often, data mining datasets are too large to process directly. Data reduction techniques are used to preprocess the data. Once the data mining project has been successful on these reduced data, the larger dataset can be processed too.

a. Stay in the Preprocess tab for now. Beside attribute selection, a reduction method is to select rows from a dataset. This is called sampling. How to perform sampling with Weka filters? Can it perform the two main methods: Simple Random Sample Without Replacement , and Simple Random Sample With Replacement?