Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Homework 2 - Machine Learning and Data Mining | CS 434, Assignments of Computer Science

Oregon State University (OSU)Computer Science

Material Type: Assignment; Class: MACHINE LEARNING AND DATA MINING; Subject: Computer Science; University: Oregon State University; Term: Unknown 1989;

Typology: Assignments

Pre 2010

Uploaded on 08/30/2009

koofers-user-gfl 🇺🇸

10 documents

1 / 6

This page cannot be seen from the preview

Don't miss anything!

CS434 HW2

Due Oct 24 in Class

PART I

In part I, you will use WEKA to analyze the two artificial data sets we generated and one

real data set. You will apply the learning algorithms we learned to each data set and

compare their performance.

• Learning Algorithms. We will compare Perceptron (in this case, the voted

perceptron), KNN (i.e., IBk), decision tree (i.e., J48) .You should use the defaults that

weka set for these algorithms with the following exceptions:

1. trees>J48 Set unpruned to True.

2. lazy>IBk. Set KNN to 1 (which is the default; we will experiment with other

values below).

• Data Sets. We will apply these algorithms to the data sets hw2-1, hw2-2, and br.

These data sets are available here:

http://web.engr.oregonstate.edu/~xfern/classes/cs434/data/data.html. Each data set

has one or more training data files and one test data file:

br data files:

br-test.arff br test data file

br-train.arff br training data file

hw2-1 data files

hw2-1-10.arff 10 training examples

hw2-1-20.arff 20 training examples

hw2-1-50.arff 50 training examples

hw2-1-100.arff 100 training examples

hw2-1-200.arff 200 training examples

hw2-1-test.arff test data file

hw2-2 data files

hw2-2-25.arff 25 training examples

hw2-2-50.arff 50 training examples

hw2-2-100.arff 100 training examples

hw2-2-200.arff 200 training examples

hw2-2-600.arff 600 training examples

hw2-2-test.arff test data file

In case you are curious, here is how we generated the two synthetic data sets. The

data set hw2-1 is generated from two Gaussian distributions. One is centered as (1,0) and

the other at (0,1). Both have the same co-variance matrix:

[ 2 0 ]

[ 0 1 ]

hw2-2 is generated as follows. The x coordinate is generated from an exponential

distribution with parameter 1.0. The y coordinate is generated from a uniform random

distribution in the interval [0,1]. The class is assigned as follows. If (x > 0.5), the

example belongs to the positive class, otherwise to the negative class. However, the class

label is flipped with probability 0.1 (so-called "10% label noise").

Discover Assignments of Computer Science Oregon State University (OSU)

Partial preview of the text

Download Homework 2 - Machine Learning and Data Mining | CS 434 and more Assignments Computer Science in PDF only on Docsity!

CS434 HW

Due Oct 24 in Class

PART I

In part I, you will use WEKA to analyze the two artificial data sets we generated and one real data set. You will apply the learning algorithms we learned to each data set and compare their performance.

Learning Algorithms. We will compare Perceptron (in this case, the voted perceptron ), KNN (i.e., IBk ), decision tree (i.e., J48 ) .You should use the defaults that weka set for these algorithms with the following exceptions: 1. trees>J48 Set unpruned to True. 2. lazy>IBk. Set KNN to 1 (which is the default; we will experiment with other values below).
Data Sets. We will apply these algorithms to the data sets hw2-1, hw2-2, and br. These data sets are available here: http://web.engr.oregonstate.edu/~xfern/classes/cs434/data/data.html. Each data set has one or more training data files and one test data file:

br data files: br-test.arff br test data file br-train.arff br training data file

hw2-1 data files hw2-1-10.arff 10 training examples hw2-1-20.arff 20 training examples hw2-1-50.arff 50 training examples hw2-1-100.arff 100 training examples hw2-1-200.arff 200 training examples hw2-1-test.arff test data file

hw2-2 data files hw2-2-25.arff 25 training examples hw2-2-50.arff 50 training examples hw2-2-100.arff 100 training examples hw2-2-200.arff 200 training examples hw2-2-600.arff 600 training examples hw2-2-test.arff test data file

In case you are curious, here is how we generated the two synthetic data sets. The data set hw2-1 is generated from two Gaussian distributions. One is centered as (1,0) and the other at (0,1). Both have the same co-variance matrix: [ 2 0 ] [ 0 1 ]

hw2-2 is generated as follows. The x coordinate is generated from an exponential distribution with parameter 1.0. The y coordinate is generated from a uniform random distribution in the interval [0,1]. The class is assigned as follows. If (x > 0.5), the example belongs to the positive class, otherwise to the negative class. However, the class label is flipped with probability 0.1 (so-called "10% label noise").

br is a hand written letter data set that contains letter b and r. Each example is described by 16 attributes corresponding to 16 pixels of a 4 by 4 image.

You will run the learning algorithms on each training data file and evaluate the results on the corresponding test data files.

Results. You should turn the following. Please provide print out of the results.
1. A table in the following format:

N Method1 Method2 Method

hw2-1: 10 xxx yyy zzz 20 xxx yyy zzz 50 xxx yyy zzz 100 xxx yyy zzz 200 xxx yyy zzz

hw2-2: 25 xxx yyy zzz 50 xxx yyy zzz 100 xxx yyy zzz 200 xxx yyy zzz 600 xxx yyy zzz

br: 614 xxx yyy zzz

Where xxx, yyy, zzz give the error rates of each method on the test data. (Use “Supplied test set” for “Test Option” in the classify tab)

Graphs of the results for hw2-1 and hw2-2 plotting the performance of each algorithm as a function of the size of the training data set (known as a "learning curve"). I recommend using Matlab, Gnuplot or Excel for constructing the graphs. WEKA does not provide an easy way to do this.
Plot of the data points for hw2-1-200 and hw2-2-200 with lines showing the decision boundaries learned by Decision tree (J48). This will require that you read the decision tree and understand the decision boundary. J48 displayes the tree in the following format:

x1 <= 1.0: positive (75.0/17.0) x1 > 1. | x2 <= 5.0: negative (42.0/12.0) | x2 > 5.0: positive (33.0/10.0)

The first line indicates a split on feature x1 with threshold 1.0. The first branch leads to a leaf labeled "positive". The numbers in parentheses indicate that this

Once we have chosen an algorithm, it will be listed next to the “choose” button with its default parameter choices. To change these choices, click on it, you will be given an interface to modify parameters. Click the “More” button to get more information about the parameters. After setting parameters, click ok. Now we are ready to run the algorithm. Make sure you have the right test option and then click on the "Start" button, and the Classifier Output window will show the output from the classifier. This output consists of several sections:

Run Information: Details of the data set
Classifier model: The learned model. This part will be different for different algorithms. For example for Decision tree, it will display the learned decision tree.
Evaluation on test set: This gives various statistics. The key item is the second one: Incorrectly Classified Instances will be expressed as a count and a percentage. You should report the percentages in your answer. One other item of interest comes at the very end: The Confusion Matrix. This shows how many false positive and false negative errors were made.

PART II

Probability

1. (10pts) We have two identical bags. Bag A contains 4 red marbles and 6 black marbles and bag B contains 5 red marbles and 5 black marbles. Now we random chose a bag and drew a marble from the chosen bag and it turns out to be black. What is the probability that the chosen bag is bag A?

(6pts) Suppose we have class variable Y and three attributes X 1, X 2, X 3 and we wish to calculate P ( Y | X 1 ;X 2 ;X 3 ), and we have no conditional independence information. (a) Which of the following sets of probabilities are sufficient for calculation? i. P ( Y ) ; P ( X 1 | Y ) ; P ( X 2 |Y ) ; P ( X 3 |Y ) ii. P ( X 1 ;X 2 ;X 3 ) ; P ( Y ) ; P ( X 1 ;X 2 ;X 3 |Y ) iii. P ( X 1 ;X 2 ;X 3 ) ; P ( Y |X 1 ) ; P ( Y |X 2 ) ; P ( Y |X 3 ) (b) Now suppose we know that the variables X 1, X 2 , X 3 are conditionally independent given the class variable Y. Which of the above 3 sets are sufficient now?

Decision tree

(20 pts) Given the following data set:

The task is to build a decision tree for classifying Y. (a) Compute the information gain of attributes X, V and W respectively. (b) Use information gain for selecting test and produce the full decision tree generated by the top-down greedy algorithm described in class. (Stopping criterion: stop if all the instances belong to the same class.) (c) Considering the following two strategies for avoid over-fitting. i. The first strategy stops growing the tree when the information gain of the best test is less than a given threshold ε. ii. The second strategy grows the full tree first and then prunes the tree bottom- up: start from the lowest level of the tree and prune a sub-tree if the information gain of the test is less than a given threshold ε. (Note that you should stop checking level t if none of sub-trees at level t+1 satisfies the pruning criterion. Let ε be 0.001 for both cases, write down the resulting tree for each strategy and compare their training errors. (d) Discuss the advantages and disadvantages of each of these two strategies.

Homework 2 - Machine Learning and Data Mining | CS 434, Assignments of Computer Science

Related documents

Partial preview of the text

Download Homework 2 - Machine Learning and Data Mining | CS 434 and more Assignments Computer Science in PDF only on Docsity!

CS434 HW

Due Oct 24 in Class

PART I

PART II