Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Assignment 5 Problems - Machine Learning | CSCI 567, Assignments of Computer Science

University of Southern California (USC)Computer Science

Material Type: Assignment; Class: Machine Learning; Subject: Computer Science; University: University of Southern California; Term: Fall 2008;

Typology: Assignments

Pre 2010

Uploaded on 02/24/2010

koofers-user-jlt 🇺🇸

10 documents

1 / 5

This page cannot be seen from the preview

Don't miss anything!

CSCI567 Machine Learning (Fall 2008) Assignment #5

Instructor: Dr. Sofus A. Macskassy

TA: Cheol Han

Due time: 5:00pm, Nov 25, 2008

Student Name: ______________________

Student ID: ______________________

Discover Assignments of Computer Science University of Southern California (USC)

Partial preview of the text

Download Assignment 5 Problems - Machine Learning | CSCI 567 and more Assignments Computer Science in PDF only on Docsity!

CSCI567 Machine Learning (Fall 2008) Assignment

Instructor: Dr. Sofus A. Macskassy

TA: Cheol Han

Due time: 5:00pm, Nov 25, 2008

Student Name: ______________________

Student ID: ______________________

1. ( Cost-Sensitive Learning , 20 points)

Suppose you run a learner on a data set, and it comes back with 21 distinct thresholds. You rank

all instances by these thresholds and for each threshold you calculate the true-positive rate and

false-positive rate and get the following table:

Plotting the values, you get the following ROC curve:

Find the best threshold to use if:

a. FP’s are 5 times as costly as FN’s.

b. FP’s are as costly as FN’s.

c. FN’s are 3 times as costly as FP’s.

d. FN’s are 2 times as costly as FP’s.

Report the threshold and plot the lines for each of these constraints.

FP TP Treshold 0 0 100 0.0370943183851526 0.118230593753712 95 0.0968031236781045 0.238230593753712 93 0.107443653905474 0.3525337 74326488 90 0.132230725946249 0.402533774326488 83 0.164140599771191 0.454572062224442 76 0.263709739192864 0.534923219475437 71 0.32281131283499 0.57860503948066 60 0.382049216107127 0.681091064190152 45 0.414339785178702 0.721091064190152 41 0.479 355121381361 0.79588320792435 38 0.569489556290122 0.835630358834613 33 0.604857471240753 0.855630358834613 29 0.629625969142352 0.88287327921599 27 0.67453458226317 0.900015074992161 23 0.725507232671498 0.920015074992161 20 0.766334495213198 0.9311 8655127536 16 0.838523833441139 0.942478390342781 13 0.859502764379531 0.952478390342781 12 0.952413400915812 0.979718100089046 4 0.993844820740749 1.00 0

4. ( Weka Experiments with Bagging and Boosting , 45 points)

In this part of the homework, you will experiment with Bagging and Boosting.

Learning Algorithms. Bagging and AdaboostM1 are available under the "Meta"

category in Weka. Please use the following settings:

 Bagging: set numIterations to 30. You will run experiments with the classifier set to

Trees.J48, Functions.logistic, and Bayes.naiveBayesSimple.

 AdaboostM1: set maxIterations to 30. Set weightThreshold to 100000. You will run

experiments with the classifier set to the same three algorithms as for Bagging.

For J48, set the "unpruned" option to True (this is done in the meta-classifier dialogue

box). You can use the default settings for all other parameters of J48, NaiveBayesSimple,

and Logistic Regression. Optional: Rerun the experiments with pruning turned on and see

if it makes any difference.

In addition to running Bagging and AdaBoostM1, you should rerun a single decision tree,

a single Naive Bayes, and a single logistic regression.

Data Sets. You will apply these three algorithms to the same data sets that you have been

using before: hw_gmm, hw_step, and statlog. You will not construct learning curves this

time. Instead, you should just train and test on the following files:

o Domain Training Data File Test Data File o statlog statlog.arff statlog_test.arff o hw_gmm hw_gmm-250.arff hw_gmm-test.arff o hw_step hw_step-250.arff hw_step-test.arff

NOTE: You should use the train and test files for statlog that were provided at the same location os this homework rather than the files you created in homework 3.

Results. You should turn in three tables in the following format ( 10 points ):

hw_gmm: Base learner Single Bagging Boosting J48 xxx yyy zzz Logistic xxx yyy zzz NaiveBayes xxx yyy zzz

hw_step: Base learner Single Bagging Boosting J48 xxx yyy zzz Logistic xxx yyy zzz NaiveBayes xxx yyy zzz

statlog: Base learner Single Bagging Boosting J48 xxx yyy zzz Logistic xxx yyy zzz NaiveBayes xxx yyy zzz

Where xxx gives the error rate of a single classifier of the indicated Base Learning, yyy

gives the error rate of a bagging (30 iterations), and zzz gives the error rate of

AdaboostM1 (maximum 30 iterations).

Answer the following questions ( 5 points each ):

a) Which algorithms+data sets are improved by Bagging?

b) Which algorithms+data sets are improved by Boosting?

c) Can you explain these results in terms of the bias and variance of the learning

algorithms applied to these domains? Are some of the learning algorithms unbiased

for some of the domains? Which ones?

Now, set the number of iterations to 3,5,10,20, and 50 (for Bagging and Boosting both)

and run J48, Logistic and naive Bayes on the hw_gmm, hw_step and statlog data sets.

Provide six tables in the following format ( 10 points ):

DATASET (LEARNER) Bagging Boosting Iteration TrainError TestError TrainError TestError ActualIterations 3 xxx yyy zzz www kkk 5 10 20 50

Assignment 5 Problems - Machine Learning | CSCI 567, Assignments of Computer Science

Related documents

Partial preview of the text

Download Assignment 5 Problems - Machine Learning | CSCI 567 and more Assignments Computer Science in PDF only on Docsity!

CSCI567 Machine Learning (Fall 2008) Assignment

Instructor: Dr. Sofus A. Macskassy

TA: Cheol Han

Due time: 5:00pm, Nov 25, 2008

Student Name: ______________________

Student ID: ______________________

1. ( Cost-Sensitive Learning , 20 points)

Suppose you run a learner on a data set, and it comes back with 21 distinct thresholds. You rank

all instances by these thresholds and for each threshold you calculate the true-positive rate and

false-positive rate and get the following table:

Plotting the values, you get the following ROC curve:

Find the best threshold to use if:

a. FP’s are 5 times as costly as FN’s.

b. FP’s are as costly as FN’s.

c. FN’s are 3 times as costly as FP’s.

d. FN’s are 2 times as costly as FP’s.

Report the threshold and plot the lines for each of these constraints.

4. ( Weka Experiments with Bagging and Boosting , 45 points)

In this part of the homework, you will experiment with Bagging and Boosting.

Learning Algorithms. Bagging and AdaboostM1 are available under the "Meta"

category in Weka. Please use the following settings:

 Bagging: set numIterations to 30. You will run experiments with the classifier set to

Trees.J48, Functions.logistic, and Bayes.naiveBayesSimple.

 AdaboostM1: set maxIterations to 30. Set weightThreshold to 100000. You will run

experiments with the classifier set to the same three algorithms as for Bagging.

For J48, set the "unpruned" option to True (this is done in the meta-classifier dialogue

box). You can use the default settings for all other parameters of J48, NaiveBayesSimple,

and Logistic Regression. Optional: Rerun the experiments with pruning turned on and see

if it makes any difference.

In addition to running Bagging and AdaBoostM1, you should rerun a single decision tree,

a single Naive Bayes, and a single logistic regression.

Data Sets. You will apply these three algorithms to the same data sets that you have been

using before: hw_gmm, hw_step, and statlog. You will not construct learning curves this

time. Instead, you should just train and test on the following files:

Results. You should turn in three tables in the following format ( 10 points ):

Where xxx gives the error rate of a single classifier of the indicated Base Learning, yyy

gives the error rate of a bagging (30 iterations), and zzz gives the error rate of

AdaboostM1 (maximum 30 iterations).

Answer the following questions ( 5 points each ):

a) Which algorithms+data sets are improved by Bagging?

b) Which algorithms+data sets are improved by Boosting?

c) Can you explain these results in terms of the bias and variance of the learning

algorithms applied to these domains? Are some of the learning algorithms unbiased

for some of the domains? Which ones?

Now, set the number of iterations to 3,5,10,20, and 50 (for Bagging and Boosting both)

and run J48, Logistic and naive Bayes on the hw_gmm, hw_step and statlog data sets.

Provide six tables in the following format ( 10 points ):

Where DATASET is hw_gmm, hw_step or statlog and LEARNER is J48, Logistic

or Naïve Bayes. You can get the training error by selecting 'test on train set' in the test

set options. The number of actual iterations for boosting is reported when you run it.

Answer the following ( 5 points each ):

d) Do the training and test error follow the pattern that you would expect? If yes, why is

this what you would expect and if no, why not?

e) Explain why the number of actual iterations for Boosting is not always the same as

the number of iterations that you requested.