Pattern Recognition and Machine Learning - Machine Learning | CS 446, Quizzes of Computer Science

Material Type: Quiz; Professor: Roth; Class: Machine Learning; Subject: Computer Science; University: University of Illinois - Urbana-Champaign; Term: Fall 2010;

Typology: Quizzes

Pre 2010

Uploaded on 12/15/2010

tonyh1986
tonyh1986 🇺🇸

5

(1)

5 documents

1 / 4

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS446: Pattern Recognition and Machine Learning Fall 2010
Class Exercise 4
Date: November 4, 2010 Name (NetID):
Instructions:
Please write your name and NetId at the top of this sheet before you return it to the instructor.
The goal of the exercises is to help you recall previous lectures and homeworks and think about them.
If you want, you may refer to your class notes to answer the questions.
Answer: The solutions are highlighted.
Multi-class Classification
Consider a multi-class classification problem with kclass labels {1,2,...k}. Assume that
we are given mexamples, labeled with one of the kclass labels. Assume for simplicity that
we have m/k examples of each type.
Assume that you have a learning algorithm Lthat can be used to learn Boolean functions.
(E.g., think about Las the perceptron algorithm). We would like to explore several ways to
develop learning algorithms for the multi class classification problem.
1. Suggest two schemes to use the algorithm Lon the given data set, and produce a
multi-class classification. In each case, determine
How you will train L? That is, what is the input data, what are the positive and
negative example, etc. Indicate how many “copies” of Lyou will use.
Answer: Scheme 1: We will have kclassifiers (that is, kweight vectors.) The
ith weight vector will assign a confidence score to the ith class. To train this, we
create kbinary problems as follows: For the ith class, the positive examples will
be all examples with label iand the negative examples will be examples with all
other labels.
Scheme 2: We will have 1
2k(k1) weight vectors. Each weight vector, wi,j, will
assign a preference between classes iand j. To train this, we will create binary
problems as follows: For training wi,j, the positive examples will be examples
labeled iand the negative examples will be those labeled as j.
How will you use your final hypothesis given a new example.
Answer: Scheme 1: The label can be chosen as the one that achieves the
maximum score. That is, for an input x,y=arg maxiwT
ix.
Scheme 2: There are several ways to use the k(k1) classifiers. One approach
would be use all of them on the example and have each classifier vote on the class.
Then the label with the highest number of votes would be the winner. Another
approach is to conduct a tournament between the labels.
1
pf3
pf4

Partial preview of the text

Download Pattern Recognition and Machine Learning - Machine Learning | CS 446 and more Quizzes Computer Science in PDF only on Docsity!

CS446: Pattern Recognition and Machine Learning Fall 2010

Class Exercise 4

Date: November 4, 2010 Name (NetID): Instructions:

  • Please write your name and NetId at the top of this sheet before you return it to the instructor.
  • The goal of the exercises is to help you recall previous lectures and homeworks and think about them. If you want, you may refer to your class notes to answer the questions.
  • Answer: The solutions are highlighted.

Multi-class Classification

Consider a multi-class classification problem with k class labels { 1 , 2 ,... k}. Assume that we are given m examples, labeled with one of the k class labels. Assume for simplicity that we have m/k examples of each type. Assume that you have a learning algorithm L that can be used to learn Boolean functions. (E.g., think about L as the perceptron algorithm). We would like to explore several ways to develop learning algorithms for the multi class classification problem.

  1. Suggest two schemes to use the algorithm L on the given data set, and produce a multi-class classification. In each case, determine - How you will train L? That is, what is the input data, what are the positive and negative example, etc. Indicate how many “copies” of L you will use. Answer: Scheme 1: We will have k classifiers (that is, k weight vectors.) The ith^ weight vector will assign a confidence score to the ith^ class. To train this, we create k binary problems as follows: For the ith^ class, the positive examples will be all examples with label i and the negative examples will be examples with all other labels. Scheme 2: We will have 12 k(k − 1) weight vectors. Each weight vector, wi,j , will assign a preference between classes i and j. To train this, we will create binary problems as follows: For training wi,j , the positive examples will be examples labeled i and the negative examples will be those labeled as j. - How will you use your final hypothesis given a new example. Answer: Scheme 1: The label can be chosen as the one that achieves the maximum score. That is, for an input x, y∗^ = arg maxi wTi x. Scheme 2: There are several ways to use the k(k − 1) classifiers. One approach would be use all of them on the example and have each classifier vote on the class. Then the label with the highest number of votes would be the winner. Another approach is to conduct a tournament between the labels.
  1. In the first scheme proposed above you used k classifiers. We call this scheme 1-vs-all.
    • Can you invent a similar scheme that only makes use of log 2 k classifiers? Answer: We will need log 2 k bits to represent all the labels in binary representa- tion. Now, each bit can either be 0 or 1. We can train a classifier for each bit. At prediction time, we can use the predictions of the log 2 k classifiers to form a log 2 k binary string, which will be the prediction.
    • Think about one disadvantage of this scheme. Answer: This scheme is extremely sensitive to noise. If even one of the classifiers is incorrect, our final prediction will be wrong.
    • How can we deal with this problem? Answer: Using the error correcting code scheme (See below and class slides).
    • The error correcting code scheme uses redundancy to address the problem. For simplicity, assume k = 8 class labels. Instead of using 3 classifiers, use 5. - How many elements are there in the output space? Answer: 25 - How will you use the 5 classifiers distinguish the k = 8 labels? Answer: Since we need to represent 8 = 2^3 labels using 5 bits, we can use the remaining two bits to design an error correcting code for each label. For example, consider the following assignment: Label Code 0 0 0 0 0 0 1 0 0 1 0 1 2 0 1 1 1 0 3 0 1 0 1 1 4 1 1 0 0 0 5 1 0 0 0 1 6 1 0 1 1 0 7 1 1 1 1 1 Each code is at least two bits away from all others. This way, the code can correct errors of upto one bit. That is, one of the classifiers can make an incorrect prediction and we can still recover from it. - What problems do you see with this scheme? Answer: The main problem with this scheme is with the meaning of the codes. For example, according to the above encoding, the classifier for the least significant bit should learn to separate labels 0,2,4,6 from the 1,3,5,7. Why should this be separable?

1: for each example (x, i) (that is, label of x is i) do 2: for all (i, j), i 6 = j do 3: if (wiT − wTj∗ ) · x < 0 (mistaken prediction) then 4: wi ← wi + x (promotion) 5: wj ← wj − x (demotion) 6: end if 7: end for 8: end for

And we get the minimal margin of a data set by minimizing over all examples in it. To make the general case even closer to the balanced case, we present the Conservative update scheme: In fact, when training via constraint classification, we don’t want to penalize all components of w. Rather, we only want to update the component that corresponds to the toughest competition to the correct label i, that is, the label with the smallest margin.

From Multiclass to Structure Prediction

In this section, we just re-write the algorithm above in a way that can later be generalized to a more general setting, of Structure Prediction. We can go now back to the global weight vector view, where we think about the concatenated k weight vectors into

w = (w 1 , w 2 ,... wk) ∈ ℜnk.

In this view, an example (x, i) is embedded in an nk dimensional vector, with x embedded in the i − th part of it, and 0 in all the other dimensions. We note that

f (x, y) = wT^ · x = wyT · x

In this notation the prediction we make is:

y∗^ = argmaxy∈[k]f (x, y).

And, how ca we write the conservative update in this view:

1: for each example (x, i) (that is, label of x is i) do 2: Let j∗^ be such that j∗^ = minj∈[k]\i(wTi · x − wjT · x) 3: if (wTi − wjT∗ ) · x < 0 (mistaken prediction) then 4: w ← w + (x, i) − (x, j∗) 5: end if 6: end for