Download Pattern Recognition and Machine Learning - Machine Learning | CS 446 and more Quizzes Computer Science in PDF only on Docsity! CS446: Pattern Recognition and Machine Learning Fall 2010 Class Exercise 4 Date: November 4, 2010 Name (NetID): Instructions: • Please write your name and NetId at the top of this sheet before you return it to the instructor. • The goal of the exercises is to help you recall previous lectures and homeworks and think about them. If you want, you may refer to your class notes to answer the questions. • Answer: The solutions are highlighted. Multi-class Classification Consider a multi-class classification problem with k class labels {1, 2, . . . k}. Assume that we are given m examples, labeled with one of the k class labels. Assume for simplicity that we have m/k examples of each type. Assume that you have a learning algorithm L that can be used to learn Boolean functions. (E.g., think about L as the perceptron algorithm). We would like to explore several ways to develop learning algorithms for the multi class classification problem. 1. Suggest two schemes to use the algorithm L on the given data set, and produce a multi-class classification. In each case, determine • How you will train L? That is, what is the input data, what are the positive and negative example, etc. Indicate how many “copies” of L you will use. Answer: Scheme 1: We will have k classifiers (that is, k weight vectors.) The ith weight vector will assign a confidence score to the ith class. To train this, we create k binary problems as follows: For the ith class, the positive examples will be all examples with label i and the negative examples will be examples with all other labels. Scheme 2: We will have 1 2 k(k− 1) weight vectors. Each weight vector, wi,j, will assign a preference between classes i and j. To train this, we will create binary problems as follows: For training wi,j, the positive examples will be examples labeled i and the negative examples will be those labeled as j. • How will you use your final hypothesis given a new example. Answer: Scheme 1: The label can be chosen as the one that achieves the maximum score. That is, for an input x, y∗ = arg maxi w T i x. Scheme 2: There are several ways to use the k(k − 1) classifiers. One approach would be use all of them on the example and have each classifier vote on the class. Then the label with the highest number of votes would be the winner. Another approach is to conduct a tournament between the labels. 1 2. In the first scheme proposed above you used k classifiers. We call this scheme 1-vs-all. • Can you invent a similar scheme that only makes use of log2k classifiers? Answer: We will need log2k bits to represent all the labels in binary representa- tion. Now, each bit can either be 0 or 1. We can train a classifier for each bit. At prediction time, we can use the predictions of the log2k classifiers to form a log2k binary string, which will be the prediction. • Think about one disadvantage of this scheme. Answer: This scheme is extremely sensitive to noise. If even one of the classifiers is incorrect, our final prediction will be wrong. • How can we deal with this problem? Answer: Using the error correcting code scheme (See below and class slides). • The error correcting code scheme uses redundancy to address the problem. For simplicity, assume k = 8 class labels. Instead of using 3 classifiers, use 5. – How many elements are there in the output space? Answer: 25 – How will you use the 5 classifiers distinguish the k = 8 labels? Answer: Since we need to represent 8 = 23 labels using 5 bits, we can use the remaining two bits to design an error correcting code for each label. For example, consider the following assignment: Label Code 0 0 0 0 0 0 1 0 0 1 0 1 2 0 1 1 1 0 3 0 1 0 1 1 4 1 1 0 0 0 5 1 0 0 0 1 6 1 0 1 1 0 7 1 1 1 1 1 Each code is at least two bits away from all others. This way, the code can correct errors of upto one bit. That is, one of the classifiers can make an incorrect prediction and we can still recover from it. – What problems do you see with this scheme? Answer: The main problem with this scheme is with the meaning of the codes. For example, according to the above encoding, the classifier for the least significant bit should learn to separate labels 0,2,4,6 from the 1,3,5,7. Why should this be separable? 2