






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Instructions for a matlab exercise on learning maximum margin linear and kernel discriminants. Students are required to write functions to learn discriminant functions using different generative models and evaluate their performance. The exercise covers both l2 margin and soft margin classifiers, as well as kernel classifiers with polynomial kernels.
Typology: Assignments
1 / 12
This page cannot be seen from the preview
Don't miss anything!







Spring 2009 Department of Computer Science and Engineering Wright State University
The first three due: 23:59:59, Wednesday, May 13 The last two due: 23:59:59, Monday, June 01 Instructor: Shaojun Wang, Joshi 387, 775-5140, [email protected]
Note These questions require you to write small Matlab programs which are to be submitted by email. When finished, please send a single tar file containing all of your .m files and all plots, tables, and explanations to me, at [email protected] with a subject heading “CS 714 A2 solutions”.
In this exercise you will write simple Matlab functions to learn maximum margin linear discriminants and test them on simulated data. You will need to be familiar with quadprog.
We will consider three ways of generating data X, y. The training data will have the form
x 1 , 1 · · · x 1 ,n .. .
xt, 1 · · · xt,n
y^ =
y 1 .. . yt
for^ xi,j^ ∈^ IR^ and^ yi^ ∈ {−^1 ,^1 }.
Data generation: (^) n = 2 % dimension
t = 10 % training size u = ones(n,1) % target weights (models 1 & 2) v = 0.5*n % target offset (models 1 & 2) p_pos = 0.5 % prob of positive example mu_pos = ones(n,1) % mean loc for pos (model 3) mu_neg = zeros(n,1) % mean loc for neg (model 3)
Generative model 1: target linear discriminant
X = rand(t,n) y = sign( X * u - v )
Generative model 2: target quadratic discriminant
X = rand(t,n) y = sign( X.^2 * u - v )
Generative model 3: noisy linear discriminant (Naive Bayes—Gaussian)
X = randn(t,n) y = 2 * (rand(t,1) < p_pos) - 1 pos = find(y > 0) neg = find(y < 0) X(pos,1) = X(pos,1) + mu_pos(1); X(pos,2) = X(pos,2) + mu_pos(2) X(neg,1) = X(neg,1) + mu_neg(1); X(neg,2) = X(neg,2) + mu_neg(2)
(a) Write a Matlab function [w,b] = maxL2marg(X,y) which takes a t × n matrix X and t × 1 vector of target labels y and returns: an n × 1 vector of weights w and a scalar offset b, corresponding to the maximum L 2 margin linear discriminant classifier yˆ = sign(w · x − b). Your function must be able to handle arbitrary n and t.
(e) For each generative model: Repeat (d) parts A, C and D 100 times and accumulate the sum of mean misclassifcation errors for each classifier in two matrices: one for the training errors and one for the testing errors. Report the averages of each kind of mean error for each classifier in two tables (one training and the other testing error).
In this exercise you will write simple Matlab functions to learn maximum margin linear discriminants once again. However, this time you will implement the “dual” form of the algorithms. Here we will use the same generative models as in Question 1. As before, you will need to be familiar with quadprog.
(a) Write a Matlab function [lambda,b] = dualL2marg(X,y) which takes a t × n matrix X and t × 1 vector of target labels y and returns: an 1 × t vector of Lagrange multi- pliers lambda and a scalar offset b, corresponding to the maximum L 2 margin linear discriminant classifier ˆy = sign
( (
∑t i=1 λiyixi^ ·^ x)^ −^ b
) . Note: For dualL2marg all you need to do is compute the vector of Lagrange multipliers λ that maximizes the objective
L(λ) =
∑^ t
i=
λi −
∑^ t
i=
∑^ t
k=
λiλkyiyk xi · xk (1)
subject to the constraints
∑t i=1 λiyi^ = 0 and^ λi^ ≥^ 0. You can do this using Matlab’s quadprog operator to recover the vector of Lagrange multipliers λ. To recover the offset value b, just solve for b in the equation: λk
( yk
( (
∑t i=1 λiyixi^ ·^ xk)^ −^ b
) − 1
) = 0 corresponding to the largest Lagrange multiplier λk. Your function must be able to handle arbitrary n and t.
(b) Write a Matlab function [yhat] = dualclassify(Xtest,lambda,b,X,y) which takes a te × n matrix Xtest, a 1 × t vector lambda, a scalar b, a t × n matrix X, and a t × 1 vector y, and returns a te × 1 vector of classifications yhat on the test patterns. Your function must be able to handle arbitrary n, t, and te, and must not explicitly compute a weight vector w (instead you must use the Lagrange multiplier vector λ, as shown above).
(c) Write a Matlab function [lambda,b] = dualsoftL2(X,y,c) which takes an additional scalar argument c and returns lambda and b corresponding to the maximum “soft” margin linear discriminant classifier. Note: For softL2marg all you have to do is compute the vector of Lagrange multipliers λ that maximizes the same objective as above (1) subject to the same set of constraints, except for the slight modification that 0 ≤ λi ≤ c. This recovers the vector of Lagrange multipliers lambda. (Note that the slack variables si actually disappear in the dual formulation.) To recover the offset value b, just use the same procedure as in Part (a) (using any λk such that 0 < λk < c). Your function must be able to handle arbitrary n and t.
(d) For each of the generative models 1, 2 and 3:
In this exercise you will write simple Matlab functions to learn maximum margin kernel classifiers. You will need to be familiar feval. Once you have written these programs, you will then apply them to the real world classification problem of recognizing images of handwritten digits.
(a) Write a Matlab function [K] = polykernel(X 1 , X 2 , d) which takes a t 1 × n matrix X 1 , a t 2 ×n matrix X 2 , and a degree parameter d, and returns a t 1 ×t 2 matrix K of kernel values for the polynomial kernel. Kij is the polynomial kernel value obtained by comparing row vector xi 1 from X 1 with row vector xj 2 from X 2 ; that is Kij = (xi 1 · xj 2 + 1)d. Your function must be able to handle arbitrary t 1 , t 2 , n and d.
(b) Write a Matlab function [lambda,b] = kernelL2marg(X,y,c,kernfun,par) which takes as input t × n matrix of observations X, a t × 1 vector of target labels y, a slack parameter c, the name of a kernel function kernfun, and a parameter value for the kernel par. The outputs are a t × 1 vector of Lagrange multipliers lambda and a scalar offset b corresponding to the maximum margin classifier in the feature space. Note: This function is the same as dualsoftL2 (Question 2(c)) except that instead of taking simple inner products between row vectors, xi·xj , you use the value calculated by the kernfun, k(xi, xj ). An example call would be [lambda,b] = kernelL2marg(X,y, 100,’polykernel’,2) using the polykernel function written in Part(a) above. Your function must be able to handle arbitrary n, t, te and d.
(c) Write a Matlab function [yhat] = kernelclassify(Xtest,lambda,b,X,y, kernfun,par) which takes as input a te × n matrix of test observations Xtest, a t × 1 vector of Lagrange multipliers lambda, a scalar offset b, a t × n matrix of train- ing observations X, a t × 1 vector of training labels y, the name of a kernel function kernfun, and a parameter for the kernel function par. The output is a te × 1 vector of classifications yhat on the test patterns. Your function must be able to handle arbitrary n, t, te and d.
(d) On the course webpage, download the file data2.mat. Then type “load data2.mat” in Matlab. This will load the training data into a matrix X and a vector y and the test data into a matrix Xtest and a vector ytest. Each row of the matrices corresponds to a 256 dimensional vector representing a 16 × 16 grayscale image of a handwritten digit. The images are of handwritten ’2’s and ’3’s. The corresponding entry in the associated y-vector gives a label indicating which digit the image represents, where - corresponds to ’2’ and +1 corresponds to ’3’. Note: that you can easily view the training images in Matlab by first typing ”colormap gray” to set up the colormap, and then viewing image i in matrix X (or Xtest) by typing “imagesc(reshape(X(i,:),16,16)’)”.
The goal of this question is to learn a function which can accurately distinguish images of ’2’s from images of ’3’s. Here, functions will be learned on the training data and then tested on the separate test data.
Use the following parameters to learn different classification functions:
[ll bl] = kernelL2marg(X,y,100,’polykernel’,1) [lq bq] = kernelL2marg(X,y,100,’polykernel’,2) [lc bc] = kernelL2marg(X,y,100,’polykernel’,3)
Report the number of misclassification errors that each of these three functions make on both the training and the test data (in a 3 × 2 table).
Assume boolean attributes Xj ∈ { 0 , 1 }, Y ∈ { 0 , 1 }.
(a) Specify the parameters of the CAN model in a manner similar to the NB model above. How many free parameters are there (as a function of n)? Also, given training data
x 11 · · · x 1 n y 1 .. .
xt 1 · · · xtn yt
what are the maximum likelihood estimates for the CAN parameters?
(b) Derive a simple form for the minimum-expected-error classification rule for this model in a similar manner to the NB model above. What form does this decision rule have? (Hint: it is not a linear discriminant, but it is close.)
For the remainder of this question we will investigate the consequences of learning proba- bility models when incorrect assumptions are made about the structure of the distribution. The lesson is that even principled learning under bad modeling assumptions can lead to suboptimal results.
Assume that data is generated according to a CAN model where
P(Y = 1) = 23 , P(X 1 = 1|Y = 1) = P(X 1 = 1|Y = 0) = 12 , and Xj = Xj− 1 ⊕ Y for j = 2, ..., n, n ≥ 2
where ⊕ denotes “exclusive or”. (That is, Xj is a deterministic function of Xj− 1 and Y and therefore P(Xj = xj− 1 ⊕ y|Xj− 1 = xj− 1 , Y = y) = 1.) Assume also that we are (incorrectly) attempting to learn from this data source by using a Naive Bayes model.
(c) What are the parameters of the optimal NB model for this CAN data source? (Note: we haven’t formally discussed optimal approximations for probability distributions in this course, but for this question just assume it means computing the corresponding conditional probabilities under the true model.) What is the minimum-expected-error classification rule for the optimal NB model?
(d) What is the misclassification probability of this NB classifier when test examples are drawn from the CAN model? What is the misclassification probability of the optimal CAN classifier? (Note: just compute the probability—you do not have to show a derivation of the optimal CAN classifier.)
(e) For this part, assume that n = 2. (That is, we are only considering the variables Y , X 1 and X 2 .) Write down the coefficients w 1 , w 2 and b of a linear discriminant that achieves better misclassification probability than the solution of Part (c). What is the misclassification probability of your alternative linear discriminant?
In this question you will derive an EM training algorithm for the Naive Bayes probability model introduced in Question 4. The outcome will be a learning method that can produce a predictor h : Xn^ → Y based on exploiting both labeled 〈xi, yi〉 and unlabeled 〈xi, −〉 training examples. Recall that the Naive Bayes model assumes there is a single random variable Y ∈ { 0 , 1 } whose value determines the conditional distribution over the remaining variables X 1 , ..., Xn (Xj ∈ { 0 , 1 }n), where, in particular, the conditional distributions of X 1 , ..., Xn are independent given Y = y. This model is defined by 2n + 1 parameters θ = 〈θ, {θj 0 }nj=1, {θj 1 }nj=1〉. Consider a scenario where all of the variables Xj are always observed but the Y labels are sometimes missing. That is, consider training data of the form
x 11 ... x 1 n z 1 .. .
xt 1 ... xtn zt
where the zi labels are chosen from the set { 0 , 1 , − 1 } such that zi = −1 indicates that the original yi value was unobserved. Recall that EM works as follows:
(k) i (y)^ =
0 if zi ∈ { 0 , 1 } and y 6 = zi 1 if zi ∈ { 0 , 1 } and y = zi P(Y = y|X 1 = xi 1 , ..., Xn = xin, θ(k)) if zi = − 1 (2)
θ(k+1)^ = arg max θ
∑^ t
i=
∑^1
y=
Q( i k)(y) log P(X 1 = xi 1 , ..., Xn = xin, Y = y|θ)
(3)
In this question you will derive a simplified form of the EM update for the Naive Bayes model, which will allow you to implement it efficiently in Question 3 below.