Download Assignment - Supervised Learning - Machine Learning - 1 and more Exercises Machine Learning in PDF only on Docsity! CS229 Problem Set #1 1 CS 229, Public Course Problem Set #1: Supervised Learning 1. Newton’s method for computing least squares In this problem, we will prove that if we use Newton’s method solve the least squares optimization problem, then we only need one iteration to converge to θ∗. (a) Find the Hessian of the cost function J(θ) = 12 ∑m i=1(θ T x(i) − y(i))2. (b) Show that the first iteration of Newton’s method gives us θ⋆ = (XT X)−1XT ~y, the solution to our least squares problem. 2. Locally-weighted logistic regression In this problem you will implement a locally-weighted version of logistic regression, where we weight different training examples differently according to the query point. The locally- weighted logistic regression problem is to maximize ℓ(θ) = − λ 2 θT θ + m ∑ i=1 w(i) [ y(i) log hθ(x (i)) + (1 − y(i)) log(1 − hθ(x (i))) ] . The −λ2 θ T θ here is what is known as a regularization parameter, which will be discussed in a future lecture, but which we include here because it is needed for Newton’s method to perform well on this task. For the entirety of this problem you can use the value λ = 0.0001. Using this definition, the gradient of ℓ(θ) is given by ∇θℓ(θ) = X T z − λθ where z ∈ Rm is defined by zi = w (i)(y(i) − hθ(x (i))) and the Hessian is given by H = XT DX − λI where D ∈ Rm×m is a diagonal matrix with Dii = −w (i)hθ(x (i))(1 − hθ(x (i))) For the sake of this problem you can just use the above formulas, but you should try to derive these results for yourself as well. Given a query point x, we choose compute the weights w(i) = exp ( − ||x − x(i)||2 2τ2 ) . Much like the locally weighted linear regression that was discussed in class, this weighting scheme gives more when the “nearby” points when predicting the class of a new example. CS229 Problem Set #1 2 (a) Implement the Newton-Raphson algorithm for optimizing ℓ(θ) for a new query point x, and use this to predict the class of x. The q2/ directory contains data and code for this problem. You should implement the y = lwlr(X train, y train, x, tau) function in the lwlr.m file. This func- tion takes as input the training set (the X train and y train matrices, in the form described in the class notes), a new query point x and the weight bandwitdh tau. Given this input the function should 1) compute weights w(i) for each training exam- ple, using the formula above, 2) maximize ℓ(θ) using Newton’s method, and finally 3) output y = 1{hθ(x) > 0.5} as the prediction. We provide two additional functions that might help. The [X train, y train] = load data; function will load the matrices from files in the data/ folder. The func- tion plot lwlr(X train, y train, tau, resolution) will plot the resulting clas- sifier (assuming you have properly implemented lwlr.m). This function evaluates the locally weighted logistic regression classifier over a large grid of points and plots the resulting prediction as blue (predicting y = 0) or red (predicting y = 1). Depending on how fast your lwlr function is, creating the plot might take some time, so we recommend debugging your code with resolution = 50; and later increase it to at least 200 to get a better idea of the decision boundary. (b) Evaluate the system with a variety of different bandwidth parameters τ . In particular, try τ = 0.01, 0.050.1, 0.51.0, 5.0. How does the classification boundary change when varying this parameter? Can you predict what the decision boundary of ordinary (unweighted) logistic regression would look like? 3. Multivariate least squares So far in class, we have only considered cases where our target variable y is a scalar value. Suppose that instead of trying to predict a single output, we have a training set with multiple outputs for each example: {(x(i), y(i)), i = 1, . . . ,m}, x(i) ∈ Rn, y(i) ∈ Rp. Thus for each training example, y(i) is vector-valued, with p entries. We wish to use a linear model to predict the outputs, as in least squares, by specifying the parameter matrix Θ in y = ΘT x, where Θ ∈ Rn×p. (a) The cost function for this case is J(Θ) = 1 2 m ∑ i=1 p ∑ j=1 ( (ΘT x(i))j − y (i) j )2 . Write J(Θ) in matrix-vector notation (i.e., without using any summations). [Hint: Start with the m × n design matrix X = — (x(1))T — — (x(2))T — ... — (x(m))T — CS229 Problem Set #1 Solutions 1 CS 229, Public Course Problem Set #1 Solutions: Supervised Learning 1. Newton’s method for computing least squares In this problem, we will prove that if we use Newton’s method solve the least squares optimization problem, then we only need one iteration to converge to θ∗. (a) Find the Hessian of the cost function J(θ) = 12 ∑m i=1(θ T x(i) − y(i))2. Answer: As shown in the class notes ∂J(θ) ∂θj = m ∑ i=1 (θT x(i) − y(i))x (i) j . So ∂2J(θ) ∂θj∂θk = m ∑ i=1 ∂ ∂θk (θT x(i) − y(i))x (i) j = m ∑ i=1 x (i) j x (i) k = (X T X)jk Therefore, the Hessian of J(θ) is H = XT X. This can also be derived by simply applying rules from the lecture notes on Linear Algebra. (b) Show that the first iteration of Newton’s method gives us θ⋆ = (XT X)−1XT ~y, the solution to our least squares problem. Answer: Given any θ(0), Newton’s method finds θ(1) according to θ(1) = θ(0) − H−1∇θJ(θ (0)) = θ(0) − (XT X)−1(XT Xθ(0) − XT ~y) = θ(0) − θ(0) + (XT X)−1XT ~y = (XT X)−1XT ~y. Therefore, no matter what θ(0) we pick, Newton’s method always finds θ⋆ after one iteration. 2. Locally-weighted logistic regression In this problem you will implement a locally-weighted version of logistic regression, where we weight different training examples differently according to the query point. The locally- weighted logistic regression problem is to maximize ℓ(θ) = − λ 2 θT θ + m ∑ i=1 w(i) [ y(i) log hθ(x (i)) + (1 − y(i)) log(1 − hθ(x (i))) ] . CS229 Problem Set #1 Solutions 2 The −λ2 θ T θ here is what is known as a regularization parameter, which will be discussed in a future lecture, but which we include here because it is needed for Newton’s method to perform well on this task. For the entirety of this problem you can use the value λ = 0.0001. Using this definition, the gradient of ℓ(θ) is given by ∇θℓ(θ) = X T z − λθ where z ∈ Rm is defined by zi = w (i)(y(i) − hθ(x (i))) and the Hessian is given by H = XT DX − λI where D ∈ Rm×m is a diagonal matrix with Dii = −w (i)hθ(x (i))(1 − hθ(x (i))) For the sake of this problem you can just use the above formulas, but you should try to derive these results for yourself as well. Given a query point x, we choose compute the weights w(i) = exp ( − ||x − x(i)||2 2τ2 ) . Much like the locally weighted linear regression that was discussed in class, this weighting scheme gives more when the “nearby” points when predicting the class of a new example. (a) Implement the Newton-Raphson algorithm for optimizing ℓ(θ) for a new query point x, and use this to predict the class of x. The q2/ directory contains data and code for this problem. You should implement the y = lwlr(X train, y train, x, tau) function in the lwlr.m file. This func- tion takes as input the training set (the X train and y train matrices, in the form described in the class notes), a new query point x and the weight bandwitdh tau. Given this input the function should 1) compute weights w(i) for each training exam- ple, using the formula above, 2) maximize ℓ(θ) using Newton’s method, and finally 3) output y = 1{hθ(x) > 0.5} as the prediction. We provide two additional functions that might help. The [X train, y train] = load data; function will load the matrices from files in the data/ folder. The func- tion plot lwlr(X train, y train, tau, resolution) will plot the resulting clas- sifier (assuming you have properly implemented lwlr.m). This function evaluates the locally weighted logistic regression classifier over a large grid of points and plots the resulting prediction as blue (predicting y = 0) or red (predicting y = 1). Depending on how fast your lwlr function is, creating the plot might take some time, so we recommend debugging your code with resolution = 50; and later increase it to at least 200 to get a better idea of the decision boundary. Answer: Our implementation of lwlr.m: function y = lwlr(X_train, y_train, x, tau) m = size(X_train,1); n = size(X_train,2); CS229 Problem Set #1 Solutions 3 theta = zeros(n,1); % compute weights w = exp(-sum((X_train - repmat(x’, m, 1)).^2, 2) / (2*tau)); % perform Newton’s method g = ones(n,1); while (norm(g) > 1e-6) h = 1 ./ (1 + exp(-X_train * theta)); g = X_train’ * (w.*(y_train - h)) - 1e-4*theta; H = -X_train’ * diag(w.*h.*(1-h)) * X_train - 1e-4*eye(n); theta = theta - H \ g; end % return predicted y y = double(x’*theta > 0); (b) Evaluate the system with a variety of different bandwidth parameters τ . In particular, try τ = 0.01, 0.050.1, 0.51.0, 5.0. How does the classification boundary change when varying this parameter? Can you predict what the decision boundary of ordinary (unweighted) logistic regression would look like? Answer: These are the resulting decision boundaries, for the different values of τ . tau = 0.01 tau = 0.05 tau = 0.1 tau = 0.5 tau = 0.5 tau = 5 For smaller τ , the classifier appears to overfit the data set, obtaining zero training error, but outputting a sporadic looking decision boundary. As τ grows, the resulting deci- sion boundary becomes smoother, eventually converging (in the limit as τ → ∞ to the unweighted linear regression solution). 3. Multivariate least squares So far in class, we have only considered cases where our target variable y is a scalar value. Suppose that instead of trying to predict a single output, we have a training set with CS229 Problem Set #1 Solutions 6 4. Naive Bayes In this problem, we look at maximum likelihood parameter estimation using the naive Bayes assumption. Here, the input features xj , j = 1, . . . , n to our model are discrete, binary-valued variables, so xj ∈ {0, 1}. We call x = [x1 x2 · · · xn] T to be the input vector. For each training example, our output targets are a single binary-value y ∈ {0, 1}. Our model is then parameterized by φj|y=0 = p(xj = 1|y = 0), φj|y=1 = p(xj = 1|y = 1), and φy = p(y = 1). We model the joint distribution of (x, y) according to p(y) = (φy) y(1 − φy) 1−y p(x|y = 0) = n ∏ j=1 p(xj |y = 0) = n ∏ j=1 (φj|y=0) xj (1 − φj|y=0) 1−xj p(x|y = 1) = n ∏ j=1 p(xj |y = 1) = n ∏ j=1 (φj|y=1) xj (1 − φj|y=1) 1−xj (a) Find the joint likelihood function ℓ(ϕ) = log ∏m i=1 p(x (i), y(i);ϕ) in terms of the model parameters given above. Here, ϕ represents the entire set of parameters {φy, φj|y=0, φj|y=1, j = 1, . . . , n}. Answer: ℓ(ϕ) = log m ∏ i=1 p(x(i), y(i);ϕ) = log m ∏ i=1 p(x(i)|y(i);ϕ)p(y(i);ϕ) = log m ∏ i=1 n ∏ j=1 p(x (i) j |y (i);ϕ) p(y(i);ϕ) = m ∑ i=1 log p(y(i);ϕ) + n ∑ j=1 log p(x (i) j |y (i);ϕ) = m ∑ i=1 [ y(i) log φy + (1 − y (i)) log(1 − φy) + n ∑ j=1 ( x (i) j log φj|y(i) + (1 − x (i) j ) log(1 − φj|y(i)) ) (b) Show that the parameters which maximize the likelihood function are the same as CS229 Problem Set #1 Solutions 7 those given in the lecture notes; i.e., that φj|y=0 = ∑m i=1 1{x (i) j = 1 ∧ y (i) = 0} ∑m i=1 1{y (i) = 0} φj|y=1 = ∑m i=1 1{x (i) j = 1 ∧ y (i) = 1} ∑m i=1 1{y (i) = 1} φy = ∑m i=1 1{y (i) = 1} m . Answer: The only terms in ℓ(ϕ) which have non-zero gradient with respect to φj|y=0 are those which include φj|y(i) . Therefore, ∇φj|y=0ℓ(ϕ) = ∇φj|y=0 m ∑ i=1 ( x (i) j log φj|y(i) + (1 − x (i) j ) log(1 − φj|y(i)) ) = ∇φj|y=0 m ∑ i=1 ( x (i) j log(φj|y=0)1{y (i) = 0} + (1 − x (i) j ) log(1 − φj|y=0)1{y (i) = 0} ) = m ∑ i=1 ( x (i) j 1 φj|y=0 1{y(i) = 0} − (1 − x (i) j ) 1 1 − φj|y=0 1{y(i) = 0} ) . Setting ∇φj|y=0ℓ(ϕ) = 0 gives 0 = m ∑ i=1 ( x (i) j 1 φj|y=0 1{y(i) = 0} − (1 − x (i) j ) 1 1 − φj|y=0 1{y(i) = 0} ) = m ∑ i=1 ( x (i) j (1 − φj|y=0)1{y (i) = 0} − (1 − x (i) j )φj|y=01{y (i) = 0} ) = m ∑ i=1 ( (x (i) j − φj|y=0)1{y (i) = 0} ) = m ∑ i=1 ( x (i) j · 1{y (i) = 0} ) − φj|y=0 m ∑ i=1 1{y(i) = 0} = m ∑ i=1 ( 1{x (i) j = 1 ∧ y (i) = 0} ) − φj|y=0 m ∑ i=1 1{y(i) = 0}. We then arrive at our desired result φj|y=0 = ∑m i=1 1{x (i) j = 1 ∧ y (i) = 0} ∑m i=1 1{y (i) = 0} The solution for φj|y=1 proceeds in the identical manner. CS229 Problem Set #1 Solutions 8 To solve for φy, ∇φyℓ(ϕ) = ∇φy m ∑ i=1 ( y(i) log φy + (1 − y (i)) log(1 − φy) ) = m ∑ i=1 ( y(i) 1 φy − (1 − y(i)) 1 1 − φy ) Then setting ∇φy = 0 gives us 0 = m ∑ i=1 ( y(i) 1 φy − (1 − y(i)) 1 1 − φy ) = m ∑ i=1 ( y(i)(1 − φy) − (1 − y (i))φy ) = m ∑ i=1 y(i) − m ∑ i=1 φy. Therefore, φy = ∑m i=1 1{y (i) = 1} m . (c) Consider making a prediction on some new data point x using the most likely class estimate generated by the naive Bayes algorithm. Show that the hypothesis returned by naive Bayes is a linear classifier—i.e., if p(y = 0|x) and p(y = 1|x) are the class probabilities returned by naive Bayes, show that there exists some θ ∈ Rn+1 such that p(y = 1|x) ≥ p(y = 0|x) if and only if θT [ 1 x ] ≥ 0. (Assume θ0 is an intercept term.) Answer: p(y = 1|x) ≥ p(y = 0|x) ⇐⇒ p(y = 1|x) p(y = 0|x) ≥ 1 ⇐⇒ ( ∏n j=1 p(xj |y = 1) ) p(y = 1) ( ∏n j=1 p(xj |y = 0) ) p(y = 0) ≥ 1 ⇐⇒ ( ∏n j=1(φj|y=0) xj (1 − φj|y=0) 1−xj ) φy ( ∏n j=1(φj|y=1) xj (1 − φj|y=1)1−xj ) (1 − φy) ≥ 1 ⇐⇒ n ∑ j=1 ( xj log ( φj|y=1 φj|y=0 ) + (1 − xj) log ( 1 − φj|y=0 1 − φj |y = 0 )) + log ( φy 1 − φy ) ≥ 0 ⇐⇒ n ∑ j=1 xj log ( (φj|y=1)(1 − φj|y=0) (φj|y=0)(1 − φj|y=1) ) + n ∑ j=1 log ( 1 − φj|y=1 1 − φj|y=0 ) + log ( φy 1 − φy ) ≥ 0 ⇐⇒ θT [ 1 x ] ≥ 0,