Download Learning Theory and Unsupervised Learning - Assignment - Machine Learning - 3 and more Exercises Machine Learning in PDF only on Docsity! CS229 Problem Set #3 1 CS 229, Public Course Problem Set #3: Learning Theory and Unsuper- vised Learning 1. Uniform convergence and Model Selection In this problem, we will prove a bound on the error of a simple model selection procedure. Let there be a binary classification problem with labels y ∈ {0, 1}, and let H1 ⊆ H2 ⊆ . . . ⊆ Hk be k different finite hypothesis classes (|Hi| < ∞). Given a dataset S of m iid training examples, we will divide it into a training set Strain consisting of the first (1−β)m examples, and a hold-out cross validation set Scv consisting of the remaining βm examples. Here, β ∈ (0, 1). Let ĥi = arg minh∈Hi ε̂Strain(h) be the hypothesis in Hi with the lowest training error (on Strain). Thus, ĥi would be the hypothesis returned by training (with empirical risk minimization) using hypothesis class Hi and dataset Strain. Also let h ⋆ i = arg minh∈Hi ε(h) be the hypothesis in Hi with the lowest generalization error. Suppose that our algorithm first finds all the ĥi’s using empirical risk minimization then uses the hold-out cross validation set to select a hypothesis from this the {ĥ1, . . . , ĥk} with minimum training error. That is, the algorithm will output ĥ = arg min h∈{ĥ1,...,ĥk} ε̂Scv(h). For this question you will prove the following bound. Let any δ > 0 be fixed. Then with probability at least 1 − δ, we have that ε(ĥ) ≤ min i=1,...,k ( ε(h∗i ) + √ 2 (1 − β)m log 4|Hi| δ ) + √ 2 2βm log 4k δ (a) Prove that with probability at least 1 − δ2 , for all ĥi, |ε(ĥi) − ε̂Scv(ĥi)| ≤ √ 1 2βm log 4k δ . (b) Use part (a) to show that with probability 1 − δ2 , ε(ĥ) ≤ min i=1,...,k ε(ĥi) + √ 2 βm log 4k δ . (c) Let j = arg mini ε(ĥi). We know from class that for Hj , with probability 1 − δ 2 |ε(ĥj) − ε̂Strain(h ⋆ j )| ≤ √ 2 (1 − β)m log 4|Hj | δ , ∀hj ∈ Hj . Use this to prove the final bound given at the beginning of this problem. CS229 Problem Set #3 2 2. VC Dimension Let the input domain of a learning problem be X = R. Give the VC dimension for each of the following classes of hypotheses. In each case, if you claim that the VC dimension is d, then you need to show that the hypothesis class can shatter d points, and explain why there are no d + 1 points it can shatter. • h(x) = 1{a < x}, with parameter a ∈ R. • h(x) = 1{a < x < b}, with parameters a, b ∈ R. • h(x) = 1{a sin x > 0}, with parameter a ∈ R. • h(x) = 1{sin(x + a) > 0}, with parameter a ∈ R. 3. ℓ1 regularization for least squares In the previous problem set, we looked at the least squares problem where the objective function is augmented with an additional regularization term λ‖θ‖22. In this problem we’ll consider a similar regularized objective but this time with a penalty on the ℓ1 norm of the parameters λ‖θ‖1, where ‖θ‖1 is defined as ∑ i |θi|. That is, we want to minimize the objective J(θ) = 1 2 m ∑ i=1 (θT x(i) − y(i))2 + λ n ∑ i=1 |θi|. There has been a great deal of recent interest in ℓ1 regularization, which, as we will see, has the benefit of outputting sparse solutions (i.e., many components of the resulting θ are equal to zero). The ℓ1 regularized least squares problem is more difficult than the unregularized or ℓ2 regularized case, because the ℓ1 term is not differentiable. However, there have been many efficient algorithms developed for this problem that work very well in practive. One very straightforward approach, which we have already seen in class, is the coordinate descent method. In this problem you’ll derive and implement a coordinate descent algorithm for ℓ1 regularized least squares, and apply it to test data. (a) Here we’ll derive the coordinate descent update for a given θi. Given the X and ~y matrices, as defined in the class notes, as well a parameter vector θ, how can we adjust θi so as to minimize the optimization objective? To answer this question, we’ll rewrite the optimization objective above as J(θ) = 1 2 ‖Xθ − ~y‖22 + λ‖θ‖1 = 1 2 ‖Xθ̄ + Xiθi − ~y‖ 2 2 + λ‖θ̄‖1 + λ|θi| where Xi ∈ R m denotes the ith column of X, and θ̄ is equal to θ except with θ̄i = 0; all we have done in rewriting the above expression is to make the θi term explicit in the objective. However, this still contains the |θi| term, which is non-differentiable and therefore difficult to optimize. To get around this we make the observation that the sign of θi must either be non-negative or non-positive. But if we knew the sign of θi, then |θi| becomes just a linear term. That, is, we can rewrite the objective as J(θ) = 1 2 ‖Xθ̄ + Xiθi − ~y‖ 2 2 + λ‖θ̄‖1 + λsiθi where si denotes the sign of θi, si ∈ {−1, 1}. In order to update θi, we can just compute the optimal θi for both possible values of si (making sure that we restrict CS229 Problem Set #3 Solutions 1 CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning 1. Uniform convergence and Model Selection In this problem, we will prove a bound on the error of a simple model selection procedure. Let there be a binary classification problem with labels y ∈ {0, 1}, and let H1 ⊆ H2 ⊆ . . . ⊆ Hk be k different finite hypothesis classes (|Hi| < ∞). Given a dataset S of m iid training examples, we will divide it into a training set Strain consisting of the first (1−β)m examples, and a hold-out cross validation set Scv consisting of the remaining βm examples. Here, β ∈ (0, 1). Let ĥi = arg minh∈Hi ε̂Strain(h) be the hypothesis in Hi with the lowest training error (on Strain). Thus, ĥi would be the hypothesis returned by training (with empirical risk minimization) using hypothesis class Hi and dataset Strain. Also let h ⋆ i = arg minh∈Hi ε(h) be the hypothesis in Hi with the lowest generalization error. Suppose that our algorithm first finds all the ĥi’s using empirical risk minimization then uses the hold-out cross validation set to select a hypothesis from this the {ĥ1, . . . , ĥk} with minimum training error. That is, the algorithm will output ĥ = arg min h∈{ĥ1,...,ĥk} ε̂Scv(h). For this question you will prove the following bound. Let any δ > 0 be fixed. Then with probability at least 1 − δ, we have that ε(ĥ) ≤ min i=1,...,k ( ε(h∗i ) + √ 2 (1 − β)m log 4|Hi| δ ) + √ 2 2βm log 4k δ (a) Prove that with probability at least 1 − δ2 , for all ĥi, |ε(ĥi) − ε̂Scv(ĥi)| ≤ √ 1 2βm log 4k δ . Answer: For each ĥi, the empirical error on the cross-validation set, ε̂(ĥi) represents the average of βm random variables with mean ε(ĥi), so by the Hoeffding inequality for any ĥi, P (|ε(ĥi) − ε̂Scv(ĥi)| ≥ γ) ≤ 2 exp(−2γ 2βm). As in the class notes, to insure that this holds for all ĥi, we need to take the union over all k of the ĥi’s. P (∃i, s.t.|ε(ĥi) − ε̂Scv(ĥi)| ≥ γ) ≤ 2k exp(−2γ 2βm). CS229 Problem Set #3 Solutions 2 Setting this term equal to δ/2 and solving for γ yields γ = √ 1 2βm log 4k δ proving the desired bound. (b) Use part (a) to show that with probability 1 − δ2 , ε(ĥ) ≤ min i=1,...,k ε(ĥi) + √ 2 βm log 4k δ . Answer: Let j = arg mini ε(ĥi). Using part (a), with probability at least 1 − δ 2 ε(ĥ) ≤ ε̂Scv(ĥ) + √ 1 2βm log 4k δ = min i ε̂Scv(ĥi) + √ 1 2βm log 4k δ ≤ ε̂Scv(ĥj) + √ 1 2βm log 4k δ ≤ ε(ĥj) + 2 √ 1 2βm log 4k δ = min i=1,...,k ε(ĥi) + √ 2 βm log 4k δ (c) Let j = arg mini ε(ĥi). We know from class that for Hj , with probability 1 − δ 2 |ε(ĥj) − ε̂Strain(h ⋆ j )| ≤ √ 2 (1 − β)m log 4|Hj | δ , ∀hj ∈ Hj . Use this to prove the final bound given at the beginning of this problem. Answer: The bounds in parts (a) and (c) both hold simultaneously with probability (1 − δ2 ) 2 = 1 − δ + δ 2 4 > 1 − δ, so with probablity greater than 1 − δ, ε(ĥ) ≤ ε(h⋆j ) + 2 √ 1 2(1 − γ)m log 2|Hj | δ 2 + 2 √ 1 2γm log 2k δ 2 which is equivalent to the bound we want to show. 2. VC Dimension Let the input domain of a learning problem be X = R. Give the VC dimension for each of the following classes of hypotheses. In each case, if you claim that the VC dimension is d, then you need to show that the hypothesis class can shatter d points, and explain why there are no d + 1 points it can shatter. CS229 Problem Set #3 Solutions 3 • h(x) = 1{a < x}, with parameter a ∈ R. Answer: VC-dimension = 1. (a) It can shatter point {0}, by choosing a to be 2 and −2. (b) It cannot shatter any two points {x1, x2}, x1 < x2, because the labelling x1 = 1 and x2 = 0 cannot be realized. • h(x) = 1{a < x < b}, with parameters a, b ∈ R. Answer: VC-dimension = 2. (a) It can shatter points {0, 2} by choosing (a, b) to be (3, 5), (−1, 1), (1, 3), (−1, 3). (b) It cannot shatter any three points {x1, x2, x3}, x1 < x2 < x3, because the labelling x1 = x3 = 1, x2 = 0 cannot be realized. • h(x) = 1{a sin x > 0}, with parameter a ∈ R. Answer: VC-dimension = 1. a controls the amplitude of the sine curve. (a) It can shatter point {π2 } by choosing a to be 1 and −1. (b) It cannot shatter any two points {x1, x2}, since, the labellings of x1 and x2 will flip together. If x1 = x2 = 1 for some a, then we cannot achieve x1 6= x2. If x1 6= x2 for some a, then we cannot achieve x1 = x2 = 1 (x1 = x2 = 0 can be achieved by setting a = 0). • h(x) = 1{sin(x + a) > 0}, with parameter a ∈ R. Answer: VC-dimension = 2. a controls the phase of the sine curve. (a) It can shatter points {π4 , 3π 4 }, by choosing a to be 0, π 2 , π, and 3π 2 . (b) It cannot shatter any three points {x1, x2, x3}. Since sine has a preiod of 2π, let’s define x′i = xi mod 2π. W.l.o.g., assume x ′ 1 < x ′ 2 < x ′ 3. If the labelling of x1 = x2 = x3 = 1 can be realized, then the labelling of x1 = x3 = 1, x2 = 0 will not be realizable. Notice the similarity to the second question. 3. ℓ1 regularization for least squares In the previous problem set, we looked at the least squares problem where the objective function is augmented with an additional regularization term λ‖θ‖22. In this problem we’ll consider a similar regularized objective but this time with a penalty on the ℓ1 norm of the parameters λ‖θ‖1, where ‖θ‖1 is defined as ∑ i |θi|. That is, we want to minimize the objective J(θ) = 1 2 m ∑ i=1 (θT x(i) − y(i))2 + λ n ∑ i=1 |θi|. There has been a great deal of recent interest in ℓ1 regularization, which, as we will see, has the benefit of outputting sparse solutions (i.e., many components of the resulting θ are equal to zero). The ℓ1 regularized least squares problem is more difficult than the unregularized or ℓ2 regularized case, because the ℓ1 term is not differentiable. However, there have been many efficient algorithms developed for this problem that work very well in practive. One very straightforward approach, which we have already seen in class, is the coordinate descent method. In this problem you’ll derive and implement a coordinate descent algorithm for ℓ1 regularized least squares, and apply it to test data. CS229 Problem Set #3 Solutions 6 and k = 4. Plot the cluster assignments and centroids for each iteration of the algorithm using the draw clusters(X, clusters, centroids) function. For each k, be sure to run the algorithm several times using different initial centroids. Answer: The following is our implementation of k means.m: function [clusters, centroids] = k_means(X, k) m = size(X,1); n = size(X,2); oldcentroids = zeros(k,n); centroids = X(ceil(rand(k,1)*m),:); while (norm(oldcentroids - centroids) > 1e-15) oldcentroids = centroids; % compute cluster assignments for i=1:m, dists = sum((repmat(X(i,:), k, 1) - centroids).^2, 2); [min_dist, clusters(i,1)] = min(dists); end draw_clusters(X, clusters, centroids); pause(0.1); % compute cluster centroids for i=1:k, centroids(i,:) = mean(X(clusters == i, :)); end end Below we show the centroid evolution for two typical runs with k = 3. Note that the different starting positions of the clusters lead to do different final clusterings. −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 5. The Generalized EM algorithm When attempting to run the EM algorithm, it may sometimes be difficult to perform the M step exactly — recall that we often need to implement numerical optimization to perform the maximization, which can be costly. Therefore, instead of finding the global maximum of our lower bound on the log-likelihood, and alternative is to just increase this lower bound a little bit, by taking one step of gradient ascent, for example. This is commonly known as the Generalized EM (GEM) algorithm. CS229 Problem Set #3 Solutions 7 Put slightly more formally, recall that the M-step of the standard EM algorithm performs the maximization θ := arg max θ ∑ i ∑ z(i) Qi(z (i)) log p(x(i), z(i); θ) Qi(z(i)) . The GEM algorithm, in constrast, performs the following update in the M-step: θ := θ + α∇θ ∑ i ∑ z(i) Qi(z (i)) log p(x(i), z(i); θ) Qi(z(i)) where α is a learning rate which we assume is choosen small enough such that we do not decrease the objective function when taking this gradient step. (a) Prove that the GEM algorithm described above converges. To do this, you should show that the the likelihood is monotonically improving, as it does for the EM algo- rithm — i.e., show that ℓ(θ(t+1)) ≥ ℓ(θ(t)). Answer: We use the same logic as for the standard EM algorithm. Specifically, just as for EM, we have for the GEM algorithm that ℓ(θ(t+1)) ≥ ∑ i ∑ z(i) Q (t) i (z (i)) log p(x(i), z(i); θ(t+1)) Q (t) i (z (i)) ≥ ∑ i ∑ z(i) Q (t) i (z (i)) log p(x(i), z(i); θ(t)) Q (t) i (z (i)) = ℓ(θ(t)) where as in EM the first line holds due to Jensen’s equality, and the last line holds because we choose the Q distribution to make this hold with equality. The only difference between EM and GEM is the logic as to why the second line holds: for EM it held because θ(t+1) was chosen to maximize this quantity, but for GEM it holds by our assumption that we take a gradient step small enough so as not to decrease the objective function. (b) Instead of using the EM algorithm at all, suppose we just want to apply gradient ascent to maximize the log-likelihood directly. In other words, we are trying to maximize the (non-convex) function ℓ(θ) = ∑ i log ∑ z(i) p(x(i), z(i); θ) so we could simply use the update θ := θ + α∇θ ∑ i log ∑ z(i) p(x(i), z(i); θ). Show that this procedure in fact gives the same update as the GEM algorithm de- scribed above. Answer: Differentiating the log likelihood directly we get ∂ ∂θj ∑ i log ∑ z(i) p(x(i), z(i); θ) = ∑ i 1 ∑ z(i) p(x (i), z(i); θ) ∑ z(i) ∂ ∂θj p(x(i), z(i); θ) = ∑ i ∑ z(i) 1 p(x(i); θ) · ∂ ∂θj p(x(i), z(i); θ). CS229 Problem Set #3 Solutions 8 For the GEM algorithm, ∂ ∂θj ∑ i ∑ z(i) Qi(z (i)) log p(x(i), z(i); θ) Qi(z(i)) = ∑ i ∑ z(i) Qi(z (i)) p(x(i), z(i); θ) · ∂ ∂θj p(x(i), z(i); θ). But the E-step of the GEM algorithm chooses Qi(z (i)) = p(z(i)|x(i); θ) = p(x(i), z(i); θ) p(x(i); θ) , so ∑ i ∑ z(i) Qi(z (i)) p(x(i), z(i); θ) · ∂ ∂θj p(x(i), z(i); θ) = ∑ i ∑ z(i) 1 p(x(i); θ) · ∂ ∂θj p(x(i), z(i); θ) which is the same as the derivative of the log likelihood.