Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Unsupervised Learning and Reinforcement Learning - Assignment - Machine Learning - 4, Exercises of Machine Learning

Artificial Intelligence. Assignment of Machine Learning with solution. Unsupervised Learning and Reinforcement Learning. Prof. Andrew Ng - Stanford University

Typology: Exercises

2010/2011

Uploaded on 10/29/2011

ilyastrab
ilyastrab 🇺🇸

4.4

(52)

382 documents

1 / 17

Toggle sidebar

Related documents


Partial preview of the text

Download Unsupervised Learning and Reinforcement Learning - Assignment - Machine Learning - 4 and more Exercises Machine Learning in PDF only on Docsity! CS229 Problem Set #4 1 CS 229, Public Course Problem Set #4: Unsupervised Learning and Re- inforcement Learning 1. EM for supervised learning In class we applied EM to the unsupervised learning setting. In particular, we represented p(x) by marginalizing over a latent random variable p(x) = ∑ z p(x, z) = ∑ z p(x|z)p(z). However, EM can also be applied to the supervised learning setting, and in this problem we discuss a “mixture of linear regressors” model; this is an instance of what is often call the Hierarchical Mixture of Experts model. We want to represent p(y|x), x ∈ Rn and y ∈ R, and we do so by again introducing a discrete latent random variable p(y|x) = ∑ z p(y, z|x) = ∑ z p(y|x, z)p(z|x). For simplicity we’ll assume that z is binary valued, that p(y|x, z) is a Gaussian density, and that p(z|x) is given by a logistic regression model. More formally p(z|x;φ) = g(φT x)z(1 − g(φT x))1−z p(y|x, z = i; θi) = 1√ 2πσ exp (−(y − θT i x)2 2σ2 ) i = 1, 2 where σ is a known parameter and φ, θ0, θ1 ∈ Rn are parameters of the model (here we use the subscript on θ to denote two different parameter vectors, not to index a particular entry in these vectors). Intuitively, the process behind model can be thought of as follows. Given a data point x, we first determine whether the data point belongs to one of two hidden classes z = 0 or z = 1, using a logistic regression model. We then determine y as a linear function of x (different linear functions for different values of z) plus Gaussian noise, as in the standard linear regression model. For example, the following data set could be well-represented by the model, but not by standard linear regression. CS229 Problem Set #4 2 (a) Suppose x, y, and z are all observed, so that we obtain a training set {(x(1), y(1), z(1)), . . . , (x(m), y(m), z(m))}. Write the log-likelihood of the parameters, and derive the maximum likelihood estimates for φ, θ0, and θ1. Note that because p(z|x) is a logistic regression model, there will not exist a closed form estimate of φ. In this case, derive the gradient and the Hessian of the likelihood with respect to φ; in practice, these quantities can be used to numerically compute the ML esimtate. (b) Now suppose z is a latent (unobserved) random variable. Write the log-likelihood of the parameters, and derive an EM algorithm to maximize the log-likelihood. Clearly specify the E-step and M-step (again, the M-step will require a numerical solution, so find the appropriate gradients and Hessians). 2. Factor Analysis and PCA In this problem we look at the relationship between two unsupervised learning algorithms we discussed in class: Factor Analysis and Principle Component Analysis. Consider the following joint distribution over (x, z) where z ∈ Rk is a latent random variable z ∼ N (0, I) x|z ∼ N (Uz, σ2I). where U ∈ Rn×k is a model parameters and σ2 is assumed to be a known constant. This model is often called Probabilistic PCA. Note that this is nearly identical to the factor analysis model except we assume that the variance of x|z is a known scaled identity matrix rather than the diagonal parameter matrix, Φ, and we do not add an additional µ term to the mean (though this last difference is just for simplicity of presentation). However, as we will see, it turns out that as σ2 → 0, this model is equivalent to PCA. For simplicity, you can assume for the remainder of the problem that k = 1, i.e., that U is a column vector in Rn. (a) Use the rules for manipulating Gaussian distributions to determine the joint distri- bution over (x, z) and the conditional distribution of z|x. [Hint: for later parts of this problem, it will help significantly if you simplify your soluting for the conditional distribution using the identity we first mentioned in problem set #1: (λI+BA)−1B = B(λI + AB)−1.] (b) Using these distributions, derive an EM algorithm for the model. Clearly state the E-step and the M-step of the algorithm. (c) As σ2 → 0, show that if the EM algorithm convergences to a parameter vector U⋆ (and such convergence is guarenteed by the argument presented in class), then U⋆ must be an eigenvector of the sample covariance matrix Σ = 1 m ∑ m i=1 x (i)x(i) T — i.e., U⋆ must satisfy λU⋆ = ΣU⋆. [Hint: When σ2 → 0, Σz|x → 0, so the E step only needs to compute the means µz|x and not the variances. Let w ∈ Rm be a vector containing all these means, wi = µz(i)|x(i) , and show that the E step and M step can be expressed as w = XU UT U , U = XT w wT w CS229 Problem Set #4 5 −1.2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 All states except those at the top of the hill have a constant reward R(s) = −1, while the goal state at the hilltop has reward R(s) = 0; thus an optimal agent will try to get to the top of the hill as fast as possible (when the car reaches the top of the hill, the episode is over, and the car is reset to its initial position). However, when starting at the bottom of the hill, the car does not have enough power to reach the top by driving forward, so it must first accerlaterate accelerate backwards, building up enough momentum to reach the top of the hill. This strategy of moving away from the goal in order to reach the goal makes the problem difficult for many classical control algorithms. As discussed in class, Q-learning maintains a table of Q-values, Q(s, a), for each state and action. These Q-values are useful because, in order to select an action in state s, we only need to check to see which Q-value is greatest. That is, in state s we take the action arg max a∈A Q(s, a). The Q-learning algorithm adjusts its estimates of the Q-values as follows. If an agent is in state s, takes action a, then ends up in state s′, Q-learning will update Q(s, a) by Q(s, a) = (1 − α)Q(s, a) + γ(R(s′) + γ max a′∈A Q(s′, a′). At each time, your implementation of Q-learning can execute the greedy policy π(s) = arg maxa∈A Q(s, a) Implement the [q, steps per episode] = qlearning(episodes) function in the q5/ directory. As input, the function takes the total number of episodes (each episode starts with the car at the bottom of the hill, and lasts until the car reaches the top), and outputs a matrix of the Q-values and a vector indicating how many steps it took before the car was able to reach the top of the hill. You should use the [x, s, absorb] = mountain car(x, actions(a)) function to simulate one control cycle for the task — the x variable describes the true (continuous) state of the system, whereas the s variable describes the discrete index of the state, which you’ll use to build the Q values. Plot a graph showing the average number of steps before the car reaches the top of the hill versus the episode number (there is quite a bit of variation in this quantity, so you will probably want to average these over a large number of episodes, as this will give you a better idea of how the number of steps before reaching the hilltop is decreasing). You can also visualize your resulting controller by calling the draw mountain car(q) function. CS229 Problem Set #4 Solutions 1 CS 229, Public Course Problem Set #4 Solutions: Unsupervised Learn- ing and Reinforcement Learning 1. EM for supervised learning In class we applied EM to the unsupervised learning setting. In particular, we represented p(x) by marginalizing over a latent random variable p(x) = ∑ z p(x, z) = ∑ z p(x|z)p(z). However, EM can also be applied to the supervised learning setting, and in this problem we discuss a “mixture of linear regressors” model; this is an instance of what is often call the Hierarchical Mixture of Experts model. We want to represent p(y|x), x ∈ Rn and y ∈ R, and we do so by again introducing a discrete latent random variable p(y|x) = ∑ z p(y, z|x) = ∑ z p(y|x, z)p(z|x). For simplicity we’ll assume that z is binary valued, that p(y|x, z) is a Gaussian density, and that p(z|x) is given by a logistic regression model. More formally p(z|x;φ) = g(φT x)z(1 − g(φT x))1−z p(y|x, z = i; θi) = 1√ 2πσ exp (−(y − θTi x)2 2σ2 ) i = 1, 2 where σ is a known parameter and φ, θ0, θ1 ∈ Rn are parameters of the model (here we use the subscript on θ to denote two different parameter vectors, not to index a particular entry in these vectors). Intuitively, the process behind model can be thought of as follows. Given a data point x, we first determine whether the data point belongs to one of two hidden classes z = 0 or z = 1, using a logistic regression model. We then determine y as a linear function of x (different linear functions for different values of z) plus Gaussian noise, as in the standard linear regression model. For example, the following data set could be well-represented by the model, but not by standard linear regression. CS229 Problem Set #4 Solutions 2 (a) Suppose x, y, and z are all observed, so that we obtain a training set {(x(1), y(1), z(1)), . . . , (x(m), y(m), z(m))}. Write the log-likelihood of the parameters, and derive the maximum likelihood estimates for φ, θ0, and θ1. Note that because p(z|x) is a logistic regression model, there will not exist a closed form estimate of φ. In this case, derive the gradient and the Hessian of the likelihood with respect to φ; in practice, these quantities can be used to numerically compute the ML esimtate. Answer: The log-likelihood is given by ℓ(φ, θ0, θ1) = log m ∏ i=1 p(y(i)|x(i), z(i); θ0, θ1)p(z(i)|x(i);φ) = ∑ i:z(i)=0 log ( (1 − g(φT x)) 1√ 2πσ exp (−(y(i) − θT0 x(i))2 2σ2 )) + ∑ i:z(i)=1 log ( (g(φT x) 1√ 2πσ exp (−(y(i) − θT1 x(i))2 2σ2 )) Differentiating with respect to θ1 and setting it to 0, 0 set = ∇θ0ℓ(φ, θ0, θ1) = ∇θ ∑ i:z(i)=0 −(y(i) − θT0 x(i))2 But this is just a least-squares problem on a subset of the data. In particular, if we let X0 and ~y0 be the design matrices formed by considering only those examples with z (i) = 0, then using the same logic as for the derivation of the least squares solution we get the maximum likelihood estimate of θ0, θ0 = (X T 0 X0) −1XT0 ~y0. The derivation for θ1 proceeds in the identical manner. Differentiating with respect to φ, and ignoring terms that do not depend on φ ∇φℓ(φ, θ0, θ1) = ∇φ ∑ i : z(i) = 0 log(1 − g(φT x)) + ∑ i : z(i) = 1 log g(φT x) = ∇φ m ∑ i=1 (1 − z(i)) log(1 − g(φT x)) + z(i) log g(φT x) This is just the standard logistic regression objective function, for which we already know the gradient and Hessian ∇φℓ(φ, θ0, θ1) = XT (~z − ~h), ~hi = g(φT x(i)), H = XT DX, Dii = g(φ T x(i))(1 − g(φT x(i))). (b) Now suppose z is a latent (unobserved) random variable. Write the log-likelihood of the parameters, and derive an EM algorithm to maximize the log-likelihood. Clearly specify the E-step and M-step (again, the M-step will require a numerical solution, so find the appropriate gradients and Hessians). CS229 Problem Set #4 Solutions 5 where in both cases the last equality comes from the identity in the hint. (b) Using these distributions, derive an EM algorithm for the model. Clearly state the E-step and the M-step of the algorithm. Answer: Even though z(i) is a scalar value, in this problem we continue to use the notation z(i) T , etc, to make the similarities to the Factor anlysis case obvious. For the E-step, we compute the distribution Qi(z (i)) = p(z(i)|x(i);U) by computing µz(i)|x(i) and Σz(i)|x(i) using the above formulas. For the M-step, we need to maximize m ∑ i=1 ∫ z(i) Qi(z (i)) log p(x(i), |z(i);U)p(z(i)) Qi(z(i)) = m ∑ i=1 Ez(i)∼Qi [ log p(x(i)|z(i);U) + log p(z(i)) − log Qi(z(i)) ] . Taking the gradient with respect to U equal to zero, dropping terms that don’t depend on U , and omitting the subscript on the expectation, this becomes ∇U m ∑ i=1 E [ log p(x(i)|z(i);U) ] = ∇U m ∑ i=1 E [ − 1 2σ2 (x(i) − Uz(i))T (x(i) − Uz(i)) ] = − 1 2σ2 m ∑ i=1 ∇UE [ trz(i) T UT Uz(i) − 2trz(i)T UT x(i) ] = − 1 2σ2 m ∑ i=1 E [ Uz(i)z(i) T − x(i)z(i)T ] = 1 2σ2 m ∑ i=1 [ −UE[z(i)z(i)T ] + x(i)E[z(i)T ] ] using the same reasoning as in the Factor Analysis class notes. Setting this derivative to zero gives U = ( m ∑ i=1 x(i)E[z(i) T ] )( m ∑ i=1 E[z(i)z(i) T ] )−1 = ( m ∑ i=1 x(i)µT z(i|x(i) )( m ∑ i=1 Σz(i)|x(i) + µz(i|x(i)µ T z(i|x(i) )−1 All these terms were calcuated in the E step, so this is our final M step update. (c) As σ2 → 0, show that if the EM algorithm convergences to a parameter vector U⋆ (and such convergence is guarenteed by the argument presented in class), then U⋆ must be an eigenvector of the sample covariance matrix Σ = 1 m ∑m i=1 x (i)x(i) T — i.e., U⋆ must satisfy λU⋆ = ΣU⋆. [Hint: When σ2 → 0, Σz|x → 0, so the E step only needs to compute the means µz|x and not the variances. Let w ∈ Rm be a vector containing all these means, CS229 Problem Set #4 Solutions 6 wi = µz(i)|x(i) , and show that the E step and M step can be expressed as w = XU UT U , U = XT w wT w respectively. Finally, show that if U doesn’t change after this update, it must satisfy the eigenvector equation shown above. ] Answer: For the E step, when σ2 → 0, µz(i)|x(i) = U T x(i) UT U , so using w as defined in the hint we have w = XU UT U as desired. As mentioned in the hint, when σ2 → 0, Σz(i)|x(i) = 0, so U = ( m ∑ i=1 x(i)µT z(i|x(i) )( m ∑ i=1 Σz(i)|x(i) + µz(i|x(i)µ T z(i|x(i) )−1 = ( m ∑ i=1 x(i)wi ) ( m ∑ i=1 wiwi) −1 = XT w wT w For U to remain unchanged after an update requires that U = XT XU UT U UT XT UT U XU UT U = XT XU UT U UT XT XU = XT XU 1 λ proving the desired equation. 3. PCA and ICA for Natural Images In this problem we’ll apply Principal Component Analysis and Independent Component Analysis to images patches collected from “natural” image scenes (pictures of leaves, grass, etc). This is one of the classical applications of the ICA algorithm, and sparked a great deal of interest in the algorithm; it was observed that the bases recovered by ICA closely resemble image filters present in the first layer of the visual cortex. The q3/ directory contains the data and several useful pieces of code for this problem. The raw images are stored in the images/ subdirectory, though you will not need to work with these directly, since we provide code for loading and normalizing the images. Calling the function [X ica, X pca] = load images; will load the images, break them into 16x16 images patches, and place all these patches into the columns of the matri- ces X ica and X pca. We create two different data sets for PCA and ICA because the algorithms require slightly different methods of preprocessing the data.1 For this problem you’ll implement the ica.m and pca.m functions, using the PCA and ICA algorithms described in the class notes. While the PCA implementation should be straightforward, getting a good implementation of ICA can be a bit trickier. Here is some general advice to getting a good implementation on this data set: 1Recall that the first step of performing PCA is to subtract the mean and normalize the variance of the features. For the image data we’re using, the preprocessing step for the ICA algorithm is slightly different, though the precise mechanism and justification is not imporant for the sake of this problem. Those who are curious about the details should read Bell and Sejnowki’s paper “The ’Independent Components’ of Natural Scenes are Edge Filters,” which provided the basis for the implementation we use in this problem. CS229 Problem Set #4 Solutions 7 • Picking a good learning rate is important. In our experiments we used α = 0.0005 on this data set. • Batch gradient descent doesn’t work well for ICA (this has to do with the fact that ICA objective function is not concave), but the pure stochastic gradient described in the notes can be slow (There are about 20,000 16x16 images patches in the data set, so one pass over the data using the stochastic gradient rule described in the notes requires inverting the 256x256 W matrix 20,000 times). Instead, a good compromise is to use a hybrid stochastic/batch gradient descent where we calculate the gradient with respect to several examples at a time (100 worked well for us), and use this to update W . Our implementation makes 10 total passes over the entire data set. • It is a good idea to randomize the order of the examples presented to stochastic gradient descent before each pass over the data. • Vectorize your Matlab code as much as possible. For general examples of how to do this, look at the Matlab review session. For reference, computing the ICA W matrix for the entire set of image patches takes about 5 minutes on a 1.6 Ghz laptop using our implementation. After you’ve learned the U matrix for PCA (the columns of U should contain the principal components of the data) and the W matrix of ICA, you can plot the basis functions using the plot ica bases(W); and plot pca bases(U); functions we have provide. Comment briefly on the difference between the two sets of basis functions. Answer: The following are our implementations of pca.m and ica.m: function U = pca(X) [U,S,V] = svd(X*X’); function W = ica(X) [n,m] = size(X); chunk = 100; alpha = 0.0005; W = eye(n); for iter=1:10, disp([num2str(iter)]); X = X(:,randperm(m)); for i=1:floor(m/chunk), Xc = X(:,(i-1)*chunk+1:i*chunk); dW = (1 - 2./(1+exp(-W*Xc)))*Xc’ + chunk*inv(W’); W = W + alpha*dW; end end PCA produces the following bases: CS229 Problem Set #4 Solutions 10 Applying part (a), V π(s) ≤ Bπ′(V π)(s) ⇒ Bπ′(V π)(s) ≤ Bπ′(Bπ′(V π))(s) Continually applying this property, and applying part (b), we obtain V π(s) ≤ Bπ′(V π)(s) ≤ Bπ′(Bπ′(V π))(s) ≤ . . . ≤ Bπ′(Bπ′(. . . Bπ′(V π) . . .))(s) = V π′(s). (d) Use the proceeding exercises to show that policy iteration will eventually converge (i.e., produce a policy π′ = π). Furthermore, show that it must converge to the optimal policy π⋆. For the later part, you may use the property that if some value function satisfies V (s) = R(s) + γ max a∈A ∑ s′ ∈ SPsa(s′)V (s′) then V = V ⋆. Answer: We know that policy iteration must converge because there are only a finite number of possible policies (if there are |S| states, each with |A| actions, then that leads to a |S||A| total possible policies). Since the policies are monotonically improving, as we showed in part (c), at some point we must stop generating new policies, so the algorithm must produce π′ = π. Using the assumptions stated in the question, it is easy to show convergence to the optimal policy. If π′ = π, then using the same logic as in part (c) V π(s) = V π ′ (s) = R(s) + γ max a∈A ∑ s′∈∫ Psa(s ′)V π(s), So V = V ⋆, and therefore π = π⋆. 5. Reinforcement Learning: The Mountain Car In this problem you will implement the Q-Learning reinforcement learning algorithm de- scribed in class on a standard control domain known as the Mountain Car.2 The Mountain Car domain simulates a car trying to drive up a hill, as shown in the figure below. −1.2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 2The dynamics of this domain were taken from Sutton and Barto, 1998. CS229 Problem Set #4 Solutions 11 All states except those at the top of the hill have a constant reward R(s) = −1, while the goal state at the hilltop has reward R(s) = 0; thus an optimal agent will try to get to the top of the hill as fast as possible (when the car reaches the top of the hill, the episode is over, and the car is reset to its initial position). However, when starting at the bottom of the hill, the car does not have enough power to reach the top by driving forward, so it must first accerlaterate accelerate backwards, building up enough momentum to reach the top of the hill. This strategy of moving away from the goal in order to reach the goal makes the problem difficult for many classical control algorithms. As discussed in class, Q-learning maintains a table of Q-values, Q(s, a), for each state and action. These Q-values are useful because, in order to select an action in state s, we only need to check to see which Q-value is greatest. That is, in state s we take the action arg max a∈A Q(s, a). The Q-learning algorithm adjusts its estimates of the Q-values as follows. If an agent is in state s, takes action a, then ends up in state s′, Q-learning will update Q(s, a) by Q(s, a) = (1 − α)Q(s, a) + γ(R(s′) + γ max a′∈A Q(s′, a′). At each time, your implementation of Q-learning can execute the greedy policy π(s) = arg maxa∈A Q(s, a) Implement the [q, steps per episode] = qlearning(episodes) function in the q5/ directory. As input, the function takes the total number of episodes (each episode starts with the car at the bottom of the hill, and lasts until the car reaches the top), and outputs a matrix of the Q-values and a vector indicating how many steps it took before the car was able to reach the top of the hill. You should use the [x, s, absorb] = mountain car(x, actions(a)) function to simulate one control cycle for the task — the x variable describes the true (continuous) state of the system, whereas the s variable describes the discrete index of the state, which you’ll use to build the Q values. Plot a graph showing the average number of steps before the car reaches the top of the hill versus the episode number (there is quite a bit of variation in this quantity, so you will probably want to average these over a large number of episodes, as this will give you a better idea of how the number of steps before reaching the hilltop is decreasing). You can also visualize your resulting controller by calling the draw mountain car(q) function. Answer: The following is our implementation of qlearning.m: function [q, steps_per_episode] = qlearning(episodes) % set up parameters and initialize q values alpha = 0.05; gamma = 0.99; num_states = 100; num_actions = 2; actions = [-1, 1]; q = zeros(num_states, num_actions); for i=1:episodes, CS229 Problem Set #4 Solutions 12 [x, s, absorb] = mountain_car([0.0 -pi/6], 0); [maxq, a] = max(q(s,:)); if (q(s,1) == q(s,2)) a = ceil(rand*num_actions); end; steps = 0; while (~absorb) % execute the best action or a random action [x, sn, absorb] = mountain_car(x, actions(a)); reward = -double(absorb == 0); % find the best action for the next state and update q value [maxq, an] = max(q(sn,:)); if (q(sn,1) == q(sn,2)) an = ceil(rand*num_actions); end q(s,a) = (1 - alpha)*q(s,a) + alpha*(reward + gamma*maxq); a = an; s = sn; steps = steps + 1; end steps_per_episode(i) = steps; end Within 10000 episodes, the algorithm converges to a policy that usually gets the car up the hill in around 52-53 steps. The following plot shows the number of steps per episode (averaged over 500 episodes) versus the number of episodes. We generated the plot using the following code: for i=1:10, [q, ep_steps] = qlearning(10000); all_ep_steps(i,:) = ep_steps; end plot(mean(reshape(mean(all_ep_steps), 500, 20))); 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 50 100 150 200 250 A ve ra ge S te ps p er E pi so de Episode Number