Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

CS 229 Notes: Machine Learning, Study notes of Artificial Intelligence

Notes on machine learning, covering topics such as supervised learning, learning theory, unsupervised learning, and reinforcement learning and control. a list of prerequisites for the course, including knowledge of computer science principles, basic probability theory, and linear algebra. The course also requires the use of MATLAB or its open-source alternative, Octave, for programming assignments. The course includes a significant final project, and there are many resources available for project ideas. definitions of machine learning from Arthur Samuel and Tom Mitchell.

Typology: Study notes

2020/2021

Uploaded on 05/11/2023

christina
christina 🇺🇸

4.6

(23)

404 documents

1 / 42

Toggle sidebar

Related documents


Partial preview of the text

Download CS 229 Notes: Machine Learning and more Study notes Artificial Intelligence in PDF only on Docsity! so it conveniently represents the predictions. Also, let y=     y (1) ... y (n)     . Then, one obtains the normal equation after a short derivation: J (θ) = 1 2 (Xθ− y)T(Xθ− y). Here, (Xθ− y) is the error, so you can see how it is squared. Now, one can take the derivative, though there’s a lot skipped in the derivation: ∇θ 1 2 (Xθ− y)T(Xθ− y) = 1 2 ∇θ(θ TX T− yT)(Xθ− y) =X TXθ−X Ty. Setting the last equation to zero gives the normal equation: X TXθ=X Ty. This is useful to solve in one step in certain situations: θ= (X TX )−1X Ty. 3. LINEAR ALGEBRA AND MATRIX CALCULUS PROBLEM SESSION: 9/27/13 This section was to review the basics of linear algebra and matrix calculus, and was taught by three of the TAs. To start, it will be helpful to define the lecture’s notation. Column vectors look like this: x=     x1 ... xn     ∈Rn . Throughout this lecture, x will be a vector in Rn , and A∈Rm×n is a matrix. On Rn we have some norms: • The 2-norm is ‖x‖2 = √ √ √ √ m ∑ i=1 x2 i . • The 1-norm is ‖x‖1 = ∑m i=1 |xi |. • The infinity norm is ‖x‖∞ =maxi{xi}. An m× n matrix can be seen as n column vectors A= (c1, . . . ,cn) or m row vectors A=      rT 1 ... rT m      . Two vectors x,y ∈Rn are multiplied with the inner product: x · y= xTy= yTx= n ∑ i=1 xi yi . Then, ‖x‖2 2 = xTx. Since there are several notations for a matrix, the matrix-vector product has several different meanings. Using the row vector notation, for example, Ax= (r1 . . .x, . . . ,rm) T (a column vector). Additionally, one has the matrix-matrix product, which is defined when A∈Rm×n and B ∈Rn×p ; then, their product is in Rm×p . If C =AB , then ci j = ∑n k=1 ai k bk j . Here are some properties of matrix multiplication: • The order of multiplication is not important (associativity): (AB)C =A(BC ). • Distributivity: A(B +C ) =AB +AC . • In general, the matrix product is not commutative: AB 6= BA. Do not make this mistake. These properties are relatively easy to prove using the formula given above. Now, a few operators on matrices: the transpose operator just flips matrices. (AT)i j =Aj i . This has a few nice properties: • (A+B)T=AT+BT. 6 • (AB)T= BTAT. • (AT)T=A. Again, these are things you can prove yourself if you want. Definition. A matrix is symmetric if AT=A. A matrix for which AT=−A is called anti-symmetric. Symmetric matrices have interesting properties, and thus come up in the material more often. The inverse of a matrix doesn’t always exist; if it does, then it is the matrix A−1 of A such that A−1A= AA−1 = I (the identity matrix, with 1 along the diagonal and 0 elsewhere). If such an inverse exists, A is called nonsingular. We have some more properties: • (A−1)−1 =A. • (AT)−1 = (A−1)T. Sometimes, this is denoted A−T. • (AB)−1 = B−1A−1. These properties hold when everything is well-behaved and invertible, so don’t forget that and misapply them. We saw the trace tr(A) in class, which is linear: tr(A+B) = tr(A)+ tr(B). tr(AT) = tr(A), additionally, and tr(AB) = tr(BA). This last property is particularly useful for computations, and can be generalized to larger products: tr(ABC ) = tr(C AB) = tr(BC A) and so on. However, this is not true for random orders: they must be cyclical. Definition. The Frobenius norm of a matrix A is ‖A‖F = Æ tr(AAT) = √ √ √ √ m ∑ i=1 n ∑ j=1 A2 i j . Proving the last equality above would be a good way to get better at using these definitions. Another useful function is the determinant. It will be briefly discussed here, and a better explanation with a more geometric angle7 is presented. The formal definition is detA= |A|= m ∑ i=1 (−1)i+ j ai j A/i/ j , where A/i/ j (the cofactor matrix) is the matrix obtained by removing the i th row and the j th column. Also, note that j doesn’t appear in the definition; its choice happens to be arbitrary. Here are some properties of the determinant: • det(AB) = det(A)det(B). • det(A) = 0 iff A is singular (i.e. noninvertible). • det(AT) = det(A). • If A is nonsingular, det(A−1) = 1/det(A). Definition. A family u1, . . . ,un of vectors is linearly independent if for every family (α1, . . . ,αn) ∈Rn , α1u1+ · · ·+αnun = 0 implies that α1 = 0, . . . ,αn = 0. The notion of linear independence can be used to define the rank of a matrix; see the lecture notes for this concept. Definition. The range of a matrix A, denoted R(A), is the set R(A) = {Ax | x ∈Rm}. The null space of A, denoted N (A), is the set N (A) = {x ∈Rm |Ax= 0}. Notice that R(A)⊂Rn , but N (A)⊂Rm , which can be remembered by tracking which corresponds to the input space and which corresponds to the output space. The null space is sometimes also denoted the kernel, but not in this class. Definition. (λ,x) is an eigenvalue-eigenvector of A if x 6= 0 and Ax= λx. For small matrices, one finds the eigenvalues by computing the roots to det(λI −A) = 0. This polynomial (since it ends up being a polynomial) is called the characteristic polynomial. Eigenvectors and eigenvalues play a role in a lot of fields, but will be the most useful in this class in principal component analysis. Definition. A matrix U ∈Rm×m is orthogonal if U TU =U U T= I . In other words, U−1 =U T. Claim. If U is an orthogonal matrix, then‖Ux‖2 = ‖x‖2 for any x ∈Rm . 7No pun intended. 7 To make calculations more tractable, one takes the logarithm, producing the creatively named log-likelihood: `(θ) = log L(θ) = log n ∏ i=1 1 σ p 2π exp − (y (i)−θTx(i))2 2σ2 ! = n ∑ i=1 log 1 σ p 2π ! + logexp − (y (i)−θTx(i))2 2σ2 !! = m log 1 σ p 2π + m ∑ i=1 − (y (i)−θTx(i))2 2σ2 , and since the first term is invariant, then to maximize `(θ) and therefore L(θ), one wishes to minimize J (θ) = 1 2 m ∑ i=1 (y (i)−θTx(i))2 2σ2 . Notice that the final answer doesn’t depend on σ2. This will be significant when this class turns to generalized linear models. Logistic Regression. Now we turn to a classification problem, where y (i) ∈ {0,1}. Here, a linear regression isn’t the smartest, though it does work sometimes. One can add a training example with y (m+1) = 1 to the far right, which makes the data fit very differently. This is a problem, because it shouldn’t make any difference to the prediction. . . This is why it is bad to use linear regression on classification problems; sometimes it works, but often it doesn’t, and in any case predicting values outside of [0,1] is a problem. Thus, an algorithm called logistic regression is used to address both of these problems.11 Here, hθ(x) ∈ [0,1] is given by hθ(x) = 1 1+ e−θ Tx = g (θTx). Here, g (z) = 1/(1+ e−z ) is called the sigmoid or logistic function, with graph given in Figure 2. This is convenient because FIGURE 2. The sigmoid function g (z). Source: Wikipedia. statistical outliers that messed up the linear model are okay for the logistic one, and all of the predicted values lie in [0,1]. The function smoothly goes between 0 and 1 along the real line, but it’s particularly useful to make the algorithm convex (which implies a unique global solution), and because it ends up being a special case of a generalized linear model. This makes it possible for more powerful, general algorithms to work on logistic regression. 11Logistic regression is a classification algorithm, not a regression algorithm. Take care to not confuse this. 10 The probabilistic interpretation of this model is that P (y = 1 | x;θ) = hθ(x), and similarly, P (y = 0 | x;θ) = 1− hθ(x). In other words, since y ∈ {0,1}, then P (y | x;θ) = hθ(x) y (1− hθ(x)) 1−y . This can be checked by filling in the two possibilities for y. Now, let’s apply a maximum likelihood estimate: L(θ) = m ∏ i=1 P (y (i) | x(i);θ) = m ∏ i=1 h(x(i))y (i) (1− h(x(i)))1−y (i) . It is again possible to calculate the log-likelihood `(θ) = log L(θ) as before. This allows one to use gradient ascent to maximize the log-likelihood, by updating the parameters by θ j := θ j +α ∂ ∂ θ j `(θ), where we are trying to maximize `(θ) rather than minimizing (which is the same after flipping a sign, so it’s not huge). Here α is again the learning rate. After all the math is bashed out (as is done in the provided lecture notes), the rule is θ j := θ j +α m ∑ i=1 (y (i)− hθ(x (i)))x (i)j . The surprising thing about this is that cosmetically, it looks identical to the rule provided for gradient-descent for linear regression. These two algorithms were invented independently, and people at first wondered if they were the same, but since hθ is different in each case, then these are different algorithms, and in particular will have different results when run on different data. These algorithms aren’t particularly new; computers called Perceptrons used a similar algorithm in the 1960s. In general, logistic regression is a very common machine learning algorithm. Targeted advertisements, for example, often use it or a small variant. Newton’s Method. There are multiple ways of optimizing a function; here’s another one called Newton’s method. For the presentation in today’s lecture, we will consider θ ∈R rather than θ ∈Rn+1. Then, the goal is to maximize the log-likelihood, so define f (θ) = d dθ`(θ), so the goal is to find a theta such that f (θ) = 0. Unlike linear regression, there is no known way to solve this in closed form.12 However, Newton’s method is a very effective way of approximating solutions. Similarly to gradient-descent, this method starts with an initial guess θ0. Take the tangent like to f at θ0, and let θ1 be the place where that line intersects the x-axis. Then,repeat with the tangent line to f at θ1, and so on. Now for the math: how does one obtain θ1 from θ0? Let ∆ = |θ1−θ0|. Thus, the slope of f at θ0 is f ′(θ0), but also f (θ0)/∆, so one calculates∆= f (θ0)/ f ′(θ0), so the updating step is θ(i+1) := θ(i)− f (θ(i)) f ′(θ(i)) . Alternatively, dispense with f entirely to get θ(1) = θ(0)− `′(θ) `′′(θ) . Newton’s method converges very quickly. Specifically, it has quadratic convergence; each iteration, the error is roughly squared (which sounds bad, but for small errors is of course very good!). This is much faster than gradient descent, though it is more computationally intensive. In the more general case where θ ∈Rn+1, this becomes θ(t+1) = θ(t )+H−1(∇θJ ), where H is the Hessian matrix of `(θ). This is explained more fully in the notes and the problem set. Inverting a matrix is generally computationally expensive, which is the primary drawback of this method. 12Hello, Galois! 11 5. GENERALIZED LINEAR MODELS: 10/2/13 Though the logistic equation was presented in the last class, it’s not clear why it was used; there are lots of smooth functions from 0 to 1. That answer, as well as others, will be given below. The Exponential Family of Probability Distributions. Recall that the Bernoulli distribution on φ is Bernoulli(φ) = P (y = 1;φ) =φ. It is convenient to think of this as the set of all Bernoulli distributions for all parameters φ. Similarly, consider the set of all Gaussians N(µ,σ2), with elements such as N(0,1), N(10,0.5), etc. It turns out that both of these are special cases of the exponential class13 of distributions. Definition. A class of distributions is an exponential family if it can be written P (y;η) = b (y)exp(ηTT (y)− a(η)) (2) for some choices of b , η, T , and a, and such that the resulting function is still a probability distribution. Here, η is called the natural parameter, and T (y) is called the sufficient statistic. In most but not all examples of exponential distributions, T (y) = y. This definition yields a set of distributions, parameterized by η once b , T , and a are fixed; one says “a distribution is in the exponential family” if some choice of parameters b , T , and a yield that distribution, as a set of probability distributions on y. For example, there must be a bijection φ 7→ η for some choice of parameters to demonstrate that the Bernoulli distribution: Rewrite the Bernoulli distribution as P (y;φ) =φy (1−φ)1−y . Thus, P (y;φ) = exp € log € φy (1−φ)1−y ŠŠ = exp ‚ log φ 1−φ y + log(1−φ) Œ . Thus, setting b (y) = 1, T (y) = y, a(η) = log(1−φ), and η = log(φ/(1−φ)), one obtains (2). Solving for φ, one obtains φ= 1/(1+ e−η) (which is just the logistic formula!), and a(y) = log(1+ eη). It turns out there are multiple mappings for which this works, such as multiplying T by 1/2 and a(y) by 2, but usually there is a most natural one. Now, consider the Gaussian. For the rest of this lecture, follow the convention that for N(µ,σ2), σ2 = 1. The simplification, after a bit of algebra, is 1 p 2π exp  − 1 2 (y −µ)2  = 1 p 2π e−y2/2 ︸ ︷︷ ︸ b (y) exp  µy − 1 2 µ2 ︸︷︷︸ a(η)  , and the natural mapping is just µ 7→ η. T (y) = y again. Most of the distributions mentioned in introductory probability classes are exponential: the multivariate Gaussian, the Poisson, the Beta, the Gamma, the Dirichlet, and of course the Exponential distribution. Since these cover a large number of different real-world scenarios, abstracting one away to an exponential distribution allows the use of a few general algorithms (specifically, the GLM described below) in many situations. The Generalized Linear Model (GLM). Given some distribution P (y; x,θ), one can use a generalized linear model under the following assumptions: (1) y | x,θ∼ ExpFam(η) for some choice of parameters. (2) The goal is to model P (y | x,θ), and to calculate the expected value E[T (y) | x;θ]. (3) The relationship between η and θ is η= θTx. This assumption leads to the very nice algorithm. Example 5.1. Suppose one has a bunch of features of tumors, and wants to classify them as malignant or benign. Choose parameters such that ExpFam(η) is the Bernoulli set of distributions, and assume y | x,θ∼ ExpFam(η), and that η= θTx, and that the algorithm outputs hθ(x) = E[y | x;θ] = P (y = 1 | x;θ), since the expected value of a Bernoulli random variable is just the probability that it is zero. Write this probability as φ, though there is an unstated dependence φ(x,θ). Since we know how η depends on x and θ, this simplifies to hθ(x) =φ= 1 1+ e−η = 1 1+ e−θ Tx . Thus, the logistic regression algorithm just falls out of the mathematics! In general, the steps are to: 13Here, class just means set. There aren’t any fancy set-theoretic difficulties to be found here that I know of. 12 It’s possible to have multiple random variables in a scenario, in which case one has joint probability distributions. Then, one takes a double sum or double integral over all values (depending on discreteness or continuity) to get 1. Similarly, one has P (Y = y) = ∑ x P (X = x,Y = y). This is useful if one is given the joint distributions, but not the individual ones. One can also write down a conditional distribution: PY |X (Y |X ) = PX ,Y (X ,Y )/PX (X ). This works in both the continuous and the discrete cases. Additionally, one has Bayes’ rule: P (A | B) = P (B |A)P (A)/P (B). This has far-reaching implications into law, medicine, etc. relating to how two important events are related, in some sense reversing a correlation to understand if it is causation. I don’t know why this was lumped into the section on multiple variables, because it’s so much more general than that. With two random variables, there’s a generalization of this to covariance: Cov[X ,Y ] = E[(X − E[X ])(Y − E[Y ])]. Notice that if X = Y , then Cov(X ,X ) =Var(X ); the covariance is large implies that X is large when Y is large, and vice versa. If X and Y are not very related (no correlation), the covariance is about zero, but if they are oppositely related, then the covariance is a negative number. This can take on any real-numbered value depending on the dataset in question. The covariance matrix Σ of n variables X1, . . . ,Xn has (i , j )th entry Cov(Xi ,X j ) (so that the diagonal entries are the variances of these n variables). This allows one to extend the Gaussian to multiple dimensions: in the single-variable case, one has f (x;µ,σ) = 1 σ p 2π e−(x−µ) 2/2σ2 , but in the multidimensional case, σ becomes the matrix Σ: f (x1, . . . , xn ;µ,Σ) = 1 (2π)n/2 det(Σ)1/2 exp  − 1 2 (x−µ)TΣ−1(x−µ)  . (4) Notice that the mean becomes a vector of means of X1, . . . ,Xn . 7. GENERATIVE LEARNING ALGORITHMS: 10/7/13 Given some data that can be classified into two parts, logistic regression attempts to find a line that separates them. This is an example of a discriminative algorithm. However, one might use a different approach; instead of finding P (y | x) directly (or h(x)), one could learn both P (x | y) and P (y): in the tumor classification example, what do the features look like in a benign tumor and a malignant tumor? Here, P (y) is called the class prior, which is useful for knowing the odds of malignancy already. By Bayes’ rule, these allow one to compute P (y | x) = P (x | y = 1)P (y = 1)/P (x). Thus, P (x) = P (x | y = 1)P (y = 1)+ P (x | y = 0)P (y = 0). Gaussian Discriminant Analysis (GDA). For this algorithm, assume the data is from a continuous distribution, so that x ∈Rn . Notice that x0 = 1 is not used here. Furthermore, it is assumed that P (x | y) is (multivariate) Gaussian. The multivariate Gaussian is a generalization of the single-variable normal distribution to vectors. It takes the form z∼ N(µ,Σ), where z,µ ∈Rn and Σ ∈Rn×n . Then, E[z] = µ, and Cov(z) = E[(z−µ)(z−µ)T] = Σ. For example, when n = 2, Σ=  E[(z1−µ1) 2] E[(z1−µ1)(z2−µ2)] E[(z2−µ2)(z1−µ1)] E[(z2−µ2) 2]  If the values on the diagonal of Σ are made larger, then the variables are more correlated, and the data clusters more strongly (thinner, taller graph). The off-diagonal values increase correlations between two variables, distorting the graph into an ellipse. This represents that the two values are more strongly correlated. Then, varying µ changes the center of the data. The probability density function is as in (4). Nobody actually memorizes this, but make sure it’s around where you can refer to it. Then, P (y) =φy (1−φ)1−y , so that P (x | y = 0) has a multivariate Gaussian distribution with mean µ0, and P (x | y = 1) has a multivariate Gaussian distribution with mean µ1. The same covariance matrix is used in both calculations. Thus, with parameters φ,µ0,µ1, and Σ and a training set (x(1), y (1)), . . . , (x(m), y (m)), one wishes to maximize the joint likelihood L(φ,µ0,µ1,Σ) = m ∏ i=1 P (x(i), y (i)) = m ∏ i=1 P (x(i), y (i))P (y (i)). Compare this to the likelihood function used in a discriminative algorithm, which looks very different. This is because in that case, P (y | x) is not defined, so a different tack had to be taken. After taking the log-likelihood, the maximum likelihood 15 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 0 0.2 0.4 0.6 0.8 1 FIGURE 3. A graph of the multivariate Gaussian when n = 2. Source: Wikipedia. estimate takes on these values: φ= 1 m m ∑ i=1 y (i) = 1 m m ∑ i=1 1{y (i) = 1}. µ0 = ∑m i=1 1{y (i) = 0}x(i) ∑m i=1 1{y (i) = 0} . This is the mean of all of the examples x(i) such that y (i) = 0, which is what makes the most sense. Thus, µ1 is calculated in effectively the same way: µ1 = ∑m i=1 1{y (i) = 1}x(i) ∑m i=1 1{y (i) = 1} . Σ is given in the lecture notes, since it’s a little messier. Then, the prediction for y is argmax y P (y | x) = argmax y P (x | y)P (y) P (x) . If you haven’t seen argmax before, it just means to choose the value of y such that the quantity is maximized: minz (z−5)2 = 0, but argminz (z − 5)2 = 5. In some sense, rather than partitioning the plane into two parts, one calculates Gaussians for each of the positive and negative datasets, and asks which one is more likely. Thus, the decision boundary is actually different in these two cases: if one plots P (y = 1 | x) (for example, in a single-variate distribution), it has a logistic form as one goes from µ0 to µ1, given the assumptions of GDA. However, the converse is not true: if this function is logistic, the assumptions of the generative model might not hold. This means that generative algorithms make stronger assumptions than discriminatory ones. Specifically, these assume y ∼ Bernoulli(φ) and x | y = 0 and x | y = 1 are both Gaussian. In these cases, of course, this model will perform better, but otherwise it might be smarter to try something else. Here’s another set of assumptions that make P (y = 1 | x) logistic: that x | y = 0 ∼ Poi(λ0), x | y = 1 ∼ Poi(λ1), and y ∼ Bernoulli(φ) again. This is actually a much more general assumption. However, logistic regression can handle any 16 exponential model, which shows that it is more robust in some cases; however, generative models are often more useful for small datasets. There are many such assumptions that lead it to be logistic. Naïve Bayes. One nice advantage of GLMs is that they are fairly easy to implement and train, since there’s no iteration, and so on. Thus, though they might not be the best algorithms for the job, generative algorithms are useful for quick prototyping (such as for the project). A good example of this is the Naïve Bayes algorithm, which is rarely the best algorithm for the problem, but is fairly quick to set up. For a motivational example, consider spam detection, in which one examines an email and tries to classify it as spam or ham. More generally, one might want to sort emails into categories such as social, professional, etc., or eBay might wish to classify its auctions into different genres (which is actually what happens; they don’t ask the user to choose a category). Emails are interesting because they are different lengths, so obtaining a feature set requires a little thinking. A common way to do this is to take a list of English words {a, aa, . . . , zymurgy}, and let the feature vector of an email be a vector of the size of this list with a 1 in a given entry if the corresponding email has that word and a 0 if otherwise.14 So write n = 10000, so that x ∈ {0,1}10000. Thus, there are 210000 possibilities for x, so modeling this distribution explicitly over all of these outcomes (e.g. a multinomial distribution), unfortunate things clearly happen, so a different approach is necessary. Naïve Bayes adds one additional assumption, called the conditional independence assumption or the Naïve Bayes assumption: that the xi are independent given y. This means that P (x1, x2, x3, . . . , x10000 | y) = 10000 ∏ i=1 P (xi | x1, . . . , xi−1, y), which is always true, but under the Naïve Bayes assumption, this simplifies to P (x1, x2, x3, . . . , x10000 | y) = 10000 ∏ i=1 P (xi | y). One way of explaining this is: suppose x3 is the word “buy,” and x1000 is the word “drugs.” Plenty of emails associate the two words. But: once you know an email is spam, there’s no correlation between these two words. This is not completely true, but it is convenient enough to be useful, and is the “conditional” in conditional independence. The parameters of this model are φ j |y=1 = P (x j = 1 | y = 1), or the probability of a word appearing in a spam message, and φ j |y=0 = P (x j = 1 | y = 0) (probability of a word appearing in a ham message), and φ j = P (y = 1). Then, one can write down the joint likelihood L(φy ,φ j |y=1,φ j |y=0) = m ∏ i=1 P (x(i), y (i)). Once one has a training set, one can fit the parameters by making a maximum likelihood estimate of the parameters as φy = 1 m m ∑ i=1 1{y (i) = 1} φ j |y=1 = ∑m i=1 1{x (i)j = 1, y (i) = 1} ∑m i=1 1{y (i) = 1} (5) φ j |y=0 = ∑m i=1 1{x (i)j = 1, y (i) = 0} ∑m i=1 1{y (i) = 0} (6) For the φ j |y , the intuition is to ask: of all emails in this bucket (spam or ham), what fraction of them contain this word? To implement a Naïve Bayes classifier, compute these parameters and then classify by setting P (y | x) = P (x | y)P (y), which follow from the above. In summary, generative models look at each component of the classification separately, forming independent models that lead to the classification. 8. LAPLACE SMOOTHING: 10/9/13 “I actually know the president of [the University of ] Arizona pretty well. He’s a nice guy, an interesting guy.” 14One more practical strategy is to take only the 10000 words most commonly seen in the recipient’s inbox. 17 This diagram represents the following computation: a1 = g (xTθ(1)), a2 = g (xTθ(2)), and a3 = g (xTθ(4)), and hθ(x) = g (aTθ(4)), where a=      1 a1 a2 a3      . Thus, hθ(x) is a function of the parameters θ(1), . . . ,θ(4), so the goal is to minimize the squared error, or some optimization objective akin to min θ 1 2 m ∑ i=1 (hθ(x (i))− y (i))2. This algorithm can output nonlinear classifiers because the sigmoid function is nonlinear, so one learns some set of nonlinear features {a1,a2,a3}, upon which logistic regression can be preformed again. The maximization problem is often solved with gradient descent, though in this specific context it tends to be called back-propagation. However, this optimization problem is nonconvex, so one might not necessarily find the global optimum, but it seems to work well enough in practice, if slowly. More complicated neural networks exist, with several layers of stacked “neurons.” One use of this was handwriting analysis for zip codes (though today, this is preformed on the entire address). It as able to handle some pretty pathological examples of numbers. Another example is text-to-speech conversion. 9. MATLAB TUTORIAL: 10/11/13 This class heavily stresses MATLAB18 for the implementation. This is because it makes for an effective prototyping tool for quick development and testing, and after everything is set up it can be translated to a faster or lower-level language. Everything in this lecture can be found in the file session.m. In MATLAB, all numbers are treated as doubles, so 1/2 evaluates to 1/2. Most operations are the same as in other languages, though 6= is ~= and xor is xor( , ). Strings are denoted with single quotes, and expressions are perfectly fine (e.g. c = 3 >= 1). Every operation has some sort of output, which can be suppressed by appending a semicolon. How about printing? If one sets a = pi, which does what one might expect, printing a truncates it to four places. However, the disp and sprintf commands allow one to work around this: disp(sprintf(’6 decimals: %0.6f’, a)) will print π to six decimal places, preceded by the words “6 decimals.” Some commands make this easier: format long permanently sets decimals to display with higher accuracy, and format short returns to four digits. The power of MATLAB is in its vectors and matrices. For example, one generates the matrix A =    1 2 3 4 5 6    with the command A = [1 2; 3 4; 5 6], and one generates the row vector v = [1 2 3] as v = [1 2 3], and the column vector v′ =    1 2 3    as v’ = [1; 2; 3]. Vectors can be automatically generated: v = [1:0.1:2] generates a vector that steps from 1 to 2 with a step size of 0.1 (which defaults to 1; 1:6 is equivalent to the array [1 2 3 4 5 6]). Some commonly used matrices have shortcuts: ones(m,n) creates an m× n matrix of entirely ones (which reduces to a vector if m or n is 1). The zeros command does the same thin, but returns the zero matrix. Similarly, the rand function takes arguments in the same way and returns a matrix with random numbers uniformly distributed in [0,1], The randn function returns random numbers chosen from the standard unit normal distribution. All of these are called in exactly the same way. The empty matrix or vector, a 0× 0 matrix, is given by e = [], and the n× n identity matrix is given by eye(n). If a matrix is read from a file (for example) and you don’t know its dimension, you can use the size command, which returns a vector of the dimensions of the matrix (or vector, in which case one of the entries is 1). Then, the length command returns the length of the longest directory. This is mostly useful for vectors, and can do unexpected things with matrices if you’re not careful. Now, on to loading files: the pwd command does exactly what it does in any actual shell, and so do cd and ls. Then, the command load filename will read the files and store their contents as directed by the file. Then, one can use the who command to determine which variables exist (and therefore which were affected by loading the file), and whos provides more information (e.g. the dimensions of the variables). The clear command deletes a given variable, but without an argument it deletes every variable, so be careful! Dually, the save filename variable saves the given variable to the specified file. Note that Octave appends an Octave-specific header unless one passes the ascii flag to save. 18Or its open-source equivalent, Octave. 20 The command v = A(1:10, 1) returns the first column of the first ten rows of A and places it into v. Notice that everything is one-indexed! The second row of A is given by A(2, :), and the second column by A(:, 2). The word end is understood to be the last column or row, so A(1, end) returns the last row of A. There are lots of interesting indexing conventions in MATLAB; for example, A(1, 3) returns the (1,3)rd element of A, but A[1,3], :) returns the first and third columns of A. Matrices can be edited, as A(:, 2) = [10; 11; 12], or even joined, as A = [A, [100; 101; 102]]. This usually is efficient if you don’t abuse it too strongly. Finally, the notation A(:) returns all of the elements of A, flattened into a vector. Matrix operations are pretty nice. Matrix multiplication (non-pointwise) is A*B, and pointwise multiplication is A .* C. Each of these only works when the dimensions are valid. Element-wise powers of a matrix (taking each element to some power) is A .^ 2, and the inverse of each element is 1 ./ A. Then, logarithm, exponential, negation, addition with scalars, and absolute value are all element-wise by default. The transpose of a matrix A is A’. There are lots of functions available by default in MATLAB, some of which will be discussed here. If a is a vector, then max(a) returns the maximum value of the list. Alternatively, assigning [val, ind] = max(a) stores the maximum in val and the argmax in ind. The help command accepts functions as arguments and returns documentation on them. The find function returns the indices matching the predicate: for example, find(a < 3) returns the indices in a for which the element is less than 3. The command magic(n) returns an n× n magic square. The sum command allows summing across lists and matrices. If A is a matrix, then sum(A, 1) is the list of its column sums, and sum(A, 2) is the list of row sums. Maximums also work: max(A, [], 1) returns the row-wise maximum, and max(A, [], 2) returns the column-wise maximum. If no other arguments are passed, sum(A) computes the sum of all elements, and product is analogous. The inverse of a matrix is given bt A^-1. The pseudoinverse of a not necessarily invertable matrix can be given by pinv(A). The function isempty returns whether a list or matrix is empty, and numel(A) returns the number of elements in a matrix, akin to the product of the elements in size(A).19 One can do matrix reshaping. Suppose A is a 4× 3 matrix. To rearrange A to have four rows and three columns, but in th same order, use the command reshape(A, [4,3]). The general syntax is the matrix first, and then the dimension. Note that the resulting matrix is not equal to the transpose AT. More exciting possibilities exist, such as reshape(A, [2,6]). If A is a matrix and v a vector, then A + v returns A with v added to every column. One can do more interesting things with repmat(v, [1,4]), which makes a matrix of four columns, each column of which is v. A function called bsxfun allows for more nifty binary operations; for example, bsxfun(@plus, A, v) does the same as the operations described above by expanding the third argument into a matrix of the same size as the second argument, and then takes the first argument and acts on the rest. Finally, it will be useful to know how to use the graphical aspect of MATLAB.20 Let t be some array, such as t = [0:0.01:1], and then y1 = sin(2*pi*4*t), so that y1 is a function. Then, one plots y1 against t as plot(t,y1). More than one graph is allowed on the same plot, but the hold on command must be used to signify this. Further arguments to the plot function determine the style of the graph; for example, plot(t,y1,’r--’) will color it red (and the rest of the string can determine the line style, and so on). Commands such as xlabel, ylabel, and title accept strings and do the obvious thing. Loops are reasonable: the syntax is for i = arr, where arr is some sort of array. Then, the loop is terminated with an end statement. The while loop is similar: while length(w) < 3 . . . end. Control statements are also reasonable: if w(1) == 0 . . . end. Chains of cascading else statements are accomplished with elseif and then finally else. Finally, we will implement the logistic regression function in MATLAB, which will provide a useful example of working code. Recall that the formula was given by ∇θ = (y − hθ(x))x, leading to the update rule θ := θ+ αX T(y− hθ(X )). The important part is that J is no longer part of this, which makes the logic and implementation easier. % Matlab functions are a little bit interesting. The function name must be the % same as the filename. function [theta , ll] = logistic_grad_ascent(X,y) % rows of X are training samples % rows of Y are corresponding 0/1 values % output ll: vector of log -likelihood values at each iteration % ouptut theta: parameters alpha = 0.0001; 19Please do not confuse the numel function with the Pokémon Numel. 20Any good mad scientist understand the importance of plotting. 21 [m,n] = size(X); max_iters = 500; % It’s necessary to append a column of ones so that there are the correct % number of parameters , so that there is a nonzero intercept term. X = [ones(size(X,1) ,1), X]; theta = zeros(n+1, 1); % initialize theta for k = 1: max_iters % This requires the file sigmoid.m to be in the same directory. % A copy can be found at % http :// cs229.stanford.edu/section/matlab/sigmoid.m hx = sigmoid(X*theta); theta = theta + alpha * X’ * (y-hx); % log -likelihood ll(k) = sum( y .* log(hx) + (1 - y) .* log(1 - hx) ); end 10. SUPPORT VECTOR MACHINES: 10/14/13 Support vector machines (SVMs) are among the most powerful off-the-shelf learning algorithms. The goal is to build a classifier that can handle nonlinear data that doesn’s necessariy have a clear margin of separation. (That said, the first algorithm we consider will only work in the case of linearly separable data.) The Functional and Geometric Margins. The SVM is a classification algorithm. Recall that such a classifier of the form hθ(x) = g (θTx); if θTx> 0, then predict 1, and otherwise predict 0. For a good algorithm, one would want θTx 0 if y (i) = 1, and θTx 0 if y (i) = 0. Now, look at the idea of a geometric margin. Some set of data is separable if it is possible to achieve zero training error. Then, there are multiple different lines that can be used to separate this data. One might want a way of measuring how “good” a line is under this schema, which is the geometric margin. In order to do this, it will be necessary to introduce a slight change in notation, which will follow us throughout the discussion on SVMs. Thus: • Instead of y ∈ {0,1}, one writes y ∈ {1,1}. • h outputs values in {−1,1} as well, rather than being a smooth function. • g (z) is effectively the sign function: g (z) = 1 if z ≥ 0, and g (z) =−1 otherwise. • The hypothesis is now written hw,b (x) = g (wTx+ b ), where w ∈Rn and b ∈R. Notice that w and x are in Rn , not Rn+1; given some parameters θ ∈Rn+1 can be separated out into b = θ0 and w= (θ1, . . . ,θn) T. Definition. Using this notation, the functional margin of (w, b ) with respect to (x(i), y (i)) is bγ (i) = y (i)(wTx(i) + b ). The functional margin of the whole data set is bγ =mini bγ (i). A large margin is good, because if y (i) = 1, then the margin is large if wTx(i)+ b  0, and similarly, if y (i) = −1, then the margin is better if wTx+ b  0. The overall functional margin represents the worst case, the data point that is least well classified. One interesting nuance is that if w is multiplied by some scalar λ, then the functional margin is also multiplied by λ, but the line of separation doesn’t actually change. Thus, it might be useful to impose the normalization constraint ‖w‖= 1 and b = b/‖w‖. However, it will eventually be more useful if bγ can be scaled, so this is not done. Definition. The geometric margin of a data point is its signed distance from the boundary line: γ (i) = (y (i)wTx+ b )/‖w‖. The total geometric margin is, again, γ =mini γ (i). This means that γ (i) = bγ (i)/‖w‖. This is convenient; it captures the idea that normalization is a good thing. During the derivation of the SVM, optimizing the geometric margin will generally make more sense, but occasionally the functional margin will also be useful. Like the functional margin, the geometric margin is better if it is larger, which should be no surprise considering their relationship. Notice also that if ‖w‖= 1, then bγ = γ , and more generally scaling w can affect bγ without changing γ . Optimal Margin Classifier. This algorithm works only in special cases, but will be generalized in a lecture or two with something called kernels. The goal here is to maximize the geometric margin, in particular solving for γ , w, and b . Specifically, the goal is to capture maxγ ,w,b γ subject to the constraints that y (i)(wTx(i)+ b )≥ γ for all i = 1, . . . , m, and ‖w‖= 1. If one treats this as an 22 Thus, the dual problem is to maximize ΘD (α) = ( −∞, if ∑ i α (i) i 6= 0. ∑ i αi 1 2 ∑ i , j y (i))y ( j )αiα j ¬ x(i),x( j ) ¶ otherwise. Confusingly, the “otherwise” part of the above equation is also called W . The hard part of this is to solve for w from α∗. w may live in a high-dimensional (or even infinite-dimensional) space. But this allows the parameters to be easily found, as seen in the lecture notes. This can all be elegantly dealt with using inner products. To actually use this to classify things, one sets w= ∑m i=1αy (i)x (i), so hw,b (x) = g (wTx+ b ) = ∑ i y (i) ¬ x(i),x ¶ + b . This is kind of nice; as long as it’s possible to compute inner products, one doesn’t really need to touch the rest of the examples, creating a nice increase in efficiency. Once w is known, the decision boundary is some line orthogonal to it, and which line in particular is determined by b . Though w ∈Rn for a large (or infinite) n, b ∈R, which makes it a little easier to obtain. The precise formula will be found in the lecture notes. Kernels. This algorithm has an interesting property, that it can be written out entirely in terms of inner products of the training examples. That is, the training data is only used inside these inner products. Additionally, the only reason one cares about w∗ is to make predictions. Thus, if one can compute inner products between examples, it’s not explicitly necessary to compute everything. The idea behind kernels is that in circumstances where x(i) is large-dimensional (or infinite-dimensional), explicitly touching it is bad, but perhaps inner products with it can be computed efficiently. This sounds magical, and indeed only works in a small subset of cases. Example 11.1. Looking at the usual housing prediction problem, suppose x is the size of a house, and letφ(x) = (x, x2, x3, x4)T be some set of nonlinear features added to x, called the feature vector. Then, everywhere in the algorithm where one has ¬ x(i),x( j ) ¶ , it can be replaced with ¬ φ(x (i)),φ(x ( j )) ¶ . This can be computed relatively efficiently relative to the dimension of φ. The goal is to implement a kernel function K(x(i),x( j )) = ¬ φ(x(i)),x( j ) ¶ that is efficient to compute, even though φ is not as easy to compute (since it is a high-dimensional quantity). Suppose x,z ∈Rn and K(x,z) = (xTz)2. This simplifies to K(x,z) = n ∑ i=1 xi zi !2 = n ∑ i , j=1 xi x j zi z j . In other words, K(x,z) = φ(x)Tφ(z), where φ(x) = ∑ i , j xi x j . Notice that φ(x) ∈ Rn2 , so it takes O(n2) time to compute φ(x). However, it takes O(n) time to compute K(x,z), because each inner product is O(n), and then they’re just multiplied and squared. Not all terms in φ(x)must be quadratic in the entries of x; some could be linear terms (e.g. multiplication by a constant). If one has K(x,z) = (xTz+ c)d , then this corresponds to φ(x) containing all monomials in the entries of x up to degree d . For example, it might have entries x1x23x14, x3 2 x5, etc. This means that φ(x) has n+d d  features, which is O(nd ) features! Thus, computing φ(x) is O(nd ), but K(x,z) is still O(n) time. Crucial to using these kernels was the fact that once the dual problem was written out, everything was in terms of inner products, meaning that kernels could be used. When one asks if a function is a valid kernel, the question is equivalent to: is there a φ such that K(x,z) = φ(x)Tφ(z)? For example, K(x,z) = −1 won’t work. Semantically, an invalid kernel might still be called a kernel, in which case one distinguishes valid kernels from invalid ones. The Gaussian kernel is one of the most widely used kernels in applications, and is a good one to try if you don’t have any other good ideas. It looks like K(x,z) = exp − ‖x− z‖2 2σ2 ! . It is given by a feature vector φ(x), but it turns out to be infinite-dimensional. This allows one to implement nonlinear classifiers; for example, if y = 0 lies entirely within the range where y = 1, it might be possible to classify linearly in a very high-dimensional space, which projects down to a nonlinear separator in lower-dimensional space. Thus, a typical path is to take some dataset, use a φ to map it into a higher-dimensional or infinite-dimensional feature space, and then project the classifier back down into Rn , where it is a useful, nonlinear decision boundary. 25 12. CONVEX OPTIMIZATION I: 10/18/13 Convex optimization is used in many places inside and outside of machine learning, in general wherever an optimization problem exists. These notes are based on Stephen Boyd’s. A large amount of the work in convex optimization is formulating the problem as a convex problem. Once this is done (typically by algebra stuff, such as juggling formulas around and performing variable substitutions), commercial solvers exist to assist. Definition. A set C is convex if for any two points x, y ∈C , any convex combination of them (i.e. an element of the form θx +(1−θ)y, for 0≤ θ≤ 1) is also within the set. This means that for any two points within a set, the line between them is contained within the set. Thus, a pentagon is a convex set, but a jellybean is not. Other useful examples include Rn and Rn + (i.e. the set of vectors with positive entries). These can both be proven, which is mostly a matter of checking the definition. Balls under some norm also induce convex sets: for example, the set {x ∈Rn | ‖x‖2 ≤ 1} (i.e. the L2-norm, where the balls are n-spheres; you can also use the L1-norm, in which they are diamonds, and the L∞ norm, where they are squares). Another set of examples is that of affine spaces, which are sets of the form {x ∈Rn |Ax= b}, for some matrix A∈Rn×m . This is because A(θx+(1−θ)y) = θAx+(1−θ)Ay= θ+(1−θ)b= b. The intersection of two convex sets is convex; but in general, the union is not convex. The set of positive semidefinite matrices Sn + is also convex (here, Sn ⊂ Rn×n is the set of n× n symmetric matrices): if xTAx≥ 0 and xTBx= 0, then xT(θA+(1−θ)B)x= θxTAx+(1−θ)xTBx≥ 0, since each of its components xTAx, xTBx, θ, and 1−θ are all positive. Definition. A convex function on a set S is a function f : S→R such that for any x, y ∈ S , f (θx+(1−θ)y)≤ θ f (x)+ (1− θ) f (y). f is called strictly convex if the inequality is strict. If the inequality goes in the other direction, the function is called concave. The intuition is that if you take the graph of a function, the secant line between any two points lies above the graph. If f is differentiable (i.e. ∇x f (x) exists), then there’s an equivalent characterization: that f (y)≥ f (x)+∇x f (x)(y− x) for any x,y ∈ S. This is saying that the tangent line lies below the graph, which is equivalent. This is one of the reasons convex optimization is so powerful; with a derivative and a single point, we have a global underestimator of the function. If the function is twice-differentiable, so that∇2 x f (x) exists, then there’s another equivalent condition: that the Hessian is positive semidefinite, written∇2 x f (x) 0.24 The definition of a convex set can be extended to multiple points, leading to a result called Jensen’s inequality: if n points are averaged, their result should still lie within the function. The result is f k ∑ i=1 θi xi ! ≤ k ∑ i=1 θi f (xi ), as long as ∑k i=1θi = 1 and θi ≥ 0 for all i . In other words, this works as long as there is a discrete probability distribution using the θi ! Thus, it seems reasonable to generalize to a continuous distribution: if p(x) is a probability distribution on S , then Jensen’s inequality states that f ∫ p(x)x dx  ≤ ∫ p(x) f (x)dx. Finally, there’s a statement in terms of expected value: f (E[x])≤E[ f (x)]. These are all equivalent statements with the same intuition (the secant line lying above the graph). Definition. If f : D→R is a convex function, an α-sublevel set is the set {x ∈D | f (x)≤ α}. An α-sublevel set given by a function is also convex. Example 12.1. Examples of convex functions: • The exponential function f (x) = eax , because f ′′(x) = x2eax ≥ 0. 24The notion  has two different meanings: for vectors x,y, x  y means that xi ≥ yi for each component, and for matrices it means positive (semi)definiteness. Thus, these are not total orders, as there exist plenty of x,y such that x 6 y and x 6 y. 26 • The negative logarithm f (x) =− log x, because f ′′(x) = 1/x2 > 0. • Affine functions f (x) =Ax+b are both convex and concave, because∇2 x f (x) = 0. • All norms are convex functions, which follows from the triangle inequality. • Nonnegative weighted sums of convex functions are still convex. Similarly, the maximum of a set of convex functions is still convex. Convex Optimization Problems. Once a problem is placed in this form, off-the-shelf solvers will be able to tackle it. The specific form is to minimize f (x) subject to x ∈C , where f is a convex function and C a convex set. This is typically written as gi (x) ≤ 0 for some i = 1, . . . , m, and hi (x) = 0 for i = 1, . . . , p. Here, the gi are required to be convex, and the hi are required to be affine. This can be intuited by looking at the 0-sublevel set. Since f is convex, there is a unique global minimum (though there many be more than one argmin). This is generally written p∗ = min{ f (x) | gi (x) ≤ 0, hi (x) = 0}. Sometimes, f is unbounded below, which is written p∗ = −∞. Correspondingly, p∗ =∞means that the constraints are infeasible. For an example of a convex optimization problem, one has an affine function C x+b subject to affine constraints Gx h and Ax= b. These end up constraining x to lie within some polyhedron, and the minimum will lie at some vertex. A generalization of this is quadratic programming, in which the goal is to minimize a quadratic function xTPx+ cTx+d subject to the same affine constraints. Now., however, there’s no guarantee that the minimum lies at one of the vertices of the polyhedron. Another generalization is that of quadratically constrained quadratic programs. Here, the constraints are replaced with quadratic ons of the form (1/2)xTQi x+ rT i x+ si = 0 for i = 1, . . . , m. Finally, there’s yet another generalization called semidefinite programming, which is covered in the lecture notes (or in the convex optimization class). Using a library called CVX (which is free for non-commercial use), one can solve convex problems in MATLAB. For example, suppose one wants to minimize (1/2)‖w‖2 2+C ∑ ξi such that y (i)(wTx(i)+ b )≥ 1− ξi for all i , and ξi ≥ 0. This sort of problem is useful for SVMs. Then, the syntax is as easy as cvx_begin variables w(n) b xi(m) minimize 1/2*sum(w.*w) + C*sum(xi) y.*(X*w + b) >= 1 - xi; xi >= 0; cvx_end This fills w with the values that are in the optimum solution, and prints a lot of information (e.g. how many iterations were necessary) to the console. 13. SUPPORT VECTOR MACHINES III: 10/21/13 Mercer Kernels. To review, we want to maximize the function W (α) = n ∑ i=1 αi − 1 2 m ∑ i=1 m ∑ j=1 y (i)y ( j )αiα j 〈x (i),x( j )〉, such that ∑ i y (i)αi = 0 and αi ≥ 0. By invoking the KKT theorem, one can relate this dual optimization problem to the primal problem, implying that hw,b (x) = g m ∑ i=1 αi y (i)x(i)Tx+ b ! . Defining K(x(i),x) = x(i)Tx, this can be thought of as sending x 7→φ(x), so that φ(x)Tφ(z) = K(x,z). K is a valid kernel iff such a φ exists. One ought to think of the kernel function as a way of measuring similarity, though not all similarity functions are kernels: if there is an x such that K(x,x) = 1, no φ exists that satisfies the above equations, so that K is not a kernel. We also have the kernel matrixK given byKi j =K(x(i),x( j )). This matrix is positive semidefinite. Theorem 13.1 (Mercer). A function K is a valid kernel iff for any {x(1), . . . ,x(d )}, the corresponding kernel matrixK ∈Rd×d is symmetric and positive semidefinite. A valid kernel is also known as a Mercer kernel. The forward direction was done before, and symmetry follows from expanding out K(x,z) =φ(x)Tφ(z) =φ(z)Tφ(x) =K(z,x). This is a useful test for valid kernels, but there are other tests for this. 27 then AAAB , and so on, to ZZZZ). This feature vector has dimension 204, which is a lot. However, there exists an efficient algorithm for computing inner products of this feature, since most of the entries are zero, and the Morris-Pratt dynamic programming algorithm is useful for handling this. Part 2. Learning Theory 14. LEARNING THEORY: 10/23/13 The second of four major topics we will cover in this class is that of learning theory. This is not deep, beautiful math as much as it is a deeper understanding of the material, which is very important in understanding why algorithms work or don’t, and so on. Consider a set of five data points in R2. One could fit a line to it, which doesn’t completely accurately capture the training set. Using a fourth-degree function will fit the data perfectly, but at the expense of making the model much less appropriate for nw data samples. Somewhere in between, a second-degree or third-degree equation might be the most appropriate. These are called, respectively, underfitting (the model fails to capture important aspects of the data); overfitting (too much attention paid to the idiosyncrasies of this particular dataset), and well-fitting. The underfit model has a problem known as bias, which is almost a mathematical idea. Dually, the overfit model has too much variance. These will be explained more precisely when uniform convergence is discussed. If an algorithm is working poorly, it’s useful to know whether the hypothesis has a high bias or high variance, to determine whether (for example) more or fewer features, respectively, are needed. There are also plenty of other ways to handle these problems. Classification problems can also suffer from high bias or high variance: logistic regression can create some surprisingly complicated separating curves, which can lead to overfit datasets, and correspondingly insufficiently complicated curves could have a high bias. An important step in the theory is to formulate an abstract model for machine learning. This is something that isn’t completely descriptive, but will be very helpful for understanding things and proving theorems nonetheless. Additionally, though the theory works equally well for classification and regression, it’s easier to formulate for classification problems, which is what will be followed below. The goal is to classify y ∈ {0,1} using a hypothesis function hθ ∈ {0,1}. Then, the training set is S = {(x(i), y (i))}m i=1. For example, hθ(x) = g (θTx) for g as defined before, etc. Assume that the samples in S are independently and identically distributed. Definition. The training error of the above is bεS (hθ) = bε(hθ) = 1 m m ∑ i=1 1{hθ(x (i)) 6= y (i)}. Since ML people aren’t good at naming things, this is occasionally also known as risk. In other words, how often does the hypothesis misclassify the sample? Definition. The generalization error is ε(hθ) = P(x,y)∼D (hθ(x) 6= y). Here, D is some hypothetical distribution of all possible input data. The point of generalization error is for the algorithm to generalize well to datasets it hasn’t seen before. The goal is thus to minimize this error — yet there is no way to measure it directly. Thus, training error is used as an approximation. The approximation under mild conditions gets better and better as m increases. One can think of supervised learning as finding the hypothesis with the smallest training error: bθ= argminθ bε(hθ). This is known as empirical risk minimization, or ERM. This isn’t implemented as a direct algorithm, as it ends up being an NP-hard problem in many cases, so supervised learning algorithms should be thought of convex relaxations of this problem. But this is still too low-level, so abstract away the θ. LetH = {hθ : θ ∈Rn+1}, which is the class of all hypothesis functions. In the specific example of logistic regression,H is the set of all straight lines (which correspond to decision boundaries). Then, the ERM is restated as bh = argminh bε(h). 26 H is the set of functions (that we care about) mapping the x to {0,1}. This now allows us to tease out the difference between high bias and high variance: in the case of high bias, bε(bh) is high, and in the case of high variance, bε(bh) is low (but in both cases, ε(bh) is high, illustrating that an underfit model is a poor predictor and an overfit model generalizes badly). 26There are many variables wearing hats today; this notation is generally used to denote an approximation to the hatless quantity, as with bε and ε. 30 Lemma 14.1 (Union Bound). Let A1, . . . ,Ak be k events (not necessarily independent). Then, P    k ⋃ i=1 Ai    ≤ k ∑ i=1 P (Ak ). This makes the most sense when one thinks about Venn diagrams: since the Ai could intersect, their total area is at most the sum of their intersections (which would imply there is no intersection). It turns out this is an axiom of probability, called the σ -subadditivity of probability. Lemma 14.2 (Hoeffding’s inequality). Let Z1, . . . ,Zn ∼ Bernoulli(φ) be independently and identically distributed random variables and bφ= 1/m ∑m i=1 Zi . Then, for any γ , P (| bφ−φ|> γ )≤ 2e−2γ 2 m . This lemma is also sometimes known as the Chernoff bound. From the Central Limit theorem, bφ has a roughly Gaussian distribution, and this inequality looks at a specific window of the distribution as m increases, so that the variance of bφ shrinks. SupposeH = {h1, . . . , hk} is some finite set (note that this isn’t logistic regression anymore). This will make some analysis easier, and can be generalized to infiniteH later. In the finite case, a bound on ε(bh) by showing that bε(bh)≈ ε(bh) for all h. Let z j = 1{h j (x (i)) 6= y (i)}, the indicator for whether hi is correct, and zi ∈ {0,1}. Thus, P (zi = 1) = ε(h j ). When Hoeffding’s inequality is applied, this will take the form of φ, and bφ is bε(h j ) = 1 m m ∑ i=1 zi = 1 m m ∑ i=1 1{h j (x (i))− y (i)}, so by Lemma 14.2, P (|bε(h j )− ε(h j )|> γ )≤ 2exp(−2γ 2m). Let Aj be the event that |bε(h j )− ε(h j )|> γ . Then, using the union bound, P (∃h j ∈H s.t. P (|bε(h j )− ε(h j )|> γ )) = P    k ⋃ i=1 Ai    ≤ k ∑ i=1 P (Aj ) = k ∑ i=1 2e−2γ 2 m = 2ke−2γ 2 m . Thus, there is a very good probability that there is no h j where the difference between the training error and the generalization error is large: formally, P (there is no h j where |bε(h j )− ε(h j )| > γ ) ≥ 1− 2ke−γ 2 m . This result is known as a uniform convergence bound because for it applies to every hypothesis. Let δ = 2ke−2γ 2 m , so that the probability becomes 1−δ. Then, there are three parameters floating around this result: γ , the error; m, the size of the training set; and δ, the probability. It is possible to solve for one in terms of the other two, which is useful for understanding the uniform convergence. This is explained in greater detail in the lecture notes, but for example if m ≥ (1/2γ 2) log(2k/δ), then, with probability 1−δ, |ε(h)− bε(h)| ≤ γ for all h ∈H . This is known as a complexity bound. The final form, solving for γ , gives an error bound: for a fixed m and δ, then with probability at least 1−δ, |bε(h)− ε(h)| ≤ È 1 2m log 2k δ ︸ ︷︷ ︸ γ . for all h ∈H . Let h∗ = argminh∈H ε(h). This is impossible to definitively determine, but it is approximated by bh, the hypothesis with the lowest training error. Thus, bε(bh)≤ bε(h) for any other h, so ε(bh)≤ bε(bh)+ γ ≤ bε(h∗)+ γ ≤ ε(h∗)+ γ + γ = ε(h∗)+ 2γ . This also uses the fact that the training and generalization errors are within γ of each other for al hypotheses. Thus, the bound on the generalization error difference is 2γ . The above discussion can be compacted into a theorem. 31 Theorem 14.3. Let |H |= k, and fix m ∈N and δ ∈ (0,1). Then, with probability 1−δ , ε(bh)≤min h∈H ε(h) ︸ ︷︷ ︸ ε(h∗) +2 s 1 2m log ‚ 2k δ Œ ︸ ︷︷ ︸ γ . This leads to an understanding of bias versus variance: if the hypothesis class is too narrow, then the best (i.e. minimal) ε(h) isn’t too good, so the first term in on the right-hand side is the bias term. The second term is a variance term: there are so many hypotheses, it’s not clear how to choose between them. VC Dimension. One shortcoming of this approach is that it requires H to be a finite set. There is a notion called VC dimension can be used to handle this. Definition. SupposeH is some infinite hypothesis class, and d points be given. IfH can realize any of the 2d labelings on the set S of these d points, thenH is said to shatter S. In other words, for any possible set of labelings,H can achieve zero training error. Notice that ifH is the set of linear classifiers on R2, then it can shatter some sets of 3 points, but no set of four or more points. Intuitively, this says something about its power. Definition. The VC dimension of a hypothesis classH is the size of the largest set it can shatter. As shown, the set of linear decision boundaries, as noted above, can shatter some sets of three points (but, significantly, not all!) but no sets of four points, so VC(H ) = 3 in this case. (Logistic regression does better in higher dimensions.) This can be used as an alternative way of measuring the complexity of the hypothesis class rather than just the number of hypotheses; a higher VC dimension is more likely to lead to overfitting, and a lower VC dimension correlates to bias. 15. CONVEX OPTIMIZATION II: 10/25/13 To review, a convex optimization problem is to minimize a convex function f (x) subject to the constraitn that x lies within some convex set C . Typically, C is presented in terms of some inequalities and affine functions: gi (x)≤ 0 for i = 1, . . . , m, and hi (x) = 0 for i = 1, . . . , p. One of the advantages of using a library such as CVX is that one can just as easily obtain the dual solution (and therefore a bound for the primal problem) from the problem. Though one might have seen the Lagrangian before, it’s a little more interesting with the inequality constraints (the gi ). The Lagrangian itself is L (x,α,β) = f (x)+ m ∑ i=1 αi gi (x)+ p ∑ i=1 βi hi (x). These αi and βi can be thought of as the costs for violating the constraints: a high value of αi means that the constraint is particularly significant. The primal problem then becomes min x  max α,β:α0 L (x,α,β)  =min x Θp (x). (8) If these constraints are realizable, the problem is called primal feasible, and the solution is denoted p∗ =Θp (x ∗). The dual problem is the same, but with the maxima and minima switched: max α,β:α0  min x L (x,α,β) ‹ = max α,β:αi≥0 ΘD (α,β). If α and β admit such a solution, then the problem is called dual-feasible, and the solution is written d ∗ =ΘD (α ∗,β∗). Notice how this doesn’t depend on x. If you stare at the constraints for a long enough time in the right way, it becomes clear that if the constraints can’t be met, it’s possible to make everything arbitrarily large in (8). Thus, one gets the fomula Θp (x) =  0, x is feasible, ∞, otherwise. This is the same problem as before, but with the contraints embedded in, in some sense. 32 Model Selection. The motivation of model selection is to try and choose what sort ofH to classify from. One might have this parameterized by the number of parameters, the bandwidth, the SVM quantity C , etc. One algorithm that doesn’t work is to simply minimize the training error. Instead, there’s a technique called holdout cross-validation (also simple CV), in which one randomly splits the data into two sets Strain (with about 70% of the data) and SCV (the remaining 30%). Then, train each model in Strain and test it on SCV, and choose the model that performs best. Notice how this minimizes the issue of overfitting relative to the simpler algorithm first described. Finally, once the model is selected, one can either output the model and then train on the whole dataset, or just stop with the hypothesis given by Strain. Either of these will end up working, and in some of the less stable learning algorithms (none of the ones yet discussed in this class), this can help preserve stability. Not retraining can also save time or computation. One criticism of this is that in some cases, data may be very expensive or very hard-won, and leaving out some of it is really bad. Similarly, one might want to get a more complicated hypothesis, which requires more data. Thus, one could use a technique called k-fold cross-validation, which uses more data but is more computationally expensive. The data is split into k pieces, and then, for 1≤ i ≤ k, train on the k − 1 pieces and test on the remaining piece, the i th piece. Then, average the errors from the k steps. The choice of k = 10 is by far the most common. Thus, there are M k different training runs to iterate through if M models are considered, which is a lot. But it uses all of the data. The final idea that will be discussed is called leave-one-out cross-validation. This is useful when data is difficult to obtain and very important to use (e.g. in the medical field, where each data point is a quantifiable amount of human suffering). This is just m-fold cross-validation: each training example is tested after a model has been trained on the other m − 1 training examples. This is even more computationally expensive, so it should be avoided on larger data sets (e.g. 150 or more examples). In general, increasing the size of k in k-fold cross-validation creates better, but increasingly diminishing, returns, yet is more and more computationally expensive. One common subcase of model selection is that of feature selection. Take as an example some text classification problem with 10000 features (e.g. because it is over 10000 words); some of these words are obviously going to be more relevant than others (such as “the” and “of,” which don’t really indicate whether an email is spam or not). The naïve way to select features gives 210000 models, since there are that many subsets of the feature set! But even halving the number of features reduces the VC dimension and the probability of overfitting, so this would be a good thing to accomplish. An algorithm called forward search is a heuristic to avoid running through all of these examples; it can be considered an approximate search algorithm. This is done as follows: start with the empty feature set, and then pick the feature that reduces the cross-validation error the most and add it. This requires iterating through all of the examples, but this is linear, not exponential! Then, repeat, adding the next most useful feature, and then the next most, and so on. This is an example of a greedy algorithm, but it’s not guaranteed to find the optimum set of features, just a (generally) pretty good one. A related algorithm called backwards search starts with the full set of features, and then removes the worst ones one at a time. This is described more in the lecture notes, but is less commonly used than forward search. The problem with forward search is that it’s still pretty computationally expensive, even though it’s a drastic improvement on the total solution. Thus, a yet faster algorithm can be used on large data sets, called filter feature selection. In this algorithm, for each i , calculate a score that indicates how useful xi is for estimating y. Then, select th k best features under this score. One reasonable score is the correlation between xi and y, though it’s far more common to use MI(xi , y), the mutual information between x and y, defined as MI(xi , y) = ∑ xi∈{0,1} ∑ y∈{0,1} p(x, y) log p(x, y) p(x)p(y) =KL(p(x, y)‖ p(x)p(y)), i.e. how different the joint distribution is from the product of the two individual distributions. This concept arises in information theory. In essence, one computes the scores, and then picks the k best. But how many features do you want? One good strategy is to just use optimal cross-validation. 17. BAYESIAN STATISTICS: 10/30/13 The techniques we discussed last time — model selection, feature selection, and so on — were ways of simplifying a problem to avoid overfitting. However, there’s an alternate approach to this that uses ideas from Bayesian statistics. Thus far, the calculation of the MLE for the logistic regresion has proceeded in a frequentist manner: the goal is to find the parameters θ, but this is a fixed value and it definitely exists. The goal is to estimate it. Alternatively, one could imagine that the universe (or God) is choosing θ at random, so the actual value doesn’t exist. The uncertainty in θ is captured by a prior distribution P (θ). In the case of logistic regression, suppose θ ∼ N(0,τ2I ) is the prior. Then, the goal is to, given a training set S = {x(i), y (i))}m i=1, to calculate P (θ | S) = P (S | θ)P (θ)/P (S), and one might choose argmaxθ P (θ | S). This might seem like meaningles semantics, but in the history of statistics, this debate was quite significant! 35 The argmax actually ends up being28 argmax θ P (θ | S) = argmax θ (S | θ)P (θ) = argmax θ m ∏ i=1 P (y (i) | x(i),θ) ! P (θ), so after taking logs, one obtains argmax θ m ∑ i=1 log P (y (i) | x(i),θ)−λ‖θ‖2 , (10) where λ= τ/2. Notice this equation is very similar to the MLE for θ, except for the −λ‖θ‖2 term. This equates to shrinking θ, which often makes the hypothesis less complicated. In general, adding a term like −λ‖θ‖2 to an expression to prevent overfitting is called regularization. It turns out that for text classification with logistic regression, regularization is a useful way to get more out of the same data; it yields a pretty good text classification algorithm. Since regularization is easy to implement, it’s a great choice for decreasing overfitting. However, one needs a good value of λ: if it is too big, the hypothesis is underfit, so try several values of λ with cross-validation to pick the best one. (10) is aso known as the MAP estimate (maximum a posteriori estimate). Online Learning. Though it’s unlikely to come up in the course project, online learning is very useful in the real world. Thus far, we have considered batch learning, in which the training data and testing data are kept separate. Online learning takes a different approach, where testing examples can also be used to train the algorithm. For example, web applications of machine learning (especially ad networks) uses feedback from what users do to sharpen their own algorithms. For example, user x(1) shows up, and the algorithm makes a prediction ŷ (1). Then, the user does something, giving an actual example y (1). Then, this data is fed back into the algorithm, and the user continues on, creating more data points x(2), ŷ (2), y (2), etc. Definition. The total online error of such an algorithm is ∑m i=1 1{ŷ (i) 6= y (i)}. One example of an online algorithm is the perceptron algorithm, in which θ= 0 at the beginning; then, for the i th example, update θ j := θ j + α(y (i)− hθ(x (i)))x (i)j , where h is the sign function rather than a logistic function. Thus, it isn’t in fact a gradient-descent algorithm, but it’s close and it’s very quick to compute. One interesting theorem (deferred to the lecture notes) shows that as long as ‖θ‖ is small, then the perceptron won’t overfit, even if x lies in some infinite-dimensional space. Advice for Applying Machine Learning Algorithms. In the real world, machine learning leads to a huge number of choices, and it’s very helpful to learn some strategies to get learning algorithms to work in different settings. This section, while not very mathematical, may be among the hardest conceptually. The key ideas are diagnostics for debugging learning algorithms, error analysis and ablative analysis, and how to approach starting an algorithm, and the idea of premature optimization. For example, suppose one wants to make a spam classifier with a small set of 100 words, using Bayesian logistic regression as in (10). However, it’s consistently getting 20% error, which is unacceptable. What next? Here are some ideas. • You can always throw more data at the problem. • Try a smaller set of features. 100 is still a large number of features. • Sometimes, you need a larger set of features. Occasionally, this works. • Try using different features, e.g. including the email header. • Try running gradient descent for more iterations. • Maybe Newton’s method? • Use a different value of λ. • Try with an SVM. The first, best thing to do is to consider multiple options. This already places you ahead of many teams. To determine whether the classifier has high bias or high variance, calculate the training error and the test error: if the training error is significantly lower, the variance is the problem. This makes a significant difference in which approach one should take. One useful diagnostic is the learning curve, where m is plotted against the test error. See Figure 5 for more information. In the case of high variance, it is useful to add more training examples or try a smaller set of features. However, trying a larger set of features or switching out the features fixes high bias. Knowing which one to test will be very helpful and save a lot of time. But in the case of high bias, where the test curve rapidly flattens out, and thus throwing more data at the problem will not fix it. 28Here, θ is considered a random variable, so one can condition on it, so one writes P (y(i) | x(i),θ) rather than P (y(i) | x(i);θ); θ is not a parameter. 36 FIGURE 5. Left: a typical learning curve with high bias, in which there is a significant difference between the training error and testing error. Right: a typical learning curve with high bias, where the training error is about equal to the testing error. The plots might be noisier in practice, but the shape is fairly characteristic. Source: CS 229 Lecture Slides Note that there are different diagnostics than just bias vs. variance, though it is most common. The next most common is the issue of convergence of the learning algorithm. For example, consider running a spam classifier that gets 2% error on both spam and non-spam; the former is OK, but we don’t want to filter valid messages so frequently! Suppose also that an SVM works, but because of online learning it’s better to go with the Naïve Bayes approach. Then, it might be worth checking how well the algorithm converges, graphing the number of iterations of the algorithm against the objective. However, it’s generally hard to tell where the plot flattens. Alternatively, are you even classifying the right function? One could adjust the tradeoffs by assigning weights to the different classifier values to promote misclassification of spam versus non-spam. For Bayesian logistic regression, one could also lower the norm by a different λ, for an SVM the constant C can be pushed around, etc. If one has multiple models (e.g. Bayesian logistic regression and an SVM), and the goal is to maximize α(θ), one can look at the objective function J (θ) of the Bayesian algorithm. One can determine whether there is a good correlation between J and α: if optimizing one has no effect on the other, you have the wrong function! In some sense, the accuracy function α(θ) is intractable: it’s hard to optimize it, which is why J is used as a substitute. If J looks like α, then it’s a good estimate, but the algorithm isn’t working; correspondingly, it could be that J is being optimized well but is just totally off of the mark, in which case one wants a different J . In this case, the specific diagnostic is whether J (θSVM) ≤ J (θBLR). If this is true, then Bayesian logistic regression is not doing well (since α(θSVM > a(θBLR)), and the issue is with the algorithm. If instead the objective function is better for Bayesian logistic regression, then optimizing J is unrelated to optimizing α, and the problem lies with the objective function. Once this is known, solutions exist: increasing the number of iterations or using Newton’s method in place of gradient ascent fix the algorithm, but changing the parameters for λ or using an SVM allow one to fix the objective function J . Part 3. Unsupervised Learning 18. CLUSTERING: 11/4/13 Before moving into the unsupervised learning material, there will be a bit more advice on applying machine learning. But first, a helicopter! An autonomous helicopter is simulated in some simulator, and uses reinforcement learning. There is some reward function, e.g. R(s) =−‖s− sdesired‖ 2, where s is the position of the helicopter, and the goal is to maximize V πRL = E[R(s0)+ · · ·+R(sT )], and obtain some reward policy πRL. Now, what happens if the resuting controller πRL does much worse than a human pilot. How might one fix that?29 Suppose that: 1. The helicopter simulation is accurate. 2. The reinforcement learning algorithm correctly controls the helicopter to maximize the expected payoff (i.e. minimizing the cost function). 3. Maximizing the expected payoff corresponds to correct autonomous flight. These seem like they ought to force it to fly well, so which assumption (or assumptions) are wrong? Here are some diagnostics. • If πRL flies well in the simulator but not in the real world, then the simulator is not very accurate. • If the human does better on the cost function than πRL (in terms of minimizing sum-of-squares error), then the reinforcement learning algorithm isn’t maximizing the reward function and a different algorithm (or a variant) should be used. 29This is not very straightforward; improving this specific algorithm made for at least two PhD theses. 37 PCA can also be used for an application called matching, or data cleaning. If one has noisy data (e.g. image recognition), two images can be compated by checking the Euclidean difference in Rn , but if PCA captures intrinsic features of the objects, then comparing their distances in the subspace is often cleaner, and can be used to determine if two pictures correspond to the second object. This is all nice and wonderful, but before using PCA, one should check what happens with the original data. This is a great example of premature optimization. It’s not that PCA is a bad algorithm; it’s very useful, but is also somewhat overused. Now, ther are four unsupervised learning algorithms to use; when is each best? It depends what one wants to do; these can be divided into models of P (x) and non-probabilstic models. Additionally, the goal might be to find lower-dimensional subspaces in the data, or clusters. Specifically: • If one wants to model P (x) and use subspaces, use factor analysis. • If one wants to use subspaces but non-probabilistically, use PCA. • If one wants to model P (x) and obtain clusters, use mixture of Gaussians. • If one wants to use clusters but non-probabilistically, use k-means. 22. INDEPENDENT COMPONENTS ANALYSIS: 11/18/13 To recall, suppose we have a cumulative distribution function F (s) = P (S ≤ s), so that PS (s) = F ′(s). In the independent components analysis probem, one has n speakers (or microphones) to separate out, where s (i)j is the emission of the speaker j at time i . Then, one has s(i),x(i) ∈ Rn , where x(i) = As(i) for some mixing matrix A. Then, the goal is to find its inverse, W =A−1. A good way to do this is to take P (x(i);W ) and do some maximum likelihood estimation. One could try to write a density function ps(s), so that px(x) = ps(W s), but this is wrong. Consider ps (s)∼ Uni(0,1), and let x = 2s . Then, x sinUni(0,2), so px (x) = 1/2 ps (W s). Thus, the correct formula looks like this: px(x) = ps(W s)det(W ). In the ICA model, the s(i) are assumed to be independent, sp p(s) = ∏n i=1 Ps(s (i)), so p(x) = det(W ) n ∏ i=1 ps(w T i x) ! , where wi is the i th column of W . Since one wants to write down a CDF that isn’t for a Gaussian, the best choice is the logistic function g (s) = 1/(1+ e−s ). This is nicer than the Gaussian, because it has “fatter tails:” it tends to zero less quickly than the Gaussian, which is useful because these sorts of applications sometimes produce very large values. Then, the log-likelihood is `(W ) = m ∑ i=1 log ∏ j Ps(w T j x (i)) ! det(W ). Then, one uses gradient-ascent as normal: W :=W + 2∇W `(W ). The actual calculation of the gradient is deferred to the lecture notes. One interesting application of this is separating out EEG data, removing for example irregularities induced by heartbeats. This is actually the standard use in industry. Another interesting application that is a little more crazy futuristic is to use ICA to find independent components in image data. . . which correspond to edges in the image. This is a surprisingly effective way to find them, though a little more has to be done to make it work. Moving on to an introduction to reinforcement learning, I’m not sure what it exactly is yet, but you use it to fly helicopters. The goal here is to not supply the right answer to the problem, but instead some sort of reward function, much like the Pavlovial training of a dog. For example, in a game of chess, there isn’t generally a “right” move, but one can design a reward function that favors good moves. Definition. A Markov decision process (MDP) is a 5-tuple (S,A,{Psa},γ , R), where: • S is some set of states, • A is a set of actions, • Psa is a state transition probability, so that the whole set sums to 1, • γ is a discount factor, and • R is a reward function. As an example, suppose one has a 3× 4 grid with a robot30 that can move in any cardinal direction: A= {N , S, E ,W }. At grid location (3,1), it has a probability 0.8 of moving north and 0.1 of moving each of east or west; this is written 30Unfortunately, this isn’t Karel. 40 P(3,1),N ((3,2)) = 0.8, P(3,1),N ((4,1)) = 0.1, etc. Then, each square has a reward associated with it, e.g. R(4,3) = 1, R(4,2) =−1, and so on. Generally, one might set R(S) =−0.02 for all other states, so that it doesn’t move around too much. Part 4. Reinforcement Learning and Control 23. INTRODUCTION TO REINFORCEMENT LEARNING: 11/20/13 Recall that a Markov decision process is a tuple (S,A < {Psa},γ , R) as defined before, and the goal is to find a policy π : S → A that maximizes the reward. The value of a state is the expected value of the immediate reward plus γ times the future reward: V π(s) = E[R(s0)+ γR(s1)+ · · · |π, s0 = s]. Note that since π isn’t a random variable, it’s an abuse of notation to condition on it. One good way to rewrite it is called Bellman’s equation: V π(s) = R(s)+ γ ∑ s ′ Ps ,π(s)(s ′)V π(s ′). The optimal value function is V ∗(s ) =maxπV π(s ). Basically, one plans to keep acting optimally, so both instances of V π in the previous equation can be replaced with V ∗. In other words, V ∗(s) = R(s)+max a ∑ s ′ Psa(s ′)V ∗(s ′). Then, the optimal policy is π∗(s) = argmax a γ ∑ s ′ Psa(s ′)V ∗(s). Of course, this is exponential in |A|, and is not pleasant to compute. A better idea is an algorithm called value iteration: • Initialize V (s) = 0 for all s . • For every s , update V (s) := R(s)+max a γ ∑ s ′ Psa(s ′)V (s ′). There’s actually a slight ambiguity in the above algorithm. One could update V (s) for all states at once, which is called synchronous updating, or update each one for the computation of the next, which is called asynchronous. Both of these converge to the optimal value, though for the synchronous update it’s much easier to show (and correspondingly, the asynchronous version is easier to implement). Typically, it can be thought of as V := B(V ) for some linear operator B . Then, π∗ can be found by substituting into one of the previous equations. This is all fine if all of the values are known, but sometimes this isn’t the case. What if Psa and/or R are unknown? Typically, this means the best idea is to estimate these by collecting some data. Some cases may not appear in the testing data, so rather than dividing by zero, initialize them to 1/ |S |. In summary, repeat the following: • Take actions using π to get experience in an MDP. • Update the estimates for Psa . • Use value iteration to estimate V ∗. • Update π(s) = argmaxa ∑ s ′ Psa(s ′)V ∗(s). Okay, excellent. But often, the state space is continuous, which is a wrinkle that must be dealt with in many real-world applications such as autonomous driving, where there is a six-dimensional space: x, y, z , ẋ, ẏ, and ż . For helicopters one also needs three axes of rotation and their derivatives. One tends to write s ∈Rn , though technically you would want to use the rotation group instead of some of the components. Generally, one discretizes the problem, by taking a large finite number of continuous states and using them as the set S. Discretization isn’t just for reinforcement learning: for a regression model, one could discretize at the data points, obtaining a piecewise constant function that is fairly close to the original one. The advantage of discretization is that it’s easy to implement and greatly simplifies the problem. However, as the number of dimensional increases, it takes more and more points, increasing exponentially. This is known as the “curse of dimensionality.” It’s more pervasive than one might think: if the state space is even six-dimensional, one must start to be careful how the dimensions are discretized, e.g. picking more points in one direction. Anything past R8 is out of the question. In these cases, it is better to approximate V ∗(s) as a function of s . Here, it’s necessary to assume there is a model or simulator of the MDP, i.e. a black box that inputs a state s and an action a and outputs an s ′ ∼ Psa . Models could come out of physics simulators Alternatively, one can just use machine learning to get a model. 41 24. STATE ACTION MODELS AND LINEAR DYNAMICAL SYSTEMS: 12/2/13 “If you close your eyes and drive your car. . . don’t do this. . . you can kind of stay in your lane for a few seconds. When I was younger, to impress my date. . . don’t ever do this. . . I was driving with the lights off at night. Well, it turns out that driving with your headlights off is a bad idea.” Recall that we talked about the value iteration algorithm, were V (S) gets iteratively updated as V (s) := R(s)+max a γ ∑ s ′ Psa(s ′)V (s ′). Then, V →V ∗ asymptotically. This uses π∗(s) = argmaxa ∑ s ′ Psa(s ′)V ∗(s ′) to calculate more reasonably. In a state-action reward model, one has R : S ×A→ R, rather than just R : S → R, i.e. the actions are factored into the reward too. This could incentivize a robot to stay in place if it costs too much power to move, etc. Then, one has the followijng form of Bellman’s equation: V ∗(s) =max a R(s ,a)+ γ ∑ s ′ Psa(s ′)V ∗(s ′), in other words a sum of the immediate reward and the anticipated future payoff. The factor γ controls how these are balanced. Definition. A finite horizon MDP is a 5-tuple (S,A,{Psa},T , R), where most of the quantities are the same as in a standard MPD, but where T is a horizon time, corresponding to some finite amount of time the model is allowed to run (e.g. a plane with a finite amount of fuel). This means that the optimal action isn’t always just a function of the state. For example, if a reward of +1 is close by and a reward of +100 is far away, then which one to go for depends on how much time is left. One says that the optimal policy is non-stationary, i.e. time-dependent. Thus, the non-stationary transition probabilities are st+1 ∼ P (t )st at , and the reward function is also non-stationary, as R(t )(st ,at ). So now V ∗t (s) = E   T ∑ i=t R(i)(si ,ai ) |π ∗, st = s   . Value iteration now becomes a dynamic programming algorithm: V ∗t (s) =max a R(t )(s ,a)+ ∑ s ′ P (t )sa (s ′)V ∗t+1(s ′), where the last term is at time t + 1 because time advances for calculating the future payoff. The general lack of discounting is because for finite horizon MDPs, convention ignores it. The above equation is the inductive step; the base case is V ∗T (s) =maxa R(T )(s ,a); then, one computes V ∗t for decreasing values of t . Then, one can compute π∗ in the usual way: π∗t = argmax a R(t )(s ,a)+ ∑ s ′ P (t )sa (s ′)V ∗t+1(s). This means that it’s time-dependent, finessing the problem raised earlier. This finite horizon model really opens up the range of problems that can be handled by reinforcement learning. It leads into an algorithm called linear quadratic regulation (LQR), which computes the optimal value exactly. It only works for some problems, though. Let the state space be S =Rn , and the set of actions be A=Rd , which is completely reasonable for a lot of cases. It then makes a stricter assumption that Psa : st+1 =A(t ) st +B (t )at +ωt , where (leaving off the non-stationaryness to simplify the derivation) ωt ∼ N(0,Σω), st+1 | st ,at ∼ N(Ast + Bat ,Σω), A ∈ Rn×n , and B ∈ Rn×d . Then, the reward function is R(st ,at ) = −(sT t U st + aT t V at ), where U ∈ Rn×n , V ∈ Rd×d , and both are positive semidefinite, so the reward function is negative (and the goal is to keep it close to zero). This could make one want st ≈ 0, so if U = I and V = I , then R(st ,at ) =−‖st‖ 2−‖at‖ 2. Thus, to calculate this model one could find min A,B m ∑ i=1 T−1 ∑ t=0 s (i)t+1− (As (i)t +Ba(i)t ) 2 . Then, one must linearize the function found. 42