Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Generative Models-Introduction to Machine Learning-Lecture 15-Computer Science, Lecture notes of Introduction to Machine Learning

Generative Models, Optimal Classification, Bayes Classifier, Bayes Risk, Discriminant Functions, Two-Category, Equal Covariance Gaussian, Gaussian Case, Linear Discriminant, Generative Models, Maximum Likelihood, Density Estimation, Unequal Covariances, Gaussians, Decision Boundaries, Quadratic Decision Boundaries, Gaussian ML, Naive Bayes Classifier, SPAM Detection, MAP Estimate, Greg Shakhnarovich, Lecture Slides, Introduction to Machine Learning, Computer Science, Toyota Technological Institu

Typology: Lecture notes

2011/2012

Uploaded on 03/12/2012

alfred67
alfred67 🇺🇸

4.9

(20)

328 documents

1 / 37

Toggle sidebar

Related documents


Partial preview of the text

Download Generative Models-Introduction to Machine Learning-Lecture 15-Computer Science and more Lecture notes Introduction to Machine Learning in PDF only on Docsity! Lecture 15: Generative models TTIC 31020: Introduction to Machine Learning Instructor: Greg Shakhnarovich TTI–Chicago October 29, 2010 Lecture 15: Generative models TTIC 31020 Reminder: optimal classification Expected classification error is minimized by h(x) = argmax c p (y = c |x) = p (x | y) p(y) p(x) . The Bayes classifier: h∗(x) = argmax c p (x | y = c) p(y = c) p(x) = argmax c p (x | y = c) p(y = c) = argmax c {log pc(x) + log Pc} . Note: p(x) is equal for all c, and can be ignored. Lecture 15: Generative models TTIC 31020 Bayes risk −10 −8 −6 −4 −2 0 2 4 6 8 10 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 y =2, h∗(x) =1 y =1, h∗(x) =2 The risk (probability of error) of Bayes classifier h∗ is called the Bayes risk R∗. This is the minimal achievable risk for the given p(x, y) with any classifier! In a sense, R∗ measures the inherent difficulty of the classification problem. R∗ = 1− ∫ x max c {p (x | c = y) Pc} dx Lecture 15: Generative models TTIC 31020 Discriminant functions We can construct, for each class c, a discriminant function δc(x) , log pc(x) + log Pc such that h∗(x) = argmax c δc(x). Can simplify δc by removing terms and factors common for all δc since they won’t affect the decision boundary. For example, if Pc = 1/C for all c, can drop the prior term: δc(x) = log pc(x) Lecture 15: Generative models TTIC 31020 Two-category case In case of two classes y ∈ {±1}, the Bayes classifier is h∗(x) = argmax c=±1 δc(x) = sign (δ+1(x)− δ−1(x)) . Decision boundary is given by δ+1(x)− δ−1(x) = 0. • Sometimes f(x) = δ+1(x)− δ−1(x) is referred to as a discriminant function. With equal priors, this is equivalent to the (log)-likelihood ratio test: h∗(x) = sign [ log p (x | y = +1) p (x | y = −1) ] . Lecture 15: Generative models TTIC 31020 Equal covariance Gaussian case Consider the case of pc(x) = N (x; µc,Σ), and equal prior for all classes. δk(x) = log p(x | y = k) = − log(2π)d/2 − 1 2 log(|Σ|)︸ ︷︷ ︸ same for all k −1 2 (x− µk)TΣ−1(x− µk) ∝ const− xTΣ−1x + µTk Σ−1x + xTΣ−1µk − µTk Σ−1µk Now consider the two classes: δk(x) ∝ 2µTk Σ−1x− µTk Σ−1µk δc(x) ∝ 2µTq Σ−1x− µTq Σ−1µq Lecture 15: Generative models TTIC 31020 Linear discriminant Two class discriminants: δk(x)− δq(x) = µTk Σ−1x− xTΣ−1µk + µTk Σ−1µk − µTq Σ−1x− xTΣ−1µq + µTq Σ−1µq = wTx + w0 If we know what µ1,...,C and Σ are, we can compute the optimal w, w0 directly. What should we do when we don’t know the Gaussians? Lecture 15: Generative models TTIC 31020 Linear discriminant Two class discriminants: δk(x)− δq(x) = µTk Σ−1x− xTΣ−1µk + µTk Σ−1µk − µTq Σ−1x− xTΣ−1µq + µTq Σ−1µq = wTx + w0 If we know what µ1,...,C and Σ are, we can compute the optimal w, w0 directly. What should we do when we don’t know the Gaussians? Lecture 15: Generative models TTIC 31020 Maximum likelihood density estimation Let X = {x1, . . . ,xN} be a set of data points • no labels; in the current context X all come from class c We assume parametric distribution model p(x; θ). The (log)-likelihood of θ given X (assuming i.i.d. sampling): log p(X; θ) , N∑ i=1 log p(xi; θ). ML estimate of θ: θ̂ML , argmax θ log p(X; θ) • Intuitively: the observed data is most likely (has highest probability) for these settings of θ. Lecture 15: Generative models TTIC 31020 Gaussians with unequal covariances What if we remove the restriction that ∀c, Σc = Σ? Compute ML estimate for µc,Σc for each c. We get discriminants (and decision boundaries) quadratic in x: δc(x) = − 1 2 xTΣ−1c x + µ T c Σ −1 c x− 〈const in x〉 (as shown in PS1). A quadratic form in x: xTAx. Lecture 15: Generative models TTIC 31020 Quadratic decision boundaries What do quadratic boundaries look like in 2D? Second-degree curves can be any conic section: Can all of these arise from two Gaussian classes? Lecture 15: Generative models TTIC 31020 Sources of error in generative models Reminder: three sources of error: noise variance (irreducible), structural due to our model class, estimation due to our choice of model from that class. In generative model, estimation error may be due to overfitting. Two issues: regularization (MAP estimation instead of ML), controlling number of parameters (degrees of freedom) in the model. Lecture 15: Generative models TTIC 31020 Sources of error in generative models Reminder: three sources of error: noise variance (irreducible), structural due to our model class, estimation due to our choice of model from that class. In generative model, estimation error may be due to overfitting. Two issues: regularization (MAP estimation instead of ML), controlling number of parameters (degrees of freedom) in the model. Lecture 15: Generative models TTIC 31020 Parameters in Gaussian ML Single Gaussian in Rd: d for the mean, plus Model Σ =  σ21 σ12 . . . σ1d σ12 σ 2 2 . . . σ2d . . . . . . . . . . . σ1d . . . . . σ 2 d   σ21 0 . . . 0 0 σ22 . . . 0 . . . . . . . . . . 0 . . . . . σ2d  σ2I # param d + d(d− 1)/2 d 1 Diagonal covariance means effectively assuming feature independence Lecture 15: Generative models TTIC 31020 Parameters in Gaussian ML Single Gaussian in Rd: d for the mean, plus Model Σ =  σ21 σ12 . . . σ1d σ12 σ 2 2 . . . σ2d . . . . . . . . . . . σ1d . . . . . σ 2 d   σ21 0 . . . 0 0 σ22 . . . 0 . . . . . . . . . . 0 . . . . . σ2d  σ2I # param d + d(d− 1)/2 d 1 Diagonal covariance means effectively assuming feature independence Lecture 15: Generative models TTIC 31020 Parameters in Gaussian ML Single Gaussian in Rd: d for the mean, plus Model Σ =  σ21 σ12 . . . σ1d σ12 σ 2 2 . . . σ2d . . . . . . . . . . . σ1d . . . . . σ 2 d   σ21 0 . . . 0 0 σ22 . . . 0 . . . . . . . . . . 0 . . . . . σ2d  σ2I # param d + d(d− 1)/2 d 1 Diagonal covariance means effectively assuming feature independence Lecture 15: Generative models TTIC 31020 Parameters in Gaussian ML Single Gaussian in Rd: d for the mean, plus Model Σ =  σ21 σ12 . . . σ1d σ12 σ 2 2 . . . σ2d . . . . . . . . . . . σ1d . . . . . σ 2 d   σ21 0 . . . 0 0 σ22 . . . 0 . . . . . . . . . . 0 . . . . . σ2d  σ2I # param d + d(d− 1)/2 d 1 Diagonal covariance means effectively assuming feature independence Lecture 15: Generative models TTIC 31020 Näıve Bayes for Gaussian model p (x | c) = p (φ1(x), . . . , φm(x) | c) = m∏ j=1 p (φj(x) | c) . φj(x) = xj ; NB assumption of independence is equivalent to Σ =  σ21 0 . . . 0 0 σ22 . . . 0 . . . . . . . . . . 0 . . . . . σ2d  Need to estimate the d marginal 1D Gaussian densities (one for each component of x). Lecture 15: Generative models TTIC 31020 Example: generative models for documents A common task: given an e-mail message, classify it as SPAM or “ham” (a legitimate e-mail). Define a set of keywords W1, . . . ,Wm. φj(x) = { 1 document x includes Wj , 0 otherwise. A document x (of arbitrary length!) is now represented as a vector in {0, 1}m: Φ(x) = [φ1(x), . . . , φm(x)]T . A natural distribution for φj(x) is Bernoulli: Pr(φj(x) = 1; θj) = θj p (φj | y = 1) = θ φj j1 (1− θj1) 1−φj , p (φj | y = 0) = θ φj j0 (1− θj0) 1−φj . Lecture 15: Generative models TTIC 31020 Example: generative models for documents A common task: given an e-mail message, classify it as SPAM or “ham” (a legitimate e-mail). Define a set of keywords W1, . . . ,Wm. φj(x) = { 1 document x includes Wj , 0 otherwise. A document x (of arbitrary length!) is now represented as a vector in {0, 1}m: Φ(x) = [φ1(x), . . . , φm(x)]T . A natural distribution for φj(x) is Bernoulli: Pr(φj(x) = 1; θj) = θj p (φj | y = 1) = θ φj j1 (1− θj1) 1−φj , p (φj | y = 0) = θ φj j0 (1− θj0) 1−φj . Lecture 15: Generative models TTIC 31020 Classifying a document Given new document x = [φ1, . . . , φm]T : ŷ = 1 ⇔ m∑ j=1 φj log θj1 + m∑ j=1 (1− φj) log(1− θj1) − m∑ j=1 φj log θj0 − m∑ j=1 (1− φj) log(1− θj0) + log P1 − log P0 ≥ 0. There are total of 2 + 2m parameters to estimate in this model. Lecture 15: Generative models TTIC 31020 Problems with ML estimation Recall the coin-tossing experiments: • ML is too sensitive to the data, and may violate some “reasonable” beliefs about θ, e.g., that θ = 1 is very unlikely. A real problem in text classification. Zipf’s law for English texts: the n-th most common word has relative frequency of 1/na, with a ≈ 1. • relative frequency means #(this word)/#(all words) According to ML, when a word appears in a message that we have never seen in SPAM, we must decide it’s legit. If the same message contains a word never seen in non-SPAM, what do we do? Lecture 15: Generative models TTIC 31020 MAP estimate for word counts We can use the Beta prior: B(θ; a, b) , Γ(a + b) Γ(a)Γ(b) θa−1(1− θ)b−1. conjugate for Bernoulli likelihood: p (θ |X) ∝ p(θ; N1 + a,N0 + b) Interpretation: a and b are pseudocounts • Prior p(θ) = B(θ; a, b) is equivalent to having seen a + b observations, a of which were ones and b zeros, before we observed actual data XN . • An alternative phrasing: a/(a + b) is the default value for θ, and a + b is how strongly we believe in that value. The posterior p (θ |XN ) updates that by adding the actual counts to the pseudocounts. Lecture 15: Generative models TTIC 31020