Download Probabilistic Models in Machine Learning: Understanding Decision Trees and Bayes Rule and more Study notes Linguistics in PDF only on Docsity! Probabilistic models • Decision trees, instance-based learning, and transformation-based learning are called non-parametric methods because they don’t use an explicit probabilistic model • Parametric machine learning methods assume a particular (typically probabilistic) model • Parametric methods (usually) search a much more restrictive hypothesis space than non-parametric methods → large bias, small variance 1 Probabilistic models • Suppose we have a representation of an instance as feature vector x and we want to predict its class c • If we have a way of modeling P(c|x), Bayes Decision Rule says our predicted ĉ should be: ĉ = argmax c∈C P(c|x) • This minimizes the expected error: P(error|x) = 1−P(ĉ|x) P(error) = ∑ x P(error|x)P(x) 2 Baseline classifier • We often compute a ‘baseline’ for a classification task by simply assigning the most frequent class to each instance: ĉ = argmax c∈C P(c) • Here we assume that P(c|x) = P(c), i.e., X and C are independent • The extra error a baseline classifier makes is: ∑ x P(x) [P(x,c)−P(x)P(c)] 5 Bayes Optimal Classifiers • Call a particular model h, chosen from the hypothesis space H. • The maximum likelihood hypothesis selects: ĉ = argmax c∈C,h∈H P(c|x,h)P(d|h) • The maximum a posteriori hypothesis selects: ĉ = argmax c∈C,h∈H P(c|x,h)P(d|h)P(h) • Both of these commit us to choosing one h, which may or may not wind up being the best choice 6 Bayes Optimal Classifiers • The Bayes Optimal Classifier selects: ĉ = argmax c∈C ∑ h∈H P(c|x,h)P(d|h)P(h) • We remove the dependence on a particular h by averaging over all possible hs • This is almost always impossible to apply in practice, but it can used to establish a lower bound on the error rate • We can also sometimes approximate it, e.g., by randomly drawing h from the posterior distribution P(d|h)P(h) 7 Naive Bayes classifiers • To apply a generative Bayesian classifier, we need P(x,c) • We can break this down into two parts: the class prior P(c), and a likelihood P(x|c) • The class priors are easy to estimate from training data: P̂(c) = # of instances in class c # of instances • This won’t work for P(x|c), since any particular feature vector x is unlikely to turn up in the training data: P̂(x|c) = # of instances of x in c # of instances in c ≈ 0 # of instances of x in c 10 Naive Bayes classifiers • To get a better estimate of P(x|c), we can make the simplifying assumption that each of the dimensions xi in x are independent, so that: P(x|c) = ∏ i P(xi|c) • Now we only need to get estimates of P(xi|c) from the data for each xi, which we can do in the usual way: P̂(xi|c) = # of instances of xi in c # of instances in c • Both P̂(c) and P̂(xi|c) can be estimated using whatever tricks we have available 11 Naive Bayes classifiers • The naive Bayes classifier selects the class ĉ such that: ĉ = argmax c∈C P(c)∏ i P(xi|c) • Naive Bayes classifiers have been used primarily for classifying texts (Maron 1961) • We treat a text as a set or bag of words, an unordered collection of all the words that appear in the text • “We treat a text as a set or bag of words” ≡ { a, a, as, bag, of, or, set, text, treat, we, words } 12 Feature selection • A straight bag-of-words model leads to positing a very large number of features • Some of those features will not be relevant for the task (stop words) • Many of the features will appear relevant, but won’t be: we can’t avoid the Curse of Dimensionality • So, we want to select a subset of features which appear promising, usually by information gain 15 Multivariate Bernoulli event model • If we represent a document as a set of words, then each feature xi is a Bernoulli variable, where: P(xi|c j) = P(xi = 1|c j)xi (1−P(xi = 1|c j))1−xi • If there are v words in the vocabulary, a document is constructed by flipping v coins • Call pi j = P(xi = 1|c j). Substituting this in, we get: P(c j|x) = P(c j)∏i P(xi|c j) P(x) = P(c j)∏i p xi i j(1− pi j)1−xi P(x) 16 Multivariate Bernoulli event model • And taking the log gives us: logP(c j|x) = logP(c j)+∑ i xi log pi j +∑ i (1− xi) log(1− pi j)− logP(x) = logP(c j)+∑ i xi log pi j +∑ i log(1− pi j)−∑xi log(1− pi j)− logP(x) = logP(c j)+∑ i xi log pi j 1− pi j +∑ i log(1− pi j)− logP(x) • Suppose we only have two classes. Then P(c1|x) = 1−P(c2|x), and the posterior log odds are: log P(c1|x) 1−P(c1|x) = ∑ i xi log pi1(1− pi2) (1− pi1)pi2 +∑ i log 1− pi1 1− pi2 + log P(c1) 1−P(c1) 17 Multinomial event model • This underestimates P(x|c j), since lots of ordered sequences correspond to the same bag of words • How many different ways are there to draw word w1 x1 times, word w2 x2, and so on? • We can use the multinomial coefficient :( n n1,n2, . . . ) = ( n n1 ) × ( n−n1 n2 ) ×·· · = n! n1!(n−n1)! × (n−n1)! n2!(n−n1−n2)! ×·· · = n! n1!n2! · · · 20 Multinomial event model • So, if we draw N = ∑i xi words, we have: P(x|c j) = ( N x1,x2, . . . ) ∏P(wi|c j)xi = N!∏ P(wi|c j) xi xi! • To be completely correct, we also need to think about the probability of finding a document of a particular length: P(x|c j) = P(N|c j)(∑ i xi)!∏ P(wi|c j) xi xi! but in practice this can be hard to do. 21 Multinomial event model • The parameters of the multinomial model are the individual word probabilities P(wi|c j) • Since these are the parameters of a multinomial distribution, we need to maintain: ∑ i P(wi|c j) = 1 • We can estimate those from training data as: P̂(wi|c j) = # of times wi occurs in documents in c j # of words in documents in class c j • As always, smoothing is important 22 Naive Bayes classifiers • Paul Graham wrote an article on naive Bayes classifiers for filtering junk mail, which has become a standard method Free CableTV!No more pay!%RND_SYB requisite silt administer orphanage teach hypothalamus diatomic conflict atlas moser cofactor electret coffin diversionary solicitous becalm absent satiable blurb mackerel sibilant tehran delivery germicidal barometer falmouth capricorn 25 Naive Bayes classifiers • Maron (1961): It is feasible to have a computing machine read a document and to decide automatically the subject category to which the item in question belongs. No real intellectual breakthroughs are required before a machine will be able to index rather well. Just as in the case of machine translation of natural language, the road is gradual but, by and large, straightfoward. 26 Zero-one loss • Given its obvious deficiencies, why does naive Bayes work as well as it does? • Its probability estimates are only as good as the independence assumptions are valid (i.e., not very) • But, we don’t evaluate a naive Bayes classifier on its probability estimates • Instead, we measure its misclassification error, or zero-one loss • The two measures need not be closely related 27 Zero-one loss • Suppose there are two classes, and let p = P(c1|x), r = P(c1) ∏i P(xi|c1) and s = r = P(c2) ∏i P(xi|c2) • For any instance x, naive Bayes is optimal under zero-one loss if and only (p≥ 12∧ r ≥ s)∨ (p≤ 1 2∧ r ≤ s) • That means that naive Bayes is optimal under zero-one loss for half the volume of the space of possible values of (p,r,s)! • The naive Bayes probabilities are optimal only along the line where the planes r = p and s = 1− p intersect 30 Zero-one loss • A necessary condition: naive Bayes can only be optimal (for discrete features) for concepts that are linearly separable • For discrete features, combinations of variables by ∧, ∨, and ¬ are linearly separable • This isn’t a sufficient condition, since there are linearly separable concepts which naive Bayes performs poorly on (m-of-n concepts) • Naive Bayes is optimal for conjunctions of features and for disjunctions of features • This points to one way to improve naive Bayes: introduce new features which are disjunctions (or conjunctions) of other features 31 Zero-one loss • Even when naive Bayes is not optimal, it may outperform other methods with greater representational power (e.g., C4.5) • Zero-one loss is relatively insensitive to bias, but can be highly sensitive to variance • When there isn’t enough training data, a high bias, low variance learner will give a lower zero-one loss than a low bias, high variance learner • We’ve seen this before: a simple model can outperform a more complex one, even when the assumptions of the simple model are false 32