Download Discriminative Learning, Generative Learning of Machine Learning | CSCI 567 and more Study notes Computer Science in PDF only on Docsity! Fall 2008 Bayesian Learning - Sofus A. Macskassy1 Machine Learning (CS 567) Fall 2008 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (
[email protected]) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol Han (
[email protected]) Office: SAL 229 Office hours: M 2-3pm, W 11-12 Class web page: http://www-scf.usc.edu/~csci567/index.html Fall 2008 Bayesian Learning - Sofus A. Macskassy2 Lecture 11 Outline • Bayesian Learning – Probability theory – Bayesian Classification Fall 2008 Bayesian Learning - Sofus A. Macskassy5 Learning Bayesian Networks: Naïve and non-Naïve Bayes • Hypothesis Space – fixed size – stochastic – continuous parameters • Learning Algorithm – direct computation – eager – batch Fall 2008 Bayesian Learning - Sofus A. Macskassy6 But first… basic probability theory • Random variables • Distributions • Statistical formulae Fall 2008 Bayesian Learning - Sofus A. Macskassy7 Random Variables • A random variable is a random number (or value) determined by chance, or more formally, drawn according to a probability distribution – The probability distribution can be estimated from observed data (e.g., throwing dice) – The probabiliy distribution can be synthetic – Discrete & continuous variables • Typical random variables in Machine Learning Problems – The input data – The output data – Noise • Important concept in learning: The data generating model – E.g., what is the data generating model for: i) Throwing dice ii) Regression iii) Classification iv) For visual perception Fall 2008 Bayesian Learning - Sofus A. Macskassy10 Classic Discrete Distributions (I) Bernoulli Distribution • A Bernoulli random variable takes on only two values, i.e., 0 and 1. • P(1)=p and P(0)=1-p or in compact notation: • The performance of a fixed number of trials with fixed probability of success (p) on each trial is known as a Bernoulli trial. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 1 P(x) for p=0.3 Fall 2008 Bayesian Learning - Sofus A. Macskassy11 Classic Discrete Distributions (II) Binomial Distribution • Like Bernoulli distribution: binary input variables: 0 or 1, and probability P(1)=p and P(0)=1-p • What is the probability of k successes, P(k), in a series of n independent trials? (n>=k) • P(k) is a binomial random variable: • Bernoulli distribution is a subset of the binomial distribution (i.e., n=1) Fall 2008 Bayesian Learning - Sofus A. Macskassy12 Classic Discrete Distributions (II) Binomial Distribution Binomial p=0.25 0 0.05 0.1 0.15 0.2 0.25 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 k P (k ) Binomial p=0.1 0 0.05 0.1 0.15 0.2 0.25 0.3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 k P (k ) Binomial p=0.5 0.00E+00 2.00E-02 4.00E-02 6.00E-02 8.00E-02 1.00E-01 1.20E-01 1.40E-01 1.60E-01 1.80E-01 2.00E-01 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 k P (k ) Fall 2008 Bayesian Learning - Sofus A. Macskassy15 Continuous Probability Distributions (cont’d) • Probability Density Function p(x) • Probability of an event: Fall 2008 Bayesian Learning - Sofus A. Macskassy16 Classic Continuous Distributions (I) Normal Distribution • The most used distribution • Also called Gaussian distribution after C.F.Gauss who proposed it • Justified by the Central Limit Theorem: – Roughly: “if a random variable is the sum of a large number of independent random variables it is approximately normally distributed” – Many observed variables are the sum of several random variables • Shorthand: Fall 2008 Bayesian Learning - Sofus A. Macskassy17 Classic Continuous Distributions (II) Uniform Distribution • All data is equally probably within a bounded region R, p(x)=1/R Fall 2008 Bayesian Learning - Sofus A. Macskassy20 Variance and Standard Deviation • Variance: • is the standard deviation of x. The variance is never negative and approaches 0 as the probability mass is centered at one point. • The standard deviation is a simple measure of how far values of x are likely to depart from the mean. – i.e., the standard or typical amount one should expect a randomly drawn value of x to deviate or differ from . Fall 2008 Bayesian Learning - Sofus A. Macskassy21 Sample variance and covariance • Sample Variance – Why division by (N-1)? This is to obtain an unbiased estimate of the variance. (unbiased estimate: ) • Covariance • Sample Covariance Fall 2008 Bayesian Learning - Sofus A. Macskassy22 Biased vs. Unbiased variance • Biased variance: • “Anti-biased” variance: n i i Xx n V 1 2 ' 1 n i ii Xx n V 1 2'* 1 VnVn 2*2)1( n i i Xx n VV 1 2'* 1 1 Fall 2008 Bayesian Learning - Sofus A. Macskassy25 Conditional Probability • P(x|y) is the probability of the occurrence of event x given that y occurred and is given as: • Knowing that y occurred reduces the sample space to y, and the part of it where x also occurred is (x,y) • This is only defined if P(y)>0. Also, because is commutative, we have: Fall 2008 Bayesian Learning - Sofus A. Macskassy26 Statistical Independence • If x and y are independent then we have • From there it follows that • In other words, knowing that y occurred does not change the probability that x occurs (and vice versa). y xy x Fall 2008 Bayesian Learning - Sofus A. Macskassy27 Bayes Rule • Remember: • Bayes Rule: • Interpretation – P(y) is the PRIOR knowledge about y – X is new evidence to be incorporated to update my belief about y. – P(x|y) is the LIKELIHOOD of x given that y was observed. – Both prior and likelhood can often be generated beforehand, e.g., by histogram statistics – P(x) is a normalizing factor, corresponding to the marginal distribution of x. Often it need not be evaluated explicitly, but it can become a great computational burden. “P(x) is an enumeration of all possible combinations of x, and the probability of their occurrence.” – P(y|x) is the POSTERIOR probability of y, i.e., the belief in y after on discovers x. Fall 2008 Bayesian Learning - Sofus A. Macskassy30 Bayes Theorem • Consider hypothesis space H • P(h) = prior prob. of hypothesis h2H • P(D) = prior prob. of training data D • P(h|D) = probability of h given D • P(D|h) = probability of D given h Fall 2008 Bayesian Learning - Sofus A. Macskassy31 Choosing Hypotheses Natural choice is most probable hypothesis given the training data, or maximum a posteriori hypothesis hMAP: If we assume P(hi) = P(hj) then can further simplify, and choose the maximum likelihood (ML) hypothesis Fall 2008 Bayesian Learning - Sofus A. Macskassy32 Bayes Theorem: Example Does patient have cancer or not? – A patient takes a lab test and the result comes back positive. The test returns a correct positive result in 98% of the cases in which the disease is actually present, and a correct negative result in 97% of the cases in which the disease is not present. Furthermore, .008 of the entire population have this cancer. P (cancer) = 0.008 P (:cancer) = 0.992 P (+|cancer) = 0.98 P (-|cancer) = 0.02 P (+|:cancer) = 0.03 P (-|:cancer) = 0.97 Fall 2008 Bayesian Learning - Sofus A. Macskassy35 Evolution of Posterior Probs • As data is added, certainty of hypotheses increases. Fall 2008 Bayesian Learning - Sofus A. Macskassy36 Classifying New Instances So far we've sought the most probable hypothesis given the data D (i.e., hMAP) Given new instance x, what is its most probable classification? hMAP(x) is not the most probable classification! Fall 2008 Bayesian Learning - Sofus A. Macskassy37 Classification Example Consider: • Three possible hypotheses: P(h1|D) = .4, P(h2|D) = .3, P(h3|D) = .3 • Given new instance x, h1(x) = +, h2(x) = -, h3(x) = - • What’s hMAP(x) ? • What's most probable classification of x? Fall 2008 Bayesian Learning - Sofus A. Macskassy40 Error of Gibbs Noteworthy fact [Haussler 1994]: Assume target concepts are drawn at random from H according to priors on H. Then: E [errorGibbs] ≤ 2E [errorBayesOptimal] Suppose correctly, uniform prior distribution over H, then • Pick any hypothesis consistent with the data, with uniform probability • Its expected error no worse than twice Bayes optimal Fall 2008 Bayesian Learning - Sofus A. Macskassy41 Naive Bayes Classifier Along with decision trees, neural networks, kNN, one of the most practical and most used learning methods. When to use: • Moderate or large training set available • Attributes that describe instances are conditionally independent given classification Successful applications: • Diagnosis • Classifying text documents Fall 2008 Bayesian Learning - Sofus A. Macskassy42 Naïve Bayes Assumption • Suppose the features xi are discrete • Assume the xi are conditionally independent given y. • In other words, assume that: • Then we have: • For binary features, instead of O(2n) numbers to describe a model, we only need O(n)! Fall 2008 Bayesian Learning - Sofus A. Macskassy45 Naïve Bayes: Example • Consider the PlayTennis problem and new instance <Outlook = sun, Temp = cool, Humid = high, Wind = strong> Want to compute: P(y) P(sun|y) P(cool|y) P(high|y) P(strong|y) = .005 P(n) P(sun|n) P(cool|n) P(high|n) P(strong|n) = .021 • So, yNB = n Fall 2008 Bayesian Learning - Sofus A. Macskassy46 Naïve Bayes: Subtleties • Conditional independence assumption is often violated P(x1, x2 … xn, |yj) = Pi P(xi |yj) • ...but it works surprisingly well anyway. Note don't need estimated posteriors P(yj|x) to be correct; need only that • See Domingos & Pazzani [1996] for analysis • Naïve Bayes posteriors often unrealistically close to 1 or 0 Fall 2008 Bayesian Learning - Sofus A. Macskassy47 Decision Boundary of naïve Bayes with binary features • The parameters of the model are i,1 = P(xi=1|y=1), i,0=P(xi=1|y=0), 1 = P(y=1) • What is the decision surface? • Using the log trick, we get: • Note that in the equation above, the xi would be 1 or 0, depending on the values were present in the instance. Fall 2008 Bayesian Learning - Sofus A. Macskassy50 Representing P(xj|y) – Discrete Values – Multinomial/Binomial • if xj is a discrete random variable, xj 2 {v1, …, vm}, then we construct the conditional probability table y = 1 y=2 … y=K xj=v1 P(xj=v1 | y = 1) P(xj=v1 | y = 2) … P(xj=v1 | y = K) xj=v2 P(xj=v2 | y = 1) P(xj=v2 | y = 2) … P(xj=v2 | y = K) … … … … … xj=vm P(xj=vm | y = 1) P(xj=vm | y = 2) … P(xj=vm | y = K) Fall 2008 Bayesian Learning - Sofus A. Macskassy51 Discretization via Mutual Information • Many discretization algorithms have been studied. One of the best is based on mutual information [Fayyad & Irani 93]. – To discretize feature xj, grow a decision tree considering only splits on xj. Each leaf of the resulting tree will correspond to a single value of the discretized xj. Fall 2008 Bayesian Learning - Sofus A. Macskassy52 Discretization via Mutual Information • Many discretization algorithms have been studied. One of the best is based on mutual information [Fayyad & Irani 93]. – To discretize feature xj, grow a decision tree considering only splits on xj. Each leaf of the resulting tree will correspond to a single value of the discretized xj. – Stopping rule (applied at each node). Stop when – where S is the training data in the parent node; Sl and Sr are the examples in the left and right child. K, Kl, and Kr are the corresponding number of classes present in these examples. I is the mutual information, H is the entropy, and N is the number of examples in the node. I(xj; y) < log2(N ¡ 1) N + ¢ N ¢= log2(3 K¡2)¡[K¢H(S)¡Kl¢H(Sl)¡Kr¢H(Sr)] Fall 2008 Bayesian Learning - Sofus A. Macskassy55 Kernel Density Estimators (2) • This is equivalent to placing a Gaussian “bump” of height 1/Nk on each training data point from class k and then adding them up xj P (x j|y ) Fall 2008 Bayesian Learning - Sofus A. Macskassy56 Kernel Density Estimators (3) • Resulting probability density P (x j|y ) xj Fall 2008 Bayesian Learning - Sofus A. Macskassy57 The value chosen for is critical 0.15??? 0.50 Fall 2008 Bayesian Learning - Sofus A. Macskassy60 Naïve Bayes Applied to Diabetes Diagnosis • Bayes nets and causality – Bayes nets work best when arrows follow the direction of causality • two things with a common cause are likely to be conditionally independent given the cause; arrows in the causal direction capture this independence – In a Naïve Bayes network, arrows are often not in the causal direction • diabetes does not cause pregnancies • diabetes does not cause age – But some arrows are correct • diabetes does cause the level of blood insulin and blood glucose Fall 2008 Bayesian Learning - Sofus A. Macskassy61 Non-Naïve Bayes • Manually construct a graph in which all arcs are causal • Learning the probability tables is still easy. For example, P(Mass | Age, Preg) involves counting the number of patients of a given age and number of pregnancies that have a given body mass • Classification: P(D = djA;P;M; I;G) = P(IjD = d)P(GjI;D = d)P(D = djA;M;P) P(I;G) Fall 2008 Bayesian Learning - Sofus A. Macskassy62 Bayesian Belief Network Network represents a set of conditional ind. assertions: • Each node is asserted to be conditionally ind. of its nondescendants, given its immediate predecessors. • Directed acyclic graph Fall 2008 Bayesian Learning - Sofus A. Macskassy65 Inference in Practice In practice, can succeed in many cases • Exact inference methods work well for some network structures (small “induced width”) • Monte Carlo methods “simulate” the network randomly to calculate approximate solutions • Now used as a primitive in more advanced learning and reasoning scenarios. (e.g., in relational learning) Fall 2008 Bayesian Learning - Sofus A. Macskassy66 Learning Bayes Nets Suppose structure known, variables partially observable e.g., observe ForestFire, Storm, BusTourGroup, Thunder, but not Lightning, Campfire... Similar to training neural network with hidden units • In fact, can learn network conditional probability tables using gradient ascent! • Converge to network h that (locally) maximizes P(D|h) Fall 2008 Bayesian Learning - Sofus A. Macskassy67 Gradient Ascent for BNs Let wijk denote one entry in the conditional probability table for variable Yi in the network e.g., if Yi = Campfire, then uik might be <Storm = T, BusTourGroup = F> Perform gradient ascent by repeatedly: 1. Update all wijk using training data D 2. Then, renormalize the wijk to assure Fall 2008 Bayesian Learning - Sofus A. Macskassy70 Naïve Bayes Summary • Advantages of Bayesian networks – Produces stochastic classifiers • can be combined with utility functions to make optimal decisions – Easy to incorporate causal knowledge • resulting probabilities are easy to interpret – Very simple learning algorithms • if all variables are observed in training data • Disadvantages of Bayesian networks – Fixed sized hypothesis space • may underfit or overfit the data • may not contain any good classifiers if prior knowledge is wrong – Harder to handle continuous features Fall 2008 Bayesian Learning - Sofus A. Macskassy71 Evaluation of Naïve Bayes Criterion LMS Logistic LDA Trees NNbr Nets NB Mixed data no no no yes no no yes Missing values no no yes yes some no yes Outliers no yes no yes yes yes disc Monotone transformations no no no yes no some disc Scalability yes yes yes yes no yes yes Irrelevant inputs no no no some no no some Linear combinations yes yes yes no some yes yes Interpretable yes yes yes yes no no yes Accurate yes yes yes no no yes yes • Naïve Bayes is very popular, particularly in natural language processing and information retrieval where there are many features compared to the number of examples • In applications with lots of data, Naïve Bayes does not usually perform as well as more sophisticated methods