Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Bayesian Learning in Introduction to Machine Learning - Lecture Slides | CSI 5325, Study notes of Computer Science

Material Type: Notes; Professor: Hamerly; Class: Introduction to Machine Learning; Subject: Computer Science; University: Baylor University; Term: Unknown 1989;

Typology: Study notes

Pre 2010

Uploaded on 08/16/2009

koofers-user-29s-1
koofers-user-29s-1 🇺🇸

5

(1)

10 documents

1 / 36

Toggle sidebar

Related documents


Partial preview of the text

Download Bayesian Learning in Introduction to Machine Learning - Lecture Slides | CSI 5325 and more Study notes Computer Science in PDF only on Docsity!

Intro. to machine learning (CSI 5325)

Lecture 14: Bayesian learning

Greg Hamerly

Some content from Tom Mitchell.

Outline

1 Naive bayes classifier example: Learning over text data

2 Bayesian belief networks

3 Expectation Maximization algorithm

Naive bayes classifier example: Learning over text data

Learning to Classify Text

Why? Learn which news articles are of interest Learn to classify web pages by topic

Naive Bayes is among most effective algorithms SVMs are even better... but NB is much simpler...

What attributes shall we use to represent text documents?

Naive bayes classifier example: Learning over text data

Learning to Classify Text

Target concept Interesting? : Document → {+, −} 1 Represent each document by vector of words one attribute per word position in document 2 Learning: Use training examples to estimate P(+), P(−) P(doc|+), P(doc|−) Naive Bayes conditional independence assumption

P(doc|vj ) =

length ∏(doc)

i=

P(ai = wk |vj )

where P(ai = wk |vj ) is probability that word in position i is wk , given vj. One more assumption: P(ai = wk |vj ) = P(am = wk |vj ), ∀i, m

Naive bayes classifier example: Learning over text data

Learning to Classify Text

Learn naive Bayes text(Examples, V ) 1 collect all words and other tokens that occur in Examples Vocabulary ← all distinct words and other tokens in Examples 2 calculate the required P(vj ) and P(wk |vj ) probability terms 3 For each target value vj in V do docsj ← subset of Examples for which the target value is vj P(vj ) ← (^) |Examples|docsj^ || Textj ← a single document created by concatenating all members of docsj n ← total number of words in Textj (counting duplicate words multiple times) for each word wk in Vocabulary nk ← number of times word wk occurs in Textj P(wk |vj ) ← (^) n+|Vocabularynk^ +1 |

Naive bayes classifier example: Learning over text data

Learning to Classify Text

Classify naive Bayes text(Doc) positions ← all word positions in Doc that contain tokens found in Vocabulary Return vNB , where

vNB = argmax vj ∈V

P(vj )

i∈positions

P(ai |vj )

Naive bayes classifier example: Learning over text data

Bag of Words model

The representation of a text document as a... vector of a fixed vocabulary where each element indicates the presence or count of the word in the document and each word’s position is discarded is called the ‘bag of words’ model. Also known as ‘unigram.’

Why is it useful? What does it lose?

Alternative methods which also discard position but improve context: bi-grams, tri-grams, etc.

Naive bayes classifier example: Learning over text data

20 news groups dataset

Given 1000 training documents from each group, learn to classify new documents according to which newsgroup it came from. alt.atheismcomp.graphics comp.windows.xmisc.forsale rec.sport.hockeysci.crypt soc.religion.christiantalk.politics.guns comp.os.ms-windows.misccomp.sys.ibm.pc.hardware rec.autosrec.motorcycles sci.electronicssci.med talk.politics.mideasttalk.politics.misc comp.sys.mac.hardware rec.sport.baseball sci.space talk.religion.misc Naive Bayes achieves 89% classification accuracy.

Naive bayes classifier example: Learning over text data

from rec.sport.hockey

Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!ogicse!uwm.edu From: [email protected] (John Doe) Subject: Re: This year’s biggest and worst (opinion)... Date: 5 Apr 93 09:53:39 GMT I can only comment on the Kings, but the most obvious candidate for pleasant surprise is Alex Zhitnik. He came highly touted as a defensive defenseman, but he’s clearly much more than that. Great skater and hard shot (though wish he were more accurate). In fact, he pretty much allowed the Kings to trade away that huge defensive liability Paul Coffey. Kelly Hrudey is only the biggest disappointment if you thought he was any good to begin with. But, at best, he’s only a

Naive bayes classifier example: Learning over text data

Curve for 20 Newsgroups

0

10

20

30

40

50

60

70

80

90

100

100 1000 10000

20News

TFIDFBayes PRTFIDF

Accuracy vs. Training set size (1/3 withheld for test)

Naive bayes classifier example: Learning over text data

Naive Bayes with other probability distributions

Usually, the naive Bayes classifier is described with histogram-type probability distributions, i.e.

P(w |v ) = |times w seen in class v | |times class v seen|

We could alternatively use any probability distribution like Gaussian, binomial, etc.

We could even use a different distribution for each attribute.

Naive bayes classifier example: Learning over text data

Naive Bayes classifier applications

Text classification Spam classification Finding articles ‘of interest’ News article classification (juggling!?) Identifying ‘high-quality’ posts in a forum Grading papers?

Failure prediction (e.g. hard disk drives, motors, software, etc.)

Others?

Naive bayes classifier example: Learning over text data

Conditional independence

The NB classifier builds a distribution of the attributes for each class. k classes → k distributions Each distribution considers the attributes to be independent, given the class.

However, it doesn’t mean that the attributes are considered completely independent!

That is, P(a 1 , a 2 |v ) = P(a 1 |v )P(a 2 |v ) but not P(a 1 , a 2 ) = P(a 1 )P(a 2 )

Naive bayes classifier example: Learning over text data

Conditional independence example

P(a 1 , a 2 |red) = P(a 1 |red)P(a 2 |red) but not P(a 1 , a 2 ) = P(a 1 )P(a 2 )

−3−3 −2 −1 0 1 2 3 4 5 6

0

1

2

3

4

5

6

7

a 1

a^2

14 / 36

Bayesian belief networks

Bayesian Belief Networks

Interesting because: Naive Bayes assumption of conditional independence too restrictive But learning is intractable without some such assumptions... Bayesian Belief networks describe conditional independence among subsets of variables → allows combining prior knowledge about (in)dependencies among variables with observed training data

(also called Bayes Nets)

Bayesian belief networks

Conditional Independence

Definition: X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given the value of Z ; that is, if

(∀xi , yj , zk ) P(X = xi |Y = yj , Z = zk ) = P(X = xi |Z = zk ) more compactly, we write

P(X |Y , Z ) = P(X |Z )

Bayesian belief networks

Conditional Independence

X is conditionally independent of Y given Z :

P(X |Y , Z ) = P(X |Z )

Example: Thunder cond. independent of Rain, given Lightning

P(Thunder |Rain, Lightning ) = P(Thunder |Lightning )

Naive Bayes uses conditional independence to justify

P(X , Y |Z ) = P(X |Y , Z )P(Y |Z ) = P(X |Z )P(Y |Z )

Bayesian belief networks

Bayesian Belief Network

Storm

Lightning Campfire

Thunder ForestFire

Campfire

C ¬C

¬S,B ¬S,¬B

S,¬B

BusTourGroup

S,B

Network represents a set of conditional independence assertions: Each node is asserted to be conditionally independent of its nondescendants, given its immediate predecessors. Directed acyclic graph

Bayesian belief networks

Storm

Lightning Campfire

Thunder ForestFire

Campfire

C ¬C

¬S,B ¬S,¬B

S,¬B

BusTourGroup S,B

Represents joint probability distribution over all variables e.g., P(Storm, BusTourGroup,... , ForestFire) in general,

P(y 1 ,... , yn) =

∏^ n i=

P(yi |Parents(Yi ))

Parents(Yi ) denotes immediate predecessors of Yi in graph joint dist. fully defined by graph and P(yi |Parents(Yi ))

Bayesian belief networks

Inference in Bayesian Networks

How can one infer the (probabilities of) values of one or more network variables, given observed values of others? Bayes net contains all information needed for this inference If only one variable with unknown value, easy to infer it In general case, problem is NP hard In practice, can succeed in many cases Exact inference methods work well for some network structures Monte Carlo methods “simulate” the network randomly to calculate approximate solutions

Bayesian belief networks

Learning of Bayesian Networks

Several variants of this learning task Network structure might be known or unknown Training examples might provide values of all network variables, or just some If structure known and observe all variables Then it’s as easy as training a Naive Bayes classifier

Bayesian belief networks

Learning Bayes Nets

Suppose structure known, variables partially observable

e.g., observe ForestFire, Storm, BusTourGroup, Thunder, but not Lightning, Campfire... Similar to training neural network with hidden units In fact, we can learn network conditional probability tables using gradient ascent. Converge to network h that (locally) maximizes P(D|h)

Bayesian belief networks

Gradient Ascent for Bayes Nets

Let wijk denote one entry in the conditional probability table for variable Yi in the network wijk = P(Yi = yij |Parents(Yi ) = the list uik of values) e.g., if Yi = Campfire, then uik might be 〈Storm = T , BusTourGroup = F 〉 Perform gradient ascent by repeatedly 1 update all wijk using training data D

wijk ← wijk + η

d∈D

Ph(yij , uik |d) wijk 2 then, renormalize the∑ wijk to assure j wijk^ = 1 0 ≤ wijk ≤ 1

Bayesian belief networks

More on Learning Bayes Nets

EM algorithm can also be used. Repeatedly: 1 Calculate probabilities of unobserved variables, assuming h 2 Calculate new wijk to maximize E [ln P(D|h)] where D now includes both observed and (calculated probabilities of) unobserved variables

When structure unknown... Algorithms use greedy search to add/subtract edges and nodes

Bayesian belief networks

Summary: Bayesian Belief Networks

Combine prior knowledge with observed data Impact of prior knowledge (when correct!) is to lower the sample complexity Further extensions: Extend from boolean to real-valued variables Parameterized distributions instead of tables Extend to first-order instead of propositional systems More effective inference methods ...