Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Material Type: Notes; Professor: Hamerly; Class: Introduction to Machine Learning; Subject: Computer Science; University: Baylor University; Term: Unknown 1989;
Typology: Study notes
1 / 36
Greg Hamerly
Some content from Tom Mitchell.
Outline
1 Naive bayes classifier example: Learning over text data
2 Bayesian belief networks
3 Expectation Maximization algorithm
Naive bayes classifier example: Learning over text data
Why? Learn which news articles are of interest Learn to classify web pages by topic
Naive Bayes is among most effective algorithms SVMs are even better... but NB is much simpler...
What attributes shall we use to represent text documents?
Naive bayes classifier example: Learning over text data
Target concept Interesting? : Document → {+, −} 1 Represent each document by vector of words one attribute per word position in document 2 Learning: Use training examples to estimate P(+), P(−) P(doc|+), P(doc|−) Naive Bayes conditional independence assumption
P(doc|vj ) =
length ∏(doc)
i=
P(ai = wk |vj )
where P(ai = wk |vj ) is probability that word in position i is wk , given vj. One more assumption: P(ai = wk |vj ) = P(am = wk |vj ), ∀i, m
Naive bayes classifier example: Learning over text data
Learn naive Bayes text(Examples, V ) 1 collect all words and other tokens that occur in Examples Vocabulary ← all distinct words and other tokens in Examples 2 calculate the required P(vj ) and P(wk |vj ) probability terms 3 For each target value vj in V do docsj ← subset of Examples for which the target value is vj P(vj ) ← (^) |Examples|docsj^ || Textj ← a single document created by concatenating all members of docsj n ← total number of words in Textj (counting duplicate words multiple times) for each word wk in Vocabulary nk ← number of times word wk occurs in Textj P(wk |vj ) ← (^) n+|Vocabularynk^ +1 |
Naive bayes classifier example: Learning over text data
Classify naive Bayes text(Doc) positions ← all word positions in Doc that contain tokens found in Vocabulary Return vNB , where
vNB = argmax vj ∈V
P(vj )
i∈positions
P(ai |vj )
Naive bayes classifier example: Learning over text data
The representation of a text document as a... vector of a fixed vocabulary where each element indicates the presence or count of the word in the document and each word’s position is discarded is called the ‘bag of words’ model. Also known as ‘unigram.’
Why is it useful? What does it lose?
Alternative methods which also discard position but improve context: bi-grams, tri-grams, etc.
Naive bayes classifier example: Learning over text data
Given 1000 training documents from each group, learn to classify new documents according to which newsgroup it came from. alt.atheismcomp.graphics comp.windows.xmisc.forsale rec.sport.hockeysci.crypt soc.religion.christiantalk.politics.guns comp.os.ms-windows.misccomp.sys.ibm.pc.hardware rec.autosrec.motorcycles sci.electronicssci.med talk.politics.mideasttalk.politics.misc comp.sys.mac.hardware rec.sport.baseball sci.space talk.religion.misc Naive Bayes achieves 89% classification accuracy.
Naive bayes classifier example: Learning over text data
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!ogicse!uwm.edu From: [email protected] (John Doe) Subject: Re: This year’s biggest and worst (opinion)... Date: 5 Apr 93 09:53:39 GMT I can only comment on the Kings, but the most obvious candidate for pleasant surprise is Alex Zhitnik. He came highly touted as a defensive defenseman, but he’s clearly much more than that. Great skater and hard shot (though wish he were more accurate). In fact, he pretty much allowed the Kings to trade away that huge defensive liability Paul Coffey. Kelly Hrudey is only the biggest disappointment if you thought he was any good to begin with. But, at best, he’s only a
Naive bayes classifier example: Learning over text data
0
10
20
30
40
50
60
70
80
90
100
100 1000 10000
20News
TFIDFBayes PRTFIDF
Accuracy vs. Training set size (1/3 withheld for test)
Naive bayes classifier example: Learning over text data
Usually, the naive Bayes classifier is described with histogram-type probability distributions, i.e.
P(w |v ) = |times w seen in class v | |times class v seen|
We could alternatively use any probability distribution like Gaussian, binomial, etc.
We could even use a different distribution for each attribute.
Naive bayes classifier example: Learning over text data
Text classification Spam classification Finding articles ‘of interest’ News article classification (juggling!?) Identifying ‘high-quality’ posts in a forum Grading papers?
Failure prediction (e.g. hard disk drives, motors, software, etc.)
Others?
Naive bayes classifier example: Learning over text data
The NB classifier builds a distribution of the attributes for each class. k classes → k distributions Each distribution considers the attributes to be independent, given the class.
However, it doesn’t mean that the attributes are considered completely independent!
That is, P(a 1 , a 2 |v ) = P(a 1 |v )P(a 2 |v ) but not P(a 1 , a 2 ) = P(a 1 )P(a 2 )
Naive bayes classifier example: Learning over text data
P(a 1 , a 2 |red) = P(a 1 |red)P(a 2 |red) but not P(a 1 , a 2 ) = P(a 1 )P(a 2 )
−3−3 −2 −1 0 1 2 3 4 5 6
−
−
0
1
2
3
4
5
6
7
a 1
a^2
14 / 36
Bayesian belief networks
Interesting because: Naive Bayes assumption of conditional independence too restrictive But learning is intractable without some such assumptions... Bayesian Belief networks describe conditional independence among subsets of variables → allows combining prior knowledge about (in)dependencies among variables with observed training data
(also called Bayes Nets)
Bayesian belief networks
Definition: X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given the value of Z ; that is, if
(∀xi , yj , zk ) P(X = xi |Y = yj , Z = zk ) = P(X = xi |Z = zk ) more compactly, we write
P(X |Y , Z ) = P(X |Z )
Bayesian belief networks
X is conditionally independent of Y given Z :
P(X |Y , Z ) = P(X |Z )
Example: Thunder cond. independent of Rain, given Lightning
P(Thunder |Rain, Lightning ) = P(Thunder |Lightning )
Naive Bayes uses conditional independence to justify
P(X , Y |Z ) = P(X |Y , Z )P(Y |Z ) = P(X |Z )P(Y |Z )
Bayesian belief networks
Storm
Lightning Campfire
Thunder ForestFire
Campfire
C ¬C
¬S,B ¬S,¬B
S,¬B
BusTourGroup
S,B
Network represents a set of conditional independence assertions: Each node is asserted to be conditionally independent of its nondescendants, given its immediate predecessors. Directed acyclic graph
Bayesian belief networks
Storm
Lightning Campfire
Thunder ForestFire
Campfire
C ¬C
¬S,B ¬S,¬B
S,¬B
BusTourGroup S,B
Represents joint probability distribution over all variables e.g., P(Storm, BusTourGroup,... , ForestFire) in general,
P(y 1 ,... , yn) =
∏^ n i=
P(yi |Parents(Yi ))
Parents(Yi ) denotes immediate predecessors of Yi in graph joint dist. fully defined by graph and P(yi |Parents(Yi ))
Bayesian belief networks
How can one infer the (probabilities of) values of one or more network variables, given observed values of others? Bayes net contains all information needed for this inference If only one variable with unknown value, easy to infer it In general case, problem is NP hard In practice, can succeed in many cases Exact inference methods work well for some network structures Monte Carlo methods “simulate” the network randomly to calculate approximate solutions
Bayesian belief networks
Several variants of this learning task Network structure might be known or unknown Training examples might provide values of all network variables, or just some If structure known and observe all variables Then it’s as easy as training a Naive Bayes classifier
Bayesian belief networks
Suppose structure known, variables partially observable
e.g., observe ForestFire, Storm, BusTourGroup, Thunder, but not Lightning, Campfire... Similar to training neural network with hidden units In fact, we can learn network conditional probability tables using gradient ascent. Converge to network h that (locally) maximizes P(D|h)
Bayesian belief networks
Let wijk denote one entry in the conditional probability table for variable Yi in the network wijk = P(Yi = yij |Parents(Yi ) = the list uik of values) e.g., if Yi = Campfire, then uik might be 〈Storm = T , BusTourGroup = F 〉 Perform gradient ascent by repeatedly 1 update all wijk using training data D
wijk ← wijk + η
d∈D
Ph(yij , uik |d) wijk 2 then, renormalize the∑ wijk to assure j wijk^ = 1 0 ≤ wijk ≤ 1
Bayesian belief networks
EM algorithm can also be used. Repeatedly: 1 Calculate probabilities of unobserved variables, assuming h 2 Calculate new wijk to maximize E [ln P(D|h)] where D now includes both observed and (calculated probabilities of) unobserved variables
When structure unknown... Algorithms use greedy search to add/subtract edges and nodes
Bayesian belief networks
Combine prior knowledge with observed data Impact of prior knowledge (when correct!) is to lower the sample complexity Further extensions: Extend from boolean to real-valued variables Parameterized distributions instead of tables Extend to first-order instead of propositional systems More effective inference methods ...