CS 188: Artificial Intelligence Machine Learning, Lecture notes of Machine Learning

All CS188 materials are available at http://ai.berkeley.edu.] Machine Learning. ▫ Up until now: how use a model to make optimal decisions.

Typology: Lecture notes

2022/2023

Uploaded on 05/11/2023

lakshmirnarman
lakshmirnarman 🇺🇸

5

(5)

221 documents

1 / 21

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS 188: Artificial Intelligence
Naïve Bayes
Instructors: Dan Klein and Pieter Abbeel --- University of California, Berkeley
[These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]
Machine Learning
Up until now: how use a model to make optimal decisions
Machine learning: how to acquire a model from data / experience
Learning parameters (e.g. probabilities)
Learning structure (e.g. BN graphs)
Learning hidden concepts (e.g. clustering, neural nets)
Today: model-based classification with Naive Bayes
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15

Partial preview of the text

Download CS 188: Artificial Intelligence Machine Learning and more Lecture notes Machine Learning in PDF only on Docsity!

CS 188: Artificial Intelligence

Naïve Bayes

Instructors: Dan Klein and Pieter Abbeel --- University of California, Berkeley

[These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]

Machine Learning

ƒ Up until now: how use a model to make optimal decisions

ƒ Machine learning: how to acquire a model from data / experience

ƒ Learning parameters (e.g. probabilities)

ƒ Learning structure (e.g. BN graphs)

ƒ Learning hidden concepts (e.g. clustering, neural nets)

ƒ Today: model-based classification with Naive Bayes

Classification

Example: Spam Filter

ƒ Input: an email

ƒ Output: spam/ham

ƒ Setup:

ƒ Get a large collection of example emails, each labeled “spam” or “ham” ƒ Note: someone has to hand label all this data! ƒ Want to learn to predict labels of new, future emails

ƒ Features: The attributes used to make the ham /

spam decision

ƒ Words: FREE! ƒ Text Patterns: $dd, CAPS ƒ Non-text: SenderInContacts, WidelyBroadcast ƒ … Dear Sir. First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and top secret. … TO BE REMOVED FROM FUTURE MAILINGS, SIMPLY REPLY TO THIS MESSAGE AND PUT "REMOVE" IN THE SUBJECT. 99 MILLION EMAIL ADDRESSES FOR ONLY $ Ok, Iknow this is blatantly OT but I'm beginning to go insane. Had an old Dell Dimension XPS sitting in the corner and decided to put it to use, I know it was working pre being stuck in the corner, but when I plugged it in, hit the power nothing happened.

Model-Based Classification

Model-Based Classification

ƒ Model-based approach

ƒ Build a model (e.g. Bayes’ net) where

both the output label and input

features are random variables

ƒ Instantiate any observed features

ƒ Query for the distribution of the label

conditioned on the features

ƒ Challenges

ƒ What structure should the BN have?

ƒ How should we learn its parameters?

Naïve Bayes for Digits

ƒ Naïve Bayes: Assume all features are independent effects of the label

ƒ Simple digit recognition version:

ƒ One feature (variable) Fij for each grid position <i,j> ƒ Feature values are on / off, based on whether intensity is more or less than 0.5 in underlying image ƒ Each input maps to a feature vector, e.g. ƒ Here: lots of features, each is binary valued

ƒ Naïve Bayes model:

ƒ What do we need to learn?

Y F 1 F 2 Fn

General Naïve Bayes

ƒ A general Naive Bayes model:

ƒ We only have to specify how each feature depends on the class

ƒ Total number of parameters is linear in n

ƒ Model is very simplistic, but often works anyway

Y F 1 F 2 Fn |Y| parameters n x |F| x |Y| parameters |Y| x |F|n^ values

Example: Conditional Probabilities

1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 0. 0 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 0. 0 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 0. 0 0.

Naïve Bayes for Text

ƒ Bag-of-words Naïve Bayes:

ƒ Features: Wi is the word at position i ƒ As before: predict label conditioned on feature variables (spam vs. ham) ƒ As before: assume features are conditionally independent given label ƒ New: each Wi is identically distributed

ƒ Generative model:

ƒ “Tied” distributions and bag-of-words

ƒ Usually, each variable gets its own conditional probability distribution P(F|Y) ƒ In a bag-of-words model ƒ Each position is identically distributed ƒ All positions share the same conditional probs P(W|Y) ƒ Why make this assumption? ƒ Called “bag-of-words” because model is insensitive to word order or reordering Word at position i, not ith^ word in the dictionary!

Example: Spam Filtering

ƒ Model: ƒ What are the parameters?

Spam Example

 - the : 0. ƒ Where do these tables come from? - to : 0. - and : 0. - of : 0. - you : 0. - a : 0. - with: 0. - from: 0. - the : 0. - to : 0. - of : 0. - 2002: 0. - with: 0. - from: 0. - and : 0. - a : 0. 
  • ham : 0.
  • spam: 0.
    • (prior) 0.33333 0.66666 - 1.1 - 0. Word P(w|spam) P(w|ham) Tot Spam Tot Ham
    • Gary 0.00002 0.00021 - 11.8 - 8.
    • would 0.00069 0.00084 - 19.1 - 16.
    • you 0.00881 0.00304 - 23.8 - 21.
    • like 0.00086 0.00083 - 30.9 - 28.
    • to 0.01517 0.01339 - 35.1 - 33.
    • lose 0.00008 0.00002 - 44.5 - 44.
    • weight 0.00016 0.00002 - 53.3 - 55.
    • while 0.00027 0.00027 - 61.5 - 63.
    • you 0.00881 0.00304 - 66.2 - 69.
    • sleep 0.00006 0.00001 - 76.0 - 80. - P(spam | w) = 98.

Important Concepts

ƒ Data: labeled instances (e.g. emails marked spam/ham) ƒ Training set ƒ Held out set ƒ Test set ƒ Features: attribute-value pairs which characterize each x ƒ Experimentation cycle ƒ Learn parameters (e.g. model probabilities) on training set ƒ (Tune hyperparameters on held-out set) ƒ Compute accuracy of test set ƒ Very important: never “peek” at the test set! ƒ Evaluation (many metrics possible, e.g. accuracy) ƒ Accuracy: fraction of instances predicted correctly ƒ Overfitting and generalization ƒ Want a classifier which does well on test data ƒ Overfitting: fitting the training data very closely, but not generalizing well ƒ We’ll investigate overfitting and generalization formally in a few lectures Training Data Held-Out Data Test Data

Generalization and Overfitting

  • (^150 2 4 6 8 10 12 14 16 18 )
  • 10
    • 5 0 5 10 15 20 25 30

Degree 15 polynomial

Overfitting

Example: Overfitting

2 wins!!

Parameter Estimation

Parameter Estimation

ƒ Estimating the distribution of a random variable

ƒ Elicitation: ask a human (why is this hard?)

ƒ Empirically: use training data (learning!)

ƒ E.g.: for each outcome x, look at the empirical rate of that value: ƒ This is the estimate that maximizes the likelihood of the data r r b r b b b r b b b^ r r (^) b b r b b

Smoothing

Maximum Likelihood?

ƒ Relative frequencies are the maximum likelihood estimates

ƒ Another option is to consider the most likely parameter value given the data

Laplace Smoothing

ƒ Laplace’s estimate (extended):

ƒ Pretend you saw every outcome k extra times ƒ What’s Laplace with k = 0? ƒ k is the strength of the prior

ƒ Laplace for conditionals:

ƒ Smooth each condition independently:

r r b

Estimation: Linear Interpolation*

ƒ In practice, Laplace often performs poorly for P(X|Y):

ƒ When |X| is very large ƒ When |Y| is very large

ƒ Another option: linear interpolation

ƒ Also get the empirical P(X) from the data ƒ Make sure the estimate of P(X|Y) isn’t too different from the empirical P(X)

ƒ What if α is 0? 1?

ƒ For even better ways to estimate parameters, as well as details of

the math, see cs281a, cs

Real NB: Smoothing

ƒ For real classification problems, smoothing is critical

ƒ New odds ratios:

helvetica : 11. seems : 10. group : 10. ago : 8. areas : 8. ... verdana : 28. Credit : 28. ORDER : 27. : 26. money : 26. ...

Do these make more sense?

Tuning

Errors, and What to Do

ƒ Examples of errors

Dear GlobalSCAPE Customer, GlobalSCAPE has partnered with ScanSoft to offer you the latest version of OmniPage Pro, for just $99.99* - the regular list price is $499! The most common question we've received about this offer is - Is this genuine? We would like to assure you that this offer is authorized by ScanSoft, is genuine and valid. You can get the...

... To receive your $30 Amazon.com promotional certificate, click through to http://www.amazon.com/apparel and see the prominent link for the $30 offer. All details are there. We hope you enjoyed receiving this message. However, if you'd rather not receive future e-mails announcing new store launches, please click...

What to Do About Errors?

ƒ Need more features– words aren’t enough!

ƒ Have you emailed the sender before? ƒ Have 1K other people just gotten the same email? ƒ Is the sending information consistent? ƒ Is the email in ALL CAPS? ƒ Do inline URLs point where they say they point? ƒ Does the email address you by (your) name?

ƒ Can add these information sources as new

variables in the NB model

ƒ Next class we’ll talk about classifiers which let

you easily add arbitrary features more easily,

and, later, how to induce new features

Baselines

ƒ First step: get a baseline

ƒ Baselines are very simple “straw man” procedures ƒ Help determine how hard the task is ƒ Help know what a “good” accuracy is

ƒ Weak baseline: most frequent label classifier

ƒ Gives all test instances whatever label was most common in the training set ƒ E.g. for spam filtering, might label everything as ham ƒ Accuracy might be very high if the problem is skewed ƒ E.g. calling everything “ham” gets 66%, so a classifier that gets 70% isn’t very good…

ƒ For real research, usually use previous work as a (strong) baseline

Confidences from a Classifier

ƒ The confidence of a probabilistic classifier:

ƒ Posterior probability of the top label ƒ Represents how sure the classifier is of the classification ƒ Any probabilistic model will have confidences ƒ No guarantee confidence is correct

ƒ Calibration

ƒ Weak calibration: higher confidences mean higher accuracy ƒ Strong calibration: confidence predicts accuracy rate ƒ What’s the value of calibration?