










Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This lecture was delivered by Dr. Ramya Riya at Ankit Institute of Technology and Science. This lecture is part of lecture series on Machine Learning and Artificial Intelligence course. It includes: System, Design, Prioritizing, Spam, Classification, Building, Subject, Supervised, Learning, Features, Routing, Information, Algorithm, Detect
Typology: Slides
1 / 18
This page cannot be seen from the preview
Don't miss anything!











Andrew Ng Building a spam classifier From: [email protected] To: [email protected] Subject: Buy now! Deal of the week! Buy now! Rolex w4tchs - $ Med1cine (any kind) - $ Also low cost M0rgages available. From: Alfred Ng To: [email protected] Subject: Christmas dates? Hey Andrew, Was talking to Mom about plans for Xmas. When do you get off work. Meet Dec 22? Alf
Andrew Ng Building a spam classifier How to spend your 3me to make it have low error? -‐ Collect lots of data -‐ E.g. “honeypot” project. -‐ Develop sophis3cated features based on email rou3ng informa3on (from email header). -‐ Develop sophis3cated features for message body, e.g. should “discount” and “discounts” be treated as the same word? How about “deal” and “Dealer”? Features about punctua3on? -‐ Develop sophis3cated algorithm to detect misspellings (e.g. m0rtgage, med1cine, w4tches.)
Andrew Ng Error Analysis 500 examples in cross valida3on set Algorithm misclassifies 100 emails. Manually examine the 100 errors, and categorize them based on: (i) What type of email it is (ii) What cues (features) you think would have helped the algorithm classify them correctly. Pharma: Replica/fake: Steal passwords: Other: Deliberate misspellings: (m0rgage, med1cine, etc.) Unusual email rou3ng: Unusual (spamming) punctua3on:
Andrew Ng The importance of numerical evalua;on Should discount/discounts/discounted/discoun3ng be treated as the same word? Can use “stemming” so\ware (E.g. “Porter stemmer”) universe/university. Error analysis may not be helpful for deciding if this is likely to improve performance. Only solu3on is to try it and see if it works. Need numerical evalua3on (e.g., cross valida3on error) of algorithm’s performance with and without stemming. Without stemming: With stemming: Dis3nguish upper vs. lower case (Mom/mom):
Andrew Ng
function y = predictCancer(x) y = 0; %ignore x! return
Andrew Ng Precision/Recall in presence of rare class that we want to detect Precision (Of all pa3ents where we predicted , what frac3on actually has cancer?) Recall (Of all pa3ents that actually have cancer, what frac3on did we correctly detect as having cancer?)
Andrew Ng
1
0.5 1 Recall Precision precision = true posi3ves no. of predicted posi3ve recall = true posi3ves no. of actual posi3ve
Andrew Ng Precision(P) Recall (R) Average F 1 Score Algorithm 1 0.5 0.4 0.45 0. Algorithm 2 0.7 0.1 0.4 0. Algorithm 3 0.02 1.0 0.51 0.
1
1
Designing a high accuracy learning system [Banko and Brill, 2001] E.g. Classify between confusable words. {to, two, too}, {then, than} For breakfast I ate _____ eggs. Algorithms -‐ Perceptron (Logis3c regression) -‐ Winnow -‐ Memory-‐based -‐ Naïve Bayes “It’s not who has the best algorithm that wins. It’s who has the most data.” Training set size (millions) Accuracy
Useful test: Given the input , can a human expert confidently predict? Large data ra;onale Assume feature has sufficient informa3on to predict accurately. Example: For breakfast I ate _____ eggs. Counterexample: Predict housing price from only size (feet 2 ) and no other features.