Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Text Classification Techniques: From Naive Bayes to Support Vector Machines, Slides of Artificial Intelligence

Johns Hopkins University (JHU)Artificial Intelligence

An overview of various text classification techniques, including naive bayes, knn, logistic regression, support vector machines (svms), and their applications. The document also discusses the challenges of text classification, such as high dimensionality and sparsity, and the importance of feature selection. It includes examples of text classification datasets and models.

Typology: Slides

2010/2011

Uploaded on 11/09/2011

stagist 🇺🇸

4.1

(27)

265 documents

1 / 25

This page cannot be seen from the preview

Don't miss anything!

Week 6 Overview

 Alternative Models of Retrieval

 Text Classification

– Naive Bayes (Chapter 13, IIR)

– kNN (Chapter 14, IIR)

– SVMs (Chapter 15, IIR)

‘Cover Density Ranking’

 Developed by Clarke et al. at U. Waterloo

 Like Coordination Level Ranking

– But adds relative rankings within each level

 Key ideas

– Documents that possess most of the query terms,

together in close proximity, are likely to be relevant

– Documents with many such spans are more likely to

be relevant

 Requires a different kind of inverted file

– Word positions must be stored for each word

occurrence

 Suited for short queries

– 4 words or fewer

Discover Slides of Artificial Intelligence Johns Hopkins University (JHU)

Partial preview of the text

Download Text Classification Techniques: From Naive Bayes to Support Vector Machines and more Slides Artificial Intelligence in PDF only on Docsity!

Week 6 Overview

 Alternative Models of Retrieval  Text Classification

Naive Bayes (Chapter 13, IIR)
kNN (Chapter 14, IIR)
SVMs (Chapter 15, IIR)

‘Cover Density Ranking’

 Developed by Clarke et al. at U. Waterloo  Like Coordination Level Ranking

But adds relative rankings within each level  Key ideas
Documents that possess most of the query terms, together in close proximity, are likely to be relevant
Documents with many such spans are more likely to be relevant  Requires a different kind of inverted file
Word positions must be stored for each word occurrence  Suited for short queries
4 words or fewer

Example

Cover Set Ranking

 A document is scored by summing the

scores for each span in the cover set

 Each span is scored as:

 Linked Dependence Assumption: there

exists a positive real number K such that

the following two conditions hold:

– P(A,B|R) = K P(A|R) P(B|R)

When K=1 this is the same as binary independence

Linked Dependence

Logistic Regression

 Estimates for relevance based on log-

linear model with various statistical

measures of document content as

independent variables.

Log odds of relevance is a linear function of attributes: Term contributions summed: Probability of Relevance is inverse of log odds:

Attributes for Logistic Regression

Average Absolute Query Frequency Query Length Average Absolute Document Frequency Document Length Average Inverse Document Frequency Inverse Document Frequency Number of Terms in common between query and document -- logged

Text Classification

 Text Classification  Methods

Feature Selection
NB, kNN, SVMs, etc…
Experiments on Reuters  Other filtering tasks
Document Clustering, Topic Detection, Spam filtering  Collaborative Filtering  Filtering at TREC
Batch & Adaptive Tasks

User satisfaction

 This is not a ranked retrieval problem

You want only new information  Too much irrelevant information takes you and your staff too much time to sift
You want just the good stuff  How can we build a system that meets this need?

Enter Machine Learning

 The leading strategy is to learn to separate data into classes using machine learning techniques

Neural Networks, Bayesian methods, Decision Trees, kNN, Support Vector Machines, others
Often problems are cast as 2-class problems
- e.g., relevant or not relevant  This approach requires supervised training data
For a high-volume application getting labeled exemplars is reasonable  IIR Chapters 16-17 focus on unsupervised scenarios. (Clustering)

Unique Problems with Text

 Many 1000s of features

 Sparse vectors

 Non-orthogonal features

 Individual features not especially

discriminating

“captured the gold medal”
“battled against cancer”
Here“captured/battled” are not about military conflict
If any term was a certain predictor, we could just use regular expressions to filter the data

Reuters-

 Approx 22k docs, 28 MB

 Most common text classification test set

http://www.daviddlewis.com/resources/testcollections/reuters21578/ example # categories w/ 1+ w/ 20+ topics coconut, gold 135 120 57 places australia 175 147 60 people andriessen 267 114 15 orgs gatt 56 32 9 exchanges nasdaq 39 32 7

Naïve Bayes

 We want to select all and only the

relevant documents

 It would be nice if we could estimate the

probability of a class assignment (ci)

given a particular document, d

 Unfortunately, we don’t generally know

how to compute this

Naïve Bayes derivation

 Applying Bayes Rule:  For ranking classes, if we just maximize the numerator, we will select the best class  Even P(d|ci) is hard to estimate

Requires looking at each combination of words
In how many docs of the class, did features tire and crash appear, but none of the other features appear? How about tire, crash, and drunk? How many combinations are there?

Conditional Independence

 Assume that each document feature (xj)

is independent from another, given

belonging to the class

 This is a naïve assumption

 To Do:

Estimate P(ci)
Estimate P(xj|ci) for all xj

NB Training (learning estimates)

 P(ci)  P(xj|ci)  (This is the Bernoulli / Binomial model in text)  What if a word in the document (some xj) never occurred in training documents?

P(xj|ci) is zero
Solution: smoothing (add one to counts)

Feature Selection

 A critical problem in text classification is selecting which features to use to represent documents  Typically words or stemmed words are used  Rather than use 20-100,000 words, often only the top-ranked features are used  For a classifier about Turkey, words like “Istanbul“ and “Ankara“ might be selected, but “the” and “water” probably wouldn’t.

How should features be identified?

Same metrics as term similarity

Contingency table Present Absent Present a b a+b Absent c d c+d a+c b+d N Ankara (term) Turkey (class) A paper by Y. Yang and J. Pedersen (A Comparative Study of Feature Selection in Text Categorization) concludes that Chi-squared, Information-Gain, and Document-Frequency are better than Mutual-Information. ICML-

k Nearest Neighbor Classification

 To classify document d into class c

 Define k -neighborhood as k nearest

neighbors of d

 Count number of documents, n , that belong

to c

 Estimate P(c| d ) as n /k

 Choose argmaxc P(c| d ) (majority class)

Classes in a Vector Space

Government Science Arts

KNN vs NB

 Naïve Bayes works fairly well

It is standard (40+ year history)
Some concerns about independence assumptions and over training on the data

 KNN requires no training, but...

Classification requires comparing to all training documents to find the best k categories
May be slow

 Both are used, but a newer ML algorithm

is increasingly popular

Support Vector Machines

 Good with high-dimensional spaces

Hundreds of thousands of features
Just like text

 Avoid over-training on data

If vectors are sparse

 Other approaches use few dimensions

But low-ranked dimensions still have value!

 Joachims compared SVMs to Naïve

Bayes, Rocchio, kNN, & Decision Trees

Paper on course web page

Support Vector Machines

 Large margin classification technique that can work well for sparse high dimensional classification problems  Not all training vectors are used in model See http://www.support-vector.net/

Non maximal margin

Joachims’s Experimental Setup

 Looked at 10 most-common Reuters (topic) categories  Measured ‘precision-recall breakeven point’ NB Rocchio DT kNN SVM (poly/3) earn 95.9 96.1 96.1 97.3 98. acq 91.5 92.1 85.3 92.0 95. money-fx 62.9 67.6 69.4 78.2 75. grain 72.5 79.5 89.1 82.2 92. crude 81.0 81.5 75.5 85.7 88. trade 50.0 77.4 59.2 77.4 76. interest 58.0 72.5 49.1 74.0 73. ship 78.7 83.1 80.9 79.2 86. wheat 60.6 79.4 85.5 76.6 85. corn 47.3 62.2 87.7 77.9 85. avg 72.0 79.9 79.4 82.3 85.

Learning taxonomies

 Labrou and Finin (UMBC) looked at classifying pages against Yahoo categories (CIKM-99)  Downloaded whole tree and built train/test sets  Approach

Score page against Yahoo categories and pick best  Any page can be assigned to a Yahoo category  Dumais et al. did similar work in SIGIR-2000 using top two levels and SVMs

Spam Filtering

 Spam detection can be set up as a 2-class problem  Many individual words have good correlations

make, money, fast, $$$, XXX, job, free  Other features
like source address / server
written with ALL CAPS  TREC ran spam detection evaluations in 2005-  Excellent 2007 article in CACM by Goodman, Cormack, and Heckerman  Paper on course web page (optional read)

First, do no harm...

Spam Filtering

 Sahami et al. created a collection of ~2000 documents

Using a 99.9 percent confidence threshold, lost 3 actual messages while using for one year.
One was a message, “See this spam!” passed on by a colleague

Text Classification Techniques: From Naive Bayes to Support Vector Machines, Slides of Artificial Intelligence

Related documents

Partial preview of the text

Download Text Classification Techniques: From Naive Bayes to Support Vector Machines and more Slides Artificial Intelligence in PDF only on Docsity!

Week 6 Overview

‘Cover Density Ranking’

Example

Cover Set Ranking

 A document is scored by summing the

scores for each span in the cover set

 Each span is scored as:

 Linked Dependence Assumption: there

exists a positive real number K such that

the following two conditions hold:

– P(A,B|R) = K P(A|R) P(B|R)

– P(A,B|R) = K P(A|R) P(B|R)

Linked Dependence

Logistic Regression

 Estimates for relevance based on log-

linear model with various statistical

measures of document content as

independent variables.

Attributes for Logistic Regression

Text Classification

User satisfaction

Enter Machine Learning

Unique Problems with Text

 Many 1000s of features

 Sparse vectors

 Non-orthogonal features

 Individual features not especially

discriminating

Reuters-

 Approx 22k docs, 28 MB

 Most common text classification test set

Naïve Bayes

 We want to select all and only the

relevant documents

 It would be nice if we could estimate the

probability of a class assignment (ci)

given a particular document, d

 Unfortunately, we don’t generally know

how to compute this

Naïve Bayes derivation

Conditional Independence

 Assume that each document feature (xj)

is independent from another, given

belonging to the class

 This is a naïve assumption

 To Do:

NB Training (learning estimates)

Feature Selection

Same metrics as term similarity

k Nearest Neighbor Classification

 To classify document d into class c

 Define k -neighborhood as k nearest

neighbors of d

 Count number of documents, n , that belong

to c

 Estimate P(c| d ) as n /k

 Choose argmaxc P(c| d ) (majority class)

Classes in a Vector Space

KNN vs NB

 Naïve Bayes works fairly well

 KNN requires no training, but...

 Both are used, but a newer ML algorithm

is increasingly popular

Support Vector Machines

 Good with high-dimensional spaces

 Avoid over-training on data

 Other approaches use few dimensions

 Joachims compared SVMs to Naïve

Bayes, Rocchio, kNN, & Decision Trees

Support Vector Machines

Joachims’s Experimental Setup

Learning taxonomies

Spam Filtering

First, do no harm...

Spam Filtering