Text Classification Techniques: From Naive Bayes to Support Vector Machines, Slides of Artificial Intelligence

An overview of various text classification techniques, including naive bayes, knn, logistic regression, support vector machines (svms), and their applications. The document also discusses the challenges of text classification, such as high dimensionality and sparsity, and the importance of feature selection. It includes examples of text classification datasets and models.

Typology: Slides

2010/2011

Uploaded on 11/09/2011

stagist
stagist šŸ‡ŗšŸ‡ø

4.1

(27)

265 documents

1 / 25

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Week 6 Overview
 Alternative Models of Retrieval
 Text Classification
– Naive Bayes (Chapter 13, IIR)
– kNN (Chapter 14, IIR)
– SVMs (Chapter 15, IIR)
ā€˜Cover Density Ranking’
 Developed by Clarke et al. at U. Waterloo
 Like Coordination Level Ranking
– But adds relative rankings within each level
 Key ideas
– Documents that possess most of the query terms,
together in close proximity, are likely to be relevant
– Documents with many such spans are more likely to
be relevant
 Requires a different kind of inverted file
– Word positions must be stored for each word
occurrence
 Suited for short queries
– 4 words or fewer
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19

Partial preview of the text

Download Text Classification Techniques: From Naive Bayes to Support Vector Machines and more Slides Artificial Intelligence in PDF only on Docsity!

Week 6 Overview

 Alternative Models of Retrieval  Text Classification

  • Naive Bayes (Chapter 13, IIR)
  • kNN (Chapter 14, IIR)
  • SVMs (Chapter 15, IIR)

ā€˜Cover Density Ranking’

 Developed by Clarke et al. at U. Waterloo  Like Coordination Level Ranking

  • But adds relative rankings within each level  Key ideas
  • Documents that possess most of the query terms, together in close proximity, are likely to be relevant
  • Documents with many such spans are more likely to be relevant  Requires a different kind of inverted file
  • Word positions must be stored for each word occurrence  Suited for short queries
  • 4 words or fewer

Example

Cover Set Ranking

 A document is scored by summing the

scores for each span in the cover set

 Each span is scored as:

 Linked Dependence Assumption: there

exists a positive real number K such that

the following two conditions hold:

– P(A,B|R) = K P(A|R) P(B|R)

– P(A,B|R) = K P(A|R) P(B|R)

  • When K=1 this is the same as binary independence

Linked Dependence

Logistic Regression

 Estimates for relevance based on log-

linear model with various statistical

measures of document content as

independent variables.

Log odds of relevance is a linear function of attributes: Term contributions summed: Probability of Relevance is inverse of log odds:

Attributes for Logistic Regression

Average Absolute Query Frequency Query Length Average Absolute Document Frequency Document Length Average Inverse Document Frequency Inverse Document Frequency Number of Terms in common between query and document -- logged

Text Classification

 Text Classification  Methods

  • Feature Selection
  • NB, kNN, SVMs, etc…
  • Experiments on Reuters  Other filtering tasks
  • Document Clustering, Topic Detection, Spam filtering  Collaborative Filtering  Filtering at TREC
  • Batch & Adaptive Tasks

User satisfaction

 This is not a ranked retrieval problem

  • You want only new information  Too much irrelevant information takes you and your staff too much time to sift
  • You want just the good stuff  How can we build a system that meets this need?

Enter Machine Learning

 The leading strategy is to learn to separate data into classes using machine learning techniques

  • Neural Networks, Bayesian methods, Decision Trees, kNN, Support Vector Machines, others
  • Often problems are cast as 2-class problems
    • e.g., relevant or not relevant  This approach requires supervised training data
  • For a high-volume application getting labeled exemplars is reasonable  IIR Chapters 16-17 focus on unsupervised scenarios. (Clustering)

Unique Problems with Text

 Many 1000s of features

 Sparse vectors

 Non-orthogonal features

 Individual features not especially

discriminating

  • ā€œcaptured the gold medalā€
  • ā€œbattled against cancerā€
  • Hereā€œcaptured/battledā€ are not about military conflict
  • If any term was a certain predictor, we could just use regular expressions to filter the data

Reuters-

 Approx 22k docs, 28 MB

 Most common text classification test set

http://www.daviddlewis.com/resources/testcollections/reuters21578/ example # categories w/ 1+ w/ 20+ topics coconut, gold 135 120 57 places australia 175 147 60 people andriessen 267 114 15 orgs gatt 56 32 9 exchanges nasdaq 39 32 7

NaĆÆve Bayes

 We want to select all and only the

relevant documents

 It would be nice if we could estimate the

probability of a class assignment (ci)

given a particular document, d

 Unfortunately, we don’t generally know

how to compute this

NaĆÆve Bayes derivation

 Applying Bayes Rule:  For ranking classes, if we just maximize the numerator, we will select the best class  Even P(d|ci) is hard to estimate

  • Requires looking at each combination of words
  • In how many docs of the class, did features tire and crash appear, but none of the other features appear? How about tire, crash, and drunk? How many combinations are there?

Conditional Independence

 Assume that each document feature (xj)

is independent from another, given

belonging to the class

 This is a naïve assumption

 To Do:

  • Estimate P(ci)
  • Estimate P(xj|ci) for all xj

NB Training (learning estimates)

 P(ci)  P(xj|ci)  (This is the Bernoulli / Binomial model in text)  What if a word in the document (some xj) never occurred in training documents?

  • P(xj|ci) is zero
  • Solution: smoothing (add one to counts)

Feature Selection

 A critical problem in text classification is selecting which features to use to represent documents  Typically words or stemmed words are used  Rather than use 20-100,000 words, often only the top-ranked features are used  For a classifier about Turkey, words like ā€œIstanbulā€œ and ā€œAnkaraā€œ might be selected, but ā€œtheā€ and ā€œwaterā€ probably wouldn’t.

  • How should features be identified?

Same metrics as term similarity

Contingency table Present Absent Present a b a+b Absent c d c+d a+c b+d N Ankara (term) Turkey (class) A paper by Y. Yang and J. Pedersen (A Comparative Study of Feature Selection in Text Categorization) concludes that Chi-squared, Information-Gain, and Document-Frequency are better than Mutual-Information. ICML-

k Nearest Neighbor Classification

 To classify document d into class c

 Define k -neighborhood as k nearest

neighbors of d

 Count number of documents, n , that belong

to c

 Estimate P(c| d ) as n /k

 Choose argmaxc P(c| d ) (majority class)

Classes in a Vector Space

Government Science Arts

KNN vs NB

 Naïve Bayes works fairly well

  • It is standard (40+ year history)
  • Some concerns about independence assumptions and over training on the data

 KNN requires no training, but...

  • Classification requires comparing to all training documents to find the best k categories
  • May be slow

 Both are used, but a newer ML algorithm

is increasingly popular

Support Vector Machines

 Good with high-dimensional spaces

  • Hundreds of thousands of features
  • Just like text

 Avoid over-training on data

  • If vectors are sparse

 Other approaches use few dimensions

  • But low-ranked dimensions still have value!

 Joachims compared SVMs to Naïve

Bayes, Rocchio, kNN, & Decision Trees

  • Paper on course web page

Support Vector Machines

 Large margin classification technique that can work well for sparse high dimensional classification problems  Not all training vectors are used in model See http://www.support-vector.net/

Non maximal margin

Joachims’s Experimental Setup

 Looked at 10 most-common Reuters (topic) categories  Measured ā€˜precision-recall breakeven point’ NB Rocchio DT kNN SVM (poly/3) earn 95.9 96.1 96.1 97.3 98. acq 91.5 92.1 85.3 92.0 95. money-fx 62.9 67.6 69.4 78.2 75. grain 72.5 79.5 89.1 82.2 92. crude 81.0 81.5 75.5 85.7 88. trade 50.0 77.4 59.2 77.4 76. interest 58.0 72.5 49.1 74.0 73. ship 78.7 83.1 80.9 79.2 86. wheat 60.6 79.4 85.5 76.6 85. corn 47.3 62.2 87.7 77.9 85. avg 72.0 79.9 79.4 82.3 85.

Learning taxonomies

 Labrou and Finin (UMBC) looked at classifying pages against Yahoo categories (CIKM-99)  Downloaded whole tree and built train/test sets  Approach

  • Score page against Yahoo categories and pick best  Any page can be assigned to a Yahoo category  Dumais et al. did similar work in SIGIR-2000 using top two levels and SVMs

Spam Filtering

 Spam detection can be set up as a 2-class problem  Many individual words have good correlations

  • make, money, fast, $$$, XXX, job, free  Other features
  • like source address / server
  • written with ALL CAPS  TREC ran spam detection evaluations in 2005-  Excellent 2007 article in CACM by Goodman, Cormack, and Heckerman  Paper on course web page (optional read)

First, do no harm...

Spam Filtering

 Sahami et al. created a collection of ~2000 documents

  • Using a 99.9 percent confidence threshold, lost 3 actual messages while using for one year.
  • One was a message, ā€œSee this spam!ā€ passed on by a colleague