Download Text Classification Techniques: From Naive Bayes to Support Vector Machines and more Slides Artificial Intelligence in PDF only on Docsity!
Week 6 Overview
ļ¬ Alternative Models of Retrieval ļ¬ Text Classification
- Naive Bayes (Chapter 13, IIR)
- kNN (Chapter 14, IIR)
- SVMs (Chapter 15, IIR)
āCover Density Rankingā
ļ¬ Developed by Clarke et al. at U. Waterloo ļ¬ Like Coordination Level Ranking
- But adds relative rankings within each level ļ¬ Key ideas
- Documents that possess most of the query terms, together in close proximity, are likely to be relevant
- Documents with many such spans are more likely to be relevant ļ¬ Requires a different kind of inverted file
- Word positions must be stored for each word occurrence ļ¬ Suited for short queries
- 4 words or fewer
Example
Cover Set Ranking
ļ¬ A document is scored by summing the
scores for each span in the cover set
ļ¬ Each span is scored as:
ļ¬ Linked Dependence Assumption: there
exists a positive real number K such that
the following two conditions hold:
ā P(A,B|R) = K P(A|R) P(B|R)
ā P(A,B|R) = K P(A|R) P(B|R)
- When K=1 this is the same as binary independence
Linked Dependence
Logistic Regression
ļ¬ Estimates for relevance based on log-
linear model with various statistical
measures of document content as
independent variables.
Log odds of relevance is a linear function of attributes: Term contributions summed: Probability of Relevance is inverse of log odds:
Attributes for Logistic Regression
Average Absolute Query Frequency Query Length Average Absolute Document Frequency Document Length Average Inverse Document Frequency Inverse Document Frequency Number of Terms in common between query and document -- logged
Text Classification
ļ¬ Text Classification ļ¬ Methods
- Feature Selection
- NB, kNN, SVMs, etcā¦
- Experiments on Reuters ļ¬ Other filtering tasks
- Document Clustering, Topic Detection, Spam filtering ļ¬ Collaborative Filtering ļ¬ Filtering at TREC
- Batch & Adaptive Tasks
User satisfaction
ļ¬ This is not a ranked retrieval problem
- You want only new information ļ¬ Too much irrelevant information takes you and your staff too much time to sift
- You want just the good stuff ļ¬ How can we build a system that meets this need?
Enter Machine Learning
ļ¬ The leading strategy is to learn to separate data into classes using machine learning techniques
- Neural Networks, Bayesian methods, Decision Trees, kNN, Support Vector Machines, others
- Often problems are cast as 2-class problems
- e.g., relevant or not relevant ļ¬ This approach requires supervised training data
- For a high-volume application getting labeled exemplars is reasonable ļ¬ IIR Chapters 16-17 focus on unsupervised scenarios. (Clustering)
Unique Problems with Text
ļ¬ Many 1000s of features
ļ¬ Sparse vectors
ļ¬ Non-orthogonal features
ļ¬ Individual features not especially
discriminating
- ācaptured the gold medalā
- ābattled against cancerā
- Hereācaptured/battledā are not about military conflict
- If any term was a certain predictor, we could just use regular expressions to filter the data
Reuters-
ļ¬ Approx 22k docs, 28 MB
ļ¬ Most common text classification test set
http://www.daviddlewis.com/resources/testcollections/reuters21578/ example # categories w/ 1+ w/ 20+ topics coconut, gold 135 120 57 places australia 175 147 60 people andriessen 267 114 15 orgs gatt 56 32 9 exchanges nasdaq 39 32 7
NaĆÆve Bayes
ļ¬ We want to select all and only the
relevant documents
ļ¬ It would be nice if we could estimate the
probability of a class assignment (ci)
given a particular document, d
ļ¬ Unfortunately, we donāt generally know
how to compute this
NaĆÆve Bayes derivation
ļ¬ Applying Bayes Rule: ļ¬ For ranking classes, if we just maximize the numerator, we will select the best class ļ¬ Even P(d|ci) is hard to estimate
- Requires looking at each combination of words
- In how many docs of the class, did features tire and crash appear, but none of the other features appear? How about tire, crash, and drunk? How many combinations are there?
Conditional Independence
ļ¬ Assume that each document feature (xj)
is independent from another, given
belonging to the class
ļ¬ This is a naĆÆve assumption
ļ¬ To Do:
- Estimate P(ci)
- Estimate P(xj|ci) for all xj
NB Training (learning estimates)
ļ¬ P(ci) ļ¬ P(xj|ci) ļ¬ (This is the Bernoulli / Binomial model in text) ļ¬ What if a word in the document (some xj) never occurred in training documents?
- P(xj|ci) is zero
- Solution: smoothing (add one to counts)
Feature Selection
ļ¬ A critical problem in text classification is selecting which features to use to represent documents ļ¬ Typically words or stemmed words are used ļ¬ Rather than use 20-100,000 words, often only the top-ranked features are used ļ¬ For a classifier about Turkey, words like āIstanbulā and āAnkaraā might be selected, but ātheā and āwaterā probably wouldnāt.
- How should features be identified?
Same metrics as term similarity
Contingency table Present Absent Present a b a+b Absent c d c+d a+c b+d N Ankara (term) Turkey (class) A paper by Y. Yang and J. Pedersen (A Comparative Study of Feature Selection in Text Categorization) concludes that Chi-squared, Information-Gain, and Document-Frequency are better than Mutual-Information. ICML-
k Nearest Neighbor Classification
ļ¬ To classify document d into class c
ļ¬ Define k -neighborhood as k nearest
neighbors of d
ļ¬ Count number of documents, n , that belong
to c
ļ¬ Estimate P(c| d ) as n /k
ļ¬ Choose argmaxc P(c| d ) (majority class)
Classes in a Vector Space
Government Science Arts
KNN vs NB
ļ¬ NaĆÆve Bayes works fairly well
- It is standard (40+ year history)
- Some concerns about independence assumptions and over training on the data
ļ¬ KNN requires no training, but...
- Classification requires comparing to all training documents to find the best k categories
- May be slow
ļ¬ Both are used, but a newer ML algorithm
is increasingly popular
Support Vector Machines
ļ¬ Good with high-dimensional spaces
- Hundreds of thousands of features
- Just like text
ļ¬ Avoid over-training on data
ļ¬ Other approaches use few dimensions
- But low-ranked dimensions still have value!
ļ¬ Joachims compared SVMs to NaĆÆve
Bayes, Rocchio, kNN, & Decision Trees
Support Vector Machines
ļ¬ Large margin classification technique that can work well for sparse high dimensional classification problems ļ¬ Not all training vectors are used in model See http://www.support-vector.net/
Non maximal margin
Joachimsās Experimental Setup
ļ¬ Looked at 10 most-common Reuters (topic) categories ļ¬ Measured āprecision-recall breakeven pointā NB Rocchio DT kNN SVM (poly/3) earn 95.9 96.1 96.1 97.3 98. acq 91.5 92.1 85.3 92.0 95. money-fx 62.9 67.6 69.4 78.2 75. grain 72.5 79.5 89.1 82.2 92. crude 81.0 81.5 75.5 85.7 88. trade 50.0 77.4 59.2 77.4 76. interest 58.0 72.5 49.1 74.0 73. ship 78.7 83.1 80.9 79.2 86. wheat 60.6 79.4 85.5 76.6 85. corn 47.3 62.2 87.7 77.9 85. avg 72.0 79.9 79.4 82.3 85.
Learning taxonomies
ļ¬ Labrou and Finin (UMBC) looked at classifying pages against Yahoo categories (CIKM-99) ļ¬ Downloaded whole tree and built train/test sets ļ¬ Approach
- Score page against Yahoo categories and pick best ļ¬ Any page can be assigned to a Yahoo category ļ¬ Dumais et al. did similar work in SIGIR-2000 using top two levels and SVMs
Spam Filtering
ļ¬ Spam detection can be set up as a 2-class problem ļ¬ Many individual words have good correlations
- make, money, fast, $$$, XXX, job, free ļ¬ Other features
- like source address / server
- written with ALL CAPS ļ¬ TREC ran spam detection evaluations in 2005- ļ¬ Excellent 2007 article in CACM by Goodman, Cormack, and Heckerman ļ¬ Paper on course web page (optional read)
First, do no harm...
Spam Filtering
ļ¬ Sahami et al. created a collection of ~2000 documents
- Using a 99.9 percent confidence threshold, lost 3 actual messages while using for one year.
- One was a message, āSee this spam!ā passed on by a colleague