Download Information Retrieval: Text Mining and Probabilistic Models and more Slides Artificial Intelligence in PDF only on Docsity!
Week 5: Other Retrieval Models
Review: Computing Cosine
Probabilistic Model
- IIR Chapter 11
- Ignore ‘Inference Networks’
Language Models
Query Evaluation (for ranked retrieval)
Multi-step process
1. Initialization
2. Identify the terms in the query
3. Look up information about each term
- file offset into inverted file
- the document frequency
4. Loop over each inverted file entry for terms
- compute a partial score for each document
5. Sort documents by score
6. Report results
[1] Initialization
We assume that the lexicon and inverted
file already exist
Since multiple queries may be presented,
we don’t want to reload data structures
multiple times
So load the lexicon data structure into
memory
Set scores for all documents to 0
Load/Compute document vector lengths
[2] Identify terms in the query
Treat queries like documents
- Tokenize and identify terms the same way
- Don’t want mismatches (e.g., Case vs case)
Note query term frequency for each term
- Example: “fast cars”
- fast occurred once
- cars occurred once
- “bookish book lovers”
- With stems:
- book occurred twice
- love occurred once
Computing cosine term-by-term
For purposes of ranking docs against a query
- Can ignore query vector ‘length’
- Only need consider terms which occur in both query and doc Part of the equation isn’t dependent on the query
- Precompute document length The sum can be computed term by term
Example
Query = survivor island Doc1 = survivor crusoe Doc2 = island bermuda N=524,000 df idf survivor 578 9. island 9635 5. crusoe 30 14. bermuda 591 9.
Example, cont’d
Query = survivor island Doc1 = survivor crusoe Doc2 = island bermuda
[5] Sort documents by score
Use any old sorting algorithm
- Quicksort or Mergesort = O(n log(n))
But, if we want only the top, say 100 docs
Probabilistic Models
Rigorous formal model attempts to
predict the probability that a given
document will be relevant to a given
query
Ranks retrieved documents according to
this probability of relevance (Probability
Ranking Principle)
Relies on accurate estimates of
probabilities
Probability Ranking Principle
If a reference retrieval system’s response to
each request is a ranking of the documents in
the collections in the order of decreasing
probability of usefulness to the user who
submitted the request, where the probabilities
are estimated as accurately as possible on the
basis of whatever data has been made
available to the system for this purpose, then
the overall effectiveness of the system to its
users will be the best that is obtainable on the
basis of that data.
Stephen E. Robertson, J. Documentation 1977
Probabilistic Model
Angle between two vectors: how empirical!
- Want to compute the probability that a document satisfies a user’s need Biased towards an interactive view of the process
- Initially posit a set of candidate documents
- The user will rate initial documents
- Now the system can make a better guess Probabilistic framework
- Compute odds of relevance for a document
- Binary weights are used
- In practice, no human feedback is required
Recall a few probability basics
For events a and b:
Bayes’ Rule
Odds:
Posterior Prior Courtesy of Manning and Raghavan
Probabilistic Ranking
Basic concept: "For a given query, if we know some documents that are relevant, terms that occur in those documents should be given greater weighting in searching for other relevant documents. By making assumptions about the distribution of terms and applying Bayes Theorem, it is possible to derive weights theoretically." Van Rijsbergen Courtesy of Manning and Raghavan
Binary Independence Model
Traditionally used in conjunction with PRP
“Binary” = Boolean: documents are represented
as binary incidence vectors of terms (cf.
lecture 1):
- iff term i is present in document x.
“Independence”: terms occur in documents
independently
Different documents could theoretically be
modeled as same vector
Courtesy of Manning and Raghavan
Binary Independence Model
Queries: binary term incidence vectors
Given query q ,
- for each document d we need to compute
p ( R | q,d ).
- replace with computing p ( R | q,x ) where x is
binary term incidence vector representing d
Interested only in ranking
Will use odds and Bayes’ Rule:
Courtesy of Manning and Raghavan
Binary Independence Model
- Using Independence Assumption: Constant for a given query Needs estimation
- So : Courtesy of Manning and Raghavan
Binary Independence Model
Constant for each query Only quantity to be estimated for rankings
- Retrieval Status Value: Courtesy of Manning and Raghavan
Binary Independence Model
- All boils down to computing RSV. So, how do we compute ci’s from our data? Courtesy of Manning and Raghavan
Binary Independence Model
- Estimating RSV coefficients.
- For each term i look at this table of document counts:
- Estimates: Courtesy of Manning and Raghavan
Initial Assumptions
Problem: we don’t know which are the relevant
and non-relevant docs
- Needed to compute values of contingency table...
Lacking any other knowledge, we can assume
that all terms are equally likely to appear in
relevant documents
- So estimate pi = 0.5 (a fixed constant)
Assume terms occur in irrelevant documents
just as often as in the corpus as a whole
- So estimate ri = n/N, where ni is the number of documents in which ki appears
What comes next?
shut ... please close the ... give me a .... i like to eat ... for breakfast I like to eat ...
- please close the browser In LDC2006T13 (GOOG 5-grams)
- please close the browser window
- please close the bug
- please close the door "
- please close the door ,
- please close the door
- please close the door?
- please close the door and
- please close the door behind
- please close the door on
- please close the gate behind
- please close the new browser
- please close the new window
- please close the outside link
- please close the popup window
- please close the queue ,
- please close the thread
- please close the window
- please close the window and
- please close the window by
- please close the window to
- please close this?
- please close this box
Language Modelling Motivation
http://www.youtube.com/watch?
v=6hcoT6yxFoU
Statistical Language Models
Around 1998-2000 three groups developed a
model based on statistical language modelling
- Ponte and Croft, (SIGIR-98)
- Miller, Leek, and Schwartz, (SIGIR-99)
- Hiemstra and de Vries, ( CTIT Tech. Report , May
Can be viewed as a Markov process
Appears to outperform vector cosine
Stochastic Language Models
Model probability of generating any
string
0.2 the 0.01 class 0.0001 sayst 0.0001 pleaseth 0.0001 yon 0.0005 maiden 0.01 woman Model M1 Model M the class pleaseth yon maiden 0.2 0.01 0.0001 0.0001 0. 0.2 0.0001 0.02 0.1 0. P(s|M2) > P(s|M1) 0.2 the 0.0001 class 0.03 sayst 0.02 pleaseth 0.1 yon 0.01 maiden 0.0001 woman Courtesy of Manning and Raghavan
M
P ( | M ) = P ( | M )
P ( | M, )
P ( | M, )
P ( | M, )
Courtesy of Manning and Raghavan
Stochastic Language Models
A statistical model for generating text
- Probability distribution over strings in a
given language
Unigram Language Models
Bigram (generally, n-gram) Language Models
= P ( )P ( | ) P ( | ) P ( | )
P ( ) P ( ) P ( ) P ( )
P ( )
P ( ) P ( | ) P ( | ) P( | )
Easy. Effective! Courtesy of Manning and Raghavan
Unigram and higher-order models
Using Language Models in IR
Treat each document as the basis for a model
(e.g., unigram sufficient statistics)
Rank document d based on P(d | q)
P(d | q) = P(q | d) x P(d) / P(q)
- P(q) is the same for all documents, so ignore
- P(d) [the prior] is often treated as the same for all d
- But we could use criteria like authority, length, genre
- P(q | d) is the probability of q given d’s model
Very general formal approach
Courtesy of Manning and Raghavan