Information Retrieval: Text Mining and Probabilistic Models, Slides of Artificial Intelligence

Various techniques for information retrieval, focusing on text mining and probabilistic models. Topics include identifying terms in queries, finding information for each term, computing cosine similarity, sorting documents by score, probabilistic models, and language modeling motivation. The document also covers the probabilistic ranking principle and the binary independence model.

Typology: Slides

2010/2011

Uploaded on 11/09/2011

stagist
stagist 🇺🇸

4.1

(27)

265 documents

1 / 27

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Week 5: Other Retrieval Models
Review: Computing Cosine
Probabilistic Model
IIR Chapter 11
Ignore ‘Inference Networks’
Language Models
IIR Chapter 12
Query Evaluation (for ranked retrieval)
Multi-step process
1.Initialization
2.Identify the terms in the query
3.Look up information about each term
file offset into inverted file
the document frequency
4.Loop over each inverted file entry for terms
compute a partial score for each document
5.Sort documents by score
6.Report results
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b

Partial preview of the text

Download Information Retrieval: Text Mining and Probabilistic Models and more Slides Artificial Intelligence in PDF only on Docsity!

Week 5: Other Retrieval Models

 Review: Computing Cosine

 Probabilistic Model

  • IIR Chapter 11
  • Ignore ‘Inference Networks’

 Language Models

  • IIR Chapter 12

Query Evaluation (for ranked retrieval)

Multi-step process

1. Initialization

2. Identify the terms in the query

3. Look up information about each term

  • file offset into inverted file
  • the document frequency

4. Loop over each inverted file entry for terms

  • compute a partial score for each document

5. Sort documents by score

6. Report results

[1] Initialization

 We assume that the lexicon and inverted

file already exist

 Since multiple queries may be presented,

we don’t want to reload data structures

multiple times

 So load the lexicon data structure into

memory

 Set scores for all documents to 0

 Load/Compute document vector lengths

[2] Identify terms in the query

 Treat queries like documents

  • Tokenize and identify terms the same way
  • Don’t want mismatches (e.g., Case vs case)

 Note query term frequency for each term

  • Example: “fast cars”
    • fast occurred once
    • cars occurred once
  • “bookish book lovers”
    • With stems:
    • book occurred twice
    • love occurred once

Computing cosine term-by-term

 For purposes of ranking docs against a query

  • Can ignore query vector ‘length’
  • Only need consider terms which occur in both query and doc  Part of the equation isn’t dependent on the query
  • Precompute document length  The sum can be computed term by term

Example

 Query = survivor island  Doc1 = survivor crusoe  Doc2 = island bermuda N=524,000 df idf survivor 578 9. island 9635 5. crusoe 30 14. bermuda 591 9.

Example, cont’d

 Query = survivor island  Doc1 = survivor crusoe  Doc2 = island bermuda

[5] Sort documents by score

 Use any old sorting algorithm

  • Quicksort or Mergesort = O(n log(n))

 But, if we want only the top, say 100 docs

  • Heapsort = O(n log(k))
    • for k << n, it matters

Probabilistic Models

 Rigorous formal model attempts to

predict the probability that a given

document will be relevant to a given

query

 Ranks retrieved documents according to

this probability of relevance (Probability

Ranking Principle)

 Relies on accurate estimates of

probabilities

Probability Ranking Principle

 If a reference retrieval system’s response to

each request is a ranking of the documents in

the collections in the order of decreasing

probability of usefulness to the user who

submitted the request, where the probabilities

are estimated as accurately as possible on the

basis of whatever data has been made

available to the system for this purpose, then

the overall effectiveness of the system to its

users will be the best that is obtainable on the

basis of that data.

Stephen E. Robertson, J. Documentation 1977

Probabilistic Model

 Angle between two vectors: how empirical!

  • Want to compute the probability that a document satisfies a user’s need  Biased towards an interactive view of the process
  • Initially posit a set of candidate documents
  • The user will rate initial documents
  • Now the system can make a better guess  Probabilistic framework
  • Compute odds of relevance for a document
  • Binary weights are used
  • In practice, no human feedback is required

Recall a few probability basics

 For events a and b:

 Bayes’ Rule

 Odds:

Posterior Prior Courtesy of Manning and Raghavan

Probabilistic Ranking

Basic concept: "For a given query, if we know some documents that are relevant, terms that occur in those documents should be given greater weighting in searching for other relevant documents. By making assumptions about the distribution of terms and applying Bayes Theorem, it is possible to derive weights theoretically." Van Rijsbergen Courtesy of Manning and Raghavan

Binary Independence Model

 Traditionally used in conjunction with PRP

 “Binary” = Boolean: documents are represented

as binary incidence vectors of terms (cf.

lecture 1):

  • iff term i is present in document x.

 “Independence”: terms occur in documents

independently

 Different documents could theoretically be

modeled as same vector

Courtesy of Manning and Raghavan

Binary Independence Model

 Queries: binary term incidence vectors

 Given query q ,

  • for each document d we need to compute

p ( R | q,d ).

  • replace with computing p ( R | q,x ) where x is

binary term incidence vector representing d

Interested only in ranking

 Will use odds and Bayes’ Rule:

Courtesy of Manning and Raghavan

Binary Independence Model

  • Using Independence Assumption: Constant for a given query Needs estimation
  • So : Courtesy of Manning and Raghavan

Binary Independence Model

Constant for each query Only quantity to be estimated for rankings

  • Retrieval Status Value: Courtesy of Manning and Raghavan

Binary Independence Model

  • All boils down to computing RSV. So, how do we compute ci’s from our data? Courtesy of Manning and Raghavan

Binary Independence Model

  • Estimating RSV coefficients.
  • For each term i look at this table of document counts:
    • Estimates: Courtesy of Manning and Raghavan

Initial Assumptions

 Problem: we don’t know which are the relevant

and non-relevant docs

  • Needed to compute values of contingency table...

 Lacking any other knowledge, we can assume

that all terms are equally likely to appear in

relevant documents

  • So estimate pi = 0.5 (a fixed constant)

 Assume terms occur in irrelevant documents

just as often as in the corpus as a whole

  • So estimate ri = n/N, where ni is the number of documents in which ki appears

What comes next?

 shut ...  please close the ...  give me a ....  i like to eat ...  for breakfast I like to eat ...

  •  please close the browser In LDC2006T13 (GOOG 5-grams)
  •  please close the browser window
  •  please close the bug
  •  please close the door "
  •  please close the door ,
  •  please close the door
  •  please close the door?
  •  please close the door and
  •  please close the door behind
  •  please close the door on
  •  please close the gate behind
  •  please close the new browser
  •  please close the new window
  •  please close the outside link
  •  please close the popup window
  •  please close the queue ,
  •  please close the thread
  •  please close the window
  •  please close the window and
  •  please close the window by
  •  please close the window to
  •  please close this?
  •  please close this box

Language Modelling Motivation

 http://www.youtube.com/watch?

v=6hcoT6yxFoU

Statistical Language Models

 Around 1998-2000 three groups developed a

model based on statistical language modelling

  • Ponte and Croft, (SIGIR-98)
  • Miller, Leek, and Schwartz, (SIGIR-99)
  • Hiemstra and de Vries, ( CTIT Tech. Report , May

 Can be viewed as a Markov process

 Appears to outperform vector cosine

Stochastic Language Models

 Model probability of generating any

string

0.2 the 0.01 class 0.0001 sayst 0.0001 pleaseth 0.0001 yon 0.0005 maiden 0.01 woman Model M1 Model M the class pleaseth yon maiden 0.2 0.01 0.0001 0.0001 0. 0.2 0.0001 0.02 0.1 0. P(s|M2) > P(s|M1) 0.2 the 0.0001 class 0.03 sayst 0.02 pleaseth 0.1 yon 0.01 maiden 0.0001 woman Courtesy of Manning and Raghavan

M

P ( | M ) = P ( | M )

P ( | M, )

P ( | M, )

P ( | M, )

Courtesy of Manning and Raghavan

Stochastic Language Models

 A statistical model for generating text

  • Probability distribution over strings in a

given language

 Unigram Language Models

 Bigram (generally, n-gram) Language Models

= P ( )P ( | ) P ( | ) P ( | )

P ( ) P ( ) P ( ) P ( )

P ( )

P ( ) P ( | ) P ( | ) P( | )

Easy. Effective! Courtesy of Manning and Raghavan

Unigram and higher-order models

Using Language Models in IR

 Treat each document as the basis for a model

(e.g., unigram sufficient statistics)

 Rank document d based on P(d | q)

 P(d | q) = P(q | d) x P(d) / P(q)

  • P(q) is the same for all documents, so ignore
  • P(d) [the prior] is often treated as the same for all d
    • But we could use criteria like authority, length, genre
  • P(q | d) is the probability of q given d’s model

 Very general formal approach

Courtesy of Manning and Raghavan