Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Information Retrieval: Text Mining and Probabilistic Models, Slides of Artificial Intelligence

Johns Hopkins University (JHU)Artificial Intelligence

Various techniques for information retrieval, focusing on text mining and probabilistic models. Topics include identifying terms in queries, finding information for each term, computing cosine similarity, sorting documents by score, probabilistic models, and language modeling motivation. The document also covers the probabilistic ranking principle and the binary independence model.

Typology: Slides

2010/2011

Uploaded on 11/09/2011

stagist 🇺🇸

4.1

(27)

265 documents

1 / 27

This page cannot be seen from the preview

Don't miss anything!

Week 5: Other Retrieval Models

 Review: Computing Cosine

 Probabilistic Model

– IIR Chapter 11

– Ignore ‘Inference Networks’

 Language Models

– IIR Chapter 12

Query Evaluation (for ranked retrieval)

Multi-step process

1. Initialization

2. Identify the terms in the query

3. Look up information about each term

• file offset into inverted file

• the document frequency

4. Loop over each inverted file entry for terms

• compute a partial score for each document

5. Sort documents by score

6. Report results

Discover Slides of Artificial Intelligence Johns Hopkins University (JHU)

Partial preview of the text

Download Information Retrieval: Text Mining and Probabilistic Models and more Slides Artificial Intelligence in PDF only on Docsity!

Week 5: Other Retrieval Models

 Review: Computing Cosine

 Probabilistic Model

IIR Chapter 11
Ignore ‘Inference Networks’

 Language Models

IIR Chapter 12

Query Evaluation (for ranked retrieval)

Multi-step process

1. Initialization

2. Identify the terms in the query

3. Look up information about each term

file offset into inverted file
the document frequency

4. Loop over each inverted file entry for terms

compute a partial score for each document

5. Sort documents by score

6. Report results

[1] Initialization

 We assume that the lexicon and inverted

file already exist

 Since multiple queries may be presented,

we don’t want to reload data structures

multiple times

 So load the lexicon data structure into

memory

 Set scores for all documents to 0

 Load/Compute document vector lengths

[2] Identify terms in the query

 Treat queries like documents

Tokenize and identify terms the same way
Don’t want mismatches (e.g., Case vs case)

 Note query term frequency for each term

Example: “fast cars”
- fast occurred once
- cars occurred once
“bookish book lovers”
- With stems:
- book occurred twice
- love occurred once

Computing cosine term-by-term

 For purposes of ranking docs against a query

Can ignore query vector ‘length’
Only need consider terms which occur in both query and doc  Part of the equation isn’t dependent on the query
Precompute document length  The sum can be computed term by term

Example

 Query = survivor island  Doc1 = survivor crusoe  Doc2 = island bermuda N=524,000 df idf survivor 578 9. island 9635 5. crusoe 30 14. bermuda 591 9.

Example, cont’d

 Query = survivor island  Doc1 = survivor crusoe  Doc2 = island bermuda

[5] Sort documents by score

 Use any old sorting algorithm

Quicksort or Mergesort = O(n log(n))

 But, if we want only the top, say 100 docs

Heapsort = O(n log(k))
- for k << n, it matters

Probabilistic Models

 Rigorous formal model attempts to

predict the probability that a given

document will be relevant to a given

query

 Ranks retrieved documents according to

this probability of relevance (Probability

Ranking Principle)

 Relies on accurate estimates of

probabilities

Probability Ranking Principle

 If a reference retrieval system’s response to

each request is a ranking of the documents in

the collections in the order of decreasing

probability of usefulness to the user who

submitted the request, where the probabilities

are estimated as accurately as possible on the

basis of whatever data has been made

available to the system for this purpose, then

the overall effectiveness of the system to its

users will be the best that is obtainable on the

basis of that data.

Stephen E. Robertson, J. Documentation 1977

Probabilistic Model

 Angle between two vectors: how empirical!

Want to compute the probability that a document satisfies a user’s need  Biased towards an interactive view of the process
Initially posit a set of candidate documents
The user will rate initial documents
Now the system can make a better guess  Probabilistic framework
Compute odds of relevance for a document
Binary weights are used
In practice, no human feedback is required

Recall a few probability basics

 For events a and b:

 Bayes’ Rule

 Odds:

Posterior Prior Courtesy of Manning and Raghavan

Probabilistic Ranking

Basic concept: "For a given query, if we know some documents that are relevant, terms that occur in those documents should be given greater weighting in searching for other relevant documents. By making assumptions about the distribution of terms and applying Bayes Theorem, it is possible to derive weights theoretically." Van Rijsbergen Courtesy of Manning and Raghavan

Binary Independence Model

 Traditionally used in conjunction with PRP

 “Binary” = Boolean: documents are represented

as binary incidence vectors of terms (cf.

lecture 1):

iff term i is present in document x.

 “Independence”: terms occur in documents

independently

 Different documents could theoretically be

modeled as same vector

Courtesy of Manning and Raghavan

Binary Independence Model

 Queries: binary term incidence vectors

 Given query q ,

for each document d we need to compute

p ( R | q,d ).

replace with computing p ( R | q,x ) where x is

binary term incidence vector representing d

Interested only in ranking

 Will use odds and Bayes’ Rule:

Courtesy of Manning and Raghavan

Binary Independence Model

Using Independence Assumption: Constant for a given query Needs estimation
So : Courtesy of Manning and Raghavan

Binary Independence Model

Constant for each query Only quantity to be estimated for rankings

Retrieval Status Value: Courtesy of Manning and Raghavan

Binary Independence Model

All boils down to computing RSV. So, how do we compute ci’s from our data? Courtesy of Manning and Raghavan

Binary Independence Model

Estimating RSV coefficients.
For each term i look at this table of document counts:
- Estimates: Courtesy of Manning and Raghavan

Initial Assumptions

 Problem: we don’t know which are the relevant

and non-relevant docs

Needed to compute values of contingency table...

 Lacking any other knowledge, we can assume

that all terms are equally likely to appear in

relevant documents

So estimate pi = 0.5 (a fixed constant)

 Assume terms occur in irrelevant documents

just as often as in the corpus as a whole

So estimate ri = n/N, where ni is the number of documents in which ki appears

What comes next?

 shut ...  please close the ...  give me a ....  i like to eat ...  for breakfast I like to eat ...

 please close the browser In LDC2006T13 (GOOG 5-grams)
 please close the browser window
 please close the bug
 please close the door "
 please close the door ,
 please close the door
 please close the door?
 please close the door and
 please close the door behind
 please close the door on
 please close the gate behind
 please close the new browser
 please close the new window
 please close the outside link
 please close the popup window
 please close the queue ,
 please close the thread
 please close the window
 please close the window and
 please close the window by
 please close the window to
 please close this?
 please close this box

Language Modelling Motivation

 http://www.youtube.com/watch?

v=6hcoT6yxFoU

Statistical Language Models

 Around 1998-2000 three groups developed a

model based on statistical language modelling

Ponte and Croft, (SIGIR-98)
Miller, Leek, and Schwartz, (SIGIR-99)
Hiemstra and de Vries, ( CTIT Tech. Report , May

 Can be viewed as a Markov process

 Appears to outperform vector cosine

Stochastic Language Models

 Model probability of generating any

string

0.2 the 0.01 class 0.0001 sayst 0.0001 pleaseth 0.0001 yon 0.0005 maiden 0.01 woman Model M1 Model M the class pleaseth yon maiden 0.2 0.01 0.0001 0.0001 0. 0.2 0.0001 0.02 0.1 0. P(s|M2) > P(s|M1) 0.2 the 0.0001 class 0.03 sayst 0.02 pleaseth 0.1 yon 0.01 maiden 0.0001 woman Courtesy of Manning and Raghavan

M

P ( | M ) = P ( | M )

P ( | M, )

Courtesy of Manning and Raghavan

Stochastic Language Models

 A statistical model for generating text

Probability distribution over strings in a

given language

 Unigram Language Models

 Bigram (generally, n-gram) Language Models

= P ( )P ( | ) P ( | ) P ( | )

P ( ) P ( ) P ( ) P ( )

P ( )

P ( ) P ( | ) P ( | ) P( | )

Easy. Effective! Courtesy of Manning and Raghavan

Unigram and higher-order models

Using Language Models in IR

 Treat each document as the basis for a model

(e.g., unigram sufficient statistics)

 Rank document d based on P(d | q)

 P(d | q) = P(q | d) x P(d) / P(q)

P(q) is the same for all documents, so ignore
P(d) [the prior] is often treated as the same for all d
- But we could use criteria like authority, length, genre
P(q | d) is the probability of q given d’s model

 Very general formal approach

Courtesy of Manning and Raghavan

Information Retrieval: Text Mining and Probabilistic Models, Slides of Artificial Intelligence

Related documents

Partial preview of the text

Download Information Retrieval: Text Mining and Probabilistic Models and more Slides Artificial Intelligence in PDF only on Docsity!

Week 5: Other Retrieval Models

 Review: Computing Cosine

 Probabilistic Model

 Language Models

Query Evaluation (for ranked retrieval)

Multi-step process

1. Initialization

2. Identify the terms in the query

3. Look up information about each term

4. Loop over each inverted file entry for terms

5. Sort documents by score

6. Report results

[1] Initialization

 We assume that the lexicon and inverted

file already exist

 Since multiple queries may be presented,

we don’t want to reload data structures

multiple times

 So load the lexicon data structure into

memory

 Set scores for all documents to 0

 Load/Compute document vector lengths

[2] Identify terms in the query

 Treat queries like documents

 Note query term frequency for each term

Computing cosine term-by-term

Example

Example, cont’d

[5] Sort documents by score

 Use any old sorting algorithm

 But, if we want only the top, say 100 docs

Probabilistic Models

 Rigorous formal model attempts to

predict the probability that a given

document will be relevant to a given

query

 Ranks retrieved documents according to

this probability of relevance (Probability

Ranking Principle)

 Relies on accurate estimates of

probabilities

Probability Ranking Principle

 If a reference retrieval system’s response to

each request is a ranking of the documents in

the collections in the order of decreasing

probability of usefulness to the user who

submitted the request, where the probabilities

are estimated as accurately as possible on the

basis of whatever data has been made

available to the system for this purpose, then

the overall effectiveness of the system to its

users will be the best that is obtainable on the

basis of that data.

Probabilistic Model

Recall a few probability basics

 For events a and b:

 Bayes’ Rule

 Odds:

Probabilistic Ranking

Binary Independence Model

 Traditionally used in conjunction with PRP

 “Binary” = Boolean: documents are represented

as binary incidence vectors of terms (cf.

lecture 1):

 “Independence”: terms occur in documents

independently

 Different documents could theoretically be

modeled as same vector

Binary Independence Model

 Queries: binary term incidence vectors

 Given query q ,

p ( R | q,d ).

binary term incidence vector representing d

Interested only in ranking

 Will use odds and Bayes’ Rule:

Binary Independence Model