Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Understanding the Vector Space Model for Information Retrieval - Prof. Douglas William Oar, Study notes of School management&administration

University of Maryland School management&administration

Prof. Douglas William Oard

This document, from lbsc 796/cmsc828o session 3 held on february 9, 2004, by douglas w. Oard, discusses the vector space model for information retrieval. Topics like thinking about search, design strategies, decomposing the search component, boolean 'free text' retrieval, 'bag of terms' representation, proximity operators, ranked retrieval, and passage retrieval. It also explains how machines and humans can work together in the search process, the concept of relevance, and the strengths and weaknesses of boolean retrieval.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-92x 🇺🇸

9 documents

1 / 17

This page cannot be seen from the preview

Don't miss anything!

The Vector Space Model

LBSC 796/CMSC828o

Session 3, February 9, 2004

Douglas W. Oard

Agenda

•Thinking about search

•Design strategies

•Decomposing the search component

•Boolean “free text” retrieval

•The “bag of terms” representation

•Proximity operators

•Ranked retrieval

•Vector space model

•Passage retrieval

Supporting the Search Process

Design Strategies

Discover Study notes of School management&administration University of Maryland

Partial preview of the text

Download Understanding the Vector Space Model for Information Retrieval - Prof. Douglas William Oar and more Study notes School management&administration in PDF only on Docsity!

The Vector Space Model

LBSC 796/CMSC828o

Session 3, February 9, 2004

Douglas W. Oard

Agenda

Thinking about search
Design strategies
Decomposing the search component
Boolean “free text” retrieval
The “bag of terms” representation
Proximity operators
Ranked retrieval
Vector space model
Passage retrieval

Supporting the Search Process

Design Strategies

Foster human-machine synergy
Exploit complementary strengths
Accommodate shared weaknesses
Divide-and-conquer
Divide task into stages with well-defined interfaces
Continue dividing until problems are easily solved
Co-design related components
Iterative process of joint optimization

Human-Machine Synergy

Machines are good at:
Doing simple things accurately and quickly
Scaling to larger collections in sublinear time
People are better at:
Accurately recognizing what they are looking for
Evaluating intangibles such as “quality”

Humans start with an information need
Machines start with a query
Humans match documents to information needs
Machines match document & query representations

Search Component Model

Relevance

Relevance relates a topic and a document
Duplicates are equally relevant, by definition
Constant over time and across users
Pertinence relates a task and a document
Accounts for quality, complexity, language, …
Utility relates a user and a document
Accounts for prior knowledge
We seek utility, but relevance is what we get!

“Bag of Terms” Representation

Bag = a “set” that can contain duplicates
“The quick brown fox jumped over the lazy dog’s back” ® {back, brown, dog, fox, jump, lazy, over, quick, the, the}

Vector = values recorded in any consistent order
{back, brown, dog, fox, jump, lazy, over, quick, the, the} ® [1 1 1 1 1 1 1 1 2]

Bag of Terms Example

Boolean “Free Text” Retrieval

Limit the bag of words to “absent” and “present”
“Boolean” values, represented as 0 and 1
Represent terms as a “bag of documents”
Same representation, but rows rather than columns
Combine the rows using “Boolean operators”
AND, OR, NOT
Result set: every document with a 1 remaining

Boolean Operators

Boolean Free Text Example

dog AND fox
Doc 3, Doc 5
dog NOT fox

The Perfect Query Paradox

Every information need has a perfect doc set
If not, there would be no sense doing retrieval
Almost every document set has a perfect query
AND every word to get a query for document 1
Repeat for each document in the set
OR every document query to get the set query
But users find Boolean query formulation hard
They get too much, too little, useless stuff, …

Why Boolean Retrieval Fails

Natural language is way more complex
She saw the man on the hill with a telescope
AND “discovers” nonexistent relationships
Terms in different paragraphs, chapters, …
Guessing terminology for OR is hard
good, nice, excellent, outstanding, awesome, …
Guessing terms to exclude is even harder!
Democratic party, party to a lawsuit, …

Proximity Operators

More precise versions of AND
“NEAR n” allows at most n-1 intervening terms
“WITH” requires terms to be adjacent and in order
Easy to implement, but less efficient
Store a list of positions for each word in each doc
Stopwords become very important!
Perform normal Boolean computations
Treat WITH and NEAR like AND with an extra constraint

Proximity Operator Example

time AND come
Doc 2
time (NEAR 2) come
Empty
quick (NEAR 2) fox
Doc 1
quick WITH fox
Empty

Display them one screen at a time

Advantages of Ranked Retrieval

Closer to the way people think
Some documents are better than others
Enriches browsing behavior
Decide how far down the list to go as you read it
Allows more flexible queries
Long and short queries can produce useful results

Ranked Retrieval Challenges

“Best first” is easy to say but hard to do!
The best we can hope for is to approximate it
Will the user understand the process?
It is hard to use a tool that you don’t understand
Efficiency becomes a concern

Only a problem for long queries, though

Partial-Match Ranking

Form several result sets from one long query
Query for the first set is the AND of all the terms
Then all but the 1st term, all but the 2nd, …
Then all but the first two terms, …
And so on until each single term query is tried
Remove duplicates from subsequent sets
Display the sets in the order they were made
Document rank within a set is arbitrary

Partial Match Example

Similarity-Based Queries

Treat the query as if it were a document
Create a query bag-of-words
Find the similarity of each document
Using the coordination measure, for example
Rank order the documents by similarity
Most similar to the query first

Documents tell us about terms
“the” is in every document -- not discriminating
Documents are most likely described well by rare terms that

occur in them frequently

Higher “term frequency” is stronger evidence
Low “collection frequency” makes it stronger still

The Document Length Effect

Humans look for documents with useful parts
But probabilities are computed for the whole
Document lengths vary in many collections
So probability calculations could be inconsistent
Two strategies
Adjust probability estimates for document length
Divide the documents into equal “passages”

Incorporating Term Frequency

High term frequency is evidence of meaning
And high IDF is evidence of term importance

Recompute the bag-of-words
Compute TF * IDF for every element

Weighted Matching Schemes

Unweighted queries
Add up the weights for every matching term
User specified query term weights
For each term, multiply the query and doc weights
Then add up those values
Automatically computed query term weights
Most queries lack useful TF, but IDF may be useful
Used just like user-specified query term weights

TF*IDF Example

Document Length Normalization

Long documents have an unfair advantage
They use a lot of terms
So they get more matches than short documents
And they use the same words repeatedly
So they have much higher term frequencies

Every document is most similar to itself

“Okapi” Term Weights

Passage Retrieval

Another approach to long-document problem
Break it up into coherent units
Recognizing topic boundaries is hard
But overlapping 300 word passages work fine
Document rank is best passage rank
And passage information can help guide browsing

Understanding the Vector Space Model for Information Retrieval - Prof. Douglas William Oar, Study notes of School management&administration

Related documents

Partial preview of the text

Download Understanding the Vector Space Model for Information Retrieval - Prof. Douglas William Oar and more Study notes School management&administration in PDF only on Docsity!

The Vector Space Model

LBSC 796/CMSC828o

Session 3, February 9, 2004

Douglas W. Oard

Agenda

Supporting the Search Process

Design Strategies

Human-Machine Synergy

Search Component Model

Relevance

“Bag of Terms” Representation

Bag of Terms Example

Boolean “Free Text” Retrieval

Boolean Operators

Boolean Free Text Example

The Perfect Query Paradox

Why Boolean Retrieval Fails

Proximity Operators

Proximity Operator Example

Advantages of Ranked Retrieval

Ranked Retrieval Challenges

Partial-Match Ranking

Partial Match Example

Similarity-Based Queries

occur in them frequently

The Document Length Effect

Incorporating Term Frequency

Weighted Matching Schemes

TF*IDF Example

Document Length Normalization

“Okapi” Term Weights

Passage Retrieval

Summary

Before You Go!

On a sheet of paper, please briefly answer the following question

(no names):

What was the muddiest point in today’s lecture?