Understanding the Vector Space Model for Information Retrieval - Prof. Douglas William Oar, Study notes of School management&administration

This document, from lbsc 796/cmsc828o session 3 held on february 9, 2004, by douglas w. Oard, discusses the vector space model for information retrieval. Topics like thinking about search, design strategies, decomposing the search component, boolean 'free text' retrieval, 'bag of terms' representation, proximity operators, ranked retrieval, and passage retrieval. It also explains how machines and humans can work together in the search process, the concept of relevance, and the strengths and weaknesses of boolean retrieval.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-92x
koofers-user-92x 🇺🇸

9 documents

1 / 17

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
The Vector Space Model
LBSC 796/CMSC828o
Session 3, February 9, 2004
Douglas W. Oard
Agenda
Thinking about search
Design strategies
Decomposing the search component
Boolean “free text” retrieval
The “bag of terms” representation
Proximity operators
Ranked retrieval
Vector space model
Passage retrieval
Supporting the Search Process
Design Strategies
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Understanding the Vector Space Model for Information Retrieval - Prof. Douglas William Oar and more Study notes School management&administration in PDF only on Docsity!

The Vector Space Model

LBSC 796/CMSC828o

Session 3, February 9, 2004

Douglas W. Oard

Agenda

  • Thinking about search
  • Design strategies
  • Decomposing the search component
  • Boolean “free text” retrieval
  • The “bag of terms” representation
  • Proximity operators
  • Ranked retrieval
  • Vector space model
  • Passage retrieval

Supporting the Search Process

Design Strategies

  • Foster human-machine synergy
  • Exploit complementary strengths
  • Accommodate shared weaknesses
  • Divide-and-conquer
  • Divide task into stages with well-defined interfaces
  • Continue dividing until problems are easily solved
  • Co-design related components
  • Iterative process of joint optimization

Human-Machine Synergy

  • Machines are good at:
  • Doing simple things accurately and quickly
  • Scaling to larger collections in sublinear time
  • People are better at:
  • Accurately recognizing what they are looking for
  • Evaluating intangibles such as “quality”
  • Humans start with an information need
  • Machines start with a query
  • Humans match documents to information needs
  • Machines match document & query representations

Search Component Model

Relevance

  • Relevance relates a topic and a document
  • Duplicates are equally relevant, by definition
  • Constant over time and across users
  • Pertinence relates a task and a document
  • Accounts for quality, complexity, language, …
  • Utility relates a user and a document
  • Accounts for prior knowledge
  • We seek utility, but relevance is what we get!

“Bag of Terms” Representation

  • Bag = a “set” that can contain duplicates
  • “The quick brown fox jumped over the lazy dog’s back” ® {back, brown, dog, fox, jump, lazy, over, quick, the, the}
  • Vector = values recorded in any consistent order
  • {back, brown, dog, fox, jump, lazy, over, quick, the, the} ® [1 1 1 1 1 1 1 1 2]

Bag of Terms Example

Boolean “Free Text” Retrieval

  • Limit the bag of words to “absent” and “present”
  • “Boolean” values, represented as 0 and 1
  • Represent terms as a “bag of documents”
  • Same representation, but rows rather than columns
  • Combine the rows using “Boolean operators”
  • AND, OR, NOT
  • Result set: every document with a 1 remaining

Boolean Operators

Boolean Free Text Example

  • dog AND fox
  • Doc 3, Doc 5
  • dog NOT fox

The Perfect Query Paradox

  • Every information need has a perfect doc set
  • If not, there would be no sense doing retrieval
  • Almost every document set has a perfect query
  • AND every word to get a query for document 1
  • Repeat for each document in the set
  • OR every document query to get the set query
  • But users find Boolean query formulation hard
  • They get too much, too little, useless stuff, …

Why Boolean Retrieval Fails

  • Natural language is way more complex
  • She saw the man on the hill with a telescope
  • AND “discovers” nonexistent relationships
  • Terms in different paragraphs, chapters, …
  • Guessing terminology for OR is hard
  • good, nice, excellent, outstanding, awesome, …
  • Guessing terms to exclude is even harder!
  • Democratic party, party to a lawsuit, …

Proximity Operators

  • More precise versions of AND
  • “NEAR n” allows at most n-1 intervening terms
  • “WITH” requires terms to be adjacent and in order
  • Easy to implement, but less efficient
  • Store a list of positions for each word in each doc
  • Stopwords become very important!
  • Perform normal Boolean computations
  • Treat WITH and NEAR like AND with an extra constraint

Proximity Operator Example

  • time AND come
  • Doc 2
  • time (NEAR 2) come
  • Empty
  • quick (NEAR 2) fox
  • Doc 1
  • quick WITH fox
  • Empty
  • Display them one screen at a time

Advantages of Ranked Retrieval

  • Closer to the way people think
  • Some documents are better than others
  • Enriches browsing behavior
  • Decide how far down the list to go as you read it
  • Allows more flexible queries
  • Long and short queries can produce useful results

Ranked Retrieval Challenges

  • “Best first” is easy to say but hard to do!
  • The best we can hope for is to approximate it
  • Will the user understand the process?
  • It is hard to use a tool that you don’t understand
  • Efficiency becomes a concern
  • Only a problem for long queries, though

Partial-Match Ranking

  • Form several result sets from one long query
  • Query for the first set is the AND of all the terms
  • Then all but the 1st term, all but the 2nd, …
  • Then all but the first two terms, …
  • And so on until each single term query is tried
  • Remove duplicates from subsequent sets
  • Display the sets in the order they were made
  • Document rank within a set is arbitrary

Partial Match Example

Similarity-Based Queries

  • Treat the query as if it were a document
  • Create a query bag-of-words
  • Find the similarity of each document
  • Using the coordination measure, for example
  • Rank order the documents by similarity
  • Most similar to the query first
  • Documents tell us about terms
  • “the” is in every document -- not discriminating
  • Documents are most likely described well by rare terms that

occur in them frequently

  • Higher “term frequency” is stronger evidence
  • Low “collection frequency” makes it stronger still

The Document Length Effect

  • Humans look for documents with useful parts
  • But probabilities are computed for the whole
  • Document lengths vary in many collections
  • So probability calculations could be inconsistent
  • Two strategies
  • Adjust probability estimates for document length
  • Divide the documents into equal “passages”

Incorporating Term Frequency

  • High term frequency is evidence of meaning
  • And high IDF is evidence of term importance
  • Recompute the bag-of-words
  • Compute TF * IDF for every element

Weighted Matching Schemes

  • Unweighted queries
  • Add up the weights for every matching term
  • User specified query term weights
  • For each term, multiply the query and doc weights
  • Then add up those values
  • Automatically computed query term weights
  • Most queries lack useful TF, but IDF may be useful
  • Used just like user-specified query term weights

TF*IDF Example

Document Length Normalization

  • Long documents have an unfair advantage
  • They use a lot of terms
  • So they get more matches than short documents
  • And they use the same words repeatedly
  • So they have much higher term frequencies
  • Every document is most similar to itself

“Okapi” Term Weights

Passage Retrieval

  • Another approach to long-document problem
  • Break it up into coherent units
  • Recognizing topic boundaries is hard
  • But overlapping 300 word passages work fine
  • Document rank is best passage rank
  • And passage information can help guide browsing

Summary

  • Goal: find documents most similar to the query
  • Compute normalized document term weights
  • Some combination of TF, DF, and Length
  • Optionally, get query term weights from the user
  • Estimate of term importance
  • Compute inner product of query and doc vectors
  • Multiply corresponding elements and then add

Before You Go!

On a sheet of paper, please briefly answer the following question

(no names):

What was the muddiest point in today’s lecture?