




Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An overview of information retrieval, focusing on models used, boolean retrieval, and the vector space model. It covers concepts such as models, query formulation, and document representation. The document also discusses the limitations of boolean retrieval and the advantages of the vector space model.
Typology: Study notes
1 / 8
This page cannot be seen from the preview
Don't miss anything!





Boolean and Vector Space Models
Muddy Points
| Statistics, significance tests
| Precision-recall curve, interpolation
| MAP
| Math, math, and more math!
| Reading the book
The Information Retrieval Cycle
source reselection
System discovery Vocabulary discovery Concept discovery Document discovery
What is a model?
| A model is a construct designed help us understand a complex system
| Models inevitably make simplifying assumptions
| Different types of models:
The Central Problem in IR
The IR Black Box
Today’s Topics
| Boolean model
| Vector space model
Next Time…
| Language models
Representing Text
How do we represent text?
| How do we represent the complexities of language?
| Simple, yet effective approach: “bag of words”
Sample Document
Fast-food chain to reduce certain types of fat in its french fries with new cooking oil. NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier. But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA. But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste. Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment. …
What’s the point?
| Retrieving relevant information is hard!
| To operationalize information retrieval, we must vastly simplify the picture
| Bag-of-words approach:
Boolean View of a Collection
quick
brown
fox
over
lazy
dog
back
now
time
all
good
men
come
jump
aid
their
party
Doc 1Doc 2 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0
Doc 3Doc 4 0 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0
Doc 5Doc 6 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0
Doc 7Doc 8
Sample Queries
fox
dog 0 0
Doc 1Doc 2Doc 3Doc 4Doc 5Doc 6Doc 7Doc 8
dog ∧ fox 0 0 1 0 1 0 0 0
dog ∨ fox 0 0 1 0 1 0 1 0
dog ¬ fox 0 0 0 0 0 0 0 0
fox ¬ dog 0 0 0 0 0 0 1 0
good party
g ∧ p 0 0 0 0 0 1 0 1
g ∧ p ¬ o 0 0 0 0 0 1 0 0
over 1 0 1 0 1 0 1 1
Doc 1Doc 2Doc 3Doc 4Doc 5Doc 6Doc 7Doc 8
Proximity Operators
| More “precise” versions of AND
| Relatively easy to implement, but less efficient
Proximity Operator Example
quick
brown
fox
over
lazy
dog
back
now
time
all
good
men
come
jump
aid
their
party
Doc 1Doc 2
Other Extensions
| Ability to search on fields
| Wildcards
| Special treatment of dates, names, companies, etc.
WESTLAW® Query Examples
z LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM
z (741 +3 824) FACTOR ELEMENT STATUS FACT /P VESSEL SHIP BOAT /P (46 +3 688) “JONES ACT” /P INJUR! /S SEAMAN CREWMAN WORKER
z NOT NEGLECT! FAIL! NEGLIG! /5 MAINT! REPAIR! /P NAVIGAT! /5 AID EQUIP! LIGHT BUOY “CHANNEL MARKER”
z EXCUS! /3 DELAY /P (LIMIT! /3 STATUTE ACTION) LACHES /P “JONES ACT” “DEATH ON THE HIGH SEAS ACT” (46 +3 761)
Why Boolean Retrieval Works
| Boolean operators approximate natural language
| AND can discover relationships between concepts
| OR can discover alternate terminology
| NOT can discover alternate meanings
The Perfect Query Paradox
| Every information need has a perfect set of documents
| Every document set has a perfect query
| But can users realistically be expected to formulate this perfect query?
Why Boolean Retrieval Fails
| Natural language is way more complex
| AND “discovers” nonexistent relationships
| Guessing terminology for OR is hard
| Guessing terms to exclude is even harder!
Strengths and Weaknesses
| Strengths
| Weaknesses
Ranked Retrieval
| Order documents by how likely they are to be relevant to the information need
| Attempts to retrieve relevant documents directly, not merely provide tools for doing so
Why Ranked Retrieval?
| Arranging documents by relevance is
| Best (partial) match: documents need not have all query terms
| Easier said than done!
How do we weight doc terms?
| Here’s the intuition:
| How do we capture this mathematically?
TF.IDF Term Weighting
| Simple, yet effective!
i
i j ij n
N w (^) , =tf (^) ,⋅log
wi (^) , j
tf i , j
N
ni
TF.IDF Example
nuclear
fallout
siberia
contaminated
interesting
complicated
information
retrieval
tf Wi,j
idf
Normalizing Document Vectors
| Recall our similarity function:
| Normalize document vectors in advance
∑ ∑
∑
=
n i ik
n i ij
n i ij ik
j k
j k j k w w
ww
dd
d d simd d
1
( , ) r r
r r
Normalization Example
nuclear
fallout
siberia
contaminated
interesting
complicated
information
retrieval
tf
Wi,j
idf
W'i,j
Retrieval Example
nuclear
fallout
siberia
contaminated
interesting
complicated
information
retrieval
query
W'i,j
W'i,j
Weighted Retrieval
nuclear
fallout
siberia
contaminated
interesting
complicated
information
retrieval
query
W'i,j
W'i,j
What’s the point?
| Information seeking behavior is incredibly complex
| In order to build actual systems, we must make many simplifications
| Know what these limitations are!
Summary
| Boolean retrieval is powerful in the hands of a trained searcher
| Ranked retrieval is preferred in other circumstances
| Key ideas in the vector space model
One Minute Paper
| What was the muddiest point in today’s class?