Understanding Information Retrieval: Models, Boolean Retrieval, and Vector Space Model - P, Study notes of School management&administration

An overview of information retrieval, focusing on models used, boolean retrieval, and the vector space model. It covers concepts such as models, query formulation, and document representation. The document also discusses the limitations of boolean retrieval and the advantages of the vector space model.

Typology: Study notes

Pre 2010

Uploaded on 02/13/2009

koofers-user-6gv
koofers-user-6gv 🇺🇸

10 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
LBSC 796/INFM 718R: Week 3
Boolean and Vector Space Models
Jimmy Lin
College of Information Studies
University of Maryland
Monday, February 13, 2006
Muddy Points
|Statistics, significance tests
|Precision-recall curve, interpolation
|MAP
|Math, math, and more math!
|Reading the book
The Information Retrieval Cycle
Source
Selection
Search
Query
Selection
Ranked List
Examination
Documents
Delivery
Documents
Query
Formulation
Resource
source reselection
System discovery
Vocabulary discovery
Concept discovery
Document discovery
What is a model?
|A model is a construct designed help us
understand a complex system
zA particular way of “looking at things”
|Models inevitably make simplifying assumptions
zWhat are the limitations of the model?
|Different types of models:
zConceptual models
zPhysical analog models
zMathematical models
z
The Central Problem in IR
Information Seeker Authors
Concepts Concepts
Query Terms Document Terms
Do these represent the same concepts?
The IR Black Box
Documents
Query
Hits
Representation
Function Representation
Function
Query Representation Document Representation
Comparison
Function Index
pf3
pf4
pf5
pf8

Partial preview of the text

Download Understanding Information Retrieval: Models, Boolean Retrieval, and Vector Space Model - P and more Study notes School management&administration in PDF only on Docsity!

LBSC 796/INFM 718R: Week 3

Boolean and Vector Space Models

Jimmy Lin

College of Information Studies

University of Maryland

Monday, February 13, 2006

Muddy Points

| Statistics, significance tests

| Precision-recall curve, interpolation

| MAP

| Math, math, and more math!

| Reading the book

The Information Retrieval Cycle

Source

Selection

Search

Query

Selection

Ranked List

Examination

Documents

Delivery

Documents

Query

Formulation

Resource

source reselection

System discovery Vocabulary discovery Concept discovery Document discovery

What is a model?

| A model is a construct designed help us understand a complex system

z A particular way of “looking at things”

| Models inevitably make simplifying assumptions

z What are the limitations of the model?

| Different types of models:

z Conceptual models

z Physical analog models

z Mathematical models

z …

The Central Problem in IR

Information Seeker Authors

Concepts Concepts

Query Terms Document Terms

Do these represent the same concepts?

The IR Black Box

Query^ Documents

Hits

Representation

Function

Representation

Function

Query Representation Document Representation

Comparison

Function Index

Today’s Topics

| Boolean model

z Based on the notion of sets

z Documents are retrieved only if they satisfy Boolean

conditions specified in the query

z Does not impose a ranking on retrieved documents

z Exact match

| Vector space model

z Based on geometry, the notion of vectors in high

dimensional space

z Documents are ranked based on their similarity to the

query (ranked retrieval)

z Best/partial match

Next Time…

| Language models

z Based on the notion of probabilities and processes for

generating text

z Documents are ranked based on the probability that

they generated the query

z Best/partial match

Representing Text

Query^ Documents

Hits

Representation

Function

Representation

Function

Query Representation Document Representation

Comparison

Function Index

How do we represent text?

| How do we represent the complexities of language?

z Keeping in mind that computers don’t “understand”

documents or queries

| Simple, yet effective approach: “bag of words”

z Treat all the words in a document as index terms for

that document

z Assign a “weight” to each term based on its

“importance”

z Disregard order, structure, meaning, etc. of the words

What’s a “word”? We’ll return to this in a few lectures…

Sample Document

McDonald's slims down spuds

Fast-food chain to reduce certain types of fat in its french fries with new cooking oil. NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier. But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA. But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste. Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment. …

16 × said

14 × McDonalds

12 × fat

11 × fries

8 × new

6 × company french nutrition

5 × food oil percent reduce

taste Tuesday

“Bag of Words”

What’s the point?

| Retrieving relevant information is hard!

z Evolving, ambiguous user needs, context, etc.

z Complexities of language

| To operationalize information retrieval, we must vastly simplify the picture

| Bag-of-words approach:

z Information retrieval is all (and only ) about matching

words in documents with words in queries

z Obviously, not true…

z But it works pretty well!

Boolean View of a Collection

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

Term

Doc 1Doc 2 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0

Doc 3Doc 4 0 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0

Doc 5Doc 6 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0

Doc 7Doc 8

Each column represents the view of

a particular document: What terms

are contained in this document?

Each row represents the view of a

particular term: What documents

contain this term?

To execute a query, pick out rows

corresponding to query terms and

then apply logic table of

corresponding Boolean operator

Sample Queries

fox

dog 0 0

Term

Doc 1Doc 2Doc 3Doc 4Doc 5Doc 6Doc 7Doc 8

dog ∧ fox 0 0 1 0 1 0 0 0

dog ∨ fox 0 0 1 0 1 0 1 0

dog ¬ fox 0 0 0 0 0 0 0 0

fox ¬ dog 0 0 0 0 0 0 1 0

dog AND fox → Doc 3, Doc 5

dog OR fox → Doc 3, Doc 5, Doc 7

dog NOT fox → empty

fox NOT dog → Doc 7

good party

g ∧ p 0 0 0 0 0 1 0 1

g ∧ p ¬ o 0 0 0 0 0 1 0 0

good AND party → Doc 6, Doc 8

over 1 0 1 0 1 0 1 1

good AND party NOT over → Doc 6

Term

Doc 1Doc 2Doc 3Doc 4Doc 5Doc 6Doc 7Doc 8

Proximity Operators

| More “precise” versions of AND

z “NEAR n” allows at most n-1 intervening terms

z “WITH” requires terms to be adjacent and in order

z Other extensions: within n sentences, within n

paragraphs, etc.

| Relatively easy to implement, but less efficient

z Store position information for each word in the

document vectors

z Perform normal Boolean computations, but treat WITH

and NEAR as extra constraints

Proximity Operator Example

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

Term

Doc 1Doc 2

time AND come → Doc 2

time (NEAR 2) come → empty

quick (NEAR 2) fox → Doc 1

quick WITH fox → empty

Other Extensions

| Ability to search on fields

z Leverage document structure: title, headings, etc.

| Wildcards

z lov* = love, loving, loves, loved, etc.

| Special treatment of dates, names, companies, etc.

WESTLAW® Query Examples

| What is the statute of limitations in cases involving the federal tort claims

act?

z LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM

| What factors are important in determining what constitutes a vessel for

purposes of determining liability of a vessel owner for injuries to a seaman

under the “Jones Act” (46 USC 688)?

z (741 +3 824) FACTOR ELEMENT STATUS FACT /P VESSEL SHIP BOAT /P (46 +3 688) “JONES ACT” /P INJUR! /S SEAMAN CREWMAN WORKER

| Are there any cases which discuss negligent maintenance or failure to

maintain aids to navigation such as lights, buoys, or channel markers?

z NOT NEGLECT! FAIL! NEGLIG! /5 MAINT! REPAIR! /P NAVIGAT! /5 AID EQUIP! LIGHT BUOY “CHANNEL MARKER”

| What cases have discussed the concept of excusable delay in the

application of statutes of limitations or the doctrine of laches involving

actions in admiralty or under the “Jones Act” or the “Death on the High

Seas Act”?

z EXCUS! /3 DELAY /P (LIMIT! /3 STATUTE ACTION) LACHES /P “JONES ACT” “DEATH ON THE HIGH SEAS ACT” (46 +3 761)

Why Boolean Retrieval Works

| Boolean operators approximate natural language

z Find documents about a good party that is not over

| AND can discover relationships between concepts

z good party

| OR can discover alternate terminology

z excellent party, wild party, etc.

| NOT can discover alternate meanings

z Democratic party

The Perfect Query Paradox

| Every information need has a perfect set of documents

z If not, there would be no sense doing retrieval

| Every document set has a perfect query

z AND every word in a document to get a query for it

z Repeat for each document in the set

z OR every document query to get the set query

| But can users realistically be expected to formulate this perfect query?

z Boolean query formulation is hard!

Why Boolean Retrieval Fails

| Natural language is way more complex

| AND “discovers” nonexistent relationships

z Terms in different sentences, paragraphs, …

| Guessing terminology for OR is hard

z good, nice, excellent, outstanding, awesome, …

| Guessing terms to exclude is even harder!

z Democratic party, party to a lawsuit, …

Strengths and Weaknesses

| Strengths

z Precise, if you know the right strategies

z Precise, if you have an idea of what you’re looking for

z Efficient for the computer

| Weaknesses

z Users must learn Boolean logic

z Boolean logic insufficient to capture the richness of

language

z No control over size of result set: either too many

documents or none

z When do you stop reading? All documents in the result

set are considered “equally good”

z What about partial matches? Documents that “don’t

quite match” the query may be useful also

Ranked Retrieval

| Order documents by how likely they are to be relevant to the information need

z Present hits one screen at a time

z At any point, users can continue browsing through

ranked list or reformulate query

| Attempts to retrieve relevant documents directly, not merely provide tools for doing so

Why Ranked Retrieval?

| Arranging documents by relevance is

z Closer to how humans think: some documents are

“better” than others

z Closer to user behavior: users can decide when to stop

reading

| Best (partial) match: documents need not have all query terms

z Although documents with more query terms should be

“better”

| Easier said than done!

How do we weight doc terms?

| Here’s the intuition:

z Terms that appear often in a document should get high

weights

z Terms that appear in many documents should get low

weights

| How do we capture this mathematically?

z Term frequency

z Inverse document frequency

The more often a document contains the term “dog”, the

more likely that the document is “about” dogs.

Words like “the”, “a”, “of” appear in (nearly) all documents.

TF.IDF Term Weighting

| Simple, yet effective!

i

i j ij n

N w (^) , =tf (^) ,⋅log

wi (^) , j

tf i , j

N

ni

weight assigned to term i in document j

number of occurrence of term i in document j

number of documents in entire collection

number of documents with term i

TF.IDF Example

nuclear

fallout

siberia

contaminated

interesting

complicated

information

retrieval

tf Wi,j

idf

Normalizing Document Vectors

| Recall our similarity function:

| Normalize document vectors in advance

z Use the “cosine normalization” method: divide each

term weight through by length of vector

∑ ∑

=

n i ik

n i ij

n i ij ik

j k

j k j k w w

ww

dd

d d simd d

1

( , ) r r

r r

Normalization Example

nuclear

fallout

siberia

contaminated

interesting

complicated

information

retrieval

tf

Wi,j

idf

Length^ 1.70 0.97 2.67 0.

W'i,j

Retrieval Example

nuclear

fallout

siberia

contaminated

interesting

complicated

information

retrieval

Query: contaminated retrieval

query

W'i,j

similarity score 0.29^ 0.9^ 0.19 0.

Ranked list:

Doc 2

Doc 4

Doc 1

Doc 3

Do we need to normalize the query vector?

W'i,j

Weighted Retrieval

Query: contaminated(3) retrieval

Weight query terms by assigning different term weights to query vector

nuclear

fallout

siberia

contaminated

interesting

complicated

information

retrieval

query

W'i,j

similarity score 0.87 1.16 0.47 0.

Ranked list:

Doc 2

Doc 1

Doc 4

Doc 3

W'i,j

What’s the point?

| Information seeking behavior is incredibly complex

| In order to build actual systems, we must make many simplifications

z Absolutely unrealistic assumptions!

z But the resulting systems are nevertheless useful

| Know what these limitations are!

Summary

| Boolean retrieval is powerful in the hands of a trained searcher

| Ranked retrieval is preferred in other circumstances

| Key ideas in the vector space model

z Goal: find documents most similar to the query

z Geometric interpretation: measure similarity in terms of

angles between vectors in high dimensional space

z Documents weights are some combinations of TF, DF,

and Length

z Length normalization is critical

z Similarity is calculated via the inner product

One Minute Paper

| What was the muddiest point in today’s class?