Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Understanding Information Retrieval: Models, Boolean Retrieval, and Vector Space Model - P, Study notes of School management&administration

University of Maryland School management&administration

Prof. Jimmy Jr-Pin Lin

An overview of information retrieval, focusing on models used, boolean retrieval, and the vector space model. It covers concepts such as models, query formulation, and document representation. The document also discusses the limitations of boolean retrieval and the advantages of the vector space model.

Typology: Study notes

Pre 2010

Uploaded on 02/13/2009

koofers-user-6gv 🇺🇸

10 documents

1 / 8

This page cannot be seen from the preview

Don't miss anything!

1

LBSC 796/INFM 718R: Week 3

Boolean and Vector Space Models

Jimmy Lin

College of Information Studies

University of Maryland

Monday, February 13, 2006

Muddy Points

|Statistics, significance tests

|Precision-recall curve, interpolation

|MAP

|Math, math, and more math!

|Reading the book

The Information Retrieval Cycle

Source

Selection

Search

Query

Selection

Ranked List

Examination

Documents

Delivery

Documents

Query

Formulation

Resource

source reselection

System discovery

Vocabulary discovery

Concept discovery

Document discovery

What is a model?

|A model is a construct designed help us

understand a complex system

zA particular way of “looking at things”

|Models inevitably make simplifying assumptions

zWhat are the limitations of the model?

|Different types of models:

zConceptual models

zPhysical analog models

zMathematical models

z…

The Central Problem in IR

Information Seeker Authors

Concepts Concepts

Query Terms Document Terms

Do these represent the same concepts?

The IR Black Box

Documents

Query

Hits

Representation

Function Representation

Function

Query Representation Document Representation

Comparison

Function Index

Discover Study notes of School management&administration University of Maryland

Partial preview of the text

Download Understanding Information Retrieval: Models, Boolean Retrieval, and Vector Space Model - P and more Study notes School management&administration in PDF only on Docsity!

LBSC 796/INFM 718R: Week 3

Boolean and Vector Space Models

Jimmy Lin

College of Information Studies

University of Maryland

Monday, February 13, 2006

Muddy Points

| Statistics, significance tests

| Precision-recall curve, interpolation

| MAP

| Math, math, and more math!

| Reading the book

The Information Retrieval Cycle

Source

Selection

Search

Query

Selection

Ranked List

Examination

Documents

Delivery

Documents

Query

Formulation

Resource

source reselection

System discovery Vocabulary discovery Concept discovery Document discovery

What is a model?

| A model is a construct designed help us understand a complex system

z A particular way of “looking at things”

| Models inevitably make simplifying assumptions

z What are the limitations of the model?

| Different types of models:

z Conceptual models

z Physical analog models

z Mathematical models

z …

The Central Problem in IR

Information Seeker Authors

Concepts Concepts

Query Terms Document Terms

Do these represent the same concepts?

The IR Black Box

Query^ Documents

Hits

Representation

Function

Representation

Function

Query Representation Document Representation

Comparison

Function Index

Today’s Topics

| Boolean model

z Based on the notion of sets

z Documents are retrieved only if they satisfy Boolean

conditions specified in the query

z Does not impose a ranking on retrieved documents

z Exact match

| Vector space model

z Based on geometry, the notion of vectors in high

dimensional space

z Documents are ranked based on their similarity to the

query (ranked retrieval)

z Best/partial match

Next Time…

| Language models

z Based on the notion of probabilities and processes for

generating text

z Documents are ranked based on the probability that

they generated the query

z Best/partial match

Representing Text

Query^ Documents

Hits

Representation

Function

Representation

Function

Query Representation Document Representation

Comparison

Function Index

How do we represent text?

| How do we represent the complexities of language?

z Keeping in mind that computers don’t “understand”

documents or queries

| Simple, yet effective approach: “bag of words”

z Treat all the words in a document as index terms for

that document

z Assign a “weight” to each term based on its

“importance”

z Disregard order, structure, meaning, etc. of the words

What’s a “word”? We’ll return to this in a few lectures…

Sample Document

McDonald's slims down spuds

Fast-food chain to reduce certain types of fat in its french fries with new cooking oil. NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier. But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA. But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste. Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment. …

16 × said

14 × McDonalds

12 × fat

11 × fries

8 × new

6 × company french nutrition

5 × food oil percent reduce

taste Tuesday

“Bag of Words”

What’s the point?

| Retrieving relevant information is hard!

z Evolving, ambiguous user needs, context, etc.

z Complexities of language

| To operationalize information retrieval, we must vastly simplify the picture

| Bag-of-words approach:

z Information retrieval is all (and only ) about matching

words in documents with words in queries

z Obviously, not true…

z But it works pretty well!

Boolean View of a Collection

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

Term

Doc 1Doc 2 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0

Doc 3Doc 4 0 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0

Doc 5Doc 6 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0

Doc 7Doc 8

Each column represents the view of

a particular document: What terms

are contained in this document?

Each row represents the view of a

particular term: What documents

contain this term?

To execute a query, pick out rows

corresponding to query terms and

then apply logic table of

corresponding Boolean operator

Sample Queries

fox

dog 0 0

Term

Doc 1Doc 2Doc 3Doc 4Doc 5Doc 6Doc 7Doc 8

dog ∧ fox 0 0 1 0 1 0 0 0

dog ∨ fox 0 0 1 0 1 0 1 0

dog ¬ fox 0 0 0 0 0 0 0 0

fox ¬ dog 0 0 0 0 0 0 1 0

dog AND fox → Doc 3, Doc 5

dog OR fox → Doc 3, Doc 5, Doc 7

dog NOT fox → empty

fox NOT dog → Doc 7

good party

g ∧ p 0 0 0 0 0 1 0 1

g ∧ p ¬ o 0 0 0 0 0 1 0 0

good AND party → Doc 6, Doc 8

over 1 0 1 0 1 0 1 1

good AND party NOT over → Doc 6

Term

Doc 1Doc 2Doc 3Doc 4Doc 5Doc 6Doc 7Doc 8

Proximity Operators

| More “precise” versions of AND

z “NEAR n” allows at most n-1 intervening terms

z “WITH” requires terms to be adjacent and in order

z Other extensions: within n sentences, within n

paragraphs, etc.

| Relatively easy to implement, but less efficient

z Store position information for each word in the

document vectors

z Perform normal Boolean computations, but treat WITH

and NEAR as extra constraints

Proximity Operator Example

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

Term

Doc 1Doc 2

time AND come → Doc 2

time (NEAR 2) come → empty

quick (NEAR 2) fox → Doc 1

quick WITH fox → empty

Other Extensions

| Ability to search on fields

z Leverage document structure: title, headings, etc.

| Wildcards

z lov* = love, loving, loves, loved, etc.

| Special treatment of dates, names, companies, etc.

WESTLAW® Query Examples

| What is the statute of limitations in cases involving the federal tort claims

act?

z LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM

| What factors are important in determining what constitutes a vessel for

purposes of determining liability of a vessel owner for injuries to a seaman

under the “Jones Act” (46 USC 688)?

z (741 +3 824) FACTOR ELEMENT STATUS FACT /P VESSEL SHIP BOAT /P (46 +3 688) “JONES ACT” /P INJUR! /S SEAMAN CREWMAN WORKER

| Are there any cases which discuss negligent maintenance or failure to

maintain aids to navigation such as lights, buoys, or channel markers?

z NOT NEGLECT! FAIL! NEGLIG! /5 MAINT! REPAIR! /P NAVIGAT! /5 AID EQUIP! LIGHT BUOY “CHANNEL MARKER”

| What cases have discussed the concept of excusable delay in the

application of statutes of limitations or the doctrine of laches involving

actions in admiralty or under the “Jones Act” or the “Death on the High

Seas Act”?

z EXCUS! /3 DELAY /P (LIMIT! /3 STATUTE ACTION) LACHES /P “JONES ACT” “DEATH ON THE HIGH SEAS ACT” (46 +3 761)

Why Boolean Retrieval Works

| Boolean operators approximate natural language

z Find documents about a good party that is not over

| AND can discover relationships between concepts

z good party

| OR can discover alternate terminology

z excellent party, wild party, etc.

| NOT can discover alternate meanings

z Democratic party

The Perfect Query Paradox

| Every information need has a perfect set of documents

z If not, there would be no sense doing retrieval

| Every document set has a perfect query

z AND every word in a document to get a query for it

z Repeat for each document in the set

z OR every document query to get the set query

| But can users realistically be expected to formulate this perfect query?

z Boolean query formulation is hard!

Why Boolean Retrieval Fails

| Natural language is way more complex

| AND “discovers” nonexistent relationships

z Terms in different sentences, paragraphs, …

| Guessing terminology for OR is hard

z good, nice, excellent, outstanding, awesome, …

| Guessing terms to exclude is even harder!

z Democratic party, party to a lawsuit, …

Strengths and Weaknesses

| Strengths

z Precise, if you know the right strategies

z Precise, if you have an idea of what you’re looking for

z Efficient for the computer

| Weaknesses

z Users must learn Boolean logic

z Boolean logic insufficient to capture the richness of

language

z No control over size of result set: either too many

documents or none

z When do you stop reading? All documents in the result

set are considered “equally good”

z What about partial matches? Documents that “don’t

quite match” the query may be useful also

Ranked Retrieval

| Order documents by how likely they are to be relevant to the information need

z Present hits one screen at a time

z At any point, users can continue browsing through

ranked list or reformulate query

| Attempts to retrieve relevant documents directly, not merely provide tools for doing so

Why Ranked Retrieval?

| Arranging documents by relevance is

z Closer to how humans think: some documents are

“better” than others

z Closer to user behavior: users can decide when to stop

reading

| Best (partial) match: documents need not have all query terms

z Although documents with more query terms should be

“better”

| Easier said than done!

How do we weight doc terms?

| Here’s the intuition:

z Terms that appear often in a document should get high

weights

z Terms that appear in many documents should get low

weights

| How do we capture this mathematically?

z Term frequency

z Inverse document frequency

The more often a document contains the term “dog”, the

Words like “the”, “a”, “of” appear in (nearly) all documents.

TF.IDF Term Weighting

| Simple, yet effective!

i

i j ij n

N w (^) , =tf (^) ,⋅log

wi (^) , j

tf i , j

N

ni

weight assigned to term i in document j

number of occurrence of term i in document j

number of documents in entire collection

number of documents with term i

TF.IDF Example

nuclear

fallout

siberia

contaminated

interesting

complicated

information

retrieval

tf Wi,j

idf

Normalizing Document Vectors

| Recall our similarity function:

| Normalize document vectors in advance

z Use the “cosine normalization” method: divide each

term weight through by length of vector

∑ ∑

∑

=

⋅

n i ik

n i ij

n i ij ik

j k

j k j k w w

ww

dd

d d simd d

1

( , ) r r

r r

Normalization Example

nuclear

fallout

siberia

contaminated

interesting

complicated

information

retrieval

tf

Wi,j

idf

Length^ 1.70 0.97 2.67 0.

W'i,j

Retrieval Example

nuclear

fallout

siberia

contaminated

interesting

complicated

information

retrieval

Query: contaminated retrieval

query

W'i,j

similarity score 0.29^ 0.9^ 0.19 0.

Ranked list:

Doc 2

Doc 4

Doc 1

Doc 3

Do we need to normalize the query vector?

W'i,j

Weighted Retrieval

Query: contaminated(3) retrieval

Weight query terms by assigning different term weights to query vector

nuclear

fallout

siberia

contaminated

interesting

complicated

information

retrieval

query

W'i,j

similarity score 0.87 1.16 0.47 0.

Ranked list:

Doc 2

Doc 1

Doc 4

Doc 3

W'i,j

What’s the point?

| Information seeking behavior is incredibly complex

| In order to build actual systems, we must make many simplifications

z Absolutely unrealistic assumptions!

z But the resulting systems are nevertheless useful

| Know what these limitations are!

Summary

| Boolean retrieval is powerful in the hands of a trained searcher

| Ranked retrieval is preferred in other circumstances

| Key ideas in the vector space model

z Goal: find documents most similar to the query

z Geometric interpretation: measure similarity in terms of

angles between vectors in high dimensional space

z Documents weights are some combinations of TF, DF,

and Length

z Length normalization is critical

z Similarity is calculated via the inner product

One Minute Paper

| What was the muddiest point in today’s class?

Understanding Information Retrieval: Models, Boolean Retrieval, and Vector Space Model - P, Study notes of School management&administration

Related documents

Partial preview of the text

Download Understanding Information Retrieval: Models, Boolean Retrieval, and Vector Space Model - P and more Study notes School management&administration in PDF only on Docsity!

LBSC 796/INFM 718R: Week 3

Jimmy Lin

College of Information Studies

University of Maryland

Monday, February 13, 2006

Source

Selection

Search

Query

Selection

Ranked List

Examination

Documents

Delivery

Documents

Query

Formulation

Resource

z A particular way of “looking at things”

z What are the limitations of the model?

z Conceptual models

z Physical analog models

z Mathematical models

z …

Information Seeker Authors

Concepts Concepts

Query Terms Document Terms

Do these represent the same concepts?

Query^ Documents

Hits

Representation

Function

Representation

Function

Query Representation Document Representation

Comparison

Function Index

z Based on the notion of sets

z Documents are retrieved only if they satisfy Boolean

conditions specified in the query

z Does not impose a ranking on retrieved documents

z Exact match

z Based on geometry, the notion of vectors in high

dimensional space

z Documents are ranked based on their similarity to the

query (ranked retrieval)

z Best/partial match

z Based on the notion of probabilities and processes for

generating text

z Documents are ranked based on the probability that

they generated the query

z Best/partial match

Query^ Documents

Hits

Representation

Function

Representation

Function

Query Representation Document Representation

Comparison

Function Index

z Keeping in mind that computers don’t “understand”

documents or queries

z Treat all the words in a document as index terms for

that document

z Assign a “weight” to each term based on its

“importance”

z Disregard order, structure, meaning, etc. of the words

What’s a “word”? We’ll return to this in a few lectures…

McDonald's slims down spuds

16 × said

14 × McDonalds

12 × fat

11 × fries

8 × new

6 × company french nutrition