Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Information Retrieval Overview, Lecture Slides - Computer Science, Slides of Artificial Intelligence

Johns Hopkins University (JHU)Artificial Intelligence

Prof.Paul McNamee, Information Retrieval,Computer Science, Artificial Intelligence, Johns Hopkins University, Computer Science, Prof.Paul McNamee, Information Retrieval, What is IR

Typology: Slides

2010/2011

Uploaded on 11/09/2011

stagist 🇺🇸

4.1

(27)

265 documents

1 / 37

This page cannot be seen from the preview

Don't miss anything!

1

http://apl.jhu.edu/~paulmac/ir.html

CS 605.744 Information Retrieval

Spring 2011

Paul McNamee

Johns Hopkins University

[email protected]

Overview

 Course Pragmatics

– Schedule of topics

– Grading policy

 Overview of Text Retrieval

 Boolean

– Document Representations

– Queries

 Tokenization

Discover Slides of Artificial Intelligence Johns Hopkins University (JHU)

Partial preview of the text

Download Information Retrieval Overview, Lecture Slides - Computer Science and more Slides Artificial Intelligence in PDF only on Docsity!

http://apl.jhu.edu/~paulmac/ir.html

CS 605.744 Information Retrieval

Spring 2011

Paul McNamee

Johns Hopkins University

[email protected]

Overview

 Course Pragmatics

Schedule of topics
Grading policy

 Overview of Text Retrieval

 Boolean

Document Representations
Queries

 Tokenization

Objectives of the Course

 To get a basic theoretical understanding of IR

Representing and indexing text documents
Retrieval models
Implementing querying efficiently  To examine application areas and research topics such as:
Text Classification
Cross-language retrieval
Retrieval on the Web
Speech Retrieval  To understand how IR performance is assessed:
Recall/Precision and other metrics  Gain hands-on experience building an IR system  To introduce selected topics in computational linguistics

Course Philosophy

 Somewhere between a lecture, seminar,

and laboratory

Lecture, but more interactive == more

interesting

Studying assigned readings prior to a class

will elevate the class discussions

Homeworks and projects provide hands-on

experience

Students present paper summaries and

projects to whole class

Research Project

 Goal is to research an area or develop an

idea that you would like to explore in

greater depth

 Deliverables include a written report and

an oral presentation

 More details in 2-3 weeks

General Introduction

 What is Information Retrieval?

How does it differ from database querying?

 History of IR

Field over 40 years old
Why so popular now?
Impact of the Web

 Beyond vanilla document retrieval

 Resources

I never waste memory on things that can easily be stored and retrieved from elsewhere – A. Einstein

What is Information Retrieval?

 Field concerned with the organization, storage,

and retrieval of information

Especially text
Also retrieval of semistructured data (XML), video images, speech, music

 Requires algorithms and data structures

For manipulating natural language
To efficiently store and process data

 Related fields

Natural Language Processing, Library Science
Computational Linguistics, Digital Libraries

What makes IR a hard problem

 Under good circumstances

Text is unstructured
In the hardest cases, it requires understanding of semantics
Human language presents distinct problems (e.g., ambiguity)

 Under hard circumstances

Patent retrieval: applications tend to use low content words; why?
One estimate is that 40% of web pages change monthly, many pages ‘lie’ about their content, new pages aren’t linked to

 Multimedia information

Hard to store (size), represent, and compare

Reason #3: Ambiguity

 English provides no canonical way to

reference people and things

President Carter, Pres. Carter, Jimmy

Carter; the 39th^ president, Rosalynn Carter’s

husband

 Ambiguity pervasive

jaguar, bank, see, hornet, red, aa,

 Distinctions vary in granularity

cool (popular) vs. cool (low in temperature)
list (to name items in a list) vs. (to include in

a list)

Reason #4: Word Choices

 Speakers of a language learn preferential

ways of expressing things:

strong tea / powerful computers

 Documents have a limited vocabulary

with discrete occurrences; words have

many synonyms

query: ‘fast automobiles’
- should match ‘fast cars’

 Inflectional forms

query about ‘juggling’
- should match jugglers, juggler, jongleur

Pre-history of IR

 300 BCE Euclid’s treatise,The Elements  300 BCE Ptolemy I founds Great Library at Alexandria which grows to include 700,000+ volumes (scrolls)  391 Great Library destroyed by fanatics (implications for the Web?)  600 Number 0 used in India  825 Muhammad ibn Musa Al-Khowarizmi writes treatise on algebra; the English word algorithm is derived from his name  1230s St. Anthony (of Padova) creates concordance for Latin Vulgate  1247 Cardinal Hugo employs 500 monks to build a concordance  1470s Johannes Gutenberg creates printing press  1550 First English concordance of entire Bible  1640 Blaise Pascal develops mechanical calculator. It performed subtraction by adding complements  1714 Henry Mills conceives of the typewriter  1837 Morse Code is an early text encoding scheme  1857 Sir Charles Wheatstone stores Morse codes on paper tapes; they could be prepared offline and transmitted later

The Great Library Rebuilt (2002)

Early IR Systems

 Card-based

 Uniterm (Casey, Perry, Berry, Kent: 1958

–developed and used from mid 1940’s)

EXCURSION 43821 90 241 52 63 34 25 66 17 58 49 130 281 92 83 44 75 86 57 88 119 640 122 93 104 115 146 97 158 139 870 342 157 178 199 207 248 269 298 LUNAR 12457 110 181 12 73 44 15 46 7 28 39 430 241 42 113 74 85 76 17 78 79 820 761 602 233 134 95 136 37 118 109 901 982 194 165 127 198 179 377 288 407

Advent of Computer Science

 1962 First Comp Sci. degree program offered by Purdue U.  1963 ASCII standard developed  1965 CD-ROM technology invented (James Russell)  1969 ARPANET contains 4 hosts (23 in 1971)  1969 UNIX operating system (Ritchie & Thompson)  1972 Tomlinson sends first email message  1975 Microsoft founded by Gates and Allen  1977 Apple II personal computer  1981 IBM PC  1982 TCP/IP basis for NSFNet  1984 Apple Macintosh with windowing interface  1984 1,000 Internet hosts  1988 Robert Morris, a Cornell U. graduate student, unleashes the ‘Internet Worm’  1989 100,000 Internet hosts

Birth of the Web

 1989 Tim Berners-Lee invents World-Wide-Web  1992 1,000,000 Internet hosts, but only 50 web sites  1994 Two Stanford graduate students found Yahoo, a manually build on- line directory  1995 AltaVista indexes 15 million web pages  1996 Two other Stanford graduate students collaborate on Google  1997 Lawrence and Giles paper characterizing Web  1999 Excite search engine sold for $6.7 billion; around same time automotive division of Volvo sold for $6.3 billion.  2000 1 billion web pages on public web; 10 million web sites, 93 million or so Internet hosts  2002 Google claims 3 billion page index  2004 Google IPO  2004 Microsoft unleashes Web search engine  2006 Google’s stock value exceeds $150 billion (> Coke, IBM, AT&T)  2009 Microsoft rebrands Web search as Bing Sources: http://www.mcs.net/~jorn/html/net/timeline.html http://ei.cs.vt.edu/~history/ http://www.maxmon.com/history.htm http://www.computerhistory.org/ http://www.let.leidenuniv.nl/history/

Why is IR thriving today?

 Dropping prices for

external storage is

the greatest factor

 Other factors

Increased expectations and demonstrated utility
Web 2.0 / electronic commerce
Ease of use From www.lesk.com

Beyond Text

 Images

Content based methods are difficult
Can try to make inferences based on filenames or coordinate text
Take up much more storage than text

 Video

Usually use sampled sequences of images

 Broadcast speech

1000s of radio stations from around the world
Typical approach: transcribe speech into text (with errors) and treat as ‘normal’ text

 Scanned text

Like speech, scan (w/ errors) and index

 Maps, Diagrams, Music (open problems)

Google Images

Small, oval, potato-brown figures with faces. Like Fry Guy and Fry Girl they too wear a variety of costumes.

document text basis for search

Beyond Document Retrieval

 User’s typically do not want to merely

find documents of interest

 A. Broder (CTO AltaVista) taxonomy (11/00)

Informational needs
Navigation (e.g., surrogate bookmarks)
Transactional

 J. Prange (ARDA), Advanced Question-

Answering

“A bridge too far or the right technology at

the right time?”

Question-Answering Systems

 Not database front-ends

 FAQ-Finder

Indexes FAQ lists and tries to find responsive answers to common questions

 Ask Jeeves

Looks for web pages likely to contain answers to common, simple questions - (e.g., “How do I make an apple pie?”)

 eHow

Web 2.0: free, user-contributed, ranked answers to common questions

 Text Retrieval Community studied for several

years

TREC-8 evaluation (1999) was the first

Beyond Single Requests

 Commercial engines have high volume

1999: Infoseek anecdotally reports
- ~50M queries / day
- ~600 queries / second over 10^8 collection
2010 estimates
- Google: 1-3 billion / day
- Yahoo: 180 M / day
- Bing: 80 M / day

 Ideally, user context should be leveraged

System can learn a profile over time
Benefits successive queries

Beyond Surfing

 Dual problem to ad hoc retrieval

Filter incoming messages relevant to a defined profile - Push technology vs. pull - Examples: Bloomberg news, Book or movie recommendations, targeted advertisements, spam filtering

 Scenario:

You are a safety engineer for a large automotive manufacturer. You want to keep track of reports of accidents in a new vehicle - Don’t have access to a static collection of documents; instead, news stories and reports trickle in over time; relevance decisions must be made immediately - Can’t be plagued by too many false alarms, but also don’t want to miss relevant reports

 Technology isn’t mainstream, yet

Research Software Systems

 Wumpus

U. Waterloo (Open source, C++)  Terrier
Glasgow (Open source, Java)  Lemur / Indri
Carnegie Mellon / UMass (C++ & Java bindings)  Lucene
Apache/Jakarta (Java?)  SMART
Developed at Cornell University (C)  mg
From the authors of Managing Gigabytes (C)  INQUERY
Univ. Massachusetts (Amherst). Available?

Some Commercial Systems

 Opentext

 Lexis-Nexis

 Verity

 Claritech

 Thompson / Reuters

bought Westlaw

 Autonomy

 Oracle

 Many others…

Boolean Queries

 INFIX operators

((cat AND dog) OR (collar AND leash))

 NOT is UNARY PREFIX operator

((cat AND dog) OR (collar AND (NOT dog)))

 AND and OR can be n-ary operators

(cat AND dog AND rabies AND shot)

 De Morgan’s Laws

NOT(a) AND NOT(b) = NOT(a OR b)
NOT(a) OR NOT(b)= NOT(a AND b)
NOT(NOT(a)) = a

Boolean Queries

 (Cat OR Dog) AND (Collar OR Leash)

Which of the following combinations

satisfies this statement:

 Cat x x x x

 Dog x x x x x

 Collar x x x x

 Leash x x x

The merge (Boolean AND)

 Walk through the two postings

simultaneously, in time linear in the total

number of postings entries

If the list lengths are x and y , the merge takes O( x+y ) operations. Crucial: postings sorted by docID. Courtesy of Manning and Raghavan

Processing Boolean Queries

 If sorted document lists are available

A new ‘array’ can be created from existing

arrays of documents

 Otherwise

Use a linear-time algorithm
- Hashtables support union, intersection and set- difference

Information Retrieval Overview, Lecture Slides - Computer Science, Slides of Artificial Intelligence

Related documents

Partial preview of the text

Download Information Retrieval Overview, Lecture Slides - Computer Science and more Slides Artificial Intelligence in PDF only on Docsity!

CS 605.744 Information Retrieval

Spring 2011

Paul McNamee

Johns Hopkins University

[email protected]

Overview

 Course Pragmatics

 Overview of Text Retrieval

 Boolean

 Tokenization

Objectives of the Course

Course Philosophy

 Somewhere between a lecture, seminar,

and laboratory

interesting

will elevate the class discussions

experience

projects to whole class

Research Project

 Goal is to research an area or develop an

idea that you would like to explore in

greater depth

 Deliverables include a written report and

an oral presentation

 More details in 2-3 weeks

General Introduction

 What is Information Retrieval?

 History of IR

 Beyond vanilla document retrieval

 Resources

What is Information Retrieval?

 Field concerned with the organization, storage,

and retrieval of information

 Requires algorithms and data structures

 Related fields

What makes IR a hard problem

 Under good circumstances

 Under hard circumstances

 Multimedia information

Reason #3: Ambiguity

 English provides no canonical way to

reference people and things

Carter; the 39th^ president, Rosalynn Carter’s

husband

 Ambiguity pervasive

 Distinctions vary in granularity

a list)

Reason #4: Word Choices

 Speakers of a language learn preferential

ways of expressing things:

 Documents have a limited vocabulary

with discrete occurrences; words have

many synonyms

 Inflectional forms

Pre-history of IR

The Great Library Rebuilt (2002)

Early IR Systems

 Card-based

 Uniterm (Casey, Perry, Berry, Kent: 1958

–developed and used from mid 1940’s)

Advent of Computer Science

Birth of the Web

Why is IR thriving today?

 Dropping prices for

external storage is

the greatest factor

 Other factors

Beyond Text

 Images

 Video

 Broadcast speech

 Scanned text

 Maps, Diagrams, Music (open problems)

Google Images

Beyond Document Retrieval

 User’s typically do not want to merely