Information Retrieval Overview, Lecture Slides - Computer Science, Slides of Artificial Intelligence

Prof.Paul McNamee, Information Retrieval,Computer Science, Artificial Intelligence, Johns Hopkins University, Computer Science, Prof.Paul McNamee, Information Retrieval, What is IR

Typology: Slides

2010/2011

Uploaded on 11/09/2011

stagist
stagist 🇺🇸

4.1

(27)

265 documents

1 / 37

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
http://apl.jhu.edu/~paulmac/ir.html
CS 605.744 Information Retrieval
Spring 2011
Paul McNamee
Johns Hopkins University
Overview
Course Pragmatics
Schedule of topics
Grading policy
Overview of Text Retrieval
Boolean
Document Representations
Queries
Tokenization
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25

Partial preview of the text

Download Information Retrieval Overview, Lecture Slides - Computer Science and more Slides Artificial Intelligence in PDF only on Docsity!

http://apl.jhu.edu/~paulmac/ir.html

CS 605.744 Information Retrieval

Spring 2011

Paul McNamee

Johns Hopkins University

[email protected]

Overview

 Course Pragmatics

  • Schedule of topics
  • Grading policy

 Overview of Text Retrieval

 Boolean

  • Document Representations
  • Queries

 Tokenization

Objectives of the Course

 To get a basic theoretical understanding of IR

  • Representing and indexing text documents
  • Retrieval models
  • Implementing querying efficiently  To examine application areas and research topics such as:
  • Text Classification
  • Cross-language retrieval
  • Retrieval on the Web
  • Speech Retrieval  To understand how IR performance is assessed:
  • Recall/Precision and other metrics  Gain hands-on experience building an IR system  To introduce selected topics in computational linguistics

Course Philosophy

 Somewhere between a lecture, seminar,

and laboratory

  • Lecture, but more interactive == more

interesting

  • Studying assigned readings prior to a class

will elevate the class discussions

  • Homeworks and projects provide hands-on

experience

  • Students present paper summaries and

projects to whole class

Research Project

 Goal is to research an area or develop an

idea that you would like to explore in

greater depth

 Deliverables include a written report and

an oral presentation

 More details in 2-3 weeks

General Introduction

 What is Information Retrieval?

  • How does it differ from database querying?

 History of IR

  • Field over 40 years old
  • Why so popular now?
  • Impact of the Web

 Beyond vanilla document retrieval

 Resources

I never waste memory on things that can easily be stored and retrieved from elsewhere – A. Einstein

What is Information Retrieval?

 Field concerned with the organization, storage,

and retrieval of information

  • Especially text
  • Also retrieval of semistructured data (XML), video images, speech, music

 Requires algorithms and data structures

  • For manipulating natural language
  • To efficiently store and process data

 Related fields

  • Natural Language Processing, Library Science
  • Computational Linguistics, Digital Libraries

What makes IR a hard problem

 Under good circumstances

  • Text is unstructured
  • In the hardest cases, it requires understanding of semantics
  • Human language presents distinct problems (e.g., ambiguity)

 Under hard circumstances

  • Patent retrieval: applications tend to use low content words; why?
  • One estimate is that 40% of web pages change monthly, many pages ‘lie’ about their content, new pages aren’t linked to

 Multimedia information

  • Hard to store (size), represent, and compare

Reason #3: Ambiguity

 English provides no canonical way to

reference people and things

  • President Carter, Pres. Carter, Jimmy

Carter; the 39th^ president, Rosalynn Carter’s

husband

 Ambiguity pervasive

  • jaguar, bank, see, hornet, red, aa,

 Distinctions vary in granularity

  • cool (popular) vs. cool (low in temperature)
  • list (to name items in a list) vs. (to include in

a list)

Reason #4: Word Choices

 Speakers of a language learn preferential

ways of expressing things:

  • strong tea / powerful computers

 Documents have a limited vocabulary

with discrete occurrences; words have

many synonyms

  • query: ‘fast automobiles’
    • should match ‘fast cars’

 Inflectional forms

  • query about ‘juggling’
    • should match jugglers, juggler, jongleur

Pre-history of IR

 300 BCE Euclid’s treatise,The Elements  300 BCE Ptolemy I founds Great Library at Alexandria which grows to include 700,000+ volumes (scrolls)  391 Great Library destroyed by fanatics (implications for the Web?)  600 Number 0 used in India  825 Muhammad ibn Musa Al-Khowarizmi writes treatise on algebra; the English word algorithm is derived from his name  1230s St. Anthony (of Padova) creates concordance for Latin Vulgate  1247 Cardinal Hugo employs 500 monks to build a concordance  1470s Johannes Gutenberg creates printing press  1550 First English concordance of entire Bible  1640 Blaise Pascal develops mechanical calculator. It performed subtraction by adding complements  1714 Henry Mills conceives of the typewriter  1837 Morse Code is an early text encoding scheme  1857 Sir Charles Wheatstone stores Morse codes on paper tapes; they could be prepared offline and transmitted later

The Great Library Rebuilt (2002)

Early IR Systems

 Card-based

 Uniterm (Casey, Perry, Berry, Kent: 1958

–developed and used from mid 1940’s)

EXCURSION 43821 90 241 52 63 34 25 66 17 58 49 130 281 92 83 44 75 86 57 88 119 640 122 93 104 115 146 97 158 139 870 342 157 178 199 207 248 269 298 LUNAR 12457 110 181 12 73 44 15 46 7 28 39 430 241 42 113 74 85 76 17 78 79 820 761 602 233 134 95 136 37 118 109 901 982 194 165 127 198 179 377 288 407

Advent of Computer Science

 1962 First Comp Sci. degree program offered by Purdue U.  1963 ASCII standard developed  1965 CD-ROM technology invented (James Russell)  1969 ARPANET contains 4 hosts (23 in 1971)  1969 UNIX operating system (Ritchie & Thompson)  1972 Tomlinson sends first email message  1975 Microsoft founded by Gates and Allen  1977 Apple II personal computer  1981 IBM PC  1982 TCP/IP basis for NSFNet  1984 Apple Macintosh with windowing interface  1984 1,000 Internet hosts  1988 Robert Morris, a Cornell U. graduate student, unleashes the ‘Internet Worm’  1989 100,000 Internet hosts

Birth of the Web

 1989 Tim Berners-Lee invents World-Wide-Web  1992 1,000,000 Internet hosts, but only 50 web sites  1994 Two Stanford graduate students found Yahoo, a manually build on- line directory  1995 AltaVista indexes 15 million web pages  1996 Two other Stanford graduate students collaborate on Google  1997 Lawrence and Giles paper characterizing Web  1999 Excite search engine sold for $6.7 billion; around same time automotive division of Volvo sold for $6.3 billion.  2000 1 billion web pages on public web; 10 million web sites, 93 million or so Internet hosts  2002 Google claims 3 billion page index  2004 Google IPO  2004 Microsoft unleashes Web search engine  2006 Google’s stock value exceeds $150 billion (> Coke, IBM, AT&T)  2009 Microsoft rebrands Web search as Bing Sources: http://www.mcs.net/~jorn/html/net/timeline.html http://ei.cs.vt.edu/~history/ http://www.maxmon.com/history.htm http://www.computerhistory.org/ http://www.let.leidenuniv.nl/history/

Why is IR thriving today?

 Dropping prices for

external storage is

the greatest factor

 Other factors

  • Increased expectations and demonstrated utility
  • Web 2.0 / electronic commerce
  • Ease of use From www.lesk.com

Beyond Text

 Images

  • Content based methods are difficult
  • Can try to make inferences based on filenames or coordinate text
  • Take up much more storage than text

 Video

  • Usually use sampled sequences of images

 Broadcast speech

  • 1000s of radio stations from around the world
  • Typical approach: transcribe speech into text (with errors) and treat as ‘normal’ text

 Scanned text

  • Like speech, scan (w/ errors) and index

 Maps, Diagrams, Music (open problems)

Google Images

Small, oval, potato-brown figures with faces. Like Fry Guy and Fry Girl they too wear a variety of costumes.

document text basis for search

Beyond Document Retrieval

 User’s typically do not want to merely

find documents of interest

 A. Broder (CTO AltaVista) taxonomy (11/00)

  • Informational needs
  • Navigation (e.g., surrogate bookmarks)
  • Transactional

 J. Prange (ARDA), Advanced Question-

Answering

  • “A bridge too far or the right technology at

the right time?”

Question-Answering Systems

 Not database front-ends

 FAQ-Finder

  • Indexes FAQ lists and tries to find responsive answers to common questions

 Ask Jeeves

  • Looks for web pages likely to contain answers to common, simple questions - (e.g., “How do I make an apple pie?”)

 eHow

  • Web 2.0: free, user-contributed, ranked answers to common questions

 Text Retrieval Community studied for several

years

  • TREC-8 evaluation (1999) was the first

Beyond Single Requests

 Commercial engines have high volume

  • 1999: Infoseek anecdotally reports
    • ~50M queries / day
    • ~600 queries / second over 10^8 collection
  • 2010 estimates
    • Google: 1-3 billion / day
    • Yahoo: 180 M / day
    • Bing: 80 M / day

 Ideally, user context should be leveraged

  • System can learn a profile over time
  • Benefits successive queries

Beyond Surfing

 Dual problem to ad hoc retrieval

  • Filter incoming messages relevant to a defined profile - Push technology vs. pull - Examples: Bloomberg news, Book or movie recommendations, targeted advertisements, spam filtering

 Scenario:

  • You are a safety engineer for a large automotive manufacturer. You want to keep track of reports of accidents in a new vehicle - Don’t have access to a static collection of documents; instead, news stories and reports trickle in over time; relevance decisions must be made immediately - Can’t be plagued by too many false alarms, but also don’t want to miss relevant reports

 Technology isn’t mainstream, yet

Research Software Systems

 Wumpus

  • U. Waterloo (Open source, C++)  Terrier
  • Glasgow (Open source, Java)  Lemur / Indri
  • Carnegie Mellon / UMass (C++ & Java bindings)  Lucene
  • Apache/Jakarta (Java?)  SMART
  • Developed at Cornell University (C)  mg
  • From the authors of Managing Gigabytes (C)  INQUERY
  • Univ. Massachusetts (Amherst). Available?

Some Commercial Systems

 Opentext

 Lexis-Nexis

 Verity

 Claritech

 Thompson / Reuters

  • bought Westlaw

 Autonomy

 Oracle

 Many others…

Boolean Queries

 INFIX operators

  • ((cat AND dog) OR (collar AND leash))

 NOT is UNARY PREFIX operator

  • ((cat AND dog) OR (collar AND (NOT dog)))

 AND and OR can be n-ary operators

  • (cat AND dog AND rabies AND shot)

 De Morgan’s Laws

  • NOT(a) AND NOT(b) = NOT(a OR b)
  • NOT(a) OR NOT(b)= NOT(a AND b)
  • NOT(NOT(a)) = a

Boolean Queries

 (Cat OR Dog) AND (Collar OR Leash)

  • Which of the following combinations

satisfies this statement:

 Cat x x x x

 Dog x x x x x

 Collar x x x x

 Leash x x x

The merge (Boolean AND)

 Walk through the two postings

simultaneously, in time linear in the total

number of postings entries

If the list lengths are x and y , the merge takes O( x+y ) operations. Crucial: postings sorted by docID. Courtesy of Manning and Raghavan

Processing Boolean Queries

 If sorted document lists are available

  • A new ‘array’ can be created from existing

arrays of documents

 Otherwise

  • Use a linear-time algorithm
    • Hashtables support union, intersection and set- difference