Download Information Retrieval Overview, Lecture Slides - Computer Science and more Slides Artificial Intelligence in PDF only on Docsity!
http://apl.jhu.edu/~paulmac/ir.html
CS 605.744 Information Retrieval
Spring 2011
Paul McNamee
Johns Hopkins University
Overview
Course Pragmatics
- Schedule of topics
- Grading policy
Overview of Text Retrieval
Boolean
- Document Representations
- Queries
Tokenization
Objectives of the Course
To get a basic theoretical understanding of IR
- Representing and indexing text documents
- Retrieval models
- Implementing querying efficiently To examine application areas and research topics such as:
- Text Classification
- Cross-language retrieval
- Retrieval on the Web
- Speech Retrieval To understand how IR performance is assessed:
- Recall/Precision and other metrics Gain hands-on experience building an IR system To introduce selected topics in computational linguistics
Course Philosophy
Somewhere between a lecture, seminar,
and laboratory
- Lecture, but more interactive == more
interesting
- Studying assigned readings prior to a class
will elevate the class discussions
- Homeworks and projects provide hands-on
experience
- Students present paper summaries and
projects to whole class
Research Project
Goal is to research an area or develop an
idea that you would like to explore in
greater depth
Deliverables include a written report and
an oral presentation
More details in 2-3 weeks
General Introduction
What is Information Retrieval?
- How does it differ from database querying?
History of IR
- Field over 40 years old
- Why so popular now?
- Impact of the Web
Beyond vanilla document retrieval
Resources
I never waste memory on things that can easily be stored and retrieved from elsewhere – A. Einstein
What is Information Retrieval?
Field concerned with the organization, storage,
and retrieval of information
- Especially text
- Also retrieval of semistructured data (XML), video images, speech, music
Requires algorithms and data structures
- For manipulating natural language
- To efficiently store and process data
Related fields
- Natural Language Processing, Library Science
- Computational Linguistics, Digital Libraries
What makes IR a hard problem
Under good circumstances
- Text is unstructured
- In the hardest cases, it requires understanding of semantics
- Human language presents distinct problems (e.g., ambiguity)
Under hard circumstances
- Patent retrieval: applications tend to use low content words; why?
- One estimate is that 40% of web pages change monthly, many pages ‘lie’ about their content, new pages aren’t linked to
Multimedia information
- Hard to store (size), represent, and compare
Reason #3: Ambiguity
English provides no canonical way to
reference people and things
- President Carter, Pres. Carter, Jimmy
Carter; the 39th^ president, Rosalynn Carter’s
husband
Ambiguity pervasive
- jaguar, bank, see, hornet, red, aa,
Distinctions vary in granularity
- cool (popular) vs. cool (low in temperature)
- list (to name items in a list) vs. (to include in
a list)
Reason #4: Word Choices
Speakers of a language learn preferential
ways of expressing things:
- strong tea / powerful computers
Documents have a limited vocabulary
with discrete occurrences; words have
many synonyms
- query: ‘fast automobiles’
Inflectional forms
- query about ‘juggling’
- should match jugglers, juggler, jongleur
Pre-history of IR
300 BCE Euclid’s treatise,The Elements 300 BCE Ptolemy I founds Great Library at Alexandria which grows to include 700,000+ volumes (scrolls) 391 Great Library destroyed by fanatics (implications for the Web?) 600 Number 0 used in India 825 Muhammad ibn Musa Al-Khowarizmi writes treatise on algebra; the English word algorithm is derived from his name 1230s St. Anthony (of Padova) creates concordance for Latin Vulgate 1247 Cardinal Hugo employs 500 monks to build a concordance 1470s Johannes Gutenberg creates printing press 1550 First English concordance of entire Bible 1640 Blaise Pascal develops mechanical calculator. It performed subtraction by adding complements 1714 Henry Mills conceives of the typewriter 1837 Morse Code is an early text encoding scheme 1857 Sir Charles Wheatstone stores Morse codes on paper tapes; they could be prepared offline and transmitted later
The Great Library Rebuilt (2002)
Early IR Systems
Card-based
Uniterm (Casey, Perry, Berry, Kent: 1958
–developed and used from mid 1940’s)
EXCURSION 43821 90 241 52 63 34 25 66 17 58 49 130 281 92 83 44 75 86 57 88 119 640 122 93 104 115 146 97 158 139 870 342 157 178 199 207 248 269 298 LUNAR 12457 110 181 12 73 44 15 46 7 28 39 430 241 42 113 74 85 76 17 78 79 820 761 602 233 134 95 136 37 118 109 901 982 194 165 127 198 179 377 288 407
Advent of Computer Science
1962 First Comp Sci. degree program offered by Purdue U. 1963 ASCII standard developed 1965 CD-ROM technology invented (James Russell) 1969 ARPANET contains 4 hosts (23 in 1971) 1969 UNIX operating system (Ritchie & Thompson) 1972 Tomlinson sends first email message 1975 Microsoft founded by Gates and Allen 1977 Apple II personal computer 1981 IBM PC 1982 TCP/IP basis for NSFNet 1984 Apple Macintosh with windowing interface 1984 1,000 Internet hosts 1988 Robert Morris, a Cornell U. graduate student, unleashes the ‘Internet Worm’ 1989 100,000 Internet hosts
Birth of the Web
1989 Tim Berners-Lee invents World-Wide-Web 1992 1,000,000 Internet hosts, but only 50 web sites 1994 Two Stanford graduate students found Yahoo, a manually build on- line directory 1995 AltaVista indexes 15 million web pages 1996 Two other Stanford graduate students collaborate on Google 1997 Lawrence and Giles paper characterizing Web 1999 Excite search engine sold for $6.7 billion; around same time automotive division of Volvo sold for $6.3 billion. 2000 1 billion web pages on public web; 10 million web sites, 93 million or so Internet hosts 2002 Google claims 3 billion page index 2004 Google IPO 2004 Microsoft unleashes Web search engine 2006 Google’s stock value exceeds $150 billion (> Coke, IBM, AT&T) 2009 Microsoft rebrands Web search as Bing Sources: http://www.mcs.net/~jorn/html/net/timeline.html http://ei.cs.vt.edu/~history/ http://www.maxmon.com/history.htm http://www.computerhistory.org/ http://www.let.leidenuniv.nl/history/
Why is IR thriving today?
Dropping prices for
external storage is
the greatest factor
Other factors
- Increased expectations and demonstrated utility
- Web 2.0 / electronic commerce
- Ease of use From www.lesk.com
Beyond Text
Images
- Content based methods are difficult
- Can try to make inferences based on filenames or coordinate text
- Take up much more storage than text
Video
- Usually use sampled sequences of images
Broadcast speech
- 1000s of radio stations from around the world
- Typical approach: transcribe speech into text (with errors) and treat as ‘normal’ text
Scanned text
- Like speech, scan (w/ errors) and index
Maps, Diagrams, Music (open problems)
Google Images
Small, oval, potato-brown figures with faces. Like Fry Guy and Fry Girl they too wear a variety of costumes.
document text basis for search
Beyond Document Retrieval
User’s typically do not want to merely
find documents of interest
A. Broder (CTO AltaVista) taxonomy (11/00)
- Informational needs
- Navigation (e.g., surrogate bookmarks)
- Transactional
J. Prange (ARDA), Advanced Question-
Answering
- “A bridge too far or the right technology at
the right time?”
Question-Answering Systems
Not database front-ends
FAQ-Finder
- Indexes FAQ lists and tries to find responsive answers to common questions
Ask Jeeves
- Looks for web pages likely to contain answers to common, simple questions - (e.g., “How do I make an apple pie?”)
eHow
- Web 2.0: free, user-contributed, ranked answers to common questions
Text Retrieval Community studied for several
years
- TREC-8 evaluation (1999) was the first
Beyond Single Requests
Commercial engines have high volume
- 1999: Infoseek anecdotally reports
- ~50M queries / day
- ~600 queries / second over 10^8 collection
- 2010 estimates
- Google: 1-3 billion / day
- Yahoo: 180 M / day
- Bing: 80 M / day
Ideally, user context should be leveraged
- System can learn a profile over time
- Benefits successive queries
Beyond Surfing
Dual problem to ad hoc retrieval
- Filter incoming messages relevant to a defined profile - Push technology vs. pull - Examples: Bloomberg news, Book or movie recommendations, targeted advertisements, spam filtering
Scenario:
- You are a safety engineer for a large automotive manufacturer. You want to keep track of reports of accidents in a new vehicle - Don’t have access to a static collection of documents; instead, news stories and reports trickle in over time; relevance decisions must be made immediately - Can’t be plagued by too many false alarms, but also don’t want to miss relevant reports
Technology isn’t mainstream, yet
Research Software Systems
Wumpus
- U. Waterloo (Open source, C++) Terrier
- Glasgow (Open source, Java) Lemur / Indri
- Carnegie Mellon / UMass (C++ & Java bindings) Lucene
- Apache/Jakarta (Java?) SMART
- Developed at Cornell University (C) mg
- From the authors of Managing Gigabytes (C) INQUERY
- Univ. Massachusetts (Amherst). Available?
Some Commercial Systems
Opentext
Lexis-Nexis
Verity
Claritech
Thompson / Reuters
Autonomy
Oracle
Many others…
Boolean Queries
INFIX operators
- ((cat AND dog) OR (collar AND leash))
NOT is UNARY PREFIX operator
- ((cat AND dog) OR (collar AND (NOT dog)))
AND and OR can be n-ary operators
- (cat AND dog AND rabies AND shot)
De Morgan’s Laws
- NOT(a) AND NOT(b) = NOT(a OR b)
- NOT(a) OR NOT(b)= NOT(a AND b)
- NOT(NOT(a)) = a
Boolean Queries
(Cat OR Dog) AND (Collar OR Leash)
- Which of the following combinations
satisfies this statement:
Cat x x x x
Dog x x x x x
Collar x x x x
Leash x x x
The merge (Boolean AND)
Walk through the two postings
simultaneously, in time linear in the total
number of postings entries
If the list lengths are x and y , the merge takes O( x+y ) operations. Crucial: postings sorted by docID. Courtesy of Manning and Raghavan
Processing Boolean Queries
If sorted document lists are available
- A new ‘array’ can be created from existing
arrays of documents
Otherwise
- Use a linear-time algorithm
- Hashtables support union, intersection and set- difference