
















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Various models of information retrieval, including boolean, vector, structure, and probabilistic models. It also covers text processing techniques such as decoding, filtering, tokenization, stopword elimination, stemming, and selecting index terms. The document emphasizes the importance of pre-processing text to improve the accuracy of information retrieval.
Typology: Study notes
1 / 24
This page cannot be seen from the preview
Don't miss anything!

















INFO 202 - 12 November 2008
Bob Glushko
Recall and Precision Models for Information Retrieval Text Processing Operations and Challenges The Boolean Model
We begin with introductory conceptual and technical foundations for IR The issues and models get progressively more complicated for the next 5 lectures On the Monday after Thanksgiving we have a lecture on multimedia retrieval (with lots of demos), followed by a lecture on applications of IR and natural language processing Last new material is "alumni day" (December 8) when some former students talk about their jobs, which emphasize IO and IR The last class meeting of the semester (December 10) is a course review to prepare you for a three-hour final exam on December 15
RECALL is the proportion of the relevant documents that are retrieved PRECISION is the proportion of the retrieved documents that are relevant Goal: High recall and precision - Get as much good stuff as possible while getting as little junk as possible
STRUCTURE models -- combine representations of terms with information about structures within documents (i.e., hierarchical organization) and between documents (i.e. hypertext links and other explicit relationships) to determine which parts of documents and which documents are most important and relevant PROBABILISTIC models -- documents are represented by index terms, and the key assumption is that the terms are distributed differently in relevant and non relevant documents.
A document is any individually retrievable item in the "pile of text" that makes up the COLLECTION Sometimes the boundaries that define documents are obvious or conventional (web search returns a web page), but sometimes they aren't "Carving up" or "chunking" large documents into smaller text passages may be required for some collections or some user interfaces A collection might contain any number of documents; web search engines index billions of pages
A query is the expression of a user’s information needs and can take many forms: A natural language description of the need An artificial and restricted language Restrictions on the vocabulary limit the words that can be used in queries Restrictions on syntax limit the ways words can be combined in logical expressions These restrictions mean that queries may be unable to express the information need completely or accurately The user interface(s) to the IR system influence the kinds of queries that the user can express (or express easily)
Not all words are equally useful indicators of what a document is about Nouns and noun groups carry more "aboutness" than adjectives, adverbs, and verbs Very frequent words that occur in all or most documents add NOISE because they cannot discriminate between documents So it is worthwhile to pre-process the text of documents to select a smaller set of terms that better represent them; these are called the INDEX terms
Removing surrounding header or format information from the text to be processed What you filter depends on the encoding format or document type You'd probably discard HTML markup before indexing You'd almost certainly save XML tags for indexing You'd probably want to use the rich metadata in email mail headers
Character sequences where the tokens include complex alphanumeric structure or punctuation syntax: [email protected] 10/26/ October 26, 1953 55 B.C B- 128.32.226. My PGP key is 324a3df234ch23e
The language that the characters represent needs to be identified during decoding because it influences the order and nature of tokenization In languages that are written right-to-left like Arabic and Hebrew, left-to-right text can be interspersed, like numbers and dollar amounts
In German compound nouns don't have spaces between the tokens Lebensversicherungsgesellschaftsangestellter = "life insurance company employee"
And these problems in "segmented languages" that use white space and punctuation to delimit words seem trivial compared to problems tokenizing Oriental languages that are "non-segmented" These languages have ideographic characters that can appear as one-character words but they also can combine to create new words. The analogous problem in English would be the word "TOGETHER" -- do we treat it as one word or is three separate words "TO GET HER"
MORPHOLOGY is the part of linguistics concerned with the mechanisms by which natural languages create words and word forms from smaller units These basic building blocks are called MORPHEMES and can express semantic concepts (when they are called ROOTS or abstract features like "pastness" or "plural") Every natural language contains about 10,000 morphemes and because of how they combine to create words, the number of words is an order of magnitude greater
INFLECTION is the morphological mechanism that changes the form of a word to handle tense, aspect, agreement, etc. It never changes the part-of-speech (grammatical category) dog, dogs tengo, tienes, tenemos, tienen DERIVATION is the mechanism for creating new words, usually of a different part-of-speech category, by adding a BOUND MORPH to a BASE MORPH build + ing -> building; health + y -> healthy
Morphological analysis of a language is often used in information retrieval and other low-level text processing applications (hyphenation, spelling correction) because solving problems using root forms and rules is more scaleable and robust than solving them using word lists Natural languages are generative, with new words continually being invented Many misspellings of common words are obscure low frequency words, so adding them to a misspelling list would make it impossible to check spellings for the latter
STEMMING is morphological processing to remove prefixes and suffixes to leave the root form of words Stemming reduces many related words and word forms to a common canonical form This makes it possible to retrieve documents when they contain the meaning we're looking for even if the form of the search word doesn't exactly match what's in the documents In English, inflectional morphology is relatively easy to handle and "dumb" stemmers (e.g., iteratively remove suffixes, matching longest sequence in rewrite rule) perform acceptably Derivational morphology is more difficult
An index is a data structure that records information about the occurrences of terms in documents This is a term-document matrix -- rows for terms, columns for documents -- one such data structure
Using a term-document matrix index representation is both infeasible and nonsensical for any substantial collection of documents So instead we divide the index into two parts A DICTIONARY is a list of the terms A POSTINGS LIST is the list of documents in which each term occurs (usually with frequency and position information within each document)