Information Retrieval Models and Text Processing - Prof. R. J. Glushko, Study notes of Information Technology

Various models of information retrieval, including boolean, vector, structure, and probabilistic models. It also covers text processing techniques such as decoding, filtering, tokenization, stopword elimination, stemming, and selecting index terms. The document emphasizes the importance of pre-processing text to improve the accuracy of information retrieval.

Typology: Study notes

Pre 2010

Uploaded on 09/09/2009

koofers-user-r2o
koofers-user-r2o 🇺🇸

10 documents

1 / 24

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
22. Text Processing and Boolean
Models
INFO 202 - 12 November 2008
Bob Glushko
Plan for Today's Class
Recall and Precision
Models for Information Retrieval
Text Processing Operations and Challenges
The Boolean Model
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18

Partial preview of the text

Download Information Retrieval Models and Text Processing - Prof. R. J. Glushko and more Study notes Information Technology in PDF only on Docsity!

22. Text Processing and Boolean

Models

INFO 202 - 12 November 2008

Bob Glushko

Plan for Today's Class

Recall and Precision Models for Information Retrieval Text Processing Operations and Challenges The Boolean Model

Overview of Remainder of Semester

We begin with introductory conceptual and technical foundations for IR The issues and models get progressively more complicated for the next 5 lectures On the Monday after Thanksgiving we have a lecture on multimedia retrieval (with lots of demos), followed by a lecture on applications of IR and natural language processing Last new material is "alumni day" (December 8) when some former students talk about their jobs, which emphasize IO and IR The last class meeting of the semester (December 10) is a course review to prepare you for a three-hour final exam on December 15

Schematic View of Classical Search

Recall and Precision [2]

RECALL is the proportion of the relevant documents that are retrieved PRECISION is the proportion of the retrieved documents that are relevant Goal: High recall and precision - Get as much good stuff as possible while getting as little junk as possible

High Recall but Low Precision

Low Recall but High Precision

High Recall and High Precision

Models of Information Retrieval [3]

STRUCTURE models -- combine representations of terms with information about structures within documents (i.e., hierarchical organization) and between documents (i.e. hypertext links and other explicit relationships) to determine which parts of documents and which documents are most important and relevant PROBABILISTIC models -- documents are represented by index terms, and the key assumption is that the terms are distributed differently in relevant and non relevant documents.

What is a "Document" in Information Retrieval?

A document is any individually retrievable item in the "pile of text" that makes up the COLLECTION Sometimes the boundaries that define documents are obvious or conventional (web search returns a web page), but sometimes they aren't "Carving up" or "chunking" large documents into smaller text passages may be required for some collections or some user interfaces A collection might contain any number of documents; web search engines index billions of pages

What is a Query?

A query is the expression of a user’s information needs and can take many forms: A natural language description of the need An artificial and restricted language Restrictions on the vocabulary limit the words that can be used in queries Restrictions on syntax limit the ways words can be combined in logical expressions These restrictions mean that queries may be unable to express the information need completely or accurately The user interface(s) to the IR system influence the kinds of queries that the user can express (or express easily)

Text Processing: Motivation

Not all words are equally useful indicators of what a document is about Nouns and noun groups carry more "aboutness" than adjectives, adverbs, and verbs Very frequent words that occur in all or most documents add NOISE because they cannot discriminate between documents So it is worthwhile to pre-process the text of documents to select a smaller set of terms that better represent them; these are called the INDEX terms

Guess That Encoding [1]

Guess That Encoding [2]

Guess That Encoding [3]

Governor's Mansion, 1526 H Street Sacramento California 95814
916 323-3047

Filtering

Removing surrounding header or format information from the text to be processed What you filter depends on the encoding format or document type You'd probably discard HTML markup before indexing You'd almost certainly save XML tags for indexing You'd probably want to use the rich metadata in email mail headers

Tokenization Challenges [1]

Character sequences where the tokens include complex alphanumeric structure or punctuation syntax: [email protected] 10/26/ October 26, 1953 55 B.C B- 128.32.226. My PGP key is 324a3df234ch23e

Tokenization Challenges [2]

Tokenization Challenges [3]

The language that the characters represent needs to be identified during decoding because it influences the order and nature of tokenization In languages that are written right-to-left like Arabic and Hebrew, left-to-right text can be interspersed, like numbers and dollar amounts

In German compound nouns don't have spaces between the tokens Lebensversicherungsgesellschaftsangestellter = "life insurance company employee"

Tokenization in "Non-Segmented" Languages

And these problems in "segmented languages" that use white space and punctuation to delimit words seem trivial compared to problems tokenizing Oriental languages that are "non-segmented" These languages have ideographic characters that can appear as one-character words but they also can combine to create new words. The analogous problem in English would be the word "TOGETHER" -- do we treat it as one word or is three separate words "TO GET HER"

One Minute Morphology

MORPHOLOGY is the part of linguistics concerned with the mechanisms by which natural languages create words and word forms from smaller units These basic building blocks are called MORPHEMES and can express semantic concepts (when they are called ROOTS or abstract features like "pastness" or "plural") Every natural language contains about 10,000 morphemes and because of how they combine to create words, the number of words is an order of magnitude greater

Inflection and Derivation

INFLECTION is the morphological mechanism that changes the form of a word to handle tense, aspect, agreement, etc. It never changes the part-of-speech (grammatical category) dog, dogs tengo, tienes, tenemos, tienen DERIVATION is the mechanism for creating new words, usually of a different part-of-speech category, by adding a BOUND MORPH to a BASE MORPH build + ing -> building; health + y -> healthy

Morphological Processing

Morphological analysis of a language is often used in information retrieval and other low-level text processing applications (hyphenation, spelling correction) because solving problems using root forms and rules is more scaleable and robust than solving them using word lists Natural languages are generative, with new words continually being invented Many misspellings of common words are obscure low frequency words, so adding them to a misspelling list would make it impossible to check spellings for the latter

Stemming

STEMMING is morphological processing to remove prefixes and suffixes to leave the root form of words Stemming reduces many related words and word forms to a common canonical form This makes it possible to retrieve documents when they contain the meaning we're looking for even if the form of the search word doesn't exactly match what's in the documents In English, inflectional morphology is relatively easy to handle and "dumb" stemmers (e.g., iteratively remove suffixes, matching longest sequence in rewrite rule) perform acceptably Derivational morphology is more difficult

The Index -- Logical View

An index is a data structure that records information about the occurrences of terms in documents This is a term-document matrix -- rows for terms, columns for documents -- one such data structure

The "Inverted" Index

Using a term-document matrix index representation is both infeasible and nonsensical for any substantial collection of documents So instead we divide the index into two parts A DICTIONARY is a list of the terms A POSTINGS LIST is the list of documents in which each term occurs (usually with frequency and position information within each document)

Indexing Step 1 - Term List

Step 2 -- Alphabetize and Merge