






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Various techniques used in text processing and indexing for information retrieval systems. Topics include bucket compression for memory saving, lexical processing for document preparation, tokenization for term extraction, stemming for reducing morphological variants, and content-based ranking for arranging search results. The vector-space model is also introduced for document representation.
Typology: Slides
1 / 10
This page cannot be seen from the preview
Don't miss anything!







-^ Reduce memory for each pointer in the buckets:^ –^ for each term sort occurrences by DID^ –^ store as a list of gaps - the sequence of differences betweensuccessive DIDs •^ Advantage – significant memory saving^ –^ frequent terms produce many small gaps^ –^ small integers encoded by short variable-length codewords •^ Example:^ the sequence of DIDs: (14, 22, 38, 42, 66, 122, 131, 226 )a sequence of gaps:
(14, 8, 16, 4, 24, 56, 9, 95)
-^ Performed prior to indexing or converting documentsto vector representations^ –^ Tokenization^ •^
extraction of terms from a document – Text conflation and vocabulary reduction • Stemming^ –^ reducing words to their root forms • Removing stop words^ –^ common words, such as articles, prepositions, non-informative adverbs^ –^ 20-30% index size reduction
-^ Want to reduce all morphological variants of a word to asingle index term^ –^ e.g. a document containing words like
fish^ and^ fisher
may not be
retrieved by a query containing
fishing^ (no
fishing^ explicitly
contained in the document) • Stemming - reduce words to their root form^ •^ e.g.^ fish
e.g.^ if suffix=IZATION and prefix contains at least one vowel followedby a consonant, replace with suffix=IZE^ –^ BINARIZATION => BINARIZE
-^ A boolean query^ –^ results in several matching documents^ –^ e.g., a user query in google: ‘
Web AND graphs’,
results in
4,040,000 matches • Problem – user can examine only a fraction of result • Content based ranking – arrange results in the order of relevance to user
-^ Text documents are mapped to a high-dimensional vector space •^ Each document
-^ represented as a sequence of terms
-^ Unique terms in a set of documents^ –^ determine the dimension of a vector space
document^
text^
terms
d^1
web web graph
web graph
d^2
graph web net graph net
graph web net
d^3
page web complex
page web complex
Boolean representation of vectors: V = [ web, graph, net, page, complex ]V= [1 1 0 0 0]^1 V= [1 1 1 0 0]^2 V= [1 0 0 1 1]^3
-^ A term that appears many times within a documentis likely to be more important than a term thatappears only once •^ n-^ Number of occurrences of a termij^
document
di
-^ Term frequency
nijTF = ij d^ i