Download Token Processing in Information Retrieval: Techniques and Algorithms - Prof. Nazli Goharia and more Study notes Computer Science in PDF only on Docsity!
Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 1
Token Processing
(CS429)
Nazli Goharian
Slides are mostly based on Information Retrieval Algorithms and Heuristics, Grossman, Frieder
Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 2
Token Processing
Identifying document units for indexing
- whole document
- chapter
- Paragraph
- ā¦.
Too large unit
Cons: potential of having more irrelevant documents & more difficult for the user to find relevant information
Too small unit
Cons: may loose some relevant docs as the terms are distributed over small units
Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 3
Token Processing
Documents may belong to various languages.
Web: ~ 60% in English
A given document may have foreign language
terms and phrases.
The collection must be indexed!
Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 4
Token Processing
Identifying the tokens in a document unit for
indexing
Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 7
Special Tokens
- Dates 2005; Oct 10, 2005; 10/10/2005; 10/10/
- Digit-alphabet 1-hour
- Alphabet-digit F-16; I-
- Hyphenation co-existence; black-tie party
- All caps CNN, BBC
- Cap period (initial) N.
- Digit.digit 8.
- Digit,digit 8,
- Currency symbol $, ā¦.
- Cultural known names MAS*H
- Email address [email protected]
- URLs http://www.cnn.com
- IP address 123.67.65.
- Names New York; Los Angles (Los Angles-New York flights ????)
Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 8
Normalization of Tokens
- Using equivalence class of terms. Example rules:
- Ph.D? Phd
- U.S.A.? USA
- 10/10/ 2005? Oct 10, 2005
- F-16? F
- Variations of Umlaut words in German
- ā¦ā¦ā¦ā¦..
- What about these rules?
- Windows? window (what if one is OS and one is a window???)
- C.A.D.? cad (different meaning????)
Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 9
Normalization of Tokens (contād)
- Case folding - reduces term index by ~17%, but a lossy compression
- Convert all to lower case (most practical); or some to lower case
- Spelling variations (neighbor vs. neighbour; a foreign name)
- Accents on letters (naĆÆve vs. naive; many foreign language terms)
- Variant transliteration (Den-Haag vs. The Hague)
- Use phonetic equivalence, best such algorithm: Soundex!
More on normalization under Stemmingā¦.
Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 10
Phrase processing
- Phrase recognition is based on the goal of indexing
meaningful phrases like
- āLincoln Town Carā
- āSan Franciscoā
- āapple pieā
- Doing this would use word order to assist with
effectiveness -- otherwise we are assuming the
query and documents are just a ābag of wordsā
- ~ 10% of web queries are explicit phrase queries
Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 13
Constructing Phrases
Using Part-of-Speech Tagging
⢠Can take advantage of NLP techniques:
⢠Using part-of-Speech tagging to identify
key components of a sentence (S-V-OBJ, ā¦)
⢠Use to identify phrases
- Keep all noun phrases āRepublic of Chinaā, or
- Keep adjective followed by noun āRed Carpetā
Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 14
Constructing Phrases
Using Named Entity Tagging
⢠Finding structured data within an unstructured
document
- Peopleās names, organizations, locations, amounts, etc.
Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 15
Phrase Processing Summary
⢠Pro
- Often found to improve effectiveness by 10%
⢠Con
- Dramatically increases size of term dictionary and
the size of the index
Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 16
Parser Generators
⢠Goal is to allow users to specify parsing
rules as grammars.
⢠Grammars provide a very flexible means of
expressing all valid strings in a language.
Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 19
Stemming Algorithms
⢠Rule-Based
- Porter (1980)
- Lovins (1968)
⢠Dictionary-based
⢠Co-Occurrence-Based (1994)
⢠Others
Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 20
Porter Stemmer
⢠An incoming word is cleaned up in the
initialization phase, one prefix trimming
phase then takes place and then five suffix
trimming phases occur.
⢠Note: The entire algorithm will not be
covered -- we will leave out some obscure
rules.
Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 21
Initialization
⢠First the word is cleaned up. Converted to
lower case only letters or digits are kept.
⢠F-16 is converted to f16.
Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 22
Porter Stemming
⢠Remove prefixes:
"kilo", "micro", "milli", "intra", "ultra",
"mega", "nano", "pico", "pseudoā
So megabyte, kilobyte all become ābyteā.
Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 25
Step 3
- With what is left, replace any suffix on the left with suffix on the right ... icate ic fabricate --> fabric ( Think about this one ) ative -- combativ --> comb ( another good one) alize al nationalize --> national iciti ic ical ic tropical --> tropic ful -- faithful --> faith iveness ive inventiveness --> inventive ness -- harness --> har
Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 26
Step 4
⢠Remove remaining standard suffixes
al, ance, ence, er, ic, able, ible, ant, ement,
ment, ent, sion, tion, ou, ism, ate, iti, ous, ive,
ize, ise
Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 27
Step 5
⢠Remove trailing āeā if word does not end in
a vowel
- hinge --> hing
- free --> free
Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 28
Porter Summary
⢠Pros
- Used commonly and had shown good results
⢠Cons
- many words with different meanings have
common stems (e.g.; fabricate and fabric )
- a lot of stems are not words
Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 31
Co-Occurrence
- Pro
- Language independent (no need of dictionary)
- Based on assumption that terms in a class will co-occur with other terms āhippoā will co-occur with āhipposā
- Improves effectiveness
- Con
- computationally expensive to build co-occurrence matrix (but you only do it every now and then)
Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 32
N-grams
- Noise such as OCR (Optical Character
Recognition) errors or misspelling lower the query
processing accuracy in a term-based search.
- The premise is:
- Terms are all strings of length n
- Substrings of a term may help to find a match in the noise cases
- Replace terms with n-grams
- Language-independent -- no stemming or stop
word removal needed
Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 33
5-Gram Example
⢠Q: What technique work s on nois e and
mis spelled words?
⢠D 1 : N-grams work on nois y mi spelled text.
_work _on_no on_noi n_nois
spell pelle elled lled_
- 8 terms are matched
- No stemming of work, noise
- Partial match of misspelled
word
Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 34
N-gram Summary
⢠Pro
- Language independent
- Works on garbled text (OCR, etc.)
⢠Con
- there can be a LOT of n-grams, dictionary may
not fit in memory anymore
- query processing requires more resources