Token Processing in Information Retrieval: Techniques and Algorithms - Prof. Nazli Goharia, Study notes of Computer Science

Various techniques and algorithms used in token processing during information retrieval. Topics include identifying document units, token identification, stop words, special tokens, normalization of tokens, phrase processing, parser generators, stemming, and co-occurrence. The document also covers the advantages and disadvantages of each approach.

Typology: Study notes

Pre 2010

Uploaded on 08/19/2009

koofers-user-01t
koofers-user-01t šŸ‡ŗšŸ‡ø

10 documents

1 / 18

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
1©Goharian, Grossman,Frieder 2002, 2005, 2008
Token Processing
(CS429)
Nazli Goharian
nazli@ir.iit.edu
Slides are mostly based on Information Retrieval Algorithms and Heuristics, Grossman,Frieder
2©Goharian, Grossman,Frieder 2002, 2005, 2008
Token Processing
Identifying document units for indexing
–whole document
–chapter
–Paragraph
– ….
Too large unit
Cons: potential of having more irrelevant documents &
more difficult for the user to find relevant information
Too small unit
Cons: may loose some relevant docs as the terms are
distributed over small units
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12

Partial preview of the text

Download Token Processing in Information Retrieval: Techniques and Algorithms - Prof. Nazli Goharia and more Study notes Computer Science in PDF only on Docsity!

Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 1

Token Processing

(CS429)

Nazli Goharian

[email protected]

Slides are mostly based on Information Retrieval Algorithms and Heuristics, Grossman, Frieder

Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 2

Token Processing

Identifying document units for indexing

  • whole document
  • chapter
  • Paragraph
  • ….

Too large unit

Cons: potential of having more irrelevant documents & more difficult for the user to find relevant information

Too small unit

Cons: may loose some relevant docs as the terms are distributed over small units

Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 3

Token Processing

Documents may belong to various languages.

Web: ~ 60% in English

A given document may have foreign language

terms and phrases.

The collection must be indexed!

Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 4

Token Processing

Identifying the tokens in a document unit for

indexing

  • Parsing
  • Stemming
  • n-grams

Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 7

Special Tokens

  • Dates 2005; Oct 10, 2005; 10/10/2005; 10/10/
  • Digit-alphabet 1-hour
  • Alphabet-digit F-16; I-
  • Hyphenation co-existence; black-tie party
  • All caps CNN, BBC
  • Cap period (initial) N.
  • Digit.digit 8.
  • Digit,digit 8,
  • Currency symbol $, ….
  • Cultural known names MAS*H
  • Email address [email protected]
  • URLs http://www.cnn.com
  • IP address 123.67.65.
  • Names New York; Los Angles (Los Angles-New York flights ????)

Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 8

Normalization of Tokens

  • Using equivalence class of terms. Example rules:
    • Ph.D? Phd
    • U.S.A.? USA
    • 10/10/ 2005? Oct 10, 2005
    • F-16? F
    • Variations of Umlaut words in German
    • …………..
  • What about these rules?
    • Windows? window (what if one is OS and one is a window???)
    • C.A.D.? cad (different meaning????)

Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 9

Normalization of Tokens (cont’d)

  • Case folding - reduces term index by ~17%, but a lossy compression
    • Convert all to lower case (most practical); or some to lower case
  • Spelling variations (neighbor vs. neighbour; a foreign name)
  • Accents on letters (naĆÆve vs. naive; many foreign language terms)
  • Variant transliteration (Den-Haag vs. The Hague)
    • Use phonetic equivalence, best such algorithm: Soundex!

More on normalization under Stemming….

Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 10

Phrase processing

  • Phrase recognition is based on the goal of indexing

meaningful phrases like

  • ā€œLincoln Town Carā€
  • ā€œSan Franciscoā€
  • ā€œapple pieā€
  • Doing this would use word order to assist with

effectiveness -- otherwise we are assuming the

query and documents are just a ā€œbag of wordsā€

  • ~ 10% of web queries are explicit phrase queries

Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 13

Constructing Phrases

Using Part-of-Speech Tagging

• Can take advantage of NLP techniques:

• Using part-of-Speech tagging to identify

key components of a sentence (S-V-OBJ, …)

• Use to identify phrases

  • Keep all noun phrases ā€œRepublic of Chinaā€, or
  • Keep adjective followed by noun ā€œRed Carpetā€

Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 14

Constructing Phrases

Using Named Entity Tagging

• Finding structured data within an unstructured

document

  • People’s names, organizations, locations, amounts, etc.

Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 15

Phrase Processing Summary

• Pro

  • Often found to improve effectiveness by 10%

• Con

  • Dramatically increases size of term dictionary and

the size of the index

Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 16

Parser Generators

• Goal is to allow users to specify parsing

rules as grammars.

• Grammars provide a very flexible means of

expressing all valid strings in a language.

Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 19

Stemming Algorithms

• Rule-Based

  • Porter (1980)
  • Lovins (1968)

• Dictionary-based

  • K-stem (1989, 1993)

• Co-Occurrence-Based (1994)

• Others

Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 20

Porter Stemmer

• An incoming word is cleaned up in the

initialization phase, one prefix trimming

phase then takes place and then five suffix

trimming phases occur.

• Note: The entire algorithm will not be

covered -- we will leave out some obscure

rules.

Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 21

Initialization

• First the word is cleaned up. Converted to

lower case only letters or digits are kept.

• F-16 is converted to f16.

Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 22

Porter Stemming

• Remove prefixes:

"kilo", "micro", "milli", "intra", "ultra",

"mega", "nano", "pico", "pseudoā€

So megabyte, kilobyte all become ā€œbyteā€.

Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 25

Step 3

  • With what is left, replace any suffix on the left with suffix on the right ... icate ic fabricate --> fabric ( Think about this one ) ative -- combativ --> comb ( another good one) alize al nationalize --> national iciti ic ical ic tropical --> tropic ful -- faithful --> faith iveness ive inventiveness --> inventive ness -- harness --> har

Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 26

Step 4

• Remove remaining standard suffixes

al, ance, ence, er, ic, able, ible, ant, ement,

ment, ent, sion, tion, ou, ism, ate, iti, ous, ive,

ize, ise

Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 27

Step 5

• Remove trailing ā€œeā€ if word does not end in

a vowel

  • hinge --> hing
  • free --> free

Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 28

Porter Summary

• Pros

  • Used commonly and had shown good results

• Cons

  • many words with different meanings have

common stems (e.g.; fabricate and fabric )

  • a lot of stems are not words

Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 31

Co-Occurrence

  • Pro
    • Language independent (no need of dictionary)
    • Based on assumption that terms in a class will co-occur with other terms ā€œhippoā€ will co-occur with ā€œhipposā€
    • Improves effectiveness
  • Con
    • computationally expensive to build co-occurrence matrix (but you only do it every now and then)

Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 32

N-grams

  • Noise such as OCR (Optical Character

Recognition) errors or misspelling lower the query

processing accuracy in a term-based search.

  • The premise is:
    • Terms are all strings of length n
    • Substrings of a term may help to find a match in the noise cases
  • Replace terms with n-grams
  • Language-independent -- no stemming or stop

word removal needed

Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 33

5-Gram Example

• Q: What technique work s on nois e and

mis spelled words?

• D 1 : N-grams work on nois y mi spelled text.

_work _on_no on_noi n_nois

spell pelle elled lled_

  • 8 terms are matched
  • No stemming of work, noise
  • Partial match of misspelled

word

Ā© Goharian, Grossman, Frieder 2002, 2005, 2008 34

N-gram Summary

• Pro

  • Language independent
  • Works on garbled text (OCR, etc.)

• Con

  • there can be a LOT of n-grams, dictionary may

not fit in memory anymore

  • query processing requires more resources