Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

For each uploaded document

Answer questions

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Token Processing in Information Retrieval: Techniques and Algorithms - Prof. Nazli Goharia, Study notes of Computer Science

Illinois Institute of Technology (IIT)Computer Science

Prof. Nazli Goharian

Various techniques and algorithms used in token processing during information retrieval. Topics include identifying document units, token identification, stop words, special tokens, normalization of tokens, phrase processing, parser generators, stemming, and co-occurrence. The document also covers the advantages and disadvantages of each approach.

Typology: Study notes

Pre 2010

Uploaded on 08/19/2009

koofers-user-01t 🇺🇸

10 documents

1 / 18

This page cannot be seen from the preview

Don't miss anything!

bg1

1

1©Goharian, Grossman,Frieder 2002, 2005, 2008

Token Processing

(CS429)

Nazli Goharian

nazli@ir.iit.edu

Slides are mostly based on Information Retrieval Algorithms and Heuristics, Grossman,Frieder

2©Goharian, Grossman,Frieder 2002, 2005, 2008

Token Processing

Identifying document units for indexing

–whole document

–chapter

–Paragraph

– ….

Too large unit

Cons: potential of having more irrelevant documents &

more difficult for the user to find relevant information

Too small unit

Cons: may loose some relevant docs as the terms are

distributed over small units

pf3

pf4

pf5

pf8

pf9

pfa

pfd

pfe

pff

pf12

Discover Study notes of Computer Science Illinois Institute of Technology (IIT)

Related documents

Understanding Token Economies: Concepts and Procedures

Token Economy and Behavior Modification: A Comprehensive Guide

Token Economy: Behavior Modification with Reinforcers

Token Economy Systems: Behavior Modification Techniques

THE TOKEN ECONOMY

Token Economy: Principles and Applications in Behavior Modification

Lexical Analysis and Token Recognition

Token Ring and FDDI: Ethernet Alternatives with Token Passing Access Method

Implementing Calypso-like Token Management and Arbitration using Spread.org as Middleware

Token Ring and FDDI: Understanding IEEE 802.5 Token Ring and FDDI Networks

Token Economy: A Classroom Intervention for Children with Attention Deficit Disorder

Token Ring - Computer Networks - Lecture Slides

Partial preview of the text

Download Token Processing in Information Retrieval: Techniques and Algorithms - Prof. Nazli Goharia and more Study notes Computer Science in PDF only on Docsity!

© Goharian, Grossman, Frieder 2002, 2005, 2008 1

Token Processing

(CS429)

Nazli Goharian

[email protected]

Slides are mostly based on Information Retrieval Algorithms and Heuristics, Grossman, Frieder

© Goharian, Grossman, Frieder 2002, 2005, 2008 2

Token Processing

Identifying document units for indexing

whole document
chapter
Paragraph
….

Too large unit

Cons: potential of having more irrelevant documents & more difficult for the user to find relevant information

Too small unit

Cons: may loose some relevant docs as the terms are distributed over small units

© Goharian, Grossman, Frieder 2002, 2005, 2008 3

Token Processing

Documents may belong to various languages.

Web: ~ 60% in English

A given document may have foreign language

terms and phrases.

The collection must be indexed!

© Goharian, Grossman, Frieder 2002, 2005, 2008 4

Token Processing

Identifying the tokens in a document unit for

indexing

Parsing
Stemming
n-grams

© Goharian, Grossman, Frieder 2002, 2005, 2008 7

Special Tokens

Dates 2005; Oct 10, 2005; 10/10/2005; 10/10/
Digit-alphabet 1-hour
Alphabet-digit F-16; I-
Hyphenation co-existence; black-tie party
All caps CNN, BBC
Cap period (initial) N.
Digit.digit 8.
Digit,digit 8,
Currency symbol $, ….
Cultural known names MAS*H
Email address [email protected]
URLs http://www.cnn.com
IP address 123.67.65.
Names New York; Los Angles (Los Angles-New York flights ????)

© Goharian, Grossman, Frieder 2002, 2005, 2008 8

Normalization of Tokens

Using equivalence class of terms. Example rules:
- Ph.D? Phd
- U.S.A.? USA
- 10/10/ 2005? Oct 10, 2005
- F-16? F
- Variations of Umlaut words in German
- …………..
What about these rules?
- Windows? window (what if one is OS and one is a window???)
- C.A.D.? cad (different meaning????)

© Goharian, Grossman, Frieder 2002, 2005, 2008 9

Normalization of Tokens (cont’d)

Case folding - reduces term index by ~17%, but a lossy compression
- Convert all to lower case (most practical); or some to lower case
Spelling variations (neighbor vs. neighbour; a foreign name)
Accents on letters (naïve vs. naive; many foreign language terms)
Variant transliteration (Den-Haag vs. The Hague)
- Use phonetic equivalence, best such algorithm: Soundex!

More on normalization under Stemming….

© Goharian, Grossman, Frieder 2002, 2005, 2008 10

Phrase processing

Phrase recognition is based on the goal of indexing

meaningful phrases like

“Lincoln Town Car”
“San Francisco”
“apple pie”
Doing this would use word order to assist with

effectiveness -- otherwise we are assuming the

query and documents are just a “bag of words”

~ 10% of web queries are explicit phrase queries

© Goharian, Grossman, Frieder 2002, 2005, 2008 13

Constructing Phrases

Using Part-of-Speech Tagging

• Can take advantage of NLP techniques:

• Using part-of-Speech tagging to identify

key components of a sentence (S-V-OBJ, …)

• Use to identify phrases

Keep all noun phrases “Republic of China”, or
Keep adjective followed by noun “Red Carpet”

© Goharian, Grossman, Frieder 2002, 2005, 2008 14

Constructing Phrases

Using Named Entity Tagging

• Finding structured data within an unstructured

document

People’s names, organizations, locations, amounts, etc.

© Goharian, Grossman, Frieder 2002, 2005, 2008 15

Phrase Processing Summary

• Pro

Often found to improve effectiveness by 10%

• Con

Dramatically increases size of term dictionary and

the size of the index

© Goharian, Grossman, Frieder 2002, 2005, 2008 16

Parser Generators

• Goal is to allow users to specify parsing

rules as grammars.

• Grammars provide a very flexible means of

expressing all valid strings in a language.

© Goharian, Grossman, Frieder 2002, 2005, 2008 19

Stemming Algorithms

• Rule-Based

Porter (1980)
Lovins (1968)

• Dictionary-based

K-stem (1989, 1993)

• Co-Occurrence-Based (1994)

• Others

© Goharian, Grossman, Frieder 2002, 2005, 2008 20

Porter Stemmer

• An incoming word is cleaned up in the

initialization phase, one prefix trimming

phase then takes place and then five suffix

trimming phases occur.

• Note: The entire algorithm will not be

covered -- we will leave out some obscure

rules.

© Goharian, Grossman, Frieder 2002, 2005, 2008 21

Initialization

• First the word is cleaned up. Converted to

lower case only letters or digits are kept.

• F-16 is converted to f16.

© Goharian, Grossman, Frieder 2002, 2005, 2008 22

Porter Stemming

• Remove prefixes:

"kilo", "micro", "milli", "intra", "ultra",

"mega", "nano", "pico", "pseudo”

So megabyte, kilobyte all become “byte”.

© Goharian, Grossman, Frieder 2002, 2005, 2008 25

Step 3

With what is left, replace any suffix on the left with suffix on the right ... icate ic fabricate --> fabric ( Think about this one ) ative -- combativ --> comb ( another good one) alize al nationalize --> national iciti ic ical ic tropical --> tropic ful -- faithful --> faith iveness ive inventiveness --> inventive ness -- harness --> har

© Goharian, Grossman, Frieder 2002, 2005, 2008 26

Step 4

• Remove remaining standard suffixes

al, ance, ence, er, ic, able, ible, ant, ement,

ment, ent, sion, tion, ou, ism, ate, iti, ous, ive,

ize, ise

© Goharian, Grossman, Frieder 2002, 2005, 2008 27

Step 5

• Remove trailing “e” if word does not end in

a vowel

hinge --> hing
free --> free

© Goharian, Grossman, Frieder 2002, 2005, 2008 28

Porter Summary

• Pros

Used commonly and had shown good results

• Cons

many words with different meanings have

common stems (e.g.; fabricate and fabric )

a lot of stems are not words

© Goharian, Grossman, Frieder 2002, 2005, 2008 31

Co-Occurrence

Pro
- Language independent (no need of dictionary)
- Based on assumption that terms in a class will co-occur with other terms “hippo” will co-occur with “hippos”
- Improves effectiveness
Con
- computationally expensive to build co-occurrence matrix (but you only do it every now and then)

© Goharian, Grossman, Frieder 2002, 2005, 2008 32

N-grams

Noise such as OCR (Optical Character

Recognition) errors or misspelling lower the query

processing accuracy in a term-based search.

The premise is:
- Terms are all strings of length n
- Substrings of a term may help to find a match in the noise cases
Replace terms with n-grams
Language-independent -- no stemming or stop

word removal needed

© Goharian, Grossman, Frieder 2002, 2005, 2008 33

5-Gram Example

• Q: What technique work s on nois e and

mis spelled words?

• D 1 : N-grams work on nois y mi spelled text.

_work _on_no on_noi n_nois

spell pelle elled lled_

8 terms are matched
No stemming of work, noise
Partial match of misspelled

word

© Goharian, Grossman, Frieder 2002, 2005, 2008 34

N-gram Summary

• Pro

Language independent
Works on garbled text (OCR, etc.)

• Con

there can be a LOT of n-grams, dictionary may

not fit in memory anymore

query processing requires more resources