Information Retrieval: Finding Relevant Objects through Text in IR Systems, Study notes of Linguistics

An overview of information retrieval (ir), a science that enables users to find relevant objects, such as text, pictures, and videos, based on their queries. Text retrieval, the history of ir, and the main components of ir systems. It also discusses various models for query processing, including boolean, vector space, probabilistic, and latent semantic indexing (lsi), as well as approaches to improving retrieval, such as query expansion, popularity measures, user profiles, and linguistically motivated indexing (lmi).

Typology: Study notes

Pre 2010

Uploaded on 09/02/2009

koofers-user-2sg
koofers-user-2sg 🇺🇸

4

(1)

10 documents

1 / 2

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
LING 120: Computers and Language
Topic: Information Retrieval
Recommended Reading: Tzoukermann et al., 2003
Information retrieval: The science of finding objects (text, picture, video, etc.) in any media
relevant to user query.
Text retrieval: The science of finding text relevant to user query in a collection of documents
Query may be formulated by user, or may be pre-formulated based on a user profile.
Dates back to 1890 when Herman Hollerith invented a machine to tabulate US census data.
Most uses for years were in scientific, legal, and medical fields. Widespread use with the advent
of the Worldwide Web.
Main components of any IR system:
Document processing (represent documents in a searchable form)
Query processing (represent queries in a searchable form)
Matching and retrieval (mechanism to measure document relevance to query)
Document processing:
Documents are usually represented as indexed keywords (“inverted index”).
build → 12 school → 85
The numbers show position of word in a document in characters. An inverted index is built
for each document.
To save time/space, “stop words” (i.e., function words) are removed first.
Also key words are sometimes “stemmed” (i.e., prefixes and suffixes are removed) before
indexing.
There is still debate on stemming due to mixed results. Stemming typically increases
“recall”, but reduces “precision” (see notes on text categorization).
Stemming could be done using
a traditional stemmer (see Chapter 3 of textbook; exercise 3.3.4 number 3),
or a full-fledged morphological analyzer.
Query processing
A number of models are generally used, e.g., boolean, vector space, & probabilistic
Boolean systems: Queries are built based on the presence or absence of query terms in
documents.
These systems are very efficient and are widely used in bibliographic and other database
searches.
Decisions are binary (relevant, or not relevant). Therefore, may miss documents that are
somewhat relevant.
1
pf2

Partial preview of the text

Download Information Retrieval: Finding Relevant Objects through Text in IR Systems and more Study notes Linguistics in PDF only on Docsity!

LING 120: Computers and Language

Topic: Information Retrieval

Recommended Reading: Tzoukermann et al., 2003

Information retrieval: The science of finding objects (text, picture, video, etc.) in any media relevant to user query. ● Text retrieval: The science of finding text relevant to user query in a collection of documents ○ Query may be formulated by user, or may be pre-formulated based on a user profile. ● Dates back to 1890 when Herman Hollerith invented a machine to tabulate US census data. ● Most uses for years were in scientific, legal, and medical fields. Widespread use with the advent of the Worldwide Web. ● Main components of any IR system: ○ Document processing (represent documents in a searchable form) ○ Query processing (represent queries in a searchable form) ○ Matching and retrieval (mechanism to measure document relevance to query) ● Document processing: ○ Documents are usually represented as indexed keywords (“inverted index”). build → 12 school → 85 The numbers show position of word in a document in characters. An inverted index is built for each document. ○ To save time/space, “stop words” (i.e., function words) are removed first. ○ Also key words are sometimes “stemmed” (i.e., prefixes and suffixes are removed) before indexing. ■ There is still debate on stemming due to mixed results. Stemming typically increases “recall” , but reduces “precision” (see notes on text categorization). ■ Stemming could be done using ● a traditional stemmer (see Chapter 3 of textbook; exercise 3.3.4 number 3), ● or a full-fledged morphological analyzer. ● Query processing ○ A number of models are generally used, e.g., boolean, vector space, & probabilistic ○ Boolean systems: Queries are built based on the presence or absence of query terms in documents. ■ These systems are very efficient and are widely used in bibliographic and other database searches. ■ Decisions are binary (relevant, or not relevant). Therefore, may miss documents that are somewhat relevant. 1

Vector space models: Documents are represented as vectors of TF.IDFs. ■ Documents are then ranked based on their “similarity” to the query. ■ Similarity is measured based on the geometric positions of the query and documents in the vector space. ■ These models are simple, fast and popular. Mostly used in Internet search engines. ○ Probabilistic models: Calculate probability of document being relevant given query. Problem is the relevance of documents is not known at the time the query is formed. Also this approach doesn't take into account the frequencies of words in documents. Having being devised in the mid-1970s, this approach is not popular these days. ○ Latent Semantic Indexing (LSI): Based on “Singular Value Decomposition” from linear algebra, this technique allows for the retrieval of potentially relevant documents that contain semantically related words to query terms, not necessarily those words per se. That is, LSI represents documents in a “concept space,” rather than a “term space.” Increases recall substantially, but reduces precision. ● Approaches to improving retrieval ○ Query expansion: Add words from most relevant documents to the query (e.g., the more like this feature of Google), or add synonyms (requires disambiguation). ○ Use popularity measures: Give higher points to documents which are pointed to frequently, or which are viewed by users frequently. ○ Use user profiles: Give higher points to documents that are similar to what user has seen/searched for before. ○ Use Linguistically Motivated Indexing (LMI): Index based on linguistic phrases (e.g., noun phrases, or collocations). References Tzoukermann, E., Klavans, J., & Strzalkowski, T. (2003). Information Retrieval. In Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics. pp. 529-544. Oxford: Oxford University Press. 2