Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Information Retrieval: Finding Relevant Objects through Text in IR Systems, Study notes of Linguistics

Iowa State University (ISU)Linguistics

An overview of information retrieval (ir), a science that enables users to find relevant objects, such as text, pictures, and videos, based on their queries. Text retrieval, the history of ir, and the main components of ir systems. It also discusses various models for query processing, including boolean, vector space, probabilistic, and latent semantic indexing (lsi), as well as approaches to improving retrieval, such as query expansion, popularity measures, user profiles, and linguistically motivated indexing (lmi).

Typology: Study notes

Pre 2010

Uploaded on 09/02/2009

koofers-user-2sg 🇺🇸

4

(1)

10 documents

1 / 2

This page cannot be seen from the preview

Don't miss anything!

LING 120: Computers and Language

Topic: Information Retrieval

Recommended Reading: Tzoukermann et al., 2003

●Information retrieval: The science of finding objects (text, picture, video, etc.) in any media

relevant to user query.

●Text retrieval: The science of finding text relevant to user query in a collection of documents

○Query may be formulated by user, or may be pre-formulated based on a user profile.

●Dates back to 1890 when Herman Hollerith invented a machine to tabulate US census data.

●Most uses for years were in scientific, legal, and medical fields. Widespread use with the advent

of the Worldwide Web.

●Main components of any IR system:

○Document processing (represent documents in a searchable form)

○Query processing (represent queries in a searchable form)

○Matching and retrieval (mechanism to measure document relevance to query)

●Document processing:

○Documents are usually represented as indexed keywords (“inverted index”).

build → 12 school → 85

The numbers show position of word in a document in characters. An inverted index is built

for each document.

○To save time/space, “stop words” (i.e., function words) are removed first.

○Also key words are sometimes “stemmed” (i.e., prefixes and suffixes are removed) before

indexing.

■There is still debate on stemming due to mixed results. Stemming typically increases

“recall”, but reduces “precision” (see notes on text categorization).

■Stemming could be done using

●a traditional stemmer (see Chapter 3 of textbook; exercise 3.3.4 number 3),

●or a full-fledged morphological analyzer.

●Query processing

○A number of models are generally used, e.g., boolean, vector space, & probabilistic

○Boolean systems: Queries are built based on the presence or absence of query terms in

documents.

■These systems are very efficient and are widely used in bibliographic and other database

searches.

■Decisions are binary (relevant, or not relevant). Therefore, may miss documents that are

somewhat relevant.

1

Discover Study notes of Linguistics Iowa State University (ISU)

Partial preview of the text

Download Information Retrieval: Finding Relevant Objects through Text in IR Systems and more Study notes Linguistics in PDF only on Docsity!

LING 120: Computers and Language

Topic: Information Retrieval

Recommended Reading: Tzoukermann et al., 2003

● Information retrieval: The science of finding objects (text, picture, video, etc.) in any media relevant to user query. ● Text retrieval: The science of finding text relevant to user query in a collection of documents ○ Query may be formulated by user, or may be pre-formulated based on a user profile. ● Dates back to 1890 when Herman Hollerith invented a machine to tabulate US census data. ● Most uses for years were in scientific, legal, and medical fields. Widespread use with the advent of the Worldwide Web. ● Main components of any IR system: ○ Document processing (represent documents in a searchable form) ○ Query processing (represent queries in a searchable form) ○ Matching and retrieval (mechanism to measure document relevance to query) ● Document processing: ○ Documents are usually represented as indexed keywords (“inverted index”). build → 12 school → 85 The numbers show position of word in a document in characters. An inverted index is built for each document. ○ To save time/space, “stop words” (i.e., function words) are removed first. ○ Also key words are sometimes “stemmed” (i.e., prefixes and suffixes are removed) before indexing. ■ There is still debate on stemming due to mixed results. Stemming typically increases “recall” , but reduces “precision” (see notes on text categorization). ■ Stemming could be done using ● a traditional stemmer (see Chapter 3 of textbook; exercise 3.3.4 number 3), ● or a full-fledged morphological analyzer. ● Query processing ○ A number of models are generally used, e.g., boolean, vector space, & probabilistic ○ Boolean systems: Queries are built based on the presence or absence of query terms in documents. ■ These systems are very efficient and are widely used in bibliographic and other database searches. ■ Decisions are binary (relevant, or not relevant). Therefore, may miss documents that are somewhat relevant. 1

○ Vector space models: Documents are represented as vectors of TF.IDFs. ■ Documents are then ranked based on their “similarity” to the query. ■ Similarity is measured based on the geometric positions of the query and documents in the vector space. ■ These models are simple, fast and popular. Mostly used in Internet search engines. ○ Probabilistic models: Calculate probability of document being relevant given query. Problem is the relevance of documents is not known at the time the query is formed. Also this approach doesn't take into account the frequencies of words in documents. Having being devised in the mid-1970s, this approach is not popular these days. ○ Latent Semantic Indexing (LSI): Based on “Singular Value Decomposition” from linear algebra, this technique allows for the retrieval of potentially relevant documents that contain semantically related words to query terms, not necessarily those words per se. That is, LSI represents documents in a “concept space,” rather than a “term space.” Increases recall substantially, but reduces precision. ● Approaches to improving retrieval ○ Query expansion: Add words from most relevant documents to the query (e.g., the more like this feature of Google), or add synonyms (requires disambiguation). ○ Use popularity measures: Give higher points to documents which are pointed to frequently, or which are viewed by users frequently. ○ Use user profiles: Give higher points to documents that are similar to what user has seen/searched for before. ○ Use Linguistically Motivated Indexing (LMI): Index based on linguistic phrases (e.g., noun phrases, or collocations). References Tzoukermann, E., Klavans, J., & Strzalkowski, T. (2003). Information Retrieval. In Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics. pp. 529-544. Oxford: Oxford University Press. 2

Information Retrieval: Finding Relevant Objects through Text in IR Systems, Study notes of Linguistics

Related documents

Partial preview of the text

Download Information Retrieval: Finding Relevant Objects through Text in IR Systems and more Study notes Linguistics in PDF only on Docsity!

LING 120: Computers and Language

Topic: Information Retrieval

Recommended Reading: Tzoukermann et al., 2003