

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An overview of information retrieval (ir), a science that enables users to find relevant objects, such as text, pictures, and videos, based on their queries. Text retrieval, the history of ir, and the main components of ir systems. It also discusses various models for query processing, including boolean, vector space, probabilistic, and latent semantic indexing (lsi), as well as approaches to improving retrieval, such as query expansion, popularity measures, user profiles, and linguistically motivated indexing (lmi).
Typology: Study notes
1 / 2
This page cannot be seen from the preview
Don't miss anything!


● Information retrieval: The science of finding objects (text, picture, video, etc.) in any media relevant to user query. ● Text retrieval: The science of finding text relevant to user query in a collection of documents ○ Query may be formulated by user, or may be pre-formulated based on a user profile. ● Dates back to 1890 when Herman Hollerith invented a machine to tabulate US census data. ● Most uses for years were in scientific, legal, and medical fields. Widespread use with the advent of the Worldwide Web. ● Main components of any IR system: ○ Document processing (represent documents in a searchable form) ○ Query processing (represent queries in a searchable form) ○ Matching and retrieval (mechanism to measure document relevance to query) ● Document processing: ○ Documents are usually represented as indexed keywords (“inverted index”). build → 12 school → 85 The numbers show position of word in a document in characters. An inverted index is built for each document. ○ To save time/space, “stop words” (i.e., function words) are removed first. ○ Also key words are sometimes “stemmed” (i.e., prefixes and suffixes are removed) before indexing. ■ There is still debate on stemming due to mixed results. Stemming typically increases “recall” , but reduces “precision” (see notes on text categorization). ■ Stemming could be done using ● a traditional stemmer (see Chapter 3 of textbook; exercise 3.3.4 number 3), ● or a full-fledged morphological analyzer. ● Query processing ○ A number of models are generally used, e.g., boolean, vector space, & probabilistic ○ Boolean systems: Queries are built based on the presence or absence of query terms in documents. ■ These systems are very efficient and are widely used in bibliographic and other database searches. ■ Decisions are binary (relevant, or not relevant). Therefore, may miss documents that are somewhat relevant. 1
○ Vector space models: Documents are represented as vectors of TF.IDFs. ■ Documents are then ranked based on their “similarity” to the query. ■ Similarity is measured based on the geometric positions of the query and documents in the vector space. ■ These models are simple, fast and popular. Mostly used in Internet search engines. ○ Probabilistic models: Calculate probability of document being relevant given query. Problem is the relevance of documents is not known at the time the query is formed. Also this approach doesn't take into account the frequencies of words in documents. Having being devised in the mid-1970s, this approach is not popular these days. ○ Latent Semantic Indexing (LSI): Based on “Singular Value Decomposition” from linear algebra, this technique allows for the retrieval of potentially relevant documents that contain semantically related words to query terms, not necessarily those words per se. That is, LSI represents documents in a “concept space,” rather than a “term space.” Increases recall substantially, but reduces precision. ● Approaches to improving retrieval ○ Query expansion: Add words from most relevant documents to the query (e.g., the more like this feature of Google), or add synonyms (requires disambiguation). ○ Use popularity measures: Give higher points to documents which are pointed to frequently, or which are viewed by users frequently. ○ Use user profiles: Give higher points to documents that are similar to what user has seen/searched for before. ○ Use Linguistically Motivated Indexing (LMI): Index based on linguistic phrases (e.g., noun phrases, or collocations). References Tzoukermann, E., Klavans, J., & Strzalkowski, T. (2003). Information Retrieval. In Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics. pp. 529-544. Oxford: Oxford University Press. 2