

























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An overview of the challenges and approaches to automatically summarizing and understanding english-language text, specifically webpages. It covers difficulties in natural language processing, information retrieval systems, query languages, user tasks, and document modeling. The text also discusses data cleaning, bag of words model, and evaluating an ir system.
Typology: Study notes
1 / 33
This page cannot be seen from the preview
Don't miss anything!


























Chris Brooks Department of Computer ScienceUniversity of San Francisco
Department of Computer Science — University of San Francisco – p.1/
Department of Computer Science — University of San Francisco – p.3/
Department of Computer Science — University of San Francisco – p.4/
Department of Computer Science — University of San Francisco – p.6/
searching
and^ browsing
Searching - the user has a specific information need,and wants a document that meets that need. “Find me an explanation of the re module in Python” Browsing - the user has a broadly defined set ofinterests, and wants information that satisfies his/herinterests. “Find me interesting pages about Python” These different modes have different models of success.
Department of Computer Science — University of San Francisco – p.7/
model
of the document.
This might be:^ A category or description (as in a library)^ A set of extracted phrases or keywords^ The full text of the document^ Full text with filtering
Department of Computer Science — University of San Francisco – p.9/
Order is discarded; we just count how often each wordappears. No semantics involved Intuition: Frequently-appearing words give an indicationof subject matter. Advantage: No need to parse, computationally tractablefor large collections. Disadvantage: Contextual information and meaning islost.
Department of Computer Science — University of San Francisco – p.10/
might also be performed. Word suffixes, such as pluralization, past tense, -ing areremoved. run, runs, running, runner all become run. Advantages: If we’re just counting words, this lets uscorrectly count different forms of a word. Disadvantages: dealing with abnormal forms(person/people, run/ran), potential misgrouping(university, universal) The stemmer can be tuned to minimize either
false
positives
(accidentally stemming a word it shouldn’t) or false negatives
(not stemming a word it should.) There’s some debate in the research community about
Department of Computer Science — University of San Francisco – p.12/
Department of Computer Science — University of San Francisco – p.13/
matchingDocs totalDocsReturned
Recall measures the fraction of relevant documentsreturned.^ recall =
relevantDocsReturned totalRelevantDocs When might we want high precision? High recall? Often, we can trade precision against recall.
Department of Computer Science — University of San Francisco – p.15/
Department of Computer Science — University of San Francisco – p.16/
Department of Computer Science — University of San Francisco – p.18/
Department of Computer Science — University of San Francisco – p.19/