Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Challenges & Approaches in AI Text Processing & Retrieval, Study notes of Computer Science

University of San Francisco (USF)Computer Science

An overview of the challenges and approaches to automatically summarizing and understanding english-language text, specifically webpages. It covers difficulties in natural language processing, information retrieval systems, query languages, user tasks, and document modeling. The text also discusses data cleaning, bag of words model, and evaluating an ir system.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-v91 🇺🇸

(1)

10 documents

1 / 33

This page cannot be seen from the preview

Don't miss anything!

Artificial Intelligence

Programming

Information Retrieval

Chris Brooks

Department of Computer Science

University of San Francisco

Discover Study notes of Computer Science University of San Francisco (USF)

Partial preview of the text

Download Challenges & Approaches in AI Text Processing & Retrieval and more Study notes Computer Science in PDF only on Docsity!

Artificial Intelligence

Programming Information Retrieval

Chris Brooks Department of Computer ScienceUniversity of San Francisco

Processing text Now that we know a little bit about how to consider andcompare different states, we’ll think about a problemwith a harder representation. English-language text^ Specifically, webpages We’ll look at several different approaches toautomatically summarizing and understandingdocuments.

Department of Computer Science — University of San Francisco – p.1/

Difficulties What makes working with natural language hard? Nonunique parses Synonyms and multiple meanings Anaphora Slang and technical terms Analogy and metaphor Misspelling and incorrect grammar

Department of Computer Science — University of San Francisco – p.3/

Information Retrieval Information retrieval deals with the storage, retrieval,organization of, and access to information items Overlaps with:^ Databases (more of a focus on content)^ AI^ Search engines

Department of Computer Science — University of San Francisco – p.4/

Query Languages What are some sorts of query languages? Keyword - Google, Yahoo!, etc. Natural language - Ask.com SQL-style Similar item - Netflix, Amazon Multimedia - Flickr

Department of Computer Science — University of San Francisco – p.6/

User tasks We’ll also distinguish between different types of usertasks. The most common are

searching

and^ browsing

Searching - the user has a specific information need,and wants a document that meets that need. “Find me an explanation of the re module in Python” Browsing - the user has a broadly defined set ofinterests, and wants information that satisfies his/herinterests. “Find me interesting pages about Python” These different modes have different models of success.

Department of Computer Science — University of San Francisco – p.7/

Modeling a Document In order to match a query to a document, an IR systemmust have a

model

of the document.

This might be:^ A category or description (as in a library)^ A set of extracted phrases or keywords^ The full text of the document^ Full text with filtering

Department of Computer Science — University of San Francisco – p.9/

“Bag of words” model The techniques we’ll look at today treat a document as a bag of words

Order is discarded; we just count how often each wordappears. No semantics involved Intuition: Frequently-appearing words give an indicationof subject matter. Advantage: No need to parse, computationally tractablefor large collections. Disadvantage: Contextual information and meaning islost.

Department of Computer Science — University of San Francisco – p.10/

Data Cleaning Stemming

might also be performed. Word suffixes, such as pluralization, past tense, -ing areremoved. run, runs, running, runner all become run. Advantages: If we’re just counting words, this lets uscorrectly count different forms of a word. Disadvantages: dealing with abnormal forms(person/people, run/ran), potential misgrouping(university, universal) The stemmer can be tuned to minimize either

false

positives

(accidentally stemming a word it shouldn’t) or false negatives

(not stemming a word it should.) There’s some debate in the research community about

Department of Computer Science — University of San Francisco – p.12/

“Bag of Words” Once a document has been cleaned, the simplest modeljust counts how many times each word occurs in thedocument. This is typically represented as a dictionary. You built this in assignment 1.

Department of Computer Science — University of San Francisco – p.13/

Precision and Recall Precision measures how well returned documentsmatch a query.^ precision =

matchingDocs totalDocsReturned

Recall measures the fraction of relevant documentsreturned.^ recall =

relevantDocsReturned totalRelevantDocs When might we want high precision? High recall? Often, we can trade precision against recall.

Department of Computer Science — University of San Francisco – p.15/

Boolean Queries Boolean queries are simple, but not very practical. User provides a set of keywords.^ Possibly also OR terms All documents containing all keywords are returned. This is the sort of query model that databases use

Department of Computer Science — University of San Francisco – p.16/

Probabilistic Queries A simple extension is to allow partial matches on queries Score documents according to the fraction of queryterms matched Return documents according to score^ Example: Document contains “cat cat dog bunny fish”^ Query is “cat dog (bunny OR snake) bird”^ Score is 3/4.

Department of Computer Science — University of San Francisco – p.18/

Probabilistic Queries Weaknesses:^ Still requires logical queries^ Doesn’t deal with word frequency^ Dependent on query length - short queries will havea hard time getting differentiated scores.^ The average Google query is only three words long!

Department of Computer Science — University of San Francisco – p.19/

Challenges & Approaches in AI Text Processing & Retrieval, Study notes of Computer Science

Related documents

Partial preview of the text

Download Challenges & Approaches in AI Text Processing & Retrieval and more Study notes Computer Science in PDF only on Docsity!

Artificial Intelligence

Programming Information Retrieval

Difficulties What makes working with natural language hard? Nonunique parses Synonyms and multiple meanings Anaphora Slang and technical terms Analogy and metaphor Misspelling and incorrect grammar

Information Retrieval Information retrieval deals with the storage, retrieval,organization of, and access to information items Overlaps with:^ Databases (more of a focus on content)^ AI^ Search engines

Query Languages What are some sorts of query languages? Keyword - Google, Yahoo!, etc. Natural language - Ask.com SQL-style Similar item - Netflix, Amazon Multimedia - Flickr

User tasks We’ll also distinguish between different types of usertasks. The most common are

Modeling a Document In order to match a query to a document, an IR systemmust have a

“Bag of words” model The techniques we’ll look at today treat a document as a bag of words

Data Cleaning Stemming

“Bag of Words” Once a document has been cleaned, the simplest modeljust counts how many times each word occurs in thedocument. This is typically represented as a dictionary. You built this in assignment 1.

Precision and Recall Precision measures how well returned documentsmatch a query.^ precision =

Boolean Queries Boolean queries are simple, but not very practical. User provides a set of keywords.^ Possibly also OR terms All documents containing all keywords are returned. This is the sort of query model that databases use

Probabilistic Queries A simple extension is to allow partial matches on queries Score documents according to the fraction of queryterms matched Return documents according to score^ Example: Document contains “cat cat dog bunny fish”^ Query is “cat dog (bunny OR snake) bird”^ Score is 3/4.

Probabilistic Queries Weaknesses:^ Still requires logical queries^ Doesn’t deal with word frequency^ Dependent on query length - short queries will havea hard time getting differentiated scores.^ The average Google query is only three words long!