Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Effect of Stemming on Mean Average Precision in IR and the Web, Slides of Artificial Intelligence

Johns Hopkins University (JHU)Artificial Intelligence

Data on the effect of stemming on mean average precision (map) in information retrieval (ir) and the web. The data includes student code, effect of stemming for map, and effect of stemming for recall. The document also discusses challenges of the web, including distributed data, volatility, scale, lack of structure, and quality. It also covers heterogeneous nature of the web and studying the web.

Typology: Slides

2010/2011

Uploaded on 11/09/2011

stagist 🇺🇸

4.1

(27)

265 documents

1 / 26

This page cannot be seen from the preview

Don't miss anything!

HW#3 Evaluation

 Results are for scientific interest only

– Not competitive (rank doesn’t effect grading)

 Stats

–  Normal / Stems

– MAP-Max 0.5822 / 0.5910

– MAP-Mean 0.4219 / 0.4250

– Recall-Max 263 / 268

 Submissions were reasonable

– Few catastrophic failures

 Stemming recall up 12 cases, dn 1 case

Mean Average Precision

0.00

0.10

0.20

0.30

0.40

0.50

0.60

Mean Average Precision

Student Code

Effect of Stemming - MAP

Normal

Stems

Discover Slides of Artificial Intelligence Johns Hopkins University (JHU)

Partial preview of the text

Download Effect of Stemming on Mean Average Precision in IR and the Web and more Slides Artificial Intelligence in PDF only on Docsity!

HW#3 Evaluation

 Results are for scientific interest only

Not competitive (rank doesn’t effect grading)

 Stats

Normal / Stems
MAP-Max 0.5822 / 0.
MAP-Mean 0.4219 / 0.
Recall-Max 263 / 268

 Submissions were reasonable

Few catastrophic failures

 Stemming recall up 12 cases, dn 1 case

Mean Average Precision

1 2 3 4 5 6 7 8 9 10 11 12 13 14 Mean Average Precision Student Code Effect of Stemming - MAP Normal Stems

2010: Mean Average Precision

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Mean Average Precision Student Code Effect of Stemming - MAP Normal Stems

Relevant Found

0 50 100 150 200 250 300 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Mean Average Precision Student Code Effect of Stemming - Recall Normal Stems

Distributed data

 New servers, pages can appear anytime

(without notice)

 No central registry for web servers

Virtual hosting makes this more complicated
(more than one server per IP number)

 Millions of servers, Billions of pages

To index Web data, you must make a large number of network connections
How long does it take to circumnavigate the Web?

Volatility and Scale

 Perhaps 40% of Web changes monthly

Indexes quickly grow outdated or inaccurate  Growth appears exponential  True numbers are larger
Much of web isn’t static
ASP/JSP front-ends to databases, catalogs
Firewalls block access to Intranets

Lack of Structure

 Duplication (30%)

 Plain text is easiest to index

Most HTML pages are not syntactically valid
Plethora of other formats (MS Word, PDF)

 Naming

URL names are ugly and difficult to canonicalize
Pages migrate to new servers

Quality

 No editorial review

Authors tend to make more grammatical and spelling errors in web pages

 Anybody can publish

Blessing and a curse

 Undesirable content

Filtering objectionable content is technically difficult

Search Engine Coverage

 Lawrence & Giles (cont’d)

Took 1050 ‘real world’ queries
Pooled set of responses from 11 engines
Determined each engine’s coverage
- Extrapolated to estimated coverage of indexable web
Estimated currency NL S AV HB MS IS G Y E L EU Coverage 38.3 37.1 37.1 27.1 20.3 19.2 18.6 17.6 13.6 5.9 5. Cov (web) 16.0 15.5 15.5 11.3 8.5 8.0 7.8 7.4 5.6 2.5 2. % invalid 9.8 2.8 6.7 2.2 2.6 5.5 7.0 2.9 2.7 14.0 2.6 5. mean age 141 240 166 192 194 148 235 206 174 186

Lawrence & Giles: Conclusions

 Size

Continues to grow (Moore’s law)

 Currency

Engines not keeping pace with growth
Why?

 Bias

Popular pages much more likely to be indexed (especially Google)
What effect?

Benefits of the Web?

 The Web presents many challenges, but

are there any benefits for IR?

 There is a particular kind of value-added

annotation

HTML

Ranking Ideas for the Web

 Exploit links

Possibly, words near a hyperlink are more important  Currency
Assumes most recent data is best  Popularity
Use estimates of what a large number of people think about a page or site
Estimate based on easy to obtain data
number of inbound links to ‘that’ page
called ‘backlink frequency’  Authority
Harder to estimate than popularity

Google’s measure of authority

 PageRank simulates a user navigating randomly in the Web who jumps to a random page with probability q or follows a random hyperlink (on the current page) with probability 1 - q  This process can be modeled with a Markov chain, from where the stationary probability of being in each page can be computed  Let C(a) be the number of outgoing links of page a and suppose that page a is pointed to by pages p 1 to pn

PageRank

typical q = 0.

PageRank’s advantages

 Google can rank unseen pages!

Corollary, Google can rank non-text content

 Estimates of page quality (for unseen

pages) can be used for crawl ordering

“Efficient crawling through URL ordering”, Cho, Garcia-Molina, and Page, WWW-7.

PageRank Example

A B C D A B C D t=0 0.25 0.25 0.25 0. t=1 0.15 0.468 0.468 0. t=2 0.15 0.612 0.522 0. t=3 0.15 0.657 0.680 0. t=4 0.15 0.792 0.784 0. t=5 0.15 0.880 0.816 0. t=30 0.15 1.297 1.277 1. Using teleport prob. of 0.15: PR(A,t=1) = 0.15 + 0 PR(B,t=1) = 0.15 + 0.85 * (PR(A,t=0)/2 + PR(C,t=0)/1) PR(C,t=1) = 0.15 + 0.85 * (PR(A,t=0)/2 + PR(D,t=0)/1) PR(D,t=1) = 0.15 + 0.85* (PR(B,t=0)/1)

Search on the Web (Spink ’98)

 Topics

Genealogy/Public Figure: 12%
Computer related: 12%
Business: 12%
Entertainment: 8%
Medical: 8%
Politics & Government 7%
News 7%
Hobbies 6%
General info/surfing 6%
Science 6%
Travel 5%
Arts/education/shopping/images 14%

Popular terms from AOL query log

Taxonomy of Search Requests

 Andrei Broder (AV) characterized user’s

requests into three main categories:

Informational: Find information about X
Transactional: E.g., buying airline tickets
Navigational:
- I know I saw a page on X last week but I didn’t bookmark it
- Or, where can I download Adobe Acrobat Reader from?

Web Search Queries

 Web search queries are SHORT

~2.4 words on average (Aug 2000)
Has increased, was 1.7 (~1997)

 User Expectations

Many say “the first item shown should be what I want to see”!
This works if the user has the most popular/ common notion in mind

Commercial Issues

 Internet search is commercially driven

Technically quality isn’t most important facet
- On the other hand, many CTOs for search engine companies are former researchers
Sometimes motivations are commercial rather than technical - Goto.com used payments to determine ranking order - Buying more disks would enable search over larger portion of the Web, but would cost $$.

Searches per Day

~2010 figures: Google (1+ billion), Yahoo (180 million), Bing (80 million)

Number of indexed pages, self-reported

Internet Archive

 Wayback Machine

http://web.archive.org/collections/web/ advanced.html

 Look for HW#1 solution

Crawling Proceedure

 Start with a set of URLs

 Extract other URLs which are followed

recursively

breadth-first, depth-first, or some other measure
don’t revisit pages

 Store a local copy of each page

and assorted meta-information

 Index your local collection

Laws of Web Robotics

 A Web Robot Must Show Identifications

User Agent & From headers

 A Web Robot Must Not Hog Resources

HEAD requests, opportune times, validate URLs, don’t loop or overwhelm

 A Web Robot Must Obey Exclusion

Stanards

robots.txt
identification consideration robots.txt

HTTP Requests and Responses

 Request GET /~gates/ HTTP/1. Header1: … Header2: … … HeaderN: … Blank Line

All request headers are optional except for Host (required only for HTTP/1.1 requests)  Response HTTP/1.0 200 OK Content-Type: text/html Header2: … … HeaderN: … Blank Line … - All response headers are optional except for Content-Type

HTTP 1.1 Request Headers

 Accept

Indicate MIME types browser can handle
Can send different content to different clients  Accept-Encoding
Indicates encodings (e.g., gzip) browser can handle  Authorization
User identification for password-protected pages.  Connection
In HTTP 1.0, keep-alive means browser can handle persistent connection. In HTTP 1.1, persistent connection is default. Persistent connections mean that the server can reuse the same socket over again for requests very close together from the same client

Effect of Stemming on Mean Average Precision in IR and the Web, Slides of Artificial Intelligence

Related documents

Partial preview of the text

Download Effect of Stemming on Mean Average Precision in IR and the Web and more Slides Artificial Intelligence in PDF only on Docsity!

HW#3 Evaluation

 Results are for scientific interest only

 Stats

 Submissions were reasonable

 Stemming recall up 12 cases, dn 1 case

Mean Average Precision

2010: Mean Average Precision

Relevant Found

Distributed data

 New servers, pages can appear anytime

(without notice)

 No central registry for web servers

 Millions of servers, Billions of pages

Volatility and Scale

Lack of Structure

 Duplication (30%)

 Plain text is easiest to index

 Naming

Quality

 No editorial review

 Anybody can publish

 Undesirable content

Search Engine Coverage

Lawrence & Giles: Conclusions

 Size

 Currency

 Bias

Benefits of the Web?

 The Web presents many challenges, but

are there any benefits for IR?

 There is a particular kind of value-added

annotation

HTML

Ranking Ideas for the Web

Google’s measure of authority

PageRank

PageRank’s advantages

 Google can rank unseen pages!

 Estimates of page quality (for unseen

pages) can be used for crawl ordering

PageRank Example

Search on the Web (Spink ’98)

 Topics

Popular terms from AOL query log

Taxonomy of Search Requests

 Andrei Broder (AV) characterized user’s

requests into three main categories:

Web Search Queries

 Web search queries are SHORT

 User Expectations

Commercial Issues

 Internet search is commercially driven

Searches per Day

Internet Archive

 Wayback Machine

 Look for HW#1 solution

Crawling Proceedure

 Start with a set of URLs

 Extract other URLs which are followed

recursively

 Store a local copy of each page

 Index your local collection

Laws of Web Robotics

 A Web Robot Must Show Identifications

 A Web Robot Must Not Hog Resources

 A Web Robot Must Obey Exclusion

Stanards

HTTP Requests and Responses

HTTP 1.1 Request Headers