Effect of Stemming on Mean Average Precision in IR and the Web, Slides of Artificial Intelligence

Data on the effect of stemming on mean average precision (map) in information retrieval (ir) and the web. The data includes student code, effect of stemming for map, and effect of stemming for recall. The document also discusses challenges of the web, including distributed data, volatility, scale, lack of structure, and quality. It also covers heterogeneous nature of the web and studying the web.

Typology: Slides

2010/2011

Uploaded on 11/09/2011

stagist
stagist 🇺🇸

4.1

(27)

265 documents

1 / 26

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
HW#3 Evaluation
Results are for scientific interest only
Not competitive (rank doesn’t effect grading)
Stats
Normal / Stems
MAP-Max 0.5822 / 0.5910
MAP-Mean 0.4219 / 0.4250
Recall-Max 263 / 268
Submissions were reasonable
Few catastrophic failures
Stemming recall up 12 cases, dn 1 case
Mean Average Precision
0.00
0.10
0.20
0.30
0.40
0.50
0.60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Mean Average Precision
Student Code
Normal
Stems
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a

Partial preview of the text

Download Effect of Stemming on Mean Average Precision in IR and the Web and more Slides Artificial Intelligence in PDF only on Docsity!

HW#3 Evaluation

 Results are for scientific interest only

  • Not competitive (rank doesn’t effect grading)

 Stats

  • Normal / Stems
  • MAP-Max 0.5822 / 0.
  • MAP-Mean 0.4219 / 0.
  • Recall-Max 263 / 268

 Submissions were reasonable

  • Few catastrophic failures

 Stemming recall up 12 cases, dn 1 case

Mean Average Precision

1 2 3 4 5 6 7 8 9 10 11 12 13 14 Mean Average Precision Student Code Effect of Stemming - MAP Normal Stems

2010: Mean Average Precision

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Mean Average Precision Student Code Effect of Stemming - MAP Normal Stems

Relevant Found

0 50 100 150 200 250 300 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Mean Average Precision Student Code Effect of Stemming - Recall Normal Stems

Distributed data

 New servers, pages can appear anytime

(without notice)

 No central registry for web servers

  • Virtual hosting makes this more complicated
  • (more than one server per IP number)

 Millions of servers, Billions of pages

  • To index Web data, you must make a large number of network connections
  • How long does it take to circumnavigate the Web?

Volatility and Scale

 Perhaps 40% of Web changes monthly

  • Indexes quickly grow outdated or inaccurate  Growth appears exponential  True numbers are larger
  • Much of web isn’t static
  • ASP/JSP front-ends to databases, catalogs
  • Firewalls block access to Intranets

Lack of Structure

 Duplication (30%)

 Plain text is easiest to index

  • Most HTML pages are not syntactically valid
  • Plethora of other formats (MS Word, PDF)

 Naming

  • URL names are ugly and difficult to canonicalize
  • Pages migrate to new servers

Quality

 No editorial review

  • Authors tend to make more grammatical and spelling errors in web pages

 Anybody can publish

  • Blessing and a curse

 Undesirable content

  • Filtering objectionable content is technically difficult

Search Engine Coverage

 Lawrence & Giles (cont’d)

  • Took 1050 ‘real world’ queries
  • Pooled set of responses from 11 engines
  • Determined each engine’s coverage
    • Extrapolated to estimated coverage of indexable web
  • Estimated currency NL S AV HB MS IS G Y E L EU Coverage 38.3 37.1 37.1 27.1 20.3 19.2 18.6 17.6 13.6 5.9 5. Cov (web) 16.0 15.5 15.5 11.3 8.5 8.0 7.8 7.4 5.6 2.5 2. % invalid 9.8 2.8 6.7 2.2 2.6 5.5 7.0 2.9 2.7 14.0 2.6 5. mean age 141 240 166 192 194 148 235 206 174 186

Lawrence & Giles: Conclusions

 Size

  • Continues to grow (Moore’s law)

 Currency

  • Engines not keeping pace with growth
  • Why?

 Bias

  • Popular pages much more likely to be indexed (especially Google)
  • What effect?

Benefits of the Web?

 The Web presents many challenges, but

are there any benefits for IR?

 There is a particular kind of value-added

annotation

HTML

Ranking Ideas for the Web

 Exploit links

  • Possibly, words near a hyperlink are more important  Currency
  • Assumes most recent data is best  Popularity
  • Use estimates of what a large number of people think about a page or site
  • Estimate based on easy to obtain data
  • number of inbound links to ‘that’ page
  • called ‘backlink frequency’  Authority
  • Harder to estimate than popularity

Google’s measure of authority

 PageRank simulates a user navigating randomly in the Web who jumps to a random page with probability q or follows a random hyperlink (on the current page) with probability 1 - q  This process can be modeled with a Markov chain, from where the stationary probability of being in each page can be computed  Let C(a) be the number of outgoing links of page a and suppose that page a is pointed to by pages p 1 to pn

PageRank

typical q = 0.

PageRank’s advantages

 Google can rank unseen pages!

  • Corollary, Google can rank non-text content

 Estimates of page quality (for unseen

pages) can be used for crawl ordering

“Efficient crawling through URL ordering”, Cho, Garcia-Molina, and Page, WWW-7.

PageRank Example

A B C D A B C D t=0 0.25 0.25 0.25 0. t=1 0.15 0.468 0.468 0. t=2 0.15 0.612 0.522 0. t=3 0.15 0.657 0.680 0. t=4 0.15 0.792 0.784 0. t=5 0.15 0.880 0.816 0. t=30 0.15 1.297 1.277 1. Using teleport prob. of 0.15: PR(A,t=1) = 0.15 + 0 PR(B,t=1) = 0.15 + 0.85 * (PR(A,t=0)/2 + PR(C,t=0)/1) PR(C,t=1) = 0.15 + 0.85 * (PR(A,t=0)/2 + PR(D,t=0)/1) PR(D,t=1) = 0.15 + 0.85* (PR(B,t=0)/1)

Search on the Web (Spink ’98)

 Topics

  • Genealogy/Public Figure: 12%
  • Computer related: 12%
  • Business: 12%
  • Entertainment: 8%
  • Medical: 8%
  • Politics & Government 7%
  • News 7%
  • Hobbies 6%
  • General info/surfing 6%
  • Science 6%
  • Travel 5%
  • Arts/education/shopping/images 14%

Popular terms from AOL query log

Taxonomy of Search Requests

 Andrei Broder (AV) characterized user’s

requests into three main categories:

  • Informational: Find information about X
  • Transactional: E.g., buying airline tickets
  • Navigational:
    • I know I saw a page on X last week but I didn’t bookmark it
    • Or, where can I download Adobe Acrobat Reader from?

Web Search Queries

 Web search queries are SHORT

  • ~2.4 words on average (Aug 2000)
  • Has increased, was 1.7 (~1997)

 User Expectations

  • Many say “the first item shown should be what I want to see”!
  • This works if the user has the most popular/ common notion in mind

Commercial Issues

 Internet search is commercially driven

  • Technically quality isn’t most important facet
    • On the other hand, many CTOs for search engine companies are former researchers
  • Sometimes motivations are commercial rather than technical - Goto.com used payments to determine ranking order - Buying more disks would enable search over larger portion of the Web, but would cost $$.

Searches per Day

~2010 figures: Google (1+ billion), Yahoo (180 million), Bing (80 million)

Number of indexed pages, self-reported

Internet Archive

 Wayback Machine

  • http://web.archive.org/collections/web/ advanced.html

 Look for HW#1 solution

Crawling Proceedure

 Start with a set of URLs

 Extract other URLs which are followed

recursively

  • breadth-first, depth-first, or some other measure
  • don’t revisit pages

 Store a local copy of each page

  • and assorted meta-information

 Index your local collection

Laws of Web Robotics

 A Web Robot Must Show Identifications

  • User Agent & From headers

 A Web Robot Must Not Hog Resources

  • HEAD requests, opportune times, validate URLs, don’t loop or overwhelm

 A Web Robot Must Obey Exclusion

Stanards

  • robots.txt
  • identification consideration robots.txt

HTTP Requests and Responses

 Request GET /~gates/ HTTP/1. Header1: … Header2: … … HeaderN: … Blank Line

  • All request headers are optional except for Host (required only for HTTP/1.1 requests)  Response HTTP/1.0 200 OK Content-Type: text/html Header2: … … HeaderN: … Blank Line … - All response headers are optional except for Content-Type

HTTP 1.1 Request Headers

 Accept

  • Indicate MIME types browser can handle
  • Can send different content to different clients  Accept-Encoding
  • Indicates encodings (e.g., gzip) browser can handle  Authorization
  • User identification for password-protected pages.  Connection
  • In HTTP 1.0, keep-alive means browser can handle persistent connection. In HTTP 1.1, persistent connection is default. Persistent connections mean that the server can reuse the same socket over again for requests very close together from the same client