Download Effect of Stemming on Mean Average Precision in IR and the Web and more Slides Artificial Intelligence in PDF only on Docsity!
HW#3 Evaluation
Results are for scientific interest only
- Not competitive (rank doesn’t effect grading)
Stats
- Normal / Stems
- MAP-Max 0.5822 / 0.
- MAP-Mean 0.4219 / 0.
- Recall-Max 263 / 268
Submissions were reasonable
- Few catastrophic failures
Stemming recall up 12 cases, dn 1 case
Mean Average Precision
1 2 3 4 5 6 7 8 9 10 11 12 13 14 Mean Average Precision Student Code Effect of Stemming - MAP Normal Stems
2010: Mean Average Precision
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Mean Average Precision Student Code Effect of Stemming - MAP Normal Stems
Relevant Found
0 50 100 150 200 250 300 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Mean Average Precision Student Code Effect of Stemming - Recall Normal Stems
Distributed data
New servers, pages can appear anytime
(without notice)
No central registry for web servers
- Virtual hosting makes this more complicated
- (more than one server per IP number)
Millions of servers, Billions of pages
- To index Web data, you must make a large number of network connections
- How long does it take to circumnavigate the Web?
Volatility and Scale
Perhaps 40% of Web changes monthly
- Indexes quickly grow outdated or inaccurate Growth appears exponential True numbers are larger
- Much of web isn’t static
- ASP/JSP front-ends to databases, catalogs
- Firewalls block access to Intranets
Lack of Structure
Duplication (30%)
Plain text is easiest to index
- Most HTML pages are not syntactically valid
- Plethora of other formats (MS Word, PDF)
Naming
- URL names are ugly and difficult to canonicalize
- Pages migrate to new servers
Quality
No editorial review
- Authors tend to make more grammatical and spelling errors in web pages
Anybody can publish
Undesirable content
- Filtering objectionable content is technically difficult
Search Engine Coverage
Lawrence & Giles (cont’d)
- Took 1050 ‘real world’ queries
- Pooled set of responses from 11 engines
- Determined each engine’s coverage
- Extrapolated to estimated coverage of indexable web
- Estimated currency NL S AV HB MS IS G Y E L EU Coverage 38.3 37.1 37.1 27.1 20.3 19.2 18.6 17.6 13.6 5.9 5. Cov (web) 16.0 15.5 15.5 11.3 8.5 8.0 7.8 7.4 5.6 2.5 2. % invalid 9.8 2.8 6.7 2.2 2.6 5.5 7.0 2.9 2.7 14.0 2.6 5. mean age 141 240 166 192 194 148 235 206 174 186
Lawrence & Giles: Conclusions
Size
- Continues to grow (Moore’s law)
Currency
- Engines not keeping pace with growth
- Why?
Bias
- Popular pages much more likely to be indexed (especially Google)
- What effect?
Benefits of the Web?
The Web presents many challenges, but
are there any benefits for IR?
There is a particular kind of value-added
annotation
HTML
Ranking Ideas for the Web
Exploit links
- Possibly, words near a hyperlink are more important Currency
- Assumes most recent data is best Popularity
- Use estimates of what a large number of people think about a page or site
- Estimate based on easy to obtain data
- number of inbound links to ‘that’ page
- called ‘backlink frequency’ Authority
- Harder to estimate than popularity
Google’s measure of authority
PageRank simulates a user navigating randomly in the Web who jumps to a random page with probability q or follows a random hyperlink (on the current page) with probability 1 - q This process can be modeled with a Markov chain, from where the stationary probability of being in each page can be computed Let C(a) be the number of outgoing links of page a and suppose that page a is pointed to by pages p 1 to pn
PageRank
typical q = 0.
PageRank’s advantages
Google can rank unseen pages!
- Corollary, Google can rank non-text content
Estimates of page quality (for unseen
pages) can be used for crawl ordering
“Efficient crawling through URL ordering”, Cho, Garcia-Molina, and Page, WWW-7.
PageRank Example
A B C D A B C D t=0 0.25 0.25 0.25 0. t=1 0.15 0.468 0.468 0. t=2 0.15 0.612 0.522 0. t=3 0.15 0.657 0.680 0. t=4 0.15 0.792 0.784 0. t=5 0.15 0.880 0.816 0. t=30 0.15 1.297 1.277 1. Using teleport prob. of 0.15: PR(A,t=1) = 0.15 + 0 PR(B,t=1) = 0.15 + 0.85 * (PR(A,t=0)/2 + PR(C,t=0)/1) PR(C,t=1) = 0.15 + 0.85 * (PR(A,t=0)/2 + PR(D,t=0)/1) PR(D,t=1) = 0.15 + 0.85* (PR(B,t=0)/1)
Search on the Web (Spink ’98)
Topics
- Genealogy/Public Figure: 12%
- Computer related: 12%
- Business: 12%
- Entertainment: 8%
- Medical: 8%
- Politics & Government 7%
- News 7%
- Hobbies 6%
- General info/surfing 6%
- Science 6%
- Travel 5%
- Arts/education/shopping/images 14%
Popular terms from AOL query log
Taxonomy of Search Requests
Andrei Broder (AV) characterized user’s
requests into three main categories:
- Informational: Find information about X
- Transactional: E.g., buying airline tickets
- Navigational:
- I know I saw a page on X last week but I didn’t bookmark it
- Or, where can I download Adobe Acrobat Reader from?
Web Search Queries
Web search queries are SHORT
- ~2.4 words on average (Aug 2000)
- Has increased, was 1.7 (~1997)
User Expectations
- Many say “the first item shown should be what I want to see”!
- This works if the user has the most popular/ common notion in mind
Commercial Issues
Internet search is commercially driven
- Technically quality isn’t most important facet
- On the other hand, many CTOs for search engine companies are former researchers
- Sometimes motivations are commercial rather than technical - Goto.com used payments to determine ranking order - Buying more disks would enable search over larger portion of the Web, but would cost $$.
Searches per Day
~2010 figures: Google (1+ billion), Yahoo (180 million), Bing (80 million)
Number of indexed pages, self-reported
Internet Archive
Wayback Machine
- http://web.archive.org/collections/web/ advanced.html
Look for HW#1 solution
Crawling Proceedure
Start with a set of URLs
Extract other URLs which are followed
recursively
- breadth-first, depth-first, or some other measure
- don’t revisit pages
Store a local copy of each page
- and assorted meta-information
Index your local collection
Laws of Web Robotics
A Web Robot Must Show Identifications
- User Agent & From headers
A Web Robot Must Not Hog Resources
- HEAD requests, opportune times, validate URLs, don’t loop or overwhelm
A Web Robot Must Obey Exclusion
Stanards
- robots.txt
- identification consideration robots.txt
HTTP Requests and Responses
Request GET /~gates/ HTTP/1. Header1: … Header2: … … HeaderN: … Blank Line
- All request headers are optional except for Host (required only for HTTP/1.1 requests) Response HTTP/1.0 200 OK Content-Type: text/html Header2: … … HeaderN: … Blank Line … - All response headers are optional except for Content-Type
HTTP 1.1 Request Headers
Accept
- Indicate MIME types browser can handle
- Can send different content to different clients Accept-Encoding
- Indicates encodings (e.g., gzip) browser can handle Authorization
- User identification for password-protected pages. Connection
- In HTTP 1.0, keep-alive means browser can handle persistent connection. In HTTP 1.1, persistent connection is default. Persistent connections mean that the server can reuse the same socket over again for requests very close together from the same client