Overlap Analysis and Web Crawling Techniques for Improving Search Engine Coverage, Slides of Fundamentals of E-Commerce

The concept of overlap analysis in the context of search engines and web crawling. It explains the formula for calculating the overlap between two web pages and the relationship between independence and overlap. The document also explores ways to improve search engine coverage through the use of meta-search engines and web crawlers. The document further explains the difference between breadth-first and depth-first crawlers and their respective algorithms.

Typology: Slides

2012/2013

Uploaded on 07/30/2013

ekyan
ekyan 🇮🇳

4.7

(10)

138 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Overlap Analysis
P(Wa
Wb| Wb) = P(Wa
Wb)/ P(Wb)
= |Wa
Wb| / |Wb|
If a and b are independent:
P(Wa
Wb) = P(Wa)*P(Wb)
P(Wa
Wb| Wb) = P(Wa)*P(Wb)/P(Wb)
= |Wa| * |Wb| / |Wb|
= |Wa| / |W|
=P(Wa)
Docsity.com
pf3
pf4
pf5
pf8

Partial preview of the text

Download Overlap Analysis and Web Crawling Techniques for Improving Search Engine Coverage and more Slides Fundamentals of E-Commerce in PDF only on Docsity!

Overlap Analysis

P(Wa

Wb| Wb) = P(Wa

Wb)/ P(Wb)

-^

= |Wa

Wb| / |Wb|

If a and b are independent:

-^

P(Wa

Wb) = P(Wa)*P(Wb)

P(Wa

Wb| Wb) = P(Wa)*P(Wb)/P(Wb)

-^

= |Wa| * |Wb| / |Wb|

-^

= |Wa| / |W|

-^

=P(Wa)

Overlap Analysis

•^

Using

|W| = |Wa|/ P(Wa),

the researchers

found:

  • Web had at least 320 million pages in 1997• 60% of web was covered by six major engines• Maximum coverage of a single engine was 1/

of the web

Web Crawler

  • A crawler is a program that picks up a page

and follows all the links on that page

  • Crawler = Spider• Types of crawler:

Breadth First

Depth First

Breadth First Crawlers

•^

Use breadth-first search (BFS) algorithm

  • Get all links from the starting page, and add

them to a queue

  • Pick the 1

st

link from the queue, get all links on

the page and add to the queue

  • Repeat above step till queue is empty

Depth First Crawlers

•^

Use depth first search (DFS) algorithm

  • Get the 1

st

link not visited from the start page

  • Visit link and get 1

st

non-visited link

  • Repeat above step till no no-visited links• Go to next non-visited link in the previous

level and repeat 2

nd

step

Depth First Crawlers