Advanced Crawling Techniques - E-Commerce - Lecture Slides, Slides of Fundamentals of E-Commerce

Students of Communication, study E-Commerce as an auxiliary subject. these are the key points discussed in these Lecture Slides of E-Commerce : Advanced Crawling Techniques, Selective Crawling, Focused Crawling, Distributed Crawling, Web Dynamics, Downloads Documents, Reachable Pages, Navigates, Exhaustive Crawl, Selective Crawl

Typology: Slides

2012/2013

Uploaded on 07/29/2013

alok-sarath
alok-sarath 🇮🇳

4.3

(35)

143 documents

1 / 11

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Advanced Crawling Techniques
Chapter 6
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Advanced Crawling Techniques - E-Commerce - Lecture Slides and more Slides Fundamentals of E-Commerce in PDF only on Docsity!

Advanced Crawling Techniques

Chapter 6

Outline

Selective Crawling

Focused Crawling

Distributed Crawling

Web Dynamics

Nature of Crawl

Broadly categorized into

Exhaustive crawl

broad coverage - used by general purpose search engines - Selective crawl - fetch pages according to some criteria, for e.g., popularpages, similar pages - exploit semantic content, rich contextual aspects

Selective Crawling

Retrieve web pages according to some criteria

Page relevance is determined by a scoringfunction

s

θ ( ξ )

(u)

relevance criterion

parameters

for e.g., a boolean relevance function - s(u) = document is relevant - s(u) = document is irrelevant

Examples of Scoring Function

Depth

length of the path from the site homepage to the document - limit total number of levels retrieved from a site - maximize coverage breadth

Popularity

assign relevance according to which pages are more important thanothers - estimate the number of backlinks    < ≈ = otherwise , 0 ) ( if , 1 ) ( ) ( δ δ u u root u s^ depth  

= otherwise , 0 indegree if, 1 ) )( ( τ δ (u) u s backlinks

Examples of Scoring Function

PageRank

assign value of importance

value is proportional to the popularity of thesource document - estimated by a measure of indegree of apage

Focused Crawling

Fetch pages within a certain topic

Relevance function

use text categorization techniques

s θ (topic) (u) = P(c|d(u),

Parent based method

score of parent is extended to children URL

Anchor based method

anchor text is used for scoring pages

Focused Crawler

Basic approach

classify crawled pages into categories

use a topic taxonomy, provide example URLs, andmark categories of interest - use a Bayesian classifier to find P(c|p) - compute relevance score for each page - R(p) =

c ∈ good P(c|p)