



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Prof. Jitesh Bhaskar delivered this lecture for Distributed and Parallel Data Management course at Dhirubhai Ambani Institute of Information and Communication Technology. Its main point are: Distributed, Information, Retrieval, Search, Engine, Crawling, Indexing, Ranking, Queries, Parallelization, Scalability
Typology: Slides
1 / 6
This page cannot be seen from the preview
Don't miss anything!




Contravariance
Agent A
url 1
url 3
url 5
Agent B
url 2
url 4
url 6
Agent A
url 1
url 2
Agent B
url 3
url 4
Agent C
url 5
url 6
Contravariance
Agent A
url 1
url 3
url 5
Agent B
url 2
url 4
url 6
Agent A
url 1
url 2
Agent B
url 3
url 4
Agent C
url 5
url 6
Agent A
url 1
url 3
Agent B
url 2
url 4
Agent C
url 5
url 6
Assignment
Mapping persistent across agent restarts
live replica
Assignment
A
A
B
B
url 6
Assignment
A
A
B
B
url 6 A
A
B
B
url 6
C
C
Crawl Partitioning
E.g., relative to absolute URL
Reduces communication between agents
Small vs. large hosts
Fault Tolerance
Checkpoints
Communication logs
Before fetch, check with near neighbor agents
Indexing
● ● ● ● ●
● ● ●
● ●
d 1 d 2 d 3 d 4 d 5 d 6 dn t 1 t 2
t 3 t 4
t 5 t 6 ●
tm ● ●
Posting for t (^1)
Lexicon
Collection
Architecture
Web
pages
Distributors Indexers Query servers
Intermediate runs
Inverted index files
Reduce
Map
Issues
Query routing
Result retrieval
Document Partitioning
d 1 d 2 d 3 d 4 d 5 d 6 d (^) n t (^1) t 2 t 3 t 4 t (^5) t (^6)
t m
d 1 d 2 d 3 t 1 t (^2) t (^3) t (^4) t 5 t 6
t (^) m
d 4 d 5 d 6 t 1 t (^2) t (^3) t (^4) t 5 t 6
t (^) m
d n-2 d n-1 d n t 1 t (^2) t (^3) t (^4) t 5 t 6
t (^) m
Document Partitioning
Architecture
Issues
E.g., identical sites + routing by dynamic DNS
lookup
Issues
Routing Merging
Document
partition
All servers
Results selected by
servers; ranking by
coordinator
Term
partition
Servers
containing
query terms
Selection and
ranking by
coordinator
Caching
Caching
Faster response
More hits
Query terms repeated more frequently
than whole queries
Caching Policy
high hit ratio
require more cache space
(longer postings)
query/document frequency ratio