Distributed Information Retrieval-Parallel and Distributed Data Management-Lecture Slides, Slides of Distributed Database Management Systems

Prof. Jitesh Bhaskar delivered this lecture for Distributed and Parallel Data Management course at Dhirubhai Ambani Institute of Information and Communication Technology. Its main point are: Distributed, Information, Retrieval, Search, Engine, Crawling, Indexing, Ranking, Queries, Parallelization, Scalability

Typology: Slides

2011/2012

Uploaded on 07/16/2012

sambandam
sambandam 🇮🇳

4.3

(37)

154 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
2
CS 347 Distributed IR 7
Contravariance
Agent A
url1
url3
url5
Agent B
url2
url4
url6
Agent A
url1
url2
Agent B
url3
url4
Agent C
url5
url6
CS 347 Distributed IR 8
Contravariance
Agent A
url1
url3
url5
Agent B
url2
url4
url6
Agent A
url1
url2
Agent B
url3
url4
Agent C
url5
url6
Agent A
url1
url3
Agent B
url2
url4
Agent C
url5
url6
CS 347 Distributed IR 9
Assignment
Consistent hashing
Hash function: URL agent
Each agent “replicated” k times
Each replica mapped randomly on unit circle
Mapping persistent across agent restarts
Lookup: map URL on unit circle; find closest
live replica
CS 347 Distributed IR 10
Assignment
A
A
B
B
url6
CS 347 Distributed IR 11
Assignment
A
A
B
B
url6A
A
B
B
url6
C
C
Balancing
•Contravariance
CS 347 Distributed IR 12
Crawl Partitioning
•Ideas
URL normalization
E.g., relative to absolute URL
Host-based partitioning
Reduces communication between agents
Small vs. large hosts
Geographic distribution
docsity.com
pf3
pf4
pf5

Partial preview of the text

Download Distributed Information Retrieval-Parallel and Distributed Data Management-Lecture Slides and more Slides Distributed Database Management Systems in PDF only on Docsity!

CS 347 Distributed IR 7

Contravariance

Agent A

url 1

url 3

url 5

Agent B

url 2

url 4

url 6

Agent A

url 1

url 2

Agent B

url 3

url 4

Agent C

url 5

url 6

CS 347 Distributed IR 8

Contravariance

Agent A

url 1

url 3

url 5

Agent B

url 2

url 4

url 6

Agent A

url 1

url 2

Agent B

url 3

url 4

Agent C

url 5

url 6

Agent A

url 1

url 3

Agent B

url 2

url 4

Agent C

url 5

url 6

CS 347 Distributed IR 9

Assignment

  • Consistent hashing
    • Hash function: URL  agent
    • Each agent “replicated” k times
    • Each replica mapped randomly on unit circle

 Mapping persistent across agent restarts

  • Lookup: map URL on unit circle; find closest

live replica

CS 347 Distributed IR 10

Assignment

A

A

B

B

url 6

CS 347 Distributed IR 11

Assignment

A

A

B

B

url 6 A

A

B

B

url 6

C

C

  • Balancing 
  • Contravariance 

CS 347 Distributed IR 12

Crawl Partitioning

  • Ideas
    • URL normalization

 E.g., relative to absolute URL

  • Host-based partitioning

 Reduces communication between agents

 Small vs. large hosts

  • Geographic distribution

CS 347 Distributed IR 13

Fault Tolerance

  • Repartitioning 
  • Permanent failure
    • Recovering list of URLs to visit

 Checkpoints

 Communication logs

  • Transient failure
    • Avoiding re-visiting URLs

 Before fetch, check with near neighbor agents

CS 347 Distributed IR 14

Indexing

  • Build term-document index

● ● ● ● ●

● ● ●

● ●

d 1 d 2 d 3 d 4 d 5 d 6 dn t 1 t 2

t 3 t 4

t 5 t 6 ●

tm ● ●

Posting for t (^1)

Lexicon

Collection

CS 347 Distributed IR 15

Architecture

Web

pages

Distributors Indexers Query servers

Intermediate runs

Inverted index files

Reduce

Map

CS 347 Distributed IR 16

Issues

  • Index partitioning
    • Efficient query processing

 Query routing

 Result retrieval

CS 347 Distributed IR 17

Document Partitioning

d 1 d 2 d 3 d 4 d 5 d 6 d (^) n t (^1) t 2 t 3 t 4 t (^5) t (^6)

t m

d 1 d 2 d 3 t 1 t (^2) t (^3) t (^4) t 5 t 6

t (^) m

d 4 d 5 d 6 t 1 t (^2) t (^3) t (^4) t 5 t 6

t (^) m

d n-2 d n-1 d n t 1 t (^2) t (^3) t (^4) t 5 t 6

t (^) m

CS 347 Distributed IR 18

Document Partitioning

  • Split the collection of documents
  • Advantages
    • Easy to add new documents
    • Load balanced
    • High processing throughput
  • Disadvantages
    • Communication with all query servers

CS 347 Distributed IR 25

Formula

p(x) = d · Σ p(y) / out(y) + (1 – d) / n

PageRank of

page x

yx

PageRank of y, where y

links to x

Out-degree of

page y

Probability of random

restart at x

CS 347 Distributed IR 26

Algorithm

i = 0

p

[i]

(x) = (1 – d) / n

repeat

i += 1

p

[i]

(x) = (1 – d) / n

for all yx

p

[i]

(x) += d · p

[i–1]

(y) / out(y)

until | p

[i]

– p

[i–1]

CS 347 Distributed IR 27

Implementation

• Two vectors, current and next

• Initialize vectors

• Iterate over all pages y, distribute

PageRank from current(y) to next(x) for

all links yx

• current = next, re-initialize next

• Go back to iteration over pages or stop

CS 347 Distributed IR 28

Distribution

• MapReduce for each iteration i

• Map

– Take <y, (current(y), edges(y))>

– For each yx in edges(y)

emit <x, current(y) / | edges(y) |>

– Also emit <y, edges(y)>

• Reduce

– Take <x, val> and <x, edges(x)>

– Sum (d · val) into next(x), add (1 – d) / n

– Emit <x, (next(x), edges(x))>

CS 347 Distributed IR 29

Distribution

<y, (current(y), edges(y))>

Map

<x, val>

<x, val>

Reduce

<x, (next(x), edges(x))>

CS 347 Distributed IR 30

Query Processing

• Locate, retrieve, process, and serve query

results

Inverted

index files

Query

coordinator

Query servers

Cache

Query

Results

CS 347 Distributed IR 31

Architecture

  • Multiple sites connected by WAN
    • Site = coordinator + servers + cache
  • Partitioning
    • Parallel processing
    • Distributed storage of data
    • E.g., index partitioning
  • Replication
    • Availability
    • Throughput
    • Response time

CS 347 Distributed IR 32

Issues

  • Routing the query
    • To sites

 E.g., identical sites + routing by dynamic DNS

lookup

  • Within sites
  • Merging the results
  • Caching

CS 347 Distributed IR 33

Issues

Routing Merging

Document

partition

All servers

Results selected by

servers; ranking by

coordinator

Term

partition

Servers

containing

query terms

Selection and

ranking by

coordinator

CS 347 Distributed IR 34

Caching

  • What to cache?
    • Query answers
    • Term postings

CS 347 Distributed IR 35

Caching

  • What to cache?
    • Query answers

 Faster response

  • Term postings 

 More hits

Query terms repeated more frequently

than whole queries

CS 347 Distributed IR 36

Caching Policy

  • Terms most frequent in queries

 high hit ratio

  • Terms most frequent in documents

 require more cache space

(longer postings)

  • Use static caching based on

query/document frequency ratio