






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An in-depth explanation of map-reduce, its generalization, and the issues in its implementation. It includes examples of map-reduce usage, counting word occurrences, and sorting records. Additionally, it discusses hadoop and pig, two open-source systems based on map-reduce, and pig latin, the query language used in pig. The document also covers data models, user-defined functions, and specifying input data.
Typology: Slides
1 / 10
This page cannot be seen from the preview
Don't miss anything!







7
rat dog dog cat
rat dog
(rat, 1) (dog, 1) (dog, 2) (cat, 2) (rat, 3) (dog, 3)
(cat, 2) (dog, 1) (dog, 2) (dog, 3) (rat, 1) (rat, 3)
Page stream
Loading Tokenizing Sorting
Intermediate runs
CS347 Notes 09 8
(cat, 2) (dog, 1) (dog, 2) (dog, 3) (rat, 1) (rat, 3)
Intermediate Runs
Final index
(ant, 5) (cat, 4) (dog, 4) (dog, 5) (eel, 6)
Merge
(ant, 5) (cat, 2) (cat, 4) (dog, 1) (dog, 2) (dog, 3) (dog, 4) (dog, 5) (eel, 6) (rat, 1) (rat, 3)
(ant: 2) (cat: 2,4) (dog: 1,2,3,4,5) (eel: 6) (rat: 1, 3)
CS347 Notes 09
9
CS347 Notes 09 10
CS347 Notes 09
11
•Why does map have 2 parameters?
CS347 Notes 09 12
13
CS347 Notes 09 14
CS347 Notes 09
15
worker
worker
[cat 1], [cat 1], [cat 1]... worker
[dog 1], [dog 1]...
worker
worker
[cat 3]... worker
[dog 2]...
CS347 Notes 09 16
worker must be able to access any part of input file
reduce worker must be able to access local disks on map workers
any worker must be able to write its part of answer; answer is left as distributed file
all data transfers are through distributed file system
CS347 Notes 09
17
worker
worker
worker
How many splits?
How many workers? Best to have many splits per worker: Improves load balance; if worker fails, easier to spread its tasks
Should workers be assigned to splits “near” them?
Similar questions for reduce workers
CS347 Notes 09 18
worker
worker
master ok?
redo j
CS347 Notes 09
25
CS347 Notes 09 26
CS347 Notes 09
27
CS347 Notes 09 28
CS347 Notes 09
29
FROM urls WHERE pagerank > 0. GROUP BY category HAVING COUNT(*) > 10 6
groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls)>10^6 ; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank);
CS347 Notes 09 30
z.cnn.com .com 0. y.yale.edu .edu 0. w.uc.edu .edu 0. x.nyt.com .com 0. y.ut.edu .edu 0. w.wh.gov .gov 0.
good_urls = FILTER urls BY pagerank > 0.2;
z.cnn.com .com 0. y.yale.edu .edu 0. x.nyt.com .com 0. y.ut.edu .edu 0. w.wh.gov .gov 0.
urls: url, category, pagerank good_urls: url, category, pagerank
CS347 Notes 09
31
groups = GROUP good_urls BY category;
z.cnn.com .com 0. y.yale.edu .edu 0. x.nyt.com .com 0. y.ut.edu .edu 0. w.wh.gov .gov 0.
.com {(z.cnn.com, .com, 0.9) (x.nyt.com, .com, 0.8)} .edu {(y.yale.edu, .edu, 0.5) (y.ut.edu, .edu, 0.6)} .gov {(w.wh.gov, .gov, .07)}
good_urls: url, category, pagerank
groups: category, good_urls
CS347 Notes 09 32
big_groups = FILTER groups BY COUNT(good_urls)>1;
.com {(z.cnn.com, .com, 0.9) (x.nyt.com, .com, 0.8)} .edu {(y.yale.edu, .edu, 0.5) (y.ut.edu, .edu, 0.6)} .gov {(w.wh.gov, .gov, .07)}
.com {(z.cnn.com, .com, 0.9) (x.nyt.com, .com, 0.8)} .edu {(y.yale.edu, .edu, 0.5) (y.ut.edu, .edu, 0.6)}
groups: category, good_urls
big_groups: category, good_urls
CS347 Notes 09
33
output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank);
.com {(z.cnn.com, .com, 0.9) (x.nyt.com, .com, 0.8)} .edu {(y.yale.edu, .edu, 0.5) (y.ut.edu, .edu. 0.6)}
.com 0. .edu 0.
big_groups: category, good_urls
output: category, good_urls
CS347 Notes 09 34
CS347 Notes 09
35
spam_urls = FILTER urls BY isSpam(url); culprit_urls = FILTER spam_urls BY pagerank>0.8;
CS347 Notes 09 36
UDF top10 can return scalar or set
should be groups.url?
CS347 Notes 09
43
a1 b1, b2 {(c1) (c2)} a1 b3, b4 {(c1) (c2)} a1 b5 {(c1) (c2)} a2 b6 {(c3) (c4)} a2 (b7, b8) {(c3) (c4)}
CS347 Notes 09 44
Flatten is not recursive
Note first tuple is (a1, b1, b2, {(c1)(c2)})
Note attribute naming gets complicated. For example, $2 for first tuple is b2; for third tuple it is {(c1)(c2)}.
a1 {(b1, b2) (b3, b4) (b5)} {(c1) (c2)} a2 {(b6, (b7,b8))} {(c3) (c4)}
a1 b1 b2 {(c1) (c2)} a1 b3 b4 {(c1) (c2)} a1 b5 {(c1) (c2)} a2 b6 (b7, b8) {(c3) (c4)}
CS347 Notes 09
45
a1 b1, b2 {(c1) (c2)} a1 b3, b4 {(c1) (c2)} a1 b5 {(c1) (c2)} a2 b6 {(c3) (c4)} a2 (b7, b8) {(c3) (c4)}
a1 b1 b2 c a1 b1 b2 c a1 b3 b4 c a1 b3 b4 c a1 b5 c a1 b5 c a2 b6 (b7, b8) c a2 b6 (b7, b8) c
CS347 Notes 09 46
CS347 Notes 09
47
queryString, revenue BY queryString;
FLATTEN(distributeRevenue(results, revenue));
CS347 Notes 09 48
CS347 Notes 09
49
CS347 Notes 09 50
1 1 d 1 2 d 2 1 d 2 2 d
1 1 c 1 1 c 2 2 c 2 2 c
CS347 Notes 09
51
1 1 d 1 2 d 2 1 d 2 2 d
1 1 c 1 1 c 2 2 c 2 2 c
1 {(1, 1, c1) (1, 1, c2)} 2 {(2, 2, c3) (2, 2, c4)}
CS347 Notes 09 52
1 1 d 1 2 d 2 1 d 2 2 d
1 1 c 1 1 c 2 2 c 2 2 c
Syntax not in paper but being added
CS347 Notes 09
53
1 1 d 1 2 d 2 1 d 2 2 d
1 1 c 1 1 c 2 2 c 2 2 c
Syntax not in paper but being added
(1, 1) {(1, 1, c1) (1, 1, c2)} (2, 2) {(2, 2, c3) (2, 2, c4)}
CS347 Notes 09 54
1 1 d 1 2 d 2 1 d 2 2 d
1 1 c 1 1 c 2 2 c 2 2 c
CS347 Notes 09
61
all attributes
key is first attribute
CS347 Notes 09 62
custom serializer
output file
CS347 Notes 09
63
CS347 Notes 09