Map-Reduce and its Implementation: CS347 Notes, Slides of Distributed Database Management Systems

An in-depth explanation of map-reduce, its generalization, and the issues in its implementation. It includes examples of map-reduce usage, counting word occurrences, and sorting records. Additionally, it discusses hadoop and pig, two open-source systems based on map-reduce, and pig latin, the query language used in pig. The document also covers data models, user-defined functions, and specifying input data.

Typology: Slides

2011/2012

Uploaded on 07/16/2012

sambandam
sambandam 🇮🇳

4.3

(37)

154 documents

1 / 10

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
2
7
Generalizing: Map-Reduce
1
2
3
rat
dog
dog
cat
rat
dog
(rat, 1)
(dog, 1)
(dog, 2)
(cat, 2)
(rat, 3)
(dog, 3)
(cat, 2)
(dog, 1)
(dog, 2)
(dog, 3)
(rat, 1)
(rat, 3) Disk
Page
stream
Tokenizing Sorting
Loading
FLUSHING
Intermediate
runs
Map
CS347 Notes 09
8
Generalizing: Map-Reduce
(cat, 2)
(dog, 1)
(dog, 2)
(dog, 3)
(rat, 1)
(rat, 3)
Intermediate
Runs
Final index
(ant, 5)
(cat, 4)
(dog, 4)
(dog, 5)
(eel, 6)
Merge
(ant, 5)
(cat, 2)
(cat, 4)
(dog, 1)
(dog, 2)
(dog, 3)
(dog, 4)
(dog, 5)
(eel, 6)
(rat, 1)
(rat, 3)
(ant: 2)
(cat: 2,4)
(dog: 1,2,3,4,5)
(eel: 6)
(rat: 1, 3)
Reduce
CS347 Notes 09
9
Map Reduce
Input: R={r1, r2, ...rn}, functions M, R
–M(r
i) { [k1, v1], [k2, v2],.. }
–R(k
i, valSet) [ki, valSet’]
Let S={ [k, v] | [k, v] M(r) for some r R }
Let K = {k | [k,v] S, for any v }
Let G(k) = { v | [k, v] S }
Output = { [k, T] | k K, T=R(k, G(k)) }
S is bag
G is bag
CS347 Notes 09
10
References
MapReduce: Simplified Data Processing on Large
Clusters, Jeffrey Dean and Sanjay Ghemawat,
available at
http://labs.google.com/papers/mapreduce-osdi04.pdf
Pig Latin: A Not-So-Foreign Language for Data
Processing, Christopher Olston, Benjamin Reedy,
Utkarsh Srivastavava, Ravi Kumar, Andrew Tomkins,
available at
http://wiki.apache.org/pig/
CS347 Notes 09
11
Example: Counting Word Occurrences
map(String doc, String value);
// doc is document name
// value is document content
for each word w in value:
EmitIntermediate(w, “1”);
Example:
map(doc, “cat dog cat bat dog”) emits
[cat 1], [dog 1], [cat 1], [bat 1], [dog 1]
•Why does map
have 2 parameters?
CS347 Notes 09
12
Example: Counting Word Occurrences
reduce(String key, Iterator values);
// key is a word
// values is a list of counts
int result = 0;
for each v in values:
result += ParseInt(v)
Emit(AsString(result));
Example:
reduce(“dog”, “1 1 1 1”) emits “4”
should emit (“dog”, 4)??
CS347 Notes 09
docsity.com
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Map-Reduce and its Implementation: CS347 Notes and more Slides Distributed Database Management Systems in PDF only on Docsity!

7

Generalizing: Map-Reduce

rat dog dog cat

rat dog

(rat, 1) (dog, 1) (dog, 2) (cat, 2) (rat, 3) (dog, 3)

(cat, 2) (dog, 1) (dog, 2) (dog, 3) (rat, 1) (rat, 3)

Disk

Page stream

Loading Tokenizing Sorting

FLUSHING

Intermediate runs

Map

CS347 Notes 09 8

Generalizing: Map-Reduce

(cat, 2) (dog, 1) (dog, 2) (dog, 3) (rat, 1) (rat, 3)

Intermediate Runs

Final index

(ant, 5) (cat, 4) (dog, 4) (dog, 5) (eel, 6)

Merge

(ant, 5) (cat, 2) (cat, 4) (dog, 1) (dog, 2) (dog, 3) (dog, 4) (dog, 5) (eel, 6) (rat, 1) (rat, 3)

(ant: 2) (cat: 2,4) (dog: 1,2,3,4,5) (eel: 6) (rat: 1, 3)

Reduce

CS347 Notes 09

9

Map Reduce

  • Input: R={r 1 , r 2 , ...r (^) n}, functions M, R
    • M(r (^) i )  { [k 1 , v 1 ], [k 2 , v 2 ],.. }
    • R(ki , valSet)  [k (^) i , valSet’]
  • Let S={ [k, v] | [k, v]  M(r) for some r  R }
  • Let K = {k | [k,v]  S, for any v }
  • Let G(k) = { v | [k, v]  S }
  • Output = { [k, T] | k  K, T=R(k, G(k)) }

S is bag

G is bag

CS347 Notes 09 10

References

  • MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, available at http://labs.google.com/papers/mapreduce-osdi04.pdf
  • Pig Latin: A Not-So-Foreign Language for Data Processing, Christopher Olston, Benjamin Reedy, Utkarsh Srivastavava, Ravi Kumar, Andrew Tomkins, available at http://wiki.apache.org/pig/

CS347 Notes 09

11

Example: Counting Word Occurrences

  • map(String doc, String value); // doc is document name // value is document content for each word w in value: EmitIntermediate(w, “1”);
  • Example:
    • map(doc, “cat dog cat bat dog”) emits [cat 1], [dog 1], [cat 1], [bat 1], [dog 1]

•Why does map have 2 parameters?

CS347 Notes 09 12

Example: Counting Word Occurrences

  • reduce(String key, Iterator values); // key is a word // values is a list of counts int result = 0; for each v in values: result += ParseInt(v) Emit(AsString(result));
  • Example:
    • reduce(“dog”, “1 1 1 1”) emits “4”

CS347 Notes 09 should emit (“dog”, 4)??

13

Google MR Overview

CS347 Notes 09 14

Implementation Issues

  • Combine function
  • File system
  • Partition of input, keys
  • Failures
  • Backup tasks
  • Ordering of results

CS347 Notes 09

15

Combine Function

worker

worker

[cat 1], [cat 1], [cat 1]... worker

[dog 1], [dog 1]...

worker

worker

[cat 3]... worker

[dog 2]...

Combine is like a local reduce applied before distribution:

CS347 Notes 09 16

Distributed File System

worker must be able to access any part of input file

reduce worker must be able to access local disks on map workers

any worker must be able to write its part of answer; answer is left as distributed file

all data transfers are through distributed file system

CS347 Notes 09

17

Partition of input, keys

  • How many workers, partitions of input file?

worker

worker

worker

How many splits?

How many workers? Best to have many splits per worker: Improves load balance; if worker fails, easier to spread its tasks

Should workers be assigned to splits “near” them?

Similar questions for reduce workers

CS347 Notes 09 18

Failures

  • Distributed implementation should produce same output as would have been produced by a non- faulty sequential execution of the program.
  • General strategy: Master detects worker failures, and has work re-done by another worker.

worker

worker

split j

master ok?

redo j

CS347 Notes 09

25

Questions

  • Can MR be made more “declarative”?
  • How can we perform joins?
  • How can we perform approximate grouping?
    • example: for all keys that are similar

reduce all values for those keys

CS347 Notes 09 26

Additional Topics

  • Hadoop: open-source Map-Reduce system
  • Pig: Yahoo system that builds on MR but is more declarative

CS347 Notes 09

27

Pig & Pig Latin

  • A layer on top of map-reduce (Hadoop)
    • Pig is the system
    • Pig Latin is the query language
  • Pig Latin is a hybrid between:
    • high-level declarative query language in the spirit of SQL
    • low-level, procedural programming à la map- reduce.

CS347 Notes 09 28

Example

  • Table urls: (url, category, pagerank)
  • Find, for each sufficiently large category, the average pagerank of high-pagerank urls in that category. In SQL:
  • SELECT category, AVG(pagerank) FROM urls WHERE pagerank > 0. GROUP BY category HAVING COUNT(*) > 10 6

CS347 Notes 09

29

Example in Pig Latin

  • SELECT category, AVG(pagerank)

FROM urls WHERE pagerank > 0. GROUP BY category HAVING COUNT(*) > 10 6

  • In Pig Latin:
  • good_urls = FILTER urls BY pagerank > 0.2;

groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls)>10^6 ; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank);

CS347 Notes 09 30

z.cnn.com .com 0. y.yale.edu .edu 0. w.uc.edu .edu 0. x.nyt.com .com 0. y.ut.edu .edu 0. w.wh.gov .gov 0.

good_urls = FILTER urls BY pagerank > 0.2;

z.cnn.com .com 0. y.yale.edu .edu 0. x.nyt.com .com 0. y.ut.edu .edu 0. w.wh.gov .gov 0.

urls: url, category, pagerank good_urls: url, category, pagerank

CS347 Notes 09

31

groups = GROUP good_urls BY category;

z.cnn.com .com 0. y.yale.edu .edu 0. x.nyt.com .com 0. y.ut.edu .edu 0. w.wh.gov .gov 0.

.com {(z.cnn.com, .com, 0.9) (x.nyt.com, .com, 0.8)} .edu {(y.yale.edu, .edu, 0.5) (y.ut.edu, .edu, 0.6)} .gov {(w.wh.gov, .gov, .07)}

good_urls: url, category, pagerank

groups: category, good_urls

CS347 Notes 09 32

big_groups = FILTER groups BY COUNT(good_urls)>1;

.com {(z.cnn.com, .com, 0.9) (x.nyt.com, .com, 0.8)} .edu {(y.yale.edu, .edu, 0.5) (y.ut.edu, .edu, 0.6)} .gov {(w.wh.gov, .gov, .07)}

.com {(z.cnn.com, .com, 0.9) (x.nyt.com, .com, 0.8)} .edu {(y.yale.edu, .edu, 0.5) (y.ut.edu, .edu, 0.6)}

groups: category, good_urls

big_groups: category, good_urls

CS347 Notes 09

33

output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank);

.com {(z.cnn.com, .com, 0.9) (x.nyt.com, .com, 0.8)} .edu {(y.yale.edu, .edu, 0.5) (y.ut.edu, .edu. 0.6)}

.com 0. .edu 0.

big_groups: category, good_urls

output: category, good_urls

CS347 Notes 09 34

Features

  • Similar to specifying a query execution plan (i.e., a dataflow graph), thereby making it easier for programmers to understand and control how their data processing task is executed.
  • Support for a flexible, fully nested data model
  • Extensive support for user-defined functions
  • Ability to operate over plain input files without any schema information.
  • Novel debugging environment useful when dealing with enormous data sets.

CS347 Notes 09

35

Execution Control: Good or Bad?

  • Example:

spam_urls = FILTER urls BY isSpam(url); culprit_urls = FILTER spam_urls BY pagerank>0.8;

  • Should system re-order filters?

CS347 Notes 09 36

User Defined Functions

  • Example
    • groups = GROUP urls BY category;
    • output = FOREACH groups GENERATE

category, top10(urls);

UDF top10 can return scalar or set

.gov {(x.fbi.gov, .gov, 0.7) ...}

.edu {(y.yale.edu, .edu, 0.5) ...}

.com {(z.cnn.com, .com, 0.9) ...}

.gov {(fbi.gov) (cia.gov) ...}

.edu {(yale.edu) ...}

.com {(cnn.com) (ibm.com) ...}

should be groups.url?

CS347 Notes 09

43

Flattening Example (Fill In)

a1 b1, b2 {(c1) (c2)} a1 b3, b4 {(c1) (c2)} a1 b5 {(c1) (c2)} a2 b6 {(c3) (c4)} a2 (b7, b8) {(c3) (c4)}

Y = FOREACH X GENERATE

A, FLATTEN(B), C

Z = FOREACH Y GENERATE

A, B, FLATTEN(C)

Z’ = FOREACH X GENERATE

A, FLATTEN(B),

FLATTEN(C)?

Is Z=Z’ where

CS347 Notes 09 44

Flattening Example

X A B C

Y = FOREACH X GENERATE

A, FLATTEN(B), C

Flatten is not recursive

Note first tuple is (a1, b1, b2, {(c1)(c2)})

Note attribute naming gets complicated. For example, $2 for first tuple is b2; for third tuple it is {(c1)(c2)}.

a1 {(b1, b2) (b3, b4) (b5)} {(c1) (c2)} a2 {(b6, (b7,b8))} {(c3) (c4)}

a1 b1 b2 {(c1) (c2)} a1 b3 b4 {(c1) (c2)} a1 b5 {(c1) (c2)} a2 b6 (b7, b8) {(c3) (c4)}

CS347 Notes 09

45

Flattening Example

a1 b1, b2 {(c1) (c2)} a1 b3, b4 {(c1) (c2)} a1 b5 {(c1) (c2)} a2 b6 {(c3) (c4)} a2 (b7, b8) {(c3) (c4)}

Y = FOREACH X GENERATE

A, FLATTEN(B), C

Z = FOREACH Y GENERATE

A, B, FLATTEN(C)

Z’ = FOREACH X GENERATE

A, FLATTEN(B),

FLATTEN(C)

Note that Z=Z’ where

a1 b1 b2 c a1 b1 b2 c a1 b3 b4 c a1 b3 b4 c a1 b5 c a1 b5 c a2 b6 (b7, b8) c a2 b6 (b7, b8) c

CS347 Notes 09 46

Filter

  • real_queries = FILTER queries BY userId neq `bot';
  • real_queries = FILTER queries BY NOT isBot(userId);

UDF function

CS347 Notes 09

47

Co-Group

  • Two data sets for example:
    • results: (queryString, url, position)
    • revenue: (queryString, adSlot, amount)
  • grouped_data = COGROUP results BY

queryString, revenue BY queryString;

  • url_revenues = FOREACH grouped_data GENERATE

FLATTEN(distributeRevenue(results, revenue));

  • Co-Group more flexible than SQL JOIN

CS347 Notes 09 48

CoGroup vs Join

CS347 Notes 09

49

Group (Simple CoGroup)

  • grouped_revenue = GROUP revenue BY queryString;
  • query_revenues = FOREACH grouped_revenue GENERATE queryString, SUM(revenue.amount) AS totalRevenue;

CS347 Notes 09 50

CoGroup Example 1

X A B C Y A B D

Z1 = GROUP X BY A

1 1 d 1 2 d 2 1 d 2 2 d

Z1 A X

1 1 c 1 1 c 2 2 c 2 2 c

CS347 Notes 09

51

CoGroup Example 1

X A B C Y A B D

Z1 = GROUP X BY A

1 1 d 1 2 d 2 1 d 2 2 d

Z1 A X

1 1 c 1 1 c 2 2 c 2 2 c

1 {(1, 1, c1) (1, 1, c2)} 2 {(2, 2, c3) (2, 2, c4)}

CS347 Notes 09 52

CoGroup Example 2

X A B C Y A B D

Z2 = GROUP X BY (A, B)

1 1 d 1 2 d 2 1 d 2 2 d

1 1 c 1 1 c 2 2 c 2 2 c

Z1? X

Syntax not in paper but being added

CS347 Notes 09

53

CoGroup Example 2

X A B C Y A B D

Z2 = GROUP X BY (A, B)

1 1 d 1 2 d 2 1 d 2 2 d

1 1 c 1 1 c 2 2 c 2 2 c

Z1 A/B? X

Syntax not in paper but being added

(1, 1) {(1, 1, c1) (1, 1, c2)} (2, 2) {(2, 2, c3) (2, 2, c4)}

CS347 Notes 09 54

CoGroup Example 3

X A B C Y A B D

Z3 = COGROUP X BY A, Y BY A

1 1 d 1 2 d 2 1 d 2 2 d

1 1 c 1 1 c 2 2 c 2 2 c

Z1 A X Y

CS347 Notes 09

61

MapReduce in Pig Latin

  • map_result = FOREACH input GENERATE FLATTEN(map(*));
  • key_groups = GROUP map_result BY $0;
  • output = FOREACH key_groups GENERATE reduce(*);

all attributes

key is first attribute

CS347 Notes 09 62

Store

  • To materialize result in a file:
  • STORE query_revenues INTO `myoutput' USING myStore();

custom serializer

output file

CS347 Notes 09

63

Hadoop

  • HDFS: Hadoop file system
  • How to use Hadoop, examples
  • Material covered by David...

CS347 Notes 09