Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Map-Reduce and its Implementation: CS347 Notes, Slides of Distributed Database Management Systems

Dhirubhai Ambani Institute of Information and Communication Technology Distributed Database Management Systems

An in-depth explanation of map-reduce, its generalization, and the issues in its implementation. It includes examples of map-reduce usage, counting word occurrences, and sorting records. Additionally, it discusses hadoop and pig, two open-source systems based on map-reduce, and pig latin, the query language used in pig. The document also covers data models, user-defined functions, and specifying input data.

Typology: Slides

2011/2012

Uploaded on 07/16/2012

sambandam 🇮🇳

4.3

(37)

154 documents

1 / 10

This page cannot be seen from the preview

Don't miss anything!

2

7

Generalizing: Map-Reduce

1

2

3

rat

dog

cat

rat

dog

(rat, 1)

(dog, 1)

(dog, 2)

(cat, 2)

(rat, 3)

(dog, 3)

(cat, 2)

(dog, 1)

(dog, 2)

(dog, 3)

(rat, 1)

(rat, 3) Disk

Page

stream

Tokenizing Sorting

Loading

FLUSHING

Intermediate

runs

Map

CS347 Notes 09

8

Generalizing: Map-Reduce

(cat, 2)

(dog, 1)

(dog, 2)

(dog, 3)

(rat, 1)

(rat, 3)

Intermediate

Runs

Final index

(ant, 5)

(cat, 4)

(dog, 4)

(dog, 5)

(eel, 6)

Merge

(ant, 5)

(cat, 2)

(cat, 4)

(dog, 1)

(dog, 2)

(dog, 3)

(dog, 4)

(dog, 5)

(eel, 6)

(rat, 1)

(rat, 3)

(ant: 2)

(cat: 2,4)

(dog: 1,2,3,4,5)

(eel: 6)

(rat: 1, 3)

Reduce

CS347 Notes 09

9

Map Reduce

• Input: R={r1, r2, ...rn}, functions M, R

–M(r

i) { [k1, v1], [k2, v2],.. }

–R(k

i, valSet) [ki, valSet’]

• Let S={ [k, v] | [k, v] M(r) for some r R }

• Let K = {k | [k,v] S, for any v }

• Let G(k) = { v | [k, v] S }

• Output = { [k, T] | k K, T=R(k, G(k)) }

S is bag

G is bag

CS347 Notes 09

10

References

• MapReduce: Simplified Data Processing on Large

Clusters, Jeffrey Dean and Sanjay Ghemawat,

available at

http://labs.google.com/papers/mapreduce-osdi04.pdf

• Pig Latin: A Not-So-Foreign Language for Data

Processing, Christopher Olston, Benjamin Reedy,

Utkarsh Srivastavava, Ravi Kumar, Andrew Tomkins,

available at

http://wiki.apache.org/pig/

CS347 Notes 09

11

Example: Counting Word Occurrences

• map(String doc, String value);

// doc is document name

// value is document content

for each word w in value:

EmitIntermediate(w, “1”);

• Example:

– map(doc, “cat dog cat bat dog”) emits

[cat 1], [dog 1], [cat 1], [bat 1], [dog 1]

•Why does map

have 2 parameters?

CS347 Notes 09

12

Example: Counting Word Occurrences

• reduce(String key, Iterator values);

// key is a word

// values is a list of counts

int result = 0;

for each v in values:

result += ParseInt(v)

Emit(AsString(result));

• Example:

– reduce(“dog”, “1 1 1 1”) emits “4”

should emit (“dog”, 4)??

CS347 Notes 09

docsity.com

Discover Slides of Distributed Database Management Systems Dhirubhai Ambani Institute of Information and Communication Technology

Partial preview of the text

Download Map-Reduce and its Implementation: CS347 Notes and more Slides Distributed Database Management Systems in PDF only on Docsity!

7

Generalizing: Map-Reduce

rat dog dog cat

rat dog

(rat, 1) (dog, 1) (dog, 2) (cat, 2) (rat, 3) (dog, 3)

(cat, 2) (dog, 1) (dog, 2) (dog, 3) (rat, 1) (rat, 3)

Disk

Page stream

Loading Tokenizing Sorting

FLUSHING

Intermediate runs

Map

CS347 Notes 09 8

Generalizing: Map-Reduce

(cat, 2) (dog, 1) (dog, 2) (dog, 3) (rat, 1) (rat, 3)

Intermediate Runs

Final index

(ant, 5) (cat, 4) (dog, 4) (dog, 5) (eel, 6)

Merge

(ant, 5) (cat, 2) (cat, 4) (dog, 1) (dog, 2) (dog, 3) (dog, 4) (dog, 5) (eel, 6) (rat, 1) (rat, 3)

(ant: 2) (cat: 2,4) (dog: 1,2,3,4,5) (eel: 6) (rat: 1, 3)

Reduce

CS347 Notes 09

9

Map Reduce

Input: R={r 1 , r 2 , ...r (^) n}, functions M, R
- M(r (^) i )  { [k 1 , v 1 ], [k 2 , v 2 ],.. }
- R(ki , valSet)  [k (^) i , valSet’]
Let S={ [k, v] | [k, v]  M(r) for some r  R }
Let K = {k | [k,v]  S, for any v }
Let G(k) = { v | [k, v]  S }
Output = { [k, T] | k  K, T=R(k, G(k)) }

S is bag

G is bag

CS347 Notes 09 10

References

MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, available at http://labs.google.com/papers/mapreduce-osdi04.pdf
Pig Latin: A Not-So-Foreign Language for Data Processing, Christopher Olston, Benjamin Reedy, Utkarsh Srivastavava, Ravi Kumar, Andrew Tomkins, available at http://wiki.apache.org/pig/

CS347 Notes 09

11

Example: Counting Word Occurrences

map(String doc, String value); // doc is document name // value is document content for each word w in value: EmitIntermediate(w, “1”);
Example:
- map(doc, “cat dog cat bat dog”) emits [cat 1], [dog 1], [cat 1], [bat 1], [dog 1]

•Why does map have 2 parameters?

CS347 Notes 09 12

Example: Counting Word Occurrences

reduce(String key, Iterator values); // key is a word // values is a list of counts int result = 0; for each v in values: result += ParseInt(v) Emit(AsString(result));
Example:
- reduce(“dog”, “1 1 1 1”) emits “4”

CS347 Notes 09 should emit (“dog”, 4)??

13

Google MR Overview

CS347 Notes 09 14

Implementation Issues

Combine function
File system
Partition of input, keys
Failures
Backup tasks
Ordering of results

CS347 Notes 09

15

Combine Function

worker

[cat 1], [cat 1], [cat 1]... worker

[dog 1], [dog 1]...

worker

[cat 3]... worker

[dog 2]...

Combine is like a local reduce applied before distribution:

CS347 Notes 09 16

Distributed File System

worker must be able to access any part of input file

reduce worker must be able to access local disks on map workers

any worker must be able to write its part of answer; answer is left as distributed file

all data transfers are through distributed file system

CS347 Notes 09

17

Partition of input, keys

How many workers, partitions of input file?

worker

How many splits?

How many workers? Best to have many splits per worker: Improves load balance; if worker fails, easier to spread its tasks

Should workers be assigned to splits “near” them?

Failures

Distributed implementation should produce same output as would have been produced by a non- faulty sequential execution of the program.
General strategy: Master detects worker failures, and has work re-done by another worker.

worker

split j

master ok?

redo j

CS347 Notes 09

25

Questions

Can MR be made more “declarative”?
How can we perform joins?
How can we perform approximate grouping?
- example: for all keys that are similar

reduce all values for those keys

CS347 Notes 09 26

Additional Topics

Hadoop: open-source Map-Reduce system
Pig: Yahoo system that builds on MR but is more declarative

CS347 Notes 09

27

Pig & Pig Latin

A layer on top of map-reduce (Hadoop)
- Pig is the system
- Pig Latin is the query language
Pig Latin is a hybrid between:
- high-level declarative query language in the spirit of SQL
- low-level, procedural programming à la map- reduce.

CS347 Notes 09 28

Example

Table urls: (url, category, pagerank)
Find, for each sufficiently large category, the average pagerank of high-pagerank urls in that category. In SQL:
SELECT category, AVG(pagerank) FROM urls WHERE pagerank > 0. GROUP BY category HAVING COUNT(*) > 10 6

CS347 Notes 09

29

Example in Pig Latin

SELECT category, AVG(pagerank)

FROM urls WHERE pagerank > 0. GROUP BY category HAVING COUNT(*) > 10 6

In Pig Latin:
good_urls = FILTER urls BY pagerank > 0.2;

groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls)>10^6 ; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank);

CS347 Notes 09 30

z.cnn.com .com 0. y.yale.edu .edu 0. w.uc.edu .edu 0. x.nyt.com .com 0. y.ut.edu .edu 0. w.wh.gov .gov 0.

good_urls = FILTER urls BY pagerank > 0.2;

z.cnn.com .com 0. y.yale.edu .edu 0. x.nyt.com .com 0. y.ut.edu .edu 0. w.wh.gov .gov 0.

urls: url, category, pagerank good_urls: url, category, pagerank

CS347 Notes 09

31

groups = GROUP good_urls BY category;

z.cnn.com .com 0. y.yale.edu .edu 0. x.nyt.com .com 0. y.ut.edu .edu 0. w.wh.gov .gov 0.

.com {(z.cnn.com, .com, 0.9) (x.nyt.com, .com, 0.8)} .edu {(y.yale.edu, .edu, 0.5) (y.ut.edu, .edu, 0.6)} .gov {(w.wh.gov, .gov, .07)}

good_urls: url, category, pagerank

groups: category, good_urls

CS347 Notes 09 32

big_groups = FILTER groups BY COUNT(good_urls)>1;

.com {(z.cnn.com, .com, 0.9) (x.nyt.com, .com, 0.8)} .edu {(y.yale.edu, .edu, 0.5) (y.ut.edu, .edu, 0.6)} .gov {(w.wh.gov, .gov, .07)}

.com {(z.cnn.com, .com, 0.9) (x.nyt.com, .com, 0.8)} .edu {(y.yale.edu, .edu, 0.5) (y.ut.edu, .edu, 0.6)}

groups: category, good_urls

big_groups: category, good_urls

CS347 Notes 09

33

output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank);

.com {(z.cnn.com, .com, 0.9) (x.nyt.com, .com, 0.8)} .edu {(y.yale.edu, .edu, 0.5) (y.ut.edu, .edu. 0.6)}

.com 0. .edu 0.

big_groups: category, good_urls

output: category, good_urls

CS347 Notes 09 34

Features

Similar to specifying a query execution plan (i.e., a dataflow graph), thereby making it easier for programmers to understand and control how their data processing task is executed.
Support for a flexible, fully nested data model
Extensive support for user-defined functions
Ability to operate over plain input files without any schema information.
Novel debugging environment useful when dealing with enormous data sets.

CS347 Notes 09

35

Execution Control: Good or Bad?

Example:

spam_urls = FILTER urls BY isSpam(url); culprit_urls = FILTER spam_urls BY pagerank>0.8;

Should system re-order filters?

CS347 Notes 09 36

User Defined Functions

Example
- groups = GROUP urls BY category;
- output = FOREACH groups GENERATE

category, top10(urls);

UDF top10 can return scalar or set

.gov {(x.fbi.gov, .gov, 0.7) ...}

.edu {(y.yale.edu, .edu, 0.5) ...}

.com {(z.cnn.com, .com, 0.9) ...}

.gov {(fbi.gov) (cia.gov) ...}

.edu {(yale.edu) ...}

.com {(cnn.com) (ibm.com) ...}

should be groups.url?

CS347 Notes 09

43

Flattening Example (Fill In)

a1 b1, b2 {(c1) (c2)} a1 b3, b4 {(c1) (c2)} a1 b5 {(c1) (c2)} a2 b6 {(c3) (c4)} a2 (b7, b8) {(c3) (c4)}

Y = FOREACH X GENERATE

A, FLATTEN(B), C

Z = FOREACH Y GENERATE

A, B, FLATTEN(C)

Z’ = FOREACH X GENERATE

A, FLATTEN(B),

FLATTEN(C)?

Is Z=Z’ where

CS347 Notes 09 44

Flattening Example

X A B C

Y = FOREACH X GENERATE

A, FLATTEN(B), C

Flatten is not recursive

Note first tuple is (a1, b1, b2, {(c1)(c2)})

Note attribute naming gets complicated. For example, $2 for first tuple is b2; for third tuple it is {(c1)(c2)}.

a1 {(b1, b2) (b3, b4) (b5)} {(c1) (c2)} a2 {(b6, (b7,b8))} {(c3) (c4)}

a1 b1 b2 {(c1) (c2)} a1 b3 b4 {(c1) (c2)} a1 b5 {(c1) (c2)} a2 b6 (b7, b8) {(c3) (c4)}

CS347 Notes 09

45

Flattening Example

a1 b1, b2 {(c1) (c2)} a1 b3, b4 {(c1) (c2)} a1 b5 {(c1) (c2)} a2 b6 {(c3) (c4)} a2 (b7, b8) {(c3) (c4)}

Y = FOREACH X GENERATE

A, FLATTEN(B), C

Z = FOREACH Y GENERATE

A, B, FLATTEN(C)

Z’ = FOREACH X GENERATE

A, FLATTEN(B),

FLATTEN(C)

Note that Z=Z’ where

a1 b1 b2 c a1 b1 b2 c a1 b3 b4 c a1 b3 b4 c a1 b5 c a1 b5 c a2 b6 (b7, b8) c a2 b6 (b7, b8) c

CS347 Notes 09 46

Filter

real_queries = FILTER queries BY userId neq `bot';
real_queries = FILTER queries BY NOT isBot(userId);

UDF function

CS347 Notes 09

47

Co-Group

Two data sets for example:
- results: (queryString, url, position)
- revenue: (queryString, adSlot, amount)
grouped_data = COGROUP results BY

queryString, revenue BY queryString;

url_revenues = FOREACH grouped_data GENERATE

FLATTEN(distributeRevenue(results, revenue));

Co-Group more flexible than SQL JOIN

CS347 Notes 09 48

CoGroup vs Join

CS347 Notes 09

49

Group (Simple CoGroup)

grouped_revenue = GROUP revenue BY queryString;
query_revenues = FOREACH grouped_revenue GENERATE queryString, SUM(revenue.amount) AS totalRevenue;

CS347 Notes 09 50

CoGroup Example 1

X A B C Y A B D

Z1 = GROUP X BY A

1 1 d 1 2 d 2 1 d 2 2 d

Z1 A X

1 1 c 1 1 c 2 2 c 2 2 c

CS347 Notes 09

51

CoGroup Example 1

X A B C Y A B D

Z1 = GROUP X BY A

1 1 d 1 2 d 2 1 d 2 2 d

Z1 A X

1 1 c 1 1 c 2 2 c 2 2 c

1 {(1, 1, c1) (1, 1, c2)} 2 {(2, 2, c3) (2, 2, c4)}

CS347 Notes 09 52

CoGroup Example 2

X A B C Y A B D

Z2 = GROUP X BY (A, B)

1 1 d 1 2 d 2 1 d 2 2 d

1 1 c 1 1 c 2 2 c 2 2 c

Z1? X

Syntax not in paper but being added

CS347 Notes 09

53

CoGroup Example 2

X A B C Y A B D

Z2 = GROUP X BY (A, B)

1 1 d 1 2 d 2 1 d 2 2 d

1 1 c 1 1 c 2 2 c 2 2 c

Z1 A/B? X

Syntax not in paper but being added

(1, 1) {(1, 1, c1) (1, 1, c2)} (2, 2) {(2, 2, c3) (2, 2, c4)}

CS347 Notes 09 54

CoGroup Example 3

X A B C Y A B D

Z3 = COGROUP X BY A, Y BY A

1 1 d 1 2 d 2 1 d 2 2 d

1 1 c 1 1 c 2 2 c 2 2 c

Z1 A X Y

CS347 Notes 09

61

MapReduce in Pig Latin

map_result = FOREACH input GENERATE FLATTEN(map(*));
key_groups = GROUP map_result BY $0;
output = FOREACH key_groups GENERATE reduce(*);

all attributes

key is first attribute

CS347 Notes 09 62

Store

To materialize result in a file:
STORE query_revenues INTO `myoutput' USING myStore();

custom serializer

output file

CS347 Notes 09

63

Hadoop

HDFS: Hadoop file system
How to use Hadoop, examples
Material covered by David...

CS347 Notes 09

Map-Reduce and its Implementation: CS347 Notes, Slides of Distributed Database Management Systems

Related documents

Partial preview of the text

Download Map-Reduce and its Implementation: CS347 Notes and more Slides Distributed Database Management Systems in PDF only on Docsity!

Generalizing: Map-Reduce

Disk

FLUSHING

Map

Generalizing: Map-Reduce

Reduce

Map Reduce

S is bag

G is bag

References

Example: Counting Word Occurrences

Example: Counting Word Occurrences

CS347 Notes 09 should emit (“dog”, 4)??

Google MR Overview

Implementation Issues

Combine Function

Combine is like a local reduce applied before distribution:

Distributed File System

Partition of input, keys

Failures

split j

Questions

reduce all values for those keys

Additional Topics

Pig & Pig Latin

Example

Example in Pig Latin

Features

Execution Control: Good or Bad?

User Defined Functions

category, top10(urls);

.gov {(x.fbi.gov, .gov, 0.7) ...}

.edu {(y.yale.edu, .edu, 0.5) ...}

.com {(z.cnn.com, .com, 0.9) ...}

.gov {(fbi.gov) (cia.gov) ...}

.edu {(yale.edu) ...}

.com {(cnn.com) (ibm.com) ...}

Flattening Example (Fill In)

Y = FOREACH X GENERATE

A, FLATTEN(B), C

Z = FOREACH Y GENERATE

A, B, FLATTEN(C)

Z’ = FOREACH X GENERATE

A, FLATTEN(B),

FLATTEN(C)?

Is Z=Z’ where

Flattening Example

X A B C

Y = FOREACH X GENERATE

A, FLATTEN(B), C

Flattening Example

Y = FOREACH X GENERATE

A, FLATTEN(B), C

Z = FOREACH Y GENERATE

A, B, FLATTEN(C)

Z’ = FOREACH X GENERATE

A, FLATTEN(B),

FLATTEN(C)

Note that Z=Z’ where

Filter

UDF function

Co-Group

CoGroup vs Join

Group (Simple CoGroup)

CoGroup Example 1

X A B C Y A B D

Z1 = GROUP X BY A

Z1 A X

CoGroup Example 1

X A B C Y A B D

Z1 = GROUP X BY A

Z1 A X

CoGroup Example 2

X A B C Y A B D

Z2 = GROUP X BY (A, B)

Z1? X