Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

MapReduce: Simplified Data Processing on Large Clusters | CS 5410, Study notes of Computer Science

Cornell University Computer Science

Material Type: Notes; Class: Intermediate Computer Systems; Subject: Computer Science; University: Cornell University; Term: Spring 2008;

Typology: Study notes

Pre 2010

Uploaded on 08/30/2009

koofers-user-2qe 🇺🇸

10 documents

1 / 42

This page cannot be seen from the preview

Don't miss anything!

Theseareslideswithahistory.Ifoundthemonthe

web...TheyareapparentlybasedonDanWeld’sclassat

U.Washington,(whointurnbasedhisslidesonthose



byJeffDean,SanjayGhemawat,Google,Inc.)

Discover Study notes of Computer Science Cornell University

Partial preview of the text

Download MapReduce: Simplified Data Processing on Large Clusters | CS 5410 and more Study notes Computer Science in PDF only on Docsity!

These

are^ slides

with

a^ history.

I^ found

them

on^ the

web...

They

are^ apparently

based

on^ Dan

Weld’s

class

U.^ Washington,

(who

in^ turn

based

his^ slides

on^ those

by^ Jeff

Dean,

Sanjay

Ghemawat,

Google,

Inc.)

Motivation^ y^ Large

‐Scale

Data

Processing

y^ Want

to^ use

1000s

of^ CPUs

y^ But^

don’t^ want

hassle

of^ managing

things

y^ But^

don t^ want

hassle

of^ managing

things

y^ MapReduce

provides

y^ Automatic

parallelization

&^ distribution

y^ Fault

tolerance y^ I/O

scheduling y^ I/O

scheduling y^ Monitoring

&^ status

updates

Map

in^

Lisp

(Scheme)

y^ (map

f list

[list

list 2

…] ) 3

y^ (map

square

‘(1^2

y^ (1^4

y^ (reduce

+^ ‘(^

(^6

(^ (

)^ )

y^ (+^16

(+^9 (+

4 1)^ )

y^30 y (reduce

+^ (map

square

(map

l l ))))

y^ (reduce

+^ (map

square

(map

l^ l^1

Map/Reduce

ala

Google

y^ map(key,

val) is

run^ on

each

item

in^ set

y^ emits

new‐

key^ /^

new‐val pairs

y^ reduce(key,

vals) is

run^ for

each

unique

key

emitted

by^ map()y^

p

y^ emits

final^

output

Of^

li^ i

ill^

d^

/^ d

y^ Often,

one^ application

will^ need

to^ run

map/reduce

many

times

in^ succession

(k^

l^ l^

map(key=url,

val=contents):

For^ each

word^ w

in^ contents,

emit^ (w,

“1”)

reduce(key=word,

values=uniq counts):

Count, Illustrated

reduce(key word,

values

uniq

_counts):

Sum^ all

“1”s^ in

values

list

Emit^ result

“(word,

sum)”

see bob throwsee spot run

see^

bob^

run^

see spot run

run^

see^

spot^

throw

Grep^ y^

Input

consists

of^ (url+offset,

single

line)

y^ map(key=url+offset,

val=line):

If^

i^ (li^

“ ”)

y^ If^ contents

matches

regexp,

emit^ (line,

“1”)

y^ reduce(key=line,

values=uniq counts): (^ y^

,^

q_^

y^ Don’t

do^ anything;

just^ emit

line

Index

maps

words

to

files

p

Compute

an

Inverted

Index

y^ MapFor

each^

file^ f^ and

each word

in^ the

file^ w

Output(f,w)

pairs

R dy Reduce^ y^ Merge,

eliminating

duplicates

Model

is^ Widely

Applicabley pp

MapReduce Programs

In^ Google

Source

Tree

Example uses:Example

uses: distributed grep

distributed sort

web link-graph reversal

term-vector / host

web access log stats

inverted index construction

i^ i^ l^

document clustering

machine learning

statistical machinetranslation

...^

...

Execution^ y^

How^

is^ this

distributed?

1.^ Partition

input

key/value

pairs

into^ chunks,

run

()^ t^ k

ll l

map()

tasks

in^ parallel

2.^ After

all^ map()s

are^ complete,

consolidate

all^ emitted

values

for^ each

unique

emitted

key q^

3.^ Now

partition

space

of^ output

map^

keys,^

and^ run

reduce()

in^ parallel

y^ If

map()

or^ reduce()

fails,

reexecute!

Job Processing

JobTracker

TaskTracker 0

TaskTracker 1

TaskTracker 2

TaskTracker 3

TaskTracker 4

TaskTracker 5

Client submits “grep” job, indicating codeand input files 2 JobTracker breaks input file into

k chunks

“grep”

JobTracker breaks input file into

k chunks,

(in this case 6).

Assigns work to ttrackers.

After map(), tasktrackers exchange map-output to build reduce() keyspace

()^ y p

JobTracker breaks reduce() keyspace into

chunks (in this case 6). Assigns work.5. reduce() output may go to NDFS

P^

ll l E

ti

Parallel

Execution

T^ k G

l^ i & Pi

li i

Task

Granularity

&^ Pi

pelining

y^ Fine

granularity

tasks:

map

tasks

^ machines

y^ Fine

granularity

tasks:

map

tasks

^ machines

y^ Minimizes

time^ for

fault^ recovery

y^ Can pipeline

shuffling

with^ map

execution

l^ d b l y^ Better

dynamic

load^ b

alancing

y^ Often

use^ 200,

map

&^5000

reduce

tasks,

running

on^2000

machines

running

on^2000

machines

MapReduce: Simplified Data Processing on Large Clusters | CS 5410, Study notes of Computer Science

Related documents

Partial preview of the text

Download MapReduce: Simplified Data Processing on Large Clusters | CS 5410 and more Study notes Computer Science in PDF only on Docsity!

Motivation^ y^ Large

‐Scale

Data

Processing

y^ MapReduce

provides

Map

in^

Lisp

(Scheme)

y^ (map

f list

[list

list 2

…] ) 3

y^ (map

square

‘(1^2

y^ (reduce

+^ ‘(^

(^6

(^ (

)^ )

(+^9 (+

4 1)^ )

y^30 y (reduce

+^ (map

square

(map

y^ (reduce

+^ (map

square

(map

Map/Reduce

ala

Google

y^ map(key,

val) is

run^ on

each

item

in^ set

y^ reduce(key,

vals) is

run^ for

each

unique

key

emitted

by^ map()y^

p

Of^

li^ i

ill^

d^

/^ d

y^ Often,

one^ application

will^ need

to^ run

map/reduce

many

times

in^ succession

Grep^ y^

,^

Index

maps

words

to

files

p

Compute

an

Inverted

Index