MapReduce: Simplified Data Processing on Large Clusters | CS 5410, Study notes of Computer Science

Material Type: Notes; Class: Intermediate Computer Systems; Subject: Computer Science; University: Cornell University; Term: Spring 2008;

Typology: Study notes

Pre 2010

Uploaded on 08/30/2009

koofers-user-2qe
koofers-user-2qe 🇺🇸

10 documents

1 / 42

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Theseareslideswithahistory.Ifoundthemonthe
web...TheyareapparentlybasedonDanWeldsclassat
U.Washington,(whointurnbasedhisslidesonthose
byJeffDean,SanjayGhemawat,Google,Inc.)
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a

Partial preview of the text

Download MapReduce: Simplified Data Processing on Large Clusters | CS 5410 and more Study notes Computer Science in PDF only on Docsity!

These

are^ slides

with

a^ history.

I^ found

them

on^ the

web...

They

are^ apparently

based

on^ Dan

Weld’s

class

at

U.^ Washington,

(who

in^ turn

based

his^ slides

on^ those

by^ Jeff

Dean,

Sanjay

Ghemawat,

Google,

Inc.)

Motivation^ y^ Large

‐Scale

Data

Processing

y^ Want

to^ use

1000s

of^ CPUs

y^ But^

don’t^ want

hassle

of^ managing

things

y^ But^

don t^ want

hassle

of^ managing

things

y^ MapReduce

provides

y^ Automatic

parallelization

&^ distribution

y^ Fault

tolerance y^ I/O

scheduling y^ I/O

scheduling y^ Monitoring

&^ status

updates

Map

in^

Lisp

(Scheme)

y^ (map

f list

[list

list 2

…] ) 3

y^ (map

square

‘(1^2

y^ (1^4

y^ (reduce

+^ ‘(^

(^6

(^ (

)^ )

y^ (+^16

(+^9 (+

4 1)^ )

y^30 y (reduce

+^ (map

square

(map

  • l l ))))

y^ (reduce

+^ (map

square

(map

  • l^ l^1

Map/Reduce

ala

Google

y^ map(key,

val) is

run^ on

each

item

in^ set

y^ emits

new‐

key^ /^

new‐val pairs

y^ reduce(key,

vals) is

run^ for

each

unique

key

emitted

by^ map()y^

p

y^ emits

final^

output

Of^

li^ i

ill^

d^

/^ d

y^ Often,

one^ application

will^ need

to^ run

map/reduce

many

times

in^ succession

(k^

l^ l^

map(key=url,

val=contents):

For^ each

word^ w

in^ contents,

emit^ (w,

“1”)

reduce(key=word,

values=uniq counts):

Count, Illustrated

reduce(key word,

values

uniq

_counts):

Sum^ all

“1”s^ in

values

list

Emit^ result

“(word,

sum)”

see bob throwsee spot run

see^

bob^

bob^

run^

see spot run

run^

see^

see^

spot^

spot^

throw

throw

Grep^ y^

Input

consists

of^ (url+offset,

single

line)

y^ map(key=url+offset,

val=line):

If^

h^

i^ (li^

“ ”)

y^ If^ contents

matches

regexp,

emit^ (line,

“1”)

y^ reduce(key=line,

values=uniq counts): (^ y^

,^

q_^

y^ Don’t

do^ anything;

just^ emit

line

Index

maps

words

to

files

p

Compute

an

Inverted

Index

y^ MapFor

each^

file^ f^ and

each word

in^ the

file^ w

Output(f,w)

pairs

R dy Reduce^ y^ Merge,

eliminating

duplicates

Model

is^ Widely

Applicabley pp

MapReduce Programs

In^ Google

Source

Tree

Example uses:Example

uses: distributed grep

distributed sort

web link-graph reversal

term-vector / host

web access log stats

inverted index construction

i^ i^ l^

hi

document clustering

machine learning

statistical machinetranslation

...^

...^

...

Execution^ y^

How^

is^ this

distributed?

1.^ Partition

input

key/value

pairs

into^ chunks,

run

()^ t^ k

i^

ll l

map()

tasks

in^ parallel

2.^ After

all^ map()s

are^ complete,

consolidate

all^ emitted

values

for^ each

unique

emitted

key q^

y

3.^ Now

partition

space

of^ output

map^

keys,^

and^ run

reduce()

in^ parallel

y^ If

map()

or^ reduce()

fails,

reexecute!

Job Processing

JobTracker

TaskTracker 0

TaskTracker 1

TaskTracker 2

TaskTracker 3

TaskTracker 4

TaskTracker 5

  1. Client submits “grep” job, indicating codeand input files 2 JobTracker breaks input file into

k chunks

“grep”

  1. JobTracker breaks input file into

k chunks,

(in this case 6).

Assigns work to ttrackers.

  1. After map(), tasktrackers exchange map-output to build reduce() keyspace

p^

()^ y p

  1. JobTracker breaks reduce() keyspace into

m

chunks (in this case 6). Assigns work.5. reduce() output may go to NDFS

P^

ll l E

ti

Parallel

Execution

T^ k G

l^ i & Pi

li i

Task

Granularity

&^ Pi

pelining

y^ Fine

granularity

tasks:

map

tasks

^ machines

y^ Fine

granularity

tasks:

map

tasks

^ machines

y^ Minimizes

time^ for

fault^ recovery

y^ Can pipeline

shuffling

with^ map

execution

d^

l^ d b l y^ Better

dynamic

load^ b

alancing

y^ Often

use^ 200,

map

&^5000

reduce

tasks,

running

on^2000

machines

running

on^2000

machines