





















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An overview of mapreduce, a data processing model for large clusters based on lisp's map and reduce higher order functions. Mapreduce simplifies the complexity of data center applications by handling data distribution, replication, access, and fault tolerance. An explanation of map and reduce functions, a word count example, and a comparison of mapreduce and multithreading.
Typology: Exams
1 / 29
This page cannot be seen from the preview
Don't miss anything!






















Source: MapReduce: Simplified Data Processing on Large ClustersJeffrey Dean and Sanjay Ghemawat, Google inc. (wim bohm, cs.colostate.edu) Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 license.
Based on Lisp’s Map and Reduce higher order functions ^
Lisp
Map(fM,L) = fM(first(L)) ^ Map(fM, rest(L))
^
Lisp: Reduce(fR,L) = fR(first(L), Reduce(fR, rest(L))) ^
Lisp MapReduce(fM,fR,L) =Reduce(fR,Map(fm,L)) ^
Lisp: Lots of Irritating Superfluous Parentheses ^
(left base cases out)
^
Hi throughput, hi performance, rack aware ^
Functional: RTS takes care of FT, restart, Distribution (//ism)
^
^
^
Map(String key, String
value)
// key: doc name, value doc contentsfor each word w in value
EmitIntermediate
(w, “1”);
EmitIntermediate
(w, “1”);
Reduce(String key, Iterator values) // key: word, values: list of countsint sum = 0;for each v in values sum += ParseInt(v);Emit((String) sum);
^
Idea: generate random points in a square ^
Count how many are inside circle, how many in thesquare (producing area estimates)
Square area
As = 4 * r
2
(^2) r = As / 4
Square area
As = 4 * r
2
(^2) r = As / 4
Circle area
Ac = pi * r
2
pi = Ac / r
2
pi =
4*Ac / As
Example of Monte Carlo method: simulating a physicalphenomenon using many random samples
r
^
Master:get input params (nWorkers, nPoints)for(i=0; i< nWorkers; i++)
thrCreate(i, nPoints);
for(i=0; i< nWorkers; i++)
join;
As = 0; Ac = 0;As = 0; Ac = 0; for(i=0; i<nWorkers; i++) {As += nPoints; Ac+=ncPoints[i];}piEst = =
4*Ac / As;
^
Slave
:i cPoints[i]=0;for(i=0; i<nPoints;i++) {
create 2 random pts x,y in (-. ..
.5);
if (sqrt(xx+yy)<.5) cPoints[i]++; }
^
^
^
^
Either it can impose an order ^
Or it can make sure the reduction function is associative andcommutative ^
Take // grep: if you want outcomes sorted by line #, makeline# part of the key, and sort
^
We need to have the MapReduce plugins to create aMapReduce Eclipse perspective. ^
MapReduce projects contain three classes:^ 1. A
Driver
(like the master in the multithreading case)
Driver
(like the master in the multithreading case)
Creating a configuaration, defining #mappers, #reducers,starting the app, dealing with the final result gathering.
mapper
(inherited class implementing mapper interface)
Getting data from files in a
directory
specified by driver.
reducer
(inherited class implementing reducer interface)
Getting data from files in a
directory
specified by driver,
produced by mappers.
MultipleMultiple SequentialMappers donot bring theperformancedown
Twenty parallelMappers: five foldfive fold speedupTwelve seemsbetter
^
the Google cluster architecture. IEEE Micro, 23(2):
^
^
^
^
worker
(6)write
userprogram master
(1)fork
(1)fork
(1)fork
(2)assignmap
(2)assignreduce
split0 split1 split2 split3 split
worker^ worker worker
(3)read
Input
Map
Intermediate
Reduce
Output
Files
phase
local files
phase
files
(4)local write
worker worker (5)remote read
outputfile 0 outputfile 1
(6)write