MapReduce: A Data Processing System for Large-Scale Data Sets - Prof. Kristen R. Lefevre | Papers Database Management Systems (DBMS)

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawat

[email protected], [email protected]

Google, Inc.

Abstract

MapReduce is a programming model and an associ-

ated implementation for processing and generating large

data sets. Users specify a map function that processes a

key/value pair to generate a set of intermediate key/value

pairs, and a reduce function that merges all intermediate

values associated with the same intermediate key. Many

real world tasks are expressible in this model, as shown

in the paper.

Programs written in this functional style are automati-

cally parallelized and executed on a large cluster of com-

modity machines. The run-time system takes care of the

details of partitioning the input data, scheduling the pro-

gram’s execution across a set of machines, handling ma-

chine failures, and managing the required inter-machine

communication. This allows programmers without any

experience with parallel and distributed systems to eas-

ily utilize the resources of a large distributed system.

Our implementation of MapReduce runs on a large

cluster of commodity machines and is highly scalable:

a typical MapReduce computation processes many ter-

abytes of data on thousands of machines. Programmers

find the system easy to use: hundreds of MapReduce pro-

grams have been implemented and upwards of one thou-

sand MapReduce jobs are executed on Google’s clusters

every day.

1 Introduction

Over the past five years, the authors and many others at

Google have implemented hundreds of special-purpose

computations that process large amounts of raw data,

such as crawled documents, web request logs, etc., to

compute various kinds of derived data, such as inverted

indices, various representations of the graph structure

of web documents, summaries of the number of pages

crawled per host, the set of most frequent queries in a

given day, etc. Most such computations are conceptu-

ally straightforward. However, the input data is usually

large and the computations have to be distributed across

hundreds or thousands of machines in order to finish in

a reasonable amount of time. The issues of how to par-

allelize the computation, distribute the data, and handle

failures conspire to obscure the original simple compu-

tation with large amounts of complex code to deal with

these issues.

As a reaction to this complexity, we designed a new

abstraction that allows us to express the simple computa-

tions we were trying to perform but hides the messy de-

tails of parallelization, fault-tolerance, data distribution

and load balancing in a library. Our abstraction is in-

spired by the map and reduce primitives present in Lisp

and many other functional languages. We realized that

most of our computations involved applying a map op-

eration to each logical “record” in our input in order to

compute a set of intermediate key/value pairs, and then

applying a reduce operation to all the values that shared

the same key, in order to combine the derived data ap-

propriately. Our use of a functional model with user-

specified map and reduce operations allows us to paral-

lelize large computations easily and to use re-execution

as the primary mechanism for fault tolerance.

The major contributions of this work are a simple and

powerful interface that enables automatic parallelization

and distribution of large-scale computations, combined

with an implementation of this interface that achieves

high performance on large clusters of commodity PCs.

Section 2 describes the basic programming model and

gives several examples. Section 3 describes an imple-

mentation of the MapReduce interface tailored towards

our cluster-based computing environment. Section 4 de-

scribes several refinements of the programming model

that we have found useful. Section 5 has performance

measurements of our implementation for a variety of

tasks. Section 6 explores the use of MapReduce within

Google including our experiences in using it as the basis

To appear in OSDI 2004 1

MapReduce: A Data Processing System for Large-Scale Data Sets - Prof. Kristen R. Lefevre, Papers of Database Management Systems (DBMS)

Related documents

Partial preview of the text

Download MapReduce: A Data Processing System for Large-Scale Data Sets - Prof. Kristen R. Lefevre and more Papers Database Management Systems (DBMS) in PDF only on Docsity!

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawat

Google, Inc.

Abstract

1 Introduction

2 Programming Model

2.1 Example

2.2 Types

2.3 More Examples

3.2 Master Data Structures

3.3 Fault Tolerance

3.4 Locality

3.5 Task Granularity

4.5 Side-effects

4.6 Skipping Bad Records

4.7 Local Execution

4.8 Status Information

4.9 Counters

5 Performance

5.1 Cluster Configuration

5.2 Grep

5.3 Sort

5.4 Effect of Backup Tasks

5.5 Machine Failures

6 Experience

6.1 Large-Scale Indexing

7 Related Work

A Word Frequency