Understanding MapReduce: Google's Algorithm for Processing Large Data Sets, Study notes of Data Analysis & Statistical Methods

Why MapReduce was developed by Google to address the limitations of traditional enterprise systems in handling large volumes of data. MapReduce is a programming model for processing large datasets distributed on a large cluster using the concept of Divide and Conquer. It consists of two methods: map() and Reduce(). an overview of how MapReduce works, including its architecture, phases, and example use cases.

Typology: Study notes

2019/2020

Uploaded on 08/01/2020

bagga-dhruv
bagga-dhruv 🇮🇳

1 document

1 / 32

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CH-1
INTRODUCTION TO BIG
DATA
BY: PROF. AJAYSINH RATHOD
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20

Partial preview of the text

Download Understanding MapReduce: Google's Algorithm for Processing Large Data Sets and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

CH-

INTRODUCTION TO BIG

DATA

BY: PROF. AJAYSINH RATHOD

Why MapReduce?

Why MapReduce?

Google solved this bottleneck issue using an algorithm

called MapReduce.

MapReduce divides a task

small parts and assigns

them to many computers.

Later, the results are collected at one place and

integrated to form the result dataset.

Algorithms using map reduce

How MapReduce Works?  (^) The MapReduce algorithm contains two important tasks, namely Map and Reduce.  (^) sorting and filtering. The Map task takes a set of data and converts it into another set of data,  (^) where individual elements are broken down into tuples (key- value pairs).  (^) sorting and filtering. The Reduce task takes the output from the Map as an input and combines  (^) those data tuples (key-value pairs) into a smaller set of tuples.  (^) The reduce task is always performed after the map job.

Map Reduce

Map Reduce algorithms

  • MapReduce is a programming model designed for processing

large volumes of data in parallel by dividing the work into a set

of independent tasks.

For example twitter data was processed on different servers on

basis of months.

sorting and filtering. Hadoop is the physical implementation of Mapreduce.

sorting and filtering. It is combination of 2 java functions : Mapper() and

Reducer().

sorting and filtering. example: to check popularity of text.

Big Data and Its Sources

Mapper function maps the split files and provide input to reducer.  Mapper ( filename , file –contents): for each word in file-contents: emit (word , 1).

Reducer function clubs the input provided by mapper and produce output  Reducer ( word , values): sum=0; for each value in values: sum=sum + value emit(word , sum).

How MapReduce Works?  The MapReduce algorithm contains two important tasks, namely Map and Reduce. 

The Map task takes a set of data and converts it into

another set of data, where individual elements are broken down into tuples (key-value pairs). 

The Reduce task takes the output from the Map as an

input and combinesthose data tuples (key-value pairs) into a smaller set of tuples.  The reduce task is always performed after the map job.

How MapReduce Works?

How MapReduce Works?  (^) Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads the grouped key-value pairs onto the local machine, where the Reducer is running. The individual key-value pairs are sorted by key into a larger data list. The data list groups the equivalent keys together so that their values can be iterated easily in the Reducer task.  (^) Reducer − The Reducer takes the grouped key-value paired data as input and runs a Reducer function on each one of them. Here, the data can be aggregated, filtered, and combined in a number of ways, and it requires a wide range of processing. Once the execution is over, it gives zero or more key-value pairs to the final step.  (^) Output Phase − In the output phase, we have an output formatter that translates the final key-value pairs from the Reducer function and writes them onto a file using a record writer.

How MapReduce Works?  (^) Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads the grouped key-value pairs onto the local machine, where the Reducer is running. The individual key-value pairs are sorted by key into a larger data list. The data list groups the equivalent keys together so that their values can be iterated easily in the Reducer task.  (^) Reducer − The Reducer takes the grouped key-value paired data as input and runs a Reducer function on each one of them. Here, the data can be aggregated, filtered, and combined in a number of ways, and it requires a wide range of processing. Once the execution is over, it gives zero or more key-value pairs to the final step.  (^) Output Phase − In the output phase, we have an output formatter that translates the final key-value pairs from the Reducer function and writes them onto a file using a record writer.

How MapReduce Works?  (^) As shown in the illustration, the MapReduce algorithm performs the following actions −  (^) Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value pairs.  (^) Filter − Filters unwanted words from the maps of tokens and writes the filtered maps as key-value pairs.  (^) Count − Generates a token counter per word.  Aggregate Counters − Prepares an aggregate of similar counter values into small manageable units.

Algorithms using map reduce