Understanding the Core of Cloud Computing: MapReduce and Datastores, Study notes of Computer Science

Explore the core concepts of cloud computing, including mapreduce, serial and parallel databases, and distributed data storage. Learn how to operate on large amounts of data, design considerations for the core, and the importance of mapreduce in cloud computing.

Typology: Study notes

2012/2013

Uploaded on 04/23/2013

aslesha
aslesha 🇮🇳

4.4

(14)

160 documents

1 / 29

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Service switching
Serial computations
Horizontal scaling
So far, we've studied the "edge" of the cloud:
Distributed objects
Parallel computations
Core scaling
Now, we turn our attention to the "core" of the cloud:
The core
Tuesday, February 23, 2010
5:50 PM
Core Page 1
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d

Partial preview of the text

Download Understanding the Core of Cloud Computing: MapReduce and Datastores and more Study notes Computer Science in PDF only on Docsity!

Service switching Serial computations Horizontal scaling So far, we've studied the "edge" of the cloud: Distributed objects Parallel computations Core scaling Now, we turn our attention to the "core" of the cloud: The core The core Tuesday, February 23, 2010 5:50 PM Docsity.com

Operate on really large amounts of data. Use redundant data storage for robustness. Compute queries quickly regardless of data size. Continue functioning even if there are multiple points of failure. Provide effective programming abstractions for manipulating data without having to know details. Design considerations for the core Design considerations for the core Tuesday, February 23, 2010 8:45 PM Docsity.com

(Serial) Databases (Cloud) Datastores Tables and rows Key/value pairs SQL Queries MapReduce and PIG Serial execution Parallel execution NoSQL: protest of how hard it is to parallelize SQL. If you do everything with <k,v> pairs, then it is much easier to put it into the cloud. Databases and Datastores Tuesday, February 23, 2010 6:00 PM Docsity.com

The key is a unique identifier for data. The value might be data associated with the key, or alternatively, a key to other data. Any datastore can be represented by a collection of sets of <key,value> pairs where: (this might be considered the fundamental theorem of database theory; the resulting representation of data is called 4th normal form ). Thus distributed datastores can concentrate on being able to store and retrieve key-value pairs rather than tables and rows. Some key concepts Some key concepts Tuesday, February 23, 2010 5:53 PM Docsity.com

The mapping phase is " embarrassingly parallel "; parallel time = (total serial time)/(# of processors). Parallelism is ideal ; we can't do any better than that! Time for the reduce phase is proportional to the log of the number of processors utilized. Not quite perfect runtime, but as good as it gets. Why MapReduce is important Why MapReduce is important Tuesday, February 23, 2010 6:03 PM Docsity.com

Trackers : initiate a MapReduce, gather results. Mappers : perform the map part of an operation, contain data. Reducers : perform the reduce part of an operation, don't contain data. Three kinds of nodes Select a Tracker by flowless switching; send it the query. Tracker contacts its mappers (mechanisms differ ; google uses UDP broadcast ; hadoop uses tree propogation ). Answer flows back from mappers to reducers to tracker to client, in a tree ("funnel") shape. How a MapReduce is implemented: Theory of operation of MapReduce Theory of operation Tuesday, February 23, 2010 8:26 PM Docsity.com

M

M

M

M

R

R

S T R

S=switch T=tracker R=reducer M=mapper Picture of a MapReduce: making request (google) Tuesday, February 23, 2010 8:30 PM Docsity.com

M

M

M

M

R

R

S T R

S=switch T=tracker R=reducer M=mapper Picture of a MapReduce: making request (hadoop) Tuesday, February 23, 2010 8:30 PM Docsity.com

Edge service MapReduce Can't operate on local data Operates solely on local data Queries a datastore Is a datastore Scales horizontally Scales vertically Adds serial instances Adds parallel instances Switches between edge servers Switches between tracker nodes Edge service versus MapReduce Tuesday, February 23, 2010 8:00 PM Docsity.com

Input: numbered lines of text. Output: index of the line numbers in which each word appears, sorted by word. Example: book index Example: book index Tuesday, February 23, 2010 6:11 PM Docsity.com

For every line in the datastore, Label words with the line in which they appear. Part 2: Mapping E.g., 1 When that Aprillis with his showers swoot, 5 When Zephyrus eke with his swoote breath Becomes, after mapping when: 1 5 that: 1 Aprillis: 1 with: 1 5 his: 1 5 showers: 1 swoot: 1 Zephyrus: 5 eke: 5 swoote: 5 breath: 5 (depicting them in order of discovery) This is the result of one node's mapping. Mapping Tuesday, February 23, 2010 6:21 PM Docsity.com

Create a global view of data from the local views Combine work from several nodes. Part 3: Reduce Node01 handles lines 1 and 5 Node02 handles lines 2 and 6 Node03 handles lines 3 and 7 Node04 handles lines 4 and 8 E.g. if Then the reduce depicts results for lines 1-8. (It is easy enough to produce this in sorted order). Reduce Tuesday, February 23, 2010 6:27 PM Docsity.com

CRUD : create/retrieve/update/delete for datastores. MergeSort : sort data into any desired order. Index : produce an index for existing data. Search : produce data only if there's a match. Count : count instances of a search term. Common MapReduce Patterns MapReduce Patterns Tuesday, February 23, 2010 7:31 PM Docsity.com

C reate a <key,value> pair. R etrieve a <key, value> pair. U pdate a value for a key. D elete a <key,value> pair. MapReduce can implement distributed CRUD Map: choose one (or more) elements to store <key, value> Reduce: count number of created pairs. Create Map: return value for key, nothing if no match. Reduce: ignore empty returns. Retrieve Map: put new value everywhere old value lives. Reduce: count changes made. Update: Map: delete all instances of given key. Reduce: count deletions done. Delete: CRUD and MapReduce CRUD Tuesday, February 23, 2010 7:57 PM Docsity.com