





















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Explore the core concepts of cloud computing, including mapreduce, serial and parallel databases, and distributed data storage. Learn how to operate on large amounts of data, design considerations for the core, and the importance of mapreduce in cloud computing.
Typology: Study notes
1 / 29
This page cannot be seen from the preview
Don't miss anything!






















Service switching Serial computations Horizontal scaling So far, we've studied the "edge" of the cloud: Distributed objects Parallel computations Core scaling Now, we turn our attention to the "core" of the cloud: The core The core Tuesday, February 23, 2010 5:50 PM Docsity.com
Operate on really large amounts of data. Use redundant data storage for robustness. Compute queries quickly regardless of data size. Continue functioning even if there are multiple points of failure. Provide effective programming abstractions for manipulating data without having to know details. Design considerations for the core Design considerations for the core Tuesday, February 23, 2010 8:45 PM Docsity.com
(Serial) Databases (Cloud) Datastores Tables and rows Key/value pairs SQL Queries MapReduce and PIG Serial execution Parallel execution NoSQL: protest of how hard it is to parallelize SQL. If you do everything with <k,v> pairs, then it is much easier to put it into the cloud. Databases and Datastores Tuesday, February 23, 2010 6:00 PM Docsity.com
The key is a unique identifier for data. The value might be data associated with the key, or alternatively, a key to other data. Any datastore can be represented by a collection of sets of <key,value> pairs where: (this might be considered the fundamental theorem of database theory; the resulting representation of data is called 4th normal form ). Thus distributed datastores can concentrate on being able to store and retrieve key-value pairs rather than tables and rows. Some key concepts Some key concepts Tuesday, February 23, 2010 5:53 PM Docsity.com
The mapping phase is " embarrassingly parallel "; parallel time = (total serial time)/(# of processors). Parallelism is ideal ; we can't do any better than that! Time for the reduce phase is proportional to the log of the number of processors utilized. Not quite perfect runtime, but as good as it gets. Why MapReduce is important Why MapReduce is important Tuesday, February 23, 2010 6:03 PM Docsity.com
Trackers : initiate a MapReduce, gather results. Mappers : perform the map part of an operation, contain data. Reducers : perform the reduce part of an operation, don't contain data. Three kinds of nodes Select a Tracker by flowless switching; send it the query. Tracker contacts its mappers (mechanisms differ ; google uses UDP broadcast ; hadoop uses tree propogation ). Answer flows back from mappers to reducers to tracker to client, in a tree ("funnel") shape. How a MapReduce is implemented: Theory of operation of MapReduce Theory of operation Tuesday, February 23, 2010 8:26 PM Docsity.com
S=switch T=tracker R=reducer M=mapper Picture of a MapReduce: making request (google) Tuesday, February 23, 2010 8:30 PM Docsity.com
S=switch T=tracker R=reducer M=mapper Picture of a MapReduce: making request (hadoop) Tuesday, February 23, 2010 8:30 PM Docsity.com
Edge service MapReduce Can't operate on local data Operates solely on local data Queries a datastore Is a datastore Scales horizontally Scales vertically Adds serial instances Adds parallel instances Switches between edge servers Switches between tracker nodes Edge service versus MapReduce Tuesday, February 23, 2010 8:00 PM Docsity.com
Input: numbered lines of text. Output: index of the line numbers in which each word appears, sorted by word. Example: book index Example: book index Tuesday, February 23, 2010 6:11 PM Docsity.com
For every line in the datastore, Label words with the line in which they appear. Part 2: Mapping E.g., 1 When that Aprillis with his showers swoot, 5 When Zephyrus eke with his swoote breath Becomes, after mapping when: 1 5 that: 1 Aprillis: 1 with: 1 5 his: 1 5 showers: 1 swoot: 1 Zephyrus: 5 eke: 5 swoote: 5 breath: 5 (depicting them in order of discovery) This is the result of one node's mapping. Mapping Tuesday, February 23, 2010 6:21 PM Docsity.com
Create a global view of data from the local views Combine work from several nodes. Part 3: Reduce Node01 handles lines 1 and 5 Node02 handles lines 2 and 6 Node03 handles lines 3 and 7 Node04 handles lines 4 and 8 E.g. if Then the reduce depicts results for lines 1-8. (It is easy enough to produce this in sorted order). Reduce Tuesday, February 23, 2010 6:27 PM Docsity.com
CRUD : create/retrieve/update/delete for datastores. MergeSort : sort data into any desired order. Index : produce an index for existing data. Search : produce data only if there's a match. Count : count instances of a search term. Common MapReduce Patterns MapReduce Patterns Tuesday, February 23, 2010 7:31 PM Docsity.com
C reate a <key,value> pair. R etrieve a <key, value> pair. U pdate a value for a key. D elete a <key,value> pair. MapReduce can implement distributed CRUD Map: choose one (or more) elements to store <key, value> Reduce: count number of created pairs. Create Map: return value for key, nothing if no match. Reduce: ignore empty returns. Retrieve Map: put new value everywhere old value lives. Reduce: count changes made. Update: Map: delete all instances of given key. Reduce: count deletions done. Delete: CRUD and MapReduce CRUD Tuesday, February 23, 2010 7:57 PM Docsity.com