Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Parallel and Distributed Databases: An Overview of MapReduce and Dynamo, Slides of Introduction to Database Management Systems

University of California - Los Angeles (UCLA)Introduction to Database Management Systems

An introduction to parallel and distributed databases, focusing on two recent systems: mapreduce and dynamo. Mapreduce is a programming model for processing large datasets in parallel, while dynamo is a distributed data store that offers eventual consistency. The basics of mapreduce, its implementation, and its use cases. It also explains dynamo's design principles, such as eventual consistency and conflict resolution. Useful for students and researchers interested in distributed systems, databases, and parallel computing.

Typology: Slides

2011/2012

Uploaded on 02/12/2012

dylanx 🇺🇸

4.7

(21)

286 documents

1 / 9

This page cannot be seen from the preview

Don't miss anything!

Parallel and distributed databases

Some interesting recent systems

MapReduce

Dynamo

Peer-to-peer

Then and now

A modern search engine

MapReduce

How do I write a massively parallel data

intensive program?

Develop the algorithm

Write the code to distribute w ork to machines

Write the code to distribute d ata among machines

Write the code to retry failed work units

Write the code to redistribute data for a second

stage of processing

Write the code to start the s econd stage after the

first finishes

Write the code to store inter mediate result data

Write the code to reliably sto re final result data

MapReduce

Two phases

Map: take input data and map it to zero or

more key/value pairs

Reduce: take key/value pairs with the

same key and reduce them to a result

MapReduce framework takes care of

the rest

Partitioning data, repartitioning data,

handling failures, tracking completion…

Discover Slides of Introduction to Database Management Systems University of California - Los Angeles (UCLA)

Partial preview of the text

Download Parallel and Distributed Databases: An Overview of MapReduce and Dynamo and more Slides Introduction to Database Management Systems in PDF only on Docsity!

Parallel and distributed databases

II

Some interesting recent systems

 MapReduce

 Dynamo

 Peer-to-peer

Then and now A modern search engine MapReduce

 How do I write a massively parallel data

intensive program?

 Develop the algorithm

 Write the code to distribute work to machines

 Write the code to distribute data among machines

 Write the code to retry failed work units

 Write the code to redistribute data for a second

stage of processing

 Write the code to start the second stage after the

first finishes

 Write the code to store intermediate result data

 Write the code to reliably store final result data

MapReduce

 Two phases

 Map: take input data and map it to zero or

more key/value pairs

 Reduce: take key/value pairs with the

same key and reduce them to a result

 MapReduce framework takes care of

the rest

 Partitioning data, repartitioning data,

handling failures, tracking completion…

MapReduce

X

Map Reduce

X

Example

Map Reduce

Count the number of times each word appears on the web

apple

banana

apple

grape

apple

grape

apple

apple banana

apple grape

apple apple

grape apple

apple,

banana,

apple,

grape,

apple,

grape,1 apple,

apple,

banana,

apple,

grape,

apple,

grape,

apple,

grape,

banana,

Other MapReduce uses

 Grep

 Sort

 Analyze web graph

 Build inverted indexes

 Analyze access logs

 Document clustering

 Machine learning

Dynamo

 Always writable data store

 Do I need ACID for this?

Eventual consistency

 Weak consistency guarantee for replicated

data

 Updates initiated at any replica

 Updates eventually reach every replica

 If updates cease, eventually all replicas will have same state

 Tentative versus stable writes

 Tentative writes applied in per-server partial order  Stable writes applied in global commit order

 Bayou system at PARC

Commit order Storage unit copies visible^ Inconsistent Storage unit Local write order preserved

Eventual consistency

Master storage unit (Joe, 22, Arizona) (Joe, 32, Arizona) (Joe, 22, Montana) (Joe, 22, Montana) Arizona Montana → (Joe, 32, Montana) (Joe, 32, Montana) Joe → Bob (Bob, 22, Montana) (Bob, 32, Montana) (Bob, 32, Montana)^ (Bob, 32, Montana) (Bob, 32, Montana) All replicas end up in same state 22 → 32 Arizona Montana → Joe → Bob Arizona Montana → Joe → Bob 22 → 32 22 → 32

How to resolve conflicts?

 Commutative operations: allow both

 Add “Fight Club” to shopping cart

 Add “Legends of the Fall” shopping cart

 Doesn’t matter what order they occur in

 Thomas write rule: take the last update

 That’s the one we “meant” to have stick

 Let the application cope with it

 Expose possible alternatives to application

 Application must write back one answer

Peer-to-peer

 Great technology

 Shady business model

 Focus on the technology for now

Peer-to-peer origins

 Where can I find songs for download?

Web interface Q? Napster Q? Q? Q? Gnutella

Q?

Characteristics

 Peers both generate and process messages

 Server + client = “servent”

 Massively parallel

 Distributed

 Data-centric

 Route queries and data, not packets

Gnutella Joining the network ping ping ping pong ping ping ping ping pong pong ping ping ping ping ping ping ping ping ping pong pong pong pong pong pong Joining the network Search

Q?

TTL = 4

Download Failures

X

Supernodes

Q?

Kazaa, Yang and Garcia-Molina 2003

Some interesting observations  Most peers are short-lived

 Average up-time: 60 minutes

 For a 100K network, this implies churn rate of 1,600 nodes per minute

 Saroiu et al 2002

 Most peers are “freeloaders”

 70 percent of peers share no files

 Most results come from 1 percent of peers

 Adar and Huberman 2000

 Network tends toward a power-law topology

 Power-law: nth^ most connected peer has k/nα^ connections

 A few peers have many connections, most peers have few

 Ripeanu and Foster 2002

“Structured” networks

 Idea: form a structured topology that gives

certain performance guarantees

 Number of hops needed for searches

 Amount of state required by nodes

 Maintenance cost

 Tolerance to churn

 Part of a larger application

 Peer-to-peer substrate for data management

Distributed Hash Tables

 Basic operation

 Given key K, return associated value V

 Examples of DHTs

 Chord

 CAN

 Tapestry

 Pastry

 Koorde

 Kelips

 Kademlia

 Viceroy

 Freenet

Chord

2 m-1 0

Node ID = Hash(IP)

Object ID = Hash(Key)

Stoica et al 2001

Searching

K?

O(N)

Better searching

K?

O(log N)

Finger table: ith^ entry is node that succeeds me by at least 2i-1, m entries total Joining

Parallel and Distributed Databases: An Overview of MapReduce and Dynamo, Slides of Introduction to Database Management Systems

Related documents

Partial preview of the text

Download Parallel and Distributed Databases: An Overview of MapReduce and Dynamo and more Slides Introduction to Database Management Systems in PDF only on Docsity!

Parallel and distributed databases

II

 MapReduce

 Dynamo

 Peer-to-peer

 How do I write a massively parallel data

intensive program?

 Develop the algorithm

 Write the code to distribute work to machines

 Write the code to distribute data among machines

 Write the code to retry failed work units

 Write the code to redistribute data for a second

stage of processing

 Write the code to start the second stage after the

first finishes

 Write the code to store intermediate result data

 Write the code to reliably store final result data

 Two phases

 Map: take input data and map it to zero or

more key/value pairs

 Reduce: take key/value pairs with the

same key and reduce them to a result

 MapReduce framework takes care of

the rest

 Partitioning data, repartitioning data,

handling failures, tracking completion…

MapReduce

X

Map Reduce

X

Example

Map Reduce

Count the number of times each word appears on the web

apple

banana

apple

grape

apple

apple

grape

apple

apple banana

apple grape

apple apple

grape apple

apple,

banana,

apple,

grape,

apple,

grape,1 apple,

apple,

apple,

banana,

apple,

grape,

apple,

grape,

apple,

apple,

apple,

grape,

banana,

Other MapReduce uses

 Grep

 Sort

 Analyze web graph

 Build inverted indexes

 Analyze access logs

 Document clustering

 Machine learning

Dynamo

 Always writable data store

 Do I need ACID for this?

Eventual consistency

 Weak consistency guarantee for replicated