Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Lecture Notes for CIS 700 Spring 2007 - Distributed and Parallel Databases, Study notes of Computer Science

University of Pennsylvania (UPenn)Computer Science

A set of lecture notes for the cis 700 course at the university of pennsylvania, taught by boon thau loo during the spring 2007 semester. The notes cover the topics of distributed and parallel databases, including their motivation, design, and comparison to traditional databases. The document also includes references to related textbooks and papers, as well as important definitions and concepts.

Typology: Study notes

Pre 2010

Uploaded on 03/28/2010

koofers-user-cng 🇺🇸

10 documents

1 / 6

This page cannot be seen from the preview

Don't miss anything!

1



 "!$#&%'%)(*%+!!

Boon Thau Loo

Spring 2007

Lecture 2

Note: Severalslides are courtesy of CSE 599C (Winter ’06) and CSE 544 (Fall ’06) from UW-Seattle,lecture

slides from http://www.cs.wisc.edu/~dbbookand cs186 Fall ‘06 lectures from UC Berkeley.

,.-/-+01/-/24345634-798

:

Reminder:

;

Introduction email (year, backgro und, research

interest, advisor, audit/enroll )

<

Office hours: Wed 3-4 pm (605 Levine)

=

http://www.cis.upenn.edu/~boonloo/cis700-

sp07/ideas/ideas.html

>

All slides will be online (within UPenn and

Drexel)

?@BA9C/D&EE4F

G

Review of Databases

=

SQL, query plan, relational algebra, architecture

of a database

H

Reference for students with systems background:

I

Hellerstein, Stonebraker. Anatomy of a Database System.

http://www.cs.brown.edu/courses/cs295-

11/anatomyofadatabase.pdf

I

http://redbook.cs.berkeley.edu

JLK+M

@ON

H

Overview of Distributed Databases

H

Overview of Parallel Databases

D&P)NRQSACTPQSAVUWE4XYE"Z@\[C]C

K^`_'acbd)de

G

Understand the motivation and design of

distributed and parallel databases

G

Relate to the design choices in the rest of the

course? For example,

=

Is PIER/P2/TinyDB more similar compared to a

parallel or distributed database?

=

What techniques carry over?

=

What are we gaining/sacrificing as we scale up to

millions, or scale down to motes?

fQSA9CTUgQih+jCTE

M

f@4C@h+@BA'EkA

l

Background textbooks:

m

Ramakrishnan and G ehrke, Database Management Syste ms,

3rd edition, Chapter 22

I

M. Tamer Özsu and Patrick Valduriez, Principles of

Distributed Database Systems. Prentice Hall, 1999

l

Papers:

m

C. Mohan, B. Lindsay, and R. Obermarck. Transaction

Management in the R* Distributed Database Management

System. ACM Transactions On Database Systems 11(4), 1986

m

M. Stonebraker, P. M. Aoki, W . Litwin, A. Pfeffer, A. Sah, J.

Sidell, C. Staelin, and A. Yu. Mariposa: A Wide-area

Distributed Database System. VLDB Journal (5)1, 1996

l

Great lecture notes:

I

http://www.stanford.edu/class/cs347/

Discover Study notes of Computer Science University of Pennsylvania (UPenn)

Partial preview of the text

Download Lecture Notes for CIS 700 Spring 2007 - Distributed and Parallel Databases and more Study notes Computer Science in PDF only on Docsity!

Boon Thau Loo Spring 2007 Lecture 2 Note: Several slides are courtesy of CSE 599C (Winter ’06) and CSE 544 (Fall ’06) from UW-Seattle, lectureslides from http://www.cs.wisc.edu/~dbbook and cs186 Fall ‘06 lectures from UC Berkeley.

: Reminder:

; Introduction email (year, background, research

interest, advisor, audit/enroll)

< Office hours: Wed 3-4 pm (605 Levine)

= http://www.cis.upenn.edu/~boonloo/cis700-

sp07/ideas/ideas.html

> All slides will be online (within UPenn and

Drexel)

? @BA9C/D&EE4F

G Review of Databases

= SQL, query plan, relational algebra, architecture

of a database

H Reference for students with systems background:

I Hellerstein, Stonebraker. Anatomy of a Database System.

http://www.cs.brown.edu/courses/cs295-

I^ 11/anatomyofadatabase.pdf

http://redbook.cs.berkeley.edu

JLK+M @ON

H Overview of Distributed Databases

H Overview of Parallel Databases

D&P)NRQSA CTPQSAVUWE4XYE"Z@[C]C K^`_'acbd)de

G Understand the motivation and design of

distributed and parallel databases

G Relate to the design choices in the rest of the

course? For example,

= Is PIER/P2/TinyDB more similar compared to a

parallel or distributed database?

= What techniques carry over?

= What are we gaining/sacrificing as we scale up to

millions, or scale down to motes?

fQSA9CTUgQih+jCTE M f@4C@h+@BA'EkA

l Background textbooks:

m Ramakrishnan and Gehrke, Database Management Systems ,

I 3rd edition, Chapter 22

M. Tamer Özsu and Patrick Valduriez, Distributed Database Systems. Prentice Hall, 1999 Principles of

l Papers:

m C. Mohan, B. Lindsay, and R. Obermarck. Transaction

Management in the R Distributed Database ManagementSystem.* ACM Transactions On Database Systems 11(4), 1986

m M. Stonebraker, P. M. Aoki, W. Litwin, A. Pfeffer, A. Sah, J.

Sidell, C. Staelin, and A. Yu. Distributed Database System. Mariposa: A Wide-area VLDB Journal (5)1, 1996

l Great lecture notes:

I http://www.stanford.edu/class/cs347/

f QA9CTUgQYh/jCTE M f a

G Important: many forms and definitions

G Our definition: “shared nothing” infrastructure

= Multiple machines connected with a network

EO@BA K [A K U @ f QA9CTUgQYh/jCTE M f a

G Scalability (eg: Amazon, eBay, Google)

= Many small servers cheaper than large mainframe

= Need to scale incrementally

G Inherent distribution

= Large organizations have data at multiple

locations (different offices) -> original motivation

= Web-based and Internet-based applications

= Different types of data in different DBMSs

_'M EO@\X A K @ fQSA9CTUTQYh+jCTE M f a

H Recall data independence: Users write SQL, do not

H worry about how data is physically stored.

Ideal: I

Distributed data independence: ^ Location transparency

I^ Fragmentation transparency

Performance transparency Distributed query optimizer ensures good performance no

I matter where query is submitted

Distributed transaction atomicity

fQSA9CTUgQih+jCQ [ f @C@

G Fragmentation:

= Horizontal

= Vertical: Lossless-join, TIDs

G Replication:

= Gives increased availability

= Faster query evaluation

= Synchronous vs Asynchronous

l Vary in how current copies are

f QA9CTUgQYh/jCTE M j/EUgQ EkA

l Sailors(SID, sname, rating, age)

l Horizontally fragmentation: Tuples with rating < 5 at site A, >=

5 at site B m

m^ Must compute SUM(age), COUNT(age) at both sites A and B.

l^ If WHERE contains just rating > 6, run the query only at site B

Vertical fragmentation: m

m^ TableA(SID, rating), TableB(SID, sname, age)

Must reconstruct relation by join on SID , then evaluate the query

E"ZQ E j/EUWN.Z @kX j@CQ K [

SELECT Sname FROM Supplier S, Supplies S WHERE S.sno = S1.sno AND S.ssity=“Seattle”

Autonomy: different administrative domains Cannot always assume full cooperation Do not require distributed transactions Heterogeneity: Different capabilities at different location Different data types, different semantics Large-scale Internet-scale query processor

(^4) Goal is to get rid of the single-administrative-domain assumption: 5 5 Dynamic data allocation 5 Multiple administrative structures 4 Heterogeneity of nodes Interesting idea based on economic models 5 Processing sites buy and sell data and query processing 5 services. Sites declare their local costs for a query based on: j j^ Estimates of resource consumption j^ Runtime constraints (e.g. current system load) Relationships with competition sites

http://mariposa.cs.berkeley.edu/

; Cohera -> PeopleSoft -> Oracle ; From the Redbook: k “The Mariposa system was commercialized as Cohera (later bought by PeopleSoft) and was demonstrated to work across k administrative domains in fields “…..flexibility and efficiency of its computation economy ideas k have yet to be significantly tested “…..unclear whether corporate IT is ready for significant k investments in federated query processing.” “It is possible that we will see ideas from Mariposa re-emerge in the peer-to-peer space, where there is significant grasroots interest.”

Overview of Distributed Databases Overview of Parallel Databases

?@#A - 023

(^4) Background: (^5) Ramakrishnan and Gehrke, Database Management 5 Systems , Chapter 22 David J. DeWitt, Jim Gray. Parallel Database Systems: The Future of High Performance Database Systems. 5 Commun. ACM, 35(6), 1992, 85-98. Goetz Graefe. Encapsulation of Parallelism in the Volcano Query Processing System. Proc. SIGMOD Conference, 1990, 102-111.

- /CBABD#' E F

Performance: Loading data, building indexes, executing queries Data is distributed within single site Distributed solely for performance

?@#A - !" >

Parallelism is natural to DBMS processing: Pipelined parallelism: many machines, each doing one step in a multi-step process Partition parallelism: many machines doing the same thing to different pieces of the data

- ! < & $ #D=

(^4) DBMS are the most successful application of parallelism 5 5 Every major DBMS vendor has some // product Key concepts are modified and reused in all major search engines (^4) Reasons for success: (^5) Natural pipelining (^5) Partitioned parallelism (^5) Inexpensive hardware (^5) Users / app-programmers don’t have to think in parallelism

# $ > #3 #' 6 7': - /CBABD#' < =% &B&-/! C'

(^4) Intra-operator parallelism (^5) Get all machines to compute a given operation (scan, sort join) (^4) Inter-operator parallelism (^5) Each operator runs at a different site (^4) Intra-query parallelism (^5) Different queries run at different sites

- ?8#$ ?8#A * #A

Main idea: Scan in parallel Range re-partition At each receiving node, local sorting Problem: Skew! Solution: “sample” the data at start to determine partition points