Lecture Notes for CIS 700 Spring 2007 - Distributed and Parallel Databases, Study notes of Computer Science

A set of lecture notes for the cis 700 course at the university of pennsylvania, taught by boon thau loo during the spring 2007 semester. The notes cover the topics of distributed and parallel databases, including their motivation, design, and comparison to traditional databases. The document also includes references to related textbooks and papers, as well as important definitions and concepts.

Typology: Study notes

Pre 2010

Uploaded on 03/28/2010

koofers-user-cng
koofers-user-cng ๐Ÿ‡บ๐Ÿ‡ธ

10 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
๎˜€๎˜‚๎˜๎˜„๎˜ƒ๎˜†๎˜…๎˜ˆ๎˜‡๎˜‰๎˜‡๎˜‹๎˜Š๎˜‹๎˜‡๎˜Œ๎˜‡๎˜Ž๎˜
๎˜๎˜‘๎˜๎˜“๎˜’๎˜•๎˜”๎˜—๎˜–๎˜‹๎˜˜๎˜„๎˜™๎˜ˆ๎˜š๎˜œ๎˜›๎˜‹๎˜๎˜Ÿ๎˜ž ๎˜๎˜“๎˜๎˜“๎˜’"!$#&%๎˜“๎˜’'%)(*%+!๎˜„๎˜๎˜ˆ!
Boon Thau Loo
Spring 2007
Lecture 2
Note: Severalslides are courtesy of CSE 599C (Winter โ€™06) and CSE 544 (Fall โ€™06) from UW-Seattle,lecture
slides from http://www.cs.wisc.edu/~dbbookand cs186 Fall โ€˜06 lectures from UC Berkeley.
,.-/-+0๎˜“1/-/24345634-๎˜“798
:
Reminder:
;
Introduction email (year, backgro und, research
interest, advisor, audit/enroll )
<
Office hours: Wed 3-4 pm (605 Levine)
=
http://www.cis.upenn.edu/~boonloo/cis700-
sp07/ideas/ideas.html
>
All slides will be online (within UPenn and
Drexel)
?๎˜‹@BA9C/D&E๎˜„E4F
G
Review of Databases
=
SQL, query plan, relational algebra, architecture
of a database
H
Reference for students with systems background:
I
Hellerstein, Stonebraker. Anatomy of a Database System.
http://www.cs.brown.edu/courses/cs295-
11/anatomyofadatabase.pdf
I
http://redbook.cs.berkeley.edu
JLK+M
@ON
H
Overview of Distributed Databases
H
Overview of Parallel Databases
D&P)NRQSA๎˜‹CTP๎˜“QSAVUWE4XYE"Z๎˜“@\[๎˜“C]C
K๎˜‚^`_'acb๎˜“d)d๎˜“e
G
Understand the motivation and design of
distributed and parallel databases
G
Relate to the design choices in the rest of the
course? For example,
=
Is PIER/P2/TinyDB more similar compared to a
parallel or distributed database?
=
What techniques carry over?
=
What are we gaining/sacrificing as we scale up to
millions, or scale down to motes?
f๎˜‚QSA9CTUgQih+j๎˜“CTE
M
f๎˜‚@4C๎˜œ@๎˜„h+@BA'EkA
l
Background textbooks:
m
Ramakrishnan and G ehrke, Database Management Syste ms,
3rd edition, Chapter 22
I
M. Tamer ร–zsu and Patrick Valduriez, Principles of
Distributed Database Systems. Prentice Hall, 1999
l
Papers:
m
C. Mohan, B. Lindsay, and R. Obermarck. Transaction
Management in the R* Distributed Database Management
System. ACM Transactions On Database Systems 11(4), 1986
m
M. Stonebraker, P. M. Aoki, W . Litwin, A. Pfeffer, A. Sah, J.
Sidell, C. Staelin, and A. Yu. Mariposa: A Wide-area
Distributed Database System. VLDB Journal (5)1, 1996
l
Great lecture notes:
I
http://www.stanford.edu/class/cs347/
pf3
pf4
pf5

Partial preview of the text

Download Lecture Notes for CIS 700 Spring 2007 - Distributed and Parallel Databases and more Study notes Computer Science in PDF only on Docsity!

Boon Thau Loo Spring 2007 Lecture 2 Note: Several slides are courtesy of CSE 599C (Winter โ€™06) and CSE 544 (Fall โ€™06) from UW-Seattle, lectureslides from http://www.cs.wisc.edu/~dbbook and cs186 Fall โ€˜06 lectures from UC Berkeley.

: Reminder:

; Introduction email (year, background, research

interest, advisor, audit/enroll)

< Office hours: Wed 3-4 pm (605 Levine)

= http://www.cis.upenn.edu/~boonloo/cis700-

sp07/ideas/ideas.html

> All slides will be online (within UPenn and

Drexel)

? @BA9C/D&EE4F

G Review of Databases

= SQL, query plan, relational algebra, architecture

of a database

H Reference for students with systems background:

I Hellerstein, Stonebraker. Anatomy of a Database System.

http://www.cs.brown.edu/courses/cs295-

I^ 11/anatomyofadatabase.pdf

http://redbook.cs.berkeley.edu

JLK+M @ON

H Overview of Distributed Databases

H Overview of Parallel Databases

D&P)NRQSA CTPQSAVUWE4XYE"Z@[C]C K^`_'acbd)de

G Understand the motivation and design of

distributed and parallel databases

G Relate to the design choices in the rest of the

course? For example,

= Is PIER/P2/TinyDB more similar compared to a

parallel or distributed database?

= What techniques carry over?

= What are we gaining/sacrificing as we scale up to

millions, or scale down to motes?

fQSA9CTUgQih+jCTE M f@4C@h+@BA'EkA

l Background textbooks:

m Ramakrishnan and Gehrke, Database Management Systems ,

I 3rd edition, Chapter 22

M. Tamer ร–zsu and Patrick Valduriez, Distributed Database Systems. Prentice Hall, 1999 Principles of

l Papers:

m C. Mohan, B. Lindsay, and R. Obermarck. Transaction

Management in the R Distributed Database ManagementSystem.* ACM Transactions On Database Systems 11(4), 1986

m M. Stonebraker, P. M. Aoki, W. Litwin, A. Pfeffer, A. Sah, J.

Sidell, C. Staelin, and A. Yu. Distributed Database System. Mariposa: A Wide-area VLDB Journal (5)1, 1996

l Great lecture notes:

I http://www.stanford.edu/class/cs347/

f QA9CTUgQYh/jCTE M f  a

G Important: many forms and definitions

G Our definition: โ€œshared nothingโ€ infrastructure

= Multiple machines connected with a network

 EO@BA K [A K U @ f QA9CTUgQYh/jCTE M f  a

G Scalability (eg: Amazon, eBay, Google)

= Many small servers cheaper than large mainframe

= Need to scale incrementally

G Inherent distribution

= Large organizations have data at multiple

locations (different offices) -> original motivation

= Web-based and Internet-based applications

= Different types of data in different DBMSs

_'M EO@\X A K  @ fQSA9CTUTQYh+jCTE M f  a

H Recall data independence: Users write SQL, do not

H worry about how data is physically stored.

Ideal: I

Distributed data independence:  ^ Location transparency

I^ Fragmentation transparency

Performance transparency  Distributed query optimizer ensures good performance no

I matter where query is submitted

Distributed transaction atomicity

fQSA9CTUgQih+jCQ [ f @C@

G Fragmentation:

= Horizontal

= Vertical: Lossless-join, TIDs

G Replication:

= Gives increased availability

= Faster query evaluation

= Synchronous vs Asynchronous

l Vary in how current copies are

f QA9CTUgQYh/jCTE M j/EUgQ EkA

l Sailors(SID, sname, rating, age)

l Horizontally fragmentation: Tuples with rating < 5 at site A, >=

5 at site B m

m^ Must compute SUM(age), COUNT(age) at both sites A and B.

l^ If WHERE contains just rating > 6, run the query only at site B

Vertical fragmentation: m

m^ TableA(SID, rating), TableB(SID, sname, age)

Must reconstruct relation by join on SID , then evaluate the query

 E"ZQ E j/EUWN.Z @kX j@CQ K [

SELECT Sname FROM Supplier S, Supplies S WHERE S.sno = S1.sno AND S.ssity=โ€œSeattleโ€

 Autonomy: different administrative domains Cannot always assume full cooperation Do not require distributed transactions  Heterogeneity: Different capabilities at different location Different data types, different semantics  Large-scale Internet-scale query processor

(^4) Goal is to get rid of the single-administrative-domain assumption: 5 5 Dynamic data allocation 5 Multiple administrative structures 4 Heterogeneity of nodes Interesting idea based on economic models 5 Processing sites buy and sell data and query processing 5 services. Sites declare their local costs for a query based on: j j^ Estimates of resource consumption j^ Runtime constraints (e.g. current system load) Relationships with competition sites

http://mariposa.cs.berkeley.edu/

; Cohera -> PeopleSoft -> Oracle ; From the Redbook: k โ€œThe Mariposa system was commercialized as Cohera (later bought by PeopleSoft) and was demonstrated to work across k administrative domains in fields โ€œโ€ฆ..flexibility and efficiency of its computation economy ideas k have yet to be significantly tested โ€œโ€ฆ..unclear whether corporate IT is ready for significant k investments in federated query processing.โ€ โ€œIt is possible that we will see ideas from Mariposa re-emerge in the peer-to-peer space, where there is significant grasroots interest.โ€

 Overview of Distributed Databases  Overview of Parallel Databases

?@#A - 023

(^4) Background: (^5) Ramakrishnan and Gehrke, Database Management 5 Systems , Chapter 22 David J. DeWitt, Jim Gray. Parallel Database Systems: The Future of High Performance Database Systems. 5 Commun. ACM, 35(6), 1992, 85-98. Goetz Graefe. Encapsulation of Parallelism in the Volcano Query Processing System. Proc. SIGMOD Conference, 1990, 102-111.

- /CBABD#' E F

 Performance: Loading data, building indexes, executing queries Data is distributed within single site Distributed solely for performance

?@#A -  !" >   

 Parallelism is natural to DBMS processing: Pipelined parallelism: many machines, each doing one step in a multi-step process Partition parallelism: many machines doing the same thing to different pieces of the data

- !  < &     $ #D=

(^4) DBMS are the most successful application of parallelism 5 5 Every major DBMS vendor has some // product Key concepts are modified and reused in all major search engines (^4) Reasons for success: (^5) Natural pipelining (^5) Partitioned parallelism (^5) Inexpensive hardware (^5) Users / app-programmers donโ€™t have to think in parallelism

 #   $ > #3    #'  6 7': - /CBABD#'  < =%  &B&-/!   C'

(^4) Intra-operator parallelism (^5) Get all machines to compute a given operation (scan, sort join) (^4) Inter-operator parallelism (^5) Each operator runs at a different site (^4) Intra-query parallelism (^5) Different queries run at different sites

-   ?8#$    ?8#A * #A 

 Main idea: Scan in parallel Range re-partition At each receiving node, local sorting Problem: Skew! Solution: โ€œsampleโ€ the data at start to determine partition points