



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
A set of lecture notes for the cis 700 course at the university of pennsylvania, taught by boon thau loo during the spring 2007 semester. The notes cover the topics of distributed and parallel databases, including their motivation, design, and comparison to traditional databases. The document also includes references to related textbooks and papers, as well as important definitions and concepts.
Typology: Study notes
1 / 6
This page cannot be seen from the preview
Don't miss anything!




Boon Thau Loo Spring 2007 Lecture 2 Note: Several slides are courtesy of CSE 599C (Winter โ06) and CSE 544 (Fall โ06) from UW-Seattle, lectureslides from http://www.cs.wisc.edu/~dbbook and cs186 Fall โ06 lectures from UC Berkeley.
interest, advisor, audit/enroll)
sp07/ideas/ideas.html
Drexel)
of a database
http://www.cs.brown.edu/courses/cs295-
http://redbook.cs.berkeley.edu
distributed and parallel databases
course? For example,
parallel or distributed database?
millions, or scale down to motes?
M. Tamer รzsu and Patrick Valduriez, Distributed Database Systems. Prentice Hall, 1999 Principles of
Management in the R Distributed Database ManagementSystem.* ACM Transactions On Database Systems 11(4), 1986
Sidell, C. Staelin, and A. Yu. Distributed Database System. Mariposa: A Wide-area VLDB Journal (5)1, 1996
locations (different offices) -> original motivation
Distributed data independence: ^ Location transparency
Performance transparency Distributed query optimizer ensures good performance no
Distributed transaction atomicity
Must reconstruct relation by join on SID , then evaluate the query
SELECT Sname FROM Supplier S, Supplies S WHERE S.sno = S1.sno AND S.ssity=โSeattleโ
Autonomy: different administrative domains Cannot always assume full cooperation Do not require distributed transactions Heterogeneity: Different capabilities at different location Different data types, different semantics Large-scale Internet-scale query processor
(^4) Goal is to get rid of the single-administrative-domain assumption: 5 5 Dynamic data allocation 5 Multiple administrative structures 4 Heterogeneity of nodes Interesting idea based on economic models 5 Processing sites buy and sell data and query processing 5 services. Sites declare their local costs for a query based on: j j^ Estimates of resource consumption j^ Runtime constraints (e.g. current system load) Relationships with competition sites
http://mariposa.cs.berkeley.edu/
; Cohera -> PeopleSoft -> Oracle ; From the Redbook: k โThe Mariposa system was commercialized as Cohera (later bought by PeopleSoft) and was demonstrated to work across k administrative domains in fields โโฆ..flexibility and efficiency of its computation economy ideas k have yet to be significantly tested โโฆ..unclear whether corporate IT is ready for significant k investments in federated query processing.โ โIt is possible that we will see ideas from Mariposa re-emerge in the peer-to-peer space, where there is significant grasroots interest.โ
Overview of Distributed Databases Overview of Parallel Databases
(^4) Background: (^5) Ramakrishnan and Gehrke, Database Management 5 Systems , Chapter 22 David J. DeWitt, Jim Gray. Parallel Database Systems: The Future of High Performance Database Systems. 5 Commun. ACM, 35(6), 1992, 85-98. Goetz Graefe. Encapsulation of Parallelism in the Volcano Query Processing System. Proc. SIGMOD Conference, 1990, 102-111.
Performance: Loading data, building indexes, executing queries Data is distributed within single site Distributed solely for performance
Parallelism is natural to DBMS processing: Pipelined parallelism: many machines, each doing one step in a multi-step process Partition parallelism: many machines doing the same thing to different pieces of the data
(^4) DBMS are the most successful application of parallelism 5 5 Every major DBMS vendor has some // product Key concepts are modified and reused in all major search engines (^4) Reasons for success: (^5) Natural pipelining (^5) Partitioned parallelism (^5) Inexpensive hardware (^5) Users / app-programmers donโt have to think in parallelism
(^4) Intra-operator parallelism (^5) Get all machines to compute a given operation (scan, sort join) (^4) Inter-operator parallelism (^5) Each operator runs at a different site (^4) Intra-query parallelism (^5) Different queries run at different sites
Main idea: Scan in parallel Range re-partition At each receiving node, local sorting Problem: Skew! Solution: โsampleโ the data at start to determine partition points