Download Distributed Databases: Understanding Client-Server Communications and Advantages and more Study notes Computer Science in PDF only on Docsity!
Distributed Database Systems
Motivation
Growth of data communication
networks
Faster, cheaper & reliable
Needs of organizations
Multiple sites
Global presence
Interacting organizations
Need for transaction capabilities
Availability and reliability are crucial
What is Distributed Database
System?
A Distributed Database (DDB) is a
collection of multiple,logically
interrelated databases distributed over
a computer network
A Distributed Database System is
software that manages a distributed
database while making the distribution
transparent to the user
Distributed DB Environment
Site 1 Site 2
Site 3 Site 4
Communication
Network
Client-Server: Characteristics
Client
presents the user interface
forms queries or commands in a predefined language
communicates these to the server according to an
accepted interprocess communication method
performs data analysis on the results returned from the
server
presents the results to the user
Server
provides a service to client(s)
only responds to queries or commands from clients;
does not initiate conversation with clients
hides the existence of the client-server system from the
client (ideally)
Client-Server: Advantages &
Limitations
Advantages
Efficient division of labor
Client access to remote data (via
standards)
Full DBMS functionality provided to client
workstations
Limitations
Server forms bottleneck
Server forms single point of failure
Database scaling difficult
Multiple Clients/Multiple
Servers
Communications
Client
Services
Applications
LAN
z directory
z caching
z query decomposition
z commit protocols
Communications
DBMS Services
Database
Communications
DBMS Services
Database
Server-to-Server
Communications
DBMS Services
LAN
Communications
DBMS Services
z SQL interface
z programmatic
interface
z other application
support environments
Communications
Client
Services
Applications
Database Database
Transparency
High-level users are shielded from the
implementational details such as the
location, replication of data
Users still see a monolithic view of the
database
Different types of transparencies
Distribution transparency
Replication transparency
Fragmentation transparency
Example
TITLE SAL
SKILL
Elect. Eng. 40000 Syst. Anal. 34000 Mech. Eng. 27000 Programmer 24000
PROJ
JNO JNAME BUDGET
ENO ENAME TITLE
E1 J. Doe Elect. Eng. E2 M. Smith Syst. Anal. E3 A. Lee Mech. Eng. E4 J. Miller Programmer E5 B. Casey Syst. Anal. E6 L. Chu Elect. Eng. E7 R. Davis Mech. Eng. E8 J. Jones Syst. Anal.
EMP
ENO JNO RESP
E1 J1 Manager 12
DUR
E2 J1 Analyst 24 E2 J2 Analyst 6 E3 J3 Consultant 10 E3 J4 Engineer 48 E4 J2 Programmer 18 E5 J2 Manager 24 E6 J4 Manager 48 E7 J3 Engineer 36
E8 J3 Manager 40
ASG
LOC
J1 Instrumentation 150000 Montreal
J3 CAD/CAM 250000 New York
J2 Database Develop. 135000 New York
J4 Maintenance 310000 Paris
E7 J5 Engineer 23
J5 CAD/CAM 500000 Bangalore
Transparency - Illustration
Paris
New York Bangalore
Berlin
Paris Employees Paris Projects Paris Assignments
Bangalore Employees Bangalore Projects Bangalore Assignments New York Projects
New York Employees New York Projects New York Assignments Paris Employees
Communication
Network
SELECT ENAME,SAL
FROM EMP,ASG,SKILL
WHERE DUR > 12
AND EMP.ENO = ASS.ENO
AND EMP.TITLE = SKILL.TITLE
Reliability & Performance
Distributed DBMS avoidssingle point of
failure
Only data and software at the failed site
are inaccessible
Further improvement by replication
Performance improvements
Proximity of data to points of its use
Parallelism in execution
Inter-query parallelism
Intra-query parallelism
Distributed Query Processing
Methodology
Query Decomposition
Data Localization
Global Optimization
Local Optimization
Calculus query on Distributed
relations
Algebraic query on
Distributed relations
Fragment Query
Optimized Fragment Query with
Communication Operations
Optimized local queries
Global Schema
Fragment Schema
Stats on fragments
Local Schema
Control
Site
Local
Sites
Steps
Query Decomposition
Input: Calculus query on global relations
Output: Algebraic query on global relations
Transformation rules used
Same as Centralized
Data Localization
Input: Algebraic query on global relations
Output: Algebraic query on fragment relations
Global Optimization
Input: Algebraic query on fragment relations
The best global schedule
Local optimization
Input: Best global execution schedule
Output: The optimal access path for each local sub-query
Decomposition &
Restructuring
Convert relational calculus
to relational algebra
Make use of query trees
Example
Find the names of all employees other than J. Doe who worked on CAD/CAM project for either 1 or 2 years SELECT ENAME FROM E, W, P WHERE E.ENO = W.ENO AND W.JNO = P.JNO AND E.ENAME ≠ “J. Doe” AND P.JNAME = “CAD/CAM” AND (W.DUR = 12 or W.DUR = 24)
ΛENAME
σ(DUR = 12 OR DUR = 24) AND JNAME = “CAD/CAM” AND ENAME = “JDOE”
|X| (^) JNO
|X|ENO
W E
P
Project
Select
Join
Step 2 – Data Localization
Input: Algebraic query on distributed
relations
Determine which fragments are
involved
Localization program
Substitute for each global query its
materialization program
Optimize
Step 3 – Global Query
Optimization
Input: Fragment Query
Find the best global schedule
Minimizes a cost function
Distributed join processing
Which relations to ship where?
Decide on the use of semi-joins
Semi joins saves communication at the expense
of more local processing
Join methods
Nested loop Vs. Ordered joins
Cost-Based Optimization
Solution Space
The set of equivalent algebra expressions (query
trees)
Cost function (in terms of time)
I/O cost + CPU cost + Communication cost
Might have different weights in different
distributed environments
Search algorithms
How do we find the best plan?
Exhaustive search, heuristic algorithms (iterative
deepening, simulated annealing, etc.)
Cost Functions
Total Time
Reduce each cost component
Do as little of each cost component as
possible
Optimizes the utilization of resources
Response Time
Do as many things as possible in parallel
May increase total time
Total Cost
Total cost = CPU cost + I/O cost +
communication cost
CPU cost = unit instruction cost * no. of instructions
I/O cost = unit disk I/O cost * no. of disk I/O
Communication cost = message initiation +
data transmission
Example
Assume that only the communication cost is considered
Total time = 2 * message initialization time + unit
transmission time * (x+y)
Response time = max(time to send x from 1 to 3 and y
from 2 to 3)
Time to send x from 1 to 3 = message initialization time
+ unit transmission time * x
Time to send y from 2 to 3 = message initialization time
+ unit transmission time * y
Site 1
Site 2
Site 3
X Units
Y Units
Distributed Transaction
Execution
User application
Results &
User notifications
Read, Write,
EOT, Abort
TM
SC
RM
SC
RM
TM
Local
Recovery
Protocol
Distributed
Concurrency Control
Protocol
Replica Control
Protocol
Distributed
Transaction Execution
Model
Begin_transaction,
Read, Write, EOT,
Abort
Serializability in Distributed
DBMS
Somewhat more involved: Two histories have
to be considered
Local histories
Global history
For global transactions (i.e., global history) to
be serializable two conditions are necessary
Each local history should be serializable
Two conflicting operations should in the same
relative order in all of the local histories where they
appear together
Global Non-serializability
T1: Read (X)
X <- X+ Write(X) Commit
T2: Read (X)
X <- X* Write(X) Commit
The following two local histories are individually serializable
(in fact serial). But the two transactions are not globally
serializable
LH 1 = {R 1 (x), W 1 (X), C 1 , R 2 (X), W 2 (X), C 2 }
LH 2 = {R 2 (x), W 2 (X), C 2 , R 1 (X), W 1 (X), C 1 }
Deadlock
A transaction is deadlocked if it is blocked and
will remain blocked until external intervention
Locking based CC algorithms may result in
deadlocks
Wait-for graph
If transaction T i waits for another transaction T j to
release a lock, then Ti Æ T j in WFG
T
i
T
j
Local Vs. Global WFG
Assume T1 and T2 run at site 1, T3 and T4 at
site 2. Also assume T3 waits for a lock held
by T4 which waits for a lock held by T1 which
waits for a lock held by T2 which, in turn,
waits for a lock held by T
T (^1)
T (^2)
T (^4)
T (^3)
Local WFG
Site 1 Site 2
T (^1)
T (^2)
T (^4)
T (^3)
Global WFG
Deadlock detection
Transactions are allowed to wait freely
Deadlock detection through cycles in
Wait-for graphs
Topologies for deadlock detection
algorithms
Centralized
Hierarchical
Distributed
Centralized Deadlock detection
One site designated as the deadlock detector for
the system.
Each scheduler periodically sends its local WFG
to the central site, which merges to obtain a
global WFG to determine cycles
How often to transmit
Too often Higher communication cost
Too late Higher delays
Would be a reasonable choice if CC algorithm is
also centralized