Distributed Databases: Understanding Client-Server Communications and Advantages, Study notes of Computer Science

An overview of distributed databases, their architecture, and the advantages of using a client-server communication model. Topics include task distribution, communication methods, caching, and the characteristics, advantages, and limitations of client-server systems. Additionally, it discusses the importance of transparency, reliability, and performance in distributed database management systems.

Typology: Study notes

Pre 2010

Uploaded on 08/05/2009

koofers-user-0uh
koofers-user-0uh 🇺🇸

10 documents

1 / 29

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Distributed Database Systems
Motivation
Growth of data communication
networks
Faster, cheaper & reliable
Needs of organizations
Multiple sites
Global presence
Interacting organizations
Need for transaction capabilities
Availability and reliability are crucial
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d

Partial preview of the text

Download Distributed Databases: Understanding Client-Server Communications and Advantages and more Study notes Computer Science in PDF only on Docsity!

Distributed Database Systems

Motivation

Growth of data communication

networks

„ Faster, cheaper & reliable

Needs of organizations

„ Multiple sites

„ Global presence

„ Interacting organizations

„ Need for transaction capabilities

„ Availability and reliability are crucial

What is Distributed Database

System?

A Distributed Database (DDB) is a

collection of multiple,logically

interrelated databases distributed over

a computer network

A Distributed Database System is

software that manages a distributed

database while making the distribution

transparent to the user

Distributed DB Environment

Site 1 Site 2

Site 3 Site 4

Communication

Network

Client-Server: Characteristics

Client

„ presents the user interface

„ forms queries or commands in a predefined language

„ communicates these to the server according to an

accepted interprocess communication method

„ performs data analysis on the results returned from the

server

„ presents the results to the user

Server

„ provides a service to client(s)

„ only responds to queries or commands from clients;

does not initiate conversation with clients

„ hides the existence of the client-server system from the

client (ideally)

Client-Server: Advantages &

Limitations

Advantages

„ Efficient division of labor

„ Client access to remote data (via

standards)

„ Full DBMS functionality provided to client

workstations

Limitations

„ Server forms bottleneck

„ Server forms single point of failure

„ Database scaling difficult

Multiple Clients/Multiple

Servers

Communications

Client

Services

Applications

LAN

z directory

z caching

z query decomposition

z commit protocols

Communications

DBMS Services

Database

Communications

DBMS Services

Database

Server-to-Server

Communications

DBMS Services

LAN

Communications

DBMS Services

z SQL interface

z programmatic

interface

z other application

support environments

Communications

Client

Services

Applications

Database Database

Transparency

High-level users are shielded from the

implementational details such as the

location, replication of data

„ Users still see a monolithic view of the

database

Different types of transparencies

„ Distribution transparency

„ Replication transparency

„ Fragmentation transparency

Example

TITLE SAL

SKILL

Elect. Eng. 40000 Syst. Anal. 34000 Mech. Eng. 27000 Programmer 24000

PROJ

JNO JNAME BUDGET

ENO ENAME TITLE

E1 J. Doe Elect. Eng. E2 M. Smith Syst. Anal. E3 A. Lee Mech. Eng. E4 J. Miller Programmer E5 B. Casey Syst. Anal. E6 L. Chu Elect. Eng. E7 R. Davis Mech. Eng. E8 J. Jones Syst. Anal.

EMP

ENO JNO RESP

E1 J1 Manager 12

DUR

E2 J1 Analyst 24 E2 J2 Analyst 6 E3 J3 Consultant 10 E3 J4 Engineer 48 E4 J2 Programmer 18 E5 J2 Manager 24 E6 J4 Manager 48 E7 J3 Engineer 36

E8 J3 Manager 40

ASG

LOC

J1 Instrumentation 150000 Montreal

J3 CAD/CAM 250000 New York

J2 Database Develop. 135000 New York

J4 Maintenance 310000 Paris

E7 J5 Engineer 23

J5 CAD/CAM 500000 Bangalore

Transparency - Illustration

Paris

New York Bangalore

Berlin

Paris Employees Paris Projects Paris Assignments

Bangalore Employees Bangalore Projects Bangalore Assignments New York Projects

New York Employees New York Projects New York Assignments Paris Employees

Communication

Network

SELECT ENAME,SAL

FROM EMP,ASG,SKILL

WHERE DUR > 12

AND EMP.ENO = ASS.ENO

AND EMP.TITLE = SKILL.TITLE

Reliability & Performance

Distributed DBMS avoidssingle point of

failure

„ Only data and software at the failed site

are inaccessible

„ Further improvement by replication

Performance improvements

„ Proximity of data to points of its use

„ Parallelism in execution

Š Inter-query parallelism

Š Intra-query parallelism

Distributed Query Processing

Methodology

Query Decomposition

Data Localization

Global Optimization

Local Optimization

Calculus query on Distributed

relations

Algebraic query on

Distributed relations

Fragment Query

Optimized Fragment Query with

Communication Operations

Optimized local queries

Global Schema

Fragment Schema

Stats on fragments

Local Schema

Control

Site

Local

Sites

Steps

Query Decomposition

„ Input: Calculus query on global relations

„ Output: Algebraic query on global relations

„ Transformation rules used

„ Same as Centralized

Data Localization

„ Input: Algebraic query on global relations

„ Output: Algebraic query on fragment relations

Global Optimization

„ Input: Algebraic query on fragment relations

„ The best global schedule

Local optimization

„ Input: Best global execution schedule

„ Output: The optimal access path for each local sub-query

Decomposition &

Restructuring

Convert relational calculus

to relational algebra

Make use of query trees

Example

Find the names of all employees other than J. Doe who worked on CAD/CAM project for either 1 or 2 years SELECT ENAME FROM E, W, P WHERE E.ENO = W.ENO AND W.JNO = P.JNO AND E.ENAME ≠ “J. Doe” AND P.JNAME = “CAD/CAM” AND (W.DUR = 12 or W.DUR = 24)

ΛENAME

σ(DUR = 12 OR DUR = 24) AND JNAME = “CAD/CAM” AND ENAME = “JDOE”

|X| (^) JNO

|X|ENO

W E

P

Project

Select

Join

Step 2 – Data Localization

Input: Algebraic query on distributed

relations

Determine which fragments are

involved

Localization program

„ Substitute for each global query its

materialization program

„ Optimize

Step 3 – Global Query

Optimization

Input: Fragment Query

Find the best global schedule

„ Minimizes a cost function

„ Distributed join processing

Š Which relations to ship where?

„ Decide on the use of semi-joins

Š Semi joins saves communication at the expense

of more local processing

„ Join methods

Š Nested loop Vs. Ordered joins

Cost-Based Optimization

Solution Space

„ The set of equivalent algebra expressions (query

trees)

Cost function (in terms of time)

„ I/O cost + CPU cost + Communication cost

„ Might have different weights in different

distributed environments

Search algorithms

„ How do we find the best plan?

„ Exhaustive search, heuristic algorithms (iterative

deepening, simulated annealing, etc.)

Cost Functions

Total Time

„ Reduce each cost component

„ Do as little of each cost component as

possible

„ Optimizes the utilization of resources

Response Time

„ Do as many things as possible in parallel

„ May increase total time

Total Cost

Total cost = CPU cost + I/O cost +

communication cost

CPU cost = unit instruction cost * no. of instructions

I/O cost = unit disk I/O cost * no. of disk I/O

Communication cost = message initiation +

data transmission

Example

Assume that only the communication cost is considered

Total time = 2 * message initialization time + unit

transmission time * (x+y)

Response time = max(time to send x from 1 to 3 and y

from 2 to 3)

Time to send x from 1 to 3 = message initialization time

+ unit transmission time * x

Time to send y from 2 to 3 = message initialization time

+ unit transmission time * y

Site 1

Site 2

Site 3

X Units

Y Units

Distributed Transaction

Execution

User application

Results &

User notifications

Read, Write,

EOT, Abort

TM

SC

RM

SC

RM

TM

Local

Recovery

Protocol

Distributed

Concurrency Control

Protocol

Replica Control

Protocol

Distributed

Transaction Execution

Model

Begin_transaction,

Read, Write, EOT,

Abort

Serializability in Distributed

DBMS

Somewhat more involved: Two histories have

to be considered

„ Local histories

„ Global history

For global transactions (i.e., global history) to

be serializable two conditions are necessary

„ Each local history should be serializable

„ Two conflicting operations should in the same

relative order in all of the local histories where they

appear together

Global Non-serializability

T1: Read (X)

X <- X+ Write(X) Commit

T2: Read (X)

X <- X* Write(X) Commit

The following two local histories are individually serializable

(in fact serial). But the two transactions are not globally

serializable

LH 1 = {R 1 (x), W 1 (X), C 1 , R 2 (X), W 2 (X), C 2 }

LH 2 = {R 2 (x), W 2 (X), C 2 , R 1 (X), W 1 (X), C 1 }

Deadlock

A transaction is deadlocked if it is blocked and

will remain blocked until external intervention

Locking based CC algorithms may result in

deadlocks

Wait-for graph

„ If transaction T i waits for another transaction T j to

release a lock, then Ti Æ T j in WFG

T

i

T

j

Local Vs. Global WFG

Assume T1 and T2 run at site 1, T3 and T4 at

site 2. Also assume T3 waits for a lock held

by T4 which waits for a lock held by T1 which

waits for a lock held by T2 which, in turn,

waits for a lock held by T

T (^1)

T (^2)

T (^4)

T (^3)

Local WFG

Site 1 Site 2

T (^1)

T (^2)

T (^4)

T (^3)

Global WFG

Deadlock detection

Transactions are allowed to wait freely

Deadlock detection through cycles in

Wait-for graphs

Topologies for deadlock detection

algorithms

„ Centralized

„ Hierarchical

„ Distributed

Centralized Deadlock detection

One site designated as the deadlock detector for

the system.

Each scheduler periodically sends its local WFG

to the central site, which merges to obtain a

global WFG to determine cycles

How often to transmit

„ Too often Higher communication cost

„ Too late Higher delays

Would be a reasonable choice if CC algorithm is

also centralized