Distributed Systems: Server Crashes, Lost Reply Messages, and Reliable Multicasting, Study notes of Computer Science

A series of slides from a university course on distributed systems. It covers various topics related to distributed systems, including server crashes, lost reply messages, and reliable multicasting. The slides discuss strategies for handling server crashes, the concept of idempotent operations, and different schemes for reliable multicasting. The document also introduces the atomic multicast problem and the concept of virtually synchronous reliable multicasting.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-ky5
koofers-user-ky5 🇺🇸

9 documents

1 / 82

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
West Virginia
University
Copyright © K.Goseva 2008 CS 757 Distributed Systems Slide 1
Fault Tolerance
Chapter 8
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52

Partial preview of the text

Download Distributed Systems: Server Crashes, Lost Reply Messages, and Reliable Multicasting and more Study notes Computer Science in PDF only on Docsity!

West Virginia University Copyright © K.Goseva 2008 CS 757 Distributed Systems

Fault Tolerance

Chapter 8

West Virginia University Copyright © K.Goseva 2008 CS 757 Distributed Systems

Outline

• Introduction to fault tolerance • Process resilience • Reliable client-server communication •

Reliable group communication: Not covered (Slides 33-54)

Distributed commit: Not covered (Slides 55-67)

Recovery: Only partially covered, high level

West Virginia University Copyright © K.Goseva 2008 CS 757 Distributed Systems

Basic concepts

• Failure – the system does not meet its requirements • Error – part of the system’s state that may lead to

failure

• Fault – cause of an error • High dependability can be achieved by

– Preventing faults – Removing faults – Fault tolerance – system provides its services even in the

presence of faults

West Virginia University Copyright © K.Goseva 2008 CS 757 Distributed Systems

Basic concepts

• Types of faults

– Transient

• Occur once and then disappear • If the operation is repeated, the fault goes away

– Intermittent

• Appears, vanishes, then reappears

– Permanent

• Continues to exist until the faulty component is repaired

West Virginia University Copyright © K.Goseva 2008 CS 757 Distributed Systems

Failure masking by redundancy

• Failure masking is achieved by redundancy

– Information redundancy

• Extra bits are added to allow recovery (e.g., Hamming code)

– Time redundancy

• Action is performed, and if needed performed again (e.g., if

transaction aborts, it can be redone)

• Especially helpful for transient or intermittent faults

– Physical redundancy

• Extra equipment or software are added (e.g., 747 has four

engines, but can fly on three)

West Virginia University Copyright © K.Goseva 2008 CS 757 Distributed Systems

Failure masking by redundancy

Triple modular redundancy

West Virginia University Copyright © K.Goseva 2008 CS 757 Distributed Systems

Process resilience

• The key approach to tolerating a faulty process is to

organize several identical processes into a group

– A process can send a message to a group of servers

without having to know how many servers there are orwhere they are

– All members of the group receive the messages – Process groups may be dynamic

• Old groups are destroyed, new groups are created • A process can join or leave a group • A process can be a member of several groups

West Virginia University Copyright © K.Goseva 2008 CS 757 Distributed Systems

Flat groups vs. hierarchical groups

Advantage: no single point of failureDisadvantage: more complex decision

making

Advantage: coordinator makes the decisionsDisadvantage: coordinator is a single point

of failure

West Virginia University Copyright © K.Goseva 2008 CS 757 Distributed Systems

Failure masking and replication

• Replace a single process with a group of replicated

processes

– Primary-based protocols

• Hierarchical group in which the primary coordinates all write

operations

• The primary is fixed • When primary crashes, the backups execute election algorithms

to choose a new primary

– Replicated-write protocols

• Active replication or quorum-based protocols • Flat group

West Virginia University Copyright © K.Goseva 2008 CS 757 Distributed Systems

Failure masking and replication

• How much replication is needed?

– A system is k fault tolerant if it can survive faults in k

components and still meet its specification

• If processes exhibit fail-stop failures, k+1 components is

enough to provide k fault tolerance

• If processes exhibit Byzantine failures, 2k+1 components are

needed to provide k fault tolerance

– All requests arrive at all servers in the same order, also

called the atomic multicast problem

West Virginia University Copyright © K.Goseva 2008 CS 757 Distributed Systems

Agreement in faulty systems

• The case of perfect processes and unreliable

communication

– Two army problem: red army with 5000 troops in the

valley and two blue armies, each 3000 troops, onsurrounding hills

• If the two blue armies can coordinate their attacks on the red

army they will be victorious

• The blue armies need to reach an agreement about attacking • They can communicate using unreliable channel – sending a

messenger who can be captured by the red army

• Even with non-faulty processes (generals), agreement is not

possible in the face of unreliable communication

West Virginia University Copyright © K.Goseva 2008 CS 757 Distributed Systems Slide 17

Agreement in faulty systems

• The case of perfect communication, but faulty

processors

– Byzantine generals problem: red army is still in the

valley, but

n

blue armies are on the nearby hills

• Communication is done pairwise and it is instantaneous and

perfect

m

(out of

n

) generals are traitors (faulty) and are actively trying

to prevent the loyal generals from reaching agreement byproviding incorrect and contradictory information

• Assumption: Generals exchange the troop strengths. If general

i

is loyal, the element

i

is the troop strength; otherwise, it is

undefined

• The question is whether the loyal generals can reach the

agreement

West Virginia University Copyright © K.Goseva 2008 CS 757 Distributed Systems

Agreement in faulty systems

The algorithm fails to produce an agreement

In general, in a system with

m

faulty processors, agreement can be achieved only if

2m+

correctly functioning processors are present for total of

3m+

processors, that is,

2/3 of the processors are working properly

West Virginia University Copyright © K.Goseva 2008 CS 757 Distributed Systems

Reliable client-sever communication