Download Distributed Systems: Server Crashes, Lost Reply Messages, and Reliable Multicasting and more Study notes Computer Science in PDF only on Docsity!
West Virginia University Copyright © K.Goseva 2008 CS 757 Distributed Systems
Fault Tolerance
Chapter 8
West Virginia University Copyright © K.Goseva 2008 CS 757 Distributed Systems
Outline
• Introduction to fault tolerance • Process resilience • Reliable client-server communication •
Reliable group communication: Not covered (Slides 33-54)
Distributed commit: Not covered (Slides 55-67)
Recovery: Only partially covered, high level
West Virginia University Copyright © K.Goseva 2008 CS 757 Distributed Systems
Basic concepts
• Failure – the system does not meet its requirements • Error – part of the system’s state that may lead to
failure
• Fault – cause of an error • High dependability can be achieved by
– Preventing faults – Removing faults – Fault tolerance – system provides its services even in the
presence of faults
West Virginia University Copyright © K.Goseva 2008 CS 757 Distributed Systems
Basic concepts
• Types of faults
– Transient
• Occur once and then disappear • If the operation is repeated, the fault goes away
– Intermittent
• Appears, vanishes, then reappears
– Permanent
• Continues to exist until the faulty component is repaired
West Virginia University Copyright © K.Goseva 2008 CS 757 Distributed Systems
Failure masking by redundancy
• Failure masking is achieved by redundancy
– Information redundancy
• Extra bits are added to allow recovery (e.g., Hamming code)
– Time redundancy
• Action is performed, and if needed performed again (e.g., if
transaction aborts, it can be redone)
• Especially helpful for transient or intermittent faults
– Physical redundancy
• Extra equipment or software are added (e.g., 747 has four
engines, but can fly on three)
West Virginia University Copyright © K.Goseva 2008 CS 757 Distributed Systems
Failure masking by redundancy
Triple modular redundancy
West Virginia University Copyright © K.Goseva 2008 CS 757 Distributed Systems
Process resilience
• The key approach to tolerating a faulty process is to
organize several identical processes into a group
– A process can send a message to a group of servers
without having to know how many servers there are orwhere they are
– All members of the group receive the messages – Process groups may be dynamic
• Old groups are destroyed, new groups are created • A process can join or leave a group • A process can be a member of several groups
West Virginia University Copyright © K.Goseva 2008 CS 757 Distributed Systems
Flat groups vs. hierarchical groups
Advantage: no single point of failureDisadvantage: more complex decision
making
Advantage: coordinator makes the decisionsDisadvantage: coordinator is a single point
of failure
West Virginia University Copyright © K.Goseva 2008 CS 757 Distributed Systems
Failure masking and replication
• Replace a single process with a group of replicated
processes
– Primary-based protocols
• Hierarchical group in which the primary coordinates all write
operations
• The primary is fixed • When primary crashes, the backups execute election algorithms
to choose a new primary
– Replicated-write protocols
• Active replication or quorum-based protocols • Flat group
West Virginia University Copyright © K.Goseva 2008 CS 757 Distributed Systems
Failure masking and replication
• How much replication is needed?
– A system is k fault tolerant if it can survive faults in k
components and still meet its specification
• If processes exhibit fail-stop failures, k+1 components is
enough to provide k fault tolerance
• If processes exhibit Byzantine failures, 2k+1 components are
needed to provide k fault tolerance
– All requests arrive at all servers in the same order, also
called the atomic multicast problem
West Virginia University Copyright © K.Goseva 2008 CS 757 Distributed Systems
Agreement in faulty systems
• The case of perfect processes and unreliable
communication
– Two army problem: red army with 5000 troops in the
valley and two blue armies, each 3000 troops, onsurrounding hills
• If the two blue armies can coordinate their attacks on the red
army they will be victorious
• The blue armies need to reach an agreement about attacking • They can communicate using unreliable channel – sending a
messenger who can be captured by the red army
• Even with non-faulty processes (generals), agreement is not
possible in the face of unreliable communication
West Virginia University Copyright © K.Goseva 2008 CS 757 Distributed Systems Slide 17
Agreement in faulty systems
• The case of perfect communication, but faulty
processors
– Byzantine generals problem: red army is still in the
valley, but
n
blue armies are on the nearby hills
• Communication is done pairwise and it is instantaneous and
perfect
m
(out of
n
) generals are traitors (faulty) and are actively trying
to prevent the loyal generals from reaching agreement byproviding incorrect and contradictory information
• Assumption: Generals exchange the troop strengths. If general
i
is loyal, the element
i
is the troop strength; otherwise, it is
undefined
• The question is whether the loyal generals can reach the
agreement
West Virginia University Copyright © K.Goseva 2008 CS 757 Distributed Systems
Agreement in faulty systems
The algorithm fails to produce an agreement
In general, in a system with
m
faulty processors, agreement can be achieved only if
2m+
correctly functioning processors are present for total of
3m+
processors, that is,
2/3 of the processors are working properly
West Virginia University Copyright © K.Goseva 2008 CS 757 Distributed Systems
Reliable client-sever communication