Download Fault Tolerance - Advanced Operating Systems - Lecture Slides and more Slides Advanced Operating Systems in PDF only on Docsity!
CIS 620
Advanced Operating Systems
Lecture 11 – Fault Tolerance
Fault Tolerance
- Dependable systems have the following
requirements
- Availability
- Reliability
- Safety
- Maintainability
Failure Models
- Different types of failures.
Type of failure Description Crash failure A server halts, but is working correctly until it halts
Omission failure Receive omission Send omission
A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to send messages Timing failure A server's response lies outside the specified time interval
Response failure Value failure State transition failure
The server's response is incorrect The value of the response is wrong The server deviates from the correct flow of control Arbitrary failure A server may produce arbitrary responses at arbitrary times
Failure Masking by Redundancy
- Triple modular redundancy.
Agreement in Faulty Systems
- The Byzantine generals problem for 3 loyal generals and1 traitor. a) The generals announce their troop strengths (in units of 1 kilosoldiers). b) The vectors that each general assembles based on (a) c) The vectors that each general receives in step 3.
Agreement in Faulty Systems
- The same as in previous slide, except now with 2 loyal generals and one traitor.
RPC Failures
- Lost request message.
- This is easy if known. That is, if we are sure the request was lost.
- Also easy if idempotent and we think it might be lost. - Simply retransmit the request. - Assumes the client still knows the request.
- Lost reply message.
- If it is known the reply was lost, have server retransmit.
RPC Failures
- Assumes the server still has the reply.
- How long should the server hold the reply?
- Wait forever for the reply to be ack'ed? No!
- Discard after "enough" time.
- Discard after we receive another request from this client.
- Ask the client if the reply was received.
- Keep resending reply.
- What if we are not sure of whether we lost the request or the reply?
- If the server is stateless, it doesn't know and the client can't tell!
- If idempotent, simply retransmit the request.
RPC Failures
- From databases, we get the idea of transactions and commits. - This really does solve the problem but is not cheap.
- Fairly easy to get “at least once” (try request again if timer expires) or “at most once (give up if timer expires)” semantics. Hard to get “exactly once” without transactions.
- To be more precise. A transaction either happens exactly once or not at all (sounds like at most once) and the client knows which.
RPC Failures
- Client crashes
- Orphan computations exist.
- Again transactions work but are expensive.
- We can have the rebooted client start another epoch and all computations of previous epoch are killed and clients resubmit. - It is better is to let old computations with owners that can be found continue.
- This isn’t a great solution.
Basic Reliable-Multicasting
Schemes
- A simple solution to reliable multicasting when all receivers are known and are assumed not to fail
a) Message transmission b) Reporting feedback Docsity.com
Nonhierarchical Feedback Control
- Several receivers have scheduled a request for retransmission, but the first retransmission request leads to the suppression of others.
Virtual Synchrony
- The logical organization of a distributed system to distinguish between message receipt and message delivery
Virtual Synchrony
- The principle of virtual synchronous multicast.