Fault Tolerance - Advanced Operating Systems - Lecture Slides, Slides of Advanced Operating Systems

Main points of this lecture are: Fault Tolerance, Dependable Systems, Types of Failures, Failure Models, Arbitrary Failure, Omission Failure, Failure Masking by Redundancy, Triple Modular Redundancy, Hierarchical Groups, Agreement in Faulty Systems

Typology: Slides

2012/2013

Uploaded on 04/23/2013

atasi
atasi 🇮🇳

4.6

(32)

134 documents

1 / 62

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CIS 620
Advanced Operating Systems
Lecture 11 Fault Tolerance
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e

Partial preview of the text

Download Fault Tolerance - Advanced Operating Systems - Lecture Slides and more Slides Advanced Operating Systems in PDF only on Docsity!

CIS 620

Advanced Operating Systems

Lecture 11 – Fault Tolerance

Fault Tolerance

  • Dependable systems have the following

requirements

  • Availability
  • Reliability
  • Safety
  • Maintainability

Failure Models

  • Different types of failures.

Type of failure Description Crash failure A server halts, but is working correctly until it halts

Omission failure Receive omission Send omission

A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to send messages Timing failure A server's response lies outside the specified time interval

Response failure Value failure State transition failure

The server's response is incorrect The value of the response is wrong The server deviates from the correct flow of control Arbitrary failure A server may produce arbitrary responses at arbitrary times

Failure Masking by Redundancy

  • Triple modular redundancy.

Agreement in Faulty Systems

  • The Byzantine generals problem for 3 loyal generals and1 traitor. a) The generals announce their troop strengths (in units of 1 kilosoldiers). b) The vectors that each general assembles based on (a) c) The vectors that each general receives in step 3.

Agreement in Faulty Systems

  • The same as in previous slide, except now with 2 loyal generals and one traitor.

RPC Failures

  • Lost request message.
    • This is easy if known. That is, if we are sure the request was lost.
    • Also easy if idempotent and we think it might be lost. - Simply retransmit the request. - Assumes the client still knows the request.
  • Lost reply message.
    • If it is known the reply was lost, have server retransmit.

RPC Failures

  • Assumes the server still has the reply.
  • How long should the server hold the reply?
    • Wait forever for the reply to be ack'ed? No!
    • Discard after "enough" time.
    • Discard after we receive another request from this client.
    • Ask the client if the reply was received.
    • Keep resending reply.
  • What if we are not sure of whether we lost the request or the reply?
  • If the server is stateless, it doesn't know and the client can't tell!
  • If idempotent, simply retransmit the request.

RPC Failures

  • From databases, we get the idea of transactions and commits. - This really does solve the problem but is not cheap.
  • Fairly easy to get “at least once” (try request again if timer expires) or “at most once (give up if timer expires)” semantics. Hard to get “exactly once” without transactions.
  • To be more precise. A transaction either happens exactly once or not at all (sounds like at most once) and the client knows which.

RPC Failures

  • Client crashes
    • Orphan computations exist.
    • Again transactions work but are expensive.
    • We can have the rebooted client start another epoch and all computations of previous epoch are killed and clients resubmit. - It is better is to let old computations with owners that can be found continue.
    • This isn’t a great solution.

Basic Reliable-Multicasting

Schemes

  • A simple solution to reliable multicasting when all receivers are known and are assumed not to fail

a) Message transmission b) Reporting feedback Docsity.com

Nonhierarchical Feedback Control

  • Several receivers have scheduled a request for retransmission, but the first retransmission request leads to the suppression of others.

Virtual Synchrony

  • The logical organization of a distributed system to distinguish between message receipt and message delivery

Virtual Synchrony

  • The principle of virtual synchronous multicast.