Fault Tolerance - Distributed Operating Systems - Lecture Slides, Slides of Operating Systems

Distributed Operating Systems course is designed to examine the fundamental principles of distributed systems, and provide students hands-on experience in developing distributed protocols. This lecture includes: Fault Tolerance, Process Resilience, Reliable Communication, Distributed Commit, Recovery, Dependability, Failure Models, Failure Masking, Hierarchical Groups, Byzantine Generals

Typology: Slides

2013/2014

Uploaded on 02/01/2014

sailendra
sailendra 🇮🇳

4.3

(19)

113 documents

1 / 99

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Fault Tolerance
Part I Introduction
Part II Process Resilience
Part III Reliable Communication
Part IV Distributed Commit
Part V Recovery
docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63

Partial preview of the text

Download Fault Tolerance - Distributed Operating Systems - Lecture Slides and more Slides Operating Systems in PDF only on Docsity!

Fault Tolerance

Part I Introduction

Part II Process Resilience

Part III Reliable Communication

Part IV Distributed Commit

Part V Recovery

Fault Tolerance

  • A DS should be fault-tolerant
    • Should be able to continue functioning in the presence of faults
  • Fault tolerance is related to dependability

Availability & Reliability (1)

  • Availability : A measurement of whether a system is ready to be used immediately - System is up and running at any given moment
  • Reliability : A measurement of whether a system can run continuously without failure - System continues to function for a long period of time

Availability & Reliability (2)

  • A system goes down 1ms/hr has an availability of more than 99.99%, but is unreliable
  • A system that never crashes but is shut down for a week once every year is 100% reliable but only 98% available

Faults

  • A system fails when it cannot meet its promises

(specifications)

  • An error is part of a system state that may lead to a

failure

  • A fault is the cause of the error
  • Fault-Tolerance : the system can provide services even

in the presence of faults

  • Faults can be:
    • Transient (appear once and disappear)
    • Intermittent (appear-disappear-reappear behavior)
      • A loose contact on a connector intermittent fault
    • Permanent (appear and persist until repaired)

Failure Models Type of failure Description Crash failure A server halts, but is working correctly until it halts Omission failure Receive omission Send omission A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to send messages Timing failure A server's response lies outside the specified time interval Response failure Value failure State transition failure The server's response is incorrect The value of the response is wrong The server deviates from the correct flow of control Arbitrary failure (Byzantine failure) A server may produce arbitrary responses at arbitrary times

Example – Redundancy in Circuits (1)

Example – Redundancy in Circuits (2) Triple modular redundancy.

Process Resilience

  • Mask process failures by replication
  • Organize processes into groups, a message sent to a group is delivered to all members
  • If a member fails, another should fill in

Flat Groups versus Hierarchical Groups

a) Communication in a flat group.
b) Communication in a simple hierarchical group

Agreement

  • Need agreement in DS:
    • Leader, commit, synchronize
  • Distributed Agreement algorithm : all non-faulty processes achieve consensus in a finite number of steps
  • Perfect processes, faulty channels: two-army
  • Faulty processes, perfect channels: Byzantine generals

Two-Army Problem

Byzantine Generals - Example (1) The Byzantine generals problem for 3 loyal generals and1 traitor. a) The generals announce the time to launch the attack (by messages marked by their ids). b) The vectors that each general assembles based on (a) c) The vectors that each general receives in step 3, where every general passes his vector from (b) to every other general.

Byzantine Generals – Example (2)

The same as in previous slide, except now
with 2 loyal generals and one traitor.