Fault Tolerant Systems: Understanding Hardware and Software Errors in Computer Systems, Lecture notes of Aeronautical Engineering

An overview of fault tolerant systems, including definitions, types of faults and errors, and key ingredients for error processing and fault treatment. It covers both hardware and software faults, and discusses the need for fault tolerance in critical applications, harsh environments, and complex systems.

Typology: Lecture notes

2014/2015

Uploaded on 11/21/2015

haniya.siddiqui
haniya.siddiqui 🇵🇰

1 document

1 / 16

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Fault Tolerant Systems
Instructor: Engr. Nabiha Faisal
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Fault Tolerant Systems: Understanding Hardware and Software Errors in Computer Systems and more Lecture notes Aeronautical Engineering in PDF only on Docsity!

Fault Tolerant Systems

Instructor: Engr. Nabiha Faisal

Prerequisites

  • (^) Basic courses in
    • (^) Digital Design
    • (^) Hardware Organization/Computer Architecture
    • (^) Probability

causes results in Fault Error Failure Fault – is a defect within the system Error – is observed by a deviation from the expected behaviour of the system Failure occurs when the system can no longer perform as required (does not meet spec) Fault Tolerance – is ability of system to provide a service, even in the presence of errors Terminology of Fault Tolerance

Fault Tolerance - Basic

definition

  • (^) Fault-tolerant systems - ideally systems capable of executing their tasks correctly regardless of either hardware failures or software errors
  • (^) In practice - we can never guarantee the flawless execution of tasks under any circumstances
  • (^) Limit ourselves to types of failures and errors which are more likely to occur

Types of Fault ( wr t attributes) Type of failure Description Crash failure Amnesia crash Pause crash Halting crash A server halts, but is working correctly until it halts Lost all history, must be reboot Still remember state before crash, can be recovered Hardware failure, must be replaced or re-installed Omission failure Receive omission Send omission A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to send messages Timing failure A server's response lies outside the specified time interval Response failure Value failure State transition failure The server's response is incorrect The value of the response is wrong The server deviates from the correct flow of control Arbitrary failure A server may produce arbitrary responses at arbitrary times

Fault - either a hardware defect or a

software/programming mistake

Error - a manifestation(means an

event or action) of a fault

Example: An adder circuit with one

output lines stuck at 1

This is a fault, but not (yet) an error

Becomes an error when the adder is

used and the result on that line

Fault Vs Error

Fault Tolerance – Key

Ingredients

FAULT TOLERANCE Error processing: error removal, before failure occurs Fault treatment: avoiding fault(s) to be activated again

Error Processing

ERROR PROCESSING Error recovery: errorfree state substituted to erroneous state Error detection: (^) identification of erroneous state(s) Backward recovery: system brought back in state visited before error occurrence Recovery points (checkpoint) Forward recovery:Erroneous state is discarded and correct one is determined Without losing any computation. Error diagnosis: damage assessment

Need For Fault Tolerance - Critical

Applications

  • (^) Aircrafts, nuclear reactors, chemical plants, medical equipment
  • (^) A malfunction of a computer in such applications can lead to catastrophe
  • (^) Their probability of failure must be extremely low, possibly one in a billion per hour of operation
  • (^) Also included - financial applications

Need for Fault Tolerance - Harsh

Environments

  • (^) A computing system operating in a harsh environment where it is subjected to - (^) electromagnetic disturbances - (^) particle hits and alike
  • (^) Very large number of failures means: the system will not produce useful results unless some fault-tolerance is incorporated

Hardware Faults Classification

  • (^) Three types of faults:
  • (^) Transient Faults - disappear after a relatively short time - (^) Example - a memory cell whose contents are changed spuriously due to some electromagnetic interference - (^) Overwriting the memory cell with the right content will make the fault go away
  • (^) Permanent Faults - never go away, component has to be repaired or replaced
  • (^) Intermittent Faults - cycle between active and benign states - (^) Example - a loose connection
  • (^) Another classification: Benign vs malicious