Fault Detection in Processors - Dependable Computing Systems - Lecture Notes, Study notes of Computer Science

Some concept of Dependable Computing Systems are Terminology and Concepts, Software Fault Tolerance, Markov Models, Information Redundancy, Fault Detection in Processors, Defect Tolerance in Vlsi Circuits. Main points of this lecture are: Fault Detection in Processors, Duplication, Fault-Detection Technique, Circuit-Level Duplication, Common-Mode Failures, Duplicate Complementary Circuits, Cost of Duplication, Disadvantages of Duplication

Typology: Study notes

2012/2013

Uploaded on 05/18/2013

maazi
maazi 🇮🇳

4.4

(12)

75 documents

1 / 5

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Fault Detection in Processors
Prof. Naga Kandasamy
ECE Department, Drexel University, Philadelphia, PA 19104.
February 2, 2009
1 Duplication
Duplication is the simplest fault-detection technique. Here, two identical copies are used and when a failure
occurs, the two copies are no longer identical. A simple comparison detects the fault. Duplication detects all
single faults except for that of the comparator. Duplication is applicable to all areas and levels of computer
design, and thus is widely used. Let us now look at examples of circuit-level duplication and system-level
duplication.
When considering fault detection using duplicated components, we must note that both copies may be
subject to identical failures (common-mode failures), particularly if both have the same design error. Fig. 1
Fig. 1: Use of duplicate complementary circuits for fault detection in VLSI chips.
1
pf3
pf4
pf5

Partial preview of the text

Download Fault Detection in Processors - Dependable Computing Systems - Lecture Notes and more Study notes Computer Science in PDF only on Docsity!

Fault Detection in Processors

Prof. Naga Kandasamy

ECE Department, Drexel University, Philadelphia, PA 19104.

February 2, 2009

1 Duplication

Duplication is the simplest fault-detection technique. Here, two identical copies are used and when a failure occurs, the two copies are no longer identical. A simple comparison detects the fault. Duplication detects all single faults except for that of the comparator. Duplication is applicable to all areas and levels of computer design, and thus is widely used. Let us now look at examples of circuit-level duplication and system-level duplication.

When considering fault detection using duplicated components, we must note that both copies may be subject to identical failures (common-mode failures), particularly if both have the same design error. Fig. 1

Fig. 1: Use of duplicate complementary circuits for fault detection in VLSI chips.

Fig. 2: Use of duplicate sub-systems for fault detection.

shows the use of complementary circuits to solve this problem for VLSI chips^1. Here, one copy of the logic is the dual of the other. Common failure modes may cause different failure effects, resulting in increased coverage for these modes.

Duplication can also be performed at the sub-system (or system) level. Fig. 2 shows the Sperry Univac 1100/60 computer that uses comparison at the bus level for its processors^2. The processor is split into two 36-bit sub-processors. Each sub-processor is duplicated, and only one of the two duplicates drives the master bus during any one processor cycle. The other drives the duplicate data bus. Both copies perform identical operations on the same data values, and at the end of a processor cycle, the results are compared. A disagreement causes the ongoing operation to be interrupted.

Cost of Duplication

The cost of duplication is twice that of the corresponding simplex (non-fault tolerant) system, plus the cost of the comparator. Performance degradation can occur due to: (1) The lack of synchronization between the compared signals (a component may be slower (faster) than its copy). This problem can be solved using a common clock for the components; (2) The decision time required by the comparator. Normally, the performance loss due to these factors is minor.

In some cases, the dollar-cost incurred by duplication can be halved by using the same hardware to perform duplicate operations—one following the other in time. This technique, called time redundancy, leads to a doubling of the execution time causing performance degradation. Transient faults can be tolerated using this method. In the case of permanent faults also, the coverage can be increased somewhat. Though a failed ALU in a processor would provide bad results both times, the results would differ for some permanent failures and still result in a mismatch. For example, a string of ALU operations (additions, subtractions, etc) can be performed twice in a different order, or could be done a second time by forming the two’s complements of the numbers, adding them, and negating the results.

(^1) R. M. Sedmak and H. L. Liebergot, “Fault Tolerance of a General Purpose Computer Implemented by Very Large Scale Integrating,” IEEE Transactions on Computers, C-29, pp. 492-500, June 1980. (^2) L. A. Boone, H. L. Liebergot, and R. M. Sedmak, “Availability, Reliability, and Maintainability Aspects of the Sperry Univac 1100/60”, 10th IEEE Symposium on Fault-Tolerant Computing, pp. 3 - 8, 1980.

... cond = v1; if (cond == 1) then v2; else v3; v4; ...

v

v

v

v

Application code

Corresponding control-flow graph (CFG)

v

v

v

SIG(v1)

v

SIG(v4)

SIG(v2) SIG(v3)

(a)

(b)

Fig. 4: (a) An example application code, (b) the corresponding control-flow graph (CFG), and control-flow monitoring using embedded signatures with the nodes.

Checking for valid node transitions. This technique checks if nodes in the CFG are executed in an allowed sequence. The allowed branching sequences for the CFG in Fig. 4 are (v1, v2, v4) and (v1, v3, v4). Any other sequence of node execution is invalid.

Checking for valid node transitions is done using assigned-signature checking in which unique signatures are first assigned arbitrarily to each node, as shown in Fig. 4(b). This is done as a preprocessing step by the compiler when generating the executable code. When the CFG is executed by the processor, the signatures are explicitly transmitted to the watchdog. The watchdog executes the checker code shown in Fig. 5(a).

Checking for correct instruction sequencing. Derived-signatures are used to check for correct se- quencing of instructions within a single node. In this case, the compiler first computes a checksum for the sequence of instructions within a node (e.g., by XORing the opcodes of these instructions). At run time, the watchdog executes the checker code shown in Fig. 5(b). Here, before entering a node, the processor first informs the watchdog of the node ID. The watchdog then snoops the instruction bus to observe the sequence of instructions transferred between the memory and CPU, and computes a checksum by XORing the instruction opcodes. Finally, before exiting the node, this run-time signature is compared against the checksum that was computed for the node at time of compilation.

accept SIG(v1);

either

accept SIG(v2);

else

accept SIG(v3);

accept SIG(v4)

case node {

v1: check SIG(v1);

v2: check SIG(v2);

v3 check SIG(v3);

v4: accept SIG(v4);

accept SIG(v1), check SIG(v1);

either

accept SIG(v2):, check SIG(v2);

or

accept SIG(v3), check SIG(v3);

accept SIG(v4), check SIG(v4);

(a) (b)

(c)

Assigned-signature checking for

checking node transitions

Derived-signature checking for instruction

sequencing within a single node

Derived signature checking for both tansition

and instruction sequencing checking

Fig. 5: The code executed by the watchdog for (a) checking node transitions, (b) checking instruction sequencing within a node, and (c) checking both node transitions and instruction sequencing within a node.

Checking for node transitions and instruction sequencing. Finally, Fig. 5(c) shows the watchdog code that checks both node transitions as well as the sequencing of instructions within the node.

Assertion Checking

In many systems, software-based methods provide a low-cost alternative to hardware redundancy. Two widely used approaches are assertions and acceptance checks, both of which use application-specific knowl- edge to detect failures. Assertions check if a task satisfies specific constraints at particular points while acceptance tests use range, bounds, and sanity checks to verify its result. These tests must be carefully designed and evaluated to achieve high fault coverage while minimizing false alarms. For example, overly sensitive tests can reject the results because of small deviations from optimal performance, even though the risk of an accident happening may be small. A more systematic method of detecting transient failures is to simply execute a task twice and compare the results.