






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Some concept of Dependable Computing Systems are Terminology and Concepts, Software Fault Tolerance, Markov Models, Information Redundancy, Fault Detection in Processors, Defect Tolerance in Vlsi Circuits. Main points of this lecture are: Combinational Techniques, Canonical Structures, Series and Parallel Systems, Series-Parallel Combinations, Nonseries/Nonparallel Models, Minimal Cut Sets, Masking Redundancy, Triple Modular Redundancy
Typology: Study notes
1 / 12
This page cannot be seen from the preview
Don't miss anything!







The following material is derived from these text books.
When designing a system, it is important to be able to predict the reliability of the final system containing many components. The two most common methods of estimating the reliability of complex systems are combinational modeling and Markov state modeling.
We will first consider some canonical structures and discuss how their reliability can be quantified using combinational techniques.
In a series combination of components, the failure of any of the components will result in the failure of the overall system. If a system contains N components arranged in series, and if the failure rates of the components are independent, then the system’s failure rate λ is given by
λ =
i=
λi
where λi is the failure rate of the ith^ component.
The reliability of the series arrangement may also be expressed in terms of the reliability of individual components. If Ri(t) is the reliability of the ith^ component in the system, the overall system reliability R(t) is given by R(t) = R 1 (t)R 2 (t)... RN (t)
which may be written as
R(t) =
i=
Ri(t)
The reliability of a parallel combination of components is given by
R(t) = 1 −
i=
1 − Ri(t)
Input Output
Series combination of components
1 2 N
Parallel combination of components
...
1
2
N
Input Output
Fig. 1: Series and parallel combination of system components.
Consider the system shown in Fig. 2. If the reliability of module M1 is 0.99, that of M2, M3, and M4 is 0.80, M and M6 is 0.90, M7 and M8 is 0.95, and M9 is 0.94. Then, the reliability of the parallel combination of modules 2, 3, and 4 is given by
R(t) = 1 − [1 − 0 .8]^3 = 0. 992
The series combination of M5 and M7 (and M6 and M8) has reliability 0. 90 × 0 .95, or 0.855, and the parallel combination of these two paths is
R(t) = 1 − [1 − 0 .855]^2 = 0. 979
The system can be simplified as shown in Fig. 2. The series combination of M10 and M11 has reliability 0.971, and this in combination in parallel with M9 gives us
R(t) = 1 − [1 − 0 .971][1 − 0 .94] = 0. 998
Sometimes, a “success” diagram is used to describe the operational mode of a system. A success diagram may not be directly reducible by the application of the series/parallel formulas. In such cases, one can obtain a lower bound on system reliability in terms of minimal cut sets of the system. We can define a cut set of a graph as a set of branches which interrupts all connections between the input and the output when removed from the graph. The minimum cut sets are a group of distinct cut sets containing the minimum number of terms. All system failures can be represented by the removal of at least one minimal cut set from the graph. The probability of system failure is, therefore, given by the probability that at least one minimal cut set fails. Let Qcuti denote the probability that a cut set fails. So, the lower bound on system reliability is given by Rsys ≥ Π(1 − Qcuti )
Fig. 4: The reliability of a NMR system comprising 2n + 1 modules as a function of time.
A tie set is a group of paths (or branches) which when traversed, forms a connection between the input and the output. A minimal tie set is that containing a minimum number of elements. If no node is traversed more than once in tracing out the tie set, then the tie set is minimal. If Rpathi denotes the serial reliability of path i, then Rsys ≤ 1 − Π(1 − Rpathi )
The minimum tie sets in Fig. 3 are {M1, M2, M3, M4}, {M1, M6, M4}, {M5, M3, M4}. The system reliability is given by Rs ≤ R^4 + 2R^3
2 Masking Redundancy using M -out-of-N structures
Another simple structure that serves as a useful model for many reliability problems is an M -out-of-N structure. Such a model represents a system of N components in which M out of N components must be good for the system to succeed. Thus, success of exactly M -out-of-N identical and independent components is given by (^) ( N M
pM^ (1 − p)N^ −M
Here, p denotes the (identical) reliability of each component. For a constant failure rate of λ and using the exponential failure law p = e−λt^ for each item, the success of at least M -out-of-N items is given by
R(t) =
i=M
i
e−iλt(1 − e−λt)N^ −i
In general, N is an odd integer. However, as we shall soon see, if we can diagnose and lock out faulty modules, it is feasible to let N be an even integer. If we let N = 2n + 1, n ≥ 1, then, in a simple masking scheme, we need a majority of the modules to work correctly, that is N = n + 1. Fig. 4 shows the reliability function for various values of n assuming p = e−λt. The figure shows that NMR is superior to a single unit in the high-reliability region, specifically NMR is superior to the single unit for λt < 0 .69. Therefore, when designing any system, we must carefully evaluate the reliability values obtained over the range 0 < t < maximum mission time for various values of n and λ.
A special case of an M -out-of-N structure is triple modular redundancy or TMR. The basic TMR structure, shown in Fig. 5, consists of three parallel modules where each module is provided with the same input. The outputs of the three modules are compared by the voter, which gives the majority opinion as the system output. If all three modules are operating properly, all outputs agree, and thus the system output is correct. However, if one module has failed so that it has produced an incorrect output, the voter chooses the output of the two good modules as the system output because they both agree, and thus the system output is correct. If two modules have failed, the voter agrees with the majority (the two that have failed), and thus the system output is incorrect.
A TMR system will function correctly provided that at least two modules are operational, and assuming that the voter does not fail, that is Rv = 1. Thus, the probability of the system working correctly is given by
R = Rv ×
p^3 (1 − p)^0 +
p^2 (1 − p)^1
= 3p^2 − 2 p^3 = p^2 (3 − 2 p)
This is, of course, the reliability expression for a two-out-of-three system. If we assume a constant-failure rate λ, then each module/component has a reliability p = e−λt, and substituting in the above equation yields, R(t) = 3e−^2 λt^ − 2 e−^3 λt
We can compute the MTTF for this system by integrating the reliability function as
0
3 e−^2 λt^ − 2 e−^3 λt
2 λ
6 λ
This TMR system can be called a “3-2 system” because the system succeeds if 3 or 2 units are good. Thus, when a second failure occurs, the voter does not know which of the components have failed and cannot determine which is the good component.
In some cases, additional information is available by such means as observation (from a human operator or a diagnostic system) of the remaining two units after the first failure occurs. If one of the two remaining units
Module 1
Module 2
Module 3
Voter
Input Output
Fig. 5: The basic triple-modular redundancy (TMR) scheme.
Input (^) M Output 1 M 2
A simplex (non-redundant) system comprising two components
(a)
Input Output
M 1 M 2
M’ 1 M’ 2
(b)
Input Output
M 1 M 2
M’ 1 M’ 2
(c)
System redundancy Component redundancy
Fig. 7: Comparison of three different systems; (a) A simplex (or non-redundant) system, (b) system redundancy, and (c) component redundancy.
where the components M 1 and M 2 are independent, but have identical reliability R(t) = p.
The reliability expression for Fig. 7(b), comprising two simplex units connected in parallel, is given by
Rb(t) = R^2 a(t) +
Ra(t)(1 − Ra(t)) (1)
= p^2 (2 − p^2 )
For Fig. 7(c), we combine each component pair in parallel to obtain
Rc(t) = [p^2 + 2p(1 − p)]^2 (2) = p^2 (2 − p)^2
To compare Equations 1 and 2, we use the ratio
Rc(t) Rb(t)
p^2 (2 − p)^2 p^2 (2 − p^2 )
(2 − p)^2 (2 − p^2 )
Some algebraic manipulation yields Rc(t) Rb(t)
2(1 − p)^2 2 − p^2
Since 0 < p < 1, the term 2 − p^2 > 0, and Rc(t)/Rb(t) ≥ 1. Therefore, component redundancy is superior to system redundancy for this example. (They are, of course, equal at the extremes when p = 0 or p = 1).
We can extend the above analysis to m components, in which case Equation 3 becomes
Rc(t) Rb(t)
(2 − p)m (2 − pm)
It can be shown by induction that this ratio is always greater than 1 and that component redundancy is superior regardless of the number of components. The superiority of component redundancy over system redundancy also holds true for non-identical components.
Fig. 8: Redundancy comparison: (a) component redundancy and (b) system redundancy
A simpler proof of the foregoing principle can be formulated by considering tie sets. In Fig. 7(b), the tie sets are M 1 M 2 and M
′ 1 M^
′ 2 , whereas in Fig. 7(c), the tie sets are^ M^1 M^2 ,^ M^1 M^
′ 2 ,^ M^
′ 1 M^2 , and^ M^
′ 1 M^
′
The reliability of system and component redundancy are compared graphically in Fig. 8.
3 Voter Design Issues
This section considers various issues related to voter design including relaxing our assumption of perfect voters.
Returning to the TMR reliability equation, if the unit reliability is denoted as pc and the voter reliability by
Fig. 10: Synchronization in the COMTRAC railroad traffic control computer.
Synchronizing the outputs of the replicated units is another concern. The problem of synchronization is often solved using a common (fault tolerant) clock. Another method of synchronization is used by the COMTRAC railroad traffic control computer shown in Fig. 10^1. Synchronization is maintained at the program task level. The system controller (DSC) ensures that both processors are performing the same calculation. When both computers have nished the calculation, the DSC compares the two results. If a mismatch occurs, the controller forces both processors to run identical test programs. The test program exercises the entire processor during the course of calculating a single constant.
4 Dynamic Redundancy
One of the drawbacks of an NMR scheme is that the fault masking ability deteriorates as more copies fail. In its pure form, fault masking neutralizes the effects of failed units without notification of their failures. Therefore, the faulty modules can eventually outvote the good modules. However, an NMR system could continue to function longer if the known bad modules could be discounted in the vote. Two methods of reconfiguration based on NMR are: (1) Hybrid redundancy where failed modules are replaced with good spares and (2) Dynamic modification of the voting process or adaptive voting.
Fig. 11 illustrates the basic concept in which a core group of N identical modules is used at any one time, and their outputs voted upon to produce the system output. When a disagreement is detected, the module(s) in the minority are assumed to have failed and are replaced by an equivalent number of spare modules. Initially, the system contains a total of (N + S) modules. As long as the number of failed modules does not exceed
(^1) Ihara et al., Fault-Tolerant Computer System with Three Symmetric Computers, Proceedings of the IEEE, pp. 1160-1177, October 1978.
Fig. 11: Organization of a system with hybrid redundancy.
Fig. 12: The quad-redundant flight control system of the space shuttle.
t = b(N/2)c in the core group before reconfiguration can take place, the system in Fig. 11 can tolerate the failure of t + S of its modules.
When N -modular redundancy is used and N is greater than three, additional considerations emerge. For example, consider the quad-redundant system shown in Fig. 12. This is the architecture used for the Space Shuttle’s primary flight control system (FCS).
Let us focus on the first four computers in the FCS. Here, we have an example of a 4-level voting with lockout. Let us assume that unit B fails permanently. There is no reason to leave B in the system if we have a way to remove it from the voting process. The rationale here is that a second failure, say that of unit C, can lead to a situation where the two failed units agree and the two good elements agree, leading to a stand-off. Clearly, this can be avoided, if, after the failure of B, it is locked out, and the system reconfigures to become a TMR system.