Combinational Techniques - Dependable Computing Systems - Lecture Notes, Study notes of Computer Science

Some concept of Dependable Computing Systems are Terminology and Concepts, Software Fault Tolerance, Markov Models, Information Redundancy, Fault Detection in Processors, Defect Tolerance in Vlsi Circuits. Main points of this lecture are: Combinational Techniques, Canonical Structures, Series and Parallel Systems, Series-Parallel Combinations, Nonseries/Nonparallel Models, Minimal Cut Sets, Masking Redundancy, Triple Modular Redundancy

Typology: Study notes

2012/2013

Uploaded on 05/18/2013

maazi
maazi 🇮🇳

4.4

(12)

75 documents

1 / 12

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Combinational Techniques for Reliability Modeling
Prof. Naga Kandasamy,
ECE Department Drexel University, Philadelphia, PA 19104.
January 24, 2009
The following material is derived from these text books.
D. P. Siewiorek and R. S. Swarz, Reliable Computer Systems, 3rd Edition, A. K. Peters, Natick,
Massachusetts, 1998.
M. L. Shooman, Reliability of Computer Systems and Networks, John Wiley & Sons, 2002.
When designing a system, it is important to be able to predict the reliability of the final system containing
many components. The two most common methods of estimating the reliability of complex systems are
combinational modeling and Markov state modeling.
1 Canonical Structures
We will first consider some canonical structures and discuss how their reliability can be quantified using
combinational techniques.
1.1 Series and Parallel Systems
In a series combination of components, the failure of any of the components will result in the failure of
the overall system. If a system contains Ncomponents arranged in series, and if the failure rates of the
components are independent, then the system’s failure rate λis given by
λ=
N
X
i=1
λi
where λiis the failure rate of the ith component.
The reliability of the series arrangement may also be expressed in terms of the reliability of individual
components. If Ri(t) is the reliability of the ith component in the system, the overall system reliability R(t)
is given by
R(t) = R1(t)R2(t). . . RN(t)
which may be written as
R(t) =
N
Y
i=1
Ri(t)
The reliability of a parallel combination of components is given by
R(t) = 1
N
Y
i=1 £1Ri(t)¤
1
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Combinational Techniques - Dependable Computing Systems - Lecture Notes and more Study notes Computer Science in PDF only on Docsity!

Combinational Techniques for Reliability Modeling

Prof. Naga Kandasamy,

ECE Department Drexel University, Philadelphia, PA 19104.

January 24, 2009

The following material is derived from these text books.

  • D. P. Siewiorek and R. S. Swarz, Reliable Computer Systems, 3rd Edition, A. K. Peters, Natick, Massachusetts, 1998.
  • M. L. Shooman, Reliability of Computer Systems and Networks, John Wiley & Sons, 2002.

When designing a system, it is important to be able to predict the reliability of the final system containing many components. The two most common methods of estimating the reliability of complex systems are combinational modeling and Markov state modeling.

1 Canonical Structures

We will first consider some canonical structures and discuss how their reliability can be quantified using combinational techniques.

1.1 Series and Parallel Systems

In a series combination of components, the failure of any of the components will result in the failure of the overall system. If a system contains N components arranged in series, and if the failure rates of the components are independent, then the system’s failure rate λ is given by

λ =

∑^ N

i=

λi

where λi is the failure rate of the ith^ component.

The reliability of the series arrangement may also be expressed in terms of the reliability of individual components. If Ri(t) is the reliability of the ith^ component in the system, the overall system reliability R(t) is given by R(t) = R 1 (t)R 2 (t)... RN (t)

which may be written as

R(t) =

∏^ N

i=

Ri(t)

The reliability of a parallel combination of components is given by

R(t) = 1 −

∏^ N

i=

[

1 − Ri(t)

]

Input Output

Series combination of components

1 2 N

Parallel combination of components

...

1

2

N

Input Output

Fig. 1: Series and parallel combination of system components.

1.2 Series-Parallel Combinations

Consider the system shown in Fig. 2. If the reliability of module M1 is 0.99, that of M2, M3, and M4 is 0.80, M and M6 is 0.90, M7 and M8 is 0.95, and M9 is 0.94. Then, the reliability of the parallel combination of modules 2, 3, and 4 is given by

R(t) = 1 − [1 − 0 .8]^3 = 0. 992

The series combination of M5 and M7 (and M6 and M8) has reliability 0. 90 × 0 .95, or 0.855, and the parallel combination of these two paths is

R(t) = 1 − [1 − 0 .855]^2 = 0. 979

The system can be simplified as shown in Fig. 2. The series combination of M10 and M11 has reliability 0.971, and this in combination in parallel with M9 gives us

R(t) = 1 − [1 − 0 .971][1 − 0 .94] = 0. 998

1.3 Nonseries/Nonparallel Models

Sometimes, a “success” diagram is used to describe the operational mode of a system. A success diagram may not be directly reducible by the application of the series/parallel formulas. In such cases, one can obtain a lower bound on system reliability in terms of minimal cut sets of the system. We can define a cut set of a graph as a set of branches which interrupts all connections between the input and the output when removed from the graph. The minimum cut sets are a group of distinct cut sets containing the minimum number of terms. All system failures can be represented by the removal of at least one minimal cut set from the graph. The probability of system failure is, therefore, given by the probability that at least one minimal cut set fails. Let Qcuti denote the probability that a cut set fails. So, the lower bound on system reliability is given by Rsys ≥ Π(1 − Qcuti )

Fig. 4: The reliability of a NMR system comprising 2n + 1 modules as a function of time.

A tie set is a group of paths (or branches) which when traversed, forms a connection between the input and the output. A minimal tie set is that containing a minimum number of elements. If no node is traversed more than once in tracing out the tie set, then the tie set is minimal. If Rpathi denotes the serial reliability of path i, then Rsys ≤ 1 − Π(1 − Rpathi )

The minimum tie sets in Fig. 3 are {M1, M2, M3, M4}, {M1, M6, M4}, {M5, M3, M4}. The system reliability is given by Rs ≤ R^4 + 2R^3

2 Masking Redundancy using M -out-of-N structures

Another simple structure that serves as a useful model for many reliability problems is an M -out-of-N structure. Such a model represents a system of N components in which M out of N components must be good for the system to succeed. Thus, success of exactly M -out-of-N identical and independent components is given by (^) ( N M

pM^ (1 − p)N^ −M

Here, p denotes the (identical) reliability of each component. For a constant failure rate of λ and using the exponential failure law p = e−λt^ for each item, the success of at least M -out-of-N items is given by

R(t) =

∑^ N

i=M

N

i

e−iλt(1 − e−λt)N^ −i

In general, N is an odd integer. However, as we shall soon see, if we can diagnose and lock out faulty modules, it is feasible to let N be an even integer. If we let N = 2n + 1, n ≥ 1, then, in a simple masking scheme, we need a majority of the modules to work correctly, that is N = n + 1. Fig. 4 shows the reliability function for various values of n assuming p = e−λt. The figure shows that NMR is superior to a single unit in the high-reliability region, specifically NMR is superior to the single unit for λt < 0 .69. Therefore, when designing any system, we must carefully evaluate the reliability values obtained over the range 0 < t < maximum mission time for various values of n and λ.

2.1 Triple Modular Redundancy

A special case of an M -out-of-N structure is triple modular redundancy or TMR. The basic TMR structure, shown in Fig. 5, consists of three parallel modules where each module is provided with the same input. The outputs of the three modules are compared by the voter, which gives the majority opinion as the system output. If all three modules are operating properly, all outputs agree, and thus the system output is correct. However, if one module has failed so that it has produced an incorrect output, the voter chooses the output of the two good modules as the system output because they both agree, and thus the system output is correct. If two modules have failed, the voter agrees with the majority (the two that have failed), and thus the system output is incorrect.

A TMR system will function correctly provided that at least two modules are operational, and assuming that the voter does not fail, that is Rv = 1. Thus, the probability of the system working correctly is given by

R = Rv ×

p^3 (1 − p)^0 +

p^2 (1 − p)^1

= 3p^2 − 2 p^3 = p^2 (3 − 2 p)

This is, of course, the reliability expression for a two-out-of-three system. If we assume a constant-failure rate λ, then each module/component has a reliability p = e−λt, and substituting in the above equation yields, R(t) = 3e−^2 λt^ − 2 e−^3 λt

We can compute the MTTF for this system by integrating the reliability function as

M T T F =

0

3 e−^2 λt^ − 2 e−^3 λt

2 λ

3 λ

6 λ

This TMR system can be called a “3-2 system” because the system succeeds if 3 or 2 units are good. Thus, when a second failure occurs, the voter does not know which of the components have failed and cannot determine which is the good component.

In some cases, additional information is available by such means as observation (from a human operator or a diagnostic system) of the remaining two units after the first failure occurs. If one of the two remaining units

Module 1

Module 2

Module 3

Voter

Input Output

Fig. 5: The basic triple-modular redundancy (TMR) scheme.

Input (^) M Output 1 M 2

A simplex (non-redundant) system comprising two components

(a)

Input Output

M 1 M 2

M’ 1 M’ 2

(b)

Input Output

M 1 M 2

M’ 1 M’ 2

(c)

System redundancy Component redundancy

Fig. 7: Comparison of three different systems; (a) A simplex (or non-redundant) system, (b) system redundancy, and (c) component redundancy.

where the components M 1 and M 2 are independent, but have identical reliability R(t) = p.

The reliability expression for Fig. 7(b), comprising two simplex units connected in parallel, is given by

Rb(t) = R^2 a(t) +

Ra(t)(1 − Ra(t)) (1)

= p^2 (2 − p^2 )

For Fig. 7(c), we combine each component pair in parallel to obtain

Rc(t) = [p^2 + 2p(1 − p)]^2 (2) = p^2 (2 − p)^2

To compare Equations 1 and 2, we use the ratio

Rc(t) Rb(t)

p^2 (2 − p)^2 p^2 (2 − p^2 )

(2 − p)^2 (2 − p^2 )

Some algebraic manipulation yields Rc(t) Rb(t)

2(1 − p)^2 2 − p^2

Since 0 < p < 1, the term 2 − p^2 > 0, and Rc(t)/Rb(t) ≥ 1. Therefore, component redundancy is superior to system redundancy for this example. (They are, of course, equal at the extremes when p = 0 or p = 1).

We can extend the above analysis to m components, in which case Equation 3 becomes

Rc(t) Rb(t)

(2 − p)m (2 − pm)

It can be shown by induction that this ratio is always greater than 1 and that component redundancy is superior regardless of the number of components. The superiority of component redundancy over system redundancy also holds true for non-identical components.

Fig. 8: Redundancy comparison: (a) component redundancy and (b) system redundancy

A simpler proof of the foregoing principle can be formulated by considering tie sets. In Fig. 7(b), the tie sets are M 1 M 2 and M

′ 1 M^

′ 2 , whereas in Fig. 7(c), the tie sets are^ M^1 M^2 ,^ M^1 M^

′ 2 ,^ M^

′ 1 M^2 , and^ M^

′ 1 M^

  1. Since the system reliability is the probability of the union of tie sets, and since the redundant system in Fig. 7(c) has the same two tie sets as Fig. 7(b) as well as two additional ones, the component-redundancy configuration has a greater reliability than the configuration with two simplex units connected in parallel. This tie-set proof can be extended to the general case.

The reliability of system and component redundancy are compared graphically in Fig. 8.

3 Voter Design Issues

This section considers various issues related to voter design including relaxing our assumption of perfect voters.

Returning to the TMR reliability equation, if the unit reliability is denoted as pc and the voter reliability by

Fig. 10: Synchronization in the COMTRAC railroad traffic control computer.

3.3 Synchronization Issues

Synchronizing the outputs of the replicated units is another concern. The problem of synchronization is often solved using a common (fault tolerant) clock. Another method of synchronization is used by the COMTRAC railroad traffic control computer shown in Fig. 10^1. Synchronization is maintained at the program task level. The system controller (DSC) ensures that both processors are performing the same calculation. When both computers have nished the calculation, the DSC compares the two results. If a mismatch occurs, the controller forces both processors to run identical test programs. The test program exercises the entire processor during the course of calculating a single constant.

4 Dynamic Redundancy

One of the drawbacks of an NMR scheme is that the fault masking ability deteriorates as more copies fail. In its pure form, fault masking neutralizes the effects of failed units without notification of their failures. Therefore, the faulty modules can eventually outvote the good modules. However, an NMR system could continue to function longer if the known bad modules could be discounted in the vote. Two methods of reconfiguration based on NMR are: (1) Hybrid redundancy where failed modules are replaced with good spares and (2) Dynamic modification of the voting process or adaptive voting.

4.1 Hybrid Redundancy

Fig. 11 illustrates the basic concept in which a core group of N identical modules is used at any one time, and their outputs voted upon to produce the system output. When a disagreement is detected, the module(s) in the minority are assumed to have failed and are replaced by an equivalent number of spare modules. Initially, the system contains a total of (N + S) modules. As long as the number of failed modules does not exceed

(^1) Ihara et al., Fault-Tolerant Computer System with Three Symmetric Computers, Proceedings of the IEEE, pp. 1160-1177, October 1978.

Fig. 11: Organization of a system with hybrid redundancy.

Fig. 12: The quad-redundant flight control system of the space shuttle.

t = b(N/2)c in the core group before reconfiguration can take place, the system in Fig. 11 can tolerate the failure of t + S of its modules.

4.2 Adaptive voting with lockout

When N -modular redundancy is used and N is greater than three, additional considerations emerge. For example, consider the quad-redundant system shown in Fig. 12. This is the architecture used for the Space Shuttle’s primary flight control system (FCS).

Let us focus on the first four computers in the FCS. Here, we have an example of a 4-level voting with lockout. Let us assume that unit B fails permanently. There is no reason to leave B in the system if we have a way to remove it from the voting process. The rationale here is that a second failure, say that of unit C, can lead to a situation where the two failed units agree and the two good elements agree, leading to a stand-off. Clearly, this can be avoided, if, after the failure of B, it is locked out, and the system reconfigures to become a TMR system.