Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Software Fault Tolerance: Techniques for Error Detection and Correction in Software, Study notes of Computer Science

Sam Higginbottom Institute of Agriculture, Technology and Sciences Computer Science

Various techniques for ensuring fault tolerance in software, including timing checks, results checking and correction, and range checks. The document also covers software rejuvenation and data diversity as approaches to proactively managing software faults. Examples and mathematical explanations are provided for each technique.

Typology: Study notes

2012/2013

Uploaded on 05/18/2013

maazi 🇮🇳

4.4

(12)

75 documents

1 / 5

This page cannot be seen from the preview

Don't miss anything!

Software Fault Tolerance

Prof. Naga Kandasamy

ECE Department, Drexel University, Philadelphia, PA 19104.

February 15, 2009

1 Acceptance Tests

Acceptance tests are, simply put, a check of reasonableness, and fall into the following categories.

1.1 Timing Checks

If we have a rough idea of how long the code should run, a watchdog timer can be set appropriately. If the

timer goes off before being reset, we can assume that a failure has occurred (say, the program has “hung”).

The timing check can be used in parallel with other acceptance tests.

1.2 Results Checking and Correction

The material discussed in this section is derived from: M. Blum and H. Wasserman, “Reflections on the

Pentium Division Bug,” IEEE Transactions, vol. 45, no. 4, pp. 385-393, April 1996. The paper is available

via the course web site.

Simple checking is based on the observation that for certain mathematical functions f, the task of determin-

ing, given inputs xand y, whether or not f(x) = yis easier than the task of, on input x, computing f(x).

So, a checker for function fis a program which, given inputs xand y, returns the correct answer to the

question, “Does f(x) = y?” The checker may be randomized, in which case, for any xand y, it must give

the correct answer with high probability over its internal randomization. Furthermore, the checker must

take asymptotically less time than any possible program for computing f.

As an example, we will consider a simple checker for matrix multiplication. Let us consider two matrices A,

and B. The product matrix C=A×Bis computed as follows

A×B=



1−2 1

0 3 3

1 0 0



×



0−1 0

2 2 2

3 0 1



=



−1−5−3

15 6 9

0−1 0





The naive way of multiplying two n×nmatrices takes time O(n3). More sophisticated methods have been

developed that take time O(nc) for various values of c, 2 < c < 3. A program implementing such an algorithm

could be quite complex and so, perhaps, buggy and in need of a checker.

The matrix-multiplication checker takes as inputs n×nmatrices A,B, and C. It must determine whether

or not A×B=C. It begins by generating an n-high column vector rof random numbers. It then calculates

and compares products A(Br) and Cr. Note that these calculations can be done in time O(n2), using

straightforward vector-by-matrix multiplication. For example, we can pick a random value for rand use it

to check the matrix multiplication from above.

1

Discover Study notes of Computer Science Sam Higginbottom Institute of Agriculture, Technology and Sciences

Partial preview of the text

Download Software Fault Tolerance: Techniques for Error Detection and Correction in Software and more Study notes Computer Science in PDF only on Docsity!

Software Fault Tolerance

Prof. Naga Kandasamy

ECE Department, Drexel University, Philadelphia, PA 19104.

February 15, 2009

1 Acceptance Tests

Acceptance tests are, simply put, a check of reasonableness, and fall into the following categories.

1.1 Timing Checks

If we have a rough idea of how long the code should run, a watchdog timer can be set appropriately. If the timer goes off before being reset, we can assume that a failure has occurred (say, the program has “hung”). The timing check can be used in parallel with other acceptance tests.

1.2 Results Checking and Correction

The material discussed in this section is derived from: M. Blum and H. Wasserman, “Reflections on the Pentium Division Bug,” IEEE Transactions, vol. 45, no. 4, pp. 385-393, April 1996. The paper is available via the course web site.

Simple checking is based on the observation that for certain mathematical functions f , the task of determin- ing, given inputs x and y, whether or not f (x) = y is easier than the task of, on input x, computing f (x). So, a checker for function f is a program which, given inputs x and y, returns the correct answer to the question, “Does f (x) = y?” The checker may be randomized, in which case, for any x and y, it must give the correct answer with high probability over its internal randomization. Furthermore, the checker must take asymptotically less time than any possible program for computing f.

As an example, we will consider a simple checker for matrix multiplication. Let us consider two matrices A, and B. The product matrix C = A × B is computed as follows

A × B =

 ×

The naive way of multiplying two n × n matrices takes time O(n^3 ). More sophisticated methods have been developed that take time O(nc) for various values of c, 2 < c < 3. A program implementing such an algorithm could be quite complex and so, perhaps, buggy and in need of a checker.

The matrix-multiplication checker takes as inputs n × n matrices A, B, and C. It must determine whether or not A × B = C. It begins by generating an n-high column vector r of random numbers. It then calculates and compares products A(Br) and Cr. Note that these calculations can be done in time O(n^2 ), using straightforward vector-by-matrix multiplication. For example, we can pick a random value for r and use it to check the matrix multiplication from above.

A × (Br) =

 ×

and

Cr =

 ×

Clearly, if A × B = C, then A(Br) = (AB)r = Cr for any r. However, if A × B 6 = C, it is possible that, for certain values of r, A(Br) may nevertheless equal Cr. However, it can be shown that the probability (over the random choice of r) of the checker being fooled in this way is small. If we repeat the above process for several values of r and accept C as correct if A(Br) = Cr for all r tried, we have a simple checker whose probability of error can be made arbitrarily low.

We have here checked a complicated, O(nc)-time program with a simple, O(n^2 )-time program. The checker is thus easy to code compared to the program it checks, unlikely to be buggy compared to the program it checks, and quick to execute compared to the program it checks.

Now, we will consider the possibility that having detected its output to be incorrect, the program attempts to correct itself. Correction is based on the fact that, for many functions f , we can efficiently compute f (x) if we know the value of f at several random inputs other than x. So, it is possible to work our way around occasional “bad inputs” at which a program for computing f fails. Since it is easy to determine, via a simple stage of random testing, that such a program is correct on most inputs, a self-corrector of this sort will then suffice to patch over the remaining, occasional errors, making the program effectively perfect.

Let us return to our matrix-multiplication example to illustrate this concept. We have the following matri- ces

A =

 B =

Say that the program has calculated the output C = A×B incorrectly, and we wish to correct it. Now, we can easily write matrices of A and B as a sum of two matrices in an essentially random way. For example

A =

B =

We are just writing A as the sum of a random matrix R and A − R, and similarly for B. So, in the place of calculating A × B, we calculate the product of four entirely different matrix multiplications. Of course, this method multiplies run-time by a factor of 4. But, we assume that such correction is not often necessary. Also needed are matrix addition and subtraction, which we assume to be simple, quick, and reliable. The

How much work is needed to complete this calculation? The multiplications by 1/2 or 2 or 4 are easy on binary numbers, and we assume addition and subtraction to be quick and reliable compared to multiplication. Thus, the computation reduces to doing four multiplications. This is a significant cost, but, as usual, we assume that it will only be needed in the rare case that the multipliers original output is incorrect.

1.3 Range Checks

We can use our knowledge of the software application to set acceptable bounds for the output. When setting bounds on acceptance tests, we have to balance between sensitivity and specificity. Here, sensitivity is the probability that the acceptance test catches an erroneous output (the coverage factor). It is the conditional

2 Single-Version Fault Tolerance

We first discuss software rejuvenation and then describe data diversity.

2.1 Software Aging and Rejuvenation

The topic discussed in this section is derived, in part from: V. Castelli et al., “Proactive Management of Software Aging,” IBM Journal Research & Development, vol. 45, no. 2, pp. 311-332, March 2001.

Unplanned computer system outages are more likely to be the result of software failures than of hardware failures. Software often exhibits an increasing failure rate over time, typically because of increasing and unbounded resource consumption, data corruption, and numerical error accumulation. This constitutes a phenomenon called software aging, and may be caused by errors in the application or operating system. Under aging conditions, the state of the software degrades gradually with time, inevitably resulting in undesirable consequences (e.g., the failure of the Patriot missile system, which resulted in loss of human lives, might have been prevented had the operators heeded the advice that the system had to be restarted after every eight hours of running time). Some typical causes of this degradation are memory bloating and leaking, unterminated threads, unreleased file-locks, data corruption, storage-space fragmentation, and accumulation of round-off errors.

To counteract software aging, a proactive technique called software rejuvenation involves stopping the run- ning software occasionally, “cleaning” its internal state (e.g., garbage collection, flushing operating system kernel tables, and re-initializing internal data structures) and restarting it. An extreme but well-known ex- ample of rejuvenation is a system reboot. There are numerous examples in real-life systems where software rejuvenation is being used. For example, it has been implemented in the real-time system collecting billing data for most telephone exchanges in the United States.

Proactive fault management takes appropriate corrective action to prevent a failure before the system ex- periences a fault. So, software rejuvenation is a specific form of proactive fault management which can be performed at suitable times, such as when there is no load on the system, and thus typically results in less downtime and cost than the reactive approach. Since proactive fault management incurs some overhead, an important issue is to determine the optimal times to invoke it in operational software systems. Proactive fault management can be greatly enhanced by the ability to predict the fault far enough in advance that one can take action to avoid or mitigate its effects. Resource exhaustion by its very nature offers clues that failure is imminent, in the form of parameters that can be monitored, extrapolated, and compared to thresholds via suitable algorithms.

2.2 Data Diversity

Please read the following paper on data diversity, available from the course web site: P. E. Ammann and J. C. Knight, “Data Diversity: An Approach to Software Fault Tolerance,” IEEE Trans. Computers, vol. 37, no. 4, pp. 418-425, 1988.

2.3 Software Implemented Hardware Fault Tolerance

Read the discussion in the book.

Software Fault Tolerance: Techniques for Error Detection and Correction in Software, Study notes of Computer Science

Related documents

Partial preview of the text

Download Software Fault Tolerance: Techniques for Error Detection and Correction in Software and more Study notes Computer Science in PDF only on Docsity!

Software Fault Tolerance

Prof. Naga Kandasamy

ECE Department, Drexel University, Philadelphia, PA 19104.

February 15, 2009

1 Acceptance Tests

1.1 Timing Checks

1.2 Results Checking and Correction

A × B =

 ×

 ×

 ×

 ×

 ×

 B =

A =

B =

1.3 Range Checks

2.1 Software Aging and Rejuvenation

2.2 Data Diversity

2.3 Software Implemented Hardware Fault Tolerance