Software Reliability Engineering: Overview by Bojan Cukic, WVU - Prof. Bojan Cukic, Study notes of Computer Science

An overview of software reliability engineering, a topic taught in the computer science department at west virginia university. The importance of software reliability in safety-critical systems, the role of formal verification and testing in ensuring software reliability, and the challenges of collecting operational software data for reliability analysis. The document also introduces the concept of software reliability engineering (sre) and its benefits for product reliability, speed to market, and cost reduction.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-atd-1
koofers-user-atd-1 🇺🇸

9 documents

1 / 13

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
West Virginia
University Software Reliability Class
ÊSoftware Engineering
ÊSENG 691 D
ÊComputer Science
ÊCS 791 X
ÊTime: Tuesdays 6PM – 8:30PM
ÊSENG section is on-line, CS section in ESB 801
ÊInstructor: Bojan Cukic
ÊOffice phone: 304-293-0405 ext. 2526
West Virginia
University Rules of operation
ÊEmail communication strongly preferred.
ÊChats possible (gmail or Skype), need to arrange time.
ÊOffice visits encouraged
ÊESB 731, Evansdale campus, Morgantown.
ÊTextbook:
ÊJohn Musa, Software Reliability Engineering: More Reliable
Software Faster and Cheaper, McGraw-Hill, 1998.
The book is out of print but the 2nd edition can be obtained as
a Print On Demand (POD) from AuthorHouse publishers with
a considerable discount. Access the following Web site:
http://members.aol.com/JohnDMusa/book.htm and follow
appropriate links.
West Virginia
University Rules of Operation
ÊTests: Midterm and finals
ÊPresentations and research papers
ÊEach student will choose a topic in agreement with the
instructor.
ÊPresentations during regular classes (on-line students
use phone).
ÊPresentations will grow into papers by the end of the
semester.
ÊSENG students can report on reliability best practices or the
application of reliability engineering to projects they are
involved with in their organizations.
ÊCS student’s topics will require research content (project).
pf3
pf4
pf5
pf8
pf9
pfa
pfd

Partial preview of the text

Download Software Reliability Engineering: Overview by Bojan Cukic, WVU - Prof. Bojan Cukic and more Study notes Computer Science in PDF only on Docsity!

West VirginiaUniversity^ Software Reliability Class

Ê Software Engineering

Ê SENG 691 D

Ê Computer Science

Ê CS 791 X

Ê Time: Tuesdays 6PM – 8:30PM

Ê SENG section is on-line, CS section in ESB 801

Ê Instructor: Bojan Cukic

Ê Office phone: 304-293-0405 ext. 2526

Ê Email: [email protected]

West VirginiaUniversity^ Rules of operation

Ê Email communication strongly preferred.

Ê Chats possible (gmail or Skype), need to arrange time.

Ê Office visits encouraged

Ê ESB 731, Evansdale campus, Morgantown.

Ê Textbook:

Ê John Musa, Software Reliability Engineering: More Reliable Software Faster and Cheaper , McGraw-Hill, 1998. The book is out of print but the 2nd edition can be obtained as a Print On Demand (POD) from AuthorHouse publishers with a considerable discount. Access the following Web site: http://members.aol.com/JohnDMusa/book.htm and follow appropriate links.

West VirginiaUniversity^ Rules of Operation

Ê Tests: Midterm and finals

Ê Presentations and research papers

Ê Each student will choose a topic in agreement with the

instructor.

Ê Presentations during regular classes (on-line students

use phone).

Ê Presentations will grow into papers by the end of the

semester.

Ê SENG students can report on reliability best practices or the application of reliability engineering to projects they are involved with in their organizations. Ê CS student’s topics will require research content (project).

West VirginiaUniversity^ Rules of Operation

Ê See the syllabus for details on tests, presentations

and papers.

Ê Grading:

Ê Midterm + Finals = 40%

Ê Class presentation = 15%

Ê Term papers / project reports = 35%

Ê Class participation = 10%

Ê 90+ = A, 80+ = B, 70+ = C …

Ê Must obtain a passing grade from both tests and

papers/presentations.

West VirginiaUniversity

Software Reliability Engineering:

A Short Overview

Bojan Cukic

Lane Department of Computer Science and Electrical Engineering West Virginia University

West VirginiaUniversity^ Introduction

Ê Hardware for safety-critical systems is very

reliable and its reliability is being improved

Ê Software is not as reliable as hardware, however,

its role in safety-critical systems increases

Ê “Today, the majority of engineers understand very

little about the science of programming or the

mathematics that one needs to analyze a program. On

the other hand, the scientists who study programming

know very little about what it means to be an

engineer... “ [Parnas 1997]

West VirginiaUniversity

Introduction: Ariane flight 501

failure

Ê Ariane 4 SRI (Inertial Reference Systems) software was

reused on Ariane 5

Ê Ariane 4 accelerated much slower, used different trajectory

Ê In SRI-1 and SRI-2 Operand Error exception appeared due

to an overflow in converting 64 bit floating point to 16 bit

unsigned integer

Ê SRIs declared failure in two successive data cycles (72 ms)

Ê On Board Computer interpreted SRI-2 diagnostic pattern as

flight data and commanded nozzle deflection

Ê 39s after launch, the launcher disintegrated because of high

aerodynamic loads due to an angle of attack of more than

20 degrees

West VirginiaUniversity^ Infamous Software Failures

Ê July 28, 1962, Mariner I space probe.

Ê A formula written by pencil on paper improperly coded. Trajectory miscalculated, rocket diverted from the path at launch, destroyed over the Atlantic.

Ê 1982, Trans-Siberian gas pipeline explosion.

Ê Fault planted into Canadian pipeline control software, covertly acquired by the Russians. Not much known about the nature of the fault.

Ê 1985 – 1987, Therac – 25 medical accellerator.

Ê Radiation therapy device delivers lethal doses at several facilities. Software safety interlock replaced electromechanical and failed. An operating system race condition was at fault.

West VirginiaUniversity^ Infamous Software Failures

Ê 1988, Buffer overflow at Berkeley Unix finger

daemon.

Ê Allowed the spread of the first internet worm. gets()

function did not control the length of string, warm

code was able to take control of the machines.

Ê 1988 – 1996, Kerberos random number generator.

Ê Generator not properly “seeded”. For 8 years, it was

possible to break into the most “secure” authentication

system using trivial mathematics. Not known whether

fault ever exploited.

West VirginiaUniversity^ Infamous Software Failures

Ê Jan 15, 1990, AT&T network outage.

Ê A fault in a new software release causes long distance

switches to crash when they receive a crash recovery

message from the neighboring machine. 114 switches

kept crashing and rebooting every 6 seconds for 9

hours. Old software release loaded back to fix the

problem.

Ê 1993, Intel Pentium floating point division.

Ê Error of 0.006% in division causes public relations

nightmare. 3 to 5 million chips in circulation,

replacement for anyone who complains. $475M in

damages.

West VirginiaUniversity^ Infamous Software Failures

Ê 1995/96 The Ping of death.

Ê Malformed “ping” packets not checked and cause the computers to display the “blue screen of death”. Lack of error handling sanity checks. Windows, Mac and UNIX systems affected.

Ê November 2000, National cancer institute, Panama City.

Ê Therapy planning software miscalculates the proper dosage of radiation. Doctors “trick” the software by placing additional shielding blocks not planned in software. Dosage calculation depends on peculiar user interaction. 8 patients die, at least 20 receive overdoses. Doctors, who were supposed to double- check computer’s calculations by hand, are indicted for murder.

West VirginiaUniversity^ Software Reliability

Ê Software Reliability: P(A|B)

Ê A : Software does not fail when operated for t time units under specified conditions. Ê B : Software has not failed at time 0.

Ê Ultra-high reliability requirements for safety-critical

systems ( Draft Int’l Standard IEC65A123 for Safety Integrity Level 4 ) :

Ê Continuous control systems: < 10 -8^ failures per hour Airbus 320/330/340 and Boing 777: <10 -9^ failures/h This translates to 113,155 years of operation without encountering a failure Ê Protection systems (emergency shutdown): < 10 -4^ failures/h UK Seizewell B nuclear reactor (emerg.): <10 -3^ failures/h

West VirginiaUniversity^ Time Domain Approach

Ê Observed failure data from testing fitted to various statistical models

Ê Time-Between-Failure models, and Period Failure Count models Ê Used for: Ê assessing current reliability Ê predicting future reliability Ê controlling software testing

Software Reliability Assessment

Formal Verification (^) Testing

Time Domain Input Domain

Failure Intensity λi

Ê CONS:^ time Ê Perfect fault removal assumed Ê Cannot be used to predict ultra-high reliability levels

West VirginiaUniversity^ Time domain models

Ê Reliability Growth models

Ê Jelenski-Moranda model (JM)

Ê The number of initial faults unknown but fixed Ê Fault detection is perfect (no new faults introduced) Ê Times between failure occurrences are independent exponentially distributed random quantities Ê all remaining faults contribute equally to failure intensity

Ê General problems (more assumptions)

Ê All faults detectable

Ê Statistical independence of inter-failure arrival

West VirginiaUniversity^ Related Work: Statistical testing

Ê PROS

Ê System level assessment Ê Theoretically sound

Ê CONS

Ê Large number of test cases, an oracle needed Ê Depends on the operational profile

Software Reliability Assessment

Formal Verification Testing

Time Domain Input Domain

Input Space Program P

Output Space

West VirginiaUniversity^ Urn Model of Software Testing

Ê Random software testing is modeled as sampling with

replacement

Ê If repeated sampling reveals no black balls, all we gain is a

confidence that there are none

Ê Only one type of testing can prove that there are no

existing faults: exhaustive testing

Ê Program with 20 variables, 10 values/var, 1 ns per test case ==> 3,000 years of testing! Prob.

Input values (balls)

1 2 3 4 5 6 7 .....

Operational distribution

West VirginiaUniversity^ Introduction: Dependability

Ê Safety-critical systems require both

Ê best practices for software development with

dependability being the major concern

Ê rigorous validation procedures

Dependability Attributes

Availability

Reliability

Safety Integrity

Confidentiality

Maintainability

Means

Fault Forecasting

Fault Tolerance Fault Removal

Fault Prevention

Impairments

Errors Failures

Faults

West VirginiaUniversity^ A Reality Check

Ê Collection of operational software data is difficult

Ê Problem occurrence rates for essential aircraft

flight functions [Shooman 96]:

Ê 2x10 -8^ to 10-6^ occurrences per hour of operation

Ê The reported failure occurrence rates are higher than

required

Ê Error, Fault and Failure (EFF) data collection

initiatives

Ê Come and go

Ê We still miss data!!!

West VirginiaUniversity^ SRE Process

Ê Tasks frequently iterate

Ê Post-delivery and maintenance phase (not shown)

Ê Testers must be involved throughout the process

Ê Allows better understanding of user’s perspective

Ê Improvement of system requirements, planning

Ê Selection of appropriate mix of

Ê fault prevention

Ê fault removal

Ê fault tolerance

West VirginiaUniversity^ SRE

Ê Types of tests applicable to SRE (based on

objectives, rather than phases in the life-cycle)

Ê Reliability growth tests (find and remove faults)

Ê need a minimum of 10-20 detected faults to achieve statistically meaningful results Ê Feature (minimize impact of the environment), load (maximize environmental impacts), regression tests (following a major change)

Ê Certification tests

Ê no debugging, accept or reject software under test Ê no. observed failures not important

West VirginiaUniversity^ Defining the “system”

Ê System is an independently tested unit

Ê SRE should be applied to subsystems (acquired

COTS, OS, for example), systems and

supersystems

Ê Different configuration represents different

system

Ê Interface stubs may not be correct

Ê But, more “systems” implies higher cost

Ê aggregation welcome Ê Product lines help reducing the cost

West VirginiaUniversity^ SRE and SW design & test process

Ê Use knowledge of operational profile to guide

and focus design efforts

Ê Established failure intensity drives the quality

assurance efforts

Ê Failure intensity goal determines when to stop

testing

Ê Measurement throughout the life-cycle helps

identify better methodologies

West VirginiaUniversity^ Is Reliability Important?

Ê It should be, since it is measurable property

Ê Unlike “software quality”

Ê Useful, since the software is tested under the

conditions of perceived usage.

Ê The number of resident faults, for example, is a

developer oriented measure. Reliability is a user

oriented measure.

Ê The number of faults found has NO correlation to

reliability. Neither has program complexity.

Ê Accurate measurements of reliability are feasible.

West VirginiaUniversity^ Why to Measure Reliability?

Ê Isn’t the “best software development process”

sufficient?

Ê What is “best”?

Ê It is important to measure the results of the process.

Ê Early consideration of target reliability is

beneficial, since it impacts cost and schedule.

Ê CMM levels 4 and 5 (and 3, indirectly),

recommend reliability measurement.

West VirginiaUniversity^ Summary

Ê Definition of software reliability.

Ê Software reliability engineering is the process

that leads to high reliability software.

Ê Based on statistical evaluation of quality factors

throughout the development lifecycle.

Ê Reliability can be assessed using different

approaches.

Ê Simple activities can significantly reduce

software failure rates.