Download Software Reliability Engineering: Overview by Bojan Cukic, WVU - Prof. Bojan Cukic and more Study notes Computer Science in PDF only on Docsity!
West VirginiaUniversity^ Software Reliability Class
Ê Software Engineering
Ê SENG 691 D
Ê Computer Science
Ê CS 791 X
Ê Time: Tuesdays 6PM – 8:30PM
Ê SENG section is on-line, CS section in ESB 801
Ê Instructor: Bojan Cukic
Ê Office phone: 304-293-0405 ext. 2526
West VirginiaUniversity^ Rules of operation
Ê Email communication strongly preferred.
Ê Chats possible (gmail or Skype), need to arrange time.
Ê Office visits encouraged
Ê ESB 731, Evansdale campus, Morgantown.
Ê Textbook:
Ê John Musa, Software Reliability Engineering: More Reliable Software Faster and Cheaper , McGraw-Hill, 1998. The book is out of print but the 2nd edition can be obtained as a Print On Demand (POD) from AuthorHouse publishers with a considerable discount. Access the following Web site: http://members.aol.com/JohnDMusa/book.htm and follow appropriate links.
West VirginiaUniversity^ Rules of Operation
Ê Tests: Midterm and finals
Ê Presentations and research papers
Ê Each student will choose a topic in agreement with the
instructor.
Ê Presentations during regular classes (on-line students
use phone).
Ê Presentations will grow into papers by the end of the
semester.
Ê SENG students can report on reliability best practices or the application of reliability engineering to projects they are involved with in their organizations. Ê CS student’s topics will require research content (project).
West VirginiaUniversity^ Rules of Operation
Ê See the syllabus for details on tests, presentations
and papers.
Ê Grading:
Ê Midterm + Finals = 40%
Ê Class presentation = 15%
Ê Term papers / project reports = 35%
Ê Class participation = 10%
Ê 90+ = A, 80+ = B, 70+ = C …
Ê Must obtain a passing grade from both tests and
papers/presentations.
West VirginiaUniversity
Software Reliability Engineering:
A Short Overview
Bojan Cukic
Lane Department of Computer Science and Electrical Engineering West Virginia University
West VirginiaUniversity^ Introduction
Ê Hardware for safety-critical systems is very
reliable and its reliability is being improved
Ê Software is not as reliable as hardware, however,
its role in safety-critical systems increases
Ê “Today, the majority of engineers understand very
little about the science of programming or the
mathematics that one needs to analyze a program. On
the other hand, the scientists who study programming
know very little about what it means to be an
engineer... “ [Parnas 1997]
West VirginiaUniversity
Introduction: Ariane flight 501
failure
Ê Ariane 4 SRI (Inertial Reference Systems) software was
reused on Ariane 5
Ê Ariane 4 accelerated much slower, used different trajectory
Ê In SRI-1 and SRI-2 Operand Error exception appeared due
to an overflow in converting 64 bit floating point to 16 bit
unsigned integer
Ê SRIs declared failure in two successive data cycles (72 ms)
Ê On Board Computer interpreted SRI-2 diagnostic pattern as
flight data and commanded nozzle deflection
Ê 39s after launch, the launcher disintegrated because of high
aerodynamic loads due to an angle of attack of more than
20 degrees
West VirginiaUniversity^ Infamous Software Failures
Ê July 28, 1962, Mariner I space probe.
Ê A formula written by pencil on paper improperly coded. Trajectory miscalculated, rocket diverted from the path at launch, destroyed over the Atlantic.
Ê 1982, Trans-Siberian gas pipeline explosion.
Ê Fault planted into Canadian pipeline control software, covertly acquired by the Russians. Not much known about the nature of the fault.
Ê 1985 – 1987, Therac – 25 medical accellerator.
Ê Radiation therapy device delivers lethal doses at several facilities. Software safety interlock replaced electromechanical and failed. An operating system race condition was at fault.
West VirginiaUniversity^ Infamous Software Failures
Ê 1988, Buffer overflow at Berkeley Unix finger
daemon.
Ê Allowed the spread of the first internet worm. gets()
function did not control the length of string, warm
code was able to take control of the machines.
Ê 1988 – 1996, Kerberos random number generator.
Ê Generator not properly “seeded”. For 8 years, it was
possible to break into the most “secure” authentication
system using trivial mathematics. Not known whether
fault ever exploited.
West VirginiaUniversity^ Infamous Software Failures
Ê Jan 15, 1990, AT&T network outage.
Ê A fault in a new software release causes long distance
switches to crash when they receive a crash recovery
message from the neighboring machine. 114 switches
kept crashing and rebooting every 6 seconds for 9
hours. Old software release loaded back to fix the
problem.
Ê 1993, Intel Pentium floating point division.
Ê Error of 0.006% in division causes public relations
nightmare. 3 to 5 million chips in circulation,
replacement for anyone who complains. $475M in
damages.
West VirginiaUniversity^ Infamous Software Failures
Ê 1995/96 The Ping of death.
Ê Malformed “ping” packets not checked and cause the computers to display the “blue screen of death”. Lack of error handling sanity checks. Windows, Mac and UNIX systems affected.
Ê November 2000, National cancer institute, Panama City.
Ê Therapy planning software miscalculates the proper dosage of radiation. Doctors “trick” the software by placing additional shielding blocks not planned in software. Dosage calculation depends on peculiar user interaction. 8 patients die, at least 20 receive overdoses. Doctors, who were supposed to double- check computer’s calculations by hand, are indicted for murder.
West VirginiaUniversity^ Software Reliability
Ê Software Reliability: P(A|B)
Ê A : Software does not fail when operated for t time units under specified conditions. Ê B : Software has not failed at time 0.
Ê Ultra-high reliability requirements for safety-critical
systems ( Draft Int’l Standard IEC65A123 for Safety Integrity Level 4 ) :
Ê Continuous control systems: < 10 -8^ failures per hour Airbus 320/330/340 and Boing 777: <10 -9^ failures/h This translates to 113,155 years of operation without encountering a failure Ê Protection systems (emergency shutdown): < 10 -4^ failures/h UK Seizewell B nuclear reactor (emerg.): <10 -3^ failures/h
West VirginiaUniversity^ Time Domain Approach
Ê Observed failure data from testing fitted to various statistical models
Ê Time-Between-Failure models, and Period Failure Count models Ê Used for: Ê assessing current reliability Ê predicting future reliability Ê controlling software testing
Software Reliability Assessment
Formal Verification (^) Testing
Time Domain Input Domain
Failure Intensity λi
Ê CONS:^ time Ê Perfect fault removal assumed Ê Cannot be used to predict ultra-high reliability levels
West VirginiaUniversity^ Time domain models
Ê Reliability Growth models
Ê Jelenski-Moranda model (JM)
Ê The number of initial faults unknown but fixed Ê Fault detection is perfect (no new faults introduced) Ê Times between failure occurrences are independent exponentially distributed random quantities Ê all remaining faults contribute equally to failure intensity
Ê General problems (more assumptions)
Ê All faults detectable
Ê Statistical independence of inter-failure arrival
West VirginiaUniversity^ Related Work: Statistical testing
Ê PROS
Ê System level assessment Ê Theoretically sound
Ê CONS
Ê Large number of test cases, an oracle needed Ê Depends on the operational profile
Software Reliability Assessment
Formal Verification Testing
Time Domain Input Domain
Input Space Program P
Output Space
West VirginiaUniversity^ Urn Model of Software Testing
Ê Random software testing is modeled as sampling with
replacement
Ê If repeated sampling reveals no black balls, all we gain is a
confidence that there are none
Ê Only one type of testing can prove that there are no
existing faults: exhaustive testing
Ê Program with 20 variables, 10 values/var, 1 ns per test case ==> 3,000 years of testing! Prob.
Input values (balls)
1 2 3 4 5 6 7 .....
Operational distribution
West VirginiaUniversity^ Introduction: Dependability
Ê Safety-critical systems require both
Ê best practices for software development with
dependability being the major concern
Ê rigorous validation procedures
Dependability Attributes
Availability
Reliability
Safety Integrity
Confidentiality
Maintainability
Means
Fault Forecasting
Fault Tolerance Fault Removal
Fault Prevention
Impairments
Errors Failures
Faults
West VirginiaUniversity^ A Reality Check
Ê Collection of operational software data is difficult
Ê Problem occurrence rates for essential aircraft
flight functions [Shooman 96]:
Ê 2x10 -8^ to 10-6^ occurrences per hour of operation
Ê The reported failure occurrence rates are higher than
required
Ê Error, Fault and Failure (EFF) data collection
initiatives
Ê Come and go
Ê We still miss data!!!
West VirginiaUniversity^ SRE Process
Ê Tasks frequently iterate
Ê Post-delivery and maintenance phase (not shown)
Ê Testers must be involved throughout the process
Ê Allows better understanding of user’s perspective
Ê Improvement of system requirements, planning
Ê Selection of appropriate mix of
Ê fault prevention
Ê fault removal
Ê fault tolerance
West VirginiaUniversity^ SRE
Ê Types of tests applicable to SRE (based on
objectives, rather than phases in the life-cycle)
Ê Reliability growth tests (find and remove faults)
Ê need a minimum of 10-20 detected faults to achieve statistically meaningful results Ê Feature (minimize impact of the environment), load (maximize environmental impacts), regression tests (following a major change)
Ê Certification tests
Ê no debugging, accept or reject software under test Ê no. observed failures not important
West VirginiaUniversity^ Defining the “system”
Ê System is an independently tested unit
Ê SRE should be applied to subsystems (acquired
COTS, OS, for example), systems and
supersystems
Ê Different configuration represents different
system
Ê Interface stubs may not be correct
Ê But, more “systems” implies higher cost
Ê aggregation welcome Ê Product lines help reducing the cost
West VirginiaUniversity^ SRE and SW design & test process
Ê Use knowledge of operational profile to guide
and focus design efforts
Ê Established failure intensity drives the quality
assurance efforts
Ê Failure intensity goal determines when to stop
testing
Ê Measurement throughout the life-cycle helps
identify better methodologies
West VirginiaUniversity^ Is Reliability Important?
Ê It should be, since it is measurable property
Ê Unlike “software quality”
Ê Useful, since the software is tested under the
conditions of perceived usage.
Ê The number of resident faults, for example, is a
developer oriented measure. Reliability is a user
oriented measure.
Ê The number of faults found has NO correlation to
reliability. Neither has program complexity.
Ê Accurate measurements of reliability are feasible.
West VirginiaUniversity^ Why to Measure Reliability?
Ê Isn’t the “best software development process”
sufficient?
Ê What is “best”?
Ê It is important to measure the results of the process.
Ê Early consideration of target reliability is
beneficial, since it impacts cost and schedule.
Ê CMM levels 4 and 5 (and 3, indirectly),
recommend reliability measurement.
West VirginiaUniversity^ Summary
Ê Definition of software reliability.
Ê Software reliability engineering is the process
that leads to high reliability software.
Ê Based on statistical evaluation of quality factors
throughout the development lifecycle.
Ê Reliability can be assessed using different
approaches.
Ê Simple activities can significantly reduce
software failure rates.