Download Software Reliability - Dependable Software Systems | CS 576 and more Study notes Computer Science in PDF only on Docsity!
Dependable Software Systems
Topics in
Software Reliability
Material drawn from [Somerville, Mancoridis]
What is Software Reliability?
- A Formal Definition: Reliability is the probability of failure-free operation of a system over a specified time within a specified environment for a specified purpose.
Software Reliability
- It is difficult to define the term objectively.
- Difficult to measure user expectations,
- Difficult to measure environmental factors.
- It’s not enough to consider simple failure rate: - Not all failures are created equal; some have much more serious consequences. - Might be able to recover from some failures reasonably.
Failures and Faults
- A failure corresponds to unexpected run- time behavior observed by a user of the software.
- A fault is a static software characteristic which causes a failure to occur.
Improving Reliability
- Primary objective: Remove faults with the most serious consequences.
- Secondary objective: Remove faults that are encountered most often by users.
Improving Reliability (Cont’d)
- Fixing N% of the faults does not, in general, lead to an N% reliability improvement.
- 90-10 Rule: 90% of the time you are executing 10% of the code.
- One study showed that removing 60% of software “defects” led to a 3% reliability improvement.
The Cost of Reliability (Cont’d)
- Cost of software failure often far outstrips the cost of the original system: - data loss - down-time - cost to fix
Measuring Reliability
- Hardware failures are almost always physical failures ( i.e., the design is correct).
- Software failures, on the other hand, are due to design faults.
- Hardware reliability metrics are not always appropriate to measure software reliability but that is how they have evolved.
Reliability Metrics (ROCOF)
- Rate Of Occurrence Of Failure (ROCOF): - Frequency of occurrence of failures. - E.g., ROCOF of 0.02 means 2 failures are likely in each 100 time units.
- Relevant for transaction processing systems.
Reliability Metrics (MTTF)
- Mean Time To Failure (MTTF):
- Measure of time between failures.
- E.g., MTTF of 500 means an average of 500 time units passes between failures.
- Relevant for systems with long transactions.
Time Units
- What is an appropriate time unit?
- Some examples:
- Raw execution time, for non-stop real-time systems.
- Number of transactions, for transaction-based systems.
Types of Failures
- Not all failures are equal in their seriousness: - Transient vs permanent - Recoverable vs non-recoverable - Corrupting vs non-corrupting
- Consequences of failure:
- Malformed HTML document.
- Inode table trashed.
- Incorrect radiation dosage reported.
- Incorrect radiation dosage given!
Automatic Bank Teller Example
- Bank has 1000 machines; each machine in the network is used 300 times per day.
- Lifetime of software release is 2 years.
- Therefore, there are about 300,000 database transactions per day, and each machine handles about 200,000 transactions over the 2 years.
Example Reliability Specification
Failure class Example Reliability metric Permanent, The system fails to ROCOF =1 occ./ days non-corrupting operate with any card; must be restarted. Transient, The magnetic strip on POFOD = 1 in 1000 trans. non-corrupting an undamaged card cannot be read. Transient, A pattern of transactions Should never happen corrupting across the network causes DB corruption.