Special Section on Software Testing - Dependable Software Systems | CS 576, Exams of Computer Science

Material Type: Exam; Class: Dependable Software Systems; Subject: Computer Science; University: Drexel University; Term: Unknown 1989;

Typology: Exams

Pre 2010

Uploaded on 08/19/2009

koofers-user-zq3
koofers-user-zq3 🇺🇸

10 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
SPECIIAL ARTICLE
SPECIAL SECTION ON SOFTWARE
TESTING
RICHARD HAMLET, Guest Editor
The field of software testing spans mathematical
theory, the art and practice of validation, and
methodology of software development. To cover this
range would require a textbook (or several texts), not a
trio of articles. But the work presented in this special
section is a kind of “test set.” Each paper is a significant
contribution within one of the three broad areas. The
reader must now make the assessment that is critical to
any review of test points: are they representative? My
own answer is ‘no’; these articles are provocative and
revealing rather than routine summaries. And perhaps
that is what software testing is all about: good tests are
the ones that provide new insights, not the ones that
cover well worn ground.
The Evaluation of Program-based Software Test Data
Adequacy Criteria
The twin foundations of any scientific or engineering
discipline are experimental knowledge and abstract
conceptualization. Progress rests not only on “knowing
what works,” but on a framework for organizing and
think.ing about what is known. Without a theoretical
framework, even the best experimental results cannot
be fitted into a useful whole. Author Elaine Weyuker
has long advocated attention to the foundations of pro-
gram testing, and has used her background in mathe-
matical logic and recursive function theory to contrib-
ute to those foundations. Her article is a study of the
logical basis for structural testing. By critically examin-
ing the properties that program-based testing methods
must possess, she seeks to characterize those methods
in a mathematical theory.
01966 ACM OOOl-0782/66/0600-0662 $1.50
Perhaps the most interesting part of this work is its
analysis of conflicts between “obvious” properties of
testing methods. For example, Weyuker has. shown that
there is a logical conflict between insisting that tests
succeed and calling a more inclusive test better [zz].
Weyuker’s theory is in its early stages: the axioms she
has in hand do not yet characterize test adequacy. In
her article she analyzes examples that “escape” charac-
terization, and seeks ways to close these loopholes.
The Catego y-Partition Method
for
Specifying and
Generating Functional Tests
The paradox of program testing is that its techniques
are compute-intensive, but there is a deep-seated resis-
tance to computer aids. Research tools, many of them
originating in practical environments, are well under-
stood but woefully little used [24]. The only explana-
tion seems to be that testers are too short
011
resources
to consider automating their tasks, reasoning that is
acknowledged to be short-sighted, but which too often
carries the day. Thomas Ostrand and Marc Balcer have
written a compelling counterexample to show how
helpful computer assistance can be at modest cost.
They describe a test-description language and a pro-
cessing tool that isolates functional test cases, with the
aim of providing specification coverage in a minimum
number of tests.
The Growth
of
Software Testing
David Gelperin and Bill Hetzel begin their article with
a history of program testing; no one is more qualified to
characterize the field. They show how debugging and
testing diverged in their aims, how demonstrating that
programs work gave way to searching for program
faults, and how detection methods are now leading to
662 Communications of
the ACM
June 1988 Volume 31 Number
6
pf3
pf4
pf5

Partial preview of the text

Download Special Section on Software Testing - Dependable Software Systems | CS 576 and more Exams Computer Science in PDF only on Docsity!

SPECIIALARTICLE

SPECIAL SECTION ON SOFTWARE

TESTING

RICHARD HAMLET, Guest Editor

The field of software testing spans mathematical theory, the art and practice of validation, and methodology of software development. To cover this range would require a textbook (or several texts), not a trio of articles. But the work presented in this special section is a kind of “test set.” Each paper is a significant contribution within one of the three broad areas. The reader must now make the assessment that is critical to any review of test points: are they representative? My own answer is ‘no’; these articles are provocative and revealing rather than routine summaries. And perhaps that is what software testing is all about: good tests are the ones that provide new insights, not the ones that cover well worn ground.

The Evaluation of Program-based Software Test Data Adequacy Criteria

The twin foundations of any scientific or engineering discipline are experimental knowledge and abstract conceptualization. Progress rests not only on “knowing what works,” but on a framework for organizing and think.ing about what is known. Without a theoretical framework, even the best experimental results cannot be fitted into a useful whole. Author Elaine Weyuker has long advocated attention to the foundations of pro- gram testing, and has used her background in mathe- matical logic and recursive function theory to contrib- ute to those foundations. Her article is a study of the logical basis for structural testing. By critically examin- ing the properties that program-based testing methods must possess, she seeks to characterize those methods in a mathematical theory.

01966 ACM OOOl-0782/66/0600-0662 $1.

Perhaps the most interesting part of this work is its analysis of conflicts between “obvious” properties of testing methods. For example, Weyuker has. shown that there is a logical conflict between insisting that tests succeed and calling a more inclusive test better [zz]. Weyuker’s theory is in its early stages: the axioms she has in hand do not yet characterize test adequacy. In her article she analyzes examples that “escape” charac- terization, and seeks ways to close these loopholes. The Catego y-Partition Method for Specifying and Generating Functional Tests The paradox of program testing is that its techniques are compute-intensive, but there is a deep-seated resis- tance to computer aids. Research tools, many of them originating in practical environments, are well under- stood but woefully little used [24]. The only explana- tion seems to be that testers are too short 011 resources to consider automating their tasks, reasoning that is acknowledged to be short-sighted, but which too often carries the day. Thomas Ostrand and Marc Balcer have written a compelling counterexample to show how helpful computer assistance can be at modest cost. They describe a test-description language and a pro- cessing tool that isolates functional test cases, with the aim of providing specification coverage in a minimum number of tests. The Growth of Software Testing David Gelperin and Bill Hetzel begin their article with a history of program testing; no one is more qualified to characterize the field. They show how debugging and testing diverged in their aims, how demonstrating that programs work gave way to searching for program faults, and how detection methods are now leading to

662 Communications of the ACM June 1988 Volume 31 Number 6

fault prevention. They argue that “testing” has a wider scope in program development than merely the execu- tion of programs. The testing viewpoint can be invalu- able in providing feedback to improve specifications and designs as well as code. Early in a program’s devel- opment, testing feedback leads to improvements so that potential errors never appear in the final code.

AN ESSAY ON TESTING THEORY AND PRACTICE My interest in program testing springs from practical and theoretical roots. While I was a student attempting to characterize properties of programs that could be pinned down by testing [7], I was also a systems pro- grammer responsible for maintenance and reliability of two large operating systems [a]. I was introduced to computer science by Halstead’s wonderful book on NELIAC self-compiling compilers [6]. What fascinated me was the prospect of capturing an infinite object in a finite description. (I thought, at first, that a self-compil- ing compiler did just that: in one program it held the seeds of all programs.) Program testing has exactly the right character. Programs are complex, hard to under- stand, hard to prove, and consequently often riddled with errors. But might not a small set of tests, con- ducted with ease, pinpoint problems for repair? And when all problems have been corrected, is the program not perfect? Testing seems far easier than programming, certainly easier than proving a program. The infinite complications of what the program does, what it is sup- posed to do, etc., are replaced by a finite set of data.

Programs are complex, hard to understand,

hard to prove, and consequently often

riddled with errors. But might not a small

set of tests, conducted with ease, pinpoint

problems for repair?

In principle, it could work that way. The defect rate for carefully made programs can be about 5 faults/ 1,000 lines, so that a TM-line program might have only 5,000 problems. Even supposing that a unique test point is needed to expose each one, thousands of trials are not impractical. The fallacy may be obvious, but it un- derlies our intuition about tests more than we like to admit: the points may be few, but we do not know how to find them. Worse, we know no systematic way to search, no way to judge points selected, and no way to decide when to stop. So when 5,000 points have been tried and no errors found, we do not know if there are none, or if 10 remain, or 5,000. Formal testing research is only about 15 years old, and its results are modest. But even what is known is not finding its way into practice very well. This essay explores testing theory and practice, suggests reasons for the present situation, and predicts future work.

TESTING AND SOFTWARE ENGINEERING

I would describe software engineering as heavy-handed management based on modest common-sense princi- ples. I do not intend to disparage the work on struc- tured programming and development methodology, which has made large-system development a predict- able craft instead of an arcane art. But its technical basis is not impressive. Modular, top-down programs were written before 1960 [15], and recognized as a good thing. To enforce such standards across the board is a management breakthrough, but not a technical one. The Gelperin and Hetzel article would argue that a great deal can be done in testing methodology, and that the proper arena is not the narrow one of programs and their execution, but rather the whole scope of develop- ment. But it is the technical problems of software that interest me: what can we do automatically, that neces- sarily amplifies the intelligence and effectiveness of programmers, that makes their work more rewarding and less routine? The field of programming languages provides a technical model. Languages and their compilers have theoretical backing and tremendous practical utility. Indeed, high-level languages enabled most of the soft- ware-engineering common sense in use today. Unless program testing has a similar technical base, software- engineering methodology for testing is premature. Pro- grams must be tested, however little we understand the process, and managers must try to find out what works, but strict legislation based on wishful thinking is a pre- scription for disaster. The technical discipline of testing can be seen as theory versus practice. Practical testing began with the first program, and so is at least twice as old as formal research in the subject. Clever programmers, often without understanding why things seem to work, have devised many testing methods, some with great intui- tive appeal. In trying to analyze these methods, testing theoreticians have only been able to show that in the worst case, all fail [lo, 231. Perhaps by accident, theore- ticians have come from absolute disciplines like logic, program proving, and algebra, and they have sought absolute results. They wanted to discover how testing is related to correctness. In general the two are unrelated, so the theoretical results are negative. Whatever we would like to do is usually an unsolvable problem. When we isolate tractable special cases to dodge the unsolvability, they turn out to be too narrow and atypical.

SOME TYPICAL RESULTS I will describe the state of the program-testing disci- pline through an example. The method and results are not the most recently proposed, but I think they are typical of the field. Branch testing is a method that arose in practice, as a refinement of trying to execute every statement of a program. Every conditional expression in a program is singled out. A collection of test points achieves branch coverage just in case each conditional is forced to take both its true and false branches by the

June 1988 Volume 31 Number (^6) Communications of the ACM 663

expected to do some thinking: we are not calling for more of the same in the same place. Suppose a programmer attempts to attain branch cov- erage, and all test points produce correct results, but some branch remains unexecuted. In the process of finding data to cover it, a fault is found. This can hap- pen because:

The fault is seen while studying the program to find additional test points. Additional test points are tried that fail to cover the branch, but happen to produce incorrect re- sults. Additional test points do cover the branch, and also produce incorrect results.

The experimental conclusion that “branch testing found the fault” is questionable. Perhaps in case 1 the study was in a different part of the program from the untried branch, and the problem had nothing to do with that branch. In case z the branch seems irrelevant. Case 3 seems to support the conclusion, but suppose that the incorrect results are traced to logic that should have been inserted before the branch? In such circum- stances it is only accidental that the error was found while doing branch testing. The person found the fault, and maybe the “method” did no more than keep the person interested in more tests and more study. If branch testing can be credited with finding the fault, so could any method that rubs the programmer’s nose in the code. (For example, upon being presented with any test results at all, a manager could say, “Not enough! Try to exercise routine (fill in the blank) more!“) Al- though no study has been attempted to demonstrate the distortions of nose-rubbing, evidence for it exists. In [l] it was found that a coverage technique detected about 40 percent of the faults of omission in code (and another 20 percent could have been seen but were simply not noticed). It is ludicrous to say that covering something that is not there, discovers it 60 percent of the time. Programming skill (and presumably testing skill is related) is one of the most widely varying aspects of human behavior. Therefore, testing experiments suffer maximum distortion from the nose-rubbing effect. Comparisons between methods are doubly uncertain, because the skills needed to use the methods may be different. This points out another defect in the sub- sumes ordering: an experiment can show that method A is inferior to method B, even though A subsumes B. The technically more powerful method could be harder to use, and so people choose less useful data to satisfy it. The fault lies both with the subsumes ordering and with such experiments. Management based on this shifting ground can suc- ceed, but relies heavily on the goodwill of technical staff. Ordered to conduct (say) branch testing, which is unreliable in itself, the staff can fulfill the letter of the law without real effect. However, branch testing may be as good any other nose-rubbing technique if used creatively with intent to discover problems. It is almost

as if a hardline manager relied on a time clock to check employees’ time sheets, but the clock does not keep time, so the people have to set it before punching in and out. Programmers have a real interest in making their code work, and there is an immense store of good- will on which management can draw, but it is ulti- mately dangerous to put trust in unsound technical procedures. Under pressure of tight schedules, or be- cause of personality conflicts, cheating might become desirable.

Programming skill (and presumably testing

skill is related) is one of the most widely

varying aspects of human behavior.

STATISTICAL THEORY

Had the dominant theoretical view of testing been probabilistic, the field might look quite different. There is nothing very clever in random sampling of program inputs, and the result cannot be expected to magically show correctness, but statistical measures seem just the right way to evaluate the dubious experiments that tests are. Intuitively, one of the reasons we trust pro- grams is because they have a long history of working properly, and statistical quantification of this idea should be investigated. Most of the statistical work is based on the failure-rate model [zo], in which it is assumed that a program has a probability R of failing for a randomly selected input, and thus that as more and more tests are conducted, the observed number of failures will approach fraction R of the trials. It must further be assumed that random selection during the tests follows the same distribution as that for the program’s long term operation. On this basis it is possible to model the testing process, for ex- ample to compute the number of successful test points needed to guarantee that R is below a given value, with a given confidence [5]. Or, the mean time to failure for a continuously operating system can be predicted from measurements of the time sequence of its crashes [20]. Or, the total time required to debug a program can be predicted from its initial history of problems and fixes

The failure-rate model is not widely accepted. Its as- sumptions do not seem appropriate for programs, and the mathematics is tractable only for the simplest cases. Some of the results are intuitively wrong. For example, the number of tests required to force high confidence in a low failure rate does not depend on the size of the program or on the size of the input domain. Critics explain successful empirical studies (particularly for models of the debugging process) as curve fitting with an adequate supply of parameters. However, even if failure-rate models are wrong, probabilistic analysis seems appropriate for testing theory because it is capa-

June 1988 Volume 31 Number 6 (^) Communicationsof the ACM 665

ble of comparing methods and assessing confidence in SllCCeSSfUl tests [lo].

FUTURE NEEDS AND DIRECTIONS

From the technical side, what testing needs most is a sound. fundamental theory, one capable of analyzing and comparing particular methods. The “absolute” ap- pr0ac.h has shown that in general nothing but exhaus- tive testing can be relied upon, and that special cases do not escape this result except in unreal situations. One promising direction seems to remain, so-called “er- ror-based” testing [16]. Its roots are in mutation theory and hardware testing for specific defects. The tester begins with a precise list of code faults that are to be precluded, and devises tests that can succeed only if those faults are not present. Running the tests then shows not that the software is correct, but that it at least does not contain the faults originally listed. Error- based testing theory requires elements of program prov- ing, in the analysis that demonstrates test points would expose the given faults, and it is a difficult field in which work is only beginning. There are two candidates for a general theory of testing:

  1. Statistical methods can be used for precise analysis, but they are limited to measures such :as the number of test points needed. Although rstatistical theories are routinely used for other Ikinds of quality control, their application to pro- gram testing is suspect because it relies on the Failure-rate model. There has been very little progress in finding better theories, perhaps be- cause researchers with appropriate training in probability lack an intuitive grasp of testing is- sues, and those who understand testing know nothing of decision theory.
  2. An axiomatic approach to program testing has broad promise because it can be applied to any property of tests or test methods. A collection of axioms attempts to formally capture properties that tests must possess. Each axiom is self- evident, but their interplay can lead to deeper results. Weyuker’s article represents the begin- nings of axiomatics.

Technology transfer from testing research to indus- trial software development has been slow, and the methods and tools that have been devised have not been widely accepted. This seems strange, because test- ing tools are often easier to build than those for the now ubiquitous high-level programming languages. Per- haps the reason that testing technology goes unused is that l.anguages must be acquired first and can stand alone, while test systems rely on language details and the operating environment, and have low budget prior- ity. Standardization of operating systems and languages, which has accelerated in both the PC and microproces- sor worlds, may make a difference in the near future. The article by Ostrand and Balcer in this section is a good example of what should be happening far more

frequently. Their testing methodology is adapted to a special-case problem, and supported by an easily imple- mented tool. Decrying the lack of testing foundations while advo- cating the spread of the technology based on that the- ory may sound like talking out of both sides of the mouth. But well engineered test tools are worthwhile, even when they use something like branch iesting that can not be relied on to find errors. Nose-rub‘bing works: programmers who systematically look for bu.gs find and fix them. A tool that checks branch coverage costs very little to build and use. Other tools (such as symbolic execution systems [L!, 31) require skill to use, but are powerful in talented hands. Debugging tools are the best, because they give nose-rubbing full play, and an error found is a gain that in no way relies on a missing theory. Such tools are best placed in the hands of pro- grammers who have the greatest knowledge of and in- terest in their code. Confidence measures based on testing are not as good, because they are not credible without a theory to back them up, and because they are used by evaluation or management people who do not know the code.

The raw material of sound theory is

experimentation. Using a poor debugger

shows what it lacks; using a false :model

for reliability can expose its flaws and

suggest changes.

Another reason for advocating the routine use of im- perfect test methods is the need for empirical informa- tion gathering. The raw material of sound theory is experimentation. Using a poor debugger shows what it lacks; using a false model for reliability can expose its flaws and suggest changes. The infrastructure needed to put testing methods in place is slowly built, and built relatively independent of any flaws in those methods. The danger, of course, is that a premature attempt to use the technology will give it a bad name without collecting information for improvement. New languages and new styles of programming re- quire new intuitions about testing. The whole of so- called structural test methods (e.g., branch testing) is based on the imperative, state-dependent execution model of von Neumann machine languages and unit testing in conventional high-level languages like FOR- TRAN and C that make efficient use of such machines. Functional and logic-programming languages do not share these properties. Conventional parallel programs have the usual properties, but repeated in e!nsembles that add a new dimension of complexity. T:he methods and intuitions built up over 30 years of conventional program testing do not apply in these situations, which are growing in importance. The opportunity exists to do better testing theory for these new programs than we have done in the past.

Communications of the ACM (^) rune 1988 Volume 31 Number 6