



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Exam; Class: Dependable Software Systems; Subject: Computer Science; University: Drexel University; Term: Unknown 1989;
Typology: Exams
1 / 6
This page cannot be seen from the preview
Don't miss anything!




The field of software testing spans mathematical theory, the art and practice of validation, and methodology of software development. To cover this range would require a textbook (or several texts), not a trio of articles. But the work presented in this special section is a kind of “test set.” Each paper is a significant contribution within one of the three broad areas. The reader must now make the assessment that is critical to any review of test points: are they representative? My own answer is ‘no’; these articles are provocative and revealing rather than routine summaries. And perhaps that is what software testing is all about: good tests are the ones that provide new insights, not the ones that cover well worn ground.
The Evaluation of Program-based Software Test Data Adequacy Criteria
The twin foundations of any scientific or engineering discipline are experimental knowledge and abstract conceptualization. Progress rests not only on “knowing what works,” but on a framework for organizing and think.ing about what is known. Without a theoretical framework, even the best experimental results cannot be fitted into a useful whole. Author Elaine Weyuker has long advocated attention to the foundations of pro- gram testing, and has used her background in mathe- matical logic and recursive function theory to contrib- ute to those foundations. Her article is a study of the logical basis for structural testing. By critically examin- ing the properties that program-based testing methods must possess, she seeks to characterize those methods in a mathematical theory.
01966 ACM OOOl-0782/66/0600-0662 $1.
Perhaps the most interesting part of this work is its analysis of conflicts between “obvious” properties of testing methods. For example, Weyuker has. shown that there is a logical conflict between insisting that tests succeed and calling a more inclusive test better [zz]. Weyuker’s theory is in its early stages: the axioms she has in hand do not yet characterize test adequacy. In her article she analyzes examples that “escape” charac- terization, and seeks ways to close these loopholes. The Catego y-Partition Method for Specifying and Generating Functional Tests The paradox of program testing is that its techniques are compute-intensive, but there is a deep-seated resis- tance to computer aids. Research tools, many of them originating in practical environments, are well under- stood but woefully little used [24]. The only explana- tion seems to be that testers are too short 011 resources to consider automating their tasks, reasoning that is acknowledged to be short-sighted, but which too often carries the day. Thomas Ostrand and Marc Balcer have written a compelling counterexample to show how helpful computer assistance can be at modest cost. They describe a test-description language and a pro- cessing tool that isolates functional test cases, with the aim of providing specification coverage in a minimum number of tests. The Growth of Software Testing David Gelperin and Bill Hetzel begin their article with a history of program testing; no one is more qualified to characterize the field. They show how debugging and testing diverged in their aims, how demonstrating that programs work gave way to searching for program faults, and how detection methods are now leading to
662 Communications of the ACM June 1988 Volume 31 Number 6
fault prevention. They argue that “testing” has a wider scope in program development than merely the execu- tion of programs. The testing viewpoint can be invalu- able in providing feedback to improve specifications and designs as well as code. Early in a program’s devel- opment, testing feedback leads to improvements so that potential errors never appear in the final code.
AN ESSAY ON TESTING THEORY AND PRACTICE My interest in program testing springs from practical and theoretical roots. While I was a student attempting to characterize properties of programs that could be pinned down by testing [7], I was also a systems pro- grammer responsible for maintenance and reliability of two large operating systems [a]. I was introduced to computer science by Halstead’s wonderful book on NELIAC self-compiling compilers [6]. What fascinated me was the prospect of capturing an infinite object in a finite description. (I thought, at first, that a self-compil- ing compiler did just that: in one program it held the seeds of all programs.) Program testing has exactly the right character. Programs are complex, hard to under- stand, hard to prove, and consequently often riddled with errors. But might not a small set of tests, con- ducted with ease, pinpoint problems for repair? And when all problems have been corrected, is the program not perfect? Testing seems far easier than programming, certainly easier than proving a program. The infinite complications of what the program does, what it is sup- posed to do, etc., are replaced by a finite set of data.
hard to prove, and consequently often
set of tests, conducted with ease, pinpoint
In principle, it could work that way. The defect rate for carefully made programs can be about 5 faults/ 1,000 lines, so that a TM-line program might have only 5,000 problems. Even supposing that a unique test point is needed to expose each one, thousands of trials are not impractical. The fallacy may be obvious, but it un- derlies our intuition about tests more than we like to admit: the points may be few, but we do not know how to find them. Worse, we know no systematic way to search, no way to judge points selected, and no way to decide when to stop. So when 5,000 points have been tried and no errors found, we do not know if there are none, or if 10 remain, or 5,000. Formal testing research is only about 15 years old, and its results are modest. But even what is known is not finding its way into practice very well. This essay explores testing theory and practice, suggests reasons for the present situation, and predicts future work.
I would describe software engineering as heavy-handed management based on modest common-sense princi- ples. I do not intend to disparage the work on struc- tured programming and development methodology, which has made large-system development a predict- able craft instead of an arcane art. But its technical basis is not impressive. Modular, top-down programs were written before 1960 [15], and recognized as a good thing. To enforce such standards across the board is a management breakthrough, but not a technical one. The Gelperin and Hetzel article would argue that a great deal can be done in testing methodology, and that the proper arena is not the narrow one of programs and their execution, but rather the whole scope of develop- ment. But it is the technical problems of software that interest me: what can we do automatically, that neces- sarily amplifies the intelligence and effectiveness of programmers, that makes their work more rewarding and less routine? The field of programming languages provides a technical model. Languages and their compilers have theoretical backing and tremendous practical utility. Indeed, high-level languages enabled most of the soft- ware-engineering common sense in use today. Unless program testing has a similar technical base, software- engineering methodology for testing is premature. Pro- grams must be tested, however little we understand the process, and managers must try to find out what works, but strict legislation based on wishful thinking is a pre- scription for disaster. The technical discipline of testing can be seen as theory versus practice. Practical testing began with the first program, and so is at least twice as old as formal research in the subject. Clever programmers, often without understanding why things seem to work, have devised many testing methods, some with great intui- tive appeal. In trying to analyze these methods, testing theoreticians have only been able to show that in the worst case, all fail [lo, 231. Perhaps by accident, theore- ticians have come from absolute disciplines like logic, program proving, and algebra, and they have sought absolute results. They wanted to discover how testing is related to correctness. In general the two are unrelated, so the theoretical results are negative. Whatever we would like to do is usually an unsolvable problem. When we isolate tractable special cases to dodge the unsolvability, they turn out to be too narrow and atypical.
SOME TYPICAL RESULTS I will describe the state of the program-testing disci- pline through an example. The method and results are not the most recently proposed, but I think they are typical of the field. Branch testing is a method that arose in practice, as a refinement of trying to execute every statement of a program. Every conditional expression in a program is singled out. A collection of test points achieves branch coverage just in case each conditional is forced to take both its true and false branches by the
June 1988 Volume 31 Number (^6) Communications of the ACM 663
expected to do some thinking: we are not calling for more of the same in the same place. Suppose a programmer attempts to attain branch cov- erage, and all test points produce correct results, but some branch remains unexecuted. In the process of finding data to cover it, a fault is found. This can hap- pen because:
The fault is seen while studying the program to find additional test points. Additional test points are tried that fail to cover the branch, but happen to produce incorrect re- sults. Additional test points do cover the branch, and also produce incorrect results.
The experimental conclusion that “branch testing found the fault” is questionable. Perhaps in case 1 the study was in a different part of the program from the untried branch, and the problem had nothing to do with that branch. In case z the branch seems irrelevant. Case 3 seems to support the conclusion, but suppose that the incorrect results are traced to logic that should have been inserted before the branch? In such circum- stances it is only accidental that the error was found while doing branch testing. The person found the fault, and maybe the “method” did no more than keep the person interested in more tests and more study. If branch testing can be credited with finding the fault, so could any method that rubs the programmer’s nose in the code. (For example, upon being presented with any test results at all, a manager could say, “Not enough! Try to exercise routine (fill in the blank) more!“) Al- though no study has been attempted to demonstrate the distortions of nose-rubbing, evidence for it exists. In [l] it was found that a coverage technique detected about 40 percent of the faults of omission in code (and another 20 percent could have been seen but were simply not noticed). It is ludicrous to say that covering something that is not there, discovers it 60 percent of the time. Programming skill (and presumably testing skill is related) is one of the most widely varying aspects of human behavior. Therefore, testing experiments suffer maximum distortion from the nose-rubbing effect. Comparisons between methods are doubly uncertain, because the skills needed to use the methods may be different. This points out another defect in the sub- sumes ordering: an experiment can show that method A is inferior to method B, even though A subsumes B. The technically more powerful method could be harder to use, and so people choose less useful data to satisfy it. The fault lies both with the subsumes ordering and with such experiments. Management based on this shifting ground can suc- ceed, but relies heavily on the goodwill of technical staff. Ordered to conduct (say) branch testing, which is unreliable in itself, the staff can fulfill the letter of the law without real effect. However, branch testing may be as good any other nose-rubbing technique if used creatively with intent to discover problems. It is almost
as if a hardline manager relied on a time clock to check employees’ time sheets, but the clock does not keep time, so the people have to set it before punching in and out. Programmers have a real interest in making their code work, and there is an immense store of good- will on which management can draw, but it is ulti- mately dangerous to put trust in unsound technical procedures. Under pressure of tight schedules, or be- cause of personality conflicts, cheating might become desirable.
Had the dominant theoretical view of testing been probabilistic, the field might look quite different. There is nothing very clever in random sampling of program inputs, and the result cannot be expected to magically show correctness, but statistical measures seem just the right way to evaluate the dubious experiments that tests are. Intuitively, one of the reasons we trust pro- grams is because they have a long history of working properly, and statistical quantification of this idea should be investigated. Most of the statistical work is based on the failure-rate model [zo], in which it is assumed that a program has a probability R of failing for a randomly selected input, and thus that as more and more tests are conducted, the observed number of failures will approach fraction R of the trials. It must further be assumed that random selection during the tests follows the same distribution as that for the program’s long term operation. On this basis it is possible to model the testing process, for ex- ample to compute the number of successful test points needed to guarantee that R is below a given value, with a given confidence [5]. Or, the mean time to failure for a continuously operating system can be predicted from measurements of the time sequence of its crashes [20]. Or, the total time required to debug a program can be predicted from its initial history of problems and fixes
The failure-rate model is not widely accepted. Its as- sumptions do not seem appropriate for programs, and the mathematics is tractable only for the simplest cases. Some of the results are intuitively wrong. For example, the number of tests required to force high confidence in a low failure rate does not depend on the size of the program or on the size of the input domain. Critics explain successful empirical studies (particularly for models of the debugging process) as curve fitting with an adequate supply of parameters. However, even if failure-rate models are wrong, probabilistic analysis seems appropriate for testing theory because it is capa-
June 1988 Volume 31 Number 6 (^) Communicationsof the ACM 665
ble of comparing methods and assessing confidence in SllCCeSSfUl tests [lo].
From the technical side, what testing needs most is a sound. fundamental theory, one capable of analyzing and comparing particular methods. The “absolute” ap- pr0ac.h has shown that in general nothing but exhaus- tive testing can be relied upon, and that special cases do not escape this result except in unreal situations. One promising direction seems to remain, so-called “er- ror-based” testing [16]. Its roots are in mutation theory and hardware testing for specific defects. The tester begins with a precise list of code faults that are to be precluded, and devises tests that can succeed only if those faults are not present. Running the tests then shows not that the software is correct, but that it at least does not contain the faults originally listed. Error- based testing theory requires elements of program prov- ing, in the analysis that demonstrates test points would expose the given faults, and it is a difficult field in which work is only beginning. There are two candidates for a general theory of testing:
Technology transfer from testing research to indus- trial software development has been slow, and the methods and tools that have been devised have not been widely accepted. This seems strange, because test- ing tools are often easier to build than those for the now ubiquitous high-level programming languages. Per- haps the reason that testing technology goes unused is that l.anguages must be acquired first and can stand alone, while test systems rely on language details and the operating environment, and have low budget prior- ity. Standardization of operating systems and languages, which has accelerated in both the PC and microproces- sor worlds, may make a difference in the near future. The article by Ostrand and Balcer in this section is a good example of what should be happening far more
frequently. Their testing methodology is adapted to a special-case problem, and supported by an easily imple- mented tool. Decrying the lack of testing foundations while advo- cating the spread of the technology based on that the- ory may sound like talking out of both sides of the mouth. But well engineered test tools are worthwhile, even when they use something like branch iesting that can not be relied on to find errors. Nose-rub‘bing works: programmers who systematically look for bu.gs find and fix them. A tool that checks branch coverage costs very little to build and use. Other tools (such as symbolic execution systems [L!, 31) require skill to use, but are powerful in talented hands. Debugging tools are the best, because they give nose-rubbing full play, and an error found is a gain that in no way relies on a missing theory. Such tools are best placed in the hands of pro- grammers who have the greatest knowledge of and in- terest in their code. Confidence measures based on testing are not as good, because they are not credible without a theory to back them up, and because they are used by evaluation or management people who do not know the code.
Another reason for advocating the routine use of im- perfect test methods is the need for empirical informa- tion gathering. The raw material of sound theory is experimentation. Using a poor debugger shows what it lacks; using a false model for reliability can expose its flaws and suggest changes. The infrastructure needed to put testing methods in place is slowly built, and built relatively independent of any flaws in those methods. The danger, of course, is that a premature attempt to use the technology will give it a bad name without collecting information for improvement. New languages and new styles of programming re- quire new intuitions about testing. The whole of so- called structural test methods (e.g., branch testing) is based on the imperative, state-dependent execution model of von Neumann machine languages and unit testing in conventional high-level languages like FOR- TRAN and C that make efficient use of such machines. Functional and logic-programming languages do not share these properties. Conventional parallel programs have the usual properties, but repeated in e!nsembles that add a new dimension of complexity. T:he methods and intuitions built up over 30 years of conventional program testing do not apply in these situations, which are growing in importance. The opportunity exists to do better testing theory for these new programs than we have done in the past.
Communications of the ACM (^) rune 1988 Volume 31 Number 6