





Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Paper; Class: Dependable Software Systems; Subject: Computer Science; University: Drexel University; Term: Unknown 1999;
Typology: Papers
1 / 9
This page cannot be seen from the preview
Don't miss anything!






2
Department of Computer and Information Science, The Ohio State University, Columbus OH, USA
Abstract
Regression testing, which attempts to validate modiÆed software and ensure that no new errors are introduced into previously tested code, is used extensively during maintenance of evolving software. Despite eorts to reduce its cost, regression testing remains one of the most expensive activities performed during a software system's lifetime. Because regression testing is important and expensive, many researchers have focused on ways to make it more eÅcient and eective. Research on regression testing spans a wide variety of topics, including test environments and automation, capture-playback mechanisms, regression-test selection, cov- erage identiÆcation, test suite maintenance, regression testability, and regression-testing process. This paper discusses the state of the art in several important aspects of regression testing, and presents some promising areas for future research. ” 1999 Elsevier Science Inc. All rights reserved.
Software maintenance can account for as much as two-thirds of the cost of software production (Beizer, 1990; Leung and White, 1989). This expense occurs in part because, in today's market, software development and testing are dominated by rework of existing soft- ware, not design of new software (Beizer, 1990). Regression testing, which attempts to validate modiÆed software and ensure that no new errors are introduced into previously tested code, is used extensively during this evolution process. Regression testing is used to test safety-critical software that must be retested often, to test software that is being developed under constant evolution as the market or technology changes, to test new or modiÆed components of a system, and to test new members in a family of similar products. Despite eorts to reduce its cost, regression testing remains one of the most expensive activities performed during a software system's lifetime. Because regression testing is expensive, but impor- tant, researchers have focused on ways to make it more eÅcient and eective. Research on regression testing spans a wide variety of topics. Test environ-
ments and automation (e.g., Homan and Brealey, 1989), and capture-playback mechanisms (e.g., Lewis et al., 1989) provide support for regression testing. Techniques for regression-test selection (e.g., Ball, 1998; Chen et al., 1994; Harrold and Soa, 1988; Rothermel and Harrold, 1997), coverage identiÆcation (e.g., Harrold and Soa, 1988; Ostrand and Weyuker, 1988; Rothermel and Harrold, 1994), and test suite maintenance (e.g., Harrold et al., 1993; Rothermel et al., 1998; Wong et al., 1995, 1997a) facilitate selective testing of the modiÆed software. Regression testability permits estimation, prior to regression test selection, of the number of tests that will be selected by a method (e.g., Harrold et al., 1998; Leung and White, 1989; Rosenblum and Weyuker, 1997), or evaluation, prior to implementation, of the diÅculty of regression testing (e.g., Harrold, 1998; Staord et al., 1997). A regression-testing process (e.g., Onoma et al., 1998) can integrate many of these key techniques into development and maintenance of the evolving software. Some of these techniques are already being used in practice. For example, many companies have used capture-playback techniques to automate part of their regression-testing process. Most of this technology, however, is not being used in practice, in part because the scalability and the usefulness of the techniques have not been convincingly demonstrated. The in- creased awareness by researchers of the importance of empirical evaluation and the support for such empir-
The Journal of Systems and Software 47 (1999) 173± www.elsevier.com/locate/jss
(^1) This work was supported in part by grants from Microsoft Inc. and Boeing Airplane Group, and by NSF under NYI Award CCR- and ESS Award CCR-9707792 to Ohio State University. (^2) E-mail: [email protected]
0164-1212/99/$ ± see front matter ” 1999 Elsevier Science Inc. All rights reserved. PII: S 0 1 6 4 - 1 2 1 2 ( 9 9 ) 0 0 0 3 7 - 0
ical work by agencies, such as the National Science Foundation, 3 have yielded initial empirical data on the cost beneÆts of these techniques. In addition to showing the potential scalability and usefulness of some of the techniques, these data also highlight the need for additional evaluation of existing techniques and for development of modiÆed and new techniques. The increased interest in techniques that help to re- duce the costs of regression testing, the maturity of the research in regression testing, and the commitment to empirical evaluation of regression-testing techniques should hasten the transfer of regression-testing tech- nology to industry. This paper discusses the state of the art, existing empirical results, and promising future work in selective retest, one aspect of regression testing for which I believe industrial-strength tools will soon be available.
During regression testing, a test suite T, used to test a program P, 4 and information about the results of testing P with T is available. Selective-retest techniques attempt to reduce the cost of regression testing by re- using T and identifying portions of the modiÆed pro- gram P 0 or its speciÆcation that should be tested. For example, a selective-retest technique may select all tests from T that execute code that is new or modiÆed from P to P 0. Selective retest techniques dier from a retest-all approach, which runs all tests in T or reanalyzes and retests all of P 0. Leung and White (1991) show that a selective-retest technique is more economical than a re- test-all technique only if the cost of selecting a reduced subset T 0 of tests from T is less than the cost of running the tests in T±T 0. Selective-retest can be integrated into a maintenance process that consists of the following steps: 5
Although each of the problems discussed in the pre- ceding section is signiÆcant for maintenance, this paper focuses on three important selective-retest problems: the regression-test-selection problem, the coverage-identiÆ- cation problem, and the test-suite-minimization prob- lem.
3.1. Regression test selection
Regression-test-selection techniques attempt to re- duce the cost of regression testing by selecting T 0 and using T 0 to test P^0. Testing professionals are reluctant, however, to omit any tests from T that might cause P^0 to expose faults. Safe regression-test-selection techniques ensure that the test suite selected, T 0 , contains all tests in
(^3) Evidence of the increased awareness of the importance of exper- imental work is seen by the support of the National Science Foundation for signiÆcant grants for experimental research and for the Workshop on Empirical Research in Software Engineering, which was held during Summer 1998. (^4) P could be either a procedure or a program. (^5) Rothermel and Harrold (1997) and Onoma et al. (1998) present a discussion of the steps involved in selective retest. In this paper, we integrate these steps into one maintenance process.
(^6) Test t is obsolete for P (^0) if t species an input to P (^0) that, according to P^0 's speciÆcation, is invalid for P^0 , or t speciÆes an invalid input±output relation for P^0.
lesser computation cost. For example, on encountering new statement J, Pythia identiÆes the statement that precedes J, Ænds E, and selects t1 and t2, those tests that executed E in P. For the same change, DejaVu and Ball's algorithm select only t2. Chen et al. (1994) present a regression-test-selection algorithm that detects modied code entities, which are deÆned as functions or as non-executable components, such as storage locations. They implemented the tech- nique as a tool, called TestTube, that performs regres- sion test selection for C programs. The technique selects all tests associated with changed entities. Because this technique is based on entities that are coarser-grained than those used by statement- or control-Øow-based techniques, it may select more tests than those tech- niques, with lesser computation cost. For example, for all changes from P to P^0 in Fig. 1, TestTube selects t 1 ;... ; t4, all tests in T. Table 1 shows the results of test selection using all four approaches, and illustrates the relative selectivity of the approaches. Because TestTube selects at the func- tion level, it selects all four tests for any change. For the modiÆcation from C to C^0 , all approaches, except TestTube are able to identify only t4 for inclusion in T 0 ; when there are no control-Øow changes, Pythia, De- jaVu, and Ball's algorithm select the same tests. For the addition of J, both Dejavu and Ball's algorithms are able to identify only t2 for inclusion in T 0 ; when there are control-Øow changes, Pythia can produce less pre- cise results than DejaVu and Ball's algorithm. For the duplication of H as H1 and H2, only Ball's algorithm is able to select t1 and t2; in the case of multiply-visited nodes, DejaVu can produce less precise results than Ball's algorithm. One subset of unsafe regression-test-selection tech- niques selects tests for inclusion in T 0 using associations between tests in T and test-coverage requirements based on the data Øow in the program (e.g., Bates and Hor- witz, 1993; Harrold and Soa, 1988; Ostrand and Weyuker, 1988). 8 For example, in previous work (Harrold and Soa, 1988) we present a regression-test- selection technique that Ærst associates tests in T with deÆnition-use pairs 9 in P, and then selects those tests
from T that execute deÆnition-use pairs that are asso- ciated with code that is modied or deleted from P to P^0. To illustrate, consider P and G in Fig. 1, and suppose that variable x is assigned a value in statement B (a def of x) and that this value of x is used in a computation in statement F (a use of x). Then (B, F) forms a test-cov- erage requirement based on the data Øow in P, and t1, which executes this pair, is associated with that re- quirement. If test selection determines that this deÆni- tion-use pair is aected by a change in P, then t1 is added to T 0. New code induces deÆnition-use pairs in P^0 that are not present in P. Because these deÆnition-use pairs do not exist for P, there are no tests in T associated with these new pairs, and thus, no tests in T are selected for inclusion in T 0. For example, suppose that there is a use of x in the newly inserted statement J, and that (B, J) forms a new deÆnition-use pair for x. Because this def- inition-use pair is new and there are no tests yet asso- ciated with it, the test-selection algorithm selects no tests in T to include in T 0. Although t2 will execute this def- inition-use pair in P^0 , it was not selected. Thus, this technique is unsafe. To date, there have been a number of empirical studies that evaluate these regression-test-selection techniques. Using DejaVu, we investigated the costs and beneÆts of using our regression-test-selection algorithm (Rothermel and Harrold, 1996, 1998). We used DejaVu to select tests for a variety of 100±500 line programs, for which savings averaged 45%, and for a larger (50, line) software system, for which savings averaged 95%. Our studies show that the cost eectiveness of test se- lection can vary widely based on a number of factors: the cost of analysis required to select the tests for T 0 , the cost of executing and validating T 0 on P^0 , the composi- tion of T, and the nature of the modiÆcations to P for P^0. Our studies also show that, for the subjects used, there were no multiply-visited vertices. Thus, for these subjects, our algorithm is edge-optimal. Rosenblum and Weyuker (1997) used TestTube to select tests for 31 versions of the KornShell and its as- sociated test suites. For 80% of the versions, TestTube selected 100% of the tests. The authors note, however, that the test suite they used contained only 16 tests, many of which caused all components of the system to be exercised. Rosenblum and Rothermel (1997) present the Ærst comparative evaluation of two dierent regression-test- selection techniques, DejaVu and TestTube, on the same
Table 1 Results of test selection for four safe, regression-test-selection algorithms
Change from P to P^0 Tests selected by testtube Tests selected by Pythia Tests selected by DejaVu Tests selected by Ball's algorithm
C is changed to C^0 t 1 ;... ; t 4 t 4 t 4 t 4 J is added t 1 ;... ; t 4 t1, t 2 t 2 t 2 H is duplicated as H1 and H 2 t 1 ;... ; t 4 t 1 ;... ; t 4 t 1 ;... ; t 4 t1, t 2
(^8) See Rothermel and Harrold (1996) for a thorough discussion of these techniques. (^9) A deÆnition-use pair is a pair of statements (S1; S2), such that S 1 deÆnes some variable v, S2 uses v, and there is a path in the program from S1 to S2 along which v is not redeÆned.
set of subjects. This study compared the relative preci- sion of the two techniques; current work is underway to compare the relative eÅciency of the two techniques. Their study suggests that, in some cases, the coarse- grained TestTube and the more Æne-grained DejaVu produce similar reductions in the tests that can be se- lected. Their study also found, however, that DejaVu sometimes selects a test suite that is substantially smaller than the original test suite. Vokolos and Frankl (1998) performed an experiment to evaluation their regression-test-selection algorithm. In their experiment, they used 33 dierent versions of a C program of approximately 11,000 lines that had been used by the European Space Agency. The versions rep- resented the correction of 33 dierent faults. They ran- domly created a test suite for use in the experiment. They performed the regression test selection using Py- thia, and found that, for their subject program and versions, Pythia substantially reduced the size of the regression test suite. For example, in almost 50% of the versions, 80% reduction was achieved. They did not directly compare Pythia with DejaVu and they did not compare the precision of their results with those ob- tained using DejaVu. Graves et al. (1998) performed an experiment that compared a number of regression-test-selection tech- niques: minimization techniques (these attempt to select a minimal set of tests from T ), safe techniques, data-Øow- coverage-based techniques, random techniques, and re- test all. They drew a number of observations from their experiment. Firstly, minimization produced the smallest and least eective test suites. However, in cases where testing is very expensive, minimization may be cost-ef- fective. Secondly, safe and data-Øow methods had nearly equivalent average behavior in terms of cost eective- ness, typically detecting the same faults, and selecting the same size test suite. Data-Øow methods are more ex- pensive if test selection is the only goal. However, they provide additional information that can be used for coverage identication, which may justify the additional cost in some cases. Thirdly, they found that the test se- lection methods were not the only factors aecting their results; other factors include the programs, the nature of the modiÆcations, and the composition of test suites. In some cases, the regression-test-selection tools se- lect a T 0 that contains almost all tests in T; in these cases, the test selection may not be cost eective. Rosenblum and Weyuker (1997) proposed coverage-based predictors for use in predicting the cost-eectiveness of selective regression-testing strategies. With such a predictor, a testing professional could quickly estimate the number of tests that would be selected by a test-selection algo- rithm. If the time to run the tests omitted is more than the estimated cost of analysis to select the tests, the tester can then run the test-selection algorithm to select the tests; if not, the tester can simply run all tests in T.
One of their predictors is used to predict whether a safe selective-regression-testing strategy will be cost-ef- fective. Using the regression testing cost model of Leung and White (1991), Rosenblum and Weyuker (1997) demonstrate the usefulness of this predictor by de- scribing the results of a case study they performed in- volving 31 versions of the KornShell. In that study, the predictor reported that, on average, it was expected that 87.3% of the tests would be selected. Using the Test- Tube approach, 88.1% were actually selected on average over the 31 versions. The authors explain, however, that because of the way their selective regression testing model employs averages, the accuracy of their predictor might vary signiÆcantly in practice from version to version. In later work, we present additional empirical studies to evaluate the eectiveness and accuracy of Rosenblum and Weyuker's model (Harrold et al., 1998). Our results suggest that the distribution of modiÆcations made to a program can play a signicant role in deter- mining the accuracy of a predictive model of test selec- tion, and that a useful prediction model must account for both code coverage and modiÆcation distribution. The studies of regression-test-selection techniques and test-suite-size predictors suggest that there are a number of factors that aect the cost and precision of test selection: the precision of the test-selection algo- rithm, the composition and size of T, and the nature of the modiÆcations from P to P^0. Based on these Ændings, there are a number of areas for future work, both in research and in experimentation, related to regression test selection.
3.2. Coverage identication
The regression-test-selection techniques, described in the preceeding sections, identify those tests in T to rerun after modiÆcations are made to P. However, after run- ning the selected tests, T 0 , there may be parts of P^0 that are not tested or are not covered according to some testing criterion. Coverage-identiÆcation techniques compute test requirements for these untested or uncov- ered parts of P^0. Several techniques (Bates and Horwitz, 1993; Harrold and Soa, 1988; Ostrand and Weyuker, 1988) store the data-Øow testing requirements (i.e., those deÆnition-use pairs that were required for the original program), and incrementally update these requirements using the pro- gram modiÆcations. 10 Instead of storing the testing requirements for the original program, we presented a technique that uses a demand approach to transitively compute data- and control-dependences for the modiÆ- cations (Gupta et al., 1996). We later presented an
(^10) See Rothermel and Harrold (1996) for a thorough discussion of these techniques.
sional testers select the appropriate algorithm to use? The studies by Graves et al. compared the relative eectiveness of control-Øow and data-Øow-based approaches. But how do these techniques compare in eÅciency? Empirical studies that consider software sys- tems of varying sizes and types, evolving test suites for those systems, and real change histories can provide information that will help researchers identify the tradeos of using these techniques and develop guide- lines that professional testers can use when making regression-testing decisions. Generalized algorithms. The regression-test-selection techniques, described in preceding sections, use a source- code representation of the software. Can these tech- niques be generalized so that they apply to other formal representations of the software, such as its requirements or architecture? Our technique (Rothermel and Harrold,
results. Preliminary evaluation shows that these tech- niques also vary in the precision of the tests selected and the analysis time to select the tests. Could a less precise technique be used to get an approximate solution, and then, with input from the tester, be reÆned until the de- sired level of precision is achieved? Such an approach would let the tester tailor the precision of T 0 to the ap- plication or focus on critical parts of the modied soft- ware. Graves et al. (1998) study showed that, for the subjects they studied, the tests selected by DejaVu and by using a data-Øow-coverage approach diered very little in the faults detected or the precision of the tests selected; they did not measure the time to perform the analyses to select the tests. Could these two techniques be integrated so that the test selection would consider both control- Øow and data Øow? Such an approach would possess the best features of each technique. Test-suite minimization and prioritization. In some cases, after T 0 is selected using a test-selection algorithm, there may not be suÅcient time to run all tests in T 0. In this case, we may want to minimize the tests in T 0 ac- cording to some criteria. For example, suppose that we can approximate the time required to run the tests in T 0 but the time allocated for testing is not suÅcient to run all tests in T 0. Can we select a subset of T 0 for use in testing P^0 that could be run in the time allocated and would be eective for testing P^0? Although not safe, such a test suite should be more eective than a randomly- selected test suite. Another approach to selecting a subset of T 0 for use in testing P^0 is to prioritize the tests in T 0 by some criteria. For example, can we prioritize tests in T 0 by the coverage that they provide? Wong et al. (1997b) discussed this approach. For another example, can we prioritize tests in T 0 by an estimation of their fault-detection ability? Under the Ærst approach, we should achieve coverage of P^0 quickly whereas under the second approach, we should expose faults early in the testing of P^0. Regression testability. It refers to the property of a program, modiÆcation, or test suite that lets it be ef- fectively and eÅciently regression tested. Leung and White (1989) classify a program as regression testable if most single statement modiÆcations to P entail rerun- ning a small proportion of T. Extending Leung and White's work, Rosenblum and Weyuker (1997) consider P and T, and present a formal model of the cost-eec- tiveness of regression-test-selection techniques. Leung and White (1989) also discuss the regression testability of a software system ± a system is regression testable if most single statement modiÆcations will entail rerunning a small proportion of the current test suite. Under this deÆnition, regression testability is a function of both the design of the program and the test suite. Rosenblum and Weyuker (1996) presented a model for predicting the cost-eectiveness of regression-test-selection techniques in terms of the coverage provided by a particular test
suite. Under their approach, both the program and the test suite are used for prediction. Using these notions of regression testability, can we design regression testable test suites? Additionally, can we identify this testability using various representations of the software, such as its architecture? These techniques can help design software and test suites on which eÅcient regression testing can be performed, and the ability to consider regression testability early in the development process has the po- tential to provide signiÆcant savings in the cost of de- velopment and maintenance of the software.
References
Ball, T., 1998. The limit of control-Øow analysis for regression testing. In: International Symposium on Software Testing and Analysis, pp. 143±242. Bates, S., Horwitz, S., 1993. Incremental program testing using program dependence graphs. In: Proceedings of the 20th ACM Symposium on Principles of Programming Languages, pp. 384±
Beizer, B., 1990. Software Testing Techniques. Van Nostrand Rein- hold, New York, NY. Chen, Y.F., Rosenblum, D.S., Vo, K.P., 1994. Test Tube: A system for selective regression testing. In: Proceedings of the 16th Interna- tional Conference on Software Engineering, pp. 211±222. Garey, M.R., Johnson, D.S., 1979. Computers and Intractability. W.H. Freeman, New York. Graves, T.L., Harrold, M.J., Kim, J-M, Porter, A., Rothermel, G.,
Leung, H.K.N., White, L.J., 1991. A cost model to compare regression test strategies. In: Proceedings of the Conference on Software Maintenance'91, pp. 201±208.
Lewis, R., Beck, D.W., Hartmann, J., 1989. Assay ± a tool to support regression testing. In: ESEC '89. Second European Software Engineering Conference Proceedings, Springer, Berlin, pp. 487±
Onoma, A.K., Tsai, W.-T., Poonawala, M.H., Suganuma, H., 1998. Regression testing in an industrial environment. Communica- tions of the ACM 41 (5). Ostrand, T.J., Weyuker, E.J., 1988. Using dataow analysis for regression testing. In: Sixth Annual Pacic Northwest Software Quality Conference, pp. 233±247. Rapps, S., Weyuker, E.J., 1985. Selecting software test data using data Øow information. IEEE Transactions on Software Engineering 11 (4), 367±375. Richardson, D.J., Staord, J., 1996. What makes one software architecture more testable than another? In: Proceedings of the International Software Architecture Symposium. Rosenblum, D., Rothermel, G., 1997. An empirical comparison of regression test selection techniques. In: Proceedings of the International Workshop for Empirical Studies of Software Maintenance. Rosenblum, D., Weyuker, E.J., 1996. Predicting the cost-eective- ness of regression testing strategies. In: ACM SIGSOFT '96 Fourth Symposium on the Foundations of Software Engineering. Rosenblum, D.S., Weyuker, E.J., 1997. Using coverage information to predict the cost-eectiveness of regression testing strategies. IEEE Transaction on Software Engineering 23 (3), 146±156. Rothermel, G., 1996. Ecient, Eective Regression Testing Using Safe Test Selection Techniques. Ph.D. dissertation, Clemson Univer- sity. Rothermel, G., Harrold, M.J., 1994. Selecting tests and identifying test coverage requirements for moded software. In: Proceedings of the 1994 International Symposium on Software Testing and Analy- sis, pp. 169±184. Rothermel, G., Harrold, M.J., 1996. Analyzing regression test selec- tion techniques. IEEE Transactions on Software Engineering, 22 (8). Rothermel, G., Harrold, M.J., 1997. A safe, eÅcient regression test selection technique. In: ACM Transactions on Software Engi- neering and Methodology, pp. 173±210. Rothermel, G., Harrold, M.J., 1998. Empirical studies of a safe regression test selection technique. IEEE Transactions on Soft- ware Engineering 24 (6). Rothermel, G., Harrold, M.J., Ostrin, J., Hong, C., 1998. An empirical study of the eects of minimization on the fault detection capabilities of test suites. In: Proceedings of the International Conference on Software Maintenance. Staord, J., Richardson, D.J., Wolf, A.L., 1997. Chaining: A dependence analysis technique for software architecture. Vokolos, F., Frankl, P., 1997. Pythia: A regression test selection tool based on text dierencing. In: International Conference on Reliability, Quality, and Safety of Software Intensive Systems. Vokolos, F., Frankl, P., 1998. Empirical evaluation of the textual dierencing of regression testing techniques. In: International Conference on Software Maintenance. Wong, W., Horgan, R., London, S., Mathur, A., 1995. Eect of test set minimization on fault detection eectiveness. In: 17th Interna- tional Conference on Software Engineering, pp. 41±50. Wong, W., Horgan, R., London, S., Mathur, A., 1997a. A study of eective regression testing in practice. In: eighth Interna- tional Symposium on Software Reliability Engineering, pp. 264±275. Wong, W.E., Horgan, J.R., Mathur, A.P., Pasquini, A., 1997b. Test Set Size Minimization and Fault Detection Eectiveness: A Case Study in a Space Application. Technical Report SERC-TR-173- P, Software Engineering Research Council.