Automatic Identification of Class Refactorings using Vector Space Cosine Similarity, Lecture notes of Engineering

This document proposes an approach to identify class refactorings between two subsequent software releases using vector space cosine similarity. The method identifies cases of class replacement, split, merge, and factoring in/out of features. The approach relies on techniques inspired by IR vector space models and requires the extraction of identifiers from class source code. The document also discusses the scalability of the approach and provides examples of class refactorings in the dnsjava system.

Typology: Lecture notes

2021/2022

Uploaded on 09/12/2022

scream
scream 🇬🇧

4.5

(11)

273 documents

1 / 10

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
An Automatic Approach to identify Class Evolution Discontinuities
Giuliano AntoniolMassimiliano Di Pentaand Ettore Merlo∗∗
RCOST - Research Centre on Software Technology
University of Sannio, Department of Engineering
Palazzo ex Poste, Via Traiano 82100 Benevento, Italy
∗∗ ´
Ecole Polytechnique de Montr´eal
Montr´eal, Canada
Abstract
When a software system evolves, features are added, re-
moved and changed. Moreover, refactoring activities are
periodically performed to improve the software internal
structure. A class may be replaced by another, two classes
can be merged, or a class may be split in two others.
As a consequence, it may not be possible to trace soft-
ware features between a release and another. When study-
ing software evolution, we should be able to trace a class
lifetime even when it disappears because it is replaced by
a similar one, split or merged. Such a capability is also
essential to perform impact analysis.
This paper proposes an automatic approach, inspired on
vector space information retrieval, to identify class evolu-
tion discontinuities and, therefore, cases of possible refac-
toring. The approach has been applied to identify refactor-
ings performed over 40 releases of a Java open source do-
main name server. Almost all the refactorings found were
actually performed in the analyzed system, thus indicating
the helpfulness of the approach and of the developedtool.
Keywords: Software Evolution, Releases, Refactoring,
Traceability
1. Introduction
Software systems continuously evolve to meet ever-
changing user needs. As a system evolves, new functional-
ities are added and existing ones are removed or modified.
In particular, when we look at the evolution of an Object–
Oriented (OO) software system, we see that the lifetime of
a class is only a limited segment in the whole system evo-
lution. When a class is not considered useful anymore, it
can be removed. On the contrary, new features can imply
the creation of new classes. The latter is, however, only part
of the reality. To improve the software internal structure,
maintainability and comprehensibility, refactoring [8, 11]
activities are periodically performed. At class level, such
refactorings may imply that new classes can be obtained by
splitting or merging old ones. Moreover, it may happen that
a class can be obtained factoring out part of another class
or, on the contrary, a class can be merged with another.
Often, for different reasons, those refactorings are not
documented. The lack of configuration management and,
in general, of a well-defined software development process
can cause the lost of traceability between related classes. As
a consequence, the software system maintainability and, in
general, its quality, tend to deteriorate. Software evolution
activities rapidly become extremely difficult as any change
may produce unpredictable side effects on other portions of
the system.
It would be greatly useful to connect the independent
segments representing class evolution during system life-
time. If, at a given release, a class terminates its life and
other two classes, obtained splitting the first one, appear,
then the three segments representing such classes should be
connected to indicate such a relationship.
The first, intuitive consequence of this information is re-
lated to understand software evolution: the lifetime of a
class should be studied also across events such as renam-
ing, replacement, merge and split. Second, the detection of
refactorings helps locating functionalities over classes, thus
giving a relevant support to software maintenance and, in
particular, impact analysis. Last but not least, the approach
can be a support to facilitate the reuse of test cases devel-
oped for the old class(es).
This paper proposes to adopt techniques inspired by In-
formation Retrieval (IR) approaches to automatically iden-
tify and document evolution discontinuities when analyz-
ing the evolution of OO source code at class level. The
approach is inspired from a number of studies [3, 5, 16, 17]
aimed at recovering any mapping between software artifacts
(e.g., free text documentation and code), or between subse-
quent releases of a software system.
Without loosing the generality of the proposed approach,
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Automatic Identification of Class Refactorings using Vector Space Cosine Similarity and more Lecture notes Engineering in PDF only on Docsity!

An Automatic Approach to identify Class Evolution Discontinuities

Giuliano Antoniol∗^ Massimiliano Di Penta∗^ and Ettore Merlo∗∗

[email protected], [email protected], [email protected]

∗ RCOST - Research Centre on Software Technology

University of Sannio, Department of Engineering

Palazzo ex Poste, Via Traiano 82100 Benevento, Italy

∗∗ Ecole Polytechnique de Montr´´ eal

Montr´eal, Canada

Abstract

When a software system evolves, features are added, re- moved and changed. Moreover, refactoring activities are periodically performed to improve the software internal structure. A class may be replaced by another, two classes can be merged, or a class may be split in two others. As a consequence, it may not be possible to trace soft- ware features between a release and another. When study- ing software evolution, we should be able to trace a class lifetime even when it disappears because it is replaced by a similar one, split or merged. Such a capability is also essential to perform impact analysis. This paper proposes an automatic approach, inspired on vector space information retrieval, to identify class evolu- tion discontinuities and, therefore, cases of possible refac- toring. The approach has been applied to identify refactor- ings performed over 40 releases of a Java open source do- main name server. Almost all the refactorings found were actually performed in the analyzed system, thus indicating the helpfulness of the approach and of the developed tool.

Keywords: Software Evolution, Releases, Refactoring, Traceability

1. Introduction

Software systems continuously evolve to meet ever- changing user needs. As a system evolves, new functional- ities are added and existing ones are removed or modified. In particular, when we look at the evolution of an Object– Oriented (OO) software system, we see that the lifetime of a class is only a limited segment in the whole system evo- lution. When a class is not considered useful anymore, it can be removed. On the contrary, new features can imply the creation of new classes. The latter is, however, only part of the reality. To improve the software internal structure, maintainability and comprehensibility, refactoring [8, 11]

activities are periodically performed. At class level, such refactorings may imply that new classes can be obtained by splitting or merging old ones. Moreover, it may happen that a class can be obtained factoring out part of another class or, on the contrary, a class can be merged with another. Often, for different reasons, those refactorings are not documented. The lack of configuration management and, in general, of a well-defined software development process can cause the lost of traceability between related classes. As a consequence, the software system maintainability and, in general, its quality, tend to deteriorate. Software evolution activities rapidly become extremely difficult as any change may produce unpredictable side effects on other portions of the system. It would be greatly useful to connect the independent segments representing class evolution during system life- time. If, at a given release, a class terminates its life and other two classes, obtained splitting the first one, appear, then the three segments representing such classes should be connected to indicate such a relationship. The first, intuitive consequence of this information is re- lated to understand software evolution: the lifetime of a class should be studied also across events such as renam- ing, replacement, merge and split. Second, the detection of refactorings helps locating functionalities over classes, thus giving a relevant support to software maintenance and, in particular, impact analysis. Last but not least, the approach can be a support to facilitate the reuse of test cases devel- oped for the old class(es). This paper proposes to adopt techniques inspired by In- formation Retrieval (IR) approaches to automatically iden- tify and document evolution discontinuities when analyz- ing the evolution of OO source code at class level. The approach is inspired from a number of studies [3, 5, 16, 17] aimed at recovering any mapping between software artifacts (e.g., free text documentation and code), or between subse- quent releases of a software system. Without loosing the generality of the proposed approach,

this paper will focus on a limited number of refactoring events, namely class renaming or replacement, class merge and split, and factoring in/out a class into/from another one. Nevertheless, the approach can be extended to other (even finer-grained, for example at method level) types of refac- toring. The primary contributions of this paper are the follow- ing:

  • it proposes a novel, automatic, approach to locate soft- ware evolution discontinuities due to possible refactor- ings;
  • it explicitly defines, in terms of such an approach, a way to identify merge, split and rename; and
  • it presents results of a study aiming at identifying such discontinuities in the evolution of 40 releases of a Java system.

The remainder of the paper is organized as follows. After Section 2 discusses the related works and their relationship with the present paper, Section 3 presents, for complete- ness’ sake, a brief overview of IR based traceability link recovery. Section 4 describes the proposed approach, while Section 5 reports and discusses the results from the case study. Section 6 concludes.

2. Related Work

The problem of identifying refactorings in subsequent releases of a software system has been previously tackled by Demeyer et al. [7]. They proposed heuristics, based on class and method-level metrics, to identify several forms of refactoring. As stated by the authors, Demeyer et al. approach requires metric analysis being complemented by manual browsing. We agree with authors that, in general, each category of refactoring requires a particular heuristic to be identified and that, in general, manual browsing is neces- sary. However, our aim is to propose an approach that auto- matically, without any human intervention, identifies some potential refactorings. Second, we are, in this paper, mainly interested in those refactorings that create discontinuities in a class lifetime. Zimmerman et al. [23, 22] presented a data mining ap- proach over CVS repositories to identify changes, to detect coupling between fine–grained entities and to predict future or missing changes. Xing and Stroulia [21] presented an approach to understand class evolution in OO software. More generally, the issue of recovering traceability links between portions of code, or between code and free text documentation, has gained interest in recent years. In par- ticular, a number of papers have been published in the area of impact analysis. For example, Turver and Munro [20]

assumed the existence of some form of ripple propagation graph, describing relations between software artifacts. When available, Concurrent Versioning System (CVS) data is used to track features between releases. Fischer et al. [10] proposed to combine CVS revision data with bug reporting data and to add some missing information such as, for example, merge points. The same authors also per- formed, on the same data, an analysis devoted to track fea- tures [9]. Finally, Gall et al. [13] analyzed CVS release history data for detecting logical coupling. Ratiu et al. [18] use version histories to detect design flaws. IR approaches were for the first time adopted by Maarek et al. [15]. Maarek introduced an IR method for automat- ically assembling software libraries based on a free text- indexing scheme. Maletic and Marcus [16] presented a system called PROCSSI that uses Latent Semantic Index- ing (LSI) to identify semantic similarity between pieces of code. Authors also prove that this semantic similarity mea- sure is helpful in the comprehension task. Antoniol et al. [6] presented a method to establish and maintain traceability links between code and free text doc- uments. The method exploits probabilistic IR techniques to estimate a language model for each document or document section, and applies Bayesian classification to score the se- quence of mnemonics extracted from a selected area of code against the language models. The same method was applied in [2], to recover trace- ability links between the functional requirements and the Java source code, extending and validating the previous re- sults on a more complex and difficult case study. The in- vestigation was then extended [4, 5] to vector space model, to compare different model families and assess the rela- tive influence of affecting factors. Maletic and Marcus per- formed a comparison of the probabilistic and vector space approaches with the LSI [17]. Antoniol et al. [3] used string matching and graph algo- rithms were used to recover a traceability mapping between different releases of OO systems. In particular, thresholds were used to discard matching (graph edges) unlikely to represent class evolution. While that work [3] was focused on traceability among classes, the present paper focuses on situations that create discontinuities in a class lifetime, and, as it will be clearer in Section 4, we rely on sum of vectors to model and identify situations of class split and merge.

3. Background

To identify links between classes obtained from refac- toring, we applied techniques inspired by IR vector space approaches. The vector space model treats documents and queries as vectors [14]; documents are ranked against queries by computing some similarity functions between

Figure 1. Possible cases of class refactoring

  1. above a threshold, under which the two classes are to be considered as significantly different i.e., the differ- ence is not simply due to the evolution of the class across two subsequent releases.

Heuristic:

  1. classes A and B must belong to different and subse- quent releases say release n and n + 1;
  2. class B is not present in release n and class A is not present in release n + 1.

4.1.2 Class Extraction

This happens (see Fowler book [11]) when a class A con- tinues to exist between releases n and n + 1, while part of it is factored out in a class B.

Motivation: a new class B is extracted from A when B, for example, contains a data structure or both methods and attributes that i) are not highly cohesive with the rest of class A and ii) are used by classes other than class A.

Condition: the class B will be identified as a class extracted from A if, as shown in Figure 1-b:

  1. the cosine of the angle between class An and the vec- tor resulting from the sum of An+1 and B is above a

threshold (the same used in the previous case) under which the “merge” of An+1 and B is significantly dif- ferent from An; and

  1. such a cosine is greater than those obtained for any other candidate extracted class.

It is important to note that the tf-idf for the sum vec- tor are computed after identifiers of the two classes to sum have been merged in a unique list. This avoids giving a low weight to identifiers appearing in both classes.

Heuristic:

  • class A must exist in releases n and n + 1;
  • class B does not exist in release n;
  • the cosine between An and An+1 is below the thresh- old of change that is considered to be due to class evo- lution.

4.1.3 Class Split

This case is similar to class extraction. It may happen that developers decide to split a class into two different ones, for example because the original class is not highly cohesive, and it models two different concepts or entities. On the con- trary of the previous case, class A disappears in release n, while classes B and C appear in release n + 1.

Motivation: new classes B and C are obtained by splitting class A, to achieve higher cohesion (i.e., B and C have dif- ferent purposes) and low coupling.

Condition: classes B and C will be identified as classes obtained by splitting A if, as shown in Figure 1-d:

  1. the cosine of the angle between class A and the vector resulting from the sum of B and C is above a threshold (the same used in the previous case); and
  2. such a cosine is greater than those obtained for any other combination of classes.

Heuristic:

  • class A must exist in releases n and not in n + 1;
  • classes B and C do not exist in release n; and
  • the cosine between A and B+C is above the threshold.

4.1.4 Class Merge

By reversing time axis, class merge is brought back to class extraction, thus very similar and dual considerations hold. For example, developers may add a feature, then they could realize that such a feature should be part of an already ex- isting class and thus perform the merge.

Motivation: class B is actually implementing part of class A state or implementing a subset of A behavior or a mixed of both;

Condition: see the condition of class extraction;

Heuristic: see the heuristic for class extraction.

4.1.5 Class Merge into a new Class

This case is the dual of class split. In this case, two classes, say B and C, disappear after release n, while class A ap- pears.

Motivation: as in class merge, however the overall behavior of the entity modeled by the new class substantially differs from the merged classes, thus class replacement take place.

Condition: see the condition of class split;

Heuristic: see the heuristic of class split.

4.1.6 Recombining Classes This is the more general case. Developers may have as- signed some responsibilities to the wrong classes, they de- cide to perform refactoring, and thus they move meth- ods/attributes from a class to another. Two different situations may happen:

  1. given two classes, A and B at release n, part of meth- ods/attributes can migrate from A to B and vice-versa. while being substantially changed, both classes will still be present in release n + 1,
  2. the combination of A and B will produce two new classes, C and D, that will appear in release n + 1, where A and B will not be present anymore.

Motivation: better (re)assign responsibilities to classes, im- prove cohesion, reduce coupling.

Condition: given the two vectors Sn and Sn+1, obtained from the sum of the A and B at release n and n + 1 respec- tively,

10

20

30

40

50

60

70

80

90

100

5 10 15 20 25 30 35 40

Classes

Releases

2

4

6

8

10

12

14

16

18

20

5 10 15 20 25 30 35 40

Classes

Releases

Removed after this releaseNew classes

Figure 2. a) Classes contained in each release - b) Classes added/removed

  1. a careful choice of the threshold, as explained in Sec- tion 4.2, to achieve a compromise between accuracy and manual intervention required;
  2. optimizing the search for an approximate solution us- ing approaches such as genetic algorithms, simulated annealing or hill climbing. The set of classes involved in the refactoring can be, in fact, encoded in a genome, and the cosine value computed can be used as a fit- ness function. Such optimization approaches can, at least, reduce the computation time (while leaving high the number of potential false positives that can be ob- tained).

Fortunately, the number of classes changed by a split, merge or reorganization is upper bounded and most of the times no more than four classes are involved (class recom- bination). Such in a worst case, a complexity of O(n^2 ), may be experienced. However, while theoretical complex- ity remains polynomial, in practical cases this does not con- stitute a problem: the number of classes removed and added between two subsequent releases most of the time is consid- erably lower than the system size.

4.4 Tool Support

Different tools have been implemented and reused to support the proposed approach, in particular:

  • a tool for extracting the identifiers from Java source code, implemented using the freely available parser generator JavaCC [1] and its Java lexer; and
  • a set of Perl scripts to implement the search algorithm itself. Noticeably, the use of such a toolkit is completely auto- matic: it proposes a series of candidate refactoring over the

software release history, without any kind of human inter- vention.

5. Case Study

To obtain a preliminary evidence of the proposed ap- proach, we applied it on 40 releases of dnsjava^1. dnsjava is an open source Domain Name Server (DNS) written in Java. The system comprises classes for handling DNS names, records, addresses, for caching name resolutions and many others. Analyzed releases range from 0.1 to 1.4.3. The system size varies from 4.3 KLOC to 15.4 KLOC, and the number of classes from 39 to 99. As shown in Fig- ure 2-a, the number of classes is, except few case, always increasing between subsequent releases. Figure 2-b shows, for each release, the number of classes added and the num- ber of classes that terminate their life in that release. In other words, the figure shows the candidate classes among which refactoring activities should be searched. Figure 3 shows the lifetime of dnsjava classes across re- leases. The plot clearly indicates situations in which some classes appear while others were removed. For sake of sim- plicity, class names were omitted (a total of 123 classes were involved in dnsjava evolution).

5.1 Case Study Results

The application of the proposed approach produced a list of potential class refactoring operations listed in Table 2. The table highlights how, in the initial results, a class could be potentially involved in different types of refactor- ing in the same release. For example, it could not be clear if, between releases 12 and 13, class CacheElement has (^1) http://www.dnsjava.org

Type of From To Classes Classes in Cosine Refactoring involved in involved in Performed Release Release release n release n + 1 Replacement 3 4 dnsServer jnamed 0. Replacement 4 5 CountedDataInputStream DataByteInputStream 0. Replacement 7 8 Resolver SimpleResolver 0. Replacement 7 8 FindResolver FindServer 0. Replacement 11 12 CacheElement Element 0. Replacement 32 33 AXFREnumeration AXFRIterator 0. Merge 7 8 FindResolver, Resolver SimpleResolver 0. Merge 11 12 CacheElement, IO Element 0. Merge 12 13 CacheResponse, ZoneResponse SetResponse 0. Merge 32 33 AXFREnumeration, Enumerator AXFRIterator 0. Split 4 5 CountedDataInputStream DataByteInputStream, DataByteOutputStream 0. Split 7 8 FindResolver ExtendedResolver, FindServer 0. Factor out 2 3 dns dns, Type 0.

Table 2. Potential refactorings found in dnsjava

20

40

60

80

100

120

5 10 15 20 25 30 35 40

Classes

Releases

Figure 3. Class lifetime for 40 dnsjava re- leases

been replaced by class Element or, instead, merged with class IO again to create class Element. In some cases, the decision can be taken automatically by analyzing the cosine value. In the last example, the co- sine obtained for the merge is 0.77, while it is 0.85 for the replacement. This clearly weights in favor of the replace- ment. There may be however cases for which the decision is not that easy to be taken, even because the difference be- tween the cosines in not that evident, or because the actual decision made by the developers could not be in full agree- ment with the computed cosines. Again, it is worth high- lighting the fact that these results should (as for traceability recovery) not be considered as an absolute truth, while as indications that should be supported by code inspection. Let us now focus on the results obtained. The results

for replacement indicated that class jnamed replaced class dnsServer in release 4. Code inspection (and even class names) confirmed this was a good result, since both class model DNS daemons. The classes were quite similar, ex- cept for the fact that jnamed was able to handle DNS record sets and zones. The second case of replacement is somewhat more complex to be interpreted. In fact, it does not appear clear if class Resolver has been simply replaced by SimpleResolver or if, instead, SimpleResolver was obtained from the merge of Resolver and FindResolver. The higher similarity agrees with the manual inspection in favors of the replacement. The class FindResolver does not seem to have been included in SimpleResolver. Instead, we found that it was re- placed by class FindServer. However, the first run of the algorithm missed that replacement. This because the class FindServer contained additional features than SimpleResolver (it also analyzes properties of DNS configuration other than reading them from a file). A thresh- old of 0.3 for cosine allowed the automatic identification of this replacement. The lowered threshold also allowed us to discover that at release 5 CountedDataInputStream was replaced by DataByteInputStream. The similar- ity (0.32) was not so high since several new features were added in the new class. Actually, also the dual class CountedDataOutputStream was replaced by DataByteOutputStream. However, here the similar- ity was even smaller (0.10) and therefore this refactoring could have been discovered only using a very low value of the threshold (acceptable for this case study, while not for others). Then, we analyzed if, in release 12, class

In our opinion, this can provide a useful support when a project manager is interested to study the evolution of a feature (or of a classes) across all the releases of a soft- ware system. In fact, even if a class could disappear at a given release, it could have been replaced, or simply merged with another into a new class. This refactoring therefore causes broken links in class evolution traceability. Identify- ing those links is also relevant for other important purposes, such as program comprehension and impact analysis. Work-in-progress is devoted to improve the identifica- tion algorithm by introducing search heuristics to find more complex cases of refactoring. Moreover, we will also inves- tigate on how complementing clone detection to improve the approach. Finally, we will also focus on the identifica- tion of finer-grained cases of refactoring (e.g. of methods), also relying on the analysis of data from CVS repositories.

References

[1] Java Compiler Compiler (JavaCC) - The Java Parser Gener- ator. https://javacc.dev.java.net. [2] G. Antoniol, G. Canfora, G. Casazza, and A. De Lucia. In- formation retrieval models for recovering traceability links between code and documentation. In Proceedings of IEEE International Conference on Software Maintenance , pages 40–49, San Jose CA USA, October 2000. IEEE Comp. Soc. Press. [3] G. Antoniol, G. Canfora, G. Casazza, and A. De Lucia. Maintaining traceability links during object-oriented soft- ware evolution. Software - Practice and Experience , 31:1– 25, 2001. [4] G. Antoniol, G. Canfora, G. Casazza, A. De Lucia, and E. Merlo. Tracing object-oriented code into functional re- quirements. In Proceedings of the 8th International Work- shop on Program Comprehension , pages 227–230. Limerick Ireland, June 2000. [5] G. Antoniol, G. Canfora, G. Casazza, A. De Lucia, and E. Merlo. Recovering traceability links between code and documentation. IEEE Transactions on Software Engineer- ing , 28(10):970–983, October 2002. [6] G. Antoniol, G. Canfora, A. De Lucia, and E. Merlo. Re- covering code to documentation links in OO systems. Proc. of the Working Conference on Reverse Engineering , pages 136–144, Oct 1999. [7] S. Demeyer, S. Ducasse, and O. Nierstrasz. Finding refac- torings via change metrics. In Proceedings of the Interna- tional Conference on Object-Oriented Programming, Sys- tems, Languages, and Application (OOPSLA) , 2000. [8] S. Demeyer, S. Ducasse, and O. Nierstrasz. Object–Oriented Reengineering Patters. Morgan Kaufmann/Elsevier, 20022. [9] M. Fischer, M. Pinzger, and H. Gall. Analyzing and Relat- ing Bug Report Data for Feature Tracking. In 10th Work- ing Conference on Reverse Engineering (WCRE), Victoria, Canada , pages 90–99, November 2003. [10] M. Fischer, M. Pinzger, and H. Gall. Populating a release history database from version control and bug tracking sys- tems. In Proceedings of IEEE International Conference

on Software Maintenance , pages 23–32, Amsterdam, The Netherlands, Sep 2003. [11] M. Fowler, K. Beck, J. Brant, W. Opdyke, and D. Roberts. Refactoring: Improving the Design of Existing Code. Addison-Wesley Publishing Company, 1999. [12] W. B. Frakes and R. Baeza-Yates. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs, NJ, 1992. [13] H. Gall, M. Jazayeri, and J. Krajewski. CVS release history data for detecting logical couplings. In Proceedings of the International Workshop on Principles of Software Evolution , pages 13–23, Helsinki, Finand, Sep 2003. [14] D. Harman. Ranking algorithms. In Information Retrieval: Data Structures and Algorithms , pages 363–392. Prentice- Hall, Englewood Cliffs, NJ, 1992. [15] Y. Maarek, D. Berry, and G. Kaiser. An information retrieval approach for automatically constructing software libraries. IEEE Transactions on Software Engineering , 17:800–813, 8

[16] J. I. Maletic and A. Marcus. Supporting program compre- hension using semantic and structural information. In Proc. of 23rd International Conference on Software Engineering , pages 103–112, Toronto, 2001. [17] A. Marcus and J. I. Maletic. Recovering documentation-to- source-code traceability links using latent semantic index- ing. In Proceedings of the International Conference on Soft- ware Engineering , pages 125–135, Portland Oregon USA, May 2003. [18] D. Ratiu, S. Ducasse, T. Gˆırba, and R. Marinescu. Using his- tory information to improve design flaws detection. In Euro- pean Conference on Software Maintenance and Reengineer- ing , pages 223–232, Tampere, Finland, Mar 2004. [19] G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Man- agement , 24(5):513–523, 1988. [20] R. J. Turver and M. Munro. An early impact analysis tech- nique for software maintenance. Journal of Software Main- tenance - Research and Practice , 6(1):35–52, 1994. [21] Z. Xing and E. Stroulia. Understanding class evolution in object–oriented software. In Proceedings of the IEEE Inter- national Workshop on Program Comprehension , pages 34– 43, Bari, Italy, Jun 2004. [22] T. Zimmermann, S. Diehl, and A. Zeller. How history jus- tifies system architecture (or not). In Proceedings of the In- ternational Workshop on Principles of Software Evolution , pages 73–83, Helsinki, Finland, Sep 2003. [23] T. Zimmermann, P. Weißgerber, S. Diehl, and A. Zeller. Mining version histories to guide software changes. In Pro- ceedings of the International Conference on Software En- gineering , pages 563–572, Edinburgh, Scotland, UK, May