An Adaptive and Iterative Algorithm for Refining Multiple Sequence Alignment | CSC 8910, Papers of Computer Science

Material Type: Paper; Professor: Li; Class: COMPUTER SCIENCE TOPICS SEMINR; Subject: COMPUTER SCIENCE; University: Georgia State University; Term: Fall 2004;

Typology: Papers

Pre 2010

Uploaded on 08/31/2009

koofers-user-z8e-1
koofers-user-z8e-1 🇺🇸

10 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Computational Biology and Chemistry 28 (2004) 141–148
An adaptive and iterative algorithm for
refining multiple sequence alignment
Yi Wang, Kuo-Bin Li
Bioinformatics Institute, 30 Biopolis Street, Singapore 138671, Singapore
Received 2 December 2003; received in revised form 10 February 2004; accepted 10 February 2004
Abstract
Multiple sequence alignment is a basic tool in computational genomics. The art of multiple sequence alignment is about placing gaps. This
paper presents a heuristic algorithm that improves multiple protein sequences alignment iteratively. A consistency-based objective function
is used to evaluate the candidate moves. During the iterative optimization, well-aligned regions can be detected and kept intact. Columns of
gaps will be inserted to assist the algorithm to escape from local optimal alignments. The algorithm has been evaluated using the BAliBASE
benchmark alignment database. Results show that the performance of the algorithm does not depend on initial or seed alignments much.
Given a perfect consistency library, the algorithm is able to produce alignments that are close to the global optimum. We demonstrate that
the algorithm is able to refine alignments produced by other software, including ClustalW, SAGA and T-COFFEE. The program is available
upon request.
© 2004 Elsevier Ltd. All rights reserved.
Keywords: Iterative algorithm; Multiple sequence alignment; Alignment improver
1. Introduction
Multiple sequence alignment has for decades been an es-
sential tool for analyzing sequences of proteins and nucleic
acids. It is used in the detection of characteristic motifs and
conserved regions, in the determination of phylogenetic tree
as well as in the prediction of secondary and tertiary struc-
ture. To accommodate rapidly growing demands for auto-
mated alignment, various algorithms have been developed
to obtain sound alignment in the sense of both quality and
speed (Notredame, 2002; Thompson et al., 1999b). This evo-
lution started with the successful optimization of pairwise
alignment using dynamic programming (Needleman and
Wunsch, 1970; Smith and Waterman, 1981). When consid-
ering the alignment of more than two sequences, however,
researchers come to acknowledge that current computing
resource is more than often dwarfed by the complexity of
problem. Although theoretically convenient to be extended
to multiple alignment of Nsequences (Carrillo and Lipman,
1988), dynamic programming requires prohibitive memory
Corresponding author. Tel.: +65-6478-8265; fax: +65-6478-9047.
E-mail address: [email protected] (K.-B. Li).
space for an N-dimensional array as well as computational
resource of the order of the Nth power of the sequence length.
In order to achieve approximate alignments within fea-
sible time, two types of heuristics are generally used, i.e.,
progressive and iterative approaches. The progressive ap-
proach builds up multiple sequence alignment gradually by
aligning the closest pair of sequences first and successively
adding in the more distant ones. This family includes MUL-
TALIGN (Barton and Sternberg, 1987), MULTAL (Taylor,
1988) and ClustalW (Thompson et al., 1994), etc., which
differ mainly in the way to decide the order of adding se-
quences. The fundamental flaw with progressive approach
rests with its inability to adjust previous alignment with
newly added ones. Trivial misalignments in early stages
remain uncorrected and conserved, which afterwards accu-
mulate into serious ones, preventing newly added sequences
from being properly aligned.
On the other hand, iterative approaches (Gotoh, 1996;
Heringa, 1999, 2002; Notredame and Higgins, 1996) start
with an initial alignment including all the sequences and
then attempts to improve it at each iteration. Iterative al-
gorithm ends when specific number of iterations has been
practiced or no effective change could be identified any
more. An objective function is employed to evaluate align-
1476-9271/$ see front matter © 2004 Elsevier Ltd. All rights reserved.
doi:10.1016/j.compbiolchem.2004.02.001
pf3
pf4
pf5
pf8

Partial preview of the text

Download An Adaptive and Iterative Algorithm for Refining Multiple Sequence Alignment | CSC 8910 and more Papers Computer Science in PDF only on Docsity!

Computational Biology and Chemistry 28 (2004) 141–

An adaptive and iterative algorithm for

refining multiple sequence alignment

Yi Wang, Kuo-Bin Li∗

Bioinformatics Institute, 30 Biopolis Street, Singapore 138671, Singapore Received 2 December 2003; received in revised form 10 February 2004; accepted 10 February 2004

Abstract

Multiple sequence alignment is a basic tool in computational genomics. The art of multiple sequence alignment is about placing gaps. This paper presents a heuristic algorithm that improves multiple protein sequences alignment iteratively. A consistency-based objective function is used to evaluate the candidate moves. During the iterative optimization, well-aligned regions can be detected and kept intact. Columns of gaps will be inserted to assist the algorithm to escape from local optimal alignments. The algorithm has been evaluated using the BAliBASE benchmark alignment database. Results show that the performance of the algorithm does not depend on initial or seed alignments much. Given a perfect consistency library, the algorithm is able to produce alignments that are close to the global optimum. We demonstrate that the algorithm is able to refine alignments produced by other software, including ClustalW, SAGA and T-COFFEE. The program is available upon request. © 2004 Elsevier Ltd. All rights reserved.

Keywords: Iterative algorithm; Multiple sequence alignment; Alignment improver

1. Introduction

Multiple sequence alignment has for decades been an es- sential tool for analyzing sequences of proteins and nucleic acids. It is used in the detection of characteristic motifs and conserved regions, in the determination of phylogenetic tree as well as in the prediction of secondary and tertiary struc- ture. To accommodate rapidly growing demands for auto- mated alignment, various algorithms have been developed to obtain sound alignment in the sense of both quality and speed (Notredame, 2002; Thompson et al., 1999b). This evo- lution started with the successful optimization of pairwise alignment using dynamic programming (Needleman and Wunsch, 1970; Smith and Waterman, 1981). When consid- ering the alignment of more than two sequences, however, researchers come to acknowledge that current computing resource is more than often dwarfed by the complexity of problem. Although theoretically convenient to be extended to multiple alignment of N sequences (Carrillo and Lipman, 1988), dynamic programming requires prohibitive memory

∗ (^) Corresponding author. Tel.: +65-6478-8265; fax: +65-6478-9047. E-mail address: [email protected] (K.-B. Li).

space for an N -dimensional array as well as computational resource of the order of the N th power of the sequence length. In order to achieve approximate alignments within fea- sible time, two types of heuristics are generally used, i.e., progressive and iterative approaches. The progressive ap- proach builds up multiple sequence alignment gradually by aligning the closest pair of sequences first and successively adding in the more distant ones. This family includes MUL- TALIGN (Barton and Sternberg, 1987), MULTAL (Taylor,

  1. and ClustalW (Thompson et al., 1994), etc., which differ mainly in the way to decide the order of adding se- quences. The fundamental flaw with progressive approach rests with its inability to adjust previous alignment with newly added ones. Trivial misalignments in early stages remain uncorrected and conserved, which afterwards accu- mulate into serious ones, preventing newly added sequences from being properly aligned. On the other hand, iterative approaches (Gotoh, 1996; Heringa, 1999, 2002; Notredame and Higgins, 1996) start with an initial alignment including all the sequences and then attempts to improve it at each iteration. Iterative al- gorithm ends when specific number of iterations has been practiced or no effective change could be identified any more. An objective function is employed to evaluate align-

1476-9271/$ – see front matter © 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.compbiolchem.2004.02.

ments and determine the changes to be committed. Iterative approach is mainly used to refine and improve alignment. As most other applications of iterative algorithm do, iter- ative multiple sequence alignment faces the challenge of how to escape from local optimal alignments, whether and to what extent worsen moves should be taken. Heuristics, such as simulated annealing (Kirkpatrick et al., 1983) and genetic algorithm (Holland, 1975), have been applied to the problem of multiple sequence alignment (Ishikawa et al., 1993; Kim et al., 1994; Notredame and Higgins, 1996). They facilitate global optimization by accepting deteriorat- ing moves. However, doing this does not guarantee to find a globally optimized alignment. In this paper, we propose an adaptive and iterative algo- rithm for refining multiple sequence alignment. Since mis- placement of gaps is an important issue in multiple sequence alignment, the first feature of our algorithm is the ability to detect and move gaps. Unlike previous attempts toward this problem (Korostensky et al., 1999), we use a simple greedy approach. Secondly, well-aligned regions can be automati- cally detected and isolated through a local objective function. This concept was seen in RASCAL (Thompson et al., 2003), where a function called NorMD (Thompson et al., 2001) was used. The ability to detect well-aligned regions has at least two advantages. It not only prevents the well-aligned regions from being modified in later iterations, but also speeds up the

Main Iteration: Loop B: Evaluate all possible gap-moves in the entire alignment. Adopt the best improving move. Termination: No improving gap-move can be found.

Detect and isolate well-aligned regions.

Loop C: Evaluate all possible gap-moves in poorly aligned regions. Adopt the best improving move. Termination: No improving gap-move can be found.

If (no successful move found in Loop C) { For a user-defined number of iterations: Randomly insert columns of gaps.

Loop D: Evaluate all possible gap-moves in the insertion-affected regions. Adopt the best improving move. Termination: No improving gap-move can be found.

Termination : No improving gap-move can be found in Loop D.

Fig. 1. The iterative algorithms of AIMSA in pseudo code.

subsequent optimization since the attention of the algorithm can be shifted to the poorly aligned regions. Finally, the in- troduction of column-gaps provides a mechanism to jump off local optima without having to take deteriorating moves. To determine the optimal candidate move in each it- eration, we use consistency-based objective function for alignment evaluation (COFFEE, Notredame et al., 1998) as an objective function. The first consistency-based mul- tiple alignment method was described over 10 years ago (Kececioglu, 1993). The idea is that the optimal multiple alignment is the one that agrees the most with all the pos- sible optimal pairwise alignments. This concept has been implemented in both progressive (Notredame et al., 2000) and in iterative approaches (Heringa, 2002). Our algorithm, adaptive iterative multiple sequence align- ment (AIMSA), has been demonstrated to be able to pro- duce high quality alignments consistently using BAliBASE benchmark alignment dataset (Thompson et al., 1999a).

2. Algorithm

AIMSA optimizes a multiple sequence alignment against an objective function iteratively. The optimization codes serves as a standalone module that is able to work on ini- tial alignments provided by any third-party method. The

sub regions. As a result, in each sub region there are much fewer potential gap-blocks, compared with the number of the whole region. Loop B first divides the whole alignment into several sub regions and works on them one by one. For each sub region, the algorithm tries to identify a best im- proving block move. This process will be iterated until no more improving block move can be found. Loop B termi- nates when block moves in all sub regions are completed.

2.4. Detection and isolation of well-aligned regions

After a few iterations of block moves, the alignment would evolve into a shape where some portion of it is well aligned. The remaining tasks are to further extend the well-aligned regions as well as to prevent these regions from being af- fected by subsequent gap moving attempts. Ability to detect and isolate well-aligned regions from their inferior neighborhoods therefore becomes crucial. De- tection of such regions is implemented by evaluating a local objective function that is able to measure the quality of any local region within the entire multiple alignment. A preset threshold is used to identify regions with significantly high objective scores. A sliding-window approach is adopted that attempts to extend possible high scoring alignment windows in one direction, stopping when certain criteria are met. An example from T-COFFEE (Notredame et al., 2000) is shown in Fig. 3 to explain the procedure (spaces are for illustrating purpose). In this example, the size of the sliding-window is set to four and the window moves rightward. Upon reaching “GARF”, the algorithm identifies it as a high scoring re- gion. It stops sliding and starts to extend the window until hitting “V” and “L”, a mismatching pair. If the maximum number of allowed mismatches is set to equal to or greater than four, the window will adopt “VERY” and “LAST” and the extension will continue. Finally a region, “GARFIELD THE VERY FAST CAT”, will be detected as a well-aligned region. If the maximum number of allowed mismatches is less than four, the window will only brings in “VE” and “LA”. Because their respective successors, “R” and “S”, still don’t match, the window-extension would stops, rejects “VE” and “LA” and takes “GARFIELD THE” as a good region. In this

a ... GARFIELD THE VERY FAST CAT ... ... GARFIELD THE LAST FAST CAT ... b ... GARFIELD THE VERY FAST CAT ... ... GARFIELD THE LAST FAST CAT ...

Fig. 3. An example of the detection of well-aligned regions (denoted by bold typeface): (a) the whole region will be recognized as a well-aligned region when the maximum number of mismatches is set to four; (b) two well-aligned regions are to be recognized when the maximum number of mismatches is set to two.

case, the sliding-window algorithm resumes from the next window after “R” and “S”, that is, “YFAS” and “TFAS”. It soon recognizes another high scoring region of “FAST CAT”. The size of the sliding-window and the maximum number of allowed mismatches were set to four and five, respectively, in our benchmark tests. Once the well-aligned regions have been identified, they will be isolated by inserting columns gaps that serves as a buffering zone between well-aligned regions and their neigh- borhoods.

2.5. Insertion of column-gaps as buffer

The main approach that AIMSA uses to improve an align- ment is by shuffling existing gaps to appropriate positions. However, direct gap insertion and deletion are still neces- sary on many occasions and can be achieved indirectly by inserting and moving column-gaps. To achieve gap insertions, columns of gaps will be added into a position near the expected insertion point. Then gaps are to be moved from the column-gaps to the expected in- sertion point. Deletions can be achieved in a similar manner. The reason why a direct gap-insertion algorithm is not proposed is that it might damage the neighboring part start- ing from the point of direct insertion. An example is shown in Fig. 4 (spaces are for illustrating purpose). In the original alignment (Fig. 4a) there are two well-aligned regions, namely, “Someone” and “this paper”. If two gaps are directly inserted before “reviewed” in se- quence A in order to align “review”, the alignment of “this paper” will be broken as shown in Fig. 4b. Instead, we may insert a few columns of gaps before “this paper” as a buffering zone, as shown in Fig. 4c. Once the column-gaps have been inserted, two individual gaps can be moved to the expected optimal position (Fig. 4d). The excessive

a Sequence A: Someone hasreviewed this paper Sequence B: Someone willpreview this paper b Sequence A: Someone has--reviewed this paper Se c

quence B: Someone willpreview this paper Sequence A: Someone hasreviewed---- this paper Se d

quence B: Someone willpreview---- this paper Sequence A: Someone has--reviewed-- this paper Sequence B: Someone willpreview---- this paper e Sequence A: Someone has--reviewed this paper Sequence B: Someone willpreview-- this paper

Fig. 4. An example of the indirect gap-insertion: (a) the alignment ini- tially has two well-aligned regions; (b) inserting two gaps would damage the second well-aligned region; (c) instead, columns of gaps could be inserted; (d) two gaps are then moved to the targeted insertion cite; (e) the redundant column gaps can be removed easily. Now the alignment has three well-aligned regions.

Before: bga2_ecoli DIISTMYTRV ------PLM--------NEFG-----EYP-H PKPRIICEY bga1_klepn DIICPMYARV ERDQPIPAV--------PKWGIKKWISLPGE QRPLILCEY bga1_klula DIFSFMYPTF ------EIMERWRKNHTDENG-----KF--- EKPLILCEY bga1_st rtr DIESRMYAKP ADIEEYLTT--------GKLVDLSSVSTNKP QKPYISCEY After: bga2_ecoli DIISTMYTRV ------PLMNEFG-----EYP-H- PKPRIICEY bga1_klepn DIICPMYARV ERDQPIPAVPKWGIKKWISLPGE- QRPLILCEY bga1_klula DIFSFMYPTF E------IMERWR-KNHTDENGKF EKPLILCEY bga1 strtr DIESRMYAKP ADIEEYLTTGKLVDLSSVSTNKP- QKPYISCEY

Fig. 5. An example showing that a poorly aligned region may be refined by AIMSA without damaging its neighboring well-aligned regions. Sequences in bold are poorly aligned region between two well-aligned regions.

column-gaps will be removed at the end of each iteration therefore they won’t affect the final alignment (Fig. 4e). A real case is given in Fig. 5 to demonstrate how AIMSA refines poorly aligned region flanked by well-aligned ones. The sequences are taken from 1bgl ref1 of BAliBASE. In summary, column-gaps are introduced as buffers to protect well-aligned regions from being affected by poorly aligned ones. It enables the alignment to escape from lo- cal optima without deteriorating it, whereas other iterative alignment algorithms may need to compromise in similar situations.

2.6. Random insertion of column-gaps

Besides being inserted into specific positions to isolate well-aligned regions, column-gaps are also randomly in- serted in an attempt to accommodate demands of gap inser- tions/deletions. In light of the previous description, regions for block-move evaluation are divided into sub regions to reduce the complexity of the problem. Consequently, the inserted buffering gaps will only affect their neighboring sub regions but not be able to facilitate adding or deleting gaps at distant sub regions, as illustrated in Fig. 6. This is the motivation of inserting column-gaps randomly. This stochastic approach could be alternated to a deter- ministic one by inserting column-gaps at boundaries of each sub region. However, doing so will introduce a large amount of block-gaps and results in an unnecessarily long comput- ing time. Random insertion is practiced in Loop D of AIMSA where the isolation of well and poorly aligned regions no longer works. It acts as the last attempt in Loop A to further improve alignment quality.

Well-aligned Region Buffer

Poorly A

Aligned B

Region C Buffer

Well-aligned Region

Fig. 6. The two inserted buffering zones will only affect the sub regions A and C. As a result, additional column-gaps need to be inserted to a randomly chosen position so that adding or deleting gaps in the sub region B is possible.

3. Results

The algorithm described was implemented in Java 2 Platform, Standard Edition (J2SE) 1.4.1. Organized into application programming interface (API), it could be con- veniently incorporated into user’s own Java applications. In order to evaluate the ability of this program to refine alignments obtained from other popular alignment tools, we used BAliBASE (Benchmark Alignment dataBASE, Thompson et al., 1999a) as the benchmark data. BAliBASE is a database of manually refined multiple sequence align- ments. Core blocks, defined as the sequences that should be aligned together, are clearly indicated by BAliBASE. The five BAliBASE reference sets we employed are categorized as follows:

  • Reference 1: equidistant sequences of similar length.
  • Reference 2: family versus orphans.
  • Reference 3: equidistant divergent families.
  • Reference 4: N/C-terminal extensions.
  • Reference 5: internal insertions.

A total number of 141 datasets from these five reference sets were aligned using AIMSA. The initial alignments are, respectively, produced by ClustalW (Thompson et al., 1994), SAGA (Notredame and Higgins, 1996) and T-COFFEE (Notredame et al., 2000). Since AIMSA is an optimization algorithm aimed at finding good alignments, we first demonstrate its capabil- ity as an optimizer. Table 1 shows how AIMSA improves the objective scores, in our case, the COFFEE scores, us- ing BAliBASE testing sequences. The “Average” is the arithmetic averages of the scores from all the tests in the five reference sets. The “Increment” is the percentage of increment achieved by AIMSA. It is evident that AIMSA performs well in terms of optimization by achieving at least 3.7% of improvement on the objective scores. Although Table 1 shows that AIMSA performs well mathematically, the biological relevance of the alignments remains to be tested. To evaluate the alignment quality, we adopt the BAliBASE sum-of-pair (SP) scoring scheme. Given a multiple alignment, the SP scheme computes the percentage of amino acid pairs that occurs in the manually

Table 4 Comparison of the average time costs (in seconds) to complete a BAl- iBASE test in the five reference sets using ClustalW, SAGA, T-COFFEE and AIMSA

Reference set ClustalW SAGA T-COFFEE AIMSA

1 0.3 338.9 1.4 11. 2 2.1 4479.1 34.9 322. 3 2.6 14905.5 47.8 884. 4 0.8 4108.5 11.6 201. 5 0.7 1443.5 8.8 197.

The initial alignments for AIMSA were created by T-COFFEE. Time cost of AIMSA doesn’t include the time that is used to create initial alignment. All tests were performed on a 1.4 GHz Pentium III computer. AIMSA is implemented in Java whereas the remaining three are in C program language.

As another test showing AIMSA’s capability to optimize a multiple alignment given “correct” guidance, we created COFFEE pairwise library from the BAliBASE reference alignments. This manually produced pairwise library is fully consistent with the reference alignments. Table 3 shows that indeed AIMSA is able to improve the initial alignments to the ones that are very close to the reference alignments. Table 4 shows the average time costs for each test across the five BAliBASE reference sets using ClustalW, SAGA, T-COFFEE and AIMSA. In this test AIMSA was refining alignments created by T-COFFEE.

4. Discussion

Table 2 shows that AIMSA may be used to refine align- ments created from other basic alignment methods. AIMSA performs particularly well given a correct objective function as a guidance, as can be seen in Table 3. AIMSA has the most consistent performance with its lower standard devia- tion of scores (shown in Table 2) over different alignment categories. This suggests that, unlike many alignment pro- grams that work well on certain types of data but poorly on others, AIMSA may be used to align multiple sequences of various combinations. We believe that the ability for AIMSA to obtain good alignments depends on good pairwise libraries and not very much on the initial or seed alignments. From Table 2, we found that the dependencies on various seed alignments (ClustalW, SAGA and T-COFFEE) are relatively small. However, shown in Table 3, given a theoretically perfect pairwise library, larger improvement on alignment quality was indeed observed. This suggests that although better seed alignments still produce better final alignments, AIMSA as an optimization algorithm is able to reach the globally op- timized solution regardless of which initial or seed solution is used. Combined with the data from Table 4 we could make further comparisons between SAGA and AIMSA, both are iterative methods fore refining alignments. The quality of

the final alignments obtained by AIMSA is slightly higher (∼1%) than that of SAGA. Since they both rely on the same COFFEE scoring scheme, which is the essential factor for the quality of optimization, we may attribute AIMSA’s im- provement to its unique features, such as to be able to jump off local optima. The main difference between the two pro- grams is their execution time: AIMSA is faster than SAGA, the speedup ranges from 10 to 30 over the five reference sets. In addition, SAGA is implemented in C while AIMSA in Java, which has a lower runtime performance compared to C language. Comparing ClustalW and T-COFFEE with AIMSA, AIMSA is 40–250 times slower than ClustalW, and 10– times slower than T-COFFEE, as illustrated in Table 4. Even assuming an implementation in C, AIMSA will still have a 10-fold speed disadvantage compared with ClustalW. This is due to the progressive and the iterative nature these two programs, respectively, adopt. The strength of AIMSA over ClustalW and T-COFFEE is in the refining ability. Although the edge is yet slim (1.8% over ClustalW and 0.7% over T-COFFEE), AIMSA has a potential of improvement provided an enhanced objective function and extra biological information that could be used as a better guidance for the optimization. Without a clear separation between the quality assessment and the alignment optimization, progressive approaches such as ClustalW need a redesign to take advantage of new data. There are improving possibilities for AIMSA, the main source being the objective function. Objective function is the soul of iterative algorithms in the sense that it determines the candidate move to be taken to improve the solution quality. In multiple sequence alignment, objective function acts as the key factor to control the evolution of an alignment into a mature one. The current COFFEE scheme that AIMSA adopted is yet to be improved. It employs weighs based on the overall similarity of a pairwise alignment instead of on the substitution score of each pair of residues. This leads to a possible overestimation on non-functionally re- lated regions in similar sequences and an underestimation on functionally related regions in sequences having low overall similarities. A main disadvantage of AIMSA is being time-consuming, which stems from its iterative nature. In the case of align- ing large number of highly diversed sequences, the time cost could extend to several hours. This weakness provides a strong incentive to make this program parallel in future efforts.

Acknowledgements

The authors wish to thank Tariq Riaz and Sheng Zeng of Bioinformatics Institute for useful discussion, Julie Thomp- son of IGBMC for providing data of BAliBASE and anony- mous reviewers for valuable suggestions.

References

Barton, G.J., Sternberg, M.J., 1987. A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary struc- ture comparisons. J. Mol. Biol. 198, 327–337. Carrillo, H., Lipman, D.J., 1988. The multiple sequence alignment prob- lems in biology. SIAM J. Appl. Math. 48, 1073–1082. Gotoh, O., 1996. Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J. Mol. Biol. 264, 823–838. Heringa, J., 1999. Two strategies for sequence comparison: profile-preprocessed and secondary structure-induced multiple align- ment. Comput. Chem. 23, 341–364. Heringa, J., 2002. Local weighting schemes for protein multiple sequence alignment. Comput Chem. 26, 459–477. Holland, J.H., 1975. Adaptation in Natural and Artificial Systems. Uni- versity of Michigan Press, Ann. Arbor, Michigan, USA. Ishikawa, M., Toya, T., Hoshida, M., Nitta, K., Ogiwara, A., Kanehisa, M.,

  1. Multiple sequence alignment by parallel simulated annealing. Comput. Appl. Biosci. 9, 267–273. Kececioglu, J.D., 1993. The maximum weight trace problem in multiple sequence alignment. Lect. Notes Comput. Sci. 684, 106–119. Kim, J., Pramanik, S., Chung, M.J., 1994. Multiple sequence alignment using simulated annealing. Comput. Appl. Biosci. 10, 419–426. Kirkpatrick, S., Gelatt, C.D., Vecci, M.P., 1983. Optimization by simulated annealing. Science 220, 671–680. Korostensky, C., Stege, U., Gonnet, G., 1999. Using insertion and dele- tion events for improving multiple sequence alignments and building the corresponding evolutionary trees. http://www.inf.ethz.ch/personal/ gonnet/papers/gaps12/. Needleman, S.B., Wunsch, C.D., 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453.

Notredame, C., 2002. Recent progress in multiple sequence alignment: a survey. Pharmacogenomics 3, 131–144. Notredame, C., Higgins, D.G., 1996. SAGA: sequence alignment by genetic algorithm. Nucl. Acids Res. 24, 1515–1524. Notredame, C., Holm, L., Higgins, D.G., 1998. COFFEE: an objective function for multiple sequence alignments. Bioinformatics 14, 407–

Notredame, C., Higgins, D.G., Heringa, J., 2000. T-COFFEE: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302, 205–217. Smith, T.F., Waterman, M.S., 1981. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197. Taylor, W.R., 1988. A flexible method to align large numbers of biological sequences. J. Mol. E 28, 161–169. Thompson, J.D., Higgins, D.G., Gibson, T.J., 1994. CLUSTAL W: improv- ing the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucl. Acids Res. 22, 4673–4680. Thompson, J.D., Plewniak, F., Poch, O., 1999a. BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 15, 87–88. Thompson, J.D., Plewniak, F., Poch, O., 1999b. A comprehensive com- parison of multiple sequence alignment programs. Nucl. Acids Res. 27, 2682–2690. Thompson, J.D., Plewniak, F., Ripp, R., Thierry, J.C., Poch, O., 2001. Towards a reliable objective function for multiple sequence alignments. J. Mol. Biol. 314, 937–951. Thompson, J.D., Thierry, J.C., Poch, O., Plewniak, F., Ripp, R., 2003. RASCAL: rapid scanning and correction of multiple sequence align- ments towards a reliable objective function for multiple sequence alignments. Bioinformatics 19, 1155–1161.