ProbCons: Practical Tool for Protein Sequence Alignment - Prof. Yingshu Li, Papers of Computer Science

Probcons is a pair-hidden markov model-based progressive alignment algorithm that uses maximum expected accuracy and probabilistic consistency transformation to incorporate multiple sequence conservation information during pairwise alignment. The document evaluates the performance of probcons on several standard alignment benchmark data sets and compares it to several leading alignment tools. The document also discusses the problem of protein multiple sequence alignment and the utility of using probabilistic consistency transformation.

Typology: Papers

Pre 2010

Uploaded on 08/31/2009

koofers-user-csq
koofers-user-csq 🇺🇸

10 documents

1 / 11

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
ProbCons: Probabilistic consistency-based multiple
sequence alignment
Chuong B. Do,
1
Mahathi S.P. Mahabhashyam,
1
Michael Brudno,
1
and
Serafim Batzoglou
1,2
1
Department of Computer Science, Stanford University, Stanford, California 94305, USA
To study gene evolution across a wide range of organisms, biologists need accurate tools for multiple sequence
alignment of protein families. Obtaining accurate alignments, however, is a difficult computational problem because
of not only the high computational cost but also the lack of proper objective functions for measuring alignment
quality. In this paper, we introduce probabilistic consistency, a novel scoring function for multiple sequence
comparisons. We present ProbCons, a practical tool for progressive protein multiple sequence alignment based on
probabilistic consistency, and evaluate its performance on several standard alignment benchmark data sets. On the
BAliBASE, SABmark, and PREFAB benchmark alignment databases, ProbCons achieves statistically significant
improvement over other leading methods while maintaining practical speed. ProbCons is publicly available as a Web
resource.
[Supplemental material is available online at www.genome.org. Source code and executables are available as public
domain software at http://probcons.stanford.edu.]
Given a set of biological sequences, a multiple alignment pro-
vides a way of identifying and visualizing patterns of sequence
conservation by organizing homologous positions across differ-
ent sequences in columns. As sequence similarity often implies
divergence from a common ancestor or functional similarity, se-
quence comparisons facilitate evolutionary and phylogenetic
studies (Phillips et al. 2000; Castillo-Davis et al. 2004) and isola-
tion of the most relevant regions (Attwood 2002) for a variety of
biological analyses. In particular, conserved amino acid stretches
in proteins are strong indicators of preserved three-dimensional
structural domains, so protein alignments have been widely used
in aiding structure prediction (Rost and Sander 1994; Jones 1999)
and characterization of protein families (Sonnhammer et al.
1998; Johnson and Church 1999; Bateman et al. 2004). However,
when sequence identity falls below 30%, called the “twilight
zone” of protein alignments, the accuracies of most automatic
sequence alignment methods drop considerably (Rost 1999;
Thompson et al. 1999b). As a result, alignment quality is often
the limiting factor in biological analyses of amino acid sequences
(Jaroszewski et al. 2002).
The problem of alignment construction consists of defining
either explicitly or implicitly an objective function for assessing
alignment quality and employing an efficient algorithm to find
the optimal, or a near optimal, alignment according to the ob-
jective function. Two-sequence alignments are usually evaluated
by addition of match/mismatch scores for aligned pairs of positions
and affine gap penalties for unaligned amino acids (Needleman and
Wunsch 1970; Smith and Waterman 1981). Quantitatively, scores
for aligned residues are given by log-odds (Altschul 1991) substitu-
tion matrices such as PAM (Dayhoff et al. 1978), GONNET (Gonnet
et al. 1992), or BLOSUM (Henikoff and Henikoff 1992). Estimation
of appropriate gap penalties, however, is often regarded as a “black
art” based on trial and error (Vingron and Waterman
1994). For two sequences of length L, an optimal alignment ac-
cording to this metric may be computed in O(L
2
) time (Gotoh
1982) and O(L) space (Myers and Miller 1988) via dynamic pro-
gramming.
Pair-hidden Markov models (HMMs) provide an alternative
formulation of the sequence alignment problem in which align-
ment generation is directly modeled as a first-order Markov pro-
cess involving state emissions and transitions. In this approach,
model parameters obtain an intuitive probabilistic interpretation
and can be trained on real data using standard supervised or
unsupervised likelihood-based methods. The Viterbi (1967)
algorithm computes the highest probability alignment of two
input sequences according to an alignment pair-HMM. In the
standard three-state pair-HMM for alignment, the Viterbi algo-
rithm may be viewed as an instantiation of the Needleman–
Wunsch algorithm in which alignment parameters are deter-
mined by a log-odds transformation of the HMM scoring scheme
(Durbin et al. 1998).
Since they specify a conditional probability distribution
over the space of all suboptimal alignments, pair-HMMs also al-
low the computation of the posterior probability,P(x
i
y
j
a*|x,
y), that particular positions x
i
and y
j
of two sequences xand y,
respectively, will be matched in an alignment a* generated by the
model. Running the Needleman–Wunsch algorithm with these
posterior probabilities as substitution scores and no gap penalties
gives rise to the maximum expected accuracy alignment method
(see Methods), also known as optimal accuracy alignment (Holmes
and Durbin 1998).
In the general case of multiple sequence comparisons, theo-
retically sound and biologically motivated scoring methods are
not straightforward to devise. In practice, ad hoc sum-of-pairs
schemes (Carrillo and Lipman 1988), which combine the pro-
jected pairwise log-odds scores for all pairs of sequences in the
alignment, and their weighted variants (Altschul et al. 1989) are
commonly used. Unfortunately, direct application of dynamic
programming is too inefficient for alignment of more than a few
sequences. Instead, a variety of heuristic strategies have been
2
Corresponding author.
E-mail [email protected]; fax (650) 725-1449.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/
gr.2821705.
Resource
330 Genome Research
www.genome.org
15:330–340 ©2005 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/05; www.genome.org
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download ProbCons: Practical Tool for Protein Sequence Alignment - Prof. Yingshu Li and more Papers Computer Science in PDF only on Docsity!

ProbCons: Probabilistic consistency-based multiple

sequence alignment

Chuong B. Do,

1

Mahathi S.P. Mahabhashyam,

1

Michael Brudno,

1

and

Serafim Batzoglou

1,

1 Department of Computer Science, Stanford University, Stanford, California 94305, USA

To study gene evolution across a wide range of organisms, biologists need accurate tools for multiple sequence

alignment of protein families. Obtaining accurate alignments, however, is a difficult computational problem because

of not only the high computational cost but also the lack of proper objective functions for measuring alignment

quality. In this paper, we introduce probabilistic consistency , a novel scoring function for multiple sequence

comparisons. We present ProbCons, a practical tool for progressive protein multiple sequence alignment based on

probabilistic consistency, and evaluate its performance on several standard alignment benchmark data sets. On the

BAliBASE, SABmark, and PREFAB benchmark alignment databases, ProbCons achieves statistically significant

improvement over other leading methods while maintaining practical speed. ProbCons is publicly available as a Web

resource.

[Supplemental material is available online at www.genome.org. Source code and executables are available as public

domain software at http://probcons.stanford.edu.]

Given a set of biological sequences, a multiple alignment pro- vides a way of identifying and visualizing patterns of sequence conservation by organizing homologous positions across differ- ent sequences in columns. As sequence similarity often implies divergence from a common ancestor or functional similarity, se- quence comparisons facilitate evolutionary and phylogenetic studies (Phillips et al. 2000; Castillo-Davis et al. 2004) and isola- tion of the most relevant regions (Attwood 2002) for a variety of biological analyses. In particular, conserved amino acid stretches in proteins are strong indicators of preserved three-dimensional structural domains, so protein alignments have been widely used in aiding structure prediction (Rost and Sander 1994; Jones 1999) and characterization of protein families (Sonnhammer et al. 1998; Johnson and Church 1999; Bateman et al. 2004). However, when sequence identity falls below 30%, called the “twilight zone” of protein alignments, the accuracies of most automatic sequence alignment methods drop considerably (Rost 1999; Thompson et al. 1999b). As a result, alignment quality is often the limiting factor in biological analyses of amino acid sequences (Jaroszewski et al. 2002). The problem of alignment construction consists of defining either explicitly or implicitly an objective function for assessing alignment quality and employing an efficient algorithm to find the optimal, or a near optimal, alignment according to the ob- jective function. Two-sequence alignments are usually evaluated by addition of match/mismatch scores for aligned pairs of positions and affine gap penalties for unaligned amino acids (Needleman and Wunsch 1970; Smith and Waterman 1981). Quantitatively, scores for aligned residues are given by log-odds (Altschul 1991) substitu- tion matrices such as PAM (Dayhoff et al. 1978), GONNET (Gonnet et al. 1992), or BLOSUM (Henikoff and Henikoff 1992). Estimation of appropriate gap penalties, however, is often regarded as a “black art” based on trial and error (Vingron and Waterman

1994). For two sequences of length L , an optimal alignment ac- cording to this metric may be computed in O ( L^2 ) time (Gotoh

  1. and O ( L ) space (Myers and Miller 1988) via dynamic pro- gramming. Pair-hidden Markov models (HMMs) provide an alternative formulation of the sequence alignment problem in which align- ment generation is directly modeled as a first-order Markov pro- cess involving state emissions and transitions. In this approach, model parameters obtain an intuitive probabilistic interpretation and can be trained on real data using standard supervised or unsupervised likelihood-based methods. The Viterbi (1967) algorithm computes the highest probability alignment of two input sequences according to an alignment pair-HMM. In the standard three-state pair-HMM for alignment, the Viterbi algo- rithm may be viewed as an instantiation of the Needleman– Wunsch algorithm in which alignment parameters are deter- mined by a log-odds transformation of the HMM scoring scheme (Durbin et al. 1998). Since they specify a conditional probability distribution over the space of all suboptimal alignments, pair-HMMs also al- low the computation of the posterior probability , P ( xiyja * | x , y ), that particular positions xi and yj of two sequences x and y , respectively, will be matched in an alignment a * generated by the model. Running the Needleman–Wunsch algorithm with these posterior probabilities as substitution scores and no gap penalties gives rise to the maximum expected accuracy alignment method (see Methods), also known as optimal accuracy alignment (Holmes and Durbin 1998). In the general case of multiple sequence comparisons, theo- retically sound and biologically motivated scoring methods are not straightforward to devise. In practice, ad hoc sum-of-pairs schemes (Carrillo and Lipman 1988), which combine the pro- jected pairwise log-odds scores for all pairs of sequences in the alignment, and their weighted variants (Altschul et al. 1989) are commonly used. Unfortunately, direct application of dynamic programming is too inefficient for alignment of more than a few sequences. Instead, a variety of heuristic strategies have been

(^2) Corresponding author. E-mail [email protected]; fax (650) 725-1449. Article and publication are at http://www.genome.org/cgi/doi/10.1101/ gr.2821705.

Resource

330 Genome Research 15:330–340 ©2005 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/05; www.genome.org

proposed, including genetic algorithms (Notredame and Higgins 1996), simulated annealing (Kim et al. 1994), alignment to a profile HMM (Krogh et al. 1994; Eddy 1995), or greedy assem- blage of multiple segment-to-segment comparisons (Morgen- stern et al. 1996). By far, the most popular heuristic strategies involve tree-based progressive alignment (Feng and Doolittle 1987) in which groups of sequences are assembled into a complete mul- tiple alignment via several pairwise alignment steps. As with any hierarchical approach, however, errors at early stages in the alignment not only propagate to the final alignment but also may increase the likelihood of misalignment due to incorrect conservation signals. Post-processing steps such as iterative re- finement (Gotoh 1996) alleviate some of the errors made during progressive alignment. Consistency-based schemes take the alternative view that “prevention is the best medicine.” Note that for any multiple alignment, the induced pairwise alignments are necessarily con- sistent —that is, given a multiple alignment containing three se- quences x , y , and z , if position xi aligns with position zk and position z (^) k aligns with yj in the projected xz and zy alignments, then xi must align with yj in the projected xy alignment. Con- sistency-based techniques apply this principle in reverse, using evidence from intermediate sequences to guide the pairwise alignment of x and y , such as needed during the steps of a pro- gressive alignment. By adjusting the score for an xiyj residue pairing according to support from some position zk that aligns to both x (^) i and yj in the respective xz and yz pairwise comparisons, consistency-based objective functions incorporate multiple se- quence information in scoring pairwise alignments. Gotoh (1990) first introduced consistency to identify an- chor points for reducing the search space of a multiple align- ment. A mathematically elegant reformulation of consistency in terms of boolean matrix multiplication was later given by Vin- gron and Argos (1991) and implemented in the program MALI, which builds multiple alignments from dot matrices (Vingron and Argos 1989). An alternative formulation of consistency was employed in the DIALIGN tool, which finds ungapped local alignments via segment-to-segment comparisons, determines new weights for these alignments using consistency, and as- sembles them into a multiple alignment by a greedy selection procedure (Morgenstern et al. 1996). More recently, Notredame et al. (1998) introduced COFFEE, a new consistency-based objective function for scoring residue pairs in a pairwise alignment. In this approach, an alignment library is computed by merging consistent CLUSTALW (Thomp- son et al. 1994) global and LALIGN (Huang and Miller 1991) local pairwise alignments to form three-way alignments, which are assigned percent identity weights. Then, the score for aligning xi to yj is defined to be the sum of the weights of all alignments in the library containing that aligned residue pair. The program T-Coffee (Notredame et al. 2000), which implements multiple sequence alignment under this objective function using progres- sive maximum weight trace computations (Kececioglu 1993), has demonstrated superior accuracy on the BAliBASE test suite (Thompson et al. 1999a) over competing methods, including CLUSTALW, DIALIGN, and PRRP (Gotoh 1996). In this article, we introduce probabilistic consistency , a novel modification of the traditional sum-of-pairs scoring system that incorporates HMM-derived posterior probabilities and three-way alignment consistency. We discuss the theoretical motivations behind the probabilistic consistency scoring system and demon- strate its applicability with ProbCons, a protein progressive mul-

tiple alignment tool based on this technique. To assess the utility of our methods, we compared ProbCons to several current lead- ing alignment tools including Align-m (Van Walle et al. 2004), CLUSTALW, DIALIGN, MAFFT (Katoh et al. 2002), MUSCLE (Edgar 2004), and T-Coffee on the BAliBASE, SABmark (Van Walle et al. 2004), and PREFAB (Edgar 2004) benchmark align- ment databases, using commonly accepted accuracy measures for validating alignment quality. In this comparison, ProbCons shows a clear statistically significant improvement in accuracy over all other alignment tools in every benchmark test, while maintaining practical running times. Moreover, all parameters for the program are derived through unsupervised training meth- ods without making any manual adjustments. ProbCons is pub- licly available as a Web resource. Source code and executables are available as public domain software at http://probcons.stanford. edu.

Results

Algorithm overview

Fundamentally, ProbCons is a pair-hidden Markov model-based progressive alignment algorithm that primarily differs from most typical approaches in its use of maximum expected accuracy rather than Viterbi alignment, and of the probabilistic consistency trans- formation to incorporate multiple sequence conservation infor- mation during pairwise alignment. ProbCons uses the HMM shown in Figure 1 to specify the probability distribution over all alignments between a pair of sequences. Emission probabilities, which correspond to traditional substitution scores, are based on the BLOSUM62 matrix (Henikoff and Henikoff 1992). Transition probabilities, which correspond to gap penalties, are trained with unsupervised expectation maximization (EM).

ProbCons algorithm

Given m sequences, S = { s (1), …, s ( m )^ }: Step 1: Computation of posterior-probability matrices For every pair of sequences x , yS and all i ∈ {1, …, | x |}, j ∈ {1, …, | y |}, compute the matrix Pxy , where Pxy ( i , j ) = P ( xiyja * | x , y ) is the probability that letters xi and yj are paired in a *, an alignment of x and y generated by the model.

Figure 1. Basic pair-HMM for sequence alignment between two se- quences, x and y. State M emits two letters, one from each sequence, and corresponds to the two letters being aligned together. State Ix emits a letter in sequence x that is aligned to a gap, and similarly state Iy emits a letter in sequence y that is aligned to a gap. Finding the most likely alignment according to this model by using the Viterbi algorithm corre- sponds to applying Needleman–Wunsch with appropriate parameters. The logarithm of the emission probability function p (.,.) at M corresponds to a substitution scoring matrix, while affine gap penalty parameters can be derived from the transition probabilities  and  (Durbin et al. 1998).

ProbCons multiple alignment tool

Genome Research 331

1998), a local aligner using segment-based homology; (3) T- Coffee 1.37 (Notredame et al. 2000), a heuristic consistency- based aligner that combines global and local alignments; (4) MAFFT 3.88 (Katoh et al. 2002), a set of six scripts for performing multiple alignment with a variety of iterative refinement tech- niques; (5) MUSCLE 3.3 (Edgar 2004), a new aligner reporting the best published results on BAliBASE to date; and (6) Align-m 1. (Van Walle et al. 2004), a consistency-based method for comput- ing all-pairs pairwise alignments of multiple sequences. Of the six scripts comprising the MAFFT alignment utilities, we chose to test nw-ns-i, the most accurate script. For Align-m 1.0, we used the parameter settings picked for testing the program in Van Walle et al. (2004). All other programs were run with default parameters. Emission probabilities for the ProbCons HMM were adapted from the BLOSUM62 scoring matrix (Henikoff and Henikoff 1992). The default transition parameters of ProbCons were trained via unsupervised Expectation-Maximization (EM) on un- aligned sequences from the BAliBASE benchmark database; thus, the tests on the PREFAB and SABmark databases provide external validation of the results shown on BAliBASE. The default options for the ProbCons program included applying two iterations of the consistency transformation and 100 rounds of iterative re-

finement for every alignment. We also experimented with a modified version of ProbCons (ProbCons-ext) in which the HMM model was extended to include an extra pair of insertion states ( I  x and I  y ) to model long or terminal insertions. The results of testing on the BAliBASE benchmark align- ments database are shown in Table 1. To assess the significance of the differences in overall SP and CS scores, we performed a Fried- man rank test for all pairs of programs; these results are summa- rized in Table 2. A typical BAliBASE alignment and its corre- sponding plot of column reliability are shown in Figure 2. The correlation between predicted and actual column reliability scores as shown in the diagram demonstrates the ability of pair- wise posterior matrices to predict the expected proportion of cor- rectly aligned residue pairs per column. With the exception of Reference 4, ProbCons achieves the strongest performance in both SP and CS scores in all references. Reference 4 sequences are marked by long N/C-terminal exten- sions in which local alignment methods tend to be more success- ful, suggesting that incorporation of a local alignment probabi- listic model into ProbCons might improve its performance on such sequences. Alternatively, we found that extending the HMM model with an extra pair of insertion states (ProbCons-ext) did improve BAliBASE performance in Reference 4; however, this

Table 1. Performance of aligners on the BAliBASE benchmark alignments database

Aligner

Ref 1 (82) Ref 2 (23) Ref 3 (12) Ref 4 (12) Ref 5 (12) Overall (141) Time SP CS SP CS SP CS SP CS SP CS SP CS (mm:ss)

Align-m 76.6 n/a 88.4 n/a 68.4 n/a 91.1 n/a 91.7 n/a 80.4 n/a 19: DIALIGN 81.1 70.9 89.3 35.9 68.4 34.4 89.7 76.2 94.0 84.3 83.2 63.7 2: CLUSTALW 86.1 77.3 93.2 56.8 75.3 46.0 83.4 52.2 85.9 63.8 86.1 68.0 1: MAFFT 86.7 78.1 92.4 50.2 78.8 50.4 91.6 72.7 96.3 85.9 88.2 71.4 1: T-Coffee 86.6 77.4 93.4 56.1 78.5 48.7 91.8 73.0 95.8 90.3 88.3 72.2 21: MUSCLE 88.7 80.8 93.5 56.3 82.5 56.4 87.6 60.9 96.8 90.2 89.6 73.9 1: ProbCons 90.1 82.6 94.4 61.3 84.1 61.3 90.1 72.3 97.9 91.9 91.0 77.2 5: ProbCons-ext 90.0 82.5 94.2 59.1 84.3 61.1 93.8 81.0 98.1 92.2 91.2 77.6 8:

Columns show the average sum-of-pairs (SP) and column scores (CS) achieved by each aligner for each of the five BAliBASE references. All scores have been multiplied by 100. The number of sequences in each reference is given in parentheses. Overall numbers for the entire database are reported in addition to the total running time of each aligner for all 141 alignments. The best results in each column are shown in bold.

Table 2. Significance test for differences in BAliBASE performance

Align-M DIALIGN CLUSTALW MAFFT T-Coffee MUSCLE ProbCons ProbCons-ext

Align-M (0.61) 8.2  10 ^6 <10^10 <10^10 <10^10 <10^10 <10^10

DIALIGN 1.9  10 ^5 <10^10 <10^10 <10^10 <10^10 <10^10

CLUSTALW +2.4  10 ^3 1.0  10 ^3 3.0  10 ^5 4.9  10 ^8 6.1  10 ^10 <10^10

MAFFT +1.2  10 ^9 +1.0  10 ^3 (0.65) 1.7  10 ^5 2.6  10 ^9 4.9  10 ^8

T-Coffee +<10 ^10 +8.4  10 ^6  (0.92) 7.0  10 ^3 1.5  10 ^6 8.4  10 ^6

MUSCLE +<10 ^10 +1.9  10 ^8 +9.6  10 ^6 +1.7  10 ^3 3.0  10 ^3 6.6  10 ^3

ProbCons +<10 ^10 +<10 ^10 +1.6  10 ^7 +1.9  10 ^6 +0.012 +0.

ProbCons-ext +<10 ^10 +<10 ^10 +8.3  10 ^6 +3.2  10 ^5 +(0.092)  (0.088)

Entries show the p- value indicating the significance of a difference in performance between two alignment methods as measured using a Friedman rank test. Nonitalicized values above the diagonal were calculated using SP scores on all alignments, whereas italicized values were computed using CS scores. (+) Method on the left had lower average rank (better performance); () Method on the left had higher average rank (worse performance); parentheses denote (nonsignificant) p- values >0.05.

ProbCons multiple alignment tool

Genome Research 333

addition roughly doubled the running time, with variable per- formance benefit in the other databases.^3 The results of testing six of the methods on the PREFAB database are shown in Table 3. Results for the Align-m program are omitted, since the program failed to complete all alignments

in the PREFAB database. Again, ProbCons and ProbCons-ext demonstrate a strong lead in SP score although their running times are longer than those of the other aligners except for T- Coffee. This is due to the computation of all-pairs pairwise pos- terior probability matrices in the first step of the algorithm; other schemes for formulating probabilistic consistency that avoid this need for a quadratic number of initial alignments may be pos- sible. The significance results for these values are given in Ta- ble 4.^4 The results of testing of the SABmark benchmark alignment database are shown in Table 5. Many of the same trends as found in the BAliBASE alignments are seen in SABmark, with the dif- ference between ProbCons and the next best aligner in terms of fD (SP) scores even more exaggerated. It should be noted, how- ever, that while the Align-m aligner lags far behind in SP score^5 (which may be thought of as a measure of sensitivity), its fM scores, which are the proportion of correctly predicted amino acid matches among all predicted matches (and which may be regarded as a measure of specificity) are the highest. Due to this disparity, it is difficult to make a precise quantitative statement regarding the relative performance of Align-m compared to the other methods without characterizing the sensitivity/specificity trade-off of each method, such as performed in a ROC analysis (Metz 1978). 6 Nevertheless, compared to all other aligners, Prob- Cons demonstrates significantly higher fD and fM scores overall, as seen in Table 6.

Comparison of ProbCons variants

To understand the features of ProbCons that give it a strong increase in performance, we compared several ProbCons variants on the “Twilight Zone” set from the SABmark alignment data- base. In particular, we examined the effects of four main algo- rithmic changes: (1) using the Viterbi algorithm to compute the highest probability alignment, instead of the highest expected accuracy alignment that is computed by ProbCons; (2) using the posterior probability matrices generated by ProbCons to produce all-pairs pairwise alignments instead of full multiple alignments; (3) varying the number of applications of consistency transfor- mation applied before alignment; and (4) omitting the applica- tion of iterative refinement to optimize the alignment with re- spect to the sum-of-pairs probabilistic consistency metric. In this (^3) Previous results on the BAliBASE 2.01 benchmark alignments database re- article, we have omitted a full comparison of expected accuracy ported in an abstract (Do et al. 2004), which correspond to the ProbCons-ext program, differ slightly from those shown in the text. These small differences are attributable to (1) a change in the methods used for extracting BAliBASE core blocks as suggested by Robert C. Edgar (pers. comm.), and (2) minor changes in the HMM model and training procedure for the current version of ProbCons. (^4) The results for the nw-ns-i script from MAFFT on the PREFAB database given in Edgar (2004) contain an editing error (R.C. Edgar, pers. comm.); the values shown here are correct. Interestingly, although MAFFT achieves a slightly higher overall average SP score than MUSCLE, a Friedman rank test indi- cates that MUSCLE consistently produces better alignments than MAFFT (see Table 4). (^5) The numbers reported for the Align-m aligner are similar to those given in Edgar (2004), but differ from the results reported in Van Walle et al. (2004). The primary reason for this difference is that the averages in the latter study were computed across all SABmark pairwise alignments; this fails to account for dependencies within each subset, so the weight of each subset scales quadratically with the number of sequences present. We avoid this by aver- aging pairwise alignment scores within each subset before averaging all subset scores. (^6) While a ROC analysis would better characterize aligner performance, properly defining sensitivity and specificity measures for alignment accuracy involves subtle issues regarding the alignability of particular positions in sequences. Furthermore, the appropriate manner for adjusting program parameters so as to observe the sensitivity/specificity trade-off for the expected accuracy align- ment algorithm is also an open problem. We leave these questions for future work.

Table 3. Performance of aligners on the PREFAB protein reference alignment benchmark

Aligner Overall (1927) Time

DIALIGN 57.2 12 h, 25 min CLUSTALW 58.9 2 h, 57 min T-Coffee 63.6 144 h, 51 min MUSCLE 64.8 3 h, 11 min MAFFT 64.8 2 h, 36 min ProbCons 66.9 19 h, 41 min ProbCons-ext 68.0 37 h, 46 min

Entries show the average Q (equivalent to SP) score achieved by each aligner on all 1927 alignments of the PREFAB database. All scores have been multiplied by 100. Running times for programs over the entire database are given for each program in hours and minutes. The best results in each column are shown in bold.

Figure 2. Column reliability plot for 1csy_ref1 from BAliBASE, Refer- ence 1. The red line and solid regions indicate the predicted and actual proportion of correct pairwise matches at each alignment position, re- spectively. All column reliability values have been multiplied by 100. Be- low , the actual ProbCons alignment is shown with core block residues highlighted in green. Note that only pairwise matches in core block regions of the BAliBASE alignment are considered correct when comput- ing the “actual” proportion of correct pairwise matches; however, some residues outside of the core block regions may also be alignable. Thus, regions in which predicted homology exceeds actual homology do not necessarily indicate overprediction of homology by the aligner.

Do et al.

334 Genome Research

racy are the use of maximum expected accuracy as an objective function and the application of the probabilistic consistency transformation. The methodology employed in developing the ProbCons algorithm is straightforward and widely applicable: (1) specify an appropriate quality measure and (2) maximize its ex- pected value according to the probability distribution given by the model. For example, the accuracy measure used in this article maximizes the expected number of correct matches in an align- ment; if one is concerned about overprediction of matches, one may use an alternative objective function that penalizes overpre- diction of matches and, provided it is easily decomposable, de- rive the corresponding optimization algorithm. Exploring this framework provides a novel and exciting direction for future work in pursuing even higher accuracy alignment approaches. The principles employed, however, are not unique to se- quence alignment alone. As an example, consider the related problem of motif finding among a set of divergent sequences. Consistency-based approaches have previously been applied to

motif-finding tasks with strong empirical results (Heger et al. 2003). A more principled algorithm based on probabilistic con- sistency may further increase the sensitivity of motif detection methods. Comparative gene finding and RNA or protein struc- tural prediction methods may also benefit from a probabilistic consistency-based approach.

Methods

The ProbCons algorithm works by (1) computing posterior- probability matrices, (2) computing expected accuracies for each pairwise comparison, (3) applying the probabilistic consistency transformation, (4) computing an expected accuracy guide tree, and (5) performing progressive alignment. As a default, we also perform iterative refinement as a post-processing step. In the subsections that follow, we consider each of these steps in greater detail, describe the EM training procedure used to obtain param- eters for the ProbCons HMM, and present a novel technique for estimating column reliability scores based on the alignment scor- ing matrices.

1. Posterior probability matrices

Let x and y be two proteins represented as character strings in which xi is the i th amino acid of x. Consider the pair-HMM given in Figure 1, where A is the space of all possible xy alignments. An alignment a corresponds uniquely to a sequence of state- emission pairs, 〈 s 1 , o 1 〉, …, 〈 sn , on 〉. The probability of a is given by

P  a  x , y  =  s 1 

i = 1

n − 1

 si → si + 1 

i = 1

n

 oi  si ,

where ( s ) is the initial probability of starting in state s, ( sisi +1 ) is the transition probability from s (^) i to si +1 , and ( oi | si ) is the emis- sion probability for either a single letter or aligner residue pair oi in the state s (^) i. In the derivation which follows, let a * be the (unknown) alignment from A that most nearly represents the “true” biologi- cal alignment of x and y. Ideally, we wish to determine a * based on the sequence information in x and y alone. To do this we use the distribution P ( A | x , y ) to represent our beliefs regarding a *, i.e., we assume that P ( a | x , y ) is the probability that an alignment a is equal to a *. Let the notation x (^) iyja denote the event that two posi-

Table 6. Significance test for differences in SABmark performance

Align-M DIALIGN CLUSTALW MAFFT T-Coffee MUSCLE ProbCons ProbCons-ext

Align-M <10^10 <10^10 <10^10 <10^10 <10^10 <10^10 <10^10

DIALIGN  <10 ^10 <10^10 <10^10 <10^10 <10^10 <10^10 <10^10

CLUSTALW  <10 ^10  <10 ^10 0.02 0.01 7.5  10 ^6 <10^10 <10^10

MAFFT  <10 ^10  <10 ^10 +(0.083) 1.5  10 ^5 <10^10 <10^10 <10^10

T-Coffee  <10 ^10  2.5  10 ^3 +<10 ^10 +<10 ^10 0.052 <10^10 <10^10

MUSCLE  <10 ^10  1.2  10 ^7 +<

 10 +1.2  10 ^4  1.5  10 ^5 <10^10 <10^10

ProbCons  <10 ^10 +<10 ^10 +<10 ^10 +<10 ^10 +<10 ^10 +<10 ^10 +6.4  10 ^4

ProbCons-ext  <10 ^10 +<10 ^10 +<10 ^10 +<10 ^10 +<10 ^10 +<10 ^10 +(0.31)

Entries show the p- value indicating the significance of a difference in performance between two alignment methods as measured using a Friedman rank test. Nonitalicized values above the diagonal were calculated using f (^) D (SP) scores on all alignments, whereas italicized values were computed using fM scores. (+) Method on the left had lower average rank (better performance); () Method on the left had higher average rank (worse performance); parentheses denote (nonsignificant) p- values >0.05.

Table 7. Performance of ProbCons Variants on SABmark “Twilight Zone” set

Algorithm c lr Output fD fM

Time (mm:ss)

  1. Viterbi 0 0 Pairwise 27.5 17.2 0:
  2. Posterior 0 0 Pairwise 29.6 18.5 2:
  3. Posterior 1 0 Pairwise 32.5 20.4 3:
  4. Posterior 2 0 Pairwise 33.2 21.0 3:
  5. Posterior 0 0 Multiple 29.1 19.8 2:
  6. Posterior 1 0 Multiple 30.9 20.8 3:
  7. Posterior 2 0 Multiple 31.5 21.3 3:
  8. Posterior 0 100 Multiple 30.6 20.8 4:
  9. Posterior 2 100 Multiple 32.1 21.7 5:

The first column indicates whether the Viterbi algorithm (highest prob- ability alignment) or posterior decoding (maximal expected accuracy alignment) was used. The next two columns indicate c, the number of iterations of the consistency transformation used, and ir, the number of rounds of iterative refinement used as post-processing. The fourth col- umn indicates whether the ProbCons was set to generate all-pairs pair- wise alignments or consistent multiple alignments. The next two columns show the average developer ( f (^) D ) score (equivalent to sum-of-pairs [SP] score) and modeler ( f (^) M ) score achieved by each aligner for the “Twilight Zone” set in the SABmark database. The last column gives the total running time for each method over all 236 alignments. All scores have been multiplied by 100. Note that the last row corresponds to the pa- rameter settings that are the default in the ProbCons program. The best results in each column are shown in bold.

Do et al.

336 Genome Research

tions xi and yj are matched in an alignment a. Formally, the posterior probability of x (^) iyja * is

P  xiyja * x , y  = (^)  aA

P  a  x , y  (^1)  xiyja 

where the common indicator notation 1 { condition } is used to de- fine a function that evaluates to 1 whenever condition is true and 0 otherwise. Then, the posterior probability matrix Pxy for the alignment of x and y is a table of P ( xiyja * | x , y ) values for 1  i  | x |, 1  j  | y |. The ProbCons algorithm begins by calcu- lating these posterior probability matrices using a modification of the Forward and Backward algorithms for computing posterior probabilities in pair-HMMs as described in Durbin et al. (1998). This computation step takes time O ( m^2 L^2 ), where m is the num- ber of sequences and L is the length of each sequence.

2. Maximal expected accuracy alignment

Most alignment schemes build an “optimal” pairwise alignment by finding the highest probability alignment using the Viterbi algorithm. In this approach, one computes arg max a P ( a | x , y ), which may be alternatively written as arg max a E a * [ 1 { a = a *} | x , y ]; that is, the Viterbi algorithm finds the alignment whose prob- ability of being exactly equal to a * is optimal. When the odds of recovering the exact correct alignment is low but partially correct alignments are still useful, this is not necessarily the best choice. In this work, we explore an alternative strategy that finds the alignment a that does not maximize the probability of a = a * but rather tries to guarantee high accuracy for a , which we define with respect to the alignment a * as

accuracy  a , a * =

min x , y  (^) x (^) i ∼ y (^) ja 1  xiyja *.

During the alignment process, however, a * is not known, so we instead maximize the expected accuracy of the reported alignment. Computing this quantity is straightforward since

E a * accuracy  a , a * | x , y  =

 A

P   x , y  (^)  x (^) iy (^) ja

(^1)  xiyj 

min x , y 

 x (^) iy (^) ja   A

P   x , y  1  xiyj 

min x , y 

=

min x , y  (^) x (^) i ∼ y (^) ja P  xiyja * x , y .

Using this decomposition, we compute the maximal expected accuracy alignment by a simple variant of the Needleman– Wunsch algorithm, where all match/mismatch scores are given by the posterior probability terms for corresponding letters and gap penalties are set to zero. This form of alignment bears strong resemblance to the problem of finding the maximum weight trace of a matrix (Kececioglu 1993), and a similar scheme is used to compute final progressive alignments in the T-Coffee program.

3. Probabilistic consistency transformation

In the previous section, we described a method for performing pairwise sequence alignment of two sequences x and y based on computing P ( xiyja * | x , y ) values for all positions in x and y , and subsequently using these posterior probabilities as match/ mismatch scores in a Needleman–Wunsch-like alignment proce- dure. In this section we introduce probabilistic consistency , a method for obtaining more accurate substitution scores when a third homologous sequence z is available.

One way to use sequence z is to generalize the pair-HMM given in Figure 1 to a triple-HMM that parameterizes a condi- tional distribution over three-sequence alignments of x , y , and z , and similarly generalize the previous formulas for expected ac- curacy to handle three-way alignments. Such an approach, how- ever, leads to impractical O ( L^3 ) algorithms for computing poste- rior matrices of sequences of length L. Here, we follow a heuristic approach that allows us to derive an algorithm with an approxi- mately O ( L^2 ) running time. For a sequence z , let z ( k , k +1) denote the interletter regions (or gaps) between amino acids k and k + 1 of z for 0  k  | z | (where z (0,1) and z (| z |,| z |+1) denote the gaps at the beginning and ends of z ). Generalizing our notation for posterior probabilities of matches, an alternative estimate for the quality of an xiyj match is given by marginalized probability, P  xiyja * x , y , z  =  z (^) k

P  xiyjzka * x , y , z  + (^)  z  k , k + 1 

P  xiyjz  k , k + 1  ∈ a * x , y , z ,

where a * now refers to a three-sequence alignment of x , y , and z. We refer to the concept of re-estimating pairwise alignment match quality scores based on three-sequence information as probabilistic consistency. As stated, computing P ( x (^) iyja * | x , y , z ) values for each x (^) iyj pair requires O ( L^3 ) time for the Forward and Backward algo- rithms (given an appropriate three-sequence HMM); to avoid this, we simplify the computation as follows. First, we heuristi- cally ignore the second summation over gaps in z to get

 z k

P  xiyjzka * x , y , z .

Second, we change the inner condition to an equivalent expression,

 z (^) k

P  xizka * ∧  zkyja * x , y , z 

Then, we use the chain rule to factorize each inner term of the summation to obtain

 z (^) k

P  xizka * x , y , z  P  zkyja * x , y , z , xizka *

Finally, we make heuristic independence assumptions to get

 z (^) k

P  xizka * x , z  P  zkyja * z , y .

This latter expression still requires O ( L^3 ) time to be computed. Now, however, we transform the Pxz and Pzy matrices into sparse matrices by discarding all values smaller than a threshold  (by default,  = 0.01). For alignable sequences, posterior probability alignment matrices tend to be sparse, with most entries near zero, so this step is justified. This effectively reduces the proba- bilistic consistency re-estimation step to sparse matrix multipli- cation; therefore, Pxy is re-estimated in time O ( c^2 L ), where c is the average number of nonzero elements per row (typically 1  c  5 in practice). With the procedure described above, we can align two se- quences given information from a third sequence. To align two sequences x and y given a set of sequences, S , we would ideally like to estimate P ( xiyja * | S ). In practice, we use the follow- ing heuristic decomposition:

1  S   zS  z (^) k P  xizka * x , z  P  zkyja * z , y 

where we set P ( xixj | x ) to 1 if i = j and 0 otherwise.

ProbCons multiple alignment tool

Genome Research 337

and Ohlsson 2002). It is clear that both of these approaches deal with many of the questions answered by match posterior prob- abilities (Miyazawa 1995, Kschischo and Lässing 2000), which represent the likelihood that specific pairs of residues are aligned. In the multiple alignment case, one possible generalization is to estimate the expected proportion of correct pairwise matches in each column of the alignment. Given a set C of the aligned residues in a particular column, this expected proportion of correct pairwise matches ( C ) is given by

 C  = (^) 

 C 

2 

− 1  x (^) i , y (^) jC x y

P  xiyja * S 

which we approximate using the pairwise posterior matrices cal- culated in Step 1. Though this is certainly not the only possible measure of column reliability based on posterior probabilities, we leave extensions of this method as future work.

Acknowledgments

The authors thank Arend Sidow and Robert Edgar for useful dis- cussions and Sandhya Kunnatur for help in program develop- ment. C.B.D. was partly supported by a Siebel Fellowship. M.B. was partly supported by an NSF Graduate Fellowship. Work in the Batzoglou laboratory is supported in part by NSF grant EF- 0312459, NIH grant U01-HG003162, the NSF CAREER Award, and the Alfred P. Sloan Fellowship.

References

Altschul, S.F. 1991. Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 219: 555 – 565. Altschul, S.F., Carroll, R.J., and Lipman, D.J. 1989. Weights for data related by a tree. J. Mol. Biol. 207: 647 – 653. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 3389 – 3402. Attwood, T.K. 2002. The PRINTS database: A resource for identification of protein families. Brief. Bioinform. 3: 252 – 263. Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths-Jones, S., Khanna, A., Moxon, M.M., Sonnhammer, E.L., Studholme, D.J., et al. 2004. The Pfam protein families database. Nucleic Acids Res. 32: D138–D141. Berger, M.P. and Munson, P.J. 1991. A novel randomized iterative strategy for aligning multiple protein sequences. Comput. Appl. Biosci. 7: 479 – 484. Boutonnet, N.S., Rooman, M.J., Ochagavia, M.E., Richelle, J., and Wodak, S.J. 1995. Optimal protein structure alignments by multiple linkage clustering: Application to distantly related proteins. Protein Eng. 8: 647 – 662. Brenner, S.E., Koehl, P., and Levitt, M. 2000. The ASTRAL compendium for protein structure and sequence analysis. Nucleic Acids Res. 28: 254 – 256. Carrillo, H. and Lipman, D. 1988. The multiple sequence alignment problem in biology. SIAM J. Appl. Math. 48: 1073 – 1082. Castillo-Davis, C.I., Kondrashov, F.A., Hartl, D.L., and Kulathinal, R.J.

  1. The functional genomic distribution of protein divergence in two animal phyla: Coevolution, genomic conflict, and constraint. Genome Res. 14: 802 – 811. Chao, K.-M., Hardison, R.C., and Miller, W. 1993. Locating well-conserved regions within a pairwise alignment. Comput. Appl. Biosci. 9: 387 – 396. Dayhoff, M.O., Schwartz, R.M., and Orcutt, B.C. 1978. A model of evolutionary change in proteins. In Atlas of proteins sequences and structure , vol. 5, Suppl. 2, pp. 345–352. National Biomedical Research Foundation, Washington, D.C. Do, C.B., Brudno, M., and Batzoglou, S. 2004. ProbCons: Probabilistic consistency-based multiple alignment of amino acid sequences. In Proceedings of the Thirteenth National Conference on Artificial Intelligence , pp. 703–708. AAAI Press, San Jose, CA. Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. 1998. Biological sequence analysis. Cambridge University Press, Cambridge, UK.

Eddy, S.R. 1995. Multiple alignment using hidden Markov models. In Proceedings of the Third International Conference on Intelligent Systems in Molecular Biology , pp. 114–120. AAAI Press, Cambridge, UK. Edgar, R.C. 2004. MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32: 1792 – 1797. Feng, D.F. and Doolittle, R.F. 1987. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25: 351 – 360. Gonnet, G.H., Cohen, M.A., and Benner, S.A. 1992. Exhaustive matching of the entire protein sequence database. Science 256: 1443 – 1445. Gotoh, O. 1982. An improved algorithm for matching biological sequences. J. Mol. Biol. 162: 705 – 708. ———. 1990. Consistency of optimal sequence alignments. Bull. Math. Biol. 52: 509 – 525. ———. 1996. Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J. Mol. Biol. 264: 823 – 838. Heger, A., Lappe, M., and Holm, L. 2003. Accurate detection of very sparse sequence motifs. In Proceedings of the Seventh Annual International Conference on Computational Molecular Biology , pp. 139 – 147. ACM Press, Berlin, Germany. Henikoff, S. and Henikoff, J.G. 1992. Amino acid substitution matrices from protein blocks. Proc. Nat. Acad. Sci. 89: 10915 – 10919. Holm, L. and Sander, C. 1994. The FSSP database of structurally aligned protein fold families. Nucleic Acids Res. 22: 3600 – 3609. Holmes, I. and Durbin, R. 1998. Dynamic programming alignment accuracy. J. Comput. Biol. 5: 493 – 504. Huang, X. and Miller, W. 1991. A time-efficient, linear space local similarity algorithm. Adv. Appl. Math. 12: 337 – 357. Jaroszewski, L., Li, W., and Godzik, A. 2002. In search for more accurate alignments in the twilight zone. Protein Sci. 11: 1702 – 1713. Johnson, J.M. and Church, G.M. 1999. Alignment and structure prediction of divergent protein families: Periplasmic and outer membrane proteins of bacterial efflux pumps. J. Mol. Biol. 287: 695 – 715. Jones, D.T. 1999. Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292: 195 – 202. Katoh, K., Misasa, K., Kuma, K., and Miyata, T. 2002. MAFFT: A novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30: 3059 – 3066. Kececioglu, J. 1993. The maximum weight trace problem in multiple sequence alignment. In Proceedings of the Fourth Symposium on Combinatorial Pattern Matching , Springer-Verlag Lecture Notes in Computer Science, vol. 684, pp. 106–119. London. Kim, J., Pramanik, S., and Chung, M.J. 1994. Multiple sequence alignment using simulated annealing. Comput. Appl. Biosci. 10: 419 – 426. Krogh, A., Brown, M., Mian, I.S., Sjolander, K., and Haussler, D. 1994. Hidden Markov models in computational biology: Applications to protein modeling. J. Mol. Biol. 235: 1501 – 1531. Kschischo, M., and Lässing, M. 2000. Finite-temperature sequence alignment. Pac. Symp. Biocomput. 5: 621 – 632. Metz, C.E. 1978. Basic principles of ROC analysis. Semin. Nucl. Med. 8: 283 – 298. Mizuguchi, K., Deane, C.M., Blundell, T.L., and Overington, J.P. 1998. HOMSTRAD: A database of protein structure alignments for homologous families. Prot. Sci. 7: 2469 – 2471. Miyazawa, S. 1995. A reliable sequence alignment method based on probabilities of residue correspondences. Protein Eng. 8: 999 – 1009. Morgenstern, B., Dress, A., and Werner, T. 1996. Multiple DNA and protein sequence alignment based on segment-to-segment comparison. Proc. Nat. Acad. Sci. 93: 12098 – 12103. Morgenstern, B., Frech, K., Dress, A., and Werner, T. 1998. DIALIGN: Finding local similarities by multiple sequence alignment. Bioinformatics 14: 290 – 294. Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C. 1995. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247: 536 – 540. Myers, E.W. and Miller, W. 1988. Optimal alignments in linear space. Comput. Appl. Biosci. 4: 11 – 17. Needleman, S.B. and Wunsch, C.D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. BIol. 48: 443 – 453. Notredame, C. and Higgins, D.G. 1996. SAGA: Sequence alignment by genetic algorithm. Nucleic Acids Res. 24: 1515 – 1524. Notredame, C., Holm, L., and Higgins, D.G. 1998. COFFEE: An objective function for multiple sequence alignments. Bioinformatics 14: 407 – 422. Notredame, C., Higgins, D.G., and Heringa, J. 2000. T-Coffee: A novel method for multiple sequence alignments. J. Mol. Biol. 302: 205 – 217.

ProbCons multiple alignment tool

Genome Research 339

Phillips, A., Janies, D., and Wheeler, W. 2000. Multiple sequence alignments in phylogenetic analysis. Mol. Phylogenet. Evol. 16: 317 – 330. Pruitt, K.D., Tatusova, T., and Maglott, D.R. 2003. NCBI Reference Sequence project: Update and current status. Nucleic Acids Res. 31: 34 – 37. Rost, B. 1999. Twilight zone of protein sequence alignments. Protein Eng. 12: 85 – 94. Rost, B. and Sander, C. 1994. Combining evolutionary information and neural networks to predict protein secondary structure. Proteins 19: 55 – 77. Saitou, N. and Nei, M. 1987. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4: 406 – 425. Sauder, J.M., Arthur, J.W., and Dunbrack Jr., R.L. 2000. Large-scale comparison of protein sequence alignment algorithms with structure alignments. Proteins 40: 6 – 22. Schlosshauer, M. and Ohlsson, M. 2002. A novel approach to local reliability of sequence alignments. Bioinformatics 18: 847 – 854. Shindyalov, I.N. and Bourne, P.E. 1998. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 11: 739 – 747. Smith, T.F. and Waterman, M.S. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147: 195 – 197. Sneath, P.H.A. and Sokal, R.R. 1973. Numerical Taxonomy. Freeman, San Francisco, CA. Sonnhammer, E.L.L., Eddy, S.R., Birney, E., Bateman, A., and Durbin, R.

  1. Pfam: Multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res. 26: 320 – 322. Thompson, J.D., Higgins, D.G., and Gibson, T.J. 1994. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap

penalties and weight matrix choice. Nucleic Acids Res. 22: 4673 – 4680. Thompson, J.D., Plewniak, F., and Poch, O. 1999a. BAliBASE: A benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 15: 87 – 88. ———. 1999b. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. 27: 2682 – 2690. Van Walle, I., Lasters, I., and Wyns, L. 2004. Align-m—A new algorithm for multiple alignment of highly divergent sequences. Bioinformatics 20: 1428 – 1435. Vingron, M. and Argos, P. 1989. A fast and sensitive multiple sequence alignment algorithm. Comput. Appl. Biosci. 5: 115 – 121. ———. 1990. Determination of reliable regions in protein sequence alignments. Protein Eng. 3: 565 – 569. ———. 1991. Motif recognition and alignment for many sequences by comparison of dot matrices. J. Math. Biol. 218: 34 – 43. Vingron, M. and Waterman, M.S. 1994. Sequence alignment and penalty choice: Review of concepts, case studies and implications. J. Mol. Biol. 235: 1 – 12. Viterbi, A.J. 1967. Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Trans. Inf. Theory IT-13: 260 – 269.

Web site references

http://probcons.stanford.edu; ProbCons alignment tool.

Received May 24, 2004; accepted in revised form November 29, 2004.

Do et al.

340 Genome Research