ProbCons: Practical Tool for Protein Sequence Alignment - Prof. Yingshu Li | Papers Computer Science

ProbCons: Probabilistic consistency-based multiple

sequence alignment

Chuong B. Do,

Mahathi S.P. Mahabhashyam,

Michael Brudno,

and

Serafim Batzoglou

1,2

Department of Computer Science, Stanford University, Stanford, California 94305, USA

To study gene evolution across a wide range of organisms, biologists need accurate tools for multiple sequence

alignment of protein families. Obtaining accurate alignments, however, is a difficult computational problem because

of not only the high computational cost but also the lack of proper objective functions for measuring alignment

quality. In this paper, we introduce probabilistic consistency, a novel scoring function for multiple sequence

comparisons. We present ProbCons, a practical tool for progressive protein multiple sequence alignment based on

probabilistic consistency, and evaluate its performance on several standard alignment benchmark data sets. On the

BAliBASE, SABmark, and PREFAB benchmark alignment databases, ProbCons achieves statistically significant

improvement over other leading methods while maintaining practical speed. ProbCons is publicly available as a Web

resource.

[Supplemental material is available online at www.genome.org. Source code and executables are available as public

domain software at http://probcons.stanford.edu.]

Given a set of biological sequences, a multiple alignment pro-

vides a way of identifying and visualizing patterns of sequence

conservation by organizing homologous positions across differ-

ent sequences in columns. As sequence similarity often implies

divergence from a common ancestor or functional similarity, se-

quence comparisons facilitate evolutionary and phylogenetic

studies (Phillips et al. 2000; Castillo-Davis et al. 2004) and isola-

tion of the most relevant regions (Attwood 2002) for a variety of

biological analyses. In particular, conserved amino acid stretches

in proteins are strong indicators of preserved three-dimensional

structural domains, so protein alignments have been widely used

in aiding structure prediction (Rost and Sander 1994; Jones 1999)

and characterization of protein families (Sonnhammer et al.

1998; Johnson and Church 1999; Bateman et al. 2004). However,

when sequence identity falls below 30%, called the “twilight

zone” of protein alignments, the accuracies of most automatic

sequence alignment methods drop considerably (Rost 1999;

Thompson et al. 1999b). As a result, alignment quality is often

the limiting factor in biological analyses of amino acid sequences

(Jaroszewski et al. 2002).

The problem of alignment construction consists of defining

either explicitly or implicitly an objective function for assessing

alignment quality and employing an efficient algorithm to find

the optimal, or a near optimal, alignment according to the ob-

jective function. Two-sequence alignments are usually evaluated

by addition of match/mismatch scores for aligned pairs of positions

and affine gap penalties for unaligned amino acids (Needleman and

Wunsch 1970; Smith and Waterman 1981). Quantitatively, scores

for aligned residues are given by log-odds (Altschul 1991) substitu-

tion matrices such as PAM (Dayhoff et al. 1978), GONNET (Gonnet

et al. 1992), or BLOSUM (Henikoff and Henikoff 1992). Estimation

of appropriate gap penalties, however, is often regarded as a “black

art” based on trial and error (Vingron and Waterman

1994). For two sequences of length L, an optimal alignment ac-

cording to this metric may be computed in O(L

) time (Gotoh

1982) and O(L) space (Myers and Miller 1988) via dynamic pro-

gramming.

Pair-hidden Markov models (HMMs) provide an alternative

formulation of the sequence alignment problem in which align-

ment generation is directly modeled as a first-order Markov pro-

cess involving state emissions and transitions. In this approach,

model parameters obtain an intuitive probabilistic interpretation

and can be trained on real data using standard supervised or

unsupervised likelihood-based methods. The Viterbi (1967)

algorithm computes the highest probability alignment of two

input sequences according to an alignment pair-HMM. In the

standard three-state pair-HMM for alignment, the Viterbi algo-

rithm may be viewed as an instantiation of the Needleman–

Wunsch algorithm in which alignment parameters are deter-

mined by a log-odds transformation of the HMM scoring scheme

(Durbin et al. 1998).

Since they specify a conditional probability distribution

over the space of all suboptimal alignments, pair-HMMs also al-

low the computation of the posterior probability,P(x

∼y

∈a*|x,

y), that particular positions x

and y

of two sequences xand y,

respectively, will be matched in an alignment a* generated by the

model. Running the Needleman–Wunsch algorithm with these

posterior probabilities as substitution scores and no gap penalties

gives rise to the maximum expected accuracy alignment method

(see Methods), also known as optimal accuracy alignment (Holmes

and Durbin 1998).

In the general case of multiple sequence comparisons, theo-

retically sound and biologically motivated scoring methods are

not straightforward to devise. In practice, ad hoc sum-of-pairs

schemes (Carrillo and Lipman 1988), which combine the pro-

jected pairwise log-odds scores for all pairs of sequences in the

alignment, and their weighted variants (Altschul et al. 1989) are

commonly used. Unfortunately, direct application of dynamic

programming is too inefficient for alignment of more than a few

sequences. Instead, a variety of heuristic strategies have been

Corresponding author.

E-mail [email protected]; fax (650) 725-1449.

Article and publication are at http://www.genome.org/cgi/doi/10.1101/

gr.2821705.

Resource

330 Genome Research

www.genome.org

ProbCons: Practical Tool for Protein Sequence Alignment - Prof. Yingshu Li, Papers of Computer Science

Related documents

Partial preview of the text

Download ProbCons: Practical Tool for Protein Sequence Alignment - Prof. Yingshu Li and more Papers Computer Science in PDF only on Docsity!

ProbCons: Probabilistic consistency-based multiple

sequence alignment

Chuong B. Do,

Mahathi S.P. Mahabhashyam,

Michael Brudno,

and

Serafim Batzoglou

1 Department of Computer Science, Stanford University, Stanford, California 94305, USA

To study gene evolution across a wide range of organisms, biologists need accurate tools for multiple sequence

alignment of protein families. Obtaining accurate alignments, however, is a difficult computational problem because

of not only the high computational cost but also the lack of proper objective functions for measuring alignment

quality. In this paper, we introduce probabilistic consistency , a novel scoring function for multiple sequence

comparisons. We present ProbCons, a practical tool for progressive protein multiple sequence alignment based on

probabilistic consistency, and evaluate its performance on several standard alignment benchmark data sets. On the

BAliBASE, SABmark, and PREFAB benchmark alignment databases, ProbCons achieves statistically significant

improvement over other leading methods while maintaining practical speed. ProbCons is publicly available as a Web

resource.

[Supplemental material is available online at www.genome.org. Source code and executables are available as public

domain software at http://probcons.stanford.edu.]

Resource

Algorithm overview

ProbCons algorithm

ProbCons multiple alignment tool

ProbCons multiple alignment tool

Comparison of ProbCons variants

Do et al.

1. Posterior probability matrices

P a x , y = s 1 

 si → si + 1 

 oi si ,

Do et al.

2. Maximal expected accuracy alignment

3. Probabilistic consistency transformation

ProbCons multiple alignment tool

C

ProbCons multiple alignment tool

Do et al.

P a x , y = s 1

si → si + 1

oi si ,