Algorithm for Memory-Efficient Constrained Multiple Sequence Alignment, Papers of Computer Science

An algorithm for finding an optimal constrained multiple sequence alignment (cmsa) using dynamic programming and a progressive approach. The algorithm is based on the work of tang et al. (2003) and is used to address the sum-of-pairs msa problem. The notation and recurrences used in the algorithm, as well as the need for additional space for computation and storage. Five families of protein/rna sequences are used as testing datasets.

Typology: Papers

Pre 2010

Uploaded on 08/31/2009

koofers-user-6lu
koofers-user-6lu 🇺🇸

8 documents

1 / 11

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
BIOINFORMATICS ORIGINAL PAPER Vol. 21 no. 1 2005, pages 20–30
doi:10.1093/bioinformatics/bth468
A memory-efficient algorithm for multiple
sequence alignment with constraints
Chin Lung Luand Yen Pin Huang
Department of Biological Science and Technology, National Chiao Tung University,
Hsinchu 300, Taiwan, Republic of China
Received on April 20, 2004; revised July 16, 2004; accepted August 3, 2004
Advance Access publication August 12, 2004
ABSTRACT
Motivation: Recently, the concept of the constrained
sequence alignment was proposed to incorporate the know-
ledgeofbiologistsaboutstructures/functionalities/consensuses
of their datasets into sequence alignment such that the user-
specified residues/nucleotides are aligned together in the
computed alignment. The currently developed programs use
the so-called progressive approach to efficiently obtain a con-
strained alignment of severalsequences. However, the kernels
of these programs, the dynamic programming algorithms for
computing an optimal constrained alignment between two
sequences, run in O n2)memory, where γis the number
of the constraints and nis the maximum of the lengths of
sequences. As a result, such a high memory requirement limits
the overall programs to align short sequences only.
Results: We adopt the divide-and-conquer approach to
design a memory-efficient algorithm for computing an optimal
constrained alignment between two sequences, which greatly
reduces the memory requirement of the dynamic program-
ming approaches at the expense of a small constant factor
in CPU time. This new algorithm consumes only On)space,
where αis the sum of the lengths of constraints and usually
αnin practical applications. Based on this algorithm, we
have developed a memory-efficient tool for multiple sequence
alignment with constraints.
Availability: http://genome.life.nctu.edu.tw/MUSICME
1 INTRODUCTION
Multiplesequencealignment(MSA)isoneofthefundamental
problems in computational molecular biology that have been
studied extensively, because it is a useful tool in the phylo-
genetic analyses among various organisms, the identification
of conserved motifs and domains in a group of related pro-
teins, the secondary and tertiary structure prediction of a
protein (or RNA), and so on (Carrillo and Lipman, 1988; Chan
et al., 1992; Gusfield, 1997; Nicholas et al., 2002; Notredame,
2002). Moreover, MSA is one of the most challenging
To whom correspondence should be addressed.
problems in computational molecular biology because it has
been shown to be NP-complete under the consideration of
sum-of-pairs scoring criteria (Kececioglu, 1993; Wang and
Jiang, 1994; Bonizzoni and Vedova, 2001), which means
that it seems to be hard to design an efficient algorithm for
finding the mathematically optimal alignment. Hence, some
approximate methods (Gusfield, 1993; Pevzner, 1992; Bafna
et al., 1997; Li et al., 2000) and heuristic methods (Feng
and Doolittle, 1987; Taylor, 1987; Corpet, 1988; Higgins and
Sharpe, 1988; Thompson et al., 1994) were introduced to
overcome this problem.
Recently, the concept of the constrained sequence align-
ment was proposed to incorporate the knowledge of biolo-
gists regarding the structures/functionalities/consensuses of
their datasets into sequence alignment such that the user-
specified residues/nucleotides are aligned together in the
computed alignment (Tang et al., 2003). Tang et al. (2003)
first designed a dynamic programming algorithm for finding
an optimal constrained alignment of two sequences and then
used it as a kernel to develop a constrained multiple sequence
alignment (CMSA) tool based on the progressive approach,
where each constraint considered by Tang et al. is a single
residue/nucleotide only. Their proposed algorithm for the
two sequences runs in O n4)time and consumes O(n4)
space, where γis the number of constrained residues and
nis the maximum lengths of the sequences. Later, this res-
ult was improved independently by two groups of researchers
to O n2)time and O n2)space using the same approach
of dynamic programming (Yu, 2003; Chin et al., 2003). In
fact, each constraint requested to be aligned together can
represent a conserved site of a protein/DNA/RNA family
and each conserved site may consist of a short segment of
residues/nucleotides, instead of a single residue/nucleotide.
In other words, the constraint specified by the biologists
can be a fragment of several residues/nucleotides. For some
applications, biologists may further expect that some mis-
matches are allowed among the residues/nucleotides of the
columns requested to be aligned. Hence, Tsai et al. (2004)
studied such a kind of the constrained sequence alignment
and designed an algorithm of O n2)time and O n2)
20 Bioinformatics vol. 21 issue 1 © Oxford University Press 2004; all rights reserved.
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Algorithm for Memory-Efficient Constrained Multiple Sequence Alignment and more Papers Computer Science in PDF only on Docsity!

BIOINFORMATICS ORIGINAL PAPER

Vol. 21 no. 1 2005, pages 20– doi:10.1093/bioinformatics/bth

A memory-efficient algorithm for multiple

sequence alignment with constraints

Chin Lung Lu ∗^ and Yen Pin Huang

Department of Biological Science and Technology, National Chiao Tung University, Hsinchu 300, Taiwan, Republic of China

Received on April 20, 2004; revised July 16, 2004; accepted August 3, 2004 Advance Access publication August 12, 2004

ABSTRACT Motivation: Recently, the concept of the constrained sequence alignment was proposed to incorporate the know- ledge of biologists about structures/functionalities/consensuses of their datasets into sequence alignment such that the user- specified residues/nucleotides are aligned together in the computed alignment. The currently developed programs use the so-called progressive approach to efficiently obtain a con- strained alignment of several sequences. However, the kernels of these programs, the dynamic programming algorithms for computing an optimal constrained alignment between two sequences, run in O(γ n^2 ) memory, where γ is the number of the constraints and n is the maximum of the lengths of sequences. As a result, such a high memory requirement limits the overall programs to align short sequences only. Results: We adopt the divide-and-conquer approach to design a memory-efficient algorithm for computing an optimal constrained alignment between two sequences, which greatly reduces the memory requirement of the dynamic program- ming approaches at the expense of a small constant factor in CPU time. This new algorithm consumes only O(α n ) space, where α is the sum of the lengths of constraints and usually α  n in practical applications. Based on this algorithm, we have developed a memory-efficient tool for multiple sequence alignment with constraints. Availability: http://genome.life.nctu.edu.tw/MUSICME Contact: [email protected]

1 INTRODUCTION

Multiple sequence alignment (MSA) is one of the fundamental problems in computational molecular biology that have been studied extensively, because it is a useful tool in the phylo- genetic analyses among various organisms, the identification of conserved motifs and domains in a group of related pro- teins, the secondary and tertiary structure prediction of a protein (or RNA), and so on (Carrillo and Lipman, 1988; Chan et al. , 1992; Gusfield, 1997; Nicholas et al. , 2002; Notredame, 2002). Moreover, MSA is one of the most challenging

∗To whom correspondence should be addressed.

problems in computational molecular biology because it has been shown to be NP-complete under the consideration of sum-of-pairs scoring criteria (Kececioglu, 1993; Wang and Jiang, 1994; Bonizzoni and Vedova, 2001), which means that it seems to be hard to design an efficient algorithm for finding the mathematically optimal alignment. Hence, some approximate methods (Gusfield, 1993; Pevzner, 1992; Bafna et al. , 1997; Li et al. , 2000) and heuristic methods (Feng and Doolittle, 1987; Taylor, 1987; Corpet, 1988; Higgins and Sharpe, 1988; Thompson et al. , 1994) were introduced to overcome this problem. Recently, the concept of the constrained sequence align- ment was proposed to incorporate the knowledge of biolo- gists regarding the structures/functionalities/consensuses of their datasets into sequence alignment such that the user- specified residues/nucleotides are aligned together in the computed alignment (Tang et al. , 2003). Tang et al. (2003) first designed a dynamic programming algorithm for finding an optimal constrained alignment of two sequences and then used it as a kernel to develop a constrained multiple sequence alignment (CMSA) tool based on the progressive approach, where each constraint considered by Tang et al. is a single residue/nucleotide only. Their proposed algorithm for the two sequences runs in O(γ n^4 ) time and consumes O(n^4 ) space, where γ is the number of constrained residues and n is the maximum lengths of the sequences. Later, this res- ult was improved independently by two groups of researchers to O(γ n^2 ) time and O(γ n^2 ) space using the same approach of dynamic programming (Yu, 2003; Chin et al. , 2003). In fact, each constraint requested to be aligned together can represent a conserved site of a protein/DNA/RNA family and each conserved site may consist of a short segment of residues/nucleotides, instead of a single residue/nucleotide. In other words, the constraint specified by the biologists can be a fragment of several residues/nucleotides. For some applications, biologists may further expect that some mis- matches are allowed among the residues/nucleotides of the columns requested to be aligned. Hence, Tsai et al. (2004) studied such a kind of the constrained sequence alignment and designed an algorithm of O(γ n^2 ) time and O(γ n^2 )

20 Bioinformatics vol. 21 issue 1 © Oxford University Press 2004; all rights reserved.

Memory-efficient algorithm for MSA constraints

space for two sequences. The improvements and extension above greatly increase the performances and practical usage of the CMSA tools developed using the progressive approach. However, the requirement of O(γ n^2 ) memory still limits the existing CMSA tools to align a set of short sequences, at most several hundreds of residues/nucleotides. To align large genomic sequences of at least several thousands of residues/nucleotides, there is a need to design a memory- efficient algorithm for the constrained pairwise sequence alignment (CPSA) problem, which is the key limiting factor relating to the applicable extent of the progressive CMSA tools. Hence, in this paper, we adopt the so-called divide-and- conquer approach to design a memory-efficient algorithm for solving the CPSA problem, which runs in O(γ n^2 ) time, but consumes only O(αn) space, where α is the sum of the lengths of constraints and usually α  n in practical applications. Based on this algorithm, we have finally developed a memory- efficient CMSA tool using the progressive approach. Note that applying the divide-and-conquer approach to memory- efficiently align two or more sequences without any con- straints has been studied extensively (Myers and Miller, 1988; Chao et al. , 1994; Tönges et al. , 1996; Stoye et al. , 1997a,b; Stoye, 1998). In contrast to the progressive approach used here, the divide-and-conquer algorithms proposed by Stoye et al. (Tönges et al. , 1996; Stoye et al. , 1997a,b; Stoye, 1998) considered the input sequences simultaneously and heuristic- ally compute the good, but not necessarily optimal, dividing positions so that the resulting total MSA is close to an optimal MSA of the original sequences. In fact, many other CMSAs have been proposed from various perspectives, even using different approaches (Schuler et al. , 1991; Depiereux and Feytmans, 1992; Taylor, 1994; Myers et al. , 1996; Notredame et al. , 2000; Thompson et al. , 2000; Sammeth et al. , 2003). Of these various CMSAs, it is worth mentioning that Myers et al. (1996) obtained their CMSA by performing progressive mul- tiple alignment under position-based constraints that are given by users; Sammeth et al. (2003) got their CMSA by perform- ing simultaneous multiple alignment under segment-based constraints (as same as we studied here) that are pre-computed via a local segmented-based algorithm (Morgenstern, 1999). We refer the reader to their papers for details.

2 PROBLEM FORMULATION

Let S = {S 1 , S 2 ,... , Sχ } be the set of χ sequences over the alphabet . Then an MSA of S is a rectangular matrix con- sisting of χ rows of characters of  ∪ {-} such that no column consists entirely of dashes and removing dashes from row i leaves Si for any 1 ≤ i ≤ χ. The sum-of-pairs score (SP score) of an MSA is defined to be the sum of the scores of all columns, where the score of each column is the sum of the scores of all distinct pairs of characters in the column. In practice, the score of the pair of two dashes is usually set to zero. Then the problem of finding an MSA of S with the

optimal SP score is the so-called sum-of-pairs MSA problem (Carrillo and Lipman, 1988; Chan et al. , 1992; Gusfield, 1997; Nicholas et al. , 2002; Notredame, 2002). Let δ(T 1 , T 2 ) denote the Hamming distance between two subsequences T 1 and T 2 of equal length, which is equal to the number of mismatched pairs in the alignment of T 1 and T 2 without any gap. Given an alignment L of S, a band is defined as a block of consecutive columns in L (i.e. a sub- matrix of L). For any band L′^ of L, let subseq(S (^) i , L′) denote the subsequence of Si whose residues/nucleotides are all in the band L′, where 1 ≤ i ≤ χ. A subsequence T = t 1 t 2... tλ is said to appear in L if L contains a band L′^ of λ columns, say π 1 , π 2 ,... , πλ, such that the charac- ters of column π (^) j , 1 ≤ j ≤ λ, are all equal to tj , or equivalently, subseq(Si , L′) = T for each 1 ≤ i ≤ χ. If δ[subseq(S (^) i , L′), T ] ≤ λ ×  for a given error ratio 0 ≤  < 1 [i.e. some mismatches are allowed between subseq(Si , L′) and T ], then T is said to approximately appear in L. From the biological viewpoint, T can be con- sidered as the consensus among the subsequences in L′^ and hence T is also called as an induced consensus by the band L′. For any two subsequences T 1 and T 2 , T 1 ≺ T 2 is used to denote that T 1 (approximately) appears strictly before T 2 in L (i.e. their corresponding bands do not overlap). Let = (C 1 , C 2 ,... , Cγ ) be an ordered set of γ constraints (i.e. subsequences), each C (^) i = c i 1 c i 2... c iλ (^) i with length of λi , where 1 ≤ i ≤ γ. Then the CMSA of S with respect to is defined as an alignment L of S in which all the constraints of approximately appear in the order C 1 ≺ C 2 ≺ · · · ≺ Cγ such that δ(subseq(S (^) i , L′ j ), Cj ) ≤ λj ×  for all 1 ≤ i ≤ χ and 1 ≤ j ≤ γ , where L′ j is the band of L whose induced consensus is Cj. Given a set S of χ sequences along with an ordered set of γ constraints and an error ratio , the so-called CMSA problem is to find a CMSA w.r.t. with the optimal SP score. When the number of sequences in S is restricted to two (i.e. χ = 2), the CMSA problem is called as the CPSA problem.

3 ALGORITHM

In this section, we shall first design a memory-efficient algorithm for solving the CPSA problem with two given sequences A = a 1 a 2... am and B = b 1 b 2... bn, a given ordered set = (C 1 , C 2 ,... , Cγ ) of γ constraints, each C (^) i = c i 1 c i 2... c (^) λi (^) i with length of λi , 1 ≤ i ≤ γ , and a given error threshold . After that, we shall use it as the kernel to heuristically solve the CMSA problem. For any sequence T , let pref(T , l) [respectively, suff(T , l)] phase don’t change denote the prefix (respect- ively, suffix) of T with length l. For any two characters a, b ∈ , let σ (a, b) denote the score of aligning a with b. The gap penalty adopted here is the so-called affine gap penalty that penalizes a gap of length l with wo +l×we , where wo > 0

Memory-efficient algorithm for MSA constraints

Fig. 1. The schematic diagram of four adjacent entries of G, where entry (i, j , k) consists of four nodes Mk , MDk , MIk and Nk corresponding to Mk (i, j ), MDk (i, j ), MIk (i, j ) and Nk (i, j , λ (^) k ), respectively.

Fig. 2. Schematic diagram of divide-and-conquer approach: two light gray areas are the reduced subproblems after middle position (imid , jmid , kmid) is determined, each of which will be further divided into two subproblems of dark gray areas.

Define Mk (i, j ) to be the score of an optimal constrained

alignment of A (^) i and B (^) j w.r.t. (^) k , and define M S k (i,^ j ) (M D k (i,^ j )^ and^ M

I k (i,^ j ), respectively) to be the maximum score of all constrained alignments of A (^) i and B (^) j w.r.t.

k that begin with a substitution [deletion and insertion, respectively] pair (a (^) i+ 1 , bj + 1 ) [(a (^) i+ 1 , −) and (−, b (^) j + 1 ),

respectively]. Let (^) k (h) = [C 1 , C 2 ,... , Ck− 1 , pref(C (^) k , h)] and (^) k (h) = [suff(C (^) k , λk − h), Ck+ 1 ,... , Cγ ], where 1 ≤ h ≤ λk. Let N (^) k (i, j , h) denote the score of an optimal semi-constrained alignment L of A (^) i and B (^) j w.r.t. k (h)^ that begins with a band whose induced consensus is equal to suff(C (^) k , λk − h). Note that the recurrences for

C.L.Lu and Y.P.Huang

computing matrices Mk , M S k ,^ M

D k ,^ M

I k and^ N^ k^ can be developed similarly as those for computing Mk , MSk , MDk , MIk and Nk , respectively. Clearly, MSk (i, j ) = Mk (i − 1, j − 1 ) + σ (ai , bj ). If δ[suff(A (^) i , λk ), Ck ] ≤ λ (^) k ×  and δ[suff(Bj , λk ), Ck ] ≤ λ (^) k × , then we can reformulate the recurrence of Nk as follows: Nk (i, j , 1) = Mk− 1 (i − 1, j − 1 ) + σ (ai , bj ) and Nk (i, j , h) = Nk (i − 1, j − 1, h − 1 ) + σ (ai , bj ) for each 1 < h ≤ λ (^) k. Next, we describe our divide-and-conquer algorithm, termed as CPSA-DC algorithm, for computing an optimal constrained alignment between A and B w.r.t. as fol- lows. The key point is to determine the middle position (imid , jmid , kmid) of the optimal path in G to divide the prob- lem into two subproblems, each of which is recursively divided into two smaller subproblems using the same way. Given an alignment L, we use score(L) to denote the score of L. Let Lγ (A, B) be an optimal constrained align- ments of A and B w.r.t. and clearly score[Lγ (A, B)] = Mγ (m, n). Let imid = m 2. Then, we partition Lγ (A, B) into two parts by cutting it at the position immediately after aimid and we let Lkmid (A (^) imid , Bjmid ) denote the part containing aimid and Lkmid (A (^) imid , B (^) jmid ) denote the remaining part, where b (^) jmid denotes the last character in Lkmid (A (^) imid , Bjmid ) from B, and kmid denotes the largest index so that pref(C (^) kmid , hmid) (approximately) appears in Lkmid (A (^) imid , Bjmid ). Then there are two possibilities when we consider the last aligned pair of Lkmid (A (^) imid , Bjmid ).

Case 1: The last aligned pair of Lkmid (A (^) imid , Bjmid ) is a substitution pair [i.e. (aimid , bjmid )]. In this case, we have Mγ (m, n) = score(Lγ (A, B)) = score(Lkmid (A (^) imid , Bjmid )) + score(Lkmid (A (^) imid , B (^) jmid )). If (a (^) imid , bjmid ) is not a constrained column in Lγ (A, B), then Lkmid (A (^) imid , Bjmid ) is an optimal constrained alignment of Aimid and B (^) jmid w.r.t. (^) kmid ending with a substitution pair (a (^) imid , bjmid ), and Lkmid (A (^) imid , B (^) jmid ) is an optimal constrained alignment of A (^) imid and B (^) jmid w.r.t. (^) kmid. Hence, Mγ (m, n) = MSkmid (imid , jmid) + Mkmid (imid , jmid). If (a (^) imid , bjmid ) is a con- strained column in Lkmid (A (^) imid , Bjmid ), then Lkmid (A (^) imid , Bjmid ) is an optimal semi-constrained alignment of A (^) imid and Bjmid w.r.t. (^) kmid (hmid) ending with a band L′^ whose induced con- sensus is equal to pref(C (^) kmid , hmid). If hmid < λ (^) kmid , then Lkmid (A (^) imid , B (^) jmid ) is an optimal semi-constrained alignment of A (^) imid and B (^) jmid w.r.t. (^) kmid (hmid) beginning with a band L′^ whose induced consensus is equal to suff(C (^) kmid , λkmid − hmid). Moreover, the induced consensus of the merge of L′^ and L′^ have to be equal to C (^) kmid. In this case, we have Mγ (m, n) = Nkmid (imid , jmid , hmid) + N (^) kmid (imid , jmid , hmid). If hmid = λkmid , then Lkmid (A (^) imid , B (^) jmid ) is an optimal constrained align- ment of A (^) imid and B (^) jmid w.r.t. (^) kmid (hmid), and hence Mγ (m, n) = Nkmid (imid , jmid , λkmid ) + Mkmid (imid , jmid). Case 2: The last aligned pair of Lkmid (A (^) imid , Bjmid ) is a deletion pair [i.e. (aimid , −)]. If the first aligned

pair in Lkmid (A (^) imid , B (^) jmid ) is not a deletion pair, then Mγ (m, n) = max{MDkmid (imid , jmid) + M S kmid (imid^ ,^ jmid), MDkmid (imid , jmid) + M I kmid (imid^ ,^ jmid)}. If the first aligned pair in Lkmid (A (^) imid , B (^) jmid ) is a deletion pair, then Mγ (m, n) = MDkmid (imid , jmid) + M D kmid (imid^ ,^ jmid)^ +^ wo^. We need to com- pensate it by adding wo because the open penalty of the gap containing aimid and a (^) imid + 1 in Lγ (A, B) is charged twice by MDkmid (imid , jmid) and M D kmid (imid^ ,^ jmid). In summary, the recurrence of Mγ (m, n) is derived as follows: Mγ (m, n)

= max

   

  

MDkmid (imid , jmid ) + MSkmid (imid , jmid ), MDkmid (imid , jmid ) + MIkmid (imid , jmid ), MDkmid (imid , jmid ) + MDkmid (imid , jmid ) + wo , MSkmid (imid , jmid ) + Mkmid (imid , jmid ), Nkmid (imid , jmid , hmid ) + N (^) kmid (imid , jmid , hmid ), Nkmid (imid , jmid , λ (^) kmid ) + Mkmid (imid , jmid )

   

  

.

When MDkmid (imid , jmid) + M D kmid (imid^ ,^ jmid)^ is added to the right-hand side, the above recurrence is not changed, but can be reformulated as follows:

Mγ (m, n) = max

  

 

MDkmid (imid , jmid ) + Mkmid (imid , jmid ), MDkmid (imid , jmid ) + MDkmid (imid , jmid ) + wo , MSkmid (imid , jmid ) + Mkmid (imid , jmid ), Nkmid (imid , jmid , hmid ) + N (^) kmid (imid , jmid , hmid ), Nkmid (imid , jmid , λ (^) kmid ) + Mkmid (imid , jmid )

  

 

.

In other words, jmid , kmid and hmid are the indices j , k and h, where 1 ≤ j ≤ n, 0 ≤ k ≤ γ and 1 ≤ h < λ (^) k , such that the following maximal value is the maximum.

max

MDk (imid , j ) + Mk (imid , j ), MDk (imid , j ) + M D k (imid^ ,^ j )^ +^ wo^ , MSk (imid , j ) + Mk (imid , j ), Nk (imid , j , h) + N (^) k (imid , j , h), Nk (imid , j , λk ) + Mk (imid , j )

Now, we show how to use O(αn), instead of O(γ mn), memory to determine∑ jmid , kmid and hmid , where α = 1 ≤k≤γ λ^ k^ and^ α^ ≤^ min{m,^ n}^ intrinsically. In fact, a single matrix E of size (γ + 1 ) × (n + 1 ) with each entry E(k, j ) of λ (^) k + 4 space is enough to compute Mk (imid , j ), MSk (imid , j ), MDk (imid , j ) MIk (imid , j ) and Nk (imid , j , h), for 1 ≤ j ≤ n, 0 ≤ k ≤ γ and 1 ≤ h ≤ λk. When reaching the entry (i, j , k) of 3D grid graph G, we use entry E(k, j ) of E to hold the most recently computed values of Mk (i, j ), MSk (i, j ), MDk (i, j ) MIk (i, j ) and Nk (i, j , h), which clearly needs a total of λk + 4 space. Note that the old values in entry E(k, j ) will be moved into an extra entry, termed as V (^) k whose space is equal to E(k, j ), before they are overwritten by their newly computed values. Before moving the old values in E(k, j ) into

C.L.Lu and Y.P.Huang

if

HA (k,λ (^) k ) λ (^) k ≤^ 

and

HB (j ,k,λ (^) k ) λ (^) k ≤^ 

then if E(Nk (imid , j , λk )) + E(Mk (imid , j )) > max then max = E(Nk (imid , j , λk )) + E(Mk (imid , j )); jmid = j ; kmid = k; hmid = h; type = case 5; end if end if end if end for end for 3: if type = case 1 then CPSA-DC(istart , imid − 1, jstart , jmid , kstart , kmid); Align aimid with a space; CPSA-DC(imid + 1, iend , jmid + 1, jend , kmid + 1, kend); end if if type = case 2 then CPSA-DC(istart , imid − 1, jstart , jmid , kstart , kmid); Align aimid a (^) imid + 1 with two spaces; CPSA-DC(imid + 2, iend , jmid + 1, jend , kmid + 1, kend); end if if type = case 3 then CPSA-DC(istart , imid − 1, jstart , jmid − 1, kstart , kmid); Align aimid with b (^) jmid ; CPSA-DC(imid + 1, iend , jmid + 1, jend , kmid + 1, kend); end if if type = case 4 then CPSA-DC(istart , imid − hmid , jstart , jmid − hmid , kstart , kmid − 1 ); Align a (^) imid −hmid + 1 · · · aimid +λ (^) k −hmid with b (^) jmid −hmid + 1 · · · bjmid +λ (^) k −hmid ; CPSA-DC(imid + λ (^) k − hmid + 1, iend , jmid + λ (^) k − hmid

  • 1, jend , kmid + 1, kend); end if if type = case 5 then CPSA-DC(istart , imid − λ (^) k , jstart , jmid − λ (^) k , kstart , kmid − 1 ); Align a (^) imid −λ (^) k + 1 · · · aimid with b (^) jmid −λ+ 1 · · · bjmid ; CPSA-DC(imid + 1, iend , jmid + 1, jend , kmid + 1, kend); end if

Algorithm BestScore (istart , iend , jstart , jend , kstart , kend ) Input: Sequences aistart · · · aiend and b (^) jstart · · · bjend with constraints (C (^) kstart ,... , Ckend ) 1: / Reindex / m = istart − iend + 1; n = jstart − jend + 1; γ = kstart − kend + 1; 2: / Initialization / for j = 0 to n do for k = 0 to γ do E(MSk (imid , j )) = E(MDk (imid , j )) = −∞; if (j = 0 ) or (k > 0 ) then E(MIk (imid , j )) = −∞; else E(MIk (imid , j )) = −wo − j we ; if (j = 0 ) and (k = 0 ) then E(Mk (imid , j )) = 0; else E(Mk (imid , j )) = −∞;

if k ≥ 1 then for h = 1 to λ (^) k do E(Nk (imid , j , h)) = −∞; end if end for end for 3: / Computation / for i = 1 to m do for k = 0 to γ do / For the case of* j = 0 / Vk (Mk (imid , 0)) = E(Mk (imid , 0)); if k ≥ 1 then for h = 1 to λk do V (^) k (Nk (imid , 0, h)) = E(Nk (imid , 0, h))); end if E(MSk (imid , 0)) = E(MIk (imid , 0)) = −∞; E(Mk (imid , 0)) = E(MDk (imid , 0)) = −wo − j we ; end for for j = 1 to n do / For the case of j > *0 / for k = 0 to γ do tempk (Mk (imid , j )) = E(Mk (imid , j )) ; if k ≥ 1 then for h = 1 to λk do tempk (Nk (imid , j , h)) = E(Nk (imid , j , h)); end if E(MSk (imid , j )) = Vk (Mk (imid , j ))

  • σ (aistart +i− 1 , bjstart +j − 1 ); E(MDk (imid , j )) = max{E(MDk (imid , j )) − we , E(Mk (imid , j )) − wo − we}; E(MIk (imid , j )) = max{E(MIk (imid , j − 1 )) − we , E(Mk (imid , j − 1 )) − wo − we}; if k ≥ 1 then for h = 1 to λk do if h = 1 then E(Nk (imid , j , h)) = vk−1,k + σ (a (^) istart +i−λk , bjstart +j −λk ); else E(Nk (imid , j , h)) = Vk (Nk (imid , j , h − 1 ))
  • σ (a (^) istart +i−λk +h− 1 , b (^) jstart +j −λk +h− 1 ); end if end for end if E(Mk (imid , j ))

= max

E(MDk (imid , j )), E(MIk (imid , j )), E(Nk (imid , j , λk )) Vk (Mk (imid , j )) + σ (a (^) istart +i− 1 , bjstart +j − 1 ),

v (^) k,k+ 1 = Vk (M (^) k (imid , j )); Vk (Mk (imid , j )) = tempk (Mk (imid , j )); if k ≥ 1 then for h = 1 to λk do Vk (N (imid , j , h)) = tempk (Nk (imid , j , h)); end for end if

Memory-efficient algorithm for MSA constraints

if i = m and k ≥ 1 then for h = 1 to λ (^) k do if h = 1 then if j = 1 and a (^) istart +i−λk = c kh then HA (k, h) = 1; else HA (k, h) = 0; if bjstart +j −λk = c kh then HB (j , k, h) = 1; else HB (j , k, h) = 0; else if j = 1 and aistart +i−λk +h− 1 = c kh then HA (k, h) = HA (k, h − 1 ) + 1; if b (^) jstart +j −λk +h− 1 = c (^) hk then HB (j , k, h) = HB (j , k, h − 1 ) + 1; end if end for end if end for end for end for

Now, we analyze the time-complexity of our CPSA-DC algorithm for solving the CPSA. As shown in Figure 2, after determining the middle position (imid , jmid , kmid) of the optimal path in G, we can divide the original problem into two subproblems, each of which further can be recursively divided into two smaller subproblems using the same way. Note that regardless of where the optimal path passes through (imid , jmid , kmid), the total size of the two reduced subproblems is just half the size of the original problem, where the size is measured by the number of entries in G. It is not hard to see that the time-complexity of determining the middle position of each subproblem at each recursive stage is proportional to the size of the subproblem. Let denote the size of the original problem (i.e. = γ mn). Then the total time-complexity of our CPSA-DC algorithm is equal to + 2 + 4 + · · · = 2 , which is twice as high as the CPSA-DP algorithm. Using the CPSA-DC algorithm as a kernel, we were able to design a memory-efficient algorithm, termed CMSA-DC, for pro- gressively aligning multiple input sequences into a CMSA according to the branching order of a guide tree. The above progressive method we adopted was proposed by Tang et al. (2003). Owing to space limitation, we refer the reader to their paper for the details of its implementation.

4 EXPERIMENTAL RESULTS

We use Java language to implement the CMSA-DC algorithm as a web server, called as MuSiC-ME (Memory-Efficient tool for Multiple Sequence Alignment with Constraints). The input of the MuSiC-ME system consists of a set of protein/DNA/RNA sequences and a set of user-specified con- straints, each with a fragment of residue/nucleotide that (approximately) appears in all input sequences. The output of MuSiC-ME is a CMSA in which the fragments of the input sequences whose residues/nucleotides exhibit a given degree

Table 1. The information of the tested families and their constraints

Family #SEQ MAXSEQ #CON MAXCON

Protease 6 123 4 1 Globin 6 146 7 2 RNase 6 185 3 1 Kinase 6 353 10 3 CoV-3′-UTR 6 422 12 2

#SEQ is the number of sequences, MAXSEQ is the maximum length of sequences, #CON is the number of constraints and MAXCON is the maximum length of constraints.

Table 2. The comparison of CPU time and memory usage between MuSiC and MuSiC-ME

Family MuSiC MuSiC-ME CPU Time Memory CPU Time Memory (s) (MB) (s) (MB)

Protease 6 25.4 6 15. Globin 23 42.0 18 15. RNase 11 32.0 8 15. Kinase 131 160.8 96 15. CoV-3′-UTR — — 165 17.

The memory usage includes JVM (Java Virtual Machine), code (MuSiC/MuSiC-ME) and data, and MuSiC cannot deal with the case of CoV-3′-UTR due to running out of memory.

of similarity to a constraint are aligned together. For its biolo- gical applications, we refer the reader to other related papers (Tang et al. , 2003; Tsai et al. , 2004). In the following, we evaluate our memory-efficient MuSiC- ME system and compare its running time and memory to the original MuSiC system (Tsai et al. , 2004), whose kernel CPSA algorithm was implemented by the dynamic programming approach. We chose five families of protein/RNA sequences as our testing datasets, each of which has been shown to contain an ordered series of conserved motifs related to the structures/functionalities/consensuses of the family (McClure et al. , 1994; Chin et al. , 2003; Tang et al. , 2003; Tsai et al. , 2004): (1) the aspartic acid protease family (Protease), (2) the hemoglobins family (Globin), (3) the ribonuclease family (RNase), (4) the kinase family (Kinase) and (5) the 3′- untrans- lated region of the coronaviruses (CoV-3′-UTR). From each family, we have selected a representative set of sequences and adopted the ordered series of conserved motifs as the con- straints. Table 1 lists the information of the tested families and their constraints. All tests were run with default parameters on IBM PC with 1.26 GHz processor and 512 MB RAM under Linux system. Table 2 lists the CPU time and memory usage of our experiments using MuSiC and MuSiC-ME. It shows that the memory usage of MuSiC-ME is much smaller than that of MuSiC for large-scale sequences, and the CPU time required by MuSiC-ME is smaller than that required by MuSiC for

Memory-efficient algorithm for MSA constraints

Fig. 5. The partial display of the resulting MSA of Clustal W 1.82 by aligning the 3′-UTR sequences of six coronaviruses, where the bases not in the pseudoknots are marked with dots.

general MSA problem without constraints (Ikeda and Imai, 1994, 1999; Kobayashi and Imai, 1999; Lermen and Reinert, 2000). Hence, it is interesting to study whether or not the A∗ algorithm can still be applied to the CMSA problem.

ACKNOWLEDGEMENTS

The authors would like to thank the anonymous referees for their constructive comments to the presentation of this paper. This work was supported in part by National Science Council of Republic of China under grant NSC93-2213-E-009-113.

REFERENCES

Bafna,V., Lawler,E.L. and Pevzner,P.A. (1997) Approximation algorithms for multiple sequence alignment. Theoret. Comput. Sci. , 182 , 233–244. Bonizzoni,P. and Vedova,G.D. (2001) The complexity of multiple sequence alignment with SP-score that is a metric. Theoret. Comput. Sci. , 259 , 63–79. Carrillo,H. and Lipman,D. (1988) The multiple sequence alignment problem in biology. SIAM J. Appl. Math. , 48 , 1073–1082. Chan,S.C., Wong,A.K.C. and Chiu,D.K.Y. (1992) A survey of multiple sequence comparison methods. Bull. Math. Biol. , 54 , 563–598. Chao,K.M., Hardison,R.C. and Miller,W. (1994) Recent develop- ments in linear-space alignment methods: a survey. J. Comput. Biol. , 1 , 271–291. Chin,F.Y.L., Ho,N.L., Lamy,T.W., Wong,P.W.H. and Chan,M.Y. (2003) Efficient constrained multiple sequence alignment with performance guarantee. In Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB 2003). IEEE, Los Alamitos, CA, pp. 337–346. Corpet,F. (1988) Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res. , 16 , 10881–10890. Deiman,B. and Pleij,C.W.A. (1997) Pseudoknots: a vital feature in viral RNA. Semin. Virol. , 8 , 166–175.

Depiereux,E. and Feytmans,E. (1992) MATCH-BOX: a fundament- ally new algorithm for the simultaneous alignment of several protein sequences. Comput. Appl. Biosci. , 8 , 501–509. Feng,D.F. and Doolittle,R.F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. , 25 , 351–360. Gusfield,D. (1993) Efficient methods for multiple sequence align- ment with guaranteed error bounds. Bull. Math. Biol. , 55 , 141–154. Gusfield,D. (1997) Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, NY. Higgins,D. and Sharpe,P. (1988) CLUSTAL: a package for perform- ing multiple sequence alignment on a microcomputer. Gene , 73 , 237–244. Hirschberg,D.S. (1975) A linear space algorithm for computing maximal common subsequences. Commun. ACM , 18 , 341–343. Ikeda,T. and Imai,H. (1994) Fast A∗^ algorithms for multiple sequence alignment. In Proceedings of the Genome Informatics Workshop. Universal Academy Press, Tokyo, pp. 90–99. Ikeda,T. and Imai,H. (1999) Enhanced A∗^ algorithms for multiple alignments: optimal alignments for several sequences and k-opt approximate alignments for large cases. Theoret. Comput. Sci. , 210 , 341–374. Kececioglu,J.D. (1993) The maximum weight trace problem in mul- tiple sequence alignment. In Proceedings of the Fourth Annual Symposium on Combinatorial Pattern Matching (CPM 2004) , LNCS 684. Springer-Verlag, Heidelberg, Germany, pp. 106–119. Kobayashi,H. and Imai,H. (1999) Improvement of the A∗^ algorithm for multiple sequence alignment. In Proceedings of the Gen- ome Informatics Workshop. Universal Academy Press, Tokyo, pp. 120–130. Lermen,M. and Reinert,K. (2000) The practical use of the A∗ algorithm for exact multiple sequence alignment. J. Comput. Biol. , 7 , 655–672. Li,M., Ma,B. and Wang,L. (2000) Near optimal multiple alignment within a band in polynomial time. In Proceedings of the Thirty

C.L.Lu and Y.P.Huang

Second Annual ACM Symposium on Theory of Computing (STOC 2000). ACM Press, Portland, OR, pp. 425–434. McClure,M.A., Vasi,T.K. and Fitch,W.M. (1994) Comparat- ive analysis of multiple protein-sequence alignment methods. Mol. Biol. Evol. , 11 , 571–592. Morgenstern,B. (1999) DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics , 15 , 211–218. Myers,E.W. and Miller,W. (1988) Optimal alignment in linear space. Comput. Appl. Biosci. , 4 , 11–17. Myers,G., Selznick,S., Zhang,Z. and Miller,W. (1996) Progressive multiple alignment with constraints. J. Comput. Biol. , 3 , 563–572. Nicholas,H.B., Ropelewski,A.J. and Deerfield,D.W. (2002) Strategies for multiple sequence alignment. Biotechniques , 32 , 592–603. Notredame,C. (2002) Recent progresses in multiple sequence align- ment: a survey. Pharmacogenomics , 3 , 131–144. Notredame,C., Higgins,D.G. and Heringa,J. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. , 302 , 205–217. Pevzner,P.A. (1992) Multiple alignment, communication cost, and graph matching. SIAM J. Appl. Math. , 52 , 1763–1779. Pleij,C.W.A. (1994) RNA pseudoknots. Curr. Opin. Struct. Biol. , 4 , 337–344. Sammeth,M., Morgenstern,B. and Stoye,J. (2003) Divide-and- conquer multiple alignment with segment-based constraints. Bioinformatics , 19 (Suppl. 2), ii189–ii195. Schuler,G.D., Altschul,S.F. and Lipman,D.J. (1991) A workbench for multiple alignment construction and analysis. Proteins , 9 , 180–190. Stoye,J. (1998) Multiple sequence alignment with the divide-and- conquer method. Gene , 211 , GC45–GC56. Stoye,J., Moulton,V. and Dress,A.W.M. (1997a) DCA: an efficient implementation of the divide-and-conquer approach to simultaneous multiple sequence alignment. Comput. Appl. Biosci. , 13 , 625–626.

Stoye,J., Perrey,S.W. and Dress,A.W.M. (1997b) Improving the divide-and-conquer approach to sum-of-pairs multiple sequence alignment. Appl. Math. Lett. , 10 , 67–73. Tang,C.Y., Lu,C.L., Chang,M.D.T., Tsai,Y.T., Sun,Y.J., Chao,K.M., Chang,J.M., Chiou,Y.H., Wu,C.M., Chang,H.T. and Chou,W.I. (2003) Constrained multiple sequence alignment tool develop- ment and its application to RNase family alignment. J. Bioinform. Comput. Biol. , 1 , 267–287. Taylor,W.R. (1987) Multiple sequence alignment by a pairwise algorithm. Comput. Appl. Biosci. , 3 , 81–87. Taylor,W.R. (1994) Motif-biased protein sequence alignment. J. Comput. Biol. , 1 , 297–310. Thompson,J.D., Higgs,D.G. and Gibson,T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties, and weight matrix choice. Nucleic Acids Res. , 22 , 4673–4680. Thompson,J.D., Plewniak,F., Thierry,J.-C. and Poch,O. (2000) DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches. Nucleic Acids Res. , 28 , 2919–2926. Tönges,U., Perrey,S.W., Stoye,J. and Dress,A.W.M. (1996) A gen- eral method for fast multiple sequence alignment. Gene , 172 , GC33–GC41. Tsai,Y.T., Huang,Y.P., Yu,C.T. and Lu,C.L. (2004) MuSiC: a tool for multiple sequence alignment with constraints. Bioinformatics , (in press). Wang,L. and Jiang,T. (1994) On the complexity of multiple sequence alignment. J. Comput. Biol. , 1 , 337–348. Williams,G.D., Chang,R.-Y. and Brian,D.A. (1999) A phylogenetic- ally conserved hairpin-type 39 untranslated region pseudoknot functions in coronavirus RNA replication. J. Virol. , 73 , 8349–8355. Yu,C.T. (2003) Efficient algorithms for constrained sequence alignment problems. Master’s Thesis, Department of Computer Science and Information Management, Providence University.