Algorithm for Constrained Multiple Data Sequences Alignment: MUSCA, Papers of Computer Science

The multiple sequence alignment problem and introduces the musca algorithm, which is a two-stage approach for motif discovery and sequence alignment. The algorithm uses the notion of alignment number k, a user-controlled parameter that ensures agreement on a character among at least k sequences in the alignment. The document also covers pairwise compatibility of motifs and domain crossing errors.

Typology: Papers

Pre 2010

Uploaded on 08/31/2009

koofers-user-y0t
koofers-user-y0t 🇺🇸

10 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
MUSCA: An Algorithm for Constrained Alignment of
Multiple Data Sequences
Laxmi Parida Aris Floratos Isidore Rigoutsos
Bioinformatics and Pattern Discovery Group
Computational Biology Center
IBM Thomas J. Watson Research Center
Yorktown Heights, NY10598, USA
Abstract
Given a set of
N
sequences, the Multiple Sequence Alignment problem is to align these
N
sequences, possibly with gaps, that brings out the best
commonality
of the
N
sequences. MUSCA
1
isatwo-stage approach to the alignment problem by identifying two relatively simpler sub-problems
whose solutions are used to obtain the alignment of the sequences. We rst
discover motifs
in the
N
sequences and then extract an appropriate subset of compatible motifs to obtain a \good"
alignment. The motifs of interest to us are the
irredundant
motifs which are only polynomial in
the input size. In practice, however, the number is much smaller (sub-linear). Notice that this
step aids in a direct
N
-wise alignment, as opposed to composing the alignments from lower order
(say pairwise) alignments and the solution is also independent of the order of the input sequences;
hence the algorithm works very well while dealing with a large number of sequences. The second
part of the problem that deals with obtaining a good alignment is solved using a graph-theoretic
approach that computes an induced subgraph satisfying certain simple constraints. We reduce a
version of this problem to that of solving an instance of a set covering problem, thus oer the best
possible approximate solution to the problem (provided P
6
=NP). Our experimental results, while
being preliminary, indicate that this approach is ecient, particularly on large numbers of long
sequences, and, gives good alignments when tested on biological data such as DNA and protein
sequences. Weintroduce the the notion of an
alignment number
K
(2
K
N
), a user-controlled
parameter, that lends a useful exibility to the aligning program: this additional requirement
constrains the alignmenttohave at least
K
sequences agree on a character, whenever possible, in
the alignment. The usefulness of the alignmentnumber is corroborated by the users who view this
as a natural constraint while dealing with a large number of sequences.
1 Introduction
Given a set of
N
sequences, the Multiple Sequence Alignment problem is to align these
N
sequences,
possibly with gaps, that brings out the best
commonality
of the
N
sequences. Various alignment cost
functions [2, 3, 4, 6, 8, 7, 14, 15, 12, 9], have been used in literature. The general approach to solving
the pairwise (
N
= 2) sequence alignment problem has been a dynamic programming technique using
dierent mechanisms of scores which is a function of the
edit distance
, along with
gap penalties
, to
evaluate the similarity of the sequences. In [16, 13] the case of
N >
2 has been handled by rst doing
a pairwise alignment for some or all possible pairs in some order and then building a
N
-wise alignment
from these.
MUSCA
2
uses a two-stage approach to the alignment problem by identifying two relatively simpler
sub-problems which deal separately with the two issues, one of identifying the \local similarities" and
1
Musca is a constellation in the polar region of the Southern Hemisphere near Apus and Carina. Also, MUSCA is an
anagram of the salientcharacters in Constrained Multiple Sequence Alignment.
2
Musca is a constellation in the polar region of the Southern Hemisphere near Apus and Carina. Also, MUSCA is an
anagram of the salientcharacters in Constrained Multiple Sequence Alignment.
112
pf3
pf4
pf5
pf8

Partial preview of the text

Download Algorithm for Constrained Multiple Data Sequences Alignment: MUSCA and more Papers Computer Science in PDF only on Docsity!

MUSCA: An Algorithm for Constrained Alignment of

Multiple Data Sequences

Laxmi Parida Aris Floratos Isidore Rigoutsos [email protected] [email protected] [email protected] Bioinformatics and Pattern Discovery Group Computational Biology Center IBM Thomas J. Watson Research Center Yorktown Heights, NY10598, USA Abstract Given a set of N sequences, the Multiple Sequence Alignment problem is to align these N sequences, p ossibly with gaps, that brings out the b est commonality of the N sequences. MUSCA 1 is a two-stage approach to the alignment problem by identifying two relatively simpler sub-problems whose solutions are used to obtain the alignment of the sequences. We rst discover motifs in the N sequences and then extract an appropriate subset of compatible motifs to obtain a \go o d" alignment. The motifs of interest to us are the irredundant motifs which are only p olynomial in the input size. In practice, however, the numb er is much smaller (sub-linear). Notice that this step aids in a direct N -wise alignment, as opp osed to comp osing the alignments from lower order (say pairwise) alignments and the solution is also indep endent of the order of the input sequences; hence the algorithm works very well while dealing with a large numb er of sequences. The second part of the problem that deals with obtaining a go o d alignment is solved using a graph-theoretic approach that computes an induced subgraph satisfying certain simple constraints. We reduce a version of this problem to that of solving an instance of a set covering problem, thus o er the b est p ossible approximate solution to the problem (provided P 6 =NP). Our exp erimental results, while b eing preliminary, indicate that this approach is ecient, particularly on large numb ers of long sequences, and, gives go o d alignments when tested on biological data such as DNA and protein sequences. We intro duce the the notion of an alignment number K (2  K  N ), a user-controlled parameter, that lends a useful exibility to the aligning program: this additional requirement constrains the alignment to have at least K sequences agree on a character, whenever p ossible, in the alignment. The usefulness of the alignment numb er is corrob orated by the users who view this as a natural constraint while dealing with a large numb er of sequences.

1 Intro duction

Given a set of N sequences, the Multiple Sequence Alignment problem is to align these N sequences, p ossibly with gaps, that brings out the b est commonality of the N sequences. Various alignment cost functions [2, 3, 4, 6, 8, 7, 14, 15, 12, 9], have b een used in literature. The general approach to solving the pairwise (N = 2) sequence alignment problem has b een a dynamic programming technique using di erent mechanisms of scores which is a function of the edit distance, along with gap penalties, to evaluate the similarity of the sequences. In [16, 13] the case of N > 2 has b een handled by rst doing a pairwise alignment for some or all p ossible pairs in some order and then building a N -wise alignment from these. MUSCA 2 uses a two-stage approach to the alignment problem by identifying two relatively simpler sub-problems which deal separately with the two issues, one of identifying the \lo cal similarities" and (^1) Musca is a constellation in the p olar region of the Southern Hemisphere near Apus and Carina. Also, MUSCA is an anagram of the salient characters in C onstrained Mu ltiple S equence Alignment. (^2) Musca is a constellation in the p olar region of the Southern Hemisphere near Apus and Carina. Also, MUSCA is an anagram of the salient characters in C onstrained Mu ltiple S equence Alignment.

the other of aligning the similarities appropriately. We rst discover motifs in the N sequences, and then use these motifs to obtain a \go o d" alignment. Informally, a motif is a rep eated pattern that app ears more than once in a sequence. In the alignment context a motif is a pattern that app ears in two or more input sequences A ma jor p oint of criticism regarding using motifs is that they are usually very large in numb er (exp onential in input size); however, we show that the motifs that are relevant to the alignment problem are the irredundant motifs, and the numb er of these motifs is p olynomial in the input size [10]. Moreover, in practice, this numb er is much smaller (sub-linear). Thus, using motifs for the alignment helps in at least two ways: (1) it aids in a direct N -wise alignment, as opp osed to comp osing the alignments from lower order (say pairwise) alignments and (2) the solution is indep endent of the order of the input sequences. We b elieve that, in practice, these have imp ortant consequences. The second sub-problem of the alignment problem is that of obtaining a go o d alignment. Notice that any arbitrary set of motifs need not necessarily give rise to an alignment, under the premise that the alignment that uses a motif does not introduce gaps in the motif. Having obtained al l possible motifs in the rst stage, this stage involves pruning this set to obtain a (sub)set that gives an alignment. We solve this problem by mapping the motifs of the rst stage to a suitable directed graph. Next we show that obtaining an alignment of the motifs is equivalent to solving a set-covering problem. Thus we present a very systematic way of aligning sequences based on motifs. It is well known that the multiple sequence alignment problem, in addition to b eing a hard-to- solve problem, is also very hard to model to the satisfaction of evolutionary biologists, geneticists and other users. Do es our approach have any theoretical contributions to the multiple sequence alignment problem in general? We intro duce the the notion of an alignment number K (2  K  N ), a user-controlled parameter, that lends a useful exibility to the aligning program: this additional requirement constrains the alignment to have at least K sequences agree on a character, whenever p ossible, in the alignment. This is particularly of interest when a large numb er of sequences are b eing aligned. The utility of the alignment numb er is corrob orated by the users who view this as a natural constraint while dealing with a large numb er of sequences.

Roadmap. We describ e our two-stage approach of motif discovery in Section 2 and aligning se- quences in Section 3. We discuss the issues involved in using motifs for alignment and present a simple graph theoretic formulation in Section 3.1. We present heuristic algorithm for this problem by mapping it to a set covering problem in Section 3.2.

2 Motif Discovery (Stage 1)

We b egin by giving a rigorous de nition of a motif.

De nition 1 (K -motif m, location list L (^) m ) Given a string s on alphabet  and an integer K , 2  K < jsj, a string m on  [ : is a K -motif with location list Lm = (l 1 ; l 2 ; : : : ; lp ), if al l of the fol lowing hold:

  1. m[0]; m[jmj 1] 2 , [First and last characters of the motif are solid characters; if \dont care" (the `.' character) characters are allowed at the ends, the motifs can b e made arbitrarily long in size without conveying any extra information.]
  2. p  K ,
  3. there does not exist a location l , l 6 = li , 1  i  p such that m occurs at l on s (the location list is of maximal size), and, [This ensures that any two distinct lo cation lists must have distinct motifs asso ciated with each.]
  1. If pi and p (^) j do not overlap in all the sequences, then pi is to the left of pj , without loss of generality (otherwise a domain crossing mismatch is said to have occurred).
  2. If pi and pj overlap in any sequence, then pi is at some xed distance d to the left of p (^) j , without loss of generality (otherwise an overlap mismatch is said to have occurred).

We de ne the alignment using a set of motifs as follows.

De nition 6 (sequence alignment, compatible set) Given a set S of motifs, vi 1 ; vi 2 ; : : : ; vin , a motif- alignment of the sequences, s 1 ; s 2 ; : : : ; s (^) m , is the alignment such that in al l the sequences, with no gaps in the motifs, the motifs vi 1 ; v (^) i 2 ; : : : ; vin , are aligned (in al l the sequences they appear). If such an alignment exists, the set s 1 ; s 2 ; : : : ; s (^) m , is cal led a compatible set.

De nition 7 (linear ordering of motifs) Given a set of compatible motifs, a consistent ordering of the motifs such that, in every sequence, the set of motifs that are present in the sequence appear in the left to right order and this ordering is cal led the linear ordering.

Is it sucient to just check for pairwise incompatibility of motifs while seeking an alignment?

De nition 8 (domain crossing error) Given a set of motifs, m 1 ; m 2 ; : : : ; mn , a domain crossing error is said to occur if there exists a linear ordering of the motifs m (^) i 1 ; mi 2 ; : : : ; min , yet there exists no alignment that respects al l the n motifs.

Lemma 2 A set of irredundant motifs p 1 ; p 2 ; : : : ; pn is feasible if and only if none of the fol lowing holds:

  1. There exist distinct motifs p (^) i and p (^) j such that pi and p (^) j are pairwise incompatible.
  2. There exists a non-empty subset of the motifs without a linear ordering.
  3. There exists a non-empty subset of the motifs that demonstrate domain crossing error.

3.1 The Graph-theoretic Formulation

Next we wish to capture these conditions in a graph as follows. Construct a directed graph G = (V ; E ) where every motif pi corresp onds to a vertex v (^) i , thus N = jV j. The directed edges are intro duced as follows:

  1. There is no edge b etween two vertices where the two corresp onding motifs do not o ccur simul- taneously in any sequence.
  2. If pi is to the left of pj in every sequence that the two motifs are present, then a directed edge is placed from vi to vj. This is to indicate that in the alignment the motif p (^) i app ears to the left of pj. The edges are lab eled as follows:

(a) Lab el forbidden, if the motifs corresp onding to v 1 and v 2 are not pairwise compatible. (b) Lab el overlap, if the motifs corresp onding to v 1 and v 2 overlap. (c) Lab el nonoverlap, if the motifs corresp onding to v 1 and v 2 are pairwise compatible but do not overlap.

The linear ordering of motifs is captured by checking for cycles in the graph. However the domain crossing mismatch requires a more careful handling as describ ed b elow.

Handling domain crossing mismatches. We asso ciate a distance D (^) v 1 v 2 with every edge that is not lab eled forbidden. This is used to compute the feasibility of a collection of motifs corresp onding to a solution; this does not contribute to the cost of the alignment. (We discuss the weight corresp onding to the cost in the next section.) Let p (^) i and pj b e the two motifs corresp onding to the vertices. Then if d is the minimum of the distance b etween the o ccurrences of the two motifs in every sequence that b oth of them app ear in, D (^) vi v (^) j = d. To detect the domain crossing mismatches of motifs (that are pairwise compatible), we de ne the notion of a consistent graph w.r.t. a vertex.

De nition 9 Let G = (V ; E ) be a labeled, weighted, directed, graph with weights on the edge uv given by D (u; v ) and a label 2 ff or bidden; ov er l ap; nonov er l apg. A path, P , is valid if it has no edges labeled forbidden. Further, a valid path, P , is cal led an overlap-path if al l the edges in the path are labeled ov er l ap. The weight of the valid path P , D (^) P , is the sum of the weights of its constituent edges. Let p 2 V. The graph is consistent w.r.t p if 8 q 2 V , for al l pairs of vertex-disjoint valid paths from p to q , P 1 and P 2 ,

  1. DP 1 = D (^) P 2 , if P 1 and P 2 are both overlap-paths, or,
  2. DP 1  D (^) P 2 , if P 1 is an overlap-path and P 2 is not. We now present the straightforward observation that relates the set of compatible motifs to a feasible subgraph.

Lemma 3 The fol lowing two statements are equivalent:

 Given a subset of motifs p 1 ; p 2 ; : : : ; pn from the set of al l motifs from m sequences of input, the subset is compatible, if the fol lowing holds:

  1. the motifs are not pairwise incompatible,
  2. there exists a linear ordering of p 1 ; p 2 ; : : : ; pn , and,
  3. there is no domain crossing mismatch in p 1 ; p 2 ; : : : ; pn.

 Given a subset of vertices v 1 ; v 2 ; : : : ; v (^) n , constructed as de ned in this section. The induced subgraph on v 1 ; v 2 ; : : : ; vn is feasible, if the fol lowing holds:

  1. there is no edge labeled forbidden in the induced subgraph,
  2. the induced subgraph is acyclic, and,
  3. the induced subgraph is consistent w.r.t. every vertex vi , 1  i  n.

Lemma 4 If p is a redundant motif, then using the motif p does not improve the cost of the optimiza- tion problem.

Pro of Sketch. Let p b e rendered redundant by motifs p 1 ; p 2 ; : : : ; pn , n  1. By de nition, motif p has less numb er of solid-characters than each of pi , 1  i  n. Thus if an alignment can use motif p, it can certainly use all the motifs p 1 ; p 2 ; : : : ; pn , giving a larger numb er of solid-characters; thus a higher cost for the optimization problem. 2

3.2 Algorithm to compute the \b est" alignment

Given a set of incompatible motifs, the set can b e group ed into sets (not necessarily disjoint) such that each set violates exactly one of the three conditions of Lemma (2). These sets are called basic incompatible sets. Next, it can b e easily shown that we can remove exactly one motif from a basic incompatible set to make it compatible. The algorithm pro ceeds in the following three steps.

3.1 We compute a linear ordering (see De nition 7) using the graph G 0 of Step 2, of the motifs pij 1 , p ij 2 , : : :, piji for each sequence i. Such a linear ordering of the motifs exists, since the set of motifs is feasible.

3.2 From the original sequence s (^) i , we obtain the llers (if any) b etween two consecutive motifs as f 0 i , f (^) ji 1 , f (^) ji 2 , : : :, f (^) jii. f 0 i is the leftmost p ortion, p ossibly empty. For example, let sequence s (^) i = abcdef g hij k l and let two motifs b e as follows: pi 1 = cde and p i 2 = g hi. Then f 0 i = ab, f 1 i = f , f 2 i = j k l.

3.3 We obtain an alignment of the sequences by appropriately aligning each (pij (^) l +f (^) ji (^) l ) and f 0 i , l = 1 ; 2 ; : : : ; j (^) i , lling the gaps with -'. The symb ol+' denotes a string concatenation op eration. The alignment is such that each motif of a sequence is p erfectly aligned with the corresp onding motif in all the other sequences. For example let s 1 = abcdef g hij k l and s 2 = cdexy z pq r g hitu. Then p^11 = p 21 = cde and p 12 = p 22 = g hi and f 01 = ab, f 11 = f , f 21 = j k l , f 02 =empty, f 12 = xy z pq r , f 22 = tu. Then the alignment of the sequences are as follows (the motifs are shown in b old):

(1) a b c d e f g h i j k l (2) c d e x y z p q r g h i t u

4 Summary

We have prop osed a two-stage approach to the alignment problem by handling two relatively simpler sub-problems which deal separately with the two issues, one of identifying the \lo cal similarities" and the other of aligning the similarities appropriately. In the rst stage we identify al l possible K -wise motifs, i.e., all motifs that app ear simultaneously in at least K of the N input sequences (2  K  N ). In the second stage, we give plausible alignments of a carefully chosen subset of these motifs (that optimize certain cost functions). Using this approach for the alignment helps in at least two ways: (1) it aids in a direct N -wise alignment, as opp osed to comp osing the alignments from lower order (say pairwise) alignments and (2) the resulting alignment is indep endent of the order of the input sequences. K is an input parameter and is called the alignment number. In practice, our approach works particularly well for alignment of a large set of (long) sequences. We have presented the result of running our alignment algorithm on biological data and the results lo ok very promising. In the full version of the pap er we discuss the computational complexity of the underlying optimization problem along with an analysis of the approximation factor of any alignment that results from the algorithm.

References

[1] Arora, S., Karger, D., Karpinski, M., Polynomial time approximation schemes for dense instances of NP-hard problems, Proceedings of the Twenty-Seventh Annual ACM Symposium on the Theory of Computing, ACM, 284{93, 1995.

[2] Altschul, S.F., Gap costs for multiple sequence alignment, J. Theor. Biol., 138:297{309, 1989.

[3] Chao, K.M., Hardison, R., and Millter, W., Recent developments in linear-space alignment meth- o ds: a survey, J. Comput. Biol., 3:271{291, 1994.

[4] Carrillo, H. and Lipman, D., The multiple sequence alignment problem in biology, SIAM J. Appl. Math., 481073{1082, 1988.

[5] Cormen, T.H., Leiserson, C.E., and Rivest, R.L., Introduction to Algorithms, The MIT Press, Cambridge, Massachusetts, 1990.

[6] Gupta, S.K., Kececioglu, J.D., and Scha er, A.A., Improving the practical space and time e- ciency of the shortest-paths approach to sum-of-pairs multiple sequence alignment, J. Comput. Biol., 2(3):459{472, 1995.

[7] Higgins, D.G. and Sharp, P.M., CLUSTAL: a package for p erforming multiple sequence alignment on a micro computer, Gene, 73:237{244, 1988.

[8] Hirosawa, M., Totoki, Y., Hoshida, M., and Ishikawa, M., Comprehensive study on iterative algorithms of multiple sequence alignment, Comput. Appl. Biosci., 11(1):13{18, 1995.

[9] Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., and Wo otton, J.C., Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment, Science, 262(5131):208{214, 1993.

[10] Parida, L., Algorithmic Techniques in Computational Genomics, Ph.D. thesis, Courant Institute of Mathematical Sciences, New York University, Septemb er, 1998.

[11] Rigoutsos, I. and Floratos, A., Motif discovery in biological sequences without alignment or enumeration, In Proceedings of the Annual Conference on Computational Molecular Biology (RE- COMB98), 221{227, ACM Press, 1998.

[12] Ro cke, E. and Tompa, M., An algorithm for nding novel gapp ed motifs in DNA sequences, In Proceedings of the Annual Conference on Computational Molecular Biology (RECOMB98), 228{233, ACM Press, 1998.

[13] Vihinen, M., An algorithm for simultaneous comparison of several sequences, Comput. Appl. Biosci., 4(1):89{92, 1988.

[14] Waterman, M.S., Parametric and ensemble alignment algorithms, Bul l. Math. Biol., 56(4):743{ 767, 1994.

[15] Waterman, M.S., An Introduction to Computational Biology: Maps, Sequences and Genomes, Chapman Hall, 1995.

[16] Miller, W. and Zhang, Z., and He, B., Lo cal multiple alignment vis subgraph enumeration, Discrete Appl. Math., 71:337{365, 1996.