




Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The multiple sequence alignment problem and introduces the musca algorithm, which is a two-stage approach for motif discovery and sequence alignment. The algorithm uses the notion of alignment number k, a user-controlled parameter that ensures agreement on a character among at least k sequences in the alignment. The document also covers pairwise compatibility of motifs and domain crossing errors.
Typology: Papers
1 / 8
This page cannot be seen from the preview
Don't miss anything!





Laxmi Parida Aris Floratos Isidore Rigoutsos [email protected] [email protected] [email protected] Bioinformatics and Pattern Discovery Group Computational Biology Center IBM Thomas J. Watson Research Center Yorktown Heights, NY10598, USA Abstract Given a set of N sequences, the Multiple Sequence Alignment problem is to align these N sequences, p ossibly with gaps, that brings out the b est commonality of the N sequences. MUSCA 1 is a two-stage approach to the alignment problem by identifying two relatively simpler sub-problems whose solutions are used to obtain the alignment of the sequences. We rst discover motifs in the N sequences and then extract an appropriate subset of compatible motifs to obtain a \go o d" alignment. The motifs of interest to us are the irredundant motifs which are only p olynomial in the input size. In practice, however, the numb er is much smaller (sub-linear). Notice that this step aids in a direct N -wise alignment, as opp osed to comp osing the alignments from lower order (say pairwise) alignments and the solution is also indep endent of the order of the input sequences; hence the algorithm works very well while dealing with a large numb er of sequences. The second part of the problem that deals with obtaining a go o d alignment is solved using a graph-theoretic approach that computes an induced subgraph satisfying certain simple constraints. We reduce a version of this problem to that of solving an instance of a set covering problem, thus o er the b est p ossible approximate solution to the problem (provided P 6 =NP). Our exp erimental results, while b eing preliminary, indicate that this approach is ecient, particularly on large numb ers of long sequences, and, gives go o d alignments when tested on biological data such as DNA and protein sequences. We intro duce the the notion of an alignment number K (2 K N ), a user-controlled parameter, that lends a useful exibility to the aligning program: this additional requirement constrains the alignment to have at least K sequences agree on a character, whenever p ossible, in the alignment. The usefulness of the alignment numb er is corrob orated by the users who view this as a natural constraint while dealing with a large numb er of sequences.
Given a set of N sequences, the Multiple Sequence Alignment problem is to align these N sequences, p ossibly with gaps, that brings out the b est commonality of the N sequences. Various alignment cost functions [2, 3, 4, 6, 8, 7, 14, 15, 12, 9], have b een used in literature. The general approach to solving the pairwise (N = 2) sequence alignment problem has b een a dynamic programming technique using di erent mechanisms of scores which is a function of the edit distance, along with gap penalties, to evaluate the similarity of the sequences. In [16, 13] the case of N > 2 has b een handled by rst doing a pairwise alignment for some or all p ossible pairs in some order and then building a N -wise alignment from these. MUSCA 2 uses a two-stage approach to the alignment problem by identifying two relatively simpler sub-problems which deal separately with the two issues, one of identifying the \lo cal similarities" and (^1) Musca is a constellation in the p olar region of the Southern Hemisphere near Apus and Carina. Also, MUSCA is an anagram of the salient characters in C onstrained Mu ltiple S equence Alignment. (^2) Musca is a constellation in the p olar region of the Southern Hemisphere near Apus and Carina. Also, MUSCA is an anagram of the salient characters in C onstrained Mu ltiple S equence Alignment.
the other of aligning the similarities appropriately. We rst discover motifs in the N sequences, and then use these motifs to obtain a \go o d" alignment. Informally, a motif is a rep eated pattern that app ears more than once in a sequence. In the alignment context a motif is a pattern that app ears in two or more input sequences A ma jor p oint of criticism regarding using motifs is that they are usually very large in numb er (exp onential in input size); however, we show that the motifs that are relevant to the alignment problem are the irredundant motifs, and the numb er of these motifs is p olynomial in the input size [10]. Moreover, in practice, this numb er is much smaller (sub-linear). Thus, using motifs for the alignment helps in at least two ways: (1) it aids in a direct N -wise alignment, as opp osed to comp osing the alignments from lower order (say pairwise) alignments and (2) the solution is indep endent of the order of the input sequences. We b elieve that, in practice, these have imp ortant consequences. The second sub-problem of the alignment problem is that of obtaining a go o d alignment. Notice that any arbitrary set of motifs need not necessarily give rise to an alignment, under the premise that the alignment that uses a motif does not introduce gaps in the motif. Having obtained al l possible motifs in the rst stage, this stage involves pruning this set to obtain a (sub)set that gives an alignment. We solve this problem by mapping the motifs of the rst stage to a suitable directed graph. Next we show that obtaining an alignment of the motifs is equivalent to solving a set-covering problem. Thus we present a very systematic way of aligning sequences based on motifs. It is well known that the multiple sequence alignment problem, in addition to b eing a hard-to- solve problem, is also very hard to model to the satisfaction of evolutionary biologists, geneticists and other users. Do es our approach have any theoretical contributions to the multiple sequence alignment problem in general? We intro duce the the notion of an alignment number K (2 K N ), a user-controlled parameter, that lends a useful exibility to the aligning program: this additional requirement constrains the alignment to have at least K sequences agree on a character, whenever p ossible, in the alignment. This is particularly of interest when a large numb er of sequences are b eing aligned. The utility of the alignment numb er is corrob orated by the users who view this as a natural constraint while dealing with a large numb er of sequences.
Roadmap. We describ e our two-stage approach of motif discovery in Section 2 and aligning se- quences in Section 3. We discuss the issues involved in using motifs for alignment and present a simple graph theoretic formulation in Section 3.1. We present heuristic algorithm for this problem by mapping it to a set covering problem in Section 3.2.
We b egin by giving a rigorous de nition of a motif.
De nition 1 (K -motif m, location list L (^) m ) Given a string s on alphabet and an integer K , 2 K < jsj, a string m on [ : is a K -motif with location list Lm = (l 1 ; l 2 ; : : : ; lp ), if al l of the fol lowing hold:
We de ne the alignment using a set of motifs as follows.
De nition 6 (sequence alignment, compatible set) Given a set S of motifs, vi 1 ; vi 2 ; : : : ; vin , a motif- alignment of the sequences, s 1 ; s 2 ; : : : ; s (^) m , is the alignment such that in al l the sequences, with no gaps in the motifs, the motifs vi 1 ; v (^) i 2 ; : : : ; vin , are aligned (in al l the sequences they appear). If such an alignment exists, the set s 1 ; s 2 ; : : : ; s (^) m , is cal led a compatible set.
De nition 7 (linear ordering of motifs) Given a set of compatible motifs, a consistent ordering of the motifs such that, in every sequence, the set of motifs that are present in the sequence appear in the left to right order and this ordering is cal led the linear ordering.
Is it sucient to just check for pairwise incompatibility of motifs while seeking an alignment?
De nition 8 (domain crossing error) Given a set of motifs, m 1 ; m 2 ; : : : ; mn , a domain crossing error is said to occur if there exists a linear ordering of the motifs m (^) i 1 ; mi 2 ; : : : ; min , yet there exists no alignment that respects al l the n motifs.
Lemma 2 A set of irredundant motifs p 1 ; p 2 ; : : : ; pn is feasible if and only if none of the fol lowing holds:
3.1 The Graph-theoretic Formulation
Next we wish to capture these conditions in a graph as follows. Construct a directed graph G = (V ; E ) where every motif pi corresp onds to a vertex v (^) i , thus N = jV j. The directed edges are intro duced as follows:
(a) Lab el forbidden, if the motifs corresp onding to v 1 and v 2 are not pairwise compatible. (b) Lab el overlap, if the motifs corresp onding to v 1 and v 2 overlap. (c) Lab el nonoverlap, if the motifs corresp onding to v 1 and v 2 are pairwise compatible but do not overlap.
The linear ordering of motifs is captured by checking for cycles in the graph. However the domain crossing mismatch requires a more careful handling as describ ed b elow.
Handling domain crossing mismatches. We asso ciate a distance D (^) v 1 v 2 with every edge that is not lab eled forbidden. This is used to compute the feasibility of a collection of motifs corresp onding to a solution; this does not contribute to the cost of the alignment. (We discuss the weight corresp onding to the cost in the next section.) Let p (^) i and pj b e the two motifs corresp onding to the vertices. Then if d is the minimum of the distance b etween the o ccurrences of the two motifs in every sequence that b oth of them app ear in, D (^) vi v (^) j = d. To detect the domain crossing mismatches of motifs (that are pairwise compatible), we de ne the notion of a consistent graph w.r.t. a vertex.
De nition 9 Let G = (V ; E ) be a labeled, weighted, directed, graph with weights on the edge uv given by D (u; v ) and a label 2 ff or bidden; ov er l ap; nonov er l apg. A path, P , is valid if it has no edges labeled forbidden. Further, a valid path, P , is cal led an overlap-path if al l the edges in the path are labeled ov er l ap. The weight of the valid path P , D (^) P , is the sum of the weights of its constituent edges. Let p 2 V. The graph is consistent w.r.t p if 8 q 2 V , for al l pairs of vertex-disjoint valid paths from p to q , P 1 and P 2 ,
Lemma 3 The fol lowing two statements are equivalent:
Given a subset of motifs p 1 ; p 2 ; : : : ; pn from the set of al l motifs from m sequences of input, the subset is compatible, if the fol lowing holds:
Given a subset of vertices v 1 ; v 2 ; : : : ; v (^) n , constructed as de ned in this section. The induced subgraph on v 1 ; v 2 ; : : : ; vn is feasible, if the fol lowing holds:
Lemma 4 If p is a redundant motif, then using the motif p does not improve the cost of the optimiza- tion problem.
Pro of Sketch. Let p b e rendered redundant by motifs p 1 ; p 2 ; : : : ; pn , n 1. By de nition, motif p has less numb er of solid-characters than each of pi , 1 i n. Thus if an alignment can use motif p, it can certainly use all the motifs p 1 ; p 2 ; : : : ; pn , giving a larger numb er of solid-characters; thus a higher cost for the optimization problem. 2
3.2 Algorithm to compute the \b est" alignment
Given a set of incompatible motifs, the set can b e group ed into sets (not necessarily disjoint) such that each set violates exactly one of the three conditions of Lemma (2). These sets are called basic incompatible sets. Next, it can b e easily shown that we can remove exactly one motif from a basic incompatible set to make it compatible. The algorithm pro ceeds in the following three steps.
3.1 We compute a linear ordering (see De nition 7) using the graph G 0 of Step 2, of the motifs pij 1 , p ij 2 , : : :, piji for each sequence i. Such a linear ordering of the motifs exists, since the set of motifs is feasible.
3.2 From the original sequence s (^) i , we obtain the llers (if any) b etween two consecutive motifs as f 0 i , f (^) ji 1 , f (^) ji 2 , : : :, f (^) jii. f 0 i is the leftmost p ortion, p ossibly empty. For example, let sequence s (^) i = abcdef g hij k l and let two motifs b e as follows: pi 1 = cde and p i 2 = g hi. Then f 0 i = ab, f 1 i = f , f 2 i = j k l.
3.3 We obtain an alignment of the sequences by appropriately aligning each (pij (^) l +f (^) ji (^) l ) and f 0 i , l = 1 ; 2 ; : : : ; j (^) i , lling the gaps with -'. The symb ol+' denotes a string concatenation op eration. The alignment is such that each motif of a sequence is p erfectly aligned with the corresp onding motif in all the other sequences. For example let s 1 = abcdef g hij k l and s 2 = cdexy z pq r g hitu. Then p^11 = p 21 = cde and p 12 = p 22 = g hi and f 01 = ab, f 11 = f , f 21 = j k l , f 02 =empty, f 12 = xy z pq r , f 22 = tu. Then the alignment of the sequences are as follows (the motifs are shown in b old):
(1) a b c d e f g h i j k l (2) c d e x y z p q r g h i t u
We have prop osed a two-stage approach to the alignment problem by handling two relatively simpler sub-problems which deal separately with the two issues, one of identifying the \lo cal similarities" and the other of aligning the similarities appropriately. In the rst stage we identify al l possible K -wise motifs, i.e., all motifs that app ear simultaneously in at least K of the N input sequences (2 K N ). In the second stage, we give plausible alignments of a carefully chosen subset of these motifs (that optimize certain cost functions). Using this approach for the alignment helps in at least two ways: (1) it aids in a direct N -wise alignment, as opp osed to comp osing the alignments from lower order (say pairwise) alignments and (2) the resulting alignment is indep endent of the order of the input sequences. K is an input parameter and is called the alignment number. In practice, our approach works particularly well for alignment of a large set of (long) sequences. We have presented the result of running our alignment algorithm on biological data and the results lo ok very promising. In the full version of the pap er we discuss the computational complexity of the underlying optimization problem along with an analysis of the approximation factor of any alignment that results from the algorithm.
[1] Arora, S., Karger, D., Karpinski, M., Polynomial time approximation schemes for dense instances of NP-hard problems, Proceedings of the Twenty-Seventh Annual ACM Symposium on the Theory of Computing, ACM, 284{93, 1995.
[2] Altschul, S.F., Gap costs for multiple sequence alignment, J. Theor. Biol., 138:297{309, 1989.
[3] Chao, K.M., Hardison, R., and Millter, W., Recent developments in linear-space alignment meth- o ds: a survey, J. Comput. Biol., 3:271{291, 1994.
[4] Carrillo, H. and Lipman, D., The multiple sequence alignment problem in biology, SIAM J. Appl. Math., 481073{1082, 1988.
[5] Cormen, T.H., Leiserson, C.E., and Rivest, R.L., Introduction to Algorithms, The MIT Press, Cambridge, Massachusetts, 1990.
[6] Gupta, S.K., Kececioglu, J.D., and Scha er, A.A., Improving the practical space and time e- ciency of the shortest-paths approach to sum-of-pairs multiple sequence alignment, J. Comput. Biol., 2(3):459{472, 1995.
[7] Higgins, D.G. and Sharp, P.M., CLUSTAL: a package for p erforming multiple sequence alignment on a micro computer, Gene, 73:237{244, 1988.
[8] Hirosawa, M., Totoki, Y., Hoshida, M., and Ishikawa, M., Comprehensive study on iterative algorithms of multiple sequence alignment, Comput. Appl. Biosci., 11(1):13{18, 1995.
[9] Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., and Wo otton, J.C., Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment, Science, 262(5131):208{214, 1993.
[10] Parida, L., Algorithmic Techniques in Computational Genomics, Ph.D. thesis, Courant Institute of Mathematical Sciences, New York University, Septemb er, 1998.
[11] Rigoutsos, I. and Floratos, A., Motif discovery in biological sequences without alignment or enumeration, In Proceedings of the Annual Conference on Computational Molecular Biology (RE- COMB98), 221{227, ACM Press, 1998.
[12] Ro cke, E. and Tompa, M., An algorithm for nding novel gapp ed motifs in DNA sequences, In Proceedings of the Annual Conference on Computational Molecular Biology (RECOMB98), 228{233, ACM Press, 1998.
[13] Vihinen, M., An algorithm for simultaneous comparison of several sequences, Comput. Appl. Biosci., 4(1):89{92, 1988.
[14] Waterman, M.S., Parametric and ensemble alignment algorithms, Bul l. Math. Biol., 56(4):743{ 767, 1994.
[15] Waterman, M.S., An Introduction to Computational Biology: Maps, Sequences and Genomes, Chapman Hall, 1995.
[16] Miller, W. and Zhang, Z., and He, B., Lo cal multiple alignment vis subgraph enumeration, Discrete Appl. Math., 71:337{365, 1996.