Algorithm for Constrained Multiple Data Sequences Alignment: MUSCA | Papers Computer Science

MUSCA: An Algorithm for Constrained Alignment of

Multiple Data Sequences

Laxmi Parida Aris Floratos Isidore Rigoutsos

[email protected] [email protected] [email protected]

Bioinformatics and Pattern Discovery Group

Computational Biology Center

IBM Thomas J. Watson Research Center

Yorktown Heights, NY10598, USA

Abstract

Given a set of

sequences, the Multiple Sequence Alignment problem is to align these

sequences, possibly with gaps, that brings out the best

commonality

of the

sequences. MUSCA

isatwo-stage approach to the alignment problem by identifying two relatively simpler sub-problems

whose solutions are used to obtain the alignment of the sequences. We rst

discover motifs

in the

sequences and then extract an appropriate subset of compatible motifs to obtain a \good"

alignment. The motifs of interest to us are the

irredundant

motifs which are only polynomial in

the input size. In practice, however, the number is much smaller (sub-linear). Notice that this

step aids in a direct

-wise alignment, as opposed to composing the alignments from lower order

(say pairwise) alignments and the solution is also independent of the order of the input sequences;

hence the algorithm works very well while dealing with a large number of sequences. The second

part of the problem that deals with obtaining a good alignment is solved using a graph-theoretic

approach that computes an induced subgraph satisfying certain simple constraints. We reduce a

version of this problem to that of solving an instance of a set covering problem, thus oer the best

possible approximate solution to the problem (provided P

=NP). Our experimental results, while

being preliminary, indicate that this approach is ecient, particularly on large numbers of long

sequences, and, gives good alignments when tested on biological data such as DNA and protein

sequences. Weintroduce the the notion of an

alignment number



), a user-controlled

parameter, that lends a useful exibility to the aligning program: this additional requirement

constrains the alignmenttohave at least

sequences agree on a character, whenever possible, in

the alignment. The usefulness of the alignmentnumber is corroborated by the users who view this

as a natural constraint while dealing with a large number of sequences.

1 Introduction

Given a set of

sequences, the Multiple Sequence Alignment problem is to align these

sequences,

possibly with gaps, that brings out the best

commonality

of the

sequences. Various alignment cost

functions [2, 3, 4, 6, 8, 7, 14, 15, 12, 9], have been used in literature. The general approach to solving

the pairwise (

= 2) sequence alignment problem has been a dynamic programming technique using

dierent mechanisms of scores which is a function of the

edit distance

, along with

gap penalties

, to

evaluate the similarity of the sequences. In [16, 13] the case of

N >

2 has been handled by rst doing

a pairwise alignment for some or all possible pairs in some order and then building a

-wise alignment

from these.

MUSCA

uses a two-stage approach to the alignment problem by identifying two relatively simpler

sub-problems which deal separately with the two issues, one of identifying the \local similarities" and

Musca is a constellation in the polar region of the Southern Hemisphere near Apus and Carina. Also, MUSCA is an

anagram of the salientcharacters in Constrained Multiple Sequence Alignment.

Musca is a constellation in the polar region of the Southern Hemisphere near Apus and Carina. Also, MUSCA is an

anagram of the salientcharacters in Constrained Multiple Sequence Alignment.

112

Algorithm for Constrained Multiple Data Sequences Alignment: MUSCA, Papers of Computer Science

Related documents

Partial preview of the text

Download Algorithm for Constrained Multiple Data Sequences Alignment: MUSCA and more Papers Computer Science in PDF only on Docsity!

MUSCA: An Algorithm for Constrained Alignment of

Multiple Data Sequences

1 Intro duction

2 Motif Discovery (Stage 1)

4 Summary

References