Algorithm for Memory-Efficient Constrained Multiple Sequence Alignment | Papers Computer Science

BIOINFORMATICS ORIGINAL PAPER Vol. 21 no. 1 2005, pages 20–30

doi:10.1093/bioinformatics/bth468

A memory-efficient algorithm for multiple

sequence alignment with constraints

Chin Lung Lu∗and Yen Pin Huang

Department of Biological Science and Technology, National Chiao Tung University,

Hsinchu 300, Taiwan, Republic of China

Received on April 20, 2004; revised July 16, 2004; accepted August 3, 2004

Advance Access publication August 12, 2004

ABSTRACT

Motivation: Recently, the concept of the constrained

sequence alignment was proposed to incorporate the know-

ledgeofbiologistsaboutstructures/functionalities/consensuses

of their datasets into sequence alignment such that the user-

specified residues/nucleotides are aligned together in the

computed alignment. The currently developed programs use

the so-called progressive approach to efficiently obtain a con-

strained alignment of severalsequences. However, the kernels

of these programs, the dynamic programming algorithms for

computing an optimal constrained alignment between two

sequences, run in O(γ n2)memory, where γis the number

of the constraints and nis the maximum of the lengths of

sequences. As a result, such a high memory requirement limits

the overall programs to align short sequences only.

Results: We adopt the divide-and-conquer approach to

design a memory-efficient algorithm for computing an optimal

constrained alignment between two sequences, which greatly

reduces the memory requirement of the dynamic program-

ming approaches at the expense of a small constant factor

in CPU time. This new algorithm consumes only O(αn)space,

where αis the sum of the lengths of constraints and usually

αnin practical applications. Based on this algorithm, we

have developed a memory-efficient tool for multiple sequence

alignment with constraints.

Availability: http://genome.life.nctu.edu.tw/MUSICME

Contact: [email protected]

1 INTRODUCTION

Multiplesequencealignment(MSA)isoneofthefundamental

problems in computational molecular biology that have been

studied extensively, because it is a useful tool in the phylo-

genetic analyses among various organisms, the identification

of conserved motifs and domains in a group of related pro-

teins, the secondary and tertiary structure prediction of a

protein (or RNA), and so on (Carrillo and Lipman, 1988; Chan

et al., 1992; Gusfield, 1997; Nicholas et al., 2002; Notredame,

2002). Moreover, MSA is one of the most challenging

∗To whom correspondence should be addressed.

problems in computational molecular biology because it has

been shown to be NP-complete under the consideration of

sum-of-pairs scoring criteria (Kececioglu, 1993; Wang and

Jiang, 1994; Bonizzoni and Vedova, 2001), which means

that it seems to be hard to design an efficient algorithm for

finding the mathematically optimal alignment. Hence, some

approximate methods (Gusfield, 1993; Pevzner, 1992; Bafna

et al., 1997; Li et al., 2000) and heuristic methods (Feng

and Doolittle, 1987; Taylor, 1987; Corpet, 1988; Higgins and

Sharpe, 1988; Thompson et al., 1994) were introduced to

overcome this problem.

Recently, the concept of the constrained sequence align-

ment was proposed to incorporate the knowledge of biolo-

gists regarding the structures/functionalities/consensuses of

their datasets into sequence alignment such that the user-

specified residues/nucleotides are aligned together in the

computed alignment (Tang et al., 2003). Tang et al. (2003)

first designed a dynamic programming algorithm for finding

an optimal constrained alignment of two sequences and then

used it as a kernel to develop a constrained multiple sequence

alignment (CMSA) tool based on the progressive approach,

where each constraint considered by Tang et al. is a single

residue/nucleotide only. Their proposed algorithm for the

two sequences runs in O(γ n4)time and consumes O(n4)

space, where γis the number of constrained residues and

nis the maximum lengths of the sequences. Later, this res-

ult was improved independently by two groups of researchers

to O(γ n2)time and O(γ n2)space using the same approach

of dynamic programming (Yu, 2003; Chin et al., 2003). In

fact, each constraint requested to be aligned together can

represent a conserved site of a protein/DNA/RNA family

and each conserved site may consist of a short segment of

residues/nucleotides, instead of a single residue/nucleotide.

In other words, the constraint specified by the biologists

can be a fragment of several residues/nucleotides. For some

applications, biologists may further expect that some mis-

matches are allowed among the residues/nucleotides of the

columns requested to be aligned. Hence, Tsai et al. (2004)

studied such a kind of the constrained sequence alignment

and designed an algorithm of O(γ n2)time and O(γ n2)

Algorithm for Memory-Efficient Constrained Multiple Sequence Alignment, Papers of Computer Science

Related documents

Partial preview of the text

Download Algorithm for Memory-Efficient Constrained Multiple Sequence Alignment and more Papers Computer Science in PDF only on Docsity!

BIOINFORMATICS ORIGINAL PAPER

A memory-efficient algorithm for multiple

sequence alignment with constraints

Chin Lung Lu ∗^ and Yen Pin Huang

1 INTRODUCTION

2 PROBLEM FORMULATION

3 ALGORITHM

4 EXPERIMENTAL RESULTS

ACKNOWLEDGEMENTS

REFERENCES