A Parallel Algorithm for the Constrained Multiple Sequence Alignment Problem

Dan He and Abdullah N. Arslan

Department of Computer Science

University of Vermont

Burlington, VT 05405, USA

{dhe, aarslan}@cs.uvm.edu

Abstract

We propose a parallel algorithm for the constrained mul-

tiple sequence alignment (CMSA) problem that seeks an

optimal multiple alignment constrained to include a given

pattern. We consider the dynamic programming computa-

tions in layers indexed by the symbols of the given pattern.

In each layer we compute as a potential part of an optimal

alignment for the CMSA problem, shortest paths for multi-

ple sources and multiple destinations. These shortest paths

problems are independent from one another (which enables

parallel execution), and each can be solved using an A∗al-

gorithm specialized for the shortest paths problem for mul-

tiple sources and multiple destinations. The ﬁnal step of our

algorithm solves a single source single destination shortest

path problem. Our experiments on real sequences show that

our algorithm is faster in general than the existing sequen-

tial dynamic programming solutions.

1. Introduction

The constrained multiple sequence alignment (CMSA)

problem was introduced by Tang et al. [10]. The prob-

lem aims to incorporate the biologically meaningful prior

knowledge of the structure or pattern of the input sequences

into the alignment process. The problem is to ﬁnd an op-

timal multiple alignment of given nstrings S1,S

2, ..., Sn

such that the alignment contains a given pattern string P,

i.e. in the alignment matrix there exists a sequence cof

columns each entirely composed of symbol P[k]for every

kwhere P[k]is the kth symbol in P,1≤k≤|P|, and

in the sequence c, a column containing P[i]appears before

column containing P[j]for all i, j, i < j.

There are many dynamic programming algorithms for

the CMSA problem and its variations [10, 3, 11, 12, 1, 4,

7].

For the CMSA problem Chin et. al [3] presents a dy-

namic programming formulation that modiﬁes the solution

for the multiple sequence alignment (MSA) problem [13]

to consider the additional string P.

Let D(i1,i

2, ..., in,k)be the optimum con-

strained sequence alignment score of sequences

S1[1..i1],S2[1..i2],...,Sn[1..in]with constrained pat-

tern sequence P[1..r]. Then this score can be computed by

the following recurrence:

Theorem 1 [3] D(i1,...,i

n,k)=∞if i1=0,or i2=0,

or ...,or in=0for all k,1≤k≤r, and D({0}n,0) =

0, and for all i1,i

2,...,i

n,k,0≤i1≤s1,0≤i2≤

s2,...,0≤in≤sn,0≤k≤r,

D(i1,i

2, ..., in,k)=

min











D(i1−1,i

2−1, ..., in−1,k−1)

+δ(S1[i1],S

2[i2], ..., Sn[in])

if (S1[i1]=S2[i2]=... =Sn[in]=P[k])

mine∈{0,1}nD(i1−e1,i

2−e2, ..., in−en,k)

+δ(e1∗S1[i1],e

2∗S2[i2], ..., en∗Sn[in])

for ij−ej≥0for all j, 1≤j≤n

where ej=0or 1,ej∗Sj[ij]with ej=0represents

the space character −, and Sj[ij]when ej=1, and

δ(x1,x

2, ..., xk)=1≤i<j≤nδ(xi,x

j).

D(s1,s

2,...,s

n,r)is the optimum score for the CMSA

problem.

He and Arslan [7] improved the naive dynamic pro-

gramming algorithm of Chin et al. [3] using the obser-

vation that if the symbol P[k]is aligned with a sym-

bol of Sithen the region before this symbol P[k]in

Sican never be aligned with the region after P[k]in

S1,S

2,...,S

i−1,S

i+1,...,S

n. Although the worst-case

time complexity of this algorithm is the same as that of the

solution in Theorem 1 (i.e. O(2ns1s2...snr)), experiments

show that for 5RNA sequences, and a pattern of length 4

the algorithm is more than 60 times faster than a naive im-

plementation of the dynamic programming solution in The-

orem 1.

Proceedings of the 5th IEEE Symposium on Bioinformatics and Bioengineering (BIBE’05)

A Parallel Algorithm for the Constrained Multiple Sequence Alignment Problem | CSC 8910, Papers of Computer Science

Related documents

Partial preview of the text

Download A Parallel Algorithm for the Constrained Multiple Sequence Alignment Problem | CSC 8910 and more Papers Computer Science in PDF only on Docsity!

Dan He and Abdullah N. Arslan

Department of Computer Science

University of Vermont

Burlington, VT 05405, USA

{dhe, aarslan}@cs.uvm.edu

Abstract

1. Introduction

2. Parallel Computation of CM SA

2.1. Bidirectional-Method-Based A∗^ Algorithm

3. Experiments

4. Concluding Remarks

References