A Parallel Algorithm for the Constrained Multiple Sequence Alignment Problem | CSC 8910, Papers of Computer Science

Material Type: Paper; Class: COMPUTER SCIENCE TOPICS SEMINR; Subject: COMPUTER SCIENCE; University: Georgia State University; Term: Unknown 1989;

Typology: Papers

Pre 2010

Uploaded on 08/31/2009

koofers-user-fde
koofers-user-fde 🇺🇸

10 documents

1 / 5

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
A Parallel Algorithm for the Constrained Multiple Sequence Alignment Problem
Dan He and Abdullah N. Arslan
Department of Computer Science
University of Vermont
Burlington, VT 05405, USA
{dhe, aarslan}@cs.uvm.edu
Abstract
We propose a parallel algorithm for the constrained mul-
tiple sequence alignment (CMSA) problem that seeks an
optimal multiple alignment constrained to include a given
pattern. We consider the dynamic programming computa-
tions in layers indexed by the symbols of the given pattern.
In each layer we compute as a potential part of an optimal
alignment for the CMSA problem, shortest paths for multi-
ple sources and multiple destinations. These shortest paths
problems are independent from one another (which enables
parallel execution), and each can be solved using an Aal-
gorithm specialized for the shortest paths problem for mul-
tiple sources and multiple destinations. The final step of our
algorithm solves a single source single destination shortest
path problem. Our experiments on real sequences show that
our algorithm is faster in general than the existing sequen-
tial dynamic programming solutions.
1. Introduction
The constrained multiple sequence alignment (CMSA)
problem was introduced by Tang et al. [10]. The prob-
lem aims to incorporate the biologically meaningful prior
knowledge of the structure or pattern of the input sequences
into the alignment process. The problem is to find an op-
timal multiple alignment of given nstrings S1,S
2, ..., Sn
such that the alignment contains a given pattern string P,
i.e. in the alignment matrix there exists a sequence cof
columns each entirely composed of symbol P[k]for every
kwhere P[k]is the kth symbol in P,1k≤|P|, and
in the sequence c, a column containing P[i]appears before
column containing P[j]for all i, j, i < j.
There are many dynamic programming algorithms for
the CMSA problem and its variations [10, 3, 11, 12, 1, 4,
7].
For the CMSA problem Chin et. al [3] presents a dy-
namic programming formulation that modifies the solution
for the multiple sequence alignment (MSA) problem [13]
to consider the additional string P.
Let D(i1,i
2, ..., in,k)be the optimum con-
strained sequence alignment score of sequences
S1[1..i1],S2[1..i2],...,Sn[1..in]with constrained pat-
tern sequence P[1..r]. Then this score can be computed by
the following recurrence:
Theorem 1 [3] D(i1,...,i
n,k)=if i1=0,or i2=0,
or ...,or in=0for all k,1kr, and D({0}n,0) =
0, and for all i1,i
2,...,i
n,k,0i1s1,0i2
s2,...,0insn,0kr,
D(i1,i
2, ..., in,k)=
min
D(i11,i
21, ..., in1,k1)
+δ(S1[i1],S
2[i2], ..., Sn[in])
if (S1[i1]=S2[i2]=... =Sn[in]=P[k])
mine∈{0,1}nD(i1e1,i
2e2, ..., inen,k)
+δ(e1S1[i1],e
2S2[i2], ..., enSn[in])
for ijej0for all j, 1jn
where ej=0or 1,ejSj[ij]with ej=0represents
the space character , and Sj[ij]when ej=1, and
δ(x1,x
2, ..., xk)=1i<jnδ(xi,x
j).
D(s1,s
2,...,s
n,r)is the optimum score for the CMSA
problem.
He and Arslan [7] improved the naive dynamic pro-
gramming algorithm of Chin et al. [3] using the obser-
vation that if the symbol P[k]is aligned with a sym-
bol of Sithen the region before this symbol P[k]in
Sican never be aligned with the region after P[k]in
S1,S
2,...,S
i1,S
i+1,...,S
n. Although the worst-case
time complexity of this algorithm is the same as that of the
solution in Theorem 1 (i.e. O(2ns1s2...snr)), experiments
show that for 5RNA sequences, and a pattern of length 4
the algorithm is more than 60 times faster than a naive im-
plementation of the dynamic programming solution in The-
orem 1.
Proceedings of the 5th IEEE Symposium on Bioinformatics and Bioengineering (BIBE’05)
0-7695-2476-1/05 $20.00 © 2005 IEEE
pf3
pf4
pf5

Partial preview of the text

Download A Parallel Algorithm for the Constrained Multiple Sequence Alignment Problem | CSC 8910 and more Papers Computer Science in PDF only on Docsity!

A Parallel Algorithm for the Constrained Multiple Sequence Alignment Problem

Dan He and Abdullah N. Arslan

Department of Computer Science

University of Vermont

Burlington, VT 05405, USA

{dhe, aarslan}@cs.uvm.edu

Abstract

We propose a parallel algorithm for the constrained mul- tiple sequence alignment ( CM SA ) problem that seeks an optimal multiple alignment constrained to include a given pattern. We consider the dynamic programming computa- tions in layers indexed by the symbols of the given pattern. In each layer we compute as a potential part of an optimal alignment for the CM SA problem, shortest paths for multi- ple sources and multiple destinations. These shortest paths problems are independent from one another (which enables parallel execution), and each can be solved using an A∗^ al- gorithm specialized for the shortest paths problem for mul- tiple sources and multiple destinations. The final step of our algorithm solves a single source single destination shortest path problem. Our experiments on real sequences show that our algorithm is faster in general than the existing sequen- tial dynamic programming solutions.

1. Introduction

The constrained multiple sequence alignment (CM SA) problem was introduced by Tang et al. [10]. The prob- lem aims to incorporate the biologically meaningful prior knowledge of the structure or pattern of the input sequences into the alignment process. The problem is to find an op- timal multiple alignment of given n strings S 1 , S 2 , ..., Sn such that the alignment contains a given pattern string P , i.e. in the alignment matrix there exists a sequence c of columns each entirely composed of symbol P [k] for every k where P [k] is the kth symbol in P , 1 ≤ k ≤ |P |, and in the sequence c, a column containing P [i] appears before column containing P [j] for all i, j, i < j. There are many dynamic programming algorithms for the CM SA problem and its variations [10, 3, 11, 12, 1, 4, 7]. For the CM SA problem Chin et. al [3] presents a dy- namic programming formulation that modifies the solution

for the multiple sequence alignment (M SA) problem [13] to consider the additional string P. Let D(i 1 , i 2 , ..., in, k) be the optimum con- strained sequence alignment score of sequences S 1 [1..i 1 ],S 2 [1..i 2 ],...,Sn[1..in] with constrained pat- tern sequence P [1..r]. Then this score can be computed by the following recurrence:

Theorem 1 [3] D(i 1 ,... , in, k) = ∞ if i 1 = 0, or i 2 = 0, or... , or in = 0 for all k , 1 ≤ k ≤ r , and D({ 0 }n, 0) = 0 , and for all i 1 , i 2 ,... , in, k , 0 ≤ i 1 ≤ s 1 , 0 ≤ i 2 ≤ s 2 ,... , 0 ≤ in ≤ sn , 0 ≤ k ≤ r , D(i 1 , i 2 , ..., in, k) =

min

D(i 1 − 1 , i 2 − 1 , ..., in − 1 , k − 1) +δ(S 1 [i 1 ], S 2 [i 2 ], ..., Sn[in]) if (S 1 [i 1 ] = S 2 [i 2 ] = ... = Sn[in] = P [k])

mine∈{ 0 , 1 }n D(i 1 − e 1 , i 2 − e 2 , ..., in − en, k) +δ(e 1 ∗ S 1 [i 1 ], e 2 ∗ S 2 [i 2 ], ..., en ∗ Sn[in]) for ij − ej ≥ 0 for all j, 1 ≤ j ≤ n

where ej = 0 or 1 , ej ∗ Sj [ij ] with ej = 0 represents the space character ′−′ , and Sj [ij ] when ej = 1 , and δ(x 1 , x 2 , ..., xk) =

1 ≤i<j≤n δ(xi, xj^ ). D(s 1 , s 2 ,... , sn, r) is the optimum score for the CM SA problem. He and Arslan [7] improved the naive dynamic pro- gramming algorithm of Chin et al. [3] using the obser- vation that if the symbol P [k] is aligned with a sym- bol of Si then the region before this symbol P [k] in Si can never be aligned with the region after P [k] in S 1 , S 2 ,... , Si− 1 , Si+1,... , Sn. Although the worst-case time complexity of this algorithm is the same as that of the solution in Theorem 1 (i.e. O(2ns 1 s 2 ...snr)), experiments show that for 5 RN A sequences, and a pattern of length 4 the algorithm is more than 60 times faster than a naive im- plementation of the dynamic programming solution in The- orem 1.

Proceedings of the 5th IEEE Symposium on Bioinformatics and Bioengineering (BIBE’05)

In this paper, we propose a parallel algorithm for the CM SA problem. Our algorithm uses the observation of He and Arslan [7], and uses as a subroutine the A∗^ algorithm presented by Shibuya [9]. The outline of this paper is as follows: in Section 2 we give the formulation for the subproblems that arise in com- puting the CM SA, and describe our parallel CM SA al- gorithm. This section is organized in two subsections. In Subsection 2.1 we explain the bidirectional-method-based A∗^ algorithm of Shibuya [9] that we use to solve the sub- problems we formulate. We present our parallel algorithm in Subsection 2.2. In Section 3, we show the results of our experiments. We include our final remarks in Section 4.

2. Parallel Computation of CM SA

The dynamic programming solution for the CM SA problem can be considered as finding a shortest path in the dynamic programming matrix which we can visualize in layers indexed by its last dimension (the positions in the pattern string) as shown in Figure 1 where each layer is an n-dimensional matrix. We observe that a shortest path has to go through each layer of the dynamic programming ma- trix beginning at layer 0 and ending at layer r, where r is the length of the pattern string. We call an optimal solution of the CM SA problem as a global shortest path. A global shortest path enters each layer at a vertex (we call it an entry vertex ), after traversing a number of vertices in each layer exits the layer at a vertex (we call it an exit vertex ) never to come back to this layer again. An exit vertex of layer k is also the entry vertex of layer k + 1 for 0 ≤ k < r. The length of the global shortest path is the sum of the length of the sub-paths on each layer, and each sub-path on layer k in the global shortest path is the shortest path between the en- try and exit vertices on layer k. We show in Subsection 2. that we can actually precalculate all of the candidate entry and exit vertices on each layer. For given all possible entry and exit vertices on each layer, we consider the computation of shortest paths between all entry-exit vertex-pairs. We ob- serve the following about these computations:

  • They are independent of each other. This means that we can do the computations for each layer in parallel.
  • They are indeed the all pairs shortest paths prob- lem among multiple sources and multiple destinations. Shibuya [9] presents a bidirectional-method-based A∗ algorithm for solving this problem efficiently. We next summarize the algorithm of Shibuya [9].

2.1. Bidirectional-Method-Based A∗^ Algorithm

The A∗^ algorithm [6] is a very popular heuristic search algorithm which is the extension of Dijkstra’s single source

shortest path algorithm [5]. It uses a heuristic estimator for the distance from each vertex in the graph to the destina- tion. The score for each vertex is the sum of the heuris- tic value and the real distance found from the source to the vertex. The algorithm always expands the vertex with the minimum score. In most practical cases, the A∗^ algorithm is very efficient. The bidirectional algorithm [2] applies the Dijkstra al- gorithm simultaneously from both the source s and the des- tination e. The search of the Dijkstra algorithm terminates if the forward and backward explorations meet. Then the shortest distance is obtained by picking a point f in the for- ward exploration points set ps and a point b in the backward exploration points set pe such that f and b have direct link and e(f, b) + ps(f ) + pe(b) is minimized, where e(f, b) is the distance of the edge between f and b, ps(f ) is the short- est distance from s to f , and pe(b) is the shortest distance from e to b. The shortest path can be obtained by combining the s − f shortest path, edge (f, b) and the e − b shortest path. Ordinarily the bidirectional method and A∗^ algorithm for the shortest path problem are used for the case of single source and single destination. It has been believed that both algorithms are only suitable for this case of the problem. Shibuya [9] was the first to apply the A∗^ algorithm to the n × m shortest paths problem. Shibuya [9] presents the bidirectional-method-based A∗^ algorithm. Shibuya’s algorithm uses the heuristic function:

h(v) = minih∗(v, ti)

where h∗(v, ti) is the heuristic estimator from vertex v to destination ti, and h(v) is the smallest estimator to the set of destinations. We can compute this heuristic function by the backward Dijkstra algorithm which is a variation of the ordinary Di- jkstra algorithm, and it has the same time complexity. The backward Dijkstra algorithm is similar to the ordinary Di- jkstra algorithm except that the start scores p(v) for all ver- tices v in the destination set T are initialized to 0. The order of the points to be expanded is decided by the distances of those points to the destination set T , while in the ordinary Dijkstra algorithm, the order is decided by the distance to the single destination. In the backward algorithm the heuris- tic estimator h(v) is the real distance found from the vertex v to the destination set T , which is the distance from v to its closest destination. After we compute the heuristic estimators for all source vertices, we can use the A∗^ algorithm directly to compute shortest paths to each destination vertex. Since the heuris- tic estimator for each vertex is the same in the search of every pair of source and destination, we do not need to re- compute these estimators thus the heuristic search would be very efficient. Shibuya [9] reports that the bidirectional-

Proceedings of the 5th IEEE Symposium on Bioinformatics and Bioengineering (BIBE’05)

layer entry vertices exit vertices total vertices 0 1 50 6 , 208 , 512 1 50 10 5 , 702 , 400 2 10 576 27 , 433 , 920 3 576 96 21 , 168 , 000 4 96 1 15 , 349 , 824

Figure 2. The numbers of entry, exit and to- tal vertices on each layer for the Data Set 1 in Figure 7 and Data Set 2 in Figure 8 of [3]. We considered only the first 4 sequences and took HKSH as the pattern.

that we use |P | + 1 processors, the time taken by this step is the longest time spent on a layer. Our experiments show that it is much shorter than the total time of sequential execution through all the layers. In Step 3 of P CM SA, after we compute all possible shortest paths on each layer, we find the global shortest path between vertices (0, 0 ,... , 0 , 0) and (s 1 , s 2 ,... , sn, r) by selecting one shortest path connecting an entry and an exit vertex on each layer such that the sum of the shortest paths from all layers is minimized. The global shortest path is the combination of these shortest paths on each layer. In this fi- nal step of the algorithm we can use a single-source shortest paths algorithm. We only need to consider part of the graph composed of these shortest paths on layers. Since we have empirical evidence suggesting that the numbers of entry and exit vertices are small, the total size of the graph used in shortest path computations is expected to be insignificant compared to 2 ns 1 s 2... sn. Therefore, we expect that if we use Dijstra’s single source shortest paths algorithm, the time spent in execution of this step is only a small part in the en- tire execution time. To summarize, given that the total numbers of entry and exit vertices on layers are small (as is the case in practice) the expected time requirement of Algorithm P CM SA is O(2ns 1 s 2... sn). Taking into account the fact that our al- gorithm considers only small relevant regions on each layer, we expect that the algorithm runs even faster in practice.

3. Experiments

In our experiments, we first use the RN ase sequences used by Chin et al. [3]. We performed separate tests using the first 3 sequences, and the first 4 sequences. Seq1 : gi| 119124 |sp|p 12724 |ecp human Seq2 : gi| 2500564 |sp|p 70709 |ecp rat Seq3 : gi| 13400006 |pdb|ldyt| Seq4 : gi| 20930966 |ref |xp 142859. 1

#s. p.l. seq.l. FCMSA PCMSA 3 5 160 / 135 / 133 0. 621 0. 209 4 5 160 / 135 / 133 / 131 79. 199 56. 124 4 6 160 / 135 / 133 / 131 35. 133 13. 194

Figure 3. The execution times of Algorithm F CM SA and Algorithm P CM SA for the Data Set 1 in Figure 7 and Data Set 2 in Figure 8 of [3]. We considered the first 3 and 4 se- quences in separate tests. In this figure and Figure 4, #s. is the number of sequences, p.l. is the pattern length, seq.l. is the sequence length, and the execution times in the last two columns are measured in seconds.

For the 3 -sequence-case, we use pattern strings as HKST H, and for the 4 -sequence-case, we use pattern strings as HKST H and HKST HH separately. We then compare the execution times of Algorithm F CM SA of He and Arslan [7], and our P CM SA algorithm. The results of the experiments are shown in Figure 3. We have not run Algorithm P CM SA on a parallel machine. We estimate the execution time of Algorithm P CM SA by measuring and adding the execution time of each step on a sequential machine with Intel Xeon 2. 4 GHz CPU and 2 GB memory, and considering the longest execution time for the parallel step (Step 2 ). We ignore the communication cost which we believe insignificant if we ran Algorithm P CM SA on a parallel machine. We also conducted tests using 4 sequences we show be- low. We chose these sequences from 4 different families: the aspartic acid protease family, the hemoglobins family, the ribonuclease family and the kinase family. The pattern strings are HKAST H and HKAST F H. We show the results of the experiments in Figure 4. Seq1 : gi| 70778788 |ref |N P 001020501. 1 | Seq2 : gi| 24817320 |emb|CAA 98468. 2 | Seq3 : gi| 68344985 |gb|AAY 92591. 1 | Seq4 : gi| 62295087 |sp|P 69615 |Y IO 1 CV M 3 When we increase the pattern length, both Algorithm F CM SA and Algorithm P CM SA perform better. This is probably because with the increase of pattern length, the regions on each layer are reduced more, and the numbers of entry vertices and exit vertices on each layer are also re- duced, therefore both algorithms perform more efficiently. Since the implementation of the bidirectional-method- based A∗^ algorithm is much more complicated than the dy- namic programming algorithm, an efficient implementation is of great importance. In our tests with 4 sequences and a pattern of length 5 , by only fine-tuning the implementation

Proceedings of the 5th IEEE Symposium on Bioinformatics and Bioengineering (BIBE’05)

#s. p.l. seq.l. FCMSA PCMSA 4 6 204 / 235 / 219 / 207 241. 649 160. 850 4 7 204 / 235 / 219 / 207 170. 185 115. 415

Figure 4. The execution times of algorithms F CM SA and P CM SA on 4 sequences cho- sen from 4 families: the aspartic acid pro- tease family, the hemoglobins family, the ri- bonuclease family and the kinase family.

of the bidirectional-method-based A∗^ algorithm, we man- age to reduce the execution time of P CM SA from the ini- tial 70. 461 seconds to 56. 124 seconds. We believe that our P CM SA algorithm could run even faster with a more effi- cient implementation of the bidirectional-method-based A∗ algorithm.

4. Concluding Remarks

We develop a parallel algorithm for the constrained mul- tiple sequence alignment problem. Following the approach proposed in the constrained multiple sequence alignment al- gorithm presented by He and Arslan [7], we consider com- putations in the dynamic programming matrix in layers in- dexed by one dimension. We formulate the computations on each layer as multiple sources multiple destinations short- est paths problem which can be solved efficiently by using the bidirectional-method-based A∗^ algorithm proposed by Shibuya [9]. Each such problem can be solved in paral- lel for each layer. The final step of our algorithm solves a single source shortest path problem on a graph which is composed of shortest paths on layers. The empirical evi- dence in practical settings of the constrained multiple se- quence alignment of RN ase sequences suggests that the total number of source and destination vertices on layers in the constrained alignment is insignificant compared to the size of a layer. The expected time requirement of our al- gorithm is dominated by the time taken to solve multiple sources-multiple destinations problem on one layer because the numbers of sources and destinations on a layer are very small in practice. Our algorithm considers only the relevant regions in each layer applying the method used in the algo- rithm of He and Arslan [7]. Our tests on real data verify that in general our algorithm is much faster than the existing se- quential dynamic programming algorithms for the CM SA problem.

References

[1] A. N. Arslan and O. E˘¨ gecio˘glu. Algorithms for the constrained common sequence problem. Proc. Prague Stringology Conference 2004, (Eds. M. Simanek and J. Holub) , pp. 24-32, Prague, August 2004.

[2] D. Champeaus. A note on two problems in connection with graphs. Num. Math. , 1, 1959, 395-142.

[3] F.Y.L. Chin, N. L. Ho, T.W.Lam, P.W.H. Wong, M.Y. Chan. Efficient constrained multiple sequence align- ment with performance guarantee. CSB 2003 , pp. 337- 346, 2003.

[4] F.Y.L. Chin, A.D. Santis, A.L. Ferrara, N.L. Ho, S.K. Kim. A simple algorithm for the constrained sequence problems. IPL Vol. 90, pp. 175-179, 2004.

[5] E. Dijkstra. Bidirectional Heuristic Search Again. J. ACM , vol. 30, 1983, pp.22-32.

[6] P. Hart, N. Nilsson, and B. Raphael. A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cybernet. , 4(2):100-107,1968.

[7] D. He and A. N. Arslan. A fast algorithm for the Con- strained Multiple Sequence Alignment problem. Proc. 11th International Conference on Automata and For- mal Languages (AFL 2005) (Eds. Zoltan ´Esik, Zoltan F¨ul¨op), Institute of Informatics, University of Szeged , pp. 131-143, Dobogoko, Hungary, May 2005.

[8] T. Ikeda and H. Imai. Fast A* algorithm for multiple sequence alignment. GIW 94 , 90-99.

[9] T. Shibuya. Computing the n × m Shortest Paths Effi- ciently. the ACM J. of Experimental Algorithmics , ISSN 1084-6654, Vol. 5, No. 9, 2000.

[10] C.Y. Tang, C.L. Lu, M.D.-T. Chang, Y.-T. Tsai, Y.- J. Sun, K.-M. Chao, J.-M. Chang, Y.-H. Chiou, C.-M. Wu, H.-T. Chang, and W.-I. Chou. Constrained mul- tiple sequence alignment tool development and its ap- plications to rnase family alignment. CSB 2002, pp. 127-137, 2002.

[11] Y.-T. Tsai. The constrained common sequence prob- lem. IPL , 88:173–176, 2003.

[12] Y.-T. Tsai, C.L. Lu, C.T. Yu, and Y.P. Huang. MuSiC: A tool for multiple sequence alignment with constraint. Bioinformatics , 20(14):2309-2311, 2004.

[13] M.S. Waterman. Introduction to computational biol- ogy. Chapman & Hall, 1995.

Proceedings of the 5th IEEE Symposium on Bioinformatics and Bioengineering (BIBE’05)