Download Dynamic Programming for Biomolecular Sequence Analysis: Alignment and Local Alignments - P and more Study notes Computer Science in PDF only on Docsity!
Dynamic Programming Method for
Analyzing Biomolecular Sequences
Tao Jiang
Department of Computer Science
University of California - Riverside
(Typeset by Kun-Mao Chao)
E-mail: [email protected] http://www.cs.ucr.edu/~jiang 2
Outline
• The paradigm of dynamic programming
• Sequence alignment – a general framework
for comparing sequences in bioinformatics
• Dynamic programming algorithms for
sequence alignment
• Techniques for improving the efficiency of
the algorithms
• Multiple sequence alignment
3
Dynamic Programming
• Dynamic programming is an algorithmic
method for solving optimization problems
with a compositional/recursive cost
structure.
• Richard Bellman was one of the principal
founders of this approach.
4
Two key ingredients
- Two key ingredients for an optimization problem
to be suitable for a dynamic programming solution:
Each substructure is optimal. (principle of optimality)
- optimal substructures 2. overlapping subinstances
Subinstances are dependent. (Otherwise, a divide-and-conquer approach is the choice.)
5
Three basic components
• The development of a dynamic programming
algorithm has three basic components:
- A recurrence relation (for defining the value/cost
of an optimal solution);
- A tabular computation (for computing the value of
an optimal solution);
- A backtracing procedure (for delivering an
optimal solution).
6
Fibonacci numbers
for.
i>
i
F
i
Fi F
F
F
The Fibonacci numbers are defined by the
following recurrence:
7
How to compute F 10 ?
F 10
F 9
F 8
F 8
F 7
F 7
F 6
8
Tabular computation
• Tabular computation can avoid redundant
computation steps.
F 0 F 1 F 2 F 3 F 4 F 5 F 6 F 7 F 8 F 9 F 10
13
Maximum sum interval
(tabular computation)
9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 - S ( i ) 9 6 7 14 –1 2 5 1 3 –4 6 4 12 16 7
The maximum sum
14
Maximum sum interval
(backtracing)
9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 - S ( i ) 9 6 7 14 –1 2 5 1 3 –4 6 4 12 16 7
The maximum-sum interval: 6 -2 8 4 Running time: O(n).
15
Defining scores for alignment columns
• infocon [Stojanovic et al ., 1999]
- Each column is assigned a score that measures its information content, based on the frequencies of the letters both within the column and within the alignment.
CGGATCAT—GGA CTTAACATTGAA GAGAACATAGTA
16
Defining scores (cont’d)
• phylogen [Stojanovic et al ., 1999]
- columns are scored based on the evolutionary
relationships among the sequences implied by a
supplied phylogenetic tree.
T
T
T
C
C T T T C C
TC
T
T
T T T C C
T T
T
T
Score = 1 Score = 2
17
Two fundamental problems we solved
(joint work with Lin and Chao)
• Given a sequence of real numbers of length
n and an upper bound U , find a consecutive
subsequence of length at most U with the
maximum sum --- an O ( n )-time algorithm.
U = 3 9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -
18
Two fundamental problems we solved
(joint work with Lin and Chao)
• Given a sequence of real numbers of length
n and a lower bound L , find a consecutive
subsequence of length at least L with the
maximum average --- an O ( n log L )-time
algorithm. This has been improved to O(n)
by others later.
L = 4
19
Another example
Given a sequence as follows:
and L = 2 , the highest-average interval is the
squared area, which has the average value
20
GC-rich regions
• Our method can be used to locate a region
of length at least L with the highest C+G
ratio in O ( n log L ) time.
ATGACTCGAGCTCGTCA
Search for an interval of length at least L with the highest average.
25
Longest increasing subsequence (LIS)
- The longest increasing subsequence is to find a
longest increasing subsequence of a given
sequence of distinct integers a 1 a 2 …an.
e.g. 9 2 5 3 7 11 8 10 13 6
are increasing subsequences.
are not increasing subsequences.
We want to find a longest one.
26
A naive approach for LIS
• Let L [ i ] be the length of a longest increasing
subsequence ending at position i.
L [ i ] = 1 + max j = 0..i-1 { L [ j ] | a j < a i }
(use a dummy a 0 = minimum, and L [0]=0)
L [ i ] 1 1 2 2 3 4?
27
A naive approach for LIS
L [ i ] 1 1 2 2 3 4 4 5 6 3
L [ i ] = 1 + max j = 0..i-1 { L [ j ] | a j < a i }
The maximum length
The subsequence 2, 3, 7, 8, 10, 13 is a longest increasing subsequence.
This method runs in O ( n 2 ) time. 28
Binary Search
• Given an ordered sequence x 1 x 2 ... xn , where
x 1 <x 2 < ... <x n , and a number y , a binary
search finds the largest x i such that x i < y in
O (log n ) time.
n ...
n/ n/
29
Binary Search
• How many steps would a binary search
reduce the problem size to 1?
n n/2 n/4 n/8 n/16 ... 1
How many steps? O (log n ) steps.
s n
n s
log 2
30
An O ( n log n ) method for LIS
- Define BestEnd [ k ] to be the smallest end number
of an increasing subsequence of length k.
BestEnd [1] BestEnd [2] BestEnd [3] BestEnd [4] BestEnd [5] BestEnd [6]
31
An O ( n log n ) method for LIS
- Define BestEnd [ k ] to be the smallest end number
of an increasing subsequence of length k.
BestEnd [1] BestEnd [2] BestEnd [3] BestEnd [4] BestEnd [5] BestEnd [6]
For each position, we perform a binary search to update BestEnd. Therefore, the running time is O ( n log n ). (^32)
Longest Common Subsequence (LCS)
• A subsequence of a sequence S is obtained
by deleting zero or more symbols from S.
For example, the following are all
subsequences of “president”: pred, sdn,
predent.
• The longest common subsequence problem
is to find a maximum length common
subsequence between two sequences.
37
i j 0 1 p
2 r
3 o
4 v
5 i
6 d
7 e
8 n
9 c
10 e 0 0 0 0 0 0 0 0 0 0 0 0 1 p 0 1 1 1 1 1 1 1 1 1 1 2 r (^) 0 1 2 2 2 2 2 2 2 2 2 3 e 0 1 2 2 2 2 2 3 3 3 3 4 s 0 1 2 2 2 2 2 3 3 3 3 5 i 0 1 2 2 2 3 3 3 3 3 3 6 d (^) 0 1 2 2 2 3 4 4 4 4 4 7 e 0 1 2 2 2 3 4 5 5 5 5 8 n 0 1 2 2 2 3 4 5 6 6 6 9 t 0 1 2 2 2 3 4 5 6 6 6
Running time and memory: O(mn) and O(mn). 38
procedure Output-LCS(A, prev, i, j) 1 if i = 0 or j = 0 then return 2 if prev(i, j)=” “ then (^) ⎢ ⎣
⎡ − − − a i
Output LCSAprevi j print
(, , 1 , 1 )
3 else if prev(i, j)=” “ then Output-LCS(A, prev, i-1, j) 4 else Output-LCS(A, prev, i, j-1)
The backtracing algorithm
39
i j 0 1 p
2 r
3 o
4 v
5 i
6 d
7 e
8 n
9 c
10 e (^0) 0 0 0 0 0 0 0 0 0 0 0 1 p (^) 0 1 1 1 1 1 1 1 1 1 1 2 r 0 1 2 2 2 2 2 2 2 2 2 3 e 0 1 2 2 2 2 2 3 3 3 3 4 s 0 1 2 2 2 2 2 3 3 3 3 5 i 0 1 2 2 2 3 3 3 3 3 3 6 d 0 1 2 2 2 3 4 4 4 4 4 7 e 0 1 2 2 2 3 4 5 5 5 5 8 n 0 1 2 2 2 3 4 5 6 6 6 9 t (^) 0 1 2 2 2 3 4 5 6 6 6
Output: priden 40
Dot Matrix
Sequence A:CTTAACT Sequence B:CGGATCAT C G G A T C A T
C T T A A C T
41
C---TTAACT
CGGATCA--T
Pairwise Alignment
Sequence A: CTTAACT
Sequence B: CGGATCAT
An alignment of A and B:
Sequence A Sequence B
42
C---TTAACT
CGGATCA--T
Pairwise Alignment
Sequence A: CTTAACT
Sequence B: CGGATCAT
An alignment of A and B:
Insertion gap
Match Mismatch
Deletion gap
43
Alignment (or Edit) Graph
Sequence A: CTTAACT
Sequence B: CGGATCAT C G G A T C A T
C T T A A C T
C---TTAACT
CGGATCA--T
44
A simple scoring scheme
• Match: +8 ( w ( x, y ) = 8, if x = y )
• Mismatch: -5 ( w ( x, y ) = -5, if x ≠ y )
• Each gap symbol: -3 ( w (-, x )= w ( x ,-)=-3)
C - - - T T A A C T
C G G A T C A - - T
alignment score
( i.e. space)
49
The Blosum50 Scoring Matrix
50
An optimal alignment
-- an alignment of maximum score
- Let A =a 1 a 2 …a (^) m and B=b 1 b 2 …b (^) n.
- S (^) i,j : the score of an optimal alignment between
a 1 a 2 …a i and b 1 b 2 …b j
- With proper initializations, S (^) i,j can be computed
as follows.
− −
−
−
max
1 , 1
, 1
1 , , i j i j
ij j
i j i ij
s wa b
s w b
s wa
s
51
Computing S i,j
i
j
w ( a (^) i,- )
w (-, b (^) j )
w ( a (^) i,b (^) j )
S (^) m,n (^) 52
Initialization
C G G A T C A T
C T T A A C T
53
S 3,5 = ?
C G G A T C A T
C T T A A C
T 54
S 3,5 = ?
C G G A T C A T
C T T A A C T
optimal score
55
C T T A A C – T
C G G A T C A T
C G G A T C A T
C T T A A C T
56
Global Alignment vs. Local Alignment
• global alignment:
• local alignment:
61
Affine gap penalties
- Match: +8 ( w ( x, y ) = 8, if x = y )
- Mismatch: -5 ( w ( x, y ) = -5, if x ≠ y )
- Each gap symbol: -3 ( w (-, x )= w ( x ,-)=-3)
- E.g. each gap is charged an extra gap-open penalty: -4.
- In general, a gap of length k should have penalty g(k)
C - - - T T A A C T
C G G A T C A - - T
alignment score: 12 – 4 – 4 = 4 62
Affine gap penalties
• A gap of length k is penalized x + k·y.
gap-open penalty gap-symbol penalty Three cases for alignment endings:
- ...x ...x
- ...x ...-
- ...- ...x
an aligned pair
a deletion
an insertion
63
Affine gap penalties
- Let D ( i, j ) denote the maximum score of any
alignment between a 1 a 2 …ai and b 1 b 2 …b j ending
with a deletion.
- Let I ( i, j ) denote the maximum score of any
alignment between a 1 a 2 …ai and b 1 b 2 …b j ending
with an insertion.
- Let S ( i, j ) denote the maximum score of any
alignment between a 1 a 2 …ai and b 1 b 2 …b j.
64
Affine gap penalties
(, ) max
(, ) max
(, ) max
Ii j
Dij
Si j wa b
Sij
Sij x y
Iij y
Iij
Si j x y
Di j y
Dij
i j
65
Affine gap penalties
(Gotoh’s algorithm)
I S
D
I S
D
I S
D
I S
D
-y -x-y
-x-y
-y
w ( a (^) i,b (^) j )
66
k best local alignments
• Smith-Waterman
(Smith and Waterman, 1981; Waterman and Eggert, 1987)
• FASTA
(Wilbur and Lipman, 1983; Lipman and Pearson, 1985)
• BLAST
(Altschul et al., 1990; Altschul et al., 1997)
BLAST and FASTA are key genomic database search tools.
67
k best local alignments
- Smith-Waterman (Smith and Waterman, 1981; Waterman and Eggert, 1987) - linear-space version:sim (Huang and Miller, 1991) - linear-space variants:sim2 (Chao et al., 1995); sim3 (Chao et al., 1997)
- FASTA (Wilbur and Lipman, 1983; Lipman and Pearson, 1985) - linear-space band alignment (Chao et al., 1992)
- BLAST (Altschul et al., 1990; Altschul et al., 1997) - restricted affine gap penalties (Chao, 1999)
68
FASTA
1) Find runs of identities, and identify
regions with the highest density of
identities.
2) Re-score using PAM matrix, and keep top
scoring segments.
3) Eliminate segments that are unlikely to be
part of the alignment.
4) Optimize the alignment in a band.
Its running time is O(n).
73
BLAST
1) Build the hash table for sequence A (the
database sequence).
2) Scan sequence B for hits.
3) Extend hits.
Also O(n) time.
74
BLAST
Step 1: Build the hash table for sequence A. (3-tuple example) For DNA sequences:
Seq. A = AGATCGAT 12345678 AAA AAC .. AGA 1 .. ATC 3 .. CGA 5 .. GAT 2 6 .. TCG 4 .. TTT
For protein sequences: Seq. A = ELVIS
Add xyz to the hash table if Score(xyz, ELV) ≧ T; Add xyz to the hash table if Score(xyz, LVI) ≧ T; Add xyz to the hash table if Score(xyz, VIS) ≧ T;
75
BLAST
Step2: Scan sequence B for hits.
76
BLAST
Step2: Scan sequence B for hits.
Step 3: Extend hits.
hit Terminate if the score of the extension fades away.
BLAST 2.0 saves the time spent in extension, and considers gapped alignments.
77
Remarks
• Filtering is based on the observation that a
good alignment usually includes short
identical or very similar fragments.
• The idea of filtration was used in both
FASTA and BLAST to achieve high speed
78
Linear space ideas
Hirschberg, 1975; Myers and Miller, 1988
m/
(i) scores can be computed in O(n) space (ii) divide-and-conquer
S(a 1 …a m/2,b 1 …b j ) +
S(a m…a m/2+1,b n …b j+1 )
maximized
j
79
Two subproblems
½ original problem size
m/
m/
3m/
80
Four subproblems
¼ original problem size
m/
m/
3m/