Dynamic Programming for Biomolecular Sequence Analysis: Alignment and Local Alignments - P, Study notes of Computer Science

The use of dynamic programming in analyzing biomolecular sequences, specifically in the context of sequence alignment. The basics of dynamic programming, its application to sequence alignment, techniques for improving algorithm efficiency, and multiple sequence alignment. It also touches on future developments in the field.

Typology: Study notes

Pre 2010

Uploaded on 03/28/2010

koofers-user-ite
koofers-user-ite 🇺🇸

1

(1)

10 documents

1 / 23

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Dynamic Programming Method for
Analyzing Biomolecular Sequences
Tao Jiang
Department of Computer Science
University of California - Riverside
(Typeset by Kun-Mao Chao)
http://www.cs.ucr.edu/~jiang
2
Outline
The paradigm of dynamic programming
Sequence alignment – a general framework
for comparing sequences in bioinformatics
Dynamic programming algorithms for
sequence alignment
Techniques for improving the efficiency of
the algorithms
Multiple sequence alignment
3
Dynamic Programming
Dynamic programming is an algorithmic
method for solving optimization problems
with a compositional/recursive cost
structure.
Richard Bellman was one of the principal
founders of this approach.
4
Two key ingredients
Two key ingredients for an optimization problem
to be suitable for a dynamic programming solution:
Each substructure is optimal.
(principle of optimality)
1. optimal substructures 2. overlapping subinstances
Subinstances are dependent.
(Otherwise, a divide-and-conquer
approach is the choice.)
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17

Partial preview of the text

Download Dynamic Programming for Biomolecular Sequence Analysis: Alignment and Local Alignments - P and more Study notes Computer Science in PDF only on Docsity!

Dynamic Programming Method for

Analyzing Biomolecular Sequences

Tao Jiang

Department of Computer Science

University of California - Riverside

(Typeset by Kun-Mao Chao)

E-mail: [email protected] http://www.cs.ucr.edu/~jiang 2

Outline

• The paradigm of dynamic programming

• Sequence alignment – a general framework

for comparing sequences in bioinformatics

• Dynamic programming algorithms for

sequence alignment

• Techniques for improving the efficiency of

the algorithms

• Multiple sequence alignment

3

Dynamic Programming

• Dynamic programming is an algorithmic

method for solving optimization problems

with a compositional/recursive cost

structure.

• Richard Bellman was one of the principal

founders of this approach.

4

Two key ingredients

  • Two key ingredients for an optimization problem

to be suitable for a dynamic programming solution:

Each substructure is optimal. (principle of optimality)

  1. optimal substructures 2. overlapping subinstances

Subinstances are dependent. (Otherwise, a divide-and-conquer approach is the choice.)

5

Three basic components

• The development of a dynamic programming

algorithm has three basic components:

  • A recurrence relation (for defining the value/cost

of an optimal solution);

  • A tabular computation (for computing the value of

an optimal solution);

  • A backtracing procedure (for delivering an

optimal solution).

6

Fibonacci numbers

for.

i>

i

F

i

Fi F

F

F

The Fibonacci numbers are defined by the

following recurrence:

7

How to compute F 10 ?

F 10

F 9

F 8

F 8

F 7

F 7

F 6

8

Tabular computation

• Tabular computation can avoid redundant

computation steps.

F 0 F 1 F 2 F 3 F 4 F 5 F 6 F 7 F 8 F 9 F 10

13

Maximum sum interval

(tabular computation)

9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 - S ( i ) 9 6 7 14 –1 2 5 1 3 –4 6 4 12 16 7

The maximum sum

14

Maximum sum interval

(backtracing)

9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 - S ( i ) 9 6 7 14 –1 2 5 1 3 –4 6 4 12 16 7

The maximum-sum interval: 6 -2 8 4 Running time: O(n).

15

Defining scores for alignment columns

• infocon [Stojanovic et al ., 1999]

  • Each column is assigned a score that measures its information content, based on the frequencies of the letters both within the column and within the alignment.

CGGATCAT—GGA CTTAACATTGAA GAGAACATAGTA

16

Defining scores (cont’d)

• phylogen [Stojanovic et al ., 1999]

  • columns are scored based on the evolutionary

relationships among the sequences implied by a

supplied phylogenetic tree.

T

T

T

C

C T T T C C

TC

T

T

T T T C C

T T

T

T

Score = 1 Score = 2

17

Two fundamental problems we solved

(joint work with Lin and Chao)

• Given a sequence of real numbers of length

n and an upper bound U , find a consecutive

subsequence of length at most U with the

maximum sum --- an O ( n )-time algorithm.

U = 3 9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -

18

Two fundamental problems we solved

(joint work with Lin and Chao)

• Given a sequence of real numbers of length

n and a lower bound L , find a consecutive

subsequence of length at least L with the

maximum average --- an O ( n log L )-time

algorithm. This has been improved to O(n)

by others later.

L = 4

19

Another example

Given a sequence as follows:

and L = 2 , the highest-average interval is the

squared area, which has the average value

20

GC-rich regions

• Our method can be used to locate a region

of length at least L with the highest C+G

ratio in O ( n log L ) time.

ATGACTCGAGCTCGTCA

Search for an interval of length at least L with the highest average.

25

Longest increasing subsequence (LIS)

  • The longest increasing subsequence is to find a

longest increasing subsequence of a given

sequence of distinct integers a 1 a 2 …an.

e.g. 9 2 5 3 7 11 8 10 13 6

are increasing subsequences.

are not increasing subsequences.

We want to find a longest one.

26

A naive approach for LIS

• Let L [ i ] be the length of a longest increasing

subsequence ending at position i.

L [ i ] = 1 + max j = 0..i-1 { L [ j ] | a j < a i }

(use a dummy a 0 = minimum, and L [0]=0)

L [ i ] 1 1 2 2 3 4?

27

A naive approach for LIS

L [ i ] 1 1 2 2 3 4 4 5 6 3

L [ i ] = 1 + max j = 0..i-1 { L [ j ] | a j < a i }

The maximum length

The subsequence 2, 3, 7, 8, 10, 13 is a longest increasing subsequence.

This method runs in O ( n 2 ) time. 28

Binary Search

• Given an ordered sequence x 1 x 2 ... xn , where

x 1 <x 2 < ... <x n , and a number y , a binary

search finds the largest x i such that x i < y in

O (log n ) time.

n ...

n/ n/

29

Binary Search

• How many steps would a binary search

reduce the problem size to 1?

n n/2 n/4 n/8 n/16 ... 1

How many steps? O (log n ) steps.

s n

n s

log 2

30

An O ( n log n ) method for LIS

  • Define BestEnd [ k ] to be the smallest end number

of an increasing subsequence of length k.

BestEnd [1] BestEnd [2] BestEnd [3] BestEnd [4] BestEnd [5] BestEnd [6]

31

An O ( n log n ) method for LIS

  • Define BestEnd [ k ] to be the smallest end number

of an increasing subsequence of length k.

BestEnd [1] BestEnd [2] BestEnd [3] BestEnd [4] BestEnd [5] BestEnd [6]

For each position, we perform a binary search to update BestEnd. Therefore, the running time is O ( n log n ). (^32)

Longest Common Subsequence (LCS)

• A subsequence of a sequence S is obtained

by deleting zero or more symbols from S.

For example, the following are all

subsequences of “president”: pred, sdn,

predent.

• The longest common subsequence problem

is to find a maximum length common

subsequence between two sequences.

37

i j 0 1 p

2 r

3 o

4 v

5 i

6 d

7 e

8 n

9 c

10 e 0 0 0 0 0 0 0 0 0 0 0 0 1 p 0 1 1 1 1 1 1 1 1 1 1 2 r (^) 0 1 2 2 2 2 2 2 2 2 2 3 e 0 1 2 2 2 2 2 3 3 3 3 4 s 0 1 2 2 2 2 2 3 3 3 3 5 i 0 1 2 2 2 3 3 3 3 3 3 6 d (^) 0 1 2 2 2 3 4 4 4 4 4 7 e 0 1 2 2 2 3 4 5 5 5 5 8 n 0 1 2 2 2 3 4 5 6 6 6 9 t 0 1 2 2 2 3 4 5 6 6 6

Running time and memory: O(mn) and O(mn). 38

procedure Output-LCS(A, prev, i, j) 1 if i = 0 or j = 0 then return 2 if prev(i, j)=” “ then (^) ⎢ ⎣

⎡ − − − a i

Output LCSAprevi j print

(, , 1 , 1 )

3 else if prev(i, j)=” “ then Output-LCS(A, prev, i-1, j) 4 else Output-LCS(A, prev, i, j-1)

The backtracing algorithm

39

i j 0 1 p

2 r

3 o

4 v

5 i

6 d

7 e

8 n

9 c

10 e (^0) 0 0 0 0 0 0 0 0 0 0 0 1 p (^) 0 1 1 1 1 1 1 1 1 1 1 2 r 0 1 2 2 2 2 2 2 2 2 2 3 e 0 1 2 2 2 2 2 3 3 3 3 4 s 0 1 2 2 2 2 2 3 3 3 3 5 i 0 1 2 2 2 3 3 3 3 3 3 6 d 0 1 2 2 2 3 4 4 4 4 4 7 e 0 1 2 2 2 3 4 5 5 5 5 8 n 0 1 2 2 2 3 4 5 6 6 6 9 t (^) 0 1 2 2 2 3 4 5 6 6 6

Output: priden 40

Dot Matrix

Sequence A:CTTAACT Sequence B:CGGATCAT C G G A T C A T

C T T A A C T

41

C---TTAACT

CGGATCA--T

Pairwise Alignment

Sequence A: CTTAACT

Sequence B: CGGATCAT

An alignment of A and B:

Sequence A Sequence B

42

C---TTAACT

CGGATCA--T

Pairwise Alignment

Sequence A: CTTAACT

Sequence B: CGGATCAT

An alignment of A and B:

Insertion gap

Match Mismatch

Deletion gap

43

Alignment (or Edit) Graph

Sequence A: CTTAACT

Sequence B: CGGATCAT C G G A T C A T

C T T A A C T

C---TTAACT

CGGATCA--T

44

A simple scoring scheme

• Match: +8 ( w ( x, y ) = 8, if x = y )

• Mismatch: -5 ( w ( x, y ) = -5, if x ≠ y )

• Each gap symbol: -3 ( w (-, x )= w ( x ,-)=-3)

C - - - T T A A C T

C G G A T C A - - T

alignment score

( i.e. space)

49

The Blosum50 Scoring Matrix

50

An optimal alignment

-- an alignment of maximum score

  • Let A =a 1 a 2 …a (^) m and B=b 1 b 2 …b (^) n.
  • S (^) i,j : the score of an optimal alignment between

a 1 a 2 …a i and b 1 b 2 …b j

  • With proper initializations, S (^) i,j can be computed

as follows.

− −

max

1 , 1

, 1

1 , , i j i j

ij j

i j i ij

s wa b

s w b

s wa

s

51

Computing S i,j

i

j

w ( a (^) i,- )

w (-, b (^) j )

w ( a (^) i,b (^) j )

S (^) m,n (^) 52

Initialization

C G G A T C A T

C T T A A C T

53

S 3,5 = ?

C G G A T C A T

C T T A A C

T 54

S 3,5 = ?

C G G A T C A T

C T T A A C T

optimal score

55

C T T A A C – T

C G G A T C A T

C G G A T C A T

C T T A A C T

56

Global Alignment vs. Local Alignment

• global alignment:

• local alignment:

61

Affine gap penalties

  • Match: +8 ( w ( x, y ) = 8, if x = y )
  • Mismatch: -5 ( w ( x, y ) = -5, if xy )
  • Each gap symbol: -3 ( w (-, x )= w ( x ,-)=-3)
  • E.g. each gap is charged an extra gap-open penalty: -4.
  • In general, a gap of length k should have penalty g(k)

C - - - T T A A C T

C G G A T C A - - T

alignment score: 12 – 4 – 4 = 4 62

Affine gap penalties

• A gap of length k is penalized x + k·y.

gap-open penalty gap-symbol penalty Three cases for alignment endings:

  1. ...x ...x
  2. ...x ...-
  3. ...- ...x

an aligned pair

a deletion

an insertion

63

Affine gap penalties

  • Let D ( i, j ) denote the maximum score of any

alignment between a 1 a 2 …ai and b 1 b 2 …b j ending

with a deletion.

  • Let I ( i, j ) denote the maximum score of any

alignment between a 1 a 2 …ai and b 1 b 2 …b j ending

with an insertion.

  • Let S ( i, j ) denote the maximum score of any

alignment between a 1 a 2 …ai and b 1 b 2 …b j.

64

Affine gap penalties

(, ) max

(, ) max

(, ) max

Ii j

Dij

Si j wa b

Sij

Sij x y

Iij y

Iij

Si j x y

Di j y

Dij

i j

65

Affine gap penalties

(Gotoh’s algorithm)

I S

D

I S

D

I S

D

I S

D

-y -x-y

-x-y

-y

w ( a (^) i,b (^) j )

66

k best local alignments

• Smith-Waterman

(Smith and Waterman, 1981; Waterman and Eggert, 1987)

• FASTA

(Wilbur and Lipman, 1983; Lipman and Pearson, 1985)

• BLAST

(Altschul et al., 1990; Altschul et al., 1997)

BLAST and FASTA are key genomic database search tools.

67

k best local alignments

  • Smith-Waterman (Smith and Waterman, 1981; Waterman and Eggert, 1987) - linear-space version:sim (Huang and Miller, 1991) - linear-space variants:sim2 (Chao et al., 1995); sim3 (Chao et al., 1997)
  • FASTA (Wilbur and Lipman, 1983; Lipman and Pearson, 1985) - linear-space band alignment (Chao et al., 1992)
  • BLAST (Altschul et al., 1990; Altschul et al., 1997) - restricted affine gap penalties (Chao, 1999)

68

FASTA

1) Find runs of identities, and identify

regions with the highest density of

identities.

2) Re-score using PAM matrix, and keep top

scoring segments.

3) Eliminate segments that are unlikely to be

part of the alignment.

4) Optimize the alignment in a band.

Its running time is O(n).

73

BLAST

1) Build the hash table for sequence A (the

database sequence).

2) Scan sequence B for hits.

3) Extend hits.

Also O(n) time.

74

BLAST

Step 1: Build the hash table for sequence A. (3-tuple example) For DNA sequences:

Seq. A = AGATCGAT 12345678 AAA AAC .. AGA 1 .. ATC 3 .. CGA 5 .. GAT 2 6 .. TCG 4 .. TTT

For protein sequences: Seq. A = ELVIS

Add xyz to the hash table if Score(xyz, ELV)T; Add xyz to the hash table if Score(xyz, LVI)T; Add xyz to the hash table if Score(xyz, VIS)T;

75

BLAST

Step2: Scan sequence B for hits.

76

BLAST

Step2: Scan sequence B for hits.

Step 3: Extend hits.

hit Terminate if the score of the extension fades away.

BLAST 2.0 saves the time spent in extension, and considers gapped alignments.

77

Remarks

• Filtering is based on the observation that a

good alignment usually includes short

identical or very similar fragments.

• The idea of filtration was used in both

FASTA and BLAST to achieve high speed

78

Linear space ideas

Hirschberg, 1975; Myers and Miller, 1988

m/

(i) scores can be computed in O(n) space (ii) divide-and-conquer

S(a 1 …a m/2,b 1 …b j ) +

S(a m…a m/2+1,b n …b j+1 )

maximized

j

79

Two subproblems

½ original problem size

m/

m/

3m/

80

Four subproblems

¼ original problem size

m/

m/

3m/