Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Dynamic Programming for Biomolecular Sequence Analysis: Alignment and Local Alignments - P, Study notes of Computer Science

University of California-Riverside Computer Science

Prof. Tao Jiang

The use of dynamic programming in analyzing biomolecular sequences, specifically in the context of sequence alignment. The basics of dynamic programming, its application to sequence alignment, techniques for improving algorithm efficiency, and multiple sequence alignment. It also touches on future developments in the field.

Typology: Study notes

Pre 2010

Uploaded on 03/28/2010

koofers-user-ite 🇺🇸

1

(1)

10 documents

1 / 23

This page cannot be seen from the preview

Don't miss anything!

1

Dynamic Programming Method for

Analyzing Biomolecular Sequences

Tao Jiang

Department of Computer Science

University of California - Riverside

(Typeset by Kun-Mao Chao)

E-mail: [email protected]

http://www.cs.ucr.edu/~jiang

2

Outline

• The paradigm of dynamic programming

• Sequence alignment – a general framework

for comparing sequences in bioinformatics

• Dynamic programming algorithms for

sequence alignment

• Techniques for improving the efficiency of

the algorithms

• Multiple sequence alignment

3

Dynamic Programming

• Dynamic programming is an algorithmic

method for solving optimization problems

with a compositional/recursive cost

structure.

• Richard Bellman was one of the principal

founders of this approach.

4

Two key ingredients

• Two key ingredients for an optimization problem

to be suitable for a dynamic programming solution:

Each substructure is optimal.

(principle of optimality)

1. optimal substructures 2. overlapping subinstances

Subinstances are dependent.

(Otherwise, a divide-and-conquer

approach is the choice.)

Discover Study notes of Computer Science University of California-Riverside

Partial preview of the text

Download Dynamic Programming for Biomolecular Sequence Analysis: Alignment and Local Alignments - P and more Study notes Computer Science in PDF only on Docsity!

Dynamic Programming Method for

Analyzing Biomolecular Sequences

Tao Jiang

Department of Computer Science

University of California - Riverside

(Typeset by Kun-Mao Chao)

E-mail: [email protected] http://www.cs.ucr.edu/~jiang 2

Outline

• The paradigm of dynamic programming

• Sequence alignment – a general framework

for comparing sequences in bioinformatics

• Dynamic programming algorithms for

sequence alignment

• Techniques for improving the efficiency of

the algorithms

• Multiple sequence alignment

3

Dynamic Programming

• Dynamic programming is an algorithmic

method for solving optimization problems

with a compositional/recursive cost

structure.

• Richard Bellman was one of the principal

founders of this approach.

4

Two key ingredients

Two key ingredients for an optimization problem

to be suitable for a dynamic programming solution:

Each substructure is optimal. (principle of optimality)

optimal substructures 2. overlapping subinstances

Subinstances are dependent. (Otherwise, a divide-and-conquer approach is the choice.)

5

Three basic components

• The development of a dynamic programming

algorithm has three basic components:

A recurrence relation (for defining the value/cost

of an optimal solution);

A tabular computation (for computing the value of

an optimal solution);

A backtracing procedure (for delivering an

optimal solution).

6

Fibonacci numbers

for.

i>

i

F

i

Fi F

F

The Fibonacci numbers are defined by the

following recurrence:

7

How to compute F 10 ？

F 10

F 9

F 8

F 7

F 6

8

Tabular computation

• Tabular computation can avoid redundant

computation steps.

F 0 F 1 F 2 F 3 F 4 F 5 F 6 F 7 F 8 F 9 F 10

13

Maximum sum interval

(tabular computation)

9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 - S ( i ) 9 6 7 14 –1 2 5 1 3 –4 6 4 12 16 7

The maximum sum

14

Maximum sum interval

(backtracing)

9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 - S ( i ) 9 6 7 14 –1 2 5 1 3 –4 6 4 12 16 7

The maximum-sum interval: 6 -2 8 4 Running time: O(n).

15

Defining scores for alignment columns

• infocon [Stojanovic et al ., 1999]

Each column is assigned a score that measures its information content, based on the frequencies of the letters both within the column and within the alignment.

CGGATCAT—GGA CTTAACATTGAA GAGAACATAGTA

16

Defining scores (cont’d)

• phylogen [Stojanovic et al ., 1999]

columns are scored based on the evolutionary

relationships among the sequences implied by a

supplied phylogenetic tree.

T

C

C T T T C C

TC

T

T T T C C

T T

T

Score = 1 Score = 2

17

Two fundamental problems we solved

(joint work with Lin and Chao)

• Given a sequence of real numbers of length

n and an upper bound U , find a consecutive

subsequence of length at most U with the

maximum sum --- an O ( n )-time algorithm.

U = 3 9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -

18

Two fundamental problems we solved

(joint work with Lin and Chao)

• Given a sequence of real numbers of length

n and a lower bound L , find a consecutive

subsequence of length at least L with the

maximum average --- an O ( n log L )-time

algorithm. This has been improved to O(n)

by others later.

L = 4

19

Another example

Given a sequence as follows:

and L = 2 , the highest-average interval is the

squared area, which has the average value

20

GC-rich regions

• Our method can be used to locate a region

of length at least L with the highest C+G

ratio in O ( n log L ) time.

ATGACTCGAGCTCGTCA

Search for an interval of length at least L with the highest average.

25

Longest increasing subsequence (LIS)

The longest increasing subsequence is to find a

longest increasing subsequence of a given

sequence of distinct integers a 1 a 2 …an.

e.g. 9 2 5 3 7 11 8 10 13 6

are increasing subsequences.

are not increasing subsequences.

We want to find a longest one.

26

A naive approach for LIS

• Let L [ i ] be the length of a longest increasing

subsequence ending at position i.

L [ i ] = 1 + max j = 0..i-1 { L [ j ] | a j < a i }

(use a dummy a 0 = minimum, and L [0]=0)

L [ i ] 1 1 2 2 3 4?

27

A naive approach for LIS

L [ i ] 1 1 2 2 3 4 4 5 6 3

L [ i ] = 1 + max j = 0..i-1 { L [ j ] | a j < a i }

The maximum length

The subsequence 2, 3, 7, 8, 10, 13 is a longest increasing subsequence.

This method runs in O ( n 2 ) time. 28

Binary Search

• Given an ordered sequence x 1 x 2 ... xn , where

x 1 <x 2 < ... <x n , and a number y , a binary

search finds the largest x i such that x i < y in

O (log n ) time.

n ...

n/ n/

29

Binary Search

• How many steps would a binary search

reduce the problem size to 1?

n n/2 n/4 n/8 n/16 ... 1

How many steps? O (log n ) steps.

s n

n s

log 2

30

An O ( n log n ) method for LIS

Define BestEnd [ k ] to be the smallest end number

of an increasing subsequence of length k.

BestEnd [1] BestEnd [2] BestEnd [3] BestEnd [4] BestEnd [5] BestEnd [6]

31

An O ( n log n ) method for LIS

Define BestEnd [ k ] to be the smallest end number

of an increasing subsequence of length k.

BestEnd [1] BestEnd [2] BestEnd [3] BestEnd [4] BestEnd [5] BestEnd [6]

For each position, we perform a binary search to update BestEnd. Therefore, the running time is O ( n log n ). (^32)

Longest Common Subsequence (LCS)

• A subsequence of a sequence S is obtained

by deleting zero or more symbols from S.

For example, the following are all

subsequences of “president”: pred, sdn,

predent.

• The longest common subsequence problem

is to find a maximum length common

subsequence between two sequences.

37

i j 0 1 p

2 r

3 o

4 v

5 i

6 d

7 e

8 n

9 c

10 e 0 0 0 0 0 0 0 0 0 0 0 0 1 p 0 1 1 1 1 1 1 1 1 1 1 2 r (^) 0 1 2 2 2 2 2 2 2 2 2 3 e 0 1 2 2 2 2 2 3 3 3 3 4 s 0 1 2 2 2 2 2 3 3 3 3 5 i 0 1 2 2 2 3 3 3 3 3 3 6 d (^) 0 1 2 2 2 3 4 4 4 4 4 7 e 0 1 2 2 2 3 4 5 5 5 5 8 n 0 1 2 2 2 3 4 5 6 6 6 9 t 0 1 2 2 2 3 4 5 6 6 6

Running time and memory: O(mn) and O(mn). 38

procedure Output-LCS(A, prev, i, j) 1 if i = 0 or j = 0 then return 2 if prev(i, j)=” “ then (^) ⎢ ⎣

⎡ − − − a i

Output LCSAprevi j print

(, , 1 , 1 )

3 else if prev(i, j)=” “ then Output-LCS(A, prev, i-1, j) 4 else Output-LCS(A, prev, i, j-1)

The backtracing algorithm

39

i j 0 1 p

2 r

3 o

4 v

5 i

6 d

7 e

8 n

9 c

10 e (^0) 0 0 0 0 0 0 0 0 0 0 0 1 p (^) 0 1 1 1 1 1 1 1 1 1 1 2 r 0 1 2 2 2 2 2 2 2 2 2 3 e 0 1 2 2 2 2 2 3 3 3 3 4 s 0 1 2 2 2 2 2 3 3 3 3 5 i 0 1 2 2 2 3 3 3 3 3 3 6 d 0 1 2 2 2 3 4 4 4 4 4 7 e 0 1 2 2 2 3 4 5 5 5 5 8 n 0 1 2 2 2 3 4 5 6 6 6 9 t (^) 0 1 2 2 2 3 4 5 6 6 6

Output: priden 40

Dot Matrix

Sequence A：CTTAACT Sequence B：CGGATCAT C G G A T C A T

C T T A A C T

41

C---TTAACT

CGGATCA--T

Pairwise Alignment

Sequence A: CTTAACT

Sequence B: CGGATCAT

An alignment of A and B:

Sequence A Sequence B

42

C---TTAACT

CGGATCA--T

Pairwise Alignment

Sequence A: CTTAACT

Sequence B: CGGATCAT

An alignment of A and B:

Insertion gap

Match Mismatch

Deletion gap

43

Alignment (or Edit) Graph

Sequence A: CTTAACT

Sequence B: CGGATCAT C G G A T C A T

C T T A A C T

C---TTAACT

CGGATCA--T

44

A simple scoring scheme

• Match: +8 ( w ( x, y ) = 8, if x = y )

• Mismatch: -5 ( w ( x, y ) = -5, if x ≠ y )

• Each gap symbol: -3 ( w (-, x )= w ( x ,-)=-3)

C - - - T T A A C T

C G G A T C A - - T

alignment score

( i.e. space)

49

The Blosum50 Scoring Matrix

50

An optimal alignment

-- an alignment of maximum score

Let A =a 1 a 2 …a (^) m and B=b 1 b 2 …b (^) n.
S (^) i,j : the score of an optimal alignment between

a 1 a 2 …a i and b 1 b 2 …b j

With proper initializations, S (^) i,j can be computed

as follows.

− −

−

max

1 , 1

, 1

1 , , i j i j

ij j

i j i ij

s wa b

s w b

s wa

s

51

Computing S i,j

i

j

w ( a (^) i,- )

w (-, b (^) j )

w ( a (^) i,b (^) j )

S (^) m,n (^) 52

Initialization

C G G A T C A T

C T T A A C T

53

S 3,5 = ？

C G G A T C A T

C T T A A C

T 54

S 3,5 = ？

C G G A T C A T

C T T A A C T

optimal score

55

C T T A A C – T

C G G A T C A T

C T T A A C T

56

Global Alignment vs. Local Alignment

• global alignment:

• local alignment:

61

Affine gap penalties

Match: +8 ( w ( x, y ) = 8, if x = y )
Mismatch: -5 ( w ( x, y ) = -5, if x ≠ y )
Each gap symbol: -3 ( w (-, x )= w ( x ,-)=-3)
E.g. each gap is charged an extra gap-open penalty: -4.
In general, a gap of length k should have penalty g(k)

C - - - T T A A C T

C G G A T C A - - T

alignment score: 12 – 4 – 4 = 4 62

Affine gap penalties

• A gap of length k is penalized x + k·y.

gap-open penalty gap-symbol penalty Three cases for alignment endings:

...x ...x
...x ...-
...- ...x

an aligned pair

a deletion

an insertion

63

Affine gap penalties

Let D ( i, j ) denote the maximum score of any

alignment between a 1 a 2 …ai and b 1 b 2 …b j ending

with a deletion.

Let I ( i, j ) denote the maximum score of any

alignment between a 1 a 2 …ai and b 1 b 2 …b j ending

with an insertion.

Let S ( i, j ) denote the maximum score of any

alignment between a 1 a 2 …ai and b 1 b 2 …b j.

64

Affine gap penalties

(, ) max

Ii j

Dij

Si j wa b

Sij

Sij x y

Iij y

Iij

Si j x y

Di j y

Dij

i j

65

Affine gap penalties

(Gotoh’s algorithm)

I S

D

I S

D

I S

D

I S

D

-y -x-y

-x-y

-y

w ( a (^) i,b (^) j )

66

k best local alignments

• Smith-Waterman

(Smith and Waterman, 1981; Waterman and Eggert, 1987)

• FASTA

(Wilbur and Lipman, 1983; Lipman and Pearson, 1985)

• BLAST

(Altschul et al., 1990; Altschul et al., 1997)

BLAST and FASTA are key genomic database search tools.

67

k best local alignments

Smith-Waterman (Smith and Waterman, 1981; Waterman and Eggert, 1987) - linear-space version：sim (Huang and Miller, 1991) - linear-space variants：sim2 (Chao et al., 1995); sim3 (Chao et al., 1997)
FASTA (Wilbur and Lipman, 1983; Lipman and Pearson, 1985) - linear-space band alignment (Chao et al., 1992)
BLAST (Altschul et al., 1990; Altschul et al., 1997) - restricted affine gap penalties (Chao, 1999)

68

FASTA

1) Find runs of identities, and identify

regions with the highest density of

identities.

2) Re-score using PAM matrix, and keep top

scoring segments.

3) Eliminate segments that are unlikely to be

part of the alignment.

4) Optimize the alignment in a band.

Its running time is O(n).

73

BLAST

1) Build the hash table for sequence A (the

database sequence).

2) Scan sequence B for hits.

3) Extend hits.

Also O(n) time.

74

BLAST

Step 1: Build the hash table for sequence A. (3-tuple example) For DNA sequences:

Seq. A = AGATCGAT 12345678 AAA AAC .. AGA 1 .. ATC 3 .. CGA 5 .. GAT 2 6 .. TCG 4 .. TTT

For protein sequences: Seq. A = ELVIS

Add xyz to the hash table if Score(xyz, ELV) ≧ T; Add xyz to the hash table if Score(xyz, LVI) ≧ T; Add xyz to the hash table if Score(xyz, VIS) ≧ T;

75

BLAST

Step2: Scan sequence B for hits.

76

BLAST

Step2: Scan sequence B for hits.

Step 3: Extend hits.

hit Terminate if the score of the extension fades away.

BLAST 2.0 saves the time spent in extension, and considers gapped alignments.

77

Remarks

• Filtering is based on the observation that a

good alignment usually includes short

identical or very similar fragments.

• The idea of filtration was used in both

FASTA and BLAST to achieve high speed

78

Linear space ideas

Hirschberg, 1975; Myers and Miller, 1988

m/

(i) scores can be computed in O(n) space (ii) divide-and-conquer

S(a 1 …a m/2,b 1 …b j ) +

S(a m…a m/2+1,b n …b j+1 )

maximized

j

79

Two subproblems

½ original problem size

m/

3m/

80

Four subproblems

¼ original problem size

m/

3m/

Dynamic Programming for Biomolecular Sequence Analysis: Alignment and Local Alignments - P, Study notes of Computer Science

Related documents

Partial preview of the text

Download Dynamic Programming for Biomolecular Sequence Analysis: Alignment and Local Alignments - P and more Study notes Computer Science in PDF only on Docsity!

Dynamic Programming Method for

Analyzing Biomolecular Sequences

Tao Jiang

Department of Computer Science

University of California - Riverside

(Typeset by Kun-Mao Chao)

Outline

• The paradigm of dynamic programming

• Sequence alignment – a general framework

for comparing sequences in bioinformatics

• Dynamic programming algorithms for

sequence alignment

• Techniques for improving the efficiency of

the algorithms

• Multiple sequence alignment

Dynamic Programming

• Dynamic programming is an algorithmic

method for solving optimization problems

with a compositional/recursive cost

structure.

• Richard Bellman was one of the principal

founders of this approach.

Two key ingredients

to be suitable for a dynamic programming solution:

Three basic components

• The development of a dynamic programming

algorithm has three basic components:

of an optimal solution);

an optimal solution);

optimal solution).

Fibonacci numbers

for.

i>

i

F

i

Fi F

F

F

The Fibonacci numbers are defined by the

following recurrence:

How to compute F 10 ？

F 10

F 9

F 8

F 8

F 7

F 7

F 6

Tabular computation

• Tabular computation can avoid redundant

computation steps.

F 0 F 1 F 2 F 3 F 4 F 5 F 6 F 7 F 8 F 9 F 10

Maximum sum interval

(tabular computation)

Maximum sum interval

(backtracing)

Defining scores for alignment columns

• infocon [Stojanovic et al ., 1999]

Defining scores (cont’d)

• phylogen [Stojanovic et al ., 1999]

relationships among the sequences implied by a

supplied phylogenetic tree.

T

T

T

C

C T T T C C

TC

T

T

T T T C C

T T

T

T

Two fundamental problems we solved