Inexact Alignment Dynamic Programming, and Gapped Alignment | CMSC 423, Study notes of Computer Science

Material Type: Notes; Class: BIOINFO ALGS, DB, TOOLS; Subject: Computer Science; University: University of Maryland; Term: Fall 2008;

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-mot
koofers-user-mot 🇺🇸

5

(2)

9 documents

1 / 16

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CMSC423 Fall 2008 1
CMSC423: Bioinformatic Algorithms,
Databases and Tools
Lecture 9
inexact alignment
dynamic programming, gapped
alignment
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Inexact Alignment Dynamic Programming, and Gapped Alignment | CMSC 423 and more Study notes Computer Science in PDF only on Docsity!

CMSC423: Bioinformatic Algorithms,

Databases and Tools

Lecture 9

inexact alignment

dynamic programming, gapped

alignment

Recap

4

Global alignment recap

C -28 -14 0 14 10 9 20 19

A -10 4 3 14 13 24 20

G

A

C

T

G

A

G

C G T A G

AG-C-GTAG

-GTCAG-AC

Value(A,A) = 10 Value(A,G) = - Value(A,-) = -

Score[i,j] is the maximum of:

  1. Score[i-1, j-1] + Value[S1[i-1],S2[j-1]] (S1[i-1], S2[j-1] aligned)
  2. Score[i – 1, j] + Value[S1[i], -] (S1[i] aligned to gap)
  3. Score[i, j – 1] + Value[-, S2[j]] (S2[j] aligned to gap)

5

Local alignment recap

C

A

G

A

C

T

G

A G C G T A G

AGCGTAG

GTCAGAC

Value(A,A) = 10 Value(A,G) = - Value(A,-) = -

Score[i,j] is the maximum of:

  1. 0
  2. Score[i-1, j-1] + Value[S1[i-1],S2[j-1]] (S1[i-1], S2[j-1] aligned)
  3. Score[i – 1, j] + Value[S1[i], -] (S1[i] aligned to gap)
  4. Score[i, j – 1] + Value[-, S2[j]] (S2[j] aligned to gap)

Where do the alignment scores come from?

• PAM matrices

  • PAM1 – based on frequency of mutations between closely related proteins (within 1 "evolutionary step")
  • PAM 2 - ... within 2 evolutionary steps
  • ... PAM 250 – commonly used

• BLOSUM matrices

  • Frequency of mutations between proteins that are x% similar
  • BLOSUM100 – based on proteins that are exactly the same (e.g. score(A,A) is defined but not score(A,G) )
  • BLOSUM62 – commonly used

• gap scores usually determined empirically

  • BLOSUM

Heuristics

• What if limit the # of differences allowed? E.g. we

expect the sequences to be very similar.

• Compute 'banded' alignment – stay within # of

differences (k) from the diagonal.

• Optimal alignment cannot stray too far from diagonal

• What if we do not know k? Do binary search to find it

k k

O(km) running time and space

Exclusion methods

• Assume P must match T with at most k errors. Find

places in T where P cannot match.

• Split P into floor(n/k+1) -sized chunks.

• If P matches T with less than k errors => at least one

chunk matches with no errors

• Use any exact matching algorithm to find places

where a chunk matches T, then run dynamic

programming in that vicinity.

• Running time, on average O(m)

"Famous" approaches

• FASTA (Pearson et al.)

  • Take all k-mers (substrings of length k) from Pattern and identify whether and where they match in the Text
  • Assume the k-mer starting at pos'n i in Pattern matches at position j in Text, remember (j – i) – the diagonal on which the match occured
  • Identify "heavy" diagonals – diagonals where many k-mers match, then refine the diagonals with Smith Waterman
  • Also look for off-diagonal matches to account for gaps

"Famous" approaches

• BLAST (Altschul et al.)

  • Find short k-mer matches
  • Also search for possible inexact matches, e.g. all k-mers within 1 difference from current one.
  • Extend exact matches with Smith-Waterman algorithm
  • Assign probabilistic scores to matches: what is the probability of finding a match with the same S-W alignment score just by chance (e.g. matching a random string)?

Chaining in 1-D

  • Input: multiple overlapping intervals on a line
  • Output: highest weight set of non-overlapping intervals
  • Weight could be length of interval, or Smith-Waterman score, etc.
  • Sort the endpoints (starts, ends) of the intervals
  • For every interval j, store V[j] – best score of a chain ending in j
  • MAX – store highest V[j] seen sofar
  • Process endpoints in increasing order of x coordinate
  • If we encounter left end (start) of interval j
    • V[j] = weight(j) + MAX
  • If we encounter right end (end) of interval j
    • MAX = max{V[j], MAX}
  • Running time?