Download Sequence Alignment: Inexact Alignment and Dynamic Programming and more Study notes Computer Science in PDF only on Docsity!
CMSC423: Bioinformatic Algorithms,
Databases and Tools
Lecture 9
Sequence alignment: inexact
alignment
dynamic programming, gapped
alignment, heuristics
Play around with alignments
• USC alignment library (seqaln)
http://www.mhoenicka.de/software/cygwinports/seqaln.html
Global alignment recap
C
A
G
A
C
T
G
A G C G T A G
AGCGTAG
GTCAGAC
Value(A,A) = 10 Value(A,G) = - Value(A,-) = -
Score[i,j] is the maximum of:
- Score[i-1, j-1] + Value[S1[i-1],S2[j-1]] (S1[i-1], S2[j-1] aligned)
- Score[i – 1, j] + Value[S1[i], -] (S1[i] aligned to gap)
- Score[i, j – 1] + Value[-, S2[j]] (S2[j] aligned to gap)
Global alignment recap
C -28 -14 0 14 10 9 20 19
A -10 4 3 14 13 24 20
G
A
C
T
G
A
G
C G T A G
AG-C-GTAG
-GTCAG-AC
Value(A,A) = 10 Value(A,G) = - Value(A,-) = -
Score[i,j] is the maximum of:
- Score[i-1, j-1] + Value[S1[i-1],S2[j-1]] (S1[i-1], S2[j-1] aligned)
- Score[i – 1, j] + Value[S1[i], -] (S1[i] aligned to gap)
- Score[i, j – 1] + Value[-, S2[j]] (S2[j] aligned to gap)
Dynamic programming solution
• Traditional 1-table approach doesn't work anymore
• Instead, use 4 tables:
- V – stores value of best alignment between S1[1..i], S2[1..j]
- G – best alignment between S1[1..i], S2[1..j] s.t. S1[i] aligned with S2[j]
- E – best alignment between S1[1..i], S2[1..j], s.t. alignment ends with gap in S
- F – best alignment between S1[1..i], S2[1..j], s.t. alignment ends with gap in S
• V[i,j] = max(E[i,j], F[i,j], G[i,j])
• As in traditional approach, find box in V matrix where
V[i,j] is maximal.
Affine gap recurrences
• V[i,j] = max[E[i,j], F[i,j], G[i,j] ]
• G[i,j] = V[i-1, j-1] + Value(S1[i], S2[j])
- irrespective how we got here (hence use of V), S1[i] and S2[j] are matched
• E[i,j] = max{E[i, j-1], V[i, j-1] – GapOpen} – GapExtend
- either we add a gap in S1 to an existing one (E-GapExtend)
- or we add a gap in S1 when there was none (V-GapOpen- GapExtend)
• F[i,j] = max{F[i-1, j], V[i-1, j] – GapOpen} – GapExtend
- either we add a gap in S2 to an existing one (F–GapExtend)
- or we add a gap in S2 when there was none (V-GapOpen- GapExtend)
Running times
• All these algorithms run in O(mn) – quadratic time
• Note – this is significantly worse than exact matching
• On Wednesday we'll talk about speed-up opportunities
• BTW, how much space is needed?
• If we only need to find the best score (not the exact
alignment as well) – O(min(m,n))
• If we need to find the best alignment – elegant divide
and conquer algorithm leads to linear space solution.
Where do the alignment scores come from?
• PAM matrices
- PAM1 – based on frequency of mutations between closely related proteins (within 1 "evolutionary step")
- PAM 2 - ... within 2 evolutionary steps
- ... PAM 250 – commonly used
• BLOSUM matrices
- Frequency of mutations between proteins that are x% similar
- BLOSUM100 – based on proteins that are exactly the same (e.g. score(A,A) is defined but not score(A,G) )
- BLOSUM62 – commonly used
• gap scores usually determined empirically
Exclusion methods
• Assume P must match T with at most k errors. Find
places in T where P cannot match.
• Split P into floor(n/k+1)-sized chunks.
• If P matches T with less than k errors => at least one
chunk matches with no errors
• Use any exact matching algorithm to find places
where a chunk matches T, then run dynamic
programming in that vicinity.
• Running time, on average O(m)
Exclusion methods
Exact match
Putative alignment Text
Pattern
"Famous" approaches
• FASTA (Pearson et al.)
- Take all k-mers (substrings of length k) from Pattern and identify whether and where they match in the Text
- Assume the k-mer starting at pos'n i in Pattern matches at position j in Text, remember (j – i) – the diagonal on which the match occured
- Identify "heavy" diagonals – diagonals where many k-mers match, then refine the diagonals with Smith Waterman
- Also look for off-diagonal matches to account for gaps
"Famous" approaches
• BLAST (Altschul et al.)
- Find short k-mer matches
- Also search for possible inexact matches, e.g. all k-mers within 1 difference from current one.
- Extend exact matches with Smith-Waterman algorithm
- Assign probabilistic scores to matches: what is the probability of finding a match with the same S-W alignment score just by chance (e.g. matching a random string)?