Download Inexact Alignment Dynamic Programming, and Gapped Alignment | CMSC 423 and more Study notes Computer Science in PDF only on Docsity!
CMSC423: Bioinformatic Algorithms,
Databases and Tools
Lecture 9
inexact alignment
dynamic programming, gapped
alignment
Recap
4
Global alignment recap
C -28 -14 0 14 10 9 20 19
A -10 4 3 14 13 24 20
G
A
C
T
G
A
G
C G T A G
AG-C-GTAG
-GTCAG-AC
Value(A,A) = 10 Value(A,G) = - Value(A,-) = -
Score[i,j] is the maximum of:
- Score[i-1, j-1] + Value[S1[i-1],S2[j-1]] (S1[i-1], S2[j-1] aligned)
- Score[i – 1, j] + Value[S1[i], -] (S1[i] aligned to gap)
- Score[i, j – 1] + Value[-, S2[j]] (S2[j] aligned to gap)
5
Local alignment recap
C
A
G
A
C
T
G
A G C G T A G
AGCGTAG
GTCAGAC
Value(A,A) = 10 Value(A,G) = - Value(A,-) = -
Score[i,j] is the maximum of:
- 0
- Score[i-1, j-1] + Value[S1[i-1],S2[j-1]] (S1[i-1], S2[j-1] aligned)
- Score[i – 1, j] + Value[S1[i], -] (S1[i] aligned to gap)
- Score[i, j – 1] + Value[-, S2[j]] (S2[j] aligned to gap)
Where do the alignment scores come from?
• PAM matrices
- PAM1 – based on frequency of mutations between closely related proteins (within 1 "evolutionary step")
- PAM 2 - ... within 2 evolutionary steps
- ... PAM 250 – commonly used
• BLOSUM matrices
- Frequency of mutations between proteins that are x% similar
- BLOSUM100 – based on proteins that are exactly the same (e.g. score(A,A) is defined but not score(A,G) )
- BLOSUM62 – commonly used
• gap scores usually determined empirically
Heuristics
• What if limit the # of differences allowed? E.g. we
expect the sequences to be very similar.
• Compute 'banded' alignment – stay within # of
differences (k) from the diagonal.
• Optimal alignment cannot stray too far from diagonal
• What if we do not know k? Do binary search to find it
k k
O(km) running time and space
Exclusion methods
• Assume P must match T with at most k errors. Find
places in T where P cannot match.
• Split P into floor(n/k+1) -sized chunks.
• If P matches T with less than k errors => at least one
chunk matches with no errors
• Use any exact matching algorithm to find places
where a chunk matches T, then run dynamic
programming in that vicinity.
• Running time, on average O(m)
"Famous" approaches
• FASTA (Pearson et al.)
- Take all k-mers (substrings of length k) from Pattern and identify whether and where they match in the Text
- Assume the k-mer starting at pos'n i in Pattern matches at position j in Text, remember (j – i) – the diagonal on which the match occured
- Identify "heavy" diagonals – diagonals where many k-mers match, then refine the diagonals with Smith Waterman
- Also look for off-diagonal matches to account for gaps
"Famous" approaches
• BLAST (Altschul et al.)
- Find short k-mer matches
- Also search for possible inexact matches, e.g. all k-mers within 1 difference from current one.
- Extend exact matches with Smith-Waterman algorithm
- Assign probabilistic scores to matches: what is the probability of finding a match with the same S-W alignment score just by chance (e.g. matching a random string)?
Chaining in 1-D
- Input: multiple overlapping intervals on a line
- Output: highest weight set of non-overlapping intervals
- Weight could be length of interval, or Smith-Waterman score, etc.
- Sort the endpoints (starts, ends) of the intervals
- For every interval j, store V[j] – best score of a chain ending in j
- MAX – store highest V[j] seen sofar
- Process endpoints in increasing order of x coordinate
- If we encounter left end (start) of interval j
- If we encounter right end (end) of interval j
- Running time?