Download Chaining Algorithms and Multiple Alignment - Lecture Notes | CMSC 423 and more Study notes Computer Science in PDF only on Docsity!
CMSC423: Bioinformatic Algorithms,
Databases and Tools
Lecture 12
chaining algorithms
multiple alignment
Jobs
• Applied Predictive Technologies – looking for the best
students – focus on databases (forwarded by Daniel
Hackner) -not bioinformatics
Path “planning” and dynamic programming
• One intuitive way to think about dynamic programming
- similar to finding shortest path between two points
- at each “point” ask – what are all possible ways to get here?
- pick the best (shortest, fastest, etc.) DCDC Frederick Baltimore Harrisburg Philly NYC
Chaining in 1D
- Sort the endpoints (starts, ends) of the intervals
- For every interval j, store V[j] – best score of a chain ending in j
- MAX – store highest V[j] seen sofar
- Process endpoints in increasing order of x coordinate
- If we encounter left end (start) of interval j
- If we encounter right end (end) of interval j
- Running time?
Multiple sequence alignment
Multiple sequence alignment
• Simultaneously identify relationship between multiple
sequences
• Note: multiple alignment implies (not necessarily
optimal) pairwise alignment between the individual
sequences
HBB_HUMAN FFESFGDLSTPDAVMGNPKVKAHGKKVL-----GAFSDGLAHLDNLKGTF HBB_HORSE FFDSFGDLSNPGAVMGNPKVKAHGKKVL-----HSFGEGVHHLDNLKGTF HBA_HUMAN YFPHF-DLS-----HGSAQVKGHGKKVA-----DALTNAVAHVDDMPNAL HBA_HORSE YFPHF-DLS-----HGSAQVKAHGKKVG-----DALTLAVGHLDDLPGAL MYG_PHYCA KFDRFKHLKTEAEMKASEDLKKHGVTVL-----TALGAILKKKGHHEAEL GLB5_PETMA FFPKFKGLTTADQLKKSADVRWHAERII-----NAVNDAVASMDDTEKMS LGB2_LUPLU LFSFLKGTSEVP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATL
- :.. .:: *. : :. : HBA_HUMAN YFPHF-DLS-----HGSAQVKGHGKKVA-----DALTNAVAHVDDMPNAL HBA_HORSE YFPHF-DLS-----HGSAQVKAHGKKVG-----DALTLAVGHLDDLPGAL
But....here's a solution
• Dynamic programming solution. e.g. 3 sequences
• Score(i, j, k) – optimal alignment between s1[1..i],
s2[1..j], s3[1..k] – do DP as usual
• s(i,j,k) = max {
s(i-1, j-1, k-1) +
match(s1[i], s2[j], s3[k]),
s 1 s 2 s 3
But... it's expensive
• 3 sequences – need to fill in the cube O(n
3
• k sequences – k-dimensional cube O(n
k
) time/space
• There are tricks that can help – similar to AI
techniques for reducing the search space
• Basic idea – if we can estimate optimal score, we can
prune the search space.
• Note – these are just heuristics – not guaranteed to
work faster
13
Iterative alignment
• Take sequences si in order:
- align s1 with sc - results in gaps being inserted in both sequences
- align s2 with sc - if gaps must be inserted – insert in previously aligned sequences
- and so on (note: if gaps coincide with previously introduced gaps no need to change previously aligned sequences) SC YFPHFDLSHGSAQVKAHGKKVGDALTLAVGHLDDLPGAL SC YFPHFDLSHGSAQVKAHGKKVGDALTLAVGHLDDLPGAL S1 YFPHFDLSHG-AQVKG--KKVADALTNAVAHVDDMPNAL SC YFPHF-DLS-----HGSAQVKAHGKKVG-----DALTLAVAHLDDLPGAL S1 YFPHF-DLS-----HG-AQVKG—GKKVA-----DALTNAVAHVDDMPNAL S2 FFPKFKGLTTADQLKKSADVRWHAERII-----NAVNDAVASMDDTEKMS SC YFPHF-DLS-----HGSAQVKAHGKKVG-----DALTLAVAHLDDLPGAL S1 YFPHF-DLS-----HG-AQVKG—GKKVA-----DALTNAVAHVDDMPNAL S2 FFPKFKGLTTADQLKKSADVRWHAERII-----NAVNDAVASMDDTEKMS S3 LFSFLKGTSEVP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATL
Theorem proof
- Theorem: star alignment is 2-optimal
- Assumption: distances obey triangle inequality OPT = ∑ si,sj d*(s i ,s j
si,sj D(s i ,s j )≥ k ∑ si D(s i , sc) STAR = ∑ si,sj d(s i ,s j
si,sj (D(s i , sc) + D(s j , sc)) # triangle ineq. = ∑ sj,sj D(s j , sc) + ∑ sj,sj D(s i , sc) = 2k ∑ si D(s i , sc) => STAR/OPT ≤ 2 Q.E.D note: ∑ si D(s i , sc) – is score optimized by choice of sc d*(si,sj) – score of alignment btwn si, sj within optimal alignment d(si,sj) – score of alignment btwn si, sj within star alignment D(si,sj) – score of optimal alignment btwn si, sj sc s i s j