Assignment 2. (Due on O ctober 2.)
Instructions on submitting p rogramming solutions:
a. Your code should be easy to compile and run on csil-linux machines.
b. Email your code directory as a tarball (.tar.gz) to the instructor.
c. Give clear instructions in the email abo ut compiling (if necessary) and running the program.
1. (Problem 6.22 from text). Define an overlap alignment between two sequences v = v1…vn and w
= w1…wm to be an alignment between a suffix of v and a prefix of w. For example, if v = TATATA
and w = AAATTT, then a (not necessarily optimal) overlap alignm ent between v and w is
ATA
AAA
Optimal overlap alignment is an alignment that maximizes the global alignment score between
vi…vn and w1…wj, where the maximum is taken over all suffixes vi…vn of v and all prefixes w1…wj of
w. Given an algorithm which computes the optimal overlap alignm ent, and runs in time O(nm).
Assume that the score of an alignment is computed as usual, i.e., with a (5 x 5) scoring matrix δ.
(10 points)
2. Suppose u and v are two DNA sequences of length n each. Define a “k-parse” of a string s as a
sequence of substrings: π(s) = {s1, s2, … sr} where each si is a k-length substring of s, with the
condition that for any pair si = s[bi…ei] and sj = s[bj…ej], if i < j then ei < bj. That is π(s) is a
sequence of non-overlapping k-length substrings of s.
Given a k-parse π(u) of sequence u and a k-parse π(v) of sequence v, consider the task of finding
the optimal global alignment of π(u) and π(v) under the following scheme:
• A k-mer x in π(u) may be align ed with a k- mer y in π(v)
• A k-mer in either π(u) or π(v) may be aligned with a gap in the other k-parse.
• The alignment preserves the order of k-m ers in each k-parse. That is, if x1 and y1 are
aligned and x2 and y2 are align ed, and if x1 is to the left of x2 in π(u), then y1 must be to the
left of y2 in π(v).
• The score for aligning two k-mers x and y is equal to k – h(x,y)2 where h(x,y) is the
number of mismatches between x and y, or in other wo rds, the Hamming distance between
x and y.
• The score for aligning a k-mer with a gap is zero.
(a) Write down the recurrences and initialization for a dynamic programming algorithm to find the
highest scoring global alignment of two g iven k-parses. (5 points)
(b) The optimal alignment of u and v is the highest scoring globa l alignment of a k-parse of u and
a k-parse of v, among all possible k-parses of the respective sequences. Write down the
recurrences and initialization for a dynamic programming algorithm to find the optimal alignment of
two given sequences, under this definition. (15 points)
(c) What is the time complexity of your alignment prog ram, in terms of n and k ? (5 points)
(d) Implement the algorithm in a programming language of your choice. The inputs to the
algorithm are two Fasta files, each with one DNA sequence. The command line should look
something like <yourprogramfilename> <fastafilename1> <fastafilename2>. The output should go
into standard output. (15 points)