Assignment: Optimal Overlap Alignment and K-Parse Alignment of DNA Sequences | Assignments Computer Science

Assignment 2. (Due on O ctober 2.)

Instructions on submitting p rogramming solutions:

a. Your code should be easy to compile and run on csil-linux machines.

b. Email your code directory as a tarball (.tar.gz) to the instructor.

c. Give clear instructions in the email abo ut compiling (if necessary) and running the program.

1. (Problem 6.22 from text). Define an overlap alignment between two sequences v = v1…vn and w

= w1…wm to be an alignment between a suffix of v and a prefix of w. For example, if v = TATATA

and w = AAATTT, then a (not necessarily optimal) overlap alignm ent between v and w is

ATA

AAA

Optimal overlap alignment is an alignment that maximizes the global alignment score between

vi…vn and w1…wj, where the maximum is taken over all suffixes vi…vn of v and all prefixes w1…wj of

w. Given an algorithm which computes the optimal overlap alignm ent, and runs in time O(nm).

Assume that the score of an alignment is computed as usual, i.e., with a (5 x 5) scoring matrix δ.

(10 points)

2. Suppose u and v are two DNA sequences of length n each. Define a “k-parse” of a string s as a

sequence of substrings: π(s) = {s1, s2, … sr} where each si is a k-length substring of s, with the

condition that for any pair si = s[bi…ei] and sj = s[bj…ej], if i < j then ei < bj. That is π(s) is a

sequence of non-overlapping k-length substrings of s.

Given a k-parse π(u) of sequence u and a k-parse π(v) of sequence v, consider the task of finding

the optimal global alignment of π(u) and π(v) under the following scheme:

• A k-mer x in π(u) may be align ed with a k- mer y in π(v)

• A k-mer in either π(u) or π(v) may be aligned with a gap in the other k-parse.

• The alignment preserves the order of k-m ers in each k-parse. That is, if x1 and y1 are

aligned and x2 and y2 are align ed, and if x1 is to the left of x2 in π(u), then y1 must be to the

left of y2 in π(v).

• The score for aligning two k-mers x and y is equal to k – h(x,y)2 where h(x,y) is the

number of mismatches between x and y, or in other wo rds, the Hamming distance between

x and y.

• The score for aligning a k-mer with a gap is zero.

(a) Write down the recurrences and initialization for a dynamic programming algorithm to find the

highest scoring global alignment of two g iven k-parses. (5 points)

(b) The optimal alignment of u and v is the highest scoring globa l alignment of a k-parse of u and

a k-parse of v, among all possible k-parses of the respective sequences. Write down the

recurrences and initialization for a dynamic programming algorithm to find the optimal alignment of

two given sequences, under this definition. (15 points)

(d) Implement the algorithm in a programming language of your choice. The inputs to the

algorithm are two Fasta files, each with one DNA sequence. The command line should look

something like <yourprogramfilename> <fastafilename1> <fastafilename2>. The output should go

into standard output. (15 points)

Partial preview of the text

Download Assignment: Optimal Overlap Alignment and K-Parse Alignment of DNA Sequences and more Assignments Computer Science in PDF only on Docsity!

Assignment 2. (Due on October 2.) Instructions on submitting programming solutions: a. Your code should be easy to compile and run on csil-linux machines. b. Email your code directory as a tarball (.tar.gz) to the instructor. c. Give clear instructions in the email about compiling (if necessary) and running the program.

(Problem 6.22 from text). Define an overlap alignment between two sequences v = v 1 …vn and w = w 1 …wm to be an alignment between a suffix of v and a prefix of w. For example, if v = TATATA and w = AAATTT, then a (not necessarily optimal) overlap alignment between v and w is ATA AAA Optimal overlap alignment is an alignment that maximizes the global alignment score between vi…vn and w 1 …wj, where the maximum is taken over all suffixes vi…vn of v and all prefixes w 1 …wj of w. Given an algorithm which computes the optimal overlap alignment, and runs in time O(nm). Assume that the score of an alignment is computed as usual, i.e., with a (5 x 5) scoring matrix δ. (10 points)
Suppose u and v are two DNA sequences of length n each. Define a “k-parse” of a string s as a sequence of substrings: π(s) = {s 1 , s 2 , … sr} where each si is a k-length substring of s , with the condition that for any pair si = s [bi…ei] and sj = s [bj…ej], if i < j then ei < bj. That is π(s) is a sequence of non-overlapping k-length substrings of s. Given a k-parse π(u) of sequence u and a k-parse π(v) of sequence v , consider the task of finding the optimal global alignment of π(u) and π(v) under the following scheme:
- A k-mer x in π(u) may be aligned with a k-mer y in π(v)
- A k-mer in either π(u) or π(v) may be aligned with a gap in the other k-parse.
- The alignment preserves the order of k-mers in each k-parse. That is, if x 1 and y 1 are aligned and x 2 and y 2 are aligned, and if x 1 is to the left of x 2 in π(u), then y 1 must be to the left of y 2 in π(v).
- The score for aligning two k-mers x and y is equal to k – h(x,y)^2 where h(x,y) is the number of mismatches between x and y, or in other words, the Hamming distance between x and y.
- The score for aligning a k-mer with a gap is zero. (a) Write down the recurrences and initialization for a dynamic programming algorithm to find the highest scoring global alignment of two given k-parses. (5 points) (b) The optimal alignment of u and v is the highest scoring global alignment of a k-parse of u and a k-parse of v , among all possible k-parses of the respective sequences. Write down the recurrences and initialization for a dynamic programming algorithm to find the optimal alignment of two given sequences, under this definition. (15 points) (c) What is the time complexity of your alignment program, in terms of n and k? (5 points) (d) Implement the algorithm in a programming language of your choice. The inputs to the algorithm are two Fasta files, each with one DNA sequence. The command line should look something like . The output should go into standard output. (15 points)

Assignment: Optimal Overlap Alignment and K-Parse Alignment of DNA Sequences, Assignments of Computer Science

Related documents

Partial preview of the text

Download Assignment: Optimal Overlap Alignment and K-Parse Alignment of DNA Sequences and more Assignments Computer Science in PDF only on Docsity!