Download Assignment 1 for Bioinformatics | CS 5263 and more Assignments Computer Science in PDF only on Docsity!
Policy on collaboration
When solving your homework problems and working on your project, you
may discuss HIGH-LEVEL approaches to the homework problems with your
classmates, HOWEVER, you are to work out all details of any solutions dis-
cussed and write up the solution completely on your own. In particular, when
working with a student on an assigned homework problem you should do so
verbally â Nothing should be written. This is aimed at keeping your discussion
at a high level so everyone can work out the details on their own. Please follow
the spirit of this rather than working to finds ways to share details verbally.
Also you must clearly acknowledge anyone (except the instructor) with whom
you discussed any problem and say briefly what you discussed.
COPYING MATERIAL FROM WEB SITES, BOOKS, OR OTHER SOURCES
IS CONSIDERED EQUIVALENT TO COPYING FROM ANOTHER STU-
DENT. INFORMATION FROM THESE SOURCES MAY BE USED IF THE
SOURCES ARE ACKNOWLEDGED AND YOU HAVE WRITTEN DOWN
THE IDEAS ON YOUR OWN.
Violations of any of the above rules will be dealt with harshly! The homework
problems and projects are designed to help you learn the material being taught.
Being told the solution and understanding it is VERY different from working
through the process of actually finding a solution. If you do not take an active
role in the process of solving the homework problems and project, then you
wonât get much out of it.
I have read the above policy and accept the terms.
Print Your Name
Your Signature
Homework 1
Due: September 24, 7:00pm
Problem 1 (15 points)
You are given the following DNA sequence, which is believed to contain a small
protein-coding gene.
GGAGGCGTAA AATGCGTACT GGTAATGCAA ACTAATGG
- If this sequence is fully transcribed (used as a coding strand), what is the
corresponding mRNA sequence?
- Which region of the mRNA do you think can be translated into a protein
(hint: Can you identify the start codon and stop codon from the mRNA
sequence?)
- What is the protein sequence encoded by the gene?
- If the reverse-complementary strand of the DNA sequence is also tran-
scribed, what will be the mRNA sequence?
- Do you think the reverse-complementary strand can encode a protein?
Problem 2 (20 points)
Consider the sequences v = TACGGGTAT and w = GGACGTACG. Assume
that the match score is +1, and the mismatch and gap penalties are -1.
- Fill out the dynamic programming table for a global alignment between v
and w. Draw arrows in the cells to store traceback information. What is
the score of the optimal global alignment and what alignment(s) achieves
this score?
- Fill out the dynamic programming table for a local alignment between v
and w. Draw arrows in the cells to store traceback information. What is
the score of the optimal local alignment in this case and what alignment(s)
achieves this score?
Problem 3 (10 points)
Problem 6 .23 in Jones & Pevzner (page 215. Also see the circled region on
page 3 of this document). Note that this problem corresponds to one of the
four overlap variants we described in class (lecture 3, slide #61). The algorithm
shown on slide #60 is not directly applicable because it allows all four cases.
6.16 Problems 215
Problem 6.
For a pair of strings v = v 1... vn and w = w 1... wm, define M (v, w) to be the
matrix whose (i, j)th entry is the score of the optimal global alignment which aligns
the character vi with the character wj. Give an O(nm) algorithm which computes
M (v, w).
Define an overlap alignment between two sequences v = v 1... vn and w = w 1... wm to be an alignment between a suffix of v and a prefix of w. For example, if v = TATATA and w = AAATTT, then a (not necessarily optimal) overlap alignment between v and w is
ATA
AAA
Optimal overlap alignment is an alignment that maximizes the global alignment score between vi,... , vn and w 1 ,... wj , where the maximum is taken over all suffixes vi,... , vn of v and all prefixes w 1 ,... wj of w.
Problem 6.
Give an algorithm which computes the optimal overlap alignment, and runs in time
O(nm).
Suppose that we have sequences v = v 1... vn and w = w 1... wm, where v is longer than w. We wish to find a substring of v which best matches all of w. Global alignment wonât work because it would try to align all of v. Local alignment wonât work because it may not align all of w. Therefore this is a distinct problem which we call the Fitting problem. Fitting a sequence w into a sequence v is a problem of finding a substring vâ˛^ of v that maximizes the score of alignment s(vâ˛, w) among all substrings of v. For example, if v = GTAGGCTTAAGGTTA and w = TAGATA, the best alignments might be
global local fitting v GTAGGCTTAAGGTTA TAG TAGGCTTA w -TAG----A---T-A TAG TAGA--TA score â 3 3 2
The scores are computed as 1 for match, â 1 for mismatch or indel. Note that the optimal local alignment is not a valid fitting alignment. On the other hand, the optimal global alignment con- tains a valid fitting alignment, but it achieves a suboptimal score among all fitting alignments.
Problem 6.
Give an algorithm which computes the optimal fitting alignment. Explain how to fill
in the first row and column of the dynamic programming table and give a recurrence
to fill in the rest of the table. Give a method to find the best alignment once the table
is filled in. The algorithm should run in time O(nm).
218 6 Dynamic Programming Algorithms
A tandem repeat P k^ of a pattern P = p 1... pn is a pattern of length n¡k formed by concatenation of k copies of P. Let P be a pattern and T be a text of length m. The Tandem Repeat problem is to find a best local alignment of T with some tandem repeat of P. This amounts to aligning P k against T and the standard local alignment algorithm solves this problem in O(km^2 ) time.
Problem 6.
Devise a faster algorithm for solving the tandem repeat problem.
An alignment of circular strings is defined as an alignment of linear strings formed by cutting (linearizing) these circular strings at arbitrary positions. The following problem asks to find the cut points of two circular strings that maximize the alignment of the resulting linear strings.
Problem 6.
Devise an efficient algorithm to find an optimal alignment (local and global) of circu-
lar strings.
The early graphical method for comparing nucleotide sequencesâdot matricesâstill yields one of the best visual representations of sequence similarities. The axes in a dot matrix correspond to the two sequences v = v 1... vn and w = w 1... wm. A dot is placed at coordinates (i, j) if the substrings si... si+k and tj... tj+k are sufficiently similar. Two such substrings are considered to be sufficiently similar if the Hamming distance between them is at most d. When the sequences are very long, it is not necessary to show exact coordinates; figure 6.29 is based on the sequences corresponding to the β-globin gene in human and mouse. In these plots each axis is on the order of 1000 base pairs long, k = 10 and d = 2.
Problem 6.
Use figure 6.29 to answer the following questions:
- How many exons are in the human β-globulin gene?
- The dot matrix in figure 6.29 (top) is between the mouse and human genes (i.e.,
all introns and exons are present). Do you think the number of exons in the
β-globulin gene is different in the human genome as compared to the mouse
genome?
- Label segments of the axes of the human and mouse genes in figure 6.29 to show
where the introns and exons would be located.
A local alignment between two different strings v and w finds a pair of substrings, one in v and the other in w, with maximum similarity. Suppose that we want to find a pair of (nonover- lapping) substrings within string v with maximum similarity (Optimal Inexact Repeat problem). Computing an optimal local alignment between v and v does not solve the problem, since the resulting alignment may correspond to overlapping substrings.
Problem 6.
Devise an algorithm for the Optimal Inexact Repeat problem.