Assignment 1 for Bioinformatics | CS 5263, Assignments of Computer Science

Material Type: Assignment; Class: Bioinformatics; Subject: Computer Science; University: University of Texas - San Antonio; Term: Fall 2000;

Typology: Assignments

Pre 2010

Uploaded on 08/18/2009

koofers-user-jxv
koofers-user-jxv 🇺🇸

10 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Policy on collaboration
When solving your homework problems and working on your project, you
may discuss HIGH-LEVEL approaches to the homework problems with your
classmates, HOWEVER, you are to work out all details of any solutions dis-
cussed and write up the solution completely on your own. In particular, when
working with a student on an assigned homework problem you should do so
verbally – Nothing should be written. This is aimed at keeping your discussion
at a high level so everyone can work out the details on their own. Please follow
the spirit of this rather than working to finds ways to share details verbally.
Also you must clearly acknowledge anyone (except the instructor) with whom
you discussed any problem and say briefly what you discussed.
COPYING MATERIAL FROM WEB SITES, BOOKS, OR OTHER SOURCES
IS CONSIDERED EQUIVALENT TO COPYING FROM ANOTHER STU-
DENT. INFORMATION FROM THESE SOURCES MAY BE USED IF THE
SOURCES ARE ACKNOWLEDGED AND YOU HAVE WRITTEN DOWN
THE IDEAS ON YOUR OWN.
Violations of any of the above rules will be dealt with harshly! The homework
problems and projects are designed to help you learn the material being taught.
Being told the solution and understanding it is VERY different from working
through the process of actually finding a solution. If you do not take an active
role in the process of solving the homework problems and project, then you
won’t get much out of it.
I have read the above policy and accept the terms.
Print Your Name
Your Signature
pf3
pf4
pf5

Partial preview of the text

Download Assignment 1 for Bioinformatics | CS 5263 and more Assignments Computer Science in PDF only on Docsity!

Policy on collaboration

When solving your homework problems and working on your project, you

may discuss HIGH-LEVEL approaches to the homework problems with your

classmates, HOWEVER, you are to work out all details of any solutions dis-

cussed and write up the solution completely on your own. In particular, when

working with a student on an assigned homework problem you should do so

verbally – Nothing should be written. This is aimed at keeping your discussion

at a high level so everyone can work out the details on their own. Please follow

the spirit of this rather than working to finds ways to share details verbally.

Also you must clearly acknowledge anyone (except the instructor) with whom

you discussed any problem and say briefly what you discussed.

COPYING MATERIAL FROM WEB SITES, BOOKS, OR OTHER SOURCES

IS CONSIDERED EQUIVALENT TO COPYING FROM ANOTHER STU-

DENT. INFORMATION FROM THESE SOURCES MAY BE USED IF THE

SOURCES ARE ACKNOWLEDGED AND YOU HAVE WRITTEN DOWN

THE IDEAS ON YOUR OWN.

Violations of any of the above rules will be dealt with harshly! The homework

problems and projects are designed to help you learn the material being taught.

Being told the solution and understanding it is VERY different from working

through the process of actually finding a solution. If you do not take an active

role in the process of solving the homework problems and project, then you

won’t get much out of it.

I have read the above policy and accept the terms.

Print Your Name

Your Signature

Homework 1

Due: September 24, 7:00pm

Problem 1 (15 points)

You are given the following DNA sequence, which is believed to contain a small

protein-coding gene.

GGAGGCGTAA AATGCGTACT GGTAATGCAA ACTAATGG

  • If this sequence is fully transcribed (used as a coding strand), what is the

corresponding mRNA sequence?

  • Which region of the mRNA do you think can be translated into a protein

(hint: Can you identify the start codon and stop codon from the mRNA

sequence?)

  • What is the protein sequence encoded by the gene?
  • If the reverse-complementary strand of the DNA sequence is also tran-

scribed, what will be the mRNA sequence?

  • Do you think the reverse-complementary strand can encode a protein?

Problem 2 (20 points)

Consider the sequences v = TACGGGTAT and w = GGACGTACG. Assume

that the match score is +1, and the mismatch and gap penalties are -1.

  • Fill out the dynamic programming table for a global alignment between v

and w. Draw arrows in the cells to store traceback information. What is

the score of the optimal global alignment and what alignment(s) achieves

this score?

  • Fill out the dynamic programming table for a local alignment between v

and w. Draw arrows in the cells to store traceback information. What is

the score of the optimal local alignment in this case and what alignment(s)

achieves this score?

Problem 3 (10 points)

Problem 6 .23 in Jones & Pevzner (page 215. Also see the circled region on

page 3 of this document). Note that this problem corresponds to one of the

four overlap variants we described in class (lecture 3, slide #61). The algorithm

shown on slide #60 is not directly applicable because it allows all four cases.

6.16 Problems 215

Problem 6.

For a pair of strings v = v 1... vn and w = w 1... wm, define M (v, w) to be the

matrix whose (i, j)th entry is the score of the optimal global alignment which aligns

the character vi with the character wj. Give an O(nm) algorithm which computes

M (v, w).

Define an overlap alignment between two sequences v = v 1... vn and w = w 1... wm to be an alignment between a suffix of v and a prefix of w. For example, if v = TATATA and w = AAATTT, then a (not necessarily optimal) overlap alignment between v and w is

ATA

AAA

Optimal overlap alignment is an alignment that maximizes the global alignment score between vi,... , vn and w 1 ,... wj , where the maximum is taken over all suffixes vi,... , vn of v and all prefixes w 1 ,... wj of w.

Problem 6.

Give an algorithm which computes the optimal overlap alignment, and runs in time

O(nm).

Suppose that we have sequences v = v 1... vn and w = w 1... wm, where v is longer than w. We wish to find a substring of v which best matches all of w. Global alignment won’t work because it would try to align all of v. Local alignment won’t work because it may not align all of w. Therefore this is a distinct problem which we call the Fitting problem. Fitting a sequence w into a sequence v is a problem of finding a substring v′^ of v that maximizes the score of alignment s(v′, w) among all substrings of v. For example, if v = GTAGGCTTAAGGTTA and w = TAGATA, the best alignments might be

global local fitting v GTAGGCTTAAGGTTA TAG TAGGCTTA w -TAG----A---T-A TAG TAGA--TA score − 3 3 2

The scores are computed as 1 for match, − 1 for mismatch or indel. Note that the optimal local alignment is not a valid fitting alignment. On the other hand, the optimal global alignment con- tains a valid fitting alignment, but it achieves a suboptimal score among all fitting alignments.

Problem 6.

Give an algorithm which computes the optimal fitting alignment. Explain how to fill

in the first row and column of the dynamic programming table and give a recurrence

to fill in the rest of the table. Give a method to find the best alignment once the table

is filled in. The algorithm should run in time O(nm).

218 6 Dynamic Programming Algorithms

A tandem repeat P k^ of a pattern P = p 1... pn is a pattern of length n¡k formed by concatenation of k copies of P. Let P be a pattern and T be a text of length m. The Tandem Repeat problem is to find a best local alignment of T with some tandem repeat of P. This amounts to aligning P k against T and the standard local alignment algorithm solves this problem in O(km^2 ) time.

Problem 6.

Devise a faster algorithm for solving the tandem repeat problem.

An alignment of circular strings is defined as an alignment of linear strings formed by cutting (linearizing) these circular strings at arbitrary positions. The following problem asks to find the cut points of two circular strings that maximize the alignment of the resulting linear strings.

Problem 6.

Devise an efficient algorithm to find an optimal alignment (local and global) of circu-

lar strings.

The early graphical method for comparing nucleotide sequences—dot matrices—still yields one of the best visual representations of sequence similarities. The axes in a dot matrix correspond to the two sequences v = v 1... vn and w = w 1... wm. A dot is placed at coordinates (i, j) if the substrings si... si+k and tj... tj+k are sufficiently similar. Two such substrings are considered to be sufficiently similar if the Hamming distance between them is at most d. When the sequences are very long, it is not necessary to show exact coordinates; figure 6.29 is based on the sequences corresponding to the β-globin gene in human and mouse. In these plots each axis is on the order of 1000 base pairs long, k = 10 and d = 2.

Problem 6.

Use figure 6.29 to answer the following questions:

  • How many exons are in the human β-globulin gene?
  • The dot matrix in figure 6.29 (top) is between the mouse and human genes (i.e.,

all introns and exons are present). Do you think the number of exons in the

β-globulin gene is different in the human genome as compared to the mouse

genome?

  • Label segments of the axes of the human and mouse genes in figure 6.29 to show

where the introns and exons would be located.

A local alignment between two different strings v and w finds a pair of substrings, one in v and the other in w, with maximum similarity. Suppose that we want to find a pair of (nonover- lapping) substrings within string v with maximum similarity (Optimal Inexact Repeat problem). Computing an optimal local alignment between v and v does not solve the problem, since the resulting alignment may correspond to overlapping substrings.

Problem 6.

Devise an algorithm for the Optimal Inexact Repeat problem.