Overlap Graph - Computational Biology - Assignment, Exercises of Computational Biology

Main points of this assignment are: Overlap Graph, Set of Sequences, Corresponding Sequences, Hamiltonian Path, Maximum Overlap, Suffix Tree, Overlap Detection Program, Sequence File, Source Code, Successive ID Numbers, Overlapping Sequences

Typology: Exercises

2012/2013

Uploaded on 04/23/2013

ashwini
ashwini 🇮🇳

4.5

(18)

167 documents

1 / 2

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
HW 1: due Thursday, September 29th in class
Readings:
Green, E.D. (2001). Strategies for the systematic sequencing of com-
plex genomes. Nature Reviews Genetics, 2:573-83.
Pop, Mihai and Salzberg, Steven (2008). Bioinformatics challenges of
new sequencing technology. Trends in Genetics, 24(3): 142-149.
(Optional) Miller, J.R., Koren, S. and Sutton, G. (2010) Assembly
algorithms for next-generation sequencing data. Genomics, 95(6): 315-
327.
Problems: Submit hard copy of answers to problems 1 and 2. For Problem
3, please use provide to submit your output file as well as any program files.
This works from any EECS machine by typing:
conbrio% provide comp167 hw1 myfilename1.here myfilename2.here ..
1. Recall that the overlap graph for a set of sequences is defined by in-
cluding a vertex for every sequence, and there is a directed edge from
vertex ato vertex bof weight t, if for the corresponding sequences, the
maximum overlap of a suffix of awith a prefix of bcontains tcharacters.
Draw the overlap graph for the following sequences. Find a maximum
weight Hamiltonian path in this graph, and show the assembly this
corresponds to.
(a) ACCA
(b) CAGGG
(c) CCAATA
(d) CGCC
pf2

Partial preview of the text

Download Overlap Graph - Computational Biology - Assignment and more Exercises Computational Biology in PDF only on Docsity!

HW 1: due Thursday, September 29th in class

Readings:

  • Green, E.D. (2001). Strategies for the systematic sequencing of com- plex genomes. Nature Reviews Genetics, 2:573-83.
  • Pop, Mihai and Salzberg, Steven (2008). Bioinformatics challenges of new sequencing technology. Trends in Genetics, 24(3): 142-149.
  • (Optional) Miller, J.R., Koren, S. and Sutton, G. (2010) Assembly algorithms for next-generation sequencing data. Genomics, 95(6): 315-

Problems: Submit hard copy of answers to problems 1 and 2. For Problem 3, please use provide to submit your output file as well as any program files. This works from any EECS machine by typing:

conbrio% provide comp167 hw1 myfilename1.here myfilename2.here ..

  1. Recall that the overlap graph for a set of sequences is defined by in- cluding a vertex for every sequence, and there is a directed edge from vertex a to vertex b of weight t, if for the corresponding sequences, the maximum overlap of a suffix of a with a prefix of b contains t characters. Draw the overlap graph for the following sequences. Find a maximum weight Hamiltonian path in this graph, and show the assembly this corresponds to.

(a) ACCA (b) CAGGG (c) CCAATA (d) CGCC

  1. Draw the suffix tree associated with the sequence ATCCATTATG.
  2. (Inspired by S. Salzberg) In this programming assignment, you are to write a overlap detection program that could be used as part of a sequence assembler. The program should assemble the sequences in the text file “hw1.reads” available from the course web site. To make this easier, the sequences are all from the same strand (you do not need to reverse-complement any of them) and there are no errors. Sequence A is considered to overlap sequence B if there is a suffix of A at least 40bp in length that exactly matches a prefix of B. Note that this relation is not necessarily symmetric; if A overlaps B, B may not overlap A! You should submit both your source code and an output file showing all the overlaps detected in the sequence file. Sort the file by ID number of each of the sequences, and for each sequence the file should contain EXACTLY one line. In other words, you will have one line in your file corresponding to each of the sequences in the data file, in sorted order. That line should contain the ID of the sequence followed by the IDs of all sequences that it overlaps. The list of overlapping sequences should also be sorted in order by ID number. For example: R1 R24 R R175 R2 R33 R109 R138 ... etc. We will be comparing your files to the correct answer using ’diff’ so the format should match exactly. Put exactly one space between successive ID numbers and no whitespace after the last ID number in each line. We will also check your program on another data set.