CMSC423 Projects: Gene Finder, Sequence Overlapper, Aligner, RNA Folding, Study Guides, Projects, Research of Computer Science

Four potential projects for the cmsc423 class, including a gene finder, shotgun sequence overlapper, multiple aligner, and rna folding program. Students can work individually or in teams, with team projects requiring additional features. Project deliverables include source code, documentation, and a report. Details on project requirements, input and output formats, and suggested resources.

Typology: Study Guides, Projects, Research

Pre 2010

Uploaded on 02/13/2009

koofers-user-v2u-1
koofers-user-v2u-1 🇺🇸

10 documents

1 / 2

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CMSC423 Project 2
Handed out: 11/11/2008
Due: 12/9/2008
Project choice due by 11/18/2008
The second project in the class is your choice. For convenience I've listed four possible
projects below, however you can come up with your own project. Furthermore, for this
project you can work in a team of up to two people. You have until next Tuesday
(11/18/2008) to email me both your project choice and the name of your partner.
If you choose to work as a team, the project will have to be more complex than if you
work alone - I have outlined how in the description of the projects below.
Note that the description of the projects is intentionally vague - part of your assignment is
to figure out what will make your project good.
Project deliverables:
1. Your source code (should run on the glue machines)
2. Documentation describing how to run your code.
3. Report (1-2 page) describing the results of applying your code to some public data.
The report should also highlight potential improvements/extensions you might want to
implement if you had more time available.
1. Gene finder
Implement a simple gene finder for bacteria, specifically find stop codons, identify in-
frame starts, then evaluate the ORFs, based on codon preferences in the organism being
analyzed. Your program must accept as input the sequence of an organism as well as a
codon bias table (http://www.kazusa.or.jp/codon/). As output you should provide a list of
genes in GFF format (http://www.sanger.ac.uk/Software/formats/GFF/)
If working in a team you must also implement an additional module that identifies long
stretches of DNA without a valid ORF,then searches these regions against a protein
database using the blastx program in order to identify genes potentially missed by the
initial module.
2. Shotgun sequence overlapper
Write a program that performs the pairwise comparisons between the set of sequences
provided as input to a shotgun sequence assembler and report which sequences overlap.
This project will require you to use a combination of exact matching (e.g. k-mer hashing)
and inexact matching (Smith-Waterman) techniques. Your program must accept as input
a set of sequences in FASTA format and must produce the output in AMOS format (see
http://amos.sourceforge.net - the RED and OVL records).
If working in a team you should also integrate your program in the minimus assembler
(component of AMOS) and compare its performance with the original implementation.
pf2

Partial preview of the text

Download CMSC423 Projects: Gene Finder, Sequence Overlapper, Aligner, RNA Folding and more Study Guides, Projects, Research Computer Science in PDF only on Docsity!

CMSC423 Project 2 Handed out: 11/11/ Due: 12/9/ Project choice due by 11/18/ The second project in the class is your choice. For convenience I've listed four possible projects below, however you can come up with your own project. Furthermore, for this project you can work in a team of up to two people. You have until next Tuesday (11/18/2008) to email me both your project choice and the name of your partner. If you choose to work as a team, the project will have to be more complex than if you work alone - I have outlined how in the description of the projects below. Note that the description of the projects is intentionally vague - part of your assignment is to figure out what will make your project good. Project deliverables:

  1. Your source code (should run on the glue machines)
  2. Documentation describing how to run your code.
  3. Report (1-2 page) describing the results of applying your code to some public data. The report should also highlight potential improvements/extensions you might want to implement if you had more time available. 1. Gene finder Implement a simple gene finder for bacteria, specifically find stop codons, identify in- frame starts, then evaluate the ORFs, based on codon preferences in the organism being analyzed. Your program must accept as input the sequence of an organism as well as a codon bias table (http://www.kazusa.or.jp/codon/). As output you should provide a list of genes in GFF format (http://www.sanger.ac.uk/Software/formats/GFF/) If working in a team you must also implement an additional module that identifies long stretches of DNA without a valid ORF,then searches these regions against a protein database using the blastx program in order to identify genes potentially missed by the initial module. 2. Shotgun sequence overlapper Write a program that performs the pairwise comparisons between the set of sequences provided as input to a shotgun sequence assembler and report which sequences overlap. This project will require you to use a combination of exact matching (e.g. k-mer hashing) and inexact matching (Smith-Waterman) techniques. Your program must accept as input a set of sequences in FASTA format and must produce the output in AMOS format (see http://amos.sourceforge.net - the RED and OVL records). If working in a team you should also integrate your program in the minimus assembler (component of AMOS) and compare its performance with the original implementation.

3. Multiple aligner Implement a simple multiple alignment program, extending the Smith-Waterman algorithm implemented in project 1 to allow implementing the progressive alignment approach. As input you must accept a set of protein sequences in FASTA format, then output their multiple alignment in ClustalW format (http://www.bioperl.org/wiki/ClustalW_multiple_alignment_format). If working in a team your aligner must use a guide tree, and you need to evaluate the impact of the tree structure on the multiple alignment (e.g. implement UPGMA and neighbor-joining and compare the quality of the alignments). Test data and reference alignments can be obtained from BaliBase (http://bips.u- strasbg.fr/fr/Products/Databases/BAliBASE/). 4. RNA folding Write a program that computes the secondary structure of an RNA molecule. The input to your program will consist of an RNA sequence in FASTA format., and the output must be presented in parenthesized form:

Sample RNA AAAAAAAAAAAAAGGGGGGGUUUUUUUUUUUUUUCCCCCCCCCCCCCCCCC .............(((((((..............))))))).......... Note: RNA sequences can be obtained from the NCBI: http://www.ncbi.nlm.nih.gov If working in a team, your aligner must take into base-pairing energies as well as a stacking term (energy of stem depends on number of stacked bases in a way similar to affine gap penalties in Smith-Waterman). Energy values can be found at http://www.bioinfo.rpi.edu/zukerm/cgi-bin/efiles-3.0.cgi. Note: simplify the information on this website - just pick two sets of energies for each pair of bases - one energy if the bases are part of a stem, another if they are not.