

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An assignment for implementing a hidden markov model (hmm) to detect g-rich and uniform regions in genomic sequences. The assignment includes instructions for creating an hmm with given transition and emission probabilities, implementing the viterbi and forward-backward algorithms, and creating a sequence simulator. Students are required to write a program that takes a fasta file as input and outputs the region type or probability for each position in the sequence.
Typology: Assignments
1 / 2
This page cannot be seen from the preview
Don't miss anything!


Assignment 3. (Due by 11:59 PM on November 2.) a. Your code should be easy to compile and run on csil-linux machines. b. Email your code directory as a tarball (.tar.gz) to the instructor. c. Give clear instructions in the email about compiling (if necessary) and running the program. d. Include some sample input files that you tested successfully on. Suppose that a genomic sequence contains two types of regions: “G-rich” regions and “uniform” regions. The “uniform” regions are drawn from a uniform distribution of the four nucleotides, i.e., P(A) = P(C) = P(G) = P(T) = 0.25. The “G-rich” regions have an abundance of “G”s, i.e., P(G) = 0.4, and P(A) = P(C) = P(T) = 0.2. Your task is to use an HMM for detecting the two types of regions in any given DNA sequence. (a) You begin with a guess that a contiguous region of any type (i.e., G-rich or uniform) has an average length of 100 bp. This guess will suggest values for the transition probabilities of your HMM. What are the values of these probabilities? (10 points) (b) Using the emission probability distributions given above and the transition probability distribution from your answer to (a), create an HMM. Implement the Viterbi algorithm for this HMM. Your program should take as input a Fasta file containing a single DNA sequence (of length L < 10,000). Its output should have one line for each position of the input string, and each line has the format: “string_position
0.5 on the probability value. That is, a position with probability >= 0.5 as per the output of (c) is predicted as being a “G-rich” position.) Repeat this 100 times, and compute an average accuracy with each method. Report the average accuracy values. (15 points).