HMM Assignment for Genomic Sequence Analysis: Detection of G-rich and Uniform Regions, Assignments of Computer Science

An assignment for implementing a hidden markov model (hmm) to detect g-rich and uniform regions in genomic sequences. The assignment includes instructions for creating an hmm with given transition and emission probabilities, implementing the viterbi and forward-backward algorithms, and creating a sequence simulator. Students are required to write a program that takes a fasta file as input and outputs the region type or probability for each position in the sequence.

Typology: Assignments

Pre 2010

Uploaded on 03/11/2009

koofers-user-sop
koofers-user-sop 🇺🇸

3

(1)

10 documents

1 / 2

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Assignment 3. (Due by 11:59 PM on November 2.)
a. Your code should be easy to compile and run on csil-linux machines.
b. Email your code directory as a tarball (.tar.gz) to the instructor.
c. Give clear instructions in the email about compiling (if necessary) and running the program.
d. Include some sample input files that you tested successfully on.
Suppose that a genomic sequence contains two types of regions: “G-rich” regions and
“uniform” regions. The “uniform” regions are drawn from a uniform distribution of the
four nucleotides, i.e., P(A) = P(C) = P(G) = P(T) = 0.25. The “G-rich” regions have an
abundance of “G”s, i.e., P(G) = 0.4, and P(A) = P(C) = P(T) = 0.2. Your task is to use an
HMM for detecting the two types of regions in any given DNA sequence.
(a) You begin with a guess that a contiguous region of any type (i.e., G-rich or
uniform) has an average length of 100 bp. This guess will suggest values for
the transition probabilities of your HMM. What are the values of these
probabilities ? (10 points)
(b) Using the emission probability distributions given above and the transition
probability distribution from your answer to (a), create an HMM. Implement
the Viterbi algorithm for this HMM. Your program should take as input a Fasta
file containing a single DNA sequence (of length L < 10,000). Its output
should have one line for each position of the input string, and each line has
the format: “string_position<TAB>region_type”, where string_position takes
the values 1,2, … L, for successive lines, and <region_type> takes the values
“G” or “U”. The letter “G” at the ith position means that the ith symbol of the
input string lies in a “G-rich” region, and “U” is defined likewise. The output
should go to standard output. In your implementation, assume that the first
nucleotide is equally likely to be in a “G-rich” or in a “uniform” region. (20
points)
(c) Implement the Forward-Backward algorithm for your HMM, so as to output for
every position i in the input string, the probability that the ith position lies in a
“G-rich” region. The output should have a similar format to that in (b), except
that the “region_type” field is now replaced by a probability, which is the
probability that the ith position is in a “G-rich” region. (20 points)
(d) Implement a “sequence simulator” using your HMM. This is a program that
takes a sequence length (L < 10,000) as input, and “runs” the HMM by using
a pseudo-random generator. It outputs two files: a Fasta file with the
generated DNA sequence, and a “log” file that records whether each
successive symbol was generated from the “G-rich” state or the “uniform”
state. The log file should have the same format as the output file format of (b).
(15 points)
(e) Compare the performance of your solutions for (b) and (c), as follows.
Generate a string of length 5000 using your simulator. Run the methods of (b)
and (c) separately on the Fasta file output by the simulator, and compare
each method’s output to the “log” file that was generated by the simulator. In
this manner, calculate, for each method, what fraction of the positions are
correctly predicted (i.e., accuracy). (For this you will need to convert the
probabilistic output of (c) to a format similar to that of (b), using a threshold of
pf2

Partial preview of the text

Download HMM Assignment for Genomic Sequence Analysis: Detection of G-rich and Uniform Regions and more Assignments Computer Science in PDF only on Docsity!

Assignment 3. (Due by 11:59 PM on November 2.) a. Your code should be easy to compile and run on csil-linux machines. b. Email your code directory as a tarball (.tar.gz) to the instructor. c. Give clear instructions in the email about compiling (if necessary) and running the program. d. Include some sample input files that you tested successfully on. Suppose that a genomic sequence contains two types of regions: “G-rich” regions and “uniform” regions. The “uniform” regions are drawn from a uniform distribution of the four nucleotides, i.e., P(A) = P(C) = P(G) = P(T) = 0.25. The “G-rich” regions have an abundance of “G”s, i.e., P(G) = 0.4, and P(A) = P(C) = P(T) = 0.2. Your task is to use an HMM for detecting the two types of regions in any given DNA sequence. (a) You begin with a guess that a contiguous region of any type (i.e., G-rich or uniform) has an average length of 100 bp. This guess will suggest values for the transition probabilities of your HMM. What are the values of these probabilities? (10 points) (b) Using the emission probability distributions given above and the transition probability distribution from your answer to (a), create an HMM. Implement the Viterbi algorithm for this HMM. Your program should take as input a Fasta file containing a single DNA sequence (of length L < 10,000). Its output should have one line for each position of the input string, and each line has the format: “string_positionregion_type”, where string_position takes the values 1,2, … L, for successive lines, and <region_type> takes the values “G” or “U”. The letter “G” at the ith^ position means that the ith^ symbol of the input string lies in a “G-rich” region, and “U” is defined likewise. The output should go to standard output. In your implementation, assume that the first nucleotide is equally likely to be in a “G-rich” or in a “uniform” region. ( points) (c) Implement the Forward-Backward algorithm for your HMM, so as to output for every position i in the input string, the probability that the ith^ position lies in a “G-rich” region. The output should have a similar format to that in (b), except that the “region_type” field is now replaced by a probability, which is the probability that the ith^ position is in a “G-rich” region. (20 points) (d) Implement a “sequence simulator” using your HMM. This is a program that takes a sequence length (L < 10,000) as input, and “runs” the HMM by using a pseudo-random generator. It outputs two files: a Fasta file with the generated DNA sequence, and a “log” file that records whether each successive symbol was generated from the “G-rich” state or the “uniform” state. The log file should have the same format as the output file format of (b). (15 points) (e) Compare the performance of your solutions for (b) and (c), as follows. Generate a string of length 5000 using your simulator. Run the methods of (b) and (c) separately on the Fasta file output by the simulator, and compare each method’s output to the “log” file that was generated by the simulator. In this manner, calculate, for each method, what fraction of the positions are correctly predicted (i.e., accuracy). (For this you will need to convert the probabilistic output of (c) to a format similar to that of (b), using a threshold of

0.5 on the probability value. That is, a position with probability >= 0.5 as per the output of (c) is predicted as being a “G-rich” position.) Repeat this 100 times, and compute an average accuracy with each method. Report the average accuracy values. (15 points).