HMM Assignment for Genomic Sequence Analysis: Detection of G-rich and Uniform Regions | Assignments Computer Science

Assignment 3. (Due by 11:59 PM on November 2.)

a. Your code should be easy to compile and run on csil-linux machines.

b. Email your code directory as a tarball (.tar.gz) to the instructor.

c. Give clear instructions in the email about compiling (if necessary) and running the program.

d. Include some sample input files that you tested successfully on.

Suppose that a genomic sequence contains two types of regions: “G-rich” regions and

“uniform” regions. The “uniform” regions are drawn from a uniform distribution of the

four nucleotides, i.e., P(A) = P(C) = P(G) = P(T) = 0.25. The “G-rich” regions have an

abundance of “G”s, i.e., P(G) = 0.4, and P(A) = P(C) = P(T) = 0.2. Your task is to use an

HMM for detecting the two types of regions in any given DNA sequence.

(a) You begin with a guess that a contiguous region of any type (i.e., G-rich or

uniform) has an average length of 100 bp. This guess will suggest values for

the transition probabilities of your HMM. What are the values of these

probabilities ? (10 points)

(b) Using the emission probability distributions given above and the transition

probability distribution from your answer to (a), create an HMM. Implement

the Viterbi algorithm for this HMM. Your program should take as input a Fasta

file containing a single DNA sequence (of length L < 10,000). Its output

should have one line for each position of the input string, and each line has

the format: “string_position<TAB>region_type”, where string_position takes

the values 1,2, … L, for successive lines, and <region_type> takes the values

“G” or “U”. The letter “G” at the ith position means that the ith symbol of the

input string lies in a “G-rich” region, and “U” is defined likewise. The output

should go to standard output. In your implementation, assume that the first

nucleotide is equally likely to be in a “G-rich” or in a “uniform” region. (20

points)

every position i in the input string, the probability that the ith position lies in a

“G-rich” region. The output should have a similar format to that in (b), except

that the “region_type” field is now replaced by a probability, which is the

probability that the ith position is in a “G-rich” region. (20 points)

(d) Implement a “sequence simulator” using your HMM. This is a program that

takes a sequence length (L < 10,000) as input, and “runs” the HMM by using

a pseudo-random generator. It outputs two files: a Fasta file with the

generated DNA sequence, and a “log” file that records whether each

successive symbol was generated from the “G-rich” state or the “uniform”

state. The log file should have the same format as the output file format of (b).

(15 points)

(e) Compare the performance of your solutions for (b) and (c), as follows.

Generate a string of length 5000 using your simulator. Run the methods of (b)

and (c) separately on the Fasta file output by the simulator, and compare

each method’s output to the “log” file that was generated by the simulator. In

this manner, calculate, for each method, what fraction of the positions are

correctly predicted (i.e., accuracy). (For this you will need to convert the

probabilistic output of (c) to a format similar to that of (b), using a threshold of

Partial preview of the text

Download HMM Assignment for Genomic Sequence Analysis: Detection of G-rich and Uniform Regions and more Assignments Computer Science in PDF only on Docsity!

Assignment 3. (Due by 11:59 PM on November 2.) a. Your code should be easy to compile and run on csil-linux machines. b. Email your code directory as a tarball (.tar.gz) to the instructor. c. Give clear instructions in the email about compiling (if necessary) and running the program. d. Include some sample input files that you tested successfully on. Suppose that a genomic sequence contains two types of regions: “G-rich” regions and “uniform” regions. The “uniform” regions are drawn from a uniform distribution of the four nucleotides, i.e., P(A) = P(C) = P(G) = P(T) = 0.25. The “G-rich” regions have an abundance of “G”s, i.e., P(G) = 0.4, and P(A) = P(C) = P(T) = 0.2. Your task is to use an HMM for detecting the two types of regions in any given DNA sequence. (a) You begin with a guess that a contiguous region of any type (i.e., G-rich or uniform) has an average length of 100 bp. This guess will suggest values for the transition probabilities of your HMM. What are the values of these probabilities? (10 points) (b) Using the emission probability distributions given above and the transition probability distribution from your answer to (a), create an HMM. Implement the Viterbi algorithm for this HMM. Your program should take as input a Fasta file containing a single DNA sequence (of length L < 10,000). Its output should have one line for each position of the input string, and each line has the format: “string_positionregion_type”, where string_position takes the values 1,2, … L, for successive lines, and <region_type> takes the values “G” or “U”. The letter “G” at the ith^ position means that the ith^ symbol of the input string lies in a “G-rich” region, and “U” is defined likewise. The output should go to standard output. In your implementation, assume that the first nucleotide is equally likely to be in a “G-rich” or in a “uniform” region. ( points) (c) Implement the Forward-Backward algorithm for your HMM, so as to output for every position i in the input string, the probability that the ith^ position lies in a “G-rich” region. The output should have a similar format to that in (b), except that the “region_type” field is now replaced by a probability, which is the probability that the ith^ position is in a “G-rich” region. (20 points) (d) Implement a “sequence simulator” using your HMM. This is a program that takes a sequence length (L < 10,000) as input, and “runs” the HMM by using a pseudo-random generator. It outputs two files: a Fasta file with the generated DNA sequence, and a “log” file that records whether each successive symbol was generated from the “G-rich” state or the “uniform” state. The log file should have the same format as the output file format of (b). (15 points) (e) Compare the performance of your solutions for (b) and (c), as follows. Generate a string of length 5000 using your simulator. Run the methods of (b) and (c) separately on the Fasta file output by the simulator, and compare each method’s output to the “log” file that was generated by the simulator. In this manner, calculate, for each method, what fraction of the positions are correctly predicted (i.e., accuracy). (For this you will need to convert the probabilistic output of (c) to a format similar to that of (b), using a threshold of

0.5 on the probability value. That is, a position with probability >= 0.5 as per the output of (c) is predicted as being a “G-rich” position.) Repeat this 100 times, and compute an average accuracy with each method. Report the average accuracy values. (15 points).

HMM Assignment for Genomic Sequence Analysis: Detection of G-rich and Uniform Regions, Assignments of Computer Science

Related documents

Partial preview of the text

Download HMM Assignment for Genomic Sequence Analysis: Detection of G-rich and Uniform Regions and more Assignments Computer Science in PDF only on Docsity!