Probabilistic Models in Biology: Understanding Probabilities and Markov Chains, Study notes of Mathematics

Probabilistic models in biology, focusing on the concepts of probabilities, probabilistic models of proteins and nucleic acids, and markov chains. It covers the definition of probabilities, the concept of conditional probability, independence of events, and markov chains as a statistical tool for modeling sequences. The document also provides examples of applying these concepts to biological sequences, such as cpg islands in dna.

Typology: Study notes

Pre 2010

Uploaded on 07/31/2009

koofers-user-pen
koofers-user-pen 🇺🇸

10 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B
Spring 2002, Jan. 14 and 16 lectures
Probabilities and probabilistic models
Reading: S. M. Ross, “Introduction to probability models”, 7th ed. Chapter 1.
Reference: R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, 1998 “Biological sequence
analysis: Probabilistic models of proteins and nucleic acids”, Section 1.3 and 3.1.
Probabilities
Let us consider a very simple example. A familiar probabilistic system with a set of
discrete outcomes is the roll of a six-sided die. To define probabilities, we must first
define the space of possible events. In this example, there are six events for a roll of a die,
face 1-6. Following the annotations of the reading (Ross), }6,5,4,3,2,1{
=
S
61...pp
. A model of a
roll of a die (possible loaded) would have six parameters ; the probability of
rolling i is . To be probabilities, the parameters must satisfy the conditions that
and . For example, suppose that all six numbers are equally likely to
appear (a fair die), then we will have
i
p
i
p
0
i
p1
6
1=
=ii
p
6
1
654321 ====== pppppp
Another example closer to our biological subject matter is to define probabilities
of amino acids or nucleotides. For instance, at a position of a protein sequence, we
assume an amino acids a occurs at random with probability q. In another words, the
amino acid at this position is possible to be one of the twenty types, each type a has
probability to occur. The probability is a number between 0 and 1, and the sum of
all twenty probabilities equals to 1.
a
a
qa
q
Conditional probabilities and independency
Suppose that we toss two dice. There are totally 36 possible outcomes, combing the
possible numbers of the first and second toss. Suppose that each of the 36 possible
outcomes is equally likely to occur hence has probability 1/36. If we observe that the first
die is a four, then given this information, what is the probability that the sum of the two
dice equals six? Given that the initial die is a four, it follows that there can be at most six
possible outcomes of our experiment, namely, (4,1), (4,2), (4,3), (4,4), (4,5), and (4,6).
Since each of these outcomes originally had the same probability of occurring, they
should still have equal probabilities. That is, given that the first die is a four, then the
(conditional) probability of each of the outcomes (4,1), (4,2), (4,3), (4,4), (4,5), and (4,6)
is 1/6, while the (conditional) probability of the other 30 points is 0. Hence, the desired
probability will be 1/6.
A conditional probability is the probability that one event will occur given that we
already know that some other events have occurred. If we let E and F denote respectively
1
pf3
pf4
pf5

Partial preview of the text

Download Probabilistic Models in Biology: Understanding Probabilities and Markov Chains and more Study notes Mathematics in PDF only on Docsity!

BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B

Spring 2002, Jan. 14 and 16 lectures

Probabilities and probabilistic models

Reading: S. M. Ross, “Introduction to probability models”, 7th ed. Chapter 1. Reference: R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, 1998 “Biological sequence analysis: Probabilistic models of proteins and nucleic acids”, Section 1.3 and 3.1.

Probabilities

Let us consider a very simple example. A familiar probabilistic system with a set of discrete outcomes is the roll of a six-sided die. To define probabilities, we must first define the space of possible events. In this example, there are six events for a roll of a die, face 1-6. Following the annotations of the reading (Ross), S ={ 1 , 2 , 3 , 4 , 5 , 6 }

p 1 ... p 6

. A model of a

roll of a die (possible loaded) would have six parameters ; the probability of

rolling i is. To be probabilities, the parameters must satisfy the conditions that

and. For example, suppose that all six numbers are equally likely to

appear (a fair die), then we will have

p i

p i p (^) i ≥ 0 1

6 i = 1 pi =

p 1 = p 2 = p 3 = p 4 = p 5 = p 6 =

Another example closer to our biological subject matter is to define probabilities of amino acids or nucleotides. For instance, at a position of a protein sequence, we assume an amino acids a occurs at random with probability q. In another words, the

amino acid at this position is possible to be one of the twenty types, each type a has probability to occur. The probability is a number between 0 and 1, and the sum of

all twenty probabilities equals to 1.

a

q (^) a qa

Conditional probabilities and independency

Suppose that we toss two dice. There are totally 36 possible outcomes, combing the possible numbers of the first and second toss. Suppose that each of the 36 possible outcomes is equally likely to occur hence has probability 1/36. If we observe that the first die is a four, then given this information, what is the probability that the sum of the two dice equals six? Given that the initial die is a four, it follows that there can be at most six possible outcomes of our experiment, namely, (4,1), (4,2), (4,3), (4,4), (4,5), and (4,6). Since each of these outcomes originally had the same probability of occurring, they should still have equal probabilities. That is, given that the first die is a four, then the (conditional) probability of each of the outcomes (4,1), (4,2), (4,3), (4,4), (4,5), and (4,6) is 1/6, while the (conditional) probability of the other 30 points is 0. Hence, the desired probability will be 1/6. A conditional probability is the probability that one event will occur given that we already know that some other events have occurred. If we let E and F denote respectively

the event that the sum of the dice is six and the event that the first die is a four, then the probability just obtained is the conditional probability that E occurs given that F has occurred and is denoted by P ( E | F )

A general formula for P ( E | F ), which is valid for all events E and F when P ( F )> 0 , is

PF

P EF

P E F = (1)

where is the probability of the intersection of E and F , or the probability of both

E and F occur. Back to the previous example, is the probability of the first die is

a four and the sum of the two dice equals six and P(F) is the probability the first die is a four. Therefore,

P ( EF )

P ( EF )

( ) (twodiceare( 4 , 2 ))

PE F

PF

PEF P

By defining conditional probability, we can also write the probability of both E and F occurrence as P ( EF )= P ( E | F ) P ( F )

Similarly, we have P ( EF )= P ( F | E ) P ( E ), if P ( E )> 0.

Independency

Two events E and F are said to be independent if P ( EF )= P ( E ) P ( F )

By Equation (1) this implies that E and F are independent if P ( E | F )= P ( E )

That is, E and F are independent if knowledge that F has occurred does not affect the probability that E occurs. That is, the occurrence of E is independent of whether or not F occurs. Two events E and F that are not independent are said to be dependent. As we will see in the followings, some models for nucleotide sequence (or protein sequence) assume a nucleotide (or amino acid) occurs independent of other residues in the sequence. Therefore, the probability of the sequence will be the product of the probabilities of residues in the sequence.

Probabilistic models of sequences

When we talk about a model normally we mean a system that simulates the object under consideration. A probabilistic model is one that produces different outcomes with different probabilities. A probabilistic model can therefore simulate a whole class of objects, assigning each an associated probability. In our case the objects will normally be sequences, and a model might describe a family of related sequences. A model of a sequence of three consecutive rolls of a die might be that they were all independent, so that the probability of sequence [1,6,3] would be the product of the

What sort of probabilistic model might we use for CpG island regions? We know that dinucleotides are important. We therefore want a model that generates sequences in which the probability of a symbol depends on the previous symbol. The simplest such model is a classical Markov chain. We can show a Markov chain graphically as a collection of ‘states’, each of which corresponds to a particular residue, with arrows between the states. A Markov chain for DNA can be drawn like this:

A

C G

T

where there is a state for each of the four letters A, C, G, and T in the DNA alphabet. A probability parameter is associated with each arrow in the figure, which determines the probability of a certain residue following another residue, or one state following another state. These probability parameters are called the transition probabilities, which we will write pst :

p (^) st = P ( xi = t | xi − 1 = s ),

where s , t indicate one of the four nucleotides.

Markov models for the CpG island example are illustrated below. From a set of human DNA sequences we extracted a total of 48 putative CpG islands and derived two Markov chain models, one for the regions labeled as CpG islands (the ‘+’ model) and the other from the remainder of the sequence (the ‘-’ model). The transition probabilities for each model were set using the equation

  • (^) = t ' st '

st st (^) c

c p

and its analogue for , where c is the number of times letter t followed letter in the

island regions. These are the maximum likelihood estimators for the transition probabilities. The resulting tables are

p (^) st

st s

+ A C G T -^ A C G T

A 0.180 0.274 0.426 0.120 A 0.300 0.205 0.285 0.

C 0.171 0.368 0.274 0.188 C 0.322 0.298 0.078 0.

G 0.161 0.339 0.375 0.125 G 0.248 0.246 0.298 0.

T 0.079 0.355 0.384 0.182 T 0.177 0.239 0.292 0.

where the first row in each case contains the frequencies with which an A is followed by each of the four bases, and so on for the other rows, so each row sums to one. These numbers are not the same; for example, G following A is much more common than T following A. Notice also that the tables are asymmetric. In both tables the probability for G following C is lower than that for C following G. But the transition probabilities of CÆG and GÆC are much higher in the island model than those in the non-island model.

The different transition probabilities make the CpG island to be distinguished from other regions. For any probability model of sequences we can write the probability of the sequence, denoted by x =( x 1 , x 2 ,..., xL − 1 , xL ), as

1 2 1 1 1 2 2 2 1 1

1 2 1 P x x x x P x x x x P x x P x

Px P x x x x

L L L L

L L − − −

by applying many times. The key property of a Markov chain is

that the probability of each symbol depends only on the value of the preceding symbol

, not on the entire previous sequence, i.e.

P ( EF )= P ( F | E ) P ( E )

xi xi − 1 P ( xi | x 1 ,..., xi − 1 )= P ( xi | xi − 1 ) = pxi − 1 x i

The equation (2) therefore becomes

1 1 2 2 1 1

1 2 1 1 1 2 2 2 1 1

1 2 1

Px x P x x Px x P x

Px x x x P x x x x Px x P x

P x Px x x x

L L L L

L L L L

L L

− − −

− − −

Although we have derived this equation in the context of CpG islands in DNA sequences, it is in fact the general equation for the probability of a specific sequence from any Markov chain. Equation (3) is used to calculate the likelihood of a sequence. For instance, we use (3) to obtain the likelihood values of a sequence under ‘+’ model and ‘-’ model respectively. If the likelihood value from ‘+’ model is much higher than that from ‘-’ model, we may conclude the sequence is a CpG island region (see exercise 5).

Exercises Due on Wednesday, Jan. 23rd.

  1. In a DNA sequence set, there are totally nucleotides, among which are adenine, 4 are guanine, and are thymine. Use this data set to estimate the probabilities for the four types of nucleotides.

2 × 104

6 × 103

× 103 8 ×

  1. A protein sequence is 500 amino acids length. Assume amino acids occur independently in the sequence. If we are given that the probability for leucine is qL = 0. 04 , how many leucine residues we are expected to see in this sequence?
  2. Prove Bayes’ theorem that

PF EP E PF Ec^ PE^ c

PF E PE

PF

P F E PE

P E F

where all the conditional probabilities are assumed well defined.

  1. Suppose that the chance of rain tomorrow depends on previous weather conditions only through whether or not it is raining today. Suppose also that if it rains today, then it will rain tomorrow with probability 0.6; and if it does not rain today, then it will rain tomorrow with probability 0.3. Show that the process is a two-state Markov chain. Calculate all the transition probabilities.