













Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Hence, the transition matrix is a stochas- tic matrix. Exercise 2.1. Show that the transition probabilities satisfy (i), (ii) above. Exercise 2.2. Show that if ...
Typology: Slides
1 / 21
This page cannot be seen from the preview
Don't miss anything!














Readings Strongly recommended:
Optional:
A canonical reference on Markov chains is Norris (1997).
We will begin by discussing Markov chains. In Lectures 2 & 3 we will discuss discrete-time Markov chains, and Lecture 4 will cover continuous-time Markov chains.
We consider a discrete-time, discrete space stochastic process which we write as X(t) = Xt , for t = 0 , 1 ,.. .. The state space S is discrete, i.e. finite or countable, so we can let it be a set of integers, as in S = { 1 , 2 ,... , N} or S = { 1 , 2 ,.. .}.
Definition. The process X(t) = X 0 , X 1 , X 2 ,... is a discrete-time Markov chain if it satisfies the Markov property: P(Xn+ 1 = s|X 0 = x 0 , X 1 = x 1 ,... , Xn = xn) = P(Xn+ 1 = s|Xn = xn). (1)
The quantities P(Xn+ 1 = j|Xn = i) are called the transition probabilities. In general the transition probabili- ties are functions of i, j, n. It is convenient to write them as
pi j(n) = P(Xn+ 1 = j|Xn = i). (2)
Definition. The transition matrix at time n is the matrix P(n) = (pi j(n)), i.e. the (i, j)th element of P(n) is pi j(n).^1 The transition matrix satisfies: (i) pi j(n) ≥ 0 ∀i, j (the entries are non-negative) (ii) (^) ∑ (^) j pi j(n) = 1 ∀i (the rows sum to 1)
Any matrix that satisfies (i), (ii) above is called a stochastic matrix. Hence, the transition matrix is a stochas- tic matrix.
Exercise 2.1. Show that the transition probabilities satisfy (i), (ii) above.
Exercise 2.2. Show that if X(t) is a discrete-time Markov chain, then
P(Xn = s|X 0 = x 0 , X 1 = x 1 ,... , Xm = xm) = P(Xn = s|Xm = xm) ,
for any 0 ≤ m < n. That is, the probabilities at the current time, depend only on the most recent known state in the past, even if it’s not exactly one step before. (^1) We call it a matrix even if |S| = ∞.
Remark. Note that a “stochastic matrix” is not the same thing as a “random matrix”! Usually “random” can be substituted for “stochastic” but not here. A random matrix is a matrix whose entries are random. A stochastic matrix has completely deterministic entries. It probably gets its name because it is used to describe a stochastic phenomenon, but this is an unfortunate accident of history.
Definition. The Markov chain X(t) is time-homogeneous if P(Xn+ 1 = j|Xn = i) = P(X 1 = j|X 0 = i), i.e. the transition probabilities do not depend on time n. If this is the case, we write pi j = P(X 1 = j|X 0 = i) for the probability to go from i to j in one step, and P = (pi j) for the transition matrix.
We will only consider time-homogeneous Markov chains in this course, though we will occasionally remark on how some results may be generalized to the time-inhomogeneous case.
Examples
(^ sun rain) sun 0. 8 0. 2 rain 0. 4 0. 6
This says that if it is sunny today, then the chance it will be sunny tomorrow is 0.8, whereas if it is rainy today, then the chance it will be sunny tomorrow is 0.4. One question you might be interested in is: what is the long-run fraction of sunny days in New York?
(^ heads tails) heads 0. 51 0. 49 tails 0. 49 0. 51
This shows that if you throw a Heads on your first toss, there is a very slightly higher chance of throwing heads on your second, and similarly for Tails.
Xn =
n
j= 1
ξ (^) j, ξ (^) j = ± 1 with probability
each, ξi i.i.d.
The transition probabilities are
pi,i+ 1 =
, pi,i− 1 =
, pi, j = 0 ( j 6 = i ± 1 ).
choose an edge uniformly at random from the set of edges leading out of the node, and move along the edge to the node at the edge. Then repeat. If there are N nodes labelled by consecutive integers then this is a Markov chain on state space S = { 1 , 2 ,... , N}. Here is are a couple of examples:
The corresponding transition matrices are:
Every Markov chain can be represented as a random walk on a weighted, directed graph. A weighted graph is one where each edge has a positive real number assigned to it, its “weight,” and the random walker chooses an edge from the set of available edges, in proportion to each edge’s weight. In a directed graph each edge also has a direction, and a walker can only move in that direction. Here is an example:
A
The corresponding transition matrix is:
In fact, such a directed graph forms the foundation for Google’s Page Rank algorithm, which has revolutionized internet searches. The earliest and best-known version of Page Rank constructs a di- rected graph of the internet, where nodes are webpages and there is a directed edge from webpage A to webpage B if A contains a link to B. Page Rank assumes an internet surfer clicks follows links at random, and ranks pages according to the long-time average fraction of time that the surfer spends on each page.
P(X 1 = σ ′|X 0 = σ ) =
1 52 if^ σ^
′ (^) is obtained by taking an item in σ and moving it to the top, 0 otherwise. One can also model more complicated shuffles, such as the riffle shuffle. While the state space is enormous (|S| = 52!) so you would not want to write down the whole transition matrix, one can still analyze these models using other techniques, from analysis and probability theory. Various authors have proven results about the number of shuffles needed to make the deck “close to random”. For example, it takes seven riffle shuffles to get close to random, but it takes 11 or 12 to get so close that a gambler in a casino cannot exploit the deviations from randomness to win a typical game. See the online essay Austin (line) for an accessible introduction to these ideas, and Aldous and Diaconis (1986) for the mathematical proofs. (I first learned about this phenomenon in the beautiful Proofs from the Book, by Aigner and Ziegler.)
Some other examples of applications that use Markov chains include:
Given a Markov chain with transition probabilities P and initial condition X 0 = i, we know how to calculate the probability distribution of X 1 ; indeed, this is given directly from the transition probabilities. The natural question to ask next is: what is the distribution at later times? That is, we would like to know the n-step transition probabilities P(n), defined by
P i j(n )= P(Xn = j|X 0 = i). (3)
For example, for n = 2, we have that
k
P(X 2 = j|X 1 = k, X 0 = i)P(X 1 = k|X 0 = i) Law of Total Probability
k
P(X 2 = j|X 1 = k)P(X 1 = k|X 0 = i) Markov Property
k
Pk jPik time-homogeneity
= (P^2 )i j
That is, the two-step transition matrix is P(^2 )^ = P^2.
This generalizes:
Theorem. Let X 0 , X 1 ,... be a time-homogeneous Markov chain with transition probabilities P. The n-step transition probabilities are P(n)^ = Pn, i.e.
P(Xn = j|X 0 = i) = (Pn)i j. (4)
To make the notation cleaner we will write (Pn)i j = Pi jn. Note that this does not equal (Pi j)n.
Exercise 2.3. Prove this theorem. Hint: use induction.
A more general equation relating the transition probabilities, that holds even in the time-inhomogeneous case, is:
Chapman-Kolmogorov Equation.
k
P(Xn = j|Xm = k)P(Xm = k|X 0 = i). (5)
Therefore, if we know the initial probability distribution α(^0 ), then we can find the distribution at any later time using powers of the matrix P.
Now consider what happens if we ask for the expected value of some function of the state of the Markov chain, such as EX n^2 , EX n^3 , E|Xn|, etc. Can we derive an evolution equation for this quantity?
Let f : S → R be a function defined on state space, and let
u( i n)= Ei f (Xn) = E[ f (Xn)|X 0 = i]. (8)
You should think of u(n)^ as a column vector; again this is a convention whose convenience will become more transparent later in the course. Then u(n)^ evolves in time as:
Backward Kolmogorov Equation. (for a time-homogeneous, discrete-time Markov Chain)
u(n+^1 )^ = Pu(n), u(^0 )(i) = f (i) ∀i ∈ S. (9)
Proof. We have
j
f ( j)P(Xn+ 1 = j|X 0 = i) definition of expectation
j
k
f ( j)P(Xn+ 1 = j|X 1 =k, X 0 =i)P(X 1 =k|X 0 =i) LoTP
j
k
f ( j)P(Xn+ 1 = j|X 1 =k)P(X 1 =k|X 0 =i) Markov property
j
k
f ( j)Pk jnPik time-homogeneity
k
j
f ( j)Pk jnPik switch order of summation
k
u(n)(k)Pik definition of u(n)
= (Pu)i
We can switch the order of summation above, provided we assume that Ei| f (Xn)| < ∞ for each i and each n.
This proof illustrates a technique sometimes known as first-step analysis, where one conditions on the first step of the Markov chain and uses the Law of Total Probability. Of course, you could also derive this equation more directly from the n-step transition probabilities.
Exercise 2.5. Do this! Derive (9) directly from the formula for the n-step transition probabilities.
Remark. What is so backward about the backward equation? It gets its name from the fact that it can be used to describe how conditional expectations propagate backwards in time. To see this, suppose that instead of (8), which computes the expectation of a function after a certain number of steps has passed, we choose a fixed time T and compute the expectation at that time, given an earlier starting position. That is, for each n ≤ T , define a column vector u(n)^ with components
u( i n)= E[ f (XT )|Xn=i]. (10)
Such a quantity is studied a lot in financial applications, where, say, Xn is the price of a stock at time n, f is a value function representing the value of an option to sell, T might be a time at which you decide (in advance) to sell a stock, and quantities of the form (10) above would represent your expected payout, conditional on being in state i at time n. Then, the vector u(n)^ evolves according to
u(n)^ = Pu(n+^1 ), u (T ) i =^ f^ (i)^ ∀i^ ∈^ S.^ (11)
Therefore you find u(n)^ by evolving it backwards in time – you are given a final condition at time T , and you can solve for un at all earlier times n ≤ T.
Interestingly, (11) holds even when the chain is not time-homogeneous, provided that P in (11) is replaced by P(n), the transition probabilities starting at time n. This same statement is not true for (9).
Exercise 2.6. Show (11), and argue it holds even when the Markov chain is not time-homogeneous.
2.2.1 Evolution of the full transition probabilities*
Another approach to the forward/backward equations is to define a function P( j,t|i, s) to be the transition probability to be in state j at time t, given the system started in state i at time s, i.e.
P( j,t|i, s) = P(Xt = j|Xs = i). (12)
One can then derive equations for how P( j,t|i, s) evolves in t and s. For evolution in t (forward in time) we have, from the Chapman-Kolmogorov equations,
k
P(k,t|i, s)P( j,t+ 1 |k,t). (13)
For evolution in s (backward in time) we have, again from the Chapman-Kolmogorov equations,
k
P( j,t|k, s+ 1 )P(k, s+ 1 |i, s). (14)
These are general versions of the forward and backward equations, respectively. They hold regardless of whether the chain is time-homogeneous or not. From them, we can derive the time-inhomogeneous versions of the forward and backward equations (7), (9).
To derive the time-inhomogeneous forward equation, notice that the probability distribution at time t, α(t),
has components α( j t)= (^) ∑i P( j,t|i, 0 )α i( 0 ). Therefore, multiplying (13) by α(^0 )^ on the left (contracting it with index i) and letting s = 0, we obtain
k
α( kt )P( j,t+ 1 |k,t) ⇔ α(t+^1 )^ = α(t)P k j(t). (15)
To derive the time-inhomogeneous backward equation, let f : S → R, and let u( i s)= E[ f (Xt )|Xs = i] (recall
(8),(10).) Notice that u( i s)= (^) ∑k f (k)P(k,t|i, s), so multiplying (14) by the column vector f on the right (contracting it with index j) gives
k
P(k, s+ 1 |i, s)u( ks +^1 ) ⇔ u(s)^ = P i j(s )u(s+^1 )^. (16)
n P(0) P(1) 0 1 0 1 0 1 2 1 0 3 0 1 4 1 0 5 0 1 6 1 0 .. .
You can see the pattern. Clearly the distribution doesn’t converge. Yet, if we start with initial distirbution α(^0 )^ = ( 0. 5 , 0. 5 ), then we obtain
n P(0) P(1) 0 0.5 0. 1 0.5 0. 2 0.5 0. .. .
The distribution never changes!
2.3.1 Limiting and stationary distributions
In applications we are often interested in the long-term probability of visiting each state.
Definition. Consider a time-homogeneous Markov chain with transition matrix P. A row vector λ is a limiting distribution if λi ≥ 0, (^) ∑ (^) j λ (^) j = 1 (so that λ is a probability distribution), and if, for every i,
lim n→∞ (Pn)i j = λ (^) j ∀ j ∈ S.
In other words,
Pn^ →
λ 1 λ 2 λ 3... λ 1 λ 2 λ 3... λ 1 λ 2 λ 3... .. .
as n → ∞.
Exercise 2.8. Show that, if |S| < ∞, then λ is a limiting distribution if and only if the definition limn→∞ αPn^ = λ for any initial probability distribution α.
As we saw in the earlier examples, a limiting distribution doesn’t have to exist. If it exists, it must be unique. What happens if we start the chain in the limiting distribution? Let’s calculate the distribution α(^1 )^ at the next step of the chain, assuming initial distribution α(^0 )^ = λ. For simplicity, we will assume a finite state space, |S| < ∞, which lets us interchange the sum and the limit in the calculations below. Choose any i, and calculate, from (7), (writing Ai,· for the row vector corresponding to the ith row of matrix A):
α(^1 )^ = λ P =
lim n→∞ Pin,·
lim n→∞ P in,·+^1
= λ.
Therefore if we start the chain in the limiting distribution, its distribution remains there forever. This moti- vates the following definition:
Definition. Given a Markov chain with transition matrix P, a stationary distribution is a probability distri- bution π which satisfies
i
πiPi j ∀ j. (17)
This says that that if we start with distribution π and run the Markov chain, the distribution will not change. That is why it is called “stationary.” In other words, if X 0 ∼ π, then X 1 ∼ π, X 2 ∼ π, etc.
Remark. Other synonyms you might hear for stationary distribution include invariant measure, invariant distribution, steady-state probability, equilibrium probability or equilibrium distribution (the latter two are from physics.).
In applications we want to know the limiting distribution, but it is usually far easier to calculate the stationary distribution, because it is obtained by solving a system of linear equations. Therefore we will restrict our focus to the stationary distribution. Some questions we might ask about π include:
(i) Does it exist? (ii) Is it unique? (iii) When is it a limiting distribution, i.e. when does an arbitrary distribution converge to it?
For (iii), we saw that a limiting distribution is a stationary distribution, but the converse is not always true. Indeed, in our second example, you can calculate that a stationary distribution is π = ( 0. 5 , 0. 5 ), but this is not a limiting distribution. What are the conditions that guarantee a stationary distribution is also the limiting distribution?
This is the subject of a rich body of work on the limiting behaviour of Markov chains. We will not go deeply into the results, but will briefly survey a couple of the major theorems.
2.3.2 A limit theorem or two
Definition. A matrix A is positive if it has all positive entries: Ai j > 0 for all i, j. In these notes we will write A > 0 when A is positive.
Remark. This is not the same as being positive-definite!
Definition. A stochastic matrix is regular if there exists some s > 0 such that Ps^ is positive, i.e. the s-step transition probabilities are positive for all i, j: (Ps)i j > 0 ∀i, j.
Remark. Some books call such a matrix primitive. The text Koralov and Sinai (2010)) calls it ergodic (when the state space is finite), though usually this word is reserved for something slightly different.
This means that there is a time s such that, no matter where you start, there is a non-zero probability of being at any other state.
Theorem (Ergodic Theorem for Markov Chains, (one version)). Assume a Markov Chain is regular and has a finite state space with size N. Then there exists a unique stationary probability distribution π = (π 1 ,... , πN ), with π (^) j > 0 ∀ j. The n-step transition probabilities converge to π: that is, limn→∞ Pi jn = π (^) j.
There are limit theorems for irreducible chains, with slightly weaker conditions. Irreducible chains also have a unique stationary distribution – this follows from the Perron-Frobenius Theorem (see below.) However, it is not true that an arbitrary distribution converges to it; rather, we have that μ(^0 )^ P¯(n)^ → π as n → ∞, where P¯(n)^ = (^1) n ∑nk= 1 Pk. This means that the average distribution converges. We need to form the average, because there may be a built-in periodicity, as in the chain in the second example. In this case P^2 n^ = I, and P^2 n+^1 = P, so αn oscillates between two distributions, instead of converging to a fixed limit.
2.3.3 The linear algebra connection
Questions about the stationary and limiting distributions can also be addressed using linear algebra, by examining the eigenvalues of P. (We assume in this section that |S| = N < ∞.) Indeed, if π is a stationary distribution, then π is a left eigenvector of P corresponding to eigenvalue λ = 1.
We know that P has an eigenvalue λ = 1, since the rows of P sum to 1 so we have
and therefore ( 1 , 1 ,... , 1 )T^ is a right eigenvector. To ensure that the corresponding left eigenvector is a stationary distribution, we need to know that its entries are all nonnegative.
Let’s put this issue on hold for a moment, and just assume that the corresponding left eigenvector π is a stationary distribution. When is it also a limiting distribution? Suppose that P has a full set of eigenvalues λ 1 , λ 2 ,... , λN which are distinct, with λ 1 = 1. Then there exists a matrix B such that
P = B−^1 ΛB, where Λ =
λ 1 0 0 · · · 0 0 λ 2 0 · · · 0 .. .
0 · · · 0 0 λN
The rows of B are left eigenvectors of P, and the columns of B−^1 are right eigenvectors. Therefore
Pn^ = B−^1 ΛnB, where Λ =
λ 1 n 0 0 · · · 0 0 λ 2 n 0 · · · 0 .. .
0 · · · 0 0 λ (^) Nn
What happens as n → ∞? For the first eigenvalue we have λ 1 n = 1. Any eigenvalue such that |λi| < 1 will converge to zero, λ (^) in → 0. Therefore, there is a limiting distribution, only if |λi| < 1 for i ≥ 2. In this case we have
lim n→∞ Pn^ = B−^1
We know that the right eigenvector associated with λ 1 is v = ( 1 , 1 ,... , 1 )T^. By assumption, the left eigen- vector is a stationary distribution π. Therefore we have
lim n→∞ Pn^ =
π 1 π 2 · · · πN
π 1 π 2 · · · πN π 1 π 2 · · · πN .. .
π 1 π 2 · · · πN
so π is also a limiting distribution. (If P does not have a full set of distinct eigenvalues, then we can do a similar calculation using the Jordan canonical form of the matrix.)
We can justify the above calculations using some results from linear algebra.
Lemma. The spectral radius of a stochastic matrix P is 1, i.e. ρ(P) = maxλ |λ | = 1 , where the max is over all eigenvalues.
Proof. Let η be a left eigenvector with eigenvalue λ. Then λ ηi = (^) ∑Nj= 1 η (^) j p (^) ji,
|λ |
N
i= 1
|ηi| =
N
i= 1
N
j= 1
ηi p (^) ji| ≤
N
i, j= 1
|η (^) j|p (^) ji =
N
j= 1
|η (^) j|.
Therefore |λ | ≤ 1.
Whew. This is good news – it shows that no eigenvalue of P has complex norm greater than 1 – but it still doesn’t rule out the possibility that there are other eigenvalues with complex norm equal to 1. But, you may recall this theorem from linear algebra.
Theorem. (Perron-Frobenius Theorem, for aperiodic positive matrices.) Let M be a positive^3 k × k matrix, with k < ∞. Then the following statements hold:
(i) There is a positive real number λ 1 which is an eigenvalue of M. All other eigenvalues λ of M satisfy |λ | < λ 1. (ii) The eigenspace of eigenvectors associated with λ 1 is one-dimensional.
(iii) There exists a positive right eigenvector v and a positive left eigenvector w associated with λ 1. (iv) M has no other eigenvector with nonnegative entries.
For a proof, see an advanced linear algebra textbook, such as Lax (1997), Chapter 16. There is also a brief description of the proof in Strang (1988), section 5.3.
The Perron-Frobenius theorem implies that if P is positive, then it has a one-dimensional eigenspace asso- ciated with the eigenvalue λ = 1, and the corresponding left eigenvector π is positive. Therefore, π is the unique stationary distribution. The theorem also shows that all other eigenvalues have complex norm less than 1, so combined with the calculations above (or an enhanced version of the Perron-Frobenius theorem^4 ), we have that π is a limiting distribution.
(^3) Mi j > 0 for all i, j (^4) See for example Theorems 8.2.7, 8.2.8, in “Matrix Analysis” by Horn & Johnson.
Definition. The first-passage time or first-hitting time of a set A ⊂ S is defined by
TA = min{n ≥ 0 : Xn ∈ A}.
To show that TA is a stopping time, observe that
{TA = n} = {X 0 ∈ Ac, X 1 ∈ Ac,... , Xn− 1 ∈ Ac, Xn ∈ A},
and the event on the right-hand side depends only on the random variables X 0 ,... , Xn.
Here are some other examples of stopping times:
Exercise 2.9. Show that all of the examples above are stopping times.
An examples of a random times that is not a stopping time is the last visit to a set A, i.e. T = max{n : Xn ∈ A}. The event {T = n} depends on all future values Xn, Xn+ 1 ,... so it cannot be a stopping time.
Here are some other examples of random times that are not stopping times:
Exercise 2.10. Argue why each of the above examples is not a stopping time.
We can answer many questions about stopping times, by solving linear equations. A common quantity of interest is the average time it takes to hit a set A ⊂ S.
Definition. The mean first passage time (mfpt) to set A starting at state i is
τi = E(TA|X 0 = i). (20)
Let’s compute the mfpt τi, using a first-step analysis. Let’s assume that P(TA < ∞|X 0 = i) = 1 for all i ∈ S, and furthermore that τi < ∞ for all i ∈ S.
For i ∈ A, we know that TA = 0. Consider i ∈/ A. Then we have
τi =
∞
t= 1
tP(TA=t|X 0 =i)
∞
t= 1
∞
j= 1
tP(TA=t|X 0 =i, X 1 = j)P(X 1 = j|X 0 =i) LOTP
∞
t= 1
∞
j= 1
tP(TA=t|X 1 = j)P(X 1 = j|X 0 =i) Markov property
Because the chain is time-homogeneous, we expect that P(TA=t|X 1 = j) = P(TA=t − 1 |X 0 = j). To show this explicitly, write
P(TA=t|X 1 = j) = P(X 2 ∈ Ac,... , Xt− 1 ∈ Ac, Xt ∈ A|X 1 = j) by definition = P(X 1 ∈ Ac,... , Xt− 2 ∈ Ac, Xt− 1 ∈ C|X 0 = j) by time-homogeneity = P(TA=t − 1 |X 0 = j).
Therefore, substituting into the above and changing the index t → t + 1, we have
τi =
∞
t= 0
∞
j= 1
(t + 1 )P(TA=t|X 0 = j)Pi j
∞
j= 1
∞
t= 0
tP(TA=t|X 0 = j)Pi j +
∞
j= 1
∞
t= 0
P(TA=t|X 0 = j)Pi j
∞
j= 1
τ (^) jPi j + 1.
The second term is 1, because ∑∞ t= 0 P(TA=t|X 0 = j) = 1, since this sum is the probability that TA takes any value (we are assuming that P(TA < ∞) = 1.) Summing over j gives (^) ∑∞ j= 1 Pi j = 1, which holds because we are simply summing the rows of P, which form a probability distribution. We can interchange the order of summation in the second step, because all the terms we are adding up are nonnegative, and we assume the sum exists since the mfpt exists.
We just showed the following:
Theorem. Let τ = (τ 1 , τ 2 ,.. .)T^ be a vector of mean first passage times from each state i ∈ S. Then τ solves the following system of equations:
τi =
0 i ∈ A 1 + (^) ∑ (^) j Pi jτ (^) j i ∈/ A.
Remark. Another way to write (21) is P′τ′^ + 1 = τ′, (22)
where P′^ is P with the rows and columns corresponding to elements in A removed, and τ′^ is τ with the elements in A removed. Equation (22) can in turn be written as
(P′^ − I)τ′^ = − 1. (23)
This form will make it easier to make the connection to continuous-time Markov chains and processes, later in the course.