Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Lecture 2: Markov Chains (I), Slides of Probability and Statistics

Northwestern University (NU)Probability and Statistics

Hence, the transition matrix is a stochas- tic matrix. Exercise 2.1. Show that the transition probabilities satisfy (i), (ii) above. Exercise 2.2. Show that if ...

Typology: Slides

2022/2023

Uploaded on 05/11/2023

photon 🇺🇸

4.6

(5)

223 documents

1 / 21

This page cannot be seen from the preview

Don't miss anything!

Miranda Holmes-Cerfon Applied Stochastic Analysis, Spring 2019

Lecture 2: Markov Chains (I)

Readings Strongly recommended:

•Grimmett and Stirzaker (2001) 6.1, 6.4-6.6

Optional:

•Hayes (2013) for a lively history and gentle introduction to Markov chains.

•Koralov and Sinai (2010) 5.1-5.5, pp.67-78 (more mathematical)

A canonical reference on Markov chains is Norris (1997).

We will begin by discussing Markov chains. In Lectures 2 & 3 we will discuss discrete-time Markov chains,

and Lecture 4 will cover continuous-time Markov chains.

2.1 Setup and definitions

We consider a discrete-time, discrete space stochastic process which we write as X(t) = Xt, for t=0,1,. ...

The state space Sis discrete, i.e. finite or countable, so we can let it be a set of integers, as in S={1,2,...,N}

or S={1,2,...}.

Definition. The process X(t) = X0,X1,X2,... is a discrete-time Markov chain if it satisfies the Markov

property:

P(Xn+1=s|X0=x0,X1=x1,...,Xn=xn) = P(Xn+1=s|Xn=xn).(1)

The quantities P(Xn+1=j|Xn=i)are called the transition probabilities. In general the transition probabili-

ties are functions of i,j,n. It is convenient to write them as

pi j(n) = P(Xn+1=j|Xn=i).(2)

Definition. The transition matrix at time nis the matrix P(n) = (pi j (n)), i.e. the (i,j)th element of P(n)is

pi j(n).1The transition matrix satisfies:

(i) pi j(n)≥0∀i,j(the entries are non-negative)

(ii) ∑jpi j(n) = 1∀i(the rows sum to 1)

Any matrix that satisfies (i), (ii) above is called a stochastic matrix. Hence, the transition matrix is a stochas-

tic matrix.

Exercise 2.1.Show that the transition probabilities satisfy (i), (ii) above.

Exercise 2.2.Show that if X(t)is a discrete-time Markov chain, then

P(Xn=s|X0=x0,X1=x1,...,Xm=xm) = P(Xn=s|Xm=xm),

for any 0 ≤m<n. That is, the probabilities at the current time, depend only on the most recent known state

in the past, even if it’s not exactly one step before.

1We call it a matrix even if |S|=∞.

Discover Slides of Probability and Statistics Northwestern University (NU)

Partial preview of the text

Download Lecture 2: Markov Chains (I) and more Slides Probability and Statistics in PDF only on Docsity!

Lecture 2: Markov Chains (I)

Readings Strongly recommended:

Grimmett and Stirzaker (2001) 6.1, 6.4-6.

Optional:

Hayes (2013) for a lively history and gentle introduction to Markov chains.
Koralov and Sinai (2010) 5.1-5.5, pp.67-78 (more mathematical)

A canonical reference on Markov chains is Norris (1997).

We will begin by discussing Markov chains. In Lectures 2 & 3 we will discuss discrete-time Markov chains, and Lecture 4 will cover continuous-time Markov chains.

2.1 Setup and definitions

We consider a discrete-time, discrete space stochastic process which we write as X(t) = Xt , for t = 0 , 1 ,.. .. The state space S is discrete, i.e. finite or countable, so we can let it be a set of integers, as in S = { 1 , 2 ,... , N} or S = { 1 , 2 ,.. .}.

Definition. The process X(t) = X 0 , X 1 , X 2 ,... is a discrete-time Markov chain if it satisfies the Markov property: P(Xn+ 1 = s|X 0 = x 0 , X 1 = x 1 ,... , Xn = xn) = P(Xn+ 1 = s|Xn = xn). (1)

The quantities P(Xn+ 1 = j|Xn = i) are called the transition probabilities. In general the transition probabili- ties are functions of i, j, n. It is convenient to write them as

pi j(n) = P(Xn+ 1 = j|Xn = i). (2)

Definition. The transition matrix at time n is the matrix P(n) = (pi j(n)), i.e. the (i, j)th element of P(n) is pi j(n).^1 The transition matrix satisfies: (i) pi j(n) ≥ 0 ∀i, j (the entries are non-negative) (ii) (^) ∑ (^) j pi j(n) = 1 ∀i (the rows sum to 1)

Any matrix that satisfies (i), (ii) above is called a stochastic matrix. Hence, the transition matrix is a stochas- tic matrix.

Exercise 2.1. Show that the transition probabilities satisfy (i), (ii) above.

Exercise 2.2. Show that if X(t) is a discrete-time Markov chain, then

P(Xn = s|X 0 = x 0 , X 1 = x 1 ,... , Xm = xm) = P(Xn = s|Xm = xm) ,

for any 0 ≤ m < n. That is, the probabilities at the current time, depend only on the most recent known state in the past, even if it’s not exactly one step before. (^1) We call it a matrix even if |S| = ∞.

Remark. Note that a “stochastic matrix” is not the same thing as a “random matrix”! Usually “random” can be substituted for “stochastic” but not here. A random matrix is a matrix whose entries are random. A stochastic matrix has completely deterministic entries. It probably gets its name because it is used to describe a stochastic phenomenon, but this is an unfortunate accident of history.

Definition. The Markov chain X(t) is time-homogeneous if P(Xn+ 1 = j|Xn = i) = P(X 1 = j|X 0 = i), i.e. the transition probabilities do not depend on time n. If this is the case, we write pi j = P(X 1 = j|X 0 = i) for the probability to go from i to j in one step, and P = (pi j) for the transition matrix.

We will only consider time-homogeneous Markov chains in this course, though we will occasionally remark on how some results may be generalized to the time-inhomogeneous case.

Examples

Weather model Let Xn be the state of the weather on day n in New York, which we assume is either rainy or sunny. We could use a Markov chain as a crude model for how the weather evolves day-by- day. The state space is S = {rain, sun}. One transition matrix might be

P =

(^ sun rain) sun 0. 8 0. 2 rain 0. 4 0. 6

This says that if it is sunny today, then the chance it will be sunny tomorrow is 0.8, whereas if it is rainy today, then the chance it will be sunny tomorrow is 0.4. One question you might be interested in is: what is the long-run fraction of sunny days in New York?

Coin flipping Another two-state Markov chain is based on coin flips. Usually coin flips are used as the canonical example of independent Bernoulli trials. However, Diaconis et al. (2007) studied sequences of coin tosses empirically, and found that outcomes in a sequence of coin tosses are not independent. Rather, they are well-modelled by a Markov chain with the following transition probabilities:

P =

(^ heads tails) heads 0. 51 0. 49 tails 0. 49 0. 51

This shows that if you throw a Heads on your first toss, there is a very slightly higher chance of throwing heads on your second, and similarly for Tails.

Random walk on the line Suppose we perform a walk on the integers, starting at 0. At each time we move right or left by one unit, with probability 1/2 each. This gives a Markov chain, which can be constructed explicitly as

Xn =

n

j= 1

ξ (^) j, ξ (^) j = ± 1 with probability

each, ξi i.i.d.

The transition probabilities are

pi,i+ 1 =

, pi,i− 1 =

, pi, j = 0 ( j 6 = i ± 1 ).

choose an edge uniformly at random from the set of edges leading out of the node, and move along the edge to the node at the edge. Then repeat. If there are N nodes labelled by consecutive integers then this is a Markov chain on state space S = { 1 , 2 ,... , N}. Here is are a couple of examples:

The corresponding transition matrices are:

P =

Random walk on a graph (weighted, directed)

Every Markov chain can be represented as a random walk on a weighted, directed graph. A weighted graph is one where each edge has a positive real number assigned to it, its “weight,” and the random walker chooses an edge from the set of available edges, in proportion to each edge’s weight. In a directed graph each edge also has a direction, and a walker can only move in that direction. Here is an example:

A

B C

The corresponding transition matrix is:

P =

^ A^ B^ C

A 0 1 0

B 15 0

C 36 26

In fact, such a directed graph forms the foundation for Google’s Page Rank algorithm, which has revolutionized internet searches. The earliest and best-known version of Page Rank constructs a di- rected graph of the internet, where nodes are webpages and there is a directed edge from webpage A to webpage B if A contains a link to B. Page Rank assumes an internet surfer clicks follows links at random, and ranks pages according to the long-time average fraction of time that the surfer spends on each page.

Ehrenfest model of diffusion Consider a container with a mem- brane in the middle, and m particles distributed in some way be- tween the left and right sides. At each step, pick one particle at random and move it to the other side. Let Xn = # of particles in the left side at time n. Then Xn is a Markov chain, with transition probabilities pi,i+ 1 = 1 − (^) mi , pi,i− 1 = (^) mi.
Card shuffling Shuffling a pack of cards can be modeled as a Markov chain. The state space S is the set of permutations of { 1 , 2 ,... , 52 }. A shuffle takes one permutation σ ∈ S, and outputs another permutation σ ′^ ∈ S with a certain probability. Perhaps the simplest model is the top-to-random shuffle: at each step, take a card from the top of the deck, and put it back in at a random location. The transition matrix has elements

P(X 1 = σ ′|X 0 = σ ) =

1 52 if^ σ^

′ (^) is obtained by taking an item in σ and moving it to the top, 0 otherwise. One can also model more complicated shuffles, such as the riffle shuffle. While the state space is enormous (|S| = 52!) so you would not want to write down the whole transition matrix, one can still analyze these models using other techniques, from analysis and probability theory. Various authors have proven results about the number of shuffles needed to make the deck “close to random”. For example, it takes seven riffle shuffles to get close to random, but it takes 11 or 12 to get so close that a gambler in a casino cannot exploit the deviations from randomness to win a typical game. See the online essay Austin (line) for an accessible introduction to these ideas, and Aldous and Diaconis (1986) for the mathematical proofs. (I first learned about this phenomenon in the beautiful Proofs from the Book, by Aigner and Ziegler.)

Autoregressive model of order k (AR(k)) Given constants a 1 ,... , ak ∈ R, let Yn = a 1 Yn− 1 + a 2 Yn− 2 + ... + akYn−k +Wn, where Wn are i.i.d. random variables.

Markov chains in applications. Markov chains arise in a great many modern applications. Here is an example, from Rogers et al. (2013), where the configuration space of two DNA-coated colloids was modelled as a two-state Markov chain, with states “bound” and “unbound,” depending on whether the distance between the particles was small or large:

Some other examples of applications that use Markov chains include:

models of physical processes
- rainfall from day-to-day
- neural networks
- population dynamics
- lineups, e.g. in grocery stores, computer servers, telephone call centers, etc.
- chemical reactions
- protein folding
- baseball statistics
discretize a continuous system

sampling from high-dimensional systems, e.g. Markov-Chain Monte-Carlo
data/network analysis
- clustering
- speech recognition
- PageRank algorithm in Google’s search engine.

2.2 Evolution of probability

Given a Markov chain with transition probabilities P and initial condition X 0 = i, we know how to calculate the probability distribution of X 1 ; indeed, this is given directly from the transition probabilities. The natural question to ask next is: what is the distribution at later times? That is, we would like to know the n-step transition probabilities P(n), defined by

P i j(n )= P(Xn = j|X 0 = i). (3)

For example, for n = 2, we have that

P(X 2 = j|X 0 = i) = ∑

k

P(X 2 = j|X 1 = k, X 0 = i)P(X 1 = k|X 0 = i) Law of Total Probability

k

P(X 2 = j|X 1 = k)P(X 1 = k|X 0 = i) Markov Property

k

Pk jPik time-homogeneity

= (P^2 )i j

That is, the two-step transition matrix is P(^2 )^ = P^2.

This generalizes:

Theorem. Let X 0 , X 1 ,... be a time-homogeneous Markov chain with transition probabilities P. The n-step transition probabilities are P(n)^ = Pn, i.e.

P(Xn = j|X 0 = i) = (Pn)i j. (4)

To make the notation cleaner we will write (Pn)i j = Pi jn. Note that this does not equal (Pi j)n.

Exercise 2.3. Prove this theorem. Hint: use induction.

A more general equation relating the transition probabilities, that holds even in the time-inhomogeneous case, is:

Chapman-Kolmogorov Equation.

P(Xn = j|X 0 = i) = ∑

k

P(Xn = j|Xm = k)P(Xm = k|X 0 = i). (5)

Therefore, if we know the initial probability distribution α(^0 ), then we can find the distribution at any later time using powers of the matrix P.

Now consider what happens if we ask for the expected value of some function of the state of the Markov chain, such as EX n^2 , EX n^3 , E|Xn|, etc. Can we derive an evolution equation for this quantity?

Let f : S → R be a function defined on state space, and let

u( i n)= Ei f (Xn) = E[ f (Xn)|X 0 = i]. (8)

You should think of u(n)^ as a column vector; again this is a convention whose convenience will become more transparent later in the course. Then u(n)^ evolves in time as:

Backward Kolmogorov Equation. (for a time-homogeneous, discrete-time Markov Chain)

u(n+^1 )^ = Pu(n), u(^0 )(i) = f (i) ∀i ∈ S. (9)

Proof. We have

u(n+^1 )(i) = ∑

j

f ( j)P(Xn+ 1 = j|X 0 = i) definition of expectation

j

k

f ( j)P(Xn+ 1 = j|X 1 =k, X 0 =i)P(X 1 =k|X 0 =i) LoTP

j

k

f ( j)P(Xn+ 1 = j|X 1 =k)P(X 1 =k|X 0 =i) Markov property

j

k

f ( j)Pk jnPik time-homogeneity

k

j

f ( j)Pk jnPik switch order of summation

k

u(n)(k)Pik definition of u(n)

= (Pu)i

We can switch the order of summation above, provided we assume that Ei| f (Xn)| < ∞ for each i and each n.

This proof illustrates a technique sometimes known as first-step analysis, where one conditions on the first step of the Markov chain and uses the Law of Total Probability. Of course, you could also derive this equation more directly from the n-step transition probabilities.

Exercise 2.5. Do this! Derive (9) directly from the formula for the n-step transition probabilities.

Remark. What is so backward about the backward equation? It gets its name from the fact that it can be used to describe how conditional expectations propagate backwards in time. To see this, suppose that instead of (8), which computes the expectation of a function after a certain number of steps has passed, we choose a fixed time T and compute the expectation at that time, given an earlier starting position. That is, for each n ≤ T , define a column vector u(n)^ with components

u( i n)= E[ f (XT )|Xn=i]. (10)

Such a quantity is studied a lot in financial applications, where, say, Xn is the price of a stock at time n, f is a value function representing the value of an option to sell, T might be a time at which you decide (in advance) to sell a stock, and quantities of the form (10) above would represent your expected payout, conditional on being in state i at time n. Then, the vector u(n)^ evolves according to

u(n)^ = Pu(n+^1 ), u (T ) i =^ f^ (i)^ ∀i^ ∈^ S.^ (11)

Therefore you find u(n)^ by evolving it backwards in time – you are given a final condition at time T , and you can solve for un at all earlier times n ≤ T.

Interestingly, (11) holds even when the chain is not time-homogeneous, provided that P in (11) is replaced by P(n), the transition probabilities starting at time n. This same statement is not true for (9).

Exercise 2.6. Show (11), and argue it holds even when the Markov chain is not time-homogeneous.

2.2.1 Evolution of the full transition probabilities*

Another approach to the forward/backward equations is to define a function P( j,t|i, s) to be the transition probability to be in state j at time t, given the system started in state i at time s, i.e.

P( j,t|i, s) = P(Xt = j|Xs = i). (12)

One can then derive equations for how P( j,t|i, s) evolves in t and s. For evolution in t (forward in time) we have, from the Chapman-Kolmogorov equations,

P( j,t+ 1 |i, s) = ∑

k

P(k,t|i, s)P( j,t+ 1 |k,t). (13)

For evolution in s (backward in time) we have, again from the Chapman-Kolmogorov equations,

P( j,t|i, s) = ∑

k

P( j,t|k, s+ 1 )P(k, s+ 1 |i, s). (14)

These are general versions of the forward and backward equations, respectively. They hold regardless of whether the chain is time-homogeneous or not. From them, we can derive the time-inhomogeneous versions of the forward and backward equations (7), (9).

To derive the time-inhomogeneous forward equation, notice that the probability distribution at time t, α(t),

has components α( j t)= (^) ∑i P( j,t|i, 0 )α i( 0 ). Therefore, multiplying (13) by α(^0 )^ on the left (contracting it with index i) and letting s = 0, we obtain

α( jt +^1 )= ∑

k

α( kt )P( j,t+ 1 |k,t) ⇔ α(t+^1 )^ = α(t)P k j(t). (15)

To derive the time-inhomogeneous backward equation, let f : S → R, and let u( i s)= E[ f (Xt )|Xs = i] (recall

(8),(10).) Notice that u( i s)= (^) ∑k f (k)P(k,t|i, s), so multiplying (14) by the column vector f on the right (contracting it with index j) gives

u( i s)= ∑

k

P(k, s+ 1 |i, s)u( ks +^1 ) ⇔ u(s)^ = P i j(s )u(s+^1 )^. (16)

n P(0) P(1) 0 1 0 1 0 1 2 1 0 3 0 1 4 1 0 5 0 1 6 1 0 .. .

You can see the pattern. Clearly the distribution doesn’t converge. Yet, if we start with initial distirbution α(^0 )^ = ( 0. 5 , 0. 5 ), then we obtain

n P(0) P(1) 0 0.5 0. 1 0.5 0. 2 0.5 0. .. .

The distribution never changes!

2.3.1 Limiting and stationary distributions

In applications we are often interested in the long-term probability of visiting each state.

Definition. Consider a time-homogeneous Markov chain with transition matrix P. A row vector λ is a limiting distribution if λi ≥ 0, (^) ∑ (^) j λ (^) j = 1 (so that λ is a probability distribution), and if, for every i,

lim n→∞ (Pn)i j = λ (^) j ∀ j ∈ S.

In other words,

Pn^ →

λ 1 λ 2 λ 3... λ 1 λ 2 λ 3... λ 1 λ 2 λ 3... .. .

as n → ∞.

Exercise 2.8. Show that, if |S| < ∞, then λ is a limiting distribution if and only if the definition limn→∞ αPn^ = λ for any initial probability distribution α.

As we saw in the earlier examples, a limiting distribution doesn’t have to exist. If it exists, it must be unique. What happens if we start the chain in the limiting distribution? Let’s calculate the distribution α(^1 )^ at the next step of the chain, assuming initial distribution α(^0 )^ = λ. For simplicity, we will assume a finite state space, |S| < ∞, which lets us interchange the sum and the limit in the calculations below. Choose any i, and calculate, from (7), (writing Ai,· for the row vector corresponding to the ith row of matrix A):

α(^1 )^ = λ P =

lim n→∞ Pin,·

P =

lim n→∞ P in,·+^1

= λ.

Therefore if we start the chain in the limiting distribution, its distribution remains there forever. This moti- vates the following definition:

Definition. Given a Markov chain with transition matrix P, a stationary distribution is a probability distri- bution π which satisfies

π = πP ⇐⇒ π j = ∑

i

πiPi j ∀ j. (17)

This says that that if we start with distribution π and run the Markov chain, the distribution will not change. That is why it is called “stationary.” In other words, if X 0 ∼ π, then X 1 ∼ π, X 2 ∼ π, etc.

Remark. Other synonyms you might hear for stationary distribution include invariant measure, invariant distribution, steady-state probability, equilibrium probability or equilibrium distribution (the latter two are from physics.).

In applications we want to know the limiting distribution, but it is usually far easier to calculate the stationary distribution, because it is obtained by solving a system of linear equations. Therefore we will restrict our focus to the stationary distribution. Some questions we might ask about π include:

(i) Does it exist? (ii) Is it unique? (iii) When is it a limiting distribution, i.e. when does an arbitrary distribution converge to it?

For (iii), we saw that a limiting distribution is a stationary distribution, but the converse is not always true. Indeed, in our second example, you can calculate that a stationary distribution is π = ( 0. 5 , 0. 5 ), but this is not a limiting distribution. What are the conditions that guarantee a stationary distribution is also the limiting distribution?

This is the subject of a rich body of work on the limiting behaviour of Markov chains. We will not go deeply into the results, but will briefly survey a couple of the major theorems.

2.3.2 A limit theorem or two

Definition. A matrix A is positive if it has all positive entries: Ai j > 0 for all i, j. In these notes we will write A > 0 when A is positive.

Remark. This is not the same as being positive-definite!

Definition. A stochastic matrix is regular if there exists some s > 0 such that Ps^ is positive, i.e. the s-step transition probabilities are positive for all i, j: (Ps)i j > 0 ∀i, j.

Remark. Some books call such a matrix primitive. The text Koralov and Sinai (2010)) calls it ergodic (when the state space is finite), though usually this word is reserved for something slightly different.

This means that there is a time s such that, no matter where you start, there is a non-zero probability of being at any other state.

Theorem (Ergodic Theorem for Markov Chains, (one version)). Assume a Markov Chain is regular and has a finite state space with size N. Then there exists a unique stationary probability distribution π = (π 1 ,... , πN ), with π (^) j > 0 ∀ j. The n-step transition probabilities converge to π: that is, limn→∞ Pi jn = π (^) j.

There are limit theorems for irreducible chains, with slightly weaker conditions. Irreducible chains also have a unique stationary distribution – this follows from the Perron-Frobenius Theorem (see below.) However, it is not true that an arbitrary distribution converges to it; rather, we have that μ(^0 )^ P¯(n)^ → π as n → ∞, where P¯(n)^ = (^1) n ∑nk= 1 Pk. This means that the average distribution converges. We need to form the average, because there may be a built-in periodicity, as in the chain in the second example. In this case P^2 n^ = I, and P^2 n+^1 = P, so αn oscillates between two distributions, instead of converging to a fixed limit.

2.3.3 The linear algebra connection

Questions about the stationary and limiting distributions can also be addressed using linear algebra, by examining the eigenvalues of P. (We assume in this section that |S| = N < ∞.) Indeed, if π is a stationary distribution, then π is a left eigenvector of P corresponding to eigenvalue λ = 1.

We know that P has an eigenvalue λ = 1, since the rows of P sum to 1 so we have

P

and therefore ( 1 , 1 ,... , 1 )T^ is a right eigenvector. To ensure that the corresponding left eigenvector is a stationary distribution, we need to know that its entries are all nonnegative.

Let’s put this issue on hold for a moment, and just assume that the corresponding left eigenvector π is a stationary distribution. When is it also a limiting distribution? Suppose that P has a full set of eigenvalues λ 1 , λ 2 ,... , λN which are distinct, with λ 1 = 1. Then there exists a matrix B such that

P = B−^1 ΛB, where Λ =

λ 1 0 0 · · · 0 0 λ 2 0 · · · 0 .. .

0 · · · 0 0 λN

The rows of B are left eigenvectors of P, and the columns of B−^1 are right eigenvectors. Therefore

Pn^ = B−^1 ΛnB, where Λ =

λ 1 n 0 0 · · · 0 0 λ 2 n 0 · · · 0 .. .

0 · · · 0 0 λ (^) Nn

What happens as n → ∞? For the first eigenvalue we have λ 1 n = 1. Any eigenvalue such that |λi| < 1 will converge to zero, λ (^) in → 0. Therefore, there is a limiting distribution, only if |λi| < 1 for i ≥ 2. In this case we have

lim n→∞ Pn^ = B−^1

B.

We know that the right eigenvector associated with λ 1 is v = ( 1 , 1 ,... , 1 )T^. By assumption, the left eigen- vector is a stationary distribution π. Therefore we have

lim n→∞ Pn^ =

π 1 π 2 · · · πN

π 1 π 2 · · · πN π 1 π 2 · · · πN .. .

π 1 π 2 · · · πN

so π is also a limiting distribution. (If P does not have a full set of distinct eigenvalues, then we can do a similar calculation using the Jordan canonical form of the matrix.)

We can justify the above calculations using some results from linear algebra.

Lemma. The spectral radius of a stochastic matrix P is 1, i.e. ρ(P) = maxλ |λ | = 1 , where the max is over all eigenvalues.

Proof. Let η be a left eigenvector with eigenvalue λ. Then λ ηi = (^) ∑Nj= 1 η (^) j p (^) ji,

|λ |

N

i= 1

|ηi| =

N

i= 1

N

j= 1

ηi p (^) ji| ≤

N

i, j= 1

|η (^) j|p (^) ji =

N

j= 1

|η (^) j|.

Therefore |λ | ≤ 1.

Whew. This is good news – it shows that no eigenvalue of P has complex norm greater than 1 – but it still doesn’t rule out the possibility that there are other eigenvalues with complex norm equal to 1. But, you may recall this theorem from linear algebra.

Theorem. (Perron-Frobenius Theorem, for aperiodic positive matrices.) Let M be a positive^3 k × k matrix, with k < ∞. Then the following statements hold:

(i) There is a positive real number λ 1 which is an eigenvalue of M. All other eigenvalues λ of M satisfy |λ | < λ 1. (ii) The eigenspace of eigenvectors associated with λ 1 is one-dimensional.

(iii) There exists a positive right eigenvector v and a positive left eigenvector w associated with λ 1. (iv) M has no other eigenvector with nonnegative entries.

For a proof, see an advanced linear algebra textbook, such as Lax (1997), Chapter 16. There is also a brief description of the proof in Strang (1988), section 5.3.

The Perron-Frobenius theorem implies that if P is positive, then it has a one-dimensional eigenspace asso- ciated with the eigenvalue λ = 1, and the corresponding left eigenvector π is positive. Therefore, π is the unique stationary distribution. The theorem also shows that all other eigenvalues have complex norm less than 1, so combined with the calculations above (or an enhanced version of the Perron-Frobenius theorem^4 ), we have that π is a limiting distribution.

(^3) Mi j > 0 for all i, j (^4) See for example Theorems 8.2.7, 8.2.8, in “Matrix Analysis” by Horn & Johnson.

Definition. The first-passage time or first-hitting time of a set A ⊂ S is defined by

TA = min{n ≥ 0 : Xn ∈ A}.

To show that TA is a stopping time, observe that

{TA = n} = {X 0 ∈ Ac, X 1 ∈ Ac,... , Xn− 1 ∈ Ac, Xn ∈ A},

and the event on the right-hand side depends only on the random variables X 0 ,... , Xn.

Here are some other examples of stopping times:

T = c, where c ∈ N is a constant.
Given two stopping times S and T the random variables U = S ∧ T (minimum of S, T ) and V = S ∨ T (maximum of S, T ) are stopping times.
Given two stopping times S and T the random variable τ = S + T is a stopping time.
T = min{n ≥ 0 : Xi > a for i ∈ {n − 2 , n − 1 , n}}, where a ∈ R is some constant. That is, the first time the process has remained above a level for a sufficiently long amount of time.

Exercise 2.9. Show that all of the examples above are stopping times.

An examples of a random times that is not a stopping time is the last visit to a set A, i.e. T = max{n : Xn ∈ A}. The event {T = n} depends on all future values Xn, Xn+ 1 ,... so it cannot be a stopping time.

Here are some other examples of random times that are not stopping times:

T − 1, where T is a stopping time.
S − T , where S, T are stopping times.
12 (S + T ) where S, T are stopping times.
T = first time to reach max(X 0 , X 1 ,.. .) (such as, in gambling, the first time to reach the maximum amount of money you will ever reach.)

Exercise 2.10. Argue why each of the above examples is not a stopping time.

We can answer many questions about stopping times, by solving linear equations. A common quantity of interest is the average time it takes to hit a set A ⊂ S.

Definition. The mean first passage time (mfpt) to set A starting at state i is

τi = E(TA|X 0 = i). (20)

Let’s compute the mfpt τi, using a first-step analysis. Let’s assume that P(TA < ∞|X 0 = i) = 1 for all i ∈ S, and furthermore that τi < ∞ for all i ∈ S.

For i ∈ A, we know that TA = 0. Consider i ∈/ A. Then we have

τi =

∞

t= 1

tP(TA=t|X 0 =i)

∞

t= 1

∞

j= 1

tP(TA=t|X 0 =i, X 1 = j)P(X 1 = j|X 0 =i) LOTP

∞

t= 1

∞

j= 1

tP(TA=t|X 1 = j)P(X 1 = j|X 0 =i) Markov property

Because the chain is time-homogeneous, we expect that P(TA=t|X 1 = j) = P(TA=t − 1 |X 0 = j). To show this explicitly, write

P(TA=t|X 1 = j) = P(X 2 ∈ Ac,... , Xt− 1 ∈ Ac, Xt ∈ A|X 1 = j) by definition = P(X 1 ∈ Ac,... , Xt− 2 ∈ Ac, Xt− 1 ∈ C|X 0 = j) by time-homogeneity = P(TA=t − 1 |X 0 = j).

Therefore, substituting into the above and changing the index t → t + 1, we have

τi =

∞

t= 0

∞

j= 1

(t + 1 )P(TA=t|X 0 = j)Pi j

∞

j= 1

∞

t= 0

tP(TA=t|X 0 = j)Pi j +

∞

j= 1

∞

t= 0

P(TA=t|X 0 = j)Pi j

∞

j= 1

τ (^) jPi j + 1.

The second term is 1, because ∑∞ t= 0 P(TA=t|X 0 = j) = 1, since this sum is the probability that TA takes any value (we are assuming that P(TA < ∞) = 1.) Summing over j gives (^) ∑∞ j= 1 Pi j = 1, which holds because we are simply summing the rows of P, which form a probability distribution. We can interchange the order of summation in the second step, because all the terms we are adding up are nonnegative, and we assume the sum exists since the mfpt exists.

We just showed the following:

Theorem. Let τ = (τ 1 , τ 2 ,.. .)T^ be a vector of mean first passage times from each state i ∈ S. Then τ solves the following system of equations:

τi =

0 i ∈ A 1 + (^) ∑ (^) j Pi jτ (^) j i ∈/ A.

Remark. Another way to write (21) is P′τ′^ + 1 = τ′, (22)

where P′^ is P with the rows and columns corresponding to elements in A removed, and τ′^ is τ with the elements in A removed. Equation (22) can in turn be written as

(P′^ − I)τ′^ = − 1. (23)

This form will make it easier to make the connection to continuous-time Markov chains and processes, later in the course.

Lecture 2: Markov Chains (I), Slides of Probability and Statistics

Related documents

Partial preview of the text

Download Lecture 2: Markov Chains (I) and more Slides Probability and Statistics in PDF only on Docsity!

Lecture 2: Markov Chains (I)

2.1 Setup and definitions

P =

P =

P =

P =

B C

P =

^ A^ B^ C

A 0 1 0

B 15 0

C 36 26

2.2 Evolution of probability

P(X 2 = j|X 0 = i) = ∑

P(Xn = j|X 0 = i) = ∑

u(n+^1 )(i) = ∑

P( j,t+ 1 |i, s) = ∑

P( j,t|i, s) = ∑

α( jt +^1 )= ∑

u( i s)= ∑

P =

π = πP ⇐⇒ π j = ∑

P

B.