MDP , Lecture Notes - Computer Science, Study notes of Computer Numerical Control

Prof. David C Parkes, Computer Science, Markov Decision Processes, Infinite Horizon Value Iteration, Comparing VI and PI, Harvard, Lecture Notes

Typology: Study notes

2010/2011

Uploaded on 10/25/2011

thecoral
thecoral 🇺🇸

4.5

(30)

395 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS181 Lecture 17 MDPs
Avi Pfeffer; Revised by David Parkes
April 4, 2011
Today we continue to explore MDPs, including an extension of value iteration
to an infinite horizon setting and a discussion of the important method of policy
iteration.
1 Infinite Horizon Value Iteration
So far, we’ve focused only the finite horizon case. What about the infinite
horizon, total discounted reward case? It turns out that the value iteration idea
generalizes quite easily to this case. First, we need to modify the value-iteration
equations, defining the k-step-to-go rewards and policies, to take into account
the discount factor:
V0(s) = 0 (1)
Qk(s, a) = R(s, a) + γX
s0
P(s0|s, a)Vk1(s0) (2)
π
k(s) = arg max
aAQk(s, a) (3)
Vk(s) = Qk(s, π(s)) (4)
We consider now what happens as we take kto infinity. Does this limit
generate a policy πthat is optimal for the infinite horizon case? This idea is
implemented in the infinite horizon version of value iteration:
ValueIteration(γ) = // Infinite horizon version
// Takes the discount factor γ
// Returns the optimal value and policy for each state
For each state sS
V(s) = 0
Repeat
For each state sS
V0(s) = V(s) // store the old value
For each state sS
For each action aA
Q(s, a) = R(s, a) + γPs0P(s0|s, a)V0(s0)
π(s) = arg maxaAQ(s, a)
1
pf3
pf4
pf5
pf8

Partial preview of the text

Download MDP , Lecture Notes - Computer Science and more Study notes Computer Numerical Control in PDF only on Docsity!

CS181 Lecture 17 — MDPs

Avi Pfeffer; Revised by David Parkes

April 4, 2011

Today we continue to explore MDPs, including an extension of value iteration to an infinite horizon setting and a discussion of the important method of policy iteration.

1 Infinite Horizon Value Iteration

So far, we’ve focused only the finite horizon case. What about the infinite horizon, total discounted reward case? It turns out that the value iteration idea generalizes quite easily to this case. First, we need to modify the value-iteration equations, defining the k-step-to-go rewards and policies, to take into account the discount factor:

V 0 (s) = 0 (1) Qk(s, a) = R(s, a) + γ

s′

P (s′^ | s, a)Vk− 1 (s′) (2)

π k∗(s) = arg max a∈A Qk(s, a) (3)

Vk(s) = Qk(s, π∗(s)) (4)

We consider now what happens as we take k to infinity. Does this limit generate a policy π∗^ that is optimal for the infinite horizon case? This idea is implemented in the infinite horizon version of value iteration:

ValueIteration(γ) = // Infinite horizon version // Takes the discount factor γ // Returns the optimal value and policy for each state For each state s ∈ S V (s) = 0 Repeat For each state s ∈ S V ′(s) = V (s) // store the old value For each state s ∈ S For each action a ∈ A Q(s, a) = R(s, a) + γ

s′^ P^ (s

′ (^) | s, a)V ′(s′) π∗(s) = arg maxa∈A Q(s, a)

V (s) = Q(s, π∗(s)) Until convergence // ∀s‖V (s) − V ′(s)‖ <  Return V, π∗.

1.1 Analysis

Does the infinite horizon value iteration algorithm work? Does the value func- tion converge? If it does converge, does it converge to the optimal value func- tion? Intuitively the algorithm makes sense, as it takes the finite horizon case to the limit, but sometimes our intuitions break down going from finite problems to infinite problems. In order to answer these questions, we need some analysis. The first concept to define is that of a stationary policy. A stationary policy is one that only depends on the current state, and not on time or history. A stationary policy π will always specify the same action π(s) at a state s. (We continue to consider only deterministic policies.) It is a fact that in the infinite horizon case, there is an optimal stationary policy. Intuitively this is clear. No matter what has happened in the past, when we are at a state s we need to choose the action that maximizes our expected utility from here on out. And the future from state s looks the same any time we are in state s because the horizon is infinite. This is in contrast to the finite horizon case. With a finite horizon, the policy depends on the horizon. If the horizon is small, a greedy policy that optimizes immediate reward will be better. If the horizon is large, a long-term view makes more sense. For the infinite horizon case, since we know that there exists an optimal sta- tionary (and deterministic) policy, we can restrict our attention to such policies. The next question to ask is, suppose we play a stationary policy π. What is the expected utility of beginning at a state s, and playing the policy π? This is called the value of s under π, or the MDP value of π, and is denoted by V π^ (s). We can answer this question by the following observation. Since π is sta- tionary, we know that if we reach a state s′^ after one step, the expected future reward starting from s′^ will be V π^ (s′). Furthermore, π specifies what action to take at s, so we know the distribution over the successor states s′. Therefore, the expected reward from s under π is given by

V π^ (s) = R(s, π(s)) + γ

s′

P (s′^ | s, π(s))V π^ (s′) (5)

We have an equation like this for each state s. These equations define a set of n (for n states) linear equations with n variables. It can be shown that they in fact have a unique solution. So, where are we? We know that there is a stationary optimal policy π∗. We also know that for any stationary policy, the value function under that policy satisfies Equation 5. So we can plug in π∗^ to get

V π

∗ (s) = R(s, π∗(s)) + γ

s′

P (s′^ | s, π∗(s))V π

∗ (s′) (6)

We have

initialize V 0 (s) = 0 for all s V 1 ← B(V 0 ) V 2 ← B(V 1 ) V 3 ← B(V 2 ) · · ·

The questions we asked at the beginning of this section can now be refined: Does this process of iterating B converge to a limit? Is the limit unique? The answer is yes, because B has a special property: it is a contraction. This means that given any two value functions V 1 and V 2 , with V 1 6 = V 2 , then B brings them closer to each other. Formally, the contraction property that is satisfied by the Bellman operator is:

||B(V 1 ) − B(V 2 )|| ≤ γ||V 1 − V 2 ||, (10)

where V 1 6 = V 2 , norm ||V || = maxs |V (s)| is the max-norm, and γ ∈ (0, 1) is the discount factor (and recall that γ < 1 so this provides that the norm between two distinct points strictly decreases.) Intuitively, if we have a process that keeps bringing points closer to each other, and we iterate the process, then we will eventually reach a fixpoint. The Contraction Mapping Theorem says that under appropriate conditions, which are satisfied here, this is in fact the case: the process of iterating B has a limit, which is the unique fixpoint of B.

Example: The function f (x) = x/2 has a contraction property, so that x′^ ← x/2 is a contraction operator, and has unique fixpoint x∗^ = 0.

When an operator is a contraction then we always have a unique fixpoint. Recall that the contraction property insists that ||f (x) − f (y)|| < ||x − y|| when x 6 = y, and can be easily seen to hold for f (x) = x/2. Now, sup- pose for contradiction that there are two distinct fixpoints, x∗^6 = y∗, so that ||f (x∗) − f (y∗)|| = ||x∗^ − y∗|| but then this is a contradiction with the contrac- tion property. In addition, iterating also converges to the fixpoint. Consider any x 6 = x∗. Then ||f (x) − x∗|| = ||f (x) − f (x∗)|| < ||x − x∗||, where the first equality holds by the fact that x∗^ is a fixpoint and the second inequality by the contraction property.

1.2 Convergence rate of VI

(^2) For value iteration, the smaller γ is, the faster the iteration process converges.

So in fact, γ has a crucial impact on the complexity of infinite horizon value iteration. For γ close to 1, convergence will be slow. In fact, if we think about it, this is not surprising. When γ is close to 1, it means that future rewards are

(^2) This section is based on the discussion in Russell and Norvig.

only slightly discounted, so the optimal policy needs to take a long-term view— just like a large horizon in the finite horizon case. Theorem 1 states that value iteration will converge in the limit to the unique solution of the Bellman equations. For any desired error between V ∗(s) and V (s), then there is some finite number of iterations after which this error gap will be closed. In this section, we discuss the impact of the contraction property,

||B(V 1 ) − B(V 2 )|| ≤ γ||V 1 − V 2 ||,

for any V 1 6 = V 2 , on the ability of value iteration (VI) to make rapid progress. We will also consider the impact of a residual error in our approximation on the quality of the policy finally adopted by an agent. We first consider the number of updates that will be required for VI to achieve a particular error bound . For this, we assume that the reward for any state action pair is bounded, with

−Rmax ≤ R(s, a) ≤ Rmax, ∀(s, a) (11)

By the property of a geometric series, we have

Rmax 1 − γ

≤ V π^ (s) ≤ Rmax 1 − γ

, ∀π, ∀s

and in particular, we have

||V 0 − V ∗|| ≤

2 Rmax 1 − γ

for any initial value function V 0. One update achieves the following

||B(V ) − V ∗|| ≤ γ||V − V ∗||

We wish to find K, the number of iterations, such that

γK^ 2 Rmax 1 − γ

and the final value function is within  of the optimal value function. Solving, we find:

K =

log

[

2 Rmax (1−γ)

]

log

1 γ

where notation dxe takes the ceiling of x, i.e. rounds to the next largest integer. This shows that the rate of convergence is good with respect to the error, with the number of rounds increasing as log( (^1)  ). On the other hand we see a weakness with respect to discount factor γ. As this goes towards one, then log( (^1) γ ) approaches zero from above, and K grows rapidly.

PolicyIteration(γ) = // Takes the discount factor γ // Returns the optimal policy for each state Let π be any policy Repeat πold^ = π Solve the system of equations V (s) = R(s, πold(s)) + γ

s′^ P^ (s

′ (^) | s, πold(s))V (s′) to obtain V (s) For each state s For each action a Q(s, a) = R(s, a) + γ

s′^ P^ (s ′ (^) | s, a)V (s′) πnew(s) = arg maxa Q(s, a) // Assume ties are broken in a consistent way π = πnew Until πnew^ = πold Return π.

Note that we repeat until the policy stops changing. Policy iteration is a neat idea. Does it actually work? The homework asks you to prove that it does. The key steps in the proof are to prove the following two claims:

  1. If policy iteration returns policy π, then π is an optimal policy.
  2. The policies produced by policy iteration get better and better in the sense that V π

new (s) ≥ V π

old (s) increases monotonically from iteration to iteration.

From these one can deduce that policy iteration always terminates with the optimal policy after a finite number of steps. Contrast this with value iteration, which only achieves the optimal value function in the limit (but may achieve the optimal policy earlier.)

3 Comparing VI and PI

How do value iteration and policy iteration compare? It is a fact that policy iteration takes at most as many iterations to reach the optimal policy as value iteration does. In practice, it usually takes far fewer iterations. In policy itera- tion, the policy always changes every iteration. In contrast, in value iteration, the value function changes every iteration, but the optimal policy relative to that value function, i.e., the optimal policy given that that value function will be achieved in future, may stay the same for several successive iterations. On the other hand, each individual iteration of policy iteration takes longer, because it requires solving the complete system of linear equations of Equa- tion (5). This can be done via Gaussian elimination, for example, in O(n^3 ). Compare with value iteration, in which each step of the algorithm takes O(nmL)

(for nm updates of Q-values, on n states and m actions, with each taking time L where L is the max number of states reachable by any action a from any state s.) In general, however, policy iteration is considered the better algorithm in practice. One way to make policy iteration more efficient than value iteration is to dispense with solving Equation (5) exactly, and to be satisfied with an approx- imate solution. Note that Equation (5) is itself a fixpoint equation, and can be solved by the same kind of iterative method we have been using for solving the optimality equations! We can start with an arbitrary value function, and apply the right hand side of (5) for a fixed number of steps or until approximate convergence, to get better and better approximations to the V π^. This method of approximating the value function is called value propagation, and can be used instead of solving the equations in the policy iteration algorithm. We obtain the modified policy iteration algorithm, which tends to be more effective than value iteration and regular policy iteration.