




Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Prof. David C Parkes, Computer Science, Markov Decision Processes, Infinite Horizon Value Iteration, Comparing VI and PI, Harvard, Lecture Notes
Typology: Study notes
1 / 8
This page cannot be seen from the preview
Don't miss anything!





Today we continue to explore MDPs, including an extension of value iteration to an infinite horizon setting and a discussion of the important method of policy iteration.
So far, we’ve focused only the finite horizon case. What about the infinite horizon, total discounted reward case? It turns out that the value iteration idea generalizes quite easily to this case. First, we need to modify the value-iteration equations, defining the k-step-to-go rewards and policies, to take into account the discount factor:
V 0 (s) = 0 (1) Qk(s, a) = R(s, a) + γ
s′
P (s′^ | s, a)Vk− 1 (s′) (2)
π k∗(s) = arg max a∈A Qk(s, a) (3)
Vk(s) = Qk(s, π∗(s)) (4)
We consider now what happens as we take k to infinity. Does this limit generate a policy π∗^ that is optimal for the infinite horizon case? This idea is implemented in the infinite horizon version of value iteration:
ValueIteration(γ) = // Infinite horizon version // Takes the discount factor γ // Returns the optimal value and policy for each state For each state s ∈ S V (s) = 0 Repeat For each state s ∈ S V ′(s) = V (s) // store the old value For each state s ∈ S For each action a ∈ A Q(s, a) = R(s, a) + γ
s′^ P^ (s
′ (^) | s, a)V ′(s′) π∗(s) = arg maxa∈A Q(s, a)
V (s) = Q(s, π∗(s)) Until convergence // ∀s‖V (s) − V ′(s)‖ < Return V, π∗.
Does the infinite horizon value iteration algorithm work? Does the value func- tion converge? If it does converge, does it converge to the optimal value func- tion? Intuitively the algorithm makes sense, as it takes the finite horizon case to the limit, but sometimes our intuitions break down going from finite problems to infinite problems. In order to answer these questions, we need some analysis. The first concept to define is that of a stationary policy. A stationary policy is one that only depends on the current state, and not on time or history. A stationary policy π will always specify the same action π(s) at a state s. (We continue to consider only deterministic policies.) It is a fact that in the infinite horizon case, there is an optimal stationary policy. Intuitively this is clear. No matter what has happened in the past, when we are at a state s we need to choose the action that maximizes our expected utility from here on out. And the future from state s looks the same any time we are in state s because the horizon is infinite. This is in contrast to the finite horizon case. With a finite horizon, the policy depends on the horizon. If the horizon is small, a greedy policy that optimizes immediate reward will be better. If the horizon is large, a long-term view makes more sense. For the infinite horizon case, since we know that there exists an optimal sta- tionary (and deterministic) policy, we can restrict our attention to such policies. The next question to ask is, suppose we play a stationary policy π. What is the expected utility of beginning at a state s, and playing the policy π? This is called the value of s under π, or the MDP value of π, and is denoted by V π^ (s). We can answer this question by the following observation. Since π is sta- tionary, we know that if we reach a state s′^ after one step, the expected future reward starting from s′^ will be V π^ (s′). Furthermore, π specifies what action to take at s, so we know the distribution over the successor states s′. Therefore, the expected reward from s under π is given by
V π^ (s) = R(s, π(s)) + γ
s′
P (s′^ | s, π(s))V π^ (s′) (5)
We have an equation like this for each state s. These equations define a set of n (for n states) linear equations with n variables. It can be shown that they in fact have a unique solution. So, where are we? We know that there is a stationary optimal policy π∗. We also know that for any stationary policy, the value function under that policy satisfies Equation 5. So we can plug in π∗^ to get
V π
∗ (s) = R(s, π∗(s)) + γ
s′
P (s′^ | s, π∗(s))V π
∗ (s′) (6)
We have
initialize V 0 (s) = 0 for all s V 1 ← B(V 0 ) V 2 ← B(V 1 ) V 3 ← B(V 2 ) · · ·
The questions we asked at the beginning of this section can now be refined: Does this process of iterating B converge to a limit? Is the limit unique? The answer is yes, because B has a special property: it is a contraction. This means that given any two value functions V 1 and V 2 , with V 1 6 = V 2 , then B brings them closer to each other. Formally, the contraction property that is satisfied by the Bellman operator is:
||B(V 1 ) − B(V 2 )|| ≤ γ||V 1 − V 2 ||, (10)
where V 1 6 = V 2 , norm ||V || = maxs |V (s)| is the max-norm, and γ ∈ (0, 1) is the discount factor (and recall that γ < 1 so this provides that the norm between two distinct points strictly decreases.) Intuitively, if we have a process that keeps bringing points closer to each other, and we iterate the process, then we will eventually reach a fixpoint. The Contraction Mapping Theorem says that under appropriate conditions, which are satisfied here, this is in fact the case: the process of iterating B has a limit, which is the unique fixpoint of B.
Example: The function f (x) = x/2 has a contraction property, so that x′^ ← x/2 is a contraction operator, and has unique fixpoint x∗^ = 0.
When an operator is a contraction then we always have a unique fixpoint. Recall that the contraction property insists that ||f (x) − f (y)|| < ||x − y|| when x 6 = y, and can be easily seen to hold for f (x) = x/2. Now, sup- pose for contradiction that there are two distinct fixpoints, x∗^6 = y∗, so that ||f (x∗) − f (y∗)|| = ||x∗^ − y∗|| but then this is a contradiction with the contrac- tion property. In addition, iterating also converges to the fixpoint. Consider any x 6 = x∗. Then ||f (x) − x∗|| = ||f (x) − f (x∗)|| < ||x − x∗||, where the first equality holds by the fact that x∗^ is a fixpoint and the second inequality by the contraction property.
(^2) For value iteration, the smaller γ is, the faster the iteration process converges.
So in fact, γ has a crucial impact on the complexity of infinite horizon value iteration. For γ close to 1, convergence will be slow. In fact, if we think about it, this is not surprising. When γ is close to 1, it means that future rewards are
(^2) This section is based on the discussion in Russell and Norvig.
only slightly discounted, so the optimal policy needs to take a long-term view— just like a large horizon in the finite horizon case. Theorem 1 states that value iteration will converge in the limit to the unique solution of the Bellman equations. For any desired error between V ∗(s) and V (s), then there is some finite number of iterations after which this error gap will be closed. In this section, we discuss the impact of the contraction property,
||B(V 1 ) − B(V 2 )|| ≤ γ||V 1 − V 2 ||,
for any V 1 6 = V 2 , on the ability of value iteration (VI) to make rapid progress. We will also consider the impact of a residual error in our approximation on the quality of the policy finally adopted by an agent. We first consider the number of updates that will be required for VI to achieve a particular error bound . For this, we assume that the reward for any state action pair is bounded, with
−Rmax ≤ R(s, a) ≤ Rmax, ∀(s, a) (11)
By the property of a geometric series, we have
Rmax 1 − γ
≤ V π^ (s) ≤ Rmax 1 − γ
, ∀π, ∀s
and in particular, we have
2 Rmax 1 − γ
for any initial value function V 0. One update achieves the following
||B(V ) − V ∗|| ≤ γ||V − V ∗||
We wish to find K, the number of iterations, such that
γK^ 2 Rmax 1 − γ
and the final value function is within of the optimal value function. Solving, we find:
log
2 Rmax (1−γ)
log
1 γ
where notation dxe takes the ceiling of x, i.e. rounds to the next largest integer. This shows that the rate of convergence is good with respect to the error, with the number of rounds increasing as log( (^1) ). On the other hand we see a weakness with respect to discount factor γ. As this goes towards one, then log( (^1) γ ) approaches zero from above, and K grows rapidly.
PolicyIteration(γ) = // Takes the discount factor γ // Returns the optimal policy for each state Let π be any policy Repeat πold^ = π Solve the system of equations V (s) = R(s, πold(s)) + γ
s′^ P^ (s
′ (^) | s, πold(s))V (s′) to obtain V (s) For each state s For each action a Q(s, a) = R(s, a) + γ
s′^ P^ (s ′ (^) | s, a)V (s′) πnew(s) = arg maxa Q(s, a) // Assume ties are broken in a consistent way π = πnew Until πnew^ = πold Return π.
Note that we repeat until the policy stops changing. Policy iteration is a neat idea. Does it actually work? The homework asks you to prove that it does. The key steps in the proof are to prove the following two claims:
new (s) ≥ V π
old (s) increases monotonically from iteration to iteration.
From these one can deduce that policy iteration always terminates with the optimal policy after a finite number of steps. Contrast this with value iteration, which only achieves the optimal value function in the limit (but may achieve the optimal policy earlier.)
3 Comparing VI and PI
How do value iteration and policy iteration compare? It is a fact that policy iteration takes at most as many iterations to reach the optimal policy as value iteration does. In practice, it usually takes far fewer iterations. In policy itera- tion, the policy always changes every iteration. In contrast, in value iteration, the value function changes every iteration, but the optimal policy relative to that value function, i.e., the optimal policy given that that value function will be achieved in future, may stay the same for several successive iterations. On the other hand, each individual iteration of policy iteration takes longer, because it requires solving the complete system of linear equations of Equa- tion (5). This can be done via Gaussian elimination, for example, in O(n^3 ). Compare with value iteration, in which each step of the algorithm takes O(nmL)
(for nm updates of Q-values, on n states and m actions, with each taking time L where L is the max number of states reachable by any action a from any state s.) In general, however, policy iteration is considered the better algorithm in practice. One way to make policy iteration more efficient than value iteration is to dispense with solving Equation (5) exactly, and to be satisfied with an approx- imate solution. Note that Equation (5) is itself a fixpoint equation, and can be solved by the same kind of iterative method we have been using for solving the optimality equations! We can start with an arbitrary value function, and apply the right hand side of (5) for a fixed number of steps or until approximate convergence, to get better and better approximations to the V π^. This method of approximating the value function is called value propagation, and can be used instead of solving the equations in the policy iteration algorithm. We obtain the modified policy iteration algorithm, which tends to be more effective than value iteration and regular policy iteration.