






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Prof. David C Parkes, Computer Science, Markov Decision Processes, Decision Theoretic Framework, Finite horizon, Total reward, Harvard, Lecture Notes
Typology: Study notes
1 / 11
This page cannot be seen from the preview
Don't miss anything!







We now turn to a new component of the course in which agents take actions and become active in their environments. Up until this point agents have been passive learners: hanging out and analyzing data. We are heading towards the interesting concept of reinforcement learning (RL), in which an agent learns to act in an uncertain environment by training data that is sequences of state, action, reward, state, action, reward, ... The fun part of RL is that the training data depends on the actions of the agent, and this will lead us to a discussion of exploitation vs. exploration. But for now we will consider planning problems in which the agent is given a probabilistic model of its environment and should decide which actions to take. We begin with a brief introduction to utility theory and the maximum expected utility principle (MEU). This provides the foundation for what it means to be a rational agent: we will take a rational agent as one that tries to maximize its expected utility, given its uncertainty about the world. Our focus in Markov Decision Processes will be on the more realistic situation of repeated interactions. The agent gets some information about the world, chooses an action, receives a reward, gets some more information, chooses another action, receives another reward, and so on. This type of agent needs to consider not only its immediate reward when making its decision, but also its utility in the long run. Optional readings: Russell & Norivg 16, 17-17.
Here is a capsule summary of the decision theoretic framework:
Decision theory can be viewed as a definition of rationality. The maximum expected utility principle states that a rational agent is one that chooses its policy so as to maximize its expected utility. The decision theoretic framework provides a precise and concrete formulation of the problem we wish to solve and a method for designing an intelligent agent.
There are various sources of uncertainty that an agent faces when trying to achieve good performance in complex environments:
In the decision theoretic framework, all these different kinds of uncertainty can be modeled using prob- abilities. For the most part we will ignore the last kind of uncertainty, uncertainty over the model itself. But this can be handled through the Bayesian prior approach that we know from our earlier discussions of learning, in which there is an explicit distribution on the parameters of the distribution. For now we will also be studying problems in which the first problem is ignored and the agent knows the current state. This will put us in the space of Markov Decision Processes (MDPs). Partially-Observable MDPs will be briefly discussed next class. Continuing our elaboration of the problem definition, “performs well” means that an agent gets good utility. Of course, because of the uncertainty, we can’t devise a policy that guarantees that it will get good utility in all circumstances. Instead, we need to try to design an agent so that it gets good utility on average, i.e., in expectation. Let’s make this notion a little more precise. For now, we will focus on the “open-loop” setting, in which the agent has to make a single decision and then it is done. We will soon get to MDPs and the “closed-loop” setting, in which the agent repeatedly interacts with is environment. Assume that our agent has some evidence o about the world, derived from its sensors. The agent can take various actions a 1 ...aM. Each action can have various different results. For example, action a 1 has possible results Result 1 (a 1 )...ResultK (a 1 ). In order to frame the problem, we need to assess two things. First, with each result Result we need to associate a utility U (Result). Second, we need to assess the probability of each possible result for each action, given our evidence. We’ll write this as Pa(Result | o). Now, we can define the expected utility of taking action a 1 as
E[U (Do(a 1 ) | o)] =
j=
Pa 1 (Resultj (a 1 ) | o)U (Resultj (a 1 )) (1)
and similarly for the other actions. The maximum expected utility principle, or MEU principle, says that we should choose the action that maximizes this expected utility. Let’s look at Equation (1) again. We need to distinguish between design time and execution time. At execution time, the agent has some evidence o about the world, and needs to choose its action a on the basis of that evidence. At design time, we need to design an agent that will work for any possible evidence. In other words, we need to define an agent function f : O → A from the space of possible evidence O to the space of actions A. We would like this to solve the expected utility maximization problem for the agent. If we succeed in doing this, the agent will be fully rational. Usually, we will establish full rationality as an ideal, and try to get as close to it as possible.
We’ve been blithely throwing around the term “utility” without stopping to think what exactly it means. The simplest answer is that utility is a measure of happiness. But what does that mean? What is a happiness of 1, 3.27, or -15?
We can illustrate the agent’s interaction with the environment as:
where S is the current state of the world, observed by the agent for the moment, A is its action (drawn as a box because it is a decision variable rather than a random variable) and R is its reward or utility, drawn as a diamond. We are interested in “closed loop” problems in which the agent affects the state of the environment through its action. This is contrasted with “open loop” problems in which only a single action is taken and any effect on the subsequent state is irrelevant. The MDP framework consists of the following elements (S, A, R, P ):
s′^ P^ (s ′ (^) | s, a) = 1 for all s, a, and defining the probability of reaching state s′^ in the next period given action a in state s.
This framework describes an agent with repeated interactions with the world. The agent starts in some state s 0 in S. The agent then gets to choose an action a 0 ∈ A. The agent receives reward R(s 0 , a 0 ), while the world transitions to a new state s 1 , according to the probability distribution P (s 1 | s 0 , a 0 ). Then the cycle continues: the agent chooses another action a 1 ∈ A, and the world transitions to s 2 according to the distribution P (s 2 | s 1 , a 1 ), the agent receiving R(s 1 , a 1 ). And so on. An agent in an MDP goes through s 0 , a 0 , r 0 , s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , s 3 ,... where st, at, rt indicate the state, action and reward in period t. In writing this model, we have adopted a stationary and deterministic (non random) reward model R(s, a). Taking the same action in the same state provides the same reward irrespective of time. Furthermore, we have adopted a stationary transition model, so that P (St = s′|St− 1 = s, a) = P (s′, s, a) for all periods t. This can be relaxed as necessary by folding some temporal features into the state. The framework makes a fundamental assumption: that the reward model and transition model depend only on the current state and current action, and not on previous history. This assumption is known as the Markov assumption (hence the name Markov Decision Process), which is a basic assumption used in reasoning about temporal models. The Markov assumption is often stated as “The future is independent of the past given the present.” As we discussed in class, thinking about whether this is reasonable is a bit subtle: it depends on the state representation. The point is that we would like a model with a small number of states that satisfies the Markov property. A dynamic decision network, a simple extension of the idea of a BN that includes actions and rewards, for an MDP is illustrated below:
Actions are again in boxes, random variables (the state) in circles as in Bayes Nets, and rewards in diamonds. This decision network uses the same conditional independence semantics for arrows that we understand from BNs. The agent’s proceeds through time periods t ∈ { 0 , 1 ,.. .} and perhaps for an infinite sequence of periods. Another component that we require in order to solve MDPs is the concept of a policy,
πt(s) ∈ A,
which defines an action in every state s ∈ S, here allowing the policy to also depend on time t. Dependence on time is required in settings with a finite decision horizon because as the deadline approaches then different actions become useful. A policy describes the solution to an MDP: it is a complete description of how an agent will act in all possible states of the world. The assumptions made in the MDP formalism are:
The transition model P tells us how actions affect the state of the world, and also how the world state changes exogenously. We can think about this as a matrix of parameters, where θai,j for row i and column j specifies the probability of getting to state j ∈ S if the agent takes action a in state i ∈ S. Let L denote the maximum number of entries in any row in the transition matrix θ, indicating the most number of next states that can be reached by any action from any state. Typical problems will have a sparse transition matrix with L considerably smaller than N. The MDP formulation above is general. For example, it is sometimes convenient to describe the reward model as R(s, a, s′) ∈ R, the reward for action a in state s given that we transition to s′. But this is equivalent to R(s, a) =
s′^ P^ (s ′ (^) | s, a)R(s, a, s′). Moreover, it is sometimes simpler to associate rewards
with states, rather than with states and actions. When we do that, we will mean that the reward for taking any action in the state is the reward associated with the state. The MDP framework describes how the agent interacts with the world, but we haven’t yet described what the goals of the agent actually are, i.e., what problem we are trying to solve. Here are some variants of the problem:
Finite horizon We are given a number H, called the horizon, and assume that the agent is only interested in maximizing the reward accumulated in the first H steps. In a sense, H represents the degree to
Someone else might also bid, but only one successful bid is accepted on each round. The bidding closes if a bid of $200 is received, or no bid is received for two successive rounds. You value the item being sold at $150. If you do not buy the item, you receive 0 reward. If you buy it for $x, you receive a reward of 150 − x. Formally, we specify the problem as follows:
R(< x, y, z >) =
150 − x if y = true and z = 2 or x = 200 0 otherwise
< x + 100, false, 0 > with probability 0. < x, y, z + 1 > with probability 0. τ (< x, y, z >, b) =
< x + 100, true, 0 > with probability 0. < x + 100, false, 0 > with probability 0.
Another common example of an MDP is robot navigation. Suppose that your robot is navigating on a grid. It has a goal to get to, and perhaps some dangerous spots like stairwells that it needs to avoid. Unfortunately, because of slippage problems, its operators are not deterministic, and it needs to take that into account when planning a path to the goal. Here, the state space consists of possible locations of the robot, and the direction in which it is currently facing. The total number of states is 4 times the number of locations, which is quite manageable. The actions might be to move forward, turn left or turn right. The reward model gives the agent a reward if it gets to the goal, and a punishment if it falls down a stairwell. The transition model states that in most cases actions have their expected effect, e.g. moving forward will normally successfully move the robot forward one space, but with some probability the robot will stick in the same place, and with some probability it will move forward two spaces. We can extend this example to the plant-eating robot that will star in our final class project. In addition to navigating around the planet, the robot also may eat plants. Let us assume that the robot has explored the planet, and located and analyzed all the plants. In fact, let us assume that for each plant, the robot has determined the probability that the plant is nutritious. It is at a particular location, facing a particular direction, with a particular energy level, and its goal is to survive as long as possible. The state space needs to be much richer for this problem than for the simple navigation task. Specifically, in addition to the location and direction of the robot, the state needs to include a variable indicating how
much energy the robot currently has, and a variable for each plant, indicating whether or not it has been eaten. The set of actions is expanded to include an eating action. The transition model for the eating action is specified as follows: If the robot eats plant w, and the probability that w is nutritious is p, then it gains 10 units of energy with probability p, and loses 20 energy with probability 1 − p. The reward model is very simple. Since the goal of the robot is to stay alive as long as possible, we can simply assign a reward of 1 to every state in which the robot is alive, i.e. has positive energy.
4 Expectimax Search: Finite Horizon
Given an MDP, the task is to find a policy that maximizes the agent’s utility. How do we find an optimal policy? Let’s concentrate on the finite horizon case first. One approach is to view an MDP as a game against nature where an opponent called “nature” behaves in a specific probabilistic manner. One way to solve this problem is to build a search tree. The tree will have two kinds of nodes: those where you get to move, and those where nature moves. Nodes where you get to move are associated with the current state s; nodes where nature moves are associated with the pair (s, a), where s is the current state and a is your chosen action. The root of the search tree is the initial state s 0. If you choose action a at node s, then the we transition to tree node (s, a). From node (s, a), then we can transition to any tree node s′^ such that P (s′^ | s, a) > 0. The edge from (s, a) to s′^ is annotated with probability P (s′^ | s, a). This models the move by nature, which is probabilistic and could take the state in the world to any one of the possible next states. How do we adopt search to solve and find the optimal decision policy? In expectimax search we will solve from the end of time to the start of time, propagating a value for every tree node up from the leaves to the root. The value of a tree node is simply the expected total reward that an agent can expect to get forward from that node under its optimal policy. The name corresponds to the fact that the algorithm alternates between taking expectations and maxi- mizations. The algorithm is as follows:
Expectimax(s) = // Takes an initial state s // Returns the value of s // Stores the optimal action at s If s is terminal Then return 0 Else For each action a Q(s, a) = R(s, a) +
s′^ P^ (s ′ (^) | s, a)Expectimax(s′) π∗(s) = arg maxa Q(s, a) Return Q(s, π∗(s)).
We adopt Q(s, a) to denote the value at a tree node (s, a), just before Nature is about to take an action and determine the next state. This is immediate reward for action a in state s plus the expectation over the values of the children s′^ of (s, a), where the expectation is taken according to the transition model. The algorithm works from the leaves towards the root, calling Expectimax on each new state. For a given state it considers the possible actions, determining the Q(s, a) value based on the work done at the children of the state. The optimal action π∗(s) is then stored, and the Q-value under that optimal action returned as the expectimax value for the state. Figure 1 shows part of the expectimax tree for the Internet auction example. The left subtree, corre- sponding to the initial action of bid, has been expanded completely and analyzed. The right subtree has only been expanded to the second level. Nature’s edges are annotated with probabilities, while nodes whose values have been computed are annotated with the value. From the analysis, we see that the auction is worth at least $8.75 to the agent. The agent can use the policy of bidding right away, hoping that the bid
succeeds, and hoping that no-one bids for the next two rounds. We do not know yet whether the agent can do better by passing in the first round. As it happens, the agent cannot do as well by passing. If the agent passes, it will hope that no-one else bids, and then have to bid immediately in the next round, hoping its bid will be successful. It will then have to hope that no-one bids for the next two rounds. This policy has smaller chance of succeeding than bidding in the first round. How expensive is the expectimax algorithm? It is proportional to the size of the tree, which is exponential in the horizon. What is the base of the exponent? The number of actions times the number of possible transitions for nature. To be precise, let the horizon be H, the number of actions be m, and the maximum number of transitions for any state and action be L. Then the cost of the algorithm is O((mL)H^ ). How large is L? It depends on the model. L is the maximum number of non-zero entries in any row of the transition matrix for any action. In general, if the matrices are sparse, the branching factor will be small and the algorithm will be more efficient. As an alternative, we will turn to value iteration, which uses dynamic programming. But we will make a comment at the end of the next section about expectimax in comparison to value iteration.
5 Value Iteration: Finite Horizon
Instead, we’ll look at another idea. Notice that the same state may be reached by many different paths. This is true even in deterministic settings, but is especially true in probabilistic settings– Nature tends to act as an equalizer, negating the results of your actions, (especially actions in the distant past). If there are states that can be reached in multiple ways, there’s a tremendous amount of wasted work in the expectimax algorithm, because the entire tree beneath the repeated states is searched multiple times. An alternative approach is to work backwards from the decision horizon but memorize the optimal thing to do from each possible state reached and reuse this computation. This is an approach of dynamic programming. It recognizes that the optimal policy forward from state s with k periods to go is the same irrespective of the way s is reached, and moreover can be found by taking the best subsequent action and then following the optimal k-to-go policy from the next state reached. To flesh this out, we will denote by Vk(s) the k-step-to-go value of state s. That is, the value that you can expect to get if you start in state s and proceed for k steps under the optimal policy. So, V 0 (s) is just 0 for all s. Similarly, let Qk(s, a) denote the k-step-to-go value of taking a now and then following by the optimal policy. Finally, let π∗ k(s) denote the optimal k-to-go action. The base case is
V 0 (s) = 0, ∀s (6)
The inductive case is
Qk(s, a) = R(s, a) +
s′
P (s′^ | s, a)Vk− 1 (s′) (7)
π∗ k(s) = arg max a∈A Qk(s, a) (8)
Vk(s) = Qk(s, π∗ k(s)) (9)
Look carefully at Eq. (7). This holds that the value of each action a taken now is the immediate reward plus the expected continuation value from the next state s′^ reached, given the Vk− 1 (s′) values already computed. Also, note that π∗ k(s) is the optimal action with k steps to go and not the optimal action in time period k. This suggests an algorithm, where we solve this series of equations layer by layer, from the bottom up. The algorithm, called value iteration, is as follows:
ValueIteration(H) = // Finite horizon version // Takes the horizon H
// Stores the k-step-to-go value and policy // for each state and each k from 1 to H For each state s ∈ S V 0 (s) = 0 For k = 1 to H For each state s ∈ S For each action a ∈ A Qk(s, a) = R(s, a) +
s′^ P^ (s
′ (^) | s, a)Vk− 1 (s′) π k∗(s) = arg maxa∈A Qk(s, a) Vk(s) = Qk(s, π k∗(s)).
What is the meaning of π∗ k and Vk? π k∗ is the optimal policy for the k-th step before the end, assuming you have only k steps to go. Vk is the value you get in the last k steps, assuming you play optimally. As k grows larger, the nature of the optimal policy changes. For k = 1, the policy will be greedy, considering only the reward in the next step. For small k, only the next few rewards will be considered, and there will be no long-term planning. For large k, there is incentive to sacrifice short-term gain for long run benefit, because there is plenty of time to reap the reward of your investment. What is the complexity of finite horizon value iteration? If the number of states is n, the number of actions is m, and the horizon is H, then there are nm Q-values to be updated every time through the loop, each Q-value update is L work for the s′^ next states that occur with non-zero probability, and there are H iterations of the loop. Altogether, we have O(nmLH), which is linear in the horizon, instead of exponential as we found for expectimax! Great. However, expectimax still has one advantage over dynamic programming: it only has to compute values for states that are actually reached. If states can generally not be reached in multiple ways, and there are many states that are not reached at all, expectimax may be better. There is a simple algorithm that combines the advantages of both, using an idea called reachability analysis. The first stage is simply to determine which of the possible states are reachable. Then, value iteration is performed over this reduced set of states.