Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Reinforcement Learning - Machine Learning | CMSC 726, Study notes of Computer Science

Material Type: Notes; Class: MACHINE LEARNING; Subject: Computer Science; University: University of Maryland; Term: Unknown 1989;

Typology: Study notes

Pre 2010

Uploaded on 02/13/2009

koofers-user-cwr
koofers-user-cwr 🇺🇸

10 documents

1 / 22

Toggle sidebar

Related documents


Partial preview of the text

Download Reinforcement Learning - Machine Learning | CMSC 726 and more Study notes Computer Science in PDF only on Docsity! 1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Reinforcement Learning Slides from Sutton and Barto R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2 The Agent-Environment Interface Agent Environment action atst reward rt rt+1 st+1 state Agent and environment interact at discrete time steps : t = 0,1, 2, K Agent observes state at step t : st ∈S produces action at step t : at ∈ A(st ) gets resulting reward : rt +1 ∈ℜ and resulting next state: st +1 t . . . st a rt +1 st +1 t +1a rt +2 st +2 t +2a rt +3 st +3 . . . t +3a 2 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3 Policy at step t, πt : a mapping from states to action probabilities πt (s, a) = probability that at = a when st = s The Agent Learns a Policy Reinforcement learning methods specify how the agent changes its policy as a result of experience. Roughly, the agent’s goal is to get as much reward as it can over the long run. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4 Returns Suppose the sequence of rewards after step t is : rt +1, rt+ 2 , rt + 3, K What do we want to maximize? In general, we want to maximize the expected return, E Rt{ }, for each step t. Episodic tasks: interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze. Rt = rt +1 + rt +2 +L + rT , where T is a final time step at which a terminal state is reached, ending an episode. 5 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9 The Markov Property By “the state” at step t, the book means whatever information is available to the agent at step t about its environment. The state can include immediate “sensations,” highly processed sensations, and structures built up over time from sequences of sensations. Ideally, a state should summarize past sensations so as to retain all “essential” information, i.e., it should have the Markov Property: Pr st +1 = ′ s ,rt +1 = r st ,at ,rt , st −1,at −1,K,r1,s0 ,a0{ }= Pr st +1 = ′ s ,rt +1 = r st ,at{ } for all ′ s , r, and histories st ,at ,rt , st −1,at −1,K,r1, s0 ,a0. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10 Markov Decision Processes If a reinforcement learning task has the Markov Property, it is basically a Markov Decision Process (MDP). If state and action sets are finite, it is a finite MDP. To define a finite MDP, you need to give: state and action sets one-step “dynamics” defined by transition probabilities: reward probabilities: Ps ′ s a = Pr st +1 = ′ s st = s,at = a{ } for all s, ′ s ∈S, a ∈A(s). Rs ′ s a = E rt +1 st = s,at = a,st +1 = ′ s { } for all s, ′ s ∈S, a ∈A(s). 6 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11 Recycling Robot An Example Finite MDP At each step, robot has to decide whether it should (1) actively search for a can, (2) wait for someone to bring it a can, or (3) go to home base and recharge. Searching is better but runs down the battery; if runs out of power while searching, has to be rescued (which is bad). Decisions made on basis of current energy level: high, low. Reward = number of cans collected R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12 Recycling Robot MDP search high low 1, 0 1–β , –3 search recharge wait wait search1–α , R β , R search α, R search 1, R wait 1, R wait S = high,low{ } A(high) = search, wait{ } A(low) = search,wait, recharge{ } Rsearch = expected no. of cans while searching Rwait = expected no. of cans while waiting Rsearch > Rwait 7 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13 Value Functions State - value function for policy π : Vπ (s) = Eπ Rt st = s{ }= Eπ γ krt +k +1 st = s k =0 ∞ ∑⎧ ⎨ ⎩ ⎫ ⎬ ⎭ Action - value function for policy π : Qπ (s, a) = Eπ Rt st = s, at = a{ }= Eπ γ krt + k +1 st = s,at = a k = 0 ∞ ∑⎧ ⎨ ⎩ ⎫ ⎬ ⎭ The value of a state is the expected return starting from that state; depends on the agent’s policy: The value of taking an action in a state under policy π is the expected return starting from that state, taking that action, and thereafter following π : R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14 Bellman Equation for a Policy π Rt = rt +1 + γ rt +2 +γ 2rt + 3 +γ 3rt + 4 L = rt +1 + γ rt +2 + γ rt +3 + γ 2rt + 4 L( ) = rt +1 + γ Rt +1 The basic idea: So: Vπ (s) = Eπ Rt st = s{ } = Eπ rt +1 + γ V st +1( ) st = s{ } Or, without the expectation operator: Vπ (s) = π (s,a) Ps ′ s a Rs ′ s a + γ V π( ′ s )[ ] ′ s ∑ a ∑ 10 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19 Why Optimal State-Value Functions are Useful a) gridworld b) V* c) π* 22.0 24.4 22.0 19.4 17.5 19.8 22.0 19.8 17.8 16.0 17.8 19.8 17.8 16.0 14.4 16.0 17.8 16.0 14.4 13.0 14.4 16.0 14.4 13.0 11.7 A B A' B'+10 +5 V∗ V∗ Any policy that is greedy with respect to is an optimal policy. Therefore, given , one-step-ahead search produces the long-term optimal actions. E.g., back to the gridworld: R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20 What About Optimal Action-Value Functions? Given , the agent does not even have to do a one-step-ahead search: Q* π ∗(s) = arg max a∈A (s) Q∗(s,a) 11 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21 Solving the Bellman Optimality Equation Finding an optimal policy by solving the Bellman Optimality Equation requires the following: accurate knowledge of environment dynamics; we have enough space an time to do the computation; the Markov Property. How much space and time do we need? polynomial in number of states (via dynamic programming methods; Chapter 4), BUT, number of states is often huge (e.g., backgammon has about 10**20 states). We usually have to settle for approximations. Many RL methods can be understood as approximately solving the Bellman Optimality Equation. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 22 Policy Evaluation State - value function for policy π : Vπ (s) = Eπ Rt st = s{ }= Eπ γ krt +k +1 st = s k =0 ∞ ∑⎧ ⎨ ⎩ ⎫ ⎬ ⎭ Bellman equation for Vπ : Vπ (s) = π (s,a) Ps ′ s a Rs ′ s a + γ V π( ′ s )[ ] ′ s ∑ a ∑ — a system of S simultaneous linear equations Policy Evaluation: for a given policy π, compute the state-value function Vπ Recall: 12 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 23 Iterative Methods V0 → V1 → L → Vk → Vk +1 → L→ V π Vk +1 (s) ← π (s,a) Ps ′ s a Rs ′ s a + γ Vk ( ′ s )[ ] ′ s ∑ a ∑ a “sweep” A sweep consists of applying a backup operation to each state. A full policy evaluation backup: R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 24 Iterative Policy Evaluation 15 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Policy Iteration R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 30 Value Iteration Vk +1 (s) ← π (s,a) Ps ′ s a Rs ′ s a + γ Vk ( ′ s )[ ] ′ s ∑ a ∑ Recall the full policy evaluation backup: Vk +1 (s) ← maxa Ps ′ s a Rs ′ s a + γ Vk ( ′ s )[ ] ′ s ∑ Here is the full value iteration backup: 16 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 31 Value Iteration Cont. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 32 Asynchronous DP All the DP methods described so far require exhaustive sweeps of the entire state set. Asynchronous DP does not use sweeps. Instead it works like this: Repeat until convergence criterion is met: – Pick a state at random and apply the appropriate backup Still need lots of computation, but does not get locked into hopelessly long sweeps Can you select states to backup intelligently? YES: an agent’s experience can act as a guide. 17 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 33 Generalized Policy Iteration π V evaluation improvement V →Vπ π→greedy(V) *Vπ* starting V π V = V π π = g ree d y ( V ) V* π* Generalized Policy Iteration (GPI): any interaction of policy evaluation and policy improvement, independent of their granularity. A geometric metaphor for convergence of GPI: R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 34 Efficiency of DP To find an optimal policy is polynomial in the number of states… BUT, the number of states is often astronomical, e.g., often growing exponentially with the number of state variables (what Bellman called “the curse of dimensionality”). In practice, classical DP can be applied to problems with a few millions of states. Asynchronous DP can be applied to larger problems, and appropriate for parallel computation. It is surprisingly easy to come up with MDPs for which DP methods are not practical. 20 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 39 TD Bootstraps and Samples Bootstrapping: update involves an estimate MC does not bootstrap DP bootstraps TD bootstraps Sampling: update does not involve an expected value MC samples DP does not sample TD samples R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 40 Advantages of TD Learning TD methods do not require a model of the environment, only experience TD, but not MC, methods can be fully incremental You can learn before knowing the final outcome – Less memory – Less peak computation You can learn without the final outcome – From incomplete sequences Both MC and TD converge (under certain assumptions to be detailed later), but which is faster? 21 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 41 Learning An Action-Value Function st+2,at+2st+1,at+1 rt+2rt+1st st+1st ,at st+2 Estimate Qπ for the current behavior policy π. After every transition from a nonterminal state st , do this : Q st , at( )← Q st , at( )+ α rt +1 +γ Q st +1,at +1( )− Q st ,at( )[ ] If st +1 is terminal, then Q(st +1, at +1 ) = 0. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 42 Sarsa: On-Policy TD Control Turn this into a control method by always updating the policy to be greedy with respect to the current estimate: 22 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 43 Q-Learning: Off-Policy TD Control One - step Q - learning : Q st , at( )← Q st , at( )+ α rt +1 +γ maxa Q st+1, a( )− Q st , at( )[ ]