Reinforcement Learning - Machine Learning | CMSC 726, Study notes of Computer Science

Material Type: Notes; Class: MACHINE LEARNING; Subject: Computer Science; University: University of Maryland; Term: Unknown 1989;

Typology: Study notes

Pre 2010

Uploaded on 02/13/2009

koofers-user-cwr
koofers-user-cwr 🇺🇸

10 documents

1 / 22

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1
Reinforcement Learning
Slides from Sutton and Barto
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2
The Agent-Environment Interface
Agent
Environment
actio
n
a
t
s
t
reward
r
t
r
t+
1
s
t+
1
state
Agent and environment interact at discrete time steps:
t
=
0, 1, 2, K
Agent observes state at step t: stS
produces action at step t: atA(st)
gets resulting reward: rt+1∈ℜ
and resulting next state: st+1
t
. . . start+1 st +1 t+1
art+2 st +2 t+2
art+3 st +3 . . .
t+3
a
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16

Partial preview of the text

Download Reinforcement Learning - Machine Learning | CMSC 726 and more Study notes Computer Science in PDF only on Docsity!

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction (^1)

Reinforcement Learning

Slides from Sutton and Barto

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction (^2)

The Agent-Environment Interface

Agent

Environment

action a s (^) t t

reward r t r (^) t+ 1 s (^) t+ 1

state

Agent and environment interact at discrete time steps : t = 0,1, 2, K Agent observes state at step t : s (^) tS produces action at step t : atA ( st ) gets resulting reward : rt + 1 ∈ℜ and resulting next state: s (^) t + 1

t

... (^) s t (^) a

rt +1 (^) s t +1 (^) at +

rt +2 (^) s t +2 (^) at (^) +

rt +3 (^) s t +3 (^) a t +...

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction (^3)

Policy at step t , π t :

a mapping from states to action probabilities

π t ( s , a ) = probability that at = a when st = s

The Agent Learns a Policy

‡ Reinforcement learning methods specify how the agent changes its policy as a result of experience. ‡ Roughly, the agent’s goal is to get as much reward as it can over the long run.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction (^4)

Returns

Suppose the sequence of rewards after step t is : rt + 1 , rt + 2 , rt + 3 , K What do we want to maximize?

In general, we want to maximize the expected return , E R { } t , for each step t.

Episodic tasks : interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze. Rt = rt + 1 + rt + 2 + L + rT ,

where T is a final time step at which a terminal state is reached, ending an episode.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction (^7)

Another Example

Get to the top of the hill as quickly as possible.

reward = −1 for each step where not at top of hill ⇒ return = − number of steps before reaching top of hill

Return is maximized by minimizing number of steps reach the top of the hill.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction (^8)

A Unified Notation

‡ In episodic tasks, we number the time steps of each episode starting from zero. ‡ We usually do not have distinguish between episodes, so we write instead of for the state at step t of episode j. ‡ Think of each episode as ending in an absorbing state that always produces reward of zero:

‡ We can cover all cases by writing

st s (^) t , j

s^ r^1 = + 0 s 1 r 2 = +1 (^) s 2 r 3 = +1 r r 4 = 0 5 = 0

Rt = γ k^ rt + k + 1 , k = 0

∞ ∑

where γ can be 1 only if a zero reward absorbing state is always reached.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction (^9)

The Markov Property

‡ By “the state” at step t , the book means whatever information is available to the agent at step t about its environment. ‡ The state can include immediate “sensations,” highly processed sensations, and structures built up over time from sequences of sensations. ‡ Ideally, a state should summarize past sensations so as to retain all “essential” information, i.e., it should have the Markov Property :

Pr (^) { s (^) t + 1 = s ′ , rt + 1 = r st , at , rt , st − 1 , at − 1 ,K, r 1 , s 0 , a 0 }=

Pr (^) { s (^) t + 1 = s ′ , rt + 1 = r st , a (^) t } for all s ′, r , and histories st , a (^) t , rt , st − 1 , at − 1 ,K, r 1 , s 0 , a 0.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction (^10)

Markov Decision Processes

‡ If a reinforcement learning task has the Markov Property, it is basically a Markov Decision Process (MDP). ‡ If state and action sets are finite, it is a finite MDP. ‡ To define a finite MDP, you need to give: „ state and action sets „ one-step “dynamics” defined by transition probabilities :

„ reward probabilities :

Ps sa ′^ = Pr (^) { s (^) (^) t + 1 = sst = s , a (^) t = a } for all s , s ′ ∈ S , aA ( s ).

Rsa s^ ′= E r { t (^) + 1 st = s , at = a , s (^) t + 1 = s ′} for all s , s ′ ∈ S , aA ( s ).

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction (^13)

Value Functions

State - value function for policy π :

V^ π^ ( s ) = E π { R (^) t st = s }= E π γ k^ rt + k + 1 st = s k = 0

∞ ∑

⎧ ⎨ ⎩

⎫ ⎬ ⎭

Action - value function for policy π :

Q^ π^ ( s , a ) = E π { R (^) t s (^) t = s , at = a }= E π γ k^ rt + k + 1 s (^) t = s , at = a k = 0

∞ ∑

⎧ ⎨ ⎩

⎫ ⎬ ⎭

‡ The value of a state is the expected return starting from that state; depends on the agent’s policy:

‡ The value of taking an action in a state under policy π

is the expected return starting from that state, taking that

action, and thereafter following π :

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction (^14)

Bellman Equation for a Policy π

Rt = rt + 1 + γ rt + 2 + γ 2 rt + 3 + γ 3 rt + 4 L

= rt + 1 +γ (^) ( r (^) t + 2 +γ rt + 3 +γ 2 rt + 4 L)

= rt + 1 +γ Rt + 1

The basic idea:

So: V^ π^ ( s ) = E π { R (^) t st = s }

= E π { rt (^) + 1 +γ V s ( (^) t + 1 ) s (^) t = s }

Or, without the expectation operator:

V^ π^ ( s ) = π ( s , a ) Ps sa ′^ [ R (^) sa s^ ′+γ V^ π^ ( s ′ )] s

a

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction (^15)

More on the Bellman Equation

V^ π^ ( s ) = π ( s , a ) Ps sa ′^ [ R (^) sa s^ ′+γ V^ π^ ( s ′ )] s

a

This is a set of equations (in fact, linear), one for each state.

The value function for π is its unique solution.

Backup diagrams : s s,a

a

s'

r a'

s'

r

(a) (b)

for V π (^) for Q^ π

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction (^16)

π ≥ π ′ if and only if V

π ( s ) ≥ V π ′ ( s ) for all sS

Optimal Value Functions

‡ For finite MDPs, policies can be partially ordered :

‡ There is always at least one (and possibly many) policies that is better than or equal to all the others. This is an optimal

policy. We denote them all π *.

‡ Optimal policies share the same optimal state-value function :

‡ Optimal policies also share the same optimal action-value function :

V

∗ ( s ) = max π

V

π ( s ) for all sS

Q ∗( s , a ) = max π Q^ π^ ( s , a ) for all sS and aA ( s )

This is the expected return for taking action a in state s and thereafter following an optimal policy.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction (^19)

Why Optimal State-Value Functions are Useful

a) gridworld (^) b) V * (^) c) π*

22.0 24.4 22.0 19.4 17. 19.8 22.0 19.8 17.8 16. 17.8 19.8 17.8 16.0 14. 16.0 17.8 16.0 14.4 13. 14.4 16.0 14.4 13.0 11.

A B

A'

  • (^10) B'

V

V

Any policy that is greedy with respect to is an optimal policy.

Therefore, given , one-step-ahead search produces the long-term optimal actions.

E.g., back to the gridworld:

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction (^20)

What About Optimal Action-Value Functions?

Given , the agent does not even have to do a one-step-ahead search:

Q

π ∗ ( s ) = arg max aA ( s )

Q ∗ ( s , a )

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction (^21)

Solving the Bellman Optimality Equation

‡ Finding an optimal policy by solving the Bellman Optimality Equation requires the following: „ accurate knowledge of environment dynamics; „ we have enough space an time to do the computation; „ the Markov Property. ‡ How much space and time do we need? „ polynomial in number of states (via dynamic programming methods; Chapter 4), „ BUT, number of states is often huge (e.g., backgammon has about 10**20 states). ‡ We usually have to settle for approximations. ‡ Many RL methods can be understood as approximately solving the Bellman Optimality Equation.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction (^22)

Policy Evaluation

State - value function for policy π :

V^ π^ ( s ) = E π { R (^) t st = s }= E π γ k^ rt + k + 1 st = s k = 0

∞ ∑

⎧ ⎨ ⎩

⎫ ⎬ ⎭

Bellman equation for V^ π^ : V^ π^ ( s ) = π ( s , a ) Ps sa ′^ [ R (^) sa s^ ′+γ V^ π^ ( s ′ )] s

a

— a system of S simultaneous linear equations

Policy Evaluation : for a given policy π, compute the

state-value function V π

Recall:

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction (^25)

Policy Improvement

Suppose we have computed V for a deterministic policy π.

π

For a given state s , would it be better to do an action a^ ≠^ π( s )?

Q^ π^ ( s , a ) = E π { rt (^) + 1 + γ V^ π^ ( st + 1 ) s (^) t = s , at = a }

= Psa s ′

s

Rs s

a +γ V π ( s ′)

[ ]

The value of doing a in state s is :

It is better to switch to action a for state s if and only if

Q^ π^ ( s , a ) > V^ π^ ( s )

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction (^26)

Policy Improvement Cont.

π ′( s ) = argmax

a

Q^ π^ ( s , a )

= argmax

a

Pssa ′

s

Rss

a +γ V π ( s ′)

[ ]

Do this for all states to get a new policy π ′that is

greedy with respect to V^ π^ :

Then V π^ ′≥ V^ π

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction (^27)

Policy Improvement Cont.

What if V π^ ′= V^ π^? i.e., for all sS , V π^ ′( s ) = max a Ps sas

Rs s

a (^) +γ V π (^) ( s ′) [ ]?

But this is the Bellman Optimality Equation. So V π^ ′= V ∗^ and both π and π ′are optimal policies.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction (^28)

Policy Iteration

π 0 → V

π (^0) →π 1 → V

π 1 → L π

→ V

→π

policy evaluation policy improvement “greedification”

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction (^31)

Value Iteration Cont.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction (^32)

Asynchronous DP

‡ All the DP methods described so far require exhaustive sweeps of the entire state set. ‡ Asynchronous DP does not use sweeps. Instead it works like this: „ Repeat until convergence criterion is met:

  • Pick a state at random and apply the appropriate backup ‡ Still need lots of computation, but does not get locked into hopelessly long sweeps ‡ Can you select states to backup intelligently? YES: an agent’s experience can act as a guide.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction (^33)

Generalized Policy Iteration

π (^) V

evaluation

improvement

VV π

π→greedy( V )

π* V^ *

starting V π

V (^) = (^) V π

π^ =^ g re e

d y^ (^ V^ )

V * π*

Generalized Policy Iteration (GPI): any interaction of policy evaluation and policy improvement, independent of their granularity.

A geometric metaphor for convergence of GPI:

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction (^34)

Efficiency of DP

‡ To find an optimal policy is polynomial in the number of states… ‡ BUT, the number of states is often astronomical, e.g., often growing exponentially with the number of state variables (what Bellman called “the curse of dimensionality”). ‡ In practice, classical DP can be applied to problems with a few millions of states. ‡ Asynchronous DP can be applied to larger problems, and appropriate for parallel computation. ‡ It is surprisingly easy to come up with MDPs for which DP methods are not practical.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction (^37)

Simplest TD Method

T T T T T

T T^ T T T

st + 1

rt + 1

st

V ( st ) ← V ( s (^) t ) +α (^) [ r (^) t + 1 + γ V ( s (^) t + 1 ) − V ( st )]

T T T T^ T

T (^) T T T T

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction (^38)

cf. Dynamic Programming

V ( st ) ← E π { rt (^) + 1 + γ V ( st )}

T

T T^ T

st

rt + 1

st + 1

T

T T

T

T T

T

T

T

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction (^39)

TD Bootstraps and Samples

‡Bootstrapping: update involves an estimate

„ MC does not bootstrap „ DP bootstraps „ TD bootstraps

‡Sampling: update does not involve an

expected value

„ MC samples „ DP does not sample „ TD samples

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction (^40)

Advantages of TD Learning

‡ TD methods do not require a model of the environment, only experience ‡ TD, but not MC, methods can be fully incremental „ You can learn before knowing the final outcome

  • Less memory
  • Less peak computation „ You can learn without the final outcome
  • From incomplete sequences ‡ Both MC and TD converge (under certain assumptions to be detailed later), but which is faster?