Download Lecture Notes on Reinforcement Learning | CS 510 and more Study Guides, Projects, Research Computer Science in PDF only on Docsity!
CS 510 Lecture 8:
Reinforcement Learning
Rachel Greenstadt
November 18, 2008
or what if no one will label your examples?
Reminders
Bayesian learning exercise due next week
Project presentations December 2,
12 min per group
+ 4 min for questions
Try to ask a question (last chance
participation)
Reinforcement Learning
The promise : Program agents to do a task
without specifying how -- just give rewards
and punishment!
Learn through trial and error interaction
with an environment
Use Markov Decision Processes (MDPs) to
model problems
Markov Decision
Process (MDP)
Models decision-making in partially random
situations
Characterized by states, actions, and
transitions between states
Markov property : given the state of the MDP
at time t is known, transition probabilities
to the state at time t + 1 are independent
of all previous states or actions.
Value Iteration
For each state, apply the Bellman Equation
until it converges
Bellman Equation
Ui+1(s) <- R(s) + γ maxa Σ s’T(s,a,s’)Ui(s’)
γ is a discount factor that reflects
uncertainty about getting to the next
state
MDP Exercise
• Ui+1(s) <- R(s) +^ γ maxa Σ s’T(s,a,s’)Ui(s’)
Sometimes MDPs are formulated with a reward function R(s,a) that depends on the action taken or R(s,a,s’) that also depends on the outcome state (exercise 17.5 in book)
Write the Bellman equations for these functions
Practice Problems for
the Final
More later
Learning: Passive vs Active
- Passive learning simply watches the world and tries to learn utilities of being in each state
- Active learning must also act using the learned information and explore the environment
What is the Agent
learning?
Agent is learning the expected utility of a
state: U(I)
This is the thing that was computed
during value iteration
But we can’t compute it because we
don’t know the reward function (R(s))
What is the Agent
learning from?
Agent gets a set of training episodes
1 episode = sequence of states
In each episode, Agent starts in (1,1) and
experiences a sequence of state transitions until it
reaches a termination state and gets a reward
Given a fixed policy, agent is gathering and learning
from experience
Weakness of LMS
Calculates the utility of each state independently, but utility of states are not independent!
Actual utility of a state is constrained by the states that are reachable from it.
Utility of a state is probabilistically weighted average of all successor states’ utilities plus its own rewards
More precisely:
• Ui+1(s) <- R(s) +^ γ maxa Σ s’T(s,a,s’)Ui(s’) (oh look, Bellman)
• LMS ignores this and converges slowly
Adaptive Dynamic
Programming (ADP)
Observe rewards of all states
Use Bellman to compute utilities
Linear system of equations
But many, many equations for large state
spaces
TD Learning
Instead of solving equations for all states, incrementally update state utilities on each transition
When observing transition from I to J:
• U(1) = U(1) +^ α(R(I) +^ γR(J)-U(I))
U(J) initialized as R(J)
• α^ is learning rate
• γ^ is discount factor
• If^ α^ is properly adjusted,^ TD guaranteed to converge to optimal value
function
Example TD Learning
Let U(1,3) = 0.84, U(2,3) = 0.
Assume prob of transition from (1,3) to (2,3) is 1
Let reward of all states be -0.
R(2,3) = R(1,3) = -0.
• Let^ α^ ,^ γ^ = 1
Using TD:
U(1,3) = U(1,3) - 0.04 + U(2,3) - U(1,3)