





































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An overview of deep learning applications in reinforcement learning. It covers the basics of reinforcement learning, including the Markov property, state-action value function, Bellman equation, and value iteration. The document also discusses the differences between model-based and model-free approaches, as well as the advantages and disadvantages of policy-based RL. Examples of applications in Atari games, robotics, object detection, and AlphaGo are provided.
Typology: Lecture notes
1 / 45
This page cannot be seen from the preview
Don't miss anything!






































-^
-^
Courtesy of Qiang Ji
Yann LeCunās Cake Analogy
Selfāsupervisedlearning
Learning to control (action) a system so asto maximize some numerical value (reward)which represents a longāterm objective. ā¢^
There is no supervisor, only a rewardsignal
-^
Directly interacts with the world
-^
Agentās actions affect the state of theworld
-^
Feedback is sequential, non i.i.d data, notinstantaneous, often with a delay Like human learning, learning is via implicitor weak supervision, and is incremental andcontinuous.
Slides from David Silver, Google DeepMind
Computer agent
-^
At each step
t
the agent:
ļ
Receives state
š
௧
ļ
Executes action
š
based on௧
ļ°
ļ
Receives scalar reward
š
௧
-^
The environment: ļ
Receives action
š
௧
ļ
Emits state
š
௧^
based on p
a ssā
ļ
Emits scalar reward
š
௧
-^
Repeats until the goal is achieved
Slides from David Silver, Google DeepMind
Computer agent
Interaction between agent and environment
:
The state āactionātransitionā reward diagramfor a simple MDP with three states (green circles),two actions (orange circles),
two
rewards (yellow arrows), and transition probabilities.
Picture from wikipedia
ą°
š ąÆ§
, š
௧^
ąµ šø
ą· š¾
ąÆ
šąÆ§ą¬¾ąÆąÆ
ąÆąą¬“
|š
š
ą°
š ąÆ§ą¬¾ą¬µ
, š
௧ାଵ
ā^
௧^
௧^
௧^
āą®ŗ
ā^
š ąÆ§ą¬¾ą¬µ
, š
௧ାଵ
0
1
2
3
Expected Q function at time t+1Select the best action at time t+1 instead of using current
ļ°
t+
ļ°^
ļ°
|s
, at
).t
t+
ā^
௧
௧^
ąÆ
ą³ą°¶ą°
āą®ŗ
ā^
௧ାଵ
௧ାଵ
ā^
௧^
௧^
ā^
௧^
௧^
ąÆ
ą³ą°¶ą°
āą®ŗ
ā^
௧ାଵ
௧ାଵ
௧ାଵ
௧^
t+
Deterministic state transition to s
t+
Current state
Legal moves
Rewards forlegal moves
Modelāfree approach ā¢
6 States: 0, 1, 2, 3, 4, 5 ļ§^
0āroom 0, 1āroom 1, .., 5āoutside 6 Actions: 0, 1, 2, 3,4, 5 ā¢^
0āgo to room 0, 1āgo to room 1, ā¦,5āgo outside
Reward table
Initial Q table
Updated Q table
3 legal moves at room 5Illegal moves are ignored
100 for outside move, ā1 for illegalmoves, and 0 for others moves.
-^
-^
Updated Q table
Updated Q table
Normalize: /
Obtain the policy
ļ°
from final
Q
)
, (
max
arg
)
(
a s
Q
s
a
ļ½
ļ°
Updated Q table
a
)
,
(
max
arg
)
(
a s
Q
s
a
ļ½
ļ°
Learning converges very fast.