Deep Learning Application in Reinforcement Learning: Qiang Ji's Lecture Notes, Lecture notes of Computer Science

An overview of deep learning applications in reinforcement learning. It covers the basics of reinforcement learning, including the Markov property, state-action value function, Bellman equation, and value iteration. The document also discusses the differences between model-based and model-free approaches, as well as the advantages and disadvantages of policy-based RL. Examples of applications in Atari games, robotics, object detection, and AlphaGo are provided.

Typology: Lecture notes

2021/2022

Uploaded on 10/13/2022

jiawei-sun
jiawei-sun šŸ‡ŗšŸ‡ø

1 document

1 / 45

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
4/18/2022
1
Chapter6
Deepreinforcementlearning
Qiang Ji
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d

Partial preview of the text

Download Deep Learning Application in Reinforcement Learning: Qiang Ji's Lecture Notes and more Lecture notes Computer Science in PDF only on Docsity!

Chapter 6

Deep reinforcement learning

Qiang Ji

Outlines •

Introduction

Fundamental in reinforcement learning

-^

Markov Decision Process (MDP)

-^

Methods for solving reinforcement learning

Deep Reinforcement Learning

Value‐Based Deep RL: Q‐function method

Policy‐Based Deep RL :Policy gradient method

Applications of deep reinforcement learning

Courtesy of Qiang Ji

Yann LeCun’s Cake Analogy

Self‐supervisedlearning

Reinforcement Learning

Learning to control (action) a system so asto maximize some numerical value (reward)which represents a long‐term objective. •^

There is no supervisor, only a rewardsignal

-^

Directly interacts with the world

-^

Agent’s actions affect the state of theworld

-^

Feedback is sequential, non i.i.d data, notinstantaneous, often with a delay Like human learning, learning is via implicitor weak supervision, and is incremental andcontinuous.

=> MDP framework

Slides from David Silver, Google DeepMind

Computer agent

Markov Decision Process (MDP)

-^

At each step

t

the agent:



Receives state

š‘ 

௧



Executes action

š‘Ž

based on௧





Receives scalar reward

š‘Ÿ

௧

-^

The environment: 

Receives action

š‘Ž

௧



Emits state

š‘ 

௧^

based on p

a ss’



Emits scalar reward

š‘Ÿ

௧

-^

Repeats until the goal is achieved

Slides from David Silver, Google DeepMind

Computer agent

Interaction between agent and environment

:

State‐Action‐Transition‐ Reward Diagram for aMDP

The state –action‐transition‐ reward diagramfor a simple MDP with three states (green circles),two actions (orange circles),

two

rewards (yellow arrows), and transition probabilities.

Picture from wikipedia

Bellman Equation š‘„

ą°—

š‘ ąÆ§

, š‘Ž

௧^

ൌ šø

ą· š›¾

ąÆž

š‘ŸąÆ§ą¬¾ąÆžąÆ

ąÆžą­€ą¬“

|šœ‹

š‘„

ą°—

š‘ ąÆ§ą¬¾ą¬µ

, š‘Ž

௧ାଵ

The optimal state‐action value function can also be computed recursively

Q

āˆ—^

௧^

௧^

௧^

maxąÆ”ą³Ÿą°¶ą°­

∈஺

āˆ—^

š‘ ąÆ§ą¬¾ą¬µ

, š‘Ž

௧ାଵ

In fact, it can be proved that the recursive updating can be applied to any initial Q

with sufficient iterations Q

0

‐> Q

1

‐> Q

2

‐> Q

3

…‐>Q*

State‐action value function can be computed recursively:

Expected Q function at time t+1Select the best action at time t+1 instead of using current



MDP – Solving Solving MP involves identifying the optimal policy

. It has two approaches
1. Value Iteration 1)
Start from an initial Q, iteratively update Q using Bellman equation until convergence.It can be proven that the final Q is optimal Q*.
Obtain the optimal policy
from the final Q*
Depending on if the transition probability p(s

t+

|s
,at
) is known or not, it can bet
performed either through model‐based or model‐free approach2.
Policy Iteration
Start from an initial
Update Q

^

using Bellman equation
Update
using updated
Q



Repeat step 2-
until convergence

Value Iteration: Model‐free Approach We don’t know state transition probability p(s

t+

|s

, at

).t

If we assume the state transition is deterministic, i.e. s

t+

is given

Q

āˆ—^

௧

)^

௧^

ąµ… š›¾ max

௔

೟శభ

∈஺

Q

āˆ—^

௧ାଵ

௧ାଵ

2. If we assume the state transition is probabilistic but unknown ‐TD‐learning

Q

āˆ—^

௧^

௧^

Q

āˆ—^

௧^

௧^

ሻ ąµ… š›¾max௧

௔

೟శభ

∈஺

Q

āˆ—^

௧ାଵ

௧ାଵ

where

௧ାଵ

is a sample state obtained when applying action

௧^

. Note due to

in state transition, s

t+

may be different for the same

is a weight between 0 and 1.

Deterministic state transition to s

t+

Current state

Example of Value iteration (Q‐learning) •

A house has 5 rooms, the goal is to go out the house.

If there is a door between two rooms, there exists an edge.

The reward which directly points to outside is set to 100, ‐1 for illegalor unsensible moves, and others are set to 0.

Legal moves

Rewards forlegal moves

Model‐free approach •

Reward matrix and initial Q matrix

Episode 1: Start from room 1, go to room 5.

Agent reaches to state 5, this episode is over.

6 States: 0, 1, 2, 3, 4, 5 ļ‚§^

0‐room 0, 1‐room 1, .., 5‐outside 6 Actions: 0, 1, 2, 3,4, 5 •^

0‐go to room 0, 1‐go to room 1, …,5‐go outside

Reward table

Initial Q table

Updated Q table

3 legal moves at room 5Illegal moves are ignored

100 for outside move, ‐1 for illegalmoves, and 0 for others moves.

-^

Reward matrix and initial Q matrix

-^

Episode 2: Start from room 3, go to room 1

Updated Q table

Updated Q table

After several episodes, Q will converge to

Normalize: /

Obtain the policy



from final

Q

)

, (

max

arg

)

(

a s

Q

s

a





Updated Q table

Value Iteration‐model free (deterministic ) Initialize Q(s,a) to arbitrary valuesSpecify r(s,a) with sensible and generic valuesRepeat

For all s

S

For all a

A

Q(s,a)= r(s,a) +

max

a

Q(s’,a))

Until Q(s,a) converges

)

,

(

max

arg

)

(

a s

Q

s

a





Learning converges very fast.