Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Deep Learning Application in Reinforcement Learning: Qiang Ji's Lecture Notes, Lecture notes of Computer Science

A.T. Still University of Health Sciences (ATSU)Computer Science

An overview of deep learning applications in reinforcement learning. It covers the basics of reinforcement learning, including the Markov property, state-action value function, Bellman equation, and value iteration. The document also discusses the differences between model-based and model-free approaches, as well as the advantages and disadvantages of policy-based RL. Examples of applications in Atari games, robotics, object detection, and AlphaGo are provided.

Typology: Lecture notes

2021/2022

Uploaded on 10/13/2022

jiawei-sun 🇺🇸

1 document

1 / 45

This page cannot be seen from the preview

Don't miss anything!

4/18/2022

Chapter6

Deepreinforcementlearning

Qiang Ji

Discover Lecture notes of Computer Science A.T. Still University of Health Sciences (ATSU)

Partial preview of the text

Download Deep Learning Application in Reinforcement Learning: Qiang Ji's Lecture Notes and more Lecture notes Computer Science in PDF only on Docsity!

Chapter 6 Deep reinforcement learning

Qiang Ji

Outlines •

Introduction

Fundamental in reinforcement learning

Markov Decision Process (MDP)

Methods for solving reinforcement learning

Deep Reinforcement Learning

Value‐Based Deep RL: Q‐function method

Policy‐Based Deep RL :Policy gradient method

Applications of deep reinforcement learning

Courtesy of Qiang Ji

Yann LeCun’s Cake Analogy

Self‐supervisedlearning

Reinforcement Learning

Learning to control (action) a system so asto maximize some numerical value (reward)which represents a long‐term objective. •^

There is no supervisor, only a rewardsignal

Directly interacts with the world

Agent’s actions affect the state of theworld

Feedback is sequential, non i.i.d data, notinstantaneous, often with a delay Like human learning, learning is via implicitor weak supervision, and is incremental andcontinuous.

=> MDP framework

Slides from David Silver, Google DeepMind

Computer agent

Markov Decision Process (MDP)

At each step

the agent:



Receives state

𝑠

௧



Executes action

𝑎

based on௧





Receives scalar reward

𝑟

௧

The environment: 

Receives action

𝑎

௧



Emits state

𝑠

௧^

based on p

a ss’



Emits scalar reward

𝑟

௧

Repeats until the goal is achieved

Slides from David Silver, Google DeepMind

Computer agent

Interaction between agent and environment

State‐Action‐Transition‐ Reward Diagram for aMDP

The state –action‐transition‐ reward diagramfor a simple MDP with three states (green circles),two actions (orange circles),

two

rewards (yellow arrows), and transition probabilities.

Picture from wikipedia

Bellman Equation 𝑄

గ

𝑠௧

, 𝑎

௧^

ൌ 𝐸

෍ 𝛾

௞

𝑟௧ା௞்

௞ୀ଴

|𝜋

𝑄

గ

𝑠௧ାଵ

, 𝑎

௧ାଵ

The optimal state‐action value function can also be computed recursively

Q

∗^

௧^

max௔೟శభ

∈஺

∗^

𝑠௧ାଵ

, 𝑎

௧ାଵ

In fact, it can be proved that the recursive updating can be applied to any initial Q

with sufficient iterations Q

‐> Q

…‐>Q*

State‐action value function can be computed recursively:

Expected Q function at time t+1Select the best action at time t+1 instead of using current



MDP – Solving Solving MP involves identifying the optimal policy

. It has two approaches

1. Value Iteration 1)

Start from an initial Q, iteratively update Q using Bellman equation until convergence.It can be proven that the final Q is optimal Q*.

Obtain the optimal policy

from the final Q*

Depending on if the transition probability p(s

|s

,at

) is known or not, it can bet

performed either through model‐based or model‐free approach2.

Policy Iteration

Start from an initial

Update Q

^

using Bellman equation

Update

using updated

Q



Repeat step 2-

until convergence

Value Iteration: Model‐free Approach We don’t know state transition probability p(s

t+

, at

).t

If we assume the state transition is deterministic, i.e. s

is given

Q

∗^

௧

)^

௧^

൅ 𝛾 max

௔

೟శభ

∈஺

Q

∗^

௧ାଵ

2. If we assume the state transition is probabilistic but unknown ‐TD‐learning

Q

∗^

௧^

Q

∗^

௧^

ሻ ൅ 𝛾max௧

௔

೟శభ

∈஺

Q

∗^

௧ାଵ

where

௧ାଵ

is a sample state obtained when applying action

௧^

. Note due to

in state transition, s

may be different for the same

is a weight between 0 and 1.

Deterministic state transition to s

Current state

Example of Value iteration (Q‐learning) •

A house has 5 rooms, the goal is to go out the house.

If there is a door between two rooms, there exists an edge.

The reward which directly points to outside is set to 100, ‐1 for illegalor unsensible moves, and others are set to 0.

Legal moves

Rewards forlegal moves

Model‐free approach •

Reward matrix and initial Q matrix

Episode 1: Start from room 1, go to room 5.

Agent reaches to state 5, this episode is over.

6 States: 0, 1, 2, 3, 4, 5 ^

0‐room 0, 1‐room 1, .., 5‐outside 6 Actions: 0, 1, 2, 3,4, 5 •^

0‐go to room 0, 1‐go to room 1, …,5‐go outside

Reward table

Initial Q table

Updated Q table

3 legal moves at room 5Illegal moves are ignored

100 for outside move, ‐1 for illegalmoves, and 0 for others moves.

Reward matrix and initial Q matrix

Episode 2: Start from room 3, go to room 1

Updated Q table

After several episodes, Q will converge to

Normalize: /

Obtain the policy



from final

)

, (

max

arg

)

(

a s





Updated Q table

Value Iteration‐model free (deterministic ) Initialize Q(s,a) to arbitrary valuesSpecify r(s,a) with sensible and generic valuesRepeat

For all s

S

For all a

A

Q(s,a)= r(s,a) +

max

Q(s’,a))

Until Q(s,a) converges

)

(

max

arg

)

(

a s





Learning converges very fast.

Deep Learning Application in Reinforcement Learning: Qiang Ji's Lecture Notes, Lecture notes of Computer Science

Related documents

Partial preview of the text

Download Deep Learning Application in Reinforcement Learning: Qiang Ji's Lecture Notes and more Lecture notes Computer Science in PDF only on Docsity!

Chapter 6

Deep reinforcement learning

Qiang Ji

Outlines •

Introduction

Fundamental in reinforcement learning

Markov Decision Process (MDP)

Methods for solving reinforcement learning

Deep Reinforcement Learning

Value‐Based Deep RL: Q‐function method

Policy‐Based Deep RL :Policy gradient method

Applications of deep reinforcement learning

Reinforcement Learning

=> MDP framework

Markov Decision Process (MDP)

State‐Action‐Transition‐ Reward Diagram for aMDP

Bellman Equation 𝑄

The optimal state‐action value function can also be computed recursively

Q

max௔೟శభ

In fact, it can be proved that the recursive updating can be applied to any initial Q

with sufficient iterations Q

‐> Q

‐> Q

‐> Q

…‐>Q*

State‐action value function can be computed recursively:

MDP – Solving Solving MP involves identifying the optimal policy

. It has two approaches

1. Value Iteration 1)

Start from an initial Q, iteratively update Q using Bellman equation until convergence.It can be proven that the final Q is optimal Q*.

Obtain the optimal policy

from the final Q*

Depending on if the transition probability p(s

|s

,at

) is known or not, it can bet

performed either through model‐based or model‐free approach2.

Policy Iteration

Start from an initial

Update Q

using Bellman equation

Update

using updated

Q

Repeat step 2-

until convergence

Value Iteration: Model‐free Approach We don’t know state transition probability p(s

t+

If we assume the state transition is deterministic, i.e. s

is given

Q

)^

൅ 𝛾 max

Q

2. If we assume the state transition is probabilistic but unknown ‐TD‐learning

Q

Q

ሻ ൅ 𝛾max௧

Q

where

is a sample state obtained when applying action

. Note due to

in state transition, s

may be different for the same

is a weight between 0 and 1.

Example of Value iteration (Q‐learning) •

A house has 5 rooms, the goal is to go out the house.

If there is a door between two rooms, there exists an edge.

The reward which directly points to outside is set to 100, ‐1 for illegalor unsensible moves, and others are set to 0.

Reward matrix and initial Q matrix

Episode 1: Start from room 1, go to room 5.

Agent reaches to state 5, this episode is over.

Reward matrix and initial Q matrix

Episode 2: Start from room 3, go to room 1

After several episodes, Q will converge to