Efficient Reinforcement Learning with REINFORCE, Lecture notes of Algorithms and Programming

The use of REINFORCE in reinforcement learning algorithms. It provides an overview of policy gradient methods, including REINFORCE, and their practical applications. The document also covers the success of reinforcement learning and the importance of policy gradient methods. The use of energy-based policies and policy optimization is also discussed. mathematical formulas and examples to illustrate the concepts.

Typology: Lecture notes

2021/2022

Uploaded on 05/11/2023

desmond
desmond 🇺🇸

4.8

(12)

327 documents

1 / 66

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Sample Efficient Reinforcement Learning with
REINFORCE
Junzi Zhang1, Jongho Kim1, Brendan O’Donoghue2, Stephen Boyd1
1EE & ICME Departments, Stanford University
2Google DeepMind
AAAI 2021 Virtual Presentation
ZKOB20 (Stanford University) 1 / 31
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42

Partial preview of the text

Download Efficient Reinforcement Learning with REINFORCE and more Lecture notes Algorithms and Programming in PDF only on Docsity!

Sample Efficient Reinforcement Learning with

REINFORCE

Junzi Zhang^1 , Jongho Kim^1 , Brendan O’Donoghue^2 , Stephen Boyd^1

(^1) EE & ICME Departments, Stanford University (^2) Google DeepMind

AAAI 2021 Virtual Presentation

Overview

1 Why Policy Gradient & REINFORCE?

(^2) Review of Policy Gradient Methods

(^3) REINFORCE & Practical Policy Gradient Methods

Reinforcement Learning (RL)

RL: algorithms for solving MDPs with incomplete information of M (e.g., p, r accessible by interacting with the environment) as input.

Reinforcement Learning (RL)

RL: algorithms for solving MDPs with incomplete information of M (e.g., p, r accessible by interacting with the environment) as input. Today: episodic (allow restart in the trajectory) and model-free (no storage of transition & reward models).

Success of RL

Success of RL

Why Policy Gradient?

Heroes Behind the Success: RL algorithms Value function learning (global convergence 3 ) Q-learning, SARSA, Bellman Residue Minimization, etc. Monte Carlo Tree Search (global convergence 3 ): -greedy tree search, UCT, BRUE, etc.

Why Policy Gradient?

Heroes Behind the Success: RL algorithms Value function learning (global convergence 3 ) Q-learning, SARSA, Bellman Residue Minimization, etc. Monte Carlo Tree Search (global convergence 3 ): -greedy tree search, UCT, BRUE, etc. Policy optimization (global convergence 37 ) Policy gradient, random search, actor-critic, etc.

Why REINFORCE?

REINFORCE: balance between good empirical performance & implementation simplicity

Why REINFORCE?

REINFORCE: balance between good empirical performance & implementation simplicity

Neural Architecture Search Semantic Program Parser Visual Question Answering Dialogue generation Coreference resolution ...

(^1) Why Policy Gradient & REINFORCE?

2 Review of Policy Gradient Methods

(^3) REINFORCE & Practical Policy Gradient Methods

Underlying MDP

MDP (stationary, discounted): M = (S, A, p, r , γ, ρ), γ ∈ [0, 1).

Policy Optimization

Policy optimization reformulation:

maximizeπ∈Π F (π),

where F (π) = E

t=0 γ

t (^) r (st , at ),

s 0 ∼ ρ, at ∼ π(st , ·), st+1 ∼ p(·|st , at ), ∀t ≥ 0, and

Π =

{ π ∈ RSA

∣∣ ∣

∑A a=1 πs,a^ = 1 (∀s^ ∈ S), πs,a^ ≥^ 0 (∀s^ ∈ S,^ a^ ∈ A)

} .

Policy Optimization

Policy optimization reformulation:

maximizeπ∈Π F (π),

F (π) is also written as V π(ρ) in the value function learning literature.