


























































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The use of REINFORCE in reinforcement learning algorithms. It provides an overview of policy gradient methods, including REINFORCE, and their practical applications. The document also covers the success of reinforcement learning and the importance of policy gradient methods. The use of energy-based policies and policy optimization is also discussed. mathematical formulas and examples to illustrate the concepts.
Typology: Lecture notes
1 / 66
This page cannot be seen from the preview
Don't miss anything!



























































Junzi Zhang^1 , Jongho Kim^1 , Brendan O’Donoghue^2 , Stephen Boyd^1
(^1) EE & ICME Departments, Stanford University (^2) Google DeepMind
AAAI 2021 Virtual Presentation
1 Why Policy Gradient & REINFORCE?
(^2) Review of Policy Gradient Methods
(^3) REINFORCE & Practical Policy Gradient Methods
RL: algorithms for solving MDPs with incomplete information of M (e.g., p, r accessible by interacting with the environment) as input.
RL: algorithms for solving MDPs with incomplete information of M (e.g., p, r accessible by interacting with the environment) as input. Today: episodic (allow restart in the trajectory) and model-free (no storage of transition & reward models).
Heroes Behind the Success: RL algorithms Value function learning (global convergence 3 ) Q-learning, SARSA, Bellman Residue Minimization, etc. Monte Carlo Tree Search (global convergence 3 ): -greedy tree search, UCT, BRUE, etc.
Heroes Behind the Success: RL algorithms Value function learning (global convergence 3 ) Q-learning, SARSA, Bellman Residue Minimization, etc. Monte Carlo Tree Search (global convergence 3 ): -greedy tree search, UCT, BRUE, etc. Policy optimization (global convergence 37 ) Policy gradient, random search, actor-critic, etc.
REINFORCE: balance between good empirical performance & implementation simplicity
REINFORCE: balance between good empirical performance & implementation simplicity
Neural Architecture Search Semantic Program Parser Visual Question Answering Dialogue generation Coreference resolution ...
(^1) Why Policy Gradient & REINFORCE?
2 Review of Policy Gradient Methods
(^3) REINFORCE & Practical Policy Gradient Methods
MDP (stationary, discounted): M = (S, A, p, r , γ, ρ), γ ∈ [0, 1).
Policy optimization reformulation:
maximizeπ∈Π F (π),
where F (π) = E
t=0 γ
t (^) r (st , at ),
s 0 ∼ ρ, at ∼ π(st , ·), st+1 ∼ p(·|st , at ), ∀t ≥ 0, and
Π =
{ π ∈ RSA
∣∣ ∣
∑A a=1 πs,a^ = 1 (∀s^ ∈ S), πs,a^ ≥^ 0 (∀s^ ∈ S,^ a^ ∈ A)
} .
Policy optimization reformulation:
maximizeπ∈Π F (π),
F (π) is also written as V π(ρ) in the value function learning literature.