Reinforcement Learning and Distributional Reinforcement Learning, Study notes of Dynamics

An introduction to reinforcement learning and distributional reinforcement learning. It explains why reinforcement learning is useful and compares it with alternative methods. It also discusses the multi-armed bandit problem and policy-based methods. exercises and a testbed to assess the effectiveness of different methods. It briefly touches on optimistic initial values. likely related to computer science, artificial intelligence, and machine learning.

Typology: Study notes

2022/2023

Uploaded on 03/14/2023

sandipp
sandipp 🇺🇸

4.3

(11)

223 documents

1 / 45

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Reinforcement Learning
Jes´us Fern´andez-Villaverde1and Galo Nu˜no2
September 1, 2022
1University of Pennsylvania
2Banco de Espa˜na
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d

Partial preview of the text

Download Reinforcement Learning and Distributional Reinforcement Learning and more Study notes Dynamics in PDF only on Docsity!

Reinforcement Learning

Jes´us Fern´andez-Villaverde^1 and Galo Nu˜no^2 September 1, 2022 (^1) University of Pennsylvania

(^2) Banco de Espa˜na

A short introduction

ion

nd decision maker.

2

Why reinforcement learning?

  • Useful when:
    1. The dynamics of the state is unkown but simulation is easy: model-free vs. model-based reinforcement learning.
    2. Or the dimensionality is so high that we cannot store the information about the DP in a table.
  • Work surprisingly well in a wide range of situations, although no methods that are guaranteed to work.
  • Key for success in economic applications: ability to simulate fast (link with massive parallelization). Also, it complements very well with neural networks.

Some history

  • Ideas go back to at least Edward Thorndike (1874-1949).
  • Grey Walter (1910 -1977)’s “mechanical tortoise” (1951).
  • Marvin Minsky (1927-2016)’s 1954 Ph.D. thesis.
  • Widrow, Gupta, and Maitra (1973): modified Least-Mean-Square (LMS) algorithm.
  • Chris Watkins’s development of Q-learning (1989).

Two applications

  • More than a general theory, reinforcement learning is a set of related ideas.
  • Thus, I will present two applications taken from Sutton and Barto:
    1. The multi-armed bandit problem.
    2. Dynamic programming.
  • Also, for more examples, see:
    1. http://incompleteideas.net/book/code/code2nd.html (and the links therein).
    2. https://www.deepmind.com/learning-resources/ introduction-to-reinforcement-learning-with-david-silver.
    3. https://github.com/TikhonJelvis/RL-book/.

The multi-armed bandit problem

  • You need to choose action a among k available options.
  • Each option is associated with a probability distribution of payoffs.
  • You want to maximize the expected (discounted) payoffs.
  • But you do not know which action is best, you only have estimates of your value function (dual control problem of identification and optimization).
  • You can observe actions and period payoffs.
  • Go back to the study of “sequential design of experiments” by Thompson (1933, 1934) and Bellman (1956).

A policy-based method I

  • Proposed by Thathachar and Sastry (1985).
  • A very simple method that uses the averages Qn(a) of rewards Ri (a), i = { 1 , ..., n}, actually received:

Qn(a) =

n

nX− 1

i=

Ri (a)

  • We start with Q 0 (a) = 0 for all k. Here (and later), we randomize among ties.
  • We update Qn(a) thanks to the nice recursive update based on linearity of means:

Qn+1(a) = Qn(a) +

n [Rn(a) − Qn(a)]

Averages of actions not picked are not updated.

A policy-based method II

  • How do we pick actions?
    1. Pure greedy method: arg maxa Qt (a).
    2. ϵ-greedy method. Mixed best action with a random trembling.
  • Easy to generalize to more sophisticated strategies.
  • In particular, we can connect with genetic algorithms (AlphaGo).