Partial preview of the text
Download CSE 475 Lab 8 Code your first Deep Reinforcement Learning Algorithm (Reinforce) with PyTor and more Exams Advanced Education in PDF only on Docsity!
CSE 475 Lab 8: Code your first Deep Reinforcement Learning Algorithm (Reinforce) with PyTorch Arizona State University Lab 8: Code your first Deep Reinforcement Learning Algorithm (Reinforce) with PyTorch In this notebook, you'll code your first Deep Reinforcement Leaming algorithm from scratch: Reinforce (also called Monte Carlo Policy Gradient). Reinforce is a Policy-based method: a Deep Reinforcement Leaming algorithm that tries to optimize the policy directly without using an action-value function. More precisely, Reinforce is a Policy-gradient method, a subclass of Policy-based methods that aims to optimize the policy directly by estimating the weights of the optimal policy using gradient ascent. To test its robustness, we're going to train it in 2 different simple environments: + Cartpole-v1 Here is an example of what you will achieve at the end of this notebook. [J a Environments: * CartPole-v1 a RL-Library: + Python + PyTorch Objectives of this notebook @& At the end of the notebook, you will: + Be able to code from scratch a Reinforce algorithm using PyTorch. + Be able to test the robustness of your agent using simple environments. Let's code Reinforce algorithm from scratch & To validate this hands-on for the certification process, you need to push your trained models to the Hub. . Get a result of >= 350 for cartpole-vl1. To find your result, go to the leaderboard and find your model, the result = mean_reward- std of reward. If you don't see your mode! on the leaderboard, go at the bottom of the leaderboard page and click on the refresh button. An advice & It's better to run this colab in a copy on your Google Drive, so that if it timeouts you still have the saved notebook on your Google Drive and do not need to fill everything from scratch. Todo that you can eitherdo ctrl + SorFile > Save a copy in Google Drive. Set the GPU & + Toaccelerate the agent's training, we'll use a GPU. To do thal, goto Runtime > Change Runtime type = Hardware Accelerator > GPU Create a virtual display @& During the notebook, we'll need to generate a replay video. To do so, with colab, we need to have a virtual screen to be able to render the environment (and thus record the frames). Hence the following cell will install the librairies and create and run a virtual screen &® Import the packages &® In addition to import the installed libraries, we also import: . imageio: A library that will help us to generate a replay video import numpy as np from collections import deque import matplotlib.pyplot as plt Smatplotlib inline # PyTorch import torch import torch.nn as nn import torch.nn.functional as F import torch.optim as optim from torch.distributions import Categorical # Gy import gym import gym_pygame # Hugging Face Hub from huggingface_hub import notebook login # To Face accour to our Hugging to be able to upload models to the import imageio Check if we have a GPU + Let's check if we have a GPU © If it's the case you should see device: cuda0 device “ cpu") (ie u o torch.device ("cuda:0" if torch.cuda.is_available() e print (device) We're now ready to implement our Reinforce algorithm @& First agent: Playing CartPole-v1 Create the CartPole environment and understand how it works The environment Why do we use a simple environment like CartPole-v1? As explained in Reinforcement Learning Tips and Tricks, when you implement your agent from scratch you need to be sure that it works correctly and find bugs with easy environments before going deeper. Since finding bugs will be much easier in simple environments. Try to have some “sign of life” on toy problems Validate the implementation by making it run on harder and harder envs (you can compare results against the RL zoo). You usually need to run hyperparameter optimization for that step. The CartPole-v1 environment A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum is placed upright on the cart and the goal is to balance the pole by applying forces in the left and right direction on the cart. So, we start with CartPole-v1. The goal is to push the cart left or right so that the pole stays in the equilibrium. The episode ends if: + The pole Angle is greater than +12° + Cart Position is greater than +2.4 + Episode length is greater than 500 We get a reward @& of +1 every timestep the Pole stays in the equilibrium. S_size = env.observation_space.shape[0] a_size = env.action_space.n space print (" OBSERVATION SPACE \n") print("The State Space is: ", s size) Indef act(self, state) function, wehave action = m.sample(), why doweuse sample ratherthan action = np.argmax (m)? Answer: We use m.sample() so that the policy is stochastic: actions are drawn according to the probabilities output by the network. REINFORCE optimizes the expected return by using gradients of log Tr(a|s), which assumes we sample from that distribution. If we used np.argmax(m) we'd always pick the single most probable action (a greedy, deterministic policy), which: 1. kills exploration (the agent might get stuck in a sub-optimal behavior), and 2. no longer matches the stochastic policy the REINFORCE gradient is derived from. 3. Som.sample() is required to both explore and get a correct, unbiased policy- gradient estimate. Let's build the Reinforce Training Algorithm This is the Reinforce algorithm pseudocode: «+ When we calculate the return Gt (line 6) we see that we calculate the sum of discounted rewards starting at timestep t. « | Why? Because our policy should only reinforce actions on the basis of the consequences: so rewards obtained before taking an action are useless (since they were not because of the action), only the ones that come after the action matters. ¢ Before coding this you should read this section don't let the past distract you that explains why we use reward-to-go policy gradient. But overall the idea is to compute the return at each timestep efficiently. def reinforce(policy, optimizer, n_training_episodes, max_t, gamma, print_every): # Pp to calculate the score scores deque = deque (maxlen=100) scores = [] # Line 3 of locode for i episode in range(l, n_training episodes + 1): saved_log probs = [] rewards = [] # Reset the envir state = env.reset () at the beginning of the # Line 4 of pseudocode: generate an episode for t in range(max_t): # Sample an action from the current policy action, log prob = policy.act (state) saved_log_probs.append(log_prob) # Step the environment state, reward, done, _ = env.step (action) rewards. append (reward) if done: break # Bookkeeping of scores scores deque.append (sum (rewards) ) scores .append (sum (rewards) ) # Line 6 of pseudocode: calculate the return returns = deque(maxlen=max_t) n_steps = len(rewards) # Compute discounted returns G_t backwards using dynamic programming for t in range (n_steps) [ if len(returns) > 0: disc_return_t = returns[0] else: Ais disc_return_t = 0 returns.appendleft (gamma * disc_return_t + rewards[t]) # Standardize the returns for more stable training eps = np.finfo(np.float32).eps.item() returns = torch.tensor (returns) returns = (returns - returns.mean()) / (returns.std() + eps) # Line 7: compute policy loss policy loss = [] for log_prob, disc_return in zip(saved_log_ probs, returns): policy loss.append(-log_prob * disc_return) policy_loss = torch.cat (policy loss) .sum() # Line 8: gradient descent step optimizer.zero_grad() policy _loss.backward() optimizer.step() if i_episode % print_every == print ( "Episode {}\tAverage Score: { £}'. format ( cartpole hyperparameters["gamma"], 100) Define evaluation method 6&&® . Here we define the evaluation method that we're going to use to test our Reinforce agent. def evaluate_agent(env, max_steps, n_eval_episodes, policy): nn Evaluate the agent for ~‘n eval_episodes episodes and returns rd. param env: The evaluation environment param n_eval_episodes: Number of episode to evaluate the agent iparam policy: The Reinforce agent nnn average reward and std of ri episode rewards = [] for episode in range (n_eval_episodes) : state = env.reset() step = 0 done = False total_rewards_ep = 0 for step in range (max_steps): action, _ = policy.act (state) new_state, reward, done, info = env.step(action) total_rewards_ep += reward if done: break state = new_state episode_rewards.append(total_rewards_ep) mean_reward = np.mean (episode rewards) std_reward = np.std(episode rewards) return mean_reward, std_reward Evaluate our agent 6 !pip install --upgrade numpy evaluate_agent (eval_env, cartpole hyperparameters["max_t"], cartpole hyperparameters["n evaluation_episodes"], cartpole policy) def record_video(env, policy, out_directory, fps=30): nnn Generate a replay video of the agent iparam env iparam Qtable: Qtable of our agent sparam out_directory tparam fps: how many frame per seconds (with taxi-v3 and frozenlake— vl we use 1) images = [] done = False state = env.reset() img = env.render(mode='"rgb array') images . append (img) while not done: # Take the action (index) that have the maximum expected future reward given that state action, _ = policy.act (state) state, reward, done, info = env.step(action) # We directly put next_state = state for recording logic img = env.render(mode="rgb_array') images. append (img) # Create the full path for the output video file video path = f£"{out_directory}/cartpole video.mp4" imageio.mimsave (video path, [np.array(img) for i, img in enumerate (images)], fps=fps) record video (eval_env, cartpole policy, "/content/") + Push your new trained model on the Hub & + Improving the implementation for more complex environments (for instance, what about changing the network to a Convolutional Neural Network to handle frames as observation)? Congrats on finishing this unit! There was a lot of information. And congrats on finishing the tutorial. You've just coded your first Deep Reinforcement Learning agent from scratch using PyTorch @&. Keep Learning, stay awesome @&®