Download Introduction to Machine Learning and Reinforcement Learning at Carnegie Mellon University and more Lecture notes Probability and Statistics in PDF only on Docsity! 18-661 Introduction to Machine Learning Reinforcement Learning Spring 2020 ECE – Carnegie Mellon University Announcements • Homework 7 is due on Friday, April 24. You may use a late day if you have any left. • Wednesday’s lecture will be a set of four guest mini-lectures from Samarth Gupta, Jianyu Wang, Mike Weber and Yuhang Yao. Please attend! • Recitation on Friday will be a review for the final exam. • Practice final (multiple choice only) on Monday, April 27. • Final exam, Part I: Wednesday, April 29 during the usual course time. We will email students with a timezone conflict about starting and finishing 1 hour earlier. Closed book except one double-sided letter/A4-size handwritten cheat sheet. • Final exam, Part II: Take-home exam from Friday, May 1 (evening) to Sunday, May 3. Open everything except working with other people, designed to take 2 to 3 hours. 1 Online Learning Recap What Is Online Learning? Online learning occurs when we do not have access to our entire training dataset when we start training. We consider a sequence of data and update the predictor based on the latest sample(s) in the sequence. • Stochastic gradient descent • Perceptron training algorithm • Multi-armed bandits Online learning has practical advantages: • Helps us handle large datasets. • Automatically adapts models to changes in the underlying process over time (e.g., house prices’ relationship to square footage). 4 Feedback in Online Learning Problems What if we don’t get full feedback on our actions? • Spam classification: we know whether our prediction was correct. • Online advertising: we only know the appeal of the ad shown. We have no idea if we were right about the appeal of other ads. • Partial information often occurs because we observe feedback from an action taken on the prediction, not the prediction itself. • Analogy to linear regression: instead of learning the ground truth y , we only observe the value of the loss function, l(y). • Makes it very hard to optimize the parameters! • Often considered via multi-armed bandit problems. 5 Bandit Formulation We can play multiple rounds t = 1, 2, . . . ,T . In each round, we select an arm it from a fixed set i = 1, 2, . . . , n; and observe the reward r(it) that the arm gives. Arm 1 Arm 2 Arm 3 Objective: Maximize the total reward over time, or equivalently minimize the regret compared to the best arm in hindsight. 8 Bandit Formulation Arm 1 Arm 2 Arm 3 • The reward at each arm is stochastic (e.g., 0 with probability pi and otherwise 1). • Usually, the rewards are i.i.d over time. The best arm is then the arm with highest expected reward. • We cannot observe the reward of each arm (the entire reward function): we just know the reward of the arm that we played. Online ads example: arm = ad, reward = 1 if the user clicks on the ad and 0 otherwise 9 Exploration vs. Exploitation Tradeoff Which arm should I play? • Best arm observed so far? (exploitation) • Or should I look around to try and find a better arm? (exploration) We need both in order to maximize the total reward. 10 Thompson Sampling Which arm should you pick next? • ε-greedy: Best arm so far (exploit) with probability 1− ε, otherwise random (explore). • UCB1: Arm with the highest UCB value. Thompson sampling instead fits a Gaussian distribution to the observations for each arm. • Assume the reward of arm i is drawn from a normal distribution. • Find the posterior distribution of the expected reward for each arm: N ( r(i), (Ti + 1)−1 ) . • Generate synthetic samples from this posterior distribution for each arm, which represent your understanding of its average reward. • Play the arm with the highest sample. 13 Continuous Bandits • So far we have assumed a finite number of discrete arms. • What happens if we assume continuous arms? x-axis is the arm, y -axis is the (stochastic) reward. 14 Bayesian Optimization Assume that f (x) is drawn from a Gaussian process {f (x1), f (x2), . . . , f (xn)} ∼ N (0,K) where K is a kernel function, e.g., Kij = exp(−‖xi − xj‖22). 15 Overview of Reinforcement Learning Reinforcement learning is also sometimes called approximate dynamic programming. It can be viewed as a type of optimal control theory with no pre-defined model of the environment. 17 Grasping an Object 18 RL Applications Reinforcement learning can be applied to many different areas. • Robotics: in which direction and how fast should a robot arm move? • Mobility: where should taxis go to pick up passengers? • Transportation: when should traffic lights turn green? • Recommendations: which news stories will users click on? • Network configuration: which parameter settings lead to the best allocation of resources? Similar to multi-armed bandits, but with a notion of state or context. 19 Objectives of Reinforcement Learning We choose the actions that maximize the expected total reward: R(T ) = T∑ t=0 E [r(a(t), s(t))] , R(∞) = ∞∑ t=0 E [ γtr(a(t), s(t)) ] . We discount the reward at future times by γ < 1 to ensure convergence when T =∞. The expectation is taken over the probabilistic evolution of the state, and possibly the probabilistic reward function. A policy tells us which action to take, given the current state. • Deterministic policy: π : S → A maps each state s to an action a. • Stochastic policy: π(a|s) specifies a probability of taking each action a ∈ A given state s. We draw an action from this probability distribution whenever we encounter state s. 22 Example: Robot Movements • Reward of + 1 if we reach [4,3] and -1 if we reach [4,2]; -0.04 for taking each step • What action should we take at state [3,3]? RIGHT • How about at state [3,2]? UP 23 Key Challenges of Reinforcement Learning • The relationship of future states to past states and actions, s(t + 1) ∼ σ(a(t), s(t)), must be learned. • Partial information feedback: the reward feedback r(a(t), s(t)) only applies to the action taken a(t), and may itself be stochastic. Moreover, we may not be able to observe the full state s(t) (more on this later). • Since actions affect future states, they should be chosen so as to maximize the total future reward ∑∞ t=0 γ tr(a(t), s(t)), not just the current reward. We can address these challenges by formulating RL using Markov decision processes. 24 Transition Matrices A Markov chain can be represented with a transition matrix: P = p1,1 p1,2 . . . p1,n p2,1 p2,2 . . . p2,n ... . . . ... ... pn,1 pn,2 . . . pn,n (2) Each entry (i , j) is the probability of transitioning from state i to state j . P = S1 S2 S3 S4 S1 0 1 0 0 S2 0.5 0 0.2 0.1 S3 0.9 0 0 0.1 S4 0 0 0.8 0.2 26 Markov Processes Markov chains are special types of Markov processes, which extend the notion of a Markov chain to possibly infinite numbers of states and continuous time. • Infinite states: For instance, if the state is the temperature in this room. • Continuous time: For instance, if the state is the velocity of a robot arm. • In both cases, we can still define transition probabilities, due to the Markov property. The Markov property says that the state evolution is memoryless: the probability distribution of future states depends only on the value of the current state, not the values of previous states. 27 Motivating the Markov Property The memorylessness of the Markov property significantly simplifies our predictions of future states (and thus, future rewards). • Robotic arms: What is the probability that a block moves from point A to point B in the next 5 minutes? Does it matter where the block was 5 minutes ago? 10 minutes ago? • Taxi mobility: What is the probability there will be, on average, 10 taxis at SFO in the next 10 minutes? Does this depend on the number of taxis that were there 10 minutes ago? 20 minutes ago? 28 Markov Decision Processes in RL • At each time t, the agent experiences a state s(t). We sometimes call the state the “environment.” • At each time t, the agent takes an action a(t), which is chosen from some feasible set A. It then experiences a reward r(a(t), s(t)). • The next state (at time t + 1) is a (probabilistic) function of the current state and action taken: s(t + 1) ∼ σ(a(t), s(t)). The state-action relationships are just a Markov decision process! σ(a(t), s(t)) is given by the transition probabilities p a(t) s(t),j where j are the possible states. 31 Finding RL Policies The State-Value Function The state-value function of a given policy π : S → A gives its expected future reward when starting at state s: Vπ(s) = E [ ∞∑ t=0 γtr (π(s(t)), s(t)) |s(0) = s ] . The expectation may be taken over a stochastic policy and reward as well as the Markov decision process (MDP) of the state transitions. • The action a(t) at any time t is determined by the policy, π(s(t)). • Due to the Markov property of the underlying MDP, the optimal policy at any time is only a function of the last observed state. 32 Optimizing the Action-Value Function Qπ∗(a, s) = E r(a, s) + ∞∑ t=1 γtr (π∗(s(t)), s(t))︸ ︷︷ ︸ use policy π∗ |s(0) = s, a(0) = a • We maximize the reward by choosing the action a∗(s) at state s as a∗(s) = arg maxa Qπ∗(a ∗(s), s). • What does π∗ do at time t = 1, and state s(1)? We don’t care about what we did at t = 0...so we can pretend t = 0 again and just choose a to maximize Qπ∗(a, s(1))! • This logic is related to Bellman’s Principle of Optimality (for those familiar with dynamic programming). 35 Policy Search Methods Given the above insight, it suffices to learn either the optimal policy π∗ or the optimal action-value function Q∗. When we know the MDP of the state transitions, these approaches are called policy iteration and value iteration. They depend on the Bellman equation: Qπ∗(a, s) = r(a, s) + γ ∑ s′∈S pas,s′ [Vπ∗(s ′)]︸ ︷︷ ︸ Maximum future reward , which follows from the fact that Vπ∗(s) = maxa Qπ∗(a, s). 36 Value Iteration Initialize V (s) for each state s. Suppose we know the reward function r(a, s) and transition probabilities pas,s′ . • Update Q(a, s): Q(a, s)← E [r(a, s)] + γ ∑ s′∈S pas,s′V (s ′)︸ ︷︷ ︸ Estimate of expected future reward • Update V (s)← maxa Q(a, s). • Repeat until convergence. Value iteration is guaranteed to converge to the optimal value function V ∗(s), and optimal action-value function Q∗(a, s). 37 Direct Policy Search If we parameterize the policy, then finding the optimal policy simply means finding the optimal parameter values. • Gradient descent: Try to estimate the gradients of the action-value function Qπ(a, s) and evolve the parameters accordingly. This can be difficult since we don’t know Qπ(a, s) in the first place. • Evolutionary optimization: Simulated annealing, cross-entropy search, etc. are generic optimization algorithms that do not require knowledge of the gradient. Such methods often require temporal difference adjustments in order to converge fast enough. 40 Q-Learning Evolve the action-value function to find its optimal value Qπ∗(a, s). • Initialize our estimate of Q(a, s) to some arbitrary value. We have dropped the dependence on π∗, as π∗ is determined by Qπ∗(a, s). • After playing action a in state s(t) and observing the next state s(t + 1), update Q as follows: Q(a, s(t))← (1−α)Q(a, s(t))︸ ︷︷ ︸ old value +α ( r(a, s(t)) + γmax a Q(a, s(t + 1)) ) ︸ ︷︷ ︸ learned value . Here α is the learning rate and r(a, s(t)) is our observed reward. The term maxa Q(a, s(t + 1)) is our estimate of the expected future reward for the optimal policy π∗. Q-learning has many variants: for instance, deep Q-learning uses a neural network to approximate Q. 41 Exploration vs. Exploitation Given Q(a, s), how do we choose our action a? • Exploitation: Take action a∗ = arg maxa Q(a, s). Given our current estimate of Q, we want to take what we think is the optimal action. • Exploration: But we might not have a good estimate of Q, and we don’t want to bias our estimate towards an action that turns out not to be optimal. • ε-Greedy: With probability 1− ε, choose a∗ = arg maxa Q(a, s), and otherwise choose a randomly. Usually, we decrease ε over time as additional exploration becomes less important. 42 Extensions and Variations of RL • Multi-agent reinforcement learning: Suppose multiple agents are simultaneously using RL to find their optimal actions, and that one agent’s actions affect another’s. The agents must then learn to compete with each other. • Distributed reinforcement learning: We can speed up the search for the optimal policy by having multiple agents explore the state space in parallel. • Hierarchical reinforcement learning: Lower-level learners try to satisfy goals specified by a higher-level learner, which are designed to maximize an overall reward. • Transfer learning: Learn how to perform a new task based on already-learned methods for performing a related one. 45 Types of Machine Learning Supervised Learning • Training data: (x , y) (features, label) samples. We want to predict y to minimize a loss function. • Regression, classification Unsupervised Learning • Training data: x (features) samples only. We want to find “similar” points in the x space. • Clustering, PCA/ICA Reinforcement Learning • Training data: (s, a, r) (state,action,reward) samples. We want to find the best sequence of decisions so as to maximize long-term reward. • Robotics, multi-armed bandits 46 Summary You should know: • What a Markov decision process is (action and state variables, transition probabilities). • What the action-value and state-value functions are. • Differences between supervised, unsupervised, and reinforcement learning. 47