Optimizing Rewards in Non-Stationary Dynamic Systems: State Action Rewards vs Horizon NVPs | Study notes Machine Learning

MachineLearning-Lecture18

Instructor (Andrew Ng):Okay. Welcome back. What I want to do today is talk about

one of my favorite algorithms for controlling NVPs that I think is one of the more elegant

and efficient and powerful algorithms that I know of. So what I'll do is I'll first start by

talking about a couple variations of NVPs that are slightly different from the NVP

definition you've seen so far. These are pretty common variations.

One is state action rewards, and the other is horizon NVPs. Using this semi-modified

definition of an NVP, I'll talk about linear dynamical systems. I'll spend a little bit of time

talking about models within dynamical systems, and then talk about LQR, or linear

quadratic regulation control, which will lead us to some kind of [inaudible] equation,

which is something we will solve in order to do LQR controls.

So just to recap, and we've seen this definition many times now. We've been defining an

NVP as [inaudible] states actions, states improbabilities, [inaudible] reward function

where – gamma's the discount factors, a number between zero and one. And R, the

reward function, was the function mapping from the states, the rewards – was the

function mapping from the states, the real numbers.

So we had value iteration, which would do this. So after a while, the value of the iteration

will cause V to convert to V star. Then having found the optimal value function, if you

compute the optimal policy by taking essentially [inaudible] of this equation above.

Augments of A, of that [inaudible].

So in value iteration, as you iterate of this – you know, perform this update, the function

V will [inaudible] convert to V star. So there won't be – so without defining the number

of iterations, you get closer and closer to V star. This actually converge exponentially

quickly to V star. We will never exactly convert to V star and define the number of

iterations.

So what I want to do now is describe a couple of common variations of NVPs that we

slightly different definitions of. Firs the reward function and then second, we'll do

something slightly different from just counting. Then remember in the last lecture, I said

that for infinite state of continuously in NVPs, we couldn't apply the most straightforward

version of value iteration because if you have a continuous state NVP, we need to use

some approximations of the optimal value function.

The [inaudible] later in this lecture, I'll talk about a special case of NVPs, where you can

actually represent the value function exactly, even if you have an infinite-state space or

even if you have a continuous-state space. I'll actually do that, talk about these special

constants of infinite-state NVPs, using this new variation of the reward function and the

alternative to just counting, so start to make the formulation a little easier.

So the first variation I want to talk about is selection rewards. So I'm going to change the

definition of the reward function. If this turns out, it won't be a huge deal. In particular, I

Optimizing Rewards in Non-Stationary Dynamic Systems: State Action Rewards vs Horizon NVPs, Study notes of Machine Learning