














Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
basic cs with exercise and full detail
Typology: Schemes and Mind Maps
1 / 22
This page cannot be seen from the preview
Don't miss anything!















Problem Context: In previous value-based RL approaches, we learned value functions V (s) or Q(s, a) and then derived policies indirectly (e.g., ω-greedy). This raises a funda- mental question: If we care about optimal behavior, why not learn the policy directly? Policy-based RL directly parameterizes and learns the policy ε (^) ω (a|s), which o!ers several advantages over value-based approaches:
The logical progression in policy gradient learning follows this flow:
Key Insight: In fully-observable MDPs, there is always an optimal deterministic pol- icy. However, in most realistic problems with partial observability, the optimal policy is stochastic.
1.3.1 Example - Rock-Paper-Scissors
1.3.2 Example - Aliased Gridworld
A policy network takes a state as input and outputs a probability distribution over actions:
ε (^) ω (a|s) = Neural Network (^) ω (s) ↓ probability for each action For discrete actions, the final layer uses softmax activation: ε (^) ω (a|s) = e^
z (^) a ∑ a →^ e^ z^ a→ where z (^) a is the output from the second-to-last layer for action a.
1.5.1 Starting Point We want to maximize the expected return:
J(ϑ) = E (^) ε →ϑω [R(ς )] where ς is a trajectory and R(ς ) = ∑^ Tt=0 ϖ t^ r (^) t is the cumulative discounted reward.
1.5.2 Key Derivation - Why Log Trick Works
→ (^) ω J(ϑ) = → (^) ω E (^) ε →ϑ (^) ω [R(ς )] (1) = → (^) ω
→ (^) ω ε (^) ω (ς )R(ς )dς (3)
1.7.1 Interpretation 1 - Connection to Maximum Likelihood Maximum likelihood objective (supervised learning): J (^) ML = E[log ε (^) ω (a|s)] Policy gradient objective: J (^) PG = E[log ε (^) ω (a|s)G(s, a)] Key Di!erence: We weight the log-probability by the return G(s, a):
1.8.1 Key Theorem If b(s) is any baseline function that depends only on state (not on actions), then:
E (^) a [b(s)→ (^) ω log ε (^) ω (a|s)] = 0
1.8.2 Proof
E (^) a [b(s)→ (^) ω log ε (^) ω (a|s)] = b(s)E (^) a [→ (^) ω log ε (^) ω (a|s)] (8) = b(s)E (^) a
→ (^) ω^ ε ε^ ω (^) ω^ ((aa||ss))
= b(s)→ (^) ω E (^) a [log ε (^) ω (a|s)] (10) = b(s)→ (^) ω 1 = 0 (11)
1.8.3 Modified Policy Gradient with Baseline
→ (^) ω J(ϑ) = E
t
→ (^) ω log ε (^) ω (at |s (^) t )(G (^) t ↔ b(s (^) t ))
The baseline reduces variance without introducing bias.
1.8.4 Best Baseline Choice b(s) = V (s) (the state-value function) The advantage function is defined as: A(s, a) = Q(s, a) ↔ V (s) Updated policy gradient: → (^) ω J(ϑ) = E
t
→ (^) ω log ε (^) ω (at |s (^) t )A(s (^) t , at )
1.9.1 Problem REINFORCE is on-policy - we must collect data with the current policy. This is very ine”cient.
1.9.2 Solution: Importance Sampling Use importance sampling to correct for policy mismatch. Importance Sampling Formula: E (^) a→ϑ(a|s) [f (a)] = E (^) a→ϖ(a|s)
[ (^) ε(a|s) φ(a|s)f^ (a)
where φ is the behavior policy (used to collect data) and ε is the target policy (what we want to improve).
1.9.3 O!-Policy Policy Gradient Derivation Starting from: J(ϑ) = E (^) s→ϱ ε^ [V ϑ^ (s)] For a new policy ϑ ↓^ : J(ϑ ↓^ ) = E (^) s→ϱ ε
V ϑ^ → (s)
Using importance sampling: J(ϑ ↓^ ) = E (^) s→ϱ ε
a
ε (^) ω → (a|s)Q ϑ^ ω→^ (s, a)
= E (^) s→ϱ ε
a
φ(a|s)ε φ^ ω^ →(^ (aa||ss)) Q ϑ^ ω→^ (s, a)
Taking derivatives: → (^) ω J(ϑ) = E (^) s,a→ϖ
[ε (^) ω (a|s) φ(a|s) →^ ω^ log^ ε^ ω^ (a|s)Q^
ϑ (^) (s, a)
Important Note: There’s a bias from ignoring the importance weights on future states, but this is acceptable.
Algorithm 2: Online Actor-Critic with Discount Factor I n i t i a l i z e p o l i c y p a r a m e t e r s and c r i t i c p a r a m e t e r s w r e p e a t :
Batch Version of Actor-Critic:
A(s (^) t , at )→ (^) ω log ε (^) ω (at |s (^) t )
Challenge: Policy gradient is stochastic gradient ascent with step size ϱ: ϑ ↑ ϑ + ϱ→ (^) ω J(ϑ) Two critical problems in RL (unlike supervised learning):
Key Insight: Euclidean distance in parameter space doesn’t correspond to distance in policy space.
3.2.1 Example: Bernoulli Policy
Consider Bernoulli policy with ε(a|s) =
p if a = 1 1 ↔ p if a = 0
3.3.1 Setup We want to maximize objective J(ϑ) subject to KL constraint: max ω → J(ϑ ↓^ ) subject to DKL (ε (^) ω ↘ε (^) ω →^ ) ≃ ↼
3.3.2 Lagrangian
L = J(ϑ ↓^ ) ↔ ↽ 2 D (^) KL (ε (^) ω ↘ε (^) ω →^ )
3.5.2 Clipping Mechanism
clip(r, 1 ↔ ω, 1 + ω) =
1 ↔ ω if r < 1 ↔ ω r if 1 ↔ ω ≃ r ≃ 1 + ω 1 + ω if r > 1 + ω
4.1.1 Discrete Action Spaces a ⇒ { 1 , 2 ,... , |A|}
4.1.2 Continuous Action Spaces a ⇒ Rd
4.1.3 Naive Solution - Discretization Divide each dimension into discrete bins. Problem: Curse of dimensionality!
Key Insight: For continuous actions, instead of sampling from a stochastic policy, use a deterministic policy: a = μω (s) This directly outputs the action (no randomness). Advantage: Can backpropagate directly through the policy to the Q-function with- out sampling!
Theorem: For a deterministic policy μ (^) ω (s), the policy gradient is:
→ (^) ω J(ϑ) = E (^) s→ϱ μ^ [→ (^) ω μ (^) ω (s)→ (^) a Q(s, a)| (^) a=μ (^) ω (s)^ ]
4.5.2 Actor Update (Deterministic Policy Gradient) Maximize: J(ϑ) = E (^) s→B [Q (^) w (s, μω (s))] Gradient: → (^) ω J(ϑ) = E (^) s→B^ [→ (^) a Q (^) w (s, a)| (^) a=μ (^) ω (s) → (^) ω μ (^) ω (s)] Update: ϑ ↑ ϑ + ϱ (^) ω → (^) ω J(ϑ)
Fundamental Challenge with Trial-and-Error RL:
Given:
Simplest Approach: Reduce imitation learning to supervised learning. Algorithm 5: Behavior Cloning
||ai ↔ ε (^) ω (s (^) i )|| 2 or with cross-entropy for discrete actions: max ω^ ∑ i
log ε (^) ω (ai |s (^) i )
5.6.1 Problem Setup Given:
5.6.2 Feature-Based Linear Reward Assume reward is linear combination of features:
R(s, a) = w T^ ⇁(s, a) where ⇁ are features and w are weights to learn.
5.6.3 Value Function
V (^) wϑ (s) = w T^ E (^) ϑ
t
ϖ t^ ⇁(s (^) t , at )
(^) Feature expectation (^) μ ε
Key Insight: We don’t need to exactly recover w. Theorem: If we learn policy ε such that its feature expectations match expert feature expectations: μ (^) ϑ = μϑ ↑ Then for any weight vector w, our policy is as good as expert: V (^) wϑ (s 0 ) ⇑ V (^) wϑ ↑(s 0 )
Question: What if we knew the environment dynamics P (s ↓^ |s, a) and could simulate trajectories? Advantage of Model-Based RL:
6.2.1 Open-Loop Planning
amax 0 ,...,a (^) T^ J(a^0 ,... , a^ T^ ) s.t.^ s^ t+1^ =^ f^ (st^ , at^ )
6.2.2 Closed-Loop Planning
max ϑ J(ε) where a (^) t = ε(st )
Key Idea: Re-plan at every step with short horizon. Why Short Horizon Works:
Why MCTS for Go?:
7.1.1 Phase 1: Selection At each node, select action with high UCB score:
UCB(a) = Q(a) + c
ln N N (a) where:
7.1.2 Phase 2-4: Expansion, Simulation, Backup
Architecture: