Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

basic cs with exercise and full detail, Schemes and Mind Maps of Computer science

Computer science

basic cs with exercise and full detail

Typology: Schemes and Mind Maps

2025/2026

Uploaded on 12/11/2025

situ-kam-chung 🇭🇰

3 documents

1 / 22

This page cannot be seen from the preview

Don't miss anything!

Top ic s 3. 1- 3. 4 & 4. 1- 4. 3 Re in fo rc em en t Le ar ni ng S tu dy G ui de

1 Topic 3.1: Policy Gradient

1.1 Why Policy Gradient Appears

Problem Context: In previous value-based RL approaches, we learned value functions

V(s)orQ(s, a) and then derived policies indirectly (e.g., ω-greedy). This raises a funda-

mental question: If we care about optimal behavior, why not learn the policy

directly?

Policy-based RL directly parameterizes and learns the p olicy εω(a|s), which o!ers

several advantages over value-based approaches:

1. True objective alignment: We directly optimize the quantity we care about (the

policy)

2. Better for complex value functions: Sometimes policies are simple while values

and models are complex

3. Continuous action spaces:Muchmoree!ective in high-dimensional or continu-

ous action spaces

4. Stochastic policies: Can learn policies that are inherently stochastic (not just

derived from value functions)

5. Convergence properties:Oftenconvergesbetterthanvalue-basedmethods

1.2 Logic Flow: The Complete Picture

The logical progression in policy gradient learning follows this flow:

1. Start with policy parametrization:εω(a|s) - a neural network outputting action

probabilities

2. Define objective:MaximizeexpectedcumulativerewardJ(ϑ)=E[!T

t=0 ϖtrt]

3. Derive policy gradient: Find →ωJ(ϑ)

4. Apply gradient ascent: Update ϑ↑ϑ+ϱ→ωJ(ϑ)

5. Address practical issues: High variance →add baselines, on-policy ine”ciency

→o!-policy corrections

1.3 Stochastic Policies: Why They Matter

Key Insight: In fully-observable MDPs, there is always an optimal deterministic pol-

icy. However, in most realistic problems with partial observability, the optimal policy is

stochastic.

4

Partial preview of the text

Download basic cs with exercise and full detail and more Schemes and Mind Maps Computer science in PDF only on Docsity!

1 Topic 3.1: Policy Gradient

1.1 Why Policy Gradient Appears

Problem Context: In previous value-based RL approaches, we learned value functions V (s) or Q(s, a) and then derived policies indirectly (e.g., ω-greedy). This raises a funda- mental question: If we care about optimal behavior, why not learn the policy directly? Policy-based RL directly parameterizes and learns the policy ε (^) ω (a|s), which o!ers several advantages over value-based approaches:

True objective alignment: We directly optimize the quantity we care about (the policy)
Better for complex value functions: Sometimes policies are simple while values and models are complex
Continuous action spaces: Much more e!ective in high-dimensional or continu- ous action spaces
Stochastic policies: Can learn policies that are inherently stochastic (not just derived from value functions)
Convergence properties: Often converges better than value-based methods

1.2 Logic Flow: The Complete Picture

The logical progression in policy gradient learning follows this flow:

Start with policy parametrization: ε (^) ω (a|s) - a neural network outputting action probabilities
Define objective: Maximize expected cumulative reward J(ϑ) = E[∑^ Tt=0 ϖ t^ r (^) t ]
Derive policy gradient: Find →ω J(ϑ)
Apply gradient ascent: Update ϑ ↑ ϑ + ϱ→ω J(ϑ)
Address practical issues: High variance → add baselines, on-policy ine”ciency → o!-policy corrections

1.3 Stochastic Policies: Why They Matter

Key Insight: In fully-observable MDPs, there is always an optimal deterministic pol- icy. However, in most realistic problems with partial observability, the optimal policy is stochastic.

1.3.1 Example - Rock-Paper-Scissors

A deterministic policy can always be exploited (opponent just counters your fixed choice)
The optimal Nash equilibrium is a uniform random policy: ε(a|s) = 1 / 3 for each action
Conclusion: A stochastic policy is optimal

1.3.2 Example - Aliased Gridworld

Two di!erent states look identical to the agent (aliased observations)
A deterministic policy must choose the same action in both states
Even if one action is optimal for one state, it’s suboptimal for the other
Solution: Use stochastic policy with ε(a 1 |s) = 0. 5 , ε(a 2 |s) = 0.5 to reach goal with high probability

1.4 Policy Network Architecture

A policy network takes a state as input and outputs a probability distribution over actions:

ε (^) ω (a|s) = Neural Network (^) ω (s) ↓ probability for each action For discrete actions, the final layer uses softmax activation: ε (^) ω (a|s) = e^

z (^) a ∑ a →^ e^ z^ a→ where z (^) a is the output from the second-to-last layer for action a.

1.5 Policy Gradient Derivation (Complete Proof )

1.5.1 Starting Point We want to maximize the expected return:

J(ϑ) = E (^) ε →ϑω [R(ς )] where ς is a trajectory and R(ς ) = ∑^ Tt=0 ϖ t^ r (^) t is the cumulative discounted reward.

1.5.2 Key Derivation - Why Log Trick Works

→ (^) ω J(ϑ) = → (^) ω E (^) ε →ϑ (^) ω [R(ς )] (1) = → (^) ω

ε (^) ω (ς )R(ς )dς (2)

→ (^) ω ε (^) ω (ς )R(ς )dς (3)

1.7 Understanding Policy Gradients

1.7.1 Interpretation 1 - Connection to Maximum Likelihood Maximum likelihood objective (supervised learning): J (^) ML = E[log ε (^) ω (a|s)] Policy gradient objective: J (^) PG = E[log ε (^) ω (a|s)G(s, a)] Key Di!erence: We weight the log-probability by the return G(s, a):

Good actions (high return) ↓ increase probability
Bad actions (low return) ↓ decrease probability
Formalizes trial-and-error learning

1.8 Policy Gradient with Baselines

1.8.1 Key Theorem If b(s) is any baseline function that depends only on state (not on actions), then:

E (^) a [b(s)→ (^) ω log ε (^) ω (a|s)] = 0

1.8.2 Proof

E (^) a [b(s)→ (^) ω log ε (^) ω (a|s)] = b(s)E (^) a [→ (^) ω log ε (^) ω (a|s)] (8) = b(s)E (^) a

[

→ (^) ω^ ε ε^ ω (^) ω^ ((aa||ss))

]

= b(s)→ (^) ω E (^) a [log ε (^) ω (a|s)] (10) = b(s)→ (^) ω 1 = 0 (11)

1.8.3 Modified Policy Gradient with Baseline

→ (^) ω J(ϑ) = E

[∑

t

→ (^) ω log ε (^) ω (at |s (^) t )(G (^) t ↔ b(s (^) t ))

]

The baseline reduces variance without introducing bias.

1.8.4 Best Baseline Choice b(s) = V (s) (the state-value function) The advantage function is defined as: A(s, a) = Q(s, a) ↔ V (s) Updated policy gradient: → (^) ω J(ϑ) = E

[∑

t

→ (^) ω log ε (^) ω (at |s (^) t )A(s (^) t , at )

]

1.9 O!-Policy Policy Gradients

1.9.1 Problem REINFORCE is on-policy - we must collect data with the current policy. This is very ine”cient.

1.9.2 Solution: Importance Sampling Use importance sampling to correct for policy mismatch. Importance Sampling Formula: E (^) a→ϑ(a|s) [f (a)] = E (^) a→ϖ(a|s)

[ (^) ε(a|s) φ(a|s)f^ (a)

]

where φ is the behavior policy (used to collect data) and ε is the target policy (what we want to improve).

1.9.3 O!-Policy Policy Gradient Derivation Starting from: J(ϑ) = E (^) s→ϱ ε^ [V ϑ^ (s)] For a new policy ϑ ↓^ : J(ϑ ↓^ ) = E (^) s→ϱ ε

[

V ϑ^ → (s)

]

Using importance sampling: J(ϑ ↓^ ) = E (^) s→ϱ ε

[∑

a

ε (^) ω → (a|s)Q ϑ^ ω→^ (s, a)

]

= E (^) s→ϱ ε

[∑

a

φ(a|s)ε φ^ ω^ →(^ (aa||ss)) Q ϑ^ ω→^ (s, a)

]

Taking derivatives: → (^) ω J(ϑ) = E (^) s,a→ϖ

[ε (^) ω (a|s) φ(a|s) →^ ω^ log^ ε^ ω^ (a|s)Q^

ϑ (^) (s, a)

]

Important Note: There’s a bias from ignoring the importance weights on future states, but this is acceptable.

2.4 Complete Algorithm: Online Actor-Critic

Algorithm 2: Online Actor-Critic with Discount Factor I n i t i a l i z e p o l i c y p a r a m e t e r s and c r i t i c p a r a m e t e r s w r e p e a t :

Take a c t i o n a ˜ ( a | s ) , o b s e r v e r , s ’
Update w u s i n g TD t a r g e t ( model−f r e e ) : = r + V w ( s ’ ) − V w( s ) w w − w w V w( s )
Compute advantage u s i n g c r i t i c : A( s , a ) = = r + V w ( s ’ ) − V w( s )
Compute p o l i c y g r a d i e n t : J ( ) = A( s , a ) l o g ( a | s )
Update :

A( s , a ) l o g ( a | s )

2.5 Advantage Actor-Critic (A2C)

Batch Version of Actor-Critic:

Sample n trajectories from current policy
For each trajectory, compute advantages using critic: A(s (^) t , at ) = r (^) t + ϖV (^) w (s (^) t+1 ) ↔ V (^) w (s (^) t )
Update critic by fitting to bootstrapped returns:
- Target: Gˆ (^) t = r (^) t + ϖV (^) w (s (^) t+1 ) (or MC return)
- Loss: L (^) w = (V (^) w (s (^) t ) ↔ Gˆ (^) t ) 2
Update actor using advantage: ϑ ↑ ϑ + ϱ (^) ω n^1 ∑ t

A(s (^) t , at )→ (^) ω log ε (^) ω (at |s (^) t )

3 Topic 3.3: Advanced Policy Gradients (TRPO and

PPO)

3.1 The Problem: Step Size Selection

Challenge: Policy gradient is stochastic gradient ascent with step size ϱ: ϑ ↑ ϑ + ϱ→ (^) ω J(ϑ) Two critical problems in RL (unlike supervised learning):

Step too large: Bad policy discovered ↓ all future data collected from bad policy ↓ hard to recover
Step too small: Very slow progress, ine”cient use of experience In supervised learning, bad steps don’t matter because data is fixed. In RL, policy improvements change the data distribution.

3.2 Natural Gradient: Distribution Space vs Parameter Space

Key Insight: Euclidean distance in parameter space doesn’t correspond to distance in policy space.

3.2.1 Example: Bernoulli Policy

Consider Bernoulli policy with ε(a|s) =

p if a = 1 1 ↔ p if a = 0

Parameterization 1: ϑ = p directly
- Step of 0.1: p : 0. 4 ↓ 0 .5 (Euclidean distance = 0.1)
Parameterization 2: ϑ = log(p/(1 ↔ p)) (logit)
- Step of 0.1: p : 0. 4 ↓ 0 .525 (same Euclidean distance = 0.1)
- Policy change is di!erent! Solution: Use natural gradient based on KL divergence instead of Euclidean dis- tance.

3.3 Natural Gradient Derivation

3.3.1 Setup We want to maximize objective J(ϑ) subject to KL constraint: max ω → J(ϑ ↓^ ) subject to DKL (ε (^) ω ↘ε (^) ω →^ ) ≃ ↼

3.3.2 Lagrangian

L = J(ϑ ↓^ ) ↔ ↽ 2 D (^) KL (ε (^) ω ↘ε (^) ω →^ )

Aˆt is the advantage estimate
ω is a hyperparameter (typically 0.2)

3.5.2 Clipping Mechanism

clip(r, 1 ↔ ω, 1 + ω) =

1 ↔ ω if r < 1 ↔ ω r if 1 ↔ ω ≃ r ≃ 1 + ω 1 + ω if r > 1 + ω

4 Topic 3.4: Continuous Control with Deterministic

Policies

4.1 The Discrete vs Continuous Challenge

4.1.1 Discrete Action Spaces a ⇒ { 1 , 2 ,... , |A|}

Can enumerate all actions
Softmax outputs probabilities
Standard policy gradient works

4.1.2 Continuous Action Spaces a ⇒ Rd

Cannot enumerate actions
Cannot use standard softmax
How to handle infinitely many actions?

4.1.3 Naive Solution - Discretization Divide each dimension into discrete bins. Problem: Curse of dimensionality!

d = 2: 100 bins ↓ 10,000 actions
d = 10: 100 bins ↓ 100 billion actions
Infeasible for robotics (high DOF)

4.2 Deterministic Policy Gradient

Key Insight: For continuous actions, instead of sampling from a stochastic policy, use a deterministic policy: a = μω (s) This directly outputs the action (no randomness). Advantage: Can backpropagate directly through the policy to the Q-function with- out sampling!

4.3 Deterministic Policy Gradient Theorem

Theorem: For a deterministic policy μ (^) ω (s), the policy gradient is:

→ (^) ω J(ϑ) = E (^) s→ϱ μ^ [→ (^) ω μ (^) ω (s)→ (^) a Q(s, a)| (^) a=μ (^) ω (s)^ ]

4.5.2 Actor Update (Deterministic Policy Gradient) Maximize: J(ϑ) = E (^) s→B [Q (^) w (s, μω (s))] Gradient: → (^) ω J(ϑ) = E (^) s→B^ [→ (^) a Q (^) w (s, a)| (^) a=μ (^) ω (s) → (^) ω μ (^) ω (s)] Update: ϑ ↑ ϑ + ϱ (^) ω → (^) ω J(ϑ)

5 Topic 4.1: Imitation Learning

5.1 Why Imitation Learning Appears

Fundamental Challenge with Trial-and-Error RL:

Requires many failed attempts to learn
In safety-critical domains (surgery, autonomous driving), failures are dangerous or costly
Each interaction with environment takes time and resources
Some tasks are hard to specify reward functions for Key Insight: Expert demonstrations contain valuable information about good be- havior. Can we learn policies by imitating experts without explicit reward functions?

5.2 Problem Setup

Given:

Set of demonstrations (trajectories from expert): {(s 0 , a 0 ), (s 1 , a 1 ),.. .}
State and action spaces S, A
No reward function Goal: Learn policy ε that mimics expert demonstrations Key Question: How is this di!erent from supervised learning?
In SL: Training and test data distributions are independent
In IL: Policy a!ects data distribution during execution (sequential decision making)

5.3 Behavior Cloning

Simplest Approach: Reduce imitation learning to supervised learning. Algorithm 5: Behavior Cloning

Collect expert demonstrations: D = {(s 1 , a 1 ), (s 2 , a 2 ),... , (s (^) N , aN )}
Supervised learning objective: min ω^ ∑ i

||ai ↔ ε (^) ω (s (^) i )|| 2 or with cross-entropy for discrete actions: max ω^ ∑ i

log ε (^) ω (ai |s (^) i )

Train policy network like standard supervised learning

5.6.1 Problem Setup Given:

Expert demonstrations {ς 1 , ς 2 ,.. .}
MDP without reward function: (S, A, P (s ↓^ |s, a), ϖ) Find:
Reward function R(s, a) such that expert policy is optimal under R

5.6.2 Feature-Based Linear Reward Assume reward is linear combination of features:

R(s, a) = w T^ ⇁(s, a) where ⇁ are features and w are weights to learn.

5.6.3 Value Function

V (^) wϑ (s) = w T^ E (^) ϑ

[∑

t

ϖ t^ ⇁(s (^) t , at )

]

 (^) Feature expectation (^) μ  ε

5.7 Apprenticeship Learning

Key Insight: We don’t need to exactly recover w. Theorem: If we learn policy ε such that its feature expectations match expert feature expectations: μ (^) ϑ = μϑ ↑ Then for any weight vector w, our policy is as good as expert: V (^) wϑ (s 0 ) ⇑ V (^) wϑ ↑(s 0 )

6 Topic 4.2: Model-Based RL 1 - Planning and Con-

trol

6.1 Why Model-Based RL Appears

Question: What if we knew the environment dynamics P (s ↓^ |s, a) and could simulate trajectories? Advantage of Model-Based RL:

Can plan with simulated rollouts
Dramatically reduces real experience needed
Applicable to domains with known/learned dynamics

6.2 Closed-Loop vs Open-Loop Planning

6.2.1 Open-Loop Planning

Compute entire action sequence [a 0 , a 1 ,... , a (^) T ] at beginning
Execute without re-planning
Problem: Sensitive to model errors (compound over time)
Advantage: Computationally simpler

amax 0 ,...,a (^) T^ J(a^0 ,... , a^ T^ ) s.t.^ s^ t+1^ =^ f^ (st^ , at^ )

6.2.2 Closed-Loop Planning

At each step, observe state and replan
Use feedback to correct for model errors
Problem: Computationally expensive
Advantage: Robust to errors

max ϑ J(ε) where a (^) t = ε(st )

6.3 Model Predictive Control (MPC)

Key Idea: Re-plan at every step with short horizon. Why Short Horizon Works:

Errors compound over time
1-step rollout has minimal error
2-step rollout has some error

7 Topic 4.3: Model-Based RL 2 - Deep MBRL and

Advanced Planning

7.1 Monte Carlo Tree Search (MCTS)

Why MCTS for Go?:

State space is enormous (19$19 board)
Cannot evaluate all possible moves
MCTS focuses computation on promising moves

7.1.1 Phase 1: Selection At each node, select action with high UCB score:

UCB(a) = Q(a) + c

ln N N (a) where:

Q(a) = average reward from taking action a
N = total visits to parent node
N (a) = visits to action a
c = exploration constant

7.1.2 Phase 2-4: Expansion, Simulation, Backup

Expansion: Add new node at frontier of tree
Simulation: Run random playout from new node to terminal state
Backup: Update statistics for all nodes in path

7.2 AlphaGo: Combining Learning and MCTS

Architecture:

Policy Network ε (^) ω (a|s):
- Trained with behavior cloning on expert games
- Then refined with REINFORCE self-play
- Used for rollout policy during MCTS simulations
Value Network V (^) w (s):
- Predicts winner from position
- Trained on self-play game outcomes

Used to initialize MCTS value estimates

MCTS:

Replaces direct policy execution
UCB formula uses both value network and visit counts
Much more powerful than policy network alone!

basic cs with exercise and full detail, Schemes and Mind Maps of Computer science

Related documents

Partial preview of the text

Download basic cs with exercise and full detail and more Schemes and Mind Maps Computer science in PDF only on Docsity!

1 Topic 3.1: Policy Gradient

1.1 Why Policy Gradient Appears

1.2 Logic Flow: The Complete Picture

1.3 Stochastic Policies: Why They Matter

1.4 Policy Network Architecture

1.5 Policy Gradient Derivation (Complete Proof )

ε (^) ω (ς )R(ς )dς (2)

1.7 Understanding Policy Gradients

1.8 Policy Gradient with Baselines

[

]

[∑

]

[∑

]

1.9 O!-Policy Policy Gradients

]

[

]

[∑

]

[∑

]

]

2.4 Complete Algorithm: Online Actor-Critic

2.5 Advantage Actor-Critic (A2C)

3 Topic 3.3: Advanced Policy Gradients (TRPO and

PPO)

3.1 The Problem: Step Size Selection

3.2 Natural Gradient: Distribution Space vs Parameter Space

3.3 Natural Gradient Derivation

4 Topic 3.4: Continuous Control with Deterministic

Policies

4.1 The Discrete vs Continuous Challenge

4.2 Deterministic Policy Gradient

4.3 Deterministic Policy Gradient Theorem

5 Topic 4.1: Imitation Learning

5.1 Why Imitation Learning Appears

5.2 Problem Setup

5.3 Behavior Cloning

[∑

]

5.7 Apprenticeship Learning

6 Topic 4.2: Model-Based RL 1 - Planning and Con-

trol

6.1 Why Model-Based RL Appears

6.2 Closed-Loop vs Open-Loop Planning

6.3 Model Predictive Control (MPC)

7 Topic 4.3: Model-Based RL 2 - Deep MBRL and

Advanced Planning

7.1 Monte Carlo Tree Search (MCTS)

7.2 AlphaGo: Combining Learning and MCTS