basic cs with exercise and full detail, Schemes and Mind Maps of Computer science

basic cs with exercise and full detail

Typology: Schemes and Mind Maps

2025/2026

Uploaded on 12/11/2025

situ-kam-chung
situ-kam-chung 🇭🇰

3 documents

1 / 22

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Top ic s 3. 1- 3. 4 & 4. 1- 4. 3 Re in fo rc em en t Le ar ni ng S tu dy G ui de
1 Topic 3.1: Policy Gradient
1.1 Why Policy Gradient Appears
Problem Context: In previous value-based RL approaches, we learned value functions
V(s)orQ(s, a) and then derived policies indirectly (e.g., ω-greedy). This raises a funda-
mental question: If we care about optimal behavior, why not learn the policy
directly?
Policy-based RL directly parameterizes and learns the p olicy εω(a|s), which o!ers
several advantages over value-based approaches:
1. True objective alignment: We directly optimize the quantity we care about (the
policy)
2. Better for complex value functions: Sometimes policies are simple while values
and models are complex
3. Continuous action spaces:Muchmoree!ective in high-dimensional or continu-
ous action spaces
4. Stochastic policies: Can learn policies that are inherently stochastic (not just
derived from value functions)
5. Convergence properties:Oftenconvergesbetterthanvalue-basedmethods
1.2 Logic Flow: The Complete Picture
The logical progression in policy gradient learning follows this flow:
1. Start with policy parametrization:εω(a|s) - a neural network outputting action
probabilities
2. Define objective:MaximizeexpectedcumulativerewardJ(ϑ)=E[!T
t=0 ϖtrt]
3. Derive policy gradient: Find ωJ(ϑ)
4. Apply gradient ascent: Update ϑϑ+ϱωJ(ϑ)
5. Address practical issues: High variance add baselines, on-policy ineciency
o!-policy corrections
1.3 Stochastic Policies: Why They Matter
Key Insight: In fully-observable MDPs, there is always an optimal deterministic pol-
icy. However, in most realistic problems with partial observability, the optimal policy is
stochastic.
4
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16

Partial preview of the text

Download basic cs with exercise and full detail and more Schemes and Mind Maps Computer science in PDF only on Docsity!

1 Topic 3.1: Policy Gradient

1.1 Why Policy Gradient Appears

Problem Context: In previous value-based RL approaches, we learned value functions V (s) or Q(s, a) and then derived policies indirectly (e.g., ω-greedy). This raises a funda- mental question: If we care about optimal behavior, why not learn the policy directly? Policy-based RL directly parameterizes and learns the policy ε (^) ω (a|s), which o!ers several advantages over value-based approaches:

  1. True objective alignment: We directly optimize the quantity we care about (the policy)
  2. Better for complex value functions: Sometimes policies are simple while values and models are complex
  3. Continuous action spaces: Much more e!ective in high-dimensional or continu- ous action spaces
  4. Stochastic policies: Can learn policies that are inherently stochastic (not just derived from value functions)
  5. Convergence properties: Often converges better than value-based methods

1.2 Logic Flow: The Complete Picture

The logical progression in policy gradient learning follows this flow:

  1. Start with policy parametrization: ε (^) ω (a|s) - a neural network outputting action probabilities
  2. Define objective: Maximize expected cumulative reward J(ϑ) = E[∑^ Tt=0 ϖ t^ r (^) t ]
  3. Derive policy gradient: Find →ω J(ϑ)
  4. Apply gradient ascent: Update ϑ ↑ ϑ + ϱ→ω J(ϑ)
  5. Address practical issues: High variance → add baselines, on-policy ine”ciency → o!-policy corrections

1.3 Stochastic Policies: Why They Matter

Key Insight: In fully-observable MDPs, there is always an optimal deterministic pol- icy. However, in most realistic problems with partial observability, the optimal policy is stochastic.

1.3.1 Example - Rock-Paper-Scissors

  • A deterministic policy can always be exploited (opponent just counters your fixed choice)
  • The optimal Nash equilibrium is a uniform random policy: ε(a|s) = 1 / 3 for each action
  • Conclusion: A stochastic policy is optimal

1.3.2 Example - Aliased Gridworld

  • Two di!erent states look identical to the agent (aliased observations)
  • A deterministic policy must choose the same action in both states
  • Even if one action is optimal for one state, it’s suboptimal for the other
  • Solution: Use stochastic policy with ε(a 1 |s) = 0. 5 , ε(a 2 |s) = 0.5 to reach goal with high probability

1.4 Policy Network Architecture

A policy network takes a state as input and outputs a probability distribution over actions:

ε (^) ω (a|s) = Neural Network (^) ω (s) ↓ probability for each action For discrete actions, the final layer uses softmax activation: ε (^) ω (a|s) = e^

z (^) a ∑ a →^ e^ z^ a→ where z (^) a is the output from the second-to-last layer for action a.

1.5 Policy Gradient Derivation (Complete Proof )

1.5.1 Starting Point We want to maximize the expected return:

J(ϑ) = E (^) ε →ϑω [R(ς )] where ς is a trajectory and R(ς ) = ∑^ Tt=0 ϖ t^ r (^) t is the cumulative discounted reward.

1.5.2 Key Derivation - Why Log Trick Works

→ (^) ω J(ϑ) = → (^) ω E (^) ε →ϑ (^) ω [R(ς )] (1) = → (^) ω

ε (^) ω (ς )R(ς )dς (2)

→ (^) ω ε (^) ω (ς )R(ς )dς (3)

1.7 Understanding Policy Gradients

1.7.1 Interpretation 1 - Connection to Maximum Likelihood Maximum likelihood objective (supervised learning): J (^) ML = E[log ε (^) ω (a|s)] Policy gradient objective: J (^) PG = E[log ε (^) ω (a|s)G(s, a)] Key Di!erence: We weight the log-probability by the return G(s, a):

  • Good actions (high return) ↓ increase probability
  • Bad actions (low return) ↓ decrease probability
  • Formalizes trial-and-error learning

1.8 Policy Gradient with Baselines

1.8.1 Key Theorem If b(s) is any baseline function that depends only on state (not on actions), then:

E (^) a [b(s)→ (^) ω log ε (^) ω (a|s)] = 0

1.8.2 Proof

E (^) a [b(s)→ (^) ω log ε (^) ω (a|s)] = b(s)E (^) a [→ (^) ω log ε (^) ω (a|s)] (8) = b(s)E (^) a

[

→ (^) ω^ ε ε^ ω (^) ω^ ((aa||ss))

]

= b(s)→ (^) ω E (^) a [log ε (^) ω (a|s)] (10) = b(s)→ (^) ω 1 = 0 (11)

1.8.3 Modified Policy Gradient with Baseline

→ (^) ω J(ϑ) = E

[∑

t

→ (^) ω log ε (^) ω (at |s (^) t )(G (^) t ↔ b(s (^) t ))

]

The baseline reduces variance without introducing bias.

1.8.4 Best Baseline Choice b(s) = V (s) (the state-value function) The advantage function is defined as: A(s, a) = Q(s, a) ↔ V (s) Updated policy gradient: → (^) ω J(ϑ) = E

[∑

t

→ (^) ω log ε (^) ω (at |s (^) t )A(s (^) t , at )

]

1.9 O!-Policy Policy Gradients

1.9.1 Problem REINFORCE is on-policy - we must collect data with the current policy. This is very ine”cient.

1.9.2 Solution: Importance Sampling Use importance sampling to correct for policy mismatch. Importance Sampling Formula: E (^) a→ϑ(a|s) [f (a)] = E (^) a→ϖ(a|s)

[ (^) ε(a|s) φ(a|s)f^ (a)

]

where φ is the behavior policy (used to collect data) and ε is the target policy (what we want to improve).

1.9.3 O!-Policy Policy Gradient Derivation Starting from: J(ϑ) = E (^) s→ϱ ε^ [V ϑ^ (s)] For a new policy ϑ ↓^ : J(ϑ ↓^ ) = E (^) s→ϱ ε

[

V ϑ^ → (s)

]

Using importance sampling: J(ϑ ↓^ ) = E (^) s→ϱ ε

[∑

a

ε (^) ω → (a|s)Q ϑ^ ω→^ (s, a)

]

= E (^) s→ϱ ε

[∑

a

φ(a|s)ε φ^ ω^ →(^ (aa||ss)) Q ϑ^ ω→^ (s, a)

]

Taking derivatives: → (^) ω J(ϑ) = E (^) s,a→ϖ

[ε (^) ω (a|s) φ(a|s) →^ ω^ log^ ε^ ω^ (a|s)Q^

ϑ (^) (s, a)

]

Important Note: There’s a bias from ignoring the importance weights on future states, but this is acceptable.

2.4 Complete Algorithm: Online Actor-Critic

Algorithm 2: Online Actor-Critic with Discount Factor I n i t i a l i z e p o l i c y p a r a m e t e r s and c r i t i c p a r a m e t e r s w r e p e a t :

  1. Take a c t i o n a ˜ ( a | s ) , o b s e r v e r , s ’
  2. Update w u s i n g TD t a r g e t ( model−f r e e ) : = r + V w ( s ’ ) − V w( s ) w w − w w V w( s )
  3. Compute advantage u s i n g c r i t i c : A( s , a ) = = r + V w ( s ’ ) − V w( s )
  4. Compute p o l i c y g r a d i e n t : J ( ) = A( s , a ) l o g ( a | s )
  5. Update :
  • A( s , a ) l o g ( a | s )

2.5 Advantage Actor-Critic (A2C)

Batch Version of Actor-Critic:

  1. Sample n trajectories from current policy
  2. For each trajectory, compute advantages using critic: A(s (^) t , at ) = r (^) t + ϖV (^) w (s (^) t+1 ) ↔ V (^) w (s (^) t )
  3. Update critic by fitting to bootstrapped returns:
    • Target: Gˆ (^) t = r (^) t + ϖV (^) w (s (^) t+1 ) (or MC return)
    • Loss: L (^) w = (V (^) w (s (^) t ) ↔ Gˆ (^) t ) 2
  4. Update actor using advantage: ϑ ↑ ϑ + ϱ (^) ω n^1 ∑ t

A(s (^) t , at )→ (^) ω log ε (^) ω (at |s (^) t )

3 Topic 3.3: Advanced Policy Gradients (TRPO and

PPO)

3.1 The Problem: Step Size Selection

Challenge: Policy gradient is stochastic gradient ascent with step size ϱ: ϑ ↑ ϑ + ϱ→ (^) ω J(ϑ) Two critical problems in RL (unlike supervised learning):

  1. Step too large: Bad policy discovered ↓ all future data collected from bad policy ↓ hard to recover
  2. Step too small: Very slow progress, ine”cient use of experience In supervised learning, bad steps don’t matter because data is fixed. In RL, policy improvements change the data distribution.

3.2 Natural Gradient: Distribution Space vs Parameter Space

Key Insight: Euclidean distance in parameter space doesn’t correspond to distance in policy space.

3.2.1 Example: Bernoulli Policy

Consider Bernoulli policy with ε(a|s) =

p if a = 1 1 ↔ p if a = 0

  • Parameterization 1: ϑ = p directly
    • Step of 0.1: p : 0. 4 ↓ 0 .5 (Euclidean distance = 0.1)
  • Parameterization 2: ϑ = log(p/(1 ↔ p)) (logit)
    • Step of 0.1: p : 0. 4 ↓ 0 .525 (same Euclidean distance = 0.1)
    • Policy change is di!erent! Solution: Use natural gradient based on KL divergence instead of Euclidean dis- tance.

3.3 Natural Gradient Derivation

3.3.1 Setup We want to maximize objective J(ϑ) subject to KL constraint: max ω → J(ϑ ↓^ ) subject to DKL (ε (^) ω ↘ε (^) ω →^ ) ≃ ↼

3.3.2 Lagrangian

L = J(ϑ ↓^ ) ↔ ↽ 2 D (^) KL (ε (^) ω ↘ε (^) ω →^ )

  • Aˆt is the advantage estimate
  • ω is a hyperparameter (typically 0.2)

3.5.2 Clipping Mechanism

clip(r, 1 ↔ ω, 1 + ω) =

1 ↔ ω if r < 1 ↔ ω r if 1 ↔ ω ≃ r ≃ 1 + ω 1 + ω if r > 1 + ω

4 Topic 3.4: Continuous Control with Deterministic

Policies

4.1 The Discrete vs Continuous Challenge

4.1.1 Discrete Action Spaces a ⇒ { 1 , 2 ,... , |A|}

  • Can enumerate all actions
  • Softmax outputs probabilities
  • Standard policy gradient works

4.1.2 Continuous Action Spaces a ⇒ Rd

  • Cannot enumerate actions
  • Cannot use standard softmax
  • How to handle infinitely many actions?

4.1.3 Naive Solution - Discretization Divide each dimension into discrete bins. Problem: Curse of dimensionality!

  • d = 2: 100 bins ↓ 10,000 actions
  • d = 10: 100 bins ↓ 100 billion actions
  • Infeasible for robotics (high DOF)

4.2 Deterministic Policy Gradient

Key Insight: For continuous actions, instead of sampling from a stochastic policy, use a deterministic policy: a = μω (s) This directly outputs the action (no randomness). Advantage: Can backpropagate directly through the policy to the Q-function with- out sampling!

4.3 Deterministic Policy Gradient Theorem

Theorem: For a deterministic policy μ (^) ω (s), the policy gradient is:

→ (^) ω J(ϑ) = E (^) s→ϱ μ^ [→ (^) ω μ (^) ω (s)→ (^) a Q(s, a)| (^) a=μ (^) ω (s)^ ]

4.5.2 Actor Update (Deterministic Policy Gradient) Maximize: J(ϑ) = E (^) s→B [Q (^) w (s, μω (s))] Gradient: → (^) ω J(ϑ) = E (^) s→B^ [→ (^) a Q (^) w (s, a)| (^) a=μ (^) ω (s) → (^) ω μ (^) ω (s)] Update: ϑ ↑ ϑ + ϱ (^) ω → (^) ω J(ϑ)

5 Topic 4.1: Imitation Learning

5.1 Why Imitation Learning Appears

Fundamental Challenge with Trial-and-Error RL:

  1. Requires many failed attempts to learn
  2. In safety-critical domains (surgery, autonomous driving), failures are dangerous or costly
  3. Each interaction with environment takes time and resources
  4. Some tasks are hard to specify reward functions for Key Insight: Expert demonstrations contain valuable information about good be- havior. Can we learn policies by imitating experts without explicit reward functions?

5.2 Problem Setup

Given:

  • Set of demonstrations (trajectories from expert): {(s 0 , a 0 ), (s 1 , a 1 ),.. .}
  • State and action spaces S, A
  • No reward function Goal: Learn policy ε that mimics expert demonstrations Key Question: How is this di!erent from supervised learning?
  • In SL: Training and test data distributions are independent
  • In IL: Policy a!ects data distribution during execution (sequential decision making)

5.3 Behavior Cloning

Simplest Approach: Reduce imitation learning to supervised learning. Algorithm 5: Behavior Cloning

  1. Collect expert demonstrations: D = {(s 1 , a 1 ), (s 2 , a 2 ),... , (s (^) N , aN )}
  2. Supervised learning objective: min ω^ ∑ i

||ai ↔ ε (^) ω (s (^) i )|| 2 or with cross-entropy for discrete actions: max ω^ ∑ i

log ε (^) ω (ai |s (^) i )

  1. Train policy network like standard supervised learning

5.6.1 Problem Setup Given:

  • Expert demonstrations {ς 1 , ς 2 ,.. .}
  • MDP without reward function: (S, A, P (s ↓^ |s, a), ϖ) Find:
  • Reward function R(s, a) such that expert policy is optimal under R

5.6.2 Feature-Based Linear Reward Assume reward is linear combination of features:

R(s, a) = w T^ ⇁(s, a) where ⇁ are features and w are weights to learn.

5.6.3 Value Function

V (^) wϑ (s) = w T^ E (^) ϑ

[∑

t

ϖ t^ ⇁(s (^) t , at )

]

 (^) Feature expectation (^) μ  ε

5.7 Apprenticeship Learning

Key Insight: We don’t need to exactly recover w. Theorem: If we learn policy ε such that its feature expectations match expert feature expectations: μ (^) ϑ = μϑ ↑ Then for any weight vector w, our policy is as good as expert: V (^) wϑ (s 0 ) ⇑ V (^) wϑ ↑(s 0 )

6 Topic 4.2: Model-Based RL 1 - Planning and Con-

trol

6.1 Why Model-Based RL Appears

Question: What if we knew the environment dynamics P (s ↓^ |s, a) and could simulate trajectories? Advantage of Model-Based RL:

  • Can plan with simulated rollouts
  • Dramatically reduces real experience needed
  • Applicable to domains with known/learned dynamics

6.2 Closed-Loop vs Open-Loop Planning

6.2.1 Open-Loop Planning

  • Compute entire action sequence [a 0 , a 1 ,... , a (^) T ] at beginning
  • Execute without re-planning
  • Problem: Sensitive to model errors (compound over time)
  • Advantage: Computationally simpler

amax 0 ,...,a (^) T^ J(a^0 ,... , a^ T^ ) s.t.^ s^ t+1^ =^ f^ (st^ , at^ )

6.2.2 Closed-Loop Planning

  • At each step, observe state and replan
  • Use feedback to correct for model errors
  • Problem: Computationally expensive
  • Advantage: Robust to errors

max ϑ J(ε) where a (^) t = ε(st )

6.3 Model Predictive Control (MPC)

Key Idea: Re-plan at every step with short horizon. Why Short Horizon Works:

  • Errors compound over time
  • 1-step rollout has minimal error
  • 2-step rollout has some error

7 Topic 4.3: Model-Based RL 2 - Deep MBRL and

Advanced Planning

7.1 Monte Carlo Tree Search (MCTS)

Why MCTS for Go?:

  • State space is enormous (19$19 board)
  • Cannot evaluate all possible moves
  • MCTS focuses computation on promising moves

7.1.1 Phase 1: Selection At each node, select action with high UCB score:

UCB(a) = Q(a) + c

ln N N (a) where:

  • Q(a) = average reward from taking action a
  • N = total visits to parent node
  • N (a) = visits to action a
  • c = exploration constant

7.1.2 Phase 2-4: Expansion, Simulation, Backup

  • Expansion: Add new node at frontier of tree
  • Simulation: Run random playout from new node to terminal state
  • Backup: Update statistics for all nodes in path

7.2 AlphaGo: Combining Learning and MCTS

Architecture:

  1. Policy Network ε (^) ω (a|s):
    • Trained with behavior cloning on expert games
    • Then refined with REINFORCE self-play
    • Used for rollout policy during MCTS simulations
  2. Value Network V (^) w (s):
    • Predicts winner from position
    • Trained on self-play game outcomes
  • Used to initialize MCTS value estimates
  1. MCTS:
  • Replaces direct policy execution
  • UCB formula uses both value network and visit counts
  • Much more powerful than policy network alone!