Reinforcement Learning Practice Exam: Questions and Answers, Exams of Technology

A practice exam for reinforcement learning, featuring multiple-choice questions with detailed explanations. It covers key concepts such as markov decision processes (mdps), bellman equations, q-learning, sarsa, deep q-networks (dqn), policy gradient methods like reinforce and ppo, and exploration techniques. The questions test understanding of model-based and model-free rl, temporal difference learning, and neural network architectures for rl. It is designed to help students and practitioners assess their knowledge and prepare for certification or further study in reinforcement learning. The exam includes questions on topics such as target networks, experience replay, and actor-critic methods, providing a comprehensive review of essential rl concepts.

Typology: Exams

2025/2026

Available from 12/21/2025

shilpi-jain-1
shilpi-jain-1 🇮🇳

4.2

(5)

29K documents

1 / 91

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Practical Reinforcement Learning Certificate
Practice Exam
**Question 1.** Which component of the reinforcementlearning loop directly determines the
set of actions an agent can take at any given moment?
A) State space (S)
B) Reward function (R)
C) Action space (A)
D) Transition dynamics (P)
Answer: C
Explanation: The action space defines all possible actions the agent may select; it is
independent of the current state, reward, or transition model.
**Question 2.** In the Markov property, the probability of the next state depends on:
A) The entire history of states and actions
B) Only the current state and action
C) The discount factor γ
D) The reward received at the previous step
Answer: B
Explanation: The Markov property states that the future is conditionally independent of the past
given the present stateaction pair.
**Question 3.** Which of the following correctly lists the five elements of an MDP?
A) S, A, R, γ, π
B) S, A, P, R, γ
C) S, A, V, Q, γ
D) S, A, R, G, γ
Answer: B
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b

Partial preview of the text

Download Reinforcement Learning Practice Exam: Questions and Answers and more Exams Technology in PDF only on Docsity!

Practice Exam

Question 1. Which component of the reinforcement‑learning loop directly determines the set of actions an agent can take at any given moment? A) State space (S) B) Reward function (R) C) Action space (A) D) Transition dynamics (P) Answer: C Explanation: The action space defines all possible actions the agent may select; it is independent of the current state, reward, or transition model. Question 2. In the Markov property, the probability of the next state depends on: A) The entire history of states and actions B) Only the current state and action C) The discount factor γ D) The reward received at the previous step Answer: B Explanation: The Markov property states that the future is conditionally independent of the past given the present state‑action pair. Question 3. Which of the following correctly lists the five elements of an MDP? A) ⟨S, A, R, γ, π⟩ B) ⟨S, A, P, R, γ⟩ C) ⟨S, A, V, Q, γ⟩ D) ⟨S, A, R, G, γ⟩ Answer: B

Practice Exam

Explanation: An MDP is defined by the state set S, action set A, transition probability P, reward function R, and discount factor γ. Question 4. The goal of reinforcement learning is to maximize the expected value of which quantity? A) Immediate reward rₜ B) Cumulative discounted return Gₜ = Σₖ₌₀^∞ γᵏ rₜ₊ₖ₊₁ C) State‑value function V(s) only D) Policy entropy Answer: B Explanation: RL seeks a policy that maximizes the expected sum of discounted future rewards, known as the return Gₜ. Question 5. The Bellman expectation equation for a given policy π expresses Vπ(s) in terms of: A) The maximum over actions of Qπ(s,a) B) The immediate reward plus discounted Vπ of successor states C) The policy gradient ∇π D) The TD error δₜ Answer: B Explanation: Vπ(s) = Σₐ π(a|s) Σ_{s'} P(s'|s,a)[R(s,a,s') + γ Vπ(s')]. Question 6. Which statement best describes the difference between model‑based and model‑free RL? A) Model‑based learns Q‑values directly, model‑free learns a transition model. B) Model‑based requires P(s'|s,a) explicitly, model‑free does not. C) Model‑free always converges faster than model‑based.

Practice Exam

C) γ[V(sₜ₊₁) – V(sₜ)] D) V(sₜ) – rₜ Answer: A Explanation: The TD error measures the difference between the observed return (rₜ + γ V(sₜ₊₁)) and the current estimate V(sₜ). Question 10. Why is Q‑learning considered an off‑policy algorithm? A) It updates the target using the action actually taken by the behavior policy. B) It updates the target using the greedy action regardless of the behavior policy. C) It requires a model of the environment. D) It only works with deterministic policies. Answer: B Explanation: Q‑learning’s update uses maxₐ Q(sₜ₊₁,a), which corresponds to the greedy policy, not necessarily the behavior policy that generated the transition. Question 11. In SARSA, the update rule uses which of the following action‑value estimates? A) maxₐ Q(sₜ₊₁,a) B) Q(sₜ₊₁, aₜ₊₁) where aₜ₊₁ is the next action actually taken C) The expected value under the current policy π D) The advantage function A(sₜ₊₁,aₜ₊₁) Answer: B Explanation: SARSA (State‑Action‑Reward‑State‑Action) updates using the Q‑value of the next state‑action pair actually experienced. Question 12. Which technique in DQN helps to break the correlation between sequential experiences?

Practice Exam

A) Target network B) Experience replay buffer C) Double Q‑learning D) ε‑greedy exploration Answer: B Explanation: Experience replay stores transitions and samples them uniformly, reducing temporal correlation during training. Question 13. The purpose of the target network in DQN is to: A) Provide a separate policy for exploration. B) Stabilize the TD target by keeping it fixed for several updates. C) Generate intrinsic rewards. D) Reduce the dimensionality of the state space. Answer: B Explanation: The target network’s parameters are updated slowly, preventing rapid oscillations in the TD target. Question 14. The REINFORCE algorithm updates the policy parameters θ in the direction of: A) ∇θ Vπ(s) B) ∇θ log πθ(a|s) * Gₜ C) ∇θ Qπ(s,a) D) ∇θ H(π) (entropy) only Answer: B Explanation: REINFORCE uses the policy gradient theorem: Δθ ∝ ∇θ log πθ(a|s) * Gₜ. Question 15. High variance in REINFORCE updates is typically reduced by:

Practice Exam

Question 18. Which of the following is a key advantage of deterministic policy gradient (DPG) over stochastic policy gradient in continuous action spaces? A) It eliminates the need for a value function. B) It can directly compute the gradient of the expected return with respect to actions. C) It always yields higher sample efficiency. D) It guarantees exploration without extra mechanisms. Answer: B Explanation: DPG computes ∇θ Q(s, μθ(s))·∇θ μθ(s), allowing gradients to flow through a deterministic actor, which is more natural for continuous actions. Question 19. Soft Actor‑Critic (SAC) differs from DDPG mainly because SAC: A) Uses a deterministic actor. B) Maximizes a trade‑off between expected return and policy entropy. C) Does not require a replay buffer. D) Operates only on discrete action spaces. Answer: B Explanation: SAC adds an entropy term to the objective, encouraging exploration and robustness. Question 20. In ε‑greedy exploration, the probability of selecting a random action is: A) ε² B) 1‑ε C) ε D) ε/(|A|) Answer: C

Practice Exam

Explanation: ε‑greedy chooses a random action with probability ε and the greedy action with probability 1‑ε. Question 21. Boltzmann (softmax) action selection chooses actions with probability proportional to: A) exp( Q(s,a)/τ ) where τ is the temperature. B) Q(s,a)² C) ε / |A| D) The gradient of the policy. Answer: A Explanation: Softmax converts Q‑values into a probability distribution using a temperature τ that controls exploration. Question 22. Count‑based exploration methods assign higher exploration bonuses to states that are: A) Frequently visited. B) Rarely visited. C) Near the terminal state. D) Associated with high immediate rewards. Answer: B Explanation: By counting visits, the algorithm rewards novelty, encouraging exploration of less‑visited states. Question 23. When designing a neural network for visual RL tasks (e.g., Atari), which architecture component is most appropriate for extracting spatial features? A) Fully connected layers only B) Recurrent neural networks (RNNs)

Practice Exam

A) Speed up convergence. B) Reduce the size of the replay buffer. C) Improve the agent’s ability to generalize to real‑world variations. D) Eliminate the need for a target network. Answer: C Explanation: Randomizing visual and physical properties forces the policy to become robust to differences between simulation and reality. Question 27. Overfitting in reinforcement learning most commonly manifests as: A) High training return but low test return on unseen environments. B) Diverging loss during training. C) Constant ε throughout training. D) Increasing entropy over time. Answer: A Explanation: The agent memorizes trajectories from the training environment, leading to poor performance when the environment changes. Question 28. A constrained MDP differs from a standard MDP by: A) Adding a secondary cost function that must satisfy a bound. B) Removing the discount factor. C) Using deterministic policies only. D) Having continuous state spaces only. Answer: A Explanation: Constrained MDPs incorporate additional constraints (e.g., safety cost) that the policy must respect while maximizing reward.

Practice Exam

Question 29. In multi‑agent reinforcement learning, the non‑stationarity problem arises because: A) The environment dynamics change as other agents update their policies. B) Agents share the same replay buffer. C) The discount factor varies across agents. D) Each agent uses a different neural network architecture. Answer: A Explanation: From any single agent’s perspective, the joint policy of other agents is part of the environment, which evolves during training. Question 30. Which RL library provides a high‑level API for building agents that can run on both TensorFlow and PyTorch backends? A) OpenAI Gym B) TensorFlow Agents (TF‑Agents) C) Stable‑Baselines D) Ray RLlib Answer: B Explanation: TF‑Agents abstracts over TensorFlow and offers optional PyTorch support via its modular components. Question 31. The primary benefit of using a replay buffer in off‑policy algorithms is: A) Guaranteeing on‑policy updates. B) Enabling mini‑batch stochastic gradient descent with decorrelated samples. C) Reducing the need for a target network. D) Automatically tuning the learning rate. Answer: B

Practice Exam

D) π(a|s) – 1/|A| Answer: A Explanation: Advantage measures how much better an action is compared to the average value of the state. Question 35. In asynchronous advantage actor‑critic (A3C), parallel workers improve learning efficiency primarily because: A) They share a single replay buffer. B) They generate diverse experiences, reducing correlation and stabilizing updates. C) They use deterministic policies only. D) They eliminate the need for a target network. Answer: B Explanation: Independent agents explore different parts of the state space, providing varied gradients that speed up convergence. Question 36. The entropy regularization term added to a policy’s objective encourages: A) Faster convergence to a deterministic policy. B) Higher exploration by preventing premature collapse to a single action. C) Decreased variance in Q‑value estimates. D) Larger learning rates. Answer: B Explanation: Maximizing entropy keeps the policy stochastic, promoting exploration. Question 37. Which of the following is a typical schedule for decaying ε in ε‑greedy exploration? A) Linear decay from 1.0 to 0.0 over a fixed number of steps.

Practice Exam

B) Exponential increase. C) Constant ε = 0.5 throughout training. D) Randomly sampling ε each episode. Answer: A Explanation: Linear or exponential decay gradually reduces exploration as the agent becomes more confident. Question 38. In a partially observable environment, which neural architecture is commonly used to retain information over time? A) Convolutional Neural Network (CNN) B) Feed‑forward Multi‑Layer Perceptron (MLP) C) Recurrent Neural Network (RNN) or LSTM D) Decision Tree Answer: C Explanation: RNNs/LSTMs maintain hidden states that capture temporal dependencies, useful for POMDPs. Question 39. The term “bootstrapping” in TD learning refers to: A) Initializing all Q‑values to zero. B) Updating estimates using other learned estimates (e.g., V(sₜ₊₁)). C) Using a large batch size. D) Resetting the replay buffer each episode. Answer: B Explanation: Bootstrapping means the target includes a value estimate rather than a full return.

Practice Exam

Explanation: Policies trained in simulation often perform poorly on real robots due to modeling inaccuracies; bridging this gap is a major research focus. Question 43. Which of the following is NOT a typical component of a reinforcement‑learning experiment tracking system? A) Hyperparameter logging B. Real‑time video capture of agent behavior C. Automatic gradient clipping D. Versioned code snapshots Answer: C Explanation: Gradient clipping is part of the training algorithm, not an experiment‑tracking feature. Question 44. In a stochastic policy for continuous actions, which distribution is most commonly used? A) Uniform distribution B) Categorical distribution C) Gaussian (Normal) distribution D) Bernoulli distribution Answer: C Explanation: Gaussian policies parameterize mean and variance, providing a smooth, differentiable action space. Question 45. The term “policy entropy” is defined as: A) The sum of squared policy probabilities. B) – Σₐ π(a|s) log π(a|s) C) The variance of the Q‑values.

Practice Exam

D) The discount factor γ. Answer: B Explanation: Entropy measures the randomness of the policy; higher entropy encourages exploration. Question 46. In the context of RL, the term “off‑policy evaluation” (OPE) refers to: A) Estimating the performance of a target policy using data collected from a different behavior policy. B) Training a policy without any interaction with the environment. C) Updating the policy after each single step. D) Using a deterministic policy to evaluate a stochastic one. Answer: A Explanation: OPE uses logged trajectories from a behavior policy to assess a new policy’s expected return. Question 47. Which of the following loss functions is used to train the critic in DDPG? A) Cross‑entropy loss B) Huber (smooth L1) loss between TD target and Q‑estimate C) KL‑divergence between policy and target D) Mean‑squared error between action and state Answer: B Explanation: DDPG’s critic minimizes the Huber loss between r + γ Q'(s', μ'(s')) and Q(s,a). Question 48. The “replay ratio” in DQN training denotes: A) The number of environment steps per gradient update. B) The number of gradient updates performed per environment step.

Practice Exam

A) The number of CPU cores used. B) The amount of interaction steps needed to achieve a certain performance level. C) The size of the neural network. D) The speed of the simulator. Answer: B Explanation: Sample efficiency measures how quickly an algorithm learns from limited experience. Question 52. Which of the following is a common technique to reduce variance in multi‑step TD methods? A) Using a larger discount factor γ. B) Employing eligibility traces with λ < 1. C) Removing the bootstrap term. D. Increasing the batch size to the entire replay buffer. Answer: B Explanation: Eligibility traces blend n‑step returns, balancing bias and variance; λ controls the trace decay. Question 53. In the context of intrinsic motivation, the “prediction error” bonus encourages the agent to: A) Seek states where its learned model is most accurate. B) Avoid states with high uncertainty. C) Visit states where its model’s prediction error is large, i.e., novel states. D. Maximize immediate extrinsic reward. Answer: C Explanation: Prediction‑error based curiosity rewards the agent for exploring poorly predicted regions.

Practice Exam

Question 54. Which of the following statements about the “advantage actor‑critic (A2C)” algorithm is true? A) It uses a single synchronous worker. B) It updates the policy using the advantage estimate A(s,a) = Q(s,a) – V(s). C) It requires a model of the environment. D) It cannot be combined with entropy regularization. Answer: B Explanation: A2C computes the advantage to reduce variance in the policy gradient update. Question 55. In the context of RL, the term “off‑policy” is synonymous with: A) Learning from data generated by a different behavior policy than the one being improved. B) Using a deterministic policy only. C) Not using a replay buffer. D) Updating the policy after every single step. Answer: A Explanation: Off‑policy methods can learn about a target policy while following another (behavior) policy. Question 56. Which algorithm combines Q‑learning with policy gradients to achieve both value‑based and policy‑based learning? A) DQN B) REINFORCE C) Actor‑Critic with Experience Replay (ACER) D) PPO Answer: C