Reinforcement 101 Ultimate Exam, Exams of Technology

The Reinforcement 101 Ultimate Exam is an educational assessment resource designed to introduce and strengthen understanding of reinforcement principles in psychology, education, and behavioral science. Topics include positive reinforcement, negative reinforcement, conditioning theories, behavior modification techniques, learning models, motivation, and practical applications. This exam preparation tool helps learners improve conceptual understanding and mastery of reinforcement strategies in academic and professional settings.

Typology: Exams

2025/2026

Available from 05/27/2026

nicky-jone
nicky-jone 🇮🇳

2.9

(44)

28K documents

1 / 66

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Reinforcement 101 Ultimate
Exam
**Question 1.** Which of the following best distinguishes
reinforcement learning (RL) from supervised learning?
A) RL requires labeled input-output pairs.
B) RL learns from a scalar reward signal rather than explicit
targets.
C) RL only works with static datasets.
D) RL does not involve any form of exploration.
**Answer:** B
**Explanation:** In RL the agent receives only a scalar reward
that evaluates its behavior; there are no explicit correct-answer
labels as in supervised learning.
**Question 2.** The “learning by doing” paradigm in RL primarily
refers to:
A) Pre-training a model on a large labeled dataset.
B) Updating policies after each observed transition.
C) Using gradient descent on a fixed loss function.
D) Performing offline batch updates only.
**Answer:** B
**Explanation:** “Learning by doing” means the agent
continuously improves its policy by interacting with the
environment and observing the consequences of its actions.
**Question 3.** Which historical figure is most closely associated
with the concept of reward-based learning that inspired modern
RL?
A) Alan Turing
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42

Partial preview of the text

Download Reinforcement 101 Ultimate Exam and more Exams Technology in PDF only on Docsity!

Exam

Question 1. Which of the following best distinguishes reinforcement learning (RL) from supervised learning? A) RL requires labeled input-output pairs. B) RL learns from a scalar reward signal rather than explicit targets. C) RL only works with static datasets. D) RL does not involve any form of exploration. Answer: B Explanation: In RL the agent receives only a scalar reward that evaluates its behavior; there are no explicit correct-answer labels as in supervised learning. Question 2. The “learning by doing” paradigm in RL primarily refers to: A) Pre-training a model on a large labeled dataset. B) Updating policies after each observed transition. C) Using gradient descent on a fixed loss function. D) Performing offline batch updates only. Answer: B Explanation: “Learning by doing” means the agent continuously improves its policy by interacting with the environment and observing the consequences of its actions. Question 3. Which historical figure is most closely associated with the concept of reward-based learning that inspired modern RL? A) Alan Turing

Exam

B) B.F. Skinner C) Claude Shannon D) John von Neumann Answer: B Explanation: Skinner’s operant conditioning experiments introduced the idea of reinforcing behavior with rewards, a direct precursor to RL. Question 4. In the agent-environment loop, which element comes immediately after the agent selects an action (A_t)? A) State (S_t) B) Reward (R_{t+1}) C) Next state (S_{t+1}) D) Policy (\pi) Answer: B Explanation: After an action is taken, the environment returns a reward (R_{t+1}) (and the next state), but the reward is defined as occurring at time (t+1). Question 5. The Reward Hypothesis states that: A) All tasks can be solved without a reward signal. B) Maximizing cumulative reward is equivalent to achieving any goal. C) Rewards must be dense for learning to succeed. D) Rewards are only useful in episodic tasks. Answer: B

Exam

A) Giving a reward of +1 at every time step. B) Providing a reward only when the goal state is reached. C) Assigning a small penalty for each action taken. D) Using a shaping reward that mirrors the distance to goal. Answer: B Explanation: Sparse rewards appear infrequently (e.g., only upon reaching the goal), making it hard for the agent to associate actions with outcomes. Question 9. The Markov Property requires that: A) The next state depends only on the entire history of past states. B) Future states are independent of the present state. C) The probability of the next state depends solely on the current state and action. D) Rewards are deterministic functions of the current state. Answer: C Explanation: By definition, a process is Markovian if the future is conditionally independent of the past given the present state (and action). Question 10. In a fully observable MDP, the agent: A) Receives a belief distribution over hidden states. B) Has direct access to the true environment state at each step. C) Must infer the state from a sequence of observations. D) Cannot use value functions.

Exam

Answer: B Explanation: Full observability means the agent observes the exact state (S_t) rather than a partial or noisy observation. Question 11. Which component of an MDP defines the probability of moving from state (s) to state (s') after taking action (a)? A) Reward function (R) B) Transition function (P) C) Policy (\pi) D) Discount factor (\gamma) Answer: B Explanation: The transition probability (P(s'|s,a)) captures the dynamics of the environment. Question 12. A deterministic policy (\pi(s)) maps a state to: A) A probability distribution over actions. B) A single specific action. C) A value estimate. D) A next state. Answer: B Explanation: Deterministic policies select exactly one action for each state. Question 13. The state-value function (V^{\pi}(s)) represents:

Exam

D) (V^{\pi}(s)=\sum_{s'}P(s'|s)[R(s)+\gamma V^{\pi}(s')]) Answer: B Explanation: The expectation is taken over both the policy’s action distribution and the environment’s transition probabilities. Question 16. In the Bellman Optimality Equation, the term ( max_a Q^{*}(s,a)) indicates: A) The worst possible action. B) The action that yields the highest expected return from state (s). C) The average reward over all actions. D) The probability of taking action (a). Answer: B Explanation: The optimal value of a state is obtained by selecting the action that maximizes the optimal action-value function. Question 17. A greedy policy with respect to a given (Q) function selects actions that: A) Randomly explore with probability (\epsilon). B) Minimize the TD error. C) Have the highest estimated (Q) value in the current state. D) Are chosen according to a softmax distribution. Answer: C Explanation: Greedy policies always pick the action with maximal estimated value, without exploration.

Exam

Question 18. The exploration-exploitation dilemma arises because: A) Agents must balance computational cost with memory usage. B) High exploitation can prevent discovering better actions, while excessive exploration reduces cumulative reward. C) Exploration always leads to higher regret. D) Exploitation requires a model of the environment. Answer: B Explanation: Exploiting known good actions yields immediate reward, but may miss superior actions; exploring discovers them but may incur short-term loss. Question 19. In the (k)-armed bandit setting, the regret after (n) steps is defined as: A) The total number of times the optimal arm was selected. B) The sum of differences between the reward of the optimal arm and the reward actually obtained. C) The variance of the reward distribution. D) The probability of selecting a sub-optimal arm. Answer: B Explanation: Regret quantifies the performance loss relative to always pulling the optimal arm. Question 20. The (\epsilon)-greedy strategy selects a random action with probability (\epsilon). What is a typical effect of decreasing (\epsilon) over time?

Exam

D) The learning rate for arm (i). Answer: B Explanation: The confidence term grows when an arm has been selected few times, encouraging exploration of uncertain actions. Question 23. Thompson Sampling selects actions based on: A) Deterministic maximization of posterior means. B) Sampling from the posterior distribution over each arm’s reward and picking the highest sample. C) A fixed schedule of exploration probability. D) Gradient ascent on the reward function. Answer: B Explanation: Thompson Sampling draws a sample from each arm’s posterior and selects the arm with the highest sampled value, balancing exploration and exploitation probabilistically. Question 24. Dynamic Programming (DP) methods require which of the following? A) No knowledge of the transition probabilities. B) A complete and accurate model of (P) and (R). C) Only a single sample trajectory. D) Continuous action spaces only. Answer: B Explanation: DP algorithms (e.g., policy iteration) assume full knowledge of the environment’s dynamics.

Exam

Question 25. In Policy Evaluation (a DP step), the value update equation is: (V_{k+1}(s)=\sum_a \pi(a|s)\sum_{s'}P(s'|s,a)[R(s,a,s')+\gamma V_k(s')]). What does iterating this equation to convergence compute? A) The optimal policy. B) The optimal Q-function. C) The exact state-value function for the current policy (\pi). D) The next best action. Answer: C Explanation: Repeatedly applying the Bellman expectation equation yields the true (V^{\pi}) for the fixed policy. Question 26. Policy Improvement modifies a policy by: A) Randomly swapping actions in each state. B) Making the policy greedy w.r.t. the current value function. C) Decreasing the discount factor. D) Adding noise to the reward signal. Answer: B Explanation: The improvement step selects the action that maximizes expected return given the current value estimates. Question 27. Which of the following statements correctly distinguishes Policy Iteration from Value Iteration?

Exam

C) The deterministic nature of the policy. D) The discount factor being set to 0.9. Answer: B Explanation: Since MC averages entire episode returns, the variability of those returns can be large. Question 30. In First-Visit MC, the value of a state is updated using: A) The average of all returns observed after each occurrence of the state. B) Only the return from the first time the state appears in each episode. C) The maximum return ever observed from that state. D) The TD-error computed at each step. Answer: B Explanation: First-Visit MC uses the return following the first occurrence of a state in each episode to avoid bias from multiple visits. Question 31. Temporal-Difference (TD) learning combines ideas from MC and DP because it: A) Requires a full model of the environment. B) Updates after partial episodes using bootstrapped estimates. C) Only works for deterministic policies. D) Never uses a reward signal.

Exam

Answer: B Explanation: TD methods update value estimates after each step, using the current estimate of the next state (bootstrapping), thus blending MC sampling with DP’s recursive updates. Question 32. The TD error (\delta_t) in TD(0) is defined as: A) (\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)). B) (\delta_t = V(S_t) - R_{t+1}). C) (\delta_t = \max_a Q(S_t,a) - Q(S_t,a_t)). D) (\delta_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}). Answer: A Explanation: The TD error measures the difference between the observed reward plus discounted next-state value and the current value estimate. Question 33. SARSA is classified as an on-policy algorithm because: A) It learns the optimal policy while following a random behavior policy. B) The update uses the action actually taken by the current policy (including exploration). C) It never updates the Q-function. D) It requires a model of the environment. Answer: B Explanation: SARSA’s update uses the next action (A_{t+1}) drawn from the same policy that generated the data, thus learning the value of the behavior policy.

Exam

A) The immediate reward. B) A vector of handcrafted features for state (s). C) The transition probability matrix. D) The optimal policy parameters. Answer: B Explanation: (\mathbf{\phi}(s)) is the feature representation of the state used in a linear combination with weights ( mathbf{w}). Question 37. One advantage of using deep neural networks as function approximators in RL is: A) They guarantee zero bias in value estimates. B) They can automatically learn hierarchical features from raw high-dimensional inputs. C) They eliminate the need for exploration. D) They only work for tabular problems. Answer: B Explanation: Deep networks can extract complex representations directly from raw data (e.g., images), enabling RL in large or continuous spaces. Question 38. Policy-gradient methods directly optimize which of the following? A) The state-value function (V(s)). B) The action-value function (Q(s,a)).

Exam

C) The parameters (\theta) of a stochastic policy (\pi_\theta(a| s)). D. The transition matrix (P). Answer: C Explanation: Policy-gradient algorithms adjust the policy parameters to maximize expected return, without explicitly estimating value functions. Question 39. The REINFORCE algorithm updates policy parameters using which quantity? A) The TD error (\delta_t). B) The Monte-Carlo return (G_t) multiplied by (\nabla_\theta \log \pi_\theta(A_t|S_t)). C) The Bellman optimality equation. D) The gradient of the value function. Answer: B Explanation: REINFORCE is a Monte-Carlo policy-gradient method that uses sampled returns to weight the log-gradient of the policy. Question 40. In OpenAI Gym (now Gymnasium), the step function returns a tuple ((observation, reward, done, info)). Which field indicates that the episode has terminated? A) observation B) reward C) done D) info

Exam

Question 43. In continuous-action environments, which RL method is most naturally applicable? A) Tabular Q-learning. B) SARSA with a discrete action set. C) Policy-gradient or actor-critic methods that output parameters of a continuous distribution. D) Multi-armed bandit algorithms. Answer: C Explanation: Policy-gradient approaches can output mean and variance for continuous distributions, handling infinite action spaces. Question 44. A robotic manipulation task suffers from reward hacking when: A) The robot learns to move efficiently. B) The robot discovers a shortcut that maximizes the engineered reward but violates safety constraints. C) The robot fails to converge to any policy. D) The robot uses too much computational power. Answer: B Explanation: Reward hacking occurs when the agent exploits loopholes in the reward design to achieve high reward in unintended ways. Question 45. Safe exploration aims to:

Exam

A) Maximize cumulative reward regardless of risk. B) Prevent the agent from taking actions that could cause catastrophic outcomes during learning. C) Eliminate the need for a discount factor. D) Ensure the agent always follows a deterministic policy. Answer: B Explanation: Safe exploration techniques constrain actions to avoid dangerous states while still allowing learning. Question 46. Alignment in RL refers to: A) Matching the agent’s learned policy to the optimal policy of the MDP. B) Ensuring the agent’s objectives (reward function) correspond to human intentions. C) Aligning the state and action spaces. D) Synchronizing multiple agents. Answer: B Explanation: Alignment addresses the problem of the agent pursuing the programmed reward in ways that may diverge from what designers actually want. Question 47. In a partially observable MDP (POMDP), the agent typically maintains a belief state, which is: A) The true underlying state of the environment. B) A probability distribution over possible hidden states given the observation history. C) A deterministic mapping from observations to actions.