












































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The Transformers Test 1 PracticeUltimate Exam prepares learners for electrical transformer theory and power distribution examinations. Topics include transformer operation, voltage regulation, magnetic principles, testing procedures, electrical safety, maintenance practices, and troubleshooting methods. This resource is ideal for electricians, technicians, and engineering students.
Typology: Exams
1 / 52
This page cannot be seen from the preview
Don't miss anything!













































Question 1. Which of the following best describes why vanilla RNNs struggle with long-range dependencies? A) They use fixed-size hidden states that cannot store unlimited information. B) Their recurrent connections cause gradients to explode or vanish over many timesteps. C) They process all tokens in parallel, losing temporal order. D) They rely on attention weights that become uniform for long sequences. Answer: B Explanation: In deep or long RNNs, repeated multiplication of the recurrent weight matrix causes gradients to either shrink to zero (vanish) or grow exponentially (explode), making learning of distant relationships difficult. Question 2. The primary advantage of the Transformer’s parallel processing over recurrent models is: A) Reducing the number of parameters. B) Converting O(n) sequential steps into O(1) steps per layer. C) Eliminating the need for positional encodings. D) Allowing the model to be trained without back-propagation. Answer: B Explanation: Transformers compute attention for all positions simultaneously, removing the sequential dependency that forces RNNs to process tokens one after another (O(n) time). Question 3. In the original Transformer, positional encodings are generated using sine and cosine functions of different frequencies. What property does this design provide? A) It makes the positional vectors orthogonal for all positions. B) It enables the model to extrapolate to sequence lengths longer than seen during training. C) It guarantees that each position has a unique binary code. D) It forces the model to ignore absolute positions and focus on relative distances. Answer: B
Explanation: The continuous sinusoidal formulation allows the model to infer relative positions for unseen lengths because any position can be represented as a linear combination of the basis functions. Question 4. Which component in the Transformer architecture directly addresses the problem of training deep networks (i.e., vanishing/exploding gradients) and stabilizes learning? A) Multi-Head Attention. B) Feed-Forward Network. C) Residual Connections combined with Layer Normalization. D) Positional Encoding. Answer: C Explanation: The “Add & Norm” step adds the input to the sub-layer output (residual) and then normalizes, preserving gradient flow and preventing degradation as depth increases. Question 5. In Scaled Dot-Product Attention, the scaling factor √d_k is applied to the QKᵀ matrix before the softmax. Why is this scaling necessary? A) To ensure the attention weights sum to one. B) To keep the dot-product values in a range that prevents softmax saturation. C) To convert the dot-product into a probability distribution. D) To increase the model’s capacity to learn long-range dependencies. Answer: B Explanation: As d_k grows, the magnitude of QKᵀ increases, causing the softmax to become extremely peaked (saturation). Dividing by √d_k keeps the values at a moderate scale, yielding more stable gradients. Question 6. Multi-Head Attention improves the model’s expressiveness because: A) Each head processes a different segment of the input sequence. B) Heads operate on distinct learned linear projections of Q, K, V, allowing the model to capture varied relational patterns. C) It reduces the total number of parameters compared to a single attention head. D) It forces each head to attend to a unique token position.
B) A single convolutional layer that mixes neighboring positions. C) A recurrent sub-layer that processes tokens sequentially. D) An attention mechanism that revisits the original embeddings. Answer: A Explanation: The FFN projects each token’s representation from d_model to d_ff, applies a non-linearity, then projects back to d_model, operating identically across all positions. Question 10. If a Transformer model has d_model = 768, d_ff = 3072, and h = 12 heads, what is the dimensionality of each head’s query/key/value vectors? A) 64 B) 128 C) 256 D) 768 Answer: A Explanation: The total model dimension is split equally among heads: d_k = d_model / h = 768 / 12 = 64. Question 11. Label smoothing in the Transformer’s loss function primarily helps to: A) Increase the model’s capacity by adding more parameters. B) Prevent the model from becoming over-confident on the training data, improving generalization. C) Accelerate convergence by scaling the gradients. D) Reduce the vocabulary size during tokenization. Answer: B Explanation: By assigning a small probability mass to all non-target tokens, label smoothing discourages extreme confidence, which mitigates overfitting. Question 12. The Adam optimizer used with the Transformer’s learning-rate scheduler includes a warm-up phase. What is the purpose of warm-up? A) To gradually increase the batch size.
B) To linearly increase the learning rate for the first N steps, avoiding large updates when parameters are still random. C) To pre-train the embedding layer separately. D) To freeze the attention heads during early training. Answer: B Explanation: Warm-up prevents instability at the start of training by starting with a small learning rate and then scaling it up before the standard decay schedule begins. Question 13. Padding masks in the Transformer are essential because: A) They prevent attention from considering padded (non-existent) tokens, which would otherwise affect the softmax distribution. B) They replace the need for positional encodings. C) They are used to compute the loss only on padded positions. D) They allow the model to attend to future tokens during training. Answer: A Explanation: Padding tokens are artificial and carry no information; masking ensures they receive zero attention weight, preserving the integrity of the representation. Question 14. A look-ahead (causal) mask in the decoder ensures that: A) The decoder can attend to encoder positions that are ahead in the source sequence. B) Each position can only attend to earlier positions in the target sequence, preventing “cheating”. C) The model learns bidirectional context during training. D) The attention scores are normalized across the entire batch. Answer: B Explanation: Causal masking blocks attention to future target tokens, enforcing the autoregressive property required for generation. Question 15. In BERT (an encoder-only model), the pre-training objective “Next Sentence Prediction (NSP)” is designed to: A) Teach the model to generate the next token in a sequence.
B. Computing attention using a tiled, cache-friendly algorithm that reduces memory bandwidth. C. Removing the need for Q, K, V projections. D. Merging the encoder and decoder into a single pass. Answer: B Explanation: Flash Attention reorganizes the attention computation to fit within GPU registers, minimizing memory traffic and achieving faster, more memory-efficient execution. Question 19. Sparse Attention mechanisms differ from dense attention by: A) Attending to every token pair equally. B) Limiting each query to a subset of keys (e.g., local window or learned pattern), reducing O(n²) complexity. C) Using convolutional kernels instead of dot-products. D) Removing the scaling factor √d_k. Answer: B Explanation: Sparse attention restricts the attention matrix to a sparse pattern (local windows, strided, or learned), lowering computational and memory cost. Question 20. Byte-Pair Encoding (BPE) tokenization primarily operates by: A) Splitting text into fixed-length character n-grams. B) Iteratively merging the most frequent pair of adjacent symbols to create a vocabulary of subword units. C) Assigning each word a unique integer ID based on frequency. D) Using a neural network to predict token boundaries. Answer: B Explanation: BPE starts with characters and repeatedly merges the highest-frequency adjacent pair, building a compact subword vocabulary. Question 21. When generating text with a Transformer decoder, greedy search differs from beam search in that greedy search: A) Explores multiple hypotheses simultaneously.
B) Always selects the token with the highest probability at each step, potentially missing better overall sequences. C) Uses a temperature parameter to diversify output. D) Guarantees the globally optimal sequence under the model. Answer: B Explanation: Greedy search makes a locally optimal choice at each timestep, which can lead to sub-optimal overall sentences compared to beam search’s broader exploration. Question 22. Nucleus (top-p) sampling selects the next token from: A) The top-k most probable tokens. B) The smallest set of tokens whose cumulative probability exceeds p, then samples proportionally. C) All tokens with probability above a fixed threshold. D) A uniformly random token from the vocabulary. Answer: B Explanation: Top-p sampling dynamically determines a cutoff such that the summed probability reaches p (e.g., 0.9), then draws a token from that truncated distribution. Question 23. BLEU score primarily measures: A) The semantic similarity between generated and reference texts. B) The n-gram overlap precision, penalized for brevity. C) The recall of generated tokens. D) The model’s perplexity on a test set. Answer: B Explanation: BLEU computes modified n-gram precision (typically up to 4-grams) and includes a brevity penalty, serving as a proxy for translation quality. Question 24. Perplexity of a language model is defined as: A) The exponential of the cross-entropy loss; lower perplexity indicates better predictive performance.
B) The ability of the model to reorder its internal parameters during training. C) The fact that softmax is invariant to the order of its inputs. D) The requirement that Q, K, V matrices be symmetric. Answer: A Explanation: Pure attention treats inputs as a set; without explicit position information, the order would not affect the computation, necessitating positional encodings. Question 28. In the Transformer’s scaled dot-product attention, if d_k = 64, what scalar is used to scale QKᵀ? A) 8 B) 64 C) √64 = 8 D) 1/ Answer: C Explanation: The scaling factor is 1/√d_k; equivalently, you divide QKᵀ by √64 = 8. Question 29. Which of the following best explains why Layer Normalization is applied after the residual addition (“Add & Norm”) rather than before? A) It normalizes the summed signal, keeping the residual magnitude under control and stabilizing gradients. B) It reduces the number of parameters required for the attention heads. C) It ensures that positional encodings are not altered. D) It allows the model to skip the normalization step during inference. Answer: A Explanation: Normalizing after addition guarantees that the combined representation (input + sub-layer output) has stable statistics, facilitating training of deep stacks. Question 30. In a Transformer decoder, which mask(s) are applied simultaneously during training? A) Only the padding mask.
B) Only the look-ahead (causal) mask. C) Both padding and look-ahead masks, combined via logical OR. D) No mask is needed because the decoder attends to the whole sequence. Answer: C Explanation: The decoder must ignore padded positions (padding mask) and future tokens (causal mask); both are applied to the attention logits. Question 31. The “feed-forward dimension” d_ff is typically set to: A) The same value as d_model. B) 2 × d_model. C) 4 × d_model. D) 0.5 × d_model. Answer: C Explanation: In the original Transformer, d_ff = 4 × d_model (e.g., 2048 when d_model = 512), providing a larger hidden capacity in the FFN. Question 32. Which of the following statements about linear Transformers is true? A) They replace the softmax attention with a kernel that enables O(n) time and memory complexity. B) They use a convolutional layer to mimic attention. C) They completely discard the concept of queries, keys, and values. D) They increase the number of attention heads to achieve linear scaling. Answer: A Explanation: Linear Transformers approximate the softmax kernel with a feature map that allows the attention computation to be rewritten as a series of matrix multiplications, achieving linear complexity. Question 33. In the context of Transformer training, “weight tying” usually refers to: A) Sharing the same weight matrix between the token embedding layer and the final linear projection layer that produces logits.
A) Byte Pair Encoding (BPE). B) WordPiece. C) SentencePiece’s unigram model. D) Character-level tokenization. Answer: C Explanation: SentencePiece’s unigram language model learns a probabilistic subword vocabulary directly from raw text and can operate without external dictionaries, making it more language-agnostic. Question 37. When evaluating a summarization system with ROUGE-L, the metric focuses on: A) Exact n-gram overlap up to 4-grams. B) Longest common subsequence (LCS) between generated and reference summaries. C) Semantic similarity using embeddings. D) Per-token perplexity. Answer: B Explanation: ROUGE-L computes the length of the longest common subsequence, capturing sentence-level order without requiring exact n-gram matches. Question 38. In the original Transformer paper, the dropout rate applied to attention weights and FFN outputs was set to: A) 0. B) 0. C) 0. D) 0. Answer: B Explanation: A dropout probability of 0.1 was used throughout the model to regularize both attention scores and feed-forward activations. Question 39. Which of the following is a primary drawback of using a very large number of attention heads (e.g., h = 64) while keeping d_model constant?
A) Each head’s dimensionality becomes too small, limiting its capacity to learn distinct patterns. B) The model will run out of GPU memory due to increased parameters in Q, K, V projections. C) The softmax function becomes unstable. D) Positional encodings become ineffective. Answer: A Explanation: Splitting d_model into many heads reduces d_k (d_model/h); very low d_k may not capture enough information, diminishing each head’s usefulness. Question 40. In a Transformer, why is the softmax function applied to the attention scores before multiplying by V? A) To convert raw scores into a probability distribution that determines how much each value contributes. B) To normalize the gradients across layers. C) To enforce sparsity in the attention matrix. D) To make the attention operation linear. Answer: A Explanation: Softmax ensures the attention weights are non-negative and sum to one, effectively weighting the values V in a convex combination. Question 41. Which of the following best characterizes the “causal” property of decoder self-attention? A) It allows each token to attend to all other tokens, including future ones. B) It restricts each token to attend only to tokens at earlier positions. C) It forces the model to use bidirectional context. D) It removes the need for positional encodings. Answer: B Explanation: Causal (or autoregressive) attention masks out future positions, ensuring that predictions depend only on past tokens. Question 42. In the T5 model, the “span-masking” pre-training objective differs from BERT’s token-level masking by:
Question 45. The “feed-forward” sub-layer in each Transformer layer is sometimes referred to as a “position-wise” network because: A) It applies the same linear transformation to each token independently, without mixing positions. B) It uses convolution over neighboring positions. C) It attends to the global context via self-attention. D) It only processes the first token of the sequence. Answer: A Explanation: The FFN treats each position separately (identical MLP applied to each vector), hence “position-wise”. Question 46. In the context of attention, what does the term “key-value memory” refer to? A) The set of all Q, K, V matrices stored for later reuse. B) The fact that each token’s representation can be seen as a key-value pair that other tokens query. C) A caching mechanism for inference speed. D) The embedding matrix used for token lookup. Answer: B Explanation: In attention, each token’s projected K and V act as a memory entry; queries retrieve information by computing similarity to these keys. Question 47. Which of the following is a reason to use GELU instead of ReLU in the Transformer’s FFN? A) GELU is computationally cheaper. B) GELU provides a smoother, probabilistic activation that can improve performance on some tasks. C) GELU forces sparsity in the output. D) GELU eliminates the need for layer normalization. Answer: B Explanation: GELU (Gaussian Error Linear Unit) applies a stochastic-like activation that can capture more nuanced behaviors, often yielding slight gains over ReLU.
Question 48. When fine-tuning BERT for a classification task, the typical architecture adds: A) A new decoder stack on top of the encoder. B) A single linear layer (softmax classifier) on the [CLS] token’s final hidden state. C) A recurrent layer after the encoder. D) An additional positional encoding layer. Answer: B Explanation: The [CLS] token aggregates sequence-level information; a linear layer maps its final representation to class logits. Question 49. In a Transformer, the term “head-dimensionality consistency” ensures that after concatenating the outputs of h heads, the resulting vector is projected back to: A) d_k × h. B) d_model. C) d_ff. D) The original token embedding size. Answer: B Explanation: Concatenated head outputs have size h × d_k (which equals d_model). A final linear layer projects this back to d_model, maintaining consistent dimensionality across layers. Question 50. Which of the following best explains why the original Transformer uses a fixed sinusoidal positional encoding rather than learned positional embeddings? A) Fixed encodings guarantee that positions beyond the training length can be represented via linear extrapolation. B) Learned embeddings are computationally more expensive. C) Sinusoidal encodings reduce the number of parameters to zero. D) Fixed encodings are required for the attention mechanism to function. Answer: A Explanation: Sinusoidal encodings have a continuous formulation that can be evaluated for any position, enabling the model to generalize to longer sequences without extra parameters.
Question 54. Which of the following best describes the “causal” masking pattern for a sequence of length 4? (Use “1” for allowed and “0” for masked.) A) 1 1 1 1 0 1 1 1 0 0 1 1 0 0 0 1 B) 1 0 0 0 1 1 0 0 1 1 1 0 1 1 1 1 C) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Answer: B Explanation: Causal masking permits each position to attend only to itself and earlier positions; the matrix shown in B reflects that pattern. Question 55. In the original Transformer, the dropout applied to the attention weights is performed: A) Before the softmax. B) After the softmax, on the resulting attention probabilities.
C) On the Q, K, V matrices before dot-product. D) Only during inference. Answer: B Explanation: Dropout is applied to the attention probability matrix after softmax, randomly zeroing some attention links while preserving the distribution’s normalization. Question 56. Which of the following is a key benefit of using a “shared” Q-K-V projection (i.e., the same linear layer for Q, K, and V) in a Transformer? A) It reduces the total number of parameters, potentially improving efficiency. B) It increases the model’s capacity to learn distinct representations. C) It eliminates the need for positional encodings. D) It allows the model to process variable-length sequences without padding. Answer: A Explanation: Sharing projection matrices reduces parameter count, though it may limit expressiveness; some lightweight variants adopt this trade-off. Question 57. In a multilingual BERT model, how does the model differentiate between languages during processing? A) By using separate embedding matrices per language. B) By inserting a language-specific token at the beginning of each sentence. C) By training separate attention heads for each language. D) It does not differentiate; all languages share the same parameters. Answer: B Explanation: Adding a language ID token (e.g., “[EN]”, “[FR]”) at the start helps the model condition on language, while still sharing the bulk of parameters. Question 58. Which of the following best explains why the Transformer’s attention matrix is O(n²) in memory? A) Because each token computes a dot-product with every other token, producing an n × n score matrix. B) Because of the residual connections.