PrepIQ Deep Learning Specialization Ultimate Exam, Exams of Technology

The Deep Learning Specialization Practice Exam helps learners assess their mastery of advanced machine learning concepts such as neural networks, CNNs, RNNs, sequence models, and optimization algorithms. It includes both theoretical and scenario-based questions covering Python frameworks like TensorFlow and PyTorch. Designed for data scientists and AI professionals, it focuses on understanding model architecture, hyperparameter tuning, and real-world AI applications.

Typology: Exams

2025/2026

Available from 06/26/2026

shilpi-jain-2
shilpi-jain-2 🇮🇳

1

(1)

25K documents

1 / 103

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
PrepIQ Deep Learning Specialization Ultimate
Exam
**Question 1. Which of the following best describes the role of the softmax
activation function in a neural network?**
A) It introduces non-linearity for hidden layers.
B) It maps outputs to probabilities that sum to 1 for multi-class classification.
C) It prevents vanishing gradients by scaling inputs.
D) It normalizes inputs to zero mean and unit variance.
Answer: B
Explanation: Softmax converts the raw scores of the output layer into a probability
distribution over classes, ensuring the probabilities sum to 1.
**Question 2. In the context of deep learning, which technological trend most
directly enabled the training of very large neural networks?**
A) Development of relational databases.
B) Increases in CPU clock speed.
C) Availability of GPUs and TPUs for parallel computation.
D) Expansion of low-latency networking.
Answer: C
Explanation: GPUs/TPUs provide massive parallelism, allowing the efficient
computation of large matrix operations required for deep networks.
**Question 3. Which loss function is most appropriate for a binary classification
problem using a sigmoid output neuron?**
A) Mean Squared Error
B) Hinge loss
C) Cross-entropy loss
D) Kullback-Leibler divergence
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download PrepIQ Deep Learning Specialization Ultimate Exam and more Exams Technology in PDF only on Docsity!

Exam

Question 1. Which of the following best describes the role of the softmax activation function in a neural network? A) It introduces non-linearity for hidden layers. B) It maps outputs to probabilities that sum to 1 for multi-class classification. C) It prevents vanishing gradients by scaling inputs. D) It normalizes inputs to zero mean and unit variance. Answer: B Explanation: Softmax converts the raw scores of the output layer into a probability distribution over classes, ensuring the probabilities sum to 1. Question 2. In the context of deep learning, which technological trend most directly enabled the training of very large neural networks? A) Development of relational databases. B) Increases in CPU clock speed. C) Availability of GPUs and TPUs for parallel computation. D) Expansion of low-latency networking. Answer: C Explanation: GPUs/TPUs provide massive parallelism, allowing the efficient computation of large matrix operations required for deep networks. Question 3. Which loss function is most appropriate for a binary classification problem using a sigmoid output neuron? A) Mean Squared Error B) Hinge loss C) Cross-entropy loss D) Kullback-Leibler divergence

Exam

Answer: C Explanation: Binary cross-entropy (log loss) aligns with the probabilistic interpretation of the sigmoid output and penalizes confident wrong predictions heavily. **Question 4. In vectorized implementation of forward propagation, which NumPy operation replaces a for-loop over training examples when computing Z = W·X + b? ** A) np.dot(W, X.T) + b B) np.multiply(W, X) + b C) np.add(np.dot(W, X), b[:, None]) D) np.sum(W * X, axis=0) + b Answer: C Explanation: np.dot performs matrix multiplication, and broadcasting adds the bias vector to each column (example) efficiently. Question 5. Why is the ReLU activation function preferred over sigmoid in many hidden layers? A) ReLU outputs are bounded between 0 and 1, aiding probability interpretation. B) ReLU reduces the likelihood of vanishing gradients because its derivative is 1 for positive inputs. C) ReLU guarantees sparsity in the weight matrix. D) ReLU prevents overfitting by acting as a regularizer. Answer: B Explanation: For positive inputs, ReLU’s derivative is 1, preserving gradient magnitude and mitigating vanishing-gradient problems.

Exam

Answer: C Explanation: Dropout randomly disables neurons, forcing the network to learn redundant representations and reducing overfitting. Question 9. In inverted dropout, why is the activation scaled by 1/(1-p) during training? A) To keep the expected activation magnitude the same during training and inference. B) To accelerate convergence of the optimizer. C) To enforce sparsity in the weight matrix. D) To improve numerical stability of gradients. Answer: A Explanation: Scaling ensures that the average output of a neuron remains unchanged whether dropout is applied or not, eliminating the need for scaling at test time. Question 10. Which of the following statements about mini-batch gradient descent is FALSE? A) It provides a more stable estimate of the gradient than stochastic gradient descent. B) It can exploit vectorized operations on modern hardware. C) It always converges to the global minimum for non-convex loss surfaces. D) The batch size is a hyperparameter that influences training speed and noise. Answer: C Explanation: Mini-batch GD still operates on non-convex objectives and may converge to local minima or saddle points; there is no guarantee of reaching the global optimum.

Exam

Question 11. The momentum optimizer updates parameters using a velocity term. Which equation correctly describes the velocity update? A) v = β·v + (1-β)·∇J B) v = β·v + (1-β)·∇J / m C) v = β·v – α·∇J D) v = β·v – α·∇J / m Answer: C Explanation: Momentum accumulates a fraction β of the previous velocity and subtracts the current gradient scaled by the learning rate α. Question 12. In the Adam optimizer, what is the purpose of the bias-corrected first-moment estimate (\hat{m}_t)? A) To compensate for the fact that mₜ is initialized at zero, especially in early steps. B) To normalize gradients across different layers. C) To enforce sparsity in the parameter updates. D) To replace the learning rate schedule. Answer: A Explanation: Because mₜ is initialized at zero, it is biased toward zero early on; dividing by (1-β₁ᵗ) corrects this bias. Question 13. Which of the following is a common learning-rate decay schedule? A) αₜ = α₀ / (1 + decay·t) B) αₜ = α₀·exp(-decay·t) C) αₜ = α₀·(1 / (1 + decay·t))

Exam

Question 16. During inference, batch normalization uses moving averages of μ and σ². Why is this necessary? A) To reduce computational cost by avoiding per-example statistics. B) To keep the model deterministic across different batch sizes. C) Because training-time batch statistics would depend on the specific mini-batch, causing inconsistency. D) All of the above. Answer: D Explanation: Using the stored moving averages yields stable, deterministic behavior and avoids the overhead of recomputing statistics on each test example. Question 17. Which metric is most appropriate when evaluating a model on a highly imbalanced binary classification problem where the positive class is rare? A) Accuracy B) Precision C) Recall D) F1 score Answer: D Explanation: F1 balances precision and recall, providing a single measure that accounts for both false positives and false negatives, which is crucial in imbalanced settings. Question 18. In the context of orthogonalization, what does it mean to “tune one hyperparameter to affect only one metric”? A) Adjust a hyperparameter while keeping all other variables constant, observing its isolated impact. B) Simultaneously optimize all hyperparameters for all metrics.

Exam

C) Use a genetic algorithm to evolve hyperparameters. D) Perform cross-validation on every hyperparameter. Answer: A Explanation: Orthogonalization aims to change a single hyperparameter at a time so its effect on a specific metric can be measured without confounding factors. Question 19. The Bayes error rate represents: A) The error achieved by the best possible classifier given the data distribution. B) The error of a logistic regression model on the training set. C) The error due to overfitting. D) The error caused by label noise only. Answer: A Explanation: Bayes error is the theoretical minimum error achievable by any classifier, assuming perfect knowledge of the underlying distribution. Question 20. When the training set and test set come from different distributions (e.g., web images vs. mobile-camera images), which strategy helps diagnose whether the issue is data mismatch or overfitting? A) Increase the model depth. B) Create a train-dev set drawn from the test distribution and compare dev and test errors. C) Apply stronger L2 regularization. D) Use early stopping on the original training set. Answer: B

Exam

A) 28×

B) 32×

C) 30×

D) 27×

Answer: A Explanation: Output size = floor((N-F)/S) + 1 = floor((32-5)/1) + 1 = 27 + 1 = 28. Question 24. Which pooling operation provides translation invariance by selecting the maximum activation within each window? A) Average pooling B) Max pooling C) Min pooling D) Median pooling Answer: B Explanation: Max pooling retains the strongest response in each region, making the representation less sensitive to small translations. Question 25. The primary advantage of residual (skip) connections in very deep networks is: A) They reduce the number of parameters. B) They allow gradients to flow directly to earlier layers, mitigating vanishing gradients. C) They enforce sparsity in the weight matrices. D) They replace the need for batch normalization. Answer: B

Exam

Explanation: Skip connections provide an alternate path for the gradient, preserving signal strength across many layers. Question 26. In the YOLO object-detection framework, what is the purpose of non-max suppression (NMS)? A) To increase the number of predicted bounding boxes. B) To merge overlapping boxes that predict the same object, keeping only the highest-confidence one. C) To compute the IoU between predicted and ground-truth boxes. D) To augment training images with random crops. Answer: B Explanation: NMS discards redundant boxes with lower confidence that overlap heavily with a higher-confidence box, producing a cleaner set of detections. Question 27. Which loss function is typically used to train a Siamese network for face verification? A) Cross-entropy loss B) Hinge loss C) Triplet loss D) Mean squared error Answer: C Explanation: Triplet loss encourages the distance between an anchor and a positive example to be smaller than the distance between the anchor and a negative example by a margin. Question 28. In neural style transfer, which component of the cost function captures the “style” of the target image?

Exam

Explanation: The decoder emits a probability distribution for each target word position, forming a sequence of predictions. Question 31. Which of the following statements about the vanishing-gradient problem in standard RNNs is TRUE? A) It occurs because gradients explode exponentially. B) It is mitigated by using ReLU activations in the hidden state. C) It arises when repeated multiplication by values < 1 shrinks gradients over long time steps. D) It only affects networks with more than 100 layers. Answer: C Explanation: In RNNs, the same weight matrix is applied at each time step; eigenvalues < 1 cause gradients to decay exponentially, leading to vanishing gradients. Question 32. The update gate in a GRU controls: A) How much of the previous hidden state to forget. B) The magnitude of the cell state. C) The proportion of new candidate activation that replaces the old hidden state. D) The learning rate of the optimizer. Answer: C Explanation: The update gate interpolates between the previous hidden state and the candidate activation, determining how much new information is incorporated. Question 33. Which of the following best characterizes the “skip-gram” variant of Word2Vec?

Exam

A) It predicts the target word from surrounding context words. B) It predicts surrounding context words given a target word. C) It factorizes a co-occurrence matrix directly. D) It uses a hierarchical softmax to reduce computational cost. Answer: B Explanation: Skip-gram takes a central word as input and tries to predict its surrounding words, learning embeddings that capture context. Question 34. In the context of GloVe embeddings, what does the term “global co-occurrence matrix” refer to? A) A matrix storing pairwise word frequencies across the entire corpus. B) A matrix of word embeddings after training. C) A confusion matrix of classification results. D) A matrix of gradients accumulated during backpropagation. Answer: A Explanation: GloVe builds a large word-word co-occurrence matrix counting how often word i appears near word j, then factorizes it to obtain embeddings. Question 35. Which of the following is a primary advantage of using a 1× convolution in modern CNN architectures? A) It reduces spatial dimensions without affecting channel depth. B) It acts as a linear projection that can reduce (or increase) the number of channels, decreasing computational cost. C) It replaces the need for pooling layers. D) It enforces sparsity in the weight tensor. Answer: B

Exam

B) It skips positions when sliding the filter, resulting in fewer output locations. C) It adds zero padding to the input. D) It changes the activation function to a pooling operation. Answer: B Explanation: A stride of s means the filter moves s pixels at a time, so the number of positions visited (and thus output size) is reduced. Question 39. In the context of object detection, what is an “anchor box”? A) A fixed-size bounding box used to generate proposals. B) A learned transformation applied to the entire image. C) A set of predefined box shapes and aspect ratios that serve as reference for predicting offsets. D) The final output after non-max suppression. Answer: C Explanation: Anchor boxes provide multiple reference shapes per grid cell, allowing the network to predict adjustments for objects of varying sizes. Question 40. Which of the following statements about the “vanilla” RNN’s backpropagation through time (BPTT) is correct? A) Gradients are computed only for the final time step. B) The same weight matrices are unrolled across all time steps, and gradients are summed over them. C) BPTT does not require storing intermediate activations. D) BPTT automatically solves the exploding-gradient problem. Answer: B

Exam

Explanation: In BPTT the network is unrolled, the same parameters are reused at each step, and the total gradient is the sum of contributions from each time step. Question 41. Which hyperparameter primarily controls the amount of regularization applied by L2 weight decay? A) Learning rate α B) Decay rate λ (lambda) C) Momentum β D) Batch size m Answer: B Explanation: The L2 penalty term λ·∑w² is added to the cost; larger λ increases the penalty on large weights, strengthening regularization. Question 42. In a deep network, why might you prefer He initialization for layers using Leaky ReLU rather than standard ReLU? A) He initialization assumes a non-zero negative slope, adjusting variance accordingly. B) He initialization is independent of the activation function. C) Leaky ReLU requires Xavier initialization for stability. D) Both activations behave identically, so any initialization works. Answer: A Explanation: Leaky ReLU passes a small gradient for negative inputs; He initialization can be adapted (variance = 2/(1+α²)·nₗ) to account for the leak factor α. Question 43. Which of the following is a common technique for handling class imbalance during training of a neural network?

Exam

Explanation: Multiplying the learning rate by (1-decay) each epoch reduces it quickly; if decay is too aggressive, learning may stop improving early. Question 46. In a convolutional neural network, what is the main purpose of using “dilated” (or atrous) convolutions? A) To increase the number of parameters without changing receptive field. B) To enlarge the receptive field without increasing the number of parameters or reducing spatial resolution. C) To perform depthwise separable convolutions. D) To replace pooling layers entirely. Answer: B Explanation: Dilated convolutions insert zeros between filter elements, expanding the effective receptive field while keeping filter size and computational cost modest. Question 47. Which of the following correctly describes the relationship between the number of parameters in a fully connected layer and the sizes of its input and output vectors? A) Parameters = (input size + output size) × bias B) Parameters = input size × output size + output size (bias term) C) Parameters = (input size × output size) / 2 D) Parameters = max(input size, output size) Answer: B Explanation: A dense layer has a weight matrix of shape (output, input) plus a bias vector of length output.

Exam

Question 48. In the context of model deployment, which of the following is a primary advantage of using a smaller, quantized model (e.g., 8-bit integers) over a full-precision (32-bit float) model? A) Higher training accuracy. B) Faster inference and lower memory footprint on edge devices. C) Better gradient flow during backpropagation. D) Automatic improvement in generalization performance. Answer: B Explanation: Quantization reduces the size of weights and activations, enabling faster compute and reduced memory usage on resource-constrained hardware. Question 49. Which of the following statements about the “bias-variance decomposition” is correct? A) Bias measures how much the model’s predictions vary across different training sets. B) Variance measures the systematic error due to model assumptions. C) Total expected error = Bias² + Variance + Irreducible noise. D) Increasing model capacity always reduces both bias and variance. Answer: C Explanation: The decomposition separates error into bias² (error from erroneous assumptions), variance (sensitivity to training data), and irreducible noise. Question 50. In a deep network, why might you place batch normalization before the activation function rather than after? A) It makes the activation function linear. B) It stabilizes the distribution of inputs to the activation, improving gradient flow.