Deep Learning with Python and PyTorch: Certificate Practice Exam, Exams of Technology

A practice exam for deep learning with python and pytorch. It includes multiple-choice questions covering fundamental concepts, neural network architectures, optimization techniques, and pytorch-specific implementations. Each question is followed by a detailed explanation of the correct answer, making it a valuable resource for students and professionals preparing for deep learning certifications or seeking to reinforce their understanding of the subject. The exam covers topics such as activation functions, loss functions, optimizers, regularization, and convolutional neural networks.

Typology: Exams

2025/2026

Available from 12/21/2025

shilpi-jain-1
shilpi-jain-1 🇮🇳

4.2

(5)

29K documents

1 / 95

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Deep Learning with Python and PyTorch
Certificate Practice Exam
**Question 1.** Which of the following best describes the main difference between
traditional machine learning and deep learning?
A) Deep learning requires manual feature engineering, while traditional ML learns features
automatically.
B) Traditional ML can only handle image data, whereas deep learning works only with text.
C) Deep learning models learn hierarchical representations directly from raw data, reducing
the need for manual feature engineering.
D) Traditional ML always uses neural networks, while deep learning never does.
**Answer:** C
**Explanation:** Deep learning automatically discovers multiple levels of abstraction
(features) from raw inputs, whereas traditional ML typically relies on handcrafted features.
**Question 2.** The perceptron learning rule was first introduced in the late 1950s. Which
limitation of the perceptron motivated the later development of multilayer networks?
A) It could not learn nonlinear decision boundaries.
B) It required GPU acceleration.
C) It could only process sequential data.
D) It could not be trained with gradient descent.
**Answer:** A
**Explanation:** A singlelayer perceptron can only represent linearly separable functions;
adding hidden layers allows modeling of nonlinear relationships.
**Question 3.** Which activation function is most prone to the vanishing gradient problem
for deep networks?
A) ReLU
B) Leaky ReLU
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f

Partial preview of the text

Download Deep Learning with Python and PyTorch: Certificate Practice Exam and more Exams Technology in PDF only on Docsity!

Certificate Practice Exam

Question 1. Which of the following best describes the main difference between traditional machine learning and deep learning? A) Deep learning requires manual feature engineering, while traditional ML learns features automatically. B) Traditional ML can only handle image data, whereas deep learning works only with text. C) Deep learning models learn hierarchical representations directly from raw data, reducing the need for manual feature engineering. D) Traditional ML always uses neural networks, while deep learning never does. Answer: C Explanation: Deep learning automatically discovers multiple levels of abstraction (features) from raw inputs, whereas traditional ML typically relies on handcrafted features. Question 2. The perceptron learning rule was first introduced in the late 1950s. Which limitation of the perceptron motivated the later development of multi‑layer networks? A) It could not learn nonlinear decision boundaries. B) It required GPU acceleration. C) It could only process sequential data. D) It could not be trained with gradient descent. Answer: A Explanation: A single‑layer perceptron can only represent linearly separable functions; adding hidden layers allows modeling of nonlinear relationships. Question 3. Which activation function is most prone to the vanishing gradient problem for deep networks? A) ReLU B) Leaky ReLU

Certificate Practice Exam

C) Sigmoid D) Swish Answer: C Explanation: The sigmoid squashes inputs to (0,1); its derivative becomes very small for saturated inputs, causing gradients to vanish in deep layers. Question 4. In a feed‑forward neural network, which layer is responsible for producing the final prediction? A) Input layer B) Hidden layer C) Output layer D) Bias layer Answer: C Explanation: The output layer maps the final hidden representation to the desired output dimension (e.g., class scores). Question 5. Xavier (Glorot) initialization is most appropriate for layers using which activation function? A) ReLU B) Tanh C) Linear D) Softmax Answer: B Explanation: Xavier initialization balances variance for sigmoid/tanh activations; for ReLU, Kaiming (He) initialization is preferred. Question 6. Which loss function is ideal for a binary classification problem?

Certificate Practice Exam

Question 9. Which PyTorch method converts a NumPy array np_arr into a tensor on the same device as the model? A) torch.tensor(np_arr, device=model.device) B) torch.from_numpy(np_arr).to(model.device) C) torch.as_tensor(np_arr, dtype=model.dtype) D) torch.convert(np_arr) Answer: B Explanation: torch.from_numpy creates a tensor sharing memory with the NumPy array, and .to(device) moves it to the model’s device. Question 10. In PyTorch, what does the attribute requires_grad=True indicate? A) The tensor should be stored on the GPU. B) The tensor will be part of the computational graph and its gradients will be computed during back‑propagation. C) The tensor is immutable. D) The tensor will be automatically converted to a NumPy array. Answer: B Explanation: Setting requires_grad=True tells autograd to track operations on the tensor for gradient computation. Question 11. After calling loss.backward() in PyTorch, why is it necessary to invoke optimizer.zero_grad() before the next forward pass? A) To free GPU memory. B) To prevent accumulation of gradients from previous batches. C) To reset the model parameters to their initial values. D) To convert gradients to CPU tensors. Answer: B

Certificate Practice Exam

Explanation: PyTorch accumulates gradients; zeroing them ensures each update uses only the current batch’s gradients. Question 12. Which of the following is a correct way to define a custom PyTorch model with two linear layers? A) class MyNet(nn.Module): def __init__(self): super().__init__(); self.fc1 = nn.Linear(10, 20); self.fc2 = nn.Linear(20, 5) B) class MyNet(nn.Module): def __init__(self): self.fc1 = nn.Linear(10, 20); self.fc2 = nn.Linear(20, 5); super().__init__() C) class MyNet(nn.Module): def __init__(self): super().__init__(); self.layers = [nn.Linear(10,20), nn.Linear(20,5)] D) class MyNet(nn.Module): def __init__(self): super().__init__(); self.fc = nn.Sequential(nn.Linear(10,20), nn.Linear(20,5)) Answer: A Explanation: The proper pattern is to call super().__init__() first, then assign sub‑modules as attributes. Option C creates a plain Python list, which PyTorch won’t register as parameters. Question 13. In a torch.utils.data.DataLoader, what does the shuffle=True argument accomplish? A) Randomly permutes the order of batches each epoch. B) Balances class distribution in each batch. C) Converts data to tensors automatically. D) Enables multi‑GPU training. Answer: A Explanation: shuffle=True shuffles the dataset indices before creating batches, improving stochasticity.

Certificate Practice Exam

Explanation: Pooling summarizes local neighborhoods, decreasing resolution and making the network less sensitive to small translations. Question 17. Batch Normalization helps training by: A) Adding dropout to the network. B) Normalizing layer inputs to have zero mean and unit variance, reducing internal covariate shift. C) Increasing the learning rate automatically. D) Converting convolutional layers to fully connected layers. Answer: B Explanation: BatchNorm stabilizes learning by normalizing activations, which allows higher learning rates and faster convergence. Question 18. When fine‑tuning a pre‑trained ResNet for a new classification task with 10 classes, which of the following is a common practice? A) Freeze all convolutional layers and replace the final fully connected layer with a new Linear(512,10). B) Re‑initialize the entire network with random weights. C) Remove all BatchNorm layers. D) Train only the first convolutional block. Answer: A Explanation: Freezing the feature extractor preserves learned representations; a new classifier head adapts to the target classes. Question 19. Residual connections in ResNet address which problem in deep networks? A) Overfitting due to too many parameters. B) Vanishing gradients by providing identity shortcuts that facilitate gradient flow.

Certificate Practice Exam

C) Lack of non‑linearity. D) Inability to process variable‑length sequences. Answer: B Explanation: Skip connections allow gradients to bypass several layers, mitigating vanishing gradients and enabling very deep architectures. Question 20. In an LSTM cell, which gate determines how much of the previous cell state is retained? A) Input gate B) Forget gate C) Output gate D) Update gate Answer: B Explanation: The forget gate multiplies the previous cell state, controlling what information is discarded. Question 21. Which of the following statements about GRU (Gated Recurrent Unit) is true? A) GRU has three gates: input, forget, and output. B) GRU combines the input and forget gates into a single update gate, making it computationally cheaper than LSTM. C) GRU cannot be used for language modeling. D) GRU requires twice as many parameters as an LSTM with the same hidden size. Answer: B Explanation: GRU merges input and forget mechanisms, reducing complexity while retaining comparable performance.

Certificate Practice Exam

Explanation: Adam maintains per‑parameter estimates of first (mean) and second (uncentered variance) moments, adapting learning rates accordingly. Question 25. In PyTorch, which function saves both model architecture and its learned parameters in a single file? A) torch.save(model.state_dict(), path) B) torch.save(model, path) C) torch.save(torch.nn.Module, path) D) torch.save(model.parameters(), path) Answer: B Explanation: Saving the whole model (torch.save(model, path)) serializes the entire module, including its structure and weights. Question 26. The ROC‑AUC metric is particularly useful when: A) Classes are perfectly balanced. B) The decision threshold needs to be varied to evaluate trade‑offs between true positive and false positive rates. C) Only precision is important. D) Regression performance is measured. Answer: B Explanation: ROC‑AUC summarizes the model’s ability to discriminate across all possible thresholds, making it robust to class imbalance. Question 27. Which of the following best describes model quantization? A) Reducing the number of layers in a network. B) Converting model weights from 32‑bit floating point to lower‑precision (e.g., 8‑bit integer) to accelerate inference.

Certificate Practice Exam

C) Pruning neurons with zero gradients. D) Adding noise to the inputs during training. Answer: B Explanation: Quantization lowers numerical precision, decreasing memory bandwidth and improving latency on compatible hardware. Question 28. In a Variational Autoencoder (VAE), the KL‑divergence term in the loss encourages: A) The encoder to output deterministic codes. B) The latent distribution to match a prior (usually standard normal). C) The decoder to produce sharper images. D) The reconstruction loss to be zero. Answer: B Explanation: KL‑divergence penalizes deviation from the prior, ensuring a smooth latent space for generative sampling. Question 29. During GAN training, “mode collapse” refers to: A) The discriminator becoming too strong, causing the generator loss to explode. B) The generator producing a limited variety of outputs despite diverse inputs. C. The optimizer diverging due to high learning rates. D) The loss functions converging to zero simultaneously. Answer: B Explanation: Mode collapse occurs when the generator maps many latent vectors to the same or few outputs, reducing diversity. Question 30. Which component of a transformer provides information about the position of tokens in a sequence?

Certificate Practice Exam

Question 33. What does the torch.nn.Dropout(p=0.5) layer do during training? A) Sets 50 % of the weights to zero permanently. B) Randomly zeros out 50 % of the activations in each forward pass. C) Adds Gaussian noise with variance 0.5. D) Scales the output by 0.5. Answer: B Explanation: Dropout randomly masks activations with probability p during training, helping prevent co‑adaptation of neurons. Question 34. Which of the following statements about the AdamW optimizer is correct? A) It decouples weight decay from the gradient update, applying L2 regularization directly to parameters. B) It is identical to standard Adam. C) It requires a learning‑rate scheduler to work. D) It cannot be used with sparse gradients. Answer: A Explanation: AdamW applies weight decay after the Adam update, preventing the decay from being absorbed into the moment estimates. Question 35. In a CNN, dilated convolutions are useful for: A) Reducing the number of parameters while increasing receptive field. B) Performing down‑sampling without pooling. C) Enforcing weight sharing across channels. D) Normalizing feature maps.

Certificate Practice Exam

Answer: A Explanation: Dilation inserts gaps between kernel elements, expanding the receptive field without adding parameters. Question 36. Which loss function is commonly used to train a semantic segmentation model with imbalanced classes? A) Binary Cross‑Entropy B) Dice Loss C) Hinge Loss D) Mean Absolute Error Answer: B Explanation: Dice loss directly optimizes overlap between predicted and ground‑truth masks, handling class imbalance better than pixel‑wise cross‑entropy. Question 37. What is the effect of using a larger batch size on the stochasticity of gradient estimates? A) Increases stochasticity, leading to noisier updates. B) Decreases stochasticity, providing a gradient closer to the true batch‑gradient. C) Has no effect on stochasticity. D) Causes gradients to become zero. Answer: B Explanation: Larger batches average over more samples, reducing variance in the gradient estimate. Question 38. In PyTorch Lightning, which method should contain the validation logic? A) training_step B) validation_step

Certificate Practice Exam

A) To randomly change the optimizer type each epoch. B) To systematically adjust the learning rate during training, often decreasing it to fine‑tune convergence. C) To increase the batch size automatically. D) To freeze layers after a certain number of epochs. Answer: B Explanation: Schedulers modify the learning rate according to a predefined policy (e.g., step, cosine) to improve training stability. Question 42. In a transformer encoder, the “multi‑head” mechanism: A) Allows the model to attend to different representation subspaces simultaneously. B) Increases the depth of the network. C) Performs convolution over the sequence. D) Provides positional information. Answer: A Explanation: Multiple attention heads compute independent attention distributions, capturing diverse relational patterns. Question 43. Which technique can be used to visualize the learned filters of the first convolutional layer in a CNN? A) Plotting the weight tensors as images after normalizing them. B) Using torch.save to export them. C) Applying PCA to the output activations. D) Converting them to NumPy arrays and computing their FFT. Answer: A Explanation: The first‑layer kernels are 2‑D filters that can be visualized directly as grayscale or RGB images after scaling.

Certificate Practice Exam

Question 44. What is the primary advantage of using torch.jit.trace over regular Python execution? A) It enables dynamic graph construction. B) It produces a TorchScript representation that can be run independently of Python, improving inference speed. C) It automatically adds dropout during inference. D) It converts models to ONNX format. Answer: B Explanation: torch.jit.trace records tensor operations to create a static TorchScript graph, which can be executed in C++ environments. Question 45. In a binary classification problem with highly imbalanced data, which metric is more informative than accuracy? A) Mean Squared Error B) Precision‑Recall AUC C) R² score D) Log‑Loss Answer: B Explanation: Precision‑Recall curves focus on the positive class and are less affected by class imbalance than accuracy. Question 46. Which of the following statements about the Softmax function is true? A) It can be used for multi‑label classification directly. B) It outputs a probability distribution over mutually exclusive classes. C) It is identical to the sigmoid function for binary tasks. D) It does not require a temperature parameter.

Certificate Practice Exam

C) Adding Gaussian noise to the signal D) Random color jitter Answer: C Explanation: Adding noise perturbs the series while preserving its temporal structure, helping improve robustness. Question 50. In a VAE, the reparameterization trick is used to: A) Compute KL‑divergence analytically. B) Enable back‑propagation through stochastic sampling by expressing the random variable as a deterministic function of a noise term. C) Replace the decoder with a linear layer. D) Freeze the encoder during training. Answer: B Explanation: By sampling z = μ + σ * ε with ε ~ N(0,1), gradients can flow through μ and σ. Question 51. Which of the following best describes “teacher forcing” in training RNNs? A) Using the model’s own predictions as inputs at the next time step. B) Providing the ground‑truth token as the next input regardless of the model’s prediction. C) Freezing the decoder weights. D) Applying dropout to the hidden state. Answer: B Explanation: Teacher forcing feeds the true previous token during training, accelerating convergence but potentially causing exposure bias. Question 52. The nn.Embedding layer in PyTorch is typically used for:

Certificate Practice Exam

A) Converting integer indices (e.g., word IDs) into dense vector representations. B) Performing one‑hot encoding. C) Applying a linear transformation to images. D) Normalizing inputs to zero mean. Answer: A Explanation: nn.Embedding maps discrete tokens to continuous embeddings, essential for NLP models. Question 53. In a multi‑class classification problem with 4 classes, what is the shape of the output tensor from a model using nn.Linear(128,4) before applying softmax? A) (batch_size, 128) B) (batch_size, 4) C) (4, batch_size) D) (batch_size, 1) Answer: B Explanation: The linear layer projects each sample to a 4‑dimensional logit vector. Question 54. Which of the following statements about the “bias” term in a linear layer is correct? A) It is optional and always set to zero. B) It allows the activation to shift, enabling the model to fit data that does not pass through the origin. C) It is shared across all neurons in the layer. D) It is only used in convolutional layers. Answer: B Explanation: The bias adds a constant offset to each neuron’s pre‑activation, increasing representational flexibility.