[DL PY] Certificate in Deep Learning with Python Certification Exam Guide, Exams of Technology

This exam guide provides a structured approach to deep learning using Python, covering data preprocessing, neural networks, training techniques, evaluation metrics, and model deployment concepts. Candidates strengthen both theoretical and practical understanding through exam-focused exercises.

Typology: Exams

2025/2026

Available from 02/10/2026

shilpi-jain-3
shilpi-jain-3 🇮🇳

2.5

(11)

80K documents

1 / 84

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
[DL PY] Certificate in Deep Learning with Python
Certification Exam Guide
**Question 1. Which mathematical operation defines the dot product of two vectors in a neural
network?**
A) Elementwise multiplication
B) Matrix addition
C) Sum of pairwise products
D) Convolution
Answer: C
Explanation: The dot product (or inner product) multiplies corresponding elements of two vectors and
sums the results, which is the core operation in a neuron's weighted sum.
**Question 2. In calculus, what does the partial derivative of a loss function with respect to a weight
represent?**
A) The change in loss when the weight is increased by one unit
B) The slope of the loss surface along that weight’s dimension
C) The total error of the network
D) The learning rate of the optimizer
Answer: B
Explanation: A partial derivative measures how the loss changes as a single weight varies, holding all
other parameters constant; it is the gradient component used in backpropagation.
**Question 3. Which variant of gradient descent uses a single randomly chosen training example to
update parameters at each step?**
A) Batch Gradient Descent
B) Minibatch Gradient Descent
C) Stochastic Gradient Descent (SGD)
D) Fullbatch Gradient Descent
Answer: C
Explanation: SGD computes the gradient using one training sample, leading to noisy but frequent
updates that can escape shallow local minima.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54

Partial preview of the text

Download [DL PY] Certificate in Deep Learning with Python Certification Exam Guide and more Exams Technology in PDF only on Docsity!

Certification Exam Guide

Question 1. Which mathematical operation defines the dot product of two vectors in a neural network? A) Element‑wise multiplication B) Matrix addition C) Sum of pairwise products D) Convolution Answer: C Explanation: The dot product (or inner product) multiplies corresponding elements of two vectors and sums the results, which is the core operation in a neuron's weighted sum. Question 2. In calculus, what does the partial derivative of a loss function with respect to a weight represent? A) The change in loss when the weight is increased by one unit B) The slope of the loss surface along that weight’s dimension C) The total error of the network D) The learning rate of the optimizer Answer: B Explanation: A partial derivative measures how the loss changes as a single weight varies, holding all other parameters constant; it is the gradient component used in back‑propagation. Question 3. Which variant of gradient descent uses a single randomly chosen training example to update parameters at each step? A) Batch Gradient Descent B) Mini‑batch Gradient Descent C) Stochastic Gradient Descent (SGD) D) Full‑batch Gradient Descent Answer: C Explanation: SGD computes the gradient using one training sample, leading to noisy but frequent updates that can escape shallow local minima.

Certification Exam Guide

Question 4. Which activation function suffers from vanishing gradients for large positive inputs? A) ReLU B) Leaky ReLU C) Sigmoid D) Softmax Answer: C Explanation: The sigmoid saturates near 0 and 1; its derivative approaches 0 for large magnitude inputs, causing vanishing gradients during back‑propagation. Question 5. What is the primary purpose of the bias term in a neuron? A) To scale the input features B) To shift the activation function horizontally C) To regularize the model D) To compute the loss Answer: B Explanation: The bias adds a constant offset to the weighted sum, allowing the activation function to be shifted left or right, enabling the neuron to fit data not passing through the origin. Question 6. In Keras, which API allows you to define models with multiple inputs or outputs? A) Sequential API B) Functional API C) Subclassing API D) TensorFlow Lite API Answer: B Explanation: The Functional API lets you create complex architectures by connecting layers as a directed acyclic graph, supporting multiple inputs and outputs.

Certification Exam Guide

A) One forward pass through a single batch B) One complete pass through the entire training dataset C) The number of hidden layers in the model D) The learning rate schedule step Answer: B Explanation: An epoch is defined as processing all training samples once; multiple epochs are required for the model to converge. Question 11. Which parameter controls the number of samples processed before the model’s weights are updated? A) Epoch B) Learning Rate C) Batch Size D) Momentum Answer: C Explanation: Batch size determines how many training examples are aggregated to compute a gradient update step. Question 12. In a 2‑D convolution, what does “same” padding achieve? A) No padding, resulting in reduced spatial dimensions B) Padding that keeps the output size equal to the input size C) Padding only on the right and bottom edges D) Random padding values for data augmentation Answer: B Explanation: “Same” padding adds zeros symmetrically so that the height and width of the output feature map match the input dimensions (when stride = 1). Question 13. What is the effect of a stride of (2,2) in a convolutional layer? A) The filter moves two pixels horizontally and vertically each step, halving the spatial dimensions

Certification Exam Guide

B) The filter size doubles C) The number of filters doubles D) No effect; stride only matters for pooling layers Answer: A Explanation: Stride controls how far the filter slides; a stride of 2 reduces the output height and width roughly by half. Question 14. Which pooling operation computes the average value within each window? A) Max Pooling B) Average Pooling C) Global Max Pooling D) Stochastic Pooling Answer: B Explanation: Average Pooling takes the mean of all activations inside the pooling window, providing a smoother down‑sampling. Question 15. After several convolution and pooling layers, which Keras layer is commonly used to convert 2‑D feature maps into a 1‑D vector before a dense classifier? A) Conv2D B) Flatten C) Reshape D) GlobalAveragePooling2D Answer: B Explanation: Flatten reshapes the multi‑dimensional tensor into a single long vector, preserving the order of elements for the fully connected layers. Question 16. In the VGG architecture, what is the primary design principle? A) Use of residual connections B) Stacking many 3×3 convolutional layers with increasing depth

Certification Exam Guide

D) The loss value at each timestep Answer: B Explanation: The hidden state is the internal representation updated at each timestep, enabling the network to retain temporal context. Question 20. Which issue is most commonly associated with training standard RNNs on long sequences? A) Exploding gradients only B) Vanishing gradients only C) Both exploding and vanishing gradients D) No gradient issues; RNNs are stable Answer: C Explanation: Standard RNNs suffer from both exploding and vanishing gradients because repeated multiplication of the recurrent weight matrix can cause gradients to shrink or grow exponentially. Question 21. In an LSTM cell, which gate decides what new information to add to the cell state? A) Forget gate B) Input gate C) Output gate D) Reset gate Answer: B Explanation: The input gate, combined with a candidate vector, determines how much of the new information is written into the cell state. Question 22. Which of the following best describes the purpose of the forget gate in an LSTM? A) To reset the hidden state to zero each timestep B) To decide which parts of the previous cell state to discard C) To compute the final output of the LSTM D) To increase the learning rate adaptively

Certification Exam Guide

Answer: B Explanation: The forget gate multiplies the previous cell state element‑wise, allowing the network to forget irrelevant past information. Question 23. What is the dimensionality of the output of a Keras Embedding layer with input dimension 10,000 and output dimension 128? A) (batch_size, 10,000) B) (batch_size, sequence_length, 128) C) (batch_size, 128) D) (batch_size, sequence_length, 10,000) Answer: B Explanation: The Embedding layer maps each integer token to a 128‑dimensional vector, producing a 3 ‑D tensor: (batch, time steps, embedding_dim). Question 24. Which algorithm learns word vectors by predicting surrounding words given a target word? A) GloVe B) Word2Vec Skip‑gram C) FastText D) BERT Answer: B Explanation: The Skip‑gram variant of Word2Vec maximizes the probability of context words conditioned on the target word. Question 25. In the Transformer architecture, what is the main function of the self‑attention mechanism? A) To replace convolutional layers for image processing B) To allow each token to weigh the relevance of every other token in the sequence C) To perform pooling over time steps D) To generate word embeddings from raw characters

Certification Exam Guide

Answer: A Explanation: L1 regularization (Lasso) adds |w| to the loss, encouraging sparsity in the learned parameters. Question 29. In Keras, which callback stops training when a monitored metric has stopped improving for a specified number of epochs? A) ModelCheckpoint B) EarlyStopping C) ReduceLROnPlateau D) TensorBoard Answer: B Explanation: EarlyStopping monitors a metric (e.g., validation loss) and halts training if no improvement is observed for 'patience' epochs. Question 30. What does Batch Normalization primarily aim to reduce during training? A) Model size B) Internal covariate shift C) Number of epochs needed for convergence D) Overfitting via weight decay Answer: B Explanation: By normalizing layer inputs per mini‑batch, BatchNorm stabilizes the distribution of activations, reducing internal covariate shift and allowing higher learning rates. Question 31. Which technique allows you to reuse the learned filters of a pre‑trained model for a new, smaller dataset? A) Data augmentation B) Transfer learning with fine‑tuning C) Gradient clipping D) Weight pruning Answer: B

Certification Exam Guide

Explanation: Transfer learning freezes early layers (which capture generic features) and fine‑tunes later layers on the new task, leveraging prior knowledge. Question 32. In a Generative Adversarial Network (GAN), what is the role of the discriminator? A) To generate realistic samples from noise B) To map latent vectors to images C) To distinguish between real and generated samples D) To compute reconstruction loss Answer: C Explanation: The discriminator outputs a probability that an input image is real; it guides the generator to produce more realistic data. Question 33. Which loss function is typically used to train a vanilla autoencoder for image reconstruction? A) Binary Cross‑Entropy B) Categorical Cross‑Entropy C) Mean Squared Error D) Hinge Loss Answer: C Explanation: Autoencoders aim to minimize the pixel‑wise difference between input and output; MSE quantifies this reconstruction error effectively. Question 34. When saving a TensorFlow model for later deployment, which format preserves the full computation graph and variables? A) .h5 (Keras HDF5) B) SavedModel directory C) Pickle file D) JSON architecture only Answer: B

Certification Exam Guide

Answer: B Explanation: Leaky ReLU replaces the zero gradient for negative inputs with a small slope (e.g., 0.01), mitigating “dying ReLU” problems. Question 38. In a neural network, which term describes the process of adjusting weights to minimize a loss function? A) Inference B) Forward propagation C) Backpropagation D) Data preprocessing Answer: C Explanation: Backpropagation computes gradients of the loss w.r.t. each parameter and updates them via an optimizer. Question 39. Which of the following statements about the Softmax activation is true? A) It can be used for binary classification without modification B) Its outputs sum to 1, representing a probability distribution over classes C) It is identical to the sigmoid function for multi‑class problems D) It does not require a loss function like cross‑entropy Answer: B Explanation: Softmax normalizes a vector of raw scores into probabilities that add up to 1, making it suitable for multi‑class classification. Question 40. What is the primary advantage of using the Adam optimizer over plain SGD? A) Adam guarantees convergence to the global minimum B) Adam automatically tunes the learning rate for each parameter C) Adam does not require a learning rate hyperparameter D) Adam eliminates the need for momentum Answer: B

Certification Exam Guide

Explanation: Adam computes adaptive learning rates per parameter using estimates of first and second moments, often leading to faster convergence. Question 41. Which loss function is most appropriate for a multi‑label classification problem where each instance can belong to multiple classes simultaneously? A) Categorical Cross‑Entropy B) Binary Cross‑Entropy (applied per label) C) Hinge Loss D) Mean Absolute Error Answer: B Explanation: Binary Cross‑Entropy treats each label independently, allowing multiple positive classes per sample. Question 42. In Keras, what does the argument return_sequences=True do in an LSTM layer? A) Returns only the final hidden state B) Returns the hidden state at every time step C) Returns the cell state instead of the hidden state D) Enables bidirectional processing automatically Answer: B Explanation: Setting return_sequences=True makes the LSTM output a sequence (one vector per timestep), useful for stacking recurrent layers. Question 43. Which of the following is a common strategy to mitigate exploding gradients? A) Using a larger learning rate B) Gradient clipping C) Removing activation functions D) Increasing batch size indefinitely Answer: B Explanation: Gradient clipping caps the norm of gradients, preventing them from growing excessively large during back‑propagation.

Certification Exam Guide

Question 47. Which of the following optimizers includes a decay term that reduces the learning rate over time automatically? A) Adam B) RMSprop C) Adagrad D) SGD with learning‑rate decay schedule Answer: D Explanation: While Adam adapts per‑parameter rates, explicit decay of the base learning rate is commonly implemented with SGD (or via callbacks) to gradually lower it. Question 48. What is the main purpose of using a learning‑rate scheduler during training? A) To increase the number of epochs automatically B) To adjust the learning rate based on epoch or performance, improving convergence C) To change the optimizer type mid‑training D) To shuffle the dataset more frequently Answer: B Explanation: Schedulers modify the learning rate over time (e.g., step decay, cosine annealing), helping the optimizer escape plateaus and fine‑tune near minima. Question 49. Which of the following statements about the “vanishing gradient” problem is true? A) It occurs only with ReLU activations B) It is more severe in deep networks with sigmoid or tanh activations C) It can be completely solved by increasing batch size D) It is unrelated to the choice of optimizer Answer: B Explanation: Sigmoid and tanh compress gradients for large inputs, causing them to shrink exponentially as they propagate through many layers. Question 50. In a bidirectional LSTM, how are the forward and backward hidden states typically combined?

Certification Exam Guide

A) They are added element‑wise B) They are concatenated along the feature dimension C) Only the forward state is kept, the backward is discarded D) The backward state overwrites the forward state Answer: B Explanation: Concatenation preserves information from both temporal directions, doubling the hidden size for subsequent layers. Question 51. Which regularization technique randomly drops entire neurons (i.e., whole feature maps) during training? A) Dropout B) Spatial Dropout C) L1 regularization D) Batch Normalization Answer: B Explanation: Spatial Dropout zeroes out entire channels (feature maps) in convolutional layers, encouraging robustness across spatial features. Question 52. What does the “temperature” parameter control in a softmax function used for sampling? A) The number of classes considered B) The steepness of the probability distribution, affecting randomness of sampling C) The learning rate of the optimizer D) The size of the hidden state vector Answer: B Explanation: Higher temperature smooths the distribution (more random sampling), while lower temperature sharpens it (more deterministic). Question 53. In TensorFlow, which API call converts a Keras model to a TensorFlow Lite model for mobile deployment?

Certification Exam Guide

C) Normalizing only the output layer D) Applying dropout after each layer Answer: B Explanation: Layer normalization computes mean and variance over the features within a single training example, making it independent of batch size. Question 57. In a CNN, what is the effect of using depthwise separable convolutions compared to standard convolutions? A) Increases the number of parameters dramatically B) Reduces computational cost by factorizing spatial and channel mixing C) Prevents overfitting by adding regularization D) Requires larger kernels to achieve the same receptive field Answer: B Explanation: Depthwise separable convolutions first apply a spatial filter per channel (depthwise) then combine channels with a pointwise 1×1 convolution, lowering FLOPs. Question 58. Which metric is most appropriate for evaluating a highly imbalanced binary classification problem? A) Accuracy B) Mean Squared Error C) Area Under the ROC Curve (AUC‑ROC) D) Categorical Cross‑Entropy Answer: C Explanation: AUC‑ROC assesses the trade‑off between true positive and false positive rates across thresholds, remaining informative under class imbalance. Question 59. In Keras, what does setting shuffle=False in model.fit() affect? A) The order of batches is fixed, preserving temporal sequence for stateful RNNs B) The optimizer will not update weights C) Validation data will be shuffled instead

Certification Exam Guide

D) The learning rate schedule will be disabled Answer: A Explanation: Disabling shuffling maintains the original order of samples, which is required for stateful recurrent networks that rely on sequence continuity across batches. Question 60. Which of the following is a characteristic of the “AdamW” optimizer compared to vanilla Adam? A) It decouples weight decay from the gradient update B) It uses a fixed learning rate throughout training C) It does not compute second‑moment estimates D) It only works with convolutional layers Answer: A Explanation: AdamW applies L2 regularization (weight decay) directly to the weights after the Adam update, improving generalization. Question 61. What does the term “epoch‑wise learning rate decay” refer to? A) Reducing the learning rate after each batch B) Decreasing the learning rate after a fixed number of epochs C) Keeping the learning rate constant throughout training D) Increasing the learning rate as training progresses Answer: B Explanation: Epoch‑wise decay schedules lower the learning rate at predetermined epoch intervals, often using step or exponential decay. Question 62. Which of the following best explains why the ReLU activation can lead to sparse representations? A) It outputs negative values for half the inputs B) It sets all negative activations to zero, deactivating neurons C) It normalizes the output to have zero mean D) It forces all activations to be exactly one