













Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The midterm examination for the CS230: Deep Learning course offered in Spring Quarter 2021 at Stanford University. The exam contains 6 problems with a total of 105 points. The exam is open book, but collaboration with anyone else is strictly forbidden pursuant to The Stanford Honor Code. The document also includes instructions for completing the exam in LATEX and the Stanford University Honor Code.
Typology: Exams
1 / 21
This page cannot be seen from the preview
Don't miss anything!














Problem Full Points Your Score
1 Multiple Choice 16
2 Short Answers 16
3 Convolutional Architectures 20
4 L^1 regularization 13
5 Backpropagation with GANs 25
6 Numpy Coding 15
Total 105
The exam contains 21 pages including this cover page.
Name:
SUNETID: @stanford.edu
The Stanford University Honor Code: I attest that I have not given or received aid in this examination, and that I have done my share and taken an active part in seeing to it that others as well as myself uphold the spirit and letter of the Honor Code.
Signature:
Question 1 (Multiple Choice Questions, 16 points)
For each of the following questions, circle the letter of your choice. Each question has AT LEAST one correct option unless explicitly mentioned. No explanation is required.
(a) (2 points) Suppose you have a CNN model for image classification, with multiple Conv, Max pooling, ReLU activation layers, and a final Softmax output. Ignoring the bias and numerical precision issues, which of the statements below are true?
(i) Multiplying the weights by a factor of 10 during inference does not affect the prediction accuracy. (ii) Multiplying the weights by a factor of 10 during training does not affect training convergence. (iii) Multiplying the input data by a factor of 10 during inference does not affect the prediction accuracy. (iv) Subtracting the input data by its mean per channel during inference does not affect the prediction accuracy.
Solution: (i) (iii). For (ii), multiplying the weights will change the gradient/weight ratio. For (iv), subtracting by mean is a data normalization technique and will affect prediction accuracy.
(b) (2 points) Select the methods that can mitigate gradient exploding
(i) Using ReLU activation instead of sigmoid. (ii) Adding Batch Normalization layers. (iii) Applying gradient clipping. (iv) Using residual connection.
Solution: (ii) (iii). (i) and (iv) are proposed to combat gradient diminishing problems. (ii) can mitigate both gradient diminishing and exploding. (iii) can avoid gradient exploding.
(c) (2 points) Select the statements that are true
(i) For a linear classifier, initializing all the weights and biases to zero will result in all the elements in the final W matrix to be the same. (ii) If we have a small dataset consisting of handwritten alphabets A-Z, and we wish to train a handwriting recognition model, it would be a good idea to augment the dataset by randomly flipping each image horizontally and vertically, given that we know that each letter is equally represented in the original dataset.
(ii) It better avoids local minima by keeping running gradient statistics. (iii) If the network has n parameters, momentum requires that we keep track of O(n) extra parameters (iv) None of the above
Solution: (ii), (iii).
(g) (2 points) Which of the following are true for early stopping during training:
(i) It may reduce the necessity to tune the hyperparameter for number of training epochs (ii) It increases model variance (iii) When accuracy reaches a trough on the validation set, we invoke early stopping (iv) It may dramatically speed up training at the cost of learning less optimal param- eters
Solution: (i), (iv).
(h) (2 points) Which of the following are true for backpropagation:
(i) If the neural network has O(n) parameters, backpropagation is an O(n^2 ) operation (ii) Backpropagation works only if a computational graph has no directed cyclic paths (iii) We update the β parameter in Adam using backpropagation (iv) We update the γ parameter in Batch Normalization using backpropagation
Solution: (ii), (iv).
Question 2 (Short Answers, 16 points)
The questions in this section can be answered in 2-4 sentences. Please be concise in your responses.
(a) (2 points) You begin training a Neural Network, but the loss evolves to be completely flat. List two possible reasons for this.
Solution:
Solutions based on the regularization parameter being too high or uneven distribution of class labels in the dataset are not accepted.
(b) (2 points) Your CS 230 project is in collaboration with the California PD, and they require you to identify criminals, given their data. Since being imprisoned is a very severe punishment, it is very important for your deep learning system to not incorrectly identify the criminals, and simultaneously ensure that your city is as safe as possible. What evaluation metric would you choose and why?
Solution: For the model to be a good one, you want to be sure that the person you catch is a criminal (Precision) and you also want to capture as many criminals (Recall) as possible. The F1 score manages this tradeoff. Students who mention both precision and recall can also be given full credit.
(c) (2 points) Although Pooling layers certainly cause a loss of information between Convolutional layers, why would we add Pooling layers to our network?
Solution: Pooling layers cause spatial dimensions to shrink and allow us to use fewer parameters to obtain smaller and smaller hidden representations of the input.
(d) (2 points) What does it mean for your model to have high variance? Give one possible way reduce variance in your model.
Solution: It means that your model has overfit the train data/does not exhibit generalizability. Accept any answer that helps with generalizability such as adding more data, adding regularization/dropout, creating a smaller model, etc.
(e) (2 points) List one advantage and one disadvantage of having a small batch size for training.
Question 3 (Convolutional Architectures, 20 points)
Say you have an input image whose shape is 128 × 128 × 3. You are deciding on the hyperparameters for a Convolutional Neural Network; in particular, you are in the process of determining the settings for the first Convolutional layer. Compute the output activation volume dimensions and number of parameters of each of the possible settings of the first Convolutional layer, given the input has the shape described above. You can write the activation shapes in the format (H, W, C) where H, W, C are the height, width, and channel dimensions, respectively.
i. (2 points) The first Convolutional layer has a stride of 1, a filter size of 3, input padding of 0, and 64 filters.
Solution: Activation volume dimensions: 14 × 14 × 64 Number of parameters: (3 ∗ 3 ∗ 3 + 1) ∗ 64 = 1792
ii. (2 points) The first Convolutional layer has a stride of 1, a filter size of 5, input padding of 2, and 16 filters.
Solution: Activation volume dimensions: 128 × 128 × 16 Number of parameters: (5 ∗ 5 ∗ 3 + 1) ∗ 16 = 1216
iii. (2 points) The first Convolutional layer has a stride of 2, a filter size of 2, input padding of 0, and 32 filters.
Solution: Activation volume dimensions: 64 × 64 × 32 Number of parameters: (2 ∗ 2 ∗ 3 + 1) ∗ 32 = 416
Now that you have determined the output shapes and number of parameters for these con- figurations, you are going to create a deeper CNN. Say you create a CNN made of three identical modules, each of which consists of: a Convolutional layer, a Max-Pooling layer, and a ReLU layer. All Pooling layers will have a stride of 2 and a width/height of 2. For example, say we define the Convolutional layer to have stride 1, filter size 1, input padding of 0, and 8 filters. Then the module architecture would be:
Three such modules make up the entire network. Given the following Convolutional hy- perparameters, compute the output activation volume dimensions after passing the input through the entire network, as well as the number of parameters in the entire network
iv. (4 points) The Conv layers have a stride of 1, a filter size of 3, input padding of 0, and 64 filters.
Solution: Activation volume dimensions: 14 × 14 × 64 Number of parameters: (3 ∗ 3 ∗ 3 + 1) ∗ 64 + (3 ∗ 3 ∗ 64 + 1) ∗ 64 ∗ 2 = 75, 648
v. (4 points) The Conv layers have a stride of 1, a filter size of 5, input padding of 2, and 16 filters.
Solution: Activation volume dimensions: 16 × 16 × 16 Number of parameters: (5 ∗ 5 ∗ 3 + 1) ∗ 16 + (5 ∗ 5 ∗ 16 + 1) ∗ 16 ∗ 2 = 14, 048
Question 4 (L^1 regularization (Lasso), 13 points)
L^1 regularization on the model parameter w is defined as
||w|| 1 =
i
|wi|
In this question, you are asked to apply L^1 regularization to a neural network model, of which the original objective function is defined as J(w; X, y). The regularized objective function after adding the L^1 normalization term becomes:
J˜(w; X, y) = J(w; X, y) + α||w|| 1
i. (3 points) Write down the corresponding gradient of the regularized objective function J˜. Hint: your answer should include an element-wise sign function sign(x).
Solution:
∇w J˜(w; X, y) = ∇wJ(w; X, y) + αsign(w)
ii. (6 points) It could be difficult to get a clean algebraic solution to the previous gra- dient expression. To study how this L^1 normalization term could affect the converged weights, we make following simplifications:
Jˆ(w; X, y) = J(w∗) +^1 2
(w − w∗)T^ H(w − w∗)
, where w∗^ is the optimal parameter for the objective function without the regu- larization term and H is the Hessian matrix.
Now, write down an analytical solution for each element wi of w when the gradient equals to zero. Your answer should be expressed with weights w i∗ (w∗^ is the the optimal weights for an unregularized object function), Hesian matrix elements Hi,i, and coefficient α. Hint: the gradient of |x| at x = 0 can take any value between [− 1 , 1].
Solution:
wi = sign(w i∗ )max{|w∗ i −
α Hi,i
It is okay to write down the expression under different conditions like:
wi = w∗ i +
α Hi,i
, when w∗ i ≤ −
α Hi,i
wi = 0, when
α Hi,i
w∗ i > −
α Hi,i
wi = w∗ i −
α Hi,i
, when w∗ i ≥
α Hi,i
iii. (4 points) Given your answer to previous question, explain how L^1 regularization will affect the weights differently from L^2 regularization. You must discuss large weights and small weights separately.
Solution:
Question 5 (Backpropagation with GANs)
In this question, we will workout backpropagation with Generative Adversarial Networks (GANs).
Recall that a GAN consists of a Generator and a Discriminator playing a game. The Gener- ator takes as input a random sample from some noise distribution (e.g., Gaussian), and its goal is to produce something from a target distribution (which we observe via samples from this distribution). The Discriminator takes as input a batch consisting of a mix of samples from the true dataset and the Generator’s output, and its goal is to correctly classify whether its input comes from the true dataset or the Generator.
Definitions:
The log likelihood of the output produced by the discriminator is:
L(θd, θg) =
n
∑^ n
i=
log D(Xi) + log(1 − D(G(Zi)))
The training of such a GAN system proceeds as follows; given the generator’s parameters, the discriminator is optimized to maximize the above likelihood. Then, given the discrimina- tor’s parameters, the generator is optimized to minimize the above likelihood. This process is iteratively repeated. Once training completes, we only require the generator to generate samples from our distribution of interest; we sample a point from our noise distribution and map it to a sample using our generator.
The Discriminator (The generator architecture is defined analogously)
i. The goal of the discriminator is to maximize the above likelihood function. Write down ∇θd L(X; θd, θg) in terms of ∇θd D(.) (3 points)
Solution: ∇θd L(θd, θg) = (^) n^1
∑n i=
∇θd D(Xi) D(Xi) −^
∇θd D(G(Zi)) 1 −D(G(Zi))
ii. Write down ∂L(θd, θg) ∂z dLd
taking help from your answer in the previous subpart. Remember that the activation function in the last layer of the discriminator is a sigmoid function as the output of the discriminator is a probability. (4 points)
Solution: ∂L(θd,θg^ ) ∂zdLd = (^) n^1
∑n i=
σ(zLd d(Xi))∗(1−σ(zLd d(Xi)) D(Xi) −^
σ(zdL d(G(Zi))∗(1−σ(zLd dG(Zi))) 1 −D(G(Zi)) OR ∂L ∂z(θdL,θdg^ ) d
= (^) n^1
∑n i=1(1^ −^ σ(z
Ld d (Xi)))^ −^ σ(z
Ld d (G(Zi))) OR ∂L(θd,θg^ ) ∂zdLd = (^) n^1
∑n i=1(1^ −^ D(Xi)))^ −^ D(G(Zi))) OR, partial credit (2 points) for
Solution: ∂L(θd, θg) ∂g(zi; θg)
= (w^1 d)T^
∂L(θd, θg) ∂z^1 d
v. Now we move to the generator. The goal of the generator is to minimize the above likelihood function. Write down ∇θg L(θd, θg) in terms of ∇θg g(.) and in terms of ∂L ∂g((zθid;,θθgg )^ ) calculated in the previous part (5 points)
Solution: ∇θg L(θd, θg) = (^) n^1
∑n i=1 ∇θg^ g(Z
i).∂L(θd,θg ) ∂g(zi;θg )
vi. Write down a simple gradient based update rule (don’t use RMSprop or Momentum, and assume no regularization) for θt d+1 in terms of ∇θd L(θd, θg), fixed learning rate α and the current parameters θtd (1 point)
Solution: θt d+1 = θtd + α∇θd L(θd, θg)
vii. Write down a simple gradient based update rule (don’t use RMSprop or Momentum, and assume no regularization) for θt g+1 in terms of ∇θg L(θd, θg), fixed learning rate α and the current parameters θtd (1 point)
Solution: θt g+1 = θtg − α∇θg L(θd, θg)
viii. Now assume you decide to (lazily) use a simplified version of the objective, which we call “Likelihood Version”, just to test how it works. Define the following alternative likelihood function. L′(θd, θg) =
n
∑^ n
i=
D(Xi) + D(G(Zi)))
Note that this likelihood is high when the original likelihood is high, and low otherwise, in general, so it is possible that it works. Write down ∇θd L′(θd, θg) in terms of ∇θd D(.) (3 points)
Solution: ∇θd L′(θd, θg) = (^) n^1
∑n i=1 ∇θd D(X
i) − ∇θ d D(G(Z
i)) You get full credit if you fixed the typo in this likelihood (there is a minus instead of a plus)
It turns out this new Likelihood (along with gradient clipping) ends up minimizing the earth mover or Wasserstein distance between the target distribution and your generated samples, and the GAN optimized as such is called a WGAN. WGANs have been shown to work better than standard GANs in many cases, and are often preferred to GANs
Computing Attention Scores. Now, we compute attention scores for each pixel in the image following figure 2. At each index (i, j), i ∈ H, j ∈ W , we consider the query array qij as pixel array of the image across all channels (∈ R iC^ )and compute the corresponding score, yij using the following equation. Hint: You might want to look at np.squeeze
yij =
a,b=(1,1)
softmax(qTij ka,b) va,b
This process is very similar to convolutions, but instead of convolving with a filter/kernel, we convolve the image with the key kernel, compute the softmax output, and compute the final score by multiplying and summing with value kernel. For simplicity, assume that the model parameters k and v of the given shapes are provided. In this problem, for each output channel, we take key kernel k of size 3 × 3 × iC and value kernel v of size 3 × 3 × 1, and ka,b ∈ R iC^.
Figure 2: Computation of attention scores for a single pixel across input channels
Given the image, and kernels k and v, you must use NumPy operations to compute the attention scores. Note that comparing attention score calculation to convolution operations, the stride size is 1, and there is no padding.
Solution:
Number of trainable parameters =
3 ∗ 3 ∗ iC ∗ oC + 3 ∗ 3 ∗ 1 ∗ oC = 180