Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

CS230: Deep Learning Midterm Examination - Spring Quarter 2021, Exams of Machine Learning

Stanford University Machine Learning

The midterm examination for the CS230: Deep Learning course offered in Spring Quarter 2021 at Stanford University. The exam contains 6 problems with a total of 105 points. The exam is open book, but collaboration with anyone else is strictly forbidden pursuant to The Stanford Honor Code. The document also includes instructions for completing the exam in LATEX and the Stanford University Honor Code.

Typology: Exams

2020/2021

Uploaded on 05/11/2023

sheetal_101 🇺🇸

4.8

(17)

234 documents

1 / 21

This page cannot be seen from the preview

Don't miss anything!

CS230: Deep Learning

Spring Quarter 2021

Stanford University

Midterm Examination

Suggested duration: 180 minutes

Problem Full Points Your Score

1 Multiple Choice 16

2 Short Answers 16

3 Convolutional Architectures 20

4L1regularization 13

5 Backpropagation with GANs 25

6 Numpy Coding 15

Total 105

The exam contains 21 pages including this cover page.

•If you wish to complete the midterm in L

A

T

EX, please download the project source’s

ZIP file here. (The Stanford Box link, just in case you face issues with the hyperlink:

https://stanford.box.com/s/9dhx0l4jaqmk3o1fk3egfuisbvbx69k1 )

•This exam is open book, but collaboration with anyone else, either in person or online,

is strictly forbidden pursuant to The Stanford Honor Code.

•In all cases, and especially if you’re stuck or unsure of your answers, explain your

work, including showing your calculations and derivations! We’ll give partial

credit for good explanations of what you were trying to do.

Name:

SUNETID: @stanford.edu

The Stanford University Honor Code:

I attest that I have not given or received aid in this examination, and that I have done my

share and taken an active part in seeing to it that others as well as myself uphold the spirit

and letter of the Honor Code.

Signature:

1

Discover Exams of Machine Learning Stanford University

Partial preview of the text

Download CS230: Deep Learning Midterm Examination - Spring Quarter 2021 and more Exams Machine Learning in PDF only on Docsity!

CS230: Deep Learning

Spring Quarter 2021

Stanford University

Midterm Examination

Suggested duration: 180 minutes

Problem Full Points Your Score

1 Multiple Choice 16

2 Short Answers 16

3 Convolutional Architectures 20

4 L^1 regularization 13

5 Backpropagation with GANs 25

6 Numpy Coding 15

Total 105

The exam contains 21 pages including this cover page.

If you wish to complete the midterm in LATEX, please download the project source’s ZIP file here. (The Stanford Box link, just in case you face issues with the hyperlink: https://stanford.box.com/s/9dhx0l4jaqmk3o1fk3egfuisbvbx69k1 )
This exam is open book, but collaboration with anyone else, either in person or online, is strictly forbidden pursuant to The Stanford Honor Code.
In all cases, and especially if you’re stuck or unsure of your answers, explain your work, including showing your calculations and derivations! We’ll give partial credit for good explanations of what you were trying to do.

Name:

SUNETID: @stanford.edu

The Stanford University Honor Code: I attest that I have not given or received aid in this examination, and that I have done my share and taken an active part in seeing to it that others as well as myself uphold the spirit and letter of the Honor Code.

Signature:

Question 1 (Multiple Choice Questions, 16 points)

For each of the following questions, circle the letter of your choice. Each question has AT LEAST one correct option unless explicitly mentioned. No explanation is required.

(a) (2 points) Suppose you have a CNN model for image classification, with multiple Conv, Max pooling, ReLU activation layers, and a final Softmax output. Ignoring the bias and numerical precision issues, which of the statements below are true?

(i) Multiplying the weights by a factor of 10 during inference does not affect the prediction accuracy. (ii) Multiplying the weights by a factor of 10 during training does not affect training convergence. (iii) Multiplying the input data by a factor of 10 during inference does not affect the prediction accuracy. (iv) Subtracting the input data by its mean per channel during inference does not affect the prediction accuracy.

Solution: (i) (iii). For (ii), multiplying the weights will change the gradient/weight ratio. For (iv), subtracting by mean is a data normalization technique and will affect prediction accuracy.

(b) (2 points) Select the methods that can mitigate gradient exploding

(i) Using ReLU activation instead of sigmoid. (ii) Adding Batch Normalization layers. (iii) Applying gradient clipping. (iv) Using residual connection.

Solution: (ii) (iii). (i) and (iv) are proposed to combat gradient diminishing problems. (ii) can mitigate both gradient diminishing and exploding. (iii) can avoid gradient exploding.

(c) (2 points) Select the statements that are true

(i) For a linear classifier, initializing all the weights and biases to zero will result in all the elements in the final W matrix to be the same. (ii) If we have a small dataset consisting of handwritten alphabets A-Z, and we wish to train a handwriting recognition model, it would be a good idea to augment the dataset by randomly flipping each image horizontally and vertically, given that we know that each letter is equally represented in the original dataset.

(ii) It better avoids local minima by keeping running gradient statistics. (iii) If the network has n parameters, momentum requires that we keep track of O(n) extra parameters (iv) None of the above

Solution: (ii), (iii).

(g) (2 points) Which of the following are true for early stopping during training:

(i) It may reduce the necessity to tune the hyperparameter for number of training epochs (ii) It increases model variance (iii) When accuracy reaches a trough on the validation set, we invoke early stopping (iv) It may dramatically speed up training at the cost of learning less optimal param- eters

Solution: (i), (iv).

(h) (2 points) Which of the following are true for backpropagation:

(i) If the neural network has O(n) parameters, backpropagation is an O(n^2 ) operation (ii) Backpropagation works only if a computational graph has no directed cyclic paths (iii) We update the β parameter in Adam using backpropagation (iv) We update the γ parameter in Batch Normalization using backpropagation

Solution: (ii), (iv).

Question 2 (Short Answers, 16 points)

The questions in this section can be answered in 2-4 sentences. Please be concise in your responses.

(a) (2 points) You begin training a Neural Network, but the loss evolves to be completely flat. List two possible reasons for this.

Solution:

It is possible that the weights are incorrectly initialized.
It is also possible that the learning rate is too low.
Some other answers may also be possible (for eg X not correlated with Y at all is a possibility)

Solutions based on the regularization parameter being too high or uneven distribution of class labels in the dataset are not accepted.

(b) (2 points) Your CS 230 project is in collaboration with the California PD, and they require you to identify criminals, given their data. Since being imprisoned is a very severe punishment, it is very important for your deep learning system to not incorrectly identify the criminals, and simultaneously ensure that your city is as safe as possible. What evaluation metric would you choose and why?

Solution: For the model to be a good one, you want to be sure that the person you catch is a criminal (Precision) and you also want to capture as many criminals (Recall) as possible. The F1 score manages this tradeoff. Students who mention both precision and recall can also be given full credit.

(c) (2 points) Although Pooling layers certainly cause a loss of information between Convolutional layers, why would we add Pooling layers to our network?

Solution: Pooling layers cause spatial dimensions to shrink and allow us to use fewer parameters to obtain smaller and smaller hidden representations of the input.

(d) (2 points) What does it mean for your model to have high variance? Give one possible way reduce variance in your model.

Solution: It means that your model has overfit the train data/does not exhibit generalizability. Accept any answer that helps with generalizability such as adding more data, adding regularization/dropout, creating a smaller model, etc.

(e) (2 points) List one advantage and one disadvantage of having a small batch size for training.

Question 3 (Convolutional Architectures, 20 points)

Say you have an input image whose shape is 128 × 128 × 3. You are deciding on the hyperparameters for a Convolutional Neural Network; in particular, you are in the process of determining the settings for the first Convolutional layer. Compute the output activation volume dimensions and number of parameters of each of the possible settings of the first Convolutional layer, given the input has the shape described above. You can write the activation shapes in the format (H, W, C) where H, W, C are the height, width, and channel dimensions, respectively.

i. (2 points) The first Convolutional layer has a stride of 1, a filter size of 3, input padding of 0, and 64 filters.

Solution: Activation volume dimensions: 14 × 14 × 64 Number of parameters: (3 ∗ 3 ∗ 3 + 1) ∗ 64 = 1792

ii. (2 points) The first Convolutional layer has a stride of 1, a filter size of 5, input padding of 2, and 16 filters.

Solution: Activation volume dimensions: 128 × 128 × 16 Number of parameters: (5 ∗ 5 ∗ 3 + 1) ∗ 16 = 1216

iii. (2 points) The first Convolutional layer has a stride of 2, a filter size of 2, input padding of 0, and 32 filters.

Solution: Activation volume dimensions: 64 × 64 × 32 Number of parameters: (2 ∗ 2 ∗ 3 + 1) ∗ 32 = 416

Now that you have determined the output shapes and number of parameters for these con- figurations, you are going to create a deeper CNN. Say you create a CNN made of three identical modules, each of which consists of: a Convolutional layer, a Max-Pooling layer, and a ReLU layer. All Pooling layers will have a stride of 2 and a width/height of 2. For example, say we define the Convolutional layer to have stride 1, filter size 1, input padding of 0, and 8 filters. Then the module architecture would be:

1 × 1 × 8 Conv with stride 1 and 0 padding
2 × 2 Max-Pool with stride 2
ReLU

Three such modules make up the entire network. Given the following Convolutional hy- perparameters, compute the output activation volume dimensions after passing the input through the entire network, as well as the number of parameters in the entire network

iv. (4 points) The Conv layers have a stride of 1, a filter size of 3, input padding of 0, and 64 filters.

Solution: Activation volume dimensions: 14 × 14 × 64 Number of parameters: (3 ∗ 3 ∗ 3 + 1) ∗ 64 + (3 ∗ 3 ∗ 64 + 1) ∗ 64 ∗ 2 = 75, 648

v. (4 points) The Conv layers have a stride of 1, a filter size of 5, input padding of 2, and 16 filters.

Solution: Activation volume dimensions: 16 × 16 × 16 Number of parameters: (5 ∗ 5 ∗ 3 + 1) ∗ 16 + (5 ∗ 5 ∗ 16 + 1) ∗ 16 ∗ 2 = 14, 048

Question 4 (L^1 regularization (Lasso), 13 points)

L^1 regularization on the model parameter w is defined as

||w|| 1 =

i

|wi|

In this question, you are asked to apply L^1 regularization to a neural network model, of which the original objective function is defined as J(w; X, y). The regularized objective function after adding the L^1 normalization term becomes:

J˜(w; X, y) = J(w; X, y) + α||w|| 1

i. (3 points) Write down the corresponding gradient of the regularized objective function J˜. Hint: your answer should include an element-wise sign function sign(x).

Solution:

∇w J˜(w; X, y) = ∇wJ(w; X, y) + αsign(w)

ii. (6 points) It could be difficult to get a clean algebraic solution to the previous gra- dient expression. To study how this L^1 normalization term could affect the converged weights, we make following simplifications:

We apply Taylor expansion to J(w; X, y) and discard high-order terms. Specifi- cally, we approximate the original objective function J with

Jˆ(w; X, y) = J(w∗) +^1 2

(w − w∗)T^ H(w − w∗)

, where w∗^ is the optimal parameter for the objective function without the regu- larization term and H is the Hessian matrix.

We assume the Hessian is diagonal, H = diag([H 1 , 1 , ..., Hn,n]), where each Hi,i > 0. This assumes that there is no correlation between the input features.

Now, write down an analytical solution for each element wi of w when the gradient equals to zero. Your answer should be expressed with weights w i∗ (w∗^ is the the optimal weights for an unregularized object function), Hesian matrix elements Hi,i, and coefficient α. Hint: the gradient of |x| at x = 0 can take any value between [− 1 , 1].

Solution:

wi = sign(w i∗ )max{|w∗ i −

α Hi,i

It is okay to write down the expression under different conditions like:

wi = w∗ i +

α Hi,i

, when w∗ i ≤ −

α Hi,i

wi = 0, when

α Hi,i

w∗ i > −

α Hi,i

wi = w∗ i −

α Hi,i

, when w∗ i ≥

α Hi,i

iii. (4 points) Given your answer to previous question, explain how L^1 regularization will affect the weights differently from L^2 regularization. You must discuss large weights and small weights separately.

Solution:

(2 points) For small weights, L^1 regularization will push them to zeros while L^2 will not. Or: L^1 regularization can lead to sparse weights.

Question 5 (Backpropagation with GANs)

In this question, we will workout backpropagation with Generative Adversarial Networks (GANs).

Recall that a GAN consists of a Generator and a Discriminator playing a game. The Gener- ator takes as input a random sample from some noise distribution (e.g., Gaussian), and its goal is to produce something from a target distribution (which we observe via samples from this distribution). The Discriminator takes as input a batch consisting of a mix of samples from the true dataset and the Generator’s output, and its goal is to correctly classify whether its input comes from the true dataset or the Generator.

Definitions:

X^1 , ..., Xn^ is a minibatch of n samples from the target data generating distribution For this question, we suppose that each Xi^ is a k dimensional vector. For example, we might be interested in generating a synthetic dataset of customer feature vectors in a credit scoring application
Z^1 , ..., Zn^ is a minibatch of n samples from some predetermined noise distribution Note that in general these minibatch sizes may be different.
The generator g(.; θg) : Z → X is a neural network
The discriminator d(.; θd) : X → (0, 1) is a neural network

The log likelihood of the output produced by the discriminator is:

L(θd, θg) =

n

∑^ n

i=

log D(Xi) + log(1 − D(G(Zi)))

The training of such a GAN system proceeds as follows; given the generator’s parameters, the discriminator is optimized to maximize the above likelihood. Then, given the discrimina- tor’s parameters, the generator is optimized to minimize the above likelihood. This process is iteratively repeated. Once training completes, we only require the generator to generate samples from our distribution of interest; we sample a point from our noise distribution and map it to a sample using our generator.

The Discriminator (The generator architecture is defined analogously)

Consider the discriminator to be a network with layers indexed by 1, 2 , .., Ld for a total of Ld layers.
Let the discriminator’s weight matrix for layer l be W (^) dl; lets assume there are no biases for simplicity
Let the activations produced by a layer l be given by Ald, and the pre activation values by zld. Write down

Let gld(.) be the activation function at layer l

i. The goal of the discriminator is to maximize the above likelihood function. Write down ∇θd L(X; θd, θg) in terms of ∇θd D(.) (3 points)

Solution: ∇θd L(θd, θg) = (^) n^1

∑n i=

∇θd D(Xi) D(Xi) −^

∇θd D(G(Zi)) 1 −D(G(Zi))

ii. Write down ∂L(θd, θg) ∂z dLd

taking help from your answer in the previous subpart. Remember that the activation function in the last layer of the discriminator is a sigmoid function as the output of the discriminator is a probability. (4 points)

Solution: ∂L(θd,θg^ ) ∂zdLd = (^) n^1

∑n i=

σ(zLd d(Xi))∗(1−σ(zLd d(Xi)) D(Xi) −^

σ(zdL d(G(Zi))∗(1−σ(zLd dG(Zi))) 1 −D(G(Zi)) OR ∂L ∂z(θdL,θdg^ ) d

= (^) n^1

∑n i=1(1^ −^ σ(z

Ld d (Xi)))^ −^ σ(z

Ld d (G(Zi))) OR ∂L(θd,θg^ ) ∂zdLd = (^) n^1

∑n i=1(1^ −^ D(Xi)))^ −^ D(G(Zi))) OR, partial credit (2 points) for

Solution: ∂L(θd, θg) ∂g(zi; θg)

= (w^1 d)T^

∂L(θd, θg) ∂z^1 d

v. Now we move to the generator. The goal of the generator is to minimize the above likelihood function. Write down ∇θg L(θd, θg) in terms of ∇θg g(.) and in terms of ∂L ∂g((zθid;,θθgg )^ ) calculated in the previous part (5 points)

Solution: ∇θg L(θd, θg) = (^) n^1

∑n i=1 ∇θg^ g(Z

i).∂L(θd,θg ) ∂g(zi;θg )

vi. Write down a simple gradient based update rule (don’t use RMSprop or Momentum, and assume no regularization) for θt d+1 in terms of ∇θd L(θd, θg), fixed learning rate α and the current parameters θtd (1 point)

Solution: θt d+1 = θtd + α∇θd L(θd, θg)

vii. Write down a simple gradient based update rule (don’t use RMSprop or Momentum, and assume no regularization) for θt g+1 in terms of ∇θg L(θd, θg), fixed learning rate α and the current parameters θtd (1 point)

Solution: θt g+1 = θtg − α∇θg L(θd, θg)

viii. Now assume you decide to (lazily) use a simplified version of the objective, which we call “Likelihood Version”, just to test how it works. Define the following alternative likelihood function. L′(θd, θg) =

n

∑^ n

i=

D(Xi) + D(G(Zi)))

Note that this likelihood is high when the original likelihood is high, and low otherwise, in general, so it is possible that it works. Write down ∇θd L′(θd, θg) in terms of ∇θd D(.) (3 points)

Solution: ∇θd L′(θd, θg) = (^) n^1

∑n i=1 ∇θd D(X

i) − ∇θ d D(G(Z

i)) You get full credit if you fixed the typo in this likelihood (there is a minus instead of a plus)

It turns out this new Likelihood (along with gradient clipping) ends up minimizing the earth mover or Wasserstein distance between the target distribution and your generated samples, and the GAN optimized as such is called a WGAN. WGANs have been shown to work better than standard GANs in many cases, and are often preferred to GANs

Computing Attention Scores. Now, we compute attention scores for each pixel in the image following figure 2. At each index (i, j), i ∈ H, j ∈ W , we consider the query array qij as pixel array of the image across all channels (∈ R iC^ )and compute the corresponding score, yij using the following equation. Hint: You might want to look at np.squeeze

yij =

a,b=(1,1)

softmax(qTij ka,b) va,b

This process is very similar to convolutions, but instead of convolving with a filter/kernel, we convolve the image with the key kernel, compute the softmax output, and compute the final score by multiplying and summing with value kernel. For simplicity, assume that the model parameters k and v of the given shapes are provided. In this problem, for each output channel, we take key kernel k of size 3 × 3 × iC and value kernel v of size 3 × 3 × 1, and ka,b ∈ R iC^.

Figure 2: Computation of attention scores for a single pixel across input channels

Given the image, and kernels k and v, you must use NumPy operations to compute the attention scores. Note that comparing attention score calculation to convolution operations, the stride size is 1, and there is no padding.

What would be the total number of trainable parameters in your self-attention model, given that the output array Y ∈ R H,W,oC^.
Following the above pseudocode and carefully following the dimensional rules, write down an attention model implemented using Python for-loops. We will be awarding extra credit if you could figure out how to replace the loops with NumPy subroutines. We have also provided a snippet of the code as a starter.

Solution:

Number of trainable parameters =

3 ∗ 3 ∗ iC ∗ oC + 3 ∗ 3 ∗ 1 ∗ oC = 180

CS230: Deep Learning Midterm Examination - Spring Quarter 2021, Exams of Machine Learning

Related documents

Partial preview of the text

Download CS230: Deep Learning Midterm Examination - Spring Quarter 2021 and more Exams Machine Learning in PDF only on Docsity!

CS230: Deep Learning

Spring Quarter 2021

Stanford University

Midterm Examination

Suggested duration: 180 minutes