CS230: Deep Learning Midterm Examination - Fall Quarter 2018, Exams of Chemistry

A midterm examination for the CS230: Deep Learning course offered in Fall Quarter 2018 at Stanford University. The exam contains 6 questions with a total of 117 points. The exam is closed book, but students are allowed to use one A4 sheet (front and back) of notes as reference. The questions cover topics such as attacks on neural networks, autonomous driving case study, and traversability estimation using GANs. The exam also includes multiple-choice and short answer questions.

Typology: Exams

2017/2018

Uploaded on 05/11/2023

houhou
houhou 🇺🇸

4

(7)

265 documents

1 / 25

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS230: Deep Learning
Fall Quarter 2018
Stanford University
Midterm Examination
180 minutes
Problem Full Points Your Score
1 Multiple Choice Questions 10
2 Short Answer Questions 35
3 Attacks on Neural Networks 15
4 Autonomous Driving Case Study 27
5 Traversability Estimation Using GANs 14
6 LogSumExp 16
Total 117
The exam contains 25 pages including this cover page.
This exam is closed book i.e. no laptops, notes, textbooks, etc. during the
exam. However, you may use one A4 sheet (front and back) of notes as reference.
In all cases, and especially if you’re stuck or unsure of your answers, explain your
work, including showing your calculations and derivations! We’ll give partial
credit for good explanations of what you were trying to do.
Name:
SUNETID: @stanford.edu
The Stanford University Honor Code:
I attest that I have not given or received aid in this examination, and that I have done my
share and taken an active part in seeing to it that others as well as myself uphold the spirit
and letter of the Honor Code.
Signature:
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19

Partial preview of the text

Download CS230: Deep Learning Midterm Examination - Fall Quarter 2018 and more Exams Chemistry in PDF only on Docsity!

CS230: Deep Learning

Fall Quarter 2018

Stanford University

Midterm Examination

180 minutes

Problem Full Points Your Score

1 Multiple Choice Questions 10

2 Short Answer Questions 35

3 Attacks on Neural Networks 15

4 Autonomous Driving Case Study 27

5 Traversability Estimation Using GANs 14

6 LogSumExp 16

Total 117

The exam contains 25 pages including this cover page.

  • This exam is closed book i.e. no laptops, notes, textbooks, etc. during the exam. However, you may use one A4 sheet (front and back) of notes as reference.
  • In all cases, and especially if you’re stuck or unsure of your answers, explain your work, including showing your calculations and derivations! We’ll give partial credit for good explanations of what you were trying to do.

Name:

SUNETID: @stanford.edu

The Stanford University Honor Code: I attest that I have not given or received aid in this examination, and that I have done my share and taken an active part in seeing to it that others as well as myself uphold the spirit and letter of the Honor Code.

Signature:

Question 1 (Multiple Choice Questions, 10 points)

For each of the following questions, circle the letter of your choice. There is only ONE correct choice unless explicitly mentioned. No explanation is required. There is no penalty for a wrong answer.

(a) (1 point) Which of the following techniques does NOT prevent a model from overfit- ting?

(i) Data augmentation (ii) Dropout (iii) Early stopping (iv) None of the above

Solution: (iv)

(b) (3 points) Consider the following data sets:

  • Xtrain = (x(1), x(2), ..., x(mtrain)), Ytrain = (y(1), y(2), ..., y(mtrain))
  • Xtest = (x(1), x(2), ..., x(mtest)), Ytest = (y(1), y(2), ..., y(mtest))

You want to normalize your data before training your model. Which of the following propositions are true? (Circle all that apply.)

(i) The normalizing mean and variance computed on the training set, and used to train the model, should be used to normalize test data. (ii) Test data should be normalized with its own mean and variance before being fed to the network at test time because the test distribution might be different from the train distribution. (iii) Normalizing the input impacts the landscape of the loss function. (iv) In imaging, just like for structured data, normalization consists in subtracting the mean from the input and multiplying the result by the standard deviation.

Solution: (i) and (iii)

Solution: (iii)

(e) (1 point) Consider the model defined in question (d) with parameters initialized with zeros. W [1]^ denotes the weight matrix of the first layer. You forward propagate a batch of examples, and then backpropagate the gradients and update the parameters. Which of the following statements is true?

(i) Entries of W [1]^ may be positive or negative (ii) Entries of W [1]^ are all negative (iii) Entries of W [1]^ are all positive (iv) Entries of W [1]^ are all zeros

Solution: (i)

(f) (2 points) Consider the layers l and l − 1 in a fully connected neural network:

The forward propagation equations for these layers are:

z[l−1]^ = W [l−1]a[l−2]^ + b[l−1] a[l−1]^ = gl−1 z[l]^ = W [l]a[l−1]^ + b[l] a[l]^ = gl

Which of the following propositions is true? Xavier initialization ensures that :

(i) V ar(W [l−1]) is the same as V ar(W [l]). (ii) V ar(b[l]) is the same as V ar(b[l−1]). (iii) V ar(a[l]) is the same as V ar(a[l−1]), at the end of training. (iv) V ar(a[l]) is the same as V ar(a[l−1]), at the beginning of training.

Solution: (iv)

Question 2 (Short Answer Questions, 35 points)

Please write concise answers.

(a) (2 points) You are training a logistic regression model. You initialize the parameters with 0’s. Is this a good idea? Explain your answer.

Solution: There is no symmetry problem with this approach. In logistic regres- sion, we have a = W x + b where a is a scalar and W and x are both vectors. The derivative of the binary cross entropy loss with respect to a single dimension in the weight vector W [i] is a function of x[i], which is in general different than x[j] when i 6 = j.

(b) (2 points) You design a fully connected neural network architecture where all acti- vations are sigmoids. You initialize the weights with large positive numbers. Is this a good idea? Explain your answer.

Solution: Large W causes W x to be large. When W x is large, the gradient is small for sigmoid activation function. Hence, we will encounter the vanishing gradient problem.

(c) (2 points) You are given a dataset of 10 × 10 grayscale images. Your goal is to build a 5-class classifier. You have to adopt one of the following two options:

  • the input is flattened into a 100-dimensional vector, followed by a fully-connected layer with 5 neurons
  • the input is directly given to a convolutional layer with five 10 × 10 filters

Explain which one you would choose and why.

Solution: The 2 approaches are the same. But the second one seems better in terms of computational costs (no need to flatten the input). We accept the answer ”the 2 approaches are the same”.

(g) (2 points) Data augmentation is often used to increase the amount of data you have. Should you apply data augmentation to the test set? Explain why.

Solution: Both answers are okay but need to be justified. If no, then explain that we want to test on real data only. If yes, then explain in which situation doing data augmentation on test set might make sense (e.g. as an ensemble approach in image classifiers).

(h) (2 points) Weight sharing allows CNNs to deal with image data without using too many parameters. Does weight sharing increase the bias or the variance of a model?

Solution: Increases bias.

(i) (2 points) You’d like to train a fully-connected neural network with 5 hidden layers, each with 10 hidden units. The input is 20-dimensional and the output is a scalar. What is the total number of trainable parameters in your network?

Solution: (20+1)10 + (10+1)104 + (10+1)

(j) (3 points) Consider the figure below:

Figure 1: Input of shape (nH , nW , nC ) = (10, 10 , 1); There are five 4 × 4 convolutional filters with ’valid’ padding and a stride of (2, 2)

What is the output shape after performing the convolution step in Figure 1? Write your answer in the following format: (nH , nW , nc).

Solution: (h=4,w=4,c=5)

(k) (2 points) Recall that σ(z) = (^) 1+^1 e−z and tanh(z) = e

z (^) −e−z ez^ +e−z^. Calculate^

∂σ(z) ∂z in terms of σ(z) and ∂tanh ∂z (z)in terms of tanh(z). Solution: Gradient for sigmoid: σ(z) ∗ (1 − σ(z)) Gradient for tanh: 1 − tanh^2 (z)

(l) (2 points) Assume that before training your neural network the setting is: (1) The data is zero centered. (2) All weights are initialized independently with mean 0 and variance 0.001. (3) The biases are all initialized to 0. (4) Learning rate is small and cannot be tuned.

Using the result from (k), explain which activation function between tanh and sigmoid is likely to lead to a higher gradient during the first update.

Solution: tanh. During initialization, expected value of z is 0. Derivative of σ w.r.t. z evaluated at zero = 0. 5 ∗ 0 .5 = 0.25. Derivative of tanh w.r.t. z evaluated at zero = 1. tanh has higher gradient magnitude close to zero.

(m) You want to build a 10-class neural network classifier, Given a cat image, you want to classify which of the 10 cat breeds it belongs to.

(i) (2 points) What loss function do you use? Introduce the appropriate notation and write down the formula of the loss function.

Solution: You would want to use the cross entropy loss given by L = −

∑n i=1 yilog(ˆyi)

(ii) (2 points) Assuming you train your network using mini-batch gradient descent with a batch size of 64, what cost function do you use? Introduce the appropriate notation and write down the formula of the cost function.

Solution: If there are m training examples, J = (^) m^1

∑m i=1 L

(i)

(iii) (3 points) One of your friends has trained a cat vs. non-cat classifier. It performs very well and you want to use transfer learning to build your own model. Explain what additional hyperparameters (due to the transfer learning) you will need to tune.

Solution: The parameters you would need to choose are: 1) How many layers of the original network to keep. 2) How many new layers to introduce

  1. How many of the layers of the original network would you want to keep frozen while fine tuning.

Question 3 (Attacks on Neural Networks, 15 points)

Alice and Bob are deep learning engineers working at two rival start-ups. They are both trying to deliver the same neural network-based product.

Alice and Bob do not have access to each other’s models and code. However they can query each other’s models as much as they’d like.

(a) (2 points) Name a type of neural network attack that Alice cannot use against Bob’s model, and explain why it cannot be used in this case.

Solution: White-box attacks White box attacks require access to the weights of the model whereas black box attacks do not.

(b) (3 points) How can Alice forge an image xiguana which looks like an iguana but will be wrongly classified as a plant by Bob’s model? Give an iterative method and explicitly mention the loss function.

Solution: L = ||yˆ − yiguana|| + γ||x − xplant||

(c) (3 points) It is possible to add an invisible perturbation η to an image x, such that x˜ = x + η is misclassified by a model. Assuming you have access to the target model, explain how you would find η.

Solution: η can be chosen using a method such as the Fast Gradient Sign Based Method. It is an iterative method that requires access to the model (a white-box attack)

(d) (2 points) Given that you have obtained η, you notice that |η| << 1. Explain why even though the adversarial noise has a small magnitude, it can cause a large change in the output of a model.

Solution: Since the dimensionality of the images is very large, even though the noise is small, it can cause a large swing in the output. To illustrate consider x˜ = x + η. While passing this through a single layer, W x˜ = W x + W η =

j Wij^ ηj^. If j, is very large, this can have a significant contribution.

(e) (3 points) Alice doesn’t have access to Bob’s network. How can she still generate an adversarial example using the method described above?

Solution: Adversarial examples are transferable, and so Alice can use Fast Gradient Sign Based Method on a different model, built for the same task (cat vs. non-cat classification), and it is likely that this adversarial example will also be misclassified by Bob’s model.

(f) (2 points) To defend himself against Alice’s attacks, Bob is thinking of using dropout. Dropout randomly shuts down certain neurons of the network, and makes it more robust to changes in the input. Thus, Bob has the intuition that the network will be less vulnerable to adversarial examples. Is Bob correct? Explain why or why not.

Solution: No, dropout isn’t used at test time while Alice will forge her adversarial examples by querying a network at test time.

(c) You finally design the following pipeline:

Figure 2: The input camera image is given to two modules: the Car Detector C and the Pedestrian Detector P. C outputs a set of bounding boxes localizing the cars. P outputs a set of bounding boxes localizing the pedestrians. The bounding boxes are then given to a Path Planner S which outputs the steering angle. Assume all these submodules are supervised learning algorithms.

(i) (3 points) What data do you need to train the submodules in the pipeline pre- sented in Figure 2?

Solution:

  • To train the Car Detector, we need to collect Xc (images from a camera hood of a car) and Yc (bounding box labels localizing the cars.)
  • To train the Pedestrian Detector, we need to collect Xp (images from a camera hood of a car) and Yp (bounding box labels localizing the pedes- trians.)
  • To train the Path planner, we need to collect Xs (bounding boxes local- izing the cars and the pedestrians) and Ys (steering angle.)

(ii) (3 points) Explain how you would collect this data.

Solution:

  • (Xc, Yc) : Put a camera on the hood of a car, label the bounding boxes by hand. You can also download images from roads online or use online datasets such as COCO or PASCAL VOC.
  • (Xp, Yp) : Put a camera on the hood of a car, label the bounding boxes by hand. You can also download images from roads online or use online datasets such as COCO or PASCAL VOC.
  • (Xs, Ys) : Put a camera on the hood of a car, label the bounding boxes by

hand. Track the steering angle θ using a sensor in the car. Note that the most challenging to train seems to be the path planner. You collect images of roads with and without cars and pedestrians. Each image should be labelled with bounding boxes around cars and pedestrians, and indicate the true steering angle. You can thus set-up a camera on the hood of your car and a sensor capturing the steering angle. Drive it in various environment, track the live variations in steering angle mapped to the camera’s video stream. Finally, label the video frames with bounding boxes around pedestrians and cars. To boost the performance of your car detector and pedestrian detector, you can also add images from other sources such as PASCAL VOC and COCO which have bounding box labels.

(d) (2 points) Propose a metric to measure the performance of your pipeline model.

Solution:

  • Sum of absolute deviations (i.e. L1 distance) between ground truth and pre- dicted steering angle.
  • Sum of squared errors (i.e. L2 distance) between ground truth and predicted steering angle.
  • (not expected) You can also design metrics for submodules. For C and P it might be mAP (/mean IoU)

(e) Assume that you have designed a metric that scores your pipeline between 0% (bad) and 100% (good.) On unseen data from the real world, your entire pipeline gets a score of 54%.

(i) (2 points) Define Bayes error and human level error. How do these two compare with each other? (≤, ≥, =)

Solution: Bayes error is a lower bound on the minimum error that can be achieved. Human level error is the error achieved by an expert human on the same task. The Bayes error is ≤ human level error.

(ii) (2 points) How would you measure human level error for your autonomous driv- ing task?

Solution: Possible solution: Create a simulator with the same path followed by the car while recording data, and have an expert try following the path.

Question 5 (Traversability Estimation Using GANs, 14 points)

In robot navigation, the traversability problem aims to answer the question: can the robot traverse through a situation?

Figure 3: Example of different situations. Left: traversable; Right: non-traversable

You want to estimate the traversability of a situation for a robot. Traversable data is easy to collect (e.g. going through a corridor) while non-traversable data is very costly (e.g. go- ing down the stairs). You have a large and rich dataset X of traversable images, but no non-traversable images.

The question you are trying to answer is: ’Is it possible to train a neural network that classifies whether or not a situation is traversable using only dataset X ?’ More precisely, if a non-traversable image was fed into the network, you want the network to predict that it is non-traversable. In this part, you will use a Generative Adversarial Network (GAN) to solve this problem.

(a) Before considering the traversability problem, let us do a warm-up question. Consider that you have trained a network fw : R nx×^1 → R ny^ ×^1. The parameters of the network are denoted w. Given an input x ∈ R nx×^1 , the network outputs ˆy = fw(x) ∈ R ny^ ×^1.

Given an arbitrary output ˆy∗, you would like to use gradient descent optimization to generate an input x∗^ such that fw(x∗) = ˆy∗.

(i) (2 points) Write down the formula of the l 2 loss function you would use.

Solution: L(x∗, yˆ) = ‖fw(x∗) − yˆ∗‖^22

(ii) (2 points) Write down the update rule of the gradient descent optimizer in terms of l 2 norm.

Solution: x∗ t+1 = x∗ t − α · ∂‖fw^ (x)−ˆy

∗‖ (^22) ∂x |x=x∗ t

(iii) (2 points) Calculate the gradient of the loss in your update rule in terms of ∂fw (x) ∂w , and^

∂fw (x) ∂x (it is not necessary to use both terms).

Solution: x∗ t+1 = x∗ t − 2 α ·

∂fw (x) ∂x |x=x∗ t

)T

· (fw(x∗ t ) − yˆ∗)

(b) Now, let us go back to the traversability problem. In particular, imagine that you have successfully trained a perfect GAN with generator G and discriminator D on X. As a consequence, given a code z ∈ R C^ , G(z) will look like a traversable image.

(i) (2 points) Consider a new image x. How can you find a code z such that the output of the generator G(z) would as close as possible to x?

Solution: We can apply the backpropagation technique developed in part (a), where z plays the role of x∗, and x plays the role of y∗.

(ii) (2 points) Suppose you’ve found z such that G(z) is the closest possible value to x out of all possible z. How can you decide if x represents a traversable situation or not? Give a qualitative explanation.

Solution: We compare G(z) to the image x in the sense that if ‖G(z) − x‖ 2 is “big” then x is non-traversable, and vice versa.

(iii) (2 points) Instead of using the method above, Amelia suggests directly running x through the discriminator D. Amelia believes that if D(x) predicts that it is a real image, then x is likely a traversable situation. Else, it is likely to be a non-traversable situation. Do you think that Amelia’s method would work and why?

Solution: The reason is if the GAN is trained perfectly, which is the case here, then the discriminator cannot tell if a generated image by the generator is real or fake, that is, traversable or non-traversable.

(c) (2 points) The iterative method developed in part (a) is too slow for your self-driving application. Given ˆy∗^ you need to generate x∗^ in real time. Come up with a method us- ing an additional network to generate x∗^ faster, e.g., with a single forward propagation.

Solution: We can train a second network to spit out the inverse of the network in hand.

Some ideas for this question are borrowed from the paper titled GONet: A Semi-Supervised Deep Learning Approach For Traversability Estimation (Hirose et al.).

In the following questions, you will express the cross-entropy loss in a different way.

(b) (2 points)

LogSumExp (LSE), defined below, is an operation commonly encountered in Deep Learning:

LSE(x 1 ,... , xn) = log

∑^ n

i=

exp(xi) = log(exp(x 1 ) + · · · + exp(xn)) (1)

Express the loss LCE(ˆy, y) in terms of the logits vector z and the LSE function. (Hint: ˆy = softmax(z))

Solution:

LCE(ˆy, y) = − log ˆyc

= − log

exp(zc) Z = − log(exp(zc)) + log(Z) = −zc + LSE(z)

Thus, LSE(z) − zc

(c) (2 point) Compute the following partial derivative:

∂ ∂zj

LSE(z)

Solution:

exp(zj ) ∑K i=1 exp(zi)

or ˆyj or (softmax(z))j

(d) (2 point) Compute the following partial derivative (for the correct class c):

∂ ∂zc

LCE(ˆy, y)

(Hint: use Part (b))

Solution: yˆc − 1

(e) (2 point) Compute the following partial derivative (for an incorrect class j 6 = c):

∂ ∂zj

LCE(ˆy, y)

Solution: yˆj

(f) (2 points) Using the results of Part (d) and (e), express the following gradient using yˆ and y: ∂ ∂z

LCE(ˆy, y)

Solution: ˆy − y