Homework 3 CSE 446: Machine Learning, Study notes of Artificial Intelligence

You are welcome to use any Python libraries for data munging, visualization, and numerical linear algebra. Examples includes Numpy, Pandas, and Matplotlib. You ...

Typology: Study notes

2022/2023

Uploaded on 05/11/2023

sandipp
sandipp 🇺🇸

4.3

(11)

223 documents

1 / 14

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Homework 3
CSE 446: Machine Learning
University of Washington
1 Policies [0 points]
Please read these policies. Please answer the three questions below and include your answers
marked in a “problem 0” in your solution set. Homeworks which do not include these answers
will not be graded.
Gradescope submission: When submitting your HW, please tag your pages correctly as is
requested in gradescope. Untagged homeworks will not be graded, until the tagging is fixed.
Readings: Read the required material.
Submission format: Submit your report as a single pdf file. Also, please include all your
code in the PDF file in a section at the end of your document, marked “Code”; also specify which
problem(s) the code corresponds to. The report (in a single pdf file) must include all the plots
and explanations for programming questions (if required). Homework solutions must be organized
in order, with all plots arranged in the correct location in your submitted solutions. We highly
recommend typesetting your scientific writing using L
A
T
E
X(see the website for references for free
tools). Writing solutions by hand will be accepted provided they are neat; written solutions need
to be scanned and included into a single pdf.
Written work: Please provide succinct answers along with succinct reasoning for all your
answers. Points may be deducted if long answers demonstrate a lack of clarity. Similarly, when
discussing the experimental results, concisely create tables and figures to organize the experimental
results. In other words, all your explanations, tables, and figures for any particular part of a question
must be grouped together.
Including your Python source code: For the programming assignments, submit your code
in the pdf file along with a neatly written README file that instructs us how you ran your code
with different settings (if applicable). Please note that we will not accept screenshots of Jupyter
notebooks. If you do use Jupyter, you must export your code to a text file and put the text of your
code in the submitted pdf file (in the last section) in a manner that can be executed in that order
(without any extraneous or missing code).
We assume that you always follow good practice of coding (commenting, structuring); these
factors are not central to your grade.
Coding policies: You must write your own code. You are welcome to use any Python libraries
for data munging, visualization, and numerical linear algebra. Examples includes Numpy, Pandas,
and Matplotlib. You may not, however, use any machine learning libraries such as Scikit-Learn,
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe

Partial preview of the text

Download Homework 3 CSE 446: Machine Learning and more Study notes Artificial Intelligence in PDF only on Docsity!

Homework 3

CSE 446: Machine Learning

University of Washington

1 Policies [0 points]

Please read these policies. Please answer the three questions below and include your answers marked in a “problem 0” in your solution set. Homeworks which do not include these answers will not be graded. Gradescope submission: When submitting your HW, please tag your pages correctly as is requested in gradescope. Untagged homeworks will not be graded, until the tagging is fixed. Readings: Read the required material. Submission format: Submit your report as a single pdf file. Also, please include all your code in the PDF file in a section at the end of your document, marked “Code”; also specify which problem(s) the code corresponds to. The report (in a single pdf file) must include all the plots and explanations for programming questions (if required). Homework solutions must be organized in order, with all plots arranged in the correct location in your submitted solutions. We highly recommend typesetting your scientific writing using LATEX(see the website for references for free tools). Writing solutions by hand will be accepted provided they are neat; written solutions need to be scanned and included into a single pdf. Written work: Please provide succinct answers along with succinct reasoning for all your answers. Points may be deducted if long answers demonstrate a lack of clarity. Similarly, when discussing the experimental results, concisely create tables and figures to organize the experimental results. In other words, all your explanations, tables, and figures for any particular part of a question must be grouped together. Including your Python source code: For the programming assignments, submit your code in the pdf file along with a neatly written README file that instructs us how you ran your code with different settings (if applicable). Please note that we will not accept screenshots of Jupyter notebooks. If you do use Jupyter, you must export your code to a text file and put the text of your code in the submitted pdf file (in the last section) in a manner that can be executed in that order (without any extraneous or missing code). We assume that you always follow good practice of coding (commenting, structuring); these factors are not central to your grade. Coding policies: You must write your own code. You are welcome to use any Python libraries for data munging, visualization, and numerical linear algebra. Examples includes Numpy, Pandas, and Matplotlib. You may not, however, use any machine learning libraries such as Scikit-Learn,

TensorFlow, or PyTorch, unless explicitly specified for that question. If in doubt, post to the message boards. Collaboration: It is acceptable for you to discuss problems with other students; it is not acceptable for students to look at another students written answers. It is acceptable for you to discuss coding questions with others; it is not acceptable for students to look at another students code. Each student must understand, write, and hand in their own answers. In addition, each student must write and submit their own code in the programming part of the assignment. Acknowledgments: We expect the students not to refer to or seek out solutions in published material from previous years, on the web, or from other textbooks. Students are certainly encour- aged to read extra material for a deeper understanding. Extra Credit Policy: In order to get extra credit, you must do all the regular problems. Extra credit points will only be awarded if there are (honest attempts at) answers to all the regular questions. This is because they are not designed to be alternative questions to the regular questions.

1.1 List of Collaborators

List the names of all people you have collaborated with and for which question(s).

1.2 List of Acknowledgements

If you do inadvertently find an assignment’s answer, acknowledge for which question and provide an appropriate citation (there is no penalty, provided you include the acknowledgement). If not, then write “none”.

1.3 Certify that you have read the instructions

Please make sure to read and follow these instructions. Write “I have read and understood these policies” to certify this.

Remark: You might find this expression to be rather curious! It looks identical to the expression for our gradient in the average squared error case. You are free to think about why this was a fortunate coincidence. The choice of y ∈ { 0 , 1 } was indeed intentional.

  1. [2 points] Again, in order to make our code fast, simplify this gradient expresion with matrix algebra by expressing it in terms of X ∈ RN^ ×d, Y ∈ RN^ ×^1 , Ŷ ∈ RN^ ×^1 (and other relevant quantities).

3.2 Let’s try it out! [12 points]

Implement logistic regression on our “2” vs “9” dataset. Note: here you need to modify the decision rule: we should label a digit as a “2”, i.e. y = 1, if w · x ≥ 0 (You should be able to see why this threshold is appropriate.). Make sure to explicitly include a bias term in the model and do not regularize this term. To be precise, let the un-regularized loss be:

L(w, b) =

N

∑^ N

n=

log pw,b(y = yn | xn) ,

and so Lλ(w, b) = L(w, b) +

λ 2

‖w‖^2.

Here pw,b is the model which includes a bias term. Note that gradient descent here would be:

w ← w − η∇wLλ(w, b) = w − η (∇wL(w, b) + λw)

b ← b − η

∂Lλ(w, b) ∂b = b − η

∂L(w, b) ∂b where the last step follows since our cost function is not regularizing the bias term. Now run gradient descent:

  1. [1 point] Specify all your parameter choices (this should be your step size and λ). What stepsize do you find works well, and what value of λ did you use? You might have to search around a little (it helps to search by going up or down in multiples of 10 ).
  2. [5 points] Show your log loss on the training set and the development set (where the y-axis is the log loss and x-axis is the iteration). You should be able to convince yourself that the log loss (sometimes referred to as the cross entropy) can be computed as: − 1 N

n

(yn log ̂yn + (1 − yn) log(1 − ŷn)) ,

which can be directly computed in python with operations on the vectors Y and Ŷ. Both curves should be on the same plot. What value of λ did you use?

  1. [4 points] Make this plot again (with both curves), except use the misclassification error, as a percentage, instead of the average log loss. Here, make sure to start your x-axis at a slightly later iteration, so that your error starts below 5%, which makes the behavior more easy to view (it is difficult to view the long run behavior if the y-axis is over too large a range).
  2. [2 points] Again, it is expected that you obtain a good test error (meaning you train long enough and you regularize appropriately, if needed). Report your lowest test error.

3.3 Let’s use stochastic gradient descent [12 points]

Now use stochastic gradient descent, using one point at a time:

  1. [1 point] Roughly, what is the largest stepsize at which SGD makes progress? (above this you will find that things start behaving very poorly).
  2. [1 point] Specify your step size schedule, if you choose to decay it or what you did.
  3. [5 points] After every 500 updates (starting before your first update at 0 ), make a plot showing your training average log loss and your development average the log loss on the y-axis and the iteration on the x-axis. Both curves should be on one same plot. What value of λ did you use?
  4. [5 points] Make this plot again (with both curves, except use the misclassification error, as a percentage, instead of average log loss. Here, make sure to start your x-axis at a slightly later iteration, so that your error starts below 5%, which makes the behavior more easy to view (it is difficult to view the long run behavior if the y-axis is over too large a range). Again, it is expected that you obtain a good test error (meaning you train long enough and you regularize appropriately, if needed). Report the lowest test error.

3.4 EXTRA CREDIT: Mini-batch, stochastic gradient descent [10 points]

Again, due to the manner in which matrix multiplication methods (as opposed to using “For Loops”) allow for faster runtimes (often through GPU processers), it is often much faster to use “mini-batch” methods by sampling m points at a time. By increasing the batch size, we reduce the variance in stochastic gradient descent. In practice (and in theory), this tends to be very helpful as increase m, and then there tends to be (relatively sharp) diminishing returns.

  1. Now run stochastic gradient descent, using a mini-batch size of m = 100 points at a time. Here, each parameter updates means you use m = 100 randomly sampled training points. (a) [1 point] Roughly, what is the stepsize at which SGD starts to make significant progress? (above this it is poorly behaved) You might find it interesting that this stepsize is different than the m = 1 case. (b) [4 points] After every 500 updates (starting before your first update 0 ), make a plot showing your training average log loss and your development log loss on the y-axis and the iteration on the x-axis. Both curves should be on one same plot. What value of λ did you use (if you used it)? Specify your learning rate scheme if you chose to decay your learning rate. Remark Note that every update now touches 100 points. However, an update should not be 100 times slower (even though, technically, your computer is doing 100 as much computation). This is,

Let us now build a classifier for digit recognition on all 10 digits. You will use the full dataset (same as you used for PCA in HW2), were x is 784 dimensions.

5.1 “One vs all classification” with Linear Regression

In the previous two class problem, we used linear regression with y ∈ { 0 , 1 }. Now we have 10 classes. Here we will use a “one-hot” encoding of the label. The label yn will be a 10 dimensional vector, where the k-th entry is 1 if the label is for the k-th class and all other entries will be 0.

  1. [0 points] Create a label matrix of size Y ∈ RN^ ×^10 for both your training, dev. and test set.

Here, we can consider a vector valued prediction:

ŷn = W >^ · xn.

where sized W ∈ Rd×^10 matrix. As discussed in class, we can define the objective function here as:

Lλ(W ) :=

N

∑^ N

n=

‖yn − W >xn‖^2 +

λ 2

‖W‖^2

where you can view the penalty as the sum of the squares of the entries of W. Note that this formulation is literally the same as doing k-binary classification problems on each of the classes separately, you will do a linear regression where you label a digit as Y = 1 if and only if the label for this digit is k (for k = 0, 1 , 2 ,... 9 ). It is straightforward to verify that the solution is:

W (^) λ∗ =

N

X>X + λId

N

X>Y

Note that here are stacking our the vectors yn and ̂yn into the matrices Y ∈ RN^ ×^10 and Ŷ ∈ RN^ ×^10. For classification, you will then take the largest predicted score among your 10 predictors. Def. of the misclassification error: We say a mistake is made on an example (x, y) if our prediction in { 1 ,... k} does not equal the label y. The % misclassification error (on our training, dev, test, etc) is the % of such mistakes made by our prediction method on our, respective, dataset. Remark: This is sometimes referred to as “one against all” prediction. Note that we are just doing 10 separate linear regressions and are stacking our answers together. Also, the gradient of this loss function can be expressed as:

dLλ(w) dW

N

∑^ N

n=

xn(yn − ̂yn)>^ + λW. (1)

Note that this expression is of size d × k.

Dataset You will use the MNIST dataset you used from the last assignment. It contains all 10 digits with the labels. The instructor will post a function on Piazza for computing the misclassification % error that you are free to use.

  1. [4 points] Based on the above gradient expression, write out the matrix algebra expression for this gradient in terms of X ∈ RN^ ×d, Y ∈ RN^ ×^10 , Ŷ ∈ RN^ ×^10 (and other relevant quantities), where there is no “sum over n” in your expression.
  2. [12 points] Decide which method you would like to use: the closed form expression, GD, or SGD. Specify your method along with all your parameters. On the training set, dev set, and test set, what are your average square losses and what is your misclassification % error?

5.2 EXTRA CREDIT: Take a matrix derivative on your own [5 points]

Prove equation 1. To do this, lookup some facts about matrix derivatives on the internet (there are all sorts of “matrix cookbooks”, “cheat sheets”, etc. out there). Provide the one or two rules for how one takes a matrix derivative to obtain the proof. The proof should be just a few steps. Also, you should be able to convince yourself as to how this follows from the vector proof you did earlier.

6 Multi-Class Classification using the the softmax [20 points]

We now turn to the softmax classifier. Here, y takes values in the set { 1 ,... k}. The model is as follows: we have k weight vectors, w(1), w(2),... w(k). Here, we can view these parameters as columns in a matrix of size W ∈ Rd×^10 matrix. For ` ∈ { 1 ,... k},

pW (y = `|x) =

exp(w(`)^ · x) ∑k i=1 exp(w (i) (^) · x)

Again, note that this is a valid probability distribution (the probabilities are positive and they sum to 1 ). Also, note that we have “over-parameterized” the model, since:

pW (y = k|x) = 1 −

∑^ k−^1

i=

pW (y = i|x)

We could define the model without using w(k). However, the instructor likes this choice as the derivative expressions become a little simpler (and it makes it easier to re-use code). As before, it is helpful to define the “prediction vector”:

̂ yn = pw(y | xn)

where we view pw(y | xn) as k-dimensional (column) vector where the i-th component is pW (y = i|xn).

learning. The instructor’s hope is that, with more hands on experience, you will be better informed about the issues in play. The talk is relevant, since we are basically implementing the method discussed in this paper. The paper itself does not provide the most lucid justification; the method is really just a “quick and dirty” procedure to make features. In practice, there are often better feature generation methods; this one is remarkably simple. In this problem, we will engage in the bad practice where we do not have a dev set. To a large extent, looking “a little” at the test set is done in practice (and this shouldn’t hurt us too much if we understand how confidence intervals work). However, this has been done for quite sometime on this dataset, which is why the instructor is suspect of the test errors below 1 .2%, among those methods that do no use “distortions” or “pre-processing” or “convolutional” methods (we should expect the latter methods to give performance bumps). The views of the instructor are that about 1.4% or less is “state of the art”, without “distor- tions” or “pre-processing” or “convolutional” methods (as discussed on the MNIST website). If we wanted even higer accuracy, we should really move to convolutional methods, which we may briefly discuss later in the class. Finally, the approach below might seem a little non-sensical. However, an important lesson is that large feature representations, appropriately blown up, often perform remarkably well once you have a lot of labeled data.

Making the features

Grab the “mnist all 50pca dims.gz” dataset. It contains all the datapoints reduced down to 50 dimensions. There is no dev set. And there are 60,000 training points. The inputs have been normalized so that the features vectors x are, on average, unit length, i.e. E[‖x‖^2 ] = 1. Load the modified MNIST dataset in Python as follows:

import gzip, pickle with gzip.open("mnist_all_50pca_dims.gz") as f: data = pickle.load(f, encoding="bytes") Xtrain, Xtest = data[b"Xtrain"], data[b"Xtest"] Ytrain, Ytest = data[b"Ytrain"], data[b"Ytest"]

Now let us try to make “better” features; we are not going to be particularly clever in the way we make these features, though they do provide remarkable improvements. Let x be an image (as a vector in Rd). Now we will map each x to a k-dimensional feature vector as follows: we will first construct k random vectors, v 1 , v 2 ,... , vk (these will be sampled form a Gaussian distribution). In other words, you first sample a matrix V ∈ Rd×k, where the columns of this matrix are v 1 to vk; this can be done in python with the command np.random.randn(d, k). Then our feature vector will be the following vector:

φ(x) = (sin(2v> 1 x), sin(2v> 2 x),... sin(2v> k x))

Note that φ(x) is a k dimensional vector; sin(·) is the usual trigonometric function; and the factor of 2 is a hyperparameter chosen by the instructor 1. You are welcome to try and alter the 2 to another value if you find it works better. Note that you only generate V once; you always use the same V whenever you compute φ. We will use (drumroll please....) k = 60,000 features. This seems like an unwieldy number. However, it will not actually be so bad since we you never actually explicitly construct and store this dataset. You will construct it “on the fly”.

Tips

With only your laptop in hand (or the compute resources provided, which are hopefully not partic- ularly impressive), this problem is just hard enough that it will force you to undersand many of the issues at play in large scale machine learning. In fact, if you try to explicitly contstruct your feature matrix of size N × k, which is of size 60,000 × 60,000, you will hopefully run out of memory. Regardless, the problem is very much solvable, in a timely manner, with even meager compute resources. The suggestions below are more broadly applicable to how we address many of the issues in large scale machine learning.

  • (mini-batching) Mini-batching helps. It is too costly to try to full gradient updates. Use m = 50.
  • (memory) As the dimension is large in this problem, we seek to avoid explicitly computing and storing the full feature matrix, which is of size N × k = N × N. Instead, you can compute the feature vector φ(x) ’on the fly’, i.e. you recompute the vector φ(x) whenever you access an image x. In particular, you must do this on your minibatch with matrix operations for your code to be fast enough. If X˜ is your m × d min-batch data matrix, do you see how the matrix sin 2 XV˜ relates to the features you desire? Here sin is applied component wise.
  • (regularization) It is up to you to determine how (and if) you set it. We do expect you to get good performance.
  • (learning rates) With the square loss case, I like to set my learning rates large. And then I decay them only if I need to.
  • (interrupting your code) Sometimes I find it helpful to be able to interrupt my code (with “Ctrl- C” or whatever you use) and have the ability to restart it without loosing my the current state of my parameters. Make sure you understand how to do this, and feel free to discuss this on the discussion board. This can be helpful. For example, for some problems, I may want to adjust my learning rate “by hand”, and this allows me to do this.

Loss functions

You are free to try out both square loss and the logistic loss. The “objective function error” refers to either the square loss or the logistic/softmax loss (whichever you used). It is encouraged you also try the square loss as well (if you tried both, tell us!). In practice, in the small feature regime

(^1) The aforementioned normalization of the data by the instructor makes this factor of 2 naturally correspond to a

certain scale of the data. You can understand this more by looking at the paper in the link. It is analogous to the choice of a “bandwidth” in certain radial basis function kernel methods.

rate.

  1. [3 points] After every 500 updates, make a plot showing your training average objective function error and your test average objective function error, with your average objective function error on the y-axis and the iteration # on the x-axis. Both curves should be on one same plot. Also make sure to start these plots sufficiently many updates after 0 and to label the x-axis appropriately based on where you start plotting (if you start plotting at update 0 , your plots will be difficult to read and interpret due to the average objective function error initially dropping so quickly).
  2. [3 points] For the misclassification error, make the same plots (again, with two curves. Do not start your plots at update 0. Make sure your plots are readable). Again, there should be two curves.
  3. [3 points] Plot the euclidean norm of the weight vector, where the x-axis is the iteration number and y-axis is the norm of the weight vector at that update (you need only compute/store the norm every 500 iterations, as before). It is often helpful to plot the norms of you weight vectors, and you might find it striking how this curve behaves.
  4. [2 points] What is the lowest training and test average objective function error achieved during your runs? Make sure you have run for long enough.
  5. [6 points] What is the lowest training misclassification % error achieved over all of your runs? What is the smallest number of total mistakes made (out of the 60K points) on your training set and on your test set (over all updates)? Note you can just derive this from your lowest misclassification % errors on your train and test sets, respectively (remember that you need to divide by a factor of 100 when dealing with %’s!). If you estimated your training error with a 10 K subset, then make sure to multiply by a 6 when estimating the total number of errors on your training set.
  6. [6 points] Provide a short discussion (about a paragraph) on overfitting. Do you see your training average objective function error rise? Did you make an extremely small number of total mistakes on your training set and was this very different from your test set? Comment on your findings.

7.3 Reflections [0 points]

You are welcome to jot down a few thoughts about what you found here. If you tried both square loss and logistic loss, please let us know. We may give credit adjustments if you tried both the square loss and the logistic loss.

8 EXTRA CREDIT: Proving a rate of convergence for GD for

the least squares problem [20 points]

This is a fundamental convergence result in mathematical optimization. With a good understanding of the SVD, the proofs are short and within your reach. Let us consider gradient descent on the least squares problem.

L(w) =

N

‖Y − Xw‖^2

Gradient descent is the update rule:

w(k+1)^ = w(k)^ − η∇L(w(k))

Let λ 1 , λ 2 ,... λd be the eigenvalues of (^) N^1 X>X in descending order (so λ 1 is the largest eigenvalue).

  1. [8 points] In terms of the aforementioned eigenvalues, what is threshold stepsize such that, for any η above this threshold, gradient descent diverges, and, for any η below this theshold, gradient descent converges? You must provide a technically correct proof.
  2. [8 points] Set η so that:

‖w(k+1)^ − w∗‖ ≤ exp(−κ)‖w(k)^ − w∗‖

where κ is some (positive) scalar. In particular, set η so that κ is as large as possible. What is the value of η you used and what is κ? Again, you must provide a proof. You should be able to upper bound your expression so that you can state it in terms of the maximal eigenvalue λ 1 and the minimal eigenvalue λd. The above equation shows a property called contraction.

  1. [4 points] Now suppose that you want your parameter to be  close to the optimal one, i.e. you seek ‖w(k)^ − w∗‖ ≤ . How large does k need to be to guarantee this?

9 Code

Please include all your code in the PDF file in this section. Specify which problem(s) the code corresponds to. Re Jupyter: refer to the policies section of the HW.