








Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
You are welcome to use any Python libraries for data munging, visualization, and numerical linear algebra. Examples includes Numpy, Pandas, and Matplotlib. You ...
Typology: Study notes
1 / 14
This page cannot be seen from the preview
Don't miss anything!









Please read these policies. Please answer the three questions below and include your answers marked in a “problem 0” in your solution set. Homeworks which do not include these answers will not be graded. Gradescope submission: When submitting your HW, please tag your pages correctly as is requested in gradescope. Untagged homeworks will not be graded, until the tagging is fixed. Readings: Read the required material. Submission format: Submit your report as a single pdf file. Also, please include all your code in the PDF file in a section at the end of your document, marked “Code”; also specify which problem(s) the code corresponds to. The report (in a single pdf file) must include all the plots and explanations for programming questions (if required). Homework solutions must be organized in order, with all plots arranged in the correct location in your submitted solutions. We highly recommend typesetting your scientific writing using LATEX(see the website for references for free tools). Writing solutions by hand will be accepted provided they are neat; written solutions need to be scanned and included into a single pdf. Written work: Please provide succinct answers along with succinct reasoning for all your answers. Points may be deducted if long answers demonstrate a lack of clarity. Similarly, when discussing the experimental results, concisely create tables and figures to organize the experimental results. In other words, all your explanations, tables, and figures for any particular part of a question must be grouped together. Including your Python source code: For the programming assignments, submit your code in the pdf file along with a neatly written README file that instructs us how you ran your code with different settings (if applicable). Please note that we will not accept screenshots of Jupyter notebooks. If you do use Jupyter, you must export your code to a text file and put the text of your code in the submitted pdf file (in the last section) in a manner that can be executed in that order (without any extraneous or missing code). We assume that you always follow good practice of coding (commenting, structuring); these factors are not central to your grade. Coding policies: You must write your own code. You are welcome to use any Python libraries for data munging, visualization, and numerical linear algebra. Examples includes Numpy, Pandas, and Matplotlib. You may not, however, use any machine learning libraries such as Scikit-Learn,
TensorFlow, or PyTorch, unless explicitly specified for that question. If in doubt, post to the message boards. Collaboration: It is acceptable for you to discuss problems with other students; it is not acceptable for students to look at another students written answers. It is acceptable for you to discuss coding questions with others; it is not acceptable for students to look at another students code. Each student must understand, write, and hand in their own answers. In addition, each student must write and submit their own code in the programming part of the assignment. Acknowledgments: We expect the students not to refer to or seek out solutions in published material from previous years, on the web, or from other textbooks. Students are certainly encour- aged to read extra material for a deeper understanding. Extra Credit Policy: In order to get extra credit, you must do all the regular problems. Extra credit points will only be awarded if there are (honest attempts at) answers to all the regular questions. This is because they are not designed to be alternative questions to the regular questions.
List the names of all people you have collaborated with and for which question(s).
If you do inadvertently find an assignment’s answer, acknowledge for which question and provide an appropriate citation (there is no penalty, provided you include the acknowledgement). If not, then write “none”.
Please make sure to read and follow these instructions. Write “I have read and understood these policies” to certify this.
Remark: You might find this expression to be rather curious! It looks identical to the expression for our gradient in the average squared error case. You are free to think about why this was a fortunate coincidence. The choice of y ∈ { 0 , 1 } was indeed intentional.
Implement logistic regression on our “2” vs “9” dataset. Note: here you need to modify the decision rule: we should label a digit as a “2”, i.e. y = 1, if w · x ≥ 0 (You should be able to see why this threshold is appropriate.). Make sure to explicitly include a bias term in the model and do not regularize this term. To be precise, let the un-regularized loss be:
L(w, b) =
n=
log pw,b(y = yn | xn) ,
and so Lλ(w, b) = L(w, b) +
λ 2
‖w‖^2.
Here pw,b is the model which includes a bias term. Note that gradient descent here would be:
w ← w − η∇wLλ(w, b) = w − η (∇wL(w, b) + λw)
b ← b − η
∂Lλ(w, b) ∂b = b − η
∂L(w, b) ∂b where the last step follows since our cost function is not regularizing the bias term. Now run gradient descent:
n
(yn log ̂yn + (1 − yn) log(1 − ŷn)) ,
which can be directly computed in python with operations on the vectors Y and Ŷ. Both curves should be on the same plot. What value of λ did you use?
Now use stochastic gradient descent, using one point at a time:
Again, due to the manner in which matrix multiplication methods (as opposed to using “For Loops”) allow for faster runtimes (often through GPU processers), it is often much faster to use “mini-batch” methods by sampling m points at a time. By increasing the batch size, we reduce the variance in stochastic gradient descent. In practice (and in theory), this tends to be very helpful as increase m, and then there tends to be (relatively sharp) diminishing returns.
Let us now build a classifier for digit recognition on all 10 digits. You will use the full dataset (same as you used for PCA in HW2), were x is 784 dimensions.
In the previous two class problem, we used linear regression with y ∈ { 0 , 1 }. Now we have 10 classes. Here we will use a “one-hot” encoding of the label. The label yn will be a 10 dimensional vector, where the k-th entry is 1 if the label is for the k-th class and all other entries will be 0.
Here, we can consider a vector valued prediction:
ŷn = W >^ · xn.
where sized W ∈ Rd×^10 matrix. As discussed in class, we can define the objective function here as:
Lλ(W ) :=
n=
‖yn − W >xn‖^2 +
λ 2
where you can view the penalty as the sum of the squares of the entries of W. Note that this formulation is literally the same as doing k-binary classification problems on each of the classes separately, you will do a linear regression where you label a digit as Y = 1 if and only if the label for this digit is k (for k = 0, 1 , 2 ,... 9 ). It is straightforward to verify that the solution is:
W (^) λ∗ =
X>X + λId
Note that here are stacking our the vectors yn and ̂yn into the matrices Y ∈ RN^ ×^10 and Ŷ ∈ RN^ ×^10. For classification, you will then take the largest predicted score among your 10 predictors. Def. of the misclassification error: We say a mistake is made on an example (x, y) if our prediction in { 1 ,... k} does not equal the label y. The % misclassification error (on our training, dev, test, etc) is the % of such mistakes made by our prediction method on our, respective, dataset. Remark: This is sometimes referred to as “one against all” prediction. Note that we are just doing 10 separate linear regressions and are stacking our answers together. Also, the gradient of this loss function can be expressed as:
dLλ(w) dW
n=
xn(yn − ̂yn)>^ + λW. (1)
Note that this expression is of size d × k.
Dataset You will use the MNIST dataset you used from the last assignment. It contains all 10 digits with the labels. The instructor will post a function on Piazza for computing the misclassification % error that you are free to use.
Prove equation 1. To do this, lookup some facts about matrix derivatives on the internet (there are all sorts of “matrix cookbooks”, “cheat sheets”, etc. out there). Provide the one or two rules for how one takes a matrix derivative to obtain the proof. The proof should be just a few steps. Also, you should be able to convince yourself as to how this follows from the vector proof you did earlier.
6 Multi-Class Classification using the the softmax [20 points]
We now turn to the softmax classifier. Here, y takes values in the set { 1 ,... k}. The model is as follows: we have k weight vectors, w(1), w(2),... w(k). Here, we can view these parameters as columns in a matrix of size W ∈ Rd×^10 matrix. For ` ∈ { 1 ,... k},
pW (y = `|x) =
exp(w(`)^ · x) ∑k i=1 exp(w (i) (^) · x)
Again, note that this is a valid probability distribution (the probabilities are positive and they sum to 1 ). Also, note that we have “over-parameterized” the model, since:
pW (y = k|x) = 1 −
∑^ k−^1
i=
pW (y = i|x)
We could define the model without using w(k). However, the instructor likes this choice as the derivative expressions become a little simpler (and it makes it easier to re-use code). As before, it is helpful to define the “prediction vector”:
̂ yn = pw(y | xn)
where we view pw(y | xn) as k-dimensional (column) vector where the i-th component is pW (y = i|xn).
learning. The instructor’s hope is that, with more hands on experience, you will be better informed about the issues in play. The talk is relevant, since we are basically implementing the method discussed in this paper. The paper itself does not provide the most lucid justification; the method is really just a “quick and dirty” procedure to make features. In practice, there are often better feature generation methods; this one is remarkably simple. In this problem, we will engage in the bad practice where we do not have a dev set. To a large extent, looking “a little” at the test set is done in practice (and this shouldn’t hurt us too much if we understand how confidence intervals work). However, this has been done for quite sometime on this dataset, which is why the instructor is suspect of the test errors below 1 .2%, among those methods that do no use “distortions” or “pre-processing” or “convolutional” methods (we should expect the latter methods to give performance bumps). The views of the instructor are that about 1.4% or less is “state of the art”, without “distor- tions” or “pre-processing” or “convolutional” methods (as discussed on the MNIST website). If we wanted even higer accuracy, we should really move to convolutional methods, which we may briefly discuss later in the class. Finally, the approach below might seem a little non-sensical. However, an important lesson is that large feature representations, appropriately blown up, often perform remarkably well once you have a lot of labeled data.
Grab the “mnist all 50pca dims.gz” dataset. It contains all the datapoints reduced down to 50 dimensions. There is no dev set. And there are 60,000 training points. The inputs have been normalized so that the features vectors x are, on average, unit length, i.e. E[‖x‖^2 ] = 1. Load the modified MNIST dataset in Python as follows:
import gzip, pickle with gzip.open("mnist_all_50pca_dims.gz") as f: data = pickle.load(f, encoding="bytes") Xtrain, Xtest = data[b"Xtrain"], data[b"Xtest"] Ytrain, Ytest = data[b"Ytrain"], data[b"Ytest"]
Now let us try to make “better” features; we are not going to be particularly clever in the way we make these features, though they do provide remarkable improvements. Let x be an image (as a vector in Rd). Now we will map each x to a k-dimensional feature vector as follows: we will first construct k random vectors, v 1 , v 2 ,... , vk (these will be sampled form a Gaussian distribution). In other words, you first sample a matrix V ∈ Rd×k, where the columns of this matrix are v 1 to vk; this can be done in python with the command np.random.randn(d, k). Then our feature vector will be the following vector:
φ(x) = (sin(2v> 1 x), sin(2v> 2 x),... sin(2v> k x))
Note that φ(x) is a k dimensional vector; sin(·) is the usual trigonometric function; and the factor of 2 is a hyperparameter chosen by the instructor 1. You are welcome to try and alter the 2 to another value if you find it works better. Note that you only generate V once; you always use the same V whenever you compute φ. We will use (drumroll please....) k = 60,000 features. This seems like an unwieldy number. However, it will not actually be so bad since we you never actually explicitly construct and store this dataset. You will construct it “on the fly”.
With only your laptop in hand (or the compute resources provided, which are hopefully not partic- ularly impressive), this problem is just hard enough that it will force you to undersand many of the issues at play in large scale machine learning. In fact, if you try to explicitly contstruct your feature matrix of size N × k, which is of size 60,000 × 60,000, you will hopefully run out of memory. Regardless, the problem is very much solvable, in a timely manner, with even meager compute resources. The suggestions below are more broadly applicable to how we address many of the issues in large scale machine learning.
You are free to try out both square loss and the logistic loss. The “objective function error” refers to either the square loss or the logistic/softmax loss (whichever you used). It is encouraged you also try the square loss as well (if you tried both, tell us!). In practice, in the small feature regime
(^1) The aforementioned normalization of the data by the instructor makes this factor of 2 naturally correspond to a
certain scale of the data. You can understand this more by looking at the paper in the link. It is analogous to the choice of a “bandwidth” in certain radial basis function kernel methods.
rate.
You are welcome to jot down a few thoughts about what you found here. If you tried both square loss and logistic loss, please let us know. We may give credit adjustments if you tried both the square loss and the logistic loss.
8 EXTRA CREDIT: Proving a rate of convergence for GD for
the least squares problem [20 points]
This is a fundamental convergence result in mathematical optimization. With a good understanding of the SVD, the proofs are short and within your reach. Let us consider gradient descent on the least squares problem.
L(w) =
‖Y − Xw‖^2
Gradient descent is the update rule:
w(k+1)^ = w(k)^ − η∇L(w(k))
Let λ 1 , λ 2 ,... λd be the eigenvalues of (^) N^1 X>X in descending order (so λ 1 is the largest eigenvalue).
‖w(k+1)^ − w∗‖ ≤ exp(−κ)‖w(k)^ − w∗‖
where κ is some (positive) scalar. In particular, set η so that κ is as large as possible. What is the value of η you used and what is κ? Again, you must provide a proof. You should be able to upper bound your expression so that you can state it in terms of the maximal eigenvalue λ 1 and the minimal eigenvalue λd. The above equation shows a property called contraction.
9 Code
Please include all your code in the PDF file in this section. Specify which problem(s) the code corresponds to. Re Jupyter: refer to the policies section of the HW.