COMP 652 - ECSE 608: Machine Learning, Exams of Machine Learning

[55 points] Regression, Overfitting and Regularization. For this exercise, you will experiment with regression, regularization and ...

Typology: Exams

2022/2023

Uploaded on 05/11/2023

mortimer
mortimer 🇺🇸

4.4

(5)

214 documents

1 / 3

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
COMP 652 - ECSE 608: Machine Learning - Assignment 1
Posted Monday, September 10, 2018
Due Friday, September 28, 2018
You should submit an archive of your code, as well as a pdf file with your answers (either typed or
scanned), uploaded to MyCourses. If you cannot access MyCourses, email the assignment directly to both
Riashat and Audrey by the deadline (11:59pm EST on the day the assignment is due).
1. [55 points] Regression, Overfitting and Regularization
For this exercise, you will experiment with regression, regularization and cross-validation. You are
provided with a data set in files hw1-q1x.csv (inputs) and hw1-q1y.csv (targets). You are
allowed to use any programming language of your choice (Python, Matlab, R, etc). You can also
use use any toolbox/package (e.g. scickit-learn) as long as you understand what is going on
behind the scene (for example does the fit linear model() method you use fit a bias term or
does it assume that the data is centered?).
(a) [5 points] Load the data into memory. Make an appropriate Xmatrix and yvector. Split the
data at random into one set (Xtrain,ytrain )containing 80% of the instances, which will be
used for training + validation, and a testing set (Xtest,ytest)(containing remaining instances).
Describe here any preprocessing of the data that you performed for this exercise (did you
handle the bias term in linear regression directly or did you add a 1to each input? Did you
center the data? Why and how?, etc.).
(b) [5 points] Run linear regression on the data using L2regularization, varying the regularization
parameter λ {0,0.1,1,10,· · · ,105}. Plot on one graph the root-mean-square error (RMSE)
for the training data and the testing data, as a function of λ(you should use a log scale for λ).
Plot on another graph the L2norm of the weight vector you obtain. Plot on the third graph the
actual values of the weights obtained (one curve per weight). Explain briefly what you see.
(c) [5 points] Perform five-fold cross-validation on the training data to determine the best value
of the regularization parameter λ(report the training and validation errors for each fold and
briefly explain how you chose the best value of λ). Compare the best value of λobtained to
your results in question 1b and briefly comment.
(d) [5 points] Suppose that the training data was sorted in increasing value of the target variable y,
and you simply partitioned it by splitting it in kfolds (without shuffling the data first). Explain
what would happen if you tried to perform cross-validation with these folds.
(e) [5 points] Re-format the data in the following way: take each of the input variables, and feed
it through a set of Gaussian basis functions, defined as follows. For each variable, use 5
univariate basis functions with means evenly spaced between 1and 1and variance σ2. You
will experiment with σ2values of 0.1,0.5,1and 5.
(f) [5 points] Using no regularization and doing regression with this new set of basis functions,
plot the training and testing error as a function of σ2(when using only basis functions of a given
σ2). Add constant lines showing the training and testing error you had obtained in question 1b.
Explain how σ2influences overfitting and the bias-variance trade-off.
1
pf3

Partial preview of the text

Download COMP 652 - ECSE 608: Machine Learning and more Exams Machine Learning in PDF only on Docsity!

COMP 652 - ECSE 608: Machine Learning - Assignment 1

Posted Monday, September 10, 2018

Due Friday, September 28, 2018

You should submit an archive of your code, as well as a pdf file with your answers (either typed or scanned), uploaded to MyCourses. If you cannot access MyCourses, email the assignment directly to both Riashat and Audrey by the deadline (11:59pm EST on the day the assignment is due).

  1. [55 points] Regression, Overfitting and Regularization For this exercise, you will experiment with regression, regularization and cross-validation. You are provided with a data set in files hw1-q1x.csv (inputs) and hw1-q1y.csv (targets). You are allowed to use any programming language of your choice (Python, Matlab, R, etc). You can also use use any toolbox/package (e.g. scickit-learn) as long as you understand what is going on behind the scene (for example does the fit linear model() method you use fit a bias term or does it assume that the data is centered?).

(a) [5 points] Load the data into memory. Make an appropriate X matrix and y vector. Split the data at random into one set (Xtrain, ytrain) containing 80% of the instances, which will be used for training + validation, and a testing set (Xtest, ytest) (containing remaining instances). Describe here any preprocessing of the data that you performed for this exercise (did you handle the bias term in linear regression directly or did you add a 1 to each input? Did you center the data? Why and how?, etc.). (b) [5 points] Run linear regression on the data using L 2 regularization, varying the regularization parameter λ ∈ { 0 , 0. 1 , 1 , 10 , · · · , 105 }. Plot on one graph the root-mean-square error (RMSE) for the training data and the testing data, as a function of λ (you should use a log scale for λ). Plot on another graph the L 2 norm of the weight vector you obtain. Plot on the third graph the actual values of the weights obtained (one curve per weight). Explain briefly what you see. (c) [5 points] Perform five-fold cross-validation on the training data to determine the best value of the regularization parameter λ (report the training and validation errors for each fold and briefly explain how you chose the best value of λ). Compare the best value of λ obtained to your results in question 1b and briefly comment. (d) [5 points] Suppose that the training data was sorted in increasing value of the target variable y, and you simply partitioned it by splitting it in k folds (without shuffling the data first). Explain what would happen if you tried to perform cross-validation with these folds. (e) [5 points] Re-format the data in the following way: take each of the input variables, and feed it through a set of Gaussian basis functions, defined as follows. For each variable, use 5 univariate basis functions with means evenly spaced between − 1 and 1 and variance σ^2. You will experiment with σ^2 values of 0. 1 , 0. 5 , 1 and 5. (f) [5 points] Using no regularization and doing regression with this new set of basis functions, plot the training and testing error as a function of σ^2 (when using only basis functions of a given σ^2 ). Add constant lines showing the training and testing error you had obtained in question 1b. Explain how σ^2 influences overfitting and the bias-variance trade-off.

(g) [10 points] Suppose that instead of using a fixed set of evenly-spaced basis functions, you would like to adapt the placement of these functions. Derive a learning algorithm that computes both the placement of the basis function, μi, and the weight vector w from data (assuming that the width σ^2 is fixed). You should still allow for L 2 regularization of the weight vector. Note that your algorithm will need to be iterative. (h) [5 points] Does your algorithm converge? If so, does it obtain a locally or globally optimal solution? Explain your answer. (i) [10 points] Consider the input set X′^ = [xi, 2 ]xi∈X containing only the second features of in- stances from X, and use the same train/test split as before. Consider the d-degree polynomial kernel such that ϕi(x) = xi^ for all i ∈ { 1 , 2 ,... , d}. Perform five-fold cross-validation on the training data to determine the best value of degree d ∈ { 1 , 2 , 3 , 5 , 9 } when using this ker- nel within non-regularized linear regression. On a single figure, plot X′^ and y along with the predicted function obtained for each degree d. Using the polynomial kernel with degree d = 9, perform five-fold cross-validation on the training data to determine the best value of regularization λ ∈ { 0. 01 , 0. 1 , 1 , 10 } when using this kernel within L 1 and L 2 regularized lin- ear regression. On a single figure, plot X′^ and y along with the predicted function obtained by L 1 -regression for each λ. On another figure, plot X′^ and y along with the predicted function obtained by L 2 -regression for each λ. Report the test error of all three approaches trained on X′ train using their optimal parameter found in cross-validation. For L 1 and L 2 -regression, re- port the weights associated with each polynomial feature and explain briefly what you observe. How do these weights relate to the optimal degree d found previously?

  1. [10 points] Maximum likelihood

Suppose that you are given a training set {(xi, yi)}mi=1 ⊂ Rn^ × R of m i.i.d. examples. In class we discussed that minimizing the mean squared error corresponds to an assumption that the labels of the data came from some target hypothesis hw, but then were observed after being perturbed by Gaussian noise, with the noise variables drawn i.i.d. from the same distribution. Suppose now that the standard deviation with which we can observe the label of example i is σi. More precisely: yi = hw(xi) + εi with εi ∼ N (0, σi) Derive the maximum likelihood estimate of w in this case.

  1. [10 points] Bayesian analysis and biased coin

Suppose you flip a coin with unknown bias θ (i.e. P (x = H) = θ) three times and observe the outcome HHH. What is the maximum likelihood estimator for θ? Do you think this is a good estimator? Would you want to make predictions using this estimator? Consider a Bayesian analysis of θ with a beta prior p(θ|α, β) = B(θ; α, β). What are the posterior mean and posterior mode of θ? Consider (α, β) = (50, 50). Plot the posterior density in this case. Is the maximum likelihood estimator a good summary of the distribution?

  1. [15 points] Multivariate Regression

(a) [5 points] We consider the problem of learning a vector-valued function f : Rd^ → Rp^ from input-output training data {(xi, yi)}mi=1 where each xi is a d-dimensional vector and each yi is