Download Machine learning ensemble methods Machine learning ensemble methods and more Thesis Mathematical Methods in PDF only on Docsity! ENCS5341 Machine Learning and Data Science Regression Yazan Abu Farha - Birzeit University Introduction • Regression is a supervised learning task where the target variable that we are trying to predict is continuous. Examples: predicting houses prices based on the living area, predicting stock price based on the history of previous prices. • When there is a single input variable (x), the method is referred to as simple linear regression. E.g.: predicting blood pressure as a function of drug dose. • When there are multiple input variables, literature from statistics often refers to the method as multiple linear regression. E.g.: predicting crop yields as a function of fertilizer and water. • Linear regression is a model that assumes a linear relationship between the input variables (x) and the single output variable (y). More specifically, that y can be calculated from a linear combination of the input variables (x) 1 Linear regression 4 For x in ℝ, linear regression fits a line in a 2- dimensional space (simple linear regression) For x in ℝ2, linear regression fits a plane in a 3- dimensional space (multiple linear regression) Linear regression • In general, if we have d features as input x = (x1, x2, …, xd)T, then the lineart regression would have the following form y = f(x) = w0 + w1x1 + w2x2 + … + wdxd • The coefficients w0, …, wd are the parameters for the model. The goal of learning is to find the ”best” values for these parameters that describe the relationship between the input features x and the target label y based on a set of training example (dataset). • Once we estimated the parameters, we can use the learned model to predict y values for new inputs. 5 Prediction with linear regression model • Example: Hours studying and grades We want to learn w0 and w1 such that Predicted final grade in class = w0 + w1*(#hours you study/week) • Assume after learning we have: Predicted final grade in class = 59.95 + 3.17*(# hours you study/week) • We can now use this function to predict grades for new #hours Ex: Someone who studies for 12 hours Final grade = 59.95 + (3.17*12) = 97.99 6 Linear Regression 2.00 4.00 6.00 8.00 10.00 Number of hours spent studying 70.00 80.00 90.00 Fin al g rad e in co urs e Final grade in course = 59.95 + 3.17 * study R-Square = 0.88 Task Definition Problem: Given a sample S = {(x1 , y1 ), …, (xn , yn)} ⊆ℝd× ℝ, find a vector w ∈ ℝd such that 𝑓 𝒙 = 𝒙 ,𝒘 best interpolates S. "best interpolates”: for (x, y) we measure the discrepancy between f(x) and y by the square loss function E(f(x), y) = (f(x) – y)2 9 Linear regression solution – simple case • Let’s first consider the solution for the simple linear regression case. I.e., the input is only one variable x. • Given a sent of n training examples: (x1,y1), … , (xn,yn), we want to learn w0 and w1 such that f(x) = y = w0 + w1x • The solution is found by minimizing the sum of squared errors: argmin %!,%" ) !"# ' 𝑦𝑖 − 𝑓 𝑥𝑖 2 argmin %!,%" ) !"# ' 𝑦𝑖 − 𝑤( − 𝑤#𝑥! 2 10 Linear regression solution – simple case • The solution is found by minimizing the sum of squared errors: argmin %!,%" ) !"# ' 𝑦𝑖 − 𝑤( − 𝑤#𝑥! 2 Find the derivative of the error function E w.r.t. each parameter and set it to 0 )* )%! = ) )%! (∑!"#' 𝑦𝑖 − 𝑤( − 𝑤#𝑥! 2) = ∑!"#' ) )%! 𝑦𝑖 − 𝑤( − 𝑤#𝑥! 2 = ∑!"#' −2 𝑦𝑖 − 𝑤( − 𝑤#𝑥! = -2 ∑!"#' 𝑦𝑖 + 2 ∑!"#' 𝑤( + 2 ∑!"#' 𝑤#𝑥! )* )%! = 0 à 0 = -2 ∑!"#' 𝑦𝑖 + 2 ∑!"#' 𝑤( + 2 ∑!"#' 𝑤#𝑥! à 𝑤( = ∑#$" % ,! ' - 𝑤# ∑#$" % -! ' 11 Linear Regression – The normal equations • Empirical risk w.r.t. the square loss function: E[f] = $,∑-#$ , (f(xi) - yi)2 = $,∑-#$ , ( xi, 𝐰 - yi)2 = $ , (X w - y)T (X w - y) = $, X w − y 2 Solve minw $ , X w − y 2 (known as least squares) 14 Linear Regression – The normal equations • Convex minimization problem: minw E[w] = minw $ , X w − y 2 • Calculate the gradient: ∇w E[w] = . .𝐰 ($ , X w − y 2 ) = ..𝐰 ( $, (wT XT X w – 2 wT XT y + yTy) ) = $, (2 XT X w - 2 XT y ) • Set it to 0: XT X w = XT y • And solve the linear system of equations: w = (XT X)-1 XT y 15 Linear Regression and overfitting • Linear regression solution: w = (XT X)-1 XT y • High values in w correspond to an overfitting problem. • Solution: use regularizer to discourage coefficient from taking large values • Penalize the sum of the squares of the coefficients, i.e. 𝐰 2 • Solve minw # 0 X w − y 2 + λ 𝐰 2 • Solution: w = (XT X + λ Id)-1 XT y (λ is a hyper-parameter and Id is d × d identity matrix) • This case is called Ridge Regression (ridge regression = Regularised Least Squares) 16 Probabilistic Interpretation of Linear Regression f’ML = argmax 56 ∑"#$7 𝑙𝑛 $ 012*− $ 0 8+ 356 9+ * 2* = argmax 56 ∑"#$7 − $ 0 8+ 356 9+ * 2* = argmax 56 ∑"#$7 − 𝑦𝑖 − 𝑓′ 𝑥𝑖 2 f’ML = argm𝑖𝑛 56 ∑"#$7 𝑦𝑖 − 𝑓′ 𝑥𝑖 2 Maximum Likelihood estimate f’ML minimizes the sum of squared errors 19 f 𝜖4 e4 e3 e1 e2 e5 Non-Linear Regression Non-Linear Regression • Linear regression fits a linear model to the data. • In real world applications, many problems are non-linear. In this case, fitting a linear model will underfit. • Can we still use linear regression to fit a non-linear model? 21 Non-Linear Regression Example of non-linear basis functions: • Radial basis functions f(x) = 𝑒 $ & $) " * • Arctan Functions • Monomials x à x, x2, …, xm (x1,x2) à x1, x2, x1x2, x1 2, x2 2 24 Example: polynomial regression
example: polynomial curve fitting hypothesis space: polynomials
{ry wix' +wo:M EN,w; € R}
i=
e feature space embedding:
xr (x°,a1,...,2™)
e patterns: hyperplanes in R™”
=} loss function: square loss
unknown target function: sin(272)
training data: S = {(21,y1),.--,(r10, y10)}
e y; = sin(27z;) + random noise
25
Example: polynomial regression
0" order polynomial? 1St order polynomial? 3 order polynomial? 9" order polynomial?
lessons:
= the 0 and 1* order polynomials fit badly the data (blue points)
= too simple models underfit
= the 9" order polynomial fits the best the data, but is a bad generalization
= too complex models overfit
« the 3 order polynomial is expected to generalize the best
= model generalizing good: neither too simple nor too complex
26
Non-Linear Regression • Evaluate x against some basis functions to create the z vector • Apply linear regression on z 𝑓 𝒙 = 𝒛 ,𝒘 , where z = g(x) • Solution: w = (ZT Z)-1 ZT y 29 Problems of the normal equations solution Linear regression solution: w = (XT X)-1 XT y • Issues: • Inverse is costly 𝑂(d3), where d is the number of features. • Non-invertibility of the matrix. • The dataset could be very large. • Solution: use iterative methods such as gradient descent (next lecture). 30