Download Linear Regression-Introduction to Machine Learning-Lecture 04-Computer Science and more Lecture notes Introduction to Machine Learning in PDF only on Docsity! Lecture 4: Linear regression TTIC 31020: Introduction to Machine Learning Instructor: Greg Shakhnarovich TTI–Chicago October 4, 2010 Lecture 4: Linear regression TTIC 31020 Administrivia TA: Feng Zhao 5 homework assignments (1.5-2 weeks apart) • Each HW worth 12% of the grade Final during finals week • 40% of the grade My office hours: by e-mail appointment Feng’s office hours: Friday 3-5pm Lecture 4: Linear regression TTIC 31020 Fitting function to data Two goals in mind: 1 Explain the data (traditional statistics) 2 Make predictions (emphasized in machine learning) We will proceed in two steps: 1 Choose a model class of functions 2 Design a fitting criterion, to guide selection of a function from the class. Let’s start with (almost) the simplest model class: linear functions. Lecture 4: Linear regression TTIC 31020 Fitting function to data Two goals in mind: 1 Explain the data (traditional statistics) 2 Make predictions (emphasized in machine learning) We will proceed in two steps: 1 Choose a model class of functions 2 Design a fitting criterion, to guide selection of a function from the class. Let’s start with (almost) the simplest model class: linear functions. Lecture 4: Linear regression TTIC 31020 Fitting function to data Two goals in mind: 1 Explain the data (traditional statistics) 2 Make predictions (emphasized in machine learning) We will proceed in two steps: 1 Choose a model class of functions 2 Design a fitting criterion, to guide selection of a function from the class. Let’s start with (almost) the simplest model class: linear functions. Lecture 4: Linear regression TTIC 31020 Linear fitting to data We want to fit a linear function to an observed set of points X = [x1, . . . ,xN ] with associated labels Y = [y1, . . . , yN ]. • Once we fit the function, we want to use it to predict the y for new x. Least squares (LSQ) fitting criterion: find the function that minimizes sum (or average) of square distances between actual ys in the training set and predicted ones. (xi, yi) 2 2 2 2 2 + + + + + +min 2 + 2 2 x0 The fitted line is used as a predictor Lecture 4: Linear regression TTIC 31020 Linear fitting to data We want to fit a linear function to an observed set of points X = [x1, . . . ,xN ] with associated labels Y = [y1, . . . , yN ]. • Once we fit the function, we want to use it to predict the y for new x. Least squares (LSQ) fitting criterion: find the function that minimizes sum (or average) of square distances between actual ys in the training set and predicted ones. (xi, yi) 2 2 2 2 2 + + + + + +min 2 + 2 2 x0 The fitted line is used as a predictor Lecture 4: Linear regression TTIC 31020 Linear fitting to data We want to fit a linear function to an observed set of points X = [x1, . . . ,xN ] with associated labels Y = [y1, . . . , yN ]. • Once we fit the function, we want to use it to predict the y for new x. Least squares (LSQ) fitting criterion: find the function that minimizes sum (or average) of square distances between actual ys in the training set and predicted ones. (xi, yi) 2 2 2 2 2 + + + + + +min 2 + 2 2 x0 The fitted line is used as a predictor Lecture 4: Linear regression TTIC 31020 Loss function Suppose target labels are in Y • Binary classification: Y = {−1,+1}, • (univariate) regression: Y ≡ R. A loss function L : Y × Y → R maps decisions to costs: • L(y, ŷ) defines the penalty paid for predicting ŷ when the true value is y. Standard choice for classification: 0/1 loss L0/1(ŷ, y) = { 0 if y = ŷ 1 otherwise Standard choice for regression: squared loss L(ŷ, y) = (ŷ− y)2 is it a good loss function?.. Lecture 4: Linear regression TTIC 31020 Empirical loss We consider a parametric function f(x;w) Linear function: f(x;w) = wTx The empirical loss of function y = f(x;w) on a set X: L(w,X) = 1 N N∑ i=1 L (f(xi;w), yi) LSQ minimizes the empirical loss for squared loss L. We care about accuracy of predicting labels for new examples. Why/when does empirical loss minimization help us achieve that? Lecture 4: Linear regression TTIC 31020 Loss: empirical and expected Fundamental assumption: example x/label y are drawn from an joint probability distribution p(x, y). Data are i.i.d.: same (unknown!) distribution for all pairs (x, y) in both training and test data. We can measure the empirical loss on training set L(w,X) = 1 N N∑ i=1 L (f(xi;w), yi) The ultimate goal is to minimize the expected loss, also known as risk: R(w) = E(x0,y0)∼p(x,y) [L (f(x0;w), y0)] Lecture 4: Linear regression TTIC 31020 Least squares: estimation We need to minimize w.r.t. w L(w,X) = L(w) = 1 N N∑ i=1 (yi − f(xi;w))2 = 1 N N∑ i=1 ( yi − w0 − w1x(i)1 − . . .− wdx (i) d )2 let’s look at 1D for the moment L(w0, w1) = 1 N N∑ i=1 ( yi − w0 − w1x(i) )2 Necessary condition to minimize L: derivatives w.r.t. w0 and w1 must be zero. Lecture 4: Linear regression TTIC 31020 Least squares: estimation We need to minimize w.r.t. w L(w,X) = L(w) = 1 N N∑ i=1 (yi − f(xi;w))2 = 1 N N∑ i=1 ( yi − w0 − w1x(i)1 − . . .− wdx (i) d )2 let’s look at 1D for the moment L(w0, w1) = 1 N N∑ i=1 ( yi − w0 − w1x(i) )2 Necessary condition to minimize L: derivatives w.r.t. w0 and w1 must be zero. Lecture 4: Linear regression TTIC 31020 Least squares: estimation We need to minimize w.r.t. w L(w,X) = L(w) = 1 N N∑ i=1 (yi − f(xi;w))2 = 1 N N∑ i=1 ( yi − w0 − w1x(i)1 − . . .− wdx (i) d )2 let’s look at 1D for the moment L(w0, w1) = 1 N N∑ i=1 ( yi − w0 − w1x(i) )2 Necessary condition to minimize L: derivatives w.r.t. w0 and w1 must be zero. Lecture 4: Linear regression TTIC 31020 Least squares: estimation L(w0, w1) = 1 N N∑ i=1 ( yi − w0 − w1x(i) )2 ∂ ∂w0 L(w0, w1) = 1 N N∑ i=1 ∂ ∂w0 ( yi − w0 − w1x(i) )2 = 1 N N∑ i=1 2 ( yi − w0 − w1x(i) ) · (−1) = 0 = 2 N N∑ i=1 ( yi − w0 − w1x(i) ) = 0. yi − w0 − w1x(i) is the prediction error on the i-th example. ⇒ Necessary condition for optimal w is that the errors have zero mean. (why?) Lecture 4: Linear regression TTIC 31020 Least squares: estimation L(w0, w1) = 1 N N∑ i=1 ( yi − w0 − w1x(i) )2 ∂ ∂w0 L(w0, w1) = 1 N N∑ i=1 ∂ ∂w0 ( yi − w0 − w1x(i) )2 = 1 N N∑ i=1 2 ( yi − w0 − w1x(i) ) · (−1) = 0 = 2 N N∑ i=1 ( yi − w0 − w1x(i) ) = 0. yi − w0 − w1x(i) is the prediction error on the i-th example. ⇒ Necessary condition for optimal w is that the errors have zero mean. (why?) Lecture 4: Linear regression TTIC 31020 Least squares: estimation L(w0, w1) = 1 N N∑ i=1 ( yi − w0 − w1x(i) )2 ∂ ∂w0 L(w0, w1) = 1 N N∑ i=1 ∂ ∂w0 ( yi − w0 − w1x(i) )2 = 1 N N∑ i=1 2 ( yi − w0 − w1x(i) ) · (−1) = 0 = 2 N N∑ i=1 ( yi − w0 − w1x(i) ) = 0. yi − w0 − w1x(i) is the prediction error on the i-th example. ⇒ Necessary condition for optimal w is that the errors have zero mean. (why?) Lecture 4: Linear regression TTIC 31020 Least squares: estimation L(w0, w1) = 1 N N∑ i=1 ( yi − w0 − w1x(i) )2 ∂ ∂w1 L(w0, w1) = − 2 N N∑ i=1 ( yi − w0 − w1x(i) ) x(i) = 0. Second necessary condition: errors are uncorrelated with the data! (And with any linear function of the data) Two linear equations in two unknowns w0, w1 N∑ i=1 ( yi − w0 − w1x(i) ) x(i) = 0, (1) N∑ i=1 ( yi − w0 − w1x(i) ) = 0 (2) Lecture 4: Linear regression TTIC 31020 Solving for w∗ ∑ ( yi − w0 − w1x(i) ) x(i) = 0, ∑ ( yi − w0 − w1x(i) ) = 0 w0 = 1 N ∑ (yi − w1x(i)) = ȳ︷ ︸︸ ︷ 1 N ∑ yi−w1 x̄︷ ︸︸ ︷ 1 N ∑ x(i) 1 N ∑ yix (i) − w0 1 N ∑ x(i) − w1 1 N ∑ x(i) 2 = 0 yx− (ȳ − w1x̄)x̄− w1x2 = 0 w∗1 = yx − ȳx̄ x2 − (x̄)2 w∗0 = ȳ − w∗1x̄ Lecture 4: Linear regression TTIC 31020 Solving for w∗ ∑ ( yi − w0 − w1x(i) ) x(i) = 0, ∑ ( yi − w0 − w1x(i) ) = 0 w0 = 1 N ∑ (yi − w1x(i)) = ȳ︷ ︸︸ ︷ 1 N ∑ yi−w1 x̄︷ ︸︸ ︷ 1 N ∑ x(i) 1 N ∑ yix (i) − w0 1 N ∑ x(i) − w1 1 N ∑ x(i) 2 = 0 yx− (ȳ − w1x̄)x̄− w1x2 = 0 w∗1 = yx − ȳx̄ x2 − (x̄)2 w∗0 = ȳ − w∗1x̄ Lecture 4: Linear regression TTIC 31020 Solving for w∗ ∑ ( yi − w0 − w1x(i) ) x(i) = 0, ∑ ( yi − w0 − w1x(i) ) = 0 w0 = 1 N ∑ (yi − w1x(i)) = ȳ︷ ︸︸ ︷ 1 N ∑ yi−w1 x̄︷ ︸︸ ︷ 1 N ∑ x(i) 1 N ∑ yix (i) − w0 1N ∑ x(i) − w1 1N ∑ x(i) 2 = 0 yx− (ȳ − w1x̄)x̄− w1x2 = 0 w∗1 = yx − ȳx̄ x2 − (x̄)2 w∗0 = ȳ − w∗1x̄ Lecture 4: Linear regression TTIC 31020 Solving for w∗ ∑ ( yi − w0 − w1x(i) ) x(i) = 0, ∑ ( yi − w0 − w1x(i) ) = 0 w0 = 1 N ∑ (yi − w1x(i)) = ȳ︷ ︸︸ ︷ 1 N ∑ yi−w1 x̄︷ ︸︸ ︷ 1 N ∑ x(i) 1 N ∑ yix (i) − w0 1N ∑ x(i) − w1 1N ∑ x(i) 2 = 0 yx− (ȳ − w1x̄)x̄− w1x2 = 0 w∗1 = yx − ȳx̄ x2 − (x̄)2 w∗0 = ȳ − w∗1x̄ Lecture 4: Linear regression TTIC 31020 Solving for w∗ ∑ ( yi − w0 − w1x(i) ) x(i) = 0, ∑ ( yi − w0 − w1x(i) ) = 0 w0 = 1 N ∑ (yi − w1x(i)) = ȳ︷ ︸︸ ︷ 1 N ∑ yi−w1 x̄︷ ︸︸ ︷ 1 N ∑ x(i) 1 N ∑ yix (i) − w0 1N ∑ x(i) − w1 1N ∑ x(i) 2 = 0 yx− (ȳ − w1x̄)x̄− w1x2 = 0 w∗1 = yx − ȳx̄ x2 − (x̄)2 w∗0 = ȳ − w∗1x̄ Lecture 4: Linear regression TTIC 31020 General case (d-dim, matrix form) X = 1 x (1) 1 · · · x (1) d ... ... 1 x (N) 1 · · · x (N) d , y = y1... yN , w = w0... wd . Predictions: ŷ = Xw, errors: y −Xw, empirical loss: L(w,X) = 1 N (y −Xw)T (y −Xw) = 1 N ( yT −wTXT ) (y −Xw) . Using (AB)T = BT AT , (A + B)T = AT + BT , (AT )T = A. Lecture 4: Linear regression TTIC 31020 Derivative of loss L(w) = 1 N ( yT −wTXT ) (y −Xw) . ∂aT b ∂a = ∂bT a ∂a = b, ∂aT Ba ∂a = 2Ba ∂L(w) ∂w = 1 N ∂ ∂w [ yTy − wTXTy − yTXw + wTXTXw ] = 1 N [ 0 − XTy − (yTX)T + 2XTXw ] = − 2 N ( XTy − XTXw ) Lecture 4: Linear regression TTIC 31020 Derivative of loss L(w) = 1 N ( yT −wTXT ) (y −Xw) . ∂aT b ∂a = ∂bT a ∂a = b, ∂aT Ba ∂a = 2Ba ∂L(w) ∂w = 1 N ∂ ∂w [ yTy − wTXTy − yTXw + wTXTXw ] = 1 N [ 0 − XTy − (yTX)T + 2XTXw ] = − 2 N ( XTy − XTXw ) Lecture 4: Linear regression TTIC 31020 General case ∂ ∂w L(w) = − 2 N ( XTy −XTXw ) = 0 XTy = XTXw ⇒ w∗ = ( XTX )−1 XTy X† , ( XTX )−1 XT is called the Moore-Penrose pseudoinverse of X. Linear regression in Matlab: % X(i,:) is i-th example, y(i) is i-th label wLSQ = pinv([ones(size(X,1),1) X])*y; Prediction: ŷ = w∗T [ 1 x0 ] = yTX† T [ 1 x0 ] Lecture 4: Linear regression TTIC 31020 General case ∂ ∂w L(w) = − 2 N ( XTy −XTXw ) = 0 XTy = XTXw ⇒ w∗ = ( XTX )−1 XTy X† , ( XTX )−1 XT is called the Moore-Penrose pseudoinverse of X. Linear regression in Matlab: % X(i,:) is i-th example, y(i) is i-th label wLSQ = pinv([ones(size(X,1),1) X])*y; Prediction: ŷ = w∗T [ 1 x0 ] = yTX† T [ 1 x0 ] Lecture 4: Linear regression TTIC 31020 General case ∂ ∂w L(w) = − 2 N ( XTy −XTXw ) = 0 XTy = XTXw ⇒ w∗ = ( XTX )−1 XTy X† , ( XTX )−1 XT is called the Moore-Penrose pseudoinverse of X. Linear regression in Matlab: % X(i,:) is i-th example, y(i) is i-th label wLSQ = pinv([ones(size(X,1),1) X])*y; Prediction: ŷ = w∗T [ 1 x0 ] = yTX† T [ 1 x0 ] Lecture 4: Linear regression TTIC 31020 Data set size and regression What happens when we only have a single data point (in 1D)? • Ill-posed problem: an infinite number of lines pass through the point and produce “perfect” fit. Two points in 1D? Two points in 2D? This is a general phenomenon: the amount of data needed to obtain a meaningful estimate of a model is related to the number of parameters in the model (its complexity). Lecture 4: Linear regression TTIC 31020 Data set size and regression What happens when we only have a single data point (in 1D)? • Ill-posed problem: an infinite number of lines pass through the point and produce “perfect” fit. Two points in 1D? Two points in 2D? This is a general phenomenon: the amount of data needed to obtain a meaningful estimate of a model is related to the number of parameters in the model (its complexity). Lecture 4: Linear regression TTIC 31020 Data set size and regression What happens when we only have a single data point (in 1D)? • Ill-posed problem: an infinite number of lines pass through the point and produce “perfect” fit. Two points in 1D? Two points in 2D? This is a general phenomenon: the amount of data needed to obtain a meaningful estimate of a model is related to the number of parameters in the model (its complexity). Lecture 4: Linear regression TTIC 31020