Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Linear Regression-Introduction to Machine Learning-Lecture 04-Computer Science, Lecture notes of Introduction to Machine Learning

Linear Regression, Regression, Fitting Function, Linear Fitting, Data, Loss Function, Empirical Loss, Loss, Empirical Loss Minimization, Learning, Least Squares, General Case, Matrix Form, Derivative of Loss, Data Set Size, Generalization, Error Analysis, Greg Shakhnarovich, Lecture Slides, Introduction to Machine Learning, Computer Science, Toyota Technological Institute at Chicago, United States of America.

Typology: Lecture notes

2011/2012

Uploaded on 03/12/2012

alfred67
alfred67 🇺🇸

4.9

(20)

328 documents

1 / 54

Toggle sidebar

Related documents


Partial preview of the text

Download Linear Regression-Introduction to Machine Learning-Lecture 04-Computer Science and more Lecture notes Introduction to Machine Learning in PDF only on Docsity! Lecture 4: Linear regression TTIC 31020: Introduction to Machine Learning Instructor: Greg Shakhnarovich TTI–Chicago October 4, 2010 Lecture 4: Linear regression TTIC 31020 Administrivia TA: Feng Zhao 5 homework assignments (1.5-2 weeks apart) • Each HW worth 12% of the grade Final during finals week • 40% of the grade My office hours: by e-mail appointment Feng’s office hours: Friday 3-5pm Lecture 4: Linear regression TTIC 31020 Fitting function to data Two goals in mind: 1 Explain the data (traditional statistics) 2 Make predictions (emphasized in machine learning) We will proceed in two steps: 1 Choose a model class of functions 2 Design a fitting criterion, to guide selection of a function from the class. Let’s start with (almost) the simplest model class: linear functions. Lecture 4: Linear regression TTIC 31020 Fitting function to data Two goals in mind: 1 Explain the data (traditional statistics) 2 Make predictions (emphasized in machine learning) We will proceed in two steps: 1 Choose a model class of functions 2 Design a fitting criterion, to guide selection of a function from the class. Let’s start with (almost) the simplest model class: linear functions. Lecture 4: Linear regression TTIC 31020 Fitting function to data Two goals in mind: 1 Explain the data (traditional statistics) 2 Make predictions (emphasized in machine learning) We will proceed in two steps: 1 Choose a model class of functions 2 Design a fitting criterion, to guide selection of a function from the class. Let’s start with (almost) the simplest model class: linear functions. Lecture 4: Linear regression TTIC 31020 Linear fitting to data We want to fit a linear function to an observed set of points X = [x1, . . . ,xN ] with associated labels Y = [y1, . . . , yN ]. • Once we fit the function, we want to use it to predict the y for new x. Least squares (LSQ) fitting criterion: find the function that minimizes sum (or average) of square distances between actual ys in the training set and predicted ones. (xi, yi) 2 2 2 2 2 + + + + + +min 2 + 2 2 x0 The fitted line is used as a predictor Lecture 4: Linear regression TTIC 31020 Linear fitting to data We want to fit a linear function to an observed set of points X = [x1, . . . ,xN ] with associated labels Y = [y1, . . . , yN ]. • Once we fit the function, we want to use it to predict the y for new x. Least squares (LSQ) fitting criterion: find the function that minimizes sum (or average) of square distances between actual ys in the training set and predicted ones. (xi, yi) 2 2 2 2 2 + + + + + +min 2 + 2 2 x0 The fitted line is used as a predictor Lecture 4: Linear regression TTIC 31020 Linear fitting to data We want to fit a linear function to an observed set of points X = [x1, . . . ,xN ] with associated labels Y = [y1, . . . , yN ]. • Once we fit the function, we want to use it to predict the y for new x. Least squares (LSQ) fitting criterion: find the function that minimizes sum (or average) of square distances between actual ys in the training set and predicted ones. (xi, yi) 2 2 2 2 2 + + + + + +min 2 + 2 2 x0 The fitted line is used as a predictor Lecture 4: Linear regression TTIC 31020 Loss function Suppose target labels are in Y • Binary classification: Y = {−1,+1}, • (univariate) regression: Y ≡ R. A loss function L : Y × Y → R maps decisions to costs: • L(y, ŷ) defines the penalty paid for predicting ŷ when the true value is y. Standard choice for classification: 0/1 loss L0/1(ŷ, y) = { 0 if y = ŷ 1 otherwise Standard choice for regression: squared loss L(ŷ, y) = (ŷ− y)2 is it a good loss function?.. Lecture 4: Linear regression TTIC 31020 Empirical loss We consider a parametric function f(x;w) Linear function: f(x;w) = wTx The empirical loss of function y = f(x;w) on a set X: L(w,X) = 1 N N∑ i=1 L (f(xi;w), yi) LSQ minimizes the empirical loss for squared loss L. We care about accuracy of predicting labels for new examples. Why/when does empirical loss minimization help us achieve that? Lecture 4: Linear regression TTIC 31020 Loss: empirical and expected Fundamental assumption: example x/label y are drawn from an joint probability distribution p(x, y). Data are i.i.d.: same (unknown!) distribution for all pairs (x, y) in both training and test data. We can measure the empirical loss on training set L(w,X) = 1 N N∑ i=1 L (f(xi;w), yi) The ultimate goal is to minimize the expected loss, also known as risk: R(w) = E(x0,y0)∼p(x,y) [L (f(x0;w), y0)] Lecture 4: Linear regression TTIC 31020 Least squares: estimation We need to minimize w.r.t. w L(w,X) = L(w) = 1 N N∑ i=1 (yi − f(xi;w))2 = 1 N N∑ i=1 ( yi − w0 − w1x(i)1 − . . .− wdx (i) d )2 let’s look at 1D for the moment L(w0, w1) = 1 N N∑ i=1 ( yi − w0 − w1x(i) )2 Necessary condition to minimize L: derivatives w.r.t. w0 and w1 must be zero. Lecture 4: Linear regression TTIC 31020 Least squares: estimation We need to minimize w.r.t. w L(w,X) = L(w) = 1 N N∑ i=1 (yi − f(xi;w))2 = 1 N N∑ i=1 ( yi − w0 − w1x(i)1 − . . .− wdx (i) d )2 let’s look at 1D for the moment L(w0, w1) = 1 N N∑ i=1 ( yi − w0 − w1x(i) )2 Necessary condition to minimize L: derivatives w.r.t. w0 and w1 must be zero. Lecture 4: Linear regression TTIC 31020 Least squares: estimation We need to minimize w.r.t. w L(w,X) = L(w) = 1 N N∑ i=1 (yi − f(xi;w))2 = 1 N N∑ i=1 ( yi − w0 − w1x(i)1 − . . .− wdx (i) d )2 let’s look at 1D for the moment L(w0, w1) = 1 N N∑ i=1 ( yi − w0 − w1x(i) )2 Necessary condition to minimize L: derivatives w.r.t. w0 and w1 must be zero. Lecture 4: Linear regression TTIC 31020 Least squares: estimation L(w0, w1) = 1 N N∑ i=1 ( yi − w0 − w1x(i) )2 ∂ ∂w0 L(w0, w1) = 1 N N∑ i=1 ∂ ∂w0 ( yi − w0 − w1x(i) )2 = 1 N N∑ i=1 2 ( yi − w0 − w1x(i) ) · (−1) = 0 = 2 N N∑ i=1 ( yi − w0 − w1x(i) ) = 0. yi − w0 − w1x(i) is the prediction error on the i-th example. ⇒ Necessary condition for optimal w is that the errors have zero mean. (why?) Lecture 4: Linear regression TTIC 31020 Least squares: estimation L(w0, w1) = 1 N N∑ i=1 ( yi − w0 − w1x(i) )2 ∂ ∂w0 L(w0, w1) = 1 N N∑ i=1 ∂ ∂w0 ( yi − w0 − w1x(i) )2 = 1 N N∑ i=1 2 ( yi − w0 − w1x(i) ) · (−1) = 0 = 2 N N∑ i=1 ( yi − w0 − w1x(i) ) = 0. yi − w0 − w1x(i) is the prediction error on the i-th example. ⇒ Necessary condition for optimal w is that the errors have zero mean. (why?) Lecture 4: Linear regression TTIC 31020 Least squares: estimation L(w0, w1) = 1 N N∑ i=1 ( yi − w0 − w1x(i) )2 ∂ ∂w0 L(w0, w1) = 1 N N∑ i=1 ∂ ∂w0 ( yi − w0 − w1x(i) )2 = 1 N N∑ i=1 2 ( yi − w0 − w1x(i) ) · (−1) = 0 = 2 N N∑ i=1 ( yi − w0 − w1x(i) ) = 0. yi − w0 − w1x(i) is the prediction error on the i-th example. ⇒ Necessary condition for optimal w is that the errors have zero mean. (why?) Lecture 4: Linear regression TTIC 31020 Least squares: estimation L(w0, w1) = 1 N N∑ i=1 ( yi − w0 − w1x(i) )2 ∂ ∂w1 L(w0, w1) = − 2 N N∑ i=1 ( yi − w0 − w1x(i) ) x(i) = 0. Second necessary condition: errors are uncorrelated with the data! (And with any linear function of the data) Two linear equations in two unknowns w0, w1 N∑ i=1 ( yi − w0 − w1x(i) ) x(i) = 0, (1) N∑ i=1 ( yi − w0 − w1x(i) ) = 0 (2) Lecture 4: Linear regression TTIC 31020 Solving for w∗ ∑ ( yi − w0 − w1x(i) ) x(i) = 0, ∑ ( yi − w0 − w1x(i) ) = 0 w0 = 1 N ∑ (yi − w1x(i)) = ȳ︷ ︸︸ ︷ 1 N ∑ yi−w1 x̄︷ ︸︸ ︷ 1 N ∑ x(i) 1 N ∑ yix (i) − w0 1 N ∑ x(i) − w1 1 N ∑ x(i) 2 = 0 yx− (ȳ − w1x̄)x̄− w1x2 = 0 w∗1 = yx − ȳx̄ x2 − (x̄)2 w∗0 = ȳ − w∗1x̄ Lecture 4: Linear regression TTIC 31020 Solving for w∗ ∑ ( yi − w0 − w1x(i) ) x(i) = 0, ∑ ( yi − w0 − w1x(i) ) = 0 w0 = 1 N ∑ (yi − w1x(i)) = ȳ︷ ︸︸ ︷ 1 N ∑ yi−w1 x̄︷ ︸︸ ︷ 1 N ∑ x(i) 1 N ∑ yix (i) − w0 1 N ∑ x(i) − w1 1 N ∑ x(i) 2 = 0 yx− (ȳ − w1x̄)x̄− w1x2 = 0 w∗1 = yx − ȳx̄ x2 − (x̄)2 w∗0 = ȳ − w∗1x̄ Lecture 4: Linear regression TTIC 31020 Solving for w∗ ∑ ( yi − w0 − w1x(i) ) x(i) = 0, ∑ ( yi − w0 − w1x(i) ) = 0 w0 = 1 N ∑ (yi − w1x(i)) = ȳ︷ ︸︸ ︷ 1 N ∑ yi−w1 x̄︷ ︸︸ ︷ 1 N ∑ x(i) 1 N ∑ yix (i) − w0 1N ∑ x(i) − w1 1N ∑ x(i) 2 = 0 yx− (ȳ − w1x̄)x̄− w1x2 = 0 w∗1 = yx − ȳx̄ x2 − (x̄)2 w∗0 = ȳ − w∗1x̄ Lecture 4: Linear regression TTIC 31020 Solving for w∗ ∑ ( yi − w0 − w1x(i) ) x(i) = 0, ∑ ( yi − w0 − w1x(i) ) = 0 w0 = 1 N ∑ (yi − w1x(i)) = ȳ︷ ︸︸ ︷ 1 N ∑ yi−w1 x̄︷ ︸︸ ︷ 1 N ∑ x(i) 1 N ∑ yix (i) − w0 1N ∑ x(i) − w1 1N ∑ x(i) 2 = 0 yx− (ȳ − w1x̄)x̄− w1x2 = 0 w∗1 = yx − ȳx̄ x2 − (x̄)2 w∗0 = ȳ − w∗1x̄ Lecture 4: Linear regression TTIC 31020 Solving for w∗ ∑ ( yi − w0 − w1x(i) ) x(i) = 0, ∑ ( yi − w0 − w1x(i) ) = 0 w0 = 1 N ∑ (yi − w1x(i)) = ȳ︷ ︸︸ ︷ 1 N ∑ yi−w1 x̄︷ ︸︸ ︷ 1 N ∑ x(i) 1 N ∑ yix (i) − w0 1N ∑ x(i) − w1 1N ∑ x(i) 2 = 0 yx− (ȳ − w1x̄)x̄− w1x2 = 0 w∗1 = yx − ȳx̄ x2 − (x̄)2 w∗0 = ȳ − w∗1x̄ Lecture 4: Linear regression TTIC 31020 General case (d-dim, matrix form) X = 1 x (1) 1 · · · x (1) d ... ... 1 x (N) 1 · · · x (N) d  , y =  y1... yN  , w =  w0... wd  . Predictions: ŷ = Xw, errors: y −Xw, empirical loss: L(w,X) = 1 N (y −Xw)T (y −Xw) = 1 N ( yT −wTXT ) (y −Xw) . Using (AB)T = BT AT , (A + B)T = AT + BT , (AT )T = A. Lecture 4: Linear regression TTIC 31020 Derivative of loss L(w) = 1 N ( yT −wTXT ) (y −Xw) . ∂aT b ∂a = ∂bT a ∂a = b, ∂aT Ba ∂a = 2Ba ∂L(w) ∂w = 1 N ∂ ∂w [ yTy − wTXTy − yTXw + wTXTXw ] = 1 N [ 0 − XTy − (yTX)T + 2XTXw ] = − 2 N ( XTy − XTXw ) Lecture 4: Linear regression TTIC 31020 Derivative of loss L(w) = 1 N ( yT −wTXT ) (y −Xw) . ∂aT b ∂a = ∂bT a ∂a = b, ∂aT Ba ∂a = 2Ba ∂L(w) ∂w = 1 N ∂ ∂w [ yTy − wTXTy − yTXw + wTXTXw ] = 1 N [ 0 − XTy − (yTX)T + 2XTXw ] = − 2 N ( XTy − XTXw ) Lecture 4: Linear regression TTIC 31020 General case ∂ ∂w L(w) = − 2 N ( XTy −XTXw ) = 0 XTy = XTXw ⇒ w∗ = ( XTX )−1 XTy X† , ( XTX )−1 XT is called the Moore-Penrose pseudoinverse of X. Linear regression in Matlab: % X(i,:) is i-th example, y(i) is i-th label wLSQ = pinv([ones(size(X,1),1) X])*y; Prediction: ŷ = w∗T [ 1 x0 ] = yTX† T [ 1 x0 ] Lecture 4: Linear regression TTIC 31020 General case ∂ ∂w L(w) = − 2 N ( XTy −XTXw ) = 0 XTy = XTXw ⇒ w∗ = ( XTX )−1 XTy X† , ( XTX )−1 XT is called the Moore-Penrose pseudoinverse of X. Linear regression in Matlab: % X(i,:) is i-th example, y(i) is i-th label wLSQ = pinv([ones(size(X,1),1) X])*y; Prediction: ŷ = w∗T [ 1 x0 ] = yTX† T [ 1 x0 ] Lecture 4: Linear regression TTIC 31020 General case ∂ ∂w L(w) = − 2 N ( XTy −XTXw ) = 0 XTy = XTXw ⇒ w∗ = ( XTX )−1 XTy X† , ( XTX )−1 XT is called the Moore-Penrose pseudoinverse of X. Linear regression in Matlab: % X(i,:) is i-th example, y(i) is i-th label wLSQ = pinv([ones(size(X,1),1) X])*y; Prediction: ŷ = w∗T [ 1 x0 ] = yTX† T [ 1 x0 ] Lecture 4: Linear regression TTIC 31020 Data set size and regression What happens when we only have a single data point (in 1D)? • Ill-posed problem: an infinite number of lines pass through the point and produce “perfect” fit. Two points in 1D? Two points in 2D? This is a general phenomenon: the amount of data needed to obtain a meaningful estimate of a model is related to the number of parameters in the model (its complexity). Lecture 4: Linear regression TTIC 31020 Data set size and regression What happens when we only have a single data point (in 1D)? • Ill-posed problem: an infinite number of lines pass through the point and produce “perfect” fit. Two points in 1D? Two points in 2D? This is a general phenomenon: the amount of data needed to obtain a meaningful estimate of a model is related to the number of parameters in the model (its complexity). Lecture 4: Linear regression TTIC 31020 Data set size and regression What happens when we only have a single data point (in 1D)? • Ill-posed problem: an infinite number of lines pass through the point and produce “perfect” fit. Two points in 1D? Two points in 2D? This is a general phenomenon: the amount of data needed to obtain a meaningful estimate of a model is related to the number of parameters in the model (its complexity). Lecture 4: Linear regression TTIC 31020