A Note on Machine Learning Methods, Slides of Machine Learning

This note is about machine learning methods. The physicist Wolfgang Pauli apologized in a letter for “no time to be brief”. This also applies to this.

Typology: Slides

2022/2023

Uploaded on 05/11/2023

jacksonfive
jacksonfive 🇺🇸

4.4

(35)

280 documents

1 / 101

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
A Note on Machine Learning Methods
Ying Nian Wu, UCLA Statistics, Based on M231B lectures
Updated March 2020
Contents
1 Ptolemy’s Epicycle and Gauss Paradigm 4
1.1 Themodel............................................ 4
1.2 Boostingmachine........................................ 6
1.3 Lasso, ridge and kernel machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Neuralnetwork ......................................... 6
1.5 Model complexity and regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 PtolemyorNewton? ...................................... 7
1.7 Eulerslinearmodel....................................... 7
1.8 Laplace’s estimating equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.9 Gaussparadigm......................................... 8
1.10ContinuingGaussparadigm .................................. 9
1.11Threemodesoflearning .................................... 9
1.12 Bayesian, Frequentist, variational . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Basics: Linear Models 12
2.1 Linearregression ........................................ 12
2.2 Ridge regression and shrinkage estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Linearspline .......................................... 14
2.4 Lassoregression......................................... 15
2.5 Logisticregression ....................................... 17
2.6 Classification and perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.7 Lossfunctions.......................................... 18
2.8 Regularization.......................................... 21
2.9 Gradient descent: learning from errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.10 Iterated reweighed least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.11 Multivariate or multinomial response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.12 Non-linear and non-parametric functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Model Complexity and Overfitting 26
3.1 Gauss’ analysis of least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Biasandvariancetradeoff ................................... 27
3.3 Steinsestimator......................................... 27
3.4 Stein’s estimator as empirical Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5 Modelbias ........................................... 29
3.6 Training and testing errors, overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download A Note on Machine Learning Methods and more Slides Machine Learning in PDF only on Docsity!

A Note on Machine Learning Methods

11.11Partially observed MDP (POMDP)............................... 91 11.12Multi-agent reinforcement learning............................... 91 11.13Inverse reinforcement learning (IRL).............................. 91 11.14Energy-based model...................................... 91

12 Causal Learning 92 12.1 Counterfactual, potential outcome............................... 92 12.2 Matching and propensity score................................. 93 12.3 Structural equations, causal diagram.............................. 93 12.4 Training data and testing data.................................. 93 12.5 Do calculus, back door and front door adjustment....................... 94

13 Background: Convex Optimization and Duality 94 13.1 von Neumann minimax theorem................................ 94 13.2 Constrained optimization and Lagrange multipliers...................... 96 13.3 Legendre-Fenchel convex conjugate.............................. 99

Preface: “no time to be brief”

Credits: Most of the figures in the note are taken from the web, and the credits and copyrights belong to the original authors. Materials presented in the note are based on the original papers and related books. References are being added. This note is about machine learning methods. The physicist Wolfgang Pauli apologized in a letter for “no time to be brief”. This also applies to this note, which should be further compactified.

1 Ptolemy’s Epicycle and Gauss Paradigm

We begin with an ancient example of learning to highlight the methods and issues in machine learning.

Figure 1: Left: epicycle model, with Earth at the center. Right: Newtonian model, with Sun at the center.

1.1 The model

Let (x(t), y(t)) be the position of a planet at time t, with Earth being the origin ( 0 , 0 ). Suppose we observe (xi, yi) at times ti, i = 1 , ..., n. We may consider (ti) as the input, and (xi, yi) as the output. We want to learn a model that can predict the position of the planet at a future time. Of course we also hope to understand the physics of planetary motion.

A single circle

The simplest model is a circular trajectory,

x(t) = r cos(ωt), y(t) = r sin(ωt),

where r is the radius and ω is the angular speed. We can learn the model by minimizing the following least squares loss:

L(r, ω) =

n ∑ i= 1

[

(xi − r cos(ωti))^2 + (yi − r sin(ωti))^2

]

We may use gradient descent to find a local minimum of (r, ω). If ω is given, the estimation of r be- comes a least squares linear regression, where r is the regression coefficient (β in modern notation) and (cos(ωt), sin(ωt)) are predictors (x in modern notation). We may search ω over a regular grid of points and, for each ω, estimate r by least squares. We then choose ω that gives us the minimal value of L.

Complex number

We can also write the model using a complex number. Let z(t) = x(t) + iy(t) and zi = xi + iyi. Then the model becomes z(t) = reiωt^ ,

and

L(r, ω) =

n ∑ i= 1

zi − reiωti^

Circle on top of circle

Ptolemy found out that a single circle did not fit the data well enough. He then assumed that the planet is moving around a center in a circular motion, and this center itself is moving around the Earth in a circular motion. This is an epicycle model that can be written as

z(t) = r 1 eiω^1 t^ + r 2 eiω^2 t^ ,

where r 1 eiω^1 t^ is the original circle, and r 2 eiω^2 t^ is the circle on top of the original circle. If two circles are not enough, we can add the third circle. In general, we may consider a model

z(t) =

d ∑ k= 1

rkeiωkt^ ,

where d (for dimensionality or degrees of freedom) is the number of circles, and defines the complexity of the model. This is a stroke of genius. It is a precursor to Fourier analysis, which shows that with enough circles, we can fit any trajectory. The above model is by all means a good model, as good as any model we can find in machine learning literature. It is flexible enough to fit all the curves. It has a clear geometric meaning.

In science, we prefer simple models. As von Neumann said, “Give me three parameters I can fit an elephant, give me four, I can wiggle its trunk.” In fact, “adding an epicycle” is synonymous to bad science. However, in machine learning, we usually use models that are similar to Ptolemy’s because of the lack of domain knowledge. Such models tend to have a lot of parameters and sometimes are even “over- parametrized”, meaning that the number of parameters exceeds the number of training examples. Such models are either explicitly or implicitly regularized to avoid overfitting. In Lasso and ridge regression, we use explicit regularization. In boosting, we use implicit regularization by early stopping. For neural networks, the gradient descent tends to get into a local mode near the random initialization, which is also a form of implicit regularization.

1.6 Ptolemy or Newton?

Ptolemy’s model is actually better than Newton’s model in general, because the former is more universal, and the latter is more specialized to our universe. If our universe is more complex, we may have to use Ptolemy’s model or even more complex models such as kernel machines or neural networks. Which model is better depends on what data the model is intended to explain. For instance, for the stock market, we do not expect a model as simple as Newton’s can fit well. On the other hand, the current kernel machines or neural networks may be too complex for explaining the experiences of a biological intelligent system, which may actually use models similar to Ptolemy’s.

1.7 Euler’s linear model

In the above discussion, we used the modern notation and ideas in statistics and machine learning. The least squares, Lasso, ridge regression, etc. were invented nearly two thousand years after Ptolemy’s genius idea. But indeed we can treat the origin of statistics and machine learning to astronomy. The model is Newto- nian mechanics (even though modern machine learning models are more of the style of Ptolemy). The data are those obtained by the observatories. Mathematicians such as Gauss worked for such observatories. Euler studied the orbit of Jupiter around the Sun. The motion of Jupiter is influenced by the Saturn. By Taylor expansion or perturbation analysis, Euler obtained the following equation:

ϕ = η − 23525 ′′^ sin q + 168 ′′^ sin 2q^32 ′′^ sin 2w − 257 ′′^ sin(w − q) − 243 ′′^ sin( 2 w − p) + m′′^ − x′′^ sin q + y′′^ sin 2q

− z′′^ sin(w − p) − u(α + 360 v + p) cos(w − p) + Nu′′^ − 11405 k′′^ cos q + ( 1 / 600 )k′′^ cos 2q,

where (ϕ, η, q, w, p, N, v) are observed and vary from observation to observation, and (x, y, m, z, α, k, n, u) are unknown parameters. Euler had 75 observations, and by treating uα as γ he had 7 unknown parameters. In other words, he had 75 equations and 7 unknowns. Let us translate the above equation into more familiar notation in modern statistics. Let

yi = ϕi, β =

β 1 β 2 .. . βp

x y .. . u

, and xi =

xi 1 xi 2 .. . xip

sin qi sin 2qi .. . sin(w − p)

Then the equation can be written as

yi = x> i β = xi 1 β 1 + xi 2 β 2 + · · · + xipβp for i = 1 , · · · , n = 75.

The following are some observations: (1) yi is linear in β , but it can be non-linear in the original variables (q, w, p, N, u). (2) The model is known to be correct a priori. While (1) is common in linear models, (2) is rare in machine learning. Euler did not go very far in solving the above problem.

1.8 Laplace’s estimating equation

Laplace proposed the method of combination of equations, e.g., we combine the 75 equations into 7 equa- tions, so that we can solve for the 7 unknowns. Specifically we solve β from the following estimating equations: n ∑ i= 1

wikyi =

n ∑ i= 1

wik

p ∑ j= 1

xi jβ (^) j, k = 1 , ..., p,

where (wi j) is a set of pre-designed weights. Laplace designed a special set of weights. But he did not give a general principle on how to design the weights.

1.9 Gauss paradigm

Gauss (and Legendre) proposed to estimate β by least squares, i.e., minimizing the loss function

L (β ) =

n ∑ i= 1

yi −

p ∑ i= 1

xi jβ (^) j

The above loss function can be minimized in closed form by solving the linear equation L′(β ) = 0. This leads to the following estimating equation:

∂ ∂ βk L (β ) = − 2

n ∑ i= 1

xik

yi −

p ∑ i= 1

xi jβ (^) j

= 0 , k = 1 , ..., p.

This estimating equation corresponds to Laplace’s estimating equation with wik = xik. Gauss did three things that set the paradigm for statistics and machine learning. (1) Gauss started with a loss function. In machine learning, most of the methods start from loss functions. (2) Gauss motivated the loss function by a probabilistic formulation. He assumed that

yi =

p ∑ j= 1

xi jβ (^) j + εi, εi ∼ N( 0 , σ 2 ),

independently for i = 1 , ..., n. Assuming a prior distribution p(β ), the posterior distribution is

p(β |(xi, yi), i = 1 , ..., n) ∝ p(β )

n ∏ i= 1

p(yi | xi, β )

∝ exp

2 σ 2

n ∑ i= 1

yi −

n ∑ j= 1

xi jβ (^) j

  • log p(β )

Assuming a uniform p(β ), then maximizing p(β |(xi, yi), i = 1 , ..., n) is equivalent to minimizing the loss function L (β ). (^) ∏ni= 1 p(yi | xi, β ) is called likelihood. The least squares estimate is also the maximum likelihood estimate. (3) Gauss analyzed the property of the least squares estimator βˆLS, which is a function of (xi, yi), i = 1 , ..., n). He used a Frequentist thinking, even though the loss function is motivated by Bayesian thinking. Specifically, we assume (xi, yi) ∼ p(x, y) independently, where p(x, y) is the joint distribution so that the conditional distribution p(y|x) is such that yi = N(x> i βtrue, σ 2 ), where βtrue is the true value of β. If we believe Newtonian mechanics, then such a βtrue does exist. In Frequentist thinking, we assume βtrue is fixed but unknown (whereas in Bayesian thinking, we treat β as a random variable). The Frequentist thinking is

input features output 1 x> 1 h> 1 y 1 2 x> 2 h> 2 y 2 ... n x> n h> n yn

The supervised learning can be represented by the diagram below,

output : yi ↑ features : hi ↑ input : xi

where the vector of features hi is computed from xi via hi = h(xi). Encoder and decoder: In the above diagram, the transformation xi → hi is called an encoder, and the transformation hi → yi is called a decoder. Both classification and regression are about supervised learning because for each input xi, an output yi is provided as supervision. In regression, yi is continuous. In classification, yi is categorical. We can represent yi by a one-hot vector, i.e., if yi denotes the k-th category, then yi is a vector where the k-th element is 1 and all the other elements are 0.

Unsupervised learning

In unsupervised learning, the dataset is as below, where yi are not provided as supervision.

input hidden output 1 x> 1 h> 1? 2 x> 2 h> 2? ... n x> n h> n?

In a generative model, the vector hi is not a vector of features extracted from the signal xi. hi is a vector of hidden variables that is used to generate xi, as illustrated by the following diagram:

hidden : hi ↓ input : xi

The components of the d-dimensional hi are variably called factors, sources, components or causes. The prototype example is factor analysis or principal component analysis. Auto-encoder: hi is also called a code in the auto-encoder illustrated by the following diagram:

code : hi ↑↓ input : xi

The direction from hi to xi is called the decoder, and the direction from xi to hi is called the encoder.

Distributed representation and disentanglement: hi = (hik, k = 1 , ..., d) is called a distributed represen- tation of xi. Usually the components of hi, (hik, k = 1 , ..., d), are assumed to be independent, and (hik) are said to disentangle the variations in xi. Embedding: hi can also be considered the coordinates of xi, if we embed xi in a low-dimensional space, as illustrated by the following diagram:

← hi → | ← xi →

In the training data, we find a hi for each xi, so that {hi, i = 1 , ..., n} preserve the relative relations between {xi, i = 1 , ..., n}. The prototype example of embedding is multi-dimensional scaling, where we want to preserve the Euclidean distances between the examples.

Reinforcement learning

Reinforcement learning is similar to supervised learning except that the guidance is in the form of reward. Here xi is the state. yi can be the action taken at this state. yi can also be the value of this state, where value is defined as the accumulated reward.

1.12 Bayesian, Frequentist, variational

We can write (xi, yi) ∼ pθ (xi, yi) for i = 1 , ..., n. In supervised learning, we let pθ (x, y) = pθ (y|x)p(x), where we learn pθ (y|x) and we leave p(x) alone. In unsupervised learning, we do not observe y, and we model pθ (x) instead. In the following, we focus on supervised learning. In Bayesian framework, we treat θ as a random variable. We assume its marginal distribution to be p(θ ). It is called the prior distribution. The learning is based on posterior distribution

p(θ | (xi, yi), i = 1 , ..., n) ∝ p(θ )

n ∏ i= 1

p(yi | xi, θ ),

where we write pθ (y|x) as p(y|x, θ ) to emphasize that θ is a random variable to be conditioned upon. ∏ni= 1 p(yi |^ xi,^ θ^ )^ is called likelihood.^ l(θ^ |(xi,^ yi),^ i^ =^1 , ...,^ n) =^ ∑ni= 1 log^ p(yi |^ xi,^ θ^ )^ is called the log-likelihood. If we estimate θ by maximizing p(θ | (xi, yi), i = 1 , ..., n) over θ , we get the so-called Maximum A Posteriori (MAP) estimate.

log p(θ | (xi, yi), i = 1 , ..., n) = l(θ |(xi, yi), i = 1 , ..., n) + log p(θ ).

If p(θ ) is uniform within a range, MAP becomes maximum likelihood estimate (MLE). For non-uniform p(θ ), MAP is penalized or regularized likelihood. MAP only captures the maximum or mode of the posterior distribution but misses the uncertainty in the posterior distribution. To capture the uncertainty, we may draw multiple samples θm ∼ p(θ | (xi, yi), i = 1 , ..., n) for m = 1 , ..., M. This can be accomplished by Monte Carlo, such as Markov chain Monte Carlo (MCMC). These θm are the multiple guesses of θ. The posterior p(θ | (xi, yi), i = 1 , ..., n) is often not tractable in the sense that we cannot calculate the normalizing constant of p(θ ) (^) ∏ni= 1 p(yi | xi, θ ) to make it a probability distribution. In variational inference, we find a simpler distribution qφ (θ ) to approximate p(θ | (xi, yi), i = 1 , ..., n), where φ is the variational parameter that we choose to minimize the divergence from qφ (θ ) to p(θ | (xi, yi), i = 1 , ..., n).

Background: derivatives in matrix form

Suppose Y = (yi)m× 1 , and X = (x (^) j)n× 1. Suppose Y = h(X). We can define

∂Y ∂ X>^

∂ yi ∂ x (^) j

m×n

To understand the notation, we can treat ∂Y = (∂ yi, i = 1 , ..., m)>^ as a column vector, and 1/∂ X = ( 1 /∂ x (^) j, j = 1 , ..., m)>^ as another column vector. Now we have two vectors of operations, instead of numbers. The prod- uct of the elements of the two vectors is understood as composition of the two operators, i.e., ∂ yi( 1 /∂ x (^) j) = ∂ yi/∂ x (^) j. Then ∂Y /∂ X>^ is a squared matrix according to the matrix multiplication rule. If Y = AX, then yi = (^) ∑k aikxk. Thus ∂ yi/∂ x (^) j = ai j. So ∂Y /∂ X>^ = A. If Y = X>SX, where S is symmetric, then ∂Y /∂ X = 2 SX. If S = I, Y = |X|^2 , ∂Y /∂ X = 2 X. The chain rule in matrix form is as follows. If Y = h(X) and X = g(Z), then

∂ yi ∂ z (^) j

= (^) ∑ k

∂ yi ∂ xk

∂ xk ∂ z (^) j

Thus ∂Y ∂ Z>^

∂Y

∂ X>

∂ X

∂ Z>^

Least squares estimator

For general (X,Y ), L (β ) = |Y − Xβ |^2.

Let e = Y − Xβ , then L (β ) = |e|^2. Applying the chain rule,

∂ L ∂ β >^

∂ L

∂ e>

∂ e ∂ β >^ = − 2 e>X,

hence

L ′(β ) =

∂ L

∂ β = − 2 X>(Y − Xβ ).

Setting L ′(β ) = 0, we get the least squares estimator

βˆ = (X>X)−^1 X>Y.

Geometrically, ˆY = X βˆ is the projection of Y onto the subspace spanned by X = (X 1 , ..., Xp), so that e = Y − Xβ is perpendicular to Xj at βˆ , i.e., 〈e, Xj〉 = X j> e = 0 for j = 1 , ..., p, i.e., X>(Y − Xβ ) = 0, which

leads to the least squares βˆ. The projection is

Yˆ = X βˆ = X(X>X)−^1 X>Y = HY,

where the hat matrix H = X(X>X)−^1 X>^ encodes the projection operation.

2.2 Ridge regression and shrinkage estimator

In order to reduce the model bias, we want the number of parameters to be large. However, this will cause overfitting if we continue to use the least squares estimator. We may reduce overfitting by using a biased estimator such as ridge regression. The ridge regression minimizes

L (β ) = |Y − Xβ |^2 + λ |β |^2 ,

for λ ≥ 0. Setting L ′(β ) = − 2 X>(Y − Xβ ) + 2 λ β = 0 ,

we have βˆ = (X>X + λ Ip)−^1 X>Y.

The ridge regression can also be expressed as minimizing the least squares loss |Y − Xβ |^2 subject to |β |^2 ≤ t for a certain t > 0. More generally, we can minimize

L (β ) = |Y − Xβ |^2 + λ β >Dβ ,

for a symmetric matrix D, and βˆ = (X>X + D)−^1 X>Y. For ridge regression, in the case X>X = Ip, we have

βˆ = βˆLS/( 1 + λ ),

which is a shrinkage estimator.

2.3 Linear spline

In the one-dimensional case, the linear spline model is of the form

f (x) = β 0 +

d ∑ k= 1

βk(x − αk)+,

where (x − αk)+ = max( 0 , x − αk), αk, k = 1 , ...., d are the knots, and βk is the change of slope at knot αk. We can learn the model from the training data (xi, yi), i = 1 , ..., n by ridge regression which minimizes

L (β ) =

n ∑ i= 1

[

yi − β 0 −

d ∑ k= 1

βk(xi − αk)+

] 2

  • λ

d ∑ k= 1

β (^) k^2 ,

where (^) ∑dk= 1 β (^) k^2 measures the smoothness of f (x). Let xik = (xi − αk)+, xi 0 = 1. The objective function is

L (β ) = |Y − Xβ |^2 + β >Dβ ,

where X is the n × (p + 1 ) matrix, D is (p + 1 ) × (p + 1 ) diagonal matrix, with Dkk = λ , except D 11 = 0 because we do not penalize β 0. Then

βˆ = (X>X + D)−^1 X>Y.

is the solution to the dual form, then it must be the solution to the primal form with t = ‖ βˆλ ‖1. The reason is that if a different βˆ is the solution to the primal form, then βˆ is a better solution to the dual form than βˆλ , which results in contradiction. The primal form also reveals the sparsity inducing property of 1 regularization in that the ` 1 ball has low-dimensional corners, edges, and faces, but is still barely convex.

Figure 2: Lasso in primal form.

The above is the well known figure of Lasso. Take the left plot for example. The blue region is ‖β ‖1 ≤ t. The red curves is the contour plot, where each red elliptical circle consists of those β that have the same value of ‖Y − Xβ ‖^2 2. The circle on the outside has bigger ‖Y − Xβ ‖^2 2 than the circle inside. The solution to the problem of min ‖Y − Xβ ‖^2 2 subject to ‖β ‖1 ≤ t is where the red circle touches the blue region. Any other points in the blue region will be outside the outer red circle and thus have bigger values of ‖Y − Xβ ‖^2 2. The reason that the 1 regularization induces sparsity is that it is likely for the red circle to touch the blue region at a corner, which is a sparse solution. If we use 2 regularization, as is the case with the plot on the right, then the solution is not sparse in general.

Coordinate descent for Lasso solution path

For multi-dimensional X = (Xj, j = 1 , ..., p), we can use the coordinate descent algorithm to compute βˆλ. The algorithm updates one component at a time, i.e., given the current values of β = (β (^) j, j = 1 , ..., p), let R (^) j = Y − ∑k 6 = j Xkβk, we can update β (^) j = sign( βˆ (^) j) max( 0 , | βˆ (^) j| − λ /‖X‖^2 2 ), where βˆ (^) j = 〈R (^) j, Xj〉/‖Xj‖^2 2. We can find the solution path of Lasso by starting from a big λ so that all of the estimated β (^) j are zeros. Then we gradually reduce λ. For each λ , we cycle through j = 1 , ..., p for coordinate descent until convergence, and then we lower λ. This gives us βˆ (λ ) for the whole range of λ. The whole process is a forward selection process, which sequentially selects new variables and occasionally removes selected variables.

Least angle regression

In the above algorithm, at any given λ , let R = Y − (^) ∑pj= 1 Xjβ (^) j, then βˆ (^) j = β (^) j + 〈R, Xj〉/‖Xj‖^2 ` 2. If β is the Lasso solution, then

〈R, Xj〉 =

λ , if β (^) j > 0 , −λ , if β (^) j < 0 , sλ if β (^) j = 0.

where |s| < 1. Thus in the above process, for all of those selected Xj, the algorithm maintains that 〈R, Xj〉 to be λ or −λ , for all selected Xj. If we interpret |〈R, Xj〉| in terms of the angle between R and Xj, then we may call the above process the equal angle regression or the least angle regression (LARS). In fact, the solution

path is piecewise linear, and the LARS computes the linear pieces analytically instead of gradually reducing λ as in coordinate descent.

Stagewise regression or epsilon-boosting

The stagewise regression iterates the following steps. Given the current R = Y − (^) ∑pj= 1 Xjβ (^) j, find j with the maximal |〈R, Xj〉|. Then update β (^) j ← β (^) j + ε〈R, Xj〉 for a small ε. This is similar to the matching pursuit but is much less greedy. Such an update will change R and reduce |〈R, Xj〉|, until another Xj catches up. So overall, the algorithm ensures that all of the selected Xj to have the same |〈R, Xj〉|, which is the case with the algorithm in the above two sections. The stagewise regression is also called ε-boosting. We can also view the stagewise regression from the perspective of the primal form of the Lasso problem: minimize ‖Y − Xβ ‖^2 2 subject to ‖β ‖ 1 ≤ t. If we relax the constraint by increasing t to t + ∆t, then we want to update β (^) j with the maximal |〈R, Xj〉| in order to maximally reducing ‖Y − Xβ ‖^2 ` 2.

2.5 Logistic regression

Consider a dataset with n training examples, where x> i = (xi 1 , · · · , xip) consists of p predictors and yi ∈ { 0 , 1 } is the outcome or class label. We assume [yi|xi, β ] ∼ Bernoulli(pi), i.e., Pr(yi = 1 |xi, β ) = pi, and we assume

logit(pi) = log

pi 1 − pi = si = x> i β.

Then

pi = sigmoid(si) =

esi 1 + esi^

1 + e−si^

where the sigmoid function is the inverse of the logit function.

2.6 Classification and perceptron

For logistic regression, we want to learn β either for the purpose of explanation or understanding, or for the purpose of classification or prediction. In the context of classification, we usually let yi ∈ {+ 1 , − 1 } instead of yi ∈ { 1 , 0 }. Those xi with yi = +1 are called positive examples, and those xi with yi = −1 are called negative examples. We may call β a classifier. si = x> i β = 〈xi, β 〉 is the projection of xi on the vector β , so the vector β is the direction that reveals the difference between positive xi and negative xi. Thus β should be aligned with positive xi and negatively aligned with negative xi, i.e., β should point from the negative examples to the positive examples. According to the previous subsection,

Pr(yi = + 1 |xi, β ) =

1 + exp(−si)

A deterministic version of the logistic regression is the perceptron model

yi = sign(si),

where sign(s) = +1 if s ≥ 0, and sign(s) = −1 if s < 0. The perceptron model is inspired by neuroscience. See Figure 3. It can be considered an over-simplified model of a neuron, which takes input xi, and emits output yi. See Figure 4.

Loss function for robust linear regression

We may also use the mean absolute value loss,

L(yi, si) = |yi − si| ,

which penalizes large differences between yi and si = x> i β to a less degree than the least squares loss, thus the estimated β is less affected by the outliers.

Loss function for logistic regression with 0/1 responses

For logistic regression, we usually maximize the likelihood function, which is

Likelihood(β ) =

n ∏ i= 1

Pr(yi|xi, β ).

That is, we want to find β to maximize the probability of the observed (yi, i = 1 , ..., n) given (xi, i = 1 , ..., n). The maximum likelihood estimate gives the most plausible explanation to the observed data. For yi ∈ { 0 , 1 }, Pr(yi = 1 |xi, β ) = sigmoid(si) = exp(si) 1 + exp(si)

Pr(yi = 0 |xi, β ) = 1 − Pr(yi = 1 |xi, β ) =

1 + exp(si)

We can combine the above two equations by

Pr(yi|xi, β ) = exp(yisi) 1 + exp(si)

The log-likelihood is

LogLikelihood(β ) =

n ∑ i= 1

log Pr(yi|xi, β ) =

n ∑ i= 1

[yisi − log( 1 + exp(si))].

We can define the loss function as the negative log-likelihood

L(yi, si) = − [yisi + log( 1 + exp(si))].

Loss function for logistic regression with ± responses

If yi ∈ {+ 1 , − 1 }, we have

Pr(yi = + 1 |xi, β ) =

1 + exp(−si)

and Pr(yi = − 1 |xi, β ) =

1 + exp(si)

Combining them, we have

p(yi|xi, β ) =

1 + exp(−yisi)

The log-likelihood is n ∑ i= 1

log Pr(yi|xi, β ) = −

n ∑ i= 1

log ( 1 + exp(−yisi)).

We define the loss function as the negative log-likelihood. Thus

L(yi, si) = log [ 1 + exp(−yisi)].

This loss is called the logistic loss. The least squares loss for linear regression can also be derived from the log-likelihood if we assume the errors follow a normal distribution.

Loss functions for classification

Figure 5: Loss functions for classification. The horizontal axis is mi = yix> i β. The vertical axis is L(yi, x> i β ). The exponential loss and the hinge loss can be considered approximations to the logistic loss. These loss functions penalize negative mi. The more negative mi is, the bigger the loss. The loss functions also penalize small positive mi, e.g., those mi < 1. Such loss functions encourage correct and confident classifications.

The following summarizes several possible choices for the loss function L(yi, si) for classification. See Figure 5.

Logistic loss = log ( 1 + exp(−yisi)) , Exponential loss = exp (−yisi) , Hinge loss = max ( 0 , 1 − yisi) , Zero-one loss = 1 (yisi > 0 )

Both the exponential and hinge losses can be considered approximations to the logistic loss. The logistic loss is used by logistic regression. The exponential loss is used by adaboost. The hinge loss is used by support vector machines. The zero-one loss is to count the number of mistakes. It is not differentiable and is not used for training. All the above loss functions are based on mi = yisi. We call mi the margin for example (yi, xi). We want yi and si = x i> β to be of the same sign for correct classification. If yi = +1, we want si = x> i β to be very positive. If yi = −1, we want si = x> i β to be very negative. We want the margin mi to be as large as possible for confident classification.