




























































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This note is about machine learning methods. The physicist Wolfgang Pauli apologized in a letter for “no time to be brief”. This also applies to this.
Typology: Slides
1 / 101
This page cannot be seen from the preview
Don't miss anything!





























































































11.11Partially observed MDP (POMDP)............................... 91 11.12Multi-agent reinforcement learning............................... 91 11.13Inverse reinforcement learning (IRL).............................. 91 11.14Energy-based model...................................... 91
12 Causal Learning 92 12.1 Counterfactual, potential outcome............................... 92 12.2 Matching and propensity score................................. 93 12.3 Structural equations, causal diagram.............................. 93 12.4 Training data and testing data.................................. 93 12.5 Do calculus, back door and front door adjustment....................... 94
13 Background: Convex Optimization and Duality 94 13.1 von Neumann minimax theorem................................ 94 13.2 Constrained optimization and Lagrange multipliers...................... 96 13.3 Legendre-Fenchel convex conjugate.............................. 99
Credits: Most of the figures in the note are taken from the web, and the credits and copyrights belong to the original authors. Materials presented in the note are based on the original papers and related books. References are being added. This note is about machine learning methods. The physicist Wolfgang Pauli apologized in a letter for “no time to be brief”. This also applies to this note, which should be further compactified.
We begin with an ancient example of learning to highlight the methods and issues in machine learning.
Figure 1: Left: epicycle model, with Earth at the center. Right: Newtonian model, with Sun at the center.
Let (x(t), y(t)) be the position of a planet at time t, with Earth being the origin ( 0 , 0 ). Suppose we observe (xi, yi) at times ti, i = 1 , ..., n. We may consider (ti) as the input, and (xi, yi) as the output. We want to learn a model that can predict the position of the planet at a future time. Of course we also hope to understand the physics of planetary motion.
A single circle
The simplest model is a circular trajectory,
x(t) = r cos(ωt), y(t) = r sin(ωt),
where r is the radius and ω is the angular speed. We can learn the model by minimizing the following least squares loss:
L(r, ω) =
n ∑ i= 1
(xi − r cos(ωti))^2 + (yi − r sin(ωti))^2
We may use gradient descent to find a local minimum of (r, ω). If ω is given, the estimation of r be- comes a least squares linear regression, where r is the regression coefficient (β in modern notation) and (cos(ωt), sin(ωt)) are predictors (x in modern notation). We may search ω over a regular grid of points and, for each ω, estimate r by least squares. We then choose ω that gives us the minimal value of L.
Complex number
We can also write the model using a complex number. Let z(t) = x(t) + iy(t) and zi = xi + iyi. Then the model becomes z(t) = reiωt^ ,
and
L(r, ω) =
n ∑ i= 1
zi − reiωti^
Circle on top of circle
Ptolemy found out that a single circle did not fit the data well enough. He then assumed that the planet is moving around a center in a circular motion, and this center itself is moving around the Earth in a circular motion. This is an epicycle model that can be written as
z(t) = r 1 eiω^1 t^ + r 2 eiω^2 t^ ,
where r 1 eiω^1 t^ is the original circle, and r 2 eiω^2 t^ is the circle on top of the original circle. If two circles are not enough, we can add the third circle. In general, we may consider a model
z(t) =
d ∑ k= 1
rkeiωkt^ ,
where d (for dimensionality or degrees of freedom) is the number of circles, and defines the complexity of the model. This is a stroke of genius. It is a precursor to Fourier analysis, which shows that with enough circles, we can fit any trajectory. The above model is by all means a good model, as good as any model we can find in machine learning literature. It is flexible enough to fit all the curves. It has a clear geometric meaning.
In science, we prefer simple models. As von Neumann said, “Give me three parameters I can fit an elephant, give me four, I can wiggle its trunk.” In fact, “adding an epicycle” is synonymous to bad science. However, in machine learning, we usually use models that are similar to Ptolemy’s because of the lack of domain knowledge. Such models tend to have a lot of parameters and sometimes are even “over- parametrized”, meaning that the number of parameters exceeds the number of training examples. Such models are either explicitly or implicitly regularized to avoid overfitting. In Lasso and ridge regression, we use explicit regularization. In boosting, we use implicit regularization by early stopping. For neural networks, the gradient descent tends to get into a local mode near the random initialization, which is also a form of implicit regularization.
Ptolemy’s model is actually better than Newton’s model in general, because the former is more universal, and the latter is more specialized to our universe. If our universe is more complex, we may have to use Ptolemy’s model or even more complex models such as kernel machines or neural networks. Which model is better depends on what data the model is intended to explain. For instance, for the stock market, we do not expect a model as simple as Newton’s can fit well. On the other hand, the current kernel machines or neural networks may be too complex for explaining the experiences of a biological intelligent system, which may actually use models similar to Ptolemy’s.
In the above discussion, we used the modern notation and ideas in statistics and machine learning. The least squares, Lasso, ridge regression, etc. were invented nearly two thousand years after Ptolemy’s genius idea. But indeed we can treat the origin of statistics and machine learning to astronomy. The model is Newto- nian mechanics (even though modern machine learning models are more of the style of Ptolemy). The data are those obtained by the observatories. Mathematicians such as Gauss worked for such observatories. Euler studied the orbit of Jupiter around the Sun. The motion of Jupiter is influenced by the Saturn. By Taylor expansion or perturbation analysis, Euler obtained the following equation:
ϕ = η − 23525 ′′^ sin q + 168 ′′^ sin 2q^32 ′′^ sin 2w − 257 ′′^ sin(w − q) − 243 ′′^ sin( 2 w − p) + m′′^ − x′′^ sin q + y′′^ sin 2q
− z′′^ sin(w − p) − u(α + 360 v + p) cos(w − p) + Nu′′^ − 11405 k′′^ cos q + ( 1 / 600 )k′′^ cos 2q,
where (ϕ, η, q, w, p, N, v) are observed and vary from observation to observation, and (x, y, m, z, α, k, n, u) are unknown parameters. Euler had 75 observations, and by treating uα as γ he had 7 unknown parameters. In other words, he had 75 equations and 7 unknowns. Let us translate the above equation into more familiar notation in modern statistics. Let
yi = ϕi, β =
β 1 β 2 .. . βp
x y .. . u
, and xi =
xi 1 xi 2 .. . xip
sin qi sin 2qi .. . sin(w − p)
Then the equation can be written as
yi = x> i β = xi 1 β 1 + xi 2 β 2 + · · · + xipβp for i = 1 , · · · , n = 75.
The following are some observations: (1) yi is linear in β , but it can be non-linear in the original variables (q, w, p, N, u). (2) The model is known to be correct a priori. While (1) is common in linear models, (2) is rare in machine learning. Euler did not go very far in solving the above problem.
Laplace proposed the method of combination of equations, e.g., we combine the 75 equations into 7 equa- tions, so that we can solve for the 7 unknowns. Specifically we solve β from the following estimating equations: n ∑ i= 1
wikyi =
n ∑ i= 1
wik
p ∑ j= 1
xi jβ (^) j, k = 1 , ..., p,
where (wi j) is a set of pre-designed weights. Laplace designed a special set of weights. But he did not give a general principle on how to design the weights.
Gauss (and Legendre) proposed to estimate β by least squares, i.e., minimizing the loss function
L (β ) =
n ∑ i= 1
yi −
p ∑ i= 1
xi jβ (^) j
The above loss function can be minimized in closed form by solving the linear equation L′(β ) = 0. This leads to the following estimating equation:
∂ ∂ βk L (β ) = − 2
n ∑ i= 1
xik
yi −
p ∑ i= 1
xi jβ (^) j
= 0 , k = 1 , ..., p.
This estimating equation corresponds to Laplace’s estimating equation with wik = xik. Gauss did three things that set the paradigm for statistics and machine learning. (1) Gauss started with a loss function. In machine learning, most of the methods start from loss functions. (2) Gauss motivated the loss function by a probabilistic formulation. He assumed that
yi =
p ∑ j= 1
xi jβ (^) j + εi, εi ∼ N( 0 , σ 2 ),
independently for i = 1 , ..., n. Assuming a prior distribution p(β ), the posterior distribution is
p(β |(xi, yi), i = 1 , ..., n) ∝ p(β )
n ∏ i= 1
p(yi | xi, β )
∝ exp
2 σ 2
n ∑ i= 1
yi −
n ∑ j= 1
xi jβ (^) j
Assuming a uniform p(β ), then maximizing p(β |(xi, yi), i = 1 , ..., n) is equivalent to minimizing the loss function L (β ). (^) ∏ni= 1 p(yi | xi, β ) is called likelihood. The least squares estimate is also the maximum likelihood estimate. (3) Gauss analyzed the property of the least squares estimator βˆLS, which is a function of (xi, yi), i = 1 , ..., n). He used a Frequentist thinking, even though the loss function is motivated by Bayesian thinking. Specifically, we assume (xi, yi) ∼ p(x, y) independently, where p(x, y) is the joint distribution so that the conditional distribution p(y|x) is such that yi = N(x> i βtrue, σ 2 ), where βtrue is the true value of β. If we believe Newtonian mechanics, then such a βtrue does exist. In Frequentist thinking, we assume βtrue is fixed but unknown (whereas in Bayesian thinking, we treat β as a random variable). The Frequentist thinking is
input features output 1 x> 1 h> 1 y 1 2 x> 2 h> 2 y 2 ... n x> n h> n yn
The supervised learning can be represented by the diagram below,
output : yi ↑ features : hi ↑ input : xi
where the vector of features hi is computed from xi via hi = h(xi). Encoder and decoder: In the above diagram, the transformation xi → hi is called an encoder, and the transformation hi → yi is called a decoder. Both classification and regression are about supervised learning because for each input xi, an output yi is provided as supervision. In regression, yi is continuous. In classification, yi is categorical. We can represent yi by a one-hot vector, i.e., if yi denotes the k-th category, then yi is a vector where the k-th element is 1 and all the other elements are 0.
Unsupervised learning
In unsupervised learning, the dataset is as below, where yi are not provided as supervision.
input hidden output 1 x> 1 h> 1? 2 x> 2 h> 2? ... n x> n h> n?
In a generative model, the vector hi is not a vector of features extracted from the signal xi. hi is a vector of hidden variables that is used to generate xi, as illustrated by the following diagram:
hidden : hi ↓ input : xi
The components of the d-dimensional hi are variably called factors, sources, components or causes. The prototype example is factor analysis or principal component analysis. Auto-encoder: hi is also called a code in the auto-encoder illustrated by the following diagram:
code : hi ↑↓ input : xi
The direction from hi to xi is called the decoder, and the direction from xi to hi is called the encoder.
Distributed representation and disentanglement: hi = (hik, k = 1 , ..., d) is called a distributed represen- tation of xi. Usually the components of hi, (hik, k = 1 , ..., d), are assumed to be independent, and (hik) are said to disentangle the variations in xi. Embedding: hi can also be considered the coordinates of xi, if we embed xi in a low-dimensional space, as illustrated by the following diagram:
← hi → | ← xi →
In the training data, we find a hi for each xi, so that {hi, i = 1 , ..., n} preserve the relative relations between {xi, i = 1 , ..., n}. The prototype example of embedding is multi-dimensional scaling, where we want to preserve the Euclidean distances between the examples.
Reinforcement learning
Reinforcement learning is similar to supervised learning except that the guidance is in the form of reward. Here xi is the state. yi can be the action taken at this state. yi can also be the value of this state, where value is defined as the accumulated reward.
We can write (xi, yi) ∼ pθ (xi, yi) for i = 1 , ..., n. In supervised learning, we let pθ (x, y) = pθ (y|x)p(x), where we learn pθ (y|x) and we leave p(x) alone. In unsupervised learning, we do not observe y, and we model pθ (x) instead. In the following, we focus on supervised learning. In Bayesian framework, we treat θ as a random variable. We assume its marginal distribution to be p(θ ). It is called the prior distribution. The learning is based on posterior distribution
p(θ | (xi, yi), i = 1 , ..., n) ∝ p(θ )
n ∏ i= 1
p(yi | xi, θ ),
where we write pθ (y|x) as p(y|x, θ ) to emphasize that θ is a random variable to be conditioned upon. ∏ni= 1 p(yi |^ xi,^ θ^ )^ is called likelihood.^ l(θ^ |(xi,^ yi),^ i^ =^1 , ...,^ n) =^ ∑ni= 1 log^ p(yi |^ xi,^ θ^ )^ is called the log-likelihood. If we estimate θ by maximizing p(θ | (xi, yi), i = 1 , ..., n) over θ , we get the so-called Maximum A Posteriori (MAP) estimate.
log p(θ | (xi, yi), i = 1 , ..., n) = l(θ |(xi, yi), i = 1 , ..., n) + log p(θ ).
If p(θ ) is uniform within a range, MAP becomes maximum likelihood estimate (MLE). For non-uniform p(θ ), MAP is penalized or regularized likelihood. MAP only captures the maximum or mode of the posterior distribution but misses the uncertainty in the posterior distribution. To capture the uncertainty, we may draw multiple samples θm ∼ p(θ | (xi, yi), i = 1 , ..., n) for m = 1 , ..., M. This can be accomplished by Monte Carlo, such as Markov chain Monte Carlo (MCMC). These θm are the multiple guesses of θ. The posterior p(θ | (xi, yi), i = 1 , ..., n) is often not tractable in the sense that we cannot calculate the normalizing constant of p(θ ) (^) ∏ni= 1 p(yi | xi, θ ) to make it a probability distribution. In variational inference, we find a simpler distribution qφ (θ ) to approximate p(θ | (xi, yi), i = 1 , ..., n), where φ is the variational parameter that we choose to minimize the divergence from qφ (θ ) to p(θ | (xi, yi), i = 1 , ..., n).
Background: derivatives in matrix form
Suppose Y = (yi)m× 1 , and X = (x (^) j)n× 1. Suppose Y = h(X). We can define
∂Y ∂ X>^
∂ yi ∂ x (^) j
m×n
To understand the notation, we can treat ∂Y = (∂ yi, i = 1 , ..., m)>^ as a column vector, and 1/∂ X = ( 1 /∂ x (^) j, j = 1 , ..., m)>^ as another column vector. Now we have two vectors of operations, instead of numbers. The prod- uct of the elements of the two vectors is understood as composition of the two operators, i.e., ∂ yi( 1 /∂ x (^) j) = ∂ yi/∂ x (^) j. Then ∂Y /∂ X>^ is a squared matrix according to the matrix multiplication rule. If Y = AX, then yi = (^) ∑k aikxk. Thus ∂ yi/∂ x (^) j = ai j. So ∂Y /∂ X>^ = A. If Y = X>SX, where S is symmetric, then ∂Y /∂ X = 2 SX. If S = I, Y = |X|^2 , ∂Y /∂ X = 2 X. The chain rule in matrix form is as follows. If Y = h(X) and X = g(Z), then
∂ yi ∂ z (^) j
= (^) ∑ k
∂ yi ∂ xk
∂ xk ∂ z (^) j
Thus ∂Y ∂ Z>^
Least squares estimator
For general (X,Y ), L (β ) = |Y − Xβ |^2.
Let e = Y − Xβ , then L (β ) = |e|^2. Applying the chain rule,
∂ L ∂ β >^
∂ e>
∂ e ∂ β >^ = − 2 e>X,
hence
L ′(β ) =
∂ β = − 2 X>(Y − Xβ ).
Setting L ′(β ) = 0, we get the least squares estimator
βˆ = (X>X)−^1 X>Y.
Geometrically, ˆY = X βˆ is the projection of Y onto the subspace spanned by X = (X 1 , ..., Xp), so that e = Y − Xβ is perpendicular to Xj at βˆ , i.e., 〈e, Xj〉 = X j> e = 0 for j = 1 , ..., p, i.e., X>(Y − Xβ ) = 0, which
leads to the least squares βˆ. The projection is
Yˆ = X βˆ = X(X>X)−^1 X>Y = HY,
where the hat matrix H = X(X>X)−^1 X>^ encodes the projection operation.
In order to reduce the model bias, we want the number of parameters to be large. However, this will cause overfitting if we continue to use the least squares estimator. We may reduce overfitting by using a biased estimator such as ridge regression. The ridge regression minimizes
L (β ) = |Y − Xβ |^2 + λ |β |^2 ,
for λ ≥ 0. Setting L ′(β ) = − 2 X>(Y − Xβ ) + 2 λ β = 0 ,
we have βˆ = (X>X + λ Ip)−^1 X>Y.
The ridge regression can also be expressed as minimizing the least squares loss |Y − Xβ |^2 subject to |β |^2 ≤ t for a certain t > 0. More generally, we can minimize
L (β ) = |Y − Xβ |^2 + λ β >Dβ ,
for a symmetric matrix D, and βˆ = (X>X + D)−^1 X>Y. For ridge regression, in the case X>X = Ip, we have
βˆ = βˆLS/( 1 + λ ),
which is a shrinkage estimator.
In the one-dimensional case, the linear spline model is of the form
f (x) = β 0 +
d ∑ k= 1
βk(x − αk)+,
where (x − αk)+ = max( 0 , x − αk), αk, k = 1 , ...., d are the knots, and βk is the change of slope at knot αk. We can learn the model from the training data (xi, yi), i = 1 , ..., n by ridge regression which minimizes
L (β ) =
n ∑ i= 1
yi − β 0 −
d ∑ k= 1
βk(xi − αk)+
d ∑ k= 1
β (^) k^2 ,
where (^) ∑dk= 1 β (^) k^2 measures the smoothness of f (x). Let xik = (xi − αk)+, xi 0 = 1. The objective function is
L (β ) = |Y − Xβ |^2 + β >Dβ ,
where X is the n × (p + 1 ) matrix, D is (p + 1 ) × (p + 1 ) diagonal matrix, with Dkk = λ , except D 11 = 0 because we do not penalize β 0. Then
βˆ = (X>X + D)−^1 X>Y.
is the solution to the dual form, then it must be the solution to the primal form with t = ‖ βˆλ ‖1. The reason is that if a different βˆ is the solution to the primal form, then βˆ is a better solution to the dual form than βˆλ , which results in contradiction. The primal form also reveals the sparsity inducing property of 1 regularization in that the ` 1 ball has low-dimensional corners, edges, and faces, but is still barely convex.
Figure 2: Lasso in primal form.
The above is the well known figure of Lasso. Take the left plot for example. The blue region is ‖β ‖1 ≤ t. The red curves is the contour plot, where each red elliptical circle consists of those β that have the same value of ‖Y − Xβ ‖^2 2. The circle on the outside has bigger ‖Y − Xβ ‖^2 2 than the circle inside. The solution to the problem of min ‖Y − Xβ ‖^2 2 subject to ‖β ‖1 ≤ t is where the red circle touches the blue region. Any other points in the blue region will be outside the outer red circle and thus have bigger values of ‖Y − Xβ ‖^2 2. The reason that the 1 regularization induces sparsity is that it is likely for the red circle to touch the blue region at a corner, which is a sparse solution. If we use 2 regularization, as is the case with the plot on the right, then the solution is not sparse in general.
Coordinate descent for Lasso solution path
For multi-dimensional X = (Xj, j = 1 , ..., p), we can use the coordinate descent algorithm to compute βˆλ. The algorithm updates one component at a time, i.e., given the current values of β = (β (^) j, j = 1 , ..., p), let R (^) j = Y − ∑k 6 = j Xkβk, we can update β (^) j = sign( βˆ (^) j) max( 0 , | βˆ (^) j| − λ /‖X‖^2 2 ), where βˆ (^) j = 〈R (^) j, Xj〉/‖Xj‖^2 2. We can find the solution path of Lasso by starting from a big λ so that all of the estimated β (^) j are zeros. Then we gradually reduce λ. For each λ , we cycle through j = 1 , ..., p for coordinate descent until convergence, and then we lower λ. This gives us βˆ (λ ) for the whole range of λ. The whole process is a forward selection process, which sequentially selects new variables and occasionally removes selected variables.
Least angle regression
In the above algorithm, at any given λ , let R = Y − (^) ∑pj= 1 Xjβ (^) j, then βˆ (^) j = β (^) j + 〈R, Xj〉/‖Xj‖^2 ` 2. If β is the Lasso solution, then
〈R, Xj〉 =
λ , if β (^) j > 0 , −λ , if β (^) j < 0 , sλ if β (^) j = 0.
where |s| < 1. Thus in the above process, for all of those selected Xj, the algorithm maintains that 〈R, Xj〉 to be λ or −λ , for all selected Xj. If we interpret |〈R, Xj〉| in terms of the angle between R and Xj, then we may call the above process the equal angle regression or the least angle regression (LARS). In fact, the solution
path is piecewise linear, and the LARS computes the linear pieces analytically instead of gradually reducing λ as in coordinate descent.
Stagewise regression or epsilon-boosting
The stagewise regression iterates the following steps. Given the current R = Y − (^) ∑pj= 1 Xjβ (^) j, find j with the maximal |〈R, Xj〉|. Then update β (^) j ← β (^) j + ε〈R, Xj〉 for a small ε. This is similar to the matching pursuit but is much less greedy. Such an update will change R and reduce |〈R, Xj〉|, until another Xj catches up. So overall, the algorithm ensures that all of the selected Xj to have the same |〈R, Xj〉|, which is the case with the algorithm in the above two sections. The stagewise regression is also called ε-boosting. We can also view the stagewise regression from the perspective of the primal form of the Lasso problem: minimize ‖Y − Xβ ‖^2 2 subject to ‖β ‖ 1 ≤ t. If we relax the constraint by increasing t to t + ∆t, then we want to update β (^) j with the maximal |〈R, Xj〉| in order to maximally reducing ‖Y − Xβ ‖^2 ` 2.
Consider a dataset with n training examples, where x> i = (xi 1 , · · · , xip) consists of p predictors and yi ∈ { 0 , 1 } is the outcome or class label. We assume [yi|xi, β ] ∼ Bernoulli(pi), i.e., Pr(yi = 1 |xi, β ) = pi, and we assume
logit(pi) = log
pi 1 − pi = si = x> i β.
Then
pi = sigmoid(si) =
esi 1 + esi^
1 + e−si^
where the sigmoid function is the inverse of the logit function.
For logistic regression, we want to learn β either for the purpose of explanation or understanding, or for the purpose of classification or prediction. In the context of classification, we usually let yi ∈ {+ 1 , − 1 } instead of yi ∈ { 1 , 0 }. Those xi with yi = +1 are called positive examples, and those xi with yi = −1 are called negative examples. We may call β a classifier. si = x> i β = 〈xi, β 〉 is the projection of xi on the vector β , so the vector β is the direction that reveals the difference between positive xi and negative xi. Thus β should be aligned with positive xi and negatively aligned with negative xi, i.e., β should point from the negative examples to the positive examples. According to the previous subsection,
Pr(yi = + 1 |xi, β ) =
1 + exp(−si)
A deterministic version of the logistic regression is the perceptron model
yi = sign(si),
where sign(s) = +1 if s ≥ 0, and sign(s) = −1 if s < 0. The perceptron model is inspired by neuroscience. See Figure 3. It can be considered an over-simplified model of a neuron, which takes input xi, and emits output yi. See Figure 4.
Loss function for robust linear regression
We may also use the mean absolute value loss,
L(yi, si) = |yi − si| ,
which penalizes large differences between yi and si = x> i β to a less degree than the least squares loss, thus the estimated β is less affected by the outliers.
Loss function for logistic regression with 0/1 responses
For logistic regression, we usually maximize the likelihood function, which is
Likelihood(β ) =
n ∏ i= 1
Pr(yi|xi, β ).
That is, we want to find β to maximize the probability of the observed (yi, i = 1 , ..., n) given (xi, i = 1 , ..., n). The maximum likelihood estimate gives the most plausible explanation to the observed data. For yi ∈ { 0 , 1 }, Pr(yi = 1 |xi, β ) = sigmoid(si) = exp(si) 1 + exp(si)
Pr(yi = 0 |xi, β ) = 1 − Pr(yi = 1 |xi, β ) =
1 + exp(si)
We can combine the above two equations by
Pr(yi|xi, β ) = exp(yisi) 1 + exp(si)
The log-likelihood is
LogLikelihood(β ) =
n ∑ i= 1
log Pr(yi|xi, β ) =
n ∑ i= 1
[yisi − log( 1 + exp(si))].
We can define the loss function as the negative log-likelihood
L(yi, si) = − [yisi + log( 1 + exp(si))].
Loss function for logistic regression with ± responses
If yi ∈ {+ 1 , − 1 }, we have
Pr(yi = + 1 |xi, β ) =
1 + exp(−si)
and Pr(yi = − 1 |xi, β ) =
1 + exp(si)
Combining them, we have
p(yi|xi, β ) =
1 + exp(−yisi)
The log-likelihood is n ∑ i= 1
log Pr(yi|xi, β ) = −
n ∑ i= 1
log ( 1 + exp(−yisi)).
We define the loss function as the negative log-likelihood. Thus
L(yi, si) = log [ 1 + exp(−yisi)].
This loss is called the logistic loss. The least squares loss for linear regression can also be derived from the log-likelihood if we assume the errors follow a normal distribution.
Loss functions for classification
Figure 5: Loss functions for classification. The horizontal axis is mi = yix> i β. The vertical axis is L(yi, x> i β ). The exponential loss and the hinge loss can be considered approximations to the logistic loss. These loss functions penalize negative mi. The more negative mi is, the bigger the loss. The loss functions also penalize small positive mi, e.g., those mi < 1. Such loss functions encourage correct and confident classifications.
The following summarizes several possible choices for the loss function L(yi, si) for classification. See Figure 5.
Logistic loss = log ( 1 + exp(−yisi)) , Exponential loss = exp (−yisi) , Hinge loss = max ( 0 , 1 − yisi) , Zero-one loss = 1 (yisi > 0 )
Both the exponential and hinge losses can be considered approximations to the logistic loss. The logistic loss is used by logistic regression. The exponential loss is used by adaboost. The hinge loss is used by support vector machines. The zero-one loss is to count the number of mistakes. It is not differentiable and is not used for training. All the above loss functions are based on mi = yisi. We call mi the margin for example (yi, xi). We want yi and si = x i> β to be of the same sign for correct classification. If yi = +1, we want si = x> i β to be very positive. If yi = −1, we want si = x> i β to be very negative. We want the margin mi to be as large as possible for confident classification.