











Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The optimization of classification line parameters using log-loss and exponential loss functions. The concept of translating confidence in classification into probabilities, finding line parameters that maximize likelihood, and minimizing the log-loss function using gradient descent. The document also introduces the concept of boosting and minimizing the exponential loss function for feature selection.
Typology: Study notes
1 / 19
This page cannot be seen from the preview
Don't miss anything!












Fall 2009
I (^) Now, for any set of line constants, we can find out the probability assigned to the correct label of each item in the training set.
∏^ N
i=
P [li|xi] =
i=
1 + exp(−li(axi + byi + c))
I (^) We’ve inserted li into the exponent because, if li = − 1
P [l = − 1 |x] =
1 + exp((ax + by + c)) I (^) We multiply the probabilities because we believe the points are drawn independently. I (^) Note that for each item, this number tells us how much the model defined by that particular line believes that the ground-truth right answer is the actual right answer.
i=
P [li|xi] =
i=
1 + exp(−li(axi + byi + c))
I (^) This is also called the likelihood of the data I (^) Ideally, this likelihood should be as close to 1 as possible. I (^) We can find the parameters of the line by looking for the line parameters that maximize this likelihood. I (^) In other words, find the set of line parameters that make the right answers have as high a probability as possible.
log
i=
P [li|xi]
i=
−log (1 + exp(−li(axi + byi + c)))
I (^) Similarly, we can multiply this equation by − 1 to change from a maximization problem to a minimization problem I (^) Leading to a function that we will call a “loss function” or “cost function” and can be denoted as L:
i=
log (1 + exp(−li(axi + byi + c)))
I (^) Our goal is to find the line parameters that minimize L.
I (^) We can minimize L by using its gradient:
∂a ∂L ∂b ∂L ∂c
I (^) The gradient can be viewed as a vector that points in the direction of steepest ascent. I (^) So, the negative gradient points in the direction of steepest descent.
I (^) Let’s look at the loss function that we are minimizing
L(x) = log(1 + e−x)
(^0) -4 -3 -2 -1 0 1 2 3 4 5
1
2
3
4
5
6
7
8
9
I (^) If the label is +1 then we are encouraging our linear classifier to return a positive value. I (^) Loss grows approximately linearly as it gets more and more negative.
L(x) = log(1 + e−x)
(^0) -4 -3 -2 -1 0 1 2 3 4 5
1
2
3
4
5
6
7
8
9
I (^) This can be thought of as a modification of the zero-one loss I (^) The 0-1 loss, says, “I only care if I make a mistake!”
I (^) In the logistic regression approach discussed in the previous lecture, we gathered all of our features, then optimized the weights. I (^) What if we had millions of features? We couldn’t load those into memory and optimize using gradient descent? I (^) What if we added features greedily? We could then consider tons of features. I (^) This often called boosting.
I (^) To begin with, we will minimize the exponential loss. I (^) We will also define a new term, F (~xi):
F (~xi) =
∑^ Nf
j=
aj φj (~xi)
I (^) F (~xi) is the classifier. The predicted label of any sample is sign(F (~xi)) I (^) Each φ(·) is a feature. Each aj is a weight. I (^) In the boosting algorithm, we will build up F (·) one feature at a time.
L(aj+1) =
i=
exp (−li[F (~xi) + aj+1φj+1(~xi)])
I (^) We need to find aj+ I (^) We could use gradient descent, or we could do a Newton step I (^) In a Newton step, we will use a Taylor series to approximate L(aj+1) with a quadratic function, then solve that quadratic function to find aj+1.
f (x) ≈ f (b) + f ′(b)(x − b) + f ′′(b) 2
(x − b)^2
I (^) This is similar to Newton’s algorithm for minimizing functions.
L(aj+1) =
i=
exp (−li[F (~xi) + aj+1φj+1(~xi)])
L′(aj+1) =
i=
−liφj+1(~xi) exp (−li[F (~xi) + aj+1φj+1(~xi)])
L′′(aj+1) =
i=
φ^2 j+1(~xi) exp (−li[F (~xi) + aj+1φj+1(~xi)])
I (^) If we do the Taylor expansion around aj+1 = 0, then
L(aj+1) ≈
i=
exp (−li[F (~xi)]) + aj+1L′(0) + a^2 j+
I (^) Differentiating this, we can solve for aj+
aj+1 = −
∑i=1^ liφj+1(~xi) exp (−li[F^ (~xi)]) N i=1 φ 2 j+1(~xi) exp (−li[F^ (~xi)])
I (^) This is called the GentleBoost algorithm I (^) We have to choose φj+1(·). Usually choose that greedily by searching over a bunch of different features that have been thresholded.