Lecture 7: Optimizing Classification Line Parameters using Log-Loss and Exponential Loss -, Study notes of Computer Science

The optimization of classification line parameters using log-loss and exponential loss functions. The concept of translating confidence in classification into probabilities, finding line parameters that maximize likelihood, and minimizing the log-loss function using gradient descent. The document also introduces the concept of boosting and minimizing the exponential loss function for feature selection.

Typology: Study notes

Pre 2010

Uploaded on 11/08/2009

koofers-user-1ci
koofers-user-1ci 🇺🇸

5

(1)

9 documents

1 / 19

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Lecture 7 - More on Classification
Fall 2009
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13

Partial preview of the text

Download Lecture 7: Optimizing Classification Line Parameters using Log-Loss and Exponential Loss - and more Study notes Computer Science in PDF only on Docsity!

Lecture 7 - More on Classification

Fall 2009

First, Some Review

Finding the line parameters

I (^) Now, for any set of line constants, we can find out the probability assigned to the correct label of each item in the training set.

∏^ N

i=

P [li|xi] =

∏^ N

i=

1 + exp(−li(axi + byi + c))

I (^) We’ve inserted li into the exponent because, if li = − 1

P [l = − 1 |x] =

1 + exp((ax + by + c)) I (^) We multiply the probabilities because we believe the points are drawn independently. I (^) Note that for each item, this number tells us how much the model defined by that particular line believes that the ground-truth right answer is the actual right answer.

Finding the line

∏^ N

i=

P [li|xi] =

∏^ N

i=

1 + exp(−li(axi + byi + c))

I (^) This is also called the likelihood of the data I (^) Ideally, this likelihood should be as close to 1 as possible. I (^) We can find the parameters of the line by looking for the line parameters that maximize this likelihood. I (^) In other words, find the set of line parameters that make the right answers have as high a probability as possible.

Finding the line

log

( N

i=

P [li|xi]

∑^ N

i=

−log (1 + exp(−li(axi + byi + c)))

I (^) Similarly, we can multiply this equation by − 1 to change from a maximization problem to a minimization problem I (^) Leading to a function that we will call a “loss function” or “cost function” and can be denoted as L:

L =

∑^ N

i=

log (1 + exp(−li(axi + byi + c)))

I (^) Our goal is to find the line parameters that minimize L.

Minimizing L

I (^) We can minimize L by using its gradient:

∇L =

∂L

∂a ∂L ∂b ∂L ∂c

I (^) The gradient can be viewed as a vector that points in the direction of steepest ascent. I (^) So, the negative gradient points in the direction of steepest descent.

More on Loss Functions

I (^) Let’s look at the loss function that we are minimizing

L(x) = log(1 + e−x)

(^0) -4 -3 -2 -1 0 1 2 3 4 5

1

2

3

4

5

6

7

8

9

I (^) If the label is +1 then we are encouraging our linear classifier to return a positive value. I (^) Loss grows approximately linearly as it gets more and more negative.

The Log-Loss

L(x) = log(1 + e−x)

(^0) -4 -3 -2 -1 0 1 2 3 4 5

1

2

3

4

5

6

7

8

9

I (^) This can be thought of as a modification of the zero-one loss I (^) The 0-1 loss, says, “I only care if I make a mistake!”

Boosting Approach to Finding Classification Parameters

I (^) In the logistic regression approach discussed in the previous lecture, we gathered all of our features, then optimized the weights. I (^) What if we had millions of features? We couldn’t load those into memory and optimize using gradient descent? I (^) What if we added features greedily? We could then consider tons of features. I (^) This often called boosting.

Describing This Mathematically

I (^) To begin with, we will minimize the exponential loss. I (^) We will also define a new term, F (~xi):

F (~xi) =

∑^ Nf

j=

aj φj (~xi)

I (^) F (~xi) is the classifier. The predicted label of any sample is sign(F (~xi)) I (^) Each φ(·) is a feature. Each aj is a weight. I (^) In the boosting algorithm, we will build up F (·) one feature at a time.

Finding the parameter

L(aj+1) =

∑^ N

i=

exp (−li[F (~xi) + aj+1φj+1(~xi)])

I (^) We need to find aj+ I (^) We could use gradient descent, or we could do a Newton step I (^) In a Newton step, we will use a Taylor series to approximate L(aj+1) with a quadratic function, then solve that quadratic function to find aj+1.

f (x) ≈ f (b) + f ′(b)(x − b) + f ′′(b) 2

(x − b)^2

I (^) This is similar to Newton’s algorithm for minimizing functions.

Finding the parameter

L(aj+1) =

∑^ N

i=

exp (−li[F (~xi) + aj+1φj+1(~xi)])

L′(aj+1) =

∑^ N

i=

−liφj+1(~xi) exp (−li[F (~xi) + aj+1φj+1(~xi)])

L′′(aj+1) =

∑^ N

i=

φ^2 j+1(~xi) exp (−li[F (~xi) + aj+1φj+1(~xi)])

I (^) If we do the Taylor expansion around aj+1 = 0, then

L(aj+1) ≈

∑^ N

i=

exp (−li[F (~xi)]) + aj+1L′(0) + a^2 j+

L′′(0)

I (^) Differentiating this, we can solve for aj+

aj+1 = −

L′(0)

L′′(0)

∑N

∑i=1^ liφj+1(~xi) exp (−li[F^ (~xi)]) N i=1 φ 2 j+1(~xi) exp (−li[F^ (~xi)])

Boosting

I (^) This is called the GentleBoost algorithm I (^) We have to choose φj+1(·). Usually choose that greedily by searching over a bunch of different features that have been thresholded.