Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Linear Classifiers: Perceptron Algorithm, Cost Function, and Support Vector Machines, Slides of Pattern Classification and Recognition

Linear classifiers, specifically focusing on the perceptron algorithm, the cost function, and support vector machines (svms). The perceptron algorithm is a method for training a linear classifier in the presence of linearly separable classes. The cost function is used to measure the error of the classifier and find the optimal solution. Support vector machines (svms) are a type of linear classifier that finds the hyperplane with the maximum margin between the classes.

Typology: Slides

2011/2012

Uploaded on 07/17/2012

bandhula
bandhula 🇮🇳

4.7

(10)

94 documents

1 / 47

Toggle sidebar

Related documents


Partial preview of the text

Download Linear Classifiers: Perceptron Algorithm, Cost Function, and Support Vector Machines and more Slides Pattern Classification and Recognition in PDF only on Docsity!

1

The Problem:

Consider a two class task with

ω

, 1

ω

2

 

0

2 2

1 1

0 ...

0

) (

w x w x w x w

w

x w

x g

l l

T

2 1

2

1

0

2

0

1

2 1

,

0 )

( 0

:

hyperplane

decision

on the

,

Assume

x x

x

x

w

w x w w x w

x x

T

T

T

LINEAR CLASSIFIERS

2

Hence:

hyperplane

on the

w

0

)

(^

0

w

x

w

x

g

T

(^22)

(^21)

(^22)

(^21)

0

) (

,^

w

w

x g

z

w

w

w

d

3

The Perceptron Algorithm

Assume linearly separable classes, i.e., 

The casefalls under the above formulation, since

• •

1 2

0

0

:

 

x

x

w

x

x

w

w

T T

T*

w

x

w

0

  

  

  

  

1

0

x

'x

,

w

*

w

'

w

0

0

' x ' w w x * w

T

T

4

Our goal:

Compute a solution, i.e., a hyperplane

w

,

so that

  • The steps
    • Define a cost function to be minimized.– Choose an algorithm to minimize the cost

function.

  • The minimum corresponds to a solution.

1 2

0 )

(

 

x

x w

T

5

The Cost Function

  • Where

Y

is

the

subset

of

the

vectors

wrongly

classified by

w

.^

When

Y

=O (empty set) a solution

is achieved and

  • • •



Y x

T x^

x

w

w

J

)

(

)

(

0

)

(^

w J

1 2

and

if 1

and

if 1

x

Y

x

x

Y

x

x x

0

)

(^

w

J

6

•^

J

(w

)^

is piecewise linear (WHY?)

The Algorithm

  • The philosophy of the gradient descent is

adopted.

7

  • Wherever valid• This is the celebrated Perceptron Algorithm.

old)(

) (old)(

(new)

w

w w w J

w

w

w

w

       

 

x

x w

w

w w J

Y x^

Y x

x

T x

^

 

  

 

)

(

) (

x

t w

t w

Y x

x

t 

 ) (

) 1

(

w

8

An example:

The perceptron algorithm

converges

in a

finite

number of iteration steps to a solution if

)

1

(

)

(

)

(

)

1

(

x

x t t

x

t w

x

t w

t w

 

c^ t

t

t k

k

t

t k

k

t

 

     

 :

e.g.,

lim ,

lim

0

2

0

9

A useful variant of the perceptron algorithm

It is a

reward and punishment

type of

algorithm.

otherwise ) (

) 1

(

0 ) ( , ) ( ) 1 ( 0 ) ( , ) ( ) 1 ( 2

) (

) (

) (

1

) (

) ( ) ( t w t w

x

x t w x t w t w x

x t w x t w t w t

t

T

t

t

t

T

t

10

The perceptron

shold

thre

weights

synaptic

or

synapses

0

w

s

'

w

i

It is a learning machine that learns from thetraining vectors via the perceptron algorithm. 

The network is called perceptron or neuron.

11

Example:

At some stage

t

the perceptron algorithm

results in The corresponding hyperplane is

5.

0

,

1

,

1

0

2

1

w

w

w

0

(^5). 0

2

1

x

x

   

        

   

    

    

    

    

(^5). (^51). 0 0

(^42). 1

(^75). 1 0

(^2). 0 ) 1 ( (^7). 0

(^05). 1 0

(^4). 0 ) 1 ( (^7). 0

(^5). 1 1 0

) 1

(t w

ρ

=

0.

12

Least Squares Methods

If classes are linearly separable, the perceptron outputresults in 

If classes are NOT linearly separable, we shall computethe weights,

,^

so that the difference

between

  • The actual output of the classifier,

, and

  • The desired outputs, e.g.,

to be SMALL.

1 

0

2

1

w

,..., w , w

x

w

T

1 2

if 1

if 1

 

x x

13

SMALL, in the mean square error sense, means to chooseso that the cost function:

• • •

w

response.

desired

ing

correspond

the

is

)

(

min

arg

ˆ

minimum.

becomes

]

)

[(

)

(^

2

y

w

J

w

x w y E w J

w

T

14

Minimizing

where

R

x^

is the autocorrelation matrix

and

the crosscorrelation vector.

]

[

ˆ

]

[

]

[

)]

( [ 2

0

] )

[(

) (

: in

results

to

w.r. ) (

1

2 y x E R w

y x E w x x E

w x

y x E

x w

y

E w

w w J

w

w J

x

T

T

T

  

 

   

    

]

[

]...

[

]

[

.

..........

...

..........

.

..........

]

[

]...

[

]

[

]

[

2

1

1

2 1

1 1

l l

l

l

l

T

x

x x E x x E x x E x x E x x E x x E x x E R

   

   

]

... [

]

[

]

[

1 y x E

y x E y x E l

15

Multi-class generalization

  • The goal is to compute

M

linear discriminant functions:

according to the MSE.

  • Adopt as desired responses

y

:i

  • Let • And the matrix

x

w

x

g

T i

i^

 ) (

otherwise 0

if 1 

 i

i

i y

x

y

^

T M

y

y

y

y

,...,

,^

2

1

^

 M w w w W

,...,

,^

2

1

16

  • The goal is to compute

:

  • The above is equivalent to a number

M

of MSE minimization

problems. That is:Design each

so that its desired output is 1 for

and 0 for

any other class.

Remark: The MSE criterion belongs to a more general class ofcost function with the following important property:

  • The value of

is an estimate, in the MSE sense, of the

a-posteriori probability

,^

provided that the desired

responses used during training are

and 0

otherwise.

^

^

 

^ 

 

^ 

M i

T i

i

W

T

W

x w y E x W y E W

1

2

2

min

arg

min

arg

ˆ

i w

i

x

)

(

x

g

i

)

|

(^

x

P

i

i

i^

x

y

,

1

W

17

Mean square error regression: Let

,^

be

jointly distributed random vectors with a joint pdf

  • The goal: Given the value of

,^

estimate the value of

.

In the pattern recognition framework, given

one wants

to estimate the respective label

.

  • The MSE estimate

of

,^

given

,^

is defined as:

  • It turns out that:

The above is known as the regression of

given

and

it is, in general, a non-linear function of

. If

is

Gaussian the MSE regressor is linear.

M

y

x

)

,

(^

y

x

p

x

y

x

ˆy

y

x

^

 2

~

~

min

arg

ˆ^

y

y

E

y

y^

^

^

y d x y p y x y E y

  

)

|

(

|

ˆ

y

x

x

)

,

(^

y x p 1   y

docsity.com

18

SMALL in the sum of error squares sense means 

that is, the input

x

i^

and its

corresponding class label

(±1).

)

,

(

)

(

) (

i

i

2

N 1 i

i

T

i

x

y

x

w

y

w J

 

:^

training pairs

i

N i

i

T i

N i

i

N i

i T

i

y x

w

x x

x

w

y

w

w w J

 

 

1

1

1

2

)

(

0

)

(

)

(

i

y

19

Pseudoinverse Matrix

Define   

responses

desired

ing

correspond

y

matrix)

(an

   

    

     

     

T 1 T^2 T N (^1) N y ... y

Nxl

x x ... x

X

matrix)

(an

]

,...,

,

[^

2

1

lxN

x

x

x

X

N

T^

 

N i

T i i

T^

x x

X

X

1

i

N i

i

T

y x

y

X

1

20

Thus

Assume

N=l

X

square and invertible.

Then

y

X

y

X

X X

w

y

X

w X X

y x

w x x

T

T

T

T N i

N i

i i

i T i 

^

 1

1

1

)

(

ˆ

ˆ )

(

)

(

ˆ )

(

T

T^

X

X

X

X

1 )

(^

^

Pseudoinverse of

X

1

1

1

1 )

(

X

X

X X X X X X X

T

T

T

T

21

Assume

N>l

.^

Then, in general, there is no solution to

satisfy all equations simultaneously: 

The “solution”

corresponds to the minimum

sum of squares solution.

unknowns

equations

....

:^

2

2

1

1

l

N

y

w

x

y

w

x

y w x y w X

N

T T T N

  

y

X

w

22

Example:

  

     

     

     

     

  

  

     

     

     

     

  

(^5). 0

(^7). 0 , (^6). 0

(^8). 0 , (^4). 0

(^7). 0 , (^2). 0

(^6). 0 , (^6). 0

(^4). 0 :

(^3). 0

(^3). 0 , (^7). 0

(^2). 0 , (^4). 0

(^1). 0 , (^5). 0

(^6). 0 , (^5). 0

(^4). 0 : 2 1  

y

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

X

                                           

               

1 1 1 1 1 1 1 1 1 1

1

5 0 7 0

1

6 0 8 0

1

4 0 7 0

1

2 0 6 0

1

6 0 4 0

1

3 0 3 0

1

7 0 2 0

1

4 0 1 0

1

5 0 6 0

1

5 0 4 0

docsity.com

23

   

   

   

   

   

    

(^34). 1

(^24). 0

(^13). 3

)

(

(^0). 0

(^1). 0

(^6). 1

,

10

(^7). 4

(^8). 4

(^7). 4

(^41). 2

(^24). 2

(^8). 4

(^24). 2

(^8). 2

1

y

X

X X

w

y

X

X X

T

T

T

T

24

The Bias – Variance DilemmaA classifier

is a learning machine that tries to predict

the class label

y

of

. In practice, a finite data set

D

is used

for its training. Let us write

. Observe that:

For some training sets,

,^

the

training may result to good estimates, for some othersthe result may be worse. 

The average performance of the classifier can be testedagainst the MSE optimal value, in the mean squaressense, that is:where

E

D

is the mean over all possible data sets

D

.

)

(

x

g

x

)

;

(^

D

x

g

^

 N i x y D

i

i^

...,,

2

,

1

,

,^

^

 2 ] | [ ) ; (

x y E D x g E

D

25

  • The above is written as:• In the above, the first term is the contribution of the

bias and the second term is the contribution of thevariance.

  • For a finite

D

, there is a trade-off between the two

terms. Increasing bias it reduces variance and viceverse. This is known as the bias-variance dilemma.

  • Using a complex model results in low-bias but a high

variance, as one changes from one training set toanother. Using a simple model results in high bias butlow variance.

^

2 ] | [ ) ; (

x y E D x g E

D

^

^

^

^

2

2

) ; ( ) ; ( ] | [ ) ; ( D x g E D x g E x y E D x g E

D

D

D