Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Linear classifiers, specifically focusing on the perceptron algorithm, the cost function, and support vector machines (svms). The perceptron algorithm is a method for training a linear classifier in the presence of linearly separable classes. The cost function is used to measure the error of the classifier and find the optimal solution. Support vector machines (svms) are a type of linear classifier that finds the hyperplane with the maximum margin between the classes.
Typology: Slides
1 / 47
1
Consider a two class task with
ω
ω
2
0
2 2
1 1
0 ...
0
) (
w x w x w x w
w
x w
x g
l l
T
2 1
2
1
0
2
0
1
2 1
,
0 )
( 0
:
hyperplane
decision
on the
,
Assume
x x
x
x
w
w x w w x w
x x
T
T
T
LINEAR CLASSIFIERS
2
hyperplane
on the
w
0
w
x
w
x
g
T
(^22)
(^21)
(^22)
(^21)
0
) (
,^
w
w
x g
z
w
w
w
d
3
Assume linearly separable classes, i.e.,
The casefalls under the above formulation, since
1 2
0
0
:
x
x
w
x
x
w
w
T T
T*
0
0
0
T
T
4
Our goal:
Compute a solution, i.e., a hyperplane
w
so that
function.
1 2
x
x w
T
5
is
the
subset
of
the
vectors
wrongly
classified by
w
When
=O (empty set) a solution
is achieved and
Y x
T x^
w J
1 2
and
if 1
and
if 1
x
x
x
x
x x
6
(w
is piecewise linear (WHY?)
adopted.
7
old)(
) (old)(
(new)
w
w w w J
w
w
w
w
x
x w
w
w w J
Y x^
Y x
x
T x
^
)
(
) (
x
t w
t w
Y x
x
t
) (
) 1
(
w
8
x
x t t
x
t w
x
t w
t w
c^ t
t
t k
k
t
t k
k
t
:
e.g.,
lim ,
lim
0
2
0
9
otherwise ) (
) 1
(
0 ) ( , ) ( ) 1 ( 0 ) ( , ) ( ) 1 ( 2
) (
) (
) (
1
) (
) ( ) ( t w t w
x
x t w x t w t w x
x t w x t w t w t
t
T
t
t
t
T
t
10
0
i
It is a learning machine that learns from thetraining vectors via the perceptron algorithm.
The network is called perceptron or neuron.
11
At some stage
t
the perceptron algorithm
results in The corresponding hyperplane is
0
2
1
0
(^5). 0
2
1
x
x
(^5). (^51). 0 0
(^42). 1
(^75). 1 0
(^2). 0 ) 1 ( (^7). 0
(^05). 1 0
(^4). 0 ) 1 ( (^7). 0
(^5). 1 1 0
) 1
(t w
ρ
12
If classes are linearly separable, the perceptron outputresults in
If classes are NOT linearly separable, we shall computethe weights,
so that the difference
between
, and
to be SMALL.
1
0
2
1
w
,..., w , w
T
1 2
if 1
if 1
x x
13
SMALL, in the mean square error sense, means to chooseso that the cost function:
w
2
w
T
14
Minimizing
where
x^
is the autocorrelation matrix
and
the crosscorrelation vector.
]
[
ˆ
]
[
]
[
)]
( [ 2
0
] )
[(
) (
: in
results
to
w.r. ) (
1
2 y x E R w
y x E w x x E
w x
y x E
x w
y
E w
w w J
w
w J
x
T
T
T
]
[
]...
[
]
[
.
..........
...
..........
.
..........
]
[
]...
[
]
[
]
[
2
1
1
2 1
1 1
l l
l
l
l
T
x
x x E x x E x x E x x E x x E x x E x x E R
1 y x E
y x E y x E l
15
Multi-class generalization
M
linear discriminant functions:
according to the MSE.
y
:i
x
w
x
g
T i
i^
) (
otherwise 0
if 1
i
i
i y
x
y
^
T M
2
1
^
M w w w W
2
1
16
:
M
of MSE minimization
problems. That is:Design each
so that its desired output is 1 for
and 0 for
any other class.
Remark: The MSE criterion belongs to a more general class ofcost function with the following important property:
is an estimate, in the MSE sense, of the
a-posteriori probability
,^
provided that the desired
responses used during training are
and 0
otherwise.
^
^
M i
T i
i
W
T
W
x w y E x W y E W
1
2
2
min
arg
min
arg
ˆ
i w
i
x
x
g
i
x
i
i
i^
x
y
17
estimate the value of
In the pattern recognition framework, given
one wants
to estimate the respective label
of
given
is defined as:
The above is known as the regression of
given
and
it is, in general, a non-linear function of
. If
is
Gaussian the MSE regressor is linear.
M
x
y
x
ˆy
y
x
^
2
~
y^
^
^
y
docsity.com
18
SMALL in the sum of error squares sense means
that is, the input
x
i^
and its
corresponding class label
)
,
(
)
(
) (
i
i
2
N 1 i
i
T
i
x
y
x
w
y
w J
training pairs
i
N i
i
T i
N i
i
N i
i T
i
y x
w
x x
x
w
y
w
w w J
1
1
1
2
i
19
Define
responses
desired
ing
correspond
y
matrix)
(an
T 1 T^2 T N (^1) N y ... y
Nxl
x x ... x
2
1
N
T^
N i
T i i
T^
x x
X
X
1
i
N i
i
T
y x
y
X
1
20
Thus
Assume
N=l
square and invertible.
Then
y
X
y
X
X X
w
y
X
w X X
y x
w x x
T
T
T
T N i
N i
i i
i T i
^
1
1
1
)
(
ˆ
ˆ )
(
)
(
ˆ )
(
T
T^
1 )
^
Pseudoinverse of
1
1
1
1 )
(
X
X
X X X X X X X
T
T
T
T
21
Assume
N>l
Then, in general, there is no solution to
satisfy all equations simultaneously:
The “solution”
corresponds to the minimum
sum of squares solution.
2
2
1
1
N
T T T N
22
Example:
(^5). 0
(^7). 0 , (^6). 0
(^8). 0 , (^4). 0
(^7). 0 , (^2). 0
(^6). 0 , (^6). 0
(^4). 0 :
(^3). 0
(^3). 0 , (^7). 0
(^2). 0 , (^4). 0
(^1). 0 , (^5). 0
(^6). 0 , (^5). 0
(^4). 0 : 2 1
y
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
X
1 1 1 1 1 1 1 1 1 1
1
5 0 7 0
1
6 0 8 0
1
4 0 7 0
1
2 0 6 0
1
6 0 4 0
1
3 0 3 0
1
7 0 2 0
1
4 0 1 0
1
5 0 6 0
1
5 0 4 0
docsity.com
23
(^34). 1
(^24). 0
(^13). 3
)
(
(^0). 0
(^1). 0
(^6). 1
,
10
(^7). 4
(^8). 4
(^7). 4
(^41). 2
(^24). 2
(^8). 4
(^24). 2
(^8). 2
1
y
X
X X
w
y
X
X X
T
T
T
T
24
is a learning machine that tries to predict
the class label
y
of
. In practice, a finite data set
is used
for its training. Let us write
. Observe that:
For some training sets,
the
training may result to good estimates, for some othersthe result may be worse.
The average performance of the classifier can be testedagainst the MSE optimal value, in the mean squaressense, that is:where
D
is the mean over all possible data sets
i
i^
x y E D x g E
D
25
bias and the second term is the contribution of thevariance.
, there is a trade-off between the two
terms. Increasing bias it reduces variance and viceverse. This is known as the bias-variance dilemma.
variance, as one changes from one training set toanother. Using a simple model results in high bias butlow variance.
2 ] | [ ) ; (
x y E D x g E
D
2
2
) ; ( ) ; ( ] | [ ) ; ( D x g E D x g E x y E D x g E
D
D
D