Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Machine Learning: Classification Models & Rosenblatt's Perceptron Algorithm - Prof. Gregor, Study notes of Computer Science

These lecture notes cover the introduction to classification models in machine learning, including generative models like fisher's linear discriminative analysis and gaussian mixture models, as well as discriminative models like rosenblatt's preceptron learning algorithm. The notes also discuss nonlinear extensions and binary classification.

Typology: Study notes

Pre 2010

Uploaded on 02/10/2009

koofers-user-0sg-1
koofers-user-0sg-1 🇺🇸

5

(1)

10 documents

1 / 9

Toggle sidebar

Related documents


Partial preview of the text

Download Machine Learning: Classification Models & Rosenblatt's Perceptron Algorithm - Prof. Gregor and more Study notes Computer Science in PDF only on Docsity!

Greg Grudic Machine Learning 1

Introduction to Classification

Greg Grudic

Greg Grudic Machine Learning 2

Today’s Lecture Goals

  • Introduction to classification
  • Generative Models
    • Fisher (Linear Discriminative Analysis)
    • Gaussian Mixture Models
  • Discriminative Models
    • Rosenblatt’s Preceptron Learning Algorithm
  • Nonlinear Extensions

Last Week: Learning Regression

Models

  • Collect Training data
  • Build Model: stock value = F(feature space)
  • Make a prediction

Feature (input) Space

Stock

Value

__*

__*

__*

__*

__*

__*

_ _

__*

__*

__*

__*

__*

__*

__*

__*

__*

__*

__*

__*

__*

_ _

__*

__*

__*

__*

__*

__*

__*

__*

__*

__*

__*

_ _

__*

__*

__*

__*

__*

This Class: Learning Classification

Models

  • Collect Training data
  • Build Model: happy = F(feature space)
  • Make a prediction

High

Dimensional

Feature (input)

Space

Greg Grudic Machine Learning 5

Binary Classification

  • A binary classifier is a mapping from a set of d

inputs to a single output which can take on one of

TWO values

  • In the most general setting
  • Specifying the output classes as -1 and +1 is

arbitrary!

  • Often done as a mathematical convenience

{ }

inputs:

output: 1, 1

d

y

∈ − +

x \

Greg Grudic Machine Learning 6

A Binary Classifier

Classification

Model

x

{ }

ˆ y ∈ −1, + 1

Given learning data:

( ) ( )

1 1

, ,..., ,

N N

x y x y

A model is constructed:

( )

M x

The Learning Data

  • Learning algorithms don’t care where the data

comes from!

  • Here is a toy example from robotics…
    • Inputs from two sonar sensors:
    • Classification output:
      • Robot in Greg’s office: y = +
      • Robot NOT in Greg’s office: y = -

1

2

sensor 1:

sensor 2:

x

x

\

\

Classification Learning Data…

Example 4

Example 3

Example 2

Example 1

… … …

0.018504 0.76037 -

0.8913 0.43291 1

0.23114 0.4235 -

0.95013 0.58279 1

1

x

2

x

y

Greg Grudic Machine Learning 9

The Learning Data

  • Symbolic Representation of N learning

examples of d dimensional inputs

d

N Nd N

x x y

x x y

 

 

 

# % #

"

Greg Grudic Machine Learning 10

Graphical Representation of

Classification Training Data

0 0.2 0.4 0.6 0.8 1 1.

0

1

x

1

x

2

: y=+

: y=-

Linear Separating Hyper-Planes

How many lines can separate these points?

NO!

Linear Separating Hyper-Planes

1

x

2

x

0

1

0

d

i i

i

β βx

=

0

1

0

d

i i

i

β βx

=

0

1

0

d

i i

i

β βx

=

  • =

y = − 1

y = + 1

Greg Grudic Machine Learning 13

Linear Separating Hyper-Planes

  • The Model:
  • Where:
  • The decision boundary:

0 ( 1 )

ˆ ˆ ˆ
ˆ

( ) sgn ,...,

T

d

y M β β β

 
= = +
 
 

x x

[ ]

1 if 0

sgn

1 otherwise

A
A
 >
=
−


( ) 0 1

ˆ ˆ ˆ

,..., 0

T

d

β + β β x =

Greg Grudic Machine Learning 14

Linear Separating Hyper-Planes

  • The model parameters are:
  • The hat on the betas means that they are

estimated from the data

  • In the class notes… Sometimes the hat will be

there and sometimes it won’t!

  • Many different learning algorithms have

been proposed for determining

( ) 0 1

ˆ ˆ ˆ

, ,...,

d

β β β

( 0 1 )

ˆ ˆ ˆ

, ,...,

d

β β β

Rosenblatt’s Preceptron Learning

Algorithm

  • Dates back to the 1950’s and is the

motivation behind Neural Networks

  • The algorithm:
    • Start with a random hyperplane
    • Incrementally modify the hyperplane such that

points that are misclassified move closer to the

correct side of the boundary

  • Stop when all learning examples are correctly

classified

( 0 1 )

ˆ ˆ ˆ
, ,...,

d

β β β

Rosenblatt’s Preceptron Learning

Algorithm

  • The algorithm is based on the following property:
    • Signed distance of any point to the boundary is

proportional to

  • Therefore, if is the set of misclassified

learning examples, we can push them closer to the

boundary by minimizing the following

( ) ( ) 0 1 ( 0 1 )

ˆ ˆ ˆ ˆ ˆ ˆ
, ,..., ,...,

T

d i d i

i M

D β β β y β β β

= − +

x

0 ( 1 )

ˆ ˆ ˆ

,...,

T

d

β + β β x

x

M

Greg Grudic Machine Learning 17

Rosenblatt’s Minimization Function

  • This is classic Machine Learning!
  • First define a cost function in model

parameter space

  • Then find an algorithm that modifies

such that this cost function is minimized

  • One such algorithm is Gradient Descent

( ) 0 1 0

1

ˆ ˆ ˆ ˆ ˆ
, ,...,

d

d i k ik

i M k

D β β β y β βx

∈ =

 
 
= − + 
 
 

∑ ∑

( 0 1 )

ˆ ˆ ˆ
, ,...,

d

β β β

Greg Grudic Machine Learning 18

Gradient Descent

0

1

2

0

1

2

3

0

5

10

15

20

25

w0 w

E[w]

The Gradient Descent Algorithm

( ) 0 1

ˆ ˆ ˆ

, ,...,

ˆ ˆ

ˆ

d

i i

i

D β β β

β β ρ

β

← −

Where the learning rate is defined by: ρ > 0

The Gradient Descent Algorithm for

the Perceptron

0 0

1 1 1

ˆ ˆ
ˆ ˆ
ˆ ˆ

i

i i

i id

d d

y

y x

y x

β β

β β

ρ

β β

   
 
 
 
   
 
  
 
  
  
 
    
  
 ← − 
     
     
  
  
   
 
  
 
   
 
    
 
   
   
#

( ) 0 1

0

ˆ ˆ ˆ , ,...,

ˆ

d

i

i M

D

y

β β β

β

= −

( ) 0 1

ˆ ˆ ˆ , ,...,

, 1,...,

ˆ

d

i ij

i M j

D

y x j d

β β β

β

= − =

Greg Grudic Machine Learning 21

The Good Theoretical Properties of

the Perceptron Algorithm

  • If a solution exists the algorithm will always

converge in a finite number of steps!

  • Question: Does a solution always exist?

Greg Grudic Machine Learning 22

Linearly Separable Data

  • Which of these datasets are separable by a

linear boundary?

a)

b)

Linearly Separable Data

  • Which of these datasets are separable by a

linear boundary?

a)

b)

Not

Linearly

Separable!

Bad Theoretical Properties of the

Perceptron Algorithm

  • If the data is not linearly separable, algorithm

cycles forever!

  • Cannot converge!
  • This property stopped research in this area between

1968 and 1984…

  • Perceptrons , Minsky and Pappert, 1969
  • There are infinitely many solutions
  • When data is linearly separable, the number of

steps to converge can be very large (depends on

size of gap between classes)

Greg Grudic Machine Learning 25

What about Nonlinear Data?

  • Data that is not linearly separable is called

nonlinear data

  • Nonlinear data can often be mapped into a

nonlinear space where it is linearly

separable

Greg Grudic Machine Learning 26

Nonlinear Models

  • The Linear Model:
  • The Nonlinear (basis function) Model:
  • Examples of Nonlinear Basis Functions:

0

1

ˆ ˆ

ˆ ( ) sgn

d

i i

i

y M β βx

=

 
= =  + 
 
 

x

( )

0

1

ˆ ˆ

ˆ ( ) sgn

k

i i

i

y M β β φ

=

 
 
= = +
 
 

x x

( ) ( ) ( ) ( ) ( )

2 2

1 1 2 2 3 1 2 4 55

φ x = x φ x = x φ x = x x φ x =sin x

Linear Separating Hyper-Planes In

Nonlinear Basis Function Space

1

φ

2

φ

0

1

0

k

i i

i

β β φ

=

0

1

0

k

i i

i

β β φ

=

0

1

0

k

i i

i

β β φ

=

  • =

y = − 1

y = + 1

An Example

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

-0.

-0.

-0.

-0.

0

1

x 1

x

2

: y=+

: y=-

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

1

φ 1

= x 1

2

φ

2

= x

22

: y=+

: y=-

Φ

Greg Grudic Machine Learning 29

Kernels as Nonlinear

Transformations

  • Polynomial
  • Sigmoid
  • Gaussian or Radial Basis Function (RBF)

( )

( )

( )

( )

( )

, ,

, tanh ,

, exp

k

i j i j

i j i j

i j i j

K q

K

K

x x x x

x x x x

x x x x

κ θ

σ

= +

= +

 

= − −



 

Greg Grudic Machine Learning 30

The Kernel Model

( )

0

1

ˆ ˆ

ˆ ( ) sgn ,

N

i i

i

y M x β βK x x

=

 

  = = +

 

 

( ) ( )

1 1

, ,..., ,

N N

Training Data: x y x y

The number of basis functions equals

the number of training examples!

  • Unless some of the beta’s get set to zero…

Gram (Kernel) Matrix

( ) ( )

( ) ( )

1 1 1

1

, ,

, ,

N

N N N

K K

K

K K

 

 

 

 

 

   

 

x x x x

x x x x

%

"

( ) ( )

1 1

, ,..., ,

N N

x y x y

Training Data:

Properties:

•Positive Definite Matrix

•Symmetric

•Positive on diagonal

•N by N

Picking a Model Structure?

  • How do you pick the Kernels?
    • Kernel parameters
  • These are called learning parameters or

hyperparamters

  • Two approaches choosing learning paramters
    • Bayesian
      • Learning parameters must maximize probability of correct

classification based on prior biases

  • Frequentist
    • Use validation data
  • More on learning parameter selection later

Greg Grudic Machine Learning 33

Perceptron Algorithm Convergence

  • Two problems:
    • No convergence when data is not separable in

basis function space

  • Gives infinitely many solutions when data is

separable

  • Can we modify the algorithm to fix these

problems?