Perceptron Algorithm: Weights, Training Methods, and Predictions, Study notes of Chemistry

An in-depth explanation of the perceptron algorithm, including the concept of weights, different training methods, and predictions based on the input-output data. It covers various examples and iterations, highlighting the importance of convergence and the impact of different weights on the decision boundary.

Typology: Study notes

Pre 2010

Uploaded on 02/13/2009

koofers-user-l86
koofers-user-l86 🇺🇸

2.5

(2)

10 documents

1 / 12

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1 percept.mcd
Perceptron -- with "OR" example.
Instructor: Nam Sun Wang
A Threshold Unit (Perceptron)...
Has either binary (-1 or 1, etc.) or real inputs. xj. (If the input is binary 0/1, change the notation
to -1/1. Otherwise, the weight has absolutely no effect on the 0 input.)
Sums up weighted inputs. σ=Σwjxj.
Tests to see if Σwjxja, where a is the threshold. (We can simulate the threshold with x0=1 and
w0=-a, an extra dummy input of unity and a weight of-a. This is also commonly called a bias.
This is equivalent to the constant part or intercept in linear regression.)
Yields a binary output (true or false, 0 or 1, -1 or 1, hit or miss, etc.) If Σwjxj>a, then 1;
otherwise, 0.
Training Procedure #1. If classification is correct, do nothing. If we get a false positive, set the
new w to w-x. On the other hand, if we get a false negative, set the new w to w+x. For y={-1, 1},
we express this procedure algebraically as:
Δwj..
ytrue xjy y true
To avoid divergence, we update the weights conservatively by taking a fractional step ε instead of a
full step. This is commonly known as the learning rate. We can start training with a small value of ε,
then later increase it to unity near convergence.
Δwj...
εytrue xjy y true
Training Procedure #2 (Steepest Descent Algorithm). We can derive the training procedure by
minimizing the error function over all i training patterns.
Minimize sse( )w E( )w
= 0
N
i
errori
2
= 0
N
i
yif.
xiw2
w
d
dwsse .
2
= 0
N
i
.
yif.
xiw xiGradient of error: ΔE.
2
= 0
N
i
.
f.
xiw yixi
The following is the steepest descent algorithm that searches in the negative gradient direction of
error for updating the jth weight after sequentially presenting each example i.
Δw..
εf.
xiw yixi
where ε is a non-negative scalar learning rate. We can either 1) divide the threshold expression by a
constant without changing the net effect of the eventual expression, e.g., Σwjxj>a for j={1, ..., m} is
transformed to Σ(wj/a)xj>1 or 2) to an un-thresholded form Σ(wj/const)xj>0 for j={0, 1, ..., m}. In
practice, because f is a step function, we can force f(xw) to be simply xw in weight calculation. This
substituted form is more restrictive and is a special case of f(xw)=±1=xw.
Δw..
ε.
xiw yixi
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Perceptron Algorithm: Weights, Training Methods, and Predictions and more Study notes Chemistry in PDF only on Docsity!

Perceptron -- with "OR" example. Instructor: Nam Sun Wang

A Threshold Unit (Perceptron)...

  • Has either binary (-1 or 1, etc.) or real inputs. x (^) j. (If the input is binary 0/1, change the notation to -1/1. Otherwise, the weight has absolutely no effect on the 0 input.)
  • Sums up weighted inputs. σ=Σwj⋅xj.
  • Tests to see if Σwj⋅xj≥a, where a is the threshold. (We can simulate the threshold with x 0 =1 and w 0 =-a, an extra dummy input of unity and a weight of-a. This is also commonly called a bias. This is equivalent to the constant part or intercept in linear regression.)
  • Yields a binary output (true or false, 0 or 1, -1 or 1, hit or miss, etc.) If Σwj⋅xj >a, then 1; otherwise, 0. Training Procedure #1. If classification is correct, do nothing. If we get a false positive, set the new w to w-x. On the other hand, if we get a false negative, set the new w to w+x. For y={-1, 1}, we express this procedure algebraically as: Δwj y true .x (^) j. y y true

To avoid divergence, we update the weights conservatively by taking a fractional step ε instead of a full step. This is commonly known as the learning rate. We can start training with a small value of ε, then later increase it to unity near convergence. Δwj ε. y true .x (^) j. y y true

Training Procedure #2 (Steepest Descent Algorithm). We can derive the training procedure by minimizing the error function over all i training patterns.

Minimize sse( w) E( w) = 0

N

i

errori^2 = 0

N

i

y (^) i f x (^) i. w 2 w

d d w

sse 2. = 0

N

i

y (^) i f x (^) i. w .x (^) i Gradient of error: ΔE 2. = 0

N

i

f x (^) i. w y (^) i .x (^) i

The following is the steepest descent algorithm that searches in the negative gradient direction of error for updating the jth weight after sequentially presenting each example i. Δw ε .f x (^) i. w y (^) i .x (^) i

where ε is a non-negative scalar learning rate. We can either 1) divide the threshold expression by a constant without changing the net effect of the eventual expression, e.g., Σwj⋅xj >a for j={1, ..., m} is transformed to Σ(w (^) j /a)⋅xj >1 or 2) to an un-thresholded form Σ(w (^) j /const)⋅xj >0 for j={0, 1, ..., m}. In practice, because f is a step function, we can force f(x⋅w) to be simply x⋅w in weight calculation. This substituted form is more restrictive and is a special case of f(x⋅w)=±1=x⋅w. Δw ε .x (^) i .w y (^) i .x (^) i

We can either update the weights immediately after presenting each sample, or we can wait until the entire set of samples have gone through, then calculate the weight change by summing up the errors in that batch. Training Procedure #3 (Linear Regression or Pseudo-Inverse Method). We realize that y is simply a linear combination of x ^ , each weighted by wj. This is exactly the same as in the plain linear regression. As opposed to the above algorithms, this is a batch procedure where we present all training examples simultaneously rather than sequentially.

w x T.x..

(^1) T x y

Of course, as in multiple linear regression, if the independent variables are highly correlated, the inverse of x T⋅x is very unstable or nonexistent.

Example : An OR function.

Number of inputs: j 0 .. 2 Maximum number of training patterns: i 0 .. 10 Two-dimensional inputs: (^) x One-dimensional output: i 1, 1 1 1 1

x (^) i 2, 1 1 1 1

y (^) i 1 1 1 1

Number of training patterns: N last( y ) i 0 ..N Add a dummy input (in lieu of threshold): x (^) i 0, 1 Output function: σ( x w, ) x w. Element-wise notation: σ( x w, ) = 0

j

wj. x (^) j f( x w, ) if σ( x w, ) 0 0 1, , 1 Assign initial weights: wj rnd( 2 ) 1 Present different examples and iterate (hopefully until convergence): k 0 .. 10

w< k (^.^ N^1 ) i^1 >^ w< k (^.^ N^1 ) i^ > y.. i

T< >

x

i f ,

< >T

xT i^ w< k (.^ N^1 )^ i^ > y i Results after iteration:

w =

Example. -- Training Method #2 with x⋅w.

Reset weight: w 0 wj rnd( 2 ) 1 Present different examples and iterate (hopefully until convergence): k 0 .. 10 Δwj ε. x (^) i. wj y (^) i .x (^) i ε 0.

w< k (^.^ N^1 ) i^1 >^ w< k (^.^ N^1 ) i^ > ε. xT^ < >i..

< >T

xT i^ w< k (.^ N^1 )^ i^ > y i Results after iteration:

w =

The last weight: (^) ω w< cols( w ) 1 >

(^22 0 )

0

2

Examples Perceptron Weights

ω=

Prediction for different cases: 0. f( ( 1 1 1 ),ω) = 1 f( ( 1 1 1 ),ω )= 1 ← compare → y=

f( ( 1 1 1 ),ω )= 1 f( ( 1 1 1 ),ω) = 1

Note that the answer is not uniqu e. Many combinations of weights give rise to the same result. Here is another set of weights that work just as well. The next-to-last weight: (^) ω w< cols( w ) 2 >

(^22 0 )

0

2

Examples Perceptron Weights

ω=

Prediction for different cases: 0. f( ( 1 1 1 ),ω) = 1 f( ( 1 1 1 ),ω )= 1 ← compare → y=

f( ( 1 1 1 ),ω )= 1 f( ( 1 1 1 ),ω) = 1

Example. -- Training Method #2 with f(x,w)

Present different examples and iterate (hopefully until convergence): k 0 .. 10

w< k (^.^ N^1 ) i^1 >^ w< k (^.^ N^1 ) i^ > ε. xT^ < >i .f ,

< >T

xT i^ w< k (.^ N^1 )^ i^ > y i Results after iteration:

w =

The last weight: (^) ω w< cols( w ) 1 >

(^22 0 )

0

2

Examples Perceptron Weights

ω=

Prediction for different cases: 0. f( ( 1 1 1 ),ω) = 1 f( ( 1 1 1 ),ω )= 1 ← compare → y=

f( ( 1 1 1 ),ω )= 1 f( ( 1 1 1 ),ω) = 1

Example. -- Training Method #2 with f(x,w) but with batch updating of weights.

Present different examples and iterate (hopefully until convergence): k 0 .. 10

w< k^1 > w< > k^ ε. = 0

N

i

T< >

x

i f ,

< >T

xT i^ w< > k^ y i

Results after iteration:

w =

The last weight: (^) ω w< 10 > (^) ← Same set of weights as in the last problem ω= but converged quickly.

Example. -- Output y is binary in 0/1 instead of -1/1. Training Method #2 with x⋅w.

Maximum number of training patterns: i 0 .. 10 Reset input and output variables: x 0 y 0 Two-dimensional inputs: (^) x One-dimensional output: i 1, 1 1 1 1

x (^) i 2, 1 1 1 1

y (^) i 0 1 1 1

Number of training patterns: N last( y ) i 0 ..N Add a dummy input (in lieu of threshold): x (^) i 0, 1 Output function: σ( x w, ) x w. f( x w, ) if σ( x w, ) 0 0 1 0, , Reset weight: w 0 wj rnd( 2 ) 1 Present different examples and iterate (hopefully until convergence): k 0 .. 10

w< k (^.^ N^1 ) i^1 >^ w< k (^.^ N^1 ) i^ > ε. xT^ < >i..

< >T

xT i^ w< k (.^ N^1 )^ i^ > y i Results after iteration:

w =

The last weight: (^) ω w< cols( w ) 1 >

(^22 0 )

0

2

Examples Perceptron Weights

ω=

Prediction for different cases: 0. f( ( 1 1 1 ),ω) = 1 f( ( 1 1 1 ),ω )= 1 ← compare → y=

f( ( 1 1 1 ),ω )= 1 f( ( 1 1 1 ),ω) = 1^ No good! No good because the approximation of f (x⋅w) with x⋅w is valid only when f(x⋅w)=±1=x⋅w. Now, f={0 or 1} violates this assumption.

Example. -- Output y is binary in 0/1 instead of -1/1. Training Method #2 with f(x,w)

Output function: σ( x w, ) x w. f( x w, ) if σ( x w, ) 0 0 1 0, , Reset weight: w 0 wj rnd( 2 ) 1 Present different examples and iterate (hopefully until convergence): k 0 .. 10

w< k (^.^ N^1 ) i^1 >^ w< k (^.^ N^1 ) i^ > ε. xT^ < >i .f ,

< >T

xT i^ w< k (.^ N^1 )^ i^ > y i Results after iteration:

w =

The last weight: (^) ω w< cols( w ) 1 >

(^22 0 )

0

2

Examples Perceptron Weights

ω=

Prediction for different cases: 0. f( ( 1 1 1 ),ω) = 0 f( ( 1 1 1 ),ω )= 1 ← compare → y=

f( ( 1 1 1 ),ω )= 1 f( ( 1 1 1 ),ω) = 1 Now, the answer is O.K.

Example. -- Output y is binary in 0/1 instead of -1/1. Training Method #3 with Pseudo-Inverse.

ω 0 ω x T.x..

(^1) T x y ω=

(^22 0 )

0

2

Examples Perceptron Weights

Prediction for different cases: f( ( 1 1 1 ),ω) = 1 f( ( 1 1 1 ),ω )= 1 ← compare → y=

f( ( 1 1 1 ),ω )= 1 f( ( 1 1 1 ),ω) = 1^ No good!

Example : An XOR function trained with the batch method #2.

Two-dimensional inputs: (^) x One-dimensional output: i 1, 1 1 1 1

x (^) i 2, 1 1 1 1

y (^) i 1 1 1 1

Present different examples and iterate (hopefully until convergence): k 0 .. 10

w< k^1 > w< > k^ ε. = 0

N

i

T< >

x

i f ,

< >T

xT i^ w< > k^ y i

Results after iteration:

w =

The last weight: (^) ω w< cols( w ) 1 > ω=

(^22 0 )

0

2

Examples Perceptron Weights

Prediction for different cases: f( ( 1 1 1 ),ω) = 1 f( ( 1 1 1 ),ω )= 1 ← compare → y=

f( ( 1 1 1 ),ω )= (^1) No good! f( ( 1 1 1 ),ω) = 1 The next-to-last weight: (^) ω w< cols( w ) 2 > ω=

(^22 0 )

0

2

Examples Perceptron Weights

Prediction for different cases: f( ( 1 1 1 ),ω) = 1 f( ( 1 1 1 ),ω )= 1 ← compare → y=

f( ( 1 1 1 ),ω )= (^1) No good! f( ( 1 1 1 ),ω) = 1

The weights are oscillating. Like the first set of weights above, the second set of weights also fail to yield the correct answers. It is clear that one perceptron with the original unadulterated input variables x<1>^ and x <2>^ alone cannot adequately divide the x <1>-x <2>^ plane into regions where the output is either +1 or -1 such that the patterns match the training data. In other words, one perceptron cannot learn the XOR function, because we need to have at least two straight lines to separate the +1 and -1 outputs, although it maybe possible to do so with a nonlinear curve. Let us generate additional variables consisting of auto- and cross-products of x <1>^ and x <2>.

x^ < > 3 x< > 1 2 x< > 4 x< > 1 .x< > 2 x< > 5 x< > 2 2 j 0 .. 5

Likewise, let us enlarge the weight vector accordingly and a ssign initial weights.

w 0 wj rnd( 2 ) 1

Present different examples and iterate (hopefully until convergence): k 0 .. 10

w< k^1 > w< > k^ ε. = 0

N

i

T< >

x

i f ,

< >T

xT i^ w< > k^ y i

Results after iteration:

w =

The last weight: (^) ω w< cols( w ) 1 > ω T=( 1.068 0.302 0.314 0.167 1.16 0.352)

The following equation describes the curve in the x <1>-x <2>^ plane defined by the weights of the perceptron.

w 5. x 2 2 w 4. x 1 .x 2 w 3. x 1 2 w 2. x 2 w 1. x 1 w 0 ... mark x2 and choose |Symbolic|Solve for| 0

x 2a x 1 ,w (^).^1. 2 w 5

w 4. x 1 w 2 w 4. x 1 w 22 4 w. 5. w 3. x 1 2 w 1. x 1 w 0

x 2b x 1 ,w (^).^1. 2 w 5

w 4. x 1 w 2 w 4. x 1 w 22 4 w. 5. w 3. x 1 2 w 1. x 1 w 0