






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An in-depth explanation of the perceptron algorithm, including the concept of weights, different training methods, and predictions based on the input-output data. It covers various examples and iterations, highlighting the importance of convergence and the impact of different weights on the decision boundary.
Typology: Study notes
1 / 12
This page cannot be seen from the preview
Don't miss anything!







Perceptron -- with "OR" example. Instructor: Nam Sun Wang
A Threshold Unit (Perceptron)...
To avoid divergence, we update the weights conservatively by taking a fractional step ε instead of a full step. This is commonly known as the learning rate. We can start training with a small value of ε, then later increase it to unity near convergence. Δwj ε. y true .x (^) j. y y true
Training Procedure #2 (Steepest Descent Algorithm). We can derive the training procedure by minimizing the error function over all i training patterns.
Minimize sse( w) E( w) = 0
i
errori^2 = 0
i
y (^) i f x (^) i. w 2 w
d d w
sse 2. = 0
i
y (^) i f x (^) i. w .x (^) i Gradient of error: ΔE 2. = 0
i
f x (^) i. w y (^) i .x (^) i
The following is the steepest descent algorithm that searches in the negative gradient direction of error for updating the jth weight after sequentially presenting each example i. Δw ε .f x (^) i. w y (^) i .x (^) i
where ε is a non-negative scalar learning rate. We can either 1) divide the threshold expression by a constant without changing the net effect of the eventual expression, e.g., Σwj⋅xj >a for j={1, ..., m} is transformed to Σ(w (^) j /a)⋅xj >1 or 2) to an un-thresholded form Σ(w (^) j /const)⋅xj >0 for j={0, 1, ..., m}. In practice, because f is a step function, we can force f(x⋅w) to be simply x⋅w in weight calculation. This substituted form is more restrictive and is a special case of f(x⋅w)=±1=x⋅w. Δw ε .x (^) i .w y (^) i .x (^) i
We can either update the weights immediately after presenting each sample, or we can wait until the entire set of samples have gone through, then calculate the weight change by summing up the errors in that batch. Training Procedure #3 (Linear Regression or Pseudo-Inverse Method). We realize that y is simply a linear combination of x
w x T.x..
(^1) T x y
Of course, as in multiple linear regression, if the independent variables are highly correlated, the inverse of x T⋅x is very unstable or nonexistent.
Example : An OR function.
Number of inputs: j 0 .. 2 Maximum number of training patterns: i 0 .. 10 Two-dimensional inputs: (^) x One-dimensional output: i 1, 1 1 1 1
x (^) i 2, 1 1 1 1
y (^) i 1 1 1 1
Number of training patterns: N last( y ) i 0 ..N Add a dummy input (in lieu of threshold): x (^) i 0, 1 Output function: σ( x w, ) x w. Element-wise notation: σ( x w, ) = 0
j
wj. x (^) j f( x w, ) if σ( x w, ) 0 0 1, , 1 Assign initial weights: wj rnd( 2 ) 1 Present different examples and iterate (hopefully until convergence): k 0 .. 10
w< k (^.^ N^1 ) i^1 >^ w< k (^.^ N^1 ) i^ > y.. i
x
i f ,
xT i^ w< k (.^ N^1 )^ i^ > y i Results after iteration:
w =
Example. -- Training Method #2 with x⋅w.
Reset weight: w 0 wj rnd( 2 ) 1 Present different examples and iterate (hopefully until convergence): k 0 .. 10 Δwj ε. x (^) i. wj y (^) i .x (^) i ε 0.
w< k (^.^ N^1 ) i^1 >^ w< k (^.^ N^1 ) i^ > ε. xT^ < >i..
xT i^ w< k (.^ N^1 )^ i^ > y i Results after iteration:
w =
The last weight: (^) ω w< cols( w ) 1 >
(^22 0 )
0
2
Examples Perceptron Weights
ω=
Prediction for different cases: 0. f( ( 1 1 1 ),ω) = 1 f( ( 1 1 1 ),ω )= 1 ← compare → y=
f( ( 1 1 1 ),ω )= 1 f( ( 1 1 1 ),ω) = 1
Note that the answer is not uniqu e. Many combinations of weights give rise to the same result. Here is another set of weights that work just as well. The next-to-last weight: (^) ω w< cols( w ) 2 >
(^22 0 )
0
2
Examples Perceptron Weights
ω=
Prediction for different cases: 0. f( ( 1 1 1 ),ω) = 1 f( ( 1 1 1 ),ω )= 1 ← compare → y=
f( ( 1 1 1 ),ω )= 1 f( ( 1 1 1 ),ω) = 1
Example. -- Training Method #2 with f(x,w)
Present different examples and iterate (hopefully until convergence): k 0 .. 10
w< k (^.^ N^1 ) i^1 >^ w< k (^.^ N^1 ) i^ > ε. xT^ < >i .f ,
xT i^ w< k (.^ N^1 )^ i^ > y i Results after iteration:
w =
The last weight: (^) ω w< cols( w ) 1 >
(^22 0 )
0
2
Examples Perceptron Weights
ω=
Prediction for different cases: 0. f( ( 1 1 1 ),ω) = 1 f( ( 1 1 1 ),ω )= 1 ← compare → y=
f( ( 1 1 1 ),ω )= 1 f( ( 1 1 1 ),ω) = 1
Example. -- Training Method #2 with f(x,w) but with batch updating of weights.
Present different examples and iterate (hopefully until convergence): k 0 .. 10
w< k^1 > w< > k^ ε. = 0
i
x
i f ,
xT i^ w< > k^ y i
Results after iteration:
w =
The last weight: (^) ω w< 10 > (^) ← Same set of weights as in the last problem ω= but converged quickly.
Example. -- Output y is binary in 0/1 instead of -1/1. Training Method #2 with x⋅w.
Maximum number of training patterns: i 0 .. 10 Reset input and output variables: x 0 y 0 Two-dimensional inputs: (^) x One-dimensional output: i 1, 1 1 1 1
x (^) i 2, 1 1 1 1
y (^) i 0 1 1 1
Number of training patterns: N last( y ) i 0 ..N Add a dummy input (in lieu of threshold): x (^) i 0, 1 Output function: σ( x w, ) x w. f( x w, ) if σ( x w, ) 0 0 1 0, , Reset weight: w 0 wj rnd( 2 ) 1 Present different examples and iterate (hopefully until convergence): k 0 .. 10
w< k (^.^ N^1 ) i^1 >^ w< k (^.^ N^1 ) i^ > ε. xT^ < >i..
xT i^ w< k (.^ N^1 )^ i^ > y i Results after iteration:
w =
The last weight: (^) ω w< cols( w ) 1 >
(^22 0 )
0
2
Examples Perceptron Weights
ω=
Prediction for different cases: 0. f( ( 1 1 1 ),ω) = 1 f( ( 1 1 1 ),ω )= 1 ← compare → y=
f( ( 1 1 1 ),ω )= 1 f( ( 1 1 1 ),ω) = 1^ No good! No good because the approximation of f (x⋅w) with x⋅w is valid only when f(x⋅w)=±1=x⋅w. Now, f={0 or 1} violates this assumption.
Example. -- Output y is binary in 0/1 instead of -1/1. Training Method #2 with f(x,w)
Output function: σ( x w, ) x w. f( x w, ) if σ( x w, ) 0 0 1 0, , Reset weight: w 0 wj rnd( 2 ) 1 Present different examples and iterate (hopefully until convergence): k 0 .. 10
w< k (^.^ N^1 ) i^1 >^ w< k (^.^ N^1 ) i^ > ε. xT^ < >i .f ,
xT i^ w< k (.^ N^1 )^ i^ > y i Results after iteration:
w =
The last weight: (^) ω w< cols( w ) 1 >
(^22 0 )
0
2
Examples Perceptron Weights
ω=
Prediction for different cases: 0. f( ( 1 1 1 ),ω) = 0 f( ( 1 1 1 ),ω )= 1 ← compare → y=
f( ( 1 1 1 ),ω )= 1 f( ( 1 1 1 ),ω) = 1 Now, the answer is O.K.
Example. -- Output y is binary in 0/1 instead of -1/1. Training Method #3 with Pseudo-Inverse.
ω 0 ω x T.x..
(^1) T x y ω=
(^22 0 )
0
2
Examples Perceptron Weights
Prediction for different cases: f( ( 1 1 1 ),ω) = 1 f( ( 1 1 1 ),ω )= 1 ← compare → y=
f( ( 1 1 1 ),ω )= 1 f( ( 1 1 1 ),ω) = 1^ No good!
Example : An XOR function trained with the batch method #2.
Two-dimensional inputs: (^) x One-dimensional output: i 1, 1 1 1 1
x (^) i 2, 1 1 1 1
y (^) i 1 1 1 1
Present different examples and iterate (hopefully until convergence): k 0 .. 10
w< k^1 > w< > k^ ε. = 0
i
x
i f ,
xT i^ w< > k^ y i
Results after iteration:
w =
The last weight: (^) ω w< cols( w ) 1 > ω=
(^22 0 )
0
2
Examples Perceptron Weights
Prediction for different cases: f( ( 1 1 1 ),ω) = 1 f( ( 1 1 1 ),ω )= 1 ← compare → y=
f( ( 1 1 1 ),ω )= (^1) No good! f( ( 1 1 1 ),ω) = 1 The next-to-last weight: (^) ω w< cols( w ) 2 > ω=
(^22 0 )
0
2
Examples Perceptron Weights
Prediction for different cases: f( ( 1 1 1 ),ω) = 1 f( ( 1 1 1 ),ω )= 1 ← compare → y=
f( ( 1 1 1 ),ω )= (^1) No good! f( ( 1 1 1 ),ω) = 1
The weights are oscillating. Like the first set of weights above, the second set of weights also fail to yield the correct answers. It is clear that one perceptron with the original unadulterated input variables x<1>^ and x <2>^ alone cannot adequately divide the x <1>-x <2>^ plane into regions where the output is either +1 or -1 such that the patterns match the training data. In other words, one perceptron cannot learn the XOR function, because we need to have at least two straight lines to separate the +1 and -1 outputs, although it maybe possible to do so with a nonlinear curve. Let us generate additional variables consisting of auto- and cross-products of x <1>^ and x <2>.
x^ < > 3 x< > 1 2 x< > 4 x< > 1 .x< > 2 x< > 5 x< > 2 2 j 0 .. 5
Likewise, let us enlarge the weight vector accordingly and a ssign initial weights.
w 0 wj rnd( 2 ) 1
Present different examples and iterate (hopefully until convergence): k 0 .. 10
w< k^1 > w< > k^ ε. = 0
i
x
i f ,
xT i^ w< > k^ y i
Results after iteration:
w =
The last weight: (^) ω w< cols( w ) 1 > ω T=( 1.068 0.302 0.314 0.167 1.16 0.352)
The following equation describes the curve in the x <1>-x <2>^ plane defined by the weights of the perceptron.
w 5. x 2 2 w 4. x 1 .x 2 w 3. x 1 2 w 2. x 2 w 1. x 1 w 0 ... mark x2 and choose |Symbolic|Solve for| 0
x 2a x 1 ,w (^).^1. 2 w 5
w 4. x 1 w 2 w 4. x 1 w 22 4 w. 5. w 3. x 1 2 w 1. x 1 w 0
x 2b x 1 ,w (^).^1. 2 w 5
w 4. x 1 w 2 w 4. x 1 w 22 4 w. 5. w 3. x 1 2 w 1. x 1 w 0