Neural Networks: Derivatives and Backpropagation, Exercises of Artificial Intelligence

These notes provide a supplement to lecture material on neural networks, focusing on the mathematics of multiple-neuron layers and the backpropagation rule. The sigmoid function, its derivative, and their significance in neural networks. It also covers the performance function, gradient ascent, and gradient descent.

Typology: Exercises

2011/2012

Uploaded on 07/31/2012

shaina_44kin
shaina_44kin 🇮🇳

3.9

(9)

64 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
6.034f Neural Net Notes
October 28, 2010
These notes are a supplement to material presented in lecture. I lay out the mathematics more prettily
and extend the analysis to handle multiple-neurons per layer. Also, I develop the back propagation
rule, which is often needed on quizzes.
I use a notation that I think improves on previous explanations. The reason is that the notation
here plainly associates each input, output, and weight with a readily identified neuron, a left-side
one and a right-side one. When you arrive at the update formulas, you will have less trouble relating
the variables in the formulas to the variables in a diagram.
One the other hand, seeing yet another notation may confuse you, so if you already feel com-
fortable with a set of update formulas, you will not gain by reading these notes.
The sigmoid function
The sigmoid function, y =1/(1 + ex), is used instead of a step function in artificial neural nets
because the sigmoid is continuous, whereas a step function is not, and you need continuity whenever
you want to use gradient ascent. Also, the sigmoid function has several desirable qualities. For
example, the sigmoid function’s value, y, approaches 1 as x becomes highly positive; 0 as x becomes
highly negative; and equals 1/2 when x =0.
Better yet, the sigmoid function features a remarkably simple derivative of the output, y, with
respect to the input, x:
dy d 1
= ( )
dx dx 1+ ex
d
= (1 + ex)1
dx
= 1 ×(1 + ex)2 ×ex ×−1
1 ex
= ×
1+ ex 1+ ex
1 1+ ex 1
= ×
1+ ex 1+ ex
1 1+ ex 1
= ×( )
1+ ex 1+ ex 1+ ex
=y(1 y)
Thus, remarkably, the derivative of the output with respect to the input is expressed as a simple
function of the output.
The performance function
The standard performance function for gauging how well a neural net is doing is given by the
following:
1
P = (dsample osample)2
2
docsity.com
pf3
pf4
pf5

Partial preview of the text

Download Neural Networks: Derivatives and Backpropagation and more Exercises Artificial Intelligence in PDF only on Docsity!

6.034f Neural Net Notes

October 28, 2010

These notes are a supplement to material presented in lecture. I lay out the mathematics more prettily and extend the analysis to handle multiple-neurons per layer. Also, I develop the back propagation rule, which is often needed on quizzes. I use a notation that I think improves on previous explanations. The reason is that the notation here plainly associates each input, output, and weight with a readily identified neuron, a left-side one and a right-side one. When you arrive at the update formulas, you will have less trouble relating the variables in the formulas to the variables in a diagram. One the other hand, seeing yet another notation may confuse you, so if you already feel com fortable with a set of update formulas, you will not gain by reading these notes.

The sigmoid function

The sigmoid function, y = 1/(1 + e−x), is used instead of a step function in artificial neural nets because the sigmoid is continuous, whereas a step function is not, and you need continuity whenever you want to use gradient ascent. Also, the sigmoid function has several desirable qualities. For example, the sigmoid function’s value, y, approaches 1 as x becomes highly positive; 0 as x becomes

highly negative; and equals 1/2 when x = 0.

Better yet, the sigmoid function features a remarkably simple derivative of the output, y, with

respect to the input, x:

dy d 1 = ( ) dx dx 1 + e−x d = (1 + e−x)−^1 dx = − 1 × (1 + e−x)−^2 × e−x^ × − 1 1 e−x = × 1 + e−x^ 1 + e−x 1 1 + e−x^ − 1 = × 1 + e−x^ 1 + e−x 1 1 + e−x^1 = × ( − ) 1 + e−x^ 1 + e−x^ 1 + e−x =y(1 − y)

Thus, remarkably, the derivative of the output with respect to the input is expressed as a simple function of the output.

The performance function

The standard performance function for gauging how well a neural net is doing is given by the following:

1 P = − (dsample − osample)^2 2

where P is the performance function, dsample is the desired output for some specific sample and

osample is the observed output for that sample. From this point forward, assume that d and o are

the desired and observed outputs for a specific sample so that we need not drag a subscript around as we work through the algebra.

The reason for choosing the given formula for P is that the formula has convenient properties.

The formula yields a maximum at o = d and monotonically decreases as o deviates from d. Moreover,

the derivative of P with respect to o is simple:

dP d 1 = [− (d − o)^2 ] do do 2

= −

× (d − o)^1 × − 1 2 = d − o

Gradient ascent

Backpropagation is a specialization of the idea of gradient ascent. You are trying to find the maximum of a performance function P, by changing the weights associated with neurons, so you move in the

direction of the gradient in a space that gives P as a function of the weights, w. That is, you move in the direction of most rapid ascent if we take a step in the direction with components governed by the following formula, which shows how much to change a weight, w, in terms of a partial derivative:

∂P Δw ∝ ∂w

The actual change is influenced by a rate constant, α; accordingly, the new weight, w′, is given by the following:

w′^ = w + α ×

∂P

∂w

Gradient descent

If the performance function were 12 (dsample − osample)^2 instead of − 12 (dsample − osample)^2 , then

you would be searching for the minimum rather than the maximum of P, and the change in w would

be subtracted from w instead of added, so w′^ would be w − α × (^) ∂^ ∂ w^ P^ instead of w + α × (^) ∂^ ∂ w^ P^. The two sign changes, one in the performance function and the other in the update formula cancel, so in the end, you get the same result whether you use gradient ascent, as I prefer, or gradient descent.

The simplest neural net

Consider the simplest possible neural net: one input, one output, and two neurons, the left neuron and the right neuron. A net with two neurons is the smallest that illustrates how the derivatives can be computed layer by layer.

Thus, the derivative consists of products of terms that have already been computed and terms in the vicinity of w (^) l. This is clearer if you write the two derivatives next to one another:

∂P =(d − o (^) r ) × o (^) r (1 − o (^) r ) × i (^) r ∂wr ∂P =(d − o (^) r ) × o (^) r (1 − o (^) r ) × w (^) r × o (^) l (1 − o (^) l ) × i (^) l ∂wl You can simplify the equations by defining δs as follows, where each delta is associated with either the left or right neuron:

δr =o (^) r (1 − o (^) r ) × (d − o (^) r )

δl =o (^) l (1 − o (^) l ) × w (^) r × δr

Then, you can write the partial derivatives with the δs:

∂P =i (^) r × δr ∂wr ∂P =i (^) l × δl ∂wl If you add more layers to the front of the network, each weight has a partial derivatives that is computed like the partial derivative of the weight of the left neuron. That is, each has a partial derivative determined by its input and its delta, where its delta in turn is determined by its output, the weight to its right, and the delta to its right. Thus, for the weights in the final layer, you compute the change as follows, where I use f as the subscript instead of r to emphasize that the computation is for the neuron in the final layer:

Δwf = α × i (^) f × δf where

δf = of (1 − o (^) f ) × (d − o (^) f ) For all other layers, you compute the change as follows:

Δwl = α × i (^) l × δl where

δl = ol (1 − o (^) l ) × w (^) r × δr

More neurons per layers

Of course, you really want back propagation formulas for not only any number of layers but also for any number of neurons per layer, each of which can have multiple inputs, each with its own weight. Accordingly, you need to generalize in another direction, allowing multiple neurons in each layer and multiple weights attached to each neuron. The generalization is an adventure in summations, with lots of subscripts to keep straight, but in the end, the result matches intuition. For the final layer, there may be many neurons, so the formula’s need an index, k, indicating which final node neuron is in play. For any weight contained

in the final-layer neuron, fk , you compute the change as follows from the input corresponding to the

weight and from the δ associated with the neuron:

Δw =α × i × δfk

δfk =o (^) fk (1 − o (^) fk ) × (d (^) k − o (^) fk )

Note that the output of each final-layer neuron output is subtracted from the output desired for that neuron.

For other layers, there may also be many neurons, and the output of each may influence all the neurons in the next layer to the right. The change in weight has to account for what happens to all of those neurons to the right, so a summation appears, but otherwise you compute the change, as before, from the input corresponding to the weight and from the δ associated with the neuron:

Δw =α × i × δli

δli =o (^) li (1 − o (^) li ) ×

w (^) li →rj × δrj j

Note that w (^) li →rj is the weight that connects the jth^ right-side neuron to the output of the ith^ left-side neuron.

Summary

Once you understood how to derive the formulas, you can combine and simplify them in preparation for solving problems. For each weight, you compute the weight’s change from the input correspond ing to the weight and from the δ associated with the neuron. Assuming that δ is the delta associated

with that neuron, you have the following, where w→rj is the weight connecting the output of the

neuron you are working on, the ith^ left-side neuron, to the jth^ right-side neuron, and δrj is the δ associated with that right-side neuron.

δo =o(1 − o) × (d − o) for the final layer

δli =o (^) li (1 − o (^) li ) ×

w (^) li →rj × δrj otherwise j

That is, you computed change in a neuron’s w, in every layer, by multiplying α times the neuron’s input times its δ. The δ is determined for all but the final layer in terms of the neuron’s output and all the weights that connect that output to neurons in the layer to the right and the δs associated with those right-side neurons. The δ for each neuron in the final layer is determined only by the output of that neuron and by the difference between the desired output and the actual output of that neuron.