Back Propagation - Lecture Notes | EEL 6812, Study notes of Electrical and Electronics Engineering

Material Type: Notes; Professor: Georgiopoulos; Class: Introduction to Neural Networks; Subject: Engineering: Electrical; University: University of Central Florida; Term: Unknown 1989;

Typology: Study notes

Pre 2010

Uploaded on 11/08/2009

koofers-user-qlz-1
koofers-user-qlz-1 🇺🇸

10 documents

1 / 76

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
5
Backpropagation
Overview . . . . . . . . . . . . . . . . . . . . . 5-2
Fundamentals . . . . . . . . . . . . . . . . . . . 5-4
Architecture . . . . . . . . . . . . . . . . . . . . . 5-4
Simulation (sim) . . . . . . . . . . . . . . . . . . . 5-8
Training . . . . . . . . . . . . . . . . . . . . . . 5-8
Faster Training . . . . . . . . . . . . . . . . . . 5-14
Variable Learning Rate (traingda, traingdx) . . . . . . . 5-14
Resilient Backpropagation (trainrp) . . . . . . . . . . . 5-16
Conjugate Gradient Algorithms . . . . . . . . . . . . . 5-17
Line Search Routines . . . . . . . . . . . . . . . . . 5-23
Quasi-Newton Algorithms . . . . . . . . . . . . . . . 5-26
Levenberg-Marquardt (trainlm) . . . . . . . . . . . . . 5-28
Reduced Memory Levenberg-Marquardt (trainlm) . . . . . 5-30
Speed and Memory Comparison . . . . . . . . . . . 5-32
Summary . . . . . . . . . . . . . . . . . . . . . . 5-49
Improving Generalization . . . . . . . . . . . . . . 5-51
Regularization . . . . . . . . . . . . . . . . . . . . 5-52
Early Stopping . . . . . . . . . . . . . . . . . . . 5-55
Summary and Discussion . . . . . . . . . . . . . . . 5-57
Preprocessing and Postprocessing . . . . . . . . . . 5-61
Min and Max (premnmx, postmnmx, tramnmx) . . . . . . 5-61
Mean and Stand. Dev. (prestd, poststd, trastd) . . . . . . 5-62
Principal Component Analysis (prepca, trapca) . . . . . . 5-63
Post-Training Analysis (postreg) . . . . . . . . . . . . 5-64
Sample Training Session . . . . . . . . . . . . . . 5-66
Limitations and Cautions . . . . . . . . . . . . . . 5-71
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c

Partial preview of the text

Download Back Propagation - Lecture Notes | EEL 6812 and more Study notes Electrical and Electronics Engineering in PDF only on Docsity!

  • Overview 5- Backpropagation
  • Fundamentals 5-
  • Architecture 5-
  • Simulation (sim) 5-
  • Training 5-
  • Faster Training 5-
  • Variable Learning Rate (traingda, traingdx) 5-
  • Resilient Backpropagation (trainrp) 5-
  • Conjugate Gradient Algorithms 5-
  • Line Search Routines 5-
  • Quasi-Newton Algorithms 5-
  • Levenberg-Marquardt (trainlm) 5-
  • Reduced Memory Levenberg-Marquardt (trainlm) 5-
  • Speed and Memory Comparison 5-
  • Summary 5-
  • Improving Generalization 5-
  • Regularization 5-
  • Early Stopping 5-
  • Summary and Discussion 5-
  • Preprocessing and Postprocessing 5-
  • Min and Max (premnmx, postmnmx, tramnmx) 5-
  • Mean and Stand. Dev. (prestd, poststd, trastd) 5-
  • Principal Component Analysis (prepca, trapca) 5-
  • Post-Training Analysis (postreg) 5-
  • Sample Training Session 5-
  • Limitations and Cautions 5-

5 Backpropagation

Overview

Backpropagation was created by generalizing the Widrow-Hoff learning rule to multiple-layer networks and nonlinear differentiable transfer functions. Input vectors and the corresponding target vectors are used to train a network until it can approximate a function, associate input vectors with specific output vectors, or classify input vectors in an appropriate way as defined by you. Networks with biases, a sigmoid layer, and a linear output layer are capable of approximating any function with a finite number of discontinuities. Standard backpropagation is a gradient descent algorithm, as is the Widrow-Hoff learning rule, in which the network weights are moved along the negative of the gradient of the performance function. The term backpropagation refers to the manner in which the gradient is computed for nonlinear multilayer networks. There are a number of variations on the basic algorithm that are based on other standard optimization techniques, such as conjugate gradient and Newton methods. The Neural Network Toolbox implements a number of these variations. This chapter explains how to use each of these routines and discusses the advantages and disadvantages of each.

Properly trained backpropagation networks tend to give reasonable answers when presented with inputs that they have never seen. Typically, a new input leads to an output similar to the correct output for input vectors used in training that are similar to the new input being presented. This generalization property makes it possible to train a network on a representative set of input/ target pairs and get good results without training the network on all possible input/output pairs. There are two features of the Neural Network Toolbox that are designed to improve network generalization - regularization and early stopping. These features and their use are discussed later in this chapter. This chapter also discusses preprocessing and postprocessing techniques, which can improve the efficiency of network training.

Before beginning this chapter you may want to read a basic reference on backpropagation, such as D.E Rumelhart, G.E. Hinton, R.J. Williams, “Learning internal representations by error propagation,” D. Rumelhart and J. McClelland, editors. Parallel Data Processing , Vol.1, Chapter 8, the M.I.T. Press, Cambridge, MA 1986 pp. 318-362. This subject is also covered in detail in Chapters 11 and 12 of M.T. Hagan, H.B. Demuth, M.H. Beale, Neural Network Design , PWS Publishing Company, Boston, MA 1996.

5 Backpropagation

Fundamentals

Architecture

This section presents the architecture of the network that is most commonly used with the backpropagation algorithm - the multilayer feedforward network. The routines in the Neural Network Toolbox can be used to train more general networks; some of these will be briefly discussed in later chapters.

Neuron Model (tansig, logsig, purelin)

An elementary neuron with R inputs is shown below. Each input is weighted with an appropriate w. The sum of the weighted inputs and the bias forms the input to the transfer function f. Neurons may use any differentiable transfer function f to generate their output.

Multilayer networks often use the log-sigmoid transfer function logsig.

Input

- Exp -

p 1 n a

p 2 p 3

pR w 1 , R

w 1, 1

A

A f

a = f ( Wp + b )

b

1

Where...

R = Number of elements in input vector

General Neuron

A

A

n 0

AA

AA

a

Log-Sigmoid Transfer Function

a = logsig(n)

Fundamentals

The function logsig generates outputs between 0 and 1 as the neuron’s net input goes from negative to positive infinity.

Alternatively, multilayer networks may use the tan-sigmoid transfer function tansig.

Occasionally, the linear transfer function purelin is used in backpropagation networks.

If the last layer of a multilayer network has sigmoid neurons, then the outputs of the network are limited to a small range. If linear output neurons are used the network outputs can take on any value.

In backpropagation it is important to be able to calculate the derivatives of any transfer functions used. Each of the transfer functions above, tansig, logsig, and purelin, have a corresponding derivative function: dtansig, dlogsig, and dpurelin. To get the name of a transfer function’s associated derivative function, call the transfer function with the string 'deriv'.

tansig('deriv') ans = dtansig

Tan-Sigmoid Transfer Function

a = tansig(n)

n 0

a

n 0

AA

AA

a = purelin(n) Linear Transfer Function

a

Fundamentals

This network can be used as a general function approximator. It can approximate any function with a finite number of discontinuities, arbitrarily well, given sufficient neurons in the hidden layer.

Creating a Network (newff). The first step in training a feedforward network is to create the network object. The function newff creates a feedforward network. It requires four inputs and returns the network object. The first input is an R by 2 matrix of minimum and maximum values for each of the R elements of the input vector. The second input is an array containing the sizes of each layer. The third input is a cell array containing the names of the transfer functions to be used in each layer. The final input contains the name of the training function to be used.

For example, the following command creates a two-layer network. There is one input vector with two elements. The values for the first element of the input vector range between -1 and 2, the values of the second element of the input vector range between 0 and 5. There are three neurons in the first layer and one neuron in the second (output) layer. The transfer function in the first layer is tan-sigmoid, and the output layer transfer function is linear. The training function is traingd (which is described in a later section).

net=newff([-1 2; 0 5],[3,1],{'tansig','purelin'},'traingd');

This command creates the network object and also initializes the weights and biases of the network; therefore the network is ready for training. There are times when you may want to reinitialize the weights, or to perform a custom initialization. The next section explains the details of the initialization process.

Initializing Weights (init). Before training a feedforward network, the weights and biases must be initialized. The newff command will automatically initialize the

p^1 a (^1) a 3 = (^) y

2 x (^1) n (^1) n 2

4 x 1

4 x 1

3 x 1

3 x 1

3 x 1

Input

AA^3 x^4

AA LW 2,

4 x 1

AAA b^1

AAA^4 x^2

AAA IW 1,

AA b^2

Hidden Layer Output Layer

a^2 = purelin ( LW 2,1 a^1 + b^2 )

f^2

a^1 = tansig ( IW 1,1 p^1 + b^1 )

AA

AA

AA

AA

AA

AA

5 Backpropagation

weights, but you may want to reinitialize them. This can be done with the command init. This function takes a network object as input and returns a network object with all weights and biases initialized. Here is how a network is initialized (or reinitialized): net = init(net);

For specifics on how the weights are initialized, see Chapter 12.

Simulation (sim)

The function sim simulates a network. sim takes the network input p, and the network object net, and returns the network outputs a. Here is how you can use sim to simulate the network we created above for a single input vector: p = [1;2]; a = sim(net,p) a = -0.

(If you try these commands, your output may be different, depending on the state of your random number generator when the network was initialized.) Below, sim is called to calculate the outputs for a concurrent set of three input vectors. This is the batch mode form of simulation, in which all of the input vectors are place in one matrix. This is much more efficient than presenting the vectors one at a time. p = [1 3 2;2 4 1]; a=sim(net,p) a = -0.1011 -0.2308 0.

Training

Once the network weights and biases have been initialized, the network is ready for training. The network can be trained for function approximation (nonlinear regression), pattern association, or pattern classification. The training process requires a set of examples of proper network behavior - network inputs p and target outputs t. During training the weights and biases of the network are iteratively adjusted to minimize the network performance function net.performFcn. The default performance function for feedforward

5 Backpropagation

Batch Gradient Descent (traingd). The batch steepest descent training function is traingd. The weights and biases are updated in the direction of the negative gradient of the performance function. If you want to train a network using batch steepest descent, you should set the network trainFcn to traingd, and then call the function train. There is only one training function associated with a given network. There are seven training parameters associated with traingd: epochs, show, goal, time, min_grad, max_fail, and lr. The learning rate lr is multiplied times the negative of the gradient to determine the changes to the weights and biases. The larger the learning rate, the bigger the step. If the learning rate is made too large, the algorithm becomes unstable. If the learning rate is set too small, the algorithm takes a long time to converge. See page 12-8 of [HDB96] for a discussion of the choice of learning rate.

The training status is displayed for every show iteration of the algorithm. (If show is set to NaN, then the training status never displays.) The other parameters determine when the training stops. The training stops if the number of iterations exceeds epochs, if the performance function drops below goal, if the magnitude of the gradient is less than mingrad, or if the training time is longer than time seconds. We discuss max_fail, which is associated with the early stopping technique, in the section on improving generalization. The following code creates a training set of inputs p and targets t. For batch training, all of the input vectors are placed in one matrix. p = [-1 -1 2 2;0 5 0 5]; t = [-1 -1 1 1];

Next, we create the feedforward network. Here we use the function minmax to determine the range of the inputs to be used in creating the network. net=newff(minmax(p),[3,1],{'tansig','purelin'},'traingd');

At this point, we might want to modify some of the default training parameters. net.trainParam.show = 50; net.trainParam.lr = 0.05; net.trainParam.epochs = 300; net.trainParam.goal = 1e-5;

If you want to use the default training parameters, the above commands are not necessary.

Fundamentals

Now we are ready to train the network.

[net,tr]=train(net,p,t); TRAINGD, Epoch 0/300, MSE 1.59423/1e-05, Gradient 2.76799/ 1e- TRAINGD, Epoch 50/300, MSE 0.00236382/1e-05, Gradient 0.0495292/1e- TRAINGD, Epoch 100/300, MSE 0.000435947/1e-05, Gradient 0.0161202/1e- TRAINGD, Epoch 150/300, MSE 8.68462e-05/1e-05, Gradient 0.00769588/1e- TRAINGD, Epoch 200/300, MSE 1.45042e-05/1e-05, Gradient 0.00325667/1e- TRAINGD, Epoch 211/300, MSE 9.64816e-06/1e-05, Gradient 0.00266775/1e- TRAINGD, Performance goal met.

The training record tr contains information about the progress of training. An example of its use is given in the Sample Training Session near the end of this chapter.

Now the trained network can be simulated to obtain its response to the inputs in the training set.

a = sim(net,p) a = -1.0010 -0.9989 1.0018 0.

Try the Neural Network Design Demonstration nnd12sd1[HDB96] for an illustration of the performance of the batch gradient descent algorithm.

Batch Gradient Descent with Momentum (traingdm). In addition to traingd, there is another batch algorithm for feedforward networks that often provides faster convergence - traingdm, steepest descent with momentum. Momentum allows a network to respond not only to the local gradient, but also to recent trends in the error surface. Acting like a low-pass filter, momentum allows the network to ignore small features in the error surface. Without momentum a network may get stuck in a shallow local minimum. With momentum a network can slide through such a minimum. See page 12-9 of [HDB96] for a discussion of momentum.

Fundamentals

TRAINGDM, Epoch 100/300, MSE 6.34868e-05/1e-05, Gradient 0.0409749/1e- TRAINGDM, Epoch 114/300, MSE 9.06235e-06/1e-05, Gradient 0.00908756/1e- TRAINGDM, Performance goal met. a = sim(net,p) a = -1.0026 -1.0044 0.9969 0.

Note that since we reinitialized the weights and biases before training (by calling newff again), we obtain a different mean square error than we did using traingd. If we were to reinitialize and train again using traingdm, we would get yet a different mean square error. The random choice of initial weights and biases will affect the performance of the algorithm. If you want to compare the performance of different algorithms, you should test each using several different sets of initial weights and biases. You may want to use net=init(net) to reinitialize the weights, rather than recreating the entire network with newff.

Try the Neural Network Design Demonstration nnd12mo [HDB96] for an illustration of the performance of the batch momentum algorithm.

5 Backpropagation

Faster Training

The previous section presented two backpropagation training algorithms: gradient descent, and gradient descent with momentum. These two methods are often too slow for practical problems. In this section we discuss several high performance algorithms that can converge from ten to one hundred times faster than the algorithms discussed previously. All of the algorithms in this section operate in the batch mode and are invoked using train. These faster algorithms fall into two main categories. The first category uses heuristic techniques, which were developed from an analysis of the performance of the standard steepest descent algorithm. One heuristic modification is the momentum technique, which was presented in the previous section. This section discusses two more heuristic techniques: variable learning rate backpropagation, traingda; and resilient backpropagation trainrp.

The second category of fast algorithms uses standard numerical optimization techniques. (See Chapter 9 of [HDB96] for a review of basic numerical optimization.) Later in this section we present three types of numerical optimization techniques for neural network training: conjugate gradient (traincgf, traincgp, traincgb, trainscg), quasi-Newton (trainbfg, trainoss), and Levenberg-Marquardt (trainlm).

Variable Learning Rate (traingda, traingdx)

With standard steepest descent, the learning rate is held constant throughout training. The performance of the algorithm is very sensitive to the proper setting of the learning rate. If the learning rate is set too high, the algorithm may oscillate and become unstable. If the learning rate is too small, the algorithm will take too long to converge. It is not practical to determine the optimal setting for the learning rate before training, and, in fact, the optimal learning rate changes during the training process, as the algorithm moves across the performance surface. The performance of the steepest descent algorithm can be improved if we allow the learning rate to change during the training process. An adaptive learning rate will attempt to keep the learning step size as large as possible while keeping learning stable. The learning rate is made responsive to the complexity of the local error surface.

An adaptive learning rate requires some changes in the training procedure used by traingd. First, the initial network output and error are calculated. At

5 Backpropagation

The function traingdx combines adaptive learning rate with momentum training. It is invoked in the same way as traingda, except that it has the momentum coefficient mc as an additional training parameter.

Resilient Backpropagation (trainrp)

Multilayer networks typically use sigmoid transfer functions in the hidden layers. These functions are often called “squashing” functions, since they compress an infinite input range into a finite output range. Sigmoid functions are characterized by the fact that their slope must approach zero as the input gets large. This causes a problem when using steepest descent to train a multilayer network with sigmoid functions, since the gradient can have a very small magnitude; and therefore, cause small changes in the weights and biases, even though the weights and biases are far from their optimal values.

The purpose of the resilient backpropagation (Rprop) training algorithm is to eliminate these harmful effects of the magnitudes of the partial derivatives. Only the sign of the derivative is used to determine the direction of the weight update; the magnitude of the derivative has no effect on the weight update. The size of the weight change is determined by a separate update value. The update value for each weight and bias is increased by a factor delt_inc whenever the derivative of the performance function with respect to that weight has the same sign for two successive iterations. The update value is decreased by a factor delt_dec whenever the derivative with respect that weight changes sign from the previous iteration. If the derivative is zero, then the update value remains the same. Whenever the weights are oscillating the weight change will be reduced. If the weight continues to change in the same direction for several iterations, then the magnitude of the weight change will be increased. A complete description of the Rprop algorithm is given in [ReBr93]. In the following code we recreate our previous network and train it using the Rprop algorithm. The training parameters for trainrp are epochs, show, goal, time, min_grad, max_fail, delt_inc, delt_dec, delta0, deltamax. We have previously discussed the first eight parameters. The last two are the initial step size and the maximum step size, respectively. The performance of Rprop is not very sensitive to the settings of the training parameters. For the example below, we leave most of the training parameters at the default values. We do reduce show below our previous value, because Rprop generally converges much faster than the previous algorithms. p = [-1 -1 2 2;0 5 0 5];

Faster Training

t = [-1 -1 1 1]; net=newff(minmax(p),[3,1],{'tansig','purelin'},'trainrp'); net.trainParam.show = 10; net.trainParam.epochs = 300; net.trainParam.goal = 1e-5; [net,tr]=train(net,p,t); TRAINRP, Epoch 0/300, MSE 0.469151/1e-05, Gradient 1.4258/ 1e- TRAINRP, Epoch 10/300, MSE 0.000789506/1e-05, Gradient 0.0554529/1e- TRAINRP, Epoch 20/300, MSE 7.13065e-06/1e-05, Gradient 0.00346986/1e- TRAINRP, Performance goal met. a = sim(net,p) a = -1.0026 -0.9963 0.9978 1.

Rprop is generally much faster than the standard steepest descent algorithm. It also has the nice property that it requires only a modest increase in memory requirements. We do need to store the update values for each weight and bias, which is equivalent to storage of the gradient.

Conjugate Gradient Algorithms

The basic backpropagation algorithm adjusts the weights in the steepest descent direction (negative of the gradient). This is the direction in which the performance function is decreasing most rapidly. It turns out that, although the function decreases most rapidly along the negative of the gradient, this does not necessarily produce the fastest convergence. In the conjugate gradient algorithms a search is performed along conjugate directions, which produces generally faster convergence than steepest descent directions. In this section, we present four different variations of conjugate gradient algorithms.

See page 12-14 of [HDB96] for a discussion of conjugate gradient algorithms and their application to neural networks.

In most of the training algorithms that we discussed up to this point, a learning rate is used to determine the length of the weight update (step size). In most of the conjugate gradient algorithms, the step size is adjusted at each iteration. A search is made along the conjugate gradient direction to determine the step size, which minimizes the performance function along that line. There are five

Faster Training

parameters for traincgf are epochs, show, goal, time, min_grad, max_fail, srchFcn, scal_tol, alpha, beta, delta, gama, low_lim, up_lim, maxstep, minstep, bmax. We have previously discussed the first six parameters. The parameter srchFcn is the name of the line search function. It can be any of the functions described later in this section (or a user-supplied function). The remaining parameters are associated with specific line search routines and are described later in this section. The default line search routine srchcha is used in this example. traincgf generally converges in fewer iterations than trainrp (although there is more computation required in each iteration).

p = [-1 -1 2 2;0 5 0 5]; t = [-1 -1 1 1]; net=newff(minmax(p),[3,1],{'tansig','purelin'},'traincgf'); net.trainParam.show = 5; net.trainParam.epochs = 300; net.trainParam.goal = 1e-5; [net,tr]=train(net,p,t); TRAINCGF-srchcha, Epoch 0/300, MSE 2.15911/1e-05, Gradient 3.17681/1e- TRAINCGF-srchcha, Epoch 5/300, MSE 0.111081/1e-05, Gradient 0.602109/1e- TRAINCGF-srchcha, Epoch 10/300, MSE 0.0095015/1e-05, Gradient 0.197436/1e- TRAINCGF-srchcha, Epoch 15/300, MSE 0.000508668/1e-05, Gradient 0.0439273/1e- TRAINCGF-srchcha, Epoch 17/300, MSE 1.33611e-06/1e-05, Gradient 0.00562836/1e- TRAINCGF, Performance goal met. a = sim(net,p) a = -1.0001 -1.0023 0.9999 1.

The conjugate gradient algorithms are usually much faster than variable learning rate backpropagation, and are sometimes faster than trainrp, although the results will vary from one problem to another. The conjugate gradient algorithms require only a little more storage than the simpler algorithms, so they are often a good choice for networks with a large number of weights.

Try the Neural Network Design Demonstration nnd12cg [HDB96] for an illustration of the performance of a conjugate gradient algorithm.

5 Backpropagation

Polak-Ribiére Update (traincgp)

Another version of the conjugate gradient algorithm was proposed by Polak and Ribiére. As with the Fletcher-Reeves algorithm, the search direction at each iteration is determined by

For the Polak-Ribiére update, the constant is computed by

This is the inner product of the previous change in the gradient with the current gradient divided by the norm squared of the previous gradient. See [FlRe64] or [HDB96] for a discussion of the Polak-Ribiére conjugate gradient algorithm.

In the following code, we recreate our previous network and train it using the Polak-Ribiére version of the conjugate gradient algorithm. The training parameters for traincgp are the same as those for traincgf. The default line search routine srchcha is used in this example. The parameters show and epoch are set to the same values as they were for traincgf. p = [-1 -1 2 2;0 5 0 5]; t = [-1 -1 1 1]; net=newff(minmax(p),[3,1],{'tansig','purelin'},'traincgp'); net.trainParam.show = 5; net.trainParam.epochs = 300; net.trainParam.goal = 1e-5; [net,tr]=train(net,p,t); TRAINCGP-srchcha, Epoch 0/300, MSE 1.21966/1e-05, Gradient 1.77008/1e- TRAINCGP-srchcha, Epoch 5/300, MSE 0.227447/1e-05, Gradient 0.86507/1e- TRAINCGP-srchcha, Epoch 10/300, MSE 0.000237395/1e-05, Gradient 0.0174276/1e- TRAINCGP-srchcha, Epoch 15/300, MSE 9.28243e-05/1e-05, Gradient 0.00485746/1e- TRAINCGP-srchcha, Epoch 20/300, MSE 1.46146e-05/1e-05, Gradient 0.000912838/1e-

p k =– g kk p k – 1

β k

β k

g k – 1 Tg k

g k^ T – 1 g k – 1