Multilayered Feed forward Neural Network , Lecture Notes - Computer Science, Study notes of Artificial Intelligence

Prof. David C Parkes, Computer Science, Multilayer Feed-Forward Neural Networks, Logistic Regression, The Back-Propagation Algorithm, Neural Networks, Perceptrons, Artificial Neural Network, Gradient Descent, Harvard, Lecture Notes

Typology: Study notes

2010/2011

Uploaded on 10/25/2011

thecoral
thecoral 🇺🇸

4.5

(30)

395 documents

1 / 11

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS181 Lecture 6 Multi-layer Feed-forward
Neural Networks
Avi Pfeffer; Revised by David Parkes
Feb 9, 2011
1 Limitations of Perceptrons
1As we saw, perceptrons can only represent linearly separable hypotheses. This
limitation of perceptrons comes despite having non-linear activation functions.
Minsky & Papert’s (1969) argument against perceptrons was based on this
observation, but was a little more subtle than simply saying “Perceptrons can’t
represent xor, so they can’t be a basis for learning and intelligence.”
There are two counters to this argument. One is that we don’t necessarily
care if a hypothesis space actually contains the correct hypothesis, only that it
can produce a hypothesis that generalizes well to unseen data. In fact, empirical
evidence shows that hypothesis spaces of linear separators sometimes perform
quite well even when the true hypothesis is not linearly separable.
A second counter-argument is that even if a concept is not linearly separable
in terms of the raw input data, we can apply some preprocessing to the data
to obtain high-level features, and then the hypothesis may be linearly separable
in terms of those features. One can define multiple features, φ1(x),...,φk(x),
each of which can be a non-linear function of attributes x, and then use these as
inputs to a perceptron algorithm. These functions, φ1,...,φk, are often called
basis functions.
This approach, of introducing additional features, was in fact the method
adopted in much of the perceptron research in the sixties. For example, a typical
system might have the following design:
x1--> φ1(x)
--->
... ... perceptron ----> output
--->
x1000 -->φ100(x)
The input data in this example contains attributes (feature values) x1to
x1000. These are pre-processed to produce 100 features, φ1(x) to φ100(x), which
1This section is based in part on the discussion in Section 3.5.4 of “Neural Networks for
Pattern Recognition” by Bishop.
1
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Multilayered Feed forward Neural Network , Lecture Notes - Computer Science and more Study notes Artificial Intelligence in PDF only on Docsity!

CS181 Lecture 6 — Multi-layer Feed-forward

Neural Networks

Avi Pfeffer; Revised by David Parkes

Feb 9, 2011

1 Limitations of Perceptrons

(^1) As we saw, perceptrons can only represent linearly separable hypotheses. This

limitation of perceptrons comes despite having non-linear activation functions. Minsky & Papert’s (1969) argument against perceptrons was based on this observation, but was a little more subtle than simply saying “Perceptrons can’t represent xor, so they can’t be a basis for learning and intelligence.” There are two counters to this argument. One is that we don’t necessarily care if a hypothesis space actually contains the correct hypothesis, only that it can produce a hypothesis that generalizes well to unseen data. In fact, empirical evidence shows that hypothesis spaces of linear separators sometimes perform quite well even when the true hypothesis is not linearly separable. A second counter-argument is that even if a concept is not linearly separable in terms of the raw input data, we can apply some preprocessing to the data to obtain high-level features, and then the hypothesis may be linearly separable in terms of those features. One can define multiple features, φ 1 (x),... , φk(x), each of which can be a non-linear function of attributes x, and then use these as inputs to a perceptron algorithm. These functions, φ 1 ,... , φk , are often called basis functions. This approach, of introducing additional features, was in fact the method adopted in much of the perceptron research in the sixties. For example, a typical system might have the following design:

x 1 --> φ 1 (x) ---> ... ... perceptron ----> output ---> x 1000 -->φ 100 (x)

The input data in this example contains attributes (feature values) x 1 to x 1000. These are pre-processed to produce 100 features, φ 1 (x) to φ 100 (x), which

(^1) This section is based in part on the discussion in Section 3.5.4 of “Neural Networks for Pattern Recognition” by Bishop.

are then passed to a perceptron. With this type of approach, many interesting properties of the data can be represented as linearly separable hypotheses, if the right features are used. But the key limitation is that the choice of basis functions (or features) is not adaptive— only the weights from the 100 basis functions were learned, and not the basis functions themselves. The crux of Minsky & Papert’s argument is that using non-adaptive high- level features is not good enough. The number of pre-programmed features that would be needed in order to represent all interesting properties using lin- early separable hypotheses is astronomical. For any reasonably sized set of pre-programmed features, there will be some interesting properties that cannot be represented using a linear separator. An example provided by Minsky & Papert considers a pre-programmed set of features that are restricted to only consider small local regions of the image. Using such features, one cannot tell whether or not an image is connected. For example, consider the following four images:

XXXXXXXXX XXXXXXXXX X X X XXXXXXXXX XXXXXXXXX X XXXXXXXXX XXXXXXXXX

XXXXXXXXX XXXXXXXXX

X

XXXXXXXXX XXXXXXXXX

X X X

XXXXXXXXX XXXXXXXXX

Now, suppose we only have features that are restricted to looking at individ- ual columns. Whether or not this image is connected depends on the combined configuration of the left and rightmost columns. No local feature can represent the combined configuration of these two columns. In fact, if we consider the two column configurations appearing here, whether or not the image is connected is the xor (oh no!) of the configurations of the left and right columns, which is not linearly separable. That is, interpret the data as x = (0, 0), (1, 0), (1, 1) and (0, 1) going through images in clockwise order from the top left, with target classes { 0 , 1 , 0 , 1 }.

2 Multilayer Feed-Forward Neural Networks

A natural solution to the problem raised by Minsky & Papert is to make the basis functions adaptive. In fact, we might allow each basis function itself to be defined as a perceptron. Then one can hope to learn which high-level features

One possible activation function g is the perceptron activation function, defined as

g(in) =

1 , if in > 0 − 1 , otherwise

Given that a neural network chains together multiple perceptron-like units into a complex network, it is natural to think that it should have a significantly richer hypothesis space. For example, one can easily represent the X 1 xor X 2 function. For this, we could construct two hidden units, the first of which would represent X 1 ∧ ¬X 2 and the second of which would represent X 1 ∧ ¬X 2. A single output unit would then represent the disjunction of these two concepts (this is an or). We can easily represent and and or and negation logic with individual perceptrons, and now these can be nested to form general disjunctive normal form expressions. The main remaining mystery is how to learn the weights in a two layer, or more general, neural network. For this, we will take a detour into the world of logistic regression. We will then have something concrete to say about the representation power of neural networks, and also be able to understand the back-propagation algorithm.

3 Logistic Regression

(^2) We should recall at this point that the discontinuous threshold activation

function in the perceptron processing unit caused some difficulty for learning weights to minimize training error. In particular, we could use the perceptron learning rule, but this was only convergent (for any fixed α > 0) for linearly separable data. We could also use the adaline rule, which was convergent, but only to a hypothesis that minimized a proxy for the true training error. As a prelude to general neural network models, we will briefly consider an alternative to the perceptron, in which the discontinuous activation function of the perceptron is replaced with a continuous, differentiable function:

g(in) =

1 + e−in^

This is the logistic or sigmoid function, and provides an output of 0.5 for in = 0, while approaching 0 and 1 for large negative and large positive inputs respectively. The graph of the sigmoid activation function is shown here: (^2) This section follow 18.6.4 in Russell and Norvig.

−5^0 −4 −3 −2 −1 0 1 2 3 4 5

1

For this new activation function, the hypothesis for weight vector w (which still includes a bias weight w 0 ) becomes

hw(x) = g(w · x) =

1 + e−w·x^

and the process of fitting weights to minimize error on a data set D, given this single unit, is called logistic regression. Don’t be confused: despite its name, logistic “regression” is appropriate for classification problems. For example, in a binary classification problem where the training data places each x into class y = 0 or y = 1. To classify with logistic regression, one would ultimately predict y′^ = 1 when hw(x) > 0 .5 and y′^ = 0 otherwise. Now that we have a differentiable activation function, we can again adopt the approach of stochastic gradient descent. We can adopt as our error the actual training error and not some proxy for error because this is differentiable! For this, fix some instance (x, y), and consider the squared error. To derive the partial derivative, and thus the gradient descent rule, we have:

∂ ∂wj

Error (w) =

∂wj

(y − hw(x))^2 (8)

= 2(y − hw(x))

∂wj

(t − hw(x)) (9)

= −2(y − hw(x))

∂wj g(w · x) (10)

= −2(y − hw(x))g′(w · x)

∂wj

(w · x) (11)

= −2(y − hw(x))g′(w · x)xj , (12)

where we make repeated use of the chain rule: ∂g(f (x))/∂x = g′(f (x))∂f (x)/∂x. Now, the derivative g′^ of the logistic function is well defined and satisfies

In fact, with a single hidden layer and this sigmoid activation function, one can approximate to any degree of accuracy desired any continuous function f : [− 1 , +1]m^ → [0, 1]. This is a fundamental result from computational learning theory. But, before you get too excited by this, there is a caveat— the number of hidden units required may be very large! Just to represent all Boolean functions on m inputs, the number of hidden units required is exponential in the number of inputs. The key remaining challenge in learning the weights that parameterize a general neural network (with sigmoid activations) is to determine the error to associate with the output of hidden units. The output from a hidden unit provides an input to one, and perhaps many output units, and goes through an additional non-linear activation before it modifies the output from the neural network. The back-propagation algorithm provides an elegant and fast method to per- form gradient descent on the weights on all units in a neural network, and provides an answer to this question. In general, a neural network may have multiple output units, k ∈ { 1 ,... , K}. The training data in this case contains examples D = {(x 1 , y 1 ),... , (xn, yn)}, where yi ∈ [0, 1]K^ defines the target vector for input xi. In the next lecture we get into a discussion about how to encode inputs and outputs for NNs in useful ways. Given this, the error function seeks to minimize the total squared error on the training data. For a single instance (x, y), we sum the squared error on the output units,

Error (w) =

∑^ K

k=

(yk − ak)^2 , (17)

where ak is the activation level of output unit k. Let err (^) k = yk − ak. The error over all the training data can be defined by summing Eq. (17) over all training examples. In thinking about learning the weights in a two-layer neural network, there are two kinds of weights:

  • for each output unit k, there are weights wkj associated with the activa- tion, aj , of each hidden unit that forms its input.
  • for each hidden unit j, there are weights wji associated with the activation, ai = xi, of each input attribute that forms its input.

For each k, the update rule for the first kind of weight, is precisely that from logistic regression,

w kj(t+1) ← w( kjt) + αaj δk, (18)

where

δk = err (^) k · g′(ink) = err (^) k · ak(1 − ak), (19)

and err (^) k = yk − ak. The derivation for weights associated with hidden units j, that is for the second kind of weight, yields a similar rule:

w( jit+1) ← w ji(t) + αaiδj , (20)

where δj = err (^) j g′(inj ) = err (^) j · aj (1 − aj ), and

errj =

∑^ K

k=

wkj δk (21)

We provide the derivation below. This result is beautiful— the rule for updating weights associated with the hidden units looks exactly like the rule for updating weights associated with the output units! The key to making this work is the way errj is defined on hidden unit j. Intuitively, the back-propagation algorithm works out how much each hidden unit j contributes to the actual error, and puts this amount into errj. Of course it is really all possible because of the differentiability of the sigmoid function and the chain rule! The back-propagation algorithm generalizes immediately to arbitrary feed- forward networks: one simply propagates the weighted error errj from upstream units back to downstream units. The general form of the back-propagation algorithm for processing a single training instance is as follows:

Repeat until all units have been processed Pick a unit j all of whose children have been processed If j is an output unit errj = (yj − aj ) Else errj =

k∈Child (j) w

(t) kj δk δj = g′(inj )errj For each unit j, for each parent i of j

w ji(t+1) ← w( jit) + αaiδj

Here, we adopt Child (j) to denote the children of unit j (the downstream units connected to j) and a parent of unit j is an upstream unit that provides one of the inputs to j. This is the algorithm for training on a single instance. A neural network is trained by running back-propagation on each of the instances in a training set repeatedly. One training epoch occurs every time n instances have been considered. Note that the δj values on hidden units are computed based on existing

weights w( kjt). In effect, back prop determines all the δ values and then updates all the weights. The update can also proceed in batch mode, where weights are adjusted according to the net contribution of all data points (x, y) ∈ D, but stochastic updates tend to be allow for faster training.

where Parents(j) is the set of units that are upstream of j and provides inputs to j. When implementing back- and forward-propagation in practice, we would rather not have to search for a unit to process next, but be able to choose the next unit automatically. This can easily be achieved by numbering the units in such a way that parents always have smaller index than their children. This can either be forced in the design of the network, or a topological sort algorithm can be used to sort the units appropriately. Once this has been achieved, then forward-propagation can simply iterate through the units in increasing sequence and back-propagation can iterate in decreasing sequence.

5.2 Computational Complexity

What is the cost of running forward and back-propagation on a single training instance? In forward propagation, each edge contributes one term to one of the inj sums and each unit results in one computation of g. Let v and e denote the number of units and edges respectively. The total running time for forward propagation is O(v + e). A similar calculation results in the same cost for back-propagation. Thus the total cost is O(v + e). If we consider networks with a single complete hidden layer, with m inputs, J hidden units and K outputs, then the network has J + K units and (m + K)J edges. Therefore the total cost is O(J + K + (m + K)J). Normally the number of inputs and outputs is fixed, while we vary the number J of hidden units. Looked at that way, the cost for a single forward and backward propagation phase, on a single instance, is O(J). But note that this is only the cost of running forward and back-propagation on a single example. It is not the cost of training an entire network, which takes multiple epochs. It is in fact possible for the number of epochs needed to train a network to be exponential in J, the number of hidden units.

5.3 Convergence of Back-Propagation

So far, we’ve talked mainly about running one phase of the algorithm on a single example. Training a network requires training on each example many times. The basic form of the full training algorithm is as follows:

Repeat until convergence: For each training example x Run forward propagation on x Run back propagation on x

What does convergence mean here? It means that the difference in weights between successive iterations is smaller than some pre-defined tolerance ǫ >

  1. As we said, back-propagation is a gradient descent algorithm, so it will

eventually converge, although it may converge to a local minimum.^3 The exact convergence is not especially important, however, because if it converges it is only to a local minimum and not, in general, a global minimum. The number of epochs required until convergence depends on a number of factors. One, of course, is the learning rate. As we discussed earlier, if the learning rate is too small too many epochs may be required, but if it is too big the algorithm may end up oscillating around a minimum. Unfortunately, the right learning rate is problem dependent, and choosing a good one can take some trial and error. Another thing the number of epochs depends on is the function being learned. Simpler functions can generally be learned in fewer epochs than more complex ones. The network structure, and in particular the number of weights to learn, can also determine the number of epochs required. The reason is that the same function can be fitted in very different ways in different hypothesis spaces. So a function that looks simple in one hypothesis space can look very complex in another hypothesis space that is rich enough to model the complexities of the function. Local minima can be a problem for back-propagation. However, there are some mitigating factors that can make them not much of an issue in practice. One reason is that if there are enough weights in the network, the extra weights can provide for “escape routes” away from local minima. A local minimum has to be a minimum in all the weights— even if a point is a minimum in some dimensions, other dimensions can allow it to change. Another reason given is that local minima can often be pretty good. Even if a neural network learns an “incorrect” function due to local minima, if the function has reasonably good performance on the training set it will often still generalize reasonably well. One typical way to deal with local minima is to use random restarts, where each run is started with different initial weights. Using this approach, you end up with a set of networks. You can then choose the one with the best performance on the training set. Alternatively, you can use a voting hypothesis, in which the weight of each network is determined by its error on the training set. Finally, if you’re already using a voting hypothesis, you can go ahead and do boosting, reweighting each of the training examples on each round according to the AdaBoost formula. Weighted training examples can easily be taken into account by the back-propagation algorithm, by factoring the weight of an example into the learning rate.

(^3) To be precise, because we are updating the weights after each training example, and not after seeing the entire data set, we are not exactly following the direction of the gradient. Actually, the algorithm performs a stochastic gradient descent, and is only convergent with a decaying learning rate at O(1/t) and random samples from the data.