Perceptrons , Lecture Notes - Computer Science, Study notes of Computer Numerical Control

Prof. David C Parkes, Computer Science, Perceptrons, Neural Networks, Linear Regression, Batch Gradient Descent, Stochastic Gradient Descent, Adaline Rule, Harvard, Lecture Notes

Typology: Study notes

2010/2011
On special offer
30 Points
Discount

Limited-time offer


Uploaded on 10/25/2011

thecoral
thecoral 🇺🇸

4.5

(30)

395 documents

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS181 Lecture 5 Perceptrons
Avi Pfeffer; Revised by David Parkes
Feb 7, 2011
Today we shift towards a new topic, on which we shall spend a couple of lectures: neural networks. A
good in-depth book on neural networks is “Neural Networks for Pattern Recognition” by C. Bishop. In this
first class we first discuss perceptrons and the perceptron and adaline learning rules. These are precursors to
the development of methods for training (artificial) neural networks.
1 Introduction to Neural Networks
In one sentence, a neural network is a computational structure formed from a network of individual pro-
cessing units (neurons) that send messages to each other. Neural networks are modeled after the neuronal
structure in the human brain. They are important both as a framework for machine learning and as a model
for understanding human intelligence. Some people believe they provide the most promising approach to
developing intelligent agents.
In fact, neural networks have been around longer than AI itself. Here’s a brief timeline, to provide some
cultural and historical perspective.
1943 McCulloch & Pitts inaugurated the study of neural networks by introducing a calculus of weights and
threshold units.
1948 Norbert Wiener’s cybernetics introduced negative feedback loops as a model for learning.
1949 Hebb developed the first update rule for learning neurons.
1950 Turing proposed his test for determining whether a computer is intelligent.
1951 Minsky build the first working neural net system.
1956 AI ”officially inaugurated” by McCarthy, Minsky, Shannon, Newell & Simon, Samuel and others.
1957 Rosenblatt develops the perceptron (a model for a single processing unit).
1960 Widrow & Hoff develop the adaline (another model for learning a single unit).
1960-1970 Lots of work on building systems of perceptrons and adalines.
1969 Minsky & Papert publish “Perceptrons”. This book describes the state of perceptron research at that
time, but also points out some fundamental limitations of perceptrons.
1969 Bryson & Ho, two applied mathematicians, develop the back-propagation algorithm.1This work
addresses the limitations of perceptrons, but was not known in the AI and neural network communities.
1970s The “perceptron winter” loss of interest and funding for neural network research as a result of
Minsky and Papert’s book.
1Prof. Ho, T. Jefferson Coolidge Professor of Applied Mathematics, School of Engineering and Applied Sciences, Harvard
University. Emeritus.
1
pf3
pf4
pf5
pf8
pf9
Discount

On special offer

Partial preview of the text

Download Perceptrons , Lecture Notes - Computer Science and more Study notes Computer Numerical Control in PDF only on Docsity!

CS181 Lecture 5 — Perceptrons

Avi Pfeffer; Revised by David Parkes

Feb 7, 2011

Today we shift towards a new topic, on which we shall spend a couple of lectures: neural networks. A good in-depth book on neural networks is “Neural Networks for Pattern Recognition” by C. Bishop. In this first class we first discuss perceptrons and the perceptron and adaline learning rules. These are precursors to the development of methods for training (artificial) neural networks.

1 Introduction to Neural Networks

In one sentence, a neural network is a computational structure formed from a network of individual pro- cessing units (neurons) that send messages to each other. Neural networks are modeled after the neuronal structure in the human brain. They are important both as a framework for machine learning and as a model for understanding human intelligence. Some people believe they provide the most promising approach to developing intelligent agents. In fact, neural networks have been around longer than AI itself. Here’s a brief timeline, to provide some cultural and historical perspective.

1943 McCulloch & Pitts inaugurated the study of neural networks by introducing a calculus of weights and threshold units.

1948 Norbert Wiener’s cybernetics introduced negative feedback loops as a model for learning.

1949 Hebb developed the first update rule for learning neurons.

1950 Turing proposed his test for determining whether a computer is intelligent.

1951 Minsky build the first working neural net system.

1956 AI ”officially inaugurated” by McCarthy, Minsky, Shannon, Newell & Simon, Samuel and others.

1957 Rosenblatt develops the perceptron (a model for a single processing unit).

1960 Widrow & Hoff develop the adaline (another model for learning a single unit).

1960-1970 Lots of work on building systems of perceptrons and adalines.

1969 Minsky & Papert publish “Perceptrons”. This book describes the state of perceptron research at that time, but also points out some fundamental limitations of perceptrons.

1969 Bryson & Ho, two applied mathematicians, develop the back-propagation algorithm.^1 This work addresses the limitations of perceptrons, but was not known in the AI and neural network communities.

1970s The “perceptron winter” — loss of interest and funding for neural network research as a result of Minsky and Papert’s book. (^1) Prof. Ho, T. Jefferson Coolidge Professor of Applied Mathematics, School of Engineering and Applied Sciences, Harvard University. Emeritus.

1980s There is a resurgence of interest in “connectionism” (as neural network research became known). The back-propagation algorithm is rediscovered by several researchers.

1986 Rumelhart & McClelland’s influential “Parallel Distributed Processing” collection popularizes connec- tionism in the AI community; Rumelhart, Hinton & Williams present the back-propagation algorithm.

1990s There is a gradual increase in applications and technical understanding of neural networks; by the end of the decade, neural network research is reintegrated with mainstream AI.

2000s Neural networks are viewed as just another machine learning technique, and not necessarily the best technique for many tasks.

1.1 Brain Networks

As we said earlier, neural networks are inspired by the workings of the brain. Studying how the brain works might help us understand how computational intelligence could work. Furthermore, brains are our best example of working intelligent systems, so it makes sense to try to mimic the brain in designing intelligent systems. By studying computational models of artificial neural networks, we might develop a better under- standing of how the brain works. I will describe very briefly some of the basic features of the brain that are mimicked by artificial neural networks. The main processing unit in the brain is the neuron, or nerve cell. Some of the parts of a neuron are: the soma or cell body; the axon, a long fiber leading out of the soma; a number of dendrites, shorter fibers branching out of the soma; and synapses that connect the axon to dendrites of other neurons. Information flows from the soma of one neuron through the axon, and is transmitted to dendrites of other neurons via the synapses. Signals are propagated by electrochemical reactions. When the electric potential in the soma exceeds some threshold, a pulse or action potential is sent down the axon. This triggers a reaction in the synapses, releasing transmitters into the dendrites that alter the electric potential in the next cell. The reactions at the synapses may be excitatory or inhibitory depending on whether they increase or decrease the potential in the next cell. When we abstract away from the biochemistry, we see that there are two basic features that determine how a network functions. One is the connectivity structure of the network, which specifies which neurons feed into which other neurons. The second is the nature of the synaptic reactions. When both of these are specified, a computational structure is defined, that is capable of information processing. In addition, both the connectivity structure and the strength of connections can be learned. The strengths of the synaptic reactions are plastic; there can be long-term changes in the strengths of connections based on patterns of stimulation. Also, new connections are often formed, changing the structure of the network itself. As the network changes, so do the computations performed by the network. It is important to emphasize that neural networks are a micro-level model of the brain. At a higher level, the brain is divided into regions that perform particular functions. Our understanding of what parts of the brain perform different tasks and how they perform it is quite incomplete. Nevertheless, one might still believe that modeling the overall architecture of intelligent agents on the macro-structure of the brain is a good idea. However, artificial neural networks do not model the higher-level structure of the brain at all. They take the neuron as the model of how information is stored and communicated, but do not follow the brain in determining what information is stored, or how it is organized. In fact, even on the lower level, artificial neural networks are not exact models of the brain. Neurosci- entists tell us that the types of computations performed in the brain are different from the ones performed by artificial networks. Nevertheless, artificial neural networks are an attractive design methodology for intelligent systems, because they mimic a number of attractive properties of the brain:

Massive Parallelism The “clock speed” of the brain is rather slow: less than 1 kHz, as compared to GHz for a modern PC. However, there are about 10^11 neurons, each of which functions as both a processing unit and a storage unit. This compares favorably with today’s computers. One key reason why brains are able to process so much data is that they are massively parallel — all the neurons process the data at once.

2.1 Batch Gradient Descent

For batch gradient descent, we combine this over n training examples, with the error the sum of individual errors on each example, and noting that the derivative of a sum is the sum of a derivative, we have:

w j(t +1) ← w( jt )+ α

∑^ n

i=

(yi − w(t)^ · xi)xij , (9)

where xij is feature j of input i. This batch gradient descent method guarantees convergence to the global minimum for a small enough α > 0. But it is slow. The weight vector is moved along the direction of the greatest rate of decrease of the error function over all examples.

2.2 Stochastic Gradient Descent

The rule in Eq. (8) can also be used directly for stochastic gradient descent, with a single training example taken at a time, sampled (with replacement) from D or just by cycling over points on D. Stochastic gradient descent is often faster than batch gradient descent, especially with lots of identical training examples. To consider an extreme example, if we take a data set and double its size then there is twice as much computational effort but the effect is the same as if the learning rate had simply been increased. The stochastic gradient descent method is unaffected. On the other hand, stochastic gradient descent may oscillate for any fixed α > 0, and is guaranteed to converge (in the limit) only for a carefully chosen learning rate. For example, with an α that decays as O(1/t), where t is the iteration number, and when examples are presented in a random sequence. For instance, α(t) = 1000/(1000 + t) would satisfy this requirement.

2.3 Extending to Multivariate

For multivariate linear regression, we have training data, D = {(x, y)}, where x ∈ Rm^ and m > 1. For this, it is again a simple matter to extend x with x 0 = 1, and then consider the linear hypothesis,

hw(x) = w · x = w 0 +

∑^ m

j=

wj xj , (10)

where input feature x 0 = 1 is introduced to allow for easy handling of the offset term w 0. The stochastic and batch gradient descent rules are exactly as above. Overfitting can again be a concern for high-dimensional spaces, when some of the input features Xj are in fact irrelevant. In this case it is common to adopt regularization, with overall cost

cost(w) = error (w) + λ · complexity(w) (11)

and complexity often captured as

j |wj^ |^ or^

j w

2 j =^ w^ ·^ w, together with some regularization parameter λ > 0 that can be tuned via cross-validation. For many problems it is advantageous to adopt the L 1 complexity measure,

j |wj^ |, which tends to produce sparse models.^ For the quadratic regularizer, this training problem can still be solved in closed-form. More generally, solutions can continue to be found through gradient descent methods.

3 Perceptrons

Linear functions can also be used to perform classification. We first consider binary classification. We adopt a nonlinear activation or threshold function, which takes as input the sum-weighted features w · x and generates a classification as y ∈ {− 1 , +1}. By doing so we construct a single perceptron, variants on which will comprise the basic processing unit in a general neural network. Here’s a simple picture of a perceptron, with notation:

x 1 ---------
w 1
... > in --> g --> OUTPUT wm / xm ---------/

A perceptron takes m inputs x = (x 1 ,... , xm). Each input xj has an associated weight wj. The output of the perceptron is produced by a two-stage process. The first stage computes the quantity in, which is the weighted sum of the inputs: in =

∑m j=1 wj^ xj^. The second stage applies an^ activation function^ g^ to^ in. For perceptrons, the activation function used is the threshold activation function:

g(in) =

1 if in > threshold − 1 otherwise

The threshold is a parameter of the activation function. An alternative way of encoding a threshold activation function that we shall find useful is to create a special input x 0. This input will always be fixed at +1 for every instance. The negated weight associated with this input can then be used in place of the threshold. Now, we compute in =

∑m j=0 wj^ xj^ , and set

g(in) =

1 if in > 0 − 1 otherwise

What we have achieved by this trick is to turn the threshold into a weight, so that we can treat it in the same way as the other weights, which are the other parameters of the perceptron. A perceptron is a classifier, that takes m continuous inputs, and returns a Boolean classification (either +1 or -1). What is the hypothesis space of perceptrons? This is, we want to understand what subset of functions from continuous inputs to Boolean classifications can a perceptron represent? Let us answer this question by analyzing the form of a perceptron. A perceptron on m inputs is defined by the set of weights w = (w 0 ,... , wm). This defines the hypothesis

hw(x) = g(w · x) (13)

Class hw(x) = 1 if w · x > 0, and −1 otherwise. Let us look at the points x where w · x = 0. This set of points defines a hyperplane in the space of inputs. On one side of this hyperplane, all points will be classified as +1, on the other side they will be classified as −1. A surface that divides the input space into regions of different classification is called a decision boundary. From this analysis, we see that perceptrons always have a linear decision boundary. A target function with a linear decision boundary is called linearly separable. Perceptrons can represent linearly separable functions. We can use a perceptron as a classifier on Boolean inputs if we restrict the inputs to be +1 (for true) and -1 (for false). We can ask what Boolean functions can be represented by perceptrons on two Boolean inputs? An equivalent question is: what Boolean functions on two inputs are linearly separable? The answer is that perceptrons can represent many but not all of the Boolean functions on two inputs. For example, a perceptron can represent the and function by setting the weights to be +1 and the threshold to 1.5. Similarly, the or function can be represented with weights of +1 and threshold -0.5. However, the xor function is not linearly separable and cannot be represented.

we should increase wj. For an instance in which the classifier is exact then we should not adjust the weights at all. We can get the right effect in all cases with the following rule for instance (x, y), called the perceptron rule:

w( jt +1) ← w( jt )+ αxj (y − y′), (14)

where α > 0 is a small positive number called the learning rate of the algorithm and y′^ = g(w · x) is the output of the perceptron. What this rule says is that whenever we see a training example, we adjust the weights in such a way as to nudge the sum in the right direction whenever there is an error. This is the same as the update rule for linear regression (but has a different behavior because y, y′^ ∈ {+1, − 1 } whereas y and hw(x) are continuous in Eq. (8).) The above rule applies to a single instance, and can be applied repeatedly using each of the individual training instances in turn, either sampled at random with replacement, or going in cyclic order through D. In this method we proceed in an analogous way as to in stochastic gradient descent. Note, though, that the perceptron learning rule is not gradient descent because the threshold activation function is discontinuous and cannot be differentiated. Alternatively, we can adopt the following “batch” version of the rule:

w (t+1) j ←^ w

(t) j +^ α

∑^ n

i=

xij (yi − y i′), (15)

where D = {(x 1 , y 1 ),... , (xn, yn)}, xij is the jth feature of input xi, and y′ i is the output of the perceptron on the ith input. If the training data is linearly separable, and a small enough learning rate is used, then repeated appli- cations of the perceptron learning rule in batch model will eventually result in a set of weights that classifies every training instance correctly. Similarly, with a suitably decaying learning rate, α = O(t), for iteration t, then stochastic update will converge in the limit. For example, adopting α(t) = 1000/(1000 − t) is one possible learning schedule. On the other hand, if the training set is not linearly separable, the update process may never converge when used with a fixed learning rate. This problem can be addressed by adopting a learning rate that decays as O(1/t) where t is the iteration number, and with examples presented in random sequence, but learning can nevertheless remain slow and noisy.

4.2 Training: The Adaline Rule

An alternative learning rule for perceptrons was developed at about the same time by Widrow and Hoff in 1960. The advantage of the adaline rule (short for ‘adaptive linear element’) is that it provides fast convergence to low (but not minimum) error solutions even when the data is not linearly separable. In fact, adaline is nothing more than stochastic gradient descent applied to an error function between the input to the perceptron activation function and the target class, rather than the output of the activation function and the target class. The adaline rule is based on the idea of gradient descent that we saw for linear regression. A natural approach would be to seek to minimize the squared error on the data set, and follow the gradient in weight space, with

w(t+1)^ ← w(t)^ − α∇error (w), (16)

where ∇error (w) is the gradient of the error function, indicating the direction of steepest increase in the value of error , and equal to

∇error (w) = [

∂w 0

error (w),... ,

∂wm

error (w)] (17)

But this error function depends on the perceptron threshold function, and thus is not a differentiable function. Specifically, we have

error (w) =

∑^ n

i=

(yi − g(ini))^2 (18)

In order to apply gradient descent, we need a differentiable function to optimize. We adopt instead the difference between the in value and the target output as a surrogate for the error, and define the function to be minimized as the total square error (measured this way) on the training set:

error (w) =

∑^ n

i=

(yi − ini)^2 =

∑^ n

i=

(yi −

∑^ m

j=

wj xij )^2 (19)

This is exactly the error function we adopted for linear regression, and the stochastic gradient descent update rule, applied to a particular instance (x, y), is:

w (t+1) j ←^ w

(t) j +^ αxj^ (y^ −^ w

(t) (^) · x) (20)

We can also run the update in batch mode, incrementing each weight according to the total error between ini and yi, aggregated over all (xi, yi) ∈ D.

4.3 Discussion

If we compare the perceptron learning rule (14) and the adaline rule (20), we see that they differ only in the use of y′^ vs. in in defining the error on the current example. But this small difference leads to a noticeable difference in the behavior of the algorithm. A first difference is that the adaline rule is performing gradient descent, and will eventually reach a local minimum of total squared error function when used with a small enough α > 0 (or with a suitably decaying α when used together with stochastic gradient descent). Furthermore, this particular error function has a unique minimum and adaline will converge to the global minimum and even when the data is not linearly separable. This is a big improvement on the perceptron rule, which might not converge if the data is not linearly separable.^3 A second difference between the perceptron rule and the adaline rule lies in the way they treat different incorrectly classified points. In the perceptron rule, all points that are incorrectly classified in the same way have the same value for their predicted output, so they contribute the same amount to the total weight adjustment. By contrast, in the adaline rule, the amount an incorrectly classified point affects the weights depends on the value of y − in, i.e., on the amount by which the point is misclassified. Points that are further away from being classified correctly will have more effect on the weights, which seems like good behavior. A third difference lies in the behavior of the two rules after a correct classification has been identified on a point, and in particular application to linearly separable data sets. For the perceptron rule, once the classification is correct everywhere then the algorithm stops learning. This is the case even if the decision boundary produced goes very close to one of the points, and only just manages to classify it correctly. For the adaline rule, even when all points have zero classification error, y − in will generally remain non-zero and the algorithm may not yet have reached a global minimum its error space. So the algorithm will be able to continue to learn, and perhaps find a decision boundary with higher margin, which again is a good thing. On the other hand, it is possible that a model learned using the adaline rule will not classify all the training data correctly, even if the training data is linearly separable. The reason is that the error function minimized by the adaline rule is not the true error function corresponding to correct classification. It is possible that

(^3) One issue that needs to be dealt with is choosing the learning rate. If the learning rate is too small, the algorithm will take a long time to move from the starting point to the minimum. On the other hand, if the learning rate is too large, the algorithm may oscillate around a minimum point instead of converging. One trick that people use is to begin with a higher learning rate, to get quickly into the region of the minimum, and gradually decrease the learning rate so as to guarantee convergence.