Computational Learning Theory , Lecture Notes - Computer Science, Study notes of Artificial Intelligence

<p class="MsoNormal" style="margin: 0in 0in 10pt"><font color="#000000"><font face="Calibri">Prof. David C Parkes, Computer Science, Computational Learning Theory, Probably Approximately Correct, PAC-learnable, Conjunctive Formulas, Finite Hypothesis Spaces, PAC-Learnability of Conjunctive Formulas, Variations to the PAC Model, VC Dimension, Harvard, Lecture Notes<p></p></font></font></p>

Typology: Study notes

2010/2011

Uploaded on 10/25/2011

thecoral
thecoral 🇺🇸

4.5

(30)

395 documents

1 / 12

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS181 Lecture 20 Computational Learning Theory
Avi Pfeffer; Revised by David Parkes
April 17, 2011
We turn now to the fundamental question of how and when it is possible to learn and the topic of
computational learning theory. We will briefly survey some of the main ideas in this area. Les Valiant,
who founded the field, teaches an in-depth course on the topic (CS228). The key concepts covered are
PAC-learnability, sample complexity and the VC dimension of a hypothesis space.1 2
1 Computational Learning Theory
Learning works. There’s lots of evidence of that, both from humans and also from the success of machine
learning algorithms. Today we’ll consider the question “Why does learning work?” There are several reasons
why one might want to answer this question, including:
Simply for the sake of understanding. To quote Russell and Norvig, “Unless we find some answers,
machine learning will, at best, be puzzled by its own success.”
To understand when and under what circumstances learning works.
To be able to give guarantees about the performance of the hypothesis produced by an algorithm on
some training set.
To be able to determine how many training samples are needed in order to produce a good hypothesis.
Simply stated, we’d like to understand how much data is required to provide good generalization. The
basic insight provided by the probably approximately correct (PAC) model of learning theory is that we
can provide formal guarantees that a learned hypothesis will generalize well by assuming that the future
distribution on examples will be the same as the distribution used for training.
Suppose you are trying to learn a classifier for animals and you never saw a pink elephant: this is all
right as long as you don’t expect to see pink elephants in the future!
We know that for learning to be possible we need some form of inductive bias. Computational learning
theory focuses on restriction bias, in which there is a restricted set of hypotheses that can be represented by
a learner. The basic question that is addressed is to understand whether it is possible to learn efficiently the
true hypothesis, under the assumption that it meets the restriction. Learning efficiently requires both that
the number of examples required is small and that the computational time required is small.
Computational learning theory also provides another way to think about Occam’s razor: it explains that
we should prefer simple hypotheses because they can be learned with less data. Computational learning
theory explains how to reason about the amount of training data required for a hypothesis space of a given
(representation) complexity.
1Additional material on computational learning theory can be found in the MIT Press book by Kearns and Vazirani, “An
Introduction to Computational Learning Theory.”
2These notes are based in part on Russell and Norvig, notes by Sally Goldman, class notes by Avrim Blum, and lecture
notes by Raymond Mooney.
1
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Computational Learning Theory , Lecture Notes - Computer Science and more Study notes Artificial Intelligence in PDF only on Docsity!

CS181 Lecture 20 — Computational Learning Theory

Avi Pfeffer; Revised by David Parkes

April 17, 2011

We turn now to the fundamental question of how and when it is possible to learn and the topic of computational learning theory. We will briefly survey some of the main ideas in this area. Les Valiant, who founded the field, teaches an in-depth course on the topic (CS228). The key concepts covered are PAC-learnability, sample complexity and the VC dimension of a hypothesis space.^1

1 Computational Learning Theory

Learning works. There’s lots of evidence of that, both from humans and also from the success of machine learning algorithms. Today we’ll consider the question “Why does learning work?” There are several reasons why one might want to answer this question, including:

  • Simply for the sake of understanding. To quote Russell and Norvig, “Unless we find some answers, machine learning will, at best, be puzzled by its own success.”
  • To understand when and under what circumstances learning works.
  • To be able to give guarantees about the performance of the hypothesis produced by an algorithm on some training set.
  • To be able to determine how many training samples are needed in order to produce a good hypothesis.

Simply stated, we’d like to understand how much data is required to provide good generalization. The basic insight provided by the probably approximately correct (PAC) model of learning theory is that we can provide formal guarantees that a learned hypothesis will generalize well by assuming that the future distribution on examples will be the same as the distribution used for training. Suppose you are trying to learn a classifier for animals and you never saw a pink elephant: this is all right as long as you don’t expect to see pink elephants in the future! We know that for learning to be possible we need some form of inductive bias. Computational learning theory focuses on restriction bias, in which there is a restricted set of hypotheses that can be represented by a learner. The basic question that is addressed is to understand whether it is possible to learn efficiently the true hypothesis, under the assumption that it meets the restriction. Learning efficiently requires both that the number of examples required is small and that the computational time required is small. Computational learning theory also provides another way to think about Occam’s razor: it explains that we should prefer simple hypotheses because they can be learned with less data. Computational learning theory explains how to reason about the amount of training data required for a hypothesis space of a given (representation) complexity.

(^1) Additional material on computational learning theory can be found in the MIT Press book by Kearns and Vazirani, “An Introduction to Computational Learning Theory.” (^2) These notes are based in part on Russell and Norvig, notes by Sally Goldman, class notes by Avrim Blum, and lecture notes by Raymond Mooney.

2 Basic Definitions

What does it mean to say that a learning algorithm works? One answer to this question is provided by the Probably Approximately Correct (PAC) framework (Valiant, 1984). PAC-learnability is defined for supervised learning, and in particular for problems of classification. PAC-learnability was originally introduced in the context of concept learning in which the true concept that represents the world is a function from Boolean attributes X = X 1 ×.. .×Xm to y ∈ { 0 , 1 }. We say that examples are “positive” (an example x satisfies the concept) or “negative” (an example x does not satisfy the concept.) An example question asked is:

Suppose that you are trying to teach a child the concept of “chair” and show positive and negative examples of furniture. How many examples are required?

2.1 Probably Approximately Correct

In general, we cannot expect a learning algorithm to produce a perfectly correct hypothesis without having seen all possible data instances. But we can seek a hypothesis with error less than some small  > 0. Also, we cannot expect an algorithm to produce a good hypothesis for every set of training data— some training data may provide a very unrepresentative sample of the population. But we can seek a learning algorithm that will generate with probability at least 1 − δ, for some small δ > 0, a hypothesis that is approximately correct (with error ). In order to talk about the probability that an approximately correct hypothesis is learned, we define a probability distribution P on examples. The PAC model makes the assumption that all examples— both training and test examples —are drawn independently and from the same probability distribution P. This assumption gives precise content to the statement that “the future resembles the past,” which is the fundamental assumption without which learning is impossible. To be precise, P is a joint probability distribution over labeled examples (x, y), with the attribute vector x ∈ X = X 1 ×... × Xm, and y ∈ Y , where Y is the set of class labels. We can equivalently view P as consisting of a distribution over attribute values P (X), and a conditional distribution P (Y | X) over the class given the attribute values, with P (X, Y) = P (X)P (Y | X). We assume that all training and all test examples are independent samples from the distribution P. Given n examples, then distribution P determines a distribution over the different training sets. In particular, for D = {(x 1 , y 1 ),... , (xn, yn)}, with (xi, y) ∈ X × Y , then P (D) =

∏n i=1 P^ (X^ =^ xi,^ Y^ =^ yi). The same probability distribution P also determines the expected error of some hypothesis h : X → Y on test data. This is just the probability, according to distribution P , that an incorrect classification will be produced on an arbitrary example (x, y). We define

errorP (h) = P(x,y)(h(x) 6 = y). (1)

A domain is deterministic if, for every x there is a class y such that P (Y = y | X = x) = 1. In this case, we can say that there is a true model (of the world), which we denote f : X → Y. Moreover, the distribution P is simply defined by P (X) with P (Y = y | X = x) placing probability one on y = f (x). Given this, then the error of hypothesis h is

errorP (h) = Px(h(x) 6 = f (x)). (2)

This leads to the following central definition of learning theory:

Definition 1. Consider a supervised learning problem with attributes X 1 ,... , Xm and target class Y. A hypothesis space H is PAC-learnable if there exists an algorithm such that, for every deterministic domain whose true model f is in H, for every distribution P over X 1 ,... , Xm, and every  > 0 , δ > 0 , then with probability at least 1 −δ the algorithm returns a hypothesis h ∈ H with errorP (h) < , in time polynomial in 1 /, 1 /δ and m.

For a negative result for PAC-learnability, one useful approach is to establish that the problem of finding a hypothesis that is consistent with the training data is hard through a reduction from some other hard problems.

3 Applications

3.1 Sample Complexity of Conjunctive Formulas

Let’s look at an example of a hypothesis space that we can prove has polynomial sample complexity. Consider the problem of learning Boolean formulas, and let the hypothesis space H consist of all conjunctive formulas on m Boolean variables. Recall that a literal has the form xj or ¬xj (also written xj ), and that a conjunctive formula is a conjunction of literals, such as x 1 ∧ ¬x 2 (also written x 1 x 2 ). The first question we need to ask is what is the size of the hypothesis space, i.e., how many logically distinct conjunctive formulas are there? The answer is 3m, because for each attribute xj , there are three possibilities: xj is in the formula, ¬xj is in the formula, or neither are in the formula. Armed with the knowledge that |H| = 3m, let’s proceed with the analysis. Given a number of training examples n, we’re going to find a bound on the probability that there is a bad hypothesis (one with error > ) that is consistent with the training data. We’re not going to make any assumptions about the distribution P over examples, so this bound will hold for any such distribution. First, let h be any particular bad hypothesis. What is the probability that h is consistent with a single training example? Clearly at most 1 − . What is the probability that h is consistent with n training examples? At most (1 − )n, since the examples are drawn independently from the distribution P. Now, we can ask ourselves, what is the probability that there exists some bad hypothesis consistent with a training set D of size n. Let’s enumerate the bad hypotheses h 1 ,... , hk. The event “there is some bad hypothesis consistent with D” is equal to the event “h 1 is consistent or h 2 is consistent or... or hk is consistent”. For this we use the union bound: the probability that any one event occurs is less than or equal to the sum of the probabilities of the individual events. This bound is exact when at most one such event can occur at the same time (e.g., only one person can be elected president), and otherwise provides an upper bound. The union bound can be quite weak when multiple events can occur simultaneously (e.g., “bad weather” is associated with “rain” and “cold” and “wind” and “ice” and so forth, but these events often occur at the same time.) Using the union bound, we obtain:

P (there is some bad hypothesis consistent with D) = P (h 1 consistent with D ∨... ∨ hk consistent with D) ≤ P (h 1 consistent with D) +... + P (hk consistent with D) ≤ k · P (a bad hypothesis h is consistent with D) < k(1 − )n ≤ 3 m(1 − )n

The last line follows because the number k of bad hypotheses is at most the size of the entire hypothesis space, which is 3m. Now that we have a bound on the probability that there is a bad hypothesis consistent with a training set of size n, we can ask what value of n is large enough to guarantee that this probability is at most δ? That is, we want n such that 3m(1 − )n^ < δ. First we use the standard inequality,

1 − z ≤ e−z

and require instead that 3me−n^ < δ. Taking logarithms and solving, we get the following result for sample

complexity for conjunctive formulas,

n >

(m ln 3 + ln

δ

For such an n, we can indeed guarantee that the probability that there is a bad hypothesis consistent with the training data is at most δ. The number of training examples required is linear increasing in 1/ and the number of attributes m, and logarithmic in 1/δ, so we conclude that conjunctive formulas have polynomial sample complexity. To give you a sense of the numbers, if m = 10,  = 0.1 and δ = 0.05, then 140 training examples suffice. Even though the hypothesis space is large, only a small number of training examples are required.

3.2 Sample Complexity of Finite Hypothesis Spaces

In the above analysis, the only place in which it mattered that our hypothesis space consisted of conjunctive formulas was in proving that |H| = 3m. We can go through the exact same analysis for any finite hypothesis space, substituting |H| wherever 3m^ appears. From this, we can easily obtain the following:

Theorem 1. The sample complexity of a finite hypothesis space H is such that if the number of training examples,

n >

(ln |H| + ln

δ

then the probability that there is a bad hypothesis (with error greater than ) consistent with a training set of n samples is at most δ.

The size of the hypothesis space enters through the quantity ln |H|. In particular, if |H| is “only” exponential in m, then the hypothesis space has polynomial sample complexity. This is good news! Focusing just on sample complexity (and leaving computational complexity to one side), any hypothesis space that is no more than exponential in the number of attributes can be efficiently learned.

Caution: The bound generated through the above analysis is an upper bound and can be quite weak. This is one of the main critiques of learning theory, which is that it does not always provide good practical guidance for the amount of data actually required for success with learning algorithms.

One nice thing about the analysis is that we obtain a simple form of Occam’s razor. Consider any hypothesis space with |H| = 2s, e.g. a class for which all descriptions are less than s bits long. If n > 1  [s^ ln 2 + ln^

1 δ ], and the true hypothesis lies in this space, then we are unlikely to be fooled and the learned hypothesis will be accurate with high probability. This result critically depends on the hypothesis space being small.

3.3 PAC-Learnability of Conjunctive Formulas

So far we’ve just considered sample complexity and the approach is “non-constructive” in that it just reasons about properties of a consistent learner. We turn now to PAC-learnability of conjunctive formulas, and thus consider computational complexity as well as sample complexity. In particular, given the positive sample complexity result with a consistent learner what we need is an efficient consistent learner. To make headway, we can first consider monotone conjunctions, where the literals are always positive. An algorithm to find a consistent hypothesis in the class of monotone conjunctive formulae obtains significant information from positive examples. For example, if m = 5 and we see positive example “10011,” then we know that the target concept does not contain X 2 or X 3. A negative example “01000” does not provide nearly as much information because we do not know which one of X 1 , X 3 , X 4 or X 5 being false caused the failure. Here is Valiant’s consistent learner for monotone conjunctions:

considered so far is what is known as proper PAC-learnabilitym which insists that the hypothesis space of the learner must be the same as that of the target class in the domain. Now we dispense with this requirement. The language k-CNF (conjunction of clauses, each clause is the disjunction of at most k literals) is a superset of the language k-term DNF. In fact, the k-CNF class is a strict superset of k-term DNF. To see this, note that the 2-CNF formula (x 1 ∨ x 2 ) ∧ (x 3 ∨ x 4 ) ∧... ∧ (x 2 k− 1 ∨ x 2 k) has 2k^ terms when we unfold it into a DNF formula and cannot be represented as a k-term DNF. To understand why k-term DNF is contained in k-CNF, we show that any k-term DNF can be rewritten as a k-CNF, we consider the idea of distributing AND over OR. To see this, consider the following k-term DNF:

f = T 1 ∨ T 2 ∨... Tk T 1 = y(1) 1 ∧ y(1) 2 ∧... ∧ y m(1) 1 T 2 = y(2) 1 ∧ y(2) 2 ∧... ∧ y m(2) 2 · · · Tk = y( 1 k )∧ y( 2 k )∧... ∧ y m(kk)

From this, we can now distribute the ‘∧’ (AND) over the ‘∨’ (OR), rewriting as

f =

i 1 ∈{ 1 ,...,m 1 },...,ik ∈{ 1 ,...,mk }

y(1) i 1 ∨ y(2) i 2 ∨... ∨ y i(kk ).

For example, a ∨ (b ∧ c) = (a ∨ c) ∧ (a ∨ d). But the result of this is a k-CNF formula, and PAC-learnable as discussed above. By adopting a richer hypothesis space we can obtain a positive learnability result! This happens because the computational problem, of finding a consistent k-CNF formula, is now tractable while that of finding a consistent k-term DNF formula is intractable. There is no contradiction with the hardness of learning k-term DNF as k-term DNF. The k-CNF algorithm might learn a close approximation in k-CNF that is not actually expressible as a k-term DNF. There are parallel results available for a k-clause CNF formula, which is a conjunction of the form C 1 ∧ C 2 ∧... Ck where each clause Cj is a disjunctive term. A hardness result applies for learning k-clause CNF by k-clause CNF, while PAC-learnability of k-clause CNF is possible through k-DNF formulas. Putting this altogether, note that if a hypothesis space has polynomial sample complexity then so does any subset of the hypothesis space. On the other hand, a hypothesis space might be PAC-learnable but a subset of the space might NOT be PAC-learnable. This highlights the additional emphasis placed in PAC-learnability on computational complexity.

3.6 PAC-Learnability of Decision Lists

The class of k-decision lists (k-DL) is PAC-learnable. A simple decision list is a Boolean function f that is defined on { 0 , 1 }m^ by a nested if-then-else statement of the form:

f (x 1 ,... , xm) = if l 1 then c 1 elseif l 2 then c 2... elseif lk then ck else ck+1,

where the lj ’s are literals (either one of the variables or their negations) and the cj ’s are true or false. The class k-DL is the extension of this where the condition in each if statement may be the conjunction of up to k literals. The class of k-DL concepts is PAC-learnable using a hypothesis from k-DL.

3.7 PAC-Learnability of General DNF formulas.

General DNF formulas can express a hypothesis space of all Boolean concepts. A DNF formula is a disjunction of terms, where a term is a conjunction of literals; e.g. x 1 x 3 x 4 ∨ x 1 x 2 is in DNF.

The hypothesis space has size 2^2

m — it is doubly exponential in m! (To see this, notice that a Boolean function is associated with a subset of rows in a truth table that evaluate to true, and that there are 2m^ rows and therefore 2^2

m subsets of rows.) In this case, plugging |H| into the union bound on sample complexity gives

n >

(2m^ ln 2 + ln

δ

and we are unable to conclude that general DNF formulas can be learned with polynomial sample complexity. Plugging in some numbers, for δ =  = 0.05 and m = 20 this bound implies that we would need 14,536, examples. Careful: this does not preclude that DNF formulas can have polynomial sample complexity. The union bound can be very weak and is exact only when the different random events are disjoint. The greater the overlap between events, the weaker the bound. In our case, the events “h 1 is consistent with D” and “h 2 is consistent with D” are likely to have a good deal of overlap, and so the bound is expected to be weak. Still, it can be formally proved that general DNF formulas have exponential sample complexity. This is not surprising: there is no inductive bias! For this reason, general DNFs are not known to be PAC-learnable. (Again, careful here. Negative sample complexity does not quite preclude PAC-learnability because sample complexity restricts attention to consistent learners and there could be a positive PAC- learnability result by using inconsistent learners.) The example of general DNF formulas is helpful in illustrating a basic tradeoff between sample complexity and computational complexity. For general DNF formulas, the problem of finding a consistent formula is trivial because an algorithm can simply create an explicit term for every positive example. On the other hand, the sample complexity is poor. As the hypothesis space gets larger it becomes trivial to find a consistent hypothesis for a given amount of training data, but we wouldn’t expect generalizability without a massive amount of data. In particular, simply adopting a hypothesis space of general DNF formulas does not allow for learning.

3.8 PAC-Learnability of Decision Trees

Similar to general DNFs, decision trees (familiar from decision tree learning and ID3) can represent all Boolean formulas. Moreover, it can be proved that general decision trees have exponential sample complexity and decision trees not known to be PAC learnable. But, what about an algorithm such as ID3 that includes a preference bias in favor of finding the sim- plest possible decision tree (DT) consistent with training data? We’d like to apply the PAC there to such an algorithm. Let s denote the minimal size of a DT representing the correct concept. Assume that a consistent learner is guaranteed to return the minimum-sized, consistent tree. Based on this, the effective hypothesis space is all hypotheses of size at most s. One can now define |H| accordingly, and determine the sample complexity in terms of the minimum size of the representation of the target concept. Still, finding a minimum, consistent DT is computationally intractable, and no worst-case approximation bounds are known for machine learning algorithms such as ID3.

4 Variations to the PAC Model

Many variations have been proposed to Valiant’s basic PAC framework. We have already mentioned a couple. First, the hypothesis space adopted by the learner may be a superset of that from which the true concept is drawn. Second, when the target hypotheses may itself be exponential to describe (e.g., for decision trees or DNF formulas), then the PAC framework can be extended to allow for the complexity to depend on size(f ), which is the minimal size of the representation of f in the representation of the learner’s hypothesis class. Here are some additional examples in which the PAC framework has been extended:

  • Sometimes the true concept is outside of the hypothesis space H. The alternative framework of agnostic learning allows the true concept to be in some other class C ⊃ H, but still probably

Put another way, a set S is said to be shattered if all possible ways of classifying S are achievable using some h ∈ H. Now we can ask the question, how large a set can H shatter? This is the VC dimension on H and provides a measure for the complexity of the hypothesis space.

Definition 4. The VC dimension of H, denoted VC (H), is the size of the largest set S shattered by H, if that number is finite, otherwise V C(H) = ∞.

Specifically, if VC (H) = d then a set of d points can be shattered but a set of d + 1 points cannot be shattered. Given d + 1 points in X, there is an assignment of labels to each of the points that cannot be represented by any h ∈ H.

Example 1 (Intervals on R.). X is the space of real numbers, H the space of closed intervals, where an interval h = [a, b] ∈ H indicates a range of numbers x ∈ [a, b] for which f (x) = 1. What is VC (H)? The answer is 2. It is clear that we can represent each of the four dichotomies on two points with an interval, as follows: [ ] − − x 1 x 2

− [ + ] x 1 x 2

[ + ] − x 1 x 2

[ + + ] x 1 x 2 However, no set of three distinct points cannot be shattered. We may assume, without loss of generality, that x 1 < x 2 < x 3. The following hypothesis cannot be represented by an interval:

  • − + x 1 x 2 x 3

Note that having x 1 = x 2 does not help to avoid this failure.

Example 2 (Linear separability on R^2 ). X is the R^2 plane. H is the set of linearly separable hypotheses in the plane, which is the hypothesis space of a perceptron with two inputs. H can obviously shatter 2 points. In section we will see that the VC dimension of linearly-separable hypotheses on R^2 is exactly 3. In general, the VC dimension of linear separators in m dimensions, i.e., of perceptrons with m inputs is m + 1.

Example 3. It might be tempting to think that the VC dimension is limited by the number of parameters that define a hypothesis. But this is not the case. For example, the hypothesis class

{sgn(sin(α · x)) : α ∈ R} (6)

on x ∈ R is parameterized by α but has ∞ VC dimension. The function sgn(y) = 1 if y ≥ 0 and 0 otherwise. You can draw some points on a line and think about sin functions with different periodicity to understand this claim.

Example 4 (All Boolean functions.). VC dimension is also defined for a finite hypothesis space. For m Boolean variables, a single example is an assignment of { 0 , 1 } to each of m variables and there are 2 m examples in total. A set of 2 m^ examples can be shattered since each dichotomy corresponds to a particular Boolean function and therefore the VC dimension of the class of all Boolean functions is 2 m.

5.1 VC dimension and Sample Complexity

The VC dimension provides a measure of the complexity of the hypothesis space and can be used to upper bound the sample complexity for infinite hypothesis spaces (where the union bound approach fails). In fact, many infinite hypothesis spaces have finite VC dimension. The VC dimension actually gives us lower as well as upper bounds on the sample complexity.

Theorem 2. The sample complexity of a (perhaps continuous) hypothesis space H is such that if the number of training examples,

n >

(4 log 2

δ

  • 8VC (H) log 2

then the probability that there is a bad hypothesis (with error greater than ) consistent with a training set of n samples is at most δ, where VC (H) is the VC dimension of the hypothesis space.

Stated as an asymptotic result, we have that the sample complexity is

O

[ln

δ

  • VC (H) ln

]

Comparing with the result for a finite hypothesis space, we see that VC (H) ln (^1)  takes the role of ln |H| in Eq. (4). There is also an information-theoretic lower bound that demonstrates that this sample complexity bound is almost tight.

Theorem 3. For any hypothesis class H with finite VC dimension, finding an -good hypothesis with prob- ability at least 1 − δ requires at least

[ln

δ

+ VC (H)]

examples.

In general, we see that the number of examples required to PAC learn in hypothesis space H grows linearly with the VC dimension, V C(H). In particular, if VC (H) = ∞ then H is not PAC-learnable. We see that VC dimension works for both infinite and finite hypothesis spaces, and provides both an upper and lower bound on sample complexity. For domains with a finite hypothesis spaces, such as Boolean domains, then the use of ln |H| from the union bound tends to provide a stronger upper-bound on sample complexity than the use of the VC dimension.^4 For real valued inputs and continuous hypothesis spaces, then it is necessary to adopt the VC dimension, and this provides a way to extend sample complexity analysis to such domains.

5.2 Application: Neural Networks

The hypothesis space for a neural network is continuous because the weights on connections are continuous. Can we use VC dimension to determine how many hidden units to adopt in a neural network based on the availability of examples in the training data? Similarly, for a given architecture we can look to use the VC dimension to predict how many training examples are needed. The VC dimension for a perceptron on m inputs is m + 1. Suppose we have a multi-layer network of nodes, where each node behaves like a perceptron, i.e., each node uses a threshold activation function. What sets of points can the entire network shatter? An answer is provided by the following theorem. This theorem requires that the network be layered , which means that the nodes can be partitioned so that there are no intra-layer edges and all edges go between adjacent layers. Most neural network designs, including all the ones we have considered, are layered.

(^4) One exception to this is the space of linear threshold hypotheses on m Boolean inputs, where the VC dimension is m + 1 but ln |H| is quadratic in m.