





















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The need to understand complex information-processing systems at multiple levels of description, using vision research as a case study. It describes the four levels of description and provides examples of theories at each level. The document also briefly reviews the prospects for the field. a working paper from the Artificial Intelligence Laboratory at the Massachusetts Institute of Technology.
Typology: Study notes
1 / 29
This page cannot be seen from the preview
Don't miss anything!






















MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY
Working Paper *131 (^) August 1976
description: (1) that at which the nature of a computation is expressed; (2) that at which the algorithms that implement a computation are characterlsed; (3) that at which an algorithm is committed to particular mechanisms; and (4) that at which the mechanisms are realised in hardware. In general, the nature of a computation is determined by the problem to be solved, the mechanisms that are used depend upon the available hardware, and the particular algorithms chosen depend on the problem and on the available mechanisms. Examples are given of theories at each level from current research in vision, and a brief review of the immediate prospects (^) for the field is given.
This report describes research done (^) at the Artificial Intelligence Laboratory of the Massachusetts Institute of Technology. Support for the laboratory's artificial intelligence research is provided in part by the (^) Advanced Research Projects Agency of the Department of Defense under Office of Naval Research contract N00014-75-C-0643.
Working papers are informal papers intended (^) for internal use.
2 VISUAL INFORMATION PROCESSING
Introduction Modern neurophysiology has learned much about the operation of the individual neuron, but deceivingly little about the meaning of the circuits they compose. The reason for this can be attributed, at least in part, to a failure to recognise what it means to understand a complex information-processing system. Complex systems cannot be understood as a simple extrapolation of the properties of their elementary components.. One does not formulate a description of thermodynamic al effects using a large set of wave equations, one for each of the particles involved. One describes such effects at their own level, and tries to show that in principle, the microscopic and. macroscopic descriptions are consistent with one another. The core of the problem is that a system as complex as a nervous system or a developing embryo must be analyzed and understood at^ several^ different^ levels.^ For^ a system that solves an information processing problem, we may distinguish four important levels of description. At the^ lowest,^ there^ is^ basic^ component^ and^ circuit^ analysis^ --^ how^ do transistors, neurons, diodes^ and^ synapses^ work?^ The^ second^ level^ is^ the^ study^ of^ particular mechanisms; adders, multipliers, and memories accessed by address or by content. The third level is that of the algorithm, and the top level contains the theory of the overall. computation. For example, take the case of Fourier analysis. The computational theory of the Fourier transform is well understood, and is expressed independently of^ the^ particular way in which it is computed. One level down, there are several algorithms for implementing a Fourier transform -- the Fast Fourier transform (Cooley & Tukey^ 1965) which is a serial algorithm; and the parallel "spatial"^ algorithm^ that^ is^ based^ on^ the mechanisms of laser optics. All these algorithms carry out the same^ computation,^ and^ the choice of which one to use depends upon the particular mechanisms^ that^ are^ available.^ If one has fast digital memory, adders and^ multipliers,^ one^ will^ use^ the^ FFT,^ and^ if^ one^ has^ a laser and photographic plates, one will use an "optical" algorithm. In general, mechanisms are strongly determined by^ hardware,^ the^ nature^ of^ the computation^ Is^ determined^ by^ the problem, and the algorithms are determined by the computation and^ the^ available mechanisms. Each of these four levels of description has its^ place^ in^ the^ eventual understanding of perceptual information processing,^ and^ it^ is^ important^ to^ keep^ them separate. Of course, there are^ are^ logical^ and^ causal^ relationships^ among^ them,^ but^ the important point is that these levels of^ description^ are^ only^ loosely^ related.^ Too^ often^ in attempts to^ relate^ psychophysical^ problems^ to^ physiology^ there^ is^ confusion^ about^ the^ level^ at which a problem arises -^ is^ it^ related mainly^ to^ biophysics^ (like^ after-images)^ or^ primarily^ to information processing (like the ambiguity of the Necker cube)? More^ disturbingly, although the^ top^ level^ is^ the^ most^ neglected,^ it^ is^ also^ the^ most^ important.^ This^ is^ because the structure of the computations that^ underly^ perception^ depend^ more^ upon^ the computational problems that have to^ be solved^ than^ on^ the^ particular^ hardware^ in^ which their solutions are^ implemented.^ There^ Is^ an^ analog^ of^ this^ in^ physics,^ where^ a thermodynamical approach represented, at^ least^ historically,^ the^ first^ stage^ in^ the^ study^ of matter. A^ description^ in^ terms^ of^ mechanisms^ or^ elementary^ components^ usually^ appears afterwards.
Marr
A
B
Figure
1.
Examples
of
pairs
of
perspective
line
drawings
presented
to
the
subjects
of
Shepard
& Metzler's
(1971)
experiments
on
mental
rotation.
(A)
A
"same"
pair, which
differs
by
an
^80
degree
rotation
in
the
picture
plane;
(8)
a
"same"
pair
which
differs
by
an
^80
degree
rotation
in depth;
(C)
a
"different"
pair,
which
cannot
be
brought
into
congruence
by
any
rotation.
The
time
taken to
decide
whether
a
pair
is
the
"same"
varies
linearly
with
the
(3-D)
angle
by
which
one'must
be
rotated
to
be
brought
into correspondence
with
the
other.
(reconstructed
from
figure
^1
of
Shepard
&
Metzler;
1971).
5 VISUAL (^) INFORMATION PROCESSING
different approach. In .1971, R. N. Shepard and J. Metzler (1971) made line drawings of simple objects, which differed from one another either by a 3-D rotation relative to the viewer, or by a rotation plus a reflection (see figure 1). They asked how long it took to .decide whether two depicted (^) objects differed by a rotation and reflection, or merely a rotation. They found that the time taken depended on the (^) 3-D angle of rotation necessary to bring the two objects into correspondence, not the 2-D angle between their images; and that it varied (^) linearly with this angle. Similar findings have been reported in many subsequent investigations, and have led to (^) the resurgence of ideas about mental imagery, and to analogies between visual recognition and computer graphics systems (Shepard 1975). Interesting and important though these findings are, one must sometimes be allowed the luxury of pausing to reflect upon the overall trends that they represent, in order to take stock of the kind of knowledge that is accessible to these techniques. This proposal is itself an attempt at examining the link between various current approaches, including those of neurophysiology and psychophysics. We would also like to know what are the limitations of these approaches, and how can one compensate for their deficiencies? Perhaps the most striking feature of these disciplines at present is their phenomenological character. They describe the behavior of cells or of subjects, but do not explain it. What is area 17 actually doing? What are the problems in doing it that need explaining, and at what level of description should such explanations be sought?. In trying to come to grips with these problems, D. Marr and his students at the M. I. T. Artificial Intelligence Laboratory^ have^ adopted^ a^ point^ of^ view^ that^ regards visual perception as a problem primarily in information processing. The problem commences with a large,^ gray-level^ intensity^ array, and^ it^ culminates^ in^ a^ description^ that depends on that array, and on the purpose that the viewer brings to it. Viewed in this light, a theory of visual information processing will exhibit the four levels of description that, as we saw in the introduction, are attached to any device that solves an information processing problem; and the first task of a theory of vision is.to examine the top level. What exactly is the underlying nature of the computations being performed during visual perception?
A computational- approach to (^) vision The empirical findings of the last 20 years, together with related anatomical (Allman 1972, 1973, 1974a, b & c, Zeki 1971) and clinical (e.g. Luria 1970, Critchley 1953, Vinken & Bruyn 1969) experience, have strengthened a view for which widespread indirect evidence previously existed, namely that the cerebral cortex is divided into many different areas that are distinguished structurally, functionally and by their anatomical connections. This suggests (^) that, to a first approximation visual information processing can be thought of as having a modular structure, a view which is strongly supported by evolutionary arguments. If this is true, the task of a top-level theory of vision is clear; what are the modules, what does each do, and how? The approach of the M. I. T. Artificial Intelligence Laboratory to the vision problem rests on these assumptions. We believe that the principal problems at present are (a) to formulate the likely modularization, and (b) to (^) understand the computational problems each module presents. Unlike simpler systems like the fly
Marr
REPRESENTATION OF^ 3-D^ STRUCTURE
AXES FOUND IN^ IMAGE
2½-D LABELLING OF^ CONTOURS
FIGURE-GROUN
STEREO LIGHTNESS^ AGGREGATION^ TEXTURE^ MOTION
LEFT & RIGHT PRIMAL^ SKETCHES
LEFT & RIGHT^ IMAGES
Figure 2. This diagram summarises our overall view of the visual recognition problem, and it embodies several (^) points that our approach takes as assumptions. The fir.:ý is that the recognition process decomposes to a set of modules that are to (^) a first approximation independent. The simplified subdivision shown here consists of four main stages, each of which may contain several modules. (^) (1) The translation of the image into a primitive description called the primal sketch (Marr (^) 1976b); (2) The division of the primal sketch into regions or forms, through the action of various grouping processes (^) ranging in scope from the very (^) local to global predicates like a rough type of connectedness; (3) The assignment (^) of an axis-based description to each form (see figure 4); and (4) The construction of a 3-D model for the viewed shape, based initially on the axes delivered by (3). The relation (^) between the 3-0 model representation (^) of a shape and the image of that shape is found and maintained with the help of the image-space processor. Finally, the representation (^) of the geometry of a shape is separate (^) from the representation of the shape's use or purpose (Warrington & Taylor 1973).
r-
Figure
3.
3a
shows
the
image
of
a
toy
bear,
printed
in
a
font
with
^16
grey
levels.
In
3b,
the
intensity
at
each
point
is
represented
along
the
z-axis.
3c
illustrates
the
spatial
component
of
the
raw
primal
sketch
as
obtained
from
this
image.
Associated
with
each
line
segment
are
measures
of
contrast,
type
and
extent
of
the
intensity
change,
position
and
orientation.
This
image
is
so
simple
that
purely
local
grouping
processes
suffice
to
extract
the
major
forms
from
the
primal sketch.
These
forms
are
exhibited
in
3d,
e
&
f.
111111111111
I^
I^
1
1111
1
I^
III
1
I^
I^
I^
I^
I^
I^
I^
I^
I^
I
I|^
|
.^ .^ .^
I^
I^
I^
I^
I^
I
~cl
Figure 4. The geometry of constraints on the computation (^) of binocular disparity. 4a illustrates the constraints for the case of a one-dimensional image. Lx and Ly represent the positions of descriptive elements from the left and right views, and the horizontal and vertical (^) lines indicate the range of disparity values that can be assigned to left-eye and right-eye elements. The uniqueness condition states that only one disparity value may be assigned to each descriptive element. That is, only one disparity value may be "on" along each (^) horizontal or vertical line. The continuity condition states that we seek solutions in which disparity values vary smoothly almost.everywhere. That is, (^) solutions tend to spread along the dotted diagonals, which are lines of constant disparity, and between adjacent diagonals. 4b shows how this geometry appeats at each intersection point. The constraints may be implemented by a (^) network with positive and negative interactions that obey this geometry, because the stable states of such a network are precisely the states that satisfy the constraints on the computation. 4c shows the constraint geometry for a 2-dimensional image. The negative interactions remain essentially unchanged, but the positive ones now extend over a small 2-dimensional neighbourhood. A network with this geometry was used to perform the computation exhibited in figure 8.
Marr
AA-
three-dimensional shapes. One component deals with the nature of the representation system that .is used, and the other with how to obtain it from the types of description that can be^ delivered^ from^ the^ primal.^ sketch.^ The^ key^ ingredients^ of^ the^ representation^ system are: (a) The deep structure of the three-dimensional representation of an object consists of a stick figure, where in formal terms each stick represents one or more axes in the object's generalized cone representation, as illustrated in figure 5. In fact, a hierarchy of stick figures exists, that allows one to describe an object on various scales with varying degrees of detail. (b) Each stick figure is defined by a propositional database called a 3-D model. The geometrical structure of a 3-D model is specified by storing the relative orientations of pairs of connecting axes. This specification is local rather than global, and it contrasts with schemes in which the^ position^ of^ each^ axis^ is^ specified^ in^ isolation,^ using^ some circumscribing frame of reference. (See legend to figure 5). (c) When a 3-D model is being used to interpret an image,^ the^ geometrical^ relationships^ in the model are^ interpreted^ by^ a^ computationally^ simple^ mechanism^ called^ the^ image-space processor, which may be thought of as a device for representing the positions of two vectors in 3-space, and for computing their projections onto the image. .(d). During recognition, a sophisticated interaction takes place between^ the^ image,^ the^ 3-D model, and the image-space processor. This interaction gradually relaxes the stored 3-D model onto the axes computed from the image. Some facets of this process resemble the computation of a 3-D rotation, but^ a^ simple computer^ graphics^ metaphor^ is^ misleading.^ In fact, the rotations take place on abstract vectors (the axes) that are not even present in the original image; and at any moment, only two such^ vectors^ are^ explicitly^ represented. The essence of this part of the theory is a method for representing the spatial disposition of the parts of an object and their relation to the viewer.
6: 2 1/2 - dimensional analysis of an image (Marr 1976c, Marr & Vatan in preparation) In simple images, the forms delivered from the primal sketch correspond to the contours^ of^ physical^ objects.^ Finally^ therefore,^ we^ need^ to^ bridge^ the^ gap^ between such forms and the beginning of the 3-D analysis described in the previous paragraph. We call this 2 112 - dimensional analysis, and it consists largely^ of^ assigning^ to^ contours^ labels, that reflect aspects^ of^ their^ 3-dimensional^ configuration,^ before^ that^ configuration^ has^ been made explicit. The most powerful single idea here is the distinction between convex and concave edges and contour segments. One can show that these distinctions are preserved^ by orthogonal projections, and can be made the basis of a segmenting technique^ that decomposes a figure into 2-D regions that correspond to^ the^ appropriate^ 3-D^ decomposition for a wide range of viewing angles (see figure 6). Marr (1976c) has proved that the assumptions, that are implicit in the use of the convex-concave, distinction to^ analyze^ a contour, are equivalent to assuming that the viewed shapes are composed of generalized cones. This adds additional support for using. the stick-figure^ scheme^ based^ on^ generalized
Marr 14 VISUAL INFORMATION (^) PROCESSING
Figure 5. Examples of 3-D models, and their arrangement into the 3-D model representation of a human shape. A 3-D model consists of a model axis (a) and component axes (b) that consist of a principal axis (the torso) and several auxiliary axes (the head and limbs) whose positions are described relative to the principal axis. The (^) complete human 3-D model is enclosed in a rectangle (c). The 3-D model representation is obtained by concatenating 3-D models for different parts at different levels of detail. This is achieved by allowing a component axis of one 3-D model to be the model axis of another. Here, for example, the arm auxiliary axis in the human 3-D model acts as the model axis for the arm 3-D model, which itself has two component axes, the upper and lower arms. The figure shows how this scheme extends downwards as far as the fingers.
Figure 6. Analysis of a contour from Vatan (^) and Marr (1976). The outline (a) was obtained by applying local grouping operations to-a primal sketch, as in figure 4. It (^) is then smoothed, and divided into convex and concave components (b). The outline is searched for deeply concave points or components, which correspond to strong segmentation points. One such point is marked with an open circle in (c). There are (^) usually several possible matching points for each strong segmentation point, and the candidates for the marked point. are shown here by filled circles (c). The correct mates for each segmentation point can usually be found by eliminating relatively poor candidates. The result of doing this here is the segmentation shown in (d). Once these segments have been defined, their corresponding axes (thick lines) are easy to obtain (e). They do not usually connect, but may be related to one another by intermediate lines which are called embedding relations (thin lines in f). According to the 3-D representation theory, the resulting stick figure (f) is the deep structure on which interpretation of this image is based.
Marr
a.100-
100
(I I0I I II I I I 100
b.
100
01 III •II 100
100
d.-
100-
f.-
100-
SI I I 11 1661001111111
100
I I I I I I I F " r 100
++ +
+
+ + _ +_* + +
+. +++ I I I I
100-
a
e.:
100-
C
I
0 0
cones to represent 3-D shapes. The theory assigns many alternating (^) figure effects like the Necker cube to the existence (^) of alternative self-consistent labellings computed at this stage. It is perhaps worth mentioning one interesting (^) point that has emerged from this way of recognising and representing 3-D shapes. Warrington & Taylor (^) (1973) described patients (^) with right parietal lesions who had difficulty in recognising objects seen in "unconventional" views - (^) like the view of a water pail seen from above (see figure 7). They did not attempt to define what makes a view unconventional. According (^) to our theory, the most troublesome views (^) of an object will be those in which its stick-figure axes cannot easily be recovered from the image. The (^) theory therefore predicts that unconventional views in the Warrington & Taylor sense will correspond (^) to those views in which an important axis in the object's generalised cylinder representation is foreshortened. Such views are (^) by no means uncommon - if a 35mm camera is directed towards you, you are seeing an unconventional view of it, since the axis of its lens is foreshortened.
Examples of algorithms and mechanisms Between the top and bottom of our four levels lie descriptions of algorithms and descriptions of mechanisms. The distinction between (^) these two levels is rather subtle, since they are (^) often closely related. The form of a specific algorithm can impose strong constraints on the mechanisms, and conversely. (^) Let us consider three examples.
1: "Simple" algorithms An algorithm operates on some kind of input and yields a corresponding output. In formal terms, an algorithm can be thought of as a mapping between the input and the output space. Perhaps the simplest (^) of all nonlinear operators on a linear space are the so-called polynomial operators. They encompass a broad spectrum (^) of applications including all linear. problems, and they approximate all sufficiently (^) smooth, nonlinear operators. For this particular class of "simple" algorithms (^) (i.e. representable through a "smooth" operator) (^) polynomial representations provide a (^) canonical decomposition in a series
of simpler, multilinear operators. Figure 8 shows this decomposition in terms of interactions or "graphs" of various orders: in this (^) way an algorithm, or its network Implementation, may be decomposed into an additive sequence of simple, canonical terms, just (^) as in another context, a function (^) can be conveniently characterized by its various Fourier terms. Moreover, functional and computational properties (^) can be associated with interactions of a given order and type. Poggio & Reichardt (1976) used the polynomial representation (^) of functionals to (^) classify the algorithms underlying movement, position and figure-ground computation in the fly's (^) visual system. The idea was to identify which terms, among the diversity of the possible ones, are implied by the experimental (^) data. Figure 8 shows the graphs that play a significant role in the (^) fly's. control of flight and, in this sense, characterize the algorithms involved. (^) The notion that seems to capture best the "computational (^) complexity" of these simple, smooth mappings is the notion of p-order (perceptron-order, (^) see Poggio and Reichardt, 1976). Movement computation in (^) the fly is of
Marr
f\T%
Jil n-7-
Separation (^) of the three types of interactions in the fly Movement (^) computation Position ("attractiveness") computation
',V I V
torresponding to ru Corresponding to D(O) Correction to superposition rule Homogencously distributed Mostly in the lower part iMoatly in the lower part in the eye (no Ltrong de- of the eye (D(.) and of the eye pendence on 0 and 0) L(M)^ dependence) No "age" depdndenco (?) "Age" dependence Light intensity^ threshold^ Light^ intensity^ threshold^? at about^^10 - 4^ candel/m^2 (of^ fixation!)^ at^ about (Eckert, 1973) LO-^ cd/m 2 (Reichardt, 1973; WehrhJhn, (^) 1976 Present in the Drosophila Disturbed in the Drosophila? mutant S 129 (IHeisenberg, mutant (^) S 129 (Heisenberg. pers. comm.) pers.^ comm.)
Figure 8. Graphical representation (a) of the decomposition of a "simple" nonlinear, n-input "algorithm"into a sum of interac ons of various order. (^) The functional representation S(.. x (t..). - + (^) 1 {x,(t))+. ILC (xL(t),x(t) + ... where L is an n-linear mapping, can be read from an akropriate sequence of such elementary graphs. Fig. 8b shows (^) the graphs that implement the fly's orientation behavior, studied by Reichardt and (^) Poggio. Several findings suggest that they may correspond to separate physiological modules. Characteristic functional and computational properties can be associated (^) to each interaction type. (From Poggio and Reichardt, 1976).
a
%.11 ..; %_; %
TIT TI