Object Recognition: Stability vs. Flexibility in Human and Machine Vision | Exercises Experimental Psychology

Visual Object Recognition

Michael J. Tarr and Quoc C. Vuong

Department of Cognitive and Linguistic Sciences

Box 1978

Brown University

Providence, RI 02912

The study of object recognition concerns itself with a two-fold problem. First, what is

the form of visual object representation? Second, how do observers match object

percepts to visual object representations? Unfortunately, the world isn’t color coded or

conveniently labeled for us. Many objects look similar (think about four-legged

mammals, cars, or song birds) and most contain no single feature or mark that uniquely

identifies them. Even worse, objects are rarely if ever seen under identical viewing

conditions: objects change their size, position, orientation, and relations between parts,

viewers move about, and sources of illumination turn on and off or move. Successful

object recognition requires generalizing across such changes. Thus, even if an observer

has never seen a bear outside of the zoo, on a walk in the woods they can tell that the

big brown furry object with teeth 20 ft in front of them is an unfriendly bear and probably

best avoided or that the orange-yellow blob hanging from a tree is a tasty papaya.

Consider how walking around an object alters one’s viewing direction. Unless the

object is rotationally symmetric, for example, a cylinder, the visible shape of the object

will change with observer movement – some surfaces will come into view, other surfaces

will become occluded and the object’s geometry will change both quantitatively and

qualitatively (Tarr & Kriegman, 2001). Changes in the image as a consequence of object

movement are even more dramatic – not only do the same alterations in shape occur,

but the positions of light sources relative to the object also change. This alters both the

pattern of shading on the object’s surfaces and the shadows cast by some parts of the

object on other parts. Transformations in size, position, and mean illumination also alter

Partial preview of the text

Download Object Recognition: Stability vs. Flexibility in Human and Machine Vision and more Exercises Experimental Psychology in PDF only on Docsity!

Visual Object Recognition

Michael J. Tarr and Quoc C. Vuong Department of Cognitive and Linguistic Sciences Box 1978 Brown University Providence, RI 02912 The study of object recognition concerns itself with a two-fold problem. First, what is the form of visual object representation? Second, how do observers match object percepts to visual object representations? Unfortunately, the world isn’t color coded or conveniently labeled for us. Many objects look similar (think about four-legged mammals, cars, or song birds) and most contain no single feature or mark that uniquely identifies them. Even worse, objects are rarely if ever seen under identical viewing conditions: objects change their size, position, orientation, and relations between parts, viewers move about, and sources of illumination turn on and off or move. Successful object recognition requires generalizing across such changes. Thus, even if an observer has never seen a bear outside of the zoo, on a walk in the woods they can tell that the big brown furry object with teeth 20 ft in front of them is an unfriendly bear and probably best avoided or that the orange-yellow blob hanging from a tree is a tasty papaya. Consider how walking around an object alters one’s viewing direction. Unless the object is rotationally symmetric, for example, a cylinder, the visible shape of the object will change with observer movement – some surfaces will come into view, other surfaces will become occluded and the object’s geometry will change both quantitatively and qualitatively (Tarr & Kriegman, 2001). Changes in the image as a consequence of object movement are even more dramatic – not only do the same alterations in shape occur, but the positions of light sources relative to the object also change. This alters both the pattern of shading on the object’s surfaces and the shadows cast by some parts of the object on other parts. Transformations in size, position, and mean illumination also alter

the image of an object, although somewhat less severely as compared to viewpoint/orientation changes. Recognizing objects across transformations of the image. Theories of object recognition must provide an account of how observers compensate for a wide variety of changes in the image. Although theories differ in many respects, most attempt to specify how perceptual representations of objects are derived from visual input, what processes are used to recognize these percepts, and the representational format used to encode objects in visual memory. Broadly speaking, two different approaches to these issues have been adopted. One class of theories assumes that there are specific invariant cues to object identity that may be recovered under almost all viewing conditions. These theories are said to be viewpoint-invariant in that these invariants provide sufficient information to recognize the object regardless of how the image of an object changes (within some limits) (Marr & Nishihara, 1978; Biederman, 1987). A second class of theories argues that no such general invariants exist^1 and that object features are represented much as they appeared when originally viewed, thereby preserving viewpoint-dependent shape information and surface appearance. The features visible in the input image are compared to features in object representations, either by normalizing the input image to approximately the same viewing position as represented in visual memory (Bülthoff & Edelman, 1992; Tarr, 1995) or by computing a statistical estimate of the quality of match between the input image and candidate representations (Perrett, Oram, & Ashbridge, 1998; Riesenhuber & Poggio, 1999). Viewpoint-invariant and viewpoint-dependent approaches make very different predictions regarding how invariance is achieved (these labels are somewhat misleading in that the goal of all theories of recognition is to achieve invariance, that is, the successful recognition of objects across varying viewing conditions). Viewpoint-invariant theories propose that recognition is itself invariant across transformations. That is, (^1) Of course invariants can be found under certain contexts. For example, if there are only three objects to be distinguished and these objects are red, green, and blue, object color becomes an invariant in this context (Tarr & Bülthoff, 1995).

friendly Gentle Ben? The key point highlighted by these different recognition tasks is that objects may be visually recognized in different ways and, critically, at different categorical levels. In the cognitive categorization literature, there is a distinction made between the superordinate, basic, subordinate, and individual levels (Rosch et al., 1976). A bear can be classified as an animal, as a bear, as a grizzly bear, and as Gentle Ben – each of these labels corresponding respectively to a different categorization of the same object. Visual recognition can occur roughly at these same levels, although visual categorization is not necessarily isomorphic with the categorization process as studied by many cognitive psychologists. For example, some properties of objects are not strictly visual, but may be relevant to categorization – chairs are used to sit on, but there is no specific visual property that defines “sitability.” A second distinction between visual and cognitive categorization is the default level of access. Jolicoeur, Gluck, and Kosslyn (1984) point out that many objects are not recognized at their basic level. For example, the basic level for pelicans is “bird,” but most people seeing a pelican would label it “pelican” by default. This level, referred to as the “entry level,” places a much greater emphasis on the similarities and differences of an object’s visual features relative to other known objects in the same object class (Murphy & Brownell, 1985). The features of pelicans are fairly distinct from those of typical birds, hence, pelicans are labeled first as “pelicans”; in contrast, the features of sparrows are very typical and sparrows are much more likely to be labeled as “birds.” Why are these distinctions between levels of access important to object recognition? As reviewed in the next section, there are several controversies in the field that center on issues related to categorical level. Indeed, the stability/sensitivity tradeoff discussed above is essentially a distinction about whether object recognition should veer more towards the subordinate level (emphasizing sensitivity) or the entry level (emphasizing stability). This issue forms the core of a debate about the appropriate domain of explanation (Tarr & Bülthoff, 1995; Biederman & Gerhardstein, 1995). That is, what is the default (and most typical) level of recognition? Furthermore, the particular recognition mechanisms applied by default may vary with experience, that is, perceptual

experts may recognize objects in their domain of expertise at a more specific level than novices (Gauthier & Tarr, 1997a). Some theorists – most notably Biederman (1987) – presuppose that recognition typically occurs at the entry level and that any theory of recognition should concentrate on accounting for how the visual system accomplishes this particular task. In contrast, other theorists – Bülthoff, Edelman, Tarr, and others (see, Tarr & Bülthoff, 1998; Hayward & Williams, 2000) – argue that that the hallmark of human recognition abilities is flexibility and that any theory of recognition should account for how the visual system can recognize objects at the entry, subordinate, and individual levels (and anything in between). This distinction is almost isomorphic with the viewpoint-invariant/viewpoint-dependent distinction raised earlier. Specifically, viewpoint- invariant theories tend to assume the entry level as the default and concentrate on accounting for how visual recognition at this level may be achieved. In contrast, viewpoint-dependent theories tend to assume that object recognition functions at many different categorical levels, varying with context and task demands. A second, somewhat related, debate focuses on the scope of posited recognition mechanisms. Some theorists argue that there are at least two distinct mechanisms available for recognition – generally breaking down along the lines of whether or not the recognition discrimination is at the entry or the subordinate level (Jolicoeur, 1990; Farah, 1992). Some researchers have suggested that there may be several “special purpose devices” devoted to the task of recognizing specific object classes, for example, a neural module for face recognition, another for place recognition, and one for common object recognition (Kanwisher, 2000). Alternatively, it has been argued that recognition at many levels and for all object categories can be accomplished by a single, highly plastic system that adapts according to task constraints and experience (Tarr & Gauthier, 2000; Tarr, in press). This and the aforementioned debates have produced an extensive research literature addressing the nature of visual object recognition. In order to better understand these controversies, we next review the particular dimensions typically used both to characterize object representations and to constrain potential mechanisms of recognition.

responses measuring brightness or color, oriented lines, T-junctions, corners, etc. (Tanaka, 1996). Examples of global features include 3D component parts realized as simple volumes that roughly capture the actual shape of an object (Marr & Nishihara, 1978; Biederman, 1987). Immediately, significant differences between these two approaches are apparent. On the one hand, an appealing aspect of local features is that they are readily derivable from retinal input and the natural result of earlier visual processing as discussed in prior chapters in this volume. In contrast, 3D parts must be recovered from 2D images in a manner that is not entirely obvious given what is currently known about visual processing. On the other hand, it is hard to imagine how stability is achieved using only local features – the set of features visible in one viewpoint of an object is likely to be very different from the feature sets that are visible in other viewpoints of the same object or other similar objects. Even slight variations in viewpoint, illumination, or configuration may change the value of local responses and, hence, the object representation. Furthermore, 3D parts yield stability – so long as the same invariants are visible, the same set of 3D parts may be recovered from many different viewpoints and across many different instances of an object class. Thus, variations in viewpoint, illumination, or configuration are likely to have little impact on the qualitative representation of the object. Dimensionality The range of features that may form the representation is quite wide, but cutting across all possible formats is their degree of dimensionality, that is, how many spatial dimensions are encoded. The physical world is three-dimensional, yet the optic array sampled by the retinae is two-dimensional. As discussed in earlier chapters, one goal of vision is to recover properties of the 3D world from this 2D input (Marr, 1982). Indeed, 3D perception seems critical for grasping things, walking around them, playing ping- pong, etc. However, recovery of 3D shape may not be critical to the process of remembering and recognizing objects. Thus, one can ask whether object representations are faithful to the full 3D structure of objects or to the 2D optic array, or to something in between. As discussed above, some theories argue that complete, 3D

models of objects are recovered (Marr & Nishihara, 1978) or that object representations are 3D, but can vary depending on the features visible from different viewpoints (Biederman, 1987). Others have argued that object representations are strictly 2D; that is, preserving the appearance of the object in the image with no reference to 3D shape or relations (Edelman, 1993). An intermediate stance is that object representations are not strictly 2D or 3D, but rather represent objects in terms of visible surfaces, including local depth and orientation information. Such a representation is sometimes termed “two-and-one-half-dimensional” (2.5D; Marr, 1982). Critically, both 2D and 2.5D representations only depict surfaces visible in the original image – there is no recovery or reconstruction or extrapolation about the 3D structure of unseen surfaces or parts; 3D information instead arises from more local processes such as shape-from-shading, stereo, and structure-from-motion. In contrast, 3D representations include not only surface features visible in the input (the output of local 3D recovery mechanisms) but also additional globally recovered information about an object’s 3D structure (e.g., the 3D shape of an object part). Such 3D representations are appealing because they encode objects with a structure that is isomorphic with their instantiation in the physical world. However, deriving 3D representations is computationally difficult because 3D information must be recovered and integrated (Bülthoff & Edelman, 1993). How are Features Related to One Another? Features are the building blocks of object representations. But by themselves, they are not sufficient to characterize, either quantitatively or qualitatively, the appearance of objects. A face, for example, is not a random arrangement of eyes, ears, nose, and mouth, but rather a particular set of features in a particular spatial arrangement. Object representations must therefore express the spatial relations between features. One aspect of how this is accomplished is whether the spatial relations between features are represented at a single level, in which all features share the same status, or whether there is a hierarchy of relations. For instance, Marr and Nishihara (1978) hypothesized that a small number of parts at the top of a hierarchy are progressively decomposed into constituent parts and their spatial relationships at finer and finer scales – for example, an arm can be decomposed into an upper arm, forearm, and hand. The hand, in turn, can

these features in the image have been dubbed image-based models (although this term still encompasses a wide variety of approaches). Global, qualitative models tend to assume a much coarser coding of the spatial relations between features. Biederman (1987; Hummel & Biederman, 1992) argues that spatial relations are encoded in a qualitative manner that discards metric relations between object features yet preserves their critical structural relations. On this view, for example, the representation would code that one part is above another part, but not how far above or how directly above (a “top-of” relation). The resulting concatenation of features (in Biederman’s model, 3D parts) and qualitative structural relations is often referred to as a structural description. One other possibility should be noted. All spatial relations between features might be discarded and only the features themselves represented: so a face, for instance, might just be a jumble of features! Such a scheme can be conceptualized as an array of non- localized feature detectors uniformly distributed across the retinal array (dubbed “Pandemonium” by Selfridge, 1959). The resulting representation might be more stable, but only so long as the same features or a subset of these features are present somewhere in the image and their presence uniquely specifies the appropriate object. Although there are obvious problems with this approach, it may have more merit than it is often given credit for, particularly if one assumes an extremely rich feature vocabulary and a large number of features per object (see Tarr, 1999). Frames of Reference As pointed out in the previous section, most theorists agree that intermediate stages of visual processing preserve at least the rough geometry of retinal inputs. Thus, there is implicitly, from the perspective of the observer, a specification of the spatial relations between features for intermediate-level image representations. Ultimately, however, the spatial relations between features are typically assumed to be explicit in high-level object representations. This explicit coding is generally thought to place features in locations specified relative to one or more anchor points or frames of reference (Marr, 1982).

The most common distinction between reference frames is whether they are viewpoint-independent or viewpoint-dependent. Embedded in these two types of approaches are several different kinds of frames, each relying on different anchor points. For example, viewpoint-independent models encompass both object-centered and viewpoint-invariant representations. Consider what happens if the features of an object are defined relative to the object itself: although changes in viewpoint alter the appearance of the object, they do not change the position of a given feature relative to other features in the object (so long as the object remains rigid). Thus, the representation of the object does not change with many changes in the image. The best known instantiation of an object-centered theory was proposed by Marr and Nishihara (1978). They suggest that an object’s features are specified relative to its axis of elongation, although other axes, such as the axis of symmetry, are also possible (McMullen & Farah, 1991). So long as an observer can recover the elongation axis for a a) b) Figure 1. Different viewpoints of objects often reveal (or occlude) different features. Thus, it seems unlikely that a complete 3D object representation could be derived from any single viewpoint.

In contrast to viewpoint-invariant models, viewpoint-dependent models inherently encompass retinotopic, viewer-centered (egocentric), and environment-centered (spatiotopic or allocentric) frames, anchored to the retinal image, the observer, or the environment, respectively. That is, objects are represented from a particular viewpoint, which entails multiple representations. Put another way, object representations that use a viewer-centered reference frame are tied more or less directly to the object as it appears to the viewer or, in the case of allocentric frames, relative to the environment. As such, they are typically assumed to be less abstract and more visually-rich than viewpoint-independent representations (although this is simply a particular choice of the field; viewpoint-dependence/independence and the richness of the representation are technically separable issues). It is often thought that viewpoint-dependent representations may be more readily computed from retinal images as compared to viewpoint-independent representations. However, there is an associated cost in that viewpoint-dependent representations are less stable across changes in viewpoint in that they necessarily encode distinct viewpoints of the same object as distinct object representations. Thus, theories adopting this approach require a large number of viewpoints for each known object. Although this approach places higher demands on memory capacity, it does potentially reduce the degree of computation necessary for deriving high-level object representations for recognition. Normalization Procedures Regardless of the molar features of the representation – local features, 3D parts, or something in between – if some degree of viewpoint dependency is assumed, then the representation for a single object or class will consist of a set of distinct feature collections, each depicting the appearance of the object from a different vantage point. This leads to a significant theoretical problem: different viewpoints of the same object must somehow be linked to form a coherent representation of the 3D object. One solution might be to find a rough correspondence between the features present in different viewpoints. For example, the head of the bear is visible from both the front and the side, so this might be a clue that the two images arose from the same object. Unfortunately, simple geometric correspondence seems unlikely to solve this problem –

if such correspondences were available (i.e., if it were possible to map one viewpoint of an object into another viewpoint of that same object), then recognition might proceed without the need to learn the new viewpoint in the first place! So it would seem that viewpoints are either distinct or they aren’t (Jolicoeur, 1985). The conundrum of how an observer might recognize a novel viewpoint of a familiar object was addressed by Tarr and Pinker (1989). They built on the finding that human perceivers have available a “mental rotation” process (Shepard & Metzler, 1971) by which they can transform a mental image of a 3D object from one viewpoint to another. Shepard and others had reasoned that although the mental rotation process was useful for mental problem solving, it was not appropriate for object recognition. The argument was that in order to know the direction and “target” of a given rotation, an observer must already know the identity of object in question; therefore executing a mental rotation would be moot. Put another way, how would the recognition system determine the correct direction and magnitude of the transformation prior to recognition? Ullman (1989) pointed out that an “alignment” between the input shape and known object representations could be carried out on the basis of partial information. That is, a small portion of the input could be used to compute both the most likely matches for the current input, as well as the transformation necessary to align this input with its putative matches. In practice, this means that a subset of local features in the input image are compared, in parallel, to features encoded in stored object representations. Each comparison returns a goodness-of-fit measure and the transformation necessary to align the image with the particular candidate representation (Ullman, 1989). The transformation actually executed is based on the best match among these. Thus, observers could learn one or more viewpoints of an object and then use these known viewpoints plus normalization procedures to map from unfamiliar to familiar viewpoints during recognition. Jolicoeur (1985) provided some of the first data suggesting that such a process exists by demonstrating that the time it takes to name a familiar object increases as that object is rotated further and further away from its familiar, upright orientation. However, this result was problematic in that upright viewpoints of mono-oriented objects are

Tarr and Pinker’s (1989) innovation was to use novel 2D objects shown to subjects in multiple viewpoints. Subjects learned the names of four of the objects and then practiced naming these objects plus three distractors (for which the correct response was “none- a) b) 800 850 900 950 1000 1050 1100 0° 75° 150° 225° 300° Familiar Unfamiliar R e s p o n s e T im e ( m s ) Orientation Figure 2. a ) The novel 2D objects used as stimuli in Tarr and Pinker (1989). b) !Average response times for subjects to name these objects in trained, familiar orientations and unfamiliar orientations. Naming times increased systematically with distance from the nearest familiar orientation. Adapted from Tarr and Pinker (1989) Figures 1 (p. 243) and 6 (p. 255).

of-the-above”) in several orientations generated by rotations in the picture-plane (Figure!2a). Tarr and Pinker’s subjects rapidly became equally fast at naming these objects from all trained orientations. Subjects’ naming times changed when new picture- plane orientations were then introduced: they remained equally fast at familiar orientations, but naming times were progressively slower as the objects were rotated further and further away from a familiar orientation (Figure 2b). Thus, subjects were learning to encode and use those orientations that were seen most frequently (and not merely geometrically “good” views). This suggests that observers were able to invoke normalization procedures to map unfamiliar orientations of known objects to familiar orientations of those same objects. Tarr and Pinker (1989) hypothesized that these normalization procedures were based on the same mental rotation process discovered by Shepard and Metzler (1971). Tarr (1995) reported corroborating evidence using novel 3D versions of Tarr and Pinker’s 2D objects rotated in depth. Some researchers have pointed out that mental rotation is an ill-defined process. What does it really mean to “rotate” a mental image? Several researchers have offered computational mechanisms for implementing normalization procedures with behavioral signatures similar to that predicted for mental rotation. These include linear combinations of views (Ullman & Basri, 1991), view interpolation (Poggio & Edelman, 1990), and statistical evidence accumulation (Perrett, Oram, & Ashbridge, 1998). Some of these normalization mechanisms make further predictions regarding recognition behavior for new viewpoints of known objects. For example, Bülthoff and Edelman (1992) obtained some evidence consistent with the view-interpolation models of normalization and Perrett, Oram, and Ashbridge (1998) found that responses of populations of neurons in monkey visual cortex were consistent with the evidence accumulation account of normalization. Note that viewpoint-invariant theories of recognition do not require normalization as a way for recognizing unfamiliar viewpoints of familiar objects. In particular, both Marr and Nishihara (1978) and Biederman (1987) assume that viewpoint-invariant recovery mechanisms are sufficient to recognize an object from any viewpoint, known or unknown, or that viewpoint-invariant mechanisms are viewpoint limited, but span

cylinders for the handle and spout. Based on such techniques, some researchers suggested that similar representations might be used by both biological and machine vision systems for object recognition (Binford, 1971). Specifically, objects would be learned by decomposing them into a collection of 3D parts and then remembering that part configuration. Recognition would proceed by recovering 3D parts from an image and then matching this new configuration to those stored in object memory. One appealing element of this approach was the representational power of the primitives – called “generalized cones” (or “generalized cylinders”) by Binford. A generalized cone represents 3D shape as three sets of parameters: (1) an arbitrarily shaped cross-section that (2) can scale arbitrarily as (3) it is swept across an arbitrarily shaped axis. These three parameter sets are typically defined by algebraic functions that together capture the shape of the object part. Marr and Nishihara built on this concept in their seminal 1978 theory of recognition. In many ways they proposed the first viable account of human object recognition, presenting a model that seemed to address the factors of invariance, stability, and level Figure 3. Schematic diagram of the multi-scale 3D representation of a human figure using object-centered generalized cones. Although cylinders are used for illustrative purposes, Marr and Nishihara’s (1978) model allows for the representation of much more complex shapes for each part of an object. This figure is adapted from Figure 3 of Marr and Nishihara (1978; p. 278).

of access. As mentioned previously, they placed a significant computational burden on the reconstruction of the 3D scene. In particular, in their model a necessary step in the recognition process is recovering 3D parts from the input image. Recognizing the power of generalized cones, Marr and Nishihara suggested that observers use information about an object’s bounding contour to locate its major axis of elongation. This axis can then be used as the sweeping axis for the creation of a generalized cone structural description relating individual 3D parts to one another at multiple scales (Figure 3). Invariance was accomplished by virtue of the object-centered coordinate system in which these 3D parts were parameterized. Thus, regardless of viewpoint and across most viewing conditions, the same viewpoint-invariant, 3D structural description would be recovered by identifying the appropriate image features, (e.g., the bounding contour and major axes of the object) recovering a canonical set of 3D parts, and matching the resultant 3D representation to like representations in visual memory. Biederman’s (1987; Hummel & Biederman, 1992) model – “Recognition-By- Components” (RBC) – is quite similar to Marr and Nishihara’s theory. However, two innovations made the RBC model more plausible in the eyes of many researchers. First, RBC assumes a restricted set of volumetric primitives, dubbed “geons” (Figure 4). Second, RBC assumes that geons are recovered on the basis of highly stable non- accidental image properties, that is, shape configurations that are unlikely to have occurred purely by chance (Lowe, 1987). One example of a non-accidental property is three edges meeting at a single point (an arrow or Y junction) – it is far more likely that this image configuration is the result of an inside or outside corner of a rectangular 3D object than the chance meeting of some random disconnected edges. Biederman considered the set of 36 or so 3D volumes specified by the combinations of non- accidental properties in the image: the presence of particular edge junctions or vertices, the shape of the major axes, symmetry of the cross section around these axes, and the scaling of the cross section. For example, a cylinder is specified as a curved cross section (i.e., a circle) with rotational and reflection symmetry, constant size, and a straight axis. An important point to keep in mind is that these attributes are defined qualitatively, for instance, a cross section is either straight or curved, there is no in-

Object Recognition: Stability vs. Flexibility in Human and Machine Vision, Exercises of Experimental Psychology

Related documents

Partial preview of the text

Download Object Recognition: Stability vs. Flexibility in Human and Machine Vision and more Exercises Experimental Psychology in PDF only on Docsity!

Visual Object Recognition