Dynamic Binding in a Neural Network for Shape Recognition

maltwormjetmoreAI and Robotics

Oct 19, 2013 (4 years and 8 months ago)


Psychologica l Revie w
1992. Vol. 99, No. 3, 480-517
Copyrigh t 199 2 by the America n Psychologica l Association, Inc.
Dynamic Binding in a Neural Networ k for Shape Recognition
John E. Hummel and Irving Biederma n
Universit y of Minnesota, Twin Cities
Give n a singl e vie w of an object, humans can readil y recogniz e that object fro m other views that
preserve the parts in the original view. Empirica l evidence suggest s that this capacit y reflect s the
activatio n of a viewpoint-invarian t structura l descriptio n specifyin g the object's part s and the
relations among them. This articl e present s a neural networ k that generates such a description.
Structura l descriptio n is made possibl e through a solution to the dynamic binding problem: Tempo-
rar y conjunction s of attribute s (part s and relations ) are represente d by synchronize d oscillator y
activit y among independen t unit s representin g those attributes. Specifically, the model uses
synchron y (a) to parse images into thei r constituen t parts, (b) to bind together the attribute s of a
part, and (c) to bind the relations to the part s to which they apply. Because it conjoins independen t
unit s temporarily, dynami c binding allows tremendou s economy of representatio n and permit s the
representatio n to reflec t the attribut e structur e of the shapes represented.
A brie f glance at Figure 1 is sufficien t to determin e that it
depict s three views of the same object. The perceived equiva-
lence of the object across its views evidences the fundamenta l
capacit y of human visual recognition: Object recognition is in-
varian t wit h viewpoint. That is, the perceived shape of an object
does not vary with position in the visual field, size, or, in gen-
eral, orientatio n in depth. The object in Figure 1 is unfamiliar,
so the abilit y to activat e a viewpoin t invariant representatio n of
its shape cannot depend on prior experienc e with it. This capac-
ity for viewpoin t invarianc e independen t of object familiarit y is
so fundamenta l to visual shape classificatio n that modeling
visua l recognitio n is largely a problem of accountin g for it.1
Thi s articl e present s a neural networ k model of viewpoin t
invarian t visual recognition. In contras t with previous model s
based primaril y on templat e or featur e list matching, we argue
that human recognitio n performanc e reflect s the activation of a
viewpoin t invariant structura l descriptio n specifyin g both the
visua l attribute s of an object (e.g., edges, vertices, or parts) and
the relations among them. For example, the simpl e object in
Figur e 1 might be represente d as a vertical cone on top of a
horizonta l brick. Such a representatio n must bind (i.e., conjoin)
the shape attribut e cone shaped with the relationa l attribut e on
top of and the attribut e brick shaped with the attribut e below;
otherwise, it woul d not specif y which volume was on top of
which. It is also necessar y to use the same representatio n for
This research was supported by National Science Foundatio n gradu-
ate and Air Force Offic e of Scientifi c Research (AFOSR) postdoctoral
fellowship s to John E. Hummel and by AFOSR research grants (88-
0231 and 90-0274) to Irving Biederman. We are grateful to Dan Ker-
sten, Randy Fletcher, Gordon Legge, and the reviewer s for their help-
fu l comment s on an earlier draft of this article and to Peter C. Gerhard-
stein for his help on early simulations.
Irvin g Biederma n is now at the Departmen t of Psychology, Hedco
Neuroscienc e Build., Universit y of Souther n California.
Correspondenc e concernin g this article shoul d be addressed to John
E. Hummel, who is now at the Departmen t of Psychology, Universit y of
California, Franz Hall, 405 Hilgar d Avenue, Los Angeles, Californi a
cone shaped whether the cone is on top of a brick, below a
cylinder, or by itself in the image otherwis e the representatio n
woul d not specif y that it was the same shape each time. Tradi-
tional connectionist/neura l net architecture s cannot represent
structura l description s because they bind attributes by positing
separat e units for each conjunctio n (cf. Fodor & Pylyshyn,
1988). The units used to represent cone shaped would diffe r
depending on whether the cone was on top of a brick, below a
cylinder, or instantiate d in some other relation. Representin g
structura l description s in a connectionis t architectur e requires
a mechanis m for binding attribute s dynamically; that is, the
binding of attributes must be temporar y so that the same units
can be used in multipl e conjunctions. A primar y theoretica l
goal of this article is to describe how this dynami c binding can
be achieved. The remainde r of this section motivates the model
by presenting the aforementione d argument s in greater detail.
Approaches to Visual Recognition:
Why Structural Description?
Prima facie, it is not obvious that successful shape classifica-
tion requires explici t and independen t representatio n of shape
attributes and relations. Indeed, most models of visual recogni -
tion are based either on templat e matching or featur e list
matching, and neither of these approache s explicitl y represent s
relations. We criticall y evaluat e each in turn. Some of the short-
comings of templat e matching and featur e list matching mod-
els were described a quarter of a centur y ago (Neisser, 1967) but
still have relevance to recent modeling efforts. We conclude
from this critical review that template and feature models suffe r
from the same shortcoming: They trade off the capacit y to
1 Though the perceived shape is invariant, we are of course aware of
the difference s in size, orientation, and position among the entities in
Figure 1. Those attribute s may be processed by a system subserving
motor interactio n rather than recognitio n (Biederma n & Cooper,
Figure 1. This objec t is readil y detectabl e as constant across the three
views despit e its being unfamiliar.
represent attribute structures with the capacity to represent re-
Template Matching
Template matching models operate by comparing an incom-
ing image (perhaps after filtering, noise reduction, etc.) against
a template, a representation of a specific view of an object. To
achieve viewpoint invariant recognition, a template model
must either (a) store a very large number of views (templates) for
each known object or (b) store a small number of views (or 3-D
models) for each object and match them against incoming
images by means of transformations such as translation, scal-
ing, and rotation. Recognition succeeds when a template is
found that fits the image within a tolerable range of error. Tem-
plate models have three main properties: (a) The fit between a
stimulus and a template is based on the extent to which active
and inactive points2 in the image spatially correspond to active
and inactive points in the template; (b) a template is used to
represent an entire view of an object, spatial relations among
points in the object coded implicitly as spatial relations among
points in the template; and (c) viewpoint invariance is a func -
tion of object classification. Differen t views of an object are
seen as equivalent when they access the same object label (i.e.,
either by matching the same template or by matching differen t
templates with pointers to the same object label).
Template matching currently enjoys considerable popularity
in computer vision (e.g., Edelman & Poggio, 1990; Lowe, 1987;
Ullman, 1989). These models can recognize objects under dif-
ferent viewpoints, and some can even find objects in cluttered
scenes. However, they evidence severe shortcomings as models
of human recognition. First, recall that template matching suc-
ceeds or fail s according to the number of points that mismatch
between an image and a stored template and therefore predicts
little effec t of where the mismatch occurs. (An exception to this
generalization can be found in alignment models, such as those
of Ullman, 1989, and Lowe, 1987. If enough critical alignment
points are deleted from an image—in Oilman's model, three
are required to compute an alignment—the alignment process
will fail. However, provided alignment can be achieved, the
degree of match is proportional to the number of correspond-
ing points.) Counter to this prediction, Biederman (1987b)
showed that human recognition performance varied greatly de-
pending on the locus of contour deletion. Similarly, Biederman
and Cooper (199 Ib) showed that an image formed by removing
half the vertices and edges from each of an object's parts (Figure
2a) visually primed3 its complement (the image formed from
the deleted contour, Figure 2b) as much as it primed itself, but
an image formed by deleting half of the parts (Figure 2c) did not
visually prime its complement at all (Figure 2d). Thus, visual
priming was predicted by the number of parts shared by two
images, not by the amount of contour.
Second, template models handle image transformations such
as translation, rotation, expansion, and compression by apply-
ing an equivalent transformation to the template. Presumably,
such transformations would require time. However, Biederman
and Cooper (1992) showed no loss in the magnitude of visual
priming for naming object images that differed in position,
size, or mirror image reflection from their original presenta-
tion. Similarly, Gerhardstein and Biederman (1991) showed no
loss in visual priming for objects rotated in depth (up to parts
A more critical difficult y arises from the template's represent-
ing an object in terms of a complete view. Global image trans-
formations, albeit time consuming, are not catastrophic, but
any transformation applied only to a part of the object, as when
an object is missing a part or has an irrelevant part added, could
readily result in complete classification failure. Yet only modest
decrements, if any, are observed in recognition performance
with such images (Biederman, 1987a). Biederman and Hilton
(1991) found that visual priming was only modestly affecte d by
changes of 20% to 40% in the aspect ratio of one part of two-
part objects. However, qualitative changes (e.g., from a cylinder
to a brick) that preserved aspect ratio resulted in priming decre-
ments that were at least as great as those produced by the aspect
ratio changes. In essence, the core difficult y for template mod-
els is that they do not make explicit the information that is
critical to the representation of shape by humans. For example,
which of the two objects in the bottom of Figure 3 is more
similar to the standard in the top of the figure? Most people
would judge Object A to be more similar because the cone on
Object B is rounded rather than pointed. However, a template
model would select B as more similar because the brick on the
bottom of A is slightly flatter than the bricks of the other two
objects. As a result, the best fit between A and the standard
mismatches along the entire bottom edge. The number of mis-
matching pixels between A and the standard is 304, compared
with only 70 pixels' mismatch for B.
2 Point is used her e to refe r to any simple, localize d image element,
includin g a raw imag e intensit y value, zero-crossing, or local contour
3 Visual (rathe r tha n name or concept) priming was define d as the
advantag e enjoye d by an identica l image over an objec t wit h the same
name but a differen t shape. The task require d that subject s rapidl y
name briefl y presente d maske d images.
Figure 2. Example s of contour-delete d images used in the Bieder -
man and Cooper (199 1 b) primin g experiments, (a: An image of a piano
as it might appear in the first block of the experimen t in which half the
image contou r was remove d fro m each of the piano's parts by remov-
ing every other edge and vertex. Long edges were treated as two seg-
ments, b: The complemen t to Panel a forme d from the contour s re-
moved fro m Panel a. c: An image of a piano as it might appear in the
first bloc k of trial s in the experimen t in which hal f the part s were
deleted fro m the piano, d: The complemen t to Panel c forme d fro m the
part s deleted fro m Panel c. From "Primin g Contour-Delete d Images:
Evidenc e for Intermediat e Representation s in Visual Object Recogni -
tion" by I. Biederma n & E. E. Cooper, 1991, Cognitive Psychology, 23,
Figure s 1 and 4, pp. 397, 403. Copyrigh t 1991 by Academi c Press.
Adapte d by permission.)
Anothe r difficult y wit h templates—on e that has been partic-
ularl y influentia l in the compute r vision community—i s thei r
capacit y to scale with large number s of objects. In particular,
the critica l alignmen t feature s used to select candidat e tem-
plates in Ullman's (1989) and Lowe's (1987) model s may be
difficul t to obtai n when large number s of template s must be
stored and differentiated. Finally, a fundamenta l difficult y
wit h templat e matchin g as a theor y of huma n recognitio n is
that a templat e model must have a stored representatio n of an
object befor e it can classif y two views of it as equivalent. As
such, templat e model s cannot account for viewpoin t invarianc e
in unfamilia r objects. Pinker (1986) presented several other
problems wit h templat e models.
The difficultie s wit h templat e model s resul t largel y fro m
thei r insensitivit y to attribut e structures. Becaus e the template's
uni t of analysi s is a complet e view of an object, it fail s to explic-
itl y represen t the individua l attribute s that define the object's
shape. The visual similarit y betwee n differen t images is lost in
the classificatio n process. An alternativ e approach to recog-
nition, motivate d largel y by thi s shortcoming, is featur e
Feature Matching
Featur e matchin g model s extract diagnosti c feature s fro m an
image and use those as the basi s for recognitio n (e.g., Hinton,
1981; Hummel, Biederman, Gerhardstein, & Hilton, 1988;
Lindsa y & Norman, 1977; Selfridge, 1959; Selfridg e & Neisser,
1960). These model s diffe r fro m templat e model s on each of
the three mai n propertie s listed previously: (a) Rathe r than
operatin g on pixels, featur e matchin g operate s on highe r order
image attribute s such as parts, surfaces, contours, or vertices;
(b) rather than classifyin g image s all at once, featur e model s
describe image s in terms of multipl e independen t attributes;
and (c) visual equivalenc e consist s of derivin g an equivalen t set
of visual features, not necessaril y accessin g the same final ob-
ject representation. Featur e matchin g is not subject to the same
shortcoming s as templat e matching. Featur e matchin g model s
can differentiall y weight differen t classes of image features, for
example, by givin g more weight to vertice s than contou r mid-
segments; they can deal wit h transformation s at the leve l of
parts of an image; and they can respond in an equivalen t man-
ner to differen t views of novel objects.
However, viewpoint-invarian t featur e matchin g is prone to
serious difficulties. Conside r a featur e matchin g model in
whic h shapes are represente d as collection s of vertices. To
achieve just simpl e translationa l invarianc e (thus ignorin g the
additiona l complication s associate d wit h scale and orientatio n
invariance), each verte x must, by definition, be represente d
independentl y of its location in the visual field. However, this
location independenc e makes the representatio n insensitiv e to
the spatial configuratio n of the features. Consequently, any con-
figuratio n of the appropriat e feature s wil l produc e recognitio n
for the object, as illustrate d in Figur e 4. Thi s problem is so
obvious that it seems as if it must be a straw man: Wit h the right
set of features, the proble m wil l surel y just go away. It is tempt -
ing to think that the relation s coul d be capture d by extractin g
more comple x feature s such as pairs of vertice s in particula r
relations or perhaps triads of vertice s (e.g., Figur e 5). Indeed, as
the feature s extracte d become more complex, it become s mor e
difficul t t o find image s that wil l foo l the system. However,
unti l the feature s become template s for whol e objects, there wil l
Standar d
Figure 3. Whic h object in the lower part of the figure looks more like
the standard object in the top of the figure? (A template model woul d
likel y judge Panel b as more simila r to the standard than Panel a.
Peopl e do not.)
Invariant Features
Memor y
Curved L vertex
Arrow vertex
Curved L vertex
Arrow vertex
Fork vertex
Figure 4. Illustratio n of illusor y recognition with translationall y in-
variant featur e list matchin g on the basis of vertices. (Both images
contai n the same vertices and wil l therefor e produce recognitio n for
the object.)
alway s be some disordere d arrangemen t of them that wil l pro-
duce illusor y recognition. Furthermore, this difficult y is not
limite d to model s based on two-dimensiona l (2-D) features
such as vertices. As depicted in Figure 6, 3-D feature s can also
be reconfigure d to produce illusor y recognition.
Invariant Features
^ Vertex complex 1 -
Vertex complex 2 -
Vertex complex 3 -
Vertex complex 4 -
Memor y
Vertex complex 1
Vertex complex 2
Vertex complex 3
Vertex complex 4
Figure 5. Illustratio n of illusor y recognition, with translationall y in-
varian t featur e list matching on the basis of featur e complexe s com-
posed of two or more vertices in particula r relations to one another.
(Bot h images contai n the critical vertex complexe s and wil l produce
recognitio n for the object.)
Feature and templat e models both belong to a class of models
that treat spatial relations implicitl y by coding for precise met-
ric relationship s among the points in an object's image. These
models trade off the capacit y to represent relations with the
capacit y to represent the individua l attributes belonging to
those relations. Model s that code for small, simple features (fea -
ture models) ignore most of the relations among the attributes
of an object. Such model s will evidence sensitivit y to attribut e
structure s but wil l also be prone to illusory recognition (i.e.,
they wil l recognize object s give n disordered arrangement s of
their features). Those that code for whole objects implicitl y cap-
ture the relations among attributes as relations among points in
the feature s (templates). Such models will avoid illusory recogni -
tion but wil l be insensitive to attribut e structures. Model s in
between these extremes wil l simply suffe r a combinatio n of
these difficulties. To escape this continuum, it is necessar y to
represent the relations among an object's attributes explicitly.
Structural Description
A structural description (Winston, 1975) explicitl y represent s
object s as configuration s of attributes (typicall y parts) in speci-
fied relations to one another. For example, a structura l descrip-
tion of the object in Figure 1 might consist of vertical cone on top
of horizontal brick. Structura l description models avoid the pit-
fall s of the implici t relations continuum: They are sensitive to
attribut e structures, but because they represent the relations
among those attributes, they are not prone to illusor y recogni -
3-D Features
Curved cross section
Non-parallel sides —
Memor y
Straight cross section
Parallel sides
Horizontal —
Curved cross section -
Non-parallel sides -
Straight cross section -
Parallel sides
Figure 6. Illustratio n of illusor y recognitio n with featur e list match-
ing on the basis of 3-D features. (Both images contai n the critical 3-D
feature s and wil l produce recognitio n for the object.)
tion. These model s can also provide a natural account of the
fundamenta l characteristics of human recognition. Provided
the element s of the structura l descriptio n are invariant with
viewpoint, recognition on the basis of that description wil l also
be invariant with viewpoint. Provided the descriptio n can be
activated bottom up on the basis of informatio n in the 2-D
image, the luxur y of viewpoint invarianc e will be enjoyed for
unfamilia r objects as well as familiar ones.
Proposing that object recognitio n operates on the basis of
viewpoint-invarian t structura l descriptions allows us to under -
stand phenomen a that extant templat e and featur e theories
cannot handle. However, it is necessar y to address how the
visua l system could possibl y derive a structura l descriptio n in
the first place. This articl e present s an explici t theory, imple-
mented as a connectionis t network, of the processes and repre-
sentations implicate d in the derivation of viewpoint-invarian t
structura l description s for object recognition. The point of de-
partur e for this effor t is Biederman's (1987b ) theor y of recogni -
tion by component s (RBC). RBC states that object s are recog-
nize d as configuration s of simple, primitive volumes called
geons in specified relations with one another. Geons are recog-
nize d fro m 2-D viewpoint-invariant 4 propertie s of image con-
tours such as whether a contour is straight or curved, whether a
pair of contour s is parallel or nonparallel, and what type of
verte x is produced where contour s coterminate. The geons de-
rived fro m these 2-D contrasts are define d in terms of view -
point-invarian t 3-D contrast s such as whether the cross-sectio n
is straight (lik e that of a brick or a pyramid ) or curved (lik e that
of a cylinder or a cone) and whethe r the sides are paralle l (lik e
those of a brick) or nonparalle l (lik e those of a wedge).
Wherea s a templat e representatio n necessaril y commit s the
theoris t to a coordinat e space, 2-D or 3-D, the structura l de-
scriptio n theor y proposed here does not. This is importan t to
note because structura l description s have been criticize d for
assumin g a metricall y specified representatio n in a 3-D object -
centered coordinat e space. Failing to find evidence for 3-D in-
varian t recognition for some unfamilia r objects, such as crum-
pled paper (Rock & DiVita, 1987), bent wires (Edelma n &
Weinshall, 1991; Rock & DiVita, 1987), and rectilinea r arrange-
ment s of bricks (Tarr & Pinker, 1989), some researcher s have
rejecte d structura l descriptio n theories in favo r of representa -
tions based on multipl e 2-D views. But the wholesal e rejection
of structura l description s on the basis of such data is un-
warranted. RBC (and the implementatio n proposed here) pre-
dict these results. Because a collection of similarl y bent wires,
fo r example, tend not to activat e distinctiv e viewpoint-invarian t
structura l description s that allow one wire to be distinguishe d
fro m the other wires in the experiment, such objects should be
difficul t to recogniz e fro m an arbitrar y viewpoint.
ute conjunction s (Feldman, 1982; Feldman & Ballard, 1982;
Hinton, McClelland, & Rumelhart, 1986; Sejnowski, 1986;
Smolensky, 1987; von der Malsburg, 1981). Recal l that a struc-
tural descriptio n of the nonsens e object in Figure 1 must spec-
if y that vertical and on top o/are bound with cone shaped,
wherea s horizontal and below are bound wit h brick shaped. This
problem is directl y analogous to the problem encountere d by
the featur e matchin g models: Given that some set of attribute s
is present in the system's input, how can it represent whethe r
the y are in the proper configuratio n to defin e a given object?
The predominan t class of solutions to this problem is con-
junctive coding (Hinton et al., 1986) and its relatives, such as
tensor product coding (Smolensky, 1987), and the interunit s of
Feldman's (1982 ) dynami c connection s network. A conjunctiv e
representatio n anticipate s attribut e conjunction s and allocates
a separat e unit or patter n for each. For example, a conjunctiv e
representatio n for the previousl y mentione d shape attribute s
woul d posit one patter n for vertical cone on top of and com-
pletel y separat e patterns for horizontal cone on top of, horizontal
cone below, and so forth. A full y local conjunctiv e representatio n
(i.e., one in which each conjunctio n is coded by a single unit) can
provid e an unambiguou s solution to the binding problem: On
the basis of whic h unit s are active, it is possibl e to know exactl y
how the attribute s are conjoine d in the system's input. Vertical
cone on top o f horizontal brick and vertical brick on top of horizon-
tal cone woul d activat e nonoverlappin g sets of units. However,
local conjunctiv e representation s suffe r a number of shortcom-
The most apparent difficult y wit h local conjunctiv e represen-
tations is the numbe r of unit s they require: The size of such a
representatio n grows exponentiall y with the number of attrib-
ute dimension s to be represente d (e.g., 10 dimensions, each wit h
5 values woul d require 510 units). As such, the numbe r of units
required to represent the univers e of possibl e conjunction s can
be prohibitiv e for complex representations. Moreover, the unit s
have to be specifie d in advance, and most of them wil l go un-
used most of the time. However, fro m the perspectiv e of repre-
senting structura l descriptions, the most serious problem wit h
local conjunctiv e representation s is their insensitivit y to attrib-
ute structures. Like a template, a local conjunctiv e unit re-
sponds to a specifi c conjunctio n of attribute s in an all-or-non e
fashion, with differen t conjunction s activating differen t units.
The similarit y in shape between, say, a cone that is on top of
somethin g and a cone that is beneat h somethin g is lost in such a
These difficultie s with local conjunctiv e representation s wil l
not come as a surprise to many readers. In defense of conjunc -
tive coding, one may reasonabl y protest that a full y local con-
junctiv e representatio n represent s the worst case, both in terms
Connectionis t Representation and Structural
Description: The Binding Problem
Give n a line drawing of an object, the goal of the current
model is to activat e a viewpoint-invarian t structura l descrip-
tion5 of the object and then use that descriptio n as the basis for
recognition. Structura l description s are particularl y challeng-
ing to represent in connectionis t architecture s because of the
bindin g problem. Binding refer s to the representatio n of attrib-
4 The term viewpoint invariant as used here refer s to the tendenc y of
the image feature's classificatio n to remai n unchange d under changes
in viewpoint. For instance, the degree of curvatur e associated with the
image of the rim of a cup wil l change as the cup is rotated in depth
relative to the viewer. However, the fac t that edge is curved rather than
straight wil l remai n unchange d in all but a few accidental views of the
5 Henceforth, it is assumed that the element s of the structura l de-
scription ar e geons and thei r relations.
of the number of units required and in terms of the loss of
attribut e structures. Both these problems can be attenuated con-
siderably by using coarse coding and other techniques for dis-
tributing the representatio n of an entity over several units (Feld-
man, 1982; Hinton et al., 1986; Smolensky, 1987). However, in
the absence of a technique for representin g attribut e bindings,
distributed representations are subject to cross talk (i.e., mutual
interferenc e among independen t patterns), the likelihood of
which increases with the extent of distribution. Specifically,
when multipl e entities are represente d simultaneously, the likeli-
hood that the units representin g one entity wil l be confused
with the units representin g another grows with the proportion
of the networ k used to represent each entity, von der Malsbur g
(1987) referred to this familiar manifestatio n of the binding
problem as the superposition catastrophe. Thus, the costs of a
conjunctiv e representatio n can be reduced by using a distrib-
uted representation, but without alternative provisions for bind-
ing the benefit s are also reduced.
This tradeof f between unambiguou s binding and distribute d
representatio n is the connectionist's version of the implici t rela-
tions continuum. In this case, however, the relations that are
coded implicitl y (i.e., in the responses of conjunctiv e units) are
binding relations. Escaping this continuu m requires a way to
represent bindings dynamically. That is, we need a way to tem-
porarily and explicitl y bind independen t units when the attrib-
utes for which they code occur in conjunction.
Recently, there has been considerabl e interest in the use of
temporal synchrony as a potential solution to this problem. The
basic idea, proposed as early as 1974 by Peter Milner (Milner,
1974), is that conjunction s of attribute s can be represente d by
synchronizin g the output s of the units (or cells) representin g
those attributes. To represent that Attribut e A is bound to At-
tribut e B and Attribut e C to Attribut e D, the cells for A and B
fire in synchrony, the cells for C and D fire in synchrony, and the
AB set fires out of synchrony with the CD set. This suggestion
has since been presented more formall y by von der Malsbur g
(1981, 1987) and many others (Abeles, 1982; Atiy a & Baldi,
1989; Baldi & Meir, 1990; Crick, 1984; Crick & Koch, 1990;
Eckhor n et al., 1988; Eckhorn, Reitboeck, Arndt, & Dicke,
1990; Gray et al, 1989; Gray & Singer, 1989; Grossberg &
Somers, 1991; Humme l & Biederman, 1990a; Shastri & Ajjana -
gadde, 1990; Strong & Whitehead, 1989; Wang, Buhmann, &
von der Malsburg, 1991).
Dynami c binding has particularl y importan t implication s
for the task of structura l description because it makes it possi-
ble to bind the element s of a distribute d representation. What is
critical about the use of synchron y in this capacit y is that it
provides a degree of freedo m whereby multipl e independen t
cell s can specif y that the attribute s to which they respond are
currentl y bound. In principle, any variabl e could serve this pur-
pose (e.g., Lange & Dyer, 1989, used signatur e activations), so
the use of synchrony, per se, is not theoreticall y critical. Tem-
poral synchron y simply seems a natural choice because it is easy
to establish, easy to exploit, and neurologicall y plausible.
This article present s an original proposal for exploiting
synchron y to bind shape and relation attributes into structura l
descriptions. Specialize d connection s in the model's first two
layer s parse images into geons by synchronizin g the oscillator y
output s of cells representin g local image feature s (edges and
vertices): Cells oscillat e in phase if they represent features of the
same geon and out of phase if they represent features of sepa-
rate geons. These phase relations are preserved throughout the
network and bind cells representin g the attributes of geons and
the relations among them. The bound attribute s and relations
constitut e a simple viewpoint-invarian t structura l descriptio n
of an object. The model's highest layers use this description as a
basis for object recognition.
The model described here is broad in scope, and we have
made only modest attempt s to optimize any of its individua l
components. Rather, the primar y theoretica l statement con-
cerns the general nature of the representation s and processes
implicated in the activation of viewpoint-invarian t structura l
descriptions for real-time object recognition. We have designed
these representation s and processes to have a transparen t
neural analogy, but we make no claims as to their strict neural
A Neural Net Model of Shape Recognition
The model (JIM; John and Irv's model ) is a seven-laye r con-
nectionis t networ k that takes as input a representatio n of a line
drawing of an object (specifyin g discontinuitie s in surface orien-
tation and depth) and, as output, activates a unit representin g
the identit y of the object. The model achieves viewpoint invar-
iance in that the same output unit wil l respond to an object
regardless of where its image appears in the visual field, the size
of the image, and the orientation in depth fro m whic h the ob-
ject is depicted. An overview of the model's architectur e is
shown in Figure 7. JIM's first layer (LI ) is a mosai c of orienta-
tion-tune d cells with overlappin g receptive fields. The second
layer (L2) is a mosai c of cells that respond to vertices, 2-D axes
of symmetry, and oriented, elongated blobs of activity. Cells in
LI and L2 group themselve s into sets describin g geons by
synchronizin g oscillation s in their outputs.
Cells in L3 respond to attribute s of complet e geons, each cell
representin g a single value on a single dimensio n over which the
geons can vary. For example, the shape of a geon's major axis
(straight or curved) is represente d in one bank of cells, and the
geon's location is represente d in another, thereby allowing the
representatio n of the geon's axis to remai n unchange d whe n the
geon is moved in the visual field. The fourt h and fift h layers
(L4 and L5) determin e the relations among the geons in an
image. The L4-L5 modul e receives input fro m L3 cells repre-
senting the metri c propertie s of geons (location in the visual
field, size, and orientation). Once active, unit s representin g rela-
tions are bound to the geons they describe by the same phase
locking that binds image feature s together for geon recognition.
The output of L3 and L5 together constitute a structura l de-
scription of an object in terms of its geons and their relations.
This representatio n is invariant with scale, translation, and ori-
entation in depth.
The model's sixth layer (L6) takes its input fro m both the
third and fifth layers. On a single time slice (ts), the input to L6
is a patter n of activation describing one geon and its relations to
the other geons in the image (a geon feature assembly). Each cell
in L6 responds to a particula r geon featur e assembly. The cells
nuk e plant
slightl y elongated, vertica l
cone above, perpendicula r to
and smalle r tha n somethin g
horizontal brick
perpendicular to
and larger than
• • »O *f* something
» • • • • • • • • • • • • • • •• •£• • °
s c s c p n v d h lon g lui t v
Layer 7
Layer 6
4 and 5
Axi s X-Sc n Sides Orn. Aspect Rnli o Orientatio n Horiz. Pos. Vert. Pos. Size
either (logica l OR )
bot h (logica l AND)
Figure 7. An overview of JIM's architecture indicating the representation activated at each layer by the
image in the Key. (In Layers 3 and above, large circles indicate cells activated in response to the image, and
dots indicate inactive cells. Cells in Layer 1 represent the edges (specifyin g discontinuities in surface
orientation and depth) in an object's image. Layer 2 represents the vertices, axes, and blobs denned by
conjunctions of edges in Layer 1. Layer 3 represents the geons in an image in terms of their definin g
dimensions: Axis shape (Axis), straight (s) or curved (c); Cross-section shape (X-Scn) straight or curved;
whether the sides are parallel (p) or nonparallel (n); Coarse orientation (Orn.), vertical (v), diagonal (d), or
horizontal (h); aspect ratio, elongated (long) to flattened (flat); Fine orientation (Orientation), vertical,
two differen t diagonals, and fou r differen t horizontals; Horizontal position (Horiz. Pos.) in the visual
field, lef t (1) to right (r); Vertical position in the visual field, bottom (b) to top (t); and size, small (near 0% of
the visual field) to large (near 100% of the visual field). Layers 4 and 5 represent the relative orientations,
locations, and sizes of the geons in an image. Cells in Layer 6 respond to specifi c conjunctions of cells
activated in Layers 3 and 5, and cells in Layer 7 respond to complete objects, define d as conjunctions of
cells in Layer 6.)
The model's firs t laye r is A t each locatio n ther e are
divide d int o 22 X 22 locations. 4 8 cells.
Segment Segment
Termination T Termination
Figure 8. Detai l of the model's first layer. (Imag e edges are repre-
sented in terms of thei r locatio n in the visua l field, orientation, curva -
ture, and whethe r they terminat e withi n the cell's receptiv e fiel d or
pass throug h it.)
in L7 sum the output s of the L6 cell s over time, combinin g two
or mor e assemblie s into a representatio n of a complet e object.
Layer 1: Representation of a Line Drawing
The model's input layer (LI ) is a mosai c of orientation-tune d
cell s wit h overlappin g receptiv e field s (Figure s 8 and 9). At each
of 484 (22 X 22) locations,6 ther e are 48 cell s that respon d to
imag e edges in terms of thei r orientation, curvatur e (straigh t vs.
curved), and whethe r the edge terminate s withi n the cell's re-
ceptive fiel d (terminatio n cells) or passe s throug h (segmen t
cells). The receptive field of an LI cell is thus define d over five
dimensions: x, y, orientation, straigh t versus curved, and termi -
nation versus segment. The net input to LI cell i (N t) is calcu-
lated as the sum (ove r al l image edges j) of product s (ove r al l
dimension s k):
- \E jk-cik\/wk)\
(1 )
wher e Ejk is the value of edge j on dimensio n k (e.g., the value
1.67 radian s on the dimensio n orientation), Cik is cell ;'s pre-
ferre d value on dimensio n k (e.g., 1.85 radians), and Wk is a
paramete r specifyin g the widt h of a cell's tunin g functio n in
dimensio n k. The superscripte d + in Equatio n 1 indicate s trun-
cation below zero; LI cell s receive no negativ e inputs. The value
of edge j on the dimension s x and y (location ) is determine d
separatel y for each cell / as the point on j closes t to Cix, Ciy.
Imag e segment s are coded coarsel y wit h respec t to location: Wx
and Wy are equal to the distance betwee n adjacen t cluster s for
segmen t cell s (i.e., if and onl y if cell s i and j ar e in adjacen t
clusters, then \Cix - Cjx\ = Wx). Withi n a cluster, segmen t cell s
are inhibited by termination cells of the same orientation. To
reduce the calculation s and dat a structure s require d by the
compute r simulation, edge orientatio n is coded discretel y (i.e.,
one cell per cluste r codes the orientatio n of a give n image edge),
and for terminations, locatio n is also coded discretel y (a given
terminatio n represente d by activit y in onl y one LI cell). The
activatio n of cell / (A,) is compute d as the Webe r functio n of its
net input:
A, = NJ(l + N,).
Layer 2: Vertices, Two-Dimensional Axes, and Blobs
The model's second layer (L2) is a mosai c of cells that re-
spond to vertices, 2-D axes of paralle l and nonparalle l sym-
metry, and elongated, oriente d blobs at all location s in the vi-
sual field.
At each locatio n in the visua l field, ther e is one cel l for ever y
possibl e two- and three-pronge d vertex. These includ e Ls,
arrows, forks, and tangen t 7s (see Biederman, 1987b; Malik,
1987 ) at all possibl e orientation s (Figur e 9a). In addition, ther e
are cells that respon d to oriente d lone terminations, endpoint s
of edges that do not coterminat e wit h other edges, such as the
stem of a T vertex. Verte x cells at a given locatio n receive input
only fro m cell s in the correspondin g locatio n of the first layer.
They receive excitator y input fro m consistent LI terminatio n
cells (i.e., cell s representin g termination s with the same orienta-
tion and curvatur e as any of the vertex's legs ) and strong inhibi -
tion fro m segmen t cell s and inconsisten t terminatio n cell s (Fig-
ure 9b). Each L2 lone termination cell receive s excitatio n fro m
the correspondin g LI terminatio n cell, strong inhibitio n fro m
all other LI termination s at the same location, and neithe r exci -
tation nor inhibitio n fro m segmen t cells. The strong inhibitio n
fro m LI cell s to inconsisten t L2 cell s ensure s that (a) only one
verte x cell wil l ever become active at a given locatio n and (b) no
vertex cells wil l become active in response to vertices wit h
more than three prongs.
Axes and Blobs
The mode l also posit s arrays of axis-sensitiv e cell s and blob-
sensitiv e cell s in L2. The axi s cell s represen t 2-D axes of parallel -
ism (straigh t and curved ) and non-paralle l symmetr y (straigh t
and curved). However, the connection s betwee n these cells and
the edge cell s of LI have not been implemented. Computin g
axes of symmetr y is a difficul t proble m (cf. Brady, 1983; Brady
& Asada, 1984; Mohan, 1989) the solutio n of whic h we are
admittedly assuming. Currently, the model is given, as part of
its input, a representatio n of the 2-D axes in an image. Similarly,
cell s sensitiv e to elongated, oriente d region s of activit y (blobs )
are posite d in the model's second layer but have not been imple -
mented. Instead, blobs are compute d directl y by a simpl e re-
6 To reduc e the compute r resource s require d by the implementation,
a squar e lattice (rathe r than a mor e realisti c hexagona l lattice ) was used
in the simulation s reported here. However, the use of a squar e lattic e is
not critica l to the model's performance.
(a) Classses of vertices detected by the model's second layer
Two legs,
either straight or
Three legs, no angles Three legs, one angle Three legs, two
greater than 180 deg. greater than 180 deg. (including the inne r leg)
are curved, one angl e
greater than 180 deg.
(b) Detecting vertices from conjunction s of termination s
Arrow vertex with a straight
leg at Pi/4, a straight leg at
Pi/2, and a curved leg at 0
Edge cells that excite the
above vertex
Edge cells that inhibi t the
above vertex
Termination | Termination
Segment Segment
Figure 9. a: An illustratio n of the types of vertices detected in the model's second layer, b: An illustratio n
of the mapping from the edge cell s in a give n location in Layer 1 to a verte x cel l in the corresponding
location of Laye r 2. deg. = degree.
gion-fillin g algorithm.7 Thes e computation s yiel d informatio n
abou t a geon's location, compute d as the blob's centra l point,
size, and elongation. Elongatio n is compute d as the squar e of
the blob's perimete r divide d by its area (elongatio n is an invers e
functio n of Ullman's, 1989, compactnes s measure).
Axi s and blob cell s are assume d to be sensitiv e to the phas e
relation s establishe d in L I and therefor e operate on parsed
image s (imag e parsin g is describe d in the nex t section). Becaus e
thi s assumptio n restrict s the computatio n of blob s and axes to
operat e on one geon at a time, it allows JIM to ignor e axes of
symmetr y tha t woul d be forme d betwee n geons.
Image Parsing: Grouping Local
Image features Into Geons
Imag e parsin g is a specia l case of the bindin g proble m i n
whic h the task is to grou p feature s at differen t location s in the
visua l field. For example, give n that man y loca l feature s (edges,
vertices, axes, and blobs ) are active, how can the mode l kno w
whic h belon g togethe r as attribute s of the same geon (Figur e
10)? As solvin g thi s proble m is a prerequisit e to correct geon
identification, an image-parsin g mechanis m mus t yiel d group s
tha t ar e usefu l for thi s purpose. The mos t commonl y propose d
constrain t fo r grouping—locatio n or proximit y (Crick, 1984;
Treisma n & Gelade, 1980)—i s insufficien t in thi s respect. Eve n
7 Regions containin g geons are filled by an iterative algorithm that
activates a poin t in the visua l field if (a) there is an activ e edge cell at
tha t location or (b) it is surrounde d by activ e points in X or Y This
nonconnectionis t algorith m fills regions occupied by geons. These
regions are then assume d to correspond to the receptive fields of blob
cell s in the second layer. The properties of the receptive field (suc h as
area, perimeter, and centra l point ) are calculated directl y fro m the
regio n by countin g activ e points.
Figure 10. How does the brai n determin e that Segment s a and b be-
long togethe r as feature s of the brick, wherea s c and d belon g togethe r
as feature s of the cone? (Not e that proximit y alone is not a reliabl e cue:
a is close r to bot h c and d than to b.)
if ther e coul d be an a prior i definitio n of a locatio n (wha t crite-
rion do we use to decide whethe r two visua l entities are at the
same location?), such a scheme woul d fai l whe n appropriat e
groupin g is inconsisten t wit h proximity, as wit h Segment s a and
b in Figur e 10. JIM parse s image s int o geons by synchronizin g
cell s in L I and L2 in a pair-wis e fashio n accordin g to thre e
simpl e constraint s on shape perception. The mechanis m for
synchronizin g a pai r of cell s is describe d first, followe d by a
discussio n of the constraint s exploite d for grouping.
Synchronizing a Single Pair of Cells
Each edge and verte x cel l /' is describe d by fou r continuou s
stat e variables, activatio n (A,), outpu t (0,), outpu t refractor y
(/?,), and refractor y threshol d (0,), that var y fro m zero to one. A
cel l wil l generat e an outpu t if and onl y if it is active and it s
outpu t refractor y is below threshold:
if and onl y if A{ > 0 an d
then Ot = A,, otherwis e 0, = 0. (3)
Whe n cel l i is initiall y activated, Rt is set to 1 .0 and 6, is set to a
rando m valu e betwee n 0 and 1 . /?, decay s linearl y over time:
R,{t) = R,{t - I) - k,
wher e / refer s to the curren t time slice. (A time slice is a discret e
interva l withi n whic h the stat e of the networ k is updated.)
Whe n its refractor y reache s threshol d (R t < 6,), the cel l fires
(O t = A,), reset s it s refractor y (R t = 1.0), and re-randomize s it s
refractor y threshold. An active cel l in isolatio n wil l fire wit h a
mean perio d of 0.5/A: time slices 8 and an amplitude 9 of At.
Two cell s can synchroniz e thei r output s by exchangin g an
enablin g signa l over a fast enabling link (PEL). FELs are a class
of fast, binar y link s completel y independen t of the standar d
connection s that propagat e excitatio n and inhibition: Two cell s
can shar e a FEL withou t sharin g a standar d connectio n and
vic e versa. FELs induc e synchron y in the followin g manner:
Whe n cel l j fires, it propagate s not onl y an outpu t along its
connection s but also an enablin g signa l along its FELs. An en-
abling signa l is assume d to travers e a FEL withi n a smal l frac -
tion of a time slice. When the enabling signal arrives at an active
cell /, it cause s i to fire immediatel y by pushin g its refractor y
below threshold:
,(Q = R,(Q -
wher e Rt(t 0) is the refractor y of cel l / at the beginnin g oft, -/?,(<„ )
is the refractor y at some later period withi n /, FEL tj is the value
of the FEL (0 or 1 ) fro m j to /, and E, (the enablin g signa l fro m
cell j ) is 1 if Oj > 0, and 0 otherwise. Not e that the refractor y
state of a cel l wil l go below threshold 10—causin g the cel l to
fire—wheneve r at leas t one cel l wit h whic h it share s a FEL
fires. Whe n a cell fires becaus e of an enablin g signal, it behave s
just as if it had fired on its own: It sends an outpu t down its
connections, an enablin g signa l down its FELs, sets its refrac -
tory to 1.0, and randomize s its refractor y threshold.
An enablin g signa l induce s firing on the same time slice in
whic h it is generated; that is, its recipien t wil l fire in synchron y
wit h its sender. Furthermore, the synchron y is assume d to be
transitive. If Cel l A share s a FEL wit h Cel l B and B share s one
wit h C, then C wil l fire whe n A fires provide d bot h B and C are
active. It is important to note that enabling signal s have no
effec t on, and do not pass through, inactiv e cells: if B is inactive,
then As firin g wil l have no effec t on eithe r B or C. In the curren t
model, the FELs are assume d to have functionall y infinit e prop-
agatio n speed, allowin g two cell s to fire in synchron y regardles s
of the numbe r of intervenin g FELs and active cells. Althoug h
thi s assumptio n is clearl y incorrect, it is also muc h stronge r
than the computationa l task of image parsin g requires. In the
Discussio n section, we explor e the implication s of the tempora l
parameter s of cell s and FELs.
The propert y of transitivity, the absenc e of propagatio n
throug h inactiv e cells, and functionall y infinit e propagatio n
speed allow us to defin e a FEL chain: Two cells, A and B, are
said to lie on the same FEL chai n if at least one pat h can be
foun d fro m A to B by traversin g FELs and active cells. Al l cell s
on the same FEL chai n wil l necessaril y fire in synchrony. Cell s
on separat e FEL chain s wil l not necessaril y fire in synchrony,
but they may fir e in synchron y by acciden t (i.e., if thei r respec -
tive outpu t refractorie s happe n to go below threshol d at the
same time). The possibilit y of accidenta l synchron y has impor -
tant implications for the use of synchron y to perfor m binding.
These implication s are addresse d in the Discussio n section.
However, in thi s section, it is assume d that if two cell s lie on
differen t FEL chains, they wil l not fir e in synchrony.
The Set of Fast Enabling Links
Activ e cell s sharin g FELs wil l fir e in synchrony. Becaus e our
goal is to synchroniz e local feature s of the same geon whil e
8 The mea n refractor y threshol d for a cel l (9,) wil l be 0.5. Assumin g
that 6, = 0.5, it wil l take 0.5/A: time slice s for R, to decay fro m 1.0 to 6,.
9 Real neuron s spik e wit h an approximatel y constan t amplitude. A
cell's firing can be though t of as a burs t of spikes, wit h amplitud e of
firing proportiona l to the numbe r of spike s in the burst.
10 VjFELtj > 1.0 > Rfa). Therefor e [/},&) - 2/EAy 1 = *,-(O ^ 0 <
keeping separate geons out of synchrony, a PEL should onl y
connect two cells if the features they represent are likely to
belong to the same geon. A pair of cells will share a PEL if and
only if that pair satisfies one of the following three conditions:
Condition I: Local coarse coding of image contours. This con-
dition is satisfied if both cells represent image edges of the same
curvature and approximatel y the same orientation and hav e
overlappin g receptive fields. As depicted in Figure 11, a single
image contour (or edge) wil l tend to activate several cells in LI.
Al l the cells activated by a single contour wil l typicall y belong
to the same geon11 and should therefor e be grouped. Locally,
such cells wil l tend to satisf y Condition I. The model groups the
local pieces of a contour (i.e., groups edge cells responding to
the same contour) using FELs between all pairs of LI cells with
similar orientation and curvatur e preferences and overlapping
receptive fields (Figure 12). Note that not all cells responding to
a given contour wil l necessaril y satisf y all these criteria; for
example, the receptive fields of two cells at opposite ends of a
long contour might not overlap. However, by virtue of the inter-
vening units, they wil l lie on the same PEL chain. Indirect
phase locking using long PEL chains does not pose a problem
except insofar as the propagation time for an enabling signal
fro m A to B increases with the number of FELs to be traversed.
An important issue for exploration concerns how the
synchron y fo r such distant units will generalize wit h more real-
istic assumptions about propagation speeds for FELs.
The FELs correspondin g to Condition I can be derived
strictly fro m the statistical properties of coarsely coded image
edges. If two cells A and B satisf y Condition I, the conditional
probabilit y that A wil l be active given that B is active [p(A|B) ]
shoul d (a) be much greater than the base probabilit y that A is
active [P(A|B) > p(A) ] and (b) be approximatel y equal to P(B|A).
In the case of the present model, the only cells that satisf y both
criteria are LI edge cells with overlapping receptive fields, iden-
tical curvatur e preferences, and similar orientation prefer -
ences. In the general case, two cells wil l tend to satisf y both
criteria whenever they code overlapping regions of the same
attribut e space. For example, if our representatio n of edges
coded degree of curvatur e (rather than simply coding an edge
discretely as either straight or curved), then Condition I would
have to be modified by adding "and hav e similar curvatur e
preferences." Because the FELs correspondin g to Condition I
connect cells whose activity should covary, they should be capa-
ble of self-organizin g with a simple Hebbian learning rule and a
stimulu s set consisting of contours longer than the diameter of
the cells' receptive fields.
We ran a small simulation that corroborated this conjecture.
A12-ro w X10-colum n hexagonal lattice of segment- and termi-
nation-sensitiv e cells was exposed to random triangles (Figure
13), and FELs were updated using a modified Hebbian learn-
ing rule. Edges were coded coarsely, with respect to both loca-
tion and orientation. Termination cells in this model did not
inhibi t consistent segment cells. Because of the small size of
this simulation, it was not necessary to make all the simplifyin g
assumptions required by JIM. Specifically, this model differe d
from JIM in its more realistic hexagonal lattice of edge cells, its
coarse coding of orientation, and the fac t that terminations did
not inhibi t same-orientatio n segments within a cluster. Initially,
a/I cells wer e connected by FELs of strengt h zero to all other
Image edge
Cell receptive fields
Figure 11. A single image edge will tend to activate several cells in
Layer 1. (Locally, these cells will tend to have similar orientation prefer-
ences and overlapping receptive fields.)
cells in the same or adjacent locations (see Figure 1 3). The simu-
lation was run for 200 iterations. On each iteration, a randoml y
generated triangl e was presented to the model. Cell activations
were determined by Equations 1 and 2. FELs were updated by
the Hebbian rule:
if At + Aj > 0, and
j > 0,
j = 0,
- \FEL U\
j + Aj = 0,
= -Vr(Ai
Aj)(l - \FELV\),
= 0, (6 )
where v is a learning rate parameter, and r determines the rate
of growth toward negative values relative to the growth toward
positive values. With this learning rule, FELs fro m active cells
to other active cells grow more positive, FELs fro m active cells
to inactive cells grow more negative, and FELs between pairs of
inactive cells do not change. By the end of the 200 iterations,
strong positive FELs had developed between cells with overlap-
ping receptive fields and identical or adjacent orientation prefer -
ences. All other potential FELs were negative or zero.
Condition II: Cotermination in an intrageon vertex. This con-
dition is satisfied if one cell represent s a termination and the
other represent s a consistent vertex or lone termination at the
same location. Image contours that coterminat e in a two- or
three-pronged vertex likel y belong to the same geon. The
model groups contours into geons by positing FELs between
termination cells in LI and cells representing consistent two-
and three-pronged vertices in L2. Recall that by Condition I, an
LI termination cell will fire in phase with the rest of the cells
representing the same contour. If at least one — but not more
than two — other termination cells are active at the same loca-
tion (reflecting the cotermination of two or three contours in an
1' There are important exceptions to this generalization as evident in
Figur e 19.
Figure 12. Fast enabling links (FELs) correspondin g to Condition I
(local coarse coding of image contours ) connect cells in Layer 1 that
have similar orientatio n preference s and overlappin g receptive fields.
(FELs are indicate d by arcs in the figure.)
intrageon vertex), then a vertex cell wil l be activated in L2
(Figure 14). By virtue of the termination-to-verte x FELs, each
LI terminatio n cell will fire in phase with the correspondin g
L2 vertex cell, and by transitivity, all the terminations and their
respective contour s wil l fire in phase with one another. In this
manner, FELs correspondin g to Condition II group separate
contour s into geons (Figure 14).
In contrast with vertices produced by cotermination, T ver-
tices are produced where one surface occludes another, the top
belongin g to the occluding surface and the stem to the occluded
surface. Therefore, the part s of a T vertex shoul d not be
grouped. In JIM's first layer, a T vertex is represente d by an
active terminatio n cell of one orientation (the stem) and an
active segment cell of a differen t orientatio n (the top). In L2, a T
vertex is represente d only as an active lone terminatio n with the
orientation and curvatur e of the 7"s stem (recal l that LI edge
cells inhibi t all L2 vertex cells except lone terminations). Lone
termination s share FELs with LI terminatio n cells, but not
wit h LI segment cells. Therefore, the contours formin g the top
and stem of a T vertex are not synchronized, allowing geons
that meet at T vertices to remai n out of synchrony with one
Condition HI: Distant collinearity through lone terminations.
Both cells represent lone terminations, their orientation s are
complementary, and they are collinear. Although the separat e
parts of a T vertex (viz., stem and top) will not belong to the
same geon (except in some accidenta l views), the stems of sepa-
rate 7s may. When an edge in the world project s as two separate
image contour s becaus e of occlusion, the endpoint s of those
contour s wil l activat e lone terminatio n cells at the point s of
occlusion (Figur e 15a). If the edge is straight, the lone termina-
tions wil l be collinea r and have complementar y orientations.
Here, complementary has the specifi c meaning that (a) the orien-
tations diffe r by 180° (t some error, e) and (b) the orientatio n of
Figure 13. FELs were allowed to self organize in respons e to images of random triangle s as illustrate d in
the figure. (In this simulation, cells were arrayed in a hexagona l lattice with overlappin g receptive fields
and orientatio n preferences. Thick lines in the figur e indicat e image edges, thi n lines indicat e active cells
[the location and orientatio n of a line correspondin g to the preferre d location and orientatio n of the cell,
respectively], and circles indicat e locations over which cel l receptive fields were centered. The degree of
overla p between cell receptive fields is shown in the upper right corner of the figure. L = layer.)
Activ e L2
vertex cell
Activ e LI
terminatio n
cell s
Figure 14. If two or more image contour s coterminat e at a point,
multipl e terminatio n cells (one for the end of each contour ) wil l be
activate d in the correspondin g location in Layer (L) 1. (As long as fewe r
than three LI terminatio n cells are active, a verte x cell wil l become
active at the same location in L2. L2 vertex cells share fast enabling
link s [FELs; indicated by arcs] wit h all correspondin g L1 terminatio n
a vector extendin g fro m either terminatio n point to the other
differ s by 180° ± e fro m the direction in which the contour
extend s fro m that point (Figure 15b). The model groups the
separate parts of an occluded edge using FELs between all pairs
of distant lone terminatio n cells with collinear receptive fields
and complementar y orientatio n preferences.
It is importan t to stress that only L2 lone terminatio n cells
share these distant FELs. Because lone termination s are inhib-
ited by inconsisten t LI terminatio n cells, multipl e contour s co-
terminatin g at a point will prevent distant collinea r grouping.
For example, Contour s 1 and 2 in Figure 16a will group because
lone termination s are activated at the occlusion points, but the
L vertices in Figure 16b wil l prevent the same contour s fro m
groupin g becaus e the extra termination s (!' and 2') inhibi t the
lone terminatio n cells. This propert y of distant collinea r
groupin g only through lone termination s has the importan t
consequenc e that separat e geons will not group just because
they happen to have collinear edges.
Fast Enabling Links: Summary and Extensions
Recal l that cells on the same FEL chai n wil l fire in
synchrony, and barring accidental synchrony, cells on separat e
chains wil l fire out of synchrony. The FELs posited for this
model wil l parse an image into its constituen t geons provided
(1 ) withi n geons, feature s lie on the same FEL chain and (2)
feature s of separat e geons lie on separat e FEL chains. Provision
1 wil l be true for all geons composed of two- or three-pronge d
vertice s (Figur e 17) unless (a) enough vertices are occluded or
deleted to leave the remainin g vertices on separat e FEL chains
(Figure 18a ) or (b) contour is deleted at midsegment, and addi-
tional contour is added that inhibit s the resulting lone termina-
tions (Figure 18b). Provision 2 will be true for all pairs of geons
unless (a) they share a two- or three-pronge d vertex (Figure 19a),
(b) they share a contour (Figur e 19b), or (c) they share a pair of
complementary, collinear lone terminations (Figure 19c).
Conditions I, II, and III constitute a simplifie d set of con-
straints that can easily be generalized. Currently, curved edges
are grouped on the basis of the same constraint s that group
straight edges (Condition s I and III), except that a greater dif-
ferenc e in orientatio n is tolerated in grouping curved edge cells
than in grouping straight edge cells. These condition s could be
generalize d by replacing the term collinear with the more gen-
eral term cocircular. Two edge segment s are cocircular if and
only if both can be tangent s to the same circle (collinearit y is a
special case of cocircularit y with circles of infinit e radius). If
direction and degree of curvatur e were made explici t in the
representatio n of edges, Condition I could be generalize d to use
this information. This generalizatio n woul d also allow us to
exploi t an additiona l constrain t on grouping: Matched pairs of
coterminatin g edges with curvature s of opposite signs ("cusps")
are formed when two convex volumes interpenetrate. This con-
straint, termed the transversality regularity, was introduce d by
Hoffma n and Richards (1985 ) as an explici t cue to parsing. Like
the top and stem of a T vertex, terminations forming cusps
woul d not be linked by FELs, thereby implicitl y implementin g
the transversalit y regularity.12
Finally, we shoul d note that the use of collinearit y and ver-
tices to group image edges into part s is not unique to the
current proposal. A number of compute r vision models, for
example, Guzma n (1971), Walt z (1975), and Mali k (1987),
group and label edges in line drawings. Line-labelin g model s
operat e by the propagatio n of symboli c constraint s among data
structure s representin g the local features (i.e., edges and ver-
tices) in a line drawing. As output, these model s produce not an
object classification, but a representatio n in which the 3-D na-
ture of each local featur e is specified (e.g., as convex, concave,
occluding, or shadow). For example, these model s can detect
when line drawings represent impossibl e objects. By contrast,
human observer s are slow at detecting the impossibilit y of an
object. The current proposal for grouping differ s fro m line la-
beling in that (a) it is concerne d only with the grouping of local
image features, not their labeling; (b) it explicitl y seeks to derive
the groupin g in a neurall y plausibl e manner; and (c) once
grouped, the feature s are used to derive a parts-based structura l
description that is subsequentl y used for object classification. It
remains to be seen what role line labeling might play in object
recognition versus, say, depth interpretatio n of surfaces.
Layer 3: Geons
The model's first two layers represent the local features of an
image and parse those feature s into tempora l package s corre-
12 Parsing woul d onl y occur at matched cusps becaus e a figur e wit h
onl y one cusp—suc h as a kidney—woul d allow enabling signal s to
pass, not through the cusp but around the back, along the smoothl y
curved edge.
Lone Terminations
produced at points of
Occluding Surface
a) Complementary, collinear lone terminations produced at point s of occlusion.
a i Directio n in whic h contou r extend s
awa y fro m Terminatio n Point 1 (Tl )
a 2 Direction in which contour extends
away from Termination Point 2 (T2)
a 12 Direction of vector (V12 ) from Tl to T2.
b)Tl and T2 are complementary iff: 1) \a^ - a2| = 18(f + e and2) (aj - a.n\- 180°±e.
Figure 15. a: Complementary, collinea r lone termination s are produce d when a single straight edge in the
worl d is occlude d by a surface, b: Illustratio n of the definitio n of complementary lone terminations.
spending to geons. The thir d laye r uses these grouped feature s
to determin e the attribute s of the geons in the image. Each cell
in L3 responds to a singl e geon attribut e and wil l respond to any
geon that possesse s that attribute, regardles s of the geon's other
properties. For example, the cell that respond s to geons with
curve d axes wil l respond to a large, curved brick in the upper
lef t of the visual field, a small, curved cone in the lower right,
and so forth. Becaus e the geons' attribute s are represente d inde-
pendently, an extraordinaril y smal l numbe r of unit s (58) is suf -
ficient to represen t the model's univers e of geons and relations.
The bindin g of attributes into geons is achieved by the temporal
synchron y establishe d in the first and second layers. It is at thi s
level of the model that viewpoin t invarianc e is achieved, and
the first element s of a structura l descriptio n are generated.
Representation of Geons
Cell s in the model's third layer receive thei r input s fro m the
vertex, axis, and blob cells of L2. The detail s of the L2 to L3
mappin g are describe d afte r the representatio n of L3 is de-
scribed. Layer 3 consist s of 5 1 cells that represen t geons in
terms of the followin g eight attribute s (Figur e 7):
Shape of the major axis. A geon's major axis is classifie d
eithe r as straight (as the axis of a cone) or curved (like a horn).
Thi s classificatio n is contrastive in that degrees of curvatur e are
not discriminated. One L3 cell codes for straight axes and one
fo r curved.
Shape of the cross section. The shape of a geon's cross section
is also classifie d as either straight (as the cross section of a brick
or wedge) or curved (like a cylinde r or cone). This attribut e is
contrastive, and the model does not discriminat e differen t
shapes withi n these categories: A geon with a triangula r cross
section woul d be classifie d as equivalen t to a geon with a squar e
or hexagona l cross section. One L3 cell codes for straight cross
sections and one for curved.
Parallel versus nonparallel sides. Geons are classified ac-
cording to whethe r they have paralle l sides, such as cylinder s
and bricks, or nonparalle l sides, like cones and wedges. This
attribut e is also contrastivel y coded in two cells: one for parallel
and one for nonparallel.
Together, these three attribute s constitut e a distribute d repre-
sentation capabl e of specifyin g eight classes of geons: Brick
(straigh t cross section, straight axis, and paralle l sides), Cylinder
(curve d cross section, straight axis, and paralle l sides), Wedge
(straight cross section, straight axis, and nonparallel sides),
Cone (curved cross section, straight axis, and nonparallel sides),
and thei r curved-axi s counterparts. Contrast s include d in Bie-
derman's (1987b ) RBC theor y that are not discriminate d by
JIM include (a) whethe r a geon wit h nonparalle l sides contract s
to a point (as a cone) or is truncate d (as a lamp shade), (b)
whethe r the cross section of a geon with nonparalle l sides both
2 i
Figure 16. a: Contours 1 and 2 wil l group because lone terminations
are activate d at the occlusio n points, b: Contours 1' and 2' produce L
vertice s that preven t the same contours (1 and 2) from grouping. The
extra termination s (!' and 2') inhibi t the lone termination cells. L2 and
LI = Layer 2 and Layer 1.
expand s and contract s as it is propagate d along the geon's axis
(as the cross section of a football ) or only expand s (as the cross
section of a cone), and (c) whethe r the cross section is symmetri -
cal or asymmetrical.
Aspect ratio. A geon's aspect ratio is the ratio of the diamete r
of its cross section to the lengt h of its majo r axis. The model
codes five categorie s of aspect ratio: approximatel y 3 or more to
1 (3+:l), 2:1, 1:1, 1:2, and 1:3+. These categorie s are coded
coarsel y in one cell per category: Each aspect ratio cell respond s
to a range of aspect ratios surroundin g its preferre d value, and
cell s wit h adjacen t preferre d values respond to overlappin g
Coarse orientation. A geon's orientatio n is represente d in
two separat e banks of cells in L3: a coarse bank, used directl y
fo r recognitio n (i.e., the output s go to L6), and a fine bank,
describe d later, used to determin e the orientatio n of one geon
relativ e to anothe r (the output s go to L4). The coarse bank
consist s of three cells, one each for horizontal, diagonal, and
vertica l orientation s (Figur e 7). The coarse orientatio n cells
pass activatio n to L6 becaus e the orientatio n classes for which
they are selective are diagnosti c for object classification, as with
the differenc e betwee n the vertica l cylinde r in a coffe e mug and
the horizonta l cylinde r in a klieg light. However, finer distinc -
tions, such as left-pointin g horizonta l versus right-pointin g hori-
zontal, typicall y do not distinguis h among basic level classes. A
klie g light is a klieg light regardles s of whethe r it is pointin g lef t
or right.
Fine orientation. The coarse representatio n of orientatio n is
not precise enough to serve as a basis for determinin g the rela-
tive orientatio n of two geons. For example, two geons could be
perpendicula r to one anothe r and be categorize d with the same
orientatio n in the coarse representatio n (e.g., bot h legs of a T
squar e lyin g on a horizonta l surfac e woul d activat e the coarse
horizonta l cell). The more precise fine representatio n is used to
determin e relative orientation. The fine cells code seven orienta -
tion classes (Figur e 20): two diagona l orientation s (lef t end up
and right end up), fou r horizonta l orientation s (perpendicula r
to the line of sight, lef t end closer to viewer, right end closer,
and end towar d viewer), and one vertica l orientation. Each ori-
entation is represente d by one cell.
Size. A geon's siz e is coded coarsel y in 10 cells accordin g to
the proportio n of the visual field it occupies. The activatio n (A t)
of a size cell in respons e to a geon is given by
= (l- \Ci-G\/Wi)+,
wher e C, is the preferre d size of cell i, Wt is the widt h of the cell's
receptive field (0.1 for size cells), and G is the proportio n of the
visual field occupie d by the geon. The preferre d sizes of the L3
size cells start at 0.0 and advanc e in increment s of 0.1 up to 0.9.
Location in the visual field. A geon's location in the visual
field is define d as the position of its centroi d (the mean x- and
^-coordinates for the set of all point s inside the geon). The
horizonta l and vertica l component s of a geon's position are
coded independentl y and coarsel y in 10 cells each. The activa-
tion of a location cell is given by Equatio n 7, where C, corre-
sponds to the cell's preferre d position. Locatio n cells are or-
dered by position s starting at the lef t and bottom edges of the
visua l field and are incremente d in equal interval s to the right
and top edges, respectively. For example, the cell for x = 1 (far
left ) respond s when a geon's centroi d is close to the lef t edge of
the visual field, y = 1 respond s to centroid s near the bottom
edge, and x = 5 respond s to centroid s just to the lef t of the
vertica l midline of the visual field.
Activating the Geon Attribute Cells
Cells in the model's thir d layer receive thei r input s fro m the
vertex, axis, and blob cells of L2, but the second and third layers
Activ e L2
vertex cell s
Activ e LI
cell s
Figure 17. Illustratio n of groupin g feature s into geons using fast en-
abling links (PEL) chains. (Conside r Point s a and b on the brick. Whe n
the cell representin g Point a [Cell 1] fires, it wil l cause Cell 2 to fire,
which wil l cause Cell 3 to fire, and so on, unti l this process has reached
Point b [Cell 11]. L= layer.)
Figure 18. Conditions under which the proposed FELs will group the local features of a geon (recover-
able) or fai l to group local features into geons (nonrecoverable). (a: Features of the nonrecoverable geons
will lie on separate FEL chains because of deletion of vertices, b: Features of the nonrecoverable geons will
lie on separate FEL chains because of extraneous vertices introduced where the original contour has been
of the model are not full y interconnected. Rather, L3 cells re-
ceive bottom-up excitation fro m consistent L2 cells only, and
L3 cells representing inconsistent hypotheses (e.g., curved axis
and straight axis ) are mutually inhibitory. For example, L2 ver-
tex cells send their outputs to the L3 cells for straight and
curved cross section, but neither excite nor inhibit the L3 size
cells. Wit h the exception of cells for size, location, and aspect
ratio, L3 cells compute their activations and outputs by Equa-
tion 2. The lateral inhibition among inconsistent L3 cells is
implemente d by normalizing the net inputs (N,) to the L3 cells
by the equation
N, =
+ N/0), k > 1,
fo r al l j such that cel l j inhibits cell /.
The remainder of this section describes the pattern of con-
nectivit y fro m L2 to L3. In the discussion that follows, two
conventions are adopted. First, if a class of L2 cells sends no
output to a class of L3 cells, the y are not mentioned in the
discussion of that L3 attribute (e.g., vertex cells are not men-
tioned in the discussion of the L3 size cells). The second conven-
tion stems fro m those cases where a give n L3 attribute receives
input fro m a class of L2 cells, but is insensitive to some of the
dimensions for which those cells code. For example, the L3 cells
for size receive input fro m the L2 blob cells but are not sensitive
to the locations, orientations, or elongations of those blobs. In
such cases, the irrelevant dimensions are not mentioned, and
the L3 cells compute their inputs by summing the outputs of L2
cells over the irrelevant dimensions (e.g., the L3 size cells sum
the outputs of L2 blob cells over all orientations, locations, and
Axis shape and parallel versus nonparallel sides. Both the
shape of a geon's majo r axis and whether its sides are parallel or
nonparallel are determined on the basis of the 2-D axes in the
image. L2 cells representing axes of nonparallel symmetry ex-
cite the L3 cell for nonparallel sides, and cells for axes of paral-
lelism excite the L3 cel l for parallel sides. L2 cells representing
curved axes excite the L3 cell for curved axis, and cells repre-
sentin g straigh t axes excit e L3's straigh t axi s cell. The parallel -
sides and non-parallel-side s cell s are mutuall y inhibitory, as are
the straight-axi s and curved-axi s cells.
Cross section. Whethe r a geon's cros s sectio n is straigh t or
curve d is determine d on the basi s of the vertice s active in L2.
Tangent- Y vertice s and L vertice s wit h at least one curve d leg
excit e the L3 curved-cross-sectio n cell. Forks and arrows excit e
the straigh t cross section cell. Straigh t cross section and curve d
cross section cell s are mutuall y inhibitory. Thi s mappin g is sum-
marize d in Figur e 21.
Size and location. L3 cell s representin g geon size and loca-
tion take thei r input fro m the blobs derive d in L2. Recal l that a
blob represent s a region of the visua l fiel d wit h a particula r
location, size, orientation, and elongation. It is assume d that
each blob cel l excite s a set of L3 size and locatio n cell s consis -
tent wit h its receptiv e field properties. The receptiv e fiel d prop-
ertie s of the size and locatio n cell s were describe d earlier; thei r
input s (G in Equatio n 7) come fro m the blob computations. It is
assume d that the term (1 — \C t - G\/W t)+ is equal to the connec -
tion weigh t to an L3 size or locatio n cel l fro m an active blob
cell. For example, as shown in Figur e 22, a give n blob migh t
respon d to a region of activit y just lef t and above the middl e of
the visua l field, elongate d slightly, oriente d at or aroun d II/4
(45° ) and occupyin g betwee n 10 % and 15 % of the visua l field. In
the L3 locatio n bank, thi s blob wil l strongl y excit e x = 5 and
mor e weakl y excit e x = 4, and it wil l strongl y excit e y = 5 and
a) Separable
Nonseparabl e
b) Separabl e
Figure 19. Condition s unde r whic h the propose d fast enablin g links
(FELs ) wil l separat e the part s of an object (separable ) or fai l to separate
the part s (nonseparable). (a: The nonseparable geons share a two- or
three-pronge d vertex, b: The nonseparabl e geons shar e a contour, c:
The nonseparabl e geons shar e a pai r of complementary, collinea r lone
termination s [along the centra l vertica l edge].)
Fine Orientations
a. Vertical
b. Diagonal, left end up
c. Diagonal, right end up
d. Horizontal
e. Horizontal, lef t end closer
to viewe r
f. Horizontal, end towar d
viewe r
g. Horizontal, right end closer
to viewe r
Coarse Orientations
Horizonta l
Figure 20. Th e fine orientatio n cell s code seven geon orientations.
(Th e coars e orientatio n cell s code onl y three. Each fine orientatio n
correspond s to exactl y one coars e orientation.)
weakl y excit e y = 6. In the L3 size bank, it wil l excit e the cell s
fo r 10% and 15 % of the visua l field. Figur e 23 shows the re-
spons e of L3 size and locatio n cell s as a functio n of a geon's size
and location.
Aspect ratio. L3 aspec t rati o cell s calculat e thei r activatio n
by Equatio n 7, wher e C, is the cell's preferre d aspec t ratio, and
Wt is the widt h of the cell's receptiv e field. The receptiv e field
propertie s of thes e cell s are illustrate d in Figur e 23c. The elon-
gation of the L2 blobs supplie s the value of a geon's aspect rati o
( G in Equatio n 7). However, as illustrate d in Figur e 24a, a blob
wit h a given elongatio n and orientatio n (e.g., mino r axi s = 1 and
majo r axi s = 2 at orientatio n = 0 radian ) coul d arise either fro m
an elongate d geon at the same orientatio n (cross-sectio n widt h
= 1 and axi s lengt h = 2 at 0 radian ) or fro m a flattene d geon at
the perpendicula r orientatio n (2:1 at H/2 radians). The mode l
resolve s thi s ambiguit y by comparin g the blob's orientatio n to
the orientatio n of the geon's longes t 2-D axis. If the orientation s
are approximatel y equal, then the longe r of the two aspec t ra-
tios is chose n (e.g., 1:2); if they are approximatel y orthogonal,
then the shorte r aspect rati o is chose n (2:1).
In the simulation s reporte d here, thes e comparison s are
made directl y on the basi s of the data structure s describin g the
axes and blobs in an object's image. However, the comparison s
could easil y be performe d by an intermediat e layer of cell s as
follow s (see Figur e 24b): Each cell in the intermediate layer
(labeled Layer 2.5 Orientation X Aspect Ratio Cells) represent s a
conjunctio n of aspect ratio and 2-D orientation13 and receive s
13 Onl y on e L2.5 uni t need exis t fo r Aspec t Rati o 1:1 becaus e suc h
blobs, being round, have no 2- D orientation.
Laye r 3
Curve d
Layer 2
Cell s
Curved Ls Tangen t Ys
Figure 21. Fork s and arrows excit e the straigh t cross section cell. (Tangent- Y vertice s and L vertice s wit h
at leas t one curve d leg excit e the Laye r 3 curved-cross-sectio n cell. Straigh t cross sectio n and curve d cross
sectio n cell s are mutuall y inhibitory.)
inputs from both axis and blob cells. Each blob cell excites two
L2.5 units, one with the same orientation and aspect ratio and
one with the perpendicular orientation and reciprocal aspect
ratio. Each axis cell excites all L2.5 cells consistent with its
orientation. The correct aspect ratio can be chosen by comput-
ing the net input to each L2.5 cell as the product of its axis and
blob inputs. The L2.5 cell outputs can then be summed over
orientation to determine the inputs to the L3 aspect ratio cells.
For example, assume that the elongated horizontal cylinder on
the right of Figure 24a was presented to the network. It would
activate the blob and axis cells shown in Figure 24 (a and b,
respectively). The blob would excite two cells in L2.5 (vertically
hatched in Figure 24b): 2:1 at Orientation 4 and 1:2 at Orienta-
tion 0. The axis cells would excite all L2.5 cells for Orientation 0
(horizontally hatched cells). The only L2.5 cell receiving inputs
from both an axis and a blob would be the one for 1:2 at Orienta-
tion 0. This cell would excite the L3 aspect ratio cell for 1:2,
which is the correct aspect ratio for that geon.
Orientation (coarse and fine). L3 fine orientation cells re-
ceive input from the vertex and axis cells in L2. In practice, the
coarse orientation cells receive their input from the corre-
sponding fine cells (Figure 20), but in principle, they could just
as easily determine their inputs on the basis of the L2 axis and
vertex cells. Both axes and vertices are used to determine orien-
tation because neither is sufficien t alone. Consider a geon with
a straight axis (of parallelism or symmetry) that is vertical in the
visual plane (Figure 25). On the basis of the orientation of the
axis alone, it is impossible to determine whether the geon is
oriented vertically or horizontally, with its end toward the
viewer. Cells representing vertical axes therefore excite both the
L3 vertical and horizontal-end-on cells. The resulting ambigu-
ity is resolved by the geon's vertices. If the geon is standing
vertically, the straight legs of its tangent-Y vertices will extend
away from the points where they terminate with an angle
greater than II and less than 211. Therefore, all cells represent-
ing tangent-Y vertices with straight legs extending between II
and 211 excite the L3 cell for vertical. All those with straight legs
extending between 0 and II excite horizontal-end-on. This
mapping is summarized in Figure 26. The orientation with the
most bottom-up support is selected by inhibitory interactions
(Equation 8) among the fine orientation cells.
Summary of Layers 1-3
The model's first three layers parse an image into its constitu-
ent geons and activate independent representations of each of
the geon's attributes. For example, given the image in Figure 1,
the model will parse the local features of the cone and the brick
into separate groups, features within a group firing in
synchrony. Each group then activates the cells in L3 that de-
scribe the geon they comprise. L3 cells are temporally yoked to
their inputs: They fire only on time slices in which they receive
inputs. When the L2 cells representing the cone fire, the L3
cells for straight axis, curved cross section, nonparallel sides,
vertical, and the cells for its aspect ratio, size, and location
become active and fire. Likewise, when the L2 cells represent-
ing the brick fire, the L3 cells for straight axis, straight cross
section, parallel sides, horizontal, and the cells for its aspect
ratio, size, and location will fire.
Layer 2 Blob activated by
the image
Figure 22. Laye r 2 blobs activate Laye r 3 cells representing a geon's size and location in the visua l field.
Layers 4 and 5: Relations Among Geons
Of the eight attributes represented in L3, five—axis shape,
cross-section shape, parallelism of sides, coarse orientation,
and aspect ratio—pass activation directly to the model's sixth
layer (Figure 7). The remaining attributes—size, location, and
fine orientation—pass activation to Layer 4, which, in conjunc-
tion with Layer 5, derives relative size, relative location, and
relative orientation. The computational goals of Layers 4 and 5
are threefold. First, the relations among geons must be made
explicit. For example, rather than representing that one geon is
below another implicitly by representing each of their loca-
tions, the below relation is made explicit by activating a unit that
corresponds uniquely to it. Second, the relations must be
bound to the geons they describe. If one geon is below another
in the visual field, the unit for below must be synchronized
with the other units describing that geon. Finally, the relations
must be invariant with geon identity and viewpoint so that, for
example, the below unit will fire whenever one geon is below
another, regardless of the shape of the geons, their locations in
the visual field, orientation in depth, and size.
These goals are satisfied in two steps. In the first, L4 cells act
as AND gates, responding when conjunctions of L3 cells fire on
differen t (but nearby) time slices. In the second step, L5 cells
OR together the outputs of multiple L4 cells, responding to the
relations in a viewpoint-invariant manner. As illustrated in Fig-
ure 27, each L4 cell responds to a specific relation (e.g., below)
in conjunction with a specific value on the dimension over
which that relation is defined (e.g., y = I). L4 cells respond to the
following types of conjunctions: above-below conjoined with
position in y, right-left with position in x, larger-smaller with
size, and perpendicular-oblique with orientation. L5 contains
only one cell for each relation: above, below, beside (which re-
places the left-right distinction), larger, smaller, perpendicular,
and oblique.
L4 cells receive both excitatory inputs and enabling signals
from L3 (Figure 28). An L3 cell will excite an L4 cell if the L4
cell's value satisfies its relation with respect to the L3 cell. For
example, the L3 y = 3 cell excites the L4 cell for below|j; = 1,
because y = 1 is below y = 3 (y = 1 satisfies below with respect to
y = 3). y = 3 also excites above|y = 5, above\y = 6, and so forth.
Excitatory connections from L3 to L4 are unit strength, and
there are no inhibitory connections. L4 cells sum their inputs
over time by the equation
where A, is the activation of L4 cell i, Et is its excitatory input,
and 7 and d are growth and decay parameters, respectively. L3
cells send enabling signals to L4 cells that respond to the same
Cell: 1
Activatio n
0.4 0.6
Geon Position
Expressed as: Distance fro m lef t edge (0.0 = left, 1.0 = right) for horizontal position cells
Distance fro m bottom edge (0.0 = bottom, 1.0 = top) for vertical position cells
Activatio n
0.4 0.6
Geon Size
Expresse d as proportio n of the visual field occupied by the geon
Cell: 3+:l
<- Blo b and axis orientatio n disagree —*— Blob and axis orientation agree —>
2:1 1:1 1:2 1:3+
Activatio n
Blob Elongation
Blo b Elongatio n - P 2/ A where P K ** number of points on the perimeter
Blo b faiongation - P /A and A is the total number of points in the blob
Figure 23. Receptive field properties of Layer 3 (a) location, (b) size, and (c) aspect ratio cells.
value; for example, y = 1 sends enabling signals to below|y = 1.
L4 cells ensure proper geon-relation binding by firing only
when they receive enabling signals. The invariant L5 relation
cells sum the outputs of the corresponding L4 cells. For exam-
ple, the L5 below cell receives excitation from below|> > = 1, be-
lowly = 2, and so on.
The relation below is used to illustrate how this architecture
activates and binds invariant relations to geons, but the mecha-
nism works in exactly the same way for all relations. Suppose, as
shown in Figure 28, there is a geon near the bottom of the
visual field (Geon A) and another nearer the middle (Geon B).
y = 1 and y = 2 will fire in synchrony with the other L3 units
describing Geon A, say, on time slices 1,5, and 9. Similarly, y = 2
and y = 3 will fire in synchrony with the other properties of
Geon B, say, on time slices 3,7, and 11. Recall that below|_ y = 1
receives an excitatory input from y = 3. Therefore, when y = 3
fires (Time Slice 3), below]}* = 1 will receive an excitatory input,
and its activation will go above zero. On Time Slice 5, y = 1 will
fire and send an enabling signal to below|y = 1, causing it to fire
(i.e., in synchrony with y - 1 and, by transitivity, Geon As other
properties). Then below|_ y = 1 sends an excitatory signal to the
L5 below cell, causing it to fire with geon As other properties.
In a directly analogous manner, above will come to fire in
synchrony with the other properties of Geon B.
One problem with this architecture is a potential to "halluci-
nate" relations between a geon and itself. Such hallucinations
can result if a geon's metric properties are coded coarsely, as
they are in this model. For example, a given geon's vertical
Layer 3 Aspect Ratio Cells
Cros s Sectio n Diameter : Axi s Lengt h
3:1 2:1 1:1 1:2 1:3
Orientatio n
Aspect Ratio
Laye r 2 Blo b Cell
2:1 at n/2rad.
1:2 at Orad.
Imay c
Orientation: n/2 Radian s
Aspec t
0 Radians
Figure 24. Determining a geon's aspect ratio, (a: A given blob is consistent wit h two aspect ratios, an
elongated geon with the same orientation as the blob or a flattened geon whose orientation is perpendicu-
lar to that of the blob, b: A layer of units that could determine aspect ratio fro m the axis and blob cells
activated by a geon's image, rad. = radian.)
position in the visua l field might be represented by simulta-
neous activity in both y = 2 and y = 3. Because y = 2 is below y=
3, the L4 belowl y = 2 cell could become active and fire in re-
sponse to the presence of that geon even if there are no others in
the visua l field. This problem is overcome by giving each L4
cell a blind spot for the L3 value directly flanking its preferred
value. For example, belowl y = 2 receives excitation fro m y = 4,
y=5,. . . y= 10 but not fro m y- 3. The L4 blind spot prevents
hallucinations, but it has the negative side effec t that, for small
enoug h difference s in metric values, relations may be undetect-
able. For example, two ver y flat geons, one on top of the other,
may activate the same L3 vertical position cells, only wit h
slightl y differen t ratios (say, one excites y=2 strongly and y = 3
weakly; the other excites y = 2 weakly and y = 3 strongly).
Because of the blind spot, the model would be insensitive to the
above-below relation between such geons.
Layers 6 and 7: Geon Feature Assemblies and Objects
Together, Layers 3 and 5 produce a pattern of activation,
terme d a geon feature assembly (GFA), describin g a geon in
Figure 25. Alone, the orientation of a geon's majo r axis is not suffi -
cient to determine the 3-D orientation of that geon. (As shown on the
left, a vertical axis could arise fro m a horizontal geon with its end
toward the viewe r or fro m a vertical geon. Similarly, as shown on the
right, the orientation of a vertex is also insufficien t to determine the
orientatio n of the geon.)
terms of its shape and general orientation as wel l as its location,
size, and orientation relative to the other geons in the image.
The collection of GFAs (over differen t time slices) produced in
response to an object constitutes a simple structural description
that is invariant wit h translation, scale, and orientation in
depth. Furthermore, the model wil l produce the same descrip-
tion just as quickly in response to an object whether it is "view-
ing" that object for the first time or the twentieth. In this sense,
the model's first five layers accomplish the primary theoretical
goals of this effort. However, to assess the sufficienc y of this
representation for viewpoint-invarian t recognition, we have in-
cluded two additional layers (6 and 7) that use the L3-L5 output
to perform object classification.
The connections and FELs in JIM's first five layers are fixed,
reflectin g our assumption that the perceptual routines govern-
ing the activation of structural descriptions remain essentiall y
unchanged past infancy. The acquisition of new objects is as-
sumed to consist entirely in the recruitment of units (usin g mod-
ificatio n of connections to existing cells) in Layers 6 and 7.
There is an important and difficul t question here as to exactly
how and when new object classes are acquired. For example,
when we see an object that is similar in shape to some familia r
class of objects, how do we decide whether it is similar enough
to belong to that class or whether it should become the first
member of a new object class? ("I think it's a hot dog stand, but
it could be some kind of device for sweeping the sidewalk.")
Proper treatment of this question requires a theory of categori-
zation, and the emphasi s of this effor t is on viewpoint invar-
iance. Although it is possible that JIM's GFAs could serve as the
representation in such a theory, a complete theory also requires
processes to operate on its representations. Rather than attempt
to propose such a theory, we have chosen instead to use a theo-
reticall y neutral procedur e for providing the model with its
vocabular y of objects. This simplifie d "familiarization" proce-
dure is described shortly. First, let us consider how the model's
last two layers activate a representation of a familia r object class
given a structural description as input.
Given that almost any upright vie w of an object wil l produce
the same GFA pattern in L3-L5 (withi n a small range of error),
the task of classifyin g an objec t fro m its GFA pattern is straight-
forward. It is accomplishe d by allowing cells in L6 to be re-
cruited by specifi c GFAs and cells in L7 to be recruited by
conjunction s of L6 cells. If an object is in JIM's vocabulary,
Layer 3 Fine Orientation Cells
Horizonta l
Lef t Closer
Rt. Closer
Righ t Up
Lef t Up
End On
Figure 26. Illustration of the mapping fro m Layer 2 axis and vertex cells
to Layer 3 orientation cells. (Rt. = right.)
Layer 5
Smaller-than Larger-than
o o
Above Below
Perpendicular to Oblique to
Location in X
Location in Y
Fine Orientation
Layer 3
Figure 27. Detail of Layer s 4 and 5. (Eac h Laye r 4 cell responds to a specifi c relation [e.g., below] in
conjunctio n wit h a specifi c value on the dimension over whic h that relation is define d [e.g., Y=l]. Layer 5
contains only one cell for each relation.)
then each of its GFAs will activate a different cell in L6 (hence-
forth, GFA cells). The outputs of the GFA cells are summed over
separate time slices to activate an L7 unit representing the class
of the object.
The cells of L6 are full y interconnected to L5 and to the
subset of L3 that passes activation to L6 (i.e., all L3 cells except
those for size, position, and fine orientation). One L6 cell is a
dummy cell, with unit strength excitatory connections from all
L3-L5 cells. This cell is included to mimic the effec t of GFA
cells that are still free to be recruited in response to novel object
classes. The remaining cells are selective for specific conjunc-
tions of geon and relation attributes (for the current vocabulary,
there are 20). For example, one GFA cell receives strong excita-
tion from the L3 cells for curved cross section, straight axis,
nonparallel sides, vertical, aspect ratio 1:1, and aspect ratio 1:2,
and from the L5 cells for above, smaller than, and perpendicu-
lar to; the remaining connections to this cell are negligibly
small. Thus, this GFA cell is selective for slightly elongated ver-
tical cones that are above, smaller than, and perpendicular to
other geons. Other GFA cells are selective for different patterns,
and all GFA cells (including the dummy cell) are mutually inhib-
The most difficul t problem confronting a GFA cell is that the
pattern for which it is selective is likely to overlap considerably
with the patterns selected by its competitors. For example,
many objects contain geons with curved cross sections or geons
that are below other geons, and so forth. Furthermore, some
GFAs may be subsets of others: One GFA cell might respond to
vertical cylinders below other geons, and another might re-
spond to vertical cylinders below and beside other geons. To
allow the GFA cells to discriminate such similar patterns, we
adopted an excitatory input rule described by Marshall (1990),
and others:
E, =
where £, is the excitatory input to L6 cell /', Os is the output of
cell j (in L3 or L5), wtj is the weight of the connection from j to /,
and a is a constant. This equation normalizes a cell's excitatory
input to a Weber function of the sum of its excitatory connec-
tions, making it possible for the cell to select for patterns that
overlap with—or are even embedded within—the patterns se-
lected by other cells. To illustrate, consider a simple network
with two output cells (corresponding to L6 cells in JIM) and
three input cells (corresponding to the L3-L5 cells), as shown in
Figure 29. For simplicity, assume outputs and connection
weights of 0 or 1, and let a = 1. Output Cell 1 is selective for input
Pattern ABC; that is, wlA = wlB = w,c = 1, and Output Cell 2 is
selective for Pattern AB. If ABC is presented to this network
(i.e, OA = OB = Oc= 1), then, by Equation 10, £", will be 0.75, and
EI will be 0.67. In response to ABC, Cell 1 receives a greater net
input and therefore inhibits Cell 2. By contrast, if Pattern AB is
presented to the network, £, will be 0.50, £j will be 0.67, and
Cell 2 will inhibit Cell 1.
By allowing the GFA cells to select for overlapping and em-
bedded patterns, this input rule allows JIM to discriminate
Computing BELOW
Layer 5 Vertical
Relation Cells
Layer 4 Vertical
Relation Cells
Layer 3 Vertical
Position Cells
Figure 28. Operation of Layers 4 and 5 illustrated with the relation below. (PEL = fast enabling link.)
object s with ver y similar structura l descriptions: Each possibl e
pattern of activation over L3 and L5 cells could potentiall y
recrui t a differen t GFA cell in L6. Given 21 input s to L6, the
numbe r of GFA cells could in principl e be as great as 221 (or
larger, if more than two degrees of activity were discriminated).
However, because GFA cells are recruited onl y as the model
adds object s to its vocabulary, the number of such cells that
woul d realisticall y be required (eve n to represent the approxi -
matel y 150,000 object model s familiar to an adult [Biederman,
1988] is considerabl y smaller).
JIM was implemente d and run on an IBM PSII, Model 70.
Simulation s tested JIM's capacit y for invarianc e with transla-
tion, scale changes, left-right image reflection, and rotations in
depth and in the visual plane. These simulation s were con-
ducted in two phases, a familiarizatio n phase and a test phase.
Durin g familiarization, the model was presented with one view
of each of 10 simple objects and allowed to modif y the connec-
tion weight s to L6 and L7 in response to the patterns of activity
produced in L3 and L5. Afte r familiarization, the connection s
were not allowed to change. During the test phase, JIM was
presented wit h each object in 10 views: the original (baseline)
view that was used for familiarizatio n and 9 novel views. Its
performanc e was evaluated by comparing its L7 responses to
the test images with its baseline responses at that layer.
Familiarization: Creating the Object Vocabulary
Let us refe r to the vector of bottom-up excitator y connection
weight s to a GFA cell as its receptive field. Befor e familiariza -
tion, all GFA cells (except for the dummy cell) were initialized:
their receptive fields were set to vectors of zeros, and their out-
put connection s to all L7 object cells were set to —0.5.
JIM's vocabular y currentl y consists of 2 three-geon objects
and 8 two-geon objects. The baseline view of each object is
depicted in Figure 30. JIM was familiarize d with one object at a
time, each by the followin g procedure: The mode l was given the
baselin e image of an objec t as input and run for 20 time slices
(ts). Each time the L1-L2 feature s of a geon fired, L3 and L5
produce d a GFA as output. The GFA fro m the latest ts asso-
ciated wit h each geon was selecte d as the familiarizatio n GFA
(denote d GFA* ) for that geon. For example, the GFA* s for the
telephon e in Figur e 30 were selecte d by presentin g the baselin e
image to the mode l and runnin g the mode l for 20 ts. Assume
that the L2 feature s of the brick fired on ts 2, 7,11, and 15, and
those of the wedge fired on 4, 8,13, and 18. The L3/L5 output s
fo r ts 1 5 woul d be selecte d as GFA * for the brick, and those for ts
18 as GFA * for the wedge.
Once an object's GFA* s were generated, the objec t was added
to JIM's vocabular y by recruitin g one L6 GFA cel l for each
GFA * and one L7 object cell for the objec t as a whole. For each
GFA*, i, a GFA cell was recruite d by computin g the Euclidea n
distance 14 (Z),-,) betwee n i and the receptiv e field s of al l previ -
ousl y recruite d GFA cells, j. If, for all j, Dtj was greate r than 0.5,
a new GFA cell was recruite d for GFA * i by setting the receptiv e
fiel d of an unrecruite d GFA cell (i.e., one stil l in its initialize d
state) equal to GFA*; and setting the connectio n fro m that GFA
cel l to the associate d objec t cell to 1.0. If a previousl y recruite d
cell, y, wa s foun d such that Dtj < 0.5, then the receptiv e fiel d of
Excitator y Weights:
WIA =W IB =W IC = !
W2A=W2B=1> W2C=°
Net Excitator y Input:
j = A
j =A
Input Patter n FABC1:
0A= 0B= Oc= 1
E1 =( 3)/( l +3 ) = 0.75
E2=(2)/(l +2) = 0.67
Inpu t Patter n [AB1:
E1=(2)/(l +3 ) = 0.50
E2=(2)/(l +2) = 0.67
Figure 29. The Webe r fractio n excitator y input rul e allows Outpu t
Cel l 2 to select for Inpu t Patter n AB and Outpu t Cel l 1 to select for
Inpu t Patter n ABC.
Figure 30. The baselin e vie w of each objec t in JIM's vocabulary. (Ob-
ject s 7 and 8 contai n three geons, all other s contai n two. Object s 8 and
9 contai n ambiguou s geons: the centra l ellips e in Objec t 8 contain s no
axi s information, and the "handle" on the "fryin g pan" [Objec t 9] con-
tains no cross-sectio n information. Object s 1 and 10 have the same
geon and relatio n attribute s in differen t combination s [see Figur e 7].)
that cel l was set to the mean of itsel f and GFA * /', and its connec -
tion to the associate d objec t cell was set to 1.0. Thi s procedur e
recruite d 20 GFA cell s for the complet e set of 22 GFA* s in the
trainin g set.
It is particularl y importan t to not e that this procedur e estab-
lishes the model's vocabular y on the basi s of only one view of
each object. As such, exposur e to many differen t views of each
object canno t accoun t for any viewpoin t invarianc e demon -
strated in the model's recognitio n performance. Also, thi s pro-
cedur e is tantamoun t to showin g the mode l a line drawin g and
instructin g it that "this is an X? In an earlie r version of the
model (Humme l & Biederman, 1990b), the objec t vocabular y
was develope d by allowin g the sixt h and sevent h layer s to self -
organiz e in respons e to GFA* s selecte d as describe d earlier.
Wit h a five-objec t trainin g set, that procedur e settle d on a
14 Recal l that bot h the GFAs and the L6 receptiv e fields are 21-di -
mensiona l vectors. The Euclidea n distanc e (D,y) betwee n two vectors, i
and j, is calculate d as the squar e root of the sum over vector element s of
the square d difference s of correspondin g vector elements: D ff = (2t(it -
stable patter n of connectivit y in L6 and L7, and the resulting
connection s produce d recognitio n result s very similar to those
produce d by the current version of the model. However, where
the current familiarizatio n procedur e is capable of acquirin g
object s one at a time, the self-organizin g system required all the
object s to be presente d at the same time. Wher e the current
procedur e establishe s a patter n of connectivit y in one exposur e
to an object, the self-organizin g algorithm required 3,000 pre-
sentation s of the training set to settle into a stable configura -
tion. The self-organizatio n was also highl y sensitive to its pa-
rameter s and to the ratio of the numbe r of GFA* s to the numbe r
of L6 cells (e.g, it woul d not settle unless the numbe r of L6 cells
exactl y equaled the numbe r of GFA*s). We decided to use the
curren t familiarizatio n procedur e instead becaus e we want the
model's behavio r to reflec t the performanc e of its 1st five layers
rather than the idiosyncrasies of any particular learning algo-
Test Simulations: Conditions and Procedure
Afte r familiarization, JIM's capacit y for viewpoint-invarian t
recognitio n was evaluate d by runnin g two blocks of simula-
tions. Each block consiste d of 100 runs of the model, 10 object s
presente d in each of 10 conditions: Conditio n 1, Baseline, in
whic h the origina l (familiarization ) image was presented; Con-
ditio n 2, Translation, in which the origina l image was moved to
a new position in the visual field; Conditio n 3, Size, in which
the origina l image was reduced to betwee n 70% and 90% of its
origina l size; Conditio n 4, Mirror Reversa l of the origina l
image; Conditio n 5, Dept h rotation of 45° to 70° of the origina l
object; Condition s 6 to 10, in which five images were created by
rotating the origina l image in the visual plane (22.5°, 45°, 90°,
135°, and 180°). Blocks 1 and 2 differe d only in the number, N, of
ts for which the model was allowed to run on each image: In
Bloc k l,N=20, and in Block 2,N = 40.
Simulation s were conducte d by activatin g the set of LI edge
cell s and L2 axi s cells correspondin g to the image of an objec t
and allowin g the model to run for TVts. Cells in all layers of the
model update d thei r activation s and output s as describe d in the
previou s sections. On each of the n ts in which L1-L2 output s
wer e generated 15, the activit y of the target cell (the L7 cell corre-
spondin g to the correct identit y of the object ) was monitored.
No data were collecte d on those (N — ri) ts in which no output
was generated.
Four respons e metrics were calculate d for each simulatio n
because, alone, any one metri c has the potentia l to yield a mis-
leadin g characterizatio n of performance. The metrics calcu-
lated were maximu m activation of the target cell activated dur-
ing the simulatio n (Max), mean activatio n of the target cell over
the n ts in whic h data were collected, proportio n (P) of all
objec t cel l mean activation s attributabl e to the target cell, and
the mean activatio n multiplie d by proportio n (MP). Max and
mean provide raw measure s of the target cell's respons e to an
image. P and MP reflec t the strengt h of the target cell's re-
sponse relative to the response s of all other object cells. The
Appendi x discusse s how these respons e metric s were calcu-
lated and thei r relative merit s and disadvantages.
As is evident in Figure 31, all the respons e metrics yielded
the same qualitativ e pictur e of the model's performance, so
most simulation s are reporte d only in terms of max. For each
block of simulations, JIM's performanc e in each conditio n (e.g.,
baseline, translation, and so forth) is reported in terms of the
mean (over objects ) of max, but the ordinat e of each graph is
labeled with the individua l respons e metri c (e.g., if the graph
shows mean max's over object s in each condition, the ordinat e
wil l be labeled max). Error bars indicat e the standar d error of
the mean. Becaus e each metri c is proportional to the strengt h
and correctnes s of the model's respons e to an image, high val-
ues of the respons e metrics are assumed to correspon d to low
reaction times and error rates in huma n subjects.
Test Simulations: Results
There is a stochasti c componen t to the refractor y threshold s
of the cells in LI and L2, so the output of the target cell in
response to an image is subject to random variation. Specifi -
cally, the cell's output wil l reflec t the numbe r of times the fea-
tures of each geon in the image fires, the numbe r of ts betwee n
differen t geons' firing, and the order in which they fire. To
derive an estimat e of the amoun t of variation that can be ex-
pected for a give n image of an object, the baseline vie w of
Object 1 was run 20 times for 20 ts per simulatio n and 20 times
fo r 40 ts per simulation. The means and standar d deviation s
(unbiase d estimate ) of the fou r respons e metrics obtaine d in the
20-ts runs were max = 0.499 (SD = 0.016), mean = 0.304 (SD =
0.022), P = 1.0 (SD = 0.0), and MP = 0.304 (SD = 0.022). The
values obtaine d in the 40-t s runs were max = 0.605 (SD =
0.005), mean = 0.436 (SD = 0.012), P = 1.0 (SD = 0.0), and
MP= 0.436 (SD = 0.012). These figures are reporte d only to
provid e an estimat e of the amount of random variatio n that
can be expecte d in JIM's performance.
Translation, Size, Mirror Reversal, and Rotation in Depth
Recal l that human s evidenc e no perceptua l cost for image
translations, scale changes, and mirror-imag e reversal s (Bieder -
man & Cooper, 199la, 1992), and onl y a ver y modes t cost for
rotations in depth, even with nonsens e object s (Gerhardstei n &
Biederman, 1991). Similarly, JIM's performanc e reveal s com-
plete invarianc e with translation, size, and mirror-imag e rever-
sals. Althoug h every test conditio n entailed translatin g the orig-
inal image (it is impossibl e to scale, rotate, or mirror reflec t an
image withou t affectin g wher e its constituen t edges fal l in the
visua l field), JIM was tested on one image that underwen t only
a translatio n fro m the baseline image. JIM was also tested wit h
one scaled image (reduce d to betwee n 70% and 90% of its origi -
nal size), one left-righ t mirror-reflecte d image, and one depth-
rotated image of each object. Limitation s of the stimulu s cre-
ation progra m made precise rotation s in dept h impossibl e (the
stimulu s creation progra m represent s object s as 2-D line draw-
ings rather than 3-D models). Therefore, each dept h rotated
image was created by rotating the object approximatel y 45° to
70° fro m its baseline orientation; these include d rotations bot h
15 Becaus e there is a stochasti c componen t to the LI and L2 cells'
firing, a subset of the time slices wil l pass during any given run withou t
any LI or L2 cells firing (and, therefore, no other cells wil l fire). On
these time slices, no data were gathered.
• s
i.oo -
Baseline Trans. Size Mirror Dept h
Figure 31. JIM's performanc e in the baseline, translatio n (Trans.)
only, size, mirror-imag e reverse d (Mirror), and dept h rotate d (Depth )
condition s expresse d in terms of the average-maximu m (Max), mea n
proportio n (P), and mea n activatio n multiplie d by proportio n (MP)
respons e metric s over object s in each condition. (Thes e dat a wer e gath -
ered in simulation s lastin g 20 time slices.)
0.5 -
0.4 -
Baselin e Trans. Siz e Mirro r Dept h
Conditio n
Figure 32. JIM's performanc e in the baseline, translatio n (Trans.)
only, size, mirror-imag e reverse d (Mirror), and dept h rotated (Depth )
condition s expresse d in terms of the averag e maximu m (Max ) re-
spons e metri c ove r object s in each condition. (Thes e dat a wer e gath -
ered in simulation s lastin g 40 time slices.)
about the object's vertical axis and about a horizontal axis per-
pendicular to the line of sight.
Figure 31 depicts JIM's performance in each of these condi-
tions in terms of all fou r response metrics in Block 1 (simula-
tions lasting 20 ts). By each response metric, performance in
these conditions was indistinguishable fro m baseline perfor-
mance. Figure 32 shows the max for these conditions in Block 2
(40 ts). These figures also reveal complete invariance in each
condition, with the exception of a very modest cost for rotation
in depth. This cost most likely reflect s changes in the geons'
aspect ratios resulting from the depth rotations. Note that mean
P over objects was 1.0 for all conditions in Block 1, indicating
that in each simulation only the target cell achieved a mean
activation greater than zero. Although it is not shown, the mean
P over objects was also 1.0 for all these conditions in Block 2.
Rotation in the Visual Plane
By contrast to rotations in depth, humans evidence a percep-
tual cost for recognizing images that have been rotated in the
visual plane (Jolicoeur, 1985). Typically, subjects show a mono-
tonic increase in response time and error rate, with the number
of degrees rotated from upright to approximately 135°. Subjects
respond faste r at 180° (whe n the object is upside down) than
they do at 135°, producing a W-shaped rotation function over
360° (Jolicoeur, 1985).
JIM's performance under rotations in the visual plane was
tested with stimuli rotated 22.5°, 45°, 90°, 135°, and 180° from
the baseline images. Figures 33 and 34 show max as a function
of degrees of rotation for Blocks 1 and 2, respectively. Again,
JIM's performance revealed a trend very similar to that of hu-
man subjects. In terms of JIM's operation, the cusp in the rota-
tion function at 180° reflects two effects. First, for objects with
a primarily vertical structure (i.e., the spatial relations among
the geons include above and below, but not beside), rotations
between upright (0°) and 180° create spurious beside relations
that are absent in the upright and 180° views. For example,
rotate a lamp 45° in the visual plane and the lampshade, which
is above the base in the upright view, will be above and beside
the base. Continue to rotate the lamp until it is upside down
and this spurious beside relation disappears. Second, a 180°
rotation in the visual plane preserves the original coarse orien-
tations of the object's component geons (horizontal, diagonal, or
vertical) more than rotations greater or less than 180°.
22.5 45.0 67.5 90.0 112.5 135.0 157.5 180.0
Degree s Rotate d from Baselin e
Figure 33. JIM's recognitio n performanc e as a functio n of degree s
rotated in the visua l plan e fro m baselin e (0.0°). (Performanc e is ex-
presse d in terms of the averag e maximu m [Max ] respons e metric over
object s in each condition. Thes e dat a wer e gathere d in simulation s
lastin g 20 time slices.)
22.5 45.0 67.5 90.0 112.5 135.0 157.5 180.0
Degrees Rotated from Baseline
Figure 34. JIM's recognitio n performanc e as a functio n of degrees
rotated in the visual plane fro m baseline (0.0°). (Performanc e is ex-
pressed in terms of the average maximu m [Max ] response metri c over
object s in each condition. These data were gathered in simulations
lastin g 40 time slices.)
Discussion of Test Simulations
The simulations reported earlier reveal a high degree of
translation, scale, and mirror-reflection invariance in JIM's rec-
ognition of objects. Performance with objects rotated in depth
and in the visual plane also resembles the performance of hu-
man subjects under similar circumstances. Comparison of re-
sults from Block 1 and Block 2 suggests that the length of time
for which a simulation is run has little or no qualitative impact
on the model's performance (mean and max were higher on
average for the 40 ts simulations, but the relative values across
conditions were unaffected). Similarly, comparison of the
various response metrics suggests that the observed results are
not dependent on the particular metric used. However, because
the results are based on simulations with only 10 objects, it
could be argued that they reflect the model's use of some sort of
diagnostic feature for each object, rather than the objects' struc-
tural descriptions.
This diagnostic feature hypothesis is challenged by the re-
sults of two additional test conditions conducted with images of
the objects in JIM's vocabulary. In the first condition, a scram-
bled image of each object was created by randomly rearranging
blocks of the original image such that the edges' orientations
were unchanged and the vertices remained intact. An example
of a scrambled image is given in Figure 35. If JIM's perfor-
mance reflects the use of a diagnostic list of 2-D features for
each object, it would be expected to achieve at least moderate
activation in the target cell corresponding to each scrambled
image (although the axes of parallelism and symmetry are de-
stroyed by the scrambling, the vertices are preserved), but if it is
sensitive to the relations among these features, it would be ex-
pected to treat these images as unfamiliar. In the second condi-
tion, the intact baseline images were presented to the model,
but the binding mechanism was disabled, forcing the separate
geons to fire in synchrony. This condition demonstrated the
effec t of accidental synchrony on JIM's performance; if the
model's capacity for image parsing is truly critical to its perfor-
mance, recognition would be expected to fai l in this condition
as well. Simulations in both conditions lasted for 40 ts. JIM's
performance on the scrambled image and forced accidental
synchrony conditions is shown in Figure 36. In both conditions,
performance was reduced to noise levels.
JIM is capable of activating a representation of an object
given a line drawing of that object as input. Moreover, that
representation is invariant with translation, scale, and left-right
mirror reversal even when the model has previously been ex-
posed to only one view of the object. This section discusses the
important aspects of JIM's design and explores their implica-
tions. Specifically, the implications of the independent attrib-
ute representation used in L3 and L5 are reviewed shortly. The
model's use of dynamic binding plays a pivotal role in this ca-
pacity for independent attribute representations. The theoreti-
cal and empirical implications of using temporal synchrony for
dynamic binding is discussed in some detail. Also addressed
are additional findings for which JIM provides an account,
some novel predictions for human recognition performance,
and some limitations of the current architecture.
The Nature of Temporal Binding and
Its Empirical Consequences
What Temporal Binding Is Not
The notion of temporal binding is only starting to become
familiar to most psychologists, and its empirical consequences
are not obvious. Specifically, binding through temporal
synchrony as described here should not be confused with the
grouping of stimulus elements that are presented in close tem-
poral contiguity. Indeed, JIM produces temporal asynchrony
for the differen t geons in an object even though they are pre-
sented simultaneously. The confusion between binding through
synchrony and grouping features presented in close temporal
» / V )
Figure 35. Left: The representation of the baseline image of Object 1
in JIM's first layer. (Lin e segments correspond to active cells in Laye r 1,
the location and orientation of a segment corresponding to the pre-
ferre d location and orientation of the cell, respectively. Circles indicate
locations in Layer 2 containing active vertex (or lone termination) cells.
Right: The scrambled image version of the baseline image for Object 1.
Vertices and edge orientations are preserved, but the positions of 2 X 2
blocks of the 22 X 22 cluster input laye r are randoml y rearranged.)
Scr. Image
Ace. Sync.
Figure 36. JIM's recognitio n performanc e in the baseline, scramble d
imag e (Scr. Image), and force d accidental synchron y (Ace. Sync.) condi -
tions. (Performanc e is expresse d in terms of all respons e metric s [aver-
aged over object s in each condition]. Thes e data were gathere d in simu-
lation s lastin g 40 time slices. Max = maximu m activatio n of the targe t
cel l activate d durin g the simulation; P = proportion; MP = mea n acti -
vatio n multiplie d by proportion.)
contiguit y is not hypothetical. A repor t by Keele, Cohen, Ivry,
Liotti, and Yee (1988 ) is critica l of tempora l bindin g on the
ground s that tempora l contiguit y in stimulu s presentatio n did
not predic t featur e binding.
Keel e et al. (1988 ) conducte d three experiment s testin g
whethe r commo n locatio n or commo n time of occurrenc e is a
stronge r cue to featur e bindin g in rapidl y presente d images. In
each experiment, illusor y conjunction s wer e bette r predicte d
by commo n locatio n in the visua l fiel d than by co-occurrenc e
in time. Failin g to find suppor t for the primac y of stimulu s
contiguit y in featur e binding, these researchers rejecte d tem-
poral bindin g theory. However, what thei r data show is that
locatio n is a mor e critica l cue than tempora l coincidenc e in
determinin g how feature s shoul d be conjoined. Such dat a do
not and canno t falsif y the hypothesi s that synchronou s activit y
is the manne r in which a featur e conjunction—eve n one estab-
lishe d on the basi s of commo n location—i s represented. How a
bindin g is represente d and what cues are used to determin e that
bindin g are differen t issue s entirely. Althoug h it is possibl e to
evaluat e a give n set of propose d bindin g cues wit h behaviora l
measures, we sugges t that behaviora l test s are inadequat e in
principl e to falsif y the tempora l bindin g hypothesis. Rather,
question s about the neura l mechanis m of binding, by defini -
tion, reside in the domai n of neuroscience.
Physiological Evidence for Temporal Binding
Is there any evidence for binding through temporal
synchron y that woul d be compatibl e wit h our proposal? Re-
cently, Gray et al. (1989 ) reporte d the existenc e of a 50 Hz oscil -
lator y potentia l in Area 17 of the anesthetize d cat. The poten-
tials wer e not of individua l spikes (whic h were filtere d out) but
the summe d dendriti c activit y of a large numbe r of neuron s in a
cortica l column. Wit h multipl e recordings, Gray et al. com-
puted the cross correlatio n of thi s activit y at differen t sites
whos e receptiv e fields had been determined. Moderatel y high
cross correlations, suggestin g phas e lockin g of oscillator y activ-
ity,16 were observe d for placement s at adjacen t sites, whateve r
the cell's orientatio n preferenc e and nearb y sites that had simi -
lar orientatio n preferences. In JIM's terms, these coul d reflec t
the FELs correspondin g to Conditio n 1, local coars e coding of
image contours, amon g unit s wit h overlappin g receptiv e fields
and simila r orientatio n preferences.
The most provocativ e result s were obtaine d fro m recording s
made at widel y separate d sites that had nonoverlappin g recep-
tive fields. The cross-correlatio n value s for these sites were es-
sentiall y zero unles s the orientatio n preference s wer e collinear,
in which case the value s were positive, but modest. If bars were
translate d separatel y throug h the receptiv e fields but in oppo-
site directions, as shown in Figur e 37a, then the correlation s
wer e low. Translatin g the separat e bars in a correlate d motion,
as shown in Figur e 37b, increase d the cross correlation. How-
ever, joinin g the two bars into one bar, so that the intervenin g
portio n of the visua l fiel d was bridged, as shown in Figur e 37c,
dramaticall y increase d the value of the cross correlation.
JIM's Single-Unit Phase-Locking Predictions
We sugges t two additiona l tests of the proposed solutio n to
the bindin g proble m using the Gray et al. (1989 ) experimenta l
paradigm. In particular, we describ e experiment s to assess (a)
whethe r the phas e lockin g can tur n corner s and (b) whethe r
collinea r phas e lockin g can be broke n throug h intervenin g ver-
Can phase locking turn comers? As a test of whethe r phase
lockin g can tur n corners, conside r three sites in visua l cortex
wit h noncollinea r receptiv e fields, as shown in Figur e 38a.
These experiment s might be best performe d wit h awake pri-
mates traine d to maintai n fixation, in the manne r of Mora n
and Desimon e (1985). Assume that in the absenc e of stimula -
tion, or with independent stimulation, the cross-correlation val-
ues betwee n the sites are low. However, if Sites 1 and 2 wer e
groupe d int o a singl e shape (Figur e 38b), wit h Site 3 groupe d
into anothe r shape, woul d the cross correlatio n be high only
betwee n Sites 1 and 2? If Sites 1 and 3 were groupe d into a singl e
shape, wit h Site 2 the odd uni t out, Sites 1 and 3 woul d be
expecte d to hav e a high cross correlation, wit h Site 2 uncorre -
lated (Figur e 38c). Similarly, Sites 2 and 3 coul d be grouped,
wit h Site 1, the odd uni t out. An attractiv e aspec t of thi s test is
that it can be performe d on any three sites, becaus e the shape s
can be designe d afte r the tunin g preference s and receptiv e
fields fo r the unit s are determined.17
16 The cross correlation s are the domai n of frequenc y rather than
phase (C. M. Gray, persona l communication, Apri l 1991). That is, high
cross correlation s resul t when neurons' firing rates increas e and de-
crease together. It is the phase s of the curve s describin g firing rate that
are observe d to be locke d in thes e investigations.
17 It is possibl e that within-shap e correlations could be produced by
an attentiona l spotligh t (Crick, 1984). If a fourt h site was added to the
shape wit h the odd uni t out in Figure s 38b or 38c, spotligh t theor y
woul d predic t that ther e woul d be correlate d firing withi n onl y one
shape at a time. The theor y presente d here predicts within-shap e corre-
lated firing for both shape s simultaneously. (Thi s analysi s was pro-
posed by G. E. Hinton, persona l communication, Februar y 1992.)
Figure 37. Summar y of the stimulu s condition s in the experiment s
by Gray, Konig, Engel, and Singe r (1989). (Recording s wer e made fro m
sites wit h distant, collinea r receptiv e field s in Are a 17 of the anesthe -
tized cat. a: Whe n the receptive fields wer e stimulated wit h separate
bar s of light movin g in opposit e directions, the cross correlatio n of the
activit y in the separat e sites was low. b: Whe n the bar s wer e translate d
acros s the receptiv e fields in phase, cros s correlation s wer e moderate,
c: Cros s correlation s wer e highes t whe n the receptiv e fields wer e stimu -
lated wit h a singl e long bar of light.)
Do vertices prevent collinear phase locking? Ca n collinea r
phas e lockin g be broke n by intervenin g vertice s that grou p the
collinea r segment s int o separate shapes? For thi s problem, unit s
wit h collinea r receptiv e field s woul d firs t have to be recruited,
as shown in Figur e 39a. Assume that collinea r bar s of light, as
show n in Figur e 39b, woul d resul t in a hig h correlation. The
critica l test woul d be whethe r the phas e lockin g woul d be bro-
ken if the bars' collinear endpoints were replaced by vertices, as
show n in Figur e 39c. Her e we woul d expec t that the phas e
lockin g woul d be greatl y reduced.
Open Questions About Synchrony for Binding and
Structural Description
The data reported by Gray et al. (1989), sugges t the existence
of neura l activit y that migh t provid e a basi s for perceptua l orga-
nizatio n throug h tempora l bindin g in a manne r compatibl e
wit h the FELs develope d for JIM. However, they shoul d be
interprete d wit h caution. First, phas e lockin g is inferre d fro m
cross correlation s indicatin g an averag e phas e lag of zero, but it
is unclea r whethe r thi s statisti c reflect s the phas e lockin g in
individua l cell s or simpl y a distributio n of phas e relation s
whos e mea n is a zer o phas e lag. A relate d questio n concern s
whethe r the phas e lockin g reflect s latera l interaction s amon g
cortica l neuron s or simpl y a statistica l epiphenomeno n (e.g.,
perhap s the phas e relation s reflec t the statistica l or tempora l
propertie s of the input s to cortex). Thi s statistica l epiphenome -
non explanatio n was challenge d by a recen t repor t by Engel,
Konig, Kreiter, and Singe r (1991). Using a paradig m simila r to
that of Gray et al., Engel et al. showe d that synchron y could
exten d acros s the corpu s callosum. Sectio n of the callosu m
resulte d in a loss of synchron y acros s the hemisphere s but lef t
the synchron y unaffecte d withi n the hemispheres.
Second, ther e is some debat e as to whethe r phas e lockin g
play s a functiona l rol e in bindin g visua l attribute s or subserve s
some differen t function. For example, Bowe r (1991 ) has argue d
that simila r (but global ) phas e lockin g in olfactor y corte x plays
a rol e in the modificatio n of synapti c strength s but serve s no
purpos e for bindin g attributes. Indeed, globa l phas e lockin g (in
whic h al l active unit s fir e approximatel y in phase ) coul d not
serve to differentiat e boun d subset s of active units. Third,
phas e lockin g in visua l corte x has onl y been observe d wit h
movin g stimul i and is notoriousl y difficul t to measur e wit h
stationary stimuli (C. M. Gray, personal communication, Apri l
1991). Finally, even if we assume that the observe d phas e lock-
ing in Area 1 7 of the cat wer e servin g to bind visua l attributes, it
is unclea r whethe r a 50 Hz oscillatio n is rapi d enoug h to imple -
ment bindin g for real-tim e visua l recognition.
Figure 38. Illustratio n of a multiple-electrod e test of whethe r phas e
lockin g can tur n corners, (a: Thre e recepto r sites, 1, 2, and 3, wit h
noncollinear receptiv e fields, b: Wil l Sites 1 and 2 fire in phase, with 3
out of phase? c: Wil l Sites 1 and 3 fir e in phase, wit h 2 out of phase?)
Figure 39. Illustratio n of a multiple-electrod e test of whethe r inter-
venin g vertices can break phase locking in collinea r sites, (a: Two re-
ceptor sites, 1 and 2, with collinea r receptive fields, b: Sites 1 and 2
shoul d fire in phase, c: Will the phase locking between Sites 1 and 2 be
broke n by the intervenin g vertices?)
These problems raise the issue of what woul d be required for
temporal synchron y to be a plausibl e solution to the binding
problem for structura l descriptio n and real-time shape recogni -
tion. In the remainde r of this section, we offe r some consider -
ations and speculation s with regard to this issue.
Timing considerations. A primar y issue concerns the mini -
mum timing requirement s for binding and thei r compatibilit y
wit h known neurophysiology. For the purposes of JIM, we have
assumed that FELs operat e with infinit e speed, allowing two
cell s to synchroniz e instantly. Realistically, there will be some
time cost associated with the propagatio n of an enabling signal
across a FEL (unless FELs are implemente d as electrical gap
junctio n synapses). How will this time cost manifes t itself in
the impositio n of synchron y on cells? Two consideration s bear
on this question. First, strict synchron y is not required for
phase relations to carry binding information. Rather, it is only
necessar y that the phase lag between separate groups of cells be
detectabl y and reliabl y greater than the phase lag between cells
withi n a group. Second, Koni g and Schillen (1991 ) present simu-
lations demonstratin g that some classes of coupled oscillator s
can phase lock (wit h zero phase lag) provided the transmissio n
time of the coupling signal is one third or less of the oscillation
period. Thus, it is possibl e to achieve tight synchron y even with
nonzer o transmissio n times.
Other constraint s on timing derive fro m the task of shape
recognition. For our proposal to be viable, synchrony must be
exploitabl e withi n the approximatel y 100 ms that it is estimate d
(Biederman, 1987b; Thorpe, 1990 ) the brain requires to activat e
a representatio n of an object fro m the first appearance of a
stimulus. Thorpe estimate d that 10 synapses must be traversed
fro m the retina to inferio r tempora l cortex (or IT, where the
final objec t representatio n presumabl y resides), leaving 10 ms
per layer of computation, just enough time to generat e one
spike. At first, these estimate s seem to preclude the use of tem-
poral binding for recognition. However, because cells in Layer
L do not need to wait for a spike to arrive at Layer L +1 to spike
a second time, additiona l cycles (spikes) woul d take approxi -
matel y 10 ms each. (If the first spike fro m a given neuron in
Layer L arrives at IT at time t, then a second spike fro m that
neuron woul d arrive at time t + 10 ms.) Thus, two cycles woul d
take not 200 ms but 110 ms. To exploi t temporal binding would
likel y requir e multipl e assemblie s (such as GFAs) firing out of
phase with one another withi n the period between / and t + 10
ms. Satisfyin g this tempora l constrain t woul d require a very
rapid mechanis m for establishin g synchrony among the cells in
an assembly.
Are fast enabling links necessary? Synchronize d oscillation s
can easily be produce d in neural networks without positing
specialize d connection s for that purpos e (e.g., see Atiya & Baldi,
1989; Grossberg, 1973). Indeed, most neural networ k model s
that exploi t phase locking establis h the phase locking using
standar d excitatory-inhibitor y interaction s among the cells.
However, the independenc e of FELs and standard excitatory-
inhibitor y connection s in JIM has importan t computationa l
consequences. Specifically, thi s independenc e allows JIM to
treat the constraint s on featur e linkin g (by synchrony ) sepa-
rately fro m the constraint s on propert y inferenc e (by excitation
and inhibition). That is, cells can phase lock without influenc -
ing one another's level of activit y and vice versa. Althoug h it
remains an open question whethe r a neuroanatomica l analog
of FELs wil l be foun d to exist, we suggest that the distinctio n
betwee n featur e linking and propert y inferenc e is likely to re-
mai n an importan t one.
Compatibility with moving images. The tempora l binding
mechanis m that we have described here was designed for use
with stationar y images. An importan t question about such a
bindin g mechanis m concerns its compatibilit y with moving
images. The answer to this question wil l ultimatel y hinge, at
least in part, on the speed of such a mechanism: At what image
velocit y woul d a given visual feature, say, an image edge, remai n
withi n the receptive fiel d of a neuron in VI or V2 of visual
cortex (wher e we assume the proposed binding mechanis m to
be operating ) for too short a time to allow the binding mecha-
nism to work, and how does this figure compar e with the veloc-
ity at which huma n recognitio n performanc e begins to fail?
Cells in Area VI have receptive field diameter s between 0.5°
and 2° of visual angle (Gross, 1973). Let us return to the esti-
mate discusse d earlier for the duration of a binding cycle (the
time between successive spikes) and assume that an image prop-
erty must remai n in the receptive fiel d of a cell for 10 ms for the
binding mechanis m to work reliably. To move in and out of the
receptiv e field of a VI cell with a receptive field of 0.5° in under
a single cycle of this duration (thereby disrupting binding), an
image woul d have to be translate d at a velocit y of 507s. Thus,
even if we relax the constraint s on the speed of binding to
confor m with the 50 Hz oscillation (20 ms per cycle) reported
by Gray et al. (1989), the binding mechanis m woul d be ex-
pected to be robust with translation s up to 25°/s. Although
shape fro m motion is a readil y demonstrate d phenomenon, we
know of no data on the effec t of motion on recognition per-
formance of images that can be readily recognized when sta-
Accidental Synchrony
As noted previously, cells on separate PEL chains will fire in
synchrony if their output refractories happen to go below
threshold at the same time. When this type of accidental
synchrony occurs between two cells, all the members of their
respective PEL chains will also fire in synchrony, resulting in a
binding error. As demonstrated in JIM's performance, such
binding errors can have devastating effect s on recognition.
Currently, JIM is equipped with no provisions for preventing
accidental synchrony. As such, it is at the mercy of stimulus
complexity: The probability of JIM's sufferin g an accidental
synchrony increases with the number of cells in LI and L2 that
are active. Were JIM required to deal with scenes of realistic
complexity, it would be incapable of keeping the separate com-
ponents of those scenes from accidentally synchronizing with
one another. Although synchrony provides a solution to the
dynamic binding problem, it is subject to an intrinsic capacity
limitation. The value of this limit is proportional to the dura-
tion of the period between firings (e.g., spikes or bursts of
spikes) divided by the duration of a firing. Note that this limit
does not refer to the number of things that can be bound to-
gether. There is no a priori reason to expect a limit on the
number of cells in an assembly or a PEL chain. Rather, it is a
limit on the number of things that can be simultaneously active
without being bound together.
To deal with the complexity of the natural visual environ-
ment, a biological system that used synchrony to perform dy-
namic binding would need mechanisms for reducing the likeli-
hood of accidental synchrony. It is tempting to speculate that
visual attention may serve this purpose (Hummel & Bieder-
man, 1991). Although we shall defer detailed consideration of
such an account of visual attention to a later article, two impli-
cations are worthy of note here. First, such an account posits a
role for visual attention that is the opposite of that proposed in
traditional spotlight models (e.g., Kahneman & Treisman, 1984;
Treisman, 1982; Treisman & Gelade, 1980). The spotlight
model holds that visual attention serves to actively bind visual
attributes together. By contrast, the limitations of synchrony as
a solution to dynamic binding suggest that attention may be
required to keep independent visual attributes separate. Sec-
ond, it would follow from such an account that attention should
serve to inhibit activity in unattended regions rather than en-
hance activity in attended ones: A straightforward way to re-
duce the likelihood that irrelevant visual features will acciden-
tally fire in synchrony with those currently of interest is to
inhibit them. The suggestion that attention should inhibit unat-
tended regions is supported by single unit recordings in the
macaque by Moran and Desimone (1985). They showed that a
stimulus within the receptive field of a V4 neuron would fai l to
elicit a response from that neuron if the animal was not attend-
ing (though maintaining fixation) to a region of the visual field
within the neuron's receptive field.
Additional Implications of Fast Enabling Link Chains
JIM was developed to model the fundamental invariances of
human visual recognition, and the simulations reported here
were designed to evaluate its capacity to do so. However, there is
a variety of other findings for which JIM also provides a natural
account. Although we have run no formal simulations of them,
this section briefl y presents some of these findings. This sec-
tion is included primarily to suggest additional implications of
the model and to underscore the generality of its fundamental
principles. More formal treatment of these findings is
warranted but beyond the scope of this article.
One critical principle underlying JIM is that binding occurs
automatically for features at differen t locations in the visual
field, provided they meet the conditions embodied in the
FELs. This principle, and the specific FELs that implement it,
provide a natural account of data recently reported by Don-
nelly, Humphreys, and Riddoch (1991). These investigators re-
corded reaction times (RTs) for the detection of an inward
pointing L vertex among a configuration of three, four, or five
outward pointing distractor vertices that formed all but one of
the corners of a quadrilateral, pentagon, or hexagon, respec-
tively (Figure 40). When the target was absent it was replaced by
another outward pointing vertex in such a position as to add the
last vertex to the shape. Examples from target absent and target
present trials are shown in Figure 40, lef t and right columns,
Because the endpoints of the distractor vertices were collin-
ear, these vertices would be synchronized by the FELs between
distant collinear lone terminations (Condition III), producing
a single assembly containing all the distractors. If RT is as-
sumed to be a positive function of the number of independent
assemblies in a stimulus (rather than the number of vertices),
then no increase in RTs would be expected with the increase in
the number of distractors. This is precisely what Donnelly et al.
(1991) found. The assumption that subjects were sensitive to the
number of assemblies rather than the number of vertices is
further evidenced by the shorter RTs for the target absent re-
sponses (there would be two assemblies in the target present
condition, one for the target and one for the distractor, and only
one in the target absent condition).
By contrast, consider Figure 41, which shows another condi-
tion studied by Donnelly et al. (1991); target absent trials are
shown in the lef t column and target present in the right. These
stimuli were similar to those described earlier, except that the
distractor vertices were pointing inward and the target out-
ward. Note that with these stimuli, the lone terminators of the
separate vertices no longer satisf y the conditions for grouping
defined in Condition III, as they would be collinear only by
passing through a vertex. As such, JIM would not bind these
vertices, and there would be one assembly per vertex. By the
assumption that RT is a positive function of the number of
assemblies, JIM would predict an increase in search times as a
function of the number of distractors in this condition. Again,
this is what Donnelly et al. (1991) reported: RTs for detecting a
target vertex increased linearly with the number of distractor
The principle of not grouping collinear edges through ver-
tices also provides an account of the well-known demonstra-
51 2
Regula r
Absen t
r n
r -v
c >
s. -/
Presen t
r n
v r
r -^
< ^
Absen t
r •*
< • >
< ;
Presen t
> ^
-i -\
^ V.
< >
> — >
Figure 40. Displays for a search task in which the target was an in-
ward-pointin g vertex and the terminator s of the distractors were col-
linea r without passing through a vertex. Search times were unaffecte d
by the number of vertices. (From "Parallel Computation of Primitive
Shape Descriptions" by N. Donnelly, G. W Humphreys, and M. J.
Riddoch, 1991, Journal of Experimental Psychology: Human Percep-
tion and Performance, 17, Figure 1, p. 563. Copyright 1991 by the Ameri-
can Psychological Association. Reprinted by permission.)
tions of Bregman (1981) and Leeper (1935; Hummel & Bieder-
man, 1991). Bregman's demonstration is shown in Figure 42
(Panels a and b). It is difficul t to form a coherent percept of the
image in Panel a. However, when the same elements are pre-
sented in the presence of what appears to be an occluding sur-
face (Panel b), it is easy to see that the figure depicts an array of
Bs. Bregman interpreted this effect in terms of the cues for
Regula r
Absen t
_ r
j \-
) '
./ *> -
-v /-
-> -i
n r
•) *
_, X.
^ (
^- r
Absen t
j N-
i (
^ ^
. (
V ^
Figure 41. Displays for a search task in whic h the target was an out-
ward-pointin g vertex and the terminator s of the distractors were col-
linea r only through a vertex. Search times increased with the number
of vertices. (From "Parallel Computation of Primitive Shape Descrip-
tions" by N. Donnelly, G. W Humphreys, and M. J. Riddoch, 1991,
Journal of Experimental Psychology: Human Perception and Perfor-
mance, 17, Figure3, p. 565. Copyright 1991 by The American Psycho-
logica l Association. Reprinted by permission.)
grouping provided by the occluding surface. Blickle (1989) pro-
posed a nonocclusional account of this demonstration. He
noted that accidental L vertices were produced where the oc-
cluder was removed in the nonoccluded image (Figure 42a) and
hypothesized that they may have prevented the elements from
grouping. When the L vertices are removed, the elements once
again form a coherent percept even in the absence of an explicit
occluder (Figure 42c). Blickle applied the same analysis toward
understanding why Leeper's (1935) figures, such as the elephant
in Figure 42d, were so difficul t to identify. Blickle argued that
the L vertices prevent the visual system's grouping the separate
relevant contours of the fragment s into a common object. In-
deed, when the L vertices are removed, the elements become
much easier to group (Figure 42e). The role played by the L
vertices in these demonstrations is precisely what would be
expected based upon JIM's account of grouping.
A similar analysis applies to Biederman's (1987b) study on
(b )
Figure 42. Two demonstrations of the inhibition of collinear group-
ing of end-stopped segments by L vertices, (a: These image fragment s
[Bregman, 1982] are not identifiabl e as familia r forms, b: Bregman
showed that with the addition of an occluder, the Bs become apparent.
Blickl e (1989 ) noted that the image fragments in Bregman's non-oc-
cluded image (Panel a) contained accidental L vertices, formed where
contour fro m the occluder was lef t in the figure, that impede the frag -
ments' grouping into objects. When the L vertices are removed, as in
Panel c (Blickle, 1989) the Bs readily appear, even in the absence of the
occluder. Blickle performed a similar analysis of the difficult y in recog-
nizing the Leeper (1935) figures, d: Leeper's elephant, e: Like Breg-
man's Bs, Blickle's removal of the accidental L vertices made the ob-
jects' shape more apparent. (Panels a and b are fro m "Asking the 'What
for" Question in Auditory Perception," pp. 106-107, by A. S. Bregman,
1982, in M. Kubovy and J. R. Pomerantz, Perceptual Organization,
Hillsdale, NJ: Erlbaum. Copyright 1982 by Erlbaum. Adapted by per-
mission. Panel d is from "A Study of a Neglected Portion of the Field of
Learning: The Development of Sensory Organization" by R. Leeper,
1935, Journal of Genetic Psychology, 46, Figure 2, p. 50. Reprinted with
permission of the Helen Dwight Reid Educational Foundation. Pub-
lished by Heldref Publications, 131918t h Street, N W, Washington, DC
20036-1802. Copyright 1935. Panels c and e are from "Recognition of
Contour Deleted Images" by T. W Blickle, 1989, p. 20, Unpublished
doctoral dissertation, State University of New York at Buffalo.
Adapted by permission.)
the effect s of amoun t and locus of contour deletion on recogni -
tion performance. Error rates and RTs increase d wit h increase s
in the amoun t of contou r deleted fro m an image, but for a
constant amoun t of contou r deletion, performanc e suffere d
more when the contour was removed from the vertices than
whe n it was remove d at midsegment. Recal l that JIM group s
the contour s comprisin g a geon using the vertice s produce d
wher e those contours coterminate. As such, it is unabl e to group
contour s into geons when all the vertice s have been remove d
and predicts that recognitio n shoul d fail. By contrast, when
contou r is remove d at midsegment, the vertice s can stil l be
groupe d by means of the FELs betwee n collinea r distan t lone
terminations. Althoug h midsegmen t deletio n partiall y re-
moves the axes of symmetr y in a geon's image, the remainin g
feature s can stil l be grouped, and recognitio n shoul d be possi -
ble, thoug h slowed. Thus, JIM predict s the differenc e betwee n
performanc e wit h vertex-delete d and midsegment-delete d stim-
uli. JIM canno t currentl y predic t that recognitio n is possibl e at
all in the vertex-delete d condition. Thi s failur e likel y reflect s
JIM's curren t inabilit y to exploi t informatio n at differen t spa-
tial scales. It is possible that other source s of information, such
as an object's axis structure, can be used for classificatio n even
whe n the edge contour has been remove d near the point s wher e
the edges coterminate.
The fina l effec t we shal l not e here was observe d by Malcu s
(1982). Malcu s foun d that extraneou s contour introduce d int o
an image impeded recognition performanc e more when it
passed throug h the vertice s in the image than when it passed
throug h edge midsegments. If an extraneou s contou r crosses a
contou r of the origina l image at midsegment, an X verte x wil l
be produced. Recal l that JIM does not group contour s that
meet at vertice s wit h mor e than three prongs. Contour s form -
ing X vertice s (containin g fou r prongs ) wil l therefor e remai n
independent; that is, the geon to whic h the origina l contou r
belong s wil l not fire in synchron y wit h the extraneou s contou r
and wil l therefor e be processe d normall y in JIM's thir d layer.
By contrast, if an extraneou s contou r is passed throug h a vertex
in the image of a geon, that verte x wil l no longer be detecte d in
the model's second layer (e.g., a three-pronge d verte x wit h an
extra contou r throug h it become s a five-pronge d vertex ) and
wil l not be availabl e for groupin g the contour s that actuall y
belong to the geon. If enoug h vertice s are disrupte d in thi s
manner, the geon wil l be unrecoverable.
Implications of Single Place-Predicate
Relations of JIM
In thi s section, we describ e two effect s in huma n visua l recog-
nition predicted by JIM's treatment of relations. JIM represents
the relation s amon g object s as single-plac e predicates, that is,
predicate s that take onl y one argument. For example, the repre-
sentatio n of above specifie s its subjec t (i.e., of which geon "abo-
veness" is true) by firin g in synchron y wit h the other attribute s
of that geon, but it does not specif y its object (whic h geon the
subjec t is above). Thus, the representatio n of Geon A in Figur e
43 woul d specif y that it is above anothe r geon, but not whethe r
it is above B or above C. The objec t of the relatio n above is
specifie d onl y implicitl y as the subjec t of below: That A is above
B can be determine d onl y becaus e above fire s in synchron y
wit h A and below wit h B. As we have argue d earlier, implici t
specificatio n of a propert y (a relation, a binding, or, in this case,
a case role assignment ) is prone to serious weaknesses. Here, the
weaknes s is a propensit y for confusion s when multipl e geons in
an object share the same relations.
Consider, for example, a totem pole-lik e object in which mul -
tiple geons are stacke d vertically, as shown in Figur e 44. For
totem poles of two or three geons, each geon wil l be describe d
by a uniqu e combinatio n of above and below (Figur e 44a), so
ther e shoul d be littl e possibilit y for confusio n betwee n one to-
tem pole and the same geons stacke d in a differen t order. How-
ever, when a fourt h geon is added, the two centra l geons wil l
share the same above-belo w description; that is, bot h geons
wil l be bound to bot h above and below, as shown in Figur e 44b.
As such, they coul d switch places and the representatio n of the
totem pole woul d not chang e (Figur e 44c). JIM thus predict s
greater confusabilit y for totem poles wit h fou r or mor e geons
than for totem poles wit h two or three geons. Confusabilit y wil l
likel y increas e wit h the numbe r of geons, but it shoul d jump
markedl y betwee n three and four. Also, confusabilit y shoul d be
greate r when geons in the center of the stack chang e position s
than for change s involvin g geons on either end.
JIM make s simila r prediction s for horizonta l arrays of geons
(Figur e 45) except that, for an equivalen t numbe r of geons, con-
fusabilit y shoul d be highe r in horizonta l arrays than for vertica l
arrays. Thi s predictio n derive s fro m JIM's not discriminatin g
left-righ t relations, rather both are coded simpl y as beside. As
such, all the geons in a horizonta l array wil l have the beside
relation, wit h thei r respectiv e serial position s in the array un-
Limitations of the Current Implementation
The curren t implementatio n of JIM is limited in a numbe r of
respects. Some of these limitation s derive fro m simplifyin g as-
sumption s we have made to keep the implementatio n manage -
able, and other s reflec t fundamenta l question s that remai n to
be addressed. We shal l focu s the discussio n primaril y on the
Figure 43. JI M woul d represen t Geon A as above something, but it
woul d not specif y that it is above Geon B rathe r than Geon C.
Figure 44. Exampl e stimul i in the proposed "totem pole" experi -
ments, (a: For totem poles with two or three geons, each geon is de-
scribed by a unique combinatio n of above and below. The geons in
these totem poles could not change places without changing their repre-
sentatio n in JIM's third and fifth layers, b: The middl e geons in a
four-geo n totem pole have the same above-below description, c: The
middl e geons in a four-geon totem pole can change places without
changin g thei r representatio n in Layers 3 and 5.)
Limitations of the Binding Mechanism
For many images, the set of FELs posited is sufficien t to
group local features into geons whil e allowing the features of
separate geons to remai n independent, as discussed earlier.
Some images JIM cannot parse were presented earlier (Figures
18 and 19). These shortcoming s may constitut e empirica l pre-
dictions in that human subject s may evidence greater difficult y
recognizin g images that JIM cannot segment than those that it
can. However, it is clear that the degree of difficult y JIM woul d
demonstrat e with such images is unrealisticall y great.
To remedy these difficulties, at least two major extensions to
the current grouping mechanis m wil l likely be required. First,
the FELs need to be generalize d to allow cells to activel y resist
firing in phase. Such a mechanis m could help the model deal
wit h accidenta l alignment s and parts that meet at L vertices by
allowin g many desynchronizin g effect s to overcome the
synchronizin g effect s that occur at the point s where separat e
geons meet. Naturally, the condition s under which image fea-
tures shoul d resist synchron y woul d have to be specified. The
second extension that woul d improve the model's capacit y for
featur e groupin g is to provide contingencie s for global con-
straint s for grouping. For example, axes of symmetr y could be
used for grouping, as described in Mohan (1989), or convex
image patches, as in Vain a and Zlateva (1990). However, how
such groupin g constraint s could be plausibl y implemente d in a
neura l networ k architectur e remains an open question.
Other Limitations of the Current Implementation
Severa l other extension s suggest themselve s for exploratio n
as well. Among the first is the compatibilit y of the proposed
constraint s for parsing and recognition with natural images. In
thi s implementation, we have assumed that recognitio n start s
wit h a clean representatio n of the surface and depth discontin-
uities describing a lin e drawing of an object. Implicitly, we have
also assumed that this representatio n could be achieved with -
out top-down influence s fro m geons or objects. However, deriv-
ing a representatio n of orientation and depth discontinuities
fro m natural images is clearly a nontrivia l problem, and it may
be the case that it cannot be solved without top-down media-
tion. These consideration s motivat e two lines of exploration.
The first is the compatibilit y of our FELs wit h output fro m
availabl e edge-detectio n algorithms, and the second is the use
of FELs and geons to interactivel y derive depth discontinuitie s
fro m natural images. In particular, edge definitio n may be facil -
itated by constraint s fro m geons (Biederman, 1987b).
JIM's capacit y for representin g shape is most limited by its
impoverishe d vocabular y of geons and relations. JIM is capabl e
of discriminatin g eight geon types, whereas RBC posits 24. The
most importan t relations omitted from JIM's current vocabu-
lar y are the connectednes s relations, which describe how the
various geons in an image are connecte d to one another. The
simples t of these is the binar y relation connected versus not
connected. JIM does not discriminat e two geons that are physi-
call y joined fro m two that are not. For example, its representa -
tion of a table woul d express only the relative angles, locations,
and sizes of the various parts (such as the top and the legs),
neglectin g that the top is connecte d to each of the legs but that
the legs do not touch one another. Other connectednes s rela-
tions include whether two geons are joined end to end (like the
cylinder s of a flashlight ) or end to side (lik e the join between a
camera body and lens ) and centerednes s (whethe r one geon con-
nects to another near the center of the latter's side or near an
Expandin g the current architectur e to captur e more rela-
tional and shape attribute s wil l require additiona l structures. In
particular, it is not clear that the current architectur e in L4 and
L5 could be applied directl y to the problem of deriving connec-
tedness relations. However, given that architecture s capabl e of
Figure 45. All geons in a horizonta l array have the same relative posi-
tion description: They are all describe d as beside something.
deriving them can be described, new attributes can be added to
JIM's representation at a cost of one cell per attribute. None of
the additional properties seem incompatible with the existing
structure. Therefore, there is no reason to think that expanding
the model to capture them will require violating any important
assumptions underlying its current design.
Among JIM's most serious weaknesses is an inability to deal
with multiple objects in an image. Expanding it to deal with
multiple objects wil l almost certainly entail addressing ques-
tions of scale, visual attention, and additional problems in
grouping such as figure-ground segmentation. Although we do
not expect these extensions to be straightforward, we also do
not expect that they will require abandoning the basic tenets
underlying JIM's current design. If we regard JIM's current
domain as a subset of the processing that occurs within the
focus of attention, then its failure with multiple objects is not
psychologically unrealistic. Biederman, Blickle, Teitelbaum,
and Klatsky (1988) found that search time for a target object in
a nonscene display (objects were arrayed in a circular arrange-
ment, as the numbers on the face of a clock) was a linear func -
tion of the number of distractor objects in the display, suggest-
ing that subjects were attending to one object at a time. This
finding seems anomalous in the context of the finding that
complete scenes can be recognized in the same time it takes to
recognize a single object (Biederman, 1981). However, Mezza-
notte (1981) demonstrated that scene recognition does not re-
quire that the individual objects in the scene be identifiable in
isolation. Rather, as argued by Biederman (1988), familiar
groups of interacting objects may be treated by the visual sys-
tem as a single object, that is, as configurations of geons in
particular relations.
Summar y and Conclusion s
We have described a neural net architecture that takes as
input a retinotopically mapped representation of the surface
and depth discontinuities in an object's image and activates a
representation of the object that is invariant with translation
and scale and largely invariant with viewpoint. JIM's behavior
conforms wel l to empirical data on human object recognition
performance. The fundamental principle underlying its design
is that an object is represented as a structural description spec-
ifyin g its parts and the relations among them. This design prin-
ciple frees JIM from trading of f attribute structures for an im-
plicit representation of the relations among those attributes.
Also, it permits shape representation to be achieved with a re-
markably small number of units. JIM's capacity for structural
description derives from its solution to the dynamic binding
problem. Dynamic binding is thus critical for shape representa-
tion, but it is subject to intrinsic capacity limitations. In the case
of binding through synchrony, the limits derive from the tem-
poral parameters of cells and the links among them. We specu-
late that observed limitations on visual attention in human sub-
jects may reflec t the limits of a natural dynamic binding mecha-
Reference s
Abeles, M. (1982). Local cortical circuits: Studies of brain function (Vol.
6), Ne w York: Springer.
Atiya, A., & Baldi, P. (1989). Oscillation s and synchronization s in
neural networks: An exploratio n of the labelin g hypothesis. Interna-
tional Journal of Neural Systems, 1,103-124.
Baldi, P., & Meir, R. (1990). Computin g wit h arrays of couple d oscilla -
tors: An applicatio n to pre-attentiv e textur e discrimination. Neural
Computation, 2, 459-471.
Biederman, I. (1981). On the semantic s of a glance at a scene. In M.
Kubov y & J. R. Pomerant z (Eds.), Perceptual organization (pp. 213-
263). Hillsdale, NJ: Erlbaum.
Biederman, I. (1987a). Matchin g image edges to object memory. Pro-
ceedings of the First International Conference on Computer Vision (pp.
384-392). Washington, DC: IEEE.
Biederman, I. (1987b). Recognition-by-components: A theor y of hu-
man image understanding. Psychological Review, 94,115-147.
Biederman, I. (1988). Aspect s and extension s of a theor y of huma n
image understanding. In Z. Pylyshy n (Ed.), Computational processes
in human vision (pp. 370-428). Norwood, NJ: Ablex.
Biederman, I., Blickle, T., Teitelbaum, R., & Klatsky, G. (1988). Objec t
search in nonscen e displays. Journal of Experimental Psychology:
Learning, Memory, and Cognition, 14, 456-467.
Biederman, I., & Cooper, E. E. (199 la). Evidenc e for complet e transla-
tional and reflectiona l invarianc e in visua l object priming. Percep-
tion, 20, 585-593.
Biederman, I., & Cooper, E. E. (1991b). Primin g contour-delete d
images: Evidenc e for intermediat e representation s in visua l object
recognition. Cognitive Psychology, 23, 393-419.
Biederman, I., & Cooper, E. E. (1992). Size invarianc e in huma n shape
recognition. Journal o f Experimental Psychology: Human Perception
and Performance, 18,121-133.
Biederman, I., & Hilton, H. J. (1991). Metri c versus nonaccidenta l
determinant s of object priming, Manuscrip t in preparation.
Blickle, T. W (1989). Recognition of contour deleted images. Unpub-
lishe d doctora l dissertation, State Universit y of New Yor k at Buffalo.
Bower, J. (1991, April). Oscillation s in cerebra l cortex: The mecha -
nisms underlyin g an epiphenomenon. Paper presente d at the Work-
shop on Neura l Oscillations. Tucson, AZ.
Brady, M. (1983). Criteri a for representation s of shape. In J. Beck, B.
Hope, & A. Rosenfel d (Eds.), Human and machine vision, (pp. 39-
84). San Diego, CA: Academi c Press.
Brady, M., & Asada, H. (1984). Smoothe d local symmetrie s and thei r
implementation. International Journal of Robotics Research, 3, 36-
Bregman, A. S. (1981). Askin g the "what for" questio n in auditor y
perception. In M. Kubov y & J. R. Pomerant z (Eds.), Perceptual orga-
nization (pp. 99-118). Hillsdale, NJ: Erlbaum.
Crick, F. H. C. (1984). The functio n of the thalami c reticula r spotlight:
The searchligh t hypothesis. Proceedings of the National Academy of
Sciences, 81, 4586-4590.
Crick, F. H. C., & Koch, C. (1990). Toward s a neurobiologica l theor y of
consciousness. Seminars in Neuroscience, 2, 263-275.
Donnelly, N., Humphreys, G. W, & Riddoch, M. J. (1991). Paralle l
computatio n of primitiv e shape descriptions. Journal o f Experimen-
tal Psychology: Human Perception and Performance, 17, 561-570.
Eckhorn, R., Bauer, R., Jordan, W, Brish, M., Kruse, W, Munk, M., &
Reitboeck, H. J. (1988). Coheren t oscillations: A mechanis m of fea -
ture linkin g in the visua l cortex? Multipl e electrod e and correlatio n
analysi s in the cat. Biological Cybernetics, 60,121-130.
Eckhorn, R., Reitboeck, H., Arndt, M., & Dicke, P. (1990). Featur e
linkin g via synchronizatio n amon g distribute d assemblies: Simula -
tions of result s fro m cat visua l cortex. Neural Computation, 2, 293-
Edelman, S., & Poggio, T. (1990, April). Bringin g the grandmothe r
back int o the picture: A memory-base d vie w of object recognition.
MIT Artificia l Intelligenc e Memo No. 1181.
Edelman, S., & Weinshall, D. (1991). A self-organizin g multiple-vie w
representatio n of 3D objects. Biological Cybernetics, 64, 209-219.
Engel, A. K., Konig, P., Kreiter, A. K., & Singer, W (1991). Interhemi -
spheri c synchronizatio n of oscillator y neurona l response s in cat vi-
sual cortex. Science, 252,1177-1179.
Feldman, J. A. (1982). Dynami c connection s in neural networks. Bio-
logical Cybernetics, 46, 27-39.
Feldman, J. A., & Ballard, D. H. (1982). Connectionis t models and
thei r properties. Cognitive Science, 6, 205-254.
Fodor, J. A., & Pylyshyn, Z. W (1988). Connectionis m and cognitiv e
architecture: A critical analysis. Cognition, 28, 1-71.
Gerhardstein, P. C, & Biederman, I. (1991, May). Priming depth-ro-
tated object images: Evidence for 3D invariance. Paper presente d at
the Meeting of the Associatio n for Researc h in Vision and Ophthal -
mology, Sarasota, FL.
Gray, C. M., Konig, P., Engel, A. E., & Singer, W (1989). Oscillator y
response s in cat visual cortex exhibi t inter-column synchronizatio n
whic h reflect s global stimulu s properties. Nature, 338, 334-337.
Gray, C. M., & Singer, W (1989). Stimulu s specifi c neurona l oscilla-
tions in orientatio n column s of cat visual cortex. Proceedings of the
National Academy of Sciences, 86, 1698-1702.
Gross, C. G. (1973). Visual function s of inferotempora l cortex. In R.
Jung (Ed.), Handbook o f sensory physiology: Vol. VlI/3. Central pro-
cessing of information. Part B. Visual centers in the brain, (pp. 451-
482). Berlin: Springer-Verlag.
Grossberg, S. (1973). Contour enhancement, short-ter m memory, and
constancie s in reverberatin g neural networks. Studies in Applied
Mathematics, 52, 217-257.
Grossberg, S., & Somers, D. (1991). Synchronize d oscillation s during
cooperativ e featur e linkin g in a cortical model of visual perception.
Neural Networks, 4, 453-466.
Guzman, A.(1971). Analysi s of curved line drawings using context and
global information. Machine Intelligence, 6, 325-375. Edinburgh,
Scotland: Edinburg h Press.
Hinton, G. E. (1981). A parallel computatio n that assigns canonica l
object-base d frame s of reference. Proceedings of the 7th International
Joint Conference on Artificial Intelligence. Symposiu m conducte d at
the Universit y of British Columbia, Vancouver, British Columbia,
Hinton, G. E., McClelland, J. L., & Rumelhart, D. E. (1986). Distrib-
uted representations. In D. E. Rumelhart, J. L. McClelland, and the
POP Researc h Group, Parallel distributed processing: Explorations
in the microstructure of cognition. Volume J: Foundations (pp. 77-
109). Cambridge, MA: MIT Press/Bradfor d Books.
Hoffman, D. D, & Richards, W A. (1985). Part s of recognition. Cogni-
tion, 18, 65-96.
Hummel, J. E., & Biederman, I. (1990a). Dynami c binding: A basis for
the representatio n of shape by neural networks. In M. P. Palmarini,
Proceedings of the 12lh Annual Conference of the Cognitive Science
Society (pp. 614-621). Hillsdale, NJ: Erlbaum.
Hummel, J. E., & Biederman, I. (1990b, November). Binding invariant
shape descriptors for object recognition: A neural net implementation.
Paper presente d at the 21s t Annua l Meeting of the Psychonomics
Society, New Orleans, LA.
Hummel, J. E., & Biederman, I. (1991, May). Binding by phase locked
neural activity: Implications for a theory of visual attention. Paper
presente d at the Annua l Meeting of the Association for Research in
Visio n and Ophthalmology, Sarasota, FL.
Hummel, J. E., Biederman, I., Gerhardstein, P. C, & Hilton, H. J.
(1988). From image edges to geons: A connectionist approach. In D.
Touretsky, G. Hinton, & T. Sejnowski (Eds.), Proceedings of the 1988
Connectionist Models Summer School (pp. 462-471), San Mateo,
CA: Morgan Kaufman.
Jolicoeur, P. (1985). The time to name disoriente d natura l objects.
Memory <& Cognition, 13. 289-303.
Kahneman, D, & Treisman, A. (1984). Changin g views of attentio n
and automaticity. In R. Parasuraman, R. Davies, & J. Beatt y (Eds.),
Varieties o f attention (pp. 29-62). San Diego, CA: Academi c Press.
Keele, S. W, Cohen, A., Ivry, R., Liotti, M., & Yee, P. (1988). Test of a
tempora l theor y of attentiona l binding. Journal o f Experimental Psy-
chology: Human Perception and Performance, 14, 444-452.
Konig, P., & Schillen, T. B. (1991). Stimulus-dependen t assembl y for-
mation of oscillator y responses: I. Synchronization. Neural Compu-
tation, 3, 167-178.
Lange. T, & Dyer, M. (1989). High-level inferencing in a connectionist
neural network. (Tech. Rep. No. UCLA-AI-89-12), Los Angeles: Uni-
versit y of California, Compute r Science Department.
Leeper, R. (1935). A study of a neglecte d portion of the field of learn-
ing: The developmen t of sensor y organization. Journal of Genetic
Psychology, 46,41-15.
Lindsay, P. H., & Norman, D. A. (1977). Human information processing:
An introduction to psychology (2nd ed.). San Diego, CA: Academi c
Lowe, D. G. (1987). The viewpoin t consistenc y constraint. Interna-
tional Journal o f Computer Vision, 1, 57-72.
Malcus, L. (1982). Contour formation, segmentation, and semantic ac-
cess in scene and object perception. Unpublishe d doctoral disserta-
tion, State Universit y of New Yor k at Buffalo.
Malik, J. (1987). Interpretin g line drawing s of curved objects. Interna-
tional Journal of Computer Vision, 1, 73-103.
Marshall, J. A. (1990). A self-organizin g scale-sensitiv e neura l net-
work. Proceedings of the International Joint Conference on Neural
Networks, 3, 649-654.
Mezzanotte, R. J. (1981). Accessing visual schemata: Mechanisms in-
voking world knowledge in the identification o f objects in scenes, Un-
publishe d doctora l dissertation, State Universit y of New York at Buf-
falo, Departmen t of Psychology.
Milner, P. M. (1974). A model for visual shape recognition. Psychologi-
cal Review, 81, 521-535.
Mohan, R. (1989). Perceptual organization for computer vision. Unpub-
lished doctoral dissertation, Universit y of Souther n California, De-
partmen t of Compute r Science.
Moran, J., & Desimone, R. (1985). Selective attention gates visual pro-
cessing in the extrastriat e cortex. Science, 229, 782-784.
Neisser, U (1967). Cognitive psychology. New York: Appleton-Century -
Pinker, S. (1986). Visua l cognition: An introduction. In S. Pinker (Ed.),
Visual cognition (pp. 1-63). Cambridge, MA: MIT Press.
Rock, I., & DiVita, J. (1987). A case of viewer-centere d perception.
Cognitive Psychology, 19, 280-293.
Sejnowski, T. J. (1986). Open question s about computatio n in cerebra l
cortex. In J. L. McClelland, & D. E. Rumelhart. (Eds.), Parallel dis-
tributed processing: Explorations in the microstructure of cognition
(pp. 372-389). Cambridge, MA: MIT Press.
Selfridge, O. G. (1959). Pandemonium: A paradigm for learning. Sym-
posium on the Mechanism of Thought Processes. London: Her Ma-
jesty's Stationer y Office.
Selfridge, O. G., & Neisser, U. (1960). Patter n recognitio n by machine.
Scientific American, 203, 60-68.
Shastri, L., & Ajjanagadde, V (1990). From simple associations to sys-
tematic reasoning: A connectionist representation of rules, variables
and dynamic bindings (Tech. Rep. No. MS-CIS-90-05). Philadel -
phia: Universit y of Pennsylvania, Department of Computer and In-
formation Sciences.
Smolensky, P. (1987). On variable binding and the representation of sym-
bolic structures in connectionist systems (Interna l Rep. No. CU-CS-
355-87). Boulder, CO: Universit y of Colorado, Department of Com-
puter Science & Institut e of Cognitive Science.
Strong, G. W, & Whitehead, B. A. (1989). A solution to the tag-assign -
ment problem for neural networks. Behavioral and Brain Sciences,
Tarr, M. J., & Pinker, S. (1989). Mental rotation and orientatio n depen-
dence in shape recognition. Cognitive Psychology, 21, 233-283.
Thorpe, S. (1990). Spike arriva l times: A highl y efficien t coding
scheme for neural networks. In R. Eckmiller, G. Hartmann, & G.
Hausk e (Eds.), Parallel processing in neural systems and computers
(pp. 91-94). Amsterdam: North-Holland.
Treisman, A. (1982). Perceptua l groupin g and attentio n in visual
search for objects. Journal o f Experimental Psychology: Human Per-
ception and Performance, 8,194-214.
Treisman, A., & Gelade, G. (1980). A featur e integratio n theor y of
attention. Cognitive Psychology, 12, 97-136.
Oilman, S. (1989). Alignin g pictoria l descriptions: An approac h to
objec t recognition. Cognition, 32,193-254.
Vaina, L. M., & Zlateva, S. D. (1990). The largest convex patches: A
boundary-base d method fo r obtainin g object parts. Biological Cyber-
netics, 62, 225-236.
von der Malsburg, C. (1981). The correlation theory o f brain function
(Interna l Rep. No. 81-2). Gottingen, Germany: Max-Plank-Institut e
for Biophysica l Chemistry, Departmen t of Neurobiology.
von der Malsburg, C. (1987). Synapti c plasticit y as a basi s of brai n
organization. In J. P. Chaneau x & M. Konishi (Eds.), The neural and
molecular bases o f learning (pp. 411-432). Ne w York: Wiley.
Waltz, D. (1975). Generatin g semanti c description s fro m drawing s of
scenes wit h shadows. In P. Winsto n (Ed.), The psychology o f com-
puter vision (pp. 19-91). New 'York: McGraw-Hill.
Wang, D, Buhmann, J., & von der Malsburg, C. (1991). Patter n seg-
mentatio n in associativ e memory. Neural Computation, 2, 94-106.
Winston, P. (1975). Learnin g structura l description s fro m examples. In
P. Winsto n (Ed.), The psychology o f computer vision (pp. 157-209).
New York: McGraw-Hill.
Appendi x
Response Metrics
MaXj represent s the highes t activatio n value (A,) achieve d by a given
target cell i during a run of N time slices. Mean, is calculate d as the
arithmeti c mean of the target cell's activation (A,) over n time slices.
Max and mean provide raw estimate s of the recognizabilit y of an ob-
ject in a particula r condition (i.e., the match betwee n the structura l
descriptio n activate d in respons e to the image of an object in a give n
condition and the structura l descriptio n used for familiarizatio n with
that object). Because they conside r the activatio n of the target cell only,
max and mean wil l tend to be misleadin g if all object cells achieve high
activation s in respons e to every image.
P is a respons e metri c designe d to overcome this difficulty. P, for a
give n target cell / on a given run is calculate d as the target cell Mean t
divide d by the sum of all above-zer o object cell mean activations:
Pi = [Mean,l(Mean, + 2jMeanJ)] +,
j > 0,
wher e j correspond s to a nontarge t object cell. This metri c provide s a
measur e of the discriminabilit y of the target object fro m the popula -
tion of nontarget s give n a particula r image. Althoug h it is not subject
to the same criticis m as max and mean, P is a less sensitive metri c when
the mean activatio n of nontarge t object cells is low. For example, as-
sume that the baseline vie w of a given object produce s a mean target
cell activatio n of 0.50 (wit h all nontarge t object cell means below zero),
and anothe r vie w produce s a mean of 0.01 (nontarge t means below
zero). With these values, P woul d provide the misleadin g impressio n
that the model had performe d identicall y with the two views: 0.50/
(0.50+ 0 + . . .0) = 0.01/(0.01 + 0 + . . . 0). An additiona l difficult y
wit h P as a metri c is that it is sensitive onl y to difference s among object
cells wit h mean activation s above zero.
The final respons e metric, MP, is designe d to reflec t both the raw
recognizabilit y of a vie w of an object and its discriminabilit y fro m the
other object s in the set. MP, is calculate d as the product of Mean, and
P,. This metri c suffer s the same insensitivit y to negative number s as
does P, but is prone neither to /"'s tendenc y to mask large difference s
betwee n condition s in the fac e of inactive nontargets, nor to the ten-
dency of max and mean to mask indiscriminat e responding.
Received July 30,1990
Revision received July 30,1991
Accepted August 12,1991